1. Introduction
Electricity price forecasting has become a critical topic as power markets undergo deregulation and experience increasing volatility. Since the beginning of this century, various authors have emphasized its relevance for both generators and consumers in operational planning and financial risk management. In this context, Contreras et al. [
1] highlighted that, in competitive markets, the ability to anticipate spot prices is essential for maximizing revenues and minimizing costs, while later reviews, such as that by Aggarwal et al. [
2], reinforced this view within deregulated market environments. More recently, several studies have addressed the challenge of price forecasting in highly volatile electricity markets by incorporating advanced machine learning techniques and artificial neural networks [
3,
4]. Tschora et al. [
5] investigated the application of various machine learning algorithms to forecast prices in the European day-ahead market, with particular emphasis on the predictive potential of features such as historical prices from neighboring countries—an approach that has been underexplored to date. In this context, electricity price forecasting has emerged as a strategic tool for both market participants and regulators seeking stability and efficiency in energy supply.
Dominated by regression-based approaches, the methodological landscape of electricity price forecasting centers around linear regression models. Both standard multiple linear regression and regularized variants—such as LASSO, ridge, and elastic net—are employed to enhance model generalization and address multicollinearity [
6,
7]. In addition, quantile regression techniques have gained traction for probabilistic forecasting by capturing price uncertainty through the estimation of different quantiles of the distribution [
8]. These regression methods are often embedded in hybrid frameworks that combine linear models with more complex algorithms (e.g., time series decomposition, decision tree ensembles, or neural networks) to exploit both linear effects and nonlinear patterns in the data [
9,
10]. Nevertheless, despite these advances, most studies remain focused on developed markets, highlighting the lack of application of such techniques in emerging economies such as Ecuador.
Nevertheless, most studies have focused on markets in Europe, North America, and other developed economies, with limited research addressing emerging electricity markets. Dragašević et al. [
11], for instance, analyzed a developing market by applying multivariable regression techniques to identify the fundamental factors influencing price formation. In the case of Ecuador, the electricity sector presents particular characteristics that distinguish it from extensively studied markets. The Ecuadorian generation matrix is highly dependent on hydropower: approximately 79% of electricity production came from hydroelectric sources in 2021 [
12]. This reliance on a single source renders the system vulnerable to extreme climatic events. In fact, a severe drought triggered a national energy crisis in 2023, leading to extended electricity rationing through scheduled blackouts across the country [
13]. This critical situation underscored the need to enhance energy price forecasting and planning tools in Ecuador, both to strengthen supply security and to inform energy policy decisions under uncertain conditions.
Motivated by this context, the present study aims to model the monthly average unit price of energy supply (USD/kWh) in Ecuador’s distribution system using multiple linear regression models. Unlike many studies focused on short-term and hourly spot prices, this analysis considers monthly behavior, which is relevant for national energy and tariff planning. Several key indicators from the electricity sector are used as explanatory variables: national energy demand, disaggregated generation by source type (hydropower, thermal, and other renewables), production costs (fixed and variable) and transmission costs, imported energy volume, and a settlement index reflecting economic adjustments in the wholesale market. These variables capture the main structural determinants of price in the Ecuadorian context, encompassing supply–demand balance, generation mix and operating costs, and international energy exchanges. To select the most relevant variables and avoid multicollinearity, three regression-based methodologies were implemented: (1) expert-driven variable selection, (2) direct selection based on correlation with price, (3) dimensionality reduction through principal component analysis (PCA) to construct latent factors, and (4) variable selection based on factor loadings in the first principal components. The model was calibrated and validated using monthly data from the 2018–2024 period, covering both years prior to and following the 2023 energy crisis, which allows for assessing the robustness of the approach under different market conditions. Additionally, an ARIMAX model was applied to perform price forecasting based on each of the proposed regression models, further enhancing the predictive capabilities of the analysis.
Despite the wide adoption of forecasting models such as PCA-enhanced regression and ARIMAX in developed markets, their application remains scarce in emerging energy systems with high structural dependencies and regulatory heterogeneity, such as Ecuador. In particular, no prior studies have systematically combined dimensionality reduction through PCA with multiple linear regression to explain electricity cost dynamics in Ecuador. Moreover, the use of ARIMAX models with exogenous regressors derived from cost-based linear models has not been previously implemented in the region. This study addresses this empirical gap by proposing an integrated modeling framework tailored to Ecuador’s unique generation profile, regulatory adjustments, and market volatility, thereby contributing to methodological innovation and regional planning strategies.
To conclude, the article structure is outlined as follows.
Section 2 presents the related works and discusses the relevant literature.
Section 3 describes the background and context of the Ecuadorian electricity sector.
Section 4 details the methodology employed, including the variable selection techniques, regression models, and ARIMAX model.
Section 5 presents the results obtained and their analysis.
Section 6 discusses the implications of these findings and potential future research directions, and finally,
Section 7 outlines the main conclusions of the study.
2. Related Works
Accurate determination of the average unit price of energy supply is essential to ensure efficient management within electricity distribution systems. Several studies have addressed this issue using statistical techniques and advanced predictive models, particularly multiple linear regression, due to its ability to incorporate multiple explanatory variables and provide clear interpretations of the relationships among energy, economic, and regulatory factors. This section presents a synthesis of recent research applying mathematical approaches to model and forecast energy supply costs, highlighting the methodologies used, variables considered, results obtained, and key limitations identified, as summarized in
Table 1. Such comparative analysis provides context for the contributions of the present study and supports the relevance of the proposed methodological approach.
Weron [
14] presented a comprehensive review of the state of the art in electricity price forecasting (EPF). The study systematically compiled and analyzed a broad range of methods used over the previous 15 years, covering classical linear statistical models as well as nonlinear approaches such as GARCH, neural networks, and others. The review outlines the complexity of available solutions, discusses their strengths and weaknesses, and identifies both opportunities and challenges associated with forecasting tools. Additionally, the author looks ahead and emphasizes the need for more objective comparative studies (using identical datasets, robust error metrics, and statistical significance testing) to guide the next decade of EPF research. As a synthesis article, it does not propose new models or perform empirical validations, serving more as a theoretical reference than a practical guide for implementation.
Agudelo et al. [
15] addressed electricity price forecasting in the Colombian power exchange using neural network techniques. They employed a nonlinear autoregressive neural network with exogenous inputs (NARX), incorporating variables such as electricity demand, the hydrothermal generation ratio, El Niño event probability, and reservoir storage levels, in addition to historical prices as autoregressive inputs. The NARX model exhibited strong predictive performance: the simulated prices reproduced 96% of the variability of the actual time series, with residuals behaving as white noise, demonstrating robust generalization even beyond the training period. However, the model relies heavily on hydrological conditions and specific local data, limiting its direct applicability in other contexts without appropriate recalibration.
Ramos et al. [
16] analyzed monthly electricity prices in the Iberian market (MIBEL) using a structural model with economic and climatic variables. A multiple linear regression was applied with exogenous variables such as per capita consumption, heating and cooling degree days, the hydro productivity index (HPI), and the industrial production index (IPI), among others, to explain variations in electricity prices. The model explained approximately 53% of price variability in Portugal and 29% in Spain; it also identified that economic indicators such as higher HPI and IPI are associated with lower electricity prices (negatively influencing price levels). A limitation of the study is the use of aggregated monthly series, which does not capture daily or hourly variations, and the presence of autocorrelation in the residuals, suggesting that the model could benefit from refinement to improve its predictive capacity.
Zheng et al. [
17] proposed an approach for forecasting the Locational Marginal Price (LMP) by decomposing the problem into its price subcomponents. Specifically, the LMP was separated into energy, congestion, and loss components, with individual forecasting models developed for each component and subsequently combined into an ensemble forecast. This component-based strategy enabled more accurate and robust LMP predictions compared to traditional methods that treat the price as a monolithic variable. However, the effectiveness of the approach depends on correctly decomposing the price components; if the separation is inadequate or the input data are limited, the model’s advantages and performance may be compromised.
Ulgen and Poyrazoglu [
18] examined key predictors for electricity price forecasting in the Turkish wholesale market. They applied a multiple linear regression model incorporating both lagged electricity prices (including moving averages) and exogenous fuel variables—natural gas, oil, and coal prices—within a dynamic estimation framework. Including historical prices and these energy inputs improved forecasting accuracy; these predictors proved significant for estimating electricity prices, substantially reducing prediction errors compared to simpler models. As a limitation, the study was based on data from Turkey with a short forecasting horizon (12 days), and it did not explore more complex nonlinear relationships in price dynamics, leaving room for future enhancement using more advanced techniques.
Zhang et al. [
19] designed an optimized linear regression scheme for real-time electricity load forecasting to support market participants in their operational decisions. The approach focused on simplifying the model and enhancing its robustness using statistical criteria: stepwise variable selection based on the Akaike Information Criterion (AIC) and influence analysis (Cook’s distance) were employed to refine the model by addressing potential outliers and multicollinearity. Despite its simplicity, the resulting linear model achieved effective demand forecasts using only publicly available data, demonstrating the usefulness of a parsimonious and transparent approach for short-term applications. However, as the study is oriented toward load (demand) forecasting rather than price forecasting, its conclusions cannot be directly transferred to electricity price prediction models without additional considerations.
As evidenced by the preceding analysis, various advanced statistical methodologies and predictive models have been successfully applied in specific contexts, highlighting the importance of incorporating local, economic, and regulatory factors for effective electricity price forecasting. However, these approaches often exhibit limitations regarding the direct transferability of results across markets due to regional differences in institutional and corporate structures, energy resource availability, regulatory frameworks, and economic conditions. In this regard, the Ecuadorian electricity market—characterized by strong dependence on hydropower generation, climatic variability, and specific regulatory schemes—requires tailored studies that explicitly account for these local factors. Therefore, it becomes essential to develop customized mathematical models based on multiple linear regressions, calibrated with context-specific Ecuadorian variables, to accurately forecast the average unit price of energy supply in distribution systems. Such models would contribute to strengthening the technical and economic management of the national electricity system.
3. Energy Supply in the Ecuadorian Electricity Sector
The electricity supply industry is typically segmented according to its main activities: production (generation), transportation (transmission), distribution, and commercialization [
21]. Depending on the corporate structure defined by national legislation, each activity is carried out by different institutions or companies dedicated to a specific function. The cost of supplying electricity to major load blocks connected to the distribution network is primarily determined by the sum of production and transmission costs.
Ecuador’s electricity generation matrix is diversified across primary sources, including renewable hydropower (with or without reservoirs), thermal generation from fossil fuels, and non-conventional renewable sources such as biomass, photovoltaic, and wind energy. However, hydropower remains predominant, covering 69.1% of national demand in 2023. Thermal generation accounted for 25.6% during the same period, while the remaining percentage was supplied by non-conventional renewable generation and electricity imports from the Colombian power system [
22]. This composition of the generation matrix has a direct impact on both electricity production costs and the average supply price.
In Ecuador, the largest share of installed hydropower generation capacity is concentrated in the Amazon basin. Water resource availability in this region exhibits marked seasonal fluctuations, characterized by a significant decrease in precipitation—and consequently, river flow—between the months of September and November each year [
23]. Since hydropower constitutes the main source of supply for national demand, its production costs and the average electricity prices are directly affected by this hydrological variability, tending to increase during periods of reduced water availability.
As shown in
Figure 1, the variation in average energy purchase prices exhibits peaks and troughs that coincide with the dry and rainy seasons of the Ecuadorian Amazon basin—that is, higher prices from September through November or December each year. For the period from 2018 to 2022, price fluctuations remain relatively small; however, in 2023 and 2024, the most significant variations across the entire seven-year dataset are observed.
This behavior is partly explained by the Ecuadorian system’s strong dependence on hydropower generation. During periods of high water availability, the average price decreases, whereas in dry seasons the price increases—a pattern that, in 2023 and 2024, revealed significant vulnerability to hydrological variability [
24]. Another contributing factor is the application, since 2016, of the generation and transmission cost settlement mechanism established by the Electricity Regulation and Control Agency (ARCONEL). This mechanism defines a cost redistribution index among distribution companies. In the analyzed case, the regulatory index has shown a sustained decline since 2017, with increases during the dry periods of 2023 and 2024 as mandated by ARCONEL.
These costs are mostly classified as fixed costs, since their occurrence is independent of the volume of electricity produced. Additionally, variable costs are considered, which are directly correlated with the amount of energy generated during a specific period. This category includes items such as fuel supply and transportation, lubricants, and chemical products, among others [
25]. The recognition and management of these costs are formalized through regulated energy purchase agreements, signed between generation companies and distribution and commercialization entities.
Additionally, energy transmission service costs are considered. Although not directly related to production, transmission is an essential service that enables the effective delivery of power and energy blocks to distribution networks, and ultimately to the load. According to current regulations [
25], these costs are calculated based on a tariff that remunerates both fixed and variable costs of the state-owned transmission company.
One cost that has had a significant impact in recent years on the average supply price at the distribution network level is related to energy imports from the Colombian power system. During the period from October to December 2023, a severe drought event occurred in the Amazon basin, triggering a major energy crisis in Ecuador due to the reduced generation capacity of hydroelectric plants, even leading to energy rationing [
24]. As a result, the national power system became heavily reliant on imported energy to mitigate the adverse effects of reduced domestic generation availability. However, the cost of imports was very high, as the neighboring country was also affected by the drought and relied on high-cost thermal sources for energy exports.
The hydrological crises of 2023–2024, particularly the severe drought in the Amazon basin, are empirically captured in the dataset by three variables: a significant decline in monthly hydropower generation (Hydropower), a sharp increase in cross-border energy imports (TIES), and adjustments in regulatory compensation schemes (Settlement ratio). These variables serve as quantitative proxies for the structural and climatic disruptions observed during the period and are explicitly incorporated into the regression and ARIMAX models.
Additionally,
Figure 2 clearly shows that the settlement ratio has a significant effect on the average unit energy supply price. Until 2022, both variables remained relatively stable, with only minor fluctuations. However, starting in 2023, there is an abrupt change in the settlement ratio that coincides with a sharp increase in the average price, suggesting that regulatory changes have an immediate and notable impact on energy supply costs in Ecuador. This finding highlights the importance of explicitly incorporating regulatory factors in energy price modeling.
In the current context of the Ecuadorian electricity market—characterized by the absence of competition among producers and a strong dependence on variables such as primary resource availability (particularly hydropower), demand growth, and limited supply expansion—energy supply is inherently tied to the conditions and availability of imports from the Colombian power system.
Due to space constraints, detailed regulatory and technical specifications of Ecuador’s electricity system—including tariff composition, transmission cost structures, and contract mechanisms—are provided in
Appendix A. This allows the main text to focus on the variables and modeling components directly relevant to cost forecasting.
4. Methodology
This study aims to analyze the relationship between the average electricity price (dependent variable) and various energy and economic variables (independent variables). The process begins with a description of each variable used, followed by a normality analysis. Based on the results of the normality analysis, appropriate statistical tests are selected. Subsequently, an analysis of significant differences, correlation among variables, and a principal component analysis are conducted. Finally, using the statistical information obtained, mathematical models are developed to identify the levels of fit and significance.
To guide the empirical analysis, the study is structured around the following research hypotheses:
H1 (Expert-driven variable selection): A regression model constructed with variables selected based on domain knowledge (e.g., hydropower, thermal generation, energy imports) will provide interpretable insights into cost drivers, though with moderate predictive accuracy.
H2 (Correlation-based variable selection): A regression model based on variables with the highest statistical correlation with the average unit price will outperform other approaches in terms of predictive accuracy and statistical fit.
H3 (PCA-based variable reduction): A regression model using variables derived from principal component analysis will achieve acceptable forecasting accuracy while reducing multicollinearity and dimensionality, offering a parsimonious alternative.
Furthermore, each of these hypotheses is tested using ARIMAX modeling to evaluate the medium-term forecasting performance of the selected regression models. The comparative results aim to inform both methodological practices and regulatory planning in energy markets with structural constraints, such as Ecuador.
To assess the robustness and interpretability of the modeling framework, three different variable selection strategies were implemented and compared. The first approach, based on expert judgment, reflects practical knowledge of the Ecuadorian electricity sector and its structural dependencies. The second approach, driven by statistical correlation, identifies the strongest direct relationships with the dependent variable and aims to maximize predictive power. The third approach applies PCA to explore the underlying structure of the data, reduce dimensionality, and address multicollinearity. This comparative setup allows the evaluation of trade-offs between model accuracy, explanatory interpretability, and complexity, thereby providing a robust basis for selecting the most suitable modeling strategy for policy and planning applications.
4.1. Analysis Variables
The variables considered are defined and described below:
Average price (USD cents/kWh): This variable represents the unit average value per kWh purchased to supply the load in a given month. It is calculated as the ratio between the total cost of energy purchased during the month and the total amount of energy purchased in the same period.
Demand (kWh/month): This is the total amount of energy consumed per month by the load. It represents the overall energy requirement that must be met by various generation sources and imports.
Hydropower energy (kWh/month): This is the amount of hydroelectric energy used to supply the load. It reflects the contribution of hydropower to the energy mix.
TIES (kWh/month): International Electricity Transactions; this is the amount of imported energy from the Colombian system used to supply the load. It indicates the reliance on and utilization of cross-border electricity.
Thermal energy (kWh/month): This is the amount of thermoelectric energy used to supply the load. It reflects the share of thermal generation in meeting the demand.
Other-source energy (kWh/month): This is the amount of energy from other technologies (wind, photovoltaic, biogas, etc.) used to supply the load. It represents the contribution of non-conventional renewable sources and other emerging technologies.
Fixed cost CR (USD/month): Cost without index application; this is the total monthly cost paid to generators with regulated energy purchase contracts for investment, amortization, and other components not dependent on the energy produced. It includes items such as investment recovery, amortizations, and other fixed charges associated with the availability of contracted generation.
Variable cost CR (USD/month): Cost without index application; this is the total monthly cost paid to generators with regulated energy purchase contracts for components that depend on the amount of energy produced. It mainly includes fuel costs and other variable inputs associated with generation.
GNC cost (USD/month): Costs of non-conventional generation; this is the total monthly cost paid to generators with or without energy purchase contracts whose production is based on renewable or non-conventional energy sources (wind, photovoltaic, biogas, etc.). It includes the variable costs associated with this type of generation.
TIES cost (USD/month): This is the total monthly cost paid for energy imports from the Colombian power system. It reflects the expense associated with acquiring electricity from the neighboring market.
TFT cost (USD/month): This is the total monthly cost paid for the energy transmission service. It covers the charges associated with the use of transmission networks.
Settlement ratio: Regulatory factor for reallocating energy purchase costs. According to the methodology established in the regulation, it is applied as a multiplier to generation (excluding TIES) and transmission costs. Its function is to reassign energy purchase costs among different market agents in accordance with current regulations.
4.2. Normality Analysis of the Variables
To ensure the validity of the statistical methods applied in this study, a normality analysis was performed on each of the variables considered in the model. This analysis assesses whether the variable distributions conform to the normality assumption, which is a fundamental requirement for the application of parametric techniques such as multiple linear regression. The evaluation included both formal statistical tests and graphical representations, aimed at identifying potential skewness, kurtosis, or outlier behavior that could affect the robustness of the proposed model.
Figure 3a shows the distribution of the average price, which is neither symmetric nor unimodal due to the concentration of most values between 2 and 3 USD cents/kWh. However, some months exhibit relatively high prices (7–8 USD cents/kWh), revealing a cluster of elevated values. In the Q–Q plot, a noticeable deviation from the red reference line is observed, particularly in the upper and lower tails. This clear deviation confirms that the average price does not follow a normal distribution (Shapiro–Wilk:
), implying that the normality assumption must be rejected.
Figure 3b shows that the distribution of demand is not symmetric, due to the concentration of a single mode between 0.95 and
kWh/month. However, the presence of substantial bars toward the right may suggest the influence of some high values. In the Q–Q plot, a deviation from the red reference line is observed in both the upper and lower tails. This deviation indicates that Demand does not follow a normal distribution (Shapiro–Wilk:
), leading to the rejection of the normality assumption.
Figure 3c shows that the distribution of Hydroelectric Power is not symmetric, due to the presence of a main peak between 0.85 and
kWh/month. In the Q–Q plot, a deviation similar to that of Demand is observed. This deviation indicates that Hydroelectric Power does not follow a normal distribution (Shapiro–Wilk:
), leading to the rejection of the normality assumption.
The following section presents the main characteristics of the variable distributions and the conclusions derived from the normality tests. For a more detailed analysis of the distributions and Q–Q plots of the variables studied, refer to
Appendix A.
The variables TIES and Cost TIES exhibit non-symmetric distributions, with left-skewness and a single mode near zero. The Q–Q plots show significant deviations in both tails, confirming that TIES and Cost TIES do not follow a normal distribution (Shapiro–Wilk: and , respectively).
In contrast, the variable GNC cost displays a symmetric distribution with slight right skewness. The Q–Q plot indicates a reasonable alignment with normality, especially in the central region. Normality is accepted for this variable (Shapiro–Wilk: ).
The variables Thermal Energy and Variable cost CR exhibit non-symmetric distributions, with a primary mode around 0.5 and secondary peaks. Their Q–Q plots reveal significant deviations, confirming that neither variable follows a normal distribution (Shapiro–Wilk: and , respectively). In the case of Fixed cost, although the central portion of the distribution aligns with normality, the tails show clear deviations, leading to a rejection of the normality assumption (Shapiro–Wilk: ).
The variables Energy from other sources and Settlement ratio exhibit bimodal and multimodal distributions, respectively. Their Q–Q plots show irregular deviations, indicating that they do not follow a normal distribution (Shapiro–Wilk: and ). Finally, the variable Cost TFT is moderately symmetric, with slight right skewness. Although the central region aligns with the normal distribution, the tails exhibit deviations, leading to rejection of normality (Shapiro–Wilk: ).
In general, with few exceptions, almost all analyzed variables do not strictly follow a normal distribution. Shapiro–Wilk test results show very low p-values () for most variables, indicating their distributions differ significantly from normality. The only variable with was GNC cost, whose distribution, as observed, is approximately normal. These deviations may be due to the presence of skewness (for example, Demand and Hydroelectric Power exhibit right tails), high kurtosis, or even bimodal/multimodal distributions caused by structural changes during the period (as with average price, which experienced an abrupt change in mid-2023).
Given the lack of normality in most variables, it is wise to be cautious when running analyses that assume normality. Applying transformations may be considered if it is necessary to get closer to a normal distribution. However, for our main goal of variable selection and price modeling, this non-normality is not necessarily damaging. It simply suggests that robust methods could be a better choice. At the very least, the model assumptions need to be verified down the line.
4.3. Analysis of Significant Differences
Whether the variables exhibit significant changes over time or other groupings is examined by using mean comparison tests. Since normality analyses indicate that most variables do not follow a normal distribution, it is more appropriate to replace parametric tests (Student’s t and ANOVA) with non-parametric alternatives (Mann–Whitney U, Kruskal–Wallis H, Wilcoxon signed-rank, Friedman).
When comparing more than two independent groups (years 2018 to 2024), the Kruskal–Wallis H test is used to determine whether significant differences exist between the medians of two or more independent groups. The obtained value of with indicates highly significant differences among the analyzed years. This implies that the groups are not homogeneous in terms of the dependent variable average price.
Figure 4 shows that the average energy prices (in USD/kWh) exhibit low variability and consistent values between 2018 and 2022, with medians close to 2 and 3 (USD cents/kWh) depending on the year. In contrast, the years 2023 and 2024 show notable increases in both medians and variability, indicating a significant change in price patterns. Additionally, outliers are identified in 2020, 2021, 2022, 2023, and 2024, highlighting unusual fluctuations during these periods.
These results, supported by the Kruskal–Wallis H statistical analysis, suggest significant differences between the years, with a marked increase in prices in the most recent years.
To complement the Kruskal–Wallis H results and identify specific pairwise differences in average energy prices across years, Dunn’s post hoc test was applied with Bonferroni correction. The resulting heatmap (
Figure 5) presents adjusted
p-values for all year-to-year comparisons. Darker cells correspond to non-significant differences (
), while lighter tones highlight statistically significant contrasts.
Statistically meaningful differences emerge between 2024 and most previous years, with particularly low p-values observed in comparisons with 2019 (), 2022 (), and 2021 (). These results reflect a sharp deviation in average unit prices during 2024, likely influenced by persistent effects of the 2023 hydrological crisis and ongoing regulatory adjustments. Additionally, significant differences between 2023 and 2020 (), as well as between 2023 and 2021 (), indicate transitional price behavior leading up to the crisis period.
Conversely, periods such as 2019–2022 exhibit no significant pairwise differences (), suggesting a relatively stable pricing regime prior to major climatic and market disruptions. These findings reinforce the temporal segmentation observed in the boxplot analysis, providing further statistical support for claims of structural shifts in energy price dynamics beginning in late 2022.
4.4. Correlation Analysis
Linear relationships among the different variables are analyzed using Spearman’s rank correlation coefficient. This coefficient measures the linear association between two quantitative variables. Values close to indicate a strong linear correlation (positive or negative), while a value near 0 indicates little or no linear relationship.
Figure 6 shows the correlation matrix calculated among all numerical variables in the dataset, including the average price. The column and row corresponding to the average price (USD cents/kWh) are notable, as it is the dependent variable. It exhibits a strong positive correlation with several cost variables, such as Cost TIES (
), Settlement ratio (
), Fixed cost CR (
), Variable cost CR (
), and generated Thermal Energy (
). This suggests that the unit energy price increases as associated costs rise, especially with higher generation of Thermal Energy. Conversely, the average price shows a strong negative correlation with Hydroelectric Power (
), indicating that in years of higher hydropower generation, the average price tends to be lower.
Significant relationships are also observed among the independent variables. Hydroelectric Power is negatively correlated with Thermal Energy () and with Variable cost CR (). This reflects that in months with higher hydropower contribution, thermal generation and the variable fuel-related costs decrease.
Thermal Energy shows a very strong correlation with Variable cost CR (), reflecting a high likelihood that variable costs depend directly on the amount of thermal energy produced (such as fuel expenses). This extremely high correlation indicates near-perfect multicollinearity between these two variables. Additionally, a notable strong correlation exists between Cost TIES and both Variable cost CR () and Thermal Energy (), suggesting that increases in internal thermal generation may coincide with rises in import or exchange costs (TIES).
4.5. Principal Component Analysis
To reduce data dimensionality and capture the maximum possible variability, a principal component analysis was conducted. This method allows the identification of patterns in the data and prioritizes those components that explain a significant proportion of the total variability.
The loadings of each variable among the eleven independent variables indicate the relative contribution of each variable to the formation of the principal components. Each principal component (PC) was derived as a linear combination of the original variables.
To determine the optimal number of principal components, the cumulative variance explained by each component was analyzed. As shown in
Figure 7, the first three principal components explain approximately 95% of the total data variability, exceeding the 90% threshold (considered a standard criterion in this analysis). Consequently, the first three principal components were selected for subsequent analyses.
Analyzing the extracted principal components reveals significant patterns in the data variability. Principal Component 1 (PC1) is strongly influenced by Thermal Energy (kWh/month) (), Variable cost CR (USD/month) (), and Cost TIES (USD/month) (). This suggests that PC1 is primarily related to thermal energy generation and associated variable costs, highlighting the relevance of these variables in the total variability.
In contrast, Principal Component 2 (PC2) shows a marked association with Demand (kWh/month) (), Cost TFT (USD/month) (), and GNC cost (USD/month) (). This indicates that PC2 captures a dimension of variability linked to specific energy demands and costs associated with certain generation technologies, representing a distinct axis of variation from PC1.
Finally, Principal Component 3 (PC3) is primarily defined by Energy from other sources (kWh) (), GNC cost (USD/month) (), and the Settlement ratio (). PC3 reflects variations related to less predominant energy sources and their impact on overall costs.
Together, these results demonstrate that the first principal components successfully extract fundamental aspects of the analyzed energy system, encompassing costs associated with thermal generation, demand variations, and specific costs of certain energy sources. Based on this identification, relevant variables from these principal components were selected, as detailed in
Table 2.
5. Results
The results are presented in four specific sections: (1) linear regression models using variables selected according to the authors’ criteria, (2) linear regression models employing variables with the highest correlation, (3) linear regression models using variables defining the principal components, and (4) autoregressive models for price forecasting.
In analyzing the results, it is necessary to consider that the variables (both dependent and independent) do not fully comply with the assumptions of normality and linearity. Additionally, significant differences are observed among the studied years, with a notable increase in prices during the most recent periods, thereby increasing the complexity of the resulting models.
5.1. Regression Model Using Variables Selected According to Professional Criteria
Figure 8 presents the results of the multiple linear regression model for an initial selection of variables. It shows the relationship between the average kilowatt-hour price and Hydroelectric Power (kWh/month) (HP), TIES (kWh/month), Thermal Energy (kWh/month) (TE), and Energy from other sources (kWh/month) (EOS). Blue points represent observed values, while the red dashed line is included as an ideal fit reference. The considerable dispersion of the blue points around this reference line suggests a significant deviation from the ideal linear behavior expected in the relationship between the independent variables and price.
The fitted linear regression equation indicates the coefficients for each predictor variable. The coefficient of determination () is 0.7035, suggesting that approximately 70.35% of the variability in the average kilowatt-hour price can be explained by the variables included in this model. The extremely low p-value () indicates that, overall, the model is statistically significant for predicting the average kilowatt-hour price.
Figure 9 presents the results of the multiple linear regression model for the second variable selection, where the average kilowatt-hour price is modeled as a function of Fixed cost CR (USD/month) (FcCR), Variable cost CR (USD/month) (VcCR), GNC cost (USD/month) (CGNC), Cost TIES (USD/month) (CTIES), and Cost TFT (USD/month) (CTFT).
The fitted regression equation , together with a high coefficient of determination (), suggests that approximately 89.61% of the variability in the average kilowatt-hour price is explained by the variables included in this model. The extremely low p-value () confirms the overall statistical significance of the model for predicting the average kilowatt-hour price based on these costs.
Figure 10 shows the multiple linear regression model for the third variable selection, where the average kilowatt-hour price is explained by Hydroelectric Power (kWh/month) (HP), TIES (kWh/month) (TIES), Thermal Energy (kWh/month) (TE), Cost TIES (USD/month) (CTIES), and the Settlement ratio (SR). The relationship between the actual average price and the average price predicted by the model is shown through its equation
. The proximity of these points to the red dashed line, representing an ideal model, indicates the predictive capability of the model.
The high coefficient of determination () suggests that approximately 96.54% of the variability in the average kilowatt-hour price is explained by the model variables. The extremely low p-value () confirms the overall statistical significance of the model for predicting the average kilowatt-hour price based on these variables.
5.2. Regression Models Using Variables with Highest Correlation
The multiple linear regression model constructed from variables with significant correlations identified in
Figure 6 reveals a strong association between the average kilowatt-hour price and the considered predictor variables: Hydroelectric Power (HP), TIES (TIES), Thermal Energy (TE), Fixed cost CR (FcCR), Variable cost CR (VcCR), Cost TIES (CTIES), and the Settlement ratio (SR). The relationship estimated by the model is expressed by the equation
.
Figure 11 shows a high concentration of blue points around the ideal red reference line, visually suggesting excellent predictive capability of the model. This is statistically confirmed by a coefficient of determination (
) of 0.9887, indicating that approximately 98.87% of the variability in the average kilowatt-hour price is explained by the linear combination of independent variables included in the model. The model’s
p-value is extremely low (
), evidencing robust statistical significance of the model as a whole for predicting the average kilowatt-hour price.
5.3. Regression Models Using Principal Components
In the application of PCA to reduce data dimensionality and prioritize variables with the greatest variability, significant patterns were identified in the first three principal components. Based on the variable loadings in these components, six representative variables were selected for constructing a multiple linear regression model: Demand (kWh/month) (DCSUR), Thermal Energy (kWh/month) (TE), Energy from other sources (kWh) (EOS), Variable cost CR (USD/month) (VcCR), GNC cost (USD/month) (CGNC), and Cost TFT (USD/month) (CTFT).
Figure 12 of the linear regression model reveals a substantial relationship between the average kilowatt-hour price and the linear combination of these six variables. The fitted model equation is
.
A coefficient of determination () of 0.7135 indicates that approximately 71.35% of the variability in the average kilowatt-hour price can be explained by this reduced model. The p-value associated with the model is highly significant (), suggesting that the model as a whole has statistically relevant predictive capability for the average kilowatt-hour price using the variables identified as important through principal component analysis.
To model the average kilowatt-hour price using the condensed information from the original variables through PCA, a linear regression model was fitted using the first three principal components (PC1, PC2, and PC3), providing a significant proportion of the total data variability.
Figure 13 shows the relationship between the actual average kilowatt-hour price values and those predicted by the model based on the principal components. The fitted linear regression model equation is
.
The coefficient of determination () of 0.6504 indicates that approximately 65.04% of the variability in the average kilowatt-hour price can be explained by this model. The associated p-value of the model is highly significant (), suggesting that the model based on the principal components has statistically relevant predictive capability for the average kilowatt-hour price, capturing an important portion of the variance through these linear combinations of the original variables.
Table 3 presents a summary detailing the different linear models implemented for determining the average unit energy supply price. This table facilitates comparison of the models in terms of their explanatory power (
), statistical significance (
p-value), variables included in each model, and their respective equations.
In the highest-performing regression model (), the settlement ratio coefficient is 3.76. This implies that, holding other variables constant, a unit increase in the settlement ratio leads to an increase of 3.76 cents in the average unit cost of energy supply (USD/kWh). Since the settlement ratio is a regulatory factor applied multiplicatively to cost components, such as generation and transmission, its direct impact on the unit price is expected. Thus, the positive coefficient confirms the regulatory transmission of cost burdens to the final tariff.
Similarly, hydropower energy exhibits a negative coefficient (), meaning that as hydropower generation increases, the average energy cost decreases. This aligns with expectations, as hydroelectric sources are lower-cost and displace more expensive thermal or imported energy.
Some models show negative coefficients for TIES (imported energy), which may seem counterintuitive. However, this could result from multicollinearity with thermal energy and fixed costs, as both tend to rise during crisis periods, masking the true marginal cost of TIES. Additionally, the low frequency of high TIES values (e.g., only during extreme droughts) may cause underestimation of their real impact when using monthly data.
5.4. Autoregressive Integrated Moving Average Model with Exogenous Variables for Cost Forecasting
Prediction of the average unit value of the kilowatt-hour was addressed using an Autoregressive Integrated Moving Average model with exogenous variables (ARIMAX). This time series methodology is used to model and forecast variables that evolve over time, incorporating both the internal dynamics of the series (through its own past values and errors) and the influence of external variables. In this case, the predictions generated by the previously selected linear regression model, based on variables with the highest correlation to the average price, were included as an exogenous variable to enhance the predictive capacity of the ARIMAX model by considering external causal factors.
For the correct application of an ARIMAX model, it is crucial to ensure the stationarity of the time series, which implies that its statistical properties (mean and variance) remain constant over time. The non-stationarity of the original average price series, evidenced in
Figure 14, is shown by trends or time-dependent patterns.
A second-order differencing, as shown in
Figure 15, was required to induce the stationarity necessary for modeling. Additionally, selecting the order of the autoregressive (AR), integrated (I, corresponding to the number of differencing), and moving average (MA) components of the ARIMAX model, along with proper identification and specification of exogenous variables, are fundamental steps to capture the complexity of the time series and the influence of relevant external factors.
In this study, three linear regression models based on different hypotheses were evaluated to identify the most suitable exogenous variable for the ARIMAX prediction model. The first approach consisted of a regression model using variables selected according to the criteria of industry professionals, incorporating expert knowledge to identify factors potentially influencing the average kilowatt-hour price. The second approach explored regression models employing variables with the highest statistical correlation to the average price, while the third approach is a model based on principal components derived from PCA, aiming to capture the key variability of the data empirically with reduced dimensionality.
Predictions obtained from each of the evaluated linear regression models, representing different strategies for selecting predictor variables, were individually considered as potential exogenous variables for the ARIMAX model. The central hypothesis of this stage was that the inclusion of the prediction from the linear regression model demonstrating the most robust relationship with the average price (whether based on expert knowledge or derived from statistical analysis of correlation and principal components) would enrich the ARIMAX time series model with valuable information, thereby leading to improved prediction accuracy.
5.4.1. ARIMAX Model Based on First Variable Selection
Prediction of the average unit value of the kilowatt-hour was performed using as an exogenous variable the predictions derived from the first variable selection in the linear regression model.
Figure 16 illustrates the time series of actual values and the prediction obtained from the ARIMAX model, along with its confidence interval. The graph shows that the differenced series tends to stabilize around zero over the forecast horizon, despite the high volatility observed in recent data. It is notable that, while the model captures the overall dynamics, the width of the confidence interval increases significantly as the forecast extends into the future, reflecting the inherent uncertainty of long-term prediction.
This linear regression model was established with the following formulation: .
5.4.2. ARIMAX Model Based on Second Variable Selection
The prediction of the average unit value of the kilowatt-hour was obtained from a time series model that incorporated as an exogenous variable the estimates generated by the second variable selection in the linear regression model.
Figure 17 illustrates the evolution of actual values and the projection for the average kilowatt-hour price. In this representation, the forecasted series tends to converge toward an equilibrium value in the short and medium term, despite the high volatility evidenced in the recent data of the original series. The pink shaded band delimits the confidence interval of the prediction, whose width increases as the forecast horizon extends, reflecting the inherent uncertainty in future estimates.
This prediction is supported by the influence of key exogenous variables incorporated into the model, whose coefficients are derived from the underlying linear regression and are expressed in the following equation: .
5.4.3. ARIMAX Model Based on Third Variable Selection
The prediction of the average unit value of the kilowatt-hour was derived from a time series model that incorporated as an exogenous variable the projections resulting from the third variable selection in the linear regression model.
Figure 18 displays the evolution of the measured average price values and the obtained future prediction. It can be observed that the forecasted series, despite recent volatility in the historical data, shows a tendency to stabilize over the prediction horizon. The confidence interval accompanying the point forecast progressively widens, reflecting the increase in uncertainty as the projection extends over time.
This prediction is supported by the influence of key exogenous variables incorporated into the model, whose coefficients are derived from the underlying linear regression and are expressed in the following equation: .
5.4.4. ARIMAX Model Based on Variables with Highest Correlation
Figure 19 presents the time series of actual values of the average price for the variables with the highest correlation, along with the prediction obtained by the ARIMAX model for the future horizon, accompanied by a confidence interval. The obtained prediction suggests a stabilization trend or slight recovery after an initial decline, oscillating around values close to zero in the average unit value of the kilowatt-hour for the forecasted period, although with increasing uncertainty as the forecast horizon extends, as indicated by the width of the confidence interval.
This prediction is supported by the influence of key exogenous variables incorporated into the model, whose coefficients are derived from the underlying linear regression and are expressed in the following equation: .
5.4.5. ARIMAX Model Based on PCA Variables
The prediction of the average unit value of the kilowatt-hour was obtained from a time series model that incorporated as an exogenous variable the estimates generated by the linear regression model based on the variables selected through the most significant PCA variables.
Figure 20 shows the historical series of actual values and the projection for the average price. Despite the volatility observed in recent data, the prediction suggests that the series tends to stabilize or show a slight recovery after an initial decline, oscillating around values close to zero over the forecast horizon. The confidence interval accompanying the prediction expands over time, indicative of the increasing uncertainty as the forecast horizon extends.
This prediction is supported by the influence of key exogenous variables incorporated into the model, whose coefficients are derived from the underlying linear regression and are expressed in the following equation: .
6. Discussion
According to the current regulatory framework governing wholesale electricity transactions in Ecuador (ARCERNNR Regulation 001/23), generation dispatch is executed on an hourly basis, yet commercial energy settlements are carried out monthly. Furthermore, the majority of electricity transactions are conducted through long-term regulated contracts between generation companies and distribution utilities, which are also settled monthly. For this reason, the present study focuses on modeling the monthly average unit cost of energy supply. Although spot market transactions—subject to hourly price volatility—exist in Ecuador, their share is marginal, as most of the energy is allocated via regulated agreements. Additionally, Ecuador has not experienced negative electricity prices, largely due to the low penetration of renewable sources, which currently account for only around 5% of total generation. This contextual specificity justifies the modeling approach based on monthly resolution and reinforces the relevance of variables such as hydropower, imports, and regulatory indices.
The implementation of three regression approaches—expert-driven selection, correlation-based selection, and PCA-based dimensionality reduction—was designed to evaluate trade-offs between interpretability, predictive performance, and robustness. While all models achieved statistically significant results, the correlation-based model outperformed others in terms of explanatory power (), capturing nearly 99% of the variance in average energy price. However, given this near-perfect fit, the possibility of overfitting cannot be dismissed. Although the available dataset is limited in size, no formal out-of-sample validation or cross-validation procedures were conducted. This remains a methodological limitation that future work should address by applying rolling-window or time-series cross-validation to test generalizability under different market conditions.
We also acknowledge the potential for multicollinearity, especially given the extremely high correlation between some predictors. While variance inflation factors (VIFs) were not computed in this version of the study, two strategies were implemented to mitigate this issue: the application of PCA to isolate latent factors and reduce redundancy, and the comparative analysis of model performance metrics—such as adjusted and statistical significance—to detect collinear effects. Future extensions should include formal diagnostic measures such as VIF and condition indices to further validate the regression models.
Beyond statistical performance, the proposed modeling framework holds practical value for decision-makers in the energy sector. First, the high accuracy of the correlation-based model enables monthly forecasts that can support operational planning, procurement decisions, and budget forecasting for distribution utilities. For example, energy suppliers can anticipate cost increases during drought periods (captured by hydropower reductions and TIES imports) and plan contractual strategies accordingly.
Second, the interpretability of the regression coefficients offers actionable insights for tariff design. The positive and statistically significant coefficient associated with the settlement ratio indicates that regulatory adjustments in cost reallocation have immediate and substantial effects on average prices. This enables regulatory agencies to simulate the impact of policy shifts or price band updates on end-user tariffs in advance.
Third, the high sensitivity of the model to thermal generation and cross-border imports highlights the risks associated with fossil-fuel dependency and limited diversification of the energy matrix. From a long-term planning perspective, the results underscore the importance of promoting renewable investment to reduce exposure to international price shocks and hydrological volatility.
To further contextualize the contribution of this study, it is important to compare the proposed models with existing techniques in the literature. While machine learning approaches such as NARX neural networks [
15] and LSTM or XGBoost-based models [
20] have achieved strong results in electricity price forecasting, they often require large volumes of high-frequency data and offer limited transparency for policymaking. In contrast, the regression and ARIMAX models developed in this study combine statistical rigor, interpretability, and moderate data requirements, making them well suited for structurally constrained markets such as Ecuador.
Moreover, while GARCH and stochastic volatility model [
14] are effective for capturing price volatility in hourly or intraday markets, they are less compatible with the monthly cost settlement logic prevailing in the Ecuadorian system. The ARIMAX model used in this study, fed with exogenous predictors from the best-performing regression, enhances medium-term forecast capability without sacrificing transparency or policy alignment.
In conclusion, the integrated use of multiple linear regression and ARIMAX modeling provides a robust, interpretable, and policy-relevant framework for forecasting energy supply costs in regulated and hydrologically sensitive electricity markets. These findings are valuable not only for Ecuador but also for other emerging economies with similar structural and regulatory characteristics.
7. Conclusions
This study presents a comprehensive modeling framework for estimating and forecasting the average unit cost of energy supply in Ecuador’s electricity distribution system using multiple linear regression and ARIMAX models. The analysis incorporates a diverse set of operational and economic variables, capturing the structural characteristics and regulatory dynamics of the national power market.
Empirical results reveal that the models exhibit strong explanatory power, with the best-performing linear regression achieving an . Among the most influential variables, the settlement ratio, fixed and variable generation costs, and imported energy costs (TIES) demonstrate the strongest positive correlations with energy prices, while hydropower generation exhibits a strong negative association. These findings confirm the price-suppressing role of renewable energy sources and the significant cost pressure imposed by external supply and regulatory redistributions.
The modeling approach further highlights the importance of capturing crisis dynamics. The inclusion of 2023–2024 data, characterized by a major hydrological drought and increased reliance on high-cost imports, allows the models to reflect the volatility and regulatory adjustments faced by the Ecuadorian power system. Such integration strengthens the validity of the framework under both stable and disruptive conditions.
From a policy standpoint, the results underscore the urgent need to enhance system resilience through diversification of the energy mix, more flexible import contracts, and adaptive regulatory mechanisms that mitigate the cost volatility passed onto distribution companies and end users. The models developed here can serve as decision-support tools for tariff setting, risk planning, and energy procurement strategies.
As future work, we recommend the development of high-frequency models using daily or hourly data to better capture short-term volatility and operational shocks. Additionally, further studies could incorporate nonlinear modeling techniques, such as regime-switching models or machine learning algorithms, to explore potential threshold effects and improve predictive performance under extreme conditions. A backtesting exercise targeting specific crisis periods—such as the fourth quarter of 2023—will also be implemented to evaluate out-of-sample robustness.
In conclusion, this research provides a robust, interpretable, and context-sensitive approach to energy price forecasting, contributing both methodologically and practically to energy economics and planning in emerging markets.