A Latent-Factor System Model for Real-Time Electricity Prices in Texas

: A novel methodology to model electricity prices and latent causes as endogenous, multivariate time-series is developed and is applied to the Texas energy market. In addition to exogenous factors like the type of renewable energy and system load, observed prices are also influenced by some combination of latent causes. For instance, prices may be affected by power outages, erroneous short-term weather forecasts, unanticipated transmission bottlenecks, etc. Before disappearing, these hidden, unobserved factors are usually present for a contiguous period of time, thereby affect-ing prices. Using our system-wide latent factor model, we find that: (a) latent causes have a highly significant impact on prices in Texas; (b) the estimated latent factor series strongly and positively correlates to system-wide prices during peak and off-peak hours; (c) the merit-order effect of wind significantly dampens prices, regardless of region and time of day; and (d) the nuclear baseload generation also significantly lowers prices during a 24-h period in the entire system.


Introduction
Information about energy prices is known in the day-ahead market, but actual realtime prices will deviate from the day-ahead prices for many "hidden" reasons; see [1]. For example, an error in load forecasts, wind forecasts, solar output forecasts, or the outage of a power plant or transmission line, and many other unforeseen events will cause realtime prices to deviate from day-ahead prices. These latent factors are difficult to measure and adjust in real-time, and yet their impact on prices can be significant. This reveals itself in the fact that the new real-time price is set at most every five minutes.
A typical approach to explaining real-time prices is to start with the ex-post dayahead price, and model deviations of the real-time price from the day-ahead price as a function of forecasting errors. While this approach may be helpful, it fails to consider the myriad of unobserved latent factors. Also, system-wide hidden factors are difficult to forecast in single-equation models that are used to explain real-time prices.
One of the two main aims of this paper is to present a novel methodology that uses unobserved latent factors and exogenous variables to explain energy prices in Texas by modeling these prices as endogenous, multivariate time-series. This system-wide approach then leads to estimating the attendant merit-order effects of baseload generation (nuclear energy) as well as renewable energy generation (wind and solar).
The hourly real-time market (RTM) energy price used in our analysis originates from the 5-min real-time energy prices based on the real-time operation of the Electric Reliability Council of Texas (ERCOT). ERCOT uses a security-constrained economic dispatch model (SCED) to simultaneously manage energy, system power balance, and network congestion, yielding 5-min locational marginal prices (LMPs) for each electrical bus within the market. The SCED process seeks to minimize offer-based costs, subject to power balance and network constraints. The zonal settlement price for a load-serving entity's real-time energy purchase is a load-weighted average of all 5-min LMPs in a load zone, which is converted to 15-min values or hourly values by ERCOT.
Methodological Contribution. We explore this topic using hourly real-time price data for the years 2015-2018 from the ERCOT market. Divided into eight zones-North, Houston, South, West, Austin Energy, CPS Energy, Lower Colorado River Authority, and Rayburn Electric Cooperative-ERCOT serves the electrical needs of the largest electricityconsuming state in the U.S.; it accounts for about 8% of the nation's total electricity generation, and is repeatedly cited as North America's most successful attempt to introduce competition in both generation and retail segments of the power industry (Distributed Energy Financial Group, 2015). In the interests of brevity, we report the findings for Houston, Austin, and West regions, since the results from the other regions are similar.
To the best of our knowledge, this study is the first attempt at developing a latentfactor system-wide model for estimating the merit-order effects of baseload and renewable energy generation. While we use the Texas energy market to exemplify the methodology, the models developed here are readily applicable to other markets as well. Moreover, while we focus on real-time prices, the methodology readily lends itself to the study of day-ahead prices as well.
Section 2 describes the data and variables used in the study. The system-wide latent factor model for prices is detailed in Section 3. Section 4 provides the results, followed by a discussion and conclusion in Section 5.

Data and Variables
This section describes the data used in the analytic models, including the geographical scope and sample period.

Geographical Scope
The current ERCOT market with its eight zones is the focus of the paper; see Figure 1 for a map of ERCOT. The North and Houston zones account for about 37% and 27%, respectively, of ERCOT market energy sales, while the South and West zones contribute 12% and 9%. Further, these four zones account for nearly all of the state's retail competition, and most of the competitive generation resides within these zones.

Sample Period and Variables
The sample period starts on 1 January 2015 and ends on 31 December 2018. Thus, we have a very large dataset since all the eight price series will appear together as endogenous variables in the multivariate response matrix.
As noted earlier, we discuss at length the results for the following three zones: Houston, Austin, and the West. Additionally, we examine the merit-order effects stemming from three hours in a 24-h cycle: 4:00 a.m., 12:00 p.m. and, 4:00 p.m. The first is off-peak and the other two correspond to peak hours. There is nothing special about the specific hours we chose to work with; a similar analysis with other hours yields the same overall conclusions reported here.  Table 1 provides the summary statistics for the prices ($/MWH) for the three hours and the three zones, respectively. The corresponding time-series plots of the nine series, shown in red, appear in Figure 2. In the analysis, however, we work with the natural log of the price data. One of the insights we hope to gain is to see how the one-hour-ahead in-sample estimates of the latent factor time series track the price plots in Figure 2. If we can show that there is a strong correlation between the estimated latent factor series and the price series, then that bodes well for the estimation of merit-order effects from alternative energy and baseload generation. On the other hand, if there is a very weak relationship between latent factors and the price series, then exogenous factors should suffice in understanding the fluctuations in the price data. A brief discussion of each of the independent variables now follows. These variables were selected based on careful data exploration via summary plots/correlation tables, practical considerations of data size, modeling aims, and computational complexities. Additionally, price formation in the ERCOT market has been analyzed in a variety of antecedent studies using many of the same data sources and variables employed in this study [14][15][16].
The exogenous variables used in this study are split into those that appear in the observation and latent factor equations, respectively; these equations are detailed in the next section.
Observation Equation Exogenous Variables. Wind generation, nuclear generation, solar generation, the Henry Hub gas price, and a dummy variable for spikes in prices that exceed $500 MWH are the exogenous variables. ERCOT analysts have noted that industrial customers tend to significantly scale back when prices exceed USD 500. So, a binary dummy variable for extreme price spikes is used. The solar generation variable and the dummy variable do not appear in the 4:00 a.m. equations. We downloaded daily natural gas prices for Henry Hub from the DOE/EIA (See: http://www.eia.gov/dnav/ng/hist/rngwhhdd.htm. Last accessed 18 July 2020). We use the Henry Hub price instead of the local natural gas price (e.g., Houston Ship Channel) since the Henry Hub price is highly correlated with the local natural gas price (r > 0.95). Finally, the latent factor variable, which is estimated from within the system endogenously, appears as an exogenous variable in the observation equation. All variables are on the natural log scale except, of course, the dummy variable.

Latent Factor Equation Exogenous
Variables. Recall that the latent factors are unobserved variables; there is no data for them. The parameters corresponding to these variables are recursively estimated from within the system at each point in time, which leads to the following intuition: if one could observe these latent causes, then they are most likely going to be related to load and prices. For instance, power outages, erroneous shortterm weather forecasts, unanticipated transmission bottlenecks, etc., would most certainly impact demand and price distributions across ERCOT. Therefore, we use system-wide load (MWH) and lagged weighted price ($/MWH) across all eight zones as the exogenous factors that could likely associate with the unobserved factor variables. Additionally, a first-order autoregressive process for the latent factor is used. This allows us to capture the potential lingering effects of hidden variables over time. As described in the next section, while we could use higher-order lags, we do not do so in the interests of parsimony. Also, the lagged weighted prices do capture some of the previous time period's effect on the latent factor. Note that the system-wide load is, in one sense, endogenous to the observation equation via the latent factor. Finally, we work with the natural logs of all these variables.

The Latent Factor Systems Model
Following [23,24], suppose there are k endogenous variables. Let < denote the number of unknown or hidden latent factors. Then, the system of equations that represent the prices in the = 8 zones in ERCOT with = 1 is given by: where the first equation is called the observation equation and the second is termed the latent factor equation. The dimensions of the various quantities in Equation (1) are: is a × 1 vector of endogenous variables; is × ; is × ; is a × vector of parameters; is an × 1 vector of exogenous variables; is a × 1 vector of random errors that are assumed to be normally distributed with mean zero and unknown standard deviation ; is an × of parameters; is an × 1 vector of exogenous variables; is an × matrix of parameters; and is an × 1 vector of random errors that is normally distributed with mean zero and unknown standard deviation . It is possible to introduce another Equation in (1) that represents an autoregressive structure for the observation error . However, this leads to a larger number of parameters than is dictated in most applications. Moreover, convergence issues abound when the parameter space and the sample size are large. As it is, the class of models contained in (1) is quite rich. By appropriately restricting , p and q, we can obtain Zellner's Seemingly Unrelated Regression model, Vector Autoregressive models; Dynamic Factors with Errors models, etc; see, for example [25,26]. Williamson et al. [1] developed an alternative Bayesian latent factor model, using nonparametric methods, that complements the latent factor model in Equation (1).
We could also add higher-order latent factors ( > 1), but again we err on the side of parsimony. Indeed, we could also increase the dimension of the autoregressive component of the latent factor vector which we have set as an AR (1) process. But we refrain from doing this since we also include the lagged weighted price of all the zones as an exogenous variable in the vector ; i.e., we allow the weighted values of lagged prices from the eight zones to guide the hidden factors that could drive each zone's price in the observation equation where these prices are endogenous in the system given in (1).
Thus, is the endogenous matrix of prices from the eight zones; contains the exogenous variables wind, nuclear and solar generation, where the last one appears only in the sunlight hours; the Henry Hub gas price; and a dummy variable for real-time prices exceeding USD 500, which will not appear in the night and early morning hours since prices do not rise to very high levels at these times. The endogenous factor variable ft also appears as an exogenous input in the observation equation. The implication is that these hidden factors could influence prices throughout the day. In the latent factor equation, the exogenous variables in use system-wide load (MWH) and lagged weighted price (USD/MWH) across all eight zones; these are contemporaneous in time. Additionally, we assume the latent factor follows a first-order autoregressive process. Since we separate our analysis for each hour of the day, the lagged variables are the variables of the previous day. Since ft enters the observation equation exogenously, the system-wide load affects system-wide prices via . Lastly, the AR(1) specification for in the second equation captures the lagged nature of hidden factors; for example, poor weather forecasts, which could be one of the latent factors, tend to be contiguous over time.
The maximum likelihood estimates (MLEs) for all the parameters (including and ) are found via an iterative method that combines the two algorithms developed in [27,28]. All analyses were conducted in STATA.

Results
Here, we report and discuss the results for three regions: Houston, Austin, and West; details on all other regions are available on request. Where appropriate, we highlight the empirics from the other regions as well. For the three regions, we report the results for 4:00 a.m., 12:00 p.m., and 4:00 p.m.; these are representative of the other off-peak and peak hours. Thus, we estimate Equation (1) nine times since we have nine models in total. We have the following major results.
Wald Test. This test has a chi-square distribution. It tests the null hypothesis of whether or not all the unknown parameters in the observation and latent factor equations are jointly significant; this is similar to the F-test in multiple linear regression. For all nine models, the Wald Statistic soundly rejects the null hypothesis at any significance level (p < 0.00001).
Actual versus predicted price series. Consider Figure 3 which shows the actual and predicted series. As expected, there are some outliers in the data, especially during the 4:00 p.m. hour for all three zones. Also, again consider Figure 2. Note that the predicted time series, shown in grey, track the original price series in red quite well for the three different hours, barring the time points corresponding to the outliers. Correlations between actual price series and estimated factor series . Table 2 shows the correlations between each of the price series from all eight zones for the three hours. They are all positively correlated to the predicted latent factors. We highlighted the correlations for the regions Houston, Austin, and West in Table 2 in order to emphasize two points. First, note that the West zone has the weakest correlation during the peak hours of 12:00 p.m. and 4:00 p.m., compared to other regions. This is because of the larger impact of wind generation in the West during these hours, compared to other zones. Second, consider Figures 4-6. Each comprises four plots. For the sake of clarity, let us focus on just Figure 6 corresponding to the 4:00 p.m. hour. The top left plot is the latent factor one-step-ahead estimated time series. The other three plots in each of the panels are the actual price series for Houston, Austin, and West. The corresponding correlations between the latent factor series and these three price series from Table 2 are: 0.587, 0.591, and 0.457, respectively. It is evident that the latent factor series structurally evolves like the three price series, which are representative of the price series for the entire ERCOT system. The presence of outliers in the price series is unavoidable in the ERCOT data. This would explain why some of the correlations are not as high as one might expect. We experimented with higher-order lags in the autoregressive error structure for the latent factor series in Equation (1). But such an increase in model dimensionality does not change the overall conclusions by much. Hence, we err on the side of parsimony. Note: Certain values are bold in order to better understand the Figure 6 discussion.   Significance of the latent factor coefficient. From Table 3, the endogenous latent factor variable, , when it appears as an exogenous variable in the observation equation is statistically significant for all the nine models (p < 0.00001). This result confirms one of the principal assertions in this paper, namely that there are hidden, unobserved factors that influence the distribution of real-time prices throughout a 24-h cycle across all zones. Damien et al. [29] do not use latent factors in their system-wide price and demand ERCOT model. It is evident from this research that latent factors play a significant role in ERCOT's pricing structure. Note: All coefficients have p-values < 0.00001. The latent factor is a vector quantity; hence it appears in bold font to be consistent with the notation in Section 3. Tables 4-6 which show the maximum likelihood estimates (MLEs) for coefficients that appear in the observation and latent factor equations in (1); the corresponding p-values; and the 95% confidence intervals for Hours 4:00 a.m., 12:00 p.m., and 4:00 p.m., respectively, for the three zones. Since we are dealing with the natural logs of all the variables, the MLEs represent elasticities. We first describe some overarching conclusions from all three tables here, saving for later the discussion of the merit-order effects.

The marginal effects of the exogenous variables. Consider
First, from the latent factor equations for all three hours and zones, system-wide load (SystemLoad) positively and significantly impacts the hidden factors. Second, lagged weighted price (LagWtPr) is not significant in the off-peak hour but is significant during the peak hours. Interestingly, it impacts the hidden factors negatively at the noon hour and positively at the 4:00 p.m. hour. Third, the lagged latent factor variable is positive and statistically significant at all three hours for all three zones in the latent factor equation. In conjunction with the plots shown in Figures 4-6, this further confirms the importance of the latent factor dynamics on energy prices in all eight zones. Fourth, from the observation equation for the three zones, during all three hours, as expected, wind generation and nuclear generation have negative elasticities, and Henry Hub gas has positive elasticity. Fifth, solar generation is a mixed bag, largely because this resource is still growing in Texas, and as such its data are non-stationary. Thus, solar generation is not significant at 12:00 p.m. and its elasticities are positive and weak at 4:00 p.m. Finally, the impact of extreme spikes in real-time prices (the dummy variable) at 12:00 p.m. and 4:00 p.m. is highly significant in all three zones.
System-wide merit-order effects. To best understand the merit-order effects shown as elasticities in Tables 4-6, consider the price boxplots shown in Figure 7. The top, middle and bottom panels, corresponding to hours 4:00 a.m. 12:00 p.m., and 4:00 p.m., respectively, have three boxplots in each panel. On the X-axis, the box titled "Before Price" is the group of mean prices in the eight ERCOT zones before accounting for any merit-order effect. The second and third boxes are the change in mean prices after accounting for merit-order effects in wind and nuclear generation, respectively. The Y-axis represents the mean price values ($/MWH). Each value on this axis is the mean price from each of the eight zones during the years 2015-2018. Focus on the 4:00 a.m. panel at the top. The interquartile range (IQR) of the mean prices of the eight zones in ERCOT at this hour is $16.18 to $16.61; see the left-most box in blue. Next, assume wind generation increases by 10%. Using the MLE estimates of the price elasticities for wind generation for each of the eight zones from our latent-factor system model, we adjust the mean prices in the blue box and construct the resulting change in prices due to increased wind generation. The corresponding distribution of the adjusted mean prices in ERCOT is shown as the second box in red. From the caption, the IQR for the prices, after accounting for increased wind generation, is between $15.59 and $16.07. Finally, we do a similar adjustment to energy prices using the parameter estimates for nuclear generation; this is shown as the green box in the top row of Figure 7. The IQR is between $14.98 and $15.42. Observing the three panels, it is also interesting to note that there is less volatility in the mean prices in the entire ERCOT system during the off-peak hour. Consider the middle panel which corresponds to the 12:00 p.m. hour. While the reduction in energy prices is less now, wind and nuclear generation still have a measurable impact on real-time prices in ERCOT as a whole. Also, there is more volatility in real-time prices during this peak hour.
Finally, the bottom panel shows the impact on prices due to the merit-order effects at 4:00 p.m. Nuclear generation is much more influential than wind at this hour of the day; its boxplot barely intersects with the boxplot from wind generation. Also, the volatility in ERCOT's prices is lesser at 4:00 p.m. when compared to 12:00 p.m.

Conclusions
This paper demonstrated the relevance of latent factors on real-time energy prices using a system-wide approach. The ERCOT system served as the case study. Using energy prices from eight inter-connected zones as endogenous variables, we found that hidden factors significantly impact the merit-order effects of baseload and renewable energy generation.
The latent-factor approach developed here can be improved and extended. Damien et al. [29] use a hierarchical Bayesian approach to compare the impact of day-ahead and real-time prices on wholesale demand in ERCOT. However, they do not model latent factors. This paper clearly shows the importance of accounting for such factors. A Bayesian latent factor system-wide model for prices and/or demand is possible in principle; see [30]. However, the challenges are formidable. First, since the parameter space is very large, convergence issues will be a difficult problem to overcome. Concurrently, while studying energy prices or demand, the attendant datasets tend to be very large, as in this paper. This too will add to convergence issues since the likelihood function will have to be evaluated many-fold in any Markov chain Monte Carlo scheme that is required to obtain posterior distributions.
Another future topic for research that this paper proposes is to model the system of equations in Equation (1) via non-normal errors. For example, Williamson et al. [1] use a nonparametric error distribution-the Indian Buffet Process-to develop a new class of latent factor models. But with large datasets, such nonparametric approaches are even more computationally involved compared to parametric formulations.
Why should a non-normal error structure matter in the context of energy prices, and in the estimation of merit-order effects? Recent studies [21,22] have shown that prices have asymmetric distributions with large kurtosis. Subsequently, error distributions from normal linear models tend to be non-normal heteroscedastic and autocorrelated. Hence, quantile regressions have been proposed and exemplified in the energy literature. However, there is a trade-off. Because of the mathematics underlying them, quantile regressions are essentially single-equation models. Thus, the prices of each of ERCOT's eight zones can be modeled separately using quantile regressions; see [31]. But the results in this paper clearly demonstrate the importance of treating the eight zones as part of an interconnected system so that we can better understand how latent factors influence prices jointly. This leads to an open question: how should one construct a system-wide, latentfactor quantile regression model that is equivalent to Equation (1) in this paper? This is a very challenging problem for multiple reasons. For example, consider a bivariate timeseries that represent prices from, say two of ERCOT's eight zones. Further, suppose the error term in the observation model in Equation (1) follows a bivariate skew-t distribution since this distribution allows for varying degrees of skewness. How should one jointly model the quantiles of this bivariate distribution as functions of latent factors and exogenous variables? The answer is not at all evident even in this simple bivariate setup. Therefore, instead of multivariate quantile regression systems, we believe, as a first step, it may be easier to recast Equation (1) using nonparametric prior distributions. Indeed, this could also lead to stronger correlations between the factor and price series since nonparametric priors can better treat outliers. The resulting estimation of the merit-order effects in energy markets would be a useful advancement.