2.1. Posterior Predictive Process
Gaussian processes give rise to a class of Bayesian nonparametric models which have garnered tremendous attention in the machine learning community. The recent attention belies the fact that Gaussian processes, and more broadly kernel methods, have been used in scientific computing and numerical analysis for over 70 years. These methods draw on a rich theoretical and practical background. The popularity of Gaussian processes is due to the flexibility of modeling in infinitedimensional space, the efficiency with which the data are used, the ease of implementation, and the natural handling of uncertainty. These properties make Gaussian processes ideal to forecast load for energy trading where accurate point estimates are required and uncertainty (i.e., risk) quantification is desirable.
Consider a matrix of input data
$\mathbf{X}\in {\mathbb{R}}^{m\times d}$ such that
$\mathbf{X}:={[{\mathbf{x}}_{1},\dots ,{\mathbf{x}}_{m}]}^{\top}$, and
${\left({\mathbf{x}}_{i}\right)}_{i=1}^{m}\in {\mathbb{R}}^{d}$ are column vectors. For this paper, the input data consist of temperatures
${\left({\mathbf{z}}_{i}\right)}_{i=1}^{m}\in {\mathbb{R}}^{(d1)/2}$, dew points
${\left({\mathbf{w}}_{i}\right)}_{i=1}^{m}\in {\mathbb{R}}^{(d1)/2})$, and a time component
${\left({t}_{i}\right)}_{i=1}^{m}\in \mathbb{R}$, such that for
$i=1,\dots ,m$,
Output data, load, are also observed,
$\mathbf{y}\in {\mathbb{R}}_{+}^{m}$ such that
$\mathbf{y}:={[{y}_{1},\dots ,{y}_{m}]}^{\top}$. The goal is to predict the load,
${\mathbf{f}}_{\mathrm{new}}\in {\mathbb{R}}_{+}^{{m}_{\mathrm{new}}}$, at
${m}_{\mathrm{new}}$ future times given future inputs
${\mathbf{X}}_{\mathrm{new}}\in {\mathbb{R}}^{{m}_{\mathrm{new}}\times d}:={[{\mathbf{x}}_{m+1},\dots ,{\mathbf{x}}_{m+{m}_{\mathrm{new}}}]}^{\top}$. For Gaussian processes, the posterior distribution of the predictive process relies on the particular positive definite kernel chosen for the task; the properties of various kernels are discussed in
Section 2.3. For shorthand, it is a common abuse of notation to pass the kernel matrixvalued arguments; the corresponding output is a kernel matrix (often referred to as the Gram matrix), as discussed in
Appendix A. The posterior process is derived in [
16], and the result is that
${\mathbf{f}}_{\mathrm{new}}$, when conditioned on the observed data
${\mathbf{X}}_{\mathrm{new}},\mathbf{X},\mathbf{y}$ is Gaussian distributed. This is denoted as follows:
where
${\mathit{\mu}}_{\mathrm{post}}$ is the posterior mean,
and
${\Gamma}_{\mathrm{post}}$ is the posterior variance,
The mean term is also referred to as the maximum a posteriori (MAP) estimate. In our context, it is a point forecast of the load. Sample realizations can be drawn from the predictive distribution, providing a mathematically rigorous and computationally efficient estimate of uncertainty. As a density estimate for future load, the posterior distribution can be used downstream in electricity price forecasting and modeling the risk associated with generation or trading decisions.
2.3. Properties of Positive Definite Kernels
Equations (
1) and (
3) make clear that the problem of modeling in a Gaussian process framework revolves around the positive definite kernel
K, and its kernel matrix
$\mathbf{K}$. A brief mathematical background on positive definite kernels is provided in the Appendix. Gaussian processes are
nonparametric models due to the fact that parameters do not appear in the model (see, for instance, Equation (
1)). This nomenclature is somewhat confusing as the kernel itself is typically parameterized. The parameterization of the kernel is important to the modeling process; it allows the analyst to impose structure on the covariance (to specify, for instance, periodicity), and allows for intuitive interpretations of the model. In the statistics literature, a positive definite kernel is referred to as a
covariance kernel, emphasizing that
$K(\xb7,\xb7)$ is a covariance function. Evaluating this function at particular arguments,
$K({\mathbf{x}}_{i},{\mathbf{x}}_{j})$, provides the covariance between two points
${\mathbf{x}}_{i}$ and
${\mathbf{x}}_{j}$; performing this operation for
$i=1,\dots m$ and
$j=1,\dots ,m$ yields the kernel matrix
$\mathbf{K}\in {\mathbb{R}}^{m\times m}$. The goal is to identify the correct covariance structure of the data by selecting an appropriate kernel and parameterizing it correctly. Since the choice of kernel and its parameterization determine the properties of the model, making an appropriate selection is crucial. In this section, we provide examples of the kernels used in our model, explain the rationale behind their use, and discuss mathematical properties needed to construct the model from these kernels.
The most obvious structure of the data is the cyclic nature over the course of the day, week, and year. A discussion of the 7 and 365day seasonality can be found in [
28], where the authors showed load data in the frequency domain to identify the cyclic behavior. Their analysis has intuitive appeal as it confirms the seasonal structure of power demand one might expect before observing any data. The periodic kernel, first described in [
29], can be used to model seasonality:
where
${\theta}_{1},\phantom{\rule{3.33333pt}{0ex}}{\theta}_{2},\phantom{\rule{3.33333pt}{0ex}}{\theta}_{3}$ are the amplitude, the period, and the lengthscale, respectively.
Figure 4 provides examples of periodic kernels of different lengthscales, as well as realizations of Gaussian processes drawn using the kernels as covariance functions.
The Gaussian (also known as squaredexponential) kernel is commonly used in the machine learning literature. It has the form
where in this context
${\theta}_{1},\phantom{\rule{3.33333pt}{0ex}}{\theta}_{2}$ are the amplitude and lengthscale parameters, respectively.
Figure 5 provides insight into the effect of the lengthscale parameter for the Gaussian kernel. Notice that the samples in all cases are smooth even for short lengthscales, which is characteristic of the Gaussian kernel.
The Matérn 5/2 kernel, commonly used in the spatial statistics literature, has the form
Here,
${\theta}_{1},{\theta}_{2}$ are the amplitude and lengthscale parameters, respectively, which are analogous to their counterparts in the Gaussian and periodic kernels. The general Matérn kernel can be viewed as a generalization of the covariance function of the Ornstein–Uhlenbeck (OU) process to higher dimensions ([
16] Section 4.2). The OU process has been used before to forecast load (see, e.g., [
30]), although in a different context then we use the Matérn here.
Figure 6 provides insight into the effect of the lengthscale parameter for the Matérn 5/2 kernel. Although
Figure 6 (right) appears almost indistinguishable from the corresponding frame of
Figure 5, it is apparent in
Figure 6 (left) that draws from Gaussian processes with the Matérn kernel are not smooth (though they are twicedifferentiable).
At this point, we have introduced several kernels, but the posterior predictive distribution outlined in
Section 2.1 only identifies a single kernel. The following properties of positive definite kernels allow for their combination:
Remark 1. Let ${K}_{1},\phantom{\rule{3.33333pt}{0ex}}{K}_{2}$ be two positive definite kernels taking arguments from a vector space $\mathcal{X}$. Then, for all ${\mathbf{x}}_{i},\phantom{\rule{3.33333pt}{0ex}}{\mathbf{x}}_{j}\in \mathcal{X}$:
$K({\mathbf{x}}_{i},{\mathbf{x}}_{j})={K}_{1}({\mathbf{x}}_{i},{\mathbf{x}}_{j})+{K}_{2}({\mathbf{x}}_{i},{\mathbf{x}}_{j})$ is a positive definite kernel.
$K({\mathbf{x}}_{i},{\mathbf{x}}_{j})={K}_{1}({\mathbf{x}}_{i},{\mathbf{x}}_{j})\times {K}_{2}({\mathbf{x}}_{i},{\mathbf{x}}_{j})$ is a positive definite kernel.
Furthermore, if ${K}_{1}$ takes arguments from ${\mathcal{X}}_{1}$ (for instance, time), ${K}_{2}$ takes arguments from ${\mathcal{X}}_{2}$ (for instance, space). Then, for all ${t}_{i},\phantom{\rule{3.33333pt}{0ex}}{t}_{j}$ in ${\mathcal{X}}_{1}$ and all ${\mathbf{z}}_{i},\phantom{\rule{3.33333pt}{0ex}}{\mathbf{z}}_{j}$ in ${\mathcal{X}}_{2}$:
$K\left(\left(\begin{array}{c}{\mathbf{z}}_{i}\\ {t}_{i}\end{array}\right),\left(\begin{array}{c}{\mathbf{z}}_{j}\\ {t}_{j}\end{array}\right)\right)={K}_{1}({t}_{i},{t}_{j})+{K}_{2}({\mathbf{z}}_{i},{\mathbf{z}}_{j})$ is a positive definite kernel.
$K\left(\left(\begin{array}{c}{\mathbf{z}}_{i}\\ {t}_{i}\end{array}\right),\left(\begin{array}{c}{\mathbf{z}}_{j}\\ {t}_{j}\end{array}\right)\right)={K}_{1}({t}_{i},{t}_{j})\times {K}_{2}({\mathbf{z}}_{i},{\mathbf{z}}_{j})$ is a positive definite kernel.
Proofs of the statements in the above remark are straightforward and can be found in ([
31] Section 13.1). This remark allows for the combination of all the kernels that have been discussed in a way that is intuitive and easy to code. For example, we use the periodic kernel with a Matérn decay on time:
The kernel can be interpreted as follows: the Matérn portion allows for a decay away from strict periodicity. This allows the periodic structure to change with time. A small lengthscale for
${\theta}_{4}$ corresponds to rapid changes in the periodic structure, whereas a long lengthscale would suggest that the periodicity remains constant over long periods of time. We combine the amplitude parameters of each kernel into a single parameter,
${\theta}_{1}$.
Figure 7 demonstrates the effect that varying
${\theta}_{4}$ of Equation (
7) has on the kernel, and on realizations of a process using that kernel. Consider that the structure of the top realization in
Figure 7 (left) has clear similarities between successive periods, but after four or five periods the differences become more pronounced; this is in contrast to the bottom realization for which the similarities after two periods are already difficult to identify.
2.4. Creating a Composite Kernel for Load Forecasts
To showcase the properties of this method, and to give insight into how one might go about creating a composite kernel using domain expertise, we step through the construction of one such kernel in this section. At each step, we discuss the desired phenomena that we would like to capture with the structure of the latest model. We also provide figures to help interpret the effect the changes have on the resulting predictions. The purpose of this section is to demonstrate how practitioners and researchers can create kernels which incorporate their own understanding of the ways load can vary with the forecast inputs. For illustrative purposes, we use the same training/test data for every step in this process.
We train each model on PJMISO data beginning on 17 September 2018 and ending on 26 October 2018. We then predict load for 27 October 2018. The point estimate prediction is the posterior mean of the Gaussian process. We also draw 1000 samples from the posterior distribution to illustrate the uncertainty of the model. Parameters that are not set manually are estimated via maximum likelihood, as described in
Section 2.5. The parameter
${\sigma}_{n}^{2}$ is not associated with any kernel, but reflects the magnitude of the noise, and is required for regularizing the kernel, as shown in Equations (
1) and (
3).
We begin with two kernels meant to capture the periodicity. As discussed in
Section 2.3, there is known daily and weekly seasonality to the data, thus we fixed the parameters that control the period. The kernel is thus,
where
${\theta}_{0}$ is the amplitude of the composite kernel. The parameters are provided in
Table 1.
Figure 8 shows that the daily and weekly periodicity capture a substantial portion of the variation in load. As expected, a purely periodic structure is not sufficient to accurately predict future load. Nevertheless, capturing the weekly and daily periodicity is an important first step for creating an accurate model. The estimated uncertainty is too low, likely because the model is not flexible enough to capture all the variation in load. We next develop a kernel that specifies a more realistic model to address this problem.
Studying the structure of the forecasted and actual data in
Figure 8 suggests that a decay away from exact periodicity is desirable. One way to achieve this is via the kernel described in Equation (
7). In particular, we want to allow consecutive days to covary more strongly than nonconsecutive days, and similarly for weeks. This structure is described in Equation (
9) and
Table 2Relaxing the strict periodic structure gives the model the flexibility required to capture the shape of the load curve. The noise term regularizes as needed to avoid overfitting. The training set is nearly perfectly predicted, but the model seems to generalize adequately, suggesting the regularization is working.
Figure 9 shows that the decaying periodic model better captures the structure of the load, and the uncertainty is more realistic. The predicted uncertainty is too high which we address by adding more structure to the model.
The model appears to explain the temporal trends in the training data. For example, note the distinct dip and rise in the test predictions around the morning peak, characteristic of an autumn load curve. The discrepancy between the forecast and actual values on the test set may be due to the inability of a strictly time series model to capture all of the intricacies of power demand on the electric grid.
A reasonable next step is to incorporate temperature information. We do this with a tensor product over time to allow for a decay of the relevance of information as time passes. The resulting kernel is described by Equation (
10) with the parameters outlined in
Table 3. The temperature data are modeled with a single Gaussian kernel, which gives changes in temperature at every location equal weight. More sophisticated methods for handling the highdimensional temperature data are discussed in
Section 2.6.
Figure 10 shows the results of using the kernel to predict only on the test set. The simulations shown in
Figure 10 are more accurate than those shown in
Figure 9. The errors are smaller and the uncertainty appears more reasonable. Clearly, some phenomena are not picked up by the model. There remains a persistent error of approximately 5000 MW throughout the day. A more rigorous evaluation of the performance and a more thorough analysis of the forecast error and uncertainty for more complex models is provided in
Section 3.
The model used in
Section 3 is the result of creating a composite kernel using the procedure described in this section. The final kernel includes additional structure meant to capture phenomena not discussed in this section. In contrast to many modern machine learning techniques, we develop this structure consistent with domain specific knowledge as described above. The ability to incorporate this type of knowledge makes Gaussian processes among the most powerful tools available to researchers with subject expertise. The benefits of Gaussian processes are likely smaller for situations where the modeler has little domain specific knowledge.
2.5. Parameter Estimation
Once the kernel has been specified, the values of the parameters of the model must be determined. We use maximum likelihood estimation, a popular and mathematically sound method for parameter estimation. The log marginal likelihood of a Gaussian process is
where the parameter vector
$\mathit{\theta}$ is implicit in the kernel and
det is the determinant of the matrix. The observed data are fixed, thus the goal is to choose the parameters which are most likely to have resulted in the observed data. Equation (
11) has a convenient interpretation that helps explain why Gaussian processes are useful for modeling complex phenomena: the first term is the datafit, encapsulating how well the kernel evaluated at the inputs represents the outputs. The second term is a penalty on the complexity of the model, depending only on the covariance function and the inputs. The third term is a normalization constant. Maximum likelihood estimation of the Gaussian process occurs over the hyperparameters
$\mathit{\theta}$ and the complexity penalty inherently regularizes the solution.
Likelihood maximization is an optimization problem that can be tackled in a variety of ways. Due to the highdimensionality of the problem, a grid search is too time consuming. For the forecasts in
Section 3, we use a generalization of gradient descent called stochastic subspace descent, as described in [
32] and defined in Algorithms 1 and 2.
Algorithm 1 Generate a scaled Haar distributed matrix (based on [33]). 
Inputs: $\ell ,d$ ▹ Dimensions of desired matrix, $d>\ell $ Outputs: $\mathbf{P}\in {\mathbb{R}}^{d\times \ell}$ such that: ${\mathbf{P}}^{\top}\mathbf{P}=d/\ell {I}_{\ell}$ columns of $\mathbf{P}$ are orthogonal Initialize $\mathbf{X}\in {\mathbb{R}}^{d\times \ell}$ Set ${X}_{i,j}\sim \mathcal{N}(0,1)$ Calculate QR decomposition of $\mathbf{X}=\mathbf{QR}$ Let $\mathbf{\Lambda}=\mathrm{diag}({\mathrm{R}}_{1,1}/\left{\mathrm{R}}_{1,1}\right\dots {\mathrm{R}}_{\ell ,\ell}/\left{\mathrm{R}}_{\ell ,\ell}\right)$ P = $d/\ell \mathbf{Q}\mathbf{\Lambda}$

Algorithm 2 is called by passing a random initialization of the parameters as well as a stepsize, and a parameter
ℓ which dictates the rank of the subspace that is used for the descent. The generic function
$f(\xb7)$ that Algorithm 2 minimizes is specified by Equation (
11).
Algorithm 2 Stochastic subspace descent. 
Inputs: $\alpha ,\ell $ ▹ step size, subspace rank Initialize: ${\mathbf{\theta}}_{0}$ for k = 1, 2, … do Generate $\mathbf{P}$ by Algorithm 1 ${\mathbf{\theta}}_{k}={\mathbf{\theta}}_{k1}\alpha \mathbf{P}{\mathbf{P}}^{\top}\nabla f\left({\mathbf{\theta}}_{k1}\right)$ end for

This particular optimization routine is designed for highdimensional problems with expensive function evaluations and gradients that are difficult or impossible to evaluate. Stochastic subspace descent uses finite differences to compute directional derivatives of lowdimensional random projections of the gradient in order to reduce the computational burden of calculating the gradient. The subroutine in Algorithm 1 defines a random matrix that is used to determine which directional derivatives to compute. Automatic differentiation software such as
autograd for Python [
34] can speed up the implementation by removing the need for finite differences, or, for simpler kernels, the gradients can be calculated by hand. For complex kernels, this may not be feasible so for generality and programming ease we use a zerothorder method.
An important attribute of stochastic subspace descent is that the routine avoids getting caught in saddle points which is a typical problem in highdimensional, nonconvex optimization as discussed in [
35]. Despite the fact that this algorithm avoids saddle points, there is no guarantee that the likelihood surface is convex for any particular set of data and associated parameterization. Nonconvexity implies that there may be local maxima with suboptimal likelihood that are far from the global maximum in parameter space. To address this concern, we perform multiple restarts of the optimization routine with the parameters initialized randomly over the domain.
2.7. Model Combination
Ensemble methods, which combine multiple models to create a single more expressive model, have been common in the machine learning community for many years, see [
42] for an early review in the context of classification. Recently, such methods have been applied successfully to load forecasting; in a paper analyzing the prestigious M4 load forecasting competition [
43], model combination is touted as one of the most important tools available to practitioners. Done correctly, models can be created and combined without substantial additional computational overhead. This is due to the parallel nature of many ensembles and is true of our proposed method. Several strategies exist for combining models, particularly in the Bayesian framework which allows for a natural weighting of different models by the
model evidence ([
44] Section 3.4), which in our case is expressed by (
11). Extensive research has been conducted into the combination of Gaussian process models, a comparison of various methods is provided in [
24]. In this paper, we propose using the Generalized Product of Experts (GPoE) method originally described in [
45].
The standard product of experts (PoE) framework [
46] can take advantage of the Gaussian nature of the posterior process. We recall the posterior density from Equation (
1) with an additional subscript to denote the model index
The product of Gaussian densities remains Gaussian, up to a constant:
where
M is the number of models to be combined. The density of the PoE model is
Although Equation (
12) may look complicated, it has a simple interpretation. We can rewrite the mean as
$\mathit{\mu}={\sum}_{j}{\mathbf{W}}_{j}{\mathit{\mu}}_{j}$, where
${\mathbf{W}}_{j}=\mathbf{\Gamma}\left({\mathbf{\Gamma}}_{j}^{1}\right)$ is a weight matrix for the
jth model, corresponding to the inverse of the uncertainty of that model. That is, models for which the uncertainty is high are downweighted compared to those with low uncertainty. The variance
$\mathbf{\Gamma}$ can readily be seen as the inverse of the sum of the constituent precision matrices
${\mathbf{\Gamma}}_{j}^{1}$.
It is apparent that models with low uncertainty will dominate
$\Gamma $, and cause the variance of the PoE model to be relatively small. This is because if
$\u2225{\Gamma}_{j}\u2225$ is small, then
$\u2225{\Gamma}_{j}^{1}\u2225$ will be large, dominating the variance of the other models in the sum. The GPoE framework is designed to ameliorate this by allocating additional weight to those models for which the data are most informative to the model. The details of the algorithm, including how we measure the informativeness of data to a model, are available in
Appendix B. See [
45] for a more detailed discussion.
There is a tradeoff to model combination. On the one hand, it is straightforward to implement, provides empirically and provably better estimates (see, e.g., [
42] for a straightforward explanation in the context of classification), and has enormous practical value as demonstrated in the M4 competition. The cost of these benefits is that there is no longer a single kernel to provide interpretability. Since the method for combining the models is to take a product of the densities, the kernels of the individual models get lost in the mix, turning the GPoE model into a black box. Depending on the application, interpretability may or may not be a relevant consideration. The ability to combine models in this way provides the data analyst the opportunity to make a decision based on the requirements placed on the forecast.