A Proposal to Fix the Number of Factors on Modeling the Dynamics of Futures Contracts on Commodity Prices

: In the literature on modeling commodity futures prices, we find that the stochastic behavior of the spot price is a response to between one and four factors, including both short- and long-term components. The more factors considered in modeling a spot price process, the better the fit to observed futures prices—but the more complex the procedure can be. With a view to contributing to the knowledge of how many factors should be considered, this study presents a new way of computing the best number of factors to be accounted for when modeling risk-management of energy derivatives. The new method identifies the number of factors one should consider in the model and the type of stochastic process to be followed. This study aims to add value to previous studies which consider principal components by assuming that the spot price can be modeled as a sum of several factors. When applied to four different commodities (weekly observations corresponding to futures prices traded at the NYMEX for WTI light sweet crude oil, heating oil, unleaded gasoline and Henry Hub natural gas) we find that, while crude oil and heating oil are satisfactorily well-modeled with two factors, unleaded gasoline and natural gas need a third factor to capture seasonality.


Introduction
Forecasting is not a highly regarded activity for economists and financiers. For some, it evokes images of speculators, chart analysts and questionable investor newsletters. For others, there are memories of the grandiose econometric forecasting failures of the 1970's. Nevertheless, there is a need for forecasting in risk management. A prudent corporate treasurer or fund manager must have some way of measuring the risk of earnings, cash flows or returns. Any measure of risk must incorporate some estimate of the probability distribution of the futures asset prices on which financial performance depends. Consequently, forecasting is an indispensable element of prudent financial management.
When a company is planning to develop a crude oil or natural gas field, the investment is significant, and production usually lasts many years. However, there must be an initial investment for there to be any return (see, for example, [1,2], among others). Assuming that futures values are not known after a certain date because there is no trade, it makes it difficult to measure the risk of these projects. Since commodities (crude oil, gas, gasoline, etc.) are physical assets, their price dynamic is much more complex than financial assets because their prices are affected by storage and transportation cost (cost of carry). Due to such complexity, in order to model this price dynamic we need factor models such as in [3][4][5][6][7][8][9]. In addition, in the transport sector [10] and [11] use different factor models for modeling bulk shipping prices and freight prices.
In order to measure exposure to price risk due to a single underlying asset, it is necessary to know the dynamics of the term structure of asset prices. Specifically, the value-at-risk (VaR, [12]) of the underlying asset price, the most widely known measure of market risk [13], is characterized by knowing the stochastic dynamic of the price, the volatility of the price and the correlation of different prices at different times. For these reasons, to date, the behavior of commodity prices has been modeled under the assumption that the spot price and/or the convenience yield of the commodity follow a stochastic process.
In the literature we find that the spot price is considered as the sum of both short-term and longterm components (see, for example, [14,15]). Short-term factors account for the mean reverting components in commodity prices, while long-term factors account for the long-term dynamics of commodity prices, assuming they follow a random walk. Sometimes a deterministic seasonal component needs to be added [16].
Following this approach, some multifactor models have been proposed in the literature. Focusing on the number of factors initially considered, [17] developed a two-factor model to value oil-linked assets. Later, [14] planned a one-factor model, two-factor model and a three-factor model, adding stochastic interest rates to the previous factors. This was superseded by a new formulation which appeared in [15], enhancing the latter article and developing a short-term/long-term model. [18] added the long-term spot price return as a third risk factor. Finally, [19] offered researchers a general N-factor model.
At this point, it should be stressed that the decision regarding the number of factors to be used in the model needs to be made a priori. According to the above literature consulted, the models are usually planned with two, three or four factors. However, in this study, the need to assume a fixed number of factors in the model is discounted. We propose a new method that identifies the number of factors one should consider in the model and the type of stochastic process to be followed. This method avoids the necessity of inaccurately suggesting a concrete number of factors in the model. This is very useful for researchers and practitioners because the optimal number of factors could change, depending on the accuracy needed in each problem. Clearly, if we do not use the optimal number of factors in modeling the commodity price dynamics, the results will not be optimal.
To the best of our knowledge, there are three previous studies applying principal component analysis [20] to the modeling of commodity futures price dynamics [21][22][23]. However, they only model the futures prices dynamic and ignore the dynamic followed by the spot price and, consequently carrying the risk of being incoherent, since futures price are the spot price expected value under the Q measure.
This study aims to add value to previous contributions by assuming that is the spot price can be modeled as a sum of several factors (long term and short term, seasonality, etc.). Therefore, since it is widely accepted (see, for example, [24]) that the futures price is the spot price expected value under the Q measure ( is the expected value under the Q measure.), from the variancecovariance matrix of the futures prices we can deduce the best structure for modelling the spot prices dynamic.
The remainder of this study is organized as follows. Section 2 presents a general theoretical model. Section 3 explains the methodology proposed to set an optimal set of factors. In Section 4, we describe the datasets used to show the methodology and these results are described in Section 5. Finally, Section 6 sets out the conclusions.

Theoretical Model
In the main literature to date (for example, [19]), it is assumed that the commodity log spot price is the sum of several stochastic factors: where the vector of state variables the eigenvalues and 1 0 k = , by simply changing the state space basis. Therefore, we already have , M A and C.
It is also easy to prove that as t dW is a 1 N× vector of correlated Brownian motion increments, R can be assumed as is what appears in all formulae. In fact, it can be proved that any factorization of ′ RR corresponds to a different definition of the noise, so we can safely take R as any Choleski factorization of ( ′ RR ). In the Black-Scholes world (risk-neutral world), knowing the real dynamics, the risk neutral one is , , , N λ λ λ ′ = λ  the vector formed from each state variable´s risk premium).
Following [25], the futures price is given by

2.2.A General Procedure to Determine the Stochastic Factors
In the previous subsection, we have presented the general model for characterizing the commodity price dynamics based on the assumption that the log commodity spot price is the sum of several factors. However, to the best of the authors' knowledge, the optimal number of stochastic factors has not yet been studied, for these models.
This subsection presents a theoretical procedure to establish the optimal number of factors. It also presents a way to determine how those factors should be aligned (long-term, short-term, seasonal, etc.).
To address this problem, let us suppose that there are M futures maturities and n observations of the forward curve, that is, the matrix ( ) , log , 0, , ; 1, , . We further assume, as usual, that n M  . To determine the optimal number of stochastic factors needed to characterize the commodity price dynamic in the best way, first we must realize that the number of factors is equal to ( ) rank R and, from the previous expression, ( ) rank R has to be equal to the rank of the variance-covariance matrix of U . If, as usual, the process t X has a unit root, so it is non-stationary and the variance and covariances are infinity, we need another matrix to determine the rank of the variance-covariance matrix of U .
If we define volatility (instantaneous variance) as We thus define the matrix . We can estimate it directly from our database and we can also estimate its rank. Once we have this rank, as stated above , we know the number of stochastic factors ( N ) that define the commodity price dynamics. From a practical point of view, however, if we follow this procedure as explained above, unless one futures maturity is a linear combination of the rest (which is not likely), we obtain Nevertheless, the weights of these factors are going to be different and most of them will have an insignificant weight.
Fortunately, from this procedure, we can also estimate the eigenvalues 1 ,.., N k k and, from there, determine the factor weight through the eigenvalues' relative weight. We can estimate the eigenvalues of A via a nonlinear search procedure by using the fact that  Moreover, from the eigenvalues of matrix A , it is also easy to determine the factors. Taking into account that factors´ Stochastic Differential Equation (SDE) is , the factor is a long-term one because the SDE associated with this factor is a random walk (General Brownian Motion (GBM)): , the factor is a short-term one because the SDE associated with this factor is an Ornstein-Uhlenbeck: If the eigenvalue is complex, the factor is a seasonal one.
From a practical point of view, when we carry out this procedure we get N eigenvalues and we need to decide how many of them to optimally choose. The way to decide this is through the relative weight of the eigenvalues. By normalizing the largest one to 1, the smallest eigenvalues represent negligible factors. This allows us to decide how many factors must be optimally chosen.
In order to clarify concepts, the following example could be useful, if we have 9 M = futures with maturities at times 1 9 , , T T  . The method is as follows.
Compute the rank of Θ . Let us assume that this is 3.
3. As a result, we have three eigenvalues 1 2 , k k and 3 k . It is usual to assume that 1 0 k = as the futures process is not stationary, but 1 k can nevertheless be estimated. If we do assume it, however, we obtain that Therefore, we obtain the general equation )  11  12  13  21  31  22  23 23 33 Select an initial estimate of ( ) Regress ˆi j Θ and compute the error. c.
Iteratively select another estimate of ( ) 2 3 , k k and get back to b.
To the best of the authors' knowledge, no method has combined the knowledge of this concrete specification with a nonlinear search procedure to identify factors, which is one of the contributions made by this article.
Once we have determined the optimal number and form of the stochastic factors to characterize the commodity price dynamics, we can estimate model parameters using standard techniques. The Kalman filter (see, for example, [26]) uses a complex calibration technique. Other techniques include approximations such as [18] or [27]. Finally, the recently published option by [28] presents an optimal way of estimating model parameters by avoiding the use of the Kalman filter. Model parameters are estimated in the papers and so, for the sake of brevity we do not estimate the parameters in this study.

Data
In this subsection, we briefly describe the datasets used in this study. The datasets include weekly observations corresponding to futures prices for four commodities: WTI light sweet crude oil, heating oil, unleaded gasoline (RBOB) and Henry Hub natural gas. These futures were taken into consideration because they are the most representative and classic among the products. They are futures with many historical series and futures at many maturities. Therefore, they are considered as ideal for studying the optimal number of factors that should be chosen.
In this study, two data sets were considered for each commodity. Data set 1 contains less futures maturities, but more years of observations considered while data set 2 contains more futures maturities, but less years of observations. For dataset 1, related to WTI crude oil, it comprised contracts from 4 September 1989 to 3 June 2013 (1240 weekly observations) for futures maturities from F1 to F17, F1 being the contract for the month closest to maturity, F2 the contract for the secondclosest month to maturity, etc. In the case of heating oil, it contained contracts from 21 January 1991  Table 1 shows the main descriptive statistics of the futures, particularly the mean and volatility, for each dataset. It is interesting to note that the lack of low-cost transportation and the limited storability of natural gas made its supply unresponsive to seasonal variation in demand. Thus, natural gas prices were strongly seasonal [3]. The unleaded gasoline was also seasonal.

Main Results
We now present the results after applying the method proposed to the 4 commodities (2 datasets per commodity) described above in order to select the number of factors to model the behavior of commodity prices. The results correspond to the eigenvalues in decreasing order, the percentage of the overall variability that they explain and the cumulative proportion of explained variance. These are reported in Tables 2-5. As a general rule, we can consider that the first factor, which corresponds to the first eigenvalue, was clearly dominant in the sense that it can explain a percentage of the total variance ranging between 95.2% and 99.7%, depending on the commodity. It captures qualitative long-run effects. However, it is always necessary to consider a second factor capable of taking up short-term effects. Both the first and second factors explain a cumulative proportion of overall variance between 97.5% and 99.9%, depending on the case under study. In WTI light sweet crude oil, these two factors explain more than a 99.99% of the total variance is explained, while in heating oil case studies, these percentages were approximately 99.88% and in unleaded gasoline and Henry Hub natural gas, they were approximately 97%-98%.
Consequently, in the first commodity (crude oil) it is recommended that just the first two factors are considered. The reason is that a third factor will impose a larger estimating effort and a minimum reduction in terms of error measures. The first factor will capture long-term effects, such as world economic events, which significantly impact on commodity prices. The second factor will capture the nature of short-term components such as temporary issues and unforeseen situations. The third and following stochastic factors can be considered as seasonal factors [28] and, as we know, crude oil is a non-seasonal commodity. This matter reinforces the idea that it is suitable to consider a model with only the first two factors.
The next commodity, heating oil, presents some seasonal behavior, which could be captured by a third factor. The fact that the gain in the percentage of cumulative proportion of overall variance goes from 99.88 to 99.94 and from 99.90 to 99.94 in its respective datasets suggest the inclusion of a third factor was not necessary. Conversely, for the unleaded gasoline and Henry hub natural gas, at least a third factor seemed to be necessary. Both were seasonal commodities (see, for example, [3]). They were characterized by very limited storability and their prices were highly dependent on the commodity demand. Third and fourth factors will acknowledge this behavior. It seems necessary to capture more than long-term and short-term dynamics. Depending on the cumulative variance, if we would like to explain (98%-99%), we need to consider at least a third factor or two more. In the unleaded gasoline case, the inclusion of a third factor would increase the cumulative proportion of overall variance from 98.48% to 99.73% and from 97.49% to 98.73%. However, with a fourth factor, we would reach 99.86% and 99.76%, respectively. When we apply the methodology proposed to Henry Hub natural gas datasets, we also verify the need to consider a third and even a fourth factor to explain 99.80% and 99.65% of the total variance, respectively. These results are coherent with the patterns shown in the futures contracts of each commodity. By considering seasonality as a stochastic factor instead of a deterministic one, we can choose from two-to four-factor models to better model the behavior of commodity prices. It should be noted that the long-term and short-term effects, captured by the first two factors, are clearly dominant in terms of their eigenvalues' relative weight. However, the seasonality should be considered if necessary.
It is important to bear in mind that the distinction between long term and short term is not always direct. It is related to the eigenvalue of the factor, which, as we have stated, is always in the form k e with 0 k ≤ (a positive k would mean an explosive process, which is clearly not observed in the data). If 0 k = , we have a long-term effect (a unit root). The more negative k is, the shorter the effect. Therefore, 1 k = − means a much shorter effect than 0.01 k = − , for example. Explanation capacities of each factor are measured according to their (relative) contribution to the global variance. For example, if there is a unique factor related to eigenvalue 0 k = that gives 90% of variance, we would conclude that long term dynamics explain 90% of the variance. It should be noted that this article focuses on the econometric theory and identifies the optimal number of factors to characterize the dynamics of commodity prices. Apart from this econometric approach, where each factor represents a component-long term, short term, seasonal, etc.-these factors may also capture economic forces [29][30][31]. In other words, there are economic forces that are being captured by these factors, such as technology effects (long term) or the functioning of the market (short term). Following [15], we argue that the long-term factor reflects expectations of the exhaustion of the existing supply, improvements in technology for the production and discovery of the commodity, inflation, as well as political and regulatory effects. The short-term factor reflects short-term changes in demand or intermittent supply disruptions. An interpretation of seasonal factors can be found in [3].
This method provides a new selection criterion for obtaining the optimal number of factors. It is always important to keep in mind the purpose of modeling such commodity prices. If we need more accuracy because, for example, we are designing investment strategies, the consideration of more factors is understandable. We could also use fewer factors in a different case. This is important because, on one hand, if we use too many factors the model will be too complex and parameter estimation may not be accurate. On the other hand, if we use too few factors the model will not be acceptable because it will not capture all the characteristic of the price dynamics that we need to consider in order to solve our problem.
We believe our findings to be very useful for researchers and practitioners. Based on our findings, a researcher who needs to model a commodity price dynamic can use our method to identify the number and the characteristics of the factors to be included in the model. Moreover, a practitioner who is investing or measuring risk can also use our methodology in order to identify the optimal number of factors needed and their characteristics.
Finally, as stated above, we have chosen to order the factors according to their relative (joint) contribution to variance because it is a direct and simple way to interpret the results. We are aware that collinearity and, in general, correlation structures can modify the results. However, since the first eigenvalue explains around 95% of variance, it seems unlikely that results are going to change substantially by a more refined analysis.

Summary and Conclusions
In this article, we propose a novel methodology for choosing the optimal number of stochastic factors to be included in a model of the term structure of futures commodity prices. With this method, we add to the research related to the way we characterize commodity price dynamics.
The procedure is based on the eigenvalues of the variance-covariance matrix. Moreover, in deciding how many of them to choose, we propose using the relative weight of the eigenvalues and the percentage of the total variance explained by them and balancing this with the effort of estimating more parameters.
In this article, we applied our method to eight datasets, corresponding with four different commodities: crude oil, heating oil, unleaded gasoline and natural gas. Results indicate that to model the first two commodity prices two factors are suitable, which corresponds with the two biggest eigenvalues, since they are sufficient to account for both long-term and short-term structures. Nevertheless, in the case of unleaded gasoline and natural gas, a third or even fourth factor is needed. We think that, in accordance with the literature, this is related to their seasonal behavior.
Our results support the notion that including too many or too few factors or factors with characteristics which are not optimal in a model for commodity prices could lead to results which may not be as accurate as they should be.