4.1. Preliminary Analysis
Before we fit the model to the data, we examine the log transformations chosen for the observed variables
and
. As explained above, the log transformations ensure non-negative
and
without restricting the model parameters. Another important reason for using the log transformation for
is to approximate the Gaussian distribution assumed for the likelihood function closely.
Table A2 in the online
Appendix A.3 shows the sample third (skewness) and fourth (kurtosis) moments of
,
, and
for the 27 stocks in the sample.
Table A2 shows that all three transformations are positively skewed. The untransformed
has the largest positive skewness followed by
, which have skewness all above one. The log transformation
has the smallest skewness all below one, but they are all significantly different (at size 0.05) from the Gaussian value of zero. The untransformed
also has the largest kurtosis, all well above the Gaussian value of three.
has somewhat smaller kurtosis, but the smallest value, 7.42 (for INTC), is still well above three. The log transformation
has kurtosis much closer to Gaussian with the largest value at 4.15 (for CVX).
As mentioned in the introduction, the proposed specification (
Section 3.1) is based on two empirical features of
and
: their common persistence and their correlation.
Figure A1 in the online
Appendix A.3 shows the sample autocorrelations of
and
for the 27 stocks in the sample. For all stocks,
is somewhat more persistent than
. The autocorrelations for
slowly decay from about 0.8 while those for
slowly decay from about 0.6. Both autocorrelations die out slowly with the lag, a feature of long memory series. The two series are highly correlated with each other with all pairwise sample correlations above 0.95.
The model (
Section 3.1) implies that the observed realized process
should follow the restricted ARMA(1,1) process (
2). The slowly decaying autocorrelations for
in
Figure A1 is not inconsistent with an ARMA(1,1) process with a large AR(1) coefficient. As a further check,
Table A5 in the online
Appendix A.3 reports estimates of an unrestricted ARMA(1,1) model fitted to
for the 27 stocks in the sample. Both the AR and MA coefficients are statistically significant, with a large positive AR(1) coefficient and a large negative MA(1) coefficient. The estimated AR(1) coefficient ranges from 0.976 (XOM) to 0.992 (MCD) and the MA(1) coefficient from
(XOM) to
(NKE). Although the portmanteau test for residual correlation up to lag 5 rejects the white noise residual null (at size 0.05), the residual serial correlations are small in size. The first order residual serial correlation ranges from 0.103 (NKE) to 0.036 (CVX).
4.2. In-Sample Estimates
To evaluate the performance of the model out of the sample, pseudo-out-of-sample rolling forecasts were obtained. As parameter estimation by maximum likelihood is computationally expensive compared to least squares, a rolling forecast window of one (calendar) month was moved forward each month starting from January 2006. The first set of parameter estimates were obtained for the estimation sample from the beginning of the data sample April 1997 to December 2005 (104 months). Using these parameter estimates, h-step forecasts for trading days in the month of January 2006 are obtained. These forecasts use the same parameter estimates but the conditioning information set, i.e., the lagged variables on the right-hand side are updated as we move forward within the forecast window.
The next set of estimates were obtained for the sample from May 1997 to January 2006 and forecasts for trading days in the month of February 2006 were obtained. This resulted in 96 sets of parameter estimates with forecasts from the beginning of January 2006 to the end of December 2013. This forecast sample included the financial crisis period 2008–2009 when there was a spike in .
Numerical maximization of the Gaussian likelihood can be sensitive to the choice of starting values. To guard against getting stuck in local maxima, a few alternative random starting values are tried for each estimation window. To start the recursion, the presample values for
and
were set to the estimation sample variance of
. (The alternative of setting these presample values to the model unconditional means (
3) often resulted in the nonconvergence of the numerical optimizer and was sensitive to the choice of starting parameter values).
There are a large number of estimated parameters (10 parameters for each of the 96 estimation windows for each of the 27 stocks).
Figure 1 shows a summary of the estimated parameters for stepsizes
days. Following Bollerslev et al. [
12], for the
h-step forecast the outcome variable on the left-hand side of (
1a) is
The h-step ahead variable for the log transformation is defined so that the forecasts from can be compared to those from the other transformations. For , taking the log of or just results in a different scaling. For , the log of averages and the log of averages are considered since one cannot be recovered from the other just by rescaling. Strictly speaking, the model parameters change with the forecast stepsize h and should be written as a function of h.
Each panel in
Figure 1 corresponds to a parameter and the shaded area is the interquartile range (from the 0.25 to 0.75 quantile) across the 27 stocks. The thick solid line is the median estimate across the 27 stocks. The intercept
and the
-in-mean parameter
of Equation (
1a) are both positive. A positive
-in-mean parameter
is to be expected given the similarity of the
and
dynamics documented above.
declines with stepsize
h and for
,
is smaller for the log average of
than for the log average of
.
The parameters
,
,
that determine the persistence
of the
and
processes are all positive and imply
close to but below the stationarity boundary of one (
Figure 2). As a function of stepsize
h,
does not change much for the log average of
but somewhat increases for the log average of
.
increases with
h from about 0.7 for
to above 0.8 for
and
decreases with
h from about 6 for
to less than 4 for
.
Figure 2 shows the rolling estimates of the persistence parameter
increase with the stepsize
h. This is to be expected as the
h-step outcome variable
gets smoother and hence more persistent with
h.
The asymmetric response parameter
is positive and decreases with stepsize
h from about 1.3 for
to below 0.5 for
. The coefficient on the quadratic term
switches sign from positive for
to negative for
. The volatility parameter
for the
Equation (
1c) increases with stepsize
h.
4.3. Pseudo Out-of-Sample Forecasts
This section evaluates the performance of the proposed model in terms of (pseudo) out-of-sample rolling forecasts described above in
Section 4.2. If the model captures the time series dependence of the outcome variable, the forecast errors should not be serially correlated. A formal such test needs to account for sampling error in generating the forecasts based on the estimated parameters.
Figure A2,
Figure A3 and
Figure A4 in the online
Appendix A.4 provide an informal check by plotting the autocorrelation functions of the rolling forecast errors. In addition to forecasts error from the proposed model (
Section 3.1), these figures also compare the forecast error correlations from the HAR model of Corsi [
1] with outcome
and the HARQ model of Bollerslev et al. [
12] with outcome
.
For stepsize
, the forecast error serial correlation is small in magnitude, but most of them fall outside the asymptotic interval for a white noise series. (A portmanteau test for serial correlation up to lag 5 all reject the null hypothesis of a white noise at the conventional size of 0.05. However, these are asymptotic
p-values that ignore parameter estimation sampling error.) For stepsizes
, the forecast errors show a large positive correlation up to lag
h for all models. This dependence is due to the overlapping sample used in generating the multistep outcome variable (
4).
To evaluate the relative forecast performance of the proposed model, its forecast accuracy is compared against a baseline model. The baseline model is the HARQ model of Bollerslev et al. [
12], which extends the HAR model of Corsi et al. [
17] with an additional interaction term involving
. The model proposed in this paper also uses the realized quarticity
but without the lagged
terms of HAR(Q). Following Bollerslev et al. [
12], the outcome variable of the baseline HARQ model is the untransformed
. The model parameters are estimated by least squares without restricting them to ensure the predicted values are non-negative. Bollerslev et al. ([
12], footnote 17) apply the ‘insanity’ filter and replace predicted values that are outside the in-sample range with the in-sample mean value. The baseline predictions in this paper do not apply this somewhat ad hoc insanity filter.
An alternative natural baseline that ensures non-negative predicted
values is the HARQ model with
and an interaction term with
. This log version of the HARQ specification, however, does not perform as well as the untransformed specification of Bollerslev et al. [
12].
Table A3 in the online
Appendix A.3 reports estimated coefficients on the interaction term for the full sample for stepsizes
. For the level (untransformed) specification, these coefficients are all negative with
t-ratios above two in absolute value (with two exceptions for
) as reported in Bollerslev et al. [
12]. For the log specification, many coefficients are positive and insignificant. For
all
t-ratios (except one) are below two in absolute value. Bollerslev et al. [
12] motivate the use of
to correct for possible measurement error in
. The negative coefficient of the interaction term may be correcting for this error mainly in the tails of the high excess kurtosis of
. This may explain the insignificant coefficient in the interaction for the log specification as the excess kurtosis largely disappears for
. The issue of appropriate HARQ specification is not the focus of this study and is left for further research.
To evaluate the relative accuracy of predictions from alternative models, we need to specify a loss or scoring function. For this study, two members from the Bregman family of consistent scoring functions that are robust to additive noise in
[
24] are used.
where
is the actual realized value and
F its forecast.
is the mean squared and
is the QLIKE scoring function. These two scoring functions were also used in Bollerslev et al. [
12].
is defined for all values of
F while
is undefined for
. As the baseline HARQ predictions are not filtered, the small cases of non-positive predictions are removed when evaluating
.
The difference in scores between the baseline and comparison model is tested with the
t-ratio of equal forecast accuracy of Diebold and Mariano [
27]. As mentioned above, the forecast errors are correlated for
. To account for this correlation, the denominator of the
t-ratio, the standard error of the difference in average scores, is computed using the Bartlett kernel with bandwidth set to the forecast stepsize
h.
Table 1 reports these
t-ratios for the full forecast sample. A positive
value indicates a better forecast (smaller score) from the comparison model than the baseline HARQ.
Table 1 shows that in about half of the 27 stocks, a negative
forecast is produced for the baseline HARQ model. As mentioned above,
is computed for the forecast sample, excluding these negative predictions. The sign switch between
and
may be due to the difference in the forecast evaluation sample for the two score functions. With this caveat, the test result is somewhat sensitive to the choice of scoring function and forecast stepsize
h. For
, both
and
indicate lower average score, i.e., better forecast accuracy, for the baseline HARQ model than the proposed model (
Section 3.1). The forecast accuracy of the proposed model relative to the baseline generally improves with the forecast stepsize
h. For
,
is inconclusive in the sense that all their values are less than two in absolute value. None of the
values are less than
, but about a quarter of the stocks have values above +2 indicating more accurate forecasts from model (
Section 3.1) than from the baseline.
The tests in
Table 1 are based on the full forecast sample. The test results could be specific to the choice of the forecast sample and a particular sample could be cherry-picked to obtain certain results. As a guard against such potential cherry picking,
Figure A5,
Figure A6 and
Figure A7 in the online
Appendix A.5 show running
t-ratios of the Diebold and Mariano [
27] equal forecast accuracy tests for all possible end-of-forecast samples. These are the values of
t-ratios when the test is applied to the forecast sample from the beginning of the forecast sample (3 January 2007) up to each date of an expanding evaluation sample. (The three figures use alternative (un)transformations to obtain forecasts for
from
).
One common feature of the running
t-ratio results is the change in performance shortly after the financial crisis period 2008–2009. As expected the test performance up to the financial crisis period is quite noisy, but the majority of the
t-ratios for both MS and QL are positive for
, indicating better forecast accuracy from model (
Section 3.1) compared to the baseline. The problem with the HARQ model in levels producing negative forecasts mostly occurs during the financial crisis period.
Table A4 in the online
Appendix A.3 lists all dates with negative predictions from the HARQ model, the majority of which occur during 2008. The
t-ratios remain quite stable after the financial crisis period indicating robustness to the choice of forecast sample post-financial crisis. When judging the significance of these running
t-ratios, one should be aware of the multiple comparisons problem and that the usual critical values are likely to be too small.