4.2. In-Sample Analysis with Regime Breaks
Table 3 contains in-sample results based on
Table 1, row 1 that are similar to those reported in
Table 2 but take into account distinct regime breaks, following Bai and Perron [
63]. This method allows the parameters to change dynamically across different time-span scenarios. The R-squared has grown substantially, now ranging from 18.30% to 22.64%, with an average of 20.51%, which is more than three times the average of the R-squared reported in
Table 2 (which was 6.6%). The explanation power increase by allowing more dynamic parameter changes across regimes, resulting in a better fit of the models. VIX, VXD, and VXO were significant in four of the six regime breaks, while RVX was significant in two. The findings indicate predictability before, during, and after the 2008 financial crisis.
Interestingly, the coefficients shift from being positive in the regimes before and during the 2008 financial crisis to damaging during the recovery from the 2008 financial crisis (2010M08–2013M12). Furthermore, three of the fourteen significant coefficients had negative coefficients, which all happened during the recovery period. Finally, the 2008 financial crisis does not determine the in-sample results in
Table 2 because eleven of the fourteen significant coefficients were positive.
Despite these in-sample indicators, there may still be concerns regarding the actual usefulness of financial reports tone disagreement as a forecasting tool, mainly due to worries regarding data mining and overfitting issues, as highlighted by Clark and McCracken [
64]. For this condition, an out-of-sample approach would be more appropriate to determine whether these insights hold true and whether they are helpful to practitioners looking to make real-time predictions, especially those involved in financial stability and portfolio optimization tasks. In the following sections, we will describe a set of results obtained using out-of-sample techniques, allowing for a better understanding of the pervasiveness of the relationship between tone disagreement in financial reports and implied volatility.
4.3. Out-of-Sample Analysis
This subsection presents three out-of-sample exercises: ENCNEW (
Table 4), ENC-t (
Table 5), and Mean Directional Accuracy Analysis (
Table 6). For out-of-sample analyses, we contrast the predictive performance of our out-of-sample core model (See
Table 1, row 1) to that of the out-of-sample benchmark model (See
Table 1, rows 2 and 3).
= 0 implies that our model simplifies to an
(
Table 1, rows 2 and 3).
The results using just one ad hoc window size, according to Clark and McCracken [
64], may still be highly debatable because predictability could only apply to a single sub-sample and consequently not be resilient to different window sizes. Therefore, to mitigate any concerns about overfitting, we consider four different window sizes (
= 4, 2, 1, and 0.4, which correspond to estimating our model with 20%, 33%, 50%, and 71% of the sample observations, respectively, and evaluating the forecast models with the remaining observations).
Table 4 reports the results from the ENCNEW test, which contrasts the prediction ability of the core models (
Table 1, raw 1) to the benchmark models (
Table 1, rows 2 and 3). Additionally, we added two autoregressive benchmark specifications to extend our analysis:
and
. The core models outperformed the benchmark models in 72% of the exercises. Remarkably, VIX and VXO could be consistently forecasted, with significative results across all autoregressive and estimation window specifications. On the other hand, models predicting the VXD achieved significant results in 12 out of 16 specifications. Finally, we observe the weakest results predicting RVX, which only yielded a single significant result across all 16 exercises.
The results point to stronger null hypothesis rejections when considering the models that used 71% of the sample to estimate the parameters ( = 0.4) and the remaining to evaluate the forecasting models, which revealed significative results in 81% of the exercises. The models with estimation windows specifications of = 4 and = 2 achieved the same frequency of significant results, with significance found in 75% of the exercises. Finally, the models with estimation windows specifications of = 1 achieved the smallest frequency of significance, with significance found in 50% of the exercises. All significant coefficients were consistently positive, consistent with the in-sample predictions that a higher level of financial reports tone disagreement forecasts an increase in implied volatility.
Table 5 reports the results from the ENC-t test, which also contrasts the prediction ability of the core models (
Table 1, row 1) to the benchmark models (
Table 1, rows 2 and 3) and also considers four different autoregressive benchmark specifications:
and
. The core models outperformed the benchmark models in 75% of the exercises. Strikingly, VXO could be consistently forecasted, with significant results across all autoregressive and estimation window specifications. The VIX indices could be forecasted in 14 out of 16 exercises. On the other hand, models predicting the VXD achieved significant results in 11 out of 16 specifications. Again, we obtained the weakest results predicting RVX, which yielded significant results in eight cases across all 16 exercises.
Similarly to findings using the ENCNEW test, we found strong null hypothesis rejections when considering the models that used 71% of the sample to estimate the parameters ( = 0.4) and the remaining to evaluate the forecasting models, as these models yielded significative results across all exercises. The models with estimation windows specifications of = 4 and = 2 achieved 75% and 69% of significant results frequency across all specifications. Finally, and similarly to what was also found using the ENCNEW test, the models with estimation windows specifications of = 1 achieved the smallest frequency of significance, with significance found in 56% of the exercises. All significant coefficients were positive and consistent with the in-sample and out-of-sample predictions previously discussed.
As a final observation, the core models performed remarkably well predicting the sign change of the two most popular and traded indices, the VIX and VXO. Moreover, they were significantly superior in every exercise, consistent with the ENCNEW and ENC-t test results.
As a last out-of-sample exercise, in
Table 6, we report the Mean Directional Accuracy of predictions made using the core models (
Table 1, raw 1) by calculating the hit rate as a simple average of
, as defined in Equation (8), where we contrast the null hypothesis of
with the alternative hypothesis
, which is a straight comparison against a “pure luck” benchmark. The results in
Table 6 are remarkable: As reported on the rows titled “Over 50%”, the core models outperformed the “pure luck” benchmark in 91% of the exercises. Furthermore, considering all exercises, the average hit rate was 14% higher than a 50% “pure luck” benchmark.
The models with estimation windows specifications of = 0.4, which use 67% of the sample observations to estimate the parameters and the remaining observations to evaluate the forecast models, yielded significance across all exercises considered. All three remaining estimation windows specifications, = 4, 2, and 1, yielded a frequency of 88% of significant results across all exercises. The core models with an had a higher frequency of significant results than the core models with an ; the former had 91% of significant results across all exercises, while the latter had a frequency of 88%.
Table 6 also reports on the rows titled “Benchmark,” the difference between the percentage of accurate directional predictions of the core model (
Table 1, row 1) and the benchmark model (
Table 1, row 2). Considering all exercises, the core models achieved a successful hit rate of 5 % higher than the benchmark models, on average. The results considering this approach show that the core models significantly surpass the benchmark models in 56% of the exercises. In addition, all significant coefficients were positive, which also corroborates the superior performance of the core models in predicting the correct sign of change in the volatility indices. The models with estimation windows specifications of
= 4 and
= 2 yielded a frequency of significant results equal to 63% across all exercises, followed by the estimation windows specifications of
= 1 and
= 0.5, which yielded both a frequency of significant results equal to 50% across all exercises.