Improving Many Volatility Forecasts Using Cross-Sectional Volatility Clusters

: The inhomogeneity of the cross-sectional distribution of realized assets’ volatility is explored and used to build a novel class of GARCH (Generalized Autoregressive Conditional Heteroskedasticity) models. The inhomogeneity of the cross-sectional distribution of realized volatility is captured by a ﬁnite Gaussian mixture model plus a uniform component that represents abnormal variations in volatility. Based on the cross-sectional mixture model, at each time point, memberships of assets to risk groups are retrieved via maximum likelihood estimation, as well as the probability that an asset belongs to a speciﬁc risk group. The latter is proﬁtably used for specifying a state-dependent model for volatility forecasting. We propose novel GARCH-type speciﬁcations the parameters of which act “clusterwise” conditional on past information on the volatility clusters. The empirical performance of the proposed models is assessed by means of an application to a panel of U.S. stocks traded on the NYSE. An extensive forecasting experiment shows that, when the main goal is to improve overall many univariate volatility forecasts, the method proposed in this paper has some advantages bover the state-of-the-arts


Introduction
A well known stylized fact in financial econometrics states that the dynamics of conditional volatility are state dependent since they are affected by the (latent) long-run level of volatility itself. This issue has motivated a variety of time varying extensions of the standard GARCH class of models including latent state regime-switching models (Gallo and Otranto 2018;Hamilton and Susmel 1994;Marcucci 2005), observation-driven regime switching models (Bauwens and Storti 2009, WGARCH), Generalized Autoregressive Score (Creal et al. 2013, GAS) models, Component GARCH models (Engle et al. 2013, GARCH-MIDAS), (Engle and Rangel 2008, Spline-GARCH). All these models, directly or indirectly, relate the conditional variance dynamics to its long-run level.
At the same time, in recent years there has been a growing interest in modeling the volatility of large dimensional portfolios with a particular interest in the dynamic inter-dependencies among several assets (Barigozzi et al. 2014;Engle et al. 2019). However, modeling dynamic interdependencies among several assets would in principle require the estimation of an unrealistic huge number of parameters compared with the available data dimensions. Model's parsimony is achieved introducing severe constraints on the dependence structure of the data (see Pakel et al. 2011). In this respect, Bauwens and Rombouts (2007) showed evidence in favor of the hypothesis that there exist cluster-wise dependence structures in the distribution of financial returns.
In this paper, we propose a novel approach to modelling volatility dynamics for large panels of stocks. In particular, we exploit the cluster structure of returns at cross-sectional level to build GARCH-type models where the volatility of each assets depends on the past information about the Let I{A} be the indicator function at the set A, let I t−1 be the information set at time t − 1, and let E[· | I t−1 ] denote the expectation conditional on I t−1 . Define the indicator variables D t,s,j = I{asset s ∈ group j at time t}, and assume that E[D t,s,j | I t−1 ] = π t,j ∈ [0, 1]. That is, at time t the j-th group is expected to contain a proportion of π t,j assets. In other words π t,j captures the expected size of the j-th cluster at time t. The vector g t,s = (D t,s,1 , D t,s,2 , . . . , D t,s,G ) is a complete description of the class memberships of the s-th asset at time t. The elements of g t,s are called "hard memberships", because these link each asset to a unique group inducing a partition at each t.
Therefore, conditional on the groups' memberships at time t − 1, the model (1) specifies a GARCH(1,1) dynamic structure for all those assets within a given group j. Based on this, model (1) is referred as the "Clusterwise GARCH" (CW-GARCH) model. Within the j-th cluster, the model parameters ω s,j , α s,j and β s,j can be interpreted as usual as the intercept, shock reaction and volatility decay coefficients of the j-latent component. Therefore, we refer to (ω s,j , α s,j , β s,j ) as the "within-group" GARCH(1,1) coefficients. The advantage of model (1) is that it models the conditional variance dynamic in terms of an ordinal state variable so that the switch between G different regimes is contingent to the overall market behavior. Volatility clusters are arranged in increasing order of risk from j = 1 to j = G, and in two different time periods t 1 and t 2 an asset s may belong to the same risk group, i.e., g t 1 ,s = g t 2 ,s , leading to the same GARCH regime although its volatility may have changed dramatically because the overall market volatility changed dramatically. The state variable g t−1,s transforms a cardinal notion, i.e., volatility, into an ordinal notion induced by the memberships to ordered risk groups. In order to see how the cluster dynamic interacts with the GARCH-type coefficients, define ω t,s = ω s g t−1,s = G ∑ j=1 ω s,j D t−1,s,j , α t,s = α s g t−1,s = G ∑ j=1 α s,j D t−1,s,j , β t,s = β s g t−1,s = G ∑ j=1 β s,j D t−1,s,j .
Equation (2) resembles a GARCH(1,1) specification with time-varying coefficients obtained by weighting the within-groups classical GARCH(1,1) parameters by the class membership state variables. Although model (2) leads to a convenient interpretation of the model, its formulation is not consistent with a model with dynamic parameters. In fact, the three model parameter vectors ω s , α s , β s do not depend on t, and the dynamic of ω t,s , α t,s , β t,s is driven by the state variables {D t−1,s,j }. From (2) it can be seen that the dynamic of σ 2 t,s changes discontinuously because of the transition of an asset from a risk group to another. We introduce an alternative formulation of the model where the dynamic of σ 2 t,s is smoothed by replacing the hard membership state variables {D t−1,s,j } with a smooth version. There are situations where the membership of an asset is not totally clear (e.g., assets on the border of the transition between two risk groups), in this situation one may desire to smooth the transition between groups. Instead of assigning an asset to a risk group, one can attach a measure of the strength of the membership. In the classification literature these are called "soft labels" or "smooth memberships" (see Hennig et al. 2016). There are various possibilities for defining a smooth assignment. The following choice will be motivated in Section 2.2. Define now τ t,s,j ∈ [0, 1], and ∑ G j=1 τ t−1,s,j = 1 for all t = 1, 2, . . . . The quantity τ t,s,j gives the strength at which the asset s is associated to the j-th cluster at time t based on the information at time t. Define the vector τ t,s = (τ t,s,1 , τ t,s,2 , . . . , τ t,s,G ) . We propose the following alternative version of model (1) where the variance dynamic equation is replaced with the followingσ 2 process We call (4) the "Smooth CW-GARCH" (sCW-GARCH) model. For the sCW-GARCH the variance process can be written as a weighted sum of GARCH(1,1) models As before, write the previous equation in terms of time varying GARCH components as follows From the latter it can be easily seen that the sort of time varying GARCH(1,1) components of the sCW-GARCH change smoothly as assets migrate from one risk group to another. In this case, the formulation above gives also a better intuition about the role of the within-group GARCH parameters {ω s,j , α s,j , β s,j }. Note that Therefore, each of the within-cluster GARCH parameter expresses the marginal variation of the corresponding GARCH component caused by a change in the degree of memberships with respect to the corresponding risk group. From a different angle, it is worth noting that the sCW-GARCH model can be seen as a state-dependent dynamic volatility model with a continuous state space where, at time t, the current value of the state is determined by the smooth memberships τ t,s . Differently, in the CW-GARCH models, the state space is discrete, since only G values are feasible, and the current value of the state is now determined by the hard memberships g t,s .

Cross-Sectional Cluster Model
In this section, we introduce a model for the cross-sectional distribution of assets' volatility. While considerable research has investigated the time-series structure of the volatility process and its relationships with market and the expected returns (see, among others, Campbell and Hentschel 1992;Glosten et al. 1993), the question of how the distribution of assets' volatility looks like at a given time point has received less attention. The key assumption in this work is that, at a given time point, there are groups of assets whose volatility cluster together to form groups of homogeneous risk. This assumption has been already explored in . This is empirically motivated by analyzing cross-sectional realized volatility data. From the data set studied in Section 4 including 123 stocks traded on the New York Stock Exchange market (NYSE), in Figure 1 we show the kernel density estimate of the cross-sectional distribution of the realized volatility in two consecutive trading days. Details on how the realized volatility is computed are postponed to Section 4. In both panels of Figure 1 there is evidence of multimodality, and this is consistent with the idea that there are groups of assets forming sub-populations with different average volatility levels. In panel (b) the kernel density estimate is corrupted by few assets exhibiting abnormal large realized volatility, this happens for a large number of trading days. In Figure 1b the two rightmost density peaks close to 0.006 and 0.012 each capture just few scattered points, although the plot seems to suggest the presence of two symmetric components. Not the kernel density estimator itself, but the estimate of the optimal bandwidth of Sheather and Jones (1991) used in this case is heavily affected by this typical right-tail behavior of the distribution. This sort of artifacts are common to other kernel density estimators.
Unless a large smoothing is introduced, in the presence of scattered points in low density regions the kernel density estimate is mainly driven by the shape of the kernel function. On the other hand, increasing the smoothing would tarnish the underlying multimodality structure.
The cross-sectional structure discussed above is consistent with mixture models. The idea is that each mixture component represents a group of assets that share similar risk behaviour. Let h t,s be the realized volatility of the asset s at time t. We assume that at each t there are G groups of assets. Furthermore, conditional on D t,s,j = 1, the jth group has a distribution f j (·) that is symmetric about the center m t,j , has a variance v t,j , and its expected size (proportion) is π t,j = E[· | I t−1 ]. This implies that the the cross-sectional distribution of the realized volatility (unconditional on {D t,s,j }) is represented by the finite location-scale mixture model The idea is that each mixture component represents a group of assets that share similar risk behaviour. The parameter m t,j represents the average within group realized volatility, while v t,j represents its dispersion. Mixture models can reproduce clustered population, and it is a popular tool in model-based cluster analysis (see Banfield and Raftery 1993;McLachlan and Peel 2000, among others).  exploited the assumed mixture structure for using robust model-based cluster analysis to group asset with similar risk, and they proposed a parsimonious multivariate dynamic model where aggregate clusters' volatility is modeled instead of individual assets' volatility. The main goal of their work was to reduce the unfeasible large dimensionality of multivariate volatility models for large portfolios. Clustering methods applied to financial time series were also used in Otranto (2008), where the autoregressive metric (see Corduas and Piccolo 2008) is applied to measure the distance between GARCH processes. Finite mixtures of Gaussians, that is when f j (·) is the Gaussian density, are effective to model symmetrically shaped clusters also when clusters are not exactly normal. But in this case some more structure is needed to capture the effects of large variations in assets volatility that often show up for many trading days, e.g., the example shown in panel (b) of Figure 1. In  it was proposed to adopt the approach of Hennig (2016, 2017) where an improper constant density mixture component is introduced to catch points (interpreted as noise or contamination) in low density regions. This makes sense in all those situations where there are points extraneous to each clusters that have an unstructured behavior, and that can potentially appear everywhere in the data space. This is not exactly the cases studied in this paper. In fact, realized volatility is a positive quantity, and this heterogeneous component can only affect the right tail of the distribution. Here we assume that the group of few assets inflating the right tail of the distribution have a proper uniform distribution whose support is not overlapping with the other regular volatility clusters. Let j = 0 denote this group of asset exhibiting "abnormal" large volatility, we call this group of points "noise". The term noise here is inherited from the robust clustering and classification literature (see Ritter 2014), where it is understood as a "noisy cluster", that is, a cluster of points with an unstructured shape compared with the main groups in the population. The uniform distribution is a convenient choice to capture atypical group of points not having a central location, and that are scattered in density regions somewhat separated from the main bulk of the data (Banfield and Raftery 1993;Coretto and Hennig 2016;Hennig 2004).
We assume that where l t < u t are respectively the lower and upper limit of the support of the uniform distribution. The previous implies that, without conditioning on the class labels {D t,s,j }, the cross-sectional distribution of the assets' volatility is represented by the following finite mixture model where φ(·) is Gaussian density function. The unknown mixture parameter vector is θ t = (π t,0 , l t , u t , π t,1 , m t 1 , v t,1 , . . . , π t,G , m t G , v t,G ) . This class of models where introduced in Banfield and Raftery (1993), and studied in Hennig (2010, 2011) to perform robust clustering. The additional problem here is that if θ t is unrestricted one may have situations where the support of the noise group overlaps with one or more regular clusters if l t is small enough. The latter would be inconsistent with the empirical evidence. To overcome this, we propose the following restriction, that is we assume that θ t is such that The constant λ > 0 controls the maximum degree of overlap between the support of the uniform distribution representing atypical observations and the closest regular Gaussian component. To see this, let z 0.99 be the 99% quantile of the standard normal distribution, and take λ = z 0.99 . The restriction (8) means that the uniform component can only overlap with the closest Gaussian component in its 1% tail probability. Restriction (8) now ensures a well separation between regular and non-regular assets. Although the mixture model introduced in this section is interesting for how it is able to fit the cross-sectional distribution of realized volatility at each time period, the main issue here is to obtain the hard class memberships variables {D t,s,j }, and the smooth version {τ t,s,j }. Since τ t,s,j = Pr{D t,s,j = 1 | I t }, here we have that Quantities in (8) are obtained simply applying the Bayes rule, this the reason why these are also called "posterior weights" for class memberships. It can be easily shown (see Velilla and Hernández 2005, and references therein) that the optimal partition of the points can be obtained by applying the following assignment rule also called "Bayes classifier" Basically the Bayes classifier assigns a point h t,s to the group with largest posterior probability of membership. The assignment rule in (10) is optimal in the sense that it achieves the lowest misclassification rate. Therefore, in order to obtain (10) and (9) from the data one needs to estimate θ t at each cross-section. Although we use the subscript t to distinguish the θ parameter in each cross-section, we do not assume any dependence in it. Here we treat the number of groups G as fixed an known. While in some situation, including the one studied in this paper, a reasonable value of G can be determined based on subject matter considerations, this is not always the case. In Section 4 we will motivate our choice of G = 3 for this study and we will give some insights on how to fix in it in general.
In the next Section 3 we introduce estimation methods for the quantities of interest, that are class memberships {D t,s,j } and smooth weights {τ t,s,j }. But before to conclude this session we show how model (7) under (8) fits the data sets of the example of Figure 1. In this paper we are mainly interested in the cluster structure of the cross-sectional data, however, it is also of interest to investigate the fitting capability of the cross-sectional model. Estimated density of the two data sets of example in Figure 1 are shown in Figure Figure 2. When the group of noisy assets is reasonably concentrated, the corresponding support of the fitted uniform distribution becomes smaller, and the uniform density easily dominates the tail of the closest regular distribution (e.g., this happens in panel (b)). The discontinuity of the model density is less noticeable in cases like the one in panel (d) where the support of the uniform is rather large because of extremely scattered abnormal realized volatility. Although the discontinuity introduced by the uniform distribution adds some technical issues for the estimation theory (see Coretto and Hennig 2011), it has two main advantages: (i) it allows to represent unstructured groups of points with a well defined, simple and proper probability model; (ii) the noisy cluster is understood as a group of points not having any particular shape, and that is different and distinguished from the regular clusters' prototype model (the Gaussian density here). Therefore, a discontinuous transition between regular and non-regular density region obeys to the philosophy in robust statistic that in order to identify outliers/noise they need to arise in low density regions under the model for the regular points. Further discussions about the model-based treatment of noise/outliers in clustering can be found in Hennig (2004) and Coretto and Hennig (2016).  (7) with G = 3, under the restriction (8) and that π 0 = 0 (that is the uniform component is inactive). (b) Realized volatility on 27/Aug/1998; fitted density based on model (7) with G = 3, under the restriction (8).

Estimation
The main goal of this paper is to estimate models (1) and (4). If the class membership variables {D t,s,j } or the posterior weights {τ t,s,j } were known, one could have estimated the unknown parameter by optimizing the sample log-likelihood function of (1) or (4). However, both {D t,s,j } and {τ t,s,j } are not known. In Section 2.2 we developed a model for the cross-sectional distribution of the realized volatility, and an unknown parameter vector θ t controls this distribution at each time point. It may be possible to embed both the time series model, and the cross-sectional model in a single likelihood function and proceed with a single Maximum Likelihood (ML) estimation. Although this is in principle possible, it would lead to an unfeasible optimization problem. Since we assume that at each time point the time series model (1) (or (4)) does interact with the current values of {D t,s,j } (or {τ t,s,j }), we propose to simplify the estimation in two separate steps: Step 1: using cross-sectional data on realized volatility, at each time period the parameter θ t is estimated based on ML. The fitted θ t is used to obtain an estimates of {D * t,s,j } and {τ t,s,j } based on (9) and (10).
Step 2: the ML (or quasi-ML) method applied to estimate the unknown parameters of (1) or (4).
Step 1 is performed by solving the following constrained ML program Letθ t the solution of the previous ML program. Additionally to the model constraint introduced in (8), the proposed ML optimization has two further constraints that require the specification of positive constant v min . This is a lower bound for the variance of both types of mixture components (e.g., the Gaussian and the uniform). It is well known that the unconstrained ML for mixture of location-scales distribution does not exist, and in practice it can easily leads to spurious solutions (see McLachlan and Peel 2000;Redner and Walker 1984, among the others). In our study we set v min = 10 −5 , while we take λ = z 0.99 as explained in Section 2.2. The numerical solution of (11) is complicated by the discontinuities introduced by the uniform component, and the additional constraint implementing (8). The Expectation-Maximization (EM) algorithm proposed in Coretto and Hennig (2011) can be easily adapted to obtainθ t . Based onθ t the corresponding {D t,sj } and {τ t,s,j } are computed by plugging inθ t into (9) and (10). In Step 2, conditional on the estimated memberships, {D t,sj } and {τ t,s,j }, for any asset s, the parameters of the model for returns, ψ s = (µ s , ω s , α s , β s ) , are estimated maximizing the conditional likelihood function subject to ω j > 0, for j = 1, 2, . . . , G, α j ≥ 0, for j = 1, 2, . . . , G, β j ≥ 0, for j = 1, 2, . . . , G.
where φ(x|.) denotes the conditional density of X, L s,t−1 = g t,s , for the CW-GARCH model, and L s,t−1 = τ t,s , for the sCW-GARCH model.
In order to gain some insight on the statistical properties of the estimation procedure for the dynamic volatility coefficients, we have performed a Monte Carlo simulation study. Namely, first, conditional on appropriately selected state variables g t,s (see below), we have generated n sim = 500 time series of length T = 2500, taking the CW-GARCH as Data Generating Process (DGP). The same model has then been fitted to each of these series maximizing the conditional likelihood function in (12). Further, the described simulation procedure has been implemented taking the sCW-GARCH model as DGP, again conditioning on appropriately selected state variables τ t,s . In order to facilitate the comparison of results across different DGPSs, the same vector of volatility coefficients ψ has been used for simulating from both CW-GARCH and sCW-GARCH.
The length of the simulated series has been selected to match the time series dimension of the panel analyzed in Section 4. We note that the models proposed in this paper are conditional to the state variables g t,s and τ t,s . In practical applications, these are not observed and, in fact, the cross-sectional clustering algorithm is introduced in order to recover them. In order to a have a sampling design that is consistent with the empirical evidence, the simulation uses input state variables the g t,s and τ t,s obtained from applying the clustering algorithm (11) on carefully selected stocks in the panel studied in Section 4. We recall that models (2) and (4) contain the GARCH(1,1) model as a nested case if an asset does not migrate across states, but stays stably in one of the volatility clusters. In order to have a sampling representing maximum heterogeneity, we looked for the g t,s and τ t,s of an asset with the distribution over the states {low, medium, high, noise} having maximal entropy. This particular selection guarantees that the state variables g t,s and τ t,s used as input for the simulation, represent the dynamics of an asset that travels uniformly across the states. The same GARCH-type parameter vector, corresponding to the the ψ-row in Table 1, have been fixed to simulate both (2) and (4). While there exists an extensive empirical literature about the classical GARCH(1,1) model, so that it is known what a realistic parameter should be picked for a numerical experiment, this obviously does not apply to models (2) and (4) proposed in this paper. Therefore we decided to fix ψ as a perturbation of the median behavior (across the market) obtained from the in-sample results in Section 4.1.
The simulation results, summarized in Table 1, support the following facts. First, for both DGPs, the estimates are on average close to the DGP coefficients. Second, in general, the simulated Root Mean Squared Errors (RMSE) are, in relative terms, small, compared to the level of the underlying coefficient. Finally, no clear ordering of the two models arises in terms of efficiency in the estimation of model parameters. where c and s refer to CW-GARCH and sCW-GARCH models, respectively; se(E(ψ A )) = simulated standard error of Monte Carlo mean ofψ A ; RMSE(ψ A ) = simulated RMSE ofψ A .

Empirical Study
Our data set is composed of 5-min log-returns for 123 assets traded on the NYSE from 11/Aug/1998 to 18/Jul/2008 for a total of 2500 trading days. The assets have been selected in order to guarantee the (i) availability of a sufficiently long time-span; (ii) maximize liquidity over the time period of interest. 5-min returns have then been aggregated on a daily scale to compute daily Realized Variances on open-to-close log-returns.
For the step 1 of the estimation we fix G = 3. G is an important decision, and rarely is determined from the data alone. Although there is large collection of methods to select an appropriate number of clusters, a consensus in the literature has not been reached. In the model-based clustering context a popular choice is to select G based on the BIC, or alternative information criteria (for a comprehensive review see McLachlan and Peel 2000). However, none of the existing methods have strong theoretical guarantees, and often, it makes more sense to supervise the choice of G based on subject-matter considerations additional to data-driven methods (Hennig and Liao 2013;Hennig et al. 2016). The choice of G = 3 is motivated based on various arguments. First of all we considered G ranging from 2 to 6 and we looked at many cross-sectional fit at random (as in Figure 2). The BIC criterion always suggested G = {3, 4}, a supervised analysis of the results confirmed that meaningful cross-sectional fits were always obtained with G = 3 or G = 4. We also explored the opinion of several professional financial analysts, and the general conclusion is that a classification into low-medium-high risk is the conventional operational way to categorize risk. Our cross-sectional model (7) also includes an atypical component for catching extremely large risk levels. Therefore we finally considered G = 3 and G = 4, and based on the performance of the forecasting experiment described in Section 4.2, we finally decided that G = 3 is the appropriate number of regular groups to consider because G = 3 produced the overall lowest average (across assets) Root Mean Square Prediction Error (RMSPE). Details about forecasting performance are treated in Section 4.2. Selection of a G value that optimizes the predictive performance is an appropriate way to approach the selection of G when the main goal is to predict future volatility. For each cross-sectional fit of θ t the mixture component are always ordered in terms of increasing risk. Therefore the groups j = 1, 2, 3 always correspond to groups of increasing average realized volatility. In other words, the cluster indexes js are rearranged so that m t,1 ≤ m t,2 ≤ m t,3 , and in case of equal means the order is based on variances v t,j , and for equal variances the order based on the expected size π t,j . We refer to the groups corresponding to j = 1, 2, 3 as group = low, medium, high. The group identified with j = 0 is called noise, and in the ordering scheme is placed after the high group because in this empirical application it always captures overall highest volatility assets. For the noise component, an indexing alternative to j = 0 may be decided, although in principle outlying points can be everywhere in the data range, and the uniform component may well catch small outliers or inliers (see Coretto and Hennig 2010). In the robust clustering literature it is common to denote the noise component with j = 0 so that G still defines the number of regular clusters.
Step 2 of the estimation procedure is performed considering a Gaussian Quasi-Likelihood function, that is we set leading to the following expression for the overall Quasi-Likelihood Maximizing (14) with respect to ψ s , for any asset s, leads to consistent and asymptotically Normal estimates of (µ s , ω s , α s , β s ) even when the conditional returns distribution in (13) is not correctly specified (Bollerslev and Wooldridge 1992).

In-Sample Results
The cross-sectional estimation showed that there are various interesting patterns in the data. Assets migrate from one risk group to another over time, but these class switch dynamics show a strong dose of persistence In Figures 3-5 we show time series of the realized volatility and class switches for three assets, e.g., asset JNJ, asset LSI, and asset NOVL.  Figure 3 we note that asset JNJ persistently stays in the low class of risk, and eventually it switches to the next medium group for short time intervals when the market volatility bumps up. The interesting thing to observe is that this behavior is almost the same in the two periods, e.g., the dynamic of these switches does not depend from the level of the market volatility. As expected in the high volatility period the persistence in the low group decreases, but overall the pattern is rather stable. If we count the number of changes from one class to another, asset JNJ is the most stable asset overall with the longest permanence in the low risk group. On the other hand asset LSI in Figure 4 was chosen because it is the most unstable, i.e., it is the asset with the largest number of jumps from one class to another. This asset has a less persistent dynamic in terms of class switches. The interesting thing is that this asset stay more in the lower risk group when the market volatility increases. A different pattern is observed for asset NOVL in Figure 5. asset NOVL, as asset LSI, has a strong tendency to stay in high risk group with a preference for the noise group. However, contrary to asset LSI, asset NOVL increases its stay in the noise group as the market volatility increases. The cross-sectional clustering step also transform a cardinal quantity like volatility into an ordinal variable much easier to interpret. One of the advantage of such a categorization of the volatility is that it is market contingent. These graphical plots of the class switches may be of practical interests for financial analyst. This is an interesting application itself maybe worth to be explored in more details in further research.    Moving our attention to the results of Step 2 of the estimation procedure, Table 2 reports the average and median values of the second stage conditional log-likelihood and of the Bayesian Information Criterion (BIC) for CW-GARCH, sCW-GARCH and two benchmarks given by the GARCH and GJR models of order (1,1), respectively. We remind that the volatility equation of a GJR(1,1) model (Glosten et al. 1993) is given by where the additional γ coefficients controls for leverage effects. In particular, the occurrence of leverage effects is associated to positive values of the coefficient. The GARCH(1,1) model is nested within the GJR specification for γ = 0. The CW-GARCH and sCW-GARCH models are clearly outperforming the benchmarks in terms of both average and median BIC, with the sCW-GARCH being the best performer. As expected, the GJR-GARCH outperforms the simpler GARCH model. The picture is completed by the analysis of Figure 6 that the reports the box-plots of the ratios between the BIC values of CW-GARCH and sCW-GARCH, in the numerator, and those of the benchmarks, in the denominator. For ease of presentation, the ratios have been multiplied by 100 so that, in the plot, values above 100 indicate that the benchmark is outperformed by the model in the numerator. The plot makes evident that, in the vast majority of cases, CW-GARCH and sCW-GARCH perform better than GARCH and GJR models. Also, while there are several assets for which the CW-GARCH and sCW-GARCH are doing remarkably better than the benchmarks (in some cases the gain in BIC is above 40%, for the GARCH, and above 25%, for the GJR), the reverse does not hold.
Finally, it is interesting to look at the average (Table 3) and median (Table 4) estimated coefficients across the whole set of 123 assets. For both CW-GARCH and sCW-GARCH, some regularitites arise. First and most importantly, the α coefficient tends to decrease as we move from low to high volatility components, taking its minimum value for the noise component. This implies that the impact of past squared returns tends to be down-weighted as the relative volatility level increases. Intuitively, this effect reaches its extreme level in the case of the noise component when it is reasonable to expect that the current returns do not offer a strong signal for the prediction of future conditional variance. Accordingly, the volatility persistence (α + β) also takes its minimum value for the noise component.   (1) and (4). The average (or median) is taken across the S = 123 assets in the sample.  Table 3. Averages of estimated coefficients for models (1) and (4). For scaling reason coefficients marked with (*) are multiplied by 10 5 . The average is taken across the S = 123 assets in the sample.  Table 4. Median of estimated coefficients for models (1) and (4). For scaling reason coefficients marked with (*) are multiplied by 10 5 . The median is taken across the S = 123 assets in the sample. In addition, since the GARCH model can be obtained as a special case of the CW-GARCH model, we test the signficance of the likelihood gains yielded by the latter model via a Likelihood Ratio Test (LRT). The p-values of the LRT test, for all the 123 assets included in our panel, have been graphically represented in Figure 7. In order to improve the readability of the plot, we have transformed the p-values on a logit scale. The null hypothesis is rejected at any reasonable significance level for 122 assets out of 123.

Forecasting Experiments
In order to assess the ability of the proposed models to generate accurate one-step-ahead volatility forecasts, we have performed an out-of-sample forecasting exercise based on a mixed-rolling window design with re-estimation every 50 observations over a moving window of 1500 days, implying 20 re-estimation for each asset. So, the first 1500 observations have been taken as initial sample period while the last 1000 have been kept for out-of-sample forecast evaluation. As for the in-sample analysis, the GARCH(1,1) and GJR(1,1) models are considered as benchmarks.
For clarity, for a generic asset s, the structure of the forecasting design can be summarized in the following steps 1. Using observation from 1 to 1500, fit a CW-GARCH, sCW-GARCH, GARCH and GJR-GARCH model to the time series of returns on asset s 2. Generate one-step-ahead forecasts of conditional variance for the subsequent 50 days, that is for times 1501 to 1550 3. Re-estimate the models using an updated estimation window from 51 to 1550 4. Iterate steps 2-3 until the end of the series.
The series of forecasts obtained by the three models considered are then scored using two different performance measures: the Root Mean Squared Prediction Error (RMSPE) and the QLIKE loss (Patton 2011). Letting, h t be the realized volatility at time t andσ 2 t,k the conditional variance predicted by model k where k ∈ {GARCH,GJR-GARCH, CW-GARCH,sCW-GARCH}, the RMSPE and QLIKE for model k are given by T+j,k , for T = 1500 and j = 1, . . . , 1000. In both cases, lower average values of the loss will be associated to better performers. Both the RMSPE and QLIKE are strictly consistent for the conditional variance of returns and can be shown to be robust to the quality of the volatility proxy used for forecast evaluation.
The results of the forecasting comparison for all the 123 assets included in our data set have been summarized in Table 5 that, for both RMSPE and QLIKE, reports the percentage of times that each of the models considered has been found to be the best performer. Here, slightly different pictures are obtained under the RMSPE and QLIKE losses, respectively. Under the RMSPE loss, the CW-GARCH model is resulting the best performer for approximately 1/3 of the assets while the remaining models are characterized by very close performances, with the GJR-GARCH slightly outperforming the GARCH and sCW-GARCH models. Overall, a state dependent models, either the CW-GARCH or sCW-GARCH model, results to be the best performer for approximately 55% of the assets. Under the QLIKE loss, no clear winner arises. The GJR-GARCH and CW-GARCH give very close "winning" frequencies, with the first slightly outperforming the latter. Overall, one of the state-dependent models results to be the best performer for approximately 50% of the assets. Although looking the "winning" frequencies offers a simple and tempting way of summarizing the forecasting performance of different models across a large number of assets, it should be remarked that this approach does not allow to consistently rank models according to their overall "aggregate" forecasting performance since "winning" frequencies do not incorporate any information on a cardinal measure of the model's predictive ability. To overcome this limit, Table 6 provides a summary of the RMSPE and QLIKE distributions across the 123 assets included in our panel. Namely, the table shows that the median loss is substantially lower for CW-GARCH and sCW-GARCH models than for the benchmarks. The performance gap appears more evident for the RMSPE than for the QLIKE. Furthermore, compared to the GARCH and GJR-GARCH models, the CW-GARCH and sCW-GARCH models are characterized by more stable forecasting performances, as documented by the value of the InterQuartile Range (IQR) for both RMSPE and QLIKE.  Figure 8 reports, for each model, the boxplot of the percentage differences between the loss value yielded by the model and that registered for the top performer for a given asset. The plots clearly show that these gaps tend to be smaller for CW-GARCH and sCW-GARCH models rather than for the benchmarks. Our findings on the distribution of performance gaps among the four models considered are further investigated and confirmed by Figure 9, for the RMSPE, and Figure 10, for the QLIKE. In each figure the first two plots compare, the average loss yielded by the CW-GARCH and sCW-GARCH, respectively, on the y-axis, with the average obtained for the GARCH models. The two scatter-plots in the second row repeat the comparison replacing the GARCH with the GJR model on the x-axis. q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q ] q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q ] q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q ] q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q   Finally, as a further robustness check we have computed the Diebold-Mariano statistic (Diebold and Mariano 1995) in order to test the null hypothesis of Equal Predictive Ability (EPA) of CW-GARCH and sCW-GARCH against the two benchmarks given by GARCH and GJR-GARCH, respectively. The results of our testing exercise have been summarized in Table 7. Here, by %L we denote the percentage of assets for which the average MSPE of the benchmark (GARCH or GJR-GARCH) is lower than that of CW-GARCH or sCW-GARCH model, respectively. In parentheses, below the above value, we report the percentage of times in which this corresponded to a significant Diebold-Mariano test statistic the usual 5% level. Similarly, %W indicates the percentage of assets for which the average MSPE of the CW-GARCH or sCW-GARCH is lower than that of the benchmark. For example, when the GARCH is taken as a benchmark, the average MSPE of GARCH is lower than that yielded by CW-GARCH in approx. 28.5% of cases, corresponding to 35 assets. In addition, for 80% of these assets, corresponding to 28 stocks, the Diebold-Mariano test rejects the null of EPA at the 5% level.
The results of our analysis reveal that, in the vast majority of cases (≈ 70%), both CW-GARCH and sCW-GARCH yield an average MSPE that is lower than that of the benchmarks. Furthermore, in over 80% of cases, this leads to a rejection of the EPA hypothesis at the 5% significance level. Key to table: %L: percentage of assets for which the average MSPE of the benchmark (GARCH or GJR) is lower than that of CW-GARCH or sCW-GARCH; %W: percentage of assets for which the average MSPE of the CW-GARCH or sCW-GARCH is lower than that of the benchmark; in parentheses: percentage of times in which this corresponded to a significant Diebold-Mariano test statistic the usual 5% level.

Conclusions and Final Remarks
We have presented a novel approach to forecasting volatility for large panels of assets. Compared to existing approaches, our modelling strategy has some important advantages. First, conditional on the information on the group structure, inference is based on a computationally feasible two-stage procedure. This implies that the investigation of the multivariate group structure of the panel, performed in Stage 1, is separated from the fitting of the dynamic volatility forecasting models, that is performed in Stage 2. The desirable consequence of this architecture is that, at Stage 2, it is possible to incorporate multivariate information on the cross-sectional volatility distribution, without having to face the computational complexity of multivariate time series modelling.
Second, for a given stock, the fitted volatility forecasting model has coefficients that are asset specific and time-varying. Last but not least, the structure of our modelling approach is inherently flexible and can be easily adapted to consider alternative choices of the clustering variables as well as of the second-stage parametric specifications used for volatility forecasting. For example, the GARCH specification of the latent components of the CW-GARCH and sCW-GARCH could be easily replaced by other conditional heteroskedastic models such as, for example, GJR (Glosten et al. 1993) or even more sophisticated Realized GARCH models (Hansen et al. 2012), directly using information on realized volatility measures for conditional variance forecasting. On an empirical ground, both in-sample and out-of-sample results, provide strong evidence that the proposed approach is able to improve over simple univariate GARCH-type models, thus confirming the intuition that taking into account the information on the group structure of volatility can be profitable in terms of predictive accuracy in volatility forecasting.
Author Contributions: All authors contributed equally to this manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Acknowledgments:
The authors would like to thank two anonymous Referees for their stimulating comments and suggestions that helped us to enhance the quality of our paper.

Conflicts of Interest:
The authors declare no conflict of interest.