4.1. Data
This study analyzes financial market time series data with up to 100 dimensions. For the training set, we selected the daily adjusted closing prices of the top 100 Nasdaq-listed stocks by market capitalization, covering the period from 2 January 2009 to 20 September 2021, where we intentionally avoided the impact of the 2008 global financial crisis. Only stocks listed before 2009 were included, resulting in a total of 3201 daily observations. The test set spans from 21 September 2021 to 21 December 2021, comprising 65 daily observations. All data were obtained from Yahoo Finance using MATLAB 2022a’s data feed function. The experiments were implemented in MATLAB and conducted on a laptop equipped with an 11th Gen Intel(R) Core(TM) i7-1185G7 processor (3.00 GHz) and 32 GB of RAM.
After completing data cleaning, we obtained 100 stock indices for use in our empirical study. A full list of these indices is provided in
Appendix C. To evaluate the performance of the proposed models described in
Section 2, we conducted two empirical experiments using two datasets: one consisting of all 100 time series and another with only the last 30 stock indices. The goal of these experiments was to assess the scalability and effectiveness of the proposed model as the dimensionality increases.
The descriptive statistics of the dataset are summarized as follows. The top five business sectors represented in our analysis are technology (22 firms), health care (18 firms), finance (11 firms), consumer services (11 firms), and capital goods (8 firms). The total market capitalization of the selected stocks is USD 30,814 billion, accounting for approximately 46% of the total market capitalization on the Nasdaq exchange.
All stocks exhibit a positive mean return, with return values ranging from −39.02% to 52.29%. The standard deviation ranges from 1.06% to 3.67%, while skewness values range between −1.04 and 1.56. Kurtosis spans from 5.95 to 47.26. These statistics indicate that the return distributions are asymmetric and exhibit heavy tails.
Additionally, the Jarque–Bera test statistics for all returns strongly reject the null hypothesis of normality at the 1% significance level, confirming that the return distributions deviate significantly from normality.
4.2. Regular Vine Distribution Selection
Multivariate copula models can be fitted to
n-dimensional time series data using a two-stage sequential approach [
30]. In the first stage, a univariate marginal model is selected and fitted separately to each time series to obtain standardized residuals, which are then transformed into marginally uniform variables. In this study, we apply our proposed marginal model, detailed in
Section 2.3, and compare its performance to a widely used benchmark: the ARMA(1,1)-GARCH(1,1) model with Student-
t innovations.
In the second stage, the dependence structure among the transformed residuals is modeled using a multivariate copula. It is well established in the literature that different pairs of financial variables exhibit varying degrees of asymmetry and tail dependence. Traditional multivariate copulas are generally unable to capture such heterogeneous dependence structures. Vine copulas, however, are specifically designed to address this limitation.
While D-vine and C-vine copulas are effective in modeling complex dependencies, they are constrained by their predefined path structures. To overcome this limitation, we investigate the use of regular vine (R-vine) copulas—a more flexible and general class that encompasses both D-vines and C-vines as special cases. R-vines offer significantly greater modeling capacity due to their extensive variety of possible tree structures and pair-copula combinations.
In fact, for high-dimensional datasets, the number of possible R-vine structures grows rapidly, with the total number given by
[
65], underscoring their flexibility and complexity compared to standard copula models and their vine subclasses.
The following tasks are the sequential steps of fitting an R-vine copula with marginal specification:
- (a)
Standardize the residuals of the returns using the univariate marginal model.
- (b)
Select the R-vine path structure of the tree, , given by maximized empirical via
- b.1
Formulation of empirical of all possible pairs variable, , ;
- b.2
Performing the spanning tree maximization (the sum of absolute empirical
)
- b.3
For each edge, selection of bivariate copula family and its parameter(s) from all possible bivariate copula candidates via Bayesian information criterion (BIC) value.
- (c)
Iterate Step (b) from the first tree (unconditional tree, ) to the last tree (conditional tree, where D is the proximity condition).
The methodology described above yields a complete specification of a regular vine copula, including . The QMLE is employed in Step b.3 to estimate all relevant parameters. This full specification is then used as the foundation for our Bayesian inference and machine learning procedures.
It is important to note that, for the first tree in the vine structure, parallel computation is feasible due to the independence of the path structure in R-vines. The method proceeds as follows:
Step 1 (Tree 1): A parallel computation technique is applied to estimate for .
Step 2: A FOR loop is used to compute where .
For more details on the maximum spanning tree algorithm used in this context, see [
30].
Our study employs a range of bivariate copula functions to capture the diverse (a)symmetrical dependence structures present in financial time series data. While the core families considered include the Gaussian, Student-
t, Gumbel, and Frank copulas—consistent with the approach of [
30]—we extend the analysis by incorporating a total of 13 copula families. The complete list of candidate bivariate copula families used in this study is provided below:
Gaussian/Normal copula presents symmetric and no tail dependence.
Student-t copula presents symmetric and upper and lower tail dependence.
Frank copula presents symmetric and tail independence.
Clayton copula presents asymmetric and lower tail dependence.
B5/Joe, Gumbel, and Galambos copulas present asymmetric and different upper and lower tail dependence.
BB1, BB3, BB4, BB5, BB7 and BB10 copulas present asymmetric and different upper and lower tail dependence.
Details of the bivariate copula distributions and their key dependence structure properties are summarized in
Appendix A. These include the cumulative distribution functions, lower and upper tail dependence orders, and rank-based dependence measures. For additional discussion on copula properties, see, among others, [
17] (Chapter 4) and [
66] (Appendix).
We also conducted Ljung–Box tests on the standardized residuals from all proposed models. The results confirm the independence of the residuals. Full details of these tests are provided in
Appendix B.
4.3. Empirical Results
In this section, we present the empirical results obtained from our proposed Bayesian models and estimation methods for R-vine copulas. A comprehensive set of 13 bivariate copula families is employed within the regular vine structure to capture complex dependence patterns.
For the univariate margins, we apply the mixture distribution defined in Equation (
7) to the ICAPM-EGARCH models, which are designed to capture key features of financial time series data such as volatility clustering and heavy tails.
As a benchmark, we use the traditional QMLE method and compare its performance with our proposed approaches, including the rw-MH algorithm and the VBDA methods. These methods are implemented in the R-vine copula framework for dimensions of up to 100. In total, the empirical study involves the estimation of up to 4950 bivariate copula functions and 6138 parameters.
The experimental results, summarized in
Table 1 and
Table 2, indicate that the VBDA1 method delivers the best performance in terms of marginal likelihood on the training set. On the test set, the variational Bayes approaches—particularly VBDA1—demonstrate substantial improvements in forecasting accuracy as the dimensionality increases, making them strong candidates for high-dimensional modeling. Specifically, as both the number of dimensions and the forecast horizon grow, VBDA1 provides more accurate variability estimates. In contrast, the QMLE method performs relatively poor in higher dimensions and longer forecast horizons. However, in the case of 30 dimensions with a three-month forecast window, QMLE outperforms the other methods.
For 100-dimensional data, VBDA1 achieves the best forecasting performance based on the mean absolute deviation (MAD). Meanwhile, both the rw-MH and QMLE methods yield the best performance in terms of root mean square error (RMSE) for the three-month test set, while rw-MH shows the strongest performance in one-month forecasts at the 100-dimensional level.
Figure 3 illustrates a representative example of the empirical results for a normalized R-vine copula contour plot and rank-based dependence measures. These results are based on the univariate ICAPM-EGARCH–mixture marginal model using the VBDA2 estimation method in a 30-dimensional setting.
Due to space, we present a normalized 10-dimensional subset of the R-vine contour plot for selected stock indices. The stock names appear along the diagonal of the matrix. The lower triangular part of the matrix displays the R-vine copula contours, where each row corresponds to the tree () derived from the full 30-dimensional dataset. The upper triangular matrix shows the empirical Kendall’s and Spearman’s , where the column () corresponds to tree for .
Similarly,
Figure 4 provides an empirical example of the normalized 10-dimensional R-vine contour plots and rank-based dependence measures from the 100-dimensional analysis, using the VBDA1 method. Complete empirical results—including Kendall’s
, Spearman’s
, and the estimated vine matrices
and
for both figures—are available in the
Supplementary Materials.
Table 1 summarizes the model performance for the 30-dimensional case. Based on the training dataset, the proposed VBDA1–R-vine–ICAPM–EGARCH–mixture model delivers the best performance in terms of marginal likelihood. The next best performers are the VBDA2 and VBDA0 methods, respectively. Among all estimation methods, VBDA2 is the most computationally intensive, which explains its longer runtime for R-vine copula estimation. VBDA1, while still relatively complex, requires less computing time than VBDA2 and converges more efficiently than simpler methods such as VBDA0 due to its favorable convergence properties.
An illustration of the convergence behavior of the evidence lower bound (ELBO) for all VBDA methods is shown in
Figure 5, using the stock index pair (DE, CAT) as an example. Additionally,
Figure 6 presents the first tree structure from the spanning tree maximization algorithm for the VBDA2–R-vine–ICAPM–EGARCH–mixture model. In contrast, when evaluating by Akaike information criterion (AIC), Bayesian information criterion (BIC), and log-likelihood values, the QMLE–R-vine–ARMA–GARCH–Student-t model achieves the best performance. The second-best performance under these criteria is the rw-MH-based R-vine–ICAPM–EGARCH–mixture model, which achieves a favorable acceptance rate in the MH algorithm.
For the test dataset, forecasting performance was evaluated over one-month and three-month horizons using the MAD and RMSE metrics. In terms of forecast errors, the QMLE and rw-MH methods show comparable predictive performance and emerge as the best-performing models overall. Interestingly, within the variational Bayes framework, VBDA1 demonstrates improved performance—especially as the forecast horizon increases—outperforming the other VBDA variants based on MAD values. This trend is further supported by the experimental results.
Table 2 summarizes the performance of the proposed models in the 100-dimensional setting, using both training and test datasets. Overall, as the number of dimensions increases to 100, the performance of variational Bayes methods improves noticeably.
Among all approaches, the Bayesian inference and machine learning methods—particularly the Variational Bayes with Data Augmentation Type 1 (VBDA1)—outperform the traditional QMLE method, especially in terms of the mean absolute deviation (MAD) for the three-month forecast horizon. The VBDA1–R-vine–ICAPM–EGARCH–mixture model achieves the lowest three-month MAD, accurate to three decimal places, indicating superior forecasting accuracy.
These findings suggest that variational Bayes—specifically, the VBDA1 algorithm—performs effectively in high-dimensional settings. Both the training and test results confirm that the VBDA1–R-vine–ICAPM–EGARCH–mixture model surpasses all competing models in terms of marginal likelihood and three-month MAD forecast error.
Based on the model comparisons, the rw-MH algorithm combined with the R-vine-ICAPM-EGARCH-Mixture model emerges as the most outstanding, particularly due to its reasonable acceptance rate. Variational Bayes inference also performs exceptionally well, with the VBDA0-R-vine-ICAPM-EGARCH-Mixture model ranking as the second most preferable according to the AIC and BIC values. While the traditional QMLE method shows a decline in performance on the training data, it still remains a viable alternative for the test data.
Table 3 and
Table 4 summarize the copula families and their parameter counts across models and methods for 30- and 100-dimensional settings, as selected by Algorithm 1. In total, there are 435 pair-copula functions in 30 dimensions and 4950 in 100 dimensions. In the 30-dimensional case, the best-performing method—VBDA1—estimates 638 parameters, which is the second-lowest among all methods. Conversely, in the 100-dimensional case, VBDA1 estimates 6138 parameters—the highest. This indicates that better model performance does not necessarily correspond to a smaller number of parameters. Among the selected bivariate copula families, the elliptical copulas—Student-
t and Gaussian—fit the empirical data best, followed by Archimedean families. The Student-
t, Gaussian, and Frank copulas together account for at least 68% of all copulas used, implying that approximately 32% of the data exhibit asymmetric or tail dependence characteristics. In 30 dimensions, the most frequently selected families are Student-
t and Frank, while in 100 dimensions, the elliptical copulas dominate. Notably, the BB6 copula was not selected in the 30-dimensional case, and the BB10 copula was not used in the 100-dimensional case.
In this study, we apply the ICAPM–EGARCH–mixture with R-vine copula, which we believe is a useful model. However, this might be subject to the model misspecification, which can be a limitation of this study. The computing performance is another possible limitation, which can be improved, but we leave for future research. A practical implementation from the practitioner’s viewpoint poses an interesting challenge to be validated in the future.