Bayesian Hierarchical Copula Models with a Dirichlet–Laplace Prior

: We discuss a Bayesian hierarchical copula model for clusters of ﬁnancial time series. A similar approach has been developed in recent paper. However, the prior distributions proposed there do not always provide a proper posterior. In order to circumvent the problem, we adopt a proper global–local shrinkage prior, which is also able to account for potential dependence structures among different clusters. The performance of the proposed model is presented via simulations and a real data analysis.


Introduction
There is a large body of literature with respect to hierarchical model settings. The concept to pull the mean of a single group towards the mean across different groups can be found at least in Kelley [1]. Tiao and Tan [2] and Hill [3] consider the one-way random effects model and they discuss a Bayesian approach for the analysis of variance because the frequentist unbiased estimator of the variance of random effects could be negative. For the same model, Stone and Springer [4] discuss and resolve a paradox that arises with the use of Jeffreys' prior. The foundation for the Bayesian hierarchical linear model is established in Lindley and Smith [5]. More recently, Gelman [6] discuss a review on prior distributions for variance parameters in the hierarchical model.
More recently, Zhuang et al. [7] introduced a hierarchical model in a copula framework; they suggest using, for the variance parameters of two different priors, (i) the standard improper prior for scale parameters, which is proportional to σ −2 , or (ii) a vaguely informative prior, say an inverse gamma density with both parameters equal to a small value.
However, both the above proposals might be impractical: in the first case, the posterior is simply not proper (as we show in the Appendix A); in the second case, the use of small parameters of the inverse Gamma priors simply hides the problem without actually solving it; see for example Berger [8].
Hobert and Casella [9] also provide another review on the effect of improper priors in the Gibbs sampling algorithm.
In this paper, we propose a Bayesian hierarchical copula model using a different prior. In particular, we adopt a global-local shrinkage prior. These prior distributions naturally arise in a linear regression framework with high dimensional data and where a sparsity constraint is necessary for the vector of coefficients. Several different global-local shrinkage families of priors have been proposed: Park and Casella [10] and Hans [11] discuss the Bayesian LASSO; Carvalho et al. [12] introduce the Horseshoe prior, Armagan et al. [13] propose a Generalized Double Pareto prior. Here, we will use a Dirichlet-Laplace prior, proposed in Bhattacharya et al. [14], with a slight modification; while in a regression framework, it is natural to adopt a prior that shrinks the parameters towards zero, this is not the case for our hierarchical copula model, where the zero value does not have a Stats 2022, 5 particular interpretation in the model. For this reason we need to introduce a further level of hierarchy, assuming a prior distribution on the location of the shrinkage point.
The rest of this paper is organized as follows: The next section is devoted to illustrating the statistical model and the prior distribution, highlighting the differences with the approach described in Zhuang et al. [7]; we conclude the section with a description of the sampling algorithm. In the third section, we perform a simulation study in order to compare the mean square error of the estimates produced by our model and compare them with a standard maximum likelihood approach. Then, we reconsider a dataset discussed in Zhuang et al. [7] and compare the results of the two approaches. We conclude with another illustration of the model in the problem of clustering financial time series.

Likelihood and Priors Distributions
Copula representation is a way to recast a multivariate distribution in such a way that the dependence structure is not influenced by the shape, the parametrization, and the unit of measurement of the marginal distributions. Their applications in statistical inferences and a review on the most popular approaches can be found in Hofert et al. [15]. In this paper we will consider several different parametric forms of copula functions: In particular, in the bivariate case, we will use the standard Archimedean families, namely the Joe, Clayton, Gumbel, and Frank copulae. For more than two dimensions, we will concentrate on the use of the most popular elliptical versions, namely the Gaussian and Student's t copulae. Since the main objective of the paper is the clusterization of the dependence structure, for the sake of simplicity and without a loss of generality , we will assume that all marginal distributions are known or, equivalently, their parameters have been previously estimated. In this way, we can directly work with the transformed variables: Let c i (·|ψ i ) be the generic copula density function associated with the i-th group . The statistical model can be stated as follows: where m denotes the number of groups or clusters. Set the following: and assume the following.
In the previous expressions, b i and B i , respectively, denote the lower and the upper bound of the parameter space of the corresponding ψ i , and γ i is the mapping of ψ i into the real axis; d i is the dimension of i-th group, and a is a hyperparameter, which we typically set to 1, although different values can be used. In general, the Archimedean copulae are parametrized in terms of Kendall's Tau, for which its range of values has been restricted to (0, 1) for the Clayton, Joe, and Gumbel copulae, while it is set to (−1, 1) for the Frank copula. In the elliptical case, the Gaussian copula is parametrized in terms of the correlation coefficient ρ, which ranges in (−1, 1); finally, Student's t copula has the additional parameter ν, and that is the number of degrees of freedom: A discrete uniform prior on {1, 2, . . . , 35} has been used here. When dimension d of the specific group is larger than two, we restrict the analysis to elliptical copulae with an equi-correlation matrix: in that case, it is well known that the range of the correlation parameter is (−1/(d − 1), 1).
Let U be entire observed sample and let U ijk be the k-th observation of i-th component in the j-th group, and let n j be the number of observation in the j-th group. The posterior distribution on the parameter vector (γ, ξ, α, τ) is then described as follows: where γ = (γ 1 , γ 2 , . . . , γ m ) and α = (α 1 , α 2 , . . . , α m ).
The complex form of the posterior distribution requires the use of simulation based methods of inference. In particular, we will adapt the algorithm of Bhattacharya et al. [14] with a minor modification for the updates of γ and the shrinkage location ξ. Following,Bhattacharya et al. [14], we introduce a vector β = β 1 , β 2 , . . . , β m ∈ R m in order to have a latent variable representation of the γ prior; then, the following is obtained.
Here, we briefly describe the algorithm. Start the chain at time 0 by drawing a sample from the prior. At time t, we use the following updating procedure: 1.
In previous statements, Cauchy(a, b) denotes a one-dimensional Cauchy distribution with location a and scale b, while GIG(p, a, b) is the generalized inverse Gaussian distribution with the following density function.
Notice that IG(a, b) is the inverse Gaussian distribution, and it is known that Finally, δ γ and δ ξ are scalar tuning parameters. In the case of the Student's t copula, we need to add another step between stride 1 and 2 in order to update ν = (ν 1 , ν 2 , . . . , ν m ): Compute the following.

Prior Distribution of ξ
The choice of the prior distribution for the shrinkage location ξ needs some explanation. First of all, notice that, according to our prior specification, . . , m} implies a standard logistic density for ξ.

Previous Work
Apart form the prior specification, the model described in previous sections is the one proposed by Zhuang et al. [7]. We restrict our discussion to the case where each copula expression has one parameter only. Their prior can be stated as follows.
There is no unique choice for the distributions of (σ 2 , λ, δ), although the authors suggest using weakly informative priors, for example, inverse gamma densities with small hyperparameters values or, as an alternative, an objective prior: for example, an improper uniform prior. However, one can prove that, in the second case, the posterior distribution cannot be proper no matter what the sample size is. We show this result in Appendix A. When the posterior distribution is improper, the resulting summary statistics are meaningless. In fact, the Markov Chain implied by the MCMC does not have a limiting distribution so the Ergodic theorem does not hold and the posterior is completely useless. Moreover, even the first solution is not feasible. In fact, when an improper prior produces an improper posterior, using a vague proper prior can typically hide-not solve-the problem. In these cases, in fact, as shown in Berger [8] (p. 398), the use of a vague prior approximating an improper prior typically concentrates the posterior mass on some boundary of the parameter space.

Simulation Study
We compare the performance of our approach with the results based on a maximum likelihood approach in a simulation study. We will use a Student's t copula with an equi-correlation matrix and set the number of groups m equal to five. We repeat the procedure 100 times; at iteration j for the i-th group, we sample the true value γ T ij from a standard normal distribution, the degrees of freedom ν T ij are sampled from the prior distribution, and the dimensions d ij of the groups are sampled from the uniform discrete distribution in {1, 2, . . . , 5}. Given the parameters and dimensions of the groups, we sample 20 observations for each group. In the maximum likelihood framework, we estimate the following: and compute the standard errors.
In a Bayesian framework, we use the posterior mean as a point estimate, obtained from the use of the MCMC algorithm described above. We ran six independent chains of 2.5 × 10 5 scans, discarded the first 5 × 10 4 as a burn-in, and finally computed theγ Bay ij via the sample mean of simulation outputs for all i ∈ {1, . . . , 5}. As a tuning parameters, we set δ γ = 10 −3 and δ ξ = 10 −1 . Then, we compute the following.
Comparison are performed in terms of the corresponding mean square errors.

Real Data Applications
This section is devoted to the implementation of the method in two different applications. The first one is the same as in Zhuang et al. [7] and we include it for comparative purposes; to this end, we quantify the goodness of fit of the model using a predictive approach based on the conditional version of the Widely Applicable Information Criterion, WAIC, in a hierarchical setting, as discussed in Millar [16]. The second one deals with clustering financial time series.

Column Vertebral Data
We apply our model to the Column Vertebral Data, available at the UCI Machine Learning Repository. It consists of 60 patients with disk hernia, 150 subjects with spondylolisthesis, and 100 healthy individuals; data are available for the following variables: angle of pelvic incidence (PI), angle of pelvic tilt (PT), lumbar lordosis angle (LL), sacral slope (SS), pelvic radius (PR), and the degree of spondylolisthesis (DS). As in Zhuang et al. [7], we adopt the generalized skew-t distribution for the marginals, use a maximum likelihood estimator in order to calibrate the parameters and then transform data via the fitted cumulative distribution function. Computations were performed using the R package sgt available on CRAN. Table 2 reports the values of fitted parameters for the marginals. Following Zhuang et al. [7], we consider the same parametric copulae for the bivariate distributions of the features of interest, and for each of these, we construct our Bayesian hierarchical copula model for three groups of subjects. We run six independent chains of 2.5 × 10 6 simulations and discard the first 5 × 10 5 . We also set δ γ = 10 −3 and δ ξ = 10 −1 . We did not report any convergence issues, and the multiple Gelman-Rubin test scores for each of the six implemented models Gelman [17] were very close to the optimal value 1. In terms of the goodness of fit, we have computed the WAIC index for all six models. Our findings is that the most significant relation is the one between PI and PT. Table 3 compares the results of Zhuang et al. [7] (model A) with our ones (model B). The main difference between the results obtained with the two methods is related to the posterior uncertainty quantification. Credible intervals obtained with model B are systemically larger than those obtianed with model A. Our feeling is that it depends on the fact that results in model A are obtained by running a chain where some hyperparameters are fixed to some estimated values, as explained in Zhuang et al. [7]. Fixing values of the hyperparameters eliminates a critical source of variation, inducing shrinkage in credible intervals size.
For the ease of comparisons, we follow Zhuang et al. [7] and report the results not in terms of parameter γ but rather according the natural parameter of each copula, that is, ρ for the Gaussian copula and θ for the Archimedean ones.

Financial Data Application
Grouping financial time series is important for diversification purposes; a portfolio manager should avoid investing in instruments with a high degree of positive dependence, and clustering procedures allow the construction of groups according to some specific risk measure. In this way, financial instruments that belong to the same group will show a certain degree of association; however, the strength of dependence within groups may well be different in different groups. It is then important to assess the strength of the association for each single cluster, and a method to perform this is to use a hierarchical structure, such as the one discussed in this paper.
As a risk measure, we consider the so-called tail index, which measures the strength of dependence between two variables when one of them takes extremely low values. Following De Luca and Zuccolotto [18], we construct a dissimilarity measure based on the lower tail coefficient. Let (Y 1 , Y 2 ) be a bivariate random vector; the lower tail coefficient λ L of (Y 1 , Y 2 ) is defined as follow: or, equivalently, where C(·, ·) is the cumulative distribution function of the copula associated to (Y 1 , Y 2 ). In order to estimate λ L , we use the empirical estimator discussed in [19]: whereĈ(·, ·) is the empirical copula, and n is the sample size. The dissimilarity measure is then defined as follows.
The preliminary clustering procedure has been implemented using a complete linkage method. Notice that a bivariate lower tail coefficient is not the unique method for modeling dependence on extreme low values: Durante et al. [20] proposed a conditioned correlation coefficient estimated using a nonparametric approach; Fuchs et al. [21] analyzed dissimilarity measure applicable to a multivariate lower tail coefficient. We consider the "S&P 500 Full Dataset" available at Kaggle: It contains more relevant information for the components of S&P 500. We take the daily closing prices from 5 June 2000 to 5 June 2020 and discard instruments without a complete record for this period. Then, we restrict our analysis to 379 components. For all of them, we computed the log-returns by taking log-differences and filter data by fitting; for each time series, an ARMA(1,1)GJR-GARCH(1,1) model with Student's t innovations was used; then, we extracted residuals and transformed them via the fitted cumulative distribution function in order to obtain pseudo-data. Computations were performed using the CRAN package rugarch. Hence, we compute the empirical estimator of the lower tail coefficient for any possible pair and the dissimilarity measure associated and use them to feed the clustering algorithm. Due to computational complexities, we used the coarsest partition under the constraint that the largest group must have at most 10 components. We obtained 30 groups with dimensions of more than one and discarded instruments that belong to groups with only one component. The final number of instruments was thus reduced to 93.
We ran the MCMC algorithm described above for the 30 clusters, performing 12 independent chains of 10 5 scans and discarding the first 1.5 × 10 4 as they burned in. Tuning parameters were set to δ γ = 10 −6 , δ ξ = 10 −3 . Moreover, in this example, we did not report any convergence issues, and the Gelman-Rubin test score was 1.02. For each scan and for any group, we compute the lower tail coefficient via the following formula: where T ν (·) is the univariate cumulative distribution function of a Student's t random variable with ν degrees of freedom. The copula used in this example was a Student's t copula with an equi-correlation matrix: As a consequence, we obtained a single value for the lower tail coefficient for each cluster. Table 4 reports the results for each pair that belongs to the same group. Finally, we report the estimation results.

Conclusions
We discussed and improved a fully Bayesian analysis for a hierarchical copula model proposed in Zhuang et al. [7]. We proposed the use of a proper prior, which is able to induce shrinkage and, at the same time, dependence among different clusters of observations. This prior does not mimic the behavior of an improper prior and is better suited for objectively representing information coming from the data. Our prior belongs to the large family of globa-local shrinkage densities, with an extra stage in the hierarchy, due to the absence of a significant shrinkage value; we experienced that this approach is very effective and useful in the case of parametric copulae depending on a single parameter. In a more general situation, this approach needs to be modified, and this can be easily accommodated.
Finally, we presented an application in a financial context, where the goal was to estimate the lower tail coefficient of several financial time series in a parametric way using the Student's t copula.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A
Here, we show that the prior proposed in Zhuang et al. [7] leads to an improper posterior. The statistical model consists of m d-dimensional copulae governing different sets of observations. U 1i , U 2i , . . . , Let γ i = η i g i (θ i ); here, η i is a scaling parameter that can be considered known. One-to-one mapping functions g i (·) are needed to put all dependence parameters on the real line. Zhuang et al. [7] made the following assumptions.
Hyper-parameters σ i 's, λ, and δ 2 are given a suitable prior distribution. For the moment, we do not specify the priors and set the following.
The next proposition shows that, using standard noninformative priors for scale and location parameters, the resulting posterior will be improper independently of the sample size.
Consider only the following: and setμ = 1 m m ∑ i=1 µ i ; then, we obtain the following.
For any choice of m > 1, π(µ) can be written as follows.
Now, we compute the following.
Notice that the following is the case: µ i : then, we obtain the following.
(µ i −μ) 2 is a convex parabolic function of µ m , and by the Weierstrass theorem, a global maximum exists for all bounded and closed sets. By integrating µ m , one obtains the following.
For the same argument, one can also see that the following obtains. A similar argument can be used to prove the following result.