Triple the gamma -- A unifying shrinkage prior for variance and variable selection in sparse state space and TVP models

Time-varying parameter (TVP) models are very flexible in capturing gradual changes in the effect of a predictor on the outcome variable. However, in particular when the number of predictors is large, there is a known risk of overfitting and poor predictive performance, since the effect of some predictors is constant over time. We propose a prior for variance shrinkage in TVP models, called triple gamma. The triple gamma prior encompasses a number of priors that have been suggested previously, such as the Bayesian lasso, the double gamma prior and the Horseshoe prior. We present the desirable properties of such a prior and its relationship to Bayesian Model Averaging for variance selection. The features of the triple gamma prior are then illustrated in the context of time varying parameter vector autoregressive models, both for simulated datasets and for a series of macroeconomics variables in the Euro Area.


Introduction
Model selection in a high-dimensional setting is a common challenge in statistical and econometric inference. The introduction of Bayesian model averaging (BMA) techniques in the statistical literature [1][2][3] has led to many interesting applications, see, among others, [4][5][6][7] for early references in econometrics.
Predictor selection for possibly very high-dimensional regression problems though shrinkage priors is an attractive alternative to BMA which relies on discrete mixture priors, see Bhadra et al. [8] for an excellent review. There is a vast and growing literature on shrinkage priors for regression problems that focuses on the following aspects. First, how to choose sensible priors for high-dimensional model selection problems in a Bayesian framework, second, how to design efficient algorithms to cope with the associated computational challenges and third, to investigate, both from a theoretical and a practical viewpoint, how such priors perform in high-dimensional problems.
A striking duality exists in this very active area between Bayesian and traditional approaches. For many shrinkage priors, the mode of the posterior distribution obtained in a Bayesian analysis can be regarded as a point estimate from a regularization approach, see Fahrmeir et al. [9] and Polson and Scott [10]. One such example is the popular Lasso [11] which is equivalent to a double-exponential shrinkage prior in a Bayesian context [12]. However, the two approaches differ when it comes to selecting penalty parameters that impact the sparsity of the solution. One advantage of the Bayesian framework in this context is that the penalty parameters are considered to be unknown hyperparameters which can be learned from the data. Such "global-local" shrinkage priors [13] adjust to the overall degree of sparsity that is required in a specific application through a global shrinkage parameter and separates signal from noise through local, individual shrinkage parameters.
While predictor selection though shrinkage priors in regression models is addressed in a vast literature, the use of shrinkage priors for more general econometric models for time series analysis, such as state space models and time-varying parameter (TVP) models is, in comparison, less well-studied. Sparsity in the context of such models refers to the presence of a few large variances among many (nearly) zero variances in the latent state processes that drive the observed time series data. A common goal in this setting is to recover a few dynamic states, driven by such a state space model, among many (nearly) constant coefficients. As shown by Frühwirth-Schnatter and Wagner [14], this variance selection problem can be cast into a variable selection problem in the non-centered parametrization of a state space model. Once this link has been established, shrinkage priors that are known to perform well in high-dimensional regression problems can be applied to variance selection in state space models, as demonstrated for the Lasso [15] and the normal-gamma [16,17].
Despite this already existing variety, we introduce a new shrinkage prior for variance selection in sparse state space and TVP models in the present paper, called triple gamma prior as it has a representation involving three gamma distributions. This prior can be related to various shrinkage priors that were found to be useful for high-dimensional regression problems such as the generalized beta mixture prior [18] and contains the popular Horseshoe prior [19,20] as a special case. Furthermore, the half-t and the half Cauchy [21,22], suggested as robust alternatives to the inverse gamma distribution for variance parameters in hierarchical models, as well as the Lasso and the double gamma, are special cases of the triple gamma. In this context, the triple gamma can also be regarded as an extension of the scaled beta2 distribution [23].
Among Bayesian shrinkage priors, usually a clear distinction is made between two-group mixture or spike-and-slab priors and continuous shrinkage priors, of which the triple gamma is a special case. An important contribution of the present paper is to show that the triple gamma provides a bridge between these two approaches and has the following property which is favourable both in sparse and dense situations. One of the hyperparameters allows high concentration over the region in the shrinkage profile that is relevant for shrinking noise, while the other hyperparameter allows high concentration over the region that prevents overshrinking of signals. This leads to a behaviour of the triple gamma prior that very much resembles Bayesian model averaging based on discrete spike-and-slab priors, with a strong prior concentration at the corner solutions where some of the variances are nearly close to zero. While this is reminiscent of the Horseshoe prior, the shrinkage profile induced by the triple gamma is more flexible than that of a Horseshoe. Thanks to the estimation of the hyperparemters, it is not constrained to be symmetric around one half, enabling adaption to varying degrees of sparsity in the data.
The triple gamma prior also scores well from a computational perspective. While exploring the full posterior distribution for spike-and-slab priors leads to computational challenges due to the combinatorial complexity of the model space, Bayesian inference based on Markov chain Monte Carlo (MCMC) methods is straightforward for continuous shrinkage priors, exploiting their Gaussian-scale mixture representation [17,24]. An extension of these schemes to the triple gamma prior is fairly straightforward.
We will study the empirical performance of the triple gamma for a challenging setting in econometric time series analysis, namely for time-varying parameter vector autoregressive models with stochastic volatility (TVP-VAR-SV models). Since the influential paper of Primiceri [25] (see Del Negro and Primiceri [26] for a corrigendum), this model has become a benchmark for analyzing relationships between macroeconomic variables that evolve over time, see Nakajima [27], Koop and Korobilis [28], Eisenstat et al. [29], Chan and Eisenstat [30], Feldkircher et al. [31] and Carriero et al. [32], among many others. Due to the high dimensionality of the time-varying parameters, even for moderately sized systems, shrinkage priors such as the triple gamma prior are instrumental for efficient inference.
The rest of the paper is organized as follows. In Section 2, we define the triple gamma prior and discuss some of its properties. The close relationship between the triple gamma and spike-and-slab priors applied in a BMA context is investigated in Section 3.2. Section 4 introduced an efficient MCMC scheme and Section 5 provides applications to TVP-VAR-SV models. Section 6 concludes the paper.

Motivation and definition
Let us recall the state space form of a TVP model. For t = 1, . . . , T, we have that where Q = Diag (θ 1 , . . . , θ d ), y t is a univariate response variable, x t = (x t1 , . . . , x td ) is a d-dimensional row vector containing the regressors at time t, with x t1 corresponding to the intercept, and the initial value follows a normal distribution, β 0 ∼ N d (β, Q), with initial mean β = (β 1 , . . . , β d ) . Model (1) can be rewritten equivalently in the non-centered parametrization introduced in Frühwirth-Schnatter and Wagner [14] asβ The error variance in the observation equation is either homoscedastic (σ 2 t ≡ σ 2 for all t = 1, . . . , T) or follows a stochastic volatility (SV) specification [33], where the log volatility h t = log σ 2 t follows an AR(1) process. Specifically, To motivate the triple gamma prior, let us recall that, in TVP models, shrinkage priors are placed on each scale parameter θ j , j = 1, . . . , d, in order to shrink dynamic coefficients to static ones, hence avoiding overfitting. One of such priors is the double gamma prior, employed recently by [17] for shrinkage of variances. The double gamma prior can be expressed as a scale-mixture of gamma distributions, with the following hierarchical representation: In the double gamma prior, each innovation variance θ j is mixed over its own ξ 2 j , each of which has an independent gamma distribution, with a common hyperparameter κ 2 B . Moreover, the parameters ξ 2 j play the role of local (component specific) shrinkage parameters, while the parameter κ 2 B is a (common) global shrinkage parameter.
We propose an extension of the double gamma prior to a triple gamma prior, where another layer is added to the hierarchy:

of 37
The main difference with the double gamma prior is that the ξ j are not identically distributed, but each one depends on its component specific parameter κ 2 j . Prior (5) contains many well-known shrinkage priors as a special case, as will be discussed in Section 2.3.
To make the shrinkage behaviour of the triple gamma prior more apparent, we will work with representations that involve the scale parameter θ j , rather than the variance θ j , using the fact that θ j |ξ 2 j ∼ ξ 2 j χ 2 1 follows a re-scaled χ 2 1 -distribution. If we consider both the positive and the negative root of θ j , then we obtain Hence, prior (5) corresponds to θ j following the so-called normal-gamma-gamma prior consider by Griffin and Brown [16] in the context of defining hierarchical shrinkage priors for regression models.
To allow shrinkage of dynamic coefficients toward fixed, but significant ones, we extend Bitto and Frühwirth-Schnatter [17] further by assuming such a normal-gamma-gamma prior on the fixed parameter β 1 , . . . , β d : In Section 2.4, we will discuss hierarchical versions of both priors, by putting a hyperprior on the parameters κ 2 B , λ 2 B , a ξ , a τ , c ξ , and c τ .

Properties of the triple gamma prior
It will be shown in Theorem 1 that the triple gamma prior is a global-local shrinkage prior in the sense of Polson and Scott [10] where the local shrinkage parameters arise from the F 2a ξ , 2c ξ distribution. This representation allows to relate the triple gamma to the well-known Horseshoe prior, see Section 2.3.
Furthermore, a closed form of the marginal shrinkage prior p( θ j |φ ξ , a ξ , c ξ ) is given in Theorem 1, which is proven in Appendix A.
Theorem 1. For the triple gamma prior defined in (5), with a ξ > 0 and c ξ > 0, the following holds: (a) It has following representation as a local-global shrinkage prior: where U (a, b, z) is the confluent hyper-geometric function of the second kind: In Figure 1 we can see the marginal prior distribution of θ j under the triple gamma prior a ξ = c ξ = 0.1 and under other well-known shrinkage priors which are special cases of the triple gamma, see Table 1. Using Bitto and Frühwirth-Schnatter [17,Footnote 3], Theorem 1 also allows to give a closed form for the prior p(θ j |φ ξ , a ξ , c ξ ) = p( θ j |φ ξ , a ξ , c ξ )/ θ j .
Global-local shrinkage priors are typically compared in terms of the concentration around the origin and the tail behaviour. For the triple gamma prior p( θ j |φ ξ , a ξ , c ξ ), the two shape parameters a ξ and c ξ play a crucial role in this respect, see Theorem 2 which is proven in Appendix A.
Theorem 2. The triple gamma prior (9) satisfies the following: (a) For 0 < a ξ < 0.5 and small values of θ j , (b) For a ξ = 0.5 and small values of θ j , where ψ(·) is the digamma function.
(c) For a ξ > 0.5, From Theorem 2, Part (a) and (b), we find that the triple gamma prior p( θ j |φ ξ , a ξ , c ξ ) has a pole at the origin, if a ξ ≤ 0.5. According to Part (a), the pole is more pronounced, the closer a ξ gets to 0. For a ξ > 0.5, we find from Part (c) that p( θ j |φ ξ , a ξ , c ξ ) is bounded at zero by a positive upper bound which is finite, as long as 0 < c ξ < ∞. Part (d) shows that the triple gamma prior p( θ j |φ ξ , a ξ , c ξ ) has polynomial tails, with the shape parameter c ξ controlling the tail index. Prior moments E(( θ j ) k |φ ξ , a ξ , c ξ ) exist up to k < 2c ξ .
Hence, the triple gamma prior has no finite moments for c ξ < 1/2. Finally, additional useful representations of the triple gamma prior as a global-local shrinkage prior are summarized in Lemma 1 which is proven in Appendix A. Representations (a) shows that the triple gamma is an extension of the double gamma prior where the Gaussian prior θ j |ξ 2 j ∼ N (0, ξ 2 j ) is substituted by a heavier-tailed Student-t prior, making the prior more robust to large values of θ j . Representation (b) and (c) will be useful for MCMC inference in Section 4. Representations (c) and (d) show that for a triple gamma prior with finite a ξ and c ξ , φ ξ acts as a global shrinkage parameter, in addition to 2/κ 2 B .
Lemma 1. For a ξ > 0 and c ξ > 0, the triple gamma prior (5) has the following alternative representations: Additional representations for 0 < a ξ < ∞ and where BP a ξ , c ξ is the beta-prime distribution. 1 1 Note that the X ∼ BP (a, b)-distribution has pdf to variance selection in state space and TVP models. This is evident from rewriting (5) as ξ 2 j ∼ G a ξ , λ j , λ j ∼ G c ξ , φ ξ . We exploit this relationship in Section 3.1 to investigate the shrinkage profile of a triple gamma prior. Using Armagan et al. [18,Definition 2], the triple gamma prior can be written as where T P B a ξ , c ξ , φ ξ is the three-parameter beta (TPB) distribution with density: From (14) and (15), it becomes evident that the Strawderman-Berger prior √ θ j ∼ N 0, 1/ρ j − 1 , ρ j ∼ B (1/2, 1) [35,36] is that special case of the triple gamma prior where φ ξ = 1, a ξ = 1/2, and c ξ = 1.
The special case of a triple gamma, where a ξ = c ξ = 1/2, corresponds to a Horseshoe prior [19,20] on √ θ j with global shrinkage parameter τ 2 = 2/κ 2 B , since ψ 2 j ∼ F (1, 1) implies that ψ j ∼ t 1 . The Horseshoe prior has been introduced for variable selection in regression models and has been shown to have excellent theoretical properties in this context for the "nearly black" case [37]. The triple gamma is a generalization of the Horseshoe prior, with a similar shrinkage profile, however with much more mass close to the corner solutions. Most importantly, as will be discussed in Section 3.1, this leads to a BMA-type behaviour of the triple gamma prior for small values of a ξ and c ξ .
The vast literature on shrinkage priors contains many more related priors. Rescaling ξ 2 j = 2/(κ 2 B )ψ 2 j in (8), for instance, yields a representation involving a scaled beta2 distribution, 2 as is easily derived from (A2). The scaled beta2 was introduced by Pérez et al. [23] in hierarchical models as a robust prior for scale parameters, √ θ j , and variance parameters, θ j , alike. Based on (16), the triple gamma can be seen as a hierarchical extension of this prior which puts a scaled beta2 distribution on the scaling parameter ξ 2 j of a Gaussian prior for √ θ j , see Table 1. Griffin and Brown [16] termed prior (16) gamma-gamma distribution, denoted by GG a ξ , c ξ , φ .
For a ξ = 1, the triple gamma reduces to the normal-exponential-gamma which has a representation as a scale-mixture of double exponential DE 0, √ 2ψ j -distributions, see Table 1. It has been considered for variable selection in regression models [38] and locally adaptive B-spline models [39]. The R2-D2 prior suggested by Zhang et al. [40] for high-dimensional regression models is another special case of the triple gamma. It reads where a = da τ and σ 2 is the residual error variance of the regression model. As shown by Zhang et al. [40], this implies following prior for the coefficient of determination: R 2 ∼ B (a, b) which motivates holding a 2 The pdf of a SBeta2 (a, c, φ)-distribution reads: fixed, while a τ decrease as d increases. Using that φ j ω ∼ G (a τ , τ), we can show that the R2-D2 prior is equivalent to following hierarchical normal gamma prior applied in Bitto and Frühwirth-Schnatter [17] for TVP models: The popular Dirichlet-Laplace prior, √ θ j |ψ j ∼ DE 0, ψ j , however, is not related to the triple gamma as the prior scale ψ j rather than ψ 2 j follows a gamma distribution, see again Table 1.

Using the triple gamma for variance selection in TVP models
A challenging question is how to choose the parameters a ξ , c ξ and κ 2 B or φ ξ of the triple gamma prior in the context of variance selection for TVP models. In addition, in a TVP context, the shrinkage parameters a τ , c τ and λ 2 B or φ τ = 2c τ /(a τ λ 2 B ) for the prior (7) of the initial values have to be selected. In high-dimensional settings it is appealing to have a prior that addresses two major issues: first, high concentration around the origin to favor strong shrinkage of small variances toward zero; second, heavy tails to introduce robustness to large variances and to avoid over-shrinkage. For the triple gamma prior, both issues are addressed through the choice of a ξ and c ξ , see Theorem 2. First of all, we need values 0 < a ξ ≤ 0.5 to induce a pole at 0. Second, values of 0 < c ξ < 0.5 will lead to very heavy tails. For very small values of a ξ and c ξ , the triple Gamma is a proper prior that behaves nearly as the improper normal-Jeffrey's prior [41], where p( Ideally, we would place a hyper prior distribution on all shrinkage parameters which would allow us to learn the global and the local degree of sparsity, both for the variances and the initial values. Such a hierarchical triple gamma prior introduces dependence among the local shrinkage parameters ξ 2 1 , . . . , ξ 2 d in (5) and, consequently, among θ 1 , . . . , θ d in the joint (marginal) prior p(θ 1 , . . . , θ d ). Introducing such dependence is desirable in that it allows to learn the degree of variance sparsity in TVP models, meaning that how much a variance is shrunken toward zero depends on how close the other variances are to zero. However, first naïve approaches with rather uninformative, independent priors on κ 2 B , a ξ , c ξ and λ 2 B , a τ , c τ were not met with much success and we found it necessary to carefully design appropriate hyper priors.
Hierarchical versions of the Bayesian Lasso [15] and the double gamma prior [17] in TVP models are based on the gamma prior κ 2 B ∼ G (d 1 , d 2 ). Interestingly, this choice can be seen as a heavy-tailed extension of both priors, where each marginal density p( √ θ j |d 1 , d 2 ) follows a triple gamma prior with the same parameter a ξ (being equal to one for the Bayesian Lasso) and tail index c ξ = d 1 . In light of this relationship, it is not surprising that very small values of d 1 were applied in these papers to ensure heavy tails of p( √ θ j |d 1 , d 2 ). Since a triple gamma prior has already heavy tails, we choose a different hyperprior in the present paper.
For the case a ξ = c ξ = 1/2, the global shrinkage parameter τ of the Horseshoe prior typically follows a Cauchy prior, τ ∼ t 1 [19,42], see also Bhadra et al. [8,Section 5]. The relationship φ ξ = 2/κ 2 B = τ 2 between the various global shrinkage parameters (see Table 1) implies in this case φ ξ ∼ F (1, 1) or, equivalently, For a triple gamma prior with arbitrary a ξ and c ξ , this is a special case of the following prior: which will be motivated in Section 3.2. Under this prior, the triple gamma prior exhibits a BMA-like behavior with a uniform prior on an appropriately defined model size (see Theorem 3). Prior (17) is equivalent with following representations: Concerning a ξ and c ξ , we choose the following priors: Hence, we are restricting the support of a ξ and c ξ to (0, 0.5), following the insights brought to us by Theorem 2.
We follow a similar strategy for the parameters a τ , c τ and λ 2 B (φ τ ) of the prior (7): which is equivalent with . An interesting special case is the "symmetric" triple gamma, where a ξ = c ξ . Despite this constraint, the favourable shrinkage behaviour is preserved and decreasing a ξ = c ξ toward zero simultaneously leads to a high concentration around the origin and a heavy-tailed behaviour. For a symmetric triple gamma prior, φ ξ = 2/κ 2 B is independent of a ξ and c ξ and the two global shrinkage parameters are related through φ ξ = 2/κ 2 B . This induces shrinkage profiles that are symmetric around 1/2, see Section 3.1. Interestingly, a symmetric triple gamma resolves the question whether to choose a gamma or an inverse gamma prior for a variance parameter ψ 2 j . It implies the same symmetric beta-prime distribution on the variance, a ξ , and the information, (ψ 2 j ) −1 ∼ BP a ξ , a ξ , and can be represented as a gamma prior with the scale arising from an inverse gamma prior or, equivalently, as an inverse gamma prior with the scale arising from a gamma prior:

Shrinkage profiles
In the sparse normal-means problem where y|β ∼ N d β, σ 2 I d and σ 2 = 1, the parameter ρ j = 1/(1 + ψ 2 j ) appearing in (14) is known as shrinkage factor and plays a fundamental role for comparing different shrinkage priors, as ρ j determines shrinkage toward 0.
Also in a variance selection context, it is evident from (14) that values of ρ j ≈ 0 will introduce no shrinkage on θ j , whereas values of ρ j ≈ 1 will introduce strong shrinkage of θ j toward 0. Hence, the prior p(ρ j ), also called shrinkage profile, will play an instrumental role in the behaviour of different shrinkage priors. Following Carvalho et al. [20], shrinkage priors are often compared in terms of the prior they imply on ρ j , i.e. how they handle shrinkage for small "observations" (in our case innovations) and how robust they are to large "observations". Note that we ideally want a shrinkage profile that has a pole in zero (heavy tails to avoid over-shrinking signals) and a pole in one (spikiness to shrink noise). The Horseshoe prior, e.g., implies ρ j ∼ B (1/2, 1/2) which is a shrinkage profile that takes this much desired form of a "Horseshoe", see Figure 2.
For the triple gamma prior, the shrinkage profile is given by the three-parameter beta prior p(ρ j ) provided in (15). For φ ξ = 1, ρ j ∼ B c ξ , a ξ and κ 2 B = 2c ξ /a ξ . Choosing small values a ξ << 1 will put prior mass close to 1, choosing small values c ξ << 1 will put prior mass close to 0, whereas values for both a ξ and c ξ smaller than one will induce the form of a Horseshoe prior for ρ j . Evidently, for φ ξ = 1, a symmetric triple gamma prior with a ξ = c ξ implies a Horseshoe prior for ρ j that is symmetric around 0.5. This is illustrated in Figure 2 for a symmetric triple gamma with a ξ = c ξ = 0.1.
In Figure 2 we can also see the shrinkage profile for the Bayesian Lasso and the double gamma, which correspond to a triple gamma where c ξ → ∞. 3 For the Bayesian Lasso with a ξ = 1 it is clear that the shrinkage profile p(ρ j ) converges to a constant for ρ j → 1, while there is no mass around ρ j = 0. This means that this prior tends to over-shrink signals, while not shrinking the noise completely to zero. A double gamma prior with a ξ < 1 has the potential to shrink the noise completely to zero, as p(ρ j ) has a pole at ρ j = 1, but p(ρ j ) has also zero mass around ρ j = 0, meaning the prior encourages over-shrinking of signals.
When we make κ 2 B random, we obtain a "prior density" of shrinkage profiles, see Figure 3. We can see that such hierarchical versions of the Lasso and the double gamma have shrinkage profiles that resemble the ones of the Horseshoe and the triple gamma. We have used κ 2 B ∼ G (0.01, 0.01) for the Lasso and 3 Using (4), we obtain the following prior for ρ j = 1/(1 + ψ 2 j ) by the law of transformation of densities:  the double gamma, 2/κ 2 B ∼ F (1, 1) for the Horseshoe and 2/κ 2 B ∼ F (0.2, 0.2) for the triple gamma, see Section 2.4.

BMA-type behaviour
From the perspective of Bayesian model averaging (BMA), an ideal approach for handling sparsity in TVP models would be the use of discrete mixture priors as suggested in Frühwirth-Schnatter and Wagner [14], with δ 0 being a Dirac measure at 0, while p slab ( √ θ j ) is the prior for non-zero variances. In terms of shrinkage profiles, the discrete mixture prior (21) has a spike at ρ j = 1, with probability 1 − π, and a lot of prior mass at ρ j = 0, provided that the tails of p slab ( √ θ j ) are heavy enough. The mixture prior (21) is considered the "gold standard" in BMA, both theoretically and empirically, see e.g. Johnstone and Silverman [43]. However, MCMC inference under this prior is extremely challenging. As opposed to this, MCMC inference for the triple gamma prior is straightforward, see Section 4.
In this section, we relate the triple gamma prior to BMA based on the discrete mixture prior (21). An interesting insight is that the triple gamma prior shows a very similar behaviour as a discrete mixture prior, if both a ξ and c ξ approach zero. This induces a BMA-type behaviour on the joint shrinkage profile p(ρ 1 , . . . , ρ d ), with a spike at all corner solutions, where some ρ j are very close to one, whereas the remaining ones very close to zero.
A very important aspect of BMA is the one of choosing a prior for the model dimension, K, see e.g. Fernández et al. [44] and Ley and Steel [45]. In the discrete mixture prior (21), the distribution of K depends on the choice of π. Fixing π corresponds to a very informative prior on the model dimension, for example π = 0.5 assigns more prior probability to models of dimension d/2 and lower prior probability to empty or full models. In fact, let δ j be the indicator that tells us if the j-th coefficient is included in the model, then we have that K = ∑ d j=1 δ j ∼ BiNom (d, π). Placing a uniform prior for π has been shown to be a good choice, since it corresponds to placing a prior on K which is uniform on {0, . . . , d}. Note that π will be learned using information from all the variables, that it is a global parameter and will adapt to the degree of sparsity.
Following ideas in [19], we believe that a natural way to perform variable selection in the continuous shrinkage prior framework is though thresh-holding. Specifically, we say that when (1 − ρ j ) > 0.5, or ρ j < 0.5, the variable is included, otherwise it is not. Notice that this classification via thresh-holding makes perfectly sense in the case of a triple gamma of which the Horseshoe is a special case, but less so for a Lasso or double gamma prior, even if the shrinkage profile shows a Horseshoe-like behaviour for hierarchical versions of these priors (see again Figure 3). Notice that this implies a prior on the model dimension K. Specifically, where ρ j |a ξ , φ ξ ∼ T P B a ξ , b ξ , φ ξ , see (15). The choice of φ ξ (or κ 2 B ) will strongly impact the prior on K. For a symmetric triple gamma with a ξ = c ξ , for instance, and fixed φ ξ = 1, that is κ 2 B = 2, we obtain K ∼ BiNom (d, 0.5), since π ξ = 0.5 regardless of a ξ . Hence, we have to face similar problems as with fixing π = 0.5 for the discrete mixture prior (21).
Placing a hyper prior on φ τ and φ ξ or, equivalently on, λ 2 B and κ 2 B , as we did in Section 2.4, is as instrumental for BMA-type variable and variance selection for the triple gamma prior, as is making π random for the discrete mixture prior (21). Ideally, we would like to have a uniform distribution on the model size K. We show in Theorem 3 that the hyperprior for κ 2 B defined in (17) achieves exactly this goal, since π ξ is uniformly distributed, see Appendix A for a proof.
Theorem 3. For a hierarchical triple gamma prior with fixed a ξ > 0 and c ξ > 0 the probability π ξ defined in (22) follows a uniform distribution, π ξ ∼ U [0, 1], under the hyper prior or, equivalently, under the hyper prior

MCMC algorithm
Let y = (y 1 , . . . , y T ) be the vector of time series observations and let z be the set of all latent variables and unknown model parameters in a TVP model. Moreover, let z −x denote the set of all unknowns but x. Bayesian inference based on MCMC sampling from the posterior p(z|y) is summarized in Algorithm 1. The hierarchical priors introduced in Section 2.4 are employed, where (a τ , c τ , λ 2 B ) follow (20), (a ξ , c ξ ) follow (19), and κ 2 B follows (17). For certain sampling steps, the hierarchical representation (18) is used for κ 2 B , and similarly for λ 2 B . Algorithm 1 extends several existing algorithms such as the MCMC schemes introduced for the Horseshoe prior by Makalic and Schmidt [24] and for the double gamma prior by Bitto and Frühwirth-Schnatter [17]. We exploit various representations of the triple gamma prior given in Lemma 1 and choose representation (12) as the baseline representation of our MCMC algorithm: where φ τ = 2c τ /(λ 2 B a τ ) and φ ξ = 2c ξ /(κ 2 B a ξ ). All conditional distributions in our MCMC scheme are available in closed form, expect the ones for a ξ , c ξ , a τ and c τ , for which we will resort to a MH step within Gibbs. Several conditional distributions are the same as for the double gamma prior and we apply Algorithm 1 of Bitto and Frühwirth-Schnatter [17]. We provide more details on the derivation of the various densities in Appendix B.

Model
Consider an m-dimensional time series, Y 1 , . . . , Y T . The joint dynamics of such time series can be modeled through a time-varying parameter vector autoregressive model with stochastic volatility (TVP-VAR-SV). Since the influential paper of Primiceri [25] (see Del Negro and Primiceri [26] for a corrigendum), this model has become a benchmark for analyzing relationships between macroeconomic variables that evolve over time, see Nakajima [27], Koop and Korobilis [28], Eisenstat et al. [29], Chan and Eisenstat [30], Feldkircher et al. [31] and Carriero et al. [32], among many others. A TVP-VAR-SV model of order p can be expressed as follows: where c t is the m-dimensional intercept, Φ j,t , for j = 1, . . . , p is an m × m matrix of time-varying coefficients, and Σ t is the time-varying variance covariance matrix of the error term. The TVP-VAR-SV model can be written in a more compact notation as where X t = (Y t−1 , . . . , Y t−p , 1) is a row vector of length mp + 1 and β t = (β 1 t , . . . , β m t ) , where β i t = (Φ 1,t i• , . . . , Φ p,t i• , c t,i ) . Here, Φ j,t i• denotes the i-th row of the matrix Φ j,t and c t,i denotes the i-th element of c t .
Following Bitto and Frühwirth-Schnatter [17], we use an LDLT decomposition of the time-varying covariance matrix, that is Σ t = A t D t A t , where D t is a diagonal matrix and A t is lower unitriangular matrix, see also [32]. We denote with a ij,t the element at the i-th row and j-th column of A t , and with d i,t the i-th element of the diagonal of D t . In total, we have m(m − 1)/2 + m(mp + 1) (potentially) time-varying parameters. Using the LDLT decomposition, we can rewrite the system as: This, in turn, allows us to write Generalizing, for the i-th equation we have that with independent error terms η i,t across equations. In practice, the i-th equation of the system can be written as a TVP regression model where the residuals of the preceding i − 1 equations have been added as "regressors". The time-varying regression parameters are assumed to follow a random walk, specifically ij , for i = 1, . . . , m, and j = 1, . . . , mp + 1, a ij,t = a ij,t−1 + w ij,t , w ij,t ∼ N 0, θ a ij , for i = 1, . . . , m, and j = 1, . . . , i − 1.
with initial values β i j,0 ∼ N β β ij , θ β ij and a ij,0 ∼ N β a ij , θ a ij . Here, β i j,t denotes the jth element of the vector β i t . Shrinkage priors are then employed row wise, for the initial expectations β β ij and β a ij as well as the variances θ β ij and θ a ij . To allow for greater flexibility in the prior structure, the β β ij and β a ij are assumed to follow independent shrinkage priors, and similarly for θ β ij and θ a ij : where x = β for the VAR-coefficients and x = a for the elements of A. Following Section 2.4, the priors for the global shrinkage parameters read Finally, we assume that the idiosyncratic shocks η i,t ∼ N (0, d i,t ) follow an SV model as in (3), with row specific parameters. Specifically, let h i,t = log d i,t , we have that the logartihm of the elements of the diagonal matrix D follow independent AR(1) processes: Here, µ i is the mean of the ith log-volatility, φ i is the equation specific persistence parameter, and σ 2 η,i is the variance of the ith log-volatility.

A brief sketch of the TVP-VAR-SV MCMC algorithm
Our algorithm exploits the aforementioned unitriangular decomposition to estimate the model parameters equation-by-equation. Due to the prior structure introduced in (30), the estimation of the β i t and the a ij,t s is separated into two blocks, with the algorithm cycling through the equations, alternating between sampling β i t conditional on Σ t and sampling the a ij,t s and d i,t s conditional on the VAR coefficients β i t . Given a set of initial values, the algorithm repeats the following steps:

Algorithm 2. MCMC inference for TVP-VAR-SV models under the triple gamma prior
Choose starting values for all global and local shrinkage parameters in prior (30) for each equation and repeat the following steps: For i = 1, . . . , m: (a) Conditional on A t and D t , createy i,t = y i,t − ∑ i−1 j=1 a ij,t η j,t and use Algorithm 1 (sans the step for the variance of the observation equation) on the TVP regressioň (where the residuals from the previous i − 1 equations are used as regressors) to sample the volatilities d i,t and the time-varying coefficients of A t in row i, a ij,t , for t = 0, . . . , T from the respective conditional posteriors, as well as the initial expectations and process variances β a ij and θ a ij and the local and global shrinkage parametersτ a,2 ij ,λ a,2 ij ,ξ a,2 ij ,κ a,2 ij , λ a,2 In the following applications, we run our algorithm for M = 200000 iterations, discarding the first 100000 iterations as burn-in, and then keeping the output of one every 100 iterations.

Illustrative example with simulated data
To illustrate the merit of our methodology in the context of TVP-VAR-SVs, we simulate data from two TVP-VAR-SVs with T = 200 points in time, p = 1 lags and m = 7 equations, with varying degrees of sparsity. In the dense regime, approximately 30% of the values of β and θ (here referring to the means of the initial states and the variances of the innovations as defined in Section 2, respectively) are truly zero, while in the sparse regime approximately 90% are truly zero. We show results for the triple gamma prior, the Horseshoe prior, the double gamma and the Lasso. Regarding the priors on the hyperparameters, we use prior (31) with α a τ = α c τ = α a ξ = α c ξ = 1 and β a τ = β c τ = β a ξ = β c ξ = 6 for the triple gamma. The probability density function of the corresponding beta prior is monotonically increasing, with a maximum at 0.5. This prior places positive mass in a neighborhood of the Horseshoe, but allows for more flexibility. In practice, placing a prior on the spike and slab parameters of the triple gamma, instead of fixing them to 0.5 as in the Horseshoe, allows us to learn the shrinkage profile from the data. Moreover, since the spike and the slab parameters are allowed to be different, the shrinkage profile can be asymmetric and adapt to the sparseness of the data.
We assume that the global shrinkage parameters λ β,2 B,i , λ a,2 B,i , and κ a,2 B,i follow a F (1, 1) distribution for the Horseshoe prior which corresponds to the prior in [19] and a G (0.001, 0.001) distribution for the Lasso and the double gamma prior, as suggested in [15] and [17]. Concerning the spike parameters a τ,a i , a ξ,a i , a τ,β i , and a ξ,β i of the double gamma, we employ a rescaled beta prior to force them to be smaller than 0.5. Specifically, we use a B(4, 6) prior which places most of its mass between 0.05 and 0.4, a range that [17] have found to induce desirable shrinkage characteristics. Figure 5 shows the posterior path against time for a constant non-significant parameter, that is one for which θ β ij = 0 and β β ij = 0 for all times, in the sparse regime. The entire set of states for the triple gamma prior can be found in Appendix C. Note that, while the zero line is contained in the 95% posterior credible interval for all priors, said interval is thinner under the triple gamma prior and the double gamma prior than under the Lasso and the Horseshoe prior. However, the light tails of the double gamma prior, as the ones of the Lasso, can over-shrink weak signals.
The above statement becomes clearer when looking at the posterior inclusion probabilities. We calculate the posterior inclusion probabilities based on the thresholding approach introduced in Section 3.2, comparing the fully unimpeded triple gamma prior to widely used special cases. Figure 6 TG  HS  DG  LS  TG  HS  DG  LS  TG  HS  DG  LS  TG  HS  DG  LS  TG  HS  DG  LS  TG  HS

Modeling area macroeconomic and financial variables in the Euro Area
Our application investigates a subset of the area wide model of the European Union of [48], which comprises quarterly macroeconomic data spanning from 1970 to 2017. We include 7 of the variables present in the dataset, namely real output (YER), prices (YED) , short-term interest rate (STN), investment (ITR), consumption (PCR), exchange rate (EEN) and unemployment (URX). A more detailed description of the data and the transformations performed to make the time series stationary can be found in Table A2 in Appendix D. To stay in line with the literature, e.g. [49], we estimate a TVP-VAR-SV model with p = 2 lags on all endogenous variables. The hyperparameter choices are the same as in Section 5.3. As in the example with simulated data, we run the algorithm for M = 200000 iterations, discarding the first 100000 iterations as burn-in, and then keeping the output of one every 100 iterations. Figures 7 and 8 display the posterior inclusion probabilities for the means of the initial states and the innovation variances of the VAR coefficients, respectively. A few things about Figure 7 are striking. First, the posterior inclusion probabilities on the diagonal, meaning those belonging to the parameter of each equation's own autoregressive term, appear to be those that are the highest, while off diagonal elements are more likely to be excluded. Second, the equation for the short-term interest rate is characterized by a large amount of parameters with a high inclusion probability, across all priors. Third, the first lag tends to have higher posterior inclusion probabilities than the second lag, which is in line with the literature. Finally, the triple gamma prior can be seen to often have either the largest or the smallest posterior inclusion probability compared to the other priors, which can be seen as a reflection of the high amount of prior mass placed near the shrinkage factors ρ β ij = 1 and ρ β ij = 0 of β β ij , as illustrated in Section 3. This BMA-like behavior yields a prior that is prone to be more absolute when it comes to inclusion decisions. Now, we shift our focus to the posterior inclusion probabilities for the θ β ij 's plotted in Figure 8. Compared to the means of the inital states, almost all inclusion probabilities are essentially zero, with virtually only the triple gamma picking up (faint) signals, in particular with respect to the equations for the financial variables in the model, namely interest rate and nominal exchange rate. This lack of variability is unsurprising, as it is well known (see, e.g., [49]) that stochastic volatility in a TVP-VAR model for macroeconomic variables can explain a large part of the variability in the data. Despite this, the triple gamma, thanks to its heavy tails, is still capable of picking up weak signals in the data that the other shrinkage priors we considered are not able to discern from noise.
Given that the triple gamma tends to include more time variation than the other priors, overfitting might be considered a concern. However, Figures 9 and 10 put these fears to rest. They display the posterior median of β β ij and √ θ β ij , respectively. Here the triple gamma can be seen to be quite conservative, both in terms of which parameters to include, as well as their magnitude. In particular the medians of the √ θ β ij are interesting, as they are closest to zero under the triple gamma prior, despite having the highest posterior inclusion probabilities among all considered priors, pointing towards the triple gamma's ability to pick up even small signals with a higher degree of confidence than other priors.
In Figures A3 and A4 in Appendix D, all the posterior paths of Φ 1,t and Φ 2,t under the triple gamma prior are shown.

Conclusion
In the present paper, shrinkage for time-varying parameter (TVP) models was investigated within a Bayesian framework with the goal to automatically reduce time-varying parameters to static ones, if the model is overfitting. This goal was achieved by suggesting the triple gamma prior as a new shrinkage priors for the process variances of varying coefficients, extending previous work using spike-and-slab priors, the Bayesian Lasso, or the double gamma prior. The triple gamma prior is related to the normal-gamma-gamma prior applied for variable selection in highly structured regression models [16]. It contains the well-known Horseshoe prior as a special case, however it is more flexible, with two shape parameters that control concentration at zero and the tail behaviour. This leads to a BMA-type behaviour which allows not only variance shrinkage, but also variance selection.
In our application, we considered time-varying parameter VAR models with stochastic volatility. Overall, our findings suggest that the family of triple gamma priors introduced in this paper for sparse TVP models is successful in avoiding overfitting, if coefficients are, indeed, static or even insignificant. The framework developed in this paper is very general and holds the promise to be useful for introducing sparsity in other TVP and state space models in many different settings. Nevertheless, a number of extensions seem to be worth pursuing.
In particular, in ultra-sparse settings, modifications seem sensible. Currently, the hyperprior for the global shrinkage parameter of the triple gamma prior is selected in a way that it implies a uniform prior on "model size". A generalization of Theorem 3 would allow to choose hyper priors that induce higher sparsity. Furthermore, in the variable selection literature, special priors such as the Horseshoe+ [50] were suggested for very sparse, ultra-high dimensional settings. Exploiting once more the non-centered parametrization of a state space model, it is straightforward to extend this prior to variance selection using following hierarchical representation: We leave both extensions for future research. An important limitation of our approach is that shrinking a variance toward zero implies that a coefficient is fixed over the entire observation period of the time-series. In future research we will investigate dynamic shrinkage priors [51][52][53] where coefficients can be both fixed and dynamic.
Author Contributions: The authors contributed equally to the work.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Proofs
Proof of Theorem 1. To proof Part (a), rewrite prior (6) in the following way by rescaling ξ 2 j and κ 2 j :  and use the fact that in (A1) the random variable ψ 2 j =ξ 2 j /κ 2 j follows the F-distribution: where p(ψ 2 j ) is given by: This yields (8).