Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models

Cadonna, Annalisa; Frühwirth-Schnatter, Sylvia; Knaus, Peter

doi:10.3390/econometrics8020020

Open AccessFeature PaperArticle

Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models

by

Annalisa Cadonna

,

Sylvia Frühwirth-Schnatter

^*

and

Peter Knaus

Department of Finance, Accounting and Statistics, WU Vienna University of Economics and Business, 1020 Vienna, Austria

^*

Author to whom correspondence should be addressed.

Econometrics 2020, 8(2), 20; https://doi.org/10.3390/econometrics8020020

Submission received: 9 December 2019 / Revised: 6 April 2020 / Accepted: 29 April 2020 / Published: 20 May 2020

(This article belongs to the Special Issue Bayesian and Frequentist Model Averaging)

Download

Browse Figures

Versions Notes

Abstract

Time-varying parameter (TVP) models are very flexible in capturing gradual changes in the effect of explanatory variables on the outcome variable. However, in particular when the number of explanatory variables is large, there is a known risk of overfitting and poor predictive performance, since the effect of some explanatory variables is constant over time. We propose a new prior for variance shrinkage in TVP models, called triple gamma. The triple gamma prior encompasses a number of priors that have been suggested previously, such as the Bayesian Lasso, the double gamma prior and the Horseshoe prior. We present the desirable properties of such a prior and its relationship to Bayesian Model Averaging for variance selection. The features of the triple gamma prior are then illustrated in the context of time varying parameter vector autoregressive models, both for simulated dataset and for a series of macroeconomics variables in the Euro Area.

Keywords:

Bayesian model averaging; horseshoe prior; lasso prior; sparsity; stochastic volatility; triple gamma prior; VAR models

1. Introduction

Model selection in a high-dimensional setting is a common challenge in statistical and econometric inference. The introduction of Bayesian model averaging (BMA) techniques in the statistical literature (Brown et al. 2002; Cottet et al. 2008; Raftery et al. 1997) has led to many interesting applications, see, among others, (Frühwirth-Schnatter and Tüchler 2008; Kleijn and van Dijk 2006; Koop and Potter 2004; Sala-i-Martin et al. 2004) for early references in econometrics.

Selecting explanatory variables for possibly very high-dimensional regression problems though shrinkage priors is an attractive alternative to BMA which relies on discrete mixture priors, see Bhadra et al. (2019) for an excellent review. There is a vast and growing literature on shrinkage priors for regression problems that focuses on the following aspects. First, how to choose sensible priors for high-dimensional model selection problems in a Bayesian framework, second, how to design efficient algorithms to cope with the associated computational challenges and third, to investigate, both from a theoretical and a practical viewpoint, how such priors perform in high-dimensional problems.

A striking duality exists in this very active area between Bayesian and traditional approaches. For many shrinkage priors, the mode of the posterior distribution obtained in a Bayesian analysis can be regarded as a point estimate from a regularization approach, see Fahrmeir et al. (2010) and Polson and Scott (2012a). One such example is the popular Lasso (Tibshirani 1996) which is equivalent to a double-exponential shrinkage prior in a Bayesian context (Park and Casella 2008). However, the two approaches differ when it comes to selecting penalty parameters that impact the sparsity of the solution. One advantage of the Bayesian framework in this context is that the penalty parameters are considered to be unknown hyperparameters which can be learned from the data. Such “global-local” shrinkage priors (Polson and Scott 2011) adjust to the overall degree of sparsity that is required in a specific application through a global shrinkage parameter and separate signal from noise through local, individual shrinkage parameters.

While the inclusion of potentially many explanatory variables though shrinkage priors in regression models is addressed in a vast literature, the use of shrinkage priors for more general econometric models in time series analysis, such as state space models and time-varying parameter (TVP) models is, in comparison, less well-studied. Sparsity in the context of such models refers to the presence of a few large variances among many (nearly) zero variances in the latent state processes that drive the observed time series data. A common goal in this setting is to recover a few dynamic states, driven by such a state space model, among many (nearly) constant coefficients. As shown by Frühwirth-Schnatter and Wagner (2010), this variance selection problem can be cast into a variable selection problem in the non-centered parametrization of a state space model. Once this link has been established, shrinkage priors that are known to perform well in high-dimensional regression problems can be applied to variance selection in state space models, as demonstrated for the Lasso (Belmonte et al. 2014) and the normal-gamma (Bitto and Frühwirth-Schnatter 2019; Griffin and Brown 2017).

Despite this already existing variety, we introduce a new shrinkage prior for variance selection in sparse state space and TVP models in the present paper called triple gamma prior, as it has a representation involving three gamma distributions. This prior can be related to various shrinkage priors that were found to be useful for high-dimensional regression problems, such as the generalized beta mixture prior (Armagan et al. 2011), and contains the popular Horseshoe prior (Carvalho et al. 2009, 2010) as a special case. Furthermore, the half-t and the half Cauchy (Gelman 2006; Polson and Scott 2012b), suggested as robust alternatives to the inverse gamma distribution for variance parameters in hierarchical models, as well as the Lasso and the double gamma, are special cases of the triple gamma. In this context, the triple gamma can also be regarded as an extension of the scaled beta2 distribution (Pérez et al. 2017).

Among Bayesian shrinkage priors, usually a clear distinction is made between two-group mixture or spike-and-slab priors and continuous shrinkage priors, of which the triple gamma is a special case. An important contribution of the present paper is to show that the triple gamma provides a bridge between these two approaches and has the following property which is favourable both in sparse and dense situations. One of the hyperparameters allows high concentration over the region in the shrinkage profile that is relevant for shrinking noise, while the other hyperparameter allows high concentration over the region that prevents overshrinking of signals. This allows the triple gamma prior to exhibit behavior that very much resembles Bayesian model averaging based on discrete spike-and-slab priors, with a strong prior concentration at the corner solutions where some of the variances are nearly zero. While this is reminiscent of the Horseshoe prior, the shrinkage profile induced by the triple gamma is more flexible than that of a Horseshoe. Thanks to the estimation of the hyperparemters, it is not constrained to be symmetric around one half, enabling adaption to varying degrees of sparsity in the data.

The triple gamma prior also scores well from a computational perspective. While exploring the full posterior distribution for spike-and-slab priors leads to computational challenges due to the combinatorial complexity of the model space, Bayesian inference based on Markov chain Monte Carlo (MCMC) methods is straightforward for continuous shrinkage priors, exploiting their Gaussian-scale mixture representation (Bitto and Frühwirth-Schnatter 2019; Makalic and Schmidt 2016). An extension of these schemes to the triple gamma prior is fairly straightforward.

We will study the empirical performance of the triple gamma for a challenging setting in econometric time series analysis, namely for time-varying parameter vector autoregressive models with stochastic volatility (TVP-VAR-SV models). Since the influential paper of Primiceri (2005) (see Del Negro and Primiceri (2015) for a corrigendum), this model has become a benchmark for analyzing relationships between macroeconomic variables that evolve over time, see Nakajima (2011), Koop and Korobilis (2013), Eisenstat et al. (2014), Chan and Eisenstat (2016), Feldkircher et al. (2017) and Carriero et al. (2019), among many others. Due to the high dimensionality of the time-varying parameters, even for moderately sized systems, shrinkage priors such as the triple gamma prior are instrumental for efficient inference.

The rest of the paper is organized as follows—in Section 2, we define the triple gamma prior and discuss some of its properties. The close relationship between the triple gamma and spike-and-slab priors applied in a BMA context is investigated in Section 3.2. Section 4 introduces an efficient MCMC scheme and Section 5 provides applications to TVP-VAR-SV models. Section 6 concludes the paper.

2. The Triple Gamma as a Prior for Variance Parameters

2.1. Motivation and Definition

To motivate the triple gamma prior, consider the state space form of a TVP model for a univariate time series

y_{t}

. For

t = 1, \dots, T

, we have that

\begin{matrix} β_{t} = β_{t - 1} + w_{t}, w_{t} \sim N_{d} (0, Q), \\ y_{t} = x_{t} β_{t} + ε_{t}, ε_{t} \sim N (0, σ_{t}^{2}), \end{matrix}

(1)

where

Q = Diag (θ_{1}, \dots, θ_{d})

and the initial value of the state process follows a normal distribution,

β_{0} \sim N_{d} (β, Q)

, with initial mean

β = {(β_{1}, \dots, β_{d})}^{⊤}

.

x_{t} = (x_{t 1}, \dots, x_{t d})

is a d-dimensional row vector containing the explanatory variables at time t. The variables

x_{t j}

can be exogenous control variables and/or be equal to lagged values of

y_{t}

. Usually, one of the variables, say

x_{t 1}

, corresponds to the intercept, but an intercept need not be present. This approach can be straightforwardly adapted to the multivariate case as for the TVP-VAR-SV model that will be considered in Section 5.

The error variance

σ_{t}^{2}

in the observation equation is either homoscedastic (

σ_{t}^{2} \equiv σ^{2}

for all

t = 1, \dots, T

) or follows a stochastic volatility (SV) specification (Jacquier et al. 1994), where the log volatility

h_{t} = log σ_{t}^{2}

follows an AR(1) process. Specifically,

\begin{matrix} h_{t} | h_{t - 1}, μ, ϕ, σ_{η}^{2} \sim N (μ + ϕ (h_{t - 1} - μ), σ_{η}^{2}) . \end{matrix}

(2)

For Bayesian inference, priors have to be chosen for the unknown variances

θ_{1}, \dots, θ_{d}

and the unknown initial means

β_{1}, \dots, β_{d}

. In order to shrink dynamic coefficients to static ones and, in this way, avoid overfitting, a shrinkage prior is placed on

θ_{j}

that puts a lot of prior mass close to zero. One such prior is the double gamma prior, employed recently by Bitto and Frühwirth-Schnatter (2019). The double gamma prior can be expressed as a scale-mixture of gamma distributions, with the following hierarchical representation:

\begin{matrix} θ_{j} | ξ_{j}^{2} \sim G (\frac{1}{2}, \frac{1}{2 ξ_{j}^{2}}), ξ_{j}^{2} | a^{ξ}, κ_{B}^{2} \sim G (a^{ξ}, \frac{a^{ξ} κ_{B}^{2}}{2}) . \end{matrix}

(3)

In the double gamma prior, each innovation variance

θ_{j}

is mixed over its own scale parameter

ξ_{j}^{2}

, each of which has an independent gamma distribution, with a common hyperparameter

κ_{B}^{2}

. Moreover, the

ξ_{j}^{2}

’s play the role of local (component specific) shrinkage parameters, while the parameter

κ_{B}^{2}

is a (common) global shrinkage parameter.

We propose an extension of the double gamma prior to a triple gamma prior, where another layer is added to the hierarchy:

\begin{matrix} θ_{j} | ξ_{j}^{2} \sim G (\frac{1}{2}, \frac{1}{2 ξ_{j}^{2}}), ξ_{j}^{2} | a^{ξ}, κ_{j}^{2} \sim G (a^{ξ}, \frac{a^{ξ} κ_{j}^{2}}{2}), κ_{j}^{2} | c^{ξ}, κ_{B}^{2} \sim G (c^{ξ}, \frac{c^{ξ}}{κ_{B}^{2}}) . \end{matrix}

(4)

The main difference with the double gamma prior is that the prior scale of the

ξ_{j}^{2}

’s is not identical, but each

ξ_{j}^{2}

depends on its component specific scale

κ_{j}^{2}

. We will show in Section 2.2 that the triple gamma prior can be represented as a global-local shrinkage prior in the sense of Polson and Scott (2012a) where the local shrinkage parameters

ξ_{j}^{2}

arise from an

F (2 a^{ξ}, 2 c^{ξ})

distribution. Hence, the triple gamma prior contains the Horseshoe prior and many other well-known shrinkage priors as special cases, as will be discussed in Section 2.3.

The shrinkage behaviour of the triple gamma prior becomes even more apparent when we rewrite model (1) in the non-centered parametrization introduced in Frühwirth-Schnatter and Wagner (2010):

\begin{matrix} {\tilde{β}}_{t} = {\tilde{β}}_{t - 1} + {\tilde{w}}_{t}, {\tilde{w}}_{t} \sim N_{d} (0, I_{d}), \\ y_{t} = x_{t} β + x_{t} Diag (\sqrt{θ_{1}}, \dots, \sqrt{θ_{d}}) {\tilde{β}}_{t} + ε_{t}, ε_{t} \sim N (0, σ_{t}^{2}), \end{matrix}

(5)

with

{\tilde{β}}_{0} \sim N_{d} (0, I_{d})

, where

I_{d}

is the d-dimensional identity matrix. Both representations are equivalent and we can specify a prior either on the variances

θ_{j}

in (1) or the scale parameters

{\sqrt{θ}}_{j}

in (5). Using the fact that

θ_{j} / ξ_{j}^{2} \sim χ_{1}^{2}

and the

χ_{1}^{2}

-distribution can be represented as

χ_{1}^{2} = Z_{j}^{2}

, where

Z_{j} \sim N (0, 1)

follows a standard normal distribution, we can match prior (4) to the non-centered parametrization (5). This yields

\begin{matrix} {\sqrt{θ}}_{j} | ξ_{j}^{2} \sim N (0, ξ_{j}^{2}), ξ_{j}^{2} | a^{ξ}, κ_{j}^{2} \sim G (a^{ξ}, \frac{a^{ξ} κ_{j}^{2}}{2}), κ_{j}^{2} | c^{ξ}, κ_{B}^{2} \sim G (c^{ξ}, \frac{c^{ξ}}{κ_{B}^{2}}) . \end{matrix}

(6)

In (6), we could force

{\sqrt{θ}}_{j}

to take on only positive values, however, we do not impose such a constraint and allow

{\sqrt{θ}}_{j}

to take on negative values. Since the half-normal

{\sqrt{θ}}_{j} \sim N (0, ξ_{j}^{2}) I {{\sqrt{θ}}_{j} > 0}

also implies that

θ_{j} \sim ξ_{j}^{2} χ_{1}^{2}

, the question arises whether the negative half is of importance. Whenever inference is performed under the non-centered parametrization (5), as is done in Section 4, restricting the prior to the positive half will lead to automatic truncation of the full conditional posterior

p ({\sqrt{θ}}_{j} | {\tilde{β}}_{0}, \dots, {\tilde{β}}_{T}, y, \cdot)

to the positive part during MCMC sampling. If the positive and the negative mode of the marginal posterior

p ({\sqrt{θ}}_{j} | y)

are well-separated, then this will not matter. However, if the true value of

θ_{j}

is close or equal to zero and

p ({\sqrt{θ}}_{j} | y)

is concentrated at zero, this truncation will introduce a bias, because the negative half is not accounted for.

Interestingly, prior (6) is related to the so-called normal-gamma-gamma prior consider by Griffin and Brown (2017) in the context of defining hierarchical shrinkage priors for regression models. This relation is helpful in choosing a prior on the fixed coefficients

β_{1}, \dots, β_{d}

. To allow shrinkage of these coefficients toward insignificant ones in a TVP model, we extend Bitto and Frühwirth-Schnatter (2019) further by assuming such a normal-gamma-gamma prior on

β_{1}, \dots, β_{d}

:

\begin{matrix} β_{j} | τ_{j}^{2} \sim N (0, τ_{j}^{2}), τ_{j}^{2} | a^{τ}, λ_{j}^{2} \sim G (a^{τ}, \frac{a^{τ} λ_{j}^{2}}{2}), λ_{j}^{2} | c^{τ}, λ_{B}^{2} \sim G (c^{τ}, \frac{c^{τ}}{λ_{B}^{2}}) . \end{matrix}

(7)

In Section 2.4, we will discuss hierarchical versions of both priors, by putting a hyperprior on the parameters

κ_{B}^{2}

,

λ_{B}^{2}

,

a^{ξ}

,

a^{τ}

,

c^{ξ}

, and

c^{τ}

.

2.2. Properties of the Triple Gamma Prior

In this section, we study the mathematical properties of the triple gamma prior. It is shown in Theorem 1 that the triple gamma prior is a global-local shrinkage prior where the local shrinkage parameters arise from the

F (2 a^{ξ}, 2 c^{ξ})

distribution. Furthermore, a closed form of the marginal shrinkage prior

p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})

is given in Theorem 1, which is proven in Appendix A.

Theorem 1.

For the triple gamma prior defined in (4), with

a^{ξ} > 0

and

c^{ξ} > 0

, the following holds:

(a): It has following representation as a local-global shrinkage prior:

$\begin{matrix} {\sqrt{θ}}_{j} | ψ_{j}^{2}, κ_{B}^{2} \sim N (0, \frac{2}{κ_{B}^{2}} ψ_{j}^{2}), ψ_{j}^{2} | a^{ξ}, c^{ξ} \sim F (2 a^{ξ}, 2 c^{ξ}) . \end{matrix}$

(8)
(b): The marginal prior $p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})$ takes the following form with $ϕ^{ξ} = \frac{2 c^{ξ}}{κ_{B}^{2} a^{ξ}}$ ,

$\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (c^{ξ} + \frac{1}{2})}{\sqrt{2 π ϕ^{ξ}} B (a^{ξ}, c^{ξ})} U (c^{ξ} + \frac{1}{2}, \frac{3}{2} - a^{ξ}, \frac{θ_{j}}{2 ϕ^{ξ}}), \end{matrix}$

(9)

where $U (a, b, z)$ is the confluent hyper-geometric function of the second kind:

$\begin{matrix} U (a, b, z) = \frac{1}{Γ (a)} \int_{0}^{\infty} e^{- z t} t^{a - 1} {(1 + t)}^{b - a - 1} d t . \end{matrix}$

In Figure 1, we can see the marginal prior distribution of

{\sqrt{θ}}_{j}

under the triple gamma prior

a^{ξ} = c^{ξ} = 0.1

and under other well-known shrinkage priors which are special cases of the triple gamma, see Table 1. Theorem 1 also allows us to give a closed form for the prior

p (θ_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) / {\sqrt{θ}}_{j}

.1

Global-local shrinkage priors are typically compared in terms of the concentration around the origin and the tail behaviour. For the triple gamma prior

p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})

, the two shape parameters

a^{ξ}

and

c^{ξ}

play a crucial role in this respect, see Theorem 2 which is proven in Appendix A.

Theorem 2.

The triple gamma prior (9) satisfies the following:

(a): For $0 < a^{ξ} < 0.5$ and small values of ${\sqrt{θ}}_{j}$ ,

$\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (\frac{1}{2} - a^{ξ})}{\sqrt{π} {(2 ϕ^{ξ})}^{a^{ξ}} B (a^{ξ}, c^{ξ})} {(\frac{1}{{\sqrt{θ}}_{j}})}^{1 - 2 a^{ξ}} + O (1) . \end{matrix}$
(b): For $a^{ξ} = 0.5$ and small values of ${\sqrt{θ}}_{j}$ ,

$\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{1}{\sqrt{2 π ϕ^{ξ}} B (0.5, c^{ξ})} (- log θ_{j} + log (2 ϕ^{ξ}) - ψ (c^{ξ} + 0.5)) + O (| θ_{j} log θ_{j} |), \end{matrix}$

where $ψ (\cdot)$ is the digamma function.
(c): For $a^{ξ} > 0.5$ ,

$\begin{matrix} lim_{{\sqrt{θ}}_{j} \to 0} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (c^{ξ} + \frac{1}{2}) Γ (a^{ξ} - \frac{1}{2})}{\sqrt{2 π ϕ^{ξ}} Γ (c^{ξ}) Γ (a^{ξ})} . \end{matrix}$
(d): As ${\sqrt{θ}}_{j} \to \infty$ ,

$\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (c^{ξ} + \frac{1}{2}) {(2 ϕ^{ξ})}^{c^{ξ}}}{\sqrt{π} B (a^{ξ}, c^{ξ})} {(\frac{1}{{\sqrt{θ}}_{j}})}^{2 c^{ξ} + 1} [1 + O (\frac{1}{θ_{j}})] . \end{matrix}$

From Theorem 2, Part (a) and (b), we find that the triple gamma prior

p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})

has a pole at the origin, if

a^{ξ} \leq 0.5

. According to Part (a), the pole is more pronounced, the closer

a^{ξ}

gets to 0. For

a^{ξ} > 0.5

, we find from Part (c) that

p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})

is bounded at zero by a positive upper bound which is finite, as long as

0 < c^{ξ} < \infty

. Part (d) shows that the triple gamma prior

p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})

has polynomial tails, with the shape parameter

c^{ξ}

controlling the tail index. Prior moments

E ({({\sqrt{θ}}_{j})}^{k} | ϕ^{ξ}, a^{ξ}, c^{ξ})

exist up to

k < 2 c^{ξ}

. Hence, the triple gamma prior has no finite moments for

c^{ξ} < 1 / 2

.

Finally, additional useful representations of the triple gamma prior as a global-local shrinkage prior are summarized in Lemma 1 which is proven in Appendix A. Representation (a) shows that the triple gamma is an extension of the double gamma prior where the Gaussian prior

{\sqrt{θ}}_{j} | ξ_{j}^{2} \sim N (0, ξ_{j}^{2})

is substituted by a heavier-tailed Student-t prior, making the prior more robust to large values of

{\sqrt{θ}}_{j}

. Representation (b) and (c) will be useful for MCMC inference in Section 4. Representations (c) and (d) show that for a triple gamma prior with finite

a^{ξ}

and

c^{ξ}

,

ϕ^{ξ}

acts as a global shrinkage parameter, in addition to

2 / κ_{B}^{2}

.

Lemma 1.

For

a^{ξ} > 0

and

c^{ξ} > 0

, the triple gamma prior (4) has the following alternative representations:

\begin{matrix} (a) {\sqrt{θ}}_{j} | {\tilde{ξ}}_{j}^{2}, c^{ξ}, κ_{B}^{2} \sim t_{2 c^{ξ}} (0, \frac{2}{κ_{B}^{2}} {\tilde{ξ}}_{j}^{2}), {\tilde{ξ}}_{j}^{2} | a^{ξ} \sim G (a^{ξ}, a^{ξ}), \end{matrix}

(10)

\begin{matrix} (b) {\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, c^{ξ}, κ_{B}^{2} \sim t_{2 c^{ξ}} (0, \frac{2}{a^{ξ} κ_{B}^{2}} {\overset{ˇ}{ξ}}_{j}^{2}), {\overset{ˇ}{ξ}}_{j}^{2} | a^{ξ} \sim G (a^{ξ}, 1) . \end{matrix}

(11)

Additional representations for

0 < a^{ξ} < \infty

and

0 < c^{ξ} < \infty

based on

ϕ^{ξ} = \frac{2 c^{ξ}}{κ_{B}^{2} a^{ξ}}

are

\begin{matrix} (c) {\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, {\overset{ˇ}{κ}}_{j}^{2}, ϕ^{ξ} \sim N (0, ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2} / {\overset{ˇ}{κ}}_{j}^{2}), {\overset{ˇ}{ξ}}_{j}^{2} | a^{ξ} \sim G (a^{ξ}, 1), {\overset{ˇ}{κ}}_{j}^{2} | c^{ξ} \sim G (c^{ξ}, 1), \end{matrix}

(12)

\begin{matrix} (d) {\sqrt{θ}}_{j} | {\tilde{ψ}}_{j}^{2}, ϕ^{ξ} \sim N (0, ϕ^{ξ} {\tilde{ψ}}_{j}^{2}), {\tilde{ψ}}_{j}^{2} | a^{ξ}, c^{ξ} \sim BP (a^{ξ}, c^{ξ}), \end{matrix}

(13)

where

BP (a^{ξ}, c^{ξ})

is the beta-prime distribution.2

2.3. Relation of the Triple Gamma to Other Shrinkage Priors

The triple gamma prior can be related to the very active research on shrinkage priors in a Bayesian framework in various ways. On the one hand, popular priors for variance parameters introduced as robust alternatives to the inverse gamma prior are special cases of the triple gamma, see Table 1. For instance, in (8),

ψ_{j}^{2}

converges a.s. to 1, as

a^{ξ} \to \infty

and

c^{ξ} \to \infty

, and the triple gamma reduces to a normal distribution for

{\sqrt{θ}}_{j}

, applied for univariate TVP models (Frühwirth-Schnatter 2004) and unobserved component state space model (Frühwirth-Schnatter and Wagner 2010). For

c^{ξ} \to \infty

,

F (2 a^{ξ}, 2 c^{ξ})

converges to the

G (a^{ξ}, a^{ξ})

distribution and the triple gamma reduces to the Bayesian Lasso for

a^{ξ} = 1

(Belmonte et al. 2014) and otherwise to the double gamma (Bitto and Frühwirth-Schnatter 2019) applied in sparse TVP models.

Gelman (2006) introduced the half-t and the half-Cauchy prior for variance parameters in hierarchical models, by assuming that

{\sqrt{θ}}_{j}

follows a “folded” t-distribution, that is, a t-distribution truncated to

[0, \infty)

, see also Polson and Scott (2012b). In (10),

{\tilde{ξ}}_{j}^{2}

converges a.s. to 1 as

a^{ξ} \to \infty

and the triple gamma reduces to a

t_{2 c^{ξ}}

- distribution and to the Cauchy distribution for

c^{ξ} = 1 / 2

, however without being “folded”, since we allow

{\sqrt{θ}}_{j}

to take on negative values, as explained in Section 2.1.

On the other hand, the triple gamma prior is related to popular shrinkage priors in regression models. It extends the generalized beta mixture prior introduced by Armagan et al. (2011) for variable selection in regression models,

\begin{matrix} β_{j} | ξ_{j}^{2} \sim N (0, ξ_{j}^{2}), ξ_{j}^{2} \sim G (a^{ξ}, λ_{j}), λ_{j} \sim G (c^{ξ}, ϕ^{ξ}), \end{matrix}

to variance selection in state space and TVP models. This is evident from rewriting (4) as

ξ_{j}^{2} \sim G (a^{ξ}, λ_{j}), λ_{j} \sim G (c^{ξ}, ϕ^{ξ})

. We exploit this relationship in Section 3.1 to investigate the shrinkage profile of a triple gamma prior. Using Armagan et al. (2011, Definition 2), the triple gamma prior can be written as

\begin{matrix} {\sqrt{θ}}_{j} | ρ_{j} \sim N (0, 1 / ρ_{j} - 1), ρ_{j} | a^{ξ}, c^{ξ}, ϕ^{ξ} \sim TPB (a^{ξ}, c^{ξ}, ϕ^{ξ}), \end{matrix}

(14)

where

TPB (a^{ξ}, c^{ξ}, ϕ^{ξ})

is the three-parameter beta distribution with density:

\begin{matrix} p (ρ_{j}) = \frac{1}{B (a^{ξ}, c^{ξ})} {(ϕ^{ξ})}^{c^{ξ}} ρ_{j}^{c^{ξ} - 1} {(1 - ρ_{j})}^{a^{ξ} - 1} {(1 + (ϕ^{ξ} - 1) ρ_{j})}^{- (a^{ξ} + c^{ξ})} . \end{matrix}

(15)

From (14) and (15), it becomes evident that the Strawderman-Berger prior

{\sqrt{θ}}_{j} | ρ_{j} \sim N (0, 1 / ρ_{j} - 1)

,

ρ_{j} \sim B (1 / 2, 1)

(Berger 1980; Strawderman 1971) is that special case of the triple gamma prior where

ϕ^{ξ} = 1

,

a^{ξ} = 1 / 2

, and

c^{ξ} = 1

.

The special case of a triple gamma, where

a^{ξ} = c^{ξ} = 1 / 2

, corresponds to a Horseshoe prior (Carvalho et al. 2009, 2010) on

{\sqrt{θ}}_{j}

with global shrinkage parameter

τ^{2} = 2 / κ_{B}^{2}

, since

ψ_{j}^{2} \sim F (1, 1)

implies that

ψ_{j} \sim t_{1}

. The Horseshoe prior has been introduced for variable selection in regression models and has been shown to have excellent theoretical properties in this context for the “nearly black” case (van der Pas et al. 2014). The triple gamma is a generalization of the Horseshoe prior, with a similar shrinkage profile, however with much more mass close to the corner solutions. Most importantly, as will be discussed in Section 3.1, this leads to a BMA-type behaviour of the triple gamma prior for small values of

a^{ξ}

and

c^{ξ}

.

The vast literature on shrinkage priors contains many more related priors. Rescaling

ξ_{j}^{2} = 2 / (κ_{B}^{2}) ψ_{j}^{2}

in (8), for instance, yields a representation involving a scaled beta2 distribution,3

\begin{matrix} {\sqrt{θ}}_{j} | ξ_{j}^{2} \sim N (0, ξ_{j}^{2}), ξ_{j}^{2} | a^{ξ}, c^{ξ}, ϕ^{ξ} \sim SBeta 2 (a^{ξ}, c^{ξ}, ϕ^{ξ}), \end{matrix}

(16)

as is easily derived from (A2). The scaled beta2 was introduced by Pérez et al. (2017) in hierarchical models as a robust prior for scale parameters,

{\sqrt{θ}}_{j}

, and variance parameters,

θ_{j}

, alike. Based on (16), the triple gamma can be seen as a hierarchical extension of this prior which puts a scaled beta2 distribution on the scaling parameter

ξ_{j}^{2}

of a Gaussian prior for

{\sqrt{θ}}_{j}

, see Table 1. Griffin and Brown (2017) termed prior (16) gamma-gamma distribution, denoted by

G G (a^{ξ}, c^{ξ}, ϕ)

.

For

a^{ξ} = 1

, the triple gamma reduces to the normal-exponential-gamma which has a representation as a scale-mixture of double exponential

DE (0, \sqrt{2} ψ_{j})

-distributions, see Table 1.

It has been considered for variable selection in regression models (Griffin and Brown 2011) and locally adaptive B-spline models (Scheipl and Kneib 2009). The R2-D2 prior suggested by Zhang et al. (2017) for high-dimensional regression models is another special case of the triple gamma. It reads

\begin{matrix} β_{j} \sim N (0, σ^{2} ϕ_{j} ω), (ϕ_{1}, \dots, ϕ_{d}) \sim D (a^{τ}, \dots, a^{τ}), ω \sim G (a, τ), τ \sim G (b, 1), \end{matrix}

where

a = d a^{τ}

and

σ^{2}

is the residual error variance of the regression model. As shown by Zhang et al. (2017), this implies the following prior for the coefficient of determination:

R^{2} \sim B (a, b)

which motivates holding a fixed, while

a^{τ}

decreases as d increases. Using that

ϕ_{j} ω \sim G (a^{τ}, τ)

, we can show that the R2-D2 prior is equivalent to the following hierarchical normal gamma prior applied in Bitto and Frühwirth-Schnatter (2019) for TVP models:

\begin{matrix} β_{j} | τ_{j}^{2} \sim N (0, τ_{j}^{2}), τ_{j}^{2} \sim G (a^{τ}, a^{τ} λ_{B}^{2} / 2), λ_{B}^{2} \sim G (b, 2 σ^{2} / a^{τ}) . \end{matrix}

The popular Dirichlet-Laplace prior,

{\sqrt{θ}}_{j} | ψ_{j} \sim DE (0, ψ_{j})

, however, is not related to the triple gamma as the prior scale

ψ_{j}

rather than the prior variance

ψ_{j}^{2}

follows a gamma distribution, see again Table 1.

2.4. Using the Triple Gamma for Variance Selection in TVP Models

A challenging question is how to choose the parameters

a^{ξ}

,

c^{ξ}

and

κ_{B}^{2}

or

ϕ^{ξ}

of the triple gamma prior in the context of variance selection for TVP models. In addition, in a TVP context, the shrinkage parameters

a^{τ}

,

c^{τ}

and

λ_{B}^{2}

or

ϕ^{τ} = 2 c^{τ} / (a^{τ} λ_{B}^{2})

for the prior (7) of the initial values

β_{j}

have to be selected.

In high-dimensional settings it is appealing to have a prior that addresses two major issues: first, high concentration around the origin to favor strong shrinkage of small variances toward zero; second, heavy tails to introduce robustness to large variances and to avoid over-shrinkage. For the triple gamma prior, both issues are addressed through the choice of

a^{ξ}

and

c^{ξ}

, see Theorem 2. First of all, we need values

0 < a^{ξ} \leq 0.5

to induce a pole at 0. Second, values of

0 < c^{ξ} < 0.5

will lead to very heavy tails. For very small values of

a^{ξ}

and

c^{ξ}

, the triple Gamma is a proper prior that behaves nearly as the improper normal-Jeffrey’s prior (Figueiredo 2003), where

p ({\sqrt{θ}}_{j}) \propto 1 / {\sqrt{θ}}_{j}

and

p (ρ_{j}) \propto ρ_{j}^{- 1} {(1 - ρ_{j})}^{- 1}

.

Ideally, we would place a hyper prior distribution on all shrinkage parameters which would allow us to learn the global and the local degree of sparsity, both for the variances and the initial values. Such a hierarchical triple gamma prior introduces dependence among the local shrinkage parameters

ξ_{1}^{2}, \dots, ξ_{d}^{2}

in (4) and, consequently, among

θ_{1}, \dots, θ_{d}

in the joint (marginal) prior

p (θ_{1}, \dots, θ_{d})

. Introducing such dependence is desirable in that it allows to learn the degree of variance sparsity in TVP models, meaning that how much a variance is shrunken toward zero depends on how close the other variances are to zero. However, first naïve approaches with rather uninformative, independent priors on

κ_{B}^{2}

,

a^{ξ}

,

c^{ξ}

and

λ_{B}^{2}

,

a^{τ}

,

c^{τ}

were not met with much success and we found it necessary to carefully design appropriate hyper priors.

Hierarchical versions of the Bayesian Lasso (Belmonte et al. 2014) and the double gamma prior (Bitto and Frühwirth-Schnatter 2019) in TVP models are based on the gamma prior

κ_{B}^{2} \sim G (d_{1}, d_{2})

. Interestingly, this choice can be seen as a heavy-tailed extension of both priors, where each marginal density

p ({\sqrt{θ}}_{j} | d_{1}, d_{2})

follows a triple gamma prior with the same parameter

a^{ξ}

(being equal to one for the Bayesian Lasso) and tail index

c^{ξ} = d_{1}

. In light of this relationship, it is not surprising that very small values of

d_{1}

were applied in these papers to ensure heavy tails of

p ({\sqrt{θ}}_{j} | d_{1}, d_{2})

. Since a triple gamma prior already has heavy tails, we choose a different hyperprior in the present paper.

For the case

a^{ξ} = c^{ξ} = 1 / 2

, the global shrinkage parameter

τ

of the Horseshoe prior typically follows a Cauchy prior,

τ \sim t_{1}

(Bhadra et al. 2017b; Carvalho et al. 2009), see also Bhadra et al. (2019, Section 5). The relationship

ϕ^{ξ} = 2 / κ_{B}^{2} = τ^{2}

between the various global shrinkage parameters (see Table 1) implies in this case

ϕ^{ξ} \sim F (1, 1)

or, equivalently,

κ_{B}^{2} / 2 \sim F (1, 1)

.

For a triple gamma prior with arbitrary

a^{ξ}

and

c^{ξ}

, this is a special case of the following prior:

\begin{matrix} \frac{κ_{B}^{2}}{2}| a^{ξ}, c^{ξ} \sim F (2 a^{ξ}, 2 c^{ξ}), \end{matrix}

(17)

which will be motivated in Section 3.2. Under this prior, the triple gamma prior exhibits BMA-like behavior with a uniform prior on an appropriately defined model size (see Theorem 3). Prior (17) is equivalent to following representations:

\begin{matrix} κ_{B}^{2} | a^{ξ} \sim G (a^{ξ}, d_{2}), d_{2} | a^{ξ}, c^{ξ} \sim G (c^{ξ}, \frac{2 c^{ξ}}{a^{ξ}}), \\ ϕ^{ξ} | a^{ξ}, c^{ξ} \sim BP (c^{ξ}, a^{ξ}) . \end{matrix}

(18)

Concerning

a^{ξ}

and

c^{ξ}

, we choose the following priors:

\begin{matrix} 2 a^{ξ} \sim B (α_{a^{ξ}}, β_{a^{ξ}}), 2 c^{ξ} \sim B (α_{c^{ξ}}, β_{c^{ξ}}) . \end{matrix}

(19)

Hence, we are restricting the support of

a^{ξ}

and

c^{ξ}

to

(0, 0.5)

, following the insights brought to us by Theorem 2.

We follow a similar strategy for the parameters

a^{τ}

,

c^{τ}

and

λ_{B}^{2}

(

ϕ^{τ}

) of the prior (7) of the initial values

β_{j}

:

\begin{matrix} \frac{λ_{B}^{2}}{2}| a^{τ}, c^{τ} \sim F (2 a^{τ}, 2 c^{τ}), 2 a^{τ} \sim B (α_{a^{τ}}, β_{a^{τ}}), 2 c^{τ} \sim B (α_{c^{τ}}, β_{c^{τ}}), \end{matrix}

(20)

which is equivalent to

λ_{B}^{2} | a^{τ} \sim G (a^{τ}, e_{2})

,

e_{2} | a^{τ}, c^{τ} \sim G (c^{τ}, 2 c^{τ} / a^{τ})

, and

ϕ^{τ} | a^{τ}, c^{τ} \sim BP (c^{τ}, a^{τ})

.

An interesting special case is the “symmetric” triple gamma, where

a^{ξ} = c^{ξ}

. Despite this constraint, the favourable shrinkage behaviour is preserved and decreasing

a^{ξ} = c^{ξ}

toward zero simultaneously leads to a high concentration around the origin and a heavy-tailed behaviour. For a symmetric triple gamma prior, the global shrinkage parameter

ϕ^{ξ}

is independent of

a^{ξ}

and

c^{ξ}

and is related to the global shrinkage parameters

κ_{B}^{2}

through

ϕ^{ξ} = 2 / κ_{B}^{2}

. This induces shrinkage profiles that are symmetric around 1/2, see Section 3.1. Interestingly, a symmetric triple gamma resolves the question whether to choose a gamma or an inverse gamma prior for a variance parameter

ψ_{j}^{2}

. It implies the same symmetric beta-prime distribution on the variance,

ψ_{j}^{2} \sim F (2 a^{ξ}, 2 a^{ξ}) = BP (a^{ξ}, a^{ξ})

, and the information,

{(ψ_{j}^{2})}^{- 1} \sim BP (a^{ξ}, a^{ξ})

, and can be represented as a gamma prior with the scale arising from an inverse gamma prior or, equivalently, as an inverse gamma prior with the scale arising from a gamma prior:

\begin{matrix} ψ_{j}^{2} = {\overset{ˇ}{ξ}}_{j}^{2} \times \frac{1}{{\overset{ˇ}{κ}}_{j}^{2}}, {(ψ_{j}^{2})}^{- 1} = {\overset{ˇ}{κ}}_{j}^{2} \times \frac{1}{{\overset{ˇ}{ξ}}_{j}^{2}}, {\overset{ˇ}{ξ}}_{j}^{2} \sim G (a^{ξ}, 1), {\overset{ˇ}{κ}}_{j}^{2} \sim G (a^{ξ}, 1) . \end{matrix}

3. Shrinkage Profiles and BMA-Like Behavior

3.1. Shrinkage Profiles

In the sparse normal-means problem where

y | β \sim N_{d} (β, σ^{2} I_{d})

and

σ^{2} = 1

, the parameter

ρ_{j} = 1 / (1 + ψ_{j}^{2})

appearing in (14) is known as shrinkage factor and plays a fundamental role for comparing different shrinkage priors, as

ρ_{j}

determines shrinkage toward 0.

Also in a variance selection context, it is evident from (14) that values of

ρ_{j} \approx 0

will introduce no shrinkage on

θ_{j}

, whereas values of

ρ_{j} \approx 1

will introduce strong shrinkage of

θ_{j}

toward 0. Hence, the prior

p (ρ_{j})

, also called shrinkage profile, will play an instrumental role in the behaviour of different shrinkage priors. Following Carvalho et al. (2010), shrinkage priors are often compared in terms of the prior they imply on

ρ_{j}

, that is, how they handle shrinkage for small “observations” (in our case innovations) and how robust they are to large “observations”. Note that we ideally want a shrinkage profile that has a pole in zero (heavy tails to avoid over-shrinking signals) and a pole in one (spikiness to shrink noise). The Horseshoe prior, for example, implies

ρ_{j} \sim B (1 / 2, 1 / 2)

which is a shrinkage profile that takes this much desired form of a “horseshoe”, see Figure 2.

For the triple gamma prior, the shrinkage profile is given by the three-parameter beta prior

p (ρ_{j})

provided in (15). For

ϕ^{ξ} = 1

,

ρ_{j} \sim B (c^{ξ}, a^{ξ})

and

κ_{B}^{2} = 2 c^{ξ} / a^{ξ}

. Choosing small values

a^{ξ} < < 1

will put prior mass close to 1, choosing small values

c^{ξ} < < 1

will put prior mass close to 0, whereas values for both

a^{ξ}

and

c^{ξ}

smaller than one will induce the form of a horseshoe prior for

ρ_{j}

. Evidently, for

ϕ^{ξ} = 1

, a symmetric triple gamma prior with

a^{ξ} = c^{ξ}

implies a Horseshoe prior for

ρ_{j}

that is symmetric around 0.5. This is illustrated in Figure 2 for a symmetric triple gamma with

a^{ξ} = c^{ξ} = 0.1

.

In Figure 2 we can also see the shrinkage profile for the Bayesian Lasso and the double gamma, which correspond to a triple gamma where

c^{ξ} \to \infty

.4 For the Bayesian Lasso with

a^{ξ} = 1

, it is clear that the shrinkage profile

p (ρ_{j})

converges to a constant for

ρ_{j} \to 1

, while there is no mass around

ρ_{j} = 0

. This means that this prior tends to over-shrink signals, while not shrinking the noise completely to zero. A double gamma prior with

a^{ξ} < 1

has the potential to shrink the noise completely to zero, as

p (ρ_{j})

has a pole at

ρ_{j} = 1

, but

p (ρ_{j})

has also zero mass around

ρ_{j} = 0

, meaning the prior encourages over-shrinking of signals.

When we make

κ_{B}^{2}

random, we obtain a “prior density” of shrinkage profiles, see Figure 3. We can see that such hierarchical versions of the Lasso and the double gamma have shrinkage profiles that resemble the ones of the Horseshoe and the triple gamma. We have used

κ_{B}^{2} \sim G (0.01, 0.01)

for the Lasso and the double gamma,

2 / κ_{B}^{2} \sim F (1, 1)

for the Horseshoe and

2 / κ_{B}^{2} \sim F (0.2, 0.2)

for the triple gamma, see Section 2.4.

3.2. BMA-Type Behaviour

Bayesian model averaging (BMA) provides to statisticians and practitioners an essential and coherent tool to account for model uncertainty. In a multiple regression setting, the uncertainty is inherent in the choice of variables to be included. In a TVP framework, there is additional uncertainty about the time-variation of the state parameters, that is, which explanatory variables have a static and which ones a dynamic effect on the response variable. In this section, we show that the triple gamma prior mimics the typical BMA behavior, thus allowing us to incorporate model uncertainty with respect to time variation.

From the perspective of Bayesian model averaging, an ideal approach for handling sparsity in TVP models would be the use of discrete mixture priors as suggested in Frühwirth-Schnatter and Wagner (2010),

\begin{matrix} p ({\sqrt{θ}}_{j}) = (1 - π) δ_{0} + π \cdot p_{s l a b} ({\sqrt{θ}}_{j}), \end{matrix}

(21)

with

δ_{0}

being a Dirac measure at 0, while

p_{s l a b} ({\sqrt{θ}}_{j})

is the prior for non-zero variances. In terms of shrinkage profiles, the discrete mixture prior (21) has a spike at

ρ_{j} = 1

, with probability

1 - π

, and a lot of prior mass at

ρ_{j} = 0

, provided that the tails of

p_{slab} ({\sqrt{θ}}_{j})

are heavy enough. The mixture prior (21) is considered the “gold standard” in BMA, both theoretically and empirically, see for example, Johnstone and Silverman (2004). However, MCMC inference under this prior is extremely challenging. As opposed to this, MCMC inference for the triple gamma prior is straightforward, see Section 4.

In this section, we relate the triple gamma prior to BMA based on the discrete mixture prior (21). An interesting insight is that the triple gamma prior shows a behaviour very similar to a discrete mixture prior, if both

a^{ξ}

and

c^{ξ}

approach zero. This induces BMA-type behaviour on the joint shrinkage profile

p (ρ_{1}, \dots, ρ_{d})

, with a spike at all corner solutions, where some

ρ_{j}

are very close to one, whereas the remaining ones are very close to zero.

The bivariate shrinkage profiles shown in Figure 4 give us some intuition about the convergence of a symmetric triple gamma prior with

a^{ξ} = c^{ξ} \to 0

toward a discrete spike and slab mixture. As opposed to the Lasso and the double gamma prior, the Horseshoe and the triple gamma prior put nearly all prior mass on the “corner solutions”, which correspond to the four possibilities (a)

ρ_{1} = ρ_{2} = 0

, that is, no shrinkage on

θ_{1}

and

θ_{2}

, (b)

ρ_{1} = 1, ρ_{2} = 0

, that is, shrinkage of

θ_{1}

toward 0 and no shrinkage on

θ_{2}

, (c)

ρ_{1} = 0, ρ_{2} = 1

, that is, shrinkage of

θ_{1}

toward 0 and no shrinkage on

θ_{2}

, and (d)

ρ_{1} = ρ_{2} = 1

, that is, shrinkage of both

θ_{1}

and

θ_{2}

toward 0.

A very important aspect of BMA is that of choosing a prior for the model dimension, K, see for example, Fernández et al. (2001) and Ley and Steel (2009). In the discrete mixture prior (21), the distribution of K depends on the choice of

π

. Fixing

π

corresponds to a very informative prior on the model dimension, for example

π = 0.5

assigns more prior probability to models of dimension

d / 2

and lower prior probability to empty or full models. In fact, let

δ_{j}

be the indicator that tells us if the j-th coefficient is included in the model, then we have that

K = \sum_{j = 1}^{d} δ_{j} \sim Binom (d, π)

. Placing a uniform prior on

π

has been shown to be a good choice, since it corresponds to placing a prior on K which is uniform on

{0, \dots, d}

. Note that

π

will be learned using information from all the variables. In this sense,

π

is a global shrinkage parameter which will adapt to the degree of sparsity.

Following ideas in Carvalho et al. (2009), we believe that a natural way to perform variable selection in the continuous shrinkage prior framework is though thresholding. Specifically, we say that when

(1 - ρ_{j}) > 0.5

, or

ρ_{j} < 0.5

, the variable is included, otherwise it is not. Notice that this classification via thresholding makes perfect sense in the case of a triple gamma of which the Horseshoe is a special case, but less so for a Lasso or double gamma prior, even if the shrinkage profile shows a Horseshoe-like behaviour for hierarchical versions of these priors (see again Figure 3). Notice that thresholding implies a prior on the model dimension K. Specifically,

\begin{matrix} K = \sum_{j = 1}^{d} I {ρ_{j} < 0.5} \sim Binom (d, π^{ξ}), π^{ξ} = \Pr (ρ_{j} < 0.5), \end{matrix}

(22)

where

ρ_{j} | a^{ξ}, ϕ^{ξ} \sim TPB (a^{ξ}, b^{ξ}, ϕ^{ξ})

, see (15). The choice of

ϕ^{ξ}

(or

κ_{B}^{2}

) will strongly impact the prior on K. For a symmetric triple gamma with

a^{ξ} = c^{ξ}

, for instance, and fixed

ϕ^{ξ} = 1

, that is

κ_{B}^{2} = 2

, we obtain

K \sim Binom (d, 0.5)

, since

π^{ξ} = 0.5

regardless of

a^{ξ}

. Hence, we have to face similar problems as with fixing

π = 0.5

for the discrete mixture prior (21).

Placing a hyper prior on

ϕ^{τ}

and

ϕ^{ξ}

(or equivalent ones on

λ_{B}^{2}

and

κ_{B}^{2}

), as we did in Section 2.4, is as vital for BMA-type variable and variance selection through the triple gamma prior, as making

π

random is for the discrete mixture prior (21). Ideally, we would like to have a uniform distribution on the model size K. We show in Theorem 3 that the hyperprior for

κ_{B}^{2}

defined in (17) achieves exactly this goal, since

π^{ξ}

is uniformly distributed, see Appendix A for a proof.

Theorem 3.

For a hierarchical triple gamma prior with fixed

a^{ξ} > 0

and

c^{ξ} > 0

the probability

π^{ξ}

defined in (22) follows a uniform distribution,

π^{ξ} \sim U [0, 1]

, under the hyper prior

\begin{matrix} \frac{κ_{B}^{2}}{2}| a^{ξ}, c^{ξ} \sim F (2 a^{ξ}, 2 c^{ξ}), \end{matrix}

(23)

or, equivalently, under the hyper prior

\begin{matrix} ϕ^{ξ} | a^{ξ}, c^{ξ} \sim BP (c^{ξ}, a^{ξ}) . \end{matrix}

(24)

Finally, it is important to point out that the thresholding approach allows us to estimate posterior inclusion probabilities, that is the probability that the corresponding variable is included in the model or, in the case of variance selection, that the corresponding parameter is time varying. In our simulations (Section 5.3) and in our application (Section 5.4), we will estimate the posterior inclusion probabilities obtained under different shrinkage priors.

4. MCMC Algorithm

Let

y = (y_{1}, \dots, y_{T})

be the vector of time series observations and let

z

be the set of all latent variables and unknown model parameters in a TVP model. Moreover, let

z_{- x}

denote the set of all unknowns but x. Bayesian inference based on MCMC sampling from the posterior

p (z | y)

is summarized in Algorithm 1. The hierarchical priors introduced in Section 2.4 are employed, where

κ_{B}^{2}

follows (17),

(a^{ξ}, c^{ξ})

follow (19), and

(a^{τ}, c^{τ}, λ_{B}^{2})

follow (20). For certain sampling steps, the hierarchical representation (18) is used for

κ_{B}^{2}

, and similarly for

λ_{B}^{2}

.

Algorithm 1 extends several existing algorithms such as the MCMC schemes introduced for the Horseshoe prior by Makalic and Schmidt (2016) and for the double gamma prior by Bitto and Frühwirth-Schnatter (2019). We exploit various representations of the triple gamma prior given in Lemma 1 and choose representation (12) as the baseline representation of our MCMC algorithm:

\begin{matrix} β_{j} | {\overset{ˇ}{τ}}_{j}^{2}, {\overset{ˇ}{λ}}_{j}^{2}, ϕ^{τ} \sim N (0, ϕ^{τ} {\overset{ˇ}{τ}}_{j}^{2} / {\overset{ˇ}{λ}}_{j}^{2}), {\overset{ˇ}{τ}}_{j}^{2} | a^{τ} \sim G (a^{τ}, 1), {\overset{ˇ}{λ}}_{j}^{2} | c^{τ} \sim G (c^{τ}, 1), \\ {\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, {\overset{ˇ}{κ}}_{j}^{2}, ϕ^{ξ} \sim N (0, ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2} / {\overset{ˇ}{κ}}_{j}^{2}), {\overset{ˇ}{ξ}}_{j}^{2} | a^{ξ} \sim G (a^{ξ}, 1), {\overset{ˇ}{κ}}_{j}^{2} | c^{ξ} \sim G (c^{ξ}, 1), \end{matrix}

where

ϕ^{τ} = 2 c^{τ} / (λ_{B}^{2} a^{τ})

and

ϕ^{ξ} = 2 c^{ξ} / (κ_{B}^{2} a^{ξ})

. All conditional distributions in our MCMC scheme are available in closed form, except for the ones for

a^{ξ}

,

c^{ξ}

,

a^{τ}

and

c^{τ}

, for which we will resort to a Metropolis-Hastings (MH) step within Gibbs. Several conditional distributions are the same as for the double gamma prior and we apply Algorithm 1 of Bitto and Frühwirth-Schnatter (2019). We provide more details on the derivation of the various densities in Appendix B.

Algorithm 1. MCMC inference for TVP models under the triple gamma prior.

Choose starting values for all global shrinkage parameters

(a^{τ}, c^{τ}, λ_{B}^{2}, a^{ξ}, c^{ξ}, κ_{B}^{2})

and local shrinkage parameters

{{\overset{ˇ}{τ}}_{j}^{2}, {\overset{ˇ}{λ}}_{j}^{2}, {\overset{ˇ}{ξ}}_{j}^{2}, {\overset{ˇ}{κ}}_{j}^{2}}_{j = 1}^{d}

, and repeat the following steps:

(a): Define for $j = 1, \dots, d$ , $τ_{j}^{2} = ϕ^{τ} {\overset{ˇ}{τ}}_{j}^{2} / {\overset{ˇ}{λ}}_{j}^{2}$ and $ξ_{j}^{2} = ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2} / {\overset{ˇ}{κ}}_{j}^{2}$ and sample from the posterior $p ({\tilde{β}}_{0}, \dots, {\tilde{β}}_{T}, β_{1}, \dots, β_{d}, \sqrt{θ_{1}}, \dots, \sqrt{θ_{d}} | {ξ_{j}^{2}, τ_{j}^{2}}_{j = 1}^{d}, y)$ using Algorithm 1, Steps (a), (b), and (c) in Bitto and Frühwirth-Schnatter (2019). In the homoscedastic case, use Step (f) of this algorithm to sample from $σ^{2} | z_{- σ^{2}}, y$ . For the SV model (2), sample the parameters μ, ϕ, and $σ_{η}^{2}$ as in Kastner and Frühwirth-Schnatter (2014), for example, using the R-packagestochvol(Kastner 2016).
(b): Use the prior $p ({\sqrt{θ}}_{j} | {\overset{ˇ}{κ}}_{j}^{2}, a^{ξ}, c^{ξ})$ , marginalized w.r.t. ${\overset{ˇ}{ξ}}_{j}^{2}$ , to sample $a^{ξ}$ from $p (a^{ξ} | z_{- a^{ξ}}, y)$ via a random walk MH step on $z = log (a^{ξ} / (0.5 - a^{ξ}))$ . Propose $a^{ξ, (*)} = 0.5 e^{z^{*}} / (1 + e^{z^{*}})$ , where $z^{*} \sim N (z^{(m - 1)}, v^{2})$ and $z^{(m - 1)} = log (a^{ξ, (m - 1)} / (0.5 - a^{ξ, (m - 1)}))$ depends on the previous value $a^{ξ, (m - 1)}$ of $a^{ξ}$ , accept $a^{ξ, (*)}$ with probability

$\begin{matrix} min \{1, \frac{q_{a} (a^{ξ, (*)})}{q_{a} (a^{ξ, (m - 1)})}\}, q_{a} (a^{ξ}) = p (a^{ξ} | z_{- a^{ξ}}, y) a^{ξ} (0.5 - a^{ξ}), \end{matrix}$

and update $ϕ^{ξ} = 2 c^{ξ} / (κ_{B}^{2} a^{ξ})$ . Explicit forms for $p (a^{ξ} | z_{- a^{ξ}}, y)$ and $log q_{a} (a^{ξ})$ are provided in (A3) and (A4).
Similarly, use the prior $p (β_{j} | {\overset{ˇ}{λ}}_{j}^{2}, a^{τ}, c^{τ})$ , marginalized w.r.t. to ${\overset{ˇ}{τ}}_{j}^{2}$ , to sample $a^{τ}$ via a random walk MH step and update $ϕ^{τ} = 2 c^{τ} / (a^{τ} λ_{B}^{2})$ .
(c): Sample ${\overset{ˇ}{ξ}}_{j}^{2}$ , $j = 1, \dots, d$ , from a generalized inverse Gaussian distribution, see (A5):

$\begin{matrix} {\overset{ˇ}{ξ}}_{j}^{2} | {\overset{ˇ}{κ}}_{j}^{2}, θ_{j}, a^{ξ}, ϕ^{ξ} \sim GIG (a^{ξ} - \frac{1}{2}, 2, \frac{{\overset{ˇ}{κ}}_{j}^{2} θ_{j}}{ϕ^{ξ}}) . \end{matrix}$

(25)

Similarly, update ${\overset{ˇ}{τ}}_{j}^{2}$ , $j = 1, \dots, d$ , conditional on $a^{τ}$ :

$\begin{matrix} {\overset{ˇ}{τ}}_{j}^{2} | β_{j}, {\overset{ˇ}{λ}}_{j}^{2}, a^{τ}, ϕ^{τ} \sim GIG (a^{τ} - \frac{1}{2}, 2, \frac{{\overset{ˇ}{λ}}_{j}^{2} β_{j}^{2}}{ϕ^{τ}}) . \end{matrix}$
(d): Use the marginal Student-t distribution $p ({\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, c^{ξ}, κ_{B}^{2})$ given in (11) to sample $c^{ξ}$ from $p (c^{ξ} | z_{- c^{ξ}}, y)$ via a random walk MH step on $z = log (c^{ξ} / (0.5 - c^{ξ}))$ . Propose $c^{ξ, (*)} = 0.5 e^{z^{*}} / (1 + e^{z^{*}})$ , where $z^{*} \sim N (z^{(m - 1)}, v^{2})$ and $z^{(m - 1)} = log (c^{ξ, (m - 1)} / (0.5 - c^{ξ, (m - 1)}))$ depends on the previous value $c^{ξ, (m - 1)}$ of $c^{ξ}$ , accept $c^{ξ, (*)}$ with probability

$\begin{matrix} min \{1, \frac{q_{c} (c^{ξ, (*)})}{q_{c} (c^{ξ, (m - 1)})}\}, q_{c} (c^{ξ}) = p (c^{ξ} | z_{- c^{ξ}}, y) c^{ξ} (0.5 - c^{ξ}), \end{matrix}$

and update $ϕ^{ξ} = 2 c^{ξ} / (κ_{B}^{2} a^{ξ})$ . Explicit forms for $p (c^{ξ} | z_{- c^{ξ}}, y)$ and $log q_{c} (c^{ξ})$ are provided in (A6) and (A7).
Similarly, to sample $c^{τ}$ via a random walk MH step use the marginal distribution of $β_{j} | {\overset{ˇ}{τ}}_{j}^{2}, a^{τ}, c^{τ}$ with respect to ${\overset{ˇ}{λ}}_{j}^{2}$ and update $ϕ^{τ} = 2 c^{τ} / (a^{τ} λ_{B}^{2})$ .
(e): Sample ${\overset{ˇ}{κ}}_{j}^{2}$ , for $j = 1, \dots, d$ , from following gamma distribution, see (A8):

$\begin{matrix} {\overset{ˇ}{κ}}_{j}^{2} | θ_{j}, {\overset{ˇ}{ξ}}_{j}^{2}, c^{ξ}, ϕ^{ξ} \sim G (\frac{1}{2} + c^{ξ}, \frac{θ_{j}}{2 ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2}} + 1) . \end{matrix}$

(26)

Similarly, update ${\overset{ˇ}{λ}}_{j}^{2}$ , $j = 1, \dots, d$ , conditional on $c^{τ}$ :

$\begin{matrix} {\overset{ˇ}{λ}}_{j}^{2} | β_{j}, {\overset{ˇ}{τ}}_{j}^{2}, c^{τ}, ϕ^{τ} \sim G (\frac{1}{2} + c^{τ}, \frac{β_{j}^{2}}{2 ϕ^{τ} {\overset{ˇ}{τ}}_{j}^{2}} + 1) . \end{matrix}$
(f): Sample $d_{2}$ from $d_{2} | a^{ξ}, c^{ξ}, κ_{B}^{2} \sim G (a^{ξ} + c^{ξ}, κ_{B}^{2} + \frac{2 c^{ξ}}{a^{ξ}})$ , see (A9); sample from $κ_{B}^{2}$ from following gamma distribution,

$\begin{matrix} κ_{B}^{2} | {θ_{j}, {\overset{ˇ}{κ}}_{j}^{2}, {\overset{ˇ}{ξ}}_{j}^{2}}_{j = 1}^{d}, a^{ξ}, c^{ξ}, d_{2} \sim G (\frac{d}{2} + a^{ξ}, \frac{a^{ξ}}{4 c^{ξ}} \sum_{j = 1}^{d} \frac{{\overset{ˇ}{κ}}_{j}^{2}}{{\overset{ˇ}{ξ}}_{j}^{2}} θ_{j} + d_{2}), \end{matrix}$

(27)

see (A10), and update $ϕ^{ξ} = 2 c^{ξ} / (κ_{B}^{2} a^{ξ})$ .
Similarly, sample $e_{2}$ from $e_{2} | a^{τ}, c^{τ}, λ_{B}^{2} \sim G (a^{τ} + c^{τ}, λ_{B}^{2} + \frac{2 c^{τ}}{a^{τ}})$ , sample $λ_{B}^{2}$ from

$\begin{matrix} λ_{B}^{2} | {β_{j}, {\overset{ˇ}{λ}}_{j}^{2}, {\overset{ˇ}{τ}}_{j}^{2}}_{j = 1}^{d}, a^{τ}, c^{τ}, e_{2} \sim G (\frac{d}{2} + a^{τ}, \frac{a^{τ}}{4 c^{τ}} \sum_{j = 1}^{d} \frac{{\overset{ˇ}{λ}}_{j}^{2}}{{\overset{ˇ}{τ}}_{j}^{2}} β_{j}^{2} + e_{2}), \end{matrix}$

and update $ϕ^{τ} = 2 c^{τ} / (a^{τ} λ_{B}^{2})$ .

The MCMC scheme in Algorithm 1 is not a full conditional scheme, as several steps are based on partially marginalized distributions. That means that the sampling order matters. For instance, in Step (b), we marginalize w.r.t.

{\overset{ˇ}{ξ}}_{1}^{2}, \dots, {\overset{ˇ}{ξ}}_{d}^{2}

, hence we need to update

{\overset{ˇ}{ξ}}_{1}^{2}, \dots, {\overset{ˇ}{ξ}}_{d}^{2}

after sampling

a^{ξ}

, before we update

c^{ξ}

in Step (d) conditional on

{\overset{ˇ}{ξ}}_{1}^{2}, \dots, {\overset{ˇ}{ξ}}_{d}^{2}

. Similarly, due to marginalization in Step (d), we need to update

{\overset{ˇ}{κ}}_{1}^{2}, \dots, {\overset{ˇ}{κ}}_{d}^{2}

, before we update

d_{2}

in Step (f). Furthermore, both Step (b) and Step (d) are based on the marginal prior of

κ_{B}^{2}

, given in (17). Hence, in Step (f),

d_{2}

has to be updated from

d_{2} | a^{ξ}, c^{ξ}, κ_{B}^{2}

, before

κ_{B}^{2}

is updated conditional on

d_{2}

.

For a symmetric triple gamma prior, where

a^{ξ} = c^{ξ}

, the MCMC scheme in Algorithm 1 has to be modified only slightly. Either

q_{a} (a^{ξ})

in Step (b) is adjusted and Step (d) is skipped, setting

c^{ξ} = a^{ξ}

, or

q_{c} (c^{ξ})

in Step (d) is adjusted and Step (b) is skipped, setting

a^{ξ} = c^{ξ}

. In Appendix B, we provide details in (A11) for the first case and in (A12) for the second case. Similar modifications are needed, if

a^{τ} = c^{τ}

. All other steps in Algorithm 1 remain the same for

a^{ξ} = c^{ξ}

and/or

a^{τ} = c^{τ}

.

5. Applications to TVP-VAR-SV Models

5.1. Model

In this section, we consider a generalization of the TVP model (1), where

y_{t}

is a m-dimensional time series, observed for

t = 1, \dots, T

. The time series

y_{t}

is assumed to follow a time-varying parameter vector autoregressive model with stochastic volatility (TVP-VAR-SV) of order p:

\begin{matrix} y_{t} = c_{t} + Φ_{1, t} y_{t - 1} + Φ_{2, t} y_{t - 2} + \dots Φ_{p, t} y_{t - p} + ε_{t}, ε_{t} \sim N_{m} (0, Σ_{t}), \end{matrix}

(28)

where

c_{t}

is the m-dimensional time-varying intercept,

Φ_{j, t}

, for

j = 1, \dots, p

is an

m \times m

matrix of time-varying coefficients, and

Σ_{t}

is the time-varying variance covariance matrix of the error term. The TVP-VAR-SV model can be written in a more compact notation as the following TVP model:

\begin{matrix} y_{t} = (I_{m} \otimes x_{t}) β_{t} + ε_{t}, ε_{t} \sim N_{m} (0, Σ_{t}), \end{matrix}

(29)

where

x_{t} = (y_{t - 1}^{'}, \dots, y_{t - p}^{'}, 1)

is a row vector of length

m p + 1

and the time-varying parameter

β_{t}

is defined as

β_{t} = {({β_{t}^{1}}^{'}, \dots, {β_{t}^{m}}^{'})}^{'}

, where

β_{t}^{i} = {({Φ_{1, t}}_{i •}, \dots, {Φ_{p, t}}_{i •}, c_{t, i})}^{'}

. Here,

{Φ_{j, t}}_{i •}

denotes the i-th row of the matrix

Φ_{j, t}

and

c_{t, i}

denotes the i-th element of

c_{t}

. Since the influential paper of Primiceri (2005) (see Del Negro and Primiceri (2015) for a corrigendum), this model has become a benchmark for analyzing relationships between macroeconomic variables that evolve over time, see Nakajima (2011), Koop and Korobilis (2013), Eisenstat et al. (2014), Chan and Eisenstat (2016), Feldkircher et al. (2017) and Carriero et al. (2019), among many others.

Following Frühwirth-Schnatter and Tüchler (2008), we use a Cholesky decomposition of the time-varying covariance matrix

Σ_{t}

, that is

Σ_{t} = A_{t} D_{t} A_{t}^{'}

, where

D_{t}

is a diagonal matrix and

A_{t}

is lower unitriangular matrix, see Carriero et al. (2019) and Bitto and Frühwirth-Schnatter (2019) for related models. We denote with

a_{i j, t}

the element at the i-th row and j-th column of

A_{t}

, and with

σ_{i, t}^{2}

the i-th diagonal element of

D_{t} = Diag (σ_{1, t}^{2} \dots σ_{m, t}^{2})

. In total, we have

m (m - 1) / 2 + m (m p + 1)

(potentially) time-varying parameters. Using the Cholesky decomposition, we can rewrite the system as:

\begin{matrix} y_{t} = (I_{m} \otimes x_{t}) β_{t} + A_{t} η_{t}, η_{t} \sim N_{m} (0, D_{t}), \end{matrix}

(30)

where

η_{t} = {(η_{1, t}, \dots, η_{m, t})}^{⊤}

. The idiosyncratic shocks

η_{i, t} \sim N (0, σ_{i, t}^{2})

follow independent SV processes as in (2), with row specific parameters. Specifically, with

h_{i, t} = log σ_{i, t}^{2}

, we have that the logarithm of the elements of the diagonal matrix

D_{t}

follow independent AR(1) processes:

\begin{matrix} h_{i, t} = μ_{i} + ϕ_{i} (h_{i, t - 1} - μ_{i}) + ν_{i, t}, ν_{i, t} \sim N (0, σ_{η, i}^{2}), \end{matrix}

for

i = 1, \dots, m

. Here,

μ_{i}

is the mean,

ϕ_{i}

is the persistence parameter, and

σ_{η, i}^{2}

is the variance of the ith log-volatility

h_{i, t}

.

It is possible to write the TVP-VAR-SV model (30) as a system of m univariate TVP models as in (1):

\begin{matrix} y_{1, t} = & x_{t} β_{t}^{1} + η_{1, t}, η_{1, t} \sim N (0, σ_{1, t}^{2}), \\ y_{2, t} = & x_{t} β_{t}^{2} + a_{21, t} η_{1, t} + η_{2, t}, η_{2, t} \sim N (0, σ_{2, t}^{2}), \\ y_{3, t} = & x_{t} β_{t}^{3} + a_{31, t} η_{1, t} + a_{32, t} η_{2, t} + η_{3, t}, η_{3, t} \sim N (0, σ_{3, t}^{2}), \\ \dots \\ y_{m, t} = & x_{t} β_{t}^{m} + a_{m 1, t} η_{1, t} + \dots + a_{m, m - 1, t} η_{m - 1, t} + η_{m, t}, η_{m, t} \sim N (0, σ_{m, t}^{2}) . \end{matrix}

Note that for

i > 1

, the i-th equation of this system is a TVP model where the residuals of the preceding

i - 1

equations are added as explanatory variables:

\begin{matrix} y_{i, t} = x_{t} β_{t}^{i} + \sum_{j = 1}^{i - 1} a_{i j, t} η_{j, t} + η_{i, t}, η_{i, t} \sim N (0, σ_{i, t}^{2}), \end{matrix}

and all time-varying parameters follow a random walk as in the TVP model (1):

\begin{matrix} β_{j, t}^{i} = β_{j, t - 1}^{i} + v_{i j, t}, v_{i j, t} \sim N (0, θ_{i j}^{β}), for i = 1, \dots, m, and j = 1, \dots, m p + 1, \\ a_{i j, t} = a_{i j, t - 1} + w_{i j, t}, w_{i j, t} \sim N (0, θ_{i j}^{a}), for i = 1, \dots, m, and j = 1, \dots, i - 1, \end{matrix}

with initial values

β_{j, 0}^{i} \sim N (β_{i j}^{β}, θ_{i j}^{β})

and

a_{i j, 0} \sim N (β_{i j}^{a}, θ_{i j}^{a})

. Here,

β_{j, t}^{i}

denotes the jth element of the vector

β_{t}^{i}

.

To achieve shrinkage for each VAR coefficient

β_{j, t}^{i}

as well as for each Cholesky factor

a_{i j, t}

, we proceed as in Section 2 and introduce shrinkage priors for the initial expectations

β_{i j}^{β}

and

β_{i j}^{a}

as well as the variances

θ_{i j}^{β}

and

θ_{i j}^{a}

. We do this independently for each equation of the system. Within each equation, the

β_{i j}^{β}

s and

β_{i j}^{a}

s are assumed to follow independent shrinkage priors to allow for flexibility in the prior structure, and similarly for

θ_{i j}^{β}

and

θ_{i j}^{a}

:

\begin{matrix} β_{i j}^{x} \sim N (0, ϕ_{i}^{τ, x} {\overset{ˇ}{τ}}_{i j}^{x, 2} / {\overset{ˇ}{λ}}_{i j}^{x, 2}), {\overset{ˇ}{τ}}_{i j}^{x, 2} \sim G (a_{i}^{τ, x}, 1), {\overset{ˇ}{λ}}_{i j}^{x, 2} \sim G (c_{i}^{τ, x}, 1), ϕ_{i}^{τ, x} = 2 c_{i}^{τ, x} / (λ_{B, i}^{x, 2} a_{i}^{τ, x}), \\ {\sqrt{θ}}_{i j}^{x} \sim N (0, ϕ_{i}^{ξ, x} {\overset{ˇ}{ξ}}_{i j}^{x, 2} / {\overset{ˇ}{κ}}_{i j}^{x, 2}), {\overset{ˇ}{ξ}}_{i j}^{x, 2} \sim G (a_{i}^{ξ, x}, 1), {\overset{ˇ}{κ}}_{i j}^{x, 2} \sim G (c_{i}^{ξ, x}, 1), ϕ_{i}^{ξ, x} = 2 c_{i}^{ξ, x} / (κ_{B, i}^{x, 2} a_{i}^{ξ, x}), \end{matrix}

(31)

where

x = β

for the VAR-coefficients and

x = a

for the elements of

A_{t}

. Following Section 2.4, the priors for the global shrinkage parameters in the ith equation read

\begin{matrix} λ_{B, i}^{x, 2} | a_{i}^{τ, x}, c_{i}^{τ, x} \sim F (2 a_{i}^{τ, x}, 2 c_{i}^{τ, x}), 2 a_{i}^{τ, x} \sim B (α_{a^{τ}}, β_{a^{τ}}), 2 c_{i}^{τ, x} \sim B (α_{c^{τ}}, β_{c^{τ}}), \\ κ_{B, i}^{x, 2} | a_{i}^{ξ, x}, 2 c_{i}^{ξ, x} \sim F (2 a_{i}^{ξ, x}, 2 c_{i}^{ξ, x}), 2 a_{i}^{ξ, x} \sim B (α_{a^{ξ}}, β_{a^{ξ}}), 2 c_{i}^{ξ, x} \sim B (α_{c^{ξ}}, β_{c^{ξ}}) . \end{matrix}

(32)

5.2. A Brief Sketch of the TVP-VAR-SV MCMC Algorithm

Our algorithm exploits the aforementioned unitriangular decomposition to estimate the model parameters equation-by-equation. Due to the prior structure introduced in (31), the estimation of

β_{t}^{i}

and the

a_{i j, t}

’s is separated into two blocks, with the algorithm cycling through the m equations, alternating between sampling

β_{t}^{i}

conditional on

Σ_{t}

and sampling the

a_{i j, t}

s and

d_{i, t}

s conditional on the VAR coefficients

β_{t}^{i}

. Given a set of initial values, the algorithm repeats the following steps:

Algorithm 2. MCMC inference for TVP-VAR-SV models under the triple gamma prior.

Choose starting values for all global and local shrinkage parameters in prior (31) for each equation and repeat the following steps:

For

i = 1, \dots, m

, update all the unknowns in the ith equation:

(a): Conditional on $A_{t}$ and $D_{t}$ , create ${\overset{ˇ}{y}}_{i, t} = y_{i, t} - \sum_{j = 1}^{i - 1} a_{i j, t} η_{j, t}$ and define the following TVP model:

${\overset{ˇ}{y}}_{i, t} = x_{t} β_{t}^{i} + η_{i, t}, η_{i, t} \sim N (0, σ_{i, t}^{2}) .$

Apply Algorithm 1 (sans the step for the variance of the observation equation) to this univariate TVP model, to draw from the conditional posterior distribution of the time-varying VAR-coeffcients $β_{t}^{i},$ for $t = 0, \dots, T,$ their initial expectations $β_{i j}^{β}$ , the process variances $θ_{i j}^{β}$ , the local shrinkage parameters ${\overset{ˇ}{τ}}_{i j}^{β, 2}, {\overset{ˇ}{λ}}_{i j}^{β, 2}, {\overset{ˇ}{ξ}}_{i j}^{β, 2}, {\overset{ˇ}{κ}}_{i j}^{β, 2}$ , as well as the global shrinkage parameters $λ_{B, i}^{β, 2}, κ_{B, i}^{β, 2}, a_{i}^{τ, β}, c_{i}^{τ, β}, a_{i}^{ξ, β}$ , and $c_{i}^{ξ, β}$ .
(b): For $i > 1$ , create $y_{i, t}^{🟉} = y_{i, t} - x_{t} β_{t}^{i}$ , conditional on $β_{t}^{i}$ , and define the following TVP model:

$y_{i, t}^{🟉} = \sum_{j = 1}^{i - 1} a_{i j, t} η_{j, t} + η_{i, t}, η_{i, t} \sim N (0, σ_{i, t}^{2}),$

where the residuals from the previous $i - 1$ equations, $(η_{1, t}, \dots, η_{i - 1, t})$ , are used as explanatory variables and no intercept is present. Apply Algorithm 1 to this univariate TVP model, to sample the volatilities $σ_{i, t}^{2}$ and the time-varying coefficients $a_{i j, t}$ in the ith row of $A_{t}$ for $t = 0, \dots, T$ from the respective conditional posteriors, as well as the initial expectations $β_{i j}^{a}$ , the process variances $θ_{i j}^{a}$ , the local shrinkage parameters ${\overset{ˇ}{τ}}_{i j}^{a, 2}, {\overset{ˇ}{λ}}_{i j}^{a, 2}, {\overset{ˇ}{ξ}}_{i j}^{a, 2}, {\overset{ˇ}{κ}}_{i j}^{a, 2}$ and the global shrinkage parameters $λ_{B, i}^{a, 2}, κ_{B, i}^{a, 2}, a_{i}^{τ, a}, c_{i}^{τ, a}, a_{i}^{ξ, a}$ , and $c_{i}^{ξ, a}$ .

In the following applications, we run our algorithm for

M = 200, 000

iterations, discarding the first

100, 000

iterations as burn-in, and then keeping the output of one every 100 iterations.

5.3. Illustrative Example with Simulated Data

To illustrate the merit of our methodology in the context of TVP-VAR-SV models, we simulate data from two TVP-VAR-SV models with

T = 200

points in time,

p = 1

lags and

m = 7

equations, with varying degrees of sparsity. In the dense regime, approximately 30% of the values of

β

and

θ

(here referring to the means of the initial states and the variances of the innovations as defined in Section 2, respectively) are truly zero, while in the sparse regime approximately 90% are truly zero. We show results for the triple gamma prior, the Horseshoe prior, the double gamma and the Lasso.

Regarding the priors on the hyperparameters, we use prior (32) with

α_{a^{τ}} = α_{c^{τ}} = α_{a^{ξ}} = α_{c^{ξ}} = 1

and

β_{a^{τ}} = β_{c^{τ}} = β_{a^{ξ}} = β_{c^{ξ}} = 6

for the triple gamma. The probability density function of the corresponding beta prior is monotonically increasing, with a maximum at

0.5

. This prior places positive mass in a neighborhood of the Horseshoe, but allows for more flexibility. In practice, placing a prior on the spike and slab parameters of the triple gamma, instead of fixing them to 0.5 as in the Horseshoe, allows us to learn the shrinkage profile from the data, including asymmetric profiles.

We assume that the global shrinkage parameters

λ_{B, i}^{β, 2}, κ_{B, i}^{β, 2}, λ_{B, i}^{a, 2}

, and

κ_{B, i}^{a, 2}

follow a

F (1, 1)

distribution for the Horseshoe prior which corresponds to the prior in Carvalho et al. (2009) and a

G (0.001, 0.001)

distribution for the Lasso and the double gamma prior, as suggested in Belmonte et al. (2014) and Bitto and Frühwirth-Schnatter (2019). Concerning the spike parameters

a_{i}^{τ, a}, a_{i}^{ξ, a}, a_{i}^{τ, β}

, and

a_{i}^{ξ, β}

of the double gamma, we employ a rescaled beta prior to force them to be smaller than

0.5

. Specifically, we use a

B (4, 6)

prior which places most of its mass between

0.05

and

0.4

, a range that Bitto and Frühwirth-Schnatter (2019) have found to induce desirable shrinkage characteristics.

Figure 5 shows the posterior path of a permanently non-significant state, that is a state where the true

β_{j, t}^{i} = 0

for

t = 1, \dots, T

, in the sparse regime. The entire set of states for the triple gamma prior can be found in Appendix C. Note that, while the zero line is contained in the 95% posterior credible interval for all priors, said interval is thinner under the triple gamma prior and the double gamma prior than under the Lasso and the Horseshoe prior.

We calculate the posterior inclusion probabilities based on the thresholding approach introduced in Section 3.2, comparing the triple gamma prior to widely used special cases. In a variance selection context, the posterior inclusion probabilities reflect the uncertainty on whether a state should be time varying or constant over time. Figure 6 shows the posterior inclusion probabilities for the variance of the innovations (

θ_{i j}^{β}

’s) under four different shrinkage priors, for the sparse and the dense scenario, respectively. The cells are shaded in gray when the corresponding true state parameter is time-varying (

θ_{i j}^{β} \neq 0

), while the background is white when the corresponding true state parameter is not time-varying (

θ_{i j}^{β} = 0

). In this simulated example, the posterior inclusion probabilities under the triple gamma prior are consistently higher for the variances that are actually different from 0, even when they are very small. This outcome is in line with the analytical results derived in Section 2.2, which show that the tails of the triple gamma prior are heavier than those of the other priors.

5.4. Modeling Area Macroeconomic and Financial Variables in the Euro Area

Our application investigates a subset of the area wide model of the European Union of Fagan et al. (2005), which comprises quarterly macroeconomic data spanning from 1970 to 2017. We include seven of the variables present in the dataset, namely real output (YER), prices (YED), short-term interest rate (STN), investment (ITR), consumption (PCR), exchange rate (EEN) and unemployment (URX). A more detailed description of the data and the transformations performed to make the time series stationary can be found in Table A1 in Appendix D. To stay in line with the literature, for example, Feldkircher et al. (2017), we estimate a TVP-VAR-SV model with

p = 2

lags on all endogenous variables. The hyperparameter choices are the same as in Section 5.3. As in the example with simulated data, we run the algorithm for

M = 200,000

iterations, discarding the first 100,000 iterations as burn-in, and then keeping the output of one every 100 iterations.

Figure 7 and Figure 8 display the posterior inclusion probabilities for the means of the initial states and the innovation variances of the VAR coefficients, respectively. A few things about Figure 7 are noteworthy. First, the posterior inclusion probabilities on the diagonal, meaning those belonging to the parameter of each equation’s own autoregressive term, appear to be those that are the highest, while off diagonal elements are more likely to be excluded. Second, the equation for the short-term interest rate is characterized by a large amount of parameters with a high inclusion probability, across all priors. Third, the first lag tends to have higher posterior inclusion probabilities than the second lag, which is in line with the literature. In most cases, the triple gamma prior can be seen to have either the largest or the smallest posterior inclusion probability compared to the other priors. This can be seen as a reflection of the fact that the triple gamma prior places more mass on the edges of the shrinkage profile, as illustrated in Section 3.

Now, we shift our focus to the posterior inclusion probabilities for the

θ_{i j}^{β}

’s plotted in Figure 8. Compared to the means of the inital states, almost all inclusion probabilities are essentially zero. This lack of variability is unsurprising, as it is well known (see, e.g., Feldkircher et al. (2017)) that stochastic volatility in a TVP-VAR model for macroeconomic variables can explain a large part of the variability in the data. However, the triple gamma prior appears to allow posterior distributions that place slightly more mass on models with some time variation, in particular with respect to the financial variables.

Figure 9 and Figure 10 display the posterior median of

β_{i j}^{β}

and

|{\sqrt{θ}}_{i j}^{β}|

, respectively. Here the triple gamma can be seen to be quite conservative, both in terms of which parameters to include, as well as their magnitude. In particular the medians of the

|{\sqrt{θ}}_{i j}^{β}|

are interesting, as they are closest to zero under the triple gamma prior, despite having the highest posterior inclusion probabilities among all considered priors.

In Figure A3 and Figure A4 in Appendix D, all the posterior paths of

Φ_{1, t}

and

Φ_{2, t}

under the triple gamma prior are shown.

6. Conclusions

In the present paper, shrinkage for time-varying parameter (TVP) models was investigated within a Bayesian framework with the goal to automatically reduce time-varying parameters to static ones, if the model is overfitting. This goal was achieved by suggesting the triple gamma prior as a new shrinkage priors for the process variances of varying coefficients, extending previous work using spike-and-slab priors, the Bayesian Lasso, or the double gamma prior. The triple gamma prior is related to the normal-gamma-gamma prior applied for variable selection in highly structured regression models (Griffin and Brown 2017). It contains the well-known Horseshoe prior as a special case, however it is more flexible, with two shape parameters that control concentration at zero and the tail behaviour. This leads to a BMA-type behaviour which allows not only variance shrinkage, but also variance selection.

In our application, we considered time-varying parameter VAR models with stochastic volatility. Overall, our findings suggest that the family of triple gamma priors introduced in this paper for sparse TVP models is successful in avoiding overfitting, if coefficients are, indeed, static or even insignificant. The framework developed in this paper is very general and holds the promise to be useful for introducing sparsity in other TVP and state space models in many different settings.

A number of extensions seem to be worth pursuing. First of all, the triple gamma prior is relevant not only for TVP models, but for any model containing variance parameters such as random-effect models or Bayesian p-splines models (Scheipl and Kneib 2009). Second, in particular, in ultra-sparse settings, modifications of the triple gamma prior seem sensible. Currently, the hyperprior for the global shrinkage parameter of the triple gamma prior is selected in a way that it implies a uniform prior on “model size”. A generalization of Theorem 3 would allow the choice of hyper priors that induce higher sparsity. Furthermore, in the variable selection literature, special priors such as the Horseshoe+ (Bhadra et al. 2017a) were suggested for very sparse, ultra-sparse high dimensional settings. Exploiting once more the non-centered parametrization of a state space model, it is straightforward to extend this prior to variance selection using following hierarchical representation:

\begin{matrix} {\sqrt{θ}}_{j} | κ_{j}^{2}, ξ_{j}^{2} \sim N (0, \frac{2}{κ_{B}^{2}} κ_{j}^{2} ξ_{j}^{2}), κ_{j} \sim t_{1}, ξ_{j} \sim t_{1} . \end{matrix}

We leave these extensions for future research.

Finally, an important limitation of our approach is that shrinking a variance toward zero implies that a coefficient is fixed over the entire observation period of the time series. In future research we will investigate dynamic shrinkage priors (Kalli and Griffin 2014; Kowal et al. 2019; Ročková and McAlinn 2020) where coefficients can be both fixed and dynamic.

Author Contributions

The authors contributed equally to the work. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare that there is no conflict of interest.

Appendix A. Proofs

Proof of Theorem 1.

To proof Part (a), rewrite prior (6) in the following way by rescaling

ξ_{j}^{2}

and

κ_{j}^{2}

:

\begin{matrix} {\sqrt{θ}}_{j} | {\tilde{ξ}}_{j}^{2}, {\tilde{κ}}_{j}^{2}, κ_{B}^{2} \sim N (0, \frac{2}{κ_{B}^{2}} \frac{{\tilde{ξ}}_{j}^{2}}{{\tilde{κ}}_{j}^{2}}), {\tilde{ξ}}_{j}^{2} | a^{ξ} \sim G (a^{ξ}, a^{ξ}), {\tilde{κ}}_{j}^{2} | c^{ξ} \sim G (c^{ξ}, c^{ξ}), \end{matrix}

(A1)

and use the fact that in (A1) the random variable

ψ_{j}^{2} = {\tilde{ξ}}_{j}^{2} / {\tilde{κ}}_{j}^{2}

follows the F-distribution:

\begin{matrix} ψ_{j}^{2} = \frac{{\tilde{ξ}}_{j}^{2}}{{\tilde{κ}}_{j}^{2}} \sim \frac{G (a^{ξ}, a^{ξ})}{G (c^{ξ}, c^{ξ})} =_{d} F (2 a^{ξ}, 2 c^{ξ}), \end{matrix}

where

p (ψ_{j}^{2})

is given by:

\begin{matrix} p (ψ_{j}^{2}) = \frac{1}{B (a^{ξ}, c^{ξ})} {(\frac{a^{ξ}}{c^{ξ}} ψ_{j}^{2})}^{a^{ξ} - 1} {(1 + \frac{a^{ξ}}{c^{ξ}} ψ_{j}^{2})}^{- (a^{ξ} + c^{ξ})} . \end{matrix}

(A2)

This yields (8).

Using that

η_{j} = 1 / ψ_{j}^{2} \sim F (2 c^{ξ}, 2 a^{ξ})

, we obtain from (8) that

\begin{matrix} p ({\sqrt{θ}}_{j} | κ_{B}^{2}, a^{ξ}, c^{ξ}) = & \frac{\sqrt{κ_{B}^{2}} {(c^{ξ})}^{c^{ξ}}}{\sqrt{4 π} {(a^{ξ})}^{c^{ξ}} B (a^{ξ}, c^{ξ})} \int_{0}^{\infty} exp (- \frac{θ_{j} κ_{B}^{2} η_{j}}{4}) η_{j}^{c^{ξ} - \frac{1}{2}} {(1 + \frac{c^{ξ} η_{j}}{a^{ξ}})}^{- (a^{ξ} + c^{ξ})} d η_{j} . \end{matrix}

A change of variable with

y_{j} = c^{ξ} η_{j} / a^{ξ}

proves Part (b):

\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = & \frac{1}{\sqrt{2 π ϕ^{ξ}} B (a^{ξ}, c^{ξ})} \int_{0}^{\infty} exp (- \frac{θ_{j}}{2 ϕ^{ξ}} y_{j}) y_{j}^{c^{ξ} - \frac{1}{2}} {(1 + y_{j})}^{- (a^{ξ} + c^{ξ})} d y_{j} \\ = & \frac{Γ (c^{ξ} + \frac{1}{2})}{\sqrt{2 π ϕ^{ξ}} B (a^{ξ}, c^{ξ})} U (c^{ξ} + \frac{1}{2}, \frac{3}{2} - a^{ξ}, \frac{θ_{j}}{2 ϕ^{ξ}}), \end{matrix}

where

ϕ^{ξ} = \frac{2 c^{ξ}}{κ_{B}^{2} a^{ξ}}

. □

Proof of Theorem 2.

Using Abramowitz and Stegun (1973, 13.5.8), we obtain for a and

1 < b < 2

fixed that

U (a, b, z)

behaves for small z as:

\begin{matrix} U (a, b, z) = \frac{Γ (b - 1)}{Γ (a)} z^{1 - b} + O (1) . \end{matrix}

Since

b = 3 / 2 - a^{ξ}

in the expression for

p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ})

given in (9), the condition

1 < b < 2

is equivalent to

0 < a^{ξ} < 0.5

and this proves Part (a):

\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (\frac{1}{2} - a^{ξ})}{\sqrt{π} {(2 ϕ^{ξ})}^{a^{ξ}} B (a^{ξ}, c^{ξ})} {(\frac{1}{{\sqrt{θ}}_{j}})}^{1 - 2 a^{ξ}} + O (1) . \end{matrix}

For

b = 1

we obtain from Abramowitz and Stegun (1973, 13.5.9) that

U (a, b, z)

behaves for small z as follows:

\begin{matrix} U (a, b, z) = - \frac{1}{Γ (a)} (log z + ψ (a)) + O (| z log z |), \end{matrix}

where

ψ (\cdot)

is the digamma function. Since

b = 1

is equivalent with

a^{ξ} = 0.5

, this proves Part (b):

\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{1}{\sqrt{2 π ϕ^{ξ}} B (a^{ξ}, c^{ξ})} (- log θ_{j} + log (2 ϕ^{ξ}) - ψ (c^{ξ} + \frac{1}{2})) + O (| θ_{j} log θ_{j} |) . \end{matrix}

Using formulas 13.5.10-13.5.12 in Abramowitz and Stegun (1973), we obtain for a and

b < 1

fixed that

U (a, b, z)

behaves for small z as follows:

\begin{matrix} U (a, b, z) = \{\begin{matrix} \frac{Γ (1 - b)}{Γ (1 + a - b)} + O (z^{1 - b}), & 0 < b < 1, \\ \frac{1}{Γ (1 + a)} + O (| z log z |), & b = 0, \\ \frac{Γ (1 - b)}{Γ (1 + a - b)} + O (| z |), & b < 0 . \end{matrix} \end{matrix}

Since

O (z^{1 - b})

with

b < 1

,

O (| z log z |)

and

O (| z |)

converge to 0 as

z \to 0

, we obtain:

\begin{matrix} lim_{z \to 0} U (a, b, z) = \frac{Γ (1 - b)}{Γ (1 + a - b)} . \end{matrix}

This proves Part (c) as condition

b < 1

is equivalent to

a^{ξ} > 0.5

:

\begin{matrix} lim_{{\sqrt{θ}}_{j} \to 0} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (c^{ξ} + \frac{1}{2})}{\sqrt{2 π ϕ^{ξ}} B (a^{ξ}, c^{ξ})} lim_{z \to 0} U (c^{ξ} + \frac{1}{2}, \frac{3}{2} - a^{ξ}, z) = \frac{Γ (c^{ξ} + \frac{1}{2}) Γ (a^{ξ} - \frac{1}{2})}{\sqrt{2 π ϕ^{ξ}} B (a^{ξ}, c^{ξ}) Γ (a^{ξ} + c^{ξ})} . \end{matrix}

Finally, using Abramowitz and Stegun (1973, 13.1.8), we obtain as

z \to \infty

:

\begin{matrix} U (a, b, z) = z^{- a} [1 + O (\frac{1}{z})] . \end{matrix}

Therefore as

{\sqrt{θ}}_{j} \to \infty

\begin{matrix} p ({\sqrt{θ}}_{j} | ϕ^{ξ}, a^{ξ}, c^{ξ}) = \frac{Γ (c^{ξ} + \frac{1}{2}) {(2 ϕ^{ξ})}^{c^{ξ}}}{\sqrt{π} B (a^{ξ}, c^{ξ})} {(\frac{1}{{\sqrt{θ}}_{j}})}^{2 c^{ξ} + 1} [1 + O (\frac{1}{θ_{j}})] . \end{matrix}

□

Proof of Lemma 1.

To derive representation (a), integrate (A1) with respect to

{\tilde{κ}}_{j}^{2}

, using the common normal-scale mixture representation of the Student-t distribution. Representation (b) is obtained from (10) by rescaling. Representation (c) is obtained from (A1) by rescaling

{\tilde{ξ}}_{j}^{2}

and

{\tilde{κ}}_{j}^{2}

. Finally, by defining

{\tilde{ψ}}_{j}^{2} = \frac{a^{ξ}}{c^{ξ}} ψ_{j}^{2}

, representation (d) follows immediately from (8) and (A2). □

Proof of Theorem 3.

The equivalence of (23) and (24) follows immediately from

ϕ^{ξ} = (c^{ξ} / a^{ξ}) (2 / κ_{B}^{2}) \sim BP (c^{ξ}, a^{ξ}),

since

2 / κ_{B}^{2} \sim F (2 c^{ξ}, 2 a^{ξ})

. In addition, (24) implies that

\begin{matrix} \frac{ϕ^{ξ}}{1 + ϕ^{ξ}} \sim B (c^{ξ}, a^{ξ}) . \end{matrix}

Using representations (13) and (14) of the tripe gamma prior, we can show:

\begin{matrix} ρ_{j} < 0.5 \Leftrightarrow ξ_{j}^{2} = \frac{1}{ρ_{j}} - 1 > 1 \Leftrightarrow ϕ^{ξ} {\tilde{ψ}}_{j}^{2} > 1 \Leftrightarrow \frac{1}{1 + {\tilde{ψ}}_{j}^{2}} < \frac{ϕ^{ξ}}{1 + ϕ^{ξ}}, \end{matrix}

where

{\tilde{ψ}}_{j}^{2} \sim BP (a^{ξ}, c^{ξ})

and, consequently,

\begin{matrix} \frac{{\tilde{ψ}}_{j}^{2}}{1 + {\tilde{ψ}}_{j}^{2}} \sim B (a^{ξ}, c^{ξ}) \Leftrightarrow \frac{1}{1 + {\tilde{ψ}}_{j}^{2}} \sim B (c^{ξ}, a^{ξ}) . \end{matrix}

Hence,

π^{ξ} = \Pr (ρ_{j} < 0.5) = F_{X} (Y)

, where

F_{X}

is the cdf of a random variable

X \sim B (c^{ξ}, a^{ξ})

and the random variable

Y \sim B (c^{ξ}, a^{ξ})

arises from the same distribution. It follows immediately that

π^{ξ} \sim U [0, 1]

. □

Appendix B. Details on the MCMC Scheme

In Step (b),

\begin{matrix} p (a^{ξ} | z_{- a^{ξ}}, y) & \propto \prod_{j = 1}^{d} p ({\sqrt{θ}}_{j} | {\overset{ˇ}{κ}}_{j}^{2}, ϕ^{ξ}) p (κ_{B}^{2} | a^{ξ}, c^{ξ}) p (a^{ξ}), \end{matrix}

where

p (κ_{B}^{2} | a^{ξ}, c^{ξ})

is given by:

\begin{matrix} p (κ_{B}^{2} | a^{ξ}, c^{ξ}) = \frac{1}{2^{a^{ξ}} B (a^{ξ}, c^{ξ})} {(\frac{a^{ξ}}{c^{ξ}} κ_{B}^{2})}^{a^{ξ} - 1} {(1 + \frac{a^{ξ}}{2 c^{ξ}} κ_{B}^{2})}^{- (a^{ξ} + c^{ξ})} . \end{matrix}

Therefore,

\begin{matrix} p (a^{ξ} | z_{- a^{ξ}}, y) & \propto \frac{2^{- d a^{ξ}}}{{Γ (a^{ξ})}^{d}} {(a^{ξ})}^{d (a^{ξ} + 1 / 2) / 2} {(\frac{κ_{B}^{2}}{c^{ξ}})}^{d a^{ξ} / 2} \\ \cdot {(\prod_{j = 1}^{d} {\overset{ˇ}{κ}}_{j}^{2} θ_{j})}^{a^{ξ} / 2} \prod_{j = 1}^{d} K_{a^{ξ} - 1 / 2} (\sqrt{{\overset{ˇ}{κ}}_{j}^{2} κ_{B}^{2} a^{ξ} | θ_{j} | / c^{ξ}}) \\ \cdot \frac{1}{2^{a^{ξ}} B (a^{ξ}, c^{ξ})} {(\frac{a^{ξ}}{c^{ξ}} κ_{B}^{2})}^{a^{ξ} - 1} {(1 + \frac{a^{ξ}}{2 c^{ξ}} κ_{B}^{2})}^{- (a^{ξ} + c^{ξ})} {(2 a^{ξ})}^{α_{a^{ξ}} - 1} {(1 - 2 a^{ξ})}^{β_{a^{ξ}} - 1} . \end{matrix}

(A3)

Hence,

log q_{a} (a^{ξ})

is given by (using

Γ (a^{ξ}) = Γ (a^{ξ} + 1) / a^{ξ}

):

\begin{array}{l} log q_{a} (a^{ξ}) & = a^{ξ} (- d log 2 + \frac{d}{2} log κ_{B}^{2} - \frac{d}{2} log c^{ξ} + \frac{1}{2} \sum_{j = 1}^{d} log {\overset{ˇ}{κ}}_{j}^{2} + \frac{1}{2} \sum_{j = 1}^{d} log θ_{j}) \\ + \frac{5}{4} d log a^{ξ} + d \frac{a^{ξ}}{2} log a^{ξ} - d log Γ (a^{ξ} + 1) \\ + \sum_{j = 1}^{d} log K_{a^{ξ} - 1 / 2} (\sqrt{{\overset{ˇ}{κ}}_{j}^{2} κ_{B}^{2} a^{ξ} | θ_{j} | / c^{ξ}}) (prior on θ_{j}) \\ - log B (a^{ξ}, c^{ξ}) + a^{ξ} (log a^{ξ} + log (\frac{κ_{B}^{2}}{2 c^{ξ}})) - log a^{ξ} - (a^{ξ} + c^{ξ}) log (1 + \frac{a^{ξ} κ_{B}^{2}}{2 c^{ξ}}) (prior on κ_{B}^{2}) \\ + (α_{a^{ξ}} - 1) log (2 a^{ξ}) - (β_{a^{ξ}} - 1) log (1 - 2 a^{ξ}) (prior on a^{ξ}) \\ + log a^{ξ} + log (0.5 - a^{ξ}) (change of variable) \end{array}

(A4)

In Step (c),

\begin{matrix} p ({\overset{ˇ}{ξ}}_{j}^{2} | z_{- {\overset{ˇ}{ξ}}_{j}^{2}}, y) & \propto p ({\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, {\overset{ˇ}{κ}}_{j}^{2}, ϕ^{ξ}) p ({\overset{ˇ}{ξ}}_{j}^{2} | a^{ξ}) \\ \propto {({\overset{ˇ}{ξ}}_{j}^{2})}^{- 1 / 2} exp \{- \frac{{\overset{ˇ}{κ}}_{j}^{2}}{2 ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2}} θ_{j}\} \cdot {({\overset{ˇ}{ξ}}_{j}^{2})}^{a^{ξ} - 1} exp \{- {\overset{ˇ}{ξ}}_{j}^{2}\} \\ = {({\overset{ˇ}{ξ}}_{j}^{2})}^{a^{ξ} - 1 / 2 - 1} exp \{- \frac{1}{2} (\frac{{\overset{ˇ}{κ}}_{j}^{2} θ_{j}}{ϕ^{ξ}} \frac{1}{{\overset{ˇ}{ξ}}_{j}^{2}} + 2 {\overset{ˇ}{ξ}}_{j}^{2})\}, \end{matrix}

(A5)

which is equal to the GIG-distribution given in (25).5

In Step (d),

\begin{matrix} p (c^{ξ} | z_{- c^{ξ}}, y) & \propto \prod_{j = 1}^{d} p ({\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, c^{ξ}, κ_{B}^{2}) p (κ_{B}^{2} | a^{ξ}, c^{ξ}) p (c^{ξ}) \\ \propto \prod_{j = 1}^{d} \frac{Γ (\frac{2 c^{ξ} + 1}{2})}{Γ (\frac{2 c^{ξ}}{2}) {(2 π ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2})}^{1 / 2}} {(1 + \frac{θ_{j}}{2 {\overset{ˇ}{ξ}}_{j}^{2} ϕ^{ξ}})}^{- \frac{2 c^{ξ} + 1}{2}} \\ \cdot \frac{1}{2^{a^{ξ}} B (a^{ξ}, c^{ξ})} {(\frac{a^{ξ}}{c^{ξ}} κ_{B}^{2})}^{a^{ξ} - 1} {(1 + \frac{a^{ξ}}{2 c^{ξ}} κ_{B}^{2})}^{- (a^{ξ} + c^{ξ})} {(2 c^{ξ})}^{α_{c^{ξ}} - 1} {(1 - 2 c^{ξ})}^{β_{c^{ξ}} - 1} . \end{matrix}

(A6)

Hence,

log q_{c} (c^{ξ})

is given by (using

Γ (c^{ξ}) = Γ (c^{ξ} + 1) / c^{ξ}

):

\begin{array}{l} log q_{c} (c^{ξ}) & = d log Γ (c^{ξ} + 0.5) - d log Γ (c^{ξ} + 1) + \frac{d}{2} log c^{ξ} \\ - (c^{ξ} + 0.5) (\sum_{j = 1}^{d} log (4 c^{ξ} {\overset{ˇ}{ξ}}_{j}^{2} + θ_{j} κ_{B}^{2} a^{ξ}) - \sum_{j = 1}^{d} log (4 c^{ξ} {\overset{ˇ}{ξ}}_{j}^{2})) (prior on θ_{j}) \\ - log B (a^{ξ}, c^{ξ}) - (a^{ξ} - 1) log c^{ξ} - (a^{ξ} + c^{ξ}) log (1 + \frac{a^{ξ} κ_{B}^{2}}{2 c^{ξ}}) (prior on κ_{B}^{2}) \\ + (α_{c^{ξ}} - 1) log (2 c^{ξ}) + (β_{c^{ξ}} - 1) (1 - 2 c^{ξ}) (prior on c^{ξ}) \\ + log c^{ξ} + log (0.5 - c^{ξ}) (change of variable) \end{array}

(A7)

In Step (e),

\begin{matrix} p ({\overset{ˇ}{κ}}_{j}^{2} | z_{- {\overset{ˇ}{κ}}_{j}^{2}}, y) & \propto p ({\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, {\overset{ˇ}{κ}}_{j}^{2}, ϕ^{ξ}) p ({\overset{ˇ}{κ}}_{j}^{2} | c^{ξ}) \\ \propto {({\overset{ˇ}{κ}}_{j}^{2})}^{1 / 2} exp \{- \frac{{\overset{ˇ}{κ}}_{j}^{2}}{2 ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2}} θ_{j}\} \times {({\overset{ˇ}{κ}}_{j}^{2})}^{c^{ξ} - 1} exp \{- {\overset{ˇ}{κ}}_{j}^{2}\} \\ = {({\overset{ˇ}{κ}}_{j}^{2})}^{1 / 2 + c^{ξ} - 1} exp \{- {\overset{ˇ}{κ}}_{j}^{2} (\frac{θ_{j}}{2 ϕ^{ξ} {\overset{ˇ}{ξ}}_{j}^{2}} + 1)\}, \end{matrix}

(A8)

which is equal to the gamma distribution given in (26).

In Step (f),

p (d_{2} | z_{- d_{2}}, y)

is equal to following gamma distribution:

\begin{matrix} p (d_{2} | z_{- d_{2}}, y) & \propto p (κ_{B}^{2} | d_{2}) p (d_{2} | a^{ξ}, c^{ξ}) \\ \propto {(d_{2})}^{a^{ξ}} exp \{- d_{2} κ_{B}^{2}\} {(d_{2})}^{c^{ξ} - 1} exp \{- d_{2} \frac{2 c^{ξ}}{a^{ξ}}\} \\ = {(d_{2})}^{a^{ξ} + c^{ξ} - 1} exp \{- d_{2} (κ_{B}^{2} + \frac{2 c^{ξ}}{a^{ξ}})\}, \end{matrix}

(A9)

and

\begin{matrix} p (κ_{B}^{2} | z_{- κ_{B}^{2}}, y) & \propto \prod_{j = 1}^{d} p ({\sqrt{θ}}_{j} | {\overset{ˇ}{ξ}}_{j}^{2}, {\overset{ˇ}{κ}}_{j}^{2}, ϕ^{ξ}) p (κ_{B}^{2} | d_{2}) \\ \propto {(κ_{B}^{2})}^{d / 2} exp \{- \frac{κ_{B}^{2} a^{ξ}}{4 c^{ξ}} \sum_{j = 1}^{d} \frac{{\overset{ˇ}{κ}}_{j}^{2}}{{\overset{ˇ}{ξ}}_{j}^{2}} θ_{j}\} \times {(κ_{B}^{2})}^{a^{ξ} - 1} exp \{- d_{2} κ_{B}^{2}\} \\ = {(κ_{B}^{2})}^{d / 2 + a^{ξ} - 1} exp \{- κ_{B}^{2} (\frac{a^{ξ}}{4 c^{ξ}} \sum_{j = 1}^{d} \frac{{\overset{ˇ}{κ}}_{j}^{2}}{{\overset{ˇ}{ξ}}_{j}^{2}} θ_{j} + d_{2})\}, \end{matrix}

(A10)

which is equal to the gamma distribution given in (27).

For a symmetric triple gamma prior, where

a^{ξ} = c^{ξ}

, Step (b) is modified in the following way, if Step (d) is dropped:

\begin{matrix} q_{a} (a^{ξ}) = p (a^{ξ} | z_{- a^{ξ}}, y) \prod_{j = 1}^{d} p ({\overset{ˇ}{κ}}_{j}^{2} | c^{ξ} = a^{ξ}) \propto p (a^{ξ} | z_{- a^{ξ}}, y) \frac{1}{Γ {(a^{ξ})}^{d}} {(\prod_{j = 1}^{d} {\overset{ˇ}{κ}}_{j}^{2})}^{a^{ξ}}, \end{matrix}

(A11)

where

p (a^{ξ} | z_{- a^{ξ}}, y)

is given by (A3). If Step (b) is dropped, then Step (d) is modified in the following way:

\begin{matrix} q_{c} (c^{ξ}) = p (c^{ξ} | z_{- c^{ξ}}, y) \prod_{j = 1}^{d} p ({\overset{ˇ}{ξ}}_{j}^{2} | a^{ξ} = c^{ξ}) \propto p (c^{ξ} | z_{- c^{ξ}}, y) \frac{1}{Γ {(c^{ξ})}^{d}} {(\prod_{j = 1}^{d} {\overset{ˇ}{ξ}}_{j}^{2})}^{c^{ξ}}, \end{matrix}

(A12)

where

p (c^{ξ} | z_{- c^{ξ}}, y)

is given by (A6).

Appendix C. Posterior Paths for the Simulated Data

Figure A1. Each cell represents the corresponding state of the matrix

Φ_{1, t}

, for

t = 1, \dots, T

, for the sparse regime described in Section 5.3. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Figure A1. Each cell represents the corresponding state of the matrix

Φ_{1, t}

, for

t = 1, \dots, T

, for the sparse regime described in Section 5.3. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Figure A2. Each cell represents the corresponding state of the matrix

Φ_{1, t}

, for

t = 1, \dots, T

, for the dense regime described in Section 5.3. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Figure A2. Each cell represents the corresponding state of the matrix

Φ_{1, t}

, for

t = 1, \dots, T

, for the dense regime described in Section 5.3. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Appendix D. Application

Appendix D.1. Data Overview

Table A1. Data overview.

Variable	Abbreviation	Description	Tcode
Real output	YER	Gross domestic product (GDP) at market prices in millions of Euros, chain linked volume, calendar and seasonally adjusted data, reference year 1995.	1
Prices	YED	GDP deflator, index base year 1995. Defined as the ratio of nominal and real GDP.	1
Short-term interest rate	STN	Nominal short-term interest rate, Euribor 3-month, percent per annum	2
Investment	ITR	Gross fixed capital formation in millions of Euros, chain linked volume, calendar and seasonally adjusted data, reference year 1995.	1
Consumption	PCR	Individual consumption expenditure in millions of Euros, chain linked volume, calendar and seasonally adjusted data, reference year 1995.	1
Exchange rate	EEN	Nominal effective exchange rate, Euro area-19 countries vis-à-vis the NEER-38 group of main trading partners, index base Q1 1999.	1
Unemployment	URX	Unemployment rate, percentage of civilian work force, total across age and sex, seasonally adjusted, but not working day adjusted.	2

Note: Data was retrieved from https://eabcn.org/page/area-wide-model. Tcode

= 1

indicates that differences of logs were taken, while Tcode

= 2

implies that the raw data was used.

Appendix D.2. Posterior Paths

Figure A3. Each cell represents the corresponding state of the matrix

Φ_{1, t}

, for

t = 1, \dots, T

, for the data described in Section 5.4. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Figure A3. Each cell represents the corresponding state of the matrix

Φ_{1, t}

, for

t = 1, \dots, T

, for the data described in Section 5.4. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Figure A4. Each cell represents the corresponding state of the matrix

Φ_{2, t}

, for

t = 1, \dots, T

, for the data described in Section 5.4. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

Figure A4. Each cell represents the corresponding state of the matrix

Φ_{2, t}

, for

t = 1, \dots, T

, for the data described in Section 5.4. The solid line is the median and the shaded areas represent

50 %

and

95 %

posterior credible intervals under the triple gamma prior.

References

Abramowitz, Milton, and Irene A. Stegun, eds. 1973. Handbook of Mathematical Functions. New York: Dover Publications. [Google Scholar]
Armagan, Artin, David B. Dunson, and Merlise Clyde. 2011. Generalized beta mixtures of Gaussians. In Advances in Neural Information Processing Systems. Vancouver: NIPS, pp. 523–31. [Google Scholar]
Belmonte, Miguel, Gary Koop, and Dimitris Korobolis. 2014. Hierarchical shrinkage in time-varying parameter models. Journal of Forecasting 33: 80–94. [Google Scholar] [CrossRef]
Berger, James O. 1980. A robust generalized Bayes estimator and confidence region for a multivariate normal mean. The Annals of Statistics 8: 716–61. [Google Scholar] [CrossRef]
Bhadra, Anindya, Jyotishka Datta, Nicholas G. Polson, and Brandon Willard. 2017a. The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis 12: 1105–31. [Google Scholar] [CrossRef]
Bhadra, Anindya, Jyotishka Datta, Nicholas G. Polson, and Brandon Willard. 2017b. Horseshoe regularization for feature subset selection. arXiv arXiv:1702.07400. [Google Scholar] [CrossRef]
Bhadra, Anindya, Jyotishka Datta, Nicholas G. Polson, and Brandon Willard. 2019. Lasso meets horseshoe: A survey. Statistical Science 34: 405–27. [Google Scholar] [CrossRef]
Bitto, Angela, and Sylvia Frühwirth-Schnatter. 2019. Achieving shrinkage in a time-varying parameter model framework. Journal of Econometrics 210: 75–97. [Google Scholar] [CrossRef]
Brown, Philip J., Marina Vannucci, and Tom Fearn. 2002. Bayes model averaging with selection of regressors. Journal of the Royal Statistical Society, Ser. B 64: 519–36. [Google Scholar] [CrossRef]
Carriero, Andrea, Todd E. Clark, and Massimiliano Marcellino. 2019. Large Bayesian vector autoregressions with stochastic volatility and non-conjugate priors. Journal of Econometrics 212: 137–54. [Google Scholar] [CrossRef]
Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. 2009. Handling sparsity via the horseshoe. Journal of Machine Learing Research W&CP 5: 73–80. [Google Scholar]
Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. 2010. The horseshoe estimator for sparse signals. Biometrika 97: 465–80. [Google Scholar] [CrossRef]
Chan, Joshua C. C., and Eric Eisenstat. 2016. Bayesian model comparison for time-varying parameter VARs with stochastic volatilty. Journal of Applied Econometrics 218: 1–24. [Google Scholar]
Cottet, Remy, Robert J. Kohn, and David J. Nott. 2008. Variable selection and model averaging in semiparametric overdispersed generalized linear models. Journal of the American Statistical Association 103: 661–71. [Google Scholar] [CrossRef]
Del Negro, Marco, and Giorgio E. Primiceri. 2015. Time Varying Structural Vector Autoregressions and Monetary Policy: A Corrigendum. The Review of Economic Studies 82: 1342–45. [Google Scholar] [CrossRef]
Eisenstat, Eric, Joshua C.C. Chan, and Rodney W. Strachan. 2014. Stochastic model specification search for time-varying parameter VARs. SSRN Electronic Journal 01/2014. [Google Scholar] [CrossRef]
Fagan, Gabriel, Jerome Henry, and Ricardo Mestre. 2005. An area-wide model for the euro area. Economic Modelling 22: 39–59. [Google Scholar] [CrossRef]
Fahrmeir, Ludwig, Thomas Kneib, and Susanne Konrath. 2010. Bayesian regularisation in structured additive regression: A unifying perspective on shrinkage, smoothing and predictor selection. Statistics and Computing 20: 203–19. [Google Scholar] [CrossRef]
Feldkircher, Martin, Florian Huber, and Gregor Kastner. 2017. Sophisticated and small versus simple and sizeable: When does it pay off to introduce drifting coefficients in Bayesian VARs. arXiv arXiv:1711.00564. [Google Scholar]
Fernández, Carmen, Eduardo Ley, and Mark F. J. Steel. 2001. Benchmark priors for Bayesian model averaging. Journal of Econometrics 100: 381–427. [Google Scholar] [CrossRef]
Figueiredo, Mario A. T. 2003. Adaptive sparseness for supervised learning. IEEE Transaction on Pattern Analysis and Machine Intelligence 25: 1150–59. [Google Scholar] [CrossRef]
Frühwirth-Schnatter, Sylvia. 2004. Efficient Bayesian parameter estimation. In State Space and Unobserved Component Models: Theory and Applications. Edited by Andrew Harvey, Siem Jan Koopman and Neil Shephard. Cambridge: Cambridge University Press, pp. 123–51. [Google Scholar]
Frühwirth-Schnatter, Sylvia, and Regina Tüchler. 2008. Bayesian parsimonious covariance estimation for hierarchical linear mixed models. Statistics and Computing 18: 1–13. [Google Scholar] [CrossRef]
Frühwirth-Schnatter, Sylvia, and Helga Wagner. 2010. Stochastic model specification search for Gaussian and partially non-Gaussian state space models. Journal of Econometrics 154: 85–100. [Google Scholar] [CrossRef]
Gelman, Andrew. 2006. Prior distributions for variance parameters in hierarchical models (Comment on Article by Browne and Draper). Bayesian Analysis 1: 515–34. [Google Scholar] [CrossRef]
Griffin, Jim E., and Phil J. Brown. 2011. Bayesian hyper-lassos with non-convex penalization. Australian & New Zealand Journal of Statistics 53: 423–42. [Google Scholar]
Griffin, Jim E., and Phil J. Brown. 2017. Hierarchical shrinkage priors for regression models. Bayesian Analysis 12: 135–59. [Google Scholar] [CrossRef]
Jacquier, Eric, Nicholas G. Polson, and Peter E. Rossi. 1994. Bayesian analysis of stochastic volatility models. Journal of Business & Economic Statistics 12: 371–417. [Google Scholar]
Johnstone, Iain M., and Bernard W. Silverman. 2004. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics 32: 1594–649. [Google Scholar] [CrossRef]
Kalli, Maria, and Jim E. Griffin. 2014. Time-varying sparsity in dynamic regression models. Journal of Econometrics 178: 779–93. [Google Scholar] [CrossRef]
Kastner, Gregor. 2016. Dealing with stochastic volatility in time series using the R package stochvol. Journal of Statistical Software 69: 1–30. [Google Scholar] [CrossRef]
Kastner, Gregor, and Sylvia Frühwirth-Schnatter. 2014. Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Computational Statistics and Data Analysis 76: 408–23. [Google Scholar] [CrossRef]
Kleijn, Richard, and Herman K. van Dijk. 2006. Bayes model averaging of cyclical decompositions in economic time series. Journal of Applied Econometrics 21: 191–212. [Google Scholar] [CrossRef]
Koop, Gary, and Dimitris Korobilis. 2013. Large time-varying parameter VARs. Journal of Econometrics 177: 185–98. [Google Scholar] [CrossRef]
Koop, Gary, and Simon M. Potter. 2004. Forecasting in dynamic factor models using Bayesian model averaging. Econometrics Journal 7: 550–65. [Google Scholar] [CrossRef]
Kowal, Daniel R., David S. Matteson, and David Ruppert. 2019. Dynamic shrinkage processes. Journal of the Royal Statistical Society, Ser. B 81: 673–806. [Google Scholar] [CrossRef]
Ley, Eduardo, and Mark F. J. Steel. 2009. On the effect of prior assumptions in Bayesian model averaging with applications to growth regression. Journal of Applied Econometrics 24: 651–74. [Google Scholar] [CrossRef]
Makalic, Enes, and Daniel F. Schmidt. 2016. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters 23: 179–82. [Google Scholar] [CrossRef]
Nakajima, Jouchi. 2011. Time-varying parameter VAR model with stochastic volatility: An overview of methodology and empirical applications. Monetary and Economic Studies 29: 107–42. [Google Scholar]
Park, Trevor, and George Casella. 2008. The Bayesian Lasso. Journal of the American Statistical Association 103: 681–86. [Google Scholar] [CrossRef]
Pérez, Maria-Eglée, Luis Raúl Pericchi, and Isabel Cristina Ramírez. 2017. The scaled beta2 distribution as a robust prior for scales. Bayesian Analysis 12: 615–37. [Google Scholar] [CrossRef]
Polson, Nicholas G., and James G. Scott. 2011. Shrink globally, act locally: Sparse Bayesian regularization and prediction. In Bayesian Statistics 9. Edited by José M. Bernardo, M. J. Bayarri, James O. Berger, Phil Dawid, David Heckerman, Adrian F. M. Smith and Mike West. Oxford: Oxford University Press, pp. 501–38. [Google Scholar]
Polson, Nicholas G., and James G. Scott. 2012a. Local shrinkage rules, Lévy processes, and regularized regression. Journal of the Royal Statistical Society, Ser. B 74: 287–311. [Google Scholar] [CrossRef]
Polson, Nicholas G., and James G. Scott. 2012b. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis 7: 887–902. [Google Scholar] [CrossRef]
Primiceri, Giorgio E. 2005. Time varying structural vector autoregressions and monetary policy. Review of Economic Studies 72: 821–52. [Google Scholar] [CrossRef]
Raftery, Adrian E., David Madigan, and Jennifer A. Hoeting. 1997. Bayesian model averaging for linear regression models. Journal of the American Statistical Association 92: 179–91. [Google Scholar] [CrossRef]
Ročková, Veronika, and Kenichiro McAlinn. 2020. Dynamic variable selection with spike-and-slab process priors. Bayesian Analysis. [Google Scholar]
Sala-i-Martin, Xavier, Gernot Doppelhofer, and Ronald I. Miller. 2004. Determinants of long-term growth: A Bayesian averaging of classical estimates (BACE) approach. The American Economic Review 94: 813–35. [Google Scholar] [CrossRef]
Scheipl, Fabian, and Thomas Kneib. 2009. Locally adaptive Bayesian p-splines with a normal-exponential-gamma prior. Computational Statistics and Data Analysis 53: 3533–52. [Google Scholar] [CrossRef]
Strawderman, William E. 1971. Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Statistics 42: 385–88. [Google Scholar] [CrossRef]
Tibshirani, Ryan. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Ser. B 58: 267–88. [Google Scholar] [CrossRef]
van der Pas, Stéphanie, Bas Kleijn, and Aad van der Vaart. 2014. The horseshoe estimator: Posterior concentration around nearly black vectors. Electronic Journal of Statistics 8: 2585–618. [Google Scholar] [CrossRef]
Zhang, Yan, Brian J. Reich, and Howard D. Bondell. 2017. High dimensional linear regression via the R2-D2 shrinkage prior. Technical report. arXiv arXiv:1609.00046v2. [Google Scholar]

1	Let $f_{{\sqrt{θ}}_{j}} (x)$ and $F_{{\sqrt{θ}}_{j}} (x)$ be, respectively, the pdf and cdf of the random variable ${\sqrt{θ}}_{j}$ . The cdf $F_{θ_{j}} (x)$ of the random variable $θ_{j}$ is given by $\begin{matrix} F_{θ_{j}} (x) = \Pr (θ_{j} \leq x) = \Pr (- \sqrt{x} \leq {\sqrt{θ}}_{j} \leq \sqrt{x}) = F_{{\sqrt{θ}}_{j}} (\sqrt{x}) - F_{{\sqrt{θ}}_{j}} (- \sqrt{x}) = 2 F_{{\sqrt{θ}}_{j}} (\sqrt{x}), \end{matrix}$ since $f_{{\sqrt{θ}}_{j}} (x)$ is symmetric around 0. The pdf $f_{θ_{j}} (x)$ is obtained by taking the first derivative of $F_{θ_{j}} (x)$ with respect to x: $\begin{matrix} f_{θ_{j}} (x) = \frac{d F_{θ_{j}} (x)}{d x} = f_{{\sqrt{θ}}_{j}} (\sqrt{x}) / \sqrt{x} . \end{matrix}$
2	Note that the $X \sim BP (a, b)$ -distribution has pdf $f (x) = \frac{1}{B (a, b)} \frac{x^{a - 1}}{{(1 + x)}^{a + b}} .$ Furthermore, $Y = X / (1 + X)$ follows the $B (a, b)$ -distribution.
3	The pdf of a $SBeta 2 (a, c, ϕ)$ -distribution reads: $\begin{matrix} f (x) = \frac{1}{ϕ^{a} B (a, c)} x^{a - 1} {(1 + x / ϕ)}^{- (a + c)}, \end{matrix}$
4	Using (3), we obtain the following prior for $ρ_{j} = 1 / (1 + ψ_{j}^{2})$ by the law of transformation of densities: $\begin{matrix} p (ρ_{j}) = \frac{1}{Γ (a^{ξ})} {(\frac{a^{ξ} κ_{B}^{2}}{2})}^{a^{ξ}} {(1 - ρ_{j})}^{a^{ξ} - 1} {ρ_{j}}^{- (a^{ξ} + 1)} exp (- (\frac{1 - ρ_{j}}{ρ_{j}}) \frac{a^{ξ} κ_{B}^{2}}{2}) . \end{matrix}$
5	The pdf of the $GIG (p, a, b)$ -distribution is given by $\begin{matrix} f (x) = \frac{{(a / b)}^{p / 2}}{2 K_{p} (\sqrt{a b})} x^{p - 1} e^{- \frac{1}{2} (a x + b / x)}, \end{matrix}$ where $K_{p} (z)$ is the modified Bessel function.

Figure 1. Marginal prior distribution of

{\sqrt{θ}}_{j}

under the triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

with

κ_{B}^{2} = 2

, in comparison to the Horseshoe prior with

ϕ^{ξ} = 1

, the double gamma prior with

a^{ξ} = 0.1

and

κ_{B}^{2} = 2

and the Lasso prior with

κ_{B}^{2} = 2

. Spike (left-hand side) and tail (right-hand side) of the marginal prior.

Figure 1. Marginal prior distribution of

{\sqrt{θ}}_{j}

under the triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

with

κ_{B}^{2} = 2

, in comparison to the Horseshoe prior with

ϕ^{ξ} = 1

, the double gamma prior with

a^{ξ} = 0.1

and

κ_{B}^{2} = 2

and the Lasso prior with

κ_{B}^{2} = 2

. Spike (left-hand side) and tail (right-hand side) of the marginal prior.

Figure 2. Marginal univariate shrinkage profile under the triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

, in comparison to the Horseshoe prior, the double gamma prior with

a^{ξ} = 0.1

and the Lasso prior.

κ_{B}^{2} = 2

for all the prior specifications.

Figure 2. Marginal univariate shrinkage profile under the triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

, in comparison to the Horseshoe prior, the double gamma prior with

a^{ξ} = 0.1

and the Lasso prior.

κ_{B}^{2} = 2

for all the prior specifications.

Figure 3. “Prior density” of shrinkage profiles for (from left to right) a Lasso prior, a double gamma prior with

a^{ξ} = 0.2

, a Horseshoe prior and a triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

, when

κ_{B}^{2}

is random. The solid line is the median, while the shaded areas represent 50% and 95 % prior credible bands. We have used

κ_{B}^{2} \sim G (0.01, 0.01)

for the Lasso and the double gamma,

2 / κ_{B}^{2} \sim F (1, 1)

for the Horseshoe and

2 / κ_{B}^{2} \sim F (0.2, 0.2)

for the triple gamma.

Figure 3. “Prior density” of shrinkage profiles for (from left to right) a Lasso prior, a double gamma prior with

a^{ξ} = 0.2

, a Horseshoe prior and a triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

, when

κ_{B}^{2}

is random. The solid line is the median, while the shaded areas represent 50% and 95 % prior credible bands. We have used

κ_{B}^{2} \sim G (0.01, 0.01)

for the Lasso and the double gamma,

2 / κ_{B}^{2} \sim F (1, 1)

for the Horseshoe and

2 / κ_{B}^{2} \sim F (0.2, 0.2)

for the triple gamma.

Figure 4. Bivariate shrinkage profile

p (ρ_{1}, ρ_{2})

for (from left to right) the Lasso prior, the double gamma prior with

a^{ξ} = 0.1

, the Horseshoe prior, and the triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

, with

κ_{B}^{2} = 2

for all the priors. The contour plots of the bivariate shrinkage profile are shown, together with 500 samples from the bivariate prior distribution of the shrinkage parameters.

Figure 4. Bivariate shrinkage profile

p (ρ_{1}, ρ_{2})

for (from left to right) the Lasso prior, the double gamma prior with

a^{ξ} = 0.1

, the Horseshoe prior, and the triple gamma prior with

a^{ξ} = c^{ξ} = 0.1

, with

κ_{B}^{2} = 2

for all the priors. The contour plots of the bivariate shrinkage profile are shown, together with 500 samples from the bivariate prior distribution of the shrinkage parameters.

Figure 5. Posterior path against time for a constant non-significant parameter

β_{j, t}^{i}

in the sparse regime.

Figure 5. Posterior path against time for a constant non-significant parameter

β_{j, t}^{i}

in the sparse regime.

Figure 6. Posterior inclusion probability for the

θ_{i j}^{β}

’s in the sparse and dense regime, under the triple gamma prior, the Horseshoe prior, the Lasso prior and the double gamma prior. The true values of the

θ_{i j}^{β}

’s are reported in each cell.

Figure 6. Posterior inclusion probability for the

θ_{i j}^{β}

’s in the sparse and dense regime, under the triple gamma prior, the Horseshoe prior, the Lasso prior and the double gamma prior. The true values of the

θ_{i j}^{β}

’s are reported in each cell.

Figure 7. Posterior inclusion probability for state parameters

β_{i j}^{β}

associated with the first lag (on the left) and with the second lag (on the right), for the Euro Area data under the triple gamma prior, the Horseshoe prior, the double gamma prior and the Lasso prior.

Figure 7. Posterior inclusion probability for state parameters

β_{i j}^{β}

associated with the first lag (on the left) and with the second lag (on the right), for the Euro Area data under the triple gamma prior, the Horseshoe prior, the double gamma prior and the Lasso prior.

Figure 8. Posterior inclusion probability for

θ_{i j}^{β}

’s associated with the first lag on the left and with the second lag on the right, for the Euro Area data under the triple gamma prior, the Horseshoe prior, the double gamma prior and the Lasso prior.

Figure 8. Posterior inclusion probability for

θ_{i j}^{β}

’s associated with the first lag on the left and with the second lag on the right, for the Euro Area data under the triple gamma prior, the Horseshoe prior, the double gamma prior and the Lasso prior.

Figure 9. Posterior median of

β_{i j}^{β}

under the triple gamma, Horseshoe, double gamma and Lasso for the Euro area model. The vertical lines delimit the intercept, first and second lag, respectively.

Figure 9. Posterior median of

β_{i j}^{β}

under the triple gamma, Horseshoe, double gamma and Lasso for the Euro area model. The vertical lines delimit the intercept, first and second lag, respectively.

Figure 10. Posterior median of

|{\sqrt{θ}}_{i j}^{β}|

under the triple gamma, Horseshoe, double gamma and Lasso for the Euro area model. The vertical lines delimit the intercept, first and second lag, respectively.

Figure 10. Posterior median of

|{\sqrt{θ}}_{i j}^{β}|

under the triple gamma, Horseshoe, double gamma and Lasso for the Euro area model. The vertical lines delimit the intercept, first and second lag, respectively.

Table 1. Priors on

{\sqrt{θ}}_{j}

which are equivalent to (top) or special cases of (bottom) the triple gamma prior.

Table 1. Priors on

{\sqrt{θ}}_{j}

which are equivalent to (top) or special cases of (bottom) the triple gamma prior.

Prior for ${\sqrt{θ}}_{j}$		$a^{ξ}$	$c^{ξ}$	$κ_{B}^{2}$	$ϕ^{ξ}$
$N (0, ψ_{j}^{2}), ψ_{j}^{2} \sim G G (a^{ξ}, c^{ξ}, ϕ^{ξ})$	normal-gamma-gamma	$a^{ξ}$	$c^{ξ}$	$\frac{2 c^{ξ}}{ϕ^{ξ} a^{ξ}}$	$ϕ^{ξ}$
$N (0, \frac{1}{κ_{j}} - 1), κ_{j} \sim TPB (a^{ξ}, c^{ξ}, ϕ^{ξ})$	generalized beta mixture	$a^{ξ}$	$c^{ξ}$	$\frac{2 c^{ξ}}{ϕ^{ξ} a^{ξ}}$	$ϕ^{ξ}$
$N (0, ψ_{j}^{2}), ψ_{j}^{2} \sim SBeta 2 (a^{ξ}, c^{ξ}, ϕ^{ξ})$	hierarchical scaled beta2	$a^{ξ}$	$c^{ξ}$	$\frac{2 c^{ξ}}{ϕ^{ξ} a^{ξ}}$	$ϕ^{ξ}$
$DE (0, \sqrt{2} ψ_{j}), ψ_{j}^{2} \sim G (c^{ξ}, \frac{1}{λ^{2}})$	normal-exponential-gamma	1	$c^{ξ}$	$2 λ^{2} c^{ξ}$	$\frac{1}{λ^{2}}$
$N (0, τ^{2} ψ_{j}^{2}), ψ_{j} \sim t_{1}$	Horseshoe	$\frac{1}{2}$	$\frac{1}{2}$	$\frac{2}{τ^{2}}$	$τ^{2}$
$N (0, \frac{1}{κ_{j}} - 1), κ_{j} \sim B (1 / 2, 1)$	Strawderman-Berger	$\frac{1}{2}$	1	4	1
$N (0, τ^{2} {\tilde{ξ}}_{j}), {\tilde{ξ}}_{j} \sim G (a^{ξ}, a^{ξ})$	double gamma	$a^{ξ}$	∞	$\frac{2}{τ^{2}}$	-
$N (0, τ^{2} {\tilde{ξ}}_{j}), {\tilde{ξ}}_{j} \sim E (1)$	Lasso	1	∞	$\frac{2}{τ^{2}}$	-
$t_{ν} (0, τ^{2})$	half-t	∞	$\frac{ν}{2}$	$\frac{2}{τ^{2}}$	-
$t_{1} (0, τ^{2})$	half-Cauchy	∞	$\frac{1}{2}$	$\frac{2}{τ^{2}}$	-
$N (0, B_{0})$	normal	∞	∞	$\frac{2}{B_{0}}$	-

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cadonna, A.; Frühwirth-Schnatter, S.; Knaus, P. Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models. Econometrics 2020, 8, 20. https://doi.org/10.3390/econometrics8020020

AMA Style

Cadonna A, Frühwirth-Schnatter S, Knaus P. Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models. Econometrics. 2020; 8(2):20. https://doi.org/10.3390/econometrics8020020

Chicago/Turabian Style

Cadonna, Annalisa, Sylvia Frühwirth-Schnatter, and Peter Knaus. 2020. "Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models" Econometrics 8, no. 2: 20. https://doi.org/10.3390/econometrics8020020

APA Style

Cadonna, A., Frühwirth-Schnatter, S., & Knaus, P. (2020). Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models. Econometrics, 8(2), 20. https://doi.org/10.3390/econometrics8020020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Triple the Gamma—A Unifying Shrinkage Prior for Variance and Variable Selection in Sparse State Space and TVP Models

Abstract

1. Introduction

2. The Triple Gamma as a Prior for Variance Parameters

2.1. Motivation and Definition

2.2. Properties of the Triple Gamma Prior

2.3. Relation of the Triple Gamma to Other Shrinkage Priors

2.4. Using the Triple Gamma for Variance Selection in TVP Models

3. Shrinkage Profiles and BMA-Like Behavior

3.1. Shrinkage Profiles

3.2. BMA-Type Behaviour

4. MCMC Algorithm

5. Applications to TVP-VAR-SV Models

5.1. Model

5.2. A Brief Sketch of the TVP-VAR-SV MCMC Algorithm

5.3. Illustrative Example with Simulated Data

5.4. Modeling Area Macroeconomic and Financial Variables in the Euro Area

6. Conclusions

Author Contributions

Conflicts of Interest

Appendix A. Proofs

Appendix B. Details on the MCMC Scheme

Appendix C. Posterior Paths for the Simulated Data

Appendix D. Application

Appendix D.1. Data Overview

Appendix D.2. Posterior Paths

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI