#### 2.1. BTA and Linear Gaussian Regression

We start with a standard Gaussian regression exercise. Let

$\mathit{Y}$ be a length

n univariate response with

${Y}_{i}\in \mathbb{R}$ and

$\mathit{X}$ be a

$n\times p$ matrix of covariates. Furthermore let

$M\subset \{1,\cdots ,p\}$ be a model over a subset of the

p potential covariates and

${\mathit{X}}_{M}$ the sub-matrix of columns associated with the model

M. The standard BMA regression with known variance is then

We note that fixing the variance

${\u03f5}_{i}$ to 1 in (

2) is done for expositional convenience. It is not important to the developments of

Section 2.2 to consider the general case of unknown variance. Under the g-prior (

Zellner 1962)

we have that the integrated likelihood of this model is

where

Now suppose that there is a natural partition of the p covariates into two groups, that is, the first ${p}_{1}$ columns of $\mathit{X}$ belong to group 1 and the final ${p}_{2}$ columns (${p}_{1}+{p}_{2}=p$) belong to group 2. Then instead of considering a single model $M\subset \{1,\cdots ,p\}$, we could imagine there is a collection $({M}_{1},{M}_{2})$ of models with ${M}_{1}\subset \{1,\cdots ,{p}_{1}\}$ and ${M}_{2}\subset \{{p}_{1}+1,\cdots ,{p}_{1}+{p}_{2}\}$. In many cases in BMA-driven studies, such a partition is natural since various concepts are proxied by collecting several features which are meant to encapsulate a given concept quantitatively. We therefore find it natural to discuss the model ${M}_{1}$ as the “theory one” model and the model ${M}_{2}$ as the “theory two” model.

We note that at this point, the integrated likelihood of

$pr({M}_{1},{M}_{2}|\mathit{Y},\mathit{X})$ can be evaluated jointly and efficiently by (

3). However, while there is no reason to do so, one could instead elect to update the models

${M}_{1}$ and

${M}_{2}$ separately.

In particular suppose that

${M}_{1}$ and

${\mathit{\beta}}_{1}$ are given. Then

where

and

Thus, we have effectively “separated” the response

$\mathit{Y}$ from the update of

${M}_{2}$ by replacing it with the residual calculation

${\mathit{E}}_{1}$ given the theory 1 parameter set. This leads to the alternative representation

Thus, again while there is no need to do so, an MCMC for the overall BMA exercise could be conducted by alternating between updating model

${M}_{1}$ and thereby

${\mathit{I}}_{1}$, then updating model

${M}_{2}$ and

${\mathit{I}}_{2}$. These two summary variables

${\mathit{I}}_{1}$ and

${\mathit{I}}_{2}$ can then be referred to as the theory one and two indices respectively.

In the Bayesian paradigm is it often natural to now incorporate a notion of over-dispersion. In particular, we can imagine that while

${\mathit{X}}_{{M}_{1}}{\mathit{\beta}}_{1}$ represents the “mean” theory one index given the features

${\mathit{X}}_{{M}_{1}}$, a random process adds a source of randomness to this mean level. It is therefore common to replace (

3) with

where the overdispersion parameter

${\nu}_{1}$ can then be given a prior distribution, for example

$\mathsf{\Gamma}({a}_{1}/2,{b}_{1}/2)$. A similar formulation can be made for

${\mathit{I}}_{2}$. In the context of econometric BMA exercises we feel such a random effects representation is imminently sensible as it implicitly admits that the features

${\mathit{X}}_{1}$ can only ever be imperfect encapsulations of a theory’s essence.

At this junction, the joint marginal likelihood (

3) is no longer directly applicable. However, the conditional strategy of alternating between models

${M}_{1}$ and

${M}_{2}$ using (

4) can still be used with an important modification. In particular, we note that given

${\mathit{\beta}}_{2},{\mathit{I}}_{\mathbf{1}},{\nu}_{2}$ we have

Furthermore, given

${\mathit{I}}_{2}$ we may replace (

5) with

Subsequent to the sampling of the latent factors

${\mathit{I}}_{2}$ we may resample the random effects precision parameters

${\nu}_{t}$ via a standard Gibbs step.

Indeed, we could then consider one final embellishment where

with

${\gamma}_{t}\in \{0,1\}$ with for example, prior probability that

${\gamma}_{t}=1$ set to

$1/2$ (or any other value in

$(0,1)$). Then when

${\gamma}_{2}=0$ the update of

${\mathit{I}}_{2}$ would simply be

that is, a sample from the prior conditional on

${\mathit{\beta}}_{2}$. Updating the parameter

${\gamma}_{2}$ conditional on all other factors would then involve a straightforward Metropolis-Hastings step. If the models

${M}_{1}$ and

${M}_{2}$ indicate which variables are included in the theory one and theory two models, the

${\gamma}_{1}$ and

${\gamma}_{2}$ act as a wholesale inclusion parameter which dictates the overall relevance of the respective theory.

This partitioning and random effects strategy forms the basis of our development in

Section 2.2. We note that the inclusion of the random effects component has the effect of keeping model evaluations conditionally Gaussian, which enables the use of conditional Bayes factors to efficiently resample model parameters.

#### 2.2. Multivariate BTA and Generalized Regression Models

We now generalize to the case where we have

R responses from a general response family. Let

${\mathit{Y}}_{i}$ be an

R dimensional response vector for observation

i and

$\mathcal{D}=\{{\mathit{Y}}_{1},\cdots ,{\mathit{Y}}_{n}\}$ be a collection of

n such observations. Each variate

${Y}_{ir}$ in the vector

${\mathit{Y}}_{i}$ is assumed to belong to a general field

${\mathcal{F}}_{r}$. In this paper we consider examples where

${\mathcal{F}}_{r}$ is

$\{0,1\}$,

$\mathbb{R}$ and

${\mathbb{R}}_{+}$, though others such as

$\mathbb{N}$, could easily be entertained. We associate

${Y}_{ir}$ with an outcome distribution as

where

${g}_{r}$ is a general probability density or mass function,

${\mathit{\alpha}}_{r}$ is a set of global parameters and

${\mu}_{ir}$ is an observation

i dependent mean value. We note that the assumption that only the mean parameter

${\mu}_{ir}$ varies according to the observation

i could be relaxed in future work.

The parameter

${\mu}_{ir}$ is then assumed to have the form

In the above formulation

${\gamma}_{rt}$ can either be 0 or

${\gamma}_{rt}\in \mathbb{R}$. We assign a prior probability of

$1/2$ to these two possibilities, clearly other prior probabilities could be entertained. By convention if several

${\gamma}_{rt}$ are non-zero for a given index

t then one of these non-zero

${\gamma}_{rt}$ is set to 1 to avoid issues related to identification. This matter is discussed subsequently.

The variable

${I}_{it}$ is then referred to as the theory-

t index for observation

i. We further assume that the

${I}_{it}$ depends on a set of

${p}_{t}$ theory proxies

${\mathit{X}}_{it}$ according to the linear model

where

${\u03f5}_{it}\sim \mathcal{N}(0,{\nu}_{t}^{-1})$ independently. The precision term

${\nu}_{t}$ is assigned a

$\mathsf{\Gamma}({a}_{t}/2,{b}_{t}/2)$ prior. We note that this prior actually is forced to adapt throughout the procedure (by adjusting the

${a}_{t},{b}_{t}$ parameters) to control for issues of identification, we discuss this aspect below. We typically begin the inference procedure setting

${a}_{t}$ and

${b}_{t}$ to 1.

Associated with the parameter

${\mathit{\beta}}_{t}$ is a model

${M}_{t}\subset \{1,\cdots ,{p}_{t}\}$ such that

${\beta}_{it}=0$ when

$i\notin {M}_{t}$, a standard BMA formulation. As the “null” model can be controlled by the

${\gamma}_{rt}$ parameter, we exclude

${M}_{t}=\xd8$ from our consideration, see

Kourtellos et al. (

2019) for a motivation of this structure. Writing

${\mathit{\beta}}_{{M}_{t}}$ to represent the subvector of

${\mathit{\beta}}_{t}$ not constrained to zero we assume

where

${p}_{{M}_{t}}$ is the size of model

${M}_{t}$ and independent across

t. As with the prior parameters

${a}_{t},{b}_{t}$, the g-prior parameter

${g}_{t}$ adapts throughout the procedure, we begin with

${g}_{t}=1/n$. Alternative priors for this model could have been considered, see our discussion in the Conclusions section.

Finally, the model

${M}_{t}$ can have a number of priors, see

Ley and Steel (

2009) for an overview of potential issues to consider when selecting this prior. For the time being we choose the uniform prior

When ${\gamma}_{rt}\in \mathbb{R}$ we assign the prior probability ${\gamma}_{rt}\sim \mathcal{N}(0,1)$. This has the effect of imposing a uniform model prior on the inclusion of theories in the outcome equation. Alternatively, joint priors for the $\gamma $ factors could be considered which would control for the size of the included theories. However, since the number of theories is meant to be modest (roughly five to ten), we have avoided such aspects in the current framework.

The system outlined above then serves as the core latent process which drives the subsequent outcome variables. Thus we see that the models ${M}_{t}$ investigate which proxies best encode a theory quantitatively while also accounting for the obvious model uncertainty in this formulation and incorporating a notion of over-dispersion. The ${\gamma}_{rt}$ terms serve two purposes. First, by examining their non-zero elements we see for which response equations a given theory is relevant. Secondly, by requiring the first non-zero ${\gamma}_{rt}$ to be equal to 1 and all others to be in $\mathbb{R}$ the ${\gamma}_{r}$ term scales the latent indices to allow them to enter into model parameters differentially and indeed in opposite directions.

Finally, the latent theory indices ${I}_{it}$ are potentially of greatest interest, as they are meant to encapsulate the way that the theory proxies affect the outcome equations of interest. Again, as outlined in the Appendix, these terms suffer from potential identification issues when combined with the restrictions placed on a given ${\gamma}_{r}$. The hyperparameters ${a}_{t},{b}_{t}$ ultimately control this aspect and therefore, final interest focuses on the scale-free term ${\tilde{I}}_{it}=({a}_{t}/{b}_{t}){I}_{it}$.

This concern regarding identification requires a modicum of bookkeeping when conducting posterior inference. If, for example all non-zero $\gamma $ values were allowed to be in $\mathbb{R}$ then the final outcome equation could have a variety of ${\gamma}_{rt}$ and ${\mathit{\beta}}_{t}$ combinations that would yield the same posterior probability. This is the justification for our restriction that the ${\gamma}_{rt}$ with the smallest r be constrained to 1.

However, this constraint yields its own issues, primarily due to its effects on the priors for the

$\mathit{\beta}$ and

$\nu $ parameters. If, for example,

${\gamma}_{11}=1$ and

${\gamma}_{21}=0.5$ and our chain sets

${\gamma}_{11}$ to 0,

${\gamma}_{21}$ will suddenly double. This would imply that

${\gamma}_{21}{\mathit{I}}_{1}$ will suddenly have twice the effect on the mean value of outcome Equation (2). The obvious answer is to simultaneously halve

${\mathit{I}}_{1}$, or equivalently, halve

${\mathit{\beta}}_{1}$. However, it would no longer be appropriate to keep the priors for

${\mathit{\beta}}_{t}$ and

${\nu}_{t}$ fixed and therefore their priors are also adjusted by this factor. Technical details are given in

Appendix A.

To review, our full modeling framework therefore takes the form

Choices for families

${g}_{r}$ that control the outcome variables are considerable. In our application, we focus on three models. The first is logistic regression. In this case

${Y}_{ir}\in \{0,1\}$,

${\alpha}_{r}$ is univariate and

We use this logistic regression to model the probability that a country will default on its sovereign debt based on theory-indices.

The second family considered corresponds to the non-central asymmetric Laplace variates. In this case

${\mathit{\alpha}}_{r}$ is two dimensional with

${\alpha}_{r1}$ denoting the intercept and

${\alpha}_{r2}$ the log-precision parameter. In particular, we write

where

$\tau $ is the quantile under consideration. This model is often referred to as the Bayesian Quantile Regression since its posterior mode is related to the quantile regression estimate under the so-called

pin-ball loss

${\rho}_{\tau}$. We employ this model for two separate variates, the inflation and unemployment rates and set

$\tau =0.9$ for both, thus focusing on the 90th percentile of the respective distributions.

Finally, we consider the Generalized Extreme Value (GEV) model with

${\mathit{\alpha}}_{r}=({\alpha}_{r1},{\alpha}_{r2},{\alpha}_{r3})$ parameterized by

for

$h({Y}_{ir})>0$ with

The GEV model is used to model block maxima and hence understand the nature of extreme behavior. In our case we use it to model the largest daily percentage jump in a country’s exchange rate (relative to USD) seen over the course of a year. The global parameters

${\alpha}_{r2}$ and

${\alpha}_{r3}$ are the log precision and shape respectively while

${\alpha}_{r1}$ again serves as the global intercept.

Based on

$\mathcal{D}$ we then conduct posterior interference on the full parameter set, which includes global parameters

${\mathit{\alpha}}_{r}$, the theory-level models

${M}_{t}$, theory-inclusion and scaling parameters

${\gamma}_{rt}$ and linear model parameters

${\mathit{\beta}}_{t}$ as well as the latent indices

${\mathit{I}}_{t}$ and their random effect variances

${\nu}_{t}$. Posterior inference is performed via Markov Chain Monte Carlo (MCMC). Given the involved and nested nature of the MCMC, several different approaches are employed at different stages of the hierarchy and the full details are provided in

Appendix A.

However, the main themes of the MCMC involve conditional Bayes Factors (CBFs) to change models

${M}_{t}$ and update proxy regression parameters

${\mathit{\beta}}_{t}$. Standard block Metropolis-Hastings proposals using local Laplacian calculations of the log posterior density are used to update latent theory indices

${I}_{it}$ as well as any global parameters in

${\mathit{\theta}}_{r}$. Finally, reversible jump methods (

Green 1995) alternate

${\gamma}_{rt}$ between being 0 or in

$\mathbb{R}$, with a modicum of book keeping to ensure that at least one

${\gamma}_{rt}$ is set to 1 when theory

t is represented in more than one dependent equation

r, again to ensure identification of the system. When conducting this bookkeeping exercise, prior distributions are adjusted accordingly to ensure that log-posterior density values are not affected by mere changes in variate representations.