Cointegration and unit root tests: A fully Bayesian approach

To perform statistical inference for time series, one should be able to assess if they present deterministic or stochastic trends. For univariate analysis one way to detect stochastic trends is to test if the series has unit roots, and for multivariate studies it is often relevant to search for stationary linear relationships between the series, or if they cointegrate. The main goal of this article is to briefly review the shortcomings of unit root and cointegration tests proposed by the Bayesian approach of statistical inference and to show how they can be overcome by the fully Bayesian significance test (FBST), a procedure designed to test sharp or precise hypothesis. We will compare its performance with the most used frequentist alternatives, namely, the Augmented Dickey-Fuller for unit roots and the maximum eigenvalue test for cointegration. Keywords: Time series; Bayesian inference; Hypothesis testing; Unit root; Cointegration.


INTRODUCTION
Several times series present deterministic or stochastic trends, which imply that the effects of these trends on the level of the series are permanent. Consequently, the mean and variance of the series will not be constant and will not revert to a long term value. This feature reflects the fact that the stochastic processes generating these series are not (weakly) stationary, imposing problems to perform inductive inference using the most traditional estimators or predictors. This is so because the usual properties of these procedures will not be valid under such conditions. Therefore, when modelling non-stationary time series, one should be able to properly detrend the used series, either by directly modelling the trend by deterministic functions, or by transforming the series to remove stochastic trends. To determine which strategy is the suitable solution, several statistical tests were developed since the 1970's by the frequentist school of statistical inference.
The Augmented Dickey-Fuller (ADF) test is one of the most popular test used to assess if a time series has a stochastic trend or, for series described by auto-regressive models, if they have a unit root. When one is searching for long term relationships between multiple series under analysis, it is crucial to know if there are stationary linear combinations of these series, i. e. if the series are cointegrated. Cointegration tests were developed, also by the frequentist school, in the late 1980's [14] and early 1990's [20]. Only in the late 1980's the Bayesian approach to test the presence of unit roots started to be developed.
Both unit root and cointegration tests may be considered tests on precise or sharp hypotheses, i. e. those in which the dimension of the parameter space under the tested hypothesis is smaller than the dimension of the unrestricted parameter space. Testing sharp hypotheses pose major difficulties for either the frequentist or Bayesian paradigms, such as the need to eliminate nuisance parameters.
The main goal of this article is to briefly review the shortcomings of the tests proposed by the Bayesian school and how they can be overcome by the fully Bayesian significance test (FBST). More specifically, we will compare its performance with the most used frequentist alternatives, the ADF for unit roots and the maximum eigenvalue test for cointegration. Since this is a review article, it is important to remark that the results presented here were published elsewhere by the same authors. 1 To accomplish this objective we will define the FBST in the next Section, also showing how it can be implemented in a general context. The following section discusses the problems of testing the existence of unit roots in univariate time series and how the Bayesian tests approach the problem. Section 4 then shows how the FBST is applied to test if a time series has unit roots and illustrates this with applications on a real data set. In the sequel we discuss the Bayesian alternatives to cointegration tests. We then apply the FBST to test for cointegration using simulated and real data sets. We conclude with some remarks and possible extensions for future work.

FBST
The fully Bayesian Significance Test was proposed in [31] mainly to deal with sharp hypotheses. The procedure has several properties 2 , most interestingly the fact that is only based on posterior densities, thus avoiding the necessity of complications such as the elimination of nuisance parameters or the adoption of priors with positive probabilities attached to sets of zero Lebesgue measure.
We shall consider general statistical spaces in which the parameter space is denoted by Θ ⊆ R m , m ∈ N. A sharp hypothesis H assumes that θ , the parameter vector of the chosen statistical model, belongs to a sub-manifold Θ H of smaller dimension than Θ. This implies, for continuous parameter spaces, that the subset Θ H has null Lebesgue measure whenever H is sharp. The sample space, set of all possible values of the observable random variables (or vectors), is here denoted by X .
Following the Bayesian paradigm, let h(·) be a probability prior density over Θ, x ∈ X the observed sample (scalar or vector), and L(· | x) the likelihood derived from data x. To evaluate the Bayesian evidence based on the FBST, the sole relevant entity is the posterior probability density for θ given x, It is important to highlight that the procedure may be used when the parameter space is discrete. However, when the posterior probability distribution over Θ is absolutely continuous the FBST appears as a more suitable alternative to significance hypothesis testing. For notational simplicity, we will denote Θ H by H in the sequel.
Let r(θ ) be a reference density on Θ such that the function s(θ ) = g(θ | x)/r(θ ) is a relative surprise 3 function. The reference density is important because it guarantees that the FBST is invariant to reparametrizations, even when r(θ ) is improper. 4 Thus, when considering r(θ ) proportional to a constant, the surprise function will be, in practical terms, equivalent to the posterior distribution. For the applications considered in this article we will use the improper uniform density as reference density on Θ. The authors of [29] remark that it is possible to generalize the procedure using other reference densities such as neutral, reference or non-informative priors, if they are available and desirable. Definition 1 [Tangent set]: Considering a sharp hypothesis H : θ ∈ Θ H , the tangential set of the hypothesis given the sample is given by where s * = sup θ ∈H s(θ ).
Notice that the tangent set T x is the highest relative surprise set, that is, the set of points of the parameter space with higher relative surprise than any point in H, being tangential to H in this sense. This approach takes into consideration the statistical model in which the hypothesis is defined, using several components of the model to define an evidential measure favouring the hypothesis.
where G x (θ ) denotes the posterior distribution function of θ and the above integral is of the Riemann-Stieltjes type.
Definition 2 sets ev as the posterior probability of the tangent set, that is interpreted as an evidence value against H. Hence, the evidence value supporting H is the complement of ev, namely, ev = 1 − ev. Notwithstanding, ev is not evidence against A : θ / ∈ Θ H , the alternative hypothesis (which is not sharp anyway). Equivalently, ev is not evidence in favor of A, although it is against H.

Definition 3 [Test]:
The FBST is the procedure that rejects H whenever ev = 1 − ev is smaller than a critical level, ev c .
Thus, we are left with the problem of deciding the critical level ev c for each particular application. We briefly discuss this and other practical issues in the following subsection.
2.1. Practical implementation: critical values and numerical computation. Since ev is a statistic, it has a sampling distribution derived from the adopted statistical model and in principle this distribution could be used to find a threshold value. If the likelihood and the posterior distribution satisfy certain regularity conditions 5 [11] proved that, asymptotically, there is a relationship between ev and the p-values obtained from the frequentist likelihood ratio procedure used to test the same hypotheses. This fact provides a way to find, at least asymptotically, a critical value to ev to reject the hypothesis being tested.
In a recent review [43], the authors discuss different ways to provide a threshold for ev. Among these alternatives, we highlight the standardized e-value, which follows, asymptotically, the uniform distribution on (0, 1). 6 One could also try to define the FBST as a Bayes test derived from a particular loss function and the respective minimization of the posterior expected loss. Following this strategy, Madruga et al. (2001) showed that there are loss functions which result in ev as a Bayes estimator of φ = I H (θ ). 7 Hence, the FBST is in fact a Bayes procedure in the formal sense as defined by Wald in [48].
General algorithm: compute ev supporting hypothesis H : θ ∈ Θ H 1. Specify the statistical model (likelihood) and prior distribution on Θ. 2. Specify the reference density, r(θ ), and derive the relative surprise function, s(θ ). 3. Find s * , the maximum value of s(θ ) under the constraint θ ∈ H. 4. Integrate the posterior distribution on the tangent set-eq. (2)-, to find ev. 5. Find ev = 1 − ev.  Table 1. After defining the statistical model and prior, it is simple to find the surprise function, s(θ ). In step 3 one should find the point of the parameter space in H that maximizes s(θ ), that is, it is a problem of constrained numerical maximization. In several applications this step does not present a closed form solution, requiring the use of numerical optimizers.
Step 4 involves the integration of the posterior distribution on a subset of Θ that can be highly complex, the tangent set T x . Once more, since in many cases it is fairly difficult to find an explicit expression for T x , one may use use various numerical techniques to compute the integral. If it is possible to generate random samples from the posterior distribution, one may use Monte Carlo integration, as we will do in this work to compute ev. Another 5 See [37], p.436. 6 See also [3] for more on the standardized version of ev. 7 I A (x) denotes the indicator function, being equal to one if x ∈ A and zero otherwise, alternative is to use approximation techniques, such as those proposed in [45], based on a Laplace approximation. We discuss how to implement such approximations for unit root and cointegration tests in [9,10].

BAYESIAN UNIT ROOT TESTS
Before presenting the Bayesian procedures used to test the presence of unit roots, let us fix notation. We will denote by y t the t-th value of a univariate time series observed in t = 1, . . ., T + p dates, where T and p are positive integers. The usual approach is to assume that the series under analysis is described by an auto-regressive process with p lags, AR(p), meaning that the data generating process is fully described by a stochastic difference equation of order p, possibly with an intercept or drift and a deterministic linear trend, i. e. (3) Using the lag or backshift operator B, we denote y t−k by B k y t , allowing us to rewrite (3) as is the autoregressive polynomial. The difference equation (3) will be stable, implying that the process generating {y t } T +p t=1 is (weakly) stationary, whenever the roots of the characteristic polynomial φ (z), z ∈ C, lie outside the unit circle, since there may be complex roots. 8 If some of the roots lie exactly on the unit circle, it is said the process has unit roots. In order to test such a hypothesis statistically, (3) is rewritten as If the generating process has only one unit root, one root of the complex polynomial φ (z), i. e. φ (1) = 0, and all the other roots are on or outside the unit circle. In this case, Γ 0 = 0, the hypothesis that will be tested when modelling (5). Even though tests based on these assumptions verify if the process has a single unit root, there are generalizations based on the same principles that test the existence of multiple unit roots. 9 The search for Bayesian unit root tests begun in the late 1980's. As far as we know, [39] and [40] were the first works to propose a Bayesian approach for unit root tests. The frequentist critics of these articles received a proper answer in [33,34], generating a fruitful debate that produced a long list of papers in the literature of Bayesian time series. A good summary of the debate and the Bayesian papers that resulted from it is presented in [2]. We will present here only the most relevant strategies proposed by the Bayesian school to test for unit roots.
where y = {y t } T +p t=1 is the observations vector, L(θ |y) the full likelihood and Ψ the support of the random vector ψ. This marginal likelihood, associated with a prior for ρ is the main ingredient used by standard Bayesian procedures to test the existence of unit roots. Even though the procedure varies among authors according to some specific aspects, mentioned below, basically all of them use Bayes factors and posterior probabilities.
One important issue is the specification of the null hypothesis: some authors, starting from [38], consider H 0 : ρ = 1 against H 1 : ρ < 1. This is the way the frequentist school addresses the problem, 10 but following this approach no explosive value for ρ is considered. The decision theoretic Bayesian approach solved the problem using the posterior probabilities ratio or Bayes factor: Advocates of this solution argue that one of the advantages of this approach is that the null and the alternative hypotheses are given equal weight. However, the expression above is not defined if g 0 (ρ) is not a proper density since the denominator of the Bayes factor is equal to the predictive density, defined just if g 0 (ρ) is a proper density. There are also problems if L m (ρ = 1|y) is zero or infinite.
The problem is approached by [33,25] by testing H 0 : ρ ≥ 1 against H 1 : ρ < 1, considering explicitly explosive values for ρ. The main advantage of this strategy is the possibility to compute posterior probabilities like, defined even for improper priors on ρ, where g m is the marginal posterior for ρ.
In [8] the authors do not choose ρ as the parameter of interest, examining instead the largest absolute value of the roots of the characteristic polynomial and then verifying if it is smaller or larger than one. Usually this value is slightly smaller than ρ, but the authors argue that this small difference may be important. When this approach is used, unit roots are found less frequently. For an AR(3) model with a constant and deterministic trend, [8] derive the posterior density for the dominant root for the 14 series used in [30] and concluded the following: for eleven of the series the dominant root was smaller than one, that is to say, the series were trend-stationary. These results were based on a flat prior for the autoregressive parameters and the deterministic trend coefficient.
Another controversy is about the prior over ρ: [33] argues that the difference between the results given by the frequentist and Bayesian inferences is due to the fact that the flat prior proposed in [39] overweights the stationary region. Hence, he derived a Jeffreys prior for the AR(1) model: this prior quickly diverges as ρ increases and becomes larger than one. The obtained posterior produces the same results of [30]. The next section discusses these results in detail. The critics of the approach adopted by Phillips in [33] 11 judged the Jeffreys prior as unrealistic, from a subjective point of view. This is a nonsensical objection if one considers that Jeffreys prior is key to ensure an invariant inferential procedure, and invariance is a highly desirable property, for either objective or subjective reasons. 12 A final controversial point concerns the modelling of initial observations. If they are included in the likelihood the process is implicitly considered stationary. In fact, when it is known that the process is stationary and it is believed that the data generating process is working for a long period, it is reasonable to assume that the parameters of the model determine the marginal distribution of the initial observations. In the simplest AR(1) model, this would imply that y 1 ∼ N(0, σ 2 /(1 − ρ 2 )). In this scenario, to perform the inference conditional on the first observation would discard relevant information. On the other hand, there is no marginal distribution defined for y 1 if the generating process is not stationary. Then, it is valid to use a likelihood conditional on initial observations. For the models presented here, we always work with the conditional likelihood. As argued in [39], inferences for stationary models are little affected by using conditional likelihoods, especially for large samples. He compares these inferences with the ones based on exact likelihoods under explicit modelling for initial observations.

IMPLEMENTING THE FBST FOR UNIT ROOT TESTING
We will now describe how to use the FBST to test for the presence of unit roots referring to the general model 13 (5), i. e.
Recalling the steps to implement the FBST, displayed on Table 1, we have just specified the statistical model. The likelihood, conditional on the first p observations, derived from the Gaussian model is To complete step 1 of Table 1 we need a prior distribution for θ . For all the series modelled in this article we will use the following non informative prior: We are aware of the problems caused by improper priors applied to this problem when one uses alternative approaches, like those mentioned by [2]. However, one of our goals is to show how the FBST can be implemented even for a potentially problematic prior like this one. To write the posterior we use the following notation: being ∆Y of dimension (T × 1), X of dimension (T × p + 2) and ψ, (p + 2 × 1). Thanks to this notation we can write, using primes to denote transposed matrices: is the ordinary least squares (OLS) estimator of ψ and ∆Y = X ψ its prediction for ∆Y . Thus, the full posterior is a Normal-Inverse Gamma density.
Step 2 demands a reference density in order to define the relative surprise function. Since we will use the improper density r(θ ) ∝ 1, the surprise function will be equivalent to the posterior distribution in our applications. 13 It is also possible to include q ∈ N moving average terms in (3) to model the process, a case that will not be covered in this article but that, in principle, shall not imply major problems for the FBST.
Given this, to find s * (Step 3) we need to find the maximum value of the posterior under the hypothesis being tested, in our case, Γ 0 = 0. This maximization step is fairly simple to implement given the modelling choices made here: Gaussian likelihood, non informative prior and reference density proportional to a constant. The restricted (assuming H) posterior distribution is in which θ r = (ψ r , σ ), ψ r being vector ψ without Γ 0 , that is, X r is simply matrix X above without its third column, since under H : Γ 0 = 0 and the coefficient of the third column of X is Γ 0 , 14 ψ r is least squares estimator of ψ r and ∆Y r denotes the predicted values for ∆Y given by the restricted model. From (9), it is easy to show that the maximum a posteriori (MAP) estimator of θ r is given by ( ψ r , σ r ), with Plugging the values of ψ r and σ r into (9) we find s * , as requested in Step 3.
Step 4 will also be easy to implement thanks to structure of the models assumed in this section. Since the full posterior, (8), is a Normal-Inverse Gamma density, a simple Gibbs sampler allows us to obtain a random sample from such distribution, suggesting a Monte Carlo approach to compute ev. From (8), the conditional posteriors of ψ and σ are, respectively 15 and ψ is the OLS estimator of ψ, as above. With a sizeable random sample from the full posterior 16 we estimate ev as the proportion of sampled vectors that generate a value for the posterior greater than s * , found in Step 3. Hence, in Step 5, we only compute one minus the estimate of ev found in Step 4. The whole procedure is summarized in Table 2. 4.1. Results. We implemented the FBST as described above to test the presence of unit roots in 14 U.S. macroeconomic time series, all with annual frequency, first mentioned in [30]. We used the extended series, analyzed in [38]. Table 3 reports the names of the tested series, the year of the first observation, the adopted value for p 17 , if a linear (deterministic) trend was included in the model or not, the ADF test statistic and its respective p-value 18 .
14 See equation (5). 15 Appendix A brings the parametrization and the probability density function of the Inverse-Gamma distribution. 16 For the implementations in this article we sampled 51,000 vectors from (8) and discarded the first 1,000 as a burn-in sample. 17 See equation (8). 18 We have used the computer procedure described in [28] to find the ADF p-values, available in the R library urca.
Evaluate the posterior at the sampled vectors and estimate ev as the proportion of N in which the evaluated values are larger than s * . 5. Find ev = 1 − ev. The last two columns bring the posterior probability of non-stationarity, Γ 0 ≥ 0, and the FBST ev -values for the specified models. In order to obtain comparable results, we adopted the model chosen by [2] for all the series. 19 The results show that the non-stationary posterior probabilities are quite distant from the ADF p-values. These results were highlighted in [39,40]. Considering the simplest AR(1) model, they argued that, once frequentist inference is based on the distribution of ρ|ρ = 1, the non-stationary posterior probabilities provide counterintuitive conclusions since the referred distribution is skewed. Their main argument is that Bayesian inference uses a distribution (marginal posterior of ρ) which is not skewed.
As mentioned before, [33] claims that the difference in results between frequentist and Bayesian approaches is due to the flat prior that puts much weight on the stationary region. He proposed the use of Jeffreys priors, which restored the conclusions drawn by the frequentist test. Phillips argued that the flat prior was, actually, informative when used in time series models like those for unit root tests. Using  Before starting our brief review of the most relevant Bayesian cointegration tests, we fix notation and present the definitions to which we will refer in the sequel.
All the tests mentioned here are based on the following multivariate framework. Let Y t = [y 1t . . . y nt ] ′ be a vector with n ∈ N time series, all of them assumed to be integrated of order d ∈ N, i.e. have d unit roots. The 19 All the models considered the intercept or constant term, µ in (8). series are said to be cointegrated if there is a nontrivial linear combination of them that has b ∈ N unit roots, b < d. We will assume that, as in most applications, d = 1 and b = 0, meaning that, if the time series in Y t are cointegrated there is a linear combination a ′ Y t that is stationary, where a ∈ R n is the cointegrating vector. Since the linear combination a ′ Y t is often motivated by problems found in economics, it is called a long-run equilibrium relationship. The explanation is that non-stationary time series that are related by a long-run relationship cannot drift too far from the equilibrium because economic forces will act to restore the relationship.
Notice also that: (i) the cointegrating vector is not uniquely determined since for any scalar s, (s · a) is a cointegrating vector; and (ii) if Y t has more than two series, it is possible that there are more than one cointegrating vector generating a stationary linear combination.
It is assumed that the data generating process of Y t is described by the following vector autorregression 20 with p ∈ N lags, denoted VAR(p), and given by: in which c is a (n × 1) vector of constants, D t a vector (n × 1) with some deterministic variable 21 , Φ i are (n × n) coefficients matrices and E t is a (n × 1) stochastic vector with multivariate normal distribution with null expected value and covariance matrix Ω, denoted N n (0, Ω). This dynamic model is assumed valid for t = 1, . . ., T + p, the available span of observations of Y t . Model (12) can be rewritten using the lag or backshift operator as where Φ(B) = I n − Φ 1 B − . . . − Φ p B p is the (multivariate) autoregressive polynomial and I n denotes the ndimensional identity matrix. The associate characteristic polynomial in this context will be the determinant of Φ(z), z ∈ C. If all the roots of the characteristic polynomial lie outside the unit circle, it is possible to show that Y t have a stationary representation, 22 such as equation (12). In order to determine if this is the case, model (12) is rewritten as an (vectorial) error correction model (VECM): where ∆Y t = [∆y 1t . . . It is possible to show that, when all the roots of det(Φ(z)) are outside the unit circle, matrix Π in (14) has full rank, i.e. all the n eigenvalues of Π are n non null. If the rank of Π is null, this matrix cannot be distinguished from a null matrix, implying that the series in Y t have at least one unit root and a valid representation is a VAR or order p − 1, i.e. model (14) without the term ΠY t−1 . 23 Finally, if the (n × n) matrix Π has rank r, 0 < r < n, it has n − r non null eigenvalues, implying that the series in Y t have at least one unit root and its valid representation is given by the VECM in equation (14). In this case, Π = αβ ′ , where α and β are matrices (n × r) of rank r. Matrix β denotes the one with the cointegrating vectors and matrix α is called the loading matrix, since it contains the weights of the equilibrium relationships. The tests developed in [20] focus on the rank of matrix Π.
The pioneer Bayesian works to study VAR models and reduced rank regressions are [7,1,15]. However, the main concern of these papers is to estimate the model parameters and their (marginal) posterior distributions. The usual approach is to assume a given rank for the long run matrix Π, and proceed with all the computations conditional on the given rank. The Bayesian initiatives to test the rank of the referred matrix are recent, the main reference for Bayesian inference on VECM's being [24].
To justify inferential procedures based on prespecified ranks of matrix Π, [2] argued that an empirical cointegration analysis should be based on economic theory, which proposes models obeying equilibrium relationships. 20 As in the univariate case, one may include moving average terms in (12), i.e. lags for E t , but this, in principle, would not cause major problems in the Bayesian framework. 21 Such as deterministic trends or seasonal dummies. 22 See Johansen (1996). 23 It is possible that the series in Y t have two unit roots each, implying that the correct VECM must be written with ∆ 2 Y t as dependent variable.
According to this view, cointegration research should be "confirmatory" rather than "exploratory". Even though the advocated conditional inference is of simple implementation and very useful for small samples, [2] recognized that tests for the rank of matrix Π should be developed. To our knowledge, few initiatives with this purpose were developed up to now. One common approach to test sharp hypotheses in the Bayesian framework is by means of Bayes factors. Testing the rank of matrix Π by Bayes factors implies several computational complications and requires the use of proper priors, as shown in [21]. Using an informal approach, [1] obtained the posterior distribution of the ordered eigenvalues of the square of the long run matrix of the VECM, obtained from a VAR model without assuming the existence of cointegration relations. As the long run matrix has a reduced rank, it has some null eigenvalues, and this should be revealed by the fact that the smallest eigenvalues should have a lot of probability mass accumulated on values close to zero. The computations can be made straightforwardly, simulating values for the long run matrix from its posterior, which is a matrix t-Student distribution under the non informative prior (16), considered in the sequel.
Another common procedure is to estimate the rank of Π as the value r that maximizes the (marginal) posterior distribution of the rank. Conditioned on such an estimate, one proceeds to derive the full posterior and eventually estimate the cointegration space.
A different approach was proposed by [4], who used the Posterior Information Criterion (PIC), developed in [35], as a criterion to choose the mode of the posterior distribution of the rank of Π. However, as highlighted in [24], one of the advantages of the Bayesian approach is the possibility to incorporate the uncertainty about the parameters in the analysis, represented by the posterior distribution of the rank and, whatever the tool the scientist uses to infer the value of r, it is derived from this posterior distribution.
The authors of [22] nested the reduced rank models in an unrestricted VAR and used Metropolis-Hastings sampling with the Savage-Dickey density ratio 24 to estimate the Bayes Factors of all the models with incomplete rank up to the model with full rank. The Bayes Factor derivation requires the estimation of an error correction factor for the incomplete rank. This factor, however, is not defined for improper priors due to a problem known as Bartlett paradox, which arises whenever one compares models of different dimensions. The difficulty is relevant in the present case because after deriving the rank posterior density, one may consider that models of different dimensions are being compared. The paradox is stated informally as: improper priors should be avoided when one computes Bayes Factors (except for parameters common to both models) as they depend on arbitrary constants (that are integrals).
More recently, [47] developed an efficient procedure to obtain the posterior distribution of the rank using a uniform proper prior over the cointegration space linearly normalized. The author derived solutions for the posterior probabilities for the null rank and for the full rank. The posterior probabilities of each intermediate rank is derived from the posterior samples of the matrices that compose the long run matrix (α and β ), properly normalized, under each rank and using the method proposed by [5].

IMPLEMENTING THE FBST AS A COINTEGRATION TEST
This section describes how to implement the FBST to test for cointegration. We will proceed in the same spirit of Section 4, i.e. describing the steps given in Table 1 to implement the test for cointegration.
Let us begin recalling the VECM given by equation (14): ∼ N n (0, Σ) with 0 a null vector of dimension (n × 1) and Ω a symmetric positive definite real matrix. Notice that these assumptions already specify the statistical model (Gaussian) and its implied likelihood. Before giving it explicitly, let us rewrite equation (14) using matrix notation: and the error vector is given by E ∼ MN T ×n (0, I T , Ω), denoting the matrix normal distribution 25 . Now the parameter vector is given by Θ = (η, Ω).
Notice that ∆Y is formed by piling up T transposed vectors ∆Y t , thus resulting in a matrix with T lines and n columns, where n is number of time series in vector Y t , those being also dimensions of matrix E. Matriz Z is constructed likewise-always piling the transposed vector-resulting in a matrix with T lines and pn + n + 1 columns. Finally, matrix η has the matrices of coefficients, all piled up properly, resulting in a matrix with pn + n + 1 lines and n columns.
Given the assumptions above, ∆Y ∼ MN T ×n (Z · η, I T , Ω), implying that the likelihood is where y denotes the set of observed values of vectors Y t for t = 1, . . . , T + p. As in Section 4, we will consider an improper prior for Θ, given by and our reference density, r(Θ), will also be proportional to a constant, leading to a surprise function equivalent to the (full) posterior distribution. These choices correspond to steps 1 and 2 of Table 1. These modeling choices imply the following posterior density: To implement Step 3 of Table 1, we need to find the maximum a posteriori of (17) under the constraint Θ ⊂ Θ H , i.e. we need to maximize the posterior in Θ H . Since we are testing the rank of matrix Π, 26 it is necessary to maximize the posterior assuming the rank of Π is r, 0 ≤ r ≤ n. Thanks to the modeling choices made here-Gaussian likelihood and equation (16) as prior-, our posterior is almost identical to a Gaussian likelihood, allowing us to find this maximum using a strategy similar to that proposed by [20], who derived the maximum of the (Gaussian) likelihood function assuming a reduced rank of Π. We will summarize Johansen's algorithm, providing in Appendix B a heuristic argument of why it indeed provides the maximum value of the posterior under the assumed hypotheses.
We begin estimating a VAR(p − 1) model for ∆Y t with all the explanatory variables shown in (14) except for Y t−1 . Using the matrix notation established above, this corresponds to estimate showing that Z 1 is obtained from matrix Z extracting its last n columns, exactly those corresponding to Y t−1 . 25 See Appendix A for more information on this distribution. 26 See the discussion made in the beginning of Section 5.
We also estimate a second set of auxiliary equations, regressing Y t−1 on a vector of constants and D t , ∆Y t−1 , . . . , ∆Y t−p+1 . By piling up all the the (transposed) vectors Y ′ t−1 for t = p + 1, . . . , T + p we have a (T × n) matrix, denoted by Y −1 . As above, these equations can be represented by Considering the OLS estimates of these sets of equations and their respective estimated residuals, we may write U and V the respective matrices of estimated residuals. It is possible to show 27 that the estimated residuals of these auxiliary regressions are related by Π in the following regressions It is possible to show that the OLS estimates of Π obtained from (14) and from (20) are numerically identical, as the estimated residuals E and W.
The second stage of Johansen's algorithm requires the computation of the following sample covariance matrices of the OLS residuals obtained above: and from these, we find the n eigenvalues of matrix Σ −1 VV · Σ VU · Σ −1 UU · Σ UV , ordering them decreasingly λ 1 ¿ λ 2 ¿. . . ¿ λ n . The maximum value attained by the log posterior subject to the constraint that there are r (0 ≤ r ≤ n) cointegration relationships is where K is a constant that depends only on T , n and y. 28 Since ℓ * represents the maximum of the log-posterior, to obtain s * one should take s * = exp(ℓ * ), completing step 3 of Table 1.
As in Section 4, we compute ev in step 4 by means of a Monte Carlo algorithm. It is easy to factor the full posterior (17) as a product of a (matrix) normal and an Inverse-Wishart, 29 suggesting a Gibbs sampler to generate random samples from the full posterior. Thus, the conditional posteriors for η and Ω are, respectively 27 A result known as Frisch-Waugh-Lovell theorem in the econometrics literature. See [17], theorem 3.3 or [6] Section 2.4. 28 By means of the marginal distribution of the data set, y. 29 See Appendix A for more on the Inverse-Wishart distribution.
where S = ∆Y ′ ∆Y − ∆Y ′ Z(Z ′ Z) −1 Z ′ ∆Y, IW denotes the Inverse-Wishart, k = pn + n + 1 is the number of lines of η, and η its OLS estimator, as above. From a Gibbs sampler set with these conditionals we obtain a random sample from the full posterior to estimate ev as the proportion of sampled vector that generate a value for the posterior greater than s * . Finally, we obtain ev = 1 − ev in the final step (5). The whole implementation for cointegration tests is summarized in Table 4.
Evaluate the posterior at the sampled vectors and estimate ev as the proportion of N in which the evaluated values are larger than s * . 5. Find ev = 1 − ev. Before presenting the results of the procedure applied to real and simulated data sets it is important to remark one feature of the FBST applied to cointegration tests. Since the estimated eigenvalues of matrix Π, λ i , lie between zero and one, 30 (21) shows that ℓ * 0 ≤ ℓ * 1 ≤ . . . ℓ * n , where ℓ * r denotes the maximum of the posterior (14) assuming Π has rank 0 ≤ r ≤ n. Therefore, one may say that the hypotheses rank(Π) = r are nested, in the sense that the respective evidence values obtained by the FBST for these hypotheses are always non-decreasing ev(0) ≤ ev(1) ≤ . . . ≤ ev(n).
This nested formulation is also present in the frequentist procedure proposed by [20], based on the likelihood ratio statistics for successive ranks of Π. Thus, the FBST should be used, like the maximum eigenvalue test, in a sequential procedure to test for the number of cointegrating relationships. We will show how this should be done in presenting the applied results in the sequel. 6.1. Results. Now we present, by means of four examples, the application of FBST as a cointegration test. The first two examples were implemented in simulated data sets, while the last two are data sets obtained from published articles. 31 In all the examples we have adopted a Gaussian likelihood and the improper prior (16). The Gibbs sampler was implemented as described above, providing 51,000 random vectors from the posterior (17). The first 1,000 samples were discarded as a burn-in sample, the remaining 50,000 being used to estimate the integral (2).
Example 1: we have generated 50 observations (vectors) of the following three dimensional VAR(1) for which we set Φ 1 =   TABLE 5. FBST and maximum eigenvalue test applied to data simulated as described in Example 1 Matrix Φ 1 has eigenvalues equal to 1, 0.5 and 0.3, and therefore there are two non-null eigenvalues in Π, i.e. there are two cointegration vectors. The results are summarized in Table 5. The first column brings r, the rank of matrix Π, being tested, the evidence value in the second column, the maximum eigenvalue test statistic 32 in the third column and its respective p-value in the last column.
Example 2: we have simulated a two dimensional VAR without deterministic variables, and Ω = 1 0.5 0.5 1.5 . The corresponding VECM is: Given the values in Φ 1 and Φ 2 , the process has one cointegration relationship since matrix Π = Φ 1 + Φ 2 − I 2 has one non-null eigenvalue. The results of the tests applied to the simulated vectors are shown in Table 6.  TABLE 6. FBST and maximum eigenvalue test applied to data simulated as described in Example 2 Example 3: we applied the FBST to the Finish data set used in their seminal work [20].
The authors used the logarithms of the series of M1 monetary aggregate, inflation rate, real income and the primary interest rate set by the Bank of Finland to model the money demand which, in theory, follows a long term relationship. The sample has 106 quarterly observations of the mentioned variables, starting at the second trimester of 1958 and finishing in the third of 1984. The chosen model was a VAR(2) with unrestricted constant 33 and seasonal dummies for the first three quarters of the year. Writing the chosen model in the error correction form, we have: where Π = Φ 1 +Φ 2 −I 3 , Γ 1 = −Φ 2 , c is a vector with constants and D it denote the seasonal dummies for trimester i = 1, 2, 3. The results are displayed in Table 7.
In [20], the authors concluded that there is, at least, two cointegration vectors. Since the hypotheses being tested are nested, for the FBST r = 1 seems to be the most suitable rank to choose. 34 32 See [20] for more information on the test. It is important to mention here that it tests H 0 : rank(Π) = r against H 1 : rank(Π) : r + 1, 0 ≤ r ≤ n. 33 This means that the series in Y t have one unit root with drift vector c and the cointegrating relations may have a non-zero mean.
For more information about how to specify deterministic terms in a VAR see [27], chapter 6. 34 This would clearly depend on the chosen criterion used to make a decision, such as a cut-off value for the FBST or a loss function, as discussed in Section 2.  Johansen and Juselius (1990) Example 4: as afinal example we apply the FBST to an US data set discussed in [26]. The observations have annual periodicity and went from 1900 to 1985. We tested for cointegration between real national income, M1 monetary aggregate deflated by the GDP deflator and the commercial papers return rate. The chosen model was a VAR(1) with unrestricted constant. The series were used in natural logarithms and the results follow below.

FINAL REMARKS
In the past few decades, the econometric literature introduced statistical tests to identify unit roots and cointegration relationships in time series. The Bayesian approach applied to these topics advanced considerably after the 1990's, developing interesting alternatives, mostly for unit root testing. The (parametric) frequentist tests mentioned here may not be suitable since these procedures rely on the distribution of the test statistic 35 , which depend on a particular a statistical model. 36 When the distributions of such statistics cannot be obtained, the procedure is saved by asymptotic results. If the researcher considers different statistical models and the available sample is small, the results of the tests may be quite misleading.
The present work reviewed a simple and powerful Bayesian procedure that can be applied to both purposes: unit root an cointegration testing. We have also shown that the FBST works considerably well even when one uses improper priors, a choice that may preclude the derivation of Bayes Factors, a standard Bayesian procedure in hypotheses testing.
A long series of articles 37 has showed the versatility and properties of FBST, such as: a. the e-value derivation and implementation are straightforward from its general definition; b. it uses absolutely no artificial restrictions like a distinct probability measure on the hypothesis set, induced by some specific parametrization; c. it is in strict compliance with the likelihood principle; d. it can conduct the test with any prior distribution; e. it does not need closed conjectures concerning error distributions, even for small samples; f. it is an exact procedure, since it does rely on asymptotic assumption; and g. it is invariant with respect to the null hypothesis parametrization and with respect to the parameter space parametrization 38 .
To proceed with this research agenda, it would be interesting to perform more simulation studies with the FBST applied to unit root testing for a larger group of parametric and semi-parametric models (likelihoods). Another possibility is to include moving average terms in the data generating processes and work with Gaussian and non-Gaussian ARMA models. Notice that, given the points made above, these extensions would not impose major problems to the FBST as they would to the frequentist procedures. Regarding cointegration, the same extensions may be studied in future works, although the adoption of statistical models outside the Gaussian family would require further efforts to numerically implement the FBST. We shall also investigate the effect of the prior choice in the estimates of cointegration relations, specially for small samples. 35 Usually assuming the hypothesis being tested is true. 36 Usually Gaussian. 37 x for x > 0 and zero otherwise. The parameters a and b are both positive real numbers and Γ is the gamma function.
A.2. Matrix Normal. The probability density function of the random matrix X with dimensions p × q that follows the matrix normal distribution MN p×q (M, U, V) has the form: where M ∈ R p×q , U ∈ R p×p and V ∈ R q×q , being U and V symmetric positive semidefinite matrices. The matrix normal distribution can be characterized by the multivariate normal distribution as follows: where x and Λ are p × p positive-definite matrices, and Γ p is the multivariate gamma function. Notice that we may also write the same density with tr(x −1 Λ) inside the exponential function, as would be convenient in our implementation of the Gibbs sampler in Section 6.

APPENDIX B. HEURISTIC PROOF OF JOHANSEN'S PROCEDURE
The goal of this appendix is to provide a brief heuristic explanation of the procedure, discussed in Section 6, that finds the maximum of posterior (17) subject to the hypothesis that matrix Π has reduced rank r, 0 ≤ r ≤ n. The procedure is based on the algorithm proposed in [19,20] to maximize a Gaussian likelihood under the same assumption (reduced rank of matrix Π). 39 As mentioned in Section 6, Johansen's algorithm can be applied to the posterior (17) since this distribution is very close to a (multivariate) Gaussian likelihood.
The first step of the algorithm involves "concentrating" the posterior, i.e. to assume Ω and Π are given and maximize the posterior with respect to all the other parameters in Θ. Hence, let γ denote the matrix η except for matrix Π, i. e. γ ′ = c Φ ′ 0 Γ ′ 1 . . . Γ ′ p−1 . The concentrated log-posterior, denoted by M , is found by replacing γ with γ(Π) in (17) 39 The formal proof of Johansen's algorithm can be found in [18], chapter 20.
where C is a constant that depends on T , n and y. The strategy behind concentrating the posterior is that if we can find the values Ω and Π that maximize M then these same values, along with γ( Π), will maximize (17) under the constraint rank(Π) = r. Carrying the concentration on one step further, we can find the value of Ω that maximizes (28) assuming Π known, giving Ω(Π) =