Formula I(1) and I(2): Race Tracks for Likelihood Maximization Algorithms of I(1) and I(2) Cointegrated VAR Models

This paper provides some test cases, called circuits, for the evaluation of Gaussian likelihood maximization algorithms of the cointegrated vector autoregressive model. Both I(1) and I(2) models are considered. The performance of algorithms is compared first in terms of effectiveness , defined as the ability to find the overall maximum. The next step is to compare their efficiency and reliability across experiments. The aim of the paper is to commence a collective learning project by the profession on the actual properties of algorithms for cointegrated vector autoregressive model estimation, in order to improve their quality and, as a consequence, also the reliability of empirical research.


Introduction
Since the late 1980s, cointegrated vector autoregressive models (CVAR) have been extensively used to analyze nonstationary macro-economic data with stochastic trends. Estimation of these models often requires numerical optimization, both for stochastic trends integrated of order 1, I(1), and of order 2, I(2). This paper proposes a set of test cases to analyze the properties of the numerical algorithms for likelihood maximization of CVAR models. This is an attempt to start a collective learning project by the profession about the actual properties of algorithms, in order to improve their quality and, as a consequence, the reliability of empirical research using CVAR models.
The statistical analysis of CVAR models for data with I(1) stochastic trends was developed in Johansen (1988Johansen ( , 1991. The I(1) CVAR model is characterized by a reduced rank restriction of the autoregressive impact matrix. Gaussian maximum likelihood estimation (MLE) in this model can be performed by Reduced Rank Regression (RRR, see Anderson 1951), which requires the solution of a generalized eigenvalue problem.
Simple common restrictions on the cointegrating vectors can be estimated explicitly by modifications of RRR, see Johansen and Juselius (1992). However, MLE under more general restrictions, such as equation-by-equation overidentifying restrictions on the cointegration parameters, cannot be reduced to RRR; here several algorithms can be applied to maximize the likelihood. Johansen and Juselius (1994) and Johansen (1995a) provided an algorithm that alternates RRR over each cointegrating vector in turn, keeping the others fixed. They called this a 'switching algorithm', and since then this label has been used for the alternating variables algorithms in the CVAR literature. Boswijk and Doornik (2004) provides an overview. their starting value routine, e.g., there may be a simple estimator that is consistent or an approximation that is reasonable.
Some implementations use a small set of randomized starting values, then pick the best. This approach is general, so could be used by all algorithms. The advantage of the present approach is that one evaluates estimation as it is presented to the user. The drawback is that it will be harder to determine the source of performance differences. 4

Convergence
The termination decision rule is also left to the algorithm, so it presents a further source of difference between implementations. One expects this to have a small impact: participants in the races should ensure that convergence is tight enough not to change the computed evaluation statistics. If it is set too loose, the algorithm will score worse on reliability. However, setting convergence too tightly will increase the required number of iterations, sometimes substantially if convergence is linear or rounding errors prevent achieving the desired accuracy.

DGP-Model Pair
The chosen DGPs generate I(1) or I(2) processes, as presented in Sections 5 and 6 below; the associated statistical models are (possibly restricted) I(1) or I(2) models, defined in Section 3. Exercises include both cases with correct specification, i.e. when the DGP is contained in the model, as well as cases with mis-specification, i.e., when the DGP is not contained in the model.
Mis-specification is limited here, in the sense that all models still belong to the appropriate model class: an I(1) DGP is always analyzed with an I(1) (sub-)model, and similarly for an I(2) DGP and (sub-)model. Indeed, the sources of mis-specification present here are a subset of the ones faced by econometricians in real applications. The hope is that results for the mis-specification cases covered here can give some lower bounds on the effects of mis-specification for real applications.
Common econometric wisdom says that algorithms tend to be less successful when the model is mis-specified. The present design provides insights as to what extent this is the case in I(1) and I(2) CVAR models, within the limited degree of mis-specification present in these races.

Construction of Test Cases
Different approaches can be used to create test cases: 1. Estimate models on real data This is the most realistic setting, because it reflects the complexities of data sets that are used for empirical analyses. On the other hand, it could be hard to study causes of poor performance as there can be many sources such as unmodelled correlations or heteroscedasticities. Aggregating performance over different real datasets may hide heterogeneity in performance due to the different DGPs that have generated the real data.

Generate artificial data from models estimated on real data
This is a semi-realistic setting where it is known from which structure the data are generated. Coefficient matrices of the DGP will normally be dense, with a non-diagonal error variance matrix.

Use a purely artificial DGP
This usually differs from the previous case in that the DGPs are controlled by only a few coefficients that are deemed important. So it is the least realistic case, but offers the possibility to determine the main causes of performance differences.
Formula I(1) and I(2) adopt the artificial DGP approach with sparse design as a method to construct test data. The drawback is that it can only cover a limited number of DGPs, which may not reflect all empirically-relevant situations. The present set of designs is no exception to this rule; however, it improves on the current state of play where no agreed common design of experiments has been proposed in this area. A future extension will be to include a set of tests based on real-world data sets.

Generation of Test Data
All experiments are run with errors that are fixed in advance to ensure that every participant generates exactly the same artificial data sets. The sample size is an important design characteristic; test data are provided for up to 1000 observations, but only races that use 100 or 1000 are included here.
In terms of comparability of results for different algorithms, the possibility to fix the innovations, and hence the data in each lap, controls for one known source of Monte Carlo variability when estimating difference in behavior; see Abadir and Paruolo (2009), Paruolo (2002) or Hendry (1984, §4.1) on the use of common random numbers.
The choice of common random numbers permits to find (significant) differences in behavior of algorithms with a smaller number of cases, and hence computer time, than when innovations vary across teams. Moreover, it also allows the possibility to investigate the presence of multiple maxima.

Evaluation
Each estimation, unless it ends in failure, results in a set of coefficient estimates with corresponding log-likelihood. Ideally, all algorithms converge to the same stationary point, which is also the global maximum of the likelihood. This will not always happen: it is not known whether these models have unimodal likelihoods, and there is evidence to the contrary in many experiments considered. Moreover, the global maximum is not known, and this is the target of each estimation. As a consequence, evaluation is largely based on a comparison with the best solution.

Effectiveness
The overall maximum is the best function value of all algorithms that have been applied to the same problem. This is the best attempt at finding the global maximum, but remains open to revision. Consensus is informative: the more algorithms agree on the maximum, the more confident one is that the global maximum has been found. Similarly, disagreement may indicate multimodality. This is one of the advantages of pooling the results of different algorithms.
If one algorithm finds a lower log-likelihood than another, this indicates that it either found a different local maximum, converged prematurely, or ended up in a saddle point, or, hopefully not so common, there is a programming error. Differences may be the result of the adopted initial parameter values, or owing to the path that is taken, or to the terminal conditions, or a combination of all the above.
However, algorithms that systematically fail to reach the overall maximum should be considered inferior to the ones that find it. Inability to find the global maximum may have serious implications for inference, leading to over-rejection or under-rejection of the null hypothesis for likelihood ratio (LR) tests, depending on whether the maximization error affects the restricted or the unrestricted model.

Efficiency
Efficiency can be expressed in terms of a measure of the number of operations involved, or a measure of time. Time can be expressed as CPU time or as the total time to complete an estimation. While lapsed time is very useful information for a user of the software, it is difficult to use in the present setting. First, the same algorithm implemented in two different languages (say Ox and Matlab) may have very different timings on identical hardware. Next, this project expects submissions of completed results, where the referee team has no control over the hardware.
Even if the referee team were to rerun the experiments, this would be done on different computers with different (versions of) operating systems. Finally, the level of parallelism and number of cores plays a role: even when the considered algorithms cannot be parallelized, matrix operations inside them may be. When running one thousand replications, one could normally do replications in parallel. This suggests not to use time to measure efficiency.
With time measurements ruled out, one is left with counting some aspects of the algorithm. This could be the number of times the objective function is evaluated, the number of parameter update steps, or some other measure. All these measures have a (loose) connection to clock time. E.g., a quadratically convergent algorithm will require fewer function calls and parameter updates than a linearly convergent algorithm, and usually be much faster as well. However, the actual speed advantage can be undermined if the former requires very costly hessian computations (say).
For the switching algorithms that are most commonly used in CVARs when RRR is not possible, the number of parameter update steps is a better metric to express efficiency. An analysis of all the timings reported in Doornik (2017b) shows that, after allowing for two outlying experiments, the number of updates can largely explain CPU time, while the number of objective function evaluations is insignificant. Both the intercept and the coefficient that maps the update count to CPU time are influenced by CPU type, amount of memory, software environment, etc.
In line with common practice, an iteration is defined as a one parameter update step. This definition also applies to quasi-Newton methods, although, unlike switching algorithms, each iteration then also involves the computation of first derivatives. As a consequence, an iteration could be slower than a switching update step, but in many situations a comparison would still be informative. When comparing efficiency of algorithms, the number of iterations appears to be of more fundamental value than CPU time, and it is certainly useful when comparing the same implementation for different experiments when these have been run on a variety of hardware.
There remains one small caveat: changing compiler can affect the number of iterations. When code generation differences mean that rounding errors accumulate differently, this can impact the convergence decision. This effect to be expected to be small.
Summing up, the remainder of the paper uses number of iterations as a measure of efficiency. An update of the parameter vector is understood to define an iteration, and each team participating to Formula I(1) and I(2) is expected to use the same definition.

Definitions and Statistical Models
This section introduces the car racing terminology and defines more precisely the notions of DGP and statistical model.

Terminology
Analogous to car racing terminology, a circuit refers to a specific DGP-model pair, i.e. a DGP coupled with a model specification, characterized by given restrictions on the parameters. Each circuit needs to be completed a certain number of times, i.e. laps (replications).
Circuits are grouped in two championships, called 'Formula I(1)' and 'Formula I(2)'. The implementation of an algorithm corresponds to a driver with a constructor team, which is called a team for simplicity. The definition of an algorithm is taken to include everything that it is required to maximize the likelihood function; in particular it includes the choice of the starting value and of the convergence criterion or termination rule.
In the following there are 96 Formula I(1) circuits and 1456 Formula I(2) circuits. Teams do not have to participate in all circuits. For each circuit, a participating team has to: (i) reconstruct N = 1000 datasets (one for each lap) using the innovations provided and the DGP documented below; (ii) for each dataset, estimate the specified model(s); (iii) report the results in a given format, described in the Appendix A.
An econometrician may implement more than one algorithm, so enter multiple teams in the races.

Definitions
This subsection is devoted to more technical definitions of a DGP, a statistical model and its parametrization. A DGP is a completely specified stochastic process that generates the sample data X 1:T := (X 1 : · · · : X T ). Here : is used to indicate horizontal concatenation, with the exception for expressions involving indices, such as (1 : T), which is a shorthand for (1 : · · · : T). For example X t i.i.d N(0, 1), t = 1, . . . , T is a DGP.
The present design of experiments considers a finite number of DGPs; these are grouped into two classes, called the I(1) DGP class and the I(2) DGP class. Each DGP class is indexed by a set of coefficients; for example X t i.i.d N(0, 1), t = 1, . . . , T, with T ∈ {100, 1000} is a DGP class. A parametric statistical model is a collection of stochastic processes indexed by a vector of parameters ϕ, which belongs to a parameter space Φ, usually an open subset of R m , where m is the number of parameters. A model is said to be correctly specified if its parameter space contains one value of the parameters that characterizes the DGP which has generated the data, and it is mis-specified otherwise. E.g. X t i.i.d N(µ, σ 2 ), t = 1, . . . , T, −∞ < µ < ∞, 0 ≤ σ < ∞ is a parametric statistical model, when X t is a scalar and ϕ = (µ : σ 2 ) . The parameter space Φ is given here by R × R + . Note that a parametric statistical model is needed in order to write the likelihood function for the sample data X 1:T . In the case above, the likelihood is f (X 1: . Consider now one model A for X 1:T , indexed by the parameter vector ϕ with parameter space Φ A . Assume also that model B is the same, except that the parameter vector ϕ lies in the parameter space The two models differ by the parameter points that are contained in Φ B but not in Φ A (i.e. points in the set Φ B \Φ A ). If some points in Φ B \Φ A cannot be obtained as limits of sequences in Φ A , (i.e., Φ B does not coincide with the closure of Φ A ) then model A is said to be a submodel of model B. For example, model A can be X t i.i.d N(µ, 1), t = 1, . . . , T, Φ A := {µ : 0 ≤ µ < ∞} while model B can be Φ B := {µ : −∞ < µ < ∞}. Here Φ B \Φ A = {µ : −∞ < µ < 0} whose points cannot be obtained as limits of sequences in Φ A . Hence model A is a submodel of model B.
When all the parameter values in Φ B \Φ A can be obtained as limits of sequences in Φ A , then model A and B are essentially the same, and no distinction between them is made here. In this case, or in case the mappings between parametrizations are bijective, it is said that A and B provide equivalent parametrizations of the same model. As an example, let Φ A := {µ : 0 ≤ µ < ∞} as above and let Φ B := {µ = exp η : −∞ < η < ∞}; the two models are essentially the same, and their parametrizations are equivalent. This is because, despite µ = 0 being present only in the µ parametrization, µ = 0 can be obtained as a limit of points in µ = exp η, η ∈ (−∞, ∞), e.g., by choosing η i = −i, i = 1, 2, 3 . . . Hence in this case the η and µ parametrizations are equivalent, as they essentially describe the same model.
In the present design of experiments all models are (restricted versions of) the I(1) and the I(2) models, defined below. The case of equivalent models in the above sense is relevant for different parametrizations of the I(2) statistical model, see Noack Jensen (2014).

The Cointegrated VAR
Both the I(1) and I(2) statistical models are sub-models of the Gaussian VAR model with k lags where X t , ε t , µ 0 , µ 1 are p × 1, A i and Ω are p × p, and Ω is symmetric and positive definite.
The presample values X 1 , ..., X k are fixed and given. The (possibly restricted) parameters associated with µ 0 , µ 1 , A i , i = 1, . . . , k are called the mean-parameters and are indicated by θ. The ones associated with Ω are called the variance parameters, and they are here always unrestricted, except for the requirement of Ω to be positive definite. The parameter vector is made of the unrestricted entries in θ and Ω. The Gaussian loglikelihood (excluding a constant term) is given by: where ε t (θ) equal to ε t in (1) considered as a function of θ. Maximizing the loglikelihood with respect to Ω, one finds Ω = Ω(θ) : The loglikelihood is here defined as (θ), calculated as in (2). The I(1) and I(2) models are submodels of (1).

I(1) statistical models
The unrestricted I(1) statistical model under consideration is given by: Here α and β = (β : β D ) are respectively p × r and (p + 1) × r parameter matrices, r < p, with β D a 1 × r vector. The long-run autoregressive matrix Π = −I + ∑ k i=1 A i is here restricted to satisfy rank(Π) ≤ r, because it is expressed as a product Π = αβ , where α and β have r columns. The coefficient µ 1 is restricted as µ 1 = αβ D . The Γ i matrices are unconstrained. Some Formula I(1) races have restrictions on the columns of α and β.
The I(1) model is indicated as M(r) in what follows. The likelihood of the I(1) model M(r) has to be maximized with respect to the parameters α, β, Γ 1 , ..., Γ k−1 , µ 0 and Ω.

Performance Evaluation
This section defines a number of indicators later employed to measure the performance of algorithms. As introduced above, θ indicates the parameter vector of mean parameters and Ω the variance covariance matrix of the innovations.

Elementary Information to Be Reported by Each Team
Each lap is indexed by i = 1, ..., N, each circuit by c = 1, . . . , C, and each team (i.e., algorithm) by a. The set of teams participating in the race on circuit c is indicated as A c ; this set contains n c algorithms. The subscript c indicates that A c and n c depend on c, because a team might not participate in all circuits. The following subsections describe the results that each team a has to report, as well as the calculations that the referee team of the race will make on the basis of it.
For each lap i of circuit c, when team a terminates optimization, it produces the optimized value θ a,c,i of the parameter vector θ. The team should also set the convergence indicator S a,c,i equal to 1 if the algorithm has satisfied the (self-selected) convergence criterion, and set S a,c,i to 0 if no convergence was achieved. Teams should report the loglikelihood value obtained at the maximum a,c,i := (θ a,c,i ) using (2). In case S a,c,i = 0, θ a,c,i indicates the last value of θ before failure of algorithm a. Algorithm a may not have converged either because (θ) cannot be evaluated numerically anymore (as e.g., when Ω (θ) becomes numerically singular) or because a maximum number of iterations has been reached. In the latter case the final loglikelihood should be reported. In the former case, when the likelihood evaluation failed, the team should report a,c,i = −∞. 7 So, regardless of success or failure in convergence, a loglikelihood is always reported.
For each lap i in circuit c, team a should also report the number of performed iterations N a,c,i . This number equals the maximum number of iterations if this is the reason why the algorithm terminated. Choosing smaller or larger maximum numbers of iterations will affect result of each team in an obvious way. Teams are asked to choose their own maximum numbers of iterations. N a,c,i is assumed here to be inversely proportional to the speed of the algorithm. In practice, the speed of the algorithm depends by the average time spent in each iteration, which is influenced by many factors, such as the hardware and software specifications in the implementation. However, because these additional factors vary among teams, the number N a,c,i is taken to provide an approximate indicator of the slowness of the algorithm.
The choice of starting value of the algorithm a is taken to be an integral part of the definition of the algorithm itself. Starting values cannot be based on the results of other teams. It is recommended that the teams document their algorithm in a way that facilitates replication of their results, including providing the computer code used in the calculations and a description of the choice of initial values.
Reported results from the races should be organised in a file, whose name indicates the circuit, and where each row should contain the following information: In case of no deterministics, the satisfied inequalities are rank(αβ ) ≤ r and rank(α ⊥ Γβ ⊥ ) ≤ s.

7
Because there is no clear convention on writing −∞, any value of −10 308 or lower is interpreted as −∞.
where u a,c,i is the maximum of the loglikelihood of a reference unrestricted model detailed in the Appendix A, and the reported part of the coefficient vector, θ R a,c,i , is defined as θ R a,c,i := vec(α a,c,i ) : vec(β a,c,i ) for the Formula I(1) circuits. For the Formula I(2) circuits instead: The reported part of the θ a,c,i includes the estimated parameters except for the parameters Φ i of the short term dynamics and the covariance of the innovations Ω. More details on the reporting conventions are given in the Appendix A.

Indicators of Teams' Performance
After completion of lap i of circuit c by a set of teams A c , the referee team will compute the overall maximum c,i and deviations D a,c,i from it as: If all a ∈ A c report failed convergence S a,c,i = 0, then c,i will be set equal to −∞ and D a,c,i will be set equal to 0. Observe that D a,c,i ≥ 0 by construction; D a,c,i is considered small if less than 10 −7 , moderately small if between 10 −7 and 10 −2 , and large if greater than 10 −2 . 8 Next define the indicators where 1 1(·) is the indicator function, SC stands for 'strong' convergence, WC stands for 'weak' convergence, DC stands for 'distant' convergence -i.e. convergence to a distant point from the overall maximum -and FC stands for failed convergence. Note that SC a,c,i + WC a,c,i + DC a,c,i + FC a,c,i = 1 by construction. When c,i = −∞, note that SC a,c,i = WC a,c,i = DC a,c,i = S a,c,i = 0 and FC a,c,i = 1. A summary across laps of the performance of algorithm a in circuit c is given by the quantities These indicators deliver information on the % of times each algorithm reached strong convergence, weak convergence, convergence to a point which is not the overall maximum, or did not converge. 9 The set of pairs {( c,i , a,c,i ) : DC a,c,i = 1} a∈A c ,i=1:N contain the detailed information on the effects of convergence to a point that is distant from the overall maximum. They are later plotted, together with the distribution of the relevant test statistics. Focusing on the laps where DC a,c,i = 1, it is also interesting to calculate the average distance of a,c,i to c,i . This is given by 8 As all reference values, the present ones of 10 −2 and 10 −7 are chosen in an ad-hoc way. In the opinion of the proponents they reflect reasonable values for the differences between loglikelihoods, which can be interpreted approximately as relative differences. Hence a difference of 10 −2 means roughly that the two likelihoods differ by 1%, while a difference of 10 −7 means roughly that the two likelihoods differ by 0.1 in a million, in relative terms. 9 One may wish to consider these indicators conditionally on the number of converged cases. To do this, one can replace the division by N in the above formulae for SC a,c , WC a,c , DC a,c with division by Conditionally on convergence, the average number of iterations is defined as

Summary Analysis of Circuits and Laps
The referee team will compute summary statistics for each circuit. First, in order to identify laps where all algorithms fail, the following DNF indicator is defined: which equals 1 if all algorithms fail to converge.
In order to harvest information on the number of different maxima reported by the teams, the following indicator is constructed. Let (1),c,i ≥ (2),c,i ≥ · · · ≥ (m c ),c,i be the ordered log-likelihood values reported by those algorithms a ∈ A c that have reported convergence, i.e., for which S a,c,i = 1. This list can be used to define the 'number of reported optima' indicator, NOR, as follows Note that for each j = 2, . . . , m c , the difference (j−1),c,i − (j),c,i ≥ 0 is the decrement of successive reported log-likelihood values; if this decrement is smaller than a selected numerical threshold, here taken to be 10 −2 , this means that the two algorithms corresponding to (j − 1) and (j) have reported the same log-likelihood in practice. In this case, the counter NOR is not increased. If this difference is greater than the numerical threshold of 10 −2 , then the two reported log-likelihood are classified as different, and the counter NOR is incremented. Overall, NOR counts the number of maxima found by different algorithms that are separated at least by distance of 10 −2 . NOR is influenced by the number of participating teams.
Observe that the reported log-likelihood value (j),c,i can correspond to an actual maximum or to any other point judged as stationary by the termination criterion used in each algorithm. No check is made by the referee team to distinguish between these situations; NOR should hence be interpreted as an indicator of potential presence of multiple maxima; a proper check of the number of maxima would require a more dedicated analysis. 10 Especially for a difficult lap i, it is interesting to pool information obtained by different algorithms on convergence to points that are distant from the overall maximum, i.e., when D a,c,i is large. This can be averaged across the set of algorithms A c that participate to circuit c in the indicator The indicators DD c,i and AD a,c are obviously related, and they differ in how they are averaged, either across laps or algorithms.
The DNF and NOR indicators are also aggregated over all laps in a circuit, giving:

Formula I(1) Circuits
This section introduces Formula I(1) circuits, i.e. DGP-model pairs where the DGP produces I(1) variables. The model is the I(1) model M(r) or a submodel of it. For some circuits the statistical model is correctly specified, whereas for others it is mis-specified, as discussed in Section 3.2.
In the specification of the I(1) and I(2) DGPs, the innovations ε t are chosen uncorrelated (and independent given normality). This choice is made to represent the simplest possible case; this can be changed in future development of the project. 11

I(1) DGPs
The I(1) DGP class for lap i is indexed on the scalars (p, T, ρ 0 , ρ 1 ): for t = 1 : Here I p is the identity matrix of order p. All possible combinations are considered of the following indices and coefficients: Note that in these DGPs, • the first p/2 variables in X (i) t are either random walks (when ρ 1 = 0), or I(1) AR(2) processes whose first difference is persistent (when ρ 1 = 0.9). Therefore, ρ 1 is interpreted as 'a near I(2)-ness' coefficient.

•
The last p/2 variables in X (i) t are either white noise (when ρ 0 = 0), or persistent stationary AR(1) processes (when ρ 0 = 0.9). Therefore, ρ 0 is a 'near I(1)-ness' or 'weak mean reversion' coefficient. For simplicity, in the following it is referred to as the 'weak mean reversion' coefficient.
The DGPs can be written as follows, see (3): where µ 0 = µ 1 = 0, r = p/2, and 0 r is a square block of zeros of dimension r.
To create the Monte Carlo datasets X (i) 1:T , each team has to use the DGP (7) with relevant values of (p, T, ρ 0 , ρ 1 ), together with the realizations of the ε's as determined by the race organizers. Further details are in the Appendix A.

I(1) Statistical Models
Using the generated data X (i) 1:T as a realization for X 1:T , the I(1) model M(r) in (3) has to be estimated on each lap i = 1, ..., N; as noted above, MLE of the unrestricted I(1) models M(r) in (3) is obtained by RRR. The estimation sample starts at t = k + 1, so uses T − k observations. Two alternative values for lag length k are used: k ∈ {2, 5}.
All I(1) circuits use the correct rank r = p/2 and are subject to further restrictions on the cointegrating vectors, with or without restrictions on their loadings. To express these restrictions, the following matrix structures are introduced, where an * stands for any value, indicating an unrestricted coefficient: Remark that R 0,m sets all elements except the diagonal to zero; R 1,m has two bands of unity along the diagonal; R 2,m fixes the diagonal to unity, but is otherwise unrestricted. All these matrices are square.
Finally, U m,n stands for an unrestricted m × n dimensional matrix: Restriction I(1)-A Model A has the following overidentifying restrictions on β: β = (R 0,r : I r : U r,1 ) .
This imposes 2(r − 1) overidentifying restrictions on β. These restrictions are mis-specified, in the sense that the DGP is outside the parameters space of the statistical model, see Section 3. Restriction I(1)-C Model C imposes the following, correctly specified, overidentifying restrictions on α and β: α = (U r,r : R 0,r ) , β = (R 0,r : R 2,r : U r,1 ) .

Formula I(2) Circuits
This section introduces Formula I(2) circuits, following a similar approach to Formula I(1).

I(2) DGPs
The I(2) DGP class is indexed by the scalars (p, T, ω, ρ 1 ); the data X (i) 1:T is generated as follows: where X 3,t ) , t = 1, ..., T, i = 1, ..., N, and ε (i) j,t is p/3 × 1. As in the I(1) case, all possible combinations are considered of the following indices and coefficients: The DGPs can be written as follows, see (4): with µ 0 = µ 1 = 0. Note that in these DGPs, • X 1,t is a pure cumulated random walk, and hence I(2); • X 2,t is I(1), and does not cointegrate with any other variable in X t . Moreover, X 2,t is a pure random walk when ρ 1 = 0, and it is I(1) -near I(2) when ρ 1 = 0.9; therefore, as in the I(1) case, the parameter ρ 1 is interpreted as a 'near I(2)-ness' coefficient. • X 3,t is the block of variables that reacts to the multi-cointegration relations, which are given by (ω − 1)X 3,t + ∆X 1,t − ∆X 3,t . These relations can be read off as the last block in the equilibrium correction formulation in the last display. When ω = 0 one has that the levels X 3,t and differences ∆X 1,t , ∆X 3,t have the same weight (apart from the sign) in the multi-cointegration relations; when ω = 0.9 the weight of the levels 1 − ω = 0.1 is smaller than the ones of the first differences. Hence ω can be interpreted as the 'relative weight of first differences in the multi-cointegrating relation'.
One can see that in this case: so for the I(2) rank condition: To create the Monte Carlo dataset X (i) 1:T , each team has to use the DGP (13) with relevant values of (p, T, ω, ρ 1 ), together with the drawings of ε as determined by the race organisers. Details are in the Appendix A.

I(2) Statistical Models
Using the generated data X (i) 1:T as a realization for X 1:T , the I(2) model M(r, s) in (4) has to be estimated on each lap i = 1, ..., N. The estimation sample starts at t = k + 1, so uses T − k observations. Two alternative values for the lag k are used, namely k ∈ {2, 5}.
An I(2) analysis usually starts with a procedure to determine the rank indices r, s. This requires estimating the M(r, s) model under all combinations of (r, s), and computing all LR test statistics. Usually, a table is produced with r along the rows and s 2 = p − r − s along the columns.
In the I(2) model M(r, s), the MLE does not reduce to RRR or OLS except when: (i) r = 0, corresponding to an I(1) model for ∆X t , or (ii) p − r − s = 0, corresponding to I(1) models for X t , or (iii) r = p, corresponding to an unrestricted VAR for X t .
All restricted I(2) circuits use the correct rank indices r = s = p/3. Restrictions are expressed using the matrix structures (8) and (9). In addition to restrictions on β and α in (4), there are circuits with restrictions on τ, which is a basis of the space spanned by (β :β ⊥ η). Under DGP (13), the correctly specified τ is any matrix of the type with r = s = p/3 and A any full rank (r + s) × (r + s) matrix. Recall that 0 r indicates a square matrix of zeros of dimension r; the 0 1,r vectors are added in the last row to account from the presence of the trend in I(2) model (4).
Unrestricted I(2) models are estimated for The number of models satisfying these inequalities is p(p − 1)/2. Obviously, some of these models are correctly specified, some are mis-specified. Restriction I(2)-A Model A is estimated with r = s = p/3 under the following overidentifying restrictions on β β = (R 0,r : U r,r : I r : U r,1 ) .
This imposes r (r − 1) overidentifying restrictions on β. These restrictions are correctly specified.
Specification (16) imposes 2r (r − 1) correctly specified restrictions on α and β; r (r − 1) of them would be enough to just reach identification of α and β, and hence r (r − 1) restrictions are overidentifying. Restriction I(2)-D Model D has r = s = p/3 and 2r(r − 1) correctly specified overidentifying restrictions on τ of the type: τ = R 0,r I r 0 r,r U r,1 R 0,r 0 r,r I r U r,1 .

Test Drive on Formula I(1) Circuits
To illustrate the type of information one can obtain by participating in the Formula I(1) circuits, this Section illustrates a 'test drive' for four algorithms, i.e., teams. The results of these teams also provide a benchmark for other teams willing to participate at a later stage.
Four teams participated in the first Formula I(1) races: • Team 1: the switching algorithm proposed in Boswijk and Doornik (2004) as implemented in Mosconi and Paruolo (2016) that alternates maximization between β and α. The algorithm is initialized using the unrestricted estimates obtained by RRR. Normalizations are not maintained during optimization, but applied after convergence. The algorithm was implemented in RATS version 9.10. • Team 2: CATS3 'alpha-beta switching' algorithm as described in Doornik (2017b, §2.2) using the LBeta acceleration procedure. CATS3 is an Ox 7 (Doornik (2013)) class for estimation of I(1) and I(2) models, including bootstrapping.

•
Team 3: CATS3 'alpha-beta hybrid' algorithm is an enhanced version of alpha-beta switching: 1. Using standard starting values, as well as twenty randomized starting values, then 2. alpha-beta switching, followed by 3. BFGS iteration for a maximum of 200 iterations, followed by 4. alpha-beta switching.
This offers some protection against false convergence, because BFGS is based on first derivatives combined with an approximation to the inverse Hessian.
More important is the randomized search for better starting values as perturbations of the default starting values. Twenty versions of starting values are created this way, and each is followed for ten iterations. Then half are discarded, and they are merged with (almost) identical ones; this is then run for another ten iterations. This is repeated until a single one is left. The iterations used in this start-up process are included in the iteration count.

•
Team 4: PcGive algorithm, see Doornik and Hendry (2013, §12.9). This algorithm allows for nonlinear restrictions on α and β, based on switching between the two after a Gauss-Newton warm-up. This is implemented in Ox, Doornik (2013). The iteration count for Team 4 cannot be extracted.
The In the circuits with T = 1000, there is not much difference between k = 2 and k = 5, so the presentation is limited to only one of these values. Combining a long lag length with a small sample size is more problematic. Onatski and Uhlig (2012) consider that situation. They find that the roots of the characteristic polynomial of the VAR tend to a uniform distribution on the unit circle when log(T)/k and k 3 /T tend to zero.
Before analyzing the Formula I(1) races, the tests for cointegration rank are used as 'qualifying races'; this only requires RRR. The qualifying races for Formula I(1) parallel the ones for Formula I(2), reported later. The overall results for the qualifying races show that: (i) Even when MLE is performed with RRR, inference on the cointegration rank is not easy (not even at T = 1000). (ii) Large VAR dimension, lag length, near-I(2) ness, weak mean reversion are all complicating factors for the use of asymptotic results.
In more detail, Table 1 records the acceptance frequency at 5% significance level of the trace test, using p-values from the Gamma approximation of (Doornik 1998); the null is that the rank is less or equal to r against the alternative of unrestricted rank up to p, where the true rank equals p/2. For T = 1000 and p = 6, the tests behave as expected. When p = 12, they tend to favour lower rank values for slow mean-reversion and higher ranks for near-I(2) behaviour.
The results for T = 100 are more problematic. When p = 12, a lag length of five is excessive relative to the sample size, and leads to overfitting. This is shown in the selection of a too-large rank with frequency close to 1. A lag length of 2 gives opposite results, where a too low rank tends to be selected away from the near-I(2) cases, and a too high rank is chosen in the near-I(2) cases. In the remainder only k = 2 is considered, as this already illustrates an interesting range of results.   Table 2 presents the Formula I(1) results for the four teams. For each team, the table reports the convergence quality (SC, WC, DC, for strong, weak, and distant convergence) as percentage of laps, followed by the percentage of laps that failed (FC), the average error distance (AD) and the average iteration count for converged laps only (IT). Team 4 does not report the iteration count. The last two columns are averages over all teams and laps. NOR is the indicator of average number of optima reported, where unity means that in all laps all teams have reported the same maximum.   Turning attention to some specific circuits, consider I(1)-A, which has valid overidentifying restrictions on β. For a large enough sample, T = 1000, all teams finish equally and quickly. This suggests that any estimation problem for T = 100 is a small sample issue.
Consider now the circuits with p = 6, ρ 0 = 0.9, ρ 1 = 0, i.e., the third and ninth row of results in Table 2, panel 1. Figure 1 plots three densities: the empirical densities (kernel density estimates) of the likelihood ratio test LR c,i := 2( u c,i − c,i ) for T = 100, 1000, where u c,i is the maximized likelihood under the cointegration rank restriction only, along with the χ 2 (6) reference asymptotic distribution. Notice that when T = 100 the empirical distribution is very different from the asymptotic one: using the asymptotic 95th percentile of the asymptotic distribution as critical value would lead to severe over-rejection (more than 70%). Finite sample corrections would therefore be very important. Notice that even when T = 1000, although the distribution approaches the asymptotic one, the difference is still substantial (the rejection rate is about 10%).
To gain some understanding of the implications of 'distant convergence', for T = 100 all laps where any of the teams obtained a distant maximum were pooled, obtaining pairs ( c,i , a,c,i ) : DC a,c,i = 1 a∈A c ,i=1:N . Figure 1 plots in blue the cdf of the LR test based on the overall maximum, LR c,i , as the left endpoint of the horizontal lines; the right endpoint represents the LR test based on the distant maximum, i.e., LR a,c,i = 2( u c,i − a,c,i ). Considering the χ 2 (6), distant convergence has almost no practical implications, since the inappropriate asymptotic distribution would lead to over-rejection anyway. Conversely, relative to the empirical density, in several cases one would (correctly) accept using LR c,i , and (wrongly) reject using LR a,c,i ). Distant convergence has therefore implications for hypothesis testing, at least if one takes finite sample problems into account. kernel-estimate pdfs of 2( u c,i − c,i ) for T = 100 and T = 1000 based on 1000 laps, along with the asymptotic χ 2 (6). Blue: empirical cdf of 2( u c,i − c,i ) for T = 100, considering only laps and algorithms where distant convergence has been reported. The blue filled diamond denotes the LR calculated using the overall maximum 2( u c,i − c,i ), the empty circle the LR calculated using the distant maximum 2( u c,i − a,c,i ).
Consider now the mis-specified restrictions I(1)-B. Table 2 clearly shows that, whichever the DGP, maximizing the likelihood under mis-specified restrictions induces optimization problems. The number of iterations is, for all teams, much higher than under restrictions I(1)-A, and it does not decrease even when T = 1000. Failure to converge (FC) becomes a serious problem for teams 1 and 4, whereas teams 2 and 3 do not suffer this problem, but have a much higher percentage of distant convergence (DC). Whether distant convergence is a nuisance or an advantage in this case is however debatable: since the hypothesis is false, rejecting is correct, and therefore distant convergence increases the power of the test (see below). Figure 1, illustrates the challenging 'weak mean reverting' circuits with p = 6, ρ 0 = 0.9, ρ 1 = 0, i.e., the third and ninth row of results in Table 2, panel 3. The asymptotic distribution of the LR test is χ 2 (4), whose 95th percentile is 9.48. Using this as a critical value, LR c,i = 2( u c,i − c,i ) would reject about 70% of the times when T = 100, and 100% of the times when T = 1000. The power seems reasonably good also in small samples, but one needs to keep in mind that, as illustrated when discussing Figure 1, the asymptotic distribution is very inappropriate here. 12 Figure 2 also illustrates the impact of distant convergence. Using LR a,c,i = 2( u c,i − a,c,i ) instead of LR c,i = 2( u c,i − c,i ) has no practical implication for large samples (T = 1000), where the power would be 1 anyway. Conversely, in small samples (T = 100) distant convergence has a somewhat beneficial effect on power, increasing the rejection rate. Notice that distant convergence seems to occur more frequently when the null hypothesis is false, like I(1)-B, than when it is true, like I(1)-A; therefore, a tentative optimistic conclusion is that the gain in power due to distant convergence seems to be more relevant than the loss in size. I(1)-C imposes valid over-identifying restrictions on α and β. The first two circuits for I(1)-C are similar to I(1)-A, except that Team 4 has a higher percentage of low and failed convergence. The third circuit shows a more dramatic difference. The effect of the persistent autoregressive effect when ρ 0 = 0.9 is to reduce the significance of the α coefficients. As a consequence, some laps yield solutions where some coefficients in α get very large, offset by almost zeros in β. The product Π still looks reasonable, but computation of standard errors of α and β fails (giving huge values), suggesting this may be towards a boundary of the parameter space.

Figure 2, in analogy with
Lap 999 of the third circuit for I(1)-C provides an illustration. Team 1 fails, Teams 2 and 4 have distant convergence. Team 3 has the best results with the following coefficients:  Figure 1 shows that, when testing I(1)-A, the asymptotic critical values leads to a 70% rejection rate even if the null hypothesis is true.
where α is numerically close to reduced rank. This model has a loglikelihood that is below the unrestricted I(1) model with rank 2. Because the switching algorithms are really for rank(Π) ≤ r rather than rank(Π) = r, they occasionally fail or yield unattractive results when α is statistically weakly determined. Team 4 provides more attractive estimates with reasonable standard errors, albeit with a lower loglikelihood. These characteristics are compatible with several scenarios, including the possibility that in this part of the parameter space the likelihood has a horizontal asymptote. A proper detailed analysis of these and other difficult cases is however beyond the scope of the present paper and it is left for future research.

Test Drive on Formula I(2) Circuits
As for Formula I(1), Formula I(2) circuits are illustrated through a test drive for three teams. The following teams participated in the races:
As previously illustrated, Formula I(2) is based on 1456 circuits. Although results for all circuits were obtained and stored to serve as benchmark for future comparisons, Formula I(2) circuits and results are too numerous to present in tabular form here; hence only the cases where p = 6 and k = 2 are shown here.
The first group of circuits are called 'qualifying races', as in Formula I(1). They are designed to: (i) check the ability of the numerical algorithms to maximize the likelihood of the I(2) model M(r, s) with no restrictions except for the specification of r and s 13 (ii) analyze the difficulties of the cointegration ranks tests in spotting the correct r and s in the different DGPs.
Results for task (i) are illustrated in Table 3. For this part of the analysis, only the cases r = 1, ..., p − 1 and s = 1, ..., p − r − 1 are considered. This excludes all cases with r = 0 and/or s = p − r (i.e., s 2 = p − r − s = 0) since in these cases the likelihood of the I(2) model can be maximized exactly by RRR. Preliminary analysis of the results shows that there is relatively little variation for different values of ω and ρ 1 . As a consequence, in Table 3 all circuits with the same p, k, T, r, s are analyzed together, irrespective of ω and ρ 1 . On the whole, Table 3 shows that the teams perform well. The percentage of 'distant convergence' (DC) is very small, and there are almost no failures. There are a few cases with large 'average distance' (AD), but only when the ranks are smaller than in the DGP. 14 Convergence is quick, usually in about 10 iterations. For T = 1000 'weak convergence' (WC) occurs quite frequently, especially in misspecified (overrestricted) models, and sometimes one observes some large 'average distance' (AD).  On the whole, likelihood maximization is reasonably accurate, for each circuit and lap; one can then proceed to find the maximum of the maximized likelihoods reported by the three teams. On this basis, the likelihood ratio tests for the cointegration ranks r and s were computed on the overall maximum. As done in Table 1 for the I(1) case, Table 4 records the acceptance frequency at 5% significance level of the LR cointegration test, using p-values from the Gamma approximation of (Doornik 1998); the null is that rank (αβ ) ≤ r and rank (α ⊥ (Γ : µ 0 )β ⊥ ) ≤ s against the alternative of unrestricted VAR. Table 4. Formula I(2). Qualifying race. I(2) rank-test table, p = 6, k = 2, r = 1. Acceptance frequencies at 5% significance level of LR test for ranks (r, s) against the unrestricted VAR, with p = 6, k = 2 and s 2 = p − r − s. Bold entries correspond to the true ranks r = s = s 2 = p/3 = 2. Cases with r = 0 and r = 1 have been omitted for readability, since the acceptance rate is always zero or very close to zero. For this aspect of the analysis, all cases r = 0, ..., p − 1 and s = 1, ..., p − r are considered. However, Table 4 does not report r = 0, since the acceptance rate is exactly zero for all values of s in that case. Also the case r = 1 is not reported, because the acceptance rate is always zero for T = 1000 and very close to zero for T = 100 (less than 0.02, except for ω = ρ 1 = 0.9, where it is 0.06). The model corresponding to the DGP, i.e., r = s = s 2 = p/3 = 2, has been highlighted in boldface.
Note that r is almost never underestimated even when T = 100, irrespective of the value of ω. This seems to be a major difference with respect to Formula I(1), where r is frequently underestimated when ρ 0 is 0.9. It is important to remark that the interpretation of ρ 0 in Formula I(1) different from the interpretation of ω in Formula I(2), although they both affect the magnitude of the coefficients in Π. In fact ρ 0 may be interpreted as 'weak mean reversion', whereas ω has no implication for the speed of adjustment, but it is rather related to the relative weight of levels and differences in the polynomial cointegration relations; this might be the reason why for ω = 0.9 there is no to underestimation of r. 15 It is, however, surprising that when ω = 0.9 (so that the weight of the levels is reduced to 1 − ω = 0.1) one tends to overestimate r, rejecting r = 2 in favour of r = 3 or even r = 4.
The impact of the 'near I(2)' parameter ρ 1 is linked to the form of the DGP in (13): when ρ 1 = 0.9 the variables in ∆X 2,t are stationary but slowly mean reverting, so that X 2,t is almost I(2). Not surprisingly then, when ρ 1 = 0.9 the tests tend to underestimate s (i.e., overestimate s 2 = p − r − s) at least when T = 100, so that very frequently r = 2, s = 0 is selected. When T = 1000 the power vs s = 0 goes to 1, but one would still select r = 2, s = 1 about 10% of the times.
The results on the Formula I(2) circuits with restrictions on the cointegration parameters (in addition to the restrictions on the ranks) are illustrated in Table 5. The cases I(2)-A, I(2)-B and I(2)-C involve only the matrix Π; more specifically, as in Formula I(1), models I(2)-A involve correctly 15 Formula I(2) circuits may be extended in the future introducing another coefficient in analogy with ρ 0 of Formula I(1). This would amount at replacing the third equation in (13) with specified restrictions on β, models I(2)-B contain misspecified restrictions on β, while models I(2)-C contain correctly specified restriction on α and β. All three algorithms seem to perform quite well in maximizing the likelihood of the I(2) model under restrictions on Π only, with Triangular-hybrid beating the others. In particular, under the correctly specified restrictions A and C the likelihood is easily and quickly maximized (especially when T = 1000), with almost no case of distant convergence. Conversely, the misspecified restrictions in model I(2)-B require more iterations and, for the first two teams, induce distant convergence quite frequently. However, as observed when discussing Formula I(1) results, it is important to keep in mind that distant convergence is indeed a problem when the restriction is correctly specified since it leads to over-rejection, whereas for misspecified restrictions it can be seen as beneficial, since it increases the power of the test.
More generally, the analysis of restrictions A, B, C, seems to suggest that estimation of restricted α and β is easier in the I(2) case with respect to the I(1) case. Note however that the comparison is not completely fair, since most of the difficulties in the I(1) case are found when ρ 0 = 0.9 (weak mean reversion), and this coefficient does not appear in the current Formula I(2) design. Consider finally the restrictions I(2)-D and I(2)-E, reported in the last two panels of Table 5. Remember that I(2)-D is a correctly specified model with restrictions on τ, while I(2)-E is a misspecified model with restrictions on τ. Table 5 shows serious difficulties in maximizing the likelihood under restrictions on τ; in both cases (i.e., whether the hypothesis is true or false), the number of iterations is much higher than under restrictions A, B and C and it does not decrease even when T = 1000. Failure to converge (FC) becomes a serious problem for triangular switching (and to some extent delta switching), and there is an high percentage of distant convergence (DC) for all three algorithms; Triangular hybrid performs better, having a smaller average distance (AD). Notice that for model I(2)-D (where the null hypothesis is true) distant convergence is more problematic since it leads to over-rejection.
To analyze this problem, as done in Formula I(1), Figure 3 illustrates the impact of distant convergence. It is apparent from the figure that over-rejection is substantial here. Since the 5% critical value of the asymptotic χ 2 (4) distribution is 9.49, the analysis clearly shows several cases where one would (correctly) accept using the overall maximum, and (wrongly) reject using the distant maximum. The striking difference with respect to Formula I(1) is that here the over-rejection due to distant convergence remains even when T = 1000.  As the final aspect of Formula I(2), consider the misspecified restrictions on τ in model I(2)-E. Figure 4 shows that distant convergence has no practical implication for large samples (T = 1000), where the power would be 1 anyway. Conversely, in small samples (T = 100) distant convergence slightly increases the rejection rate, which would be quite high in any case. Overall, in the setting of Formula I(2), maximizing the likelihood under correctly specified restrictions on α and β seems fast and accurate. Conversely, when correctly specified restrictions on τ are introduced, finding the overall maximum of the likelihood is not easy. Since β is one of the components of τ, one might guess that the problems arise from the complementary directions with respect to β within τ; the issue deserves further exploration.
As in the I(1) case, maximizing the likelihood under misspecified restrictions is difficult; however, the consequence of this difficulty are benign, because they appear to increase the power of the test for the current design of the Formula I(2) races.

Conclusions
The test run of the championships shows that there is room for improving algorithms. It demonstrates the strength of this 'collective learning' experiment, where other researchers may try and propose new algorithm to improve on the existing ones. All algorithms win in the end, since each team learns where and how to improve the algorithm design.
Other circuits may be added in the future, as algorithms improve. Races with a similar spirit can be set up in other adjacent fields, like fractional cointegration; the same principles may in fact be applied to any other model classes where maximizing the likelihood needs numerical optimization. Table A3. Definition of the model index m.
Model index m := 2i r + i k + 1 i r = 0 Restriction I(1)-A or I(2)-A i r = 1 Restriction I(1)-B or I(2)-B i r = 2 Restriction I(1)-C or I(2)-C i r = 3 Restriction I(2)-D i r = 4 Restriction I(2)-E i r = 4 + r + (r + s − 1)(r + s)/2 M(r, s) with ordering as in Table A4 i k = 0 k = 2 i k = 1 k = 5 The ordering of the unrestricted I(2) estimates corresponds to the column vectorization of the upper diagonal of the relevant part of a ranks test table. For instance, in case p = 6 the ordering of models is the one in Table A4. Table A4. Ordering of the models M(r, s) for case p = 6. Entries in the table correspond to the numbering of models, where s 2 = p − r − s. The ordering is similar for the case p = 12. r\s 2 5 4 3 2 1 1 1 2 4 7 11 2 3 5 8 12 3 6 9 13 4 10 14 5 15 As an example, results for the Formula I(2) circuit with n = 13 (i T = 1, i p = 1, i 0 = 0 and i 1 = 0) and m = 6 (i r = 2, i k = 1), should be stored in a file named FI2DGP013MOD006.csv (or FI2DGP013MOD006.txt).
Coefficients must be reported exactly in the given order, providing at least 8 significant digits (but 15 digits is recommended). No particular normalization is required.
If the algorithm failed because likelihood evaluation failed (e.g., singular Ω), then a,c,i = −∞ should be reported. The data is processed with Ox, so .NaN and .Inf are allowed. Because there is no clear convention on writing −∞, any value of −10 308 or lower is interpreted as −∞. Table A5 provides the start of the first three lines of three selected output files.