Multivariate Count Data Models for Time Series Forecasting

Count data appears in many research fields and exhibits certain features that make modeling difficult. Most popular approaches to modeling count data can be classified into observation and parameter-driven models. In this paper, we review two models from these classes: the log-linear multivariate conditional intensity model (also referred to as an integer-valued generalized autoregressive conditional heteroskedastic model) and the non-linear state-space model for count data. We compare these models in terms of forecasting performance on simulated data and two real datasets. In simulations, we consider the case of model misspecification. We find that both models have advantages in different situations, and we discuss the pros and cons of inference for both models in detail.


Introduction
Modeling time series of counts is relevant in a range of application areas, including the dynamics of the number of infectious diseases, number of road accidents or number of bank failures. In many applications, such count data dynamics are correlated across several data series. Examples include from correlated number of bank failures [1], number of crimes [2] to COVID-19 contagion dynamics [3]. The analysis of such correlations provides detailed information about the overall connectedness of the series, as well as the dynamics of an individual series conditional on the others. Several multivariate count data models have been proposed to capture the overall connectedness of multivariate count data. Each one of these models has different underlying assumptions as well as computational challenges. We present a comparative study of two families of multivariate count data models, namely State Space Models (SSM) and log-linear multivariate autoregressive conditional intensity (MACI) models, based on simulation studies and two empirical applications.
We provide some examples of the count data and discuss particular properties that one desires to model when dealing with such data. In this paper, we assume that the counts are unbounded and we assume both models to be stationary. For discussion on the difference between bounded and unbounded count data and the difference of the modeling approaches for these data we refer to [4]. The top panels in Figure 1 present two conventional data sets that have been used for univariate illustrations, namely the monthly number of cases of poliomyelitis in the U.S. between 1970 and 1983, and asthma presentations at a Sydney hospital. The middle panel in Figure 1 presents the number of bank failures in the U.S. over time, a dataset that we also analyze in this paper, and the number of transactions for BMW in a 30 second interval. The bottom panel in Figure 1 presents a number of car crashes and a number of earthquakes. The former, number of car crashes over time is analyzed in Park and Lord [5] with a multivariate Poisson log-normal model with correlations for modeling the crash frequency by severity. The authors demonstrate that, accounting for the correlations in the multivariate model can improve the accuracy of the estimation. A common feature in all presented datasets is the autocorrelation present in count data over time that is visible in the time series plots. In multivariate count time series data, this correlation generalizes to a correlation between past and current values of a specific series as well as between different series. Models for multivariate count time series typically rely on multivariate Poisson distributions, where time-variation is defined through one or more rate parameters [6]. In some cases, Gaussian approximations are used but, as has been shown in [7], this can lead to reduced performance in the risk forecasting assessment. In general, the quality of such approximations depends on a particular problem [8]. Estimation of these models is computationally demanding for high numbers of counts as the estimation relies on the sum over all counts. In addition, these models typically have positivity restrictions on the conditional intensity function that governs the Poisson process over time and the correlation between different time series. A few exceptions to the positive correlation assumption exist, see for example, [9,10].
An alternative model for the joint distribution of count data is the copula model. A number of papers proposed different copula models for multivariate count time series, see, for example, [11][12][13][14]. Copulas are generally used for modeling dependency in multiple series, which makes them attractive methods also for multiple count time series. However, several issues arise in their applications to count data such as unidentifiability and not accounting for potential overdispersion-a property that is common for count data. Genest and Nešlehová [15] provides a detailed overview of copula models for count data, and proposes and compares Bayesian and classical inferential approaches for multivariate Poisson regression. They show that computationally Bayesian and classical approaches are of a similar order.
Both approaches of modeling joint distribution of count data-multivariate Poisson distribution and copulas-can be incorporated in the autoregressive conditional intensity (ACI) framework, often also referred to as an integer-valued generalized autoregressive conditional heteroskedasticity model (INGARCH). This model belongs to the class of observation driven models as opposed to parameter driven models, a classification proposed by Cox et al. [16]. ACI models have been dominating the literature for quite a long time despite their restrictiveness: these models only allow only for positive coefficients in the equation for conditional intensity. These bounded coefficients lead to several problems besides potentially unrealistic dependence structure for some data. In particular, the problem of calculating confidence sets for the parameters that are close to or on the boundary rises and has not been yet solved in the literature. Another observation driven model that has been proposed as an alternative to ACI framework is log-linear model, see Fokianos and Tjøstheim [17], a multivariate extension of which has been considered in Doukhan et al. [10]. Even though the problem of modeling joint distribution remains, the advantage of this approach is that no restrictions on the parameter space are required due to the log-transform of the data.
Another class of models that can be considered for modeling count data, but is rarely used in the literature, is parameter driven models and, in particular, non-linear state-space models. In this framework, the observations are driven by an independent unobserved stochastic process which, for instance, can be a (vector) autoregressive process (VAR(p)). These models have been discussed extensively in the univariate case, see, for example Davis et al. [18]. However, these models are rarely used in multivariate applications due to the computationally demanding estimation methods that have to be used. To our knowledge, only one very recent study has considered them in a multivariate application Zhang et al. [19]. Non-linear state-space models are capable of modeling and inferring complex dependence structures in the data. They allow for both negative and positive contemporaneous correlation, as well as for both negative and positive Granger-causal feedback. Thereby, these models avoid the problem of modeling the joint distribution of time series of counts and provide a coherent inferential tool in the Bayesian framework. This is what distinguishes our approach from the approach discussed in Zhang et al. [19] who consider frequentist estimation of these models. We also compare SSM to log-linear models instead of MACI models since they allow for negative dependence between the intensities and hence appear to be more natural competitors of SSM models.
In this paper, we compare two classes of models, observation driven and parameter driven models, in terms of their forecasting performances. We estimate the observation driven models the quasi-maximum likelihood method. Parameter driven models, however, fit very well into the Bayesian paradigm and that is what we use for estimation. Certain advantages come together with this framework, such as those naturally obtained from the posterior distribution uncertainty about the parameters of the model and forecast of the multivariate time series [20]. We in particular use particle Markov chain Monte Carlo (pMCMC) [21] for the estimation of the parameter driven model. As is discussed in [22], pMCMC outperforms other methods (variational Bayes [23], integrated nested Laplace approximation [24] and Riemann manifold Hamiltonian Monte Carlo [25]) in terms of parameter estimation. There are other recent methods for the estimation of the state-space models such as auxiliary likelihood-based approximate Bayesian computation [26] and variational Sequential Monte Carlo [27], but their performance has to be investigated further, which is outside of the scope of this paper.
We present a set of simulation studies to show how these models perform when they are correctly specified and misspecified. The simulation results show that, as expected, the correctly specified models perform generally well, but there are exceptions. Particularly, parameter driven models have better forecast performances in some simulations even if they are misspecified. In addition to these simulation studies, we compare the performances of these models in two real data applications. The two data sets we analyze exhibit different sample sizes, standard deviation, dispersion and maximum counts. We show that the overall forecast performances of the models can be very different, depending on the applications. Furthermore, for the second data set we analyze, we find that observation driven models capture extreme data values better than parameter driven models.
The remainder of this paper is as follows: Sections 2 and 3 summarize observation and parameter driven models, respectively. Section 4 presents the model and forecast comparison tools we use for multivariate count data models. Section 5 presents simulation results. Section 6 presents results from applications to two data sets. Finally, Section 7 concludes the paper.

Observation Driven Models
In this section, we summarize two observation driven models: multivariate autoregressive conditional intensity model and log-linear analog of it. Both of these models are characterized by the dynamics that depend on the past of the process itself and some noise. Both models have been considered in Doukhan et al. [10], where the authors discussed some theoretical properties and proposed to use copula approach for modeling joint count distribution. Copulas are flexible tools for modeling dependence structure but their use in count time series models brings challenges. We first summarize the use of Poisson distribution for count data, analyze both models under an independence assumption in the Poisson random variables, and at the end of this section, we discuss the extension of modeling multiple count time series with multivariate Poisson distribution.

Poisson Distribution
Many of the count time series models take their origins in the idea of Poisson regression model, an extensive overview of these models is given in Fokianos [28]. Specifically, both models considered in this section as well as the parameter driven models in Section 3 rely on Poisson distributions. We therefore first provide some background on the Poisson distribution. Poisson distribution has played an important role in modeling count time series data as its interpretation is the number of independent events that occur in a time period. The Poisson distribution is defined for a random variable x takes integer values in {0, 1, . . . }. The mean of the distribution, λ, describes the average occurrences per interval, the distribution has the equi-dispersion property since the variance its variance is also λ, and the probability mass function (pmf) of the distribution is For the simple multivariate case, two Poisson random variables, say x 1 and x 2 , the joint pmf reads The multivariate extension in (2) is rather naive due to the underlying independence assumption between x 1 and x 2 . Such a model would ignore potential dependency of the real world data, thus is potentially not suitable for the majority of applications. One way to use the Poisson distribution for modeling count data in multivariate case and incorporate correlation structure is through the so-called trivariate reduction [13,29]. The idea is that correlation can be modeled through the third Poisson variable. Assume we have three independent random variables Define Y t,1 = X 1 + X 2 and Y t,2 = X 2 + X 3 . In this way, the random variable X 2 is exploited to model the dependence between Y 1 and Y 2 . The restriction of this approach is that the correlation is the same between all the series (in case one wants to model the systems beyond bivariate case) and the dependence can only be positive. We further discuss the trivariate reduction technique in the context of ACI/INGARCH models. In particular, the difficulties of extending this to higher dimensions is of interest and presents one with a challenging task.

MACI (INGARCH)
The Poisson integer-valued generalized autoregressive conditional heteroscedastic process (INGARCH) models [30]-also called multivariate autoregressive conditional intensity models (MACI) in the literature-are built upon GARCH framework and are capable of capturing time series properties of count data. As for GARCH-type models, it is assumed that the conditional mean of the process at time t depends on the value of the process at period t − 1 and its conditional mean at time t − 1. The time series of counts follow Poisson process with the conditional mean λ t , that is, The corresponding joint pmf reads The dynamics of the conditional intensity Note that the elements of ω, a i , b j are assumed to be positive to ensure the positivity of the intensity process λ. (Doukhan et al. [10] argue that the condition || A + B || 2 < 1 guarantees stationarity.) In addition, we assume no contemporaneous correlation in the counts. Consider the bivariate case for the conditional intensity process From Equation (6) it is clear that when A and B are diagonal, there is no dependence structure between the intensities. Further, when a 12 = 0 and b 12 = 0 then the intensity of the first process, λ 1,t , depends only on its own past while the second process can depend on the dynamics of the first one. Finally, if we restrict A to be diagonal and B to be nondiagonal, every intensity process would depend on its past and possibly on the past of all of the observations. This constraint is relevant when one wants to apply graphical modeling to this problem.

Quasi-Maximum Likelihood for MACI Models
In this section, we discuss how the inference for MACI/INGARCH models can be executed. The details for the multivariate case have also been discussed in Doukhan et al. [10]. For these models, we make use of the classical estimation framework and in particular use quasi-maximum likelihood estimation. The conditional quasi-likelihood for this model and θ reads where θ are the parameters of interest. It follows the the quasi log-likelihood function is and the corresponding score function reads where ∂λ t /∂θ T is n × d matrix with d ≡ n(1 + 2n) being the dimension of the parameter vector θ, D t is an n × n diagonal matrix and its diagonal elements are λ i,t (θ), i = 1, 2, . . . , n, and X t consists of elements x i,t , i = 1, 2, . . . , n, t = 1, 2, . . . , T. Thus the recursions for the quasi-maximum likelihood estimation follow Finally, the Hessian matrix and the conditional information matrix are correspondingly Further, one can show that S n (θ) = 0 has a unique solution,θ, which is strongly consistent and asymptotically normal. For further details of these properties, we refer the reader to Doukhan et al. [10]. However, that theoretical properties ofθ are proven under assumption that the true value θ 0 belongs to the interior of the parameter space Θ. The problems certainly arise when the true parameter is close or on the boundary of the parameter space. Dealing with the theoretical problems of the constrained optimization and parameters near or on the boundary of parameter space is out of the scope of this paper and generally establishing the theory for this case is a complicated task. One of the possible solutions is to exploit bootstrap methods for this task, see Hilmer et al. [31] for a comparison of some bootstrap methods related to this sort of problem and review of other possible approaches.

Log-Linear Autoregressive Model
Log-linear models have appeared in the count data literature in the recent years [10] and have good potential since they allow for both positive/negative correlation and avoid parameter boundary problems which MACI models suffer from.
where F Y,λ t−1 is the σ−field generated by {X 0 , . . . , X t , λ 0 }, 1 n is the n-dimensional vector of ones, ν t ≡ log λ t . Parameters of this model, ω, A, and B, do not have to be positive which makes this model more attractive than MACI.

Quasi-Maximum Likelihood for Log-Linear Models
The inference in log-linear models is very similar to the quasi-maximum likelihood approach derived for MACI models in Section 2.3. Only minor adjustments have to be made in corresponding recursions [10]. In particular, the score function for the log-linear model reads the Hessian matrix is (18) and the conditional information matrix for the log-linear model reads Doukhan et al. [10] prove theoretical properties of this model. In particular, they show that there exists a unique solutionθ which is strongly consistent and asymptotically normal. The authors also show that the condition ∑ ∞ j=0 || A j B || 2 < 1 guarantees both stationarity and weak dependence.

Multivariate Poisson Distribution
To allow for contemporaneous correlation, we need to use trivariate reduction technique discussed before. We consider the bivariate case to give an example, assume that there are three independent random variables Y 1 , Y 2 , Y 3 with positive means λ 1 , λ 2 , λ 3 respectively. Define random variables The new random variables will have means λ 1 + λ 3 and λ 2 + λ 3 , where λ 3 would also correspond to the covariance between X 1 and X 2 . The covariance is clearly restricted to be positive, while correlation will lie between 0 and min{ }. Thereby the joint pmf of interest, alternative to what we have in Equation (3), becomes Extending this approach to contemporaneous correlation in higher dimensions is not trivial. Suppose that we would like to model n Poisson random variables, thus X i ∼ Poisson(λ i ), i = 1, . . . , n. By defining a random variable X 0 ∼ Poisson(λ 0 ) and extending the argument of the trivariate reduction we can define random variables where m = min(x 1 , x 2 , . . . , x n ). This approach assumes that the covariance is the same for all the pairs of Poisson random variables which is very restrictive. Karlis and Meligkotsidou [9] consider the case with richer covariance structure which we discuss further. For simplicity, assume we want to model trivariate time series of counts Y 1 , Y 2 , Y 3 . As before, let us specify through X i and X ij univariate Poisson random variables, i.e., Then the random variables Y i with i ∈ {1, 2, 3} are defined in the following way Finally, these random variables follow the Poisson distribution with λ = (λ 1 , λ 2 , λ 3 , λ 12 , λ 13 , λ 23 ), and hence with the mean vector Aλ = (λ 1 + λ 12 + λ 13 , λ 2 + λ 12 + λ 23 , λ 3 + λ 13 + λ 23 ) . The variancecovariance matrix for this distribution is given by It is clear from the above examples that the modeling of the time series of counts with multivariate Poisson distribution in higher dimensions is restrictive and cumbersome. It is restrictive, since it allows only for positive dependency in the data, which can be unreasonable for real-world applications. It is cumbersome since the method is only computationally tractable when one has low counts data, see Equation (21) in which the number of terms in the sum depends on the number of observed counts. Methods such as expectation maximization can be applied in this case but they are not trivial and stable in case of high counts. Moreover, in this case, incorporation of the multivariate Poisson distribution into MACI or log-linear models also affects computational speed substantially, and these models lose their attractiveness over more complex models such as nonlinear state-space models in the next section.

Parameter Driven Model: Nonlinear State-Space Model
The advantage of parameter driven models is the clear interpretability of the model parameters and the high degree of flexibility. The model can easily incorporate different distributions and extends easily to the multivariate framework. Moreover, in the Bayesian framework, we have coherent inferential tools derived from the posterior distributions of the parameters, such as highest posterior density intervals. These models also take into account uncertainty about the parameters which is incorporated into predictions. The disadvantage of this approach is challenging estimation procedures that are computationally intensive. Hence, even though theoretically estimation methodologies are possible to extend to any dimension, in practice that is not feasible due to the time con-straints. In this paper, we estimate the parameter-driven model for multivariate count data using Sequential Monte Carlo and particle Markov Chain Monte Carlo. These methods became popular with the availability of more computational power. They are restricted in some ways, and we will discuss these restrictions in the next subsections after introducing the nonlinear state space model (SSM), which we will compare to the observation driven models.

Multivariate SSM
A state-space model is usually presented by an observation equation and a state equation. The state equation represents a latent process, say h t , which drives the dynamics of observations y t . In the multivariate SSM for count data below, this dependence between the observations and the state is nonlinear where η t ∼ N(0, Σ η ). Equation (24) shows that the observations have Poisson distribution with mean λ t defined through the Equation (25), and λ t nonlinearly depends on the latent process h t which is defined through Equation (26). Note that the latent process is defined through a VAR(p) process, and hence corresponding theory applies. In particular, the stationarity condition is that the roots of the Equation (27) must lie outside the unit circle The dependence structure between counts is modeled through the dependence in the latent process. Conditioned on the latent process {h t } T t=1 the observations {X t } T t=1 are independent. Furthermore, since the latent process of the model is a VAR(p), we can account for various dependence structures: positive and negative contemporaneous correlation, and positive and negative Granger-causal feedback. These models are challenging to estimate, and an assumption of p = 1 can simplify the inference. (For extending the model to p > 1 we advise the reader to consider using sparse priors, such as Minnesota prior, spike and slab or horseshoe prior.) Bivariate specification of the nonlinear state-space model with the lag p = 1 reads The dependence structure between series is described by the Granger-causal relationship in the latent processes h it and contemporaneous relations that are incorporated in Σ η . For example, we say that h 2,t does not Granger-cause h 1,t if φ 12 = 0. Correlation coefficient ρ in this model allows us to model both positive and negative correlation between the counts.

Bayesian Inference in Multivariate SSM
The estimation of nonlinear state-space models naturally fits into the Bayesian framework. The presence of the unobservable process in the model and nonlinear dependence of the observations on this unobservable process leads to an intractable likelihood and posterior. For this reason, and due to the nonlinear SSM structure, we use particle Markov Chain Monte Carlo (PMCMC) for the estimation of the posterior distribution of the model parameters Andrieu et al. [21]. The method consists of two parts. First, the likelihood is estimated in a sequential manner through a particle filter. Second, this estimate is used within an MCMC sampler, in our case Metropolis-Hastings algorithm. An extensive introduction to nonlinear state-space models and particle filtering can be found, for example, in Särkkä [32].
Recall the Bayes rule on which the inference is based where π(h 1:T ) is the prior distribution on the parameters of the volatility process defined by the dynamic model, p(x 1:T |h 1:T ) is the likelihood of the observations, p(x 1:T ) is the normalization constant that is ignored during the inference. Thus, we use Bayes rule in proportionality terms p(h 1:T |x 1:T ) ∝ p(x 1:T |h 1:T )π(h 1:T ).
We use particle Metropolis-Hastings to estimate the posterior distribution of the parameters of the model since neither the likelihood nor the posterior are available in closed form. We use Metropolis-Hastings algorithm to sample from the posterior of the parameters. Algorithm 1 presents an iteration of the Metropolis-Hastings algorithm. At every iteration of the algorithm we make a new proposal θ c i for the parameter vector using a proposal mechanism q(·|θ (i) ). Then we accept the proposed candidate θ c i with probability α. The acceptance probability in Algorithm 1 depends on p(θ, h 1:T |x 1:T )target distribution-and q(·)-proposal distribution. How well we manage to explore the posterior distribution depends on the acceptance rate of the algorithm. When the acceptance rate is too high it is often related to too small proposal steps, and the other way around. Overall, either case slows down the convergence of the Markov Chain. General advice for the optimal performance of the algorithm is an acceptance rate that is around 0.234 [33].
We further discuss how p θ (x 1:T ) can be estimated.

Estimation of the Likelihood with SMC
Sampling from the posterior distribution with algorithms such as Metropolis-Hastings requires evaluationg the likelihood. In case of non-linear state-space models, this likelihood evaluation is not straightforward since the likelihood is a high dimensional integral L(x 1:T ) = p(x 1:T , h 1:T )dh 1:T = p(x 1:T |h 1:T )p(h 1:T )dh 1:T and this likelihood is not analytically tractable. Instead of relying on an analytical result, the integral from Equation (33) can be approximated using Sequential Monte Carlo methods, also known as particle filters. This estimate of the likelihood is then used in Algoperithm 1 asp θ (x 1:T ). Algorithm 2 represents a simple version of a particle filter which we use in this paper. The algorithm consists of three main steps: prediction, updating and resampling. In the prediction step we sample N particles according to the assumed dynamics of the latent process p(h t |h t−1 ). Then we weight each particle according to the distribution of the data given the latent state p(x t |h t ). Finally, in the resampling step we resample the particles according to these weights. Resampling step is meant to solve the known problem of particle degeneracy: without resampling we end up only with a few particles with high weights over time. The particle filter provides us with the sequence of distributions p(h t |x t ), however due to particle degeneracy problem discussed previously, sampling from p(h 1:T |x 1:T ) and approximating p(h k |x 1:T ), k = 1, . . . , T, is inefficient. One of the possible solutions to this problem is using so called forward filtering -backward smoothing recursions [34]. The algorithms starts with sampling h * T ∼p(h T |x 1:T ), and then backwards for k = T − 1, T − 2, . . . , 1, we sample h * k ∼p(h k |h * k+1 , x 1:n ) . Then we can approximate the distributionp(h k |x 1:T ) as followsp The smoothing comes at cost of O(NT) operations to sample a path from p(h 1:T |x 1:T ) and O(N 2 T) operations to approximate p(h k |x 1:T ). This method works very well, in particular when dealing with large sample sizes. However, its performance comes at the price of a high computational costs. Thereby, it is generally recommended to use it when the sample size of the data is large and hence Sequential Monte Carlo is more likely to suffer from particle degeneracy. There exist other methods that are computationally less expensive [34]. However, in higher dimensions, they would be less reliable, and it would be recommendable to use more expensive methods.

Forecasting with SSM
One of the interests of statistical inference is the ability to perform forecasting exercises and thus handling the uncertainty about the future in the best possible way. In this section, we will discuss how forecasting task fits into the Bayesian framework, and in particular how it can be done for models of our interest.
Recall that we estimated multivariate SSM model for count data in the Bayesian framework. Observing the data x = (x 1 , . . . , x T ) we estimated the posterior distribution of the parameters in our model using particle Markov Chain Monte Carlo methods p(θ|x). Suppose that we are interested in predicting the next s observations, that is, x T+1 , . . . , x T+s . First, note that the prediction equation for the next step reads In the framework of particle Markov Chain Monte Carlo, it is natural to adopt a sequential nature of SMC and the fact that we obtain posterior draws in the MCMC part of the algorithm. Thereby, for every vector θ of the parameters drawn in the MCMC, we can propagate the particles obtained at time T and based on those make one-step ahead forecast. The similar idea extends to s-steps ahead forecasts. In this case the uncertainty about the parameters is included in the forecasts.
When forecasting, the most natural but cumbersome approach would be to update the posterior distribution every time we get a new observation. It would mean that we generate as many MCMC chains as we have steps for forecasting. This can be carried out in a straightforward way by re-estimating the posterior distribution every time or more efficiently by incorporating this into the SMC framework. However, for large enough samples, adding an extra estimation into the PMCMC framework should not change the results substantially. Ignoring this update also makes the forecasting performance of the frequentist and Bayesian approaches more comparable. Both frameworks are estimated in different paradigms. While SSM naturally fits into the Bayesian paradigm, the log-linear model is usually estimated using frequentist methods (quasi-maximum likelihood in this case). Since our goal is not to compare the two approaches to statistics, this design of forecasting exercise is more fair. Figure 2 illustrates the forecasting approach we undertake with the state-space model in a graphical representation. In particular, one can see that we do not re-estimate the posterior distribution every time we receive a data point.

Model Comparison and Prediction Assessment
We next summarize the measures using which we compare the models in Sections 2 and 3. Observation driven models in this comparison are represented by the log-linear autoregressive model. The log-linear autoregressive model is more flexible than the MACI model since it can account for a negative correlation and thus it is a fairer competitor. The parameter driven approach is the state-space model, where observations are generated from the Poisson distribution and dependency is modeled through a latent process. Note that, for the latter framework, we follow a fully Bayesian approach. Thereby, we compare these two classes of models by model fit and forecasting performance criteria. The standard measures to access the model fit and forecast accuracy would be Mean Squared Error (MSE) and Mean Absolute Error (MAE) defined in Equation (36) respectively.
In Czado et al. [35] the authors propose comparing forecast performance using some scoring rules. To define scoring rules, let P be the predictive distribution and x be the counts; then the penalty is defined through s(P, x). Table 1 presents some of the scoring rules one can use for comparing the performance of count data models. Table 1. Scoring rules for assessment of the forecasts. The table summarizes scoring rules that we use to assess forecasting performance of the models under consideration, proposed in Czado et al. [35] for count data.
Logarithmic score log(P, Squared error score ses(P, x) = (x − µ p ) 2 , where µ p is the mean of P Note, that in practice, one calculates the mean score To compare our results with the conclusions in Zhang et al. [19] we also report Dawid-Sebastiani (DS) score which is defined in Equation (38)

Simulation Examples
In this section, we demonstrate the performances of the models based on simulated data. We generate data from various specifications of SSM and log-linear MACI models and compare the models on forecasting performance. We assess forecasting performance based on six different scoring measures discussed in the previous section. The design of the simulation study allows us to assess forecasting performance in the cases of both correct model specification and misspecified case. Table 2 summarizes three different specifications of the state-space approach for data generation and Table 3 summarizes specifications of the log-linear MACI for data generation. Figure 3 illustrates two examples of bivariate time series generated from these models. For each simulation setting, we generate ten datasets with different random seeds and report the average results from these ten datasets. State-space model was estimated using particle Metropolis-Hastings algorithm with N = 5000 particles and M = 20000 Metropolis-Hastings step with a warmup period of 5000 steps. The acceptance rate was targeted to be between 25% and 40%.

Data Set
We assess the forecasting performance of two models for multivariate count data: statespace model and log-linear model. Tables 4 and 5 summarize the forecasting performances of the models according to various scoring rules. The rows of the tables correspond to a particular data generating process (see for details of specification Tables 2 and 3) and columns for a particular scoring rule (see scoring rules specification in Table 1). In particular, Table 4 shows performance of the state-space model and Table 5 the performance of the log-linear model. The state-space model outperforms the log-linear MACI model when the data are generated from SSM 1 (SSM with positive correlation) and LL 1 (log-linear model with a negative a 11 coefficient). It is particularly interesting that when the data is simulated from LL 1 , SSM performs best according to all measures despite being a misspecified model.
When the data are generated from SSM 2 and SSM 3 , the state-space approach performs best based on most measures. This result is expected as SSM is the correct model specification for these simulated data. Finally, log-linear MACI model performs best in the case of data set LL 2 -in the case when the model is correctly specified and all the coefficients are positive-according to most measures.  Table 2. (b) Dimension 2 of the bivariate time series generated from SSM2 in Table 2. (c) Dimension 1 of the bivariate time series generated from LL2 in Table 2. (d) Dimension 2 of the bivariate time series generated from LL2 in Table 2.

Empirical Applications
In this section, we compare the models in two empirical applications-bank failures and transactions data. These data sets exhibit different sample sizes, standard deviation, dispersion and maximum counts. In particular, bank failure time series reach a maximum of 10 and 24 counts while transaction data reach up to 67 and 60 counts with comparable mean counts.

Bank Failures
Bank failures have been analyzed using a univariate Poisson process [36]. A number of researches have investigated bank failure data of different time periods, see e.g., Schoenmaker [1] for an analysis of contagion risk in banking. Overall, it is reasonable to expect that bank failures in different countries are driven by similar economic phenomena, and possible contagion/spillover effects exist between economies of different countries.
For this application, we analyze count data using a bivariate data set of bank failures in the U.S. and Russia that has not been considered in the literature before. We use monthly number of bank failures for the period between January 2008 and December 2012 for both countries and apply the bi-variate specifications of the models in Sections 2 and 3. Especially due to the global financial crisis included in this period, it is important to allow for potential correlation between the number of bank failures in the U.S. and Russia using the multivariate count data models. Figure 4 illustrates these time series and Table 6 presents descriptive statistics for this data set.   The estimation results from both models are presented in Tables 7 and 8. Based on the state-space model, the correlation is estimated as being low negative and 0 is included in highest posterior density interval for this parameter. Despite that log-linear MACI model estimates correlation coefficient to be positive, it provides a large confidence interval for this parameter which also includes 0. Thus, for this relatively small data set, we do not find an indication of correlated bank failures using both models. We also note that some confidence intervals in Table 8 include point 0. As discussed in Section 2, applying observation driven models with positivity constraints would be problematic for these data especially in terms of the calculation of confidence intervals. We next compare the models in terms of their forecast performances. For this comparison, we take a sample size of T = 55, and we make five steps ahead predictions using the log-linear model and the state-space approach. Table 9 presents scores for this forecasting exercise. Based on all scores, except for the rank probability score (rps), the state-space model outperforms the log-linear model in terms of forecasting. Based on the simulation results in Section 5, we conjecture the following: The better performance of the state-space model can be due to this model being close to the true data generating process, or due to its property of capturing data properties well even if it is misspecified.

Transactions
In this empirical application, we analyze the number of transactions on 30 s intervals for Deutsche Bank AG and Bayer AG (the datawere obtained from FactSet, in the time period of 3 August 2015 09:05:30 until 3 August 2015 12:25:00 for the training data). We expect such transactions to be correlated due to their dependence on the time of the day and the market conditions. The sample size in this application is T = 400, which is significantly larger than the sample size in the bank failures application. The summary statistics for this data set are provided in Table 10   We apply the bi-variate counterparts of the count data models in Sections 2 and 3 to these data and compare the model performances based on 100 steps ahead forecasts. In Tables 11 and 12 we present parameter estimates of both models. Both models estimate positive correlation between these time series. However, in the case of the log-linear MACI model the estimated correlation coefficient is much higher. In addition, the confidence intervals of parameter estimates such as b 12 and b 22 in Table 12 include point 0. Thus, true parameters being non-positive is a potential problem if other observation driven models, with positivity constraints, in Section 2 were applied to these data. Table 13 presents the scores of each model in the forecast sample. In this application, the log-linear model performs best according to all scoring rules. Based on the simulation results in Section 5, we conjecture that the log-linear model is potentially closer to the true data generating process compared to the state space model. We further analyze the forecasting performances of the models in Figure 6. Particularly in Figure 6b, we observe that the log-linear MACI model captures better high spikes of the counts and then returns to the original level of the data. The forecast with the state-space model appears to be too smooth compared to the data points. Thus, the better forecast performance of the log-linear MACI is potentially due to its ability to capture these extreme data values successfully.
The under-performance and over-smoothing of the state-space approach can be mitigated by implementing a different particle filter. For example, one could take the direction of implementing look-ahead particle filters such as [37,38]. General idea of the look-ahead approaches is that in the particle filtering algorithm we make a proposal not just according to the dynamics of the model p(h t |h t−1 ), but taking the current observation into account p(h t |h t−1 , y t ) or taking into account the complete time series p(h t |h t−1 , y 1:T ) as in [38]. These methods, however, have not been developed for the distributional assumptions we are considering in this paper and further research is needed in this direction.

Discussion
In this paper, we have reviewed and compared two approaches for modeling multivariate count time series data. One of the challenges that appears in the literature and have not been resolved is modeling the dependency between counts in a flexible way that would also allow for feasible estimation. We have discussed multivariate autoregressive conditional intensity models (MACI), their log-linear alternative which we refer to as the multivariate log-linear model and nonlinear state-space model. Both models have advantages and disadvantages. In particular, the nonlinear state-space framework allows for various interpretable dependencies that one cannot easily incorporate into MACI or log-linear approach. However, these models can be computationally expensive to estimate, in particular in higher-dimensions. Challenges in estimation arise from different sources. State-space models naturally fit into the Bayesian framework, however, since both the likelihood and the posterior of the model are analytically intractable this leads to computationally expensive procedure. MACI models, on the other hand, are quite restrictive: they restrict coefficients in the model to be positive as well as the correlation between time series. Both assumptions can be unrealistic in many real-world applications. Log-linear model avoids the problem of only positive coefficients in the model by logarithmic transformation of the data. However, estimation can be unstable, and good starting points need to be chosen for the estimation. When the dimension of the model grows, it becomes harder to choose good starting points for the optimization problem. The computational advantage of log-linear and MACI models decreases with the increase in either dimensionality of the model or the number of counts. This reduction in the computational advantage is due to the usage of the multivariate Poisson distribution as every pairwise correlation has to be modeled as a separate Poisson random variable. Moreover, the summation in the specification of the joint distribution runs through the number of counts. Generally, one could say that estimation of log-linear models much faster than of the state-space models. In low dimensions and with the small number of counts these models do not require much of computational power, however, once the number of counts increases and once we deal with higher dimensions, the computations become much more extensive due to large sums in the multivariate Poisson distribution. Moreover, while running the model on simulated and empirical data, we found that the estimation can be numerically unstable and can highly depend on the starting values in the estimation procedure. We follow the suggestion of Doukhan et al. [10], and the first estimate the model for univariate time series. These estimates we further use in multivariate estimation. However, the problem of numerical instability especially remains in small samples according to our experience. Nevertheless, in terms of flexibility, this model is the best competitor for the state-space approach.
We have compared log-linear models and state-space models for count data in terms of forecasting performance on multiple simulated data sets and real data applications. We found that on the simulated data state-space framework generally outperforms loglinear model, sometimes even under model misspecification. On the real data sets, the state-space model performs better in bank failures applications which consists of two time series of bank failures in Russia and U.S. and the counts remain relatively low and the data are relative smooth. The log-linear model performs better in the transactions applications in which we consider two time series of transactions counts in 30 seconds intervals. The challenge of transactions application is that there are spikes of counts which deviate a lot from the mean. In this case, we notice that the log-linear model approximates these spikes better. Thus, a possible direction for future research is adapting a multivariate state-space model for count data to capture such spikes better. A possible way to improve the model in this regard would be to adapt the particle filtering algorithm. We used bootstrap particle filter which does not take into account observations when making a proposal for particles, but taking current (or all) observation into account in the proposal mechanism for the particles can help approximating the spikes in the data. There have been proposed multiple look-ahead approaches for particle filters [37,38], but they have not been adapted to count data.
Finally, both approaches have their drawbacks. In particular, the log-linear model seems to have numerical stability issues and finding optimal starting values for optimization can be a challenge. In the state-space approach, the challenging part is the estimation of the likelihood, which is intractable and sampled from the posterior distribution. Additionally, the state-space model in its current implementation is challenged by possible spikes in the data to a larger degree than the log-linear model.

Data Availability Statement:
The data for Polio and Asthma cases is publicly available from R library glarma [39]. Number of car crashes previously published in [40]. The earthquakes data is available from https://earthquake.usgs.gov/ (accessed on 14 June 2017). Transactions data can be downloaded from FactSet. The data for bank failures is publicly available at www.banki.ru (Russia) (accessed on 2 June 2021) and www.fdic.gov (U.S.) (accessed on 2 June 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: