Next Article in Journal
The Replication Crisis as Market Failure
Next Article in Special Issue
Generalized Binary Time Series Models
Previous Article in Journal
Partial Cointegrated Vector Autoregressive Models with Structural Breaks in Deterministic Terms
Previous Article in Special Issue
Evaluating Approximate Point Forecasting of Count Processes
Open AccessFeature PaperArticle

# Likelihood Inference for Generalized Integer Autoregressive Time Series Models

by Harry Joe
Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
Econometrics 2019, 7(4), 43; https://doi.org/10.3390/econometrics7040043
Received: 17 May 2019 / Revised: 23 September 2019 / Accepted: 2 October 2019 / Published: 11 October 2019
(This article belongs to the Special Issue Discrete-Valued Time Series: Modelling, Estimation and Forecasting)

## Abstract

For modeling count time series data, one class of models is generalized integer autoregressive of order p based on thinning operators. It is shown how numerical maximum likelihood estimation is possible by inverting the probability generating function of the conditional distribution of an observation given the past p observations. Two data examples are included and show that thinning operators based on compounding can substantially improve the model fit compared with the commonly used binomial thinning operator.

## 1. Introduction

Modeling of count times series has been an active area of research (see Davis et al. 2015 and Weiß 2018) for a few decades. There are applications in business and econometrics, health and medical studies; the models can be applied in general situations for count response variables with time dependence and covariates.
One class of count time series models is based on thinning operators, which are replacements of the multiplication operator to maintain support on the non-negative integers. The theory for count time series modeling with thinning operators has two approaches: (a) models with a specified stationary univariate margin combined with operators that can yield Markov times series models with a lag 1 serial correlation between 0 to 1; (b) models based on thinning operators applied to p previous observations and a distribution for the innovation, so that in the stationary setting, the autocorrelation functions have similar properties to the Gaussian case (except only positive serial correlations are possible).
The general model of type (b) is generalized integer autoregressive of order p, denoted as GINAR(p). For GINAR(p), mainly the binomial thinning operator has been used, whereas there are many thinning operators proposed for models of type (a).
For models of type (a), there is an extension to include covariates if the distribution of the innovation and the univariate margin are in the same infinitely divisible family, and then the convolution parameter can be a function of covariates. For models of type (b), the extension to include covariates typically requires the parameters of the innovation distribution to be function of the covariates.
For type (a), estimation can usually proceed with numerical maximum likelihood. For type (b), the estimation method is typically method of moments or conditional least squares. The disadvantage of the latter two estimation methods is that the predictive distributions for the next observations given the recent observations cannot be obtained.
In this paper, we use some numerical techniques from the literature for models of type (a) for parameter estimation in GINAR(p) models. The numerical technique applies when there is closed form for the probability generating function (pgf) of the conditional distribution of the next observation given the past because the pgf can be efficiently numerically inverted to get the conditional probability mass function for likelihood calculations.
The remainder of the paper is organized as follows. Section 2 describes the background and notation for count times series models and thinning operators. Section 3 presents the formulation of GINAR(p). Section 4 presents some probabilistic properties of GINAR(p) in the stationary setting. Section 5 presents the details to show how the likelihood can be numerically computed, and includes some simulation results. Section 6 presents two data examples that show how generalized thinning operators lead to better models than binomial thinning. Section 7 presents the concluding discussion.

## 2. Background for Thinning Operators

This section presents the background and notation for count data and thinning operators.
Let ${ Y t : t = 1 , 2 , … }$ be a count time series (a sequence of dependent random variables), where $Y t ∈ N 0 = { 0 , 1 , 2 … , }$. The realized data will be denoted as ${ y t : t = 1 , 2 , … , n }$ for a sample of size n. If there are trend/seasonal terms or covariates at time t, they are incorporated into a vector $x t$. We mainly consider the case of small counts, with zeros and overdispersion relative to Poisson.
For early research in Markov count time series with given margins, some references are McKenzie (1986, 1987), Al-Osh and Aly (1992); Alzaid and Al-Osh (1993). Letting $α$ denote the non-negative lag 1 serial correlation, the integer autoregressive model of order 1 has the form:
$Y t = R t ( Y t − 1 ; α ) + ϵ t ( α ) , t = 1 , 2 , … , 0 ≤ α ≤ 1 .$
$Y t$ has a (parametric) distribution such as Poisson, negative binomial (NB) or generalized Poisson. The innovation random variables are $ϵ t ( α ) ∈ N 0$ for $t = 1 , 2 , …$, and the serially dependent component of the model consists of random variables $R t ( y ; α ) ∈ N 0$ such that
$E [ R t ( Y t − 1 ; α ) | Y t − 1 = y ] = E [ R t ( y ; α ) ] = α y , y ∈ N 0 , 0 ≤ α ≤ 1 ,$
and $Y t = d R t ( Y t − 1 ; α ) + ϵ t ( α )$ for all $0 ≤ α ≤ 1$. Please note that to preserve the space of $N 0$ (non-negative integers), (2) must hold in expectation: $R t ( y ; α ) ≡ α y$ is possible only for real-valued or non-negative reals.
Some examples include the following. (a) $Y t ∼$ Poisson and $R t ( y ; α ) ∼ Binomial ( y ; α )$ for binomial thinning; (b) $Y t ∼$ negative binomial and $R t ( y ; α )$ has a beta-binomial distribution; (c) $Y t ∼$ generalized Poisson and $R t ( y ; α )$ has a quasi-binomial distribution.
For an extension to include covariates, if $Y t$, $R t ( Y t − 1 ; α )$ and $ϵ t ( α )$ are in the same convolution-closed family for all $0 < α < 1$ with a convolution parameter $ϑ$, then $ϑ$ can be a function of the covariates. However, with this approach, the extension to Markov of higher order is not simple. Markov of order 2 is tractable, but not order 3 or more; see Joe (1996) and Section 8.4.4 of Joe (1997).
A broad class of generalized thinning operators with the property of (2) can be obtained with a family of compounding operators based on a family of random variables ${ K ( α ) : 0 ≤ α ≤ 1 }$, where $K ( α ) ∈ N 0$ and $E [ K ( α ) ] = α$.
Definition 1
(Compounding operator based on random variable K). Let K be a non-negative integer random variable and let y be a non-negative integer. Then K as a compounding operator is denoted as $K ⊛ y$, where
$K ⊛ y = d ∑ i = 1 y K i$
with $K i$ being independent replicates of K. In the above, $K ⊛ y = 0$ if $y = 0$ with the conventional meaning of a null sum.
With the family ${ K ( α ) : 0 ≤ α ≤ 1 }$ of compounding random variables acting as thinning operators, in (1) let
$R t ( y ; α ) = ∑ i = 1 y K t i ( α ) ,$
where $K t i ( α )$ are independent replicates of $K ( α )$ and $K ( α )$ has pgf $G K ( s ; α ) = E [ s K ( α ) ]$ for $0 ≤ s ≤ 1$. With the ⊛ notation, the Markov stationary count time series model can be written as
$Y t = K t ( α ) ⊛ Y t − 1 + ϵ t ( α ) = ∑ i = 1 Y t − 1 K t i ( α ) + ϵ t ( α ) , t = 1 , 2 , …$
For the expectation thinning requirement, $α = 0$ implies $K ( α ) ≡ 0$ and $ϵ t = Y t$ in (3) for an independent and identically distributed sequence, and $α = 1$ implies $K ( α ) ≡ 1$ and $ϵ t = 0$ in (3) for perfect dependence.
Please note that if K has pgf $G K$ and Y has pgf $G Y$, then $K ⊛ y$ has pgf $G K y$ and $K ⊛ Y$ has pgf $G Y ∘ G K$.
With the family ${ K ( α ) }$ of compounding operators, a subclass with the self-generalized property has some special properties. Self-generalization is defined next; properties based on this concept are in Zhu and Joe (2003, 2010a).
Definition 2
(Self-generalized family ${ K ( α ) : 0 ≤ α ≤ 1 }$ with pgf $G K ( · ; α )$). ${ K ( α ) : 0 ≤ α ≤ 1 }$ satisfies the property of self-generalization if
This self-generalized property implies the following.
(a)
One can embed (3) into a continuous-time Markov process.
(b)
$Var [ K ( α ) ] = c α ( 1 − α )$ for some constant $c ≥ 1$.
(c)
If (3) can hold in distribution for all $0 < α < 1$, then the marginal distribution $F Y$ is infinitely divisible and said to be generalized discrete self-decomposable (GDSD).
There are three known families of self-generalized operators that are quite tractable. These are summarized in the next definition.
Definition 3
(Three families of self-generalized thinning operators).
(I1)
(binomial thinning) $G K ( s ; α ) = ( 1 − α ) + α s$ with $Var [ K ( α ) ] = α ( 1 − α )$.
(I2)
$G K ( s ; α ; γ ) = ( 1 − α ) + ( α − γ ) s ( 1 − α γ ) − ( 1 − α ) γ s$, $0 ≤ γ ≤ 1$, with $Var [ K ( α ) ] = α ( 1 − α ) ( 1 + γ ) / ( 1 − γ )$. Please note that $γ = 0$ implies $G K ( z ; α ) = ( 1 − α ) + α s$.
(I3)
$G K ( s ; α ; γ ) = γ − 1 [ 1 + γ − ( 1 + γ − γ s ) α ]$, $0 ≤ γ$, with $Var [ K ( α ) ] = α ( 1 − α ) ( 1 + γ )$. Please note that $γ → 0$ implies $G K ( s ; α ) = ( 1 − α ) + α s$.
The I2 and I3 families have an additional parameter $γ$ besides $α ∈ [ 0 , 1 ]$ to allow different degrees of conditional heteroscedasticity. The I2 and I3 families include binomial thinning when $γ → 0 +$. The second operator family I2 has been used in different parametrizations; see Aly and Bouzar (1994, 2019).
In the next section, we use these three classes of self-generalized compounding operators for GINAR(p) models in which compounding acts (independently) on each of the previous p observations. Please note that model (3) has no simple extension to Markov of higher order if a univariate stationary margin such as negative binomial is desired.

## 3. GINAR($p$): Generalized Integer Autoregressive of Order $p ≥ 1$

In this section, we define INAR(p) and GINAR(p) count times series models, and summarize estimation methods that have appeared in the literature. Also, it is indicated why the generalized form is more interpretable.
The INAR(p) model-based binomial thinning operators (denoted with *), is defined in Du and Li (1991), as:
$Y t = ∑ j = 1 p α j ∗ Y t − j + ϵ t ;$
the innovation random variables $ϵ t$ have a distribution such as Poisson or negative binomial (NB), and they can have parameters that depend on covariates.
For an extension beyond binomial thinning, the GINAR(p) model as defined in Gauthier and Latour (1994) and Joe (2015) is as follows:
$Y t = ∑ j = 1 p K t ( α j ) ⊛ Y t − j + ϵ t = ∑ j = 1 p ∑ i = 1 Y t − j K t j i ( α j ) + ϵ t ,$
where $0 ≤ α j ≤ 1$ for $j = 1 , … , p$ and the $K t j i ( α j )$ are independent over $t , j$ and i, and $ϵ t$ is the innovation at time t.
Model (5) could also be defined if the thinning operators are based on other non-compounding thinning operators that appear in other constructions of count time series models. However as shown below, the feasibility of numerical maximum likelihood depends on the use of compounding operators with closed form pgfs.
This GINAR(p) model is defined in Section 3.2 of Weiß (2018) but there is no discussion of good choices of generalized thinning operators and no applications in subsequent sections of the book.
Binomial thinning in the INAR(p) model with $p ≥ 2$ does NOT have survivor-immigration interpretation (a random fraction of current counted units continue to next time point (page 18 of Weiß (2018))). GINAR(p) is more interpretable, with a unit count at time t branching into (or contributing) 0, 1 or more counts at time $t + 1$, $t + 2$, etc.; this is referred to as branching with immigration on page 19 of Weiß (2018).
The original estimation method in Du and Li (1991) is Yule-Walker or method of moments. An approximate likelihood inference method based on the saddlepoint approximation is used in Pedeli et al. (2015), and Lu (2018) has an implementation of maximum likelihood by getting the probability mass function from the pgf by differentiation and Taylor expansion. In Pedeli et al. (2015), only the binomial thinning operator is used and a NB innovation with a fixed convolution parameter is assumed. That is, the convolution parameter is fixed (whose reciprocal was called the dispersion parameter) at different values and then estimates of the remaining parameters use the saddlepoint approximation. Lu (2018) assumes a Poisson innovation, and the approach extends to NB innovations but may not be practical for thinning operators with support on all non-negative integers.
By inverting the pgf using the numerical integral in Davies (1973), the likelihood can be evaluated to high precision whenever the conditional distribution of $[ Y t | Y t − 1 = y t − 1 , … , Y t − p = y t − p ]$ has a numerically tractable pgf. The pgf of this conditional distribution is
$∏ j = 1 p [ G K ( s ; α j ) ] y t − j G ϵ ( s ) .$
This pgf is tractable for thinning operators I1, I2 and I3, combined with Poisson or NB binomial distributions for innovations. Some details are provided in Section 5. In Zhu and Joe (2010b), the numerical technique for the likelihood was used for some Markov time series models of order 1 with NB marginal distributions.
An advantage of the likelihood method over method of moments and least squares is that different distributions for $ϵ t$ can be used in a sensitivity analysis, and prediction intervals of $Y t + 1$ given $y t , y t − 1 , … , y t − p$ are possible. Covariates can be accommodated into the parameters of $F ϵ t$.

## 4. Probabilistic Properties and Numerical Techniques

In this section, probabilistic properties and numerical techniques for GINAR(p) are given.
The method for inverting the conditional pgf is summarized in Section 4.1 and it can be used for simulation of GINAR(p) as shown in Section 4.2. In Section 4.3, an algorithm is given for obtaining the autocorrelation function of (5), assuming stationarity. Section 4.4 summarizes the validation of the numerical methods.

#### 4.1. Conditional Distributions

In this subsection, we explain how to compute the probability mass function (pmf) $Pr ( Y t = y | Y t − 1 = y t − 1 , … , Y t − p = y t − p )$ for GINAR(p) in (5).
There are two approaches indicated below.
(a)
Let the pmf for extended binomial distribution be
$f ( z ; α 1 , … , α p ; G K ) = Pr ∑ j = 1 p K t ( α j ) ⊛ y t − j = z$
for $z = 0 , 1 , 2 …$; this is obtained by inverting $∏ j = 1 p G K y t − j ( s ; α j )$, which is the pgf of $∑ j = 1 p K t ( α j ) ⊛ y t − j$. Please note that (6) is a binomial distribution when the thinning operator is binomial thinning and $α 1 = … = α p$. Let $f ϵ ( · )$ be the pmf of the innovation random variable. Then
$f Y t | Y t − 1 , … , Y t − p ( z ) = Pr ( Y t = y | Y t − 1 = y t − 1 , … , Y t − p = y t − p ) = ∑ z = 0 y f ( z ; α 1 , … , α p ; G K ) f ϵ ( y − z )$
(b)
The conditional pmf (7) is obtained by inverting $G ϵ ( s ) ∏ j = 1 p G K y t − j ( s ; α j )$, which is the pgf of $∑ j = 1 p K t ( α j ) ⊛ y t − j + ϵ t$.
Approach (a) can be used if the innovation random variable has simple form for the pmf but not the pgf. In the next result that shows the inversion, W is either $∑ j = 1 p K t ( α j ) ⊛ y t − j$ or $∑ j = 1 p K t ( α j ) ⊛ y t − j + ϵ t$.
Let W be a random variable with support in $N 0$. The characteristic function $φ W ( t ) = E ( e i t W ) = G W ( e i t )$ of W can be inverted via the algorithm of Davies (1973). Let
The function $a ( w )$ is straightforward to evaluate via numerical quadrature. The cumulative distribution function (cdf) and pmf of W are
$F W ( w ) = a ( w + 1 ) , w = 1 , 2 , …$
$f W ( 0 ) = a ( 1 ) , f W ( w ) = a ( w + 1 ) − a ( w ) , w = 1 , 2 , … .$

#### 4.2. Simulating GINAR(p)

For simulating from GINAR(p), one can start with something approximate for the first p observations and then generate remaining observations from the conditional distributions in the preceding subsection. If one ignores the initial burn-in period, then rest of the resulting time series will be close to stationary, assuming that the parameters $α 1 , … , α p$ satisfy the condition for stationary (as given in the next subsection).
Suppose one has the cdf for $Pr ( Y t = z | Y t − 1 = y t − 1 , … , Y t − p = y t − p )$ as in the previous subsection. Then the usual technique can be used for simulating a discrete random variable. Below is an algorithm to generate a sequence of length $n > p$.
• Simulate $y 1$ so that it has the theoretical stationary mean and variance.
• Simulate $y 2 , … , y p$ to try to match the lag 1 to lag $p − 1$ serial correlations (approximately).
• For $i = p + 1 , … , n$:
•  obtain the cdf $F c o n d ( z ) ← F Y t | Y t − 1 , … , Y t − p ( z | y t − 1 , … , y t − p )$ for z from 0 to, say, the integer closest to
$E ( Y t | Y t − 1 = y t − 1 , … , Y t − p = y t − p ) + 5 Var ( Y t | Y t − 1 = y t − 1 , … , Y t − p = y t − p )$.
•  generate a random number r in $( 0 , 1 )$
•  assign $y i ← min { z : F c o n d ( z ) ≥ r }$.
• End of for loop

#### 4.3. Moments under Stationarity

In this subsection, some results are derived and summarized under the assumption that model (5) is in a stationary state. It is assumed that the sequence ${ Y t }$ has finite variance $σ Y 2$ and mean $μ Y$. We use the property that ${ K ( α ) : 0 ≤ α ≤ 1 }$ is a family of self-generalized random variables with $E [ K ( α ) ] = α$ and $Var [ K ( α ) ] = c α ( 1 − α )$ for a constant $c ≥ 1$. Du and Li (1991) has most of these results for the special case of binomial thinning. The condition for stationarity is that $0 ≤ ∑ j = 1 p α j < 1$.
• $E [ K ( α ) ⊛ Y | Y = y ] = y E [ K ( α ) ] = α y$; $E [ K ( α ) ⊛ Y ] = α μ Y$.
• For stationary GINAR(p), $μ Y = μ Y ∑ j = 1 p α j + μ ϵ$ or
$μ Y = μ ϵ / 1 − ∑ j = 1 p α j .$
• $Var [ K ( α ) ⊛ Y | Y = y ] = y Var [ K ( α ) ] = c α ( 1 − α ) y$.
• $Var [ K ( α ) ⊛ Y ] = c α ( 1 − α ) μ Y + α 2 σ Y 2$ and $E [ ( K ( α ) ⊛ Y ) 2 ] = c α ( 1 − α ) μ Y + α 2 E ( Y 2 )$.
• Let $Y a , Y b$ be two (distinct) dependent counts, with independent thinning operations at the same time or at different times: $Cov [ K t 1 ( α j ) ⊛ Y a , K t 2 ( α m ) ⊛ Y b ∣ Y a = y a , Y b = y b ] = 0$, and $Cov [ K t ( α j ) ⊛ Y a , Y b ∣ Y a = y a , Y b = y b ] = 0$.
• Let $Y a , Y b$ be two (distinct) dependent counts, with independent thinning operations at the same time or at different times. From the preceding items,
$Cov [ K t ( α j ) ⊛ Y a , Y b ] = Cov [ E ( K t ( α j ) ⊛ Y a | Y a , Y b ) , Y b ] = Cov ( α j Y a , Y b ) = α j Cov ( Y a , Y b ) ; Cov ( K t 1 ( α j ) ⊛ Y a , K t 2 ( α m ) ⊛ Y b ] = Cov [ E ( K t 1 ( α j ) ⊛ Y a | Y a , Y b ) , E ( K t 2 ( α m ) ⊛ Y b | Y a , Y b ) ] = Cov ( α j Y a , α m Y b ) = α j α m Cov ( Y a , Y b ) .$
• For the case of $a = b$ and write $Y a = Y b = Y$, then
$Cov [ K t ( α ) ⊛ Y , Y ] = Cov [ E ( K t ( α ) ⊛ Y | Y ) , Y ] = Cov ( α Y , Y ) = α Var ( Y ) .$
With $t 1 ≠ t 2$,
$Cov ( K t 1 ( α j ) ⊛ Y , K t 2 ( α m ) ⊛ Y ] = Cov [ E ( K t 1 ( α j ) ⊛ Y | Y ) , E ( K t 2 ( α m ) ⊛ Y | Y ) ] = Cov ( α j Y , α m Y ) = α j α m Var ( Y ) .$
• $Cov ( Y t , ϵ t ) = Var ( ϵ t ) = σ ϵ 2$ since $ϵ t$ is independent of past observations.
From the above, we can develop recursion equations for autocovariances $γ h$ or autocorrelations $ρ h$ for $h ∈ N 0$. These are the same as for Gaussian AR(p) for lags $h ≥ p$.
Variance calculations under stationarity are as follows.
$Y t = ∑ j = 1 p K t ( α j ) ⊛ Y t − j + ϵ t Var ( Y t ) = ∑ j = 1 p Var [ K t ( α j ) ⊛ Y t − j ] + 2 ∑ 1 ≤ j < m ≤ p Cov [ K t ( α j ) ⊛ Y t − j , K t ( α m ) ⊛ Y t − m ] + Var ( ϵ t ) γ 0 = c μ Y ∑ j = 1 p α j ( 1 − α j ) + γ 0 ∑ j = 1 p α j 2 + 2 ∑ 1 ≤ j < m ≤ p α j α m γ m − j + σ ϵ 2 ( 1 − ∑ j = 1 p α j 2 ) γ 0 = c μ Y ∑ j = 1 p α j ( 1 − α j ) + 2 ∑ 1 ≤ j < m ≤ p α j α m γ m − j + σ ϵ 2 γ h = ∑ j = 1 p α j Cov ( Y t − j , Y t − h ) = ∑ j = 1 p α j γ | h − j | , h ≥ 1 .$
For example, for GINAR(3),
$γ 0 = { c μ Y ∑ j = 1 3 α j ( 1 − α j ) + σ ϵ 2 } + γ 0 ∑ j = 1 3 α j 2 + 2 [ α 1 α 2 γ 1 + α 2 α 3 γ 1 + α 1 α 3 γ 2 ] = : a 0 + b 0 γ 0 + 2 [ α 1 α 2 γ 1 + α 2 α 3 γ 1 + α 1 α 3 γ 2 ] γ 1 = Cov ( Y t , Y t − 1 ) = α 1 γ 0 + α 2 γ 1 + α 3 γ 2 γ 2 = Cov ( Y t , Y t − 2 ) = α 1 γ 1 + α 2 γ 0 + α 3 γ 1 = α 2 γ 0 + ( α 1 + α 3 ) γ 1 γ 1 = α 1 γ 0 + α 3 α 2 γ 0 + α 2 γ 1 + α 3 ( α 1 + α 3 ) γ 1 γ 1 = ( α 1 + α 2 α 3 ) γ 0 / [ 1 − α 2 − α 3 ( α 1 + α 3 ) ] = : ρ 1 γ 0 γ 2 = [ α 2 + ( α 1 + α 3 ) ρ 1 ] γ 0 = : ρ 2 γ 0 γ 0 = a 0 + b 0 γ 0 + 2 ( α 1 α 2 + α 2 α 3 ) ρ 1 γ 0 + 2 α 1 α 3 ρ 2 γ 0 γ 0 = a 0 / [ 1 − b 0 − 2 ( α 1 α 2 + α 2 α 3 ) ρ 1 − 2 α 1 α 3 ρ 2 ] .$
In general, the first p serial correlations can be obtained by solving a linear system in the equations for $ρ 1 , … , ρ p$, and then $ρ p + 1 , ρ p + 2 , …$ can obtained by recursion.
Next is the algorithm for GINAR(p) for computing $ρ 1 , … , ρ p$, and $γ 0 = σ Y 2$, given inputs of $α 1 , … , α p$ The higher order serial correlations then are obtained via:
$ρ h = ∑ j = 1 p α j ρ h − j , h > p .$
• Initialize a $p × p$ matrix M to 0.
• For $j 1 ∈ { 1 , … , p }$
• $M j 1 , j 1 ← 1$
•  for $j 2 ∈ { 1 , … , p }$
•   $h ← | j 1 − j 2 |$
•   if ($h > 0$) $M j 1 , h ← M j 1 , h − α j 2$
•  end of loop for $j 2$
• end of loop for $j 1$.
• Solve
$M ρ 1 ⋮ ρ p = α 1 ⋮ α p .$
• Let $a 0 = c μ Y ∑ j = 1 p α j ( 1 − α j ) + σ ϵ 2$, $b 0 = ∑ j = 1 p α j 2$. Then
$γ 0 = a 0 / 1 − b 0 − 2 ∑ 1 ≤ j 1 < j 2 ≤ p α j 1 α j 2 ρ j 2 − j 1$
is the stationary variance.
Please note that the stationary autocorrelation function (acf) depends only on $α 1 , … , α p$ and not on the distribution of the innovation random variables. The acf also does not depend on the family ${ K ( α ) }$ of thinning operations, but the stationary mean and variance are affected by ${ K ( α ) }$ and the distribution of the innovations.

#### 4.4. Validation

With the simulation method in Section 4.2 and numerical maximum likelihood based on (7), we simulated many time series of length in the thousands under different stationary parameter settings for $( α 1 , … , α p )$ and different parameters for the Poisson or NB innovation. Some representative simulation results are summarized in the next section.
The sample acf’s are close to the theoretical acf’s in Section 4.3 and the maximum likelihood estimates are close to the “true" parameters when considering sampling variability. We also checked that as p increases and $∑ j = 1 p α j$ gets closer to 1, the serial correlations require longer lags before they are closer to 0.

## 5. Likelihood and Numerical Implementation

In this section, we summarize the log-likelihoods that can be used for model (5), where the innovation random variables can be independent and identically distributed with a parametric distribution or they can depend parametrically on covariates.
To compare models with different autoregressive order using Akaike information criterion (AIC) values, we use the (conditional) likelihood as the product of the conditional densities starting at an index $i s t a r t$ which is larger than $p m a x$ (the maximum autoregressive order that will be considered). With a large sample size, the maximum likelihood estimates are not sensitive to the few initial conditional probabilities so that one could fit GINAR(p) starting from the conditional probability for $Y p + 1$ to get parameter estimates and then omit a few conditional probability terms at the beginning so that $i s t a r t$ is common for different p. This form of likelihood implies that we do not have to determine the distribution of the first few observations under the assumption of stationarity. Actually, we then do not have to assume that the time series starts in a stationary state, and covariates can be included. The likelihood is:
$L = ∏ i = i s t a r t n f Y t | Y t − 1 , … , Y t − p ( y t | y t − 1 , … , y t − p ; α 1 , … , α p , γ , θ i n n o v ) ,$
where $α 1 , … α p ∈ ( 0 , 1 )$, with $0 ≤ α 1 + ⋯ + α p < 1$, are the autoregressive parameters, $γ$ is the conditional heteroscedatic parameter in Definition 3 if I2 or I3 thinning is used, and $θ i n n o v$ is the vector of parameters for the innovation random variables. If there are covariates, then $θ i n n o v$ consists of (regression) parameters linking the covariates to parameters of the innovation distribution. The conditional density is obtained via the numerical technique in Section 4.1. We refer to the parameter vector maximizing (10) or minimizing the negative log conditional likelihood as the conditional maximum likelihood (CML) estimate.
For example, $θ i n n o v = λ > 0$ for Poisson innovations, $θ i n n o v = ( ϑ , ξ )$ for NB innovations with convolution parameter $ϑ$, mean $ϑ ξ$ and variance $ϑ ξ ( 1 + ξ )$. With covariate vector $x t$ at time t, $θ i n n o v = ( β 0 , β )$ for Poisson innovations with mean $μ ( x t ) = exp { β 0 + β T x t }$ and $θ i n n o v = ( β 0 , β , ξ )$ for NB innovations with mean $μ ( x t ) = exp { β 0 + β T x t }$, $ϑ t = μ ( x t ) / ξ$ with constant overdispersion.
Because of the summations and numerical integrals, to gain computational speed in the numerical maximum likelihood, the negative log-likelihood can be coded in a high-level programming language such as Fortran 90. The numerical optimization could be done by interfacing to a statistical software such as R.
Table 1 reports on some representative simulation results for I1 and I2 thinning to show the accuracy of the numerical methods and how the computing time increases when there are increases in the (i) number of parameters, (ii) sample size and (iii) maximum count. Similar patterns occur for GINAR with I3 thinning. When the sample size increases by a factor of 4 (from 500 to 2000), the SDs of the parameter estimates decrease by a factor of 2 (as expected); the computing time increase by a factor of slightly more than 4 because the maximum count is larger for a sample size of 2000 and the pmf in (7) must be computed up to a larger value. The parameter vectors in Table 1 are based on maximum likelihood values when fitting these different models to the Ericcson data in Section 6.1.

## 6. Data Examples

In this section, we show results of fitting model (5) to two data examples that appeared in the literature. The use of operators with branching stochastic representation is more interpretable. So it is not surprising that, based on AIC, the I2 and I3 thinning operators in Definition 3 provide much better fits than binomial thinning.
Because the I2 and I3 thinning operators provide conditional heterscedasticity, it turns out that the innovation random variable need not be as overdispersed relative to Poisson as p gets larger. With the binomial thinning operator which each count from the previous p time points can contribute at most 1 to the next count observation, a NB innovation random variable provides a much better fit than Poisson.

#### 6.1. Ericsson Transaction Data

The data set consists of the number of transactions per minute for the stock Ericsson B for business days and hours during 2 to 22 July in the year 2002. The sample size is $n = 460$. The original source is Brännäs and Quoresh (2010). This data set is also used in Fokianos et al. (2009) and in Examples 4.1.5 and 4.2.4 in Weiß (2018).
Different models were used by the previous authors: integer moving average INMA of order q with a large q and INGARCH(1,1) with Poisson and overdispersed Poisson distributions for the innovations. The empirical autocorrelation function (see Table 3) suggests that a low-order autoregressive model is not appropriate for these data (as indicated by Weiß (2018)). Here, for comparison, we consider stationary GINAR(p) with p increasing until there is no improvement in the log-likelihood and the highest order $α$ parameter becomes close to 0.
Table 2 presents a summary of AIC values to compare GINAR(p) with binomial thinning and the I2, I3 thinning operators.
For thinning with I2 and I3, based on AIC values, the models with Poisson innovations are a little better than with NB innovations for $p = 3$ to 7 for I2 and $p = 4$ to 7 for I3. For binomial thinning, based on AIC values, the models with NB innovations are much better than with Poisson innovations.
The best models based on AIC values are GINAR(6) models with I2 and I3 thinning operators and Poisson-distributed innovations. They are quite an improvement on INAR(6) with binomial thinning and NB-distributed innovations. Based on the context of the data, thinning operations based on compounding with support on all of $N 0$ instead of ${ 0 , 1 }$ are more reasonable.
The I2 and I3 thinning operators account for some conditional heteroscedasticity so that the use of the NB-distributed innovations leads to a flatter log-likelihood over the NB parameters. Hence Poisson distributed innovations are adequate to handle the marginal overdispersion.
We next compare AIC values with other models that have been used for this data set. Table 4.4 and Example 4.2.4 of Weiß (2018) have values of maximized log conditional likelihoods at the CML estimates for INGARCH(1,1) models with Poisson, NB and generalized Poisson innovations. The use of NB and generalized Poisson distributions lead to much better fitting models with AIC values in the range 2662 to 2666. With a further adjustment for starting in the eighth observation in the conditional log-likelihood, the AIC values are comparable with those in Table 2.
With also compare with the Poisson autoregression or INGARCH models in Fokianos et al. (2009). The models are conditional Poisson with a latent mean process ${ Λ t }$, where $[ Y t | Λ s = λ s : s ≤ t , Y s = y s : s < t ] ∼ Poisson ( λ t )$ with
$λ t = b 0 + b 1 λ t − 1 + b 2 y t − 1$
or
$λ t = ( b 0 + b 1 exp { − γ λ t − 1 2 } ) λ t − 1 + b 2 y t − 1 .$
As in Fokianos et al. (2009), we start the latent process with $λ 0 = 0$. There is some sensitivity to the starting value $y 0$ but we get similar maximum likelihood parameter estimates when starting $y 0$ at the sample mean. When evaluating AIC values with $i s t a r t = 8$ are between 2847 and 2849 for these two models and correspond roughly to the AIC value for INGARCH(1,1) with Poisson innovations in Weiß (2018).
Table 3 has (a) CML estimates for the GINAR(6) models with I1, I2 and I3 thinning operators, and (b) model-based moments and acf’s from these models to compare with the empirical values. The comparison with empirical values provides a simple goodness-of-fit procedure; in this case, it shows that the fit from GINAR(6) with binomial thinning is a worse fit.
Based on the parameter estimates in this table, and the last 6 values in the data series: $( y 460 , y 459 , y 458 , y 457 , y 456 , y 455 ) = ( 3 , 9 , 29 , 18 , 20 , 7 )$, we can estimate the conditional mean $μ ^ c o n d$, conditional variance $σ ^ c o n d 2$ and central 50% and 80% intervals from estimated pmf: $f Y n + 1 | Y n , … , Y n − 5 ( · | y n , … , y n − 5 ; α ^ 1 , … , α ^ 6 , θ ^ i n n o v )$. These are summarized below.
• I1 (binomial thinning): $μ ^ c o n d = 11.62$, $σ ^ c o n d 2 = 25.66$, $[ 7 , 12 ]$ with probability content 0.52; $[ 5 , 16 ]$ with probability content 0.82;
• I2: $μ ^ c o n d = 12.20$, $σ ^ c o n d 2 = 30.31$, $[ 7 , 14 ]$ with probability content 0.56; $[ 5 , 18 ]$ with probability content 0.82;
• I3: $μ ^ c o n d = 12.20$, $σ ^ c o n d 2 = 30.92$, $[ 7 , 13 ]$ with probability content 0.51; $[ 5 , 18 ]$ with probability content 0.82.
For binomial thinning, the point and interval predictions are smaller and shorter. Please note that these prediction intervals would not be possible with estimation based on conditional least squares or the method of moments.

#### 6.2. Meningococcal Disease Data

The data set comes from the German national surveillance system for notifiable diseases, administered by the Robert Koch Institute. The time series consists of weekly numbers of meningococcal disease cases in Germany for the years 2001–2006 and the sample size is $n = 312$. In Pedeli et al. (2015), INAR(p) models are fitted with approximate likelihood based on a saddlepoint approximation. There is a seasonal pattern over the year and sinusoidal terms were used as covariates in the mean parameter of the innovations as indicated in Section 5.
We fit several GINAR(p) models with I2 and I3 thinning in addition to binomial thinning, using the numerical techniques in the preceding sections. The primary sinusoidal terms $x t 1 = sin ( 2 π t / 52 )$ and $x t 2 = cos ( 2 π t / 52 )$. As indicated in Pedeli et al. (2015), the addition of additional harmonic terms $x t 3 = sin ( 4 π t / 52 )$ and $x t 4 = cos ( 4 π t / 52 )$ do not lead to improvements based on AIC; the estimates of the corresponding $β$ regression parameters are at least 10 times smaller than those for $x t 1 , x t 2$. The autoregressive order started at 1 and increased to a value $p m a x$ so that the last $α ^ j$ was close to 0 and the negative log-likelihood value was not improving.
Table 4 has some AIC values, based on $i s t a r t = 5$ in (10). For thinning with I2 and I3, based on AIC values, the models with Poisson innovations are a little better than those with NB innovations for $p = 2 , 3 , 4$. For binomial thinning, based on AIC values, the models with NB innovations are much better than those with Poisson innovations.
As in the Section 6.1, the I2 and I3 thinning operators account for some conditional heteroscedasticity so that the use of the NB-distributed innovations leads to a flatter log-likelihood over the NB parameters.
Overall, from Table 4, the GINAR(2) models with I2 or I3 thinning, Poisson-distributed innovations and $x t 1 , x t 2$ as covariates provide the best models. The best AIC values from GINAR(p) with I2 and I3 thinning are smaller than the best AIC values in Pedeli et al. (2015) (based on binomial thinning).

## 7. Discussion

We showed how GINAR(p) count time series models can be estimated using numerical maximum likelihood. This allows for sensitivity analysis to model assumptions that is not possible with the estimation methods of conditional least squares and the method of moments.
Future research includes the use of the thinning operators in Definition 3 in other applications, as well as derivations of other families of self-generalized random variables satisfying Definition 2.
Where researchers have been using binomial thinning in count time series models, we recommend the replacement with more general thinning operators that satisfy the self-generalized properties. These operators are more interpretable when the survivor interpretation for binomial thinning does not apply. The examples in this paper show that better fitting models can be obtained.
For likelihood computations, alternative methods can be considered. Although a greater coding effort would be needed, one could consider whether the method of Lu (2018), consisting of obtaining probabilities from derivatives of the pgf, is feasible for thinning operators with support on all non-negative integers.

## Funding

This research was funded by NSERC Discovery Grant 8698.

## Acknowledgments

Thanks to Konstantinos Fokianos for providing the data sets. Thanks to the referees for their constructive suggestions to improve the presentation.

## Conflicts of Interest

The author declares no conflict of interest.

## References

1. Al-Osh, Mohamed A., and Emad-Eldin A. A. Aly. 1992. First order autoregressive time-series with negative binomial and geometric marginals. Communications in Statistics—Theory and Methods 21: 2483–92. [Google Scholar] [CrossRef]
2. Aly, Emad-Eldin A. A., and Nadjib Bouzar. 1994. Explicit stationary distributions for some galton-watson processes with immigration. Communications in Statistics—Stochastic Models 10: 499–517. [Google Scholar] [CrossRef]
3. Aly, Emad-Eldin A. A., and Nadjib Bouzar. 2019. Expectation thinning operators based on linear fractional probability generating functions. Journal of the Indian Society for Probability and Statistics 20: 89–107. [Google Scholar] [CrossRef]
4. Alzaid, Abdulhamid A., and Mohamed A. Al-Osh. 1993. Some autoregressive moving average processes with generalized Poisson marginal distributions. Annals of the Institute of Statistical Mathematics 45: 223–32. [Google Scholar] [CrossRef]
5. Brännäs, Kurt, and A. M. M. Shahiduzzaman Quoreshi. 2010. Integer-valued moving average modelling of the number of transactions in stocks. Applied Financial Economics 20: 1429–40. [Google Scholar] [CrossRef]
6. Davies, Robert B. 1973. Numerical inversion of a characteristic function. Biometrika 60: 415–17. [Google Scholar] [CrossRef]
7. Davis, Richard A., Scott H. Holan, Robert Lund, and Nalini Ravishanker. 2015. Handbook of Discrete-Valued Time Series. Boca Raton: Chapman & Hall/CRC. [Google Scholar]
8. Du, Jin-Guan, and Yuan Li. 1991. The integer-valued autoregressive (INAR(p)) model. Journal of Time Series Analysis 12: 129–42. [Google Scholar]
9. Fokianos, Konstantinos, Anders Rahbek, and Dag Tjostheim. 2009. Poisson autoregression. Journal of the American Statistical Association 104: 1430–39. [Google Scholar] [CrossRef]
10. Gauthier, Geneviève, and Alain Latour. 1994. Convergence forte des estimateurs des paramètres d’un processus GENAR(p). Ann. Sci. Math. Québec 18: 49–71. [Google Scholar]
11. Joe, Harry. 1996. Time series models with univariate margins in the convolution-closed infinitely divisible class. Journal of Applied Probability 33: 664–77. [Google Scholar] [CrossRef]
12. Joe, Harry. 1997. Multivariate Models and Dependence Concepts. London: Chapman & Hall. [Google Scholar]
13. Joe, Harry. 2015. Markov count time series models with covariates. In Handbook of Discrete-Valued Time Series. Edited by Richard A. Davis, Scott H. Holan, Robert Lund and Nalini Ravishanker. Boca Raton: Chapman & Hall/CRC, chp. 2. pp. 29–49. [Google Scholar]
14. Lu, Yang. 2018. Probabilistic Forecasting in Higher-Order INAR(p) Models. MPRA Paper 83682. Munich: University Library of Munich. [Google Scholar]
15. McKenzie, Ed. 1986. Autoregressive moving-average processes with negative-binomial and geometric marginal distributions. Advances in Applied Probability 18: 679–705. [Google Scholar] [CrossRef]
16. McKenzie, Ed. 1987. Innovation distributions for gamma and negative binomial autoregressions. Scandinavian Journal of Statistics 14: 79–85. [Google Scholar]
17. Pedeli, Xanthi, Anthony C. Davison, and Konstantinos Fokianos. 2015. Likelihood estimation for the inar(p) model by saddlepoint approximation. Journal of the American Statistical Association 110: 1229–38. [Google Scholar] [CrossRef]
18. Weiß, Christian H. 2018. An Introduction to Discrete-Valued Time Series. Hoboken: Wiley. [Google Scholar]
19. Zhu, Rong, and Harry Joe. 2003. A new type of discrete self-decomposability and its application to continuous-time Markov processes for modeling count data time series. Stochastic Models 19: 235–54. [Google Scholar] [CrossRef]
20. Zhu, Rong, and Harry Joe. 2010a. Count data time series models based on expectation thinning. Stochastic Models 26: 431–62. [Google Scholar] [CrossRef]
21. Zhu, Rong, and Harry Joe. 2010b. Negative binomial time series models based on expectation thinning operators. Journal of Statistical Planning and Inference 140: 1874–88. [Google Scholar] [CrossRef]
Table 1. Simulation results for different parameter vectors that come from fits to the Ericcson data in Section 6.1. The likelihoods were coded in Fortran90 and the numerical optimization was done with a link to R using nlm as the implementation of the quasi-Newton method. The simulation sample size was 500, and the timings were based on a PC with Intel Core i7-6770HQ processor at 2.6 GHz.
Table 1. Simulation results for different parameter vectors that come from fits to the Ericcson data in Section 6.1. The likelihoods were coded in Fortran90 and the numerical optimization was done with a link to R using nlm as the implementation of the quasi-Newton method. The simulation sample size was 500, and the timings were based on a PC with Intel Core i7-6770HQ processor at 2.6 GHz.
 I1/NB, $p = 2$, $n = 500$, av = 0.85 min parameter $α 1 = 0.27$ $α 2 = 0.15$ $θ = 1.85$ $ξ = 3$ bias −0.001 −0.003 0.06 −0.02 rmse 0.035 0.037 0.34 0.38 I1/NB, $p = 2$, $n = 2000$, av = 3.7 min parameter $α 1 = 0.27$ $α 2 = 0.15$ $θ = 1.85$ $ξ = 3$ bias 0.000 −0.001 0.02 −0.02 rmse 0.018 0.018 0.17 0.20 I2/Po, $p = 2$, $n = 500$, av = 1.2 min parameter $γ = 0.7$ $α 1 = 0.3$ $α 2 = 0.2$ $λ = 4.5$ bias 0.001 0.000 −0.005 0.03 rmse 0.027 0.045 0.046 0.35 I2/Po, $p = 2$, $n = 2000$, av = 5.0 min parameter $γ = 0.7$ $α 1 = 0.3$ $α 2 = 0.2$ $λ = 4.5$ bias −0.001 0.000 0.000 0.01 rmse 0.013 0.022 0.022 0.17 I2/Po, $p = 3$, $n = 500$, av = 2.9 min parameter $γ = 0.64$ $α 1 = 0.27$ $α 2 = 0.14$ $α 3 = 0.20$ $λ = 4.5$ bias 0.003 −0.001 −0.005 −0.003 0.07 rmse 0.030 0.045 0.049 0.043 0.46 I2/Po, $p = 3$, $n = 2000$, av = 12.2 min parameter $γ = 0.64$ $α 1 = 0.27$ $α 2 = 0.14$ $α 3 = 0.20$ $λ = 4.5$ bias 0.000 −0.001 −0.001 −0.001 0.02 rmse 0.014 0.022 0.024 0.022 0.23
Table 2. Ericsson transaction data: AIC values for (5) with three thinning operators and Poisson (Po) and negative binomial (NB) distributions for the innovation. When the autoregressive order reaches 7, there is no improvement in the log-likelihood and the last estimated $α j$ is close to 0. The AIC values are based on (10) with $i s t a r t = 8$. The AIC values for the best models for each of I1,I2,I3 are boldfaced. For $p = 1$ and 2, the AIC values for I2/NB are 2694.6 and 2679.4 respectively, and for I3/NB they are 2690.5 and 2677.1 respectively. For larger p, there is enough conditional heterscedasticity from the thinning operators and NB innovations did not lead to improved AIC values over Poisson innovations.
Table 2. Ericsson transaction data: AIC values for (5) with three thinning operators and Poisson (Po) and negative binomial (NB) distributions for the innovation. When the autoregressive order reaches 7, there is no improvement in the log-likelihood and the last estimated $α j$ is close to 0. The AIC values are based on (10) with $i s t a r t = 8$. The AIC values for the best models for each of I1,I2,I3 are boldfaced. For $p = 1$ and 2, the AIC values for I2/NB are 2694.6 and 2679.4 respectively, and for I3/NB they are 2690.5 and 2677.1 respectively. For larger p, there is enough conditional heterscedasticity from the thinning operators and NB innovations did not lead to improved AIC values over Poisson innovations.
pI1/NBI2/PoI3/Po
12695.72702.92722.0
22682.52681.02686.0
32670.62663.92666.1
42662.72654.32654.9
52657.02648.12647.3
62651.92641.12639.5
72653.32642.62640.9
Table 3. Ericsson transaction data: CML parameter estimates and corresponding SEs for GINAR(6) with I1, I2 and I3 thinning; also model-based summary statistics, to compare with empirical.
Table 3. Ericsson transaction data: CML parameter estimates and corresponding SEs for GINAR(6) with I1, I2 and I3 thinning; also model-based summary statistics, to compare with empirical.
ParameterI1/NBI2/PoI3/Po
$γ ^$0.533 (0.036)2.321 (0.333)
$α ^ 1$0.172 (0.037)0.187 (0.047)0.194 (0.047)
$α ^ 2$0.057 (0.037)0.068 (0.048)0.071 (0.047)
$α ^ 3$0.086 (0.036)0.109 (0.049)0.109 (0.047)
$α ^ 4$0.086 (0.037)0.116 (0.048)0.117 (0.047)
$α ^ 5$0.093 (0.038)0.104 (0.050)0.109 (0.047)
$α ^ 6$0.105 (0.038)0.142 (0.048)0.146 (0.047)
$ϑ ^$1.068 (0.125)
$ξ ^$3.717 (0.509)
$λ ^$2.704 (0.538)2.507 (0.531)
SummaryI1/NBI2/PoI3/PoEmpirical
$μ ^ Y$9.8799.8899.8929.909
$σ ^ Y 2$27.46030.07031.70732.837
$ρ ^ 1$0.2570.3500.3740.405
$ρ ^ 2$0.1760.2780.3020.340
$ρ ^ 3$0.1890.2970.3170.372
$ρ ^ 4$0.1930.3050.3260.377
$ρ ^ 5$0.2010.3020.3260.358
$ρ ^ 6$0.2050.3210.3430.352
$ρ ^ 7$0.1230.2270.2500.298
Table 4. Meningococcal disease data: AIC values for (5) with three thinning operators and Poisson (Po) and negative binomial (NB) distributions for the innovation. The AIC values are based on (10) with $i s t a r t = 5$. The AIC values of the best models for each of I1,I2,I3 are boldfaced. The covariates are $x t 1 = sin ( 2 π t / 52 )$ and $x t 2 = cos ( 2 π t / 52 )$.
Table 4. Meningococcal disease data: AIC values for (5) with three thinning operators and Poisson (Po) and negative binomial (NB) distributions for the innovation. The AIC values are based on (10) with $i s t a r t = 5$. The AIC values of the best models for each of I1,I2,I3 are boldfaced. The covariates are $x t 1 = sin ( 2 π t / 52 )$ and $x t 2 = cos ( 2 π t / 52 )$.
No CovariatesCovariates $x t 1 , x t 2$
pI1/NBI2/PoI3/PoI1/NBI2/PoI3/Po
11766.51754.81758.51689.31684.81683.9
21738.51731.21730.01686.01681.51681.9
31726.61723.21721.61684.51683.51682.3
41728.71725.21723.61686.61685.91684.7