A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.


Introduction
Counting time series naturally occur in many contexts, including actuarial science, epidemiology, finance, economics, etc. The last few years have witnessed the rapid development of modeling time series of counts. One of the most common approaches for modeling integer-valued autoregressive (INAR) time series is based on thinning operators. In order to fit different kinds of situations, many corresponding operators have been developed; see [1] for a detailed discussion on thinning-based INAR models.
The most popular thinning operator is the binomial thinning operator introduced by [2]. Let X be a non-negative integer-valued random variable and α ∈ (0, 1), the binomial thinning operator is defined as and 0 otherwise, where {B i } is a sequence of independent identically distributed (i.i.d.) Bernoulli random variables with fixed success probability α, and B i is independent of X. Based on the binomial thinning operator, [3,4] independently proposed an INAR(1) model as follows where { t } is a sequence of i.i.d. integer-valued random variables with finite mean and variance. Since this seminal work, the INAR-type models have received considerable attention. For recent literature on this topic, see [5,6], among others. Note that B i in (1) follows a Bernoulli distribution, so α • X is always less than or equal to X; in other words, the first part of the right side in (2) cannot be greater than X t−1 , which limits the flexibility of the model. Although it has such a shortcoming, the simple form makes it easy to estimate the parameter, and it also has many similar properties to the multiplication operator in the continuous case. For this reason, there have still been many extensions of the binomial thinning operator since its emergence. Zhu and Joe [7] proposed the expectation thinning operator, which is the generalization of binomial thinning from the perspective of a probability generating function (pgf). Although this extension is very successful, the estimation procedure is a little complicated. Compared with this extension, the thinning operator we proposed is simpler and more intuitive. For recent developments, Yang et al. [8] proposed the generalized Poisson (GP) thinning operator, which is defined by replacing B i with a GP counting series. Although the GP thinning operator is flexible and adaptable, we argue that it has a potential drawback: the GP distribution is not a strict probability distribution in the conventional sense. Recently, Aly and Bouzar [9] introduced a two-parameter expectation thinning operator based on a linear fractional probability generating function, which can be regarded as a general case of at least nine thinning operators. Kang et al [10] proposed a new flexible thinning operator, which is named GSC because of three initiators of the counting series: Gómez-Déniza, Sarabia and Calderín-Ojeda.
Although the binomial thinning operator is very popular, it may not perform very well in large numerical value counting time series. This is because under such circumstances, the predicted data are often volatile, and the data are more likely to be non-stationary when the numerical value is large. We intend to establish a new thinning operator which meets the following requirements: (i) it is an extension of the binomial thinning operator; (ii) it contains two parameters to achieve flexibility, (iii) it has a simple structure and is easy to implement.
Based on the above considerations, we propose a new thinning operator based on the extended binomial (EB) distribution. The operator has two parameters: real-valued α and integer-valued m (0 ≤ α ≤ 1, m ≥ 2), which is more flexible compared to some single parameter thinning, and the binomial thinning operator (1) can be regarded as a special case of m = 2 in the EB thinning. The case of m > 2 in the EB thinning usually performs better than m = 2 in some large value data sets. In other words, the EB thinning alleviates the main defect of the binomial thinning to some extent. Since the EB thinning is not a special case of the expectation thinning in [9], we have further extended the framework of thinning-based INAR models to provide a new way in practical application. Therefore, an INAR(1) model is proposed based on the EB thinning operator, which is an extension of the model (2) and can more accurately and flexibly capture the dispersed features in real data. This paper is organized as follows. In Section 2, we review the properties of the EB distribution and then introduce the EB thinning operator. Based on the new thinning operator, we propose a new INAR(1) model. In Section 3, two-step conditional least squares estimation is investigated for the innovation-free case of the model and the asymptotic property of the estimator is obtained. The conditional maximum likelihood estimation is discussed and the numerical simulations. In Section 4, we focus on forecasting and introduce two criteria to compare the prediction results for three overdispersed or underdispersed real data sets, which are considered to illustrate a better performance of the proposed model. In Section 5, we give some conclusions and related discussions.

A New INAR(1) Model
The EB distribution comes from the theory about Pascal's triangles, which can be regarded as a multivariate case of the binomial distribution; see [11] for more details. Based on this distribution, we introduce the EB thinning operator and propose a corresponding INAR(1) model.

EB Distribution
The EB random variable X n (m, α), denoted as EB(m, n, α), which is defined as follows: where m and n are both integers satisfying m ≥ 2 and n ≥ 1; C m (n, r) can be calculated as where s 1 = min{n, integer part in r/m}; and α and β in (3) satisfy the following restriction: The above restriction is equivalent to β m − α m = β − α. The mean and variance of EB random variables X n (m, α) are respectively. The pgf of X n (m, α) can be written as As X n (m, α) can be expressed as a convolution, the EB distribution has the reproduc-
Based on the reproductive property of the EB distribution, we define the EB thinning operator " " as follows: for a non-negative integer-valued random variable X, where (m, α) X = 0 if X = 0. Note that the EB thinning operator reduces to the binomial operator (1) when m = 2. It is easy to know that (m, α) X ≤ X or > X, so the EB thinning operator is quite flexible when dealing with the overdispersed or underdispersed data sets. Remark 1. The computation of (α, m) for given (µ, σ 2 ) is based on (4) and (5). The solution can be obtained by solving these nonlinear equations. When m = 3, β = (−α + √ 4 − 3α 2 )/2 and when m = 4, For more complex cases (m ≥ 5), we can derive the solution (α, β, m) by solving these large-scale nonlinear systems, and a more detailed calculation procedure is given in Section 3.3.

EB-INAR(1) Model
Based on the EB thinning operator, we define the EB-INAR(1) model as follows: where α ∈ (0, 1), {X t } is a sequence of non-negative integer-valued random variables; the innovation process { t } is a sequence of i.i.d. integer-valued random variables with finite mean and variance; and t is independent of {X s , s < t}.
In order to obtain the estimation equations, we give some conditional or unconditional moments of the EB-INAR(1) model in the following proposition. Proposition 1. Suppose {X t } is a stationary process defined by (7) and let µ < 1; then for t ≥ 1, where µ and σ 2 are the expectation and variance of the innovation t , respectively.
The proof of some of these properties mentioned above is given in Appendix A. [12], we can further extend this model to INAR(p); the EB-INAR(p) model is defined as follows:

Remark 2. Inspired by the INAR(p) model in
where α 1 , . . . , α p ∈ (0, 1), m is an integer satisfying m ≥ 2, {X t } is a sequence of non-negative integer-valued random variables, the innovation process { t } is a sequence of i.i.d. integer-valued random variables with finite mean and variance, and t is independent with {X s , s < t}.
We will show that the new model can accurately and flexibly capture the dispersion features of real data in Section 4.

Estimation
We use the two-step conditional least squares estimation proposed by [13] to investigate the innovation-free case and the asymptotic properties of the estimators are obtained. Conditional maximum likelihood estimation for the parametric cases are also discussed. Finally, we demonstrate the finite sample performance via simulation studies.
Step 1.1. The estimator for θ 1 . Let be the CLS criterion function. Then the CLS estimatorθ 1,CLS := (μ CLS ,μ ,CLS ) of θ 1 can be obtained by solving the score equation = 0, which implies a closed-form solution: Step 1.2. The estimator for θ 2 .
then the CLS criterion function for θ 2 can be written as By solving the score equation = 0, we can obtain the CLS estimatorθ 2,CLS := (σ 2 CLS ,σ 2 ,CLS ) of θ 2 , which also is a closed-form solution: Step 2. Estimating parameters (m, α) via the method of moments. The estimator (m,α) of (m, α), which is called a two-step CLS estimator, can be obtained by solving the following estimation equations: where α and β satisfy (4). Therefore, the resulting CLS estimator isΘ CLS = (m CLS ,α CLS ,μ ,CLS ,σ 2 ,CLS ) . To study the asymptotic behaviour of the estimator, we make the following assumptions: Assumption 1. {X t } is a stationary and ergodic process;

Proposition 2.
Under assumptions 1 and 2, the CLS estimatorθ 1,CLS is strongly consistent and asymptotically normal: To obtain the asymptotic normality ofθ 2,CLS , we make a further assumption: Then we have the following proposition.

Proposition 3.
Under assumptions 1 and 3, the CLS estimatorθ 2,CLS is strongly consistent and asymptotically normal: Based on Propositions 2 and 3 and Theorem 3.2 in [14], we have the following proposition.

Proposition 5.
Under assumptions 1 and 3, the CLS estimator (m CLS ,α CLS ) is strongly consistent and asymptotically normal: where D is given in (9); 0); m 0 and α 0 denote the true values of m and α, respectively.
The brief proofs of Propositions 2-5 are given in Appendix A.

Conditional Maximum Likelihood Estimation
We maximize the likelihood function with respect to the model parameters θ = (m, α, δ) to get the conditional maximum likelihood (CML) estimate of the parametric case where δ is the parameter of i , P X 1 is the pmf for X 1 and P θ (X i+1 |X i ) is the conditional pmf. Since the marginal distribution is difficult to obtain in general, a simple approach is conditional on the observed X 1 . By essentially ignoring the dependency on the initial value and considering the CML estimate given X 1 as an estimate for θ by maximizing the conditional log-likelihood over Θ, we denote the CML estimate byθ = (m,α,δ). The log-likelihood function is as follows: where α and β satisfy (4); i follows a non-negative discrete distribution with a parameter δ. In what follows, we consider two cases: m = 3, 4.
where β is given in Remark 1. For higher order m, the formula is a little tedious, which is omitted here. For the estimate of EB-INAR(p), the CML estimation is too complicated, but the two-step CLS estimation is quite feasible, the procedure is similar to the case of p = 1.
For this reason, we only consider the case of EB-INAR(1) in simulation studies.

Simulation
A Monte Carlo simulation study was conducted to evaluate the finite sample performance of the estimator. For CLS estimation, we used the package BB in R for solving and optimizing large-scale nonlinear systems to solve Equations (4) and (8). For CML estimation, we used the package maxLik in R to maximize the log-likelihood function.
We considered the following configurations of the parameters: For the CLS estimate, the solutions of (4) and (8) are sensitive toμ andσ 2 , so we adopted the following estimation procedure. First, calculate 500 groups ofμ andσ 2 estimates, then use the mean values ofμ andσ 2 to solve the Equations (4) and (8). The simulation results of CLS are summarized in Table 1. We found that the estimation values are closer to the true value and the values of RMSE gradually decrease as the sample size increases. As it is a little difficult to estimate the parameter m in CML estimation, we considered m as known. The simulation results of CML estimators are given in Table 2. For all cases, all estimates generally show small values of RMSE, and the values of RMSE gradually decrease as the sample size increases.

Real Data Examples
In this section, three real data sets, including overdispersed and underdispersed settings, are considered to illustrate the better performance of the proposed model. The first example is overdispersed crime data in Pittsburgh; the second is overdispersed stock data in New York Stock Exchange (NYSE); and the third is underdispersed crime data in Pittsburgh, which was also analyzed by [15]. As is well known, in time series analysis, forecasting is very important in model evaluation. We first introduce two criteria on forecasting, and other preparations.

Forecasting
Before introducing the evaluation criterion, we briefly introduce the basic procedure as follows: First, we divide the n 1 + n 2 data into two parts, the training set with the first n 1 data and the prediction set with the last n 2 data. The training set is used to estimate the parameters and evaluate the fitness of the model. Then we can evaluate the efficiency of each model by comparing the following criteria between prediction data and the real data in the prediction set.
Similar to the procedure in [16], which performs an out-of-sample experiment to compare forecasting performances of two model-based bootstrap approaches, we introduce the forecasting procedure as follows: For each t = (n 1 + 1), . . . , (n 1 + n 2 − 5) we estimate an INAR(1) model for the data x 1 , . . . , x t , then we use the fitted result based on x 1 , . . . , x t to generate the next five forecasts, which is called the 5-step ahead forecast x F t+1 , . . . , x F t+5 for each t in {(n 1 + 1), . . . , (n 1 + n 2 − 5)}, where x F t is the forecast at time t. In this way we obtain many sequences of 1, 2, . . . , 5 step-ahead forecasts, finally we replicate the whole procedure P times. Then we can evaluate the point forecast accuracy by the forecast mean square error (FMSE) defined as and forecast mean absolute error (FMAE) defined as where x i is the true value of the data, x F i is the mean of all the forecasts at i and P is the number of replicates.

Overdispersed Cases
We consider two overdispersed data sets, the first one contains 144 observations and represents monthly tallies of crime data from the Forecasting Principles website http://www.forecastingprinciples.com, and these crimes are reported in the police car beats in Pittsburgh from January 1990 to December 2001; the second one is Empire District Electric Company (EDE) data set from the Trades and Quotes (TAQ) set in NYSE, which contains 300 observations, and it was also analyzed by [17].

P1V Data
The 45th P1V (Part 1 Violent Crimes) data set contains crimes of murder, rape, robbery and other kinds; see more details in the data dictionary on the Forecasting Principles website. Figure 1 plots the time series plot, the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of 45th data of P1V series, respectively. The maximum value of the data is 15 and the minimum is 0; the mean is 4.3333; the variance is 7.4685. From the ACF plot, we found that the data are dependent. From the PACF plots, we can see that only the first sample is significant, which strongly suggests an INAR(1) model.
First, we divided the data set into two parts-the training set with the first n 1 = 134 counting data and the prediction set with the last n 2 = 10 data. We fit the training set by the following models: expectation thinning INAR(1) (ETINAR(1)) model in [9], GSC thinning INAR(1) (GSCINAR(1)) model in [10], the binomial thinning INAR(1) model and EB thinning EB-INAR(1) models with m = 3, 4. According to the mean and variance of P1V data, we used one of the most common settings-geometric distribution-as the distribution of the innovation in above models.
In order to compare the effectiveness of the models, we consider the following evaluation criteria: (1) AIC. (2) The mean and standard error of Pearson residual r t and its related Ljung-Box statistics, where the Pearson residuals are defined as whereμ andσ 2 are the estimated expectation and variance for related thinning operators, respectively. (3) Three goodness-of-fit statistics: RMS (root mean square error), MAE (mean absolute error) and MdAE (median absolute error), where the error is defined by X t − E(X t |X t−1 ), t = 1, . . . , n 1 . (4) The mean of the datax on the training set calculated by the estimated results. Next, focusing on forecasting, we generated P = 100 replicates based on the training set for each model. Then we calculated the FMSE and FMAE for each model.
All results of the fitted models are given in Table 3. There is no evidence of any correlation within the residuals of all five models, which is also supported by the Ljung-Box statistic based on 15 lags (because χ 2 0.05 (14) = 23.6847). There were no significant differences for the RMS, MAE, MdAE andx values (the true mean of the 134 training set was 4.3880) of the models. In other words, no model performed the best in terms of these four criteria, so we also considered AIC. Since the CML estimator cannot be adopted in GSCINAR (1) Here we analyze a portion of the data between first to fourth trading days. As there are 75 5 min intervals per day, the sample size was T = 300. Figure 2 plots the time series plot, the ACF and the PACF of the EDE series. The maximum value of the data is 25 and the minimum is 0; the mean is 4.6933; and the variance is 14.1665. It seems that the series is not completely stationary with several outliers or influential observations based on the time series plot. Zhu et al. [18] analyzed the Poisson autoregression for the stock transaction data with extreme values, which can be considered in the current setting. From the ACF plot, we found that the data are dependent. From the PACF plots, we can see that only the first sample is significant, which strongly suggests an INAR(1) model. We used the same procedures and criteria as before. We used the geometric distribution as the distribution of the innovation in above models.
First divide the data set into two parts-the training set with the first n 1 = 270 data and the prediction set with the last n 2 = 30 data. All results of the fitted models are given in Table 4

Underdispersed Case
The 11th FAMVIOL data set contains the crimes of family violence, which can also be obtained from the Forecasting Principles website. Figure 3 plots the time series plot, the ACF and the PACF of the 11th data set of FAMVIOL series. The maximum value of the data is 3 and the minimum is 0; the mean is 0.4027; and the variance is 0.3820. We use the procedures and criteria in Section 4.2.1 to compare different models. According to the mean and the variance of FAMVIOL data, we use one of the most common settings-Poisson distribution as the distribution of the innovation in above models.
All results of the fitted models are given in Table 5. There is no evidence of any correlation within the residuals of all five models, which is also supported by the Ljung-Box statistic based on 15 lags. There are no significant differences about the criteria on the fitness and forecasting of all models. ETINAR(1) with the biggest AIC, performed the worst in these models. Now let us have a brief summary. For the P1V data and stock data, which are overdispersed with slightly high-count data, the EB-INAR(1) of m > 2 is obviously better than m = 2. For the FAMVIOL data, which is underdispersed with small-count data, the EB-INAR(1) with m > 2 is also competitive.

Conclusions
This paper proposes an EB-INAR(1) model based on the newly constructed EB thinning operator, which is an extension of the thinning-based INAR models. We gave the estimation method for parameters and established the asymptotic properties of the estimators for the innovation-free case. Based on the simulations and real data analysis, the EB-INAR(1) model can accurately and flexibly capture the dispersion features of the data, which shows its effectiveness and practicality. Compared with other models, such as ETINAR(1) and GSCINAR(1), our model is competitive.
We point out that many existing integer-valued models can be generalized by replacing the binomial thinning operator with the EB thinning operator, such as those models in [19][20][21][22][23]. In addition, we can extend the considered first-order INAR model to the higherorder one. More research will be studied in the future.

Conflicts of Interest:
The authors declare no conflict of interest.
Thus, the autocorrelation function Corr(X t , X t−h ) = µ h .
Proof of Propositions 2 and 3. Propositions 2 and 3 are similar to Theorems 1 and 2 in [8], which can be proved by verifying the regularity conditions of Theorems 3.1 and 3.2 in [24]. For instance, in the proof of Proposition 2, the partial derivatives ∂g(α, F m−1 )/∂α i have finite fourth moments in [24], which correspond to Assumption 2 in Section 3.1, u 2 m (α) in [24] is corresponds to q 1t (θ 1 ) in Step 1.1. Hence, Proposition 2 can be regarded as a direct conclusion of Theorem 3.2.
Besides, the proof of Proposition 3 is similar to Proposition 2; the procedure is almost the same as Theorem 3.2 in [24]. where