Generalized Binary Time Series Models

: The serial dependence of categorical data is commonly described using Markovian models. Such models are very ﬂexible, but they can suffer from a huge number of parameters if the state space or the model order becomes large. To address the problem of a large number of model parameters, the class of (new) discrete autoregressive moving-average (NDARMA) models has been proposed as a parsimonious alternative to Markov models. However, NDARMA models do not allow any negative model parameters, which might be a severe drawback in practical applications. In particular, this model class cannot capture any negative serial correlation. For the special case of binary data, we propose an extension of the NDARMA model class that allows for negative model parameters, and, hence, autocorrelations leading to the considerably larger and more ﬂexible model class of generalized binary ARMA (gbARMA) processes. We provide stationary conditions, give the stationary solution, and derive stochastic properties of gbARMA processes. For the purely autoregressive case, classical Yule–Walker equations hold that facilitate parameter estimation of gbAR models. Yule–Walker type equations are also derived for gbARMA processes.


Introduction
Categorical time series data are collected in many fields of applications and the statistical research focusing on such data structures evolved considerably over the last years. As an important special case, binary time series that correspond to categorical data with two categories, occur in many different contexts. Often, binary time series are obtained from binarization of observed real-valued data. Such processes are considered, e.g., in Kedem and Fokianos (2002). In Figure 1, we show three real data examples of binary time series from different fields of research. For example, in Figure 1a, the eruption duration of the Old Faithful Geyser in the Yellowstone National Park is binarized using a threshold. It is coded with a value of one if an eruption lasts for longer than three minutes and zero if it is shorter. In economics, the two states of recessions and economic growth are of interest, as discussed, e.g., in Bellégo (2009). One example of a recession/no-recession time series is shown in Figure 1b, where for every quarter it is shown if Italy is in a recession, indicated by zero, or not, indicated by one. Recently, there is great interest in the air pollution in European cities, where an exceedance of the threshold of 50 µg/m 3 PM 10 (fine dust) causes a fine dust alarm. The resulting sequence of states of no exceedance corresponding to zero and exceedance corresponding to one is shown in Figure 1c. Further examples can be found, e.g., in geography, where sequences with the two states of dry and wet days are considered, e.g., in Buishand (1978). In biomedical studies, binary time series occur in the case, where the participants keep daily diaries of their disease.
For example, in clinical trials, as in Fitzmaurice et al. (1995), the binary self assessment of participants of their arthritis is collected, where poor is indicated by zero and good by one. In natural language processing, the occurrence of vowels as a sequence can be of interest, as considered in Weiß (2009b), where a text is binarized by detecting a consonant or no consonant/vowel as the two states. The binarization of a time series by a threshold, as, e.g., in the PM 10 example, or by categorizing the time series into two states, as, e.g., in dry and wet days, indeed simplifies the real valued time series to a binary version. As mentioned in Kedem (1980), nevertheless, the transformation keeps the random mechanism from which the data are generated. For the example of PM 10 data, it might often be of more interest, whether a certain threshold is crossed (or not) instead of the actual amount. In general, the rhythm within the binarized time series contains a great amount of information of the original data. As discussed in Kedem (1980), binary Markov chains are typically used for modelling the dependence structure due to their great flexibility. However, the number of parameters to estimate from the data grows exponentially with the order of the Markov model leading to over-parametrization (see, e.g., McKenzie (2003)).
To avoid the estimation of a large number of parameters, Jacobs and Lewis (1983) proposed the class of (new) discrete autoregressive moving-average (NDARMA) models for categorical time series. More precisely, for processes with discrete and finite state space, a parsimonious model is suggested. The idea is to choose the current value for X t randomly either from the past values of the time series X t−1 , . . . , X t−p or from one of the innovations e t , e t−1 , . . . , e t−q with certain probabilities, respectively. This random selection mechanism is described by independent and identically distributed (i.i.d) random vectors (P t , t ∈ Z) with P t := a (1) t , . . . , a (p) where Mult (1; P) denotes the multinomial distribution with parameter 1 and probability vector P := α (1) , . . . , α (p) , β (0) , . . . , β (q) with α (1) , . . . , α (p) ∈ [0, 1), β (0) ∈ (0, 1] and β (1) , . . . , β (q) ∈ [0, 1) such that ∑ p i=1 α (i) + ∑ q j=0 β (j) = 1. Then, the NDARMA(p,q) model equation is given by where (e t ) t∈Z is an i.i.d. process taking values in a discrete and finite state space S. Since for each time point t only one entry in the random vector P t is realized to be one while all others become zero, the value of X t takes either one of the values of X s for s ∈ {t − 1, . . . , t − p} or one of the error terms e s for s ∈ {t, . . . , t − q}. This sampling mechanism assures that the time series takes values in the state space S, such that, e.g., for a binary time series with S = {0, 1}, the process stays binary. In contrast to the real-valued ARMA model, the lagged time series values and errors are not weighted according to the model coefficients and summed-up since only one of them is actually multiplied with one and all the others with zero based on the realization of P t . The model parameters are the probabilities of the multinomial distribution, summarized in the parameter vector P, where all entries of P lie in the unit interval and sum-up to one. In comparison to Markov Chains, NDARMA models maintain the nice interpretable ARMA-type structure and have a parsimonious parameterization. Furthermore, NDARMA models fulfill certain Yule-Walker-type equations, as shown in Weiß and Göb (2008).
In Figure 2, one realization of an NDARMA(1,0) process, denoted by NDAR(1), with binary state space is shown. NDAR(1) models are probably the simplest members of the NDARMA class, but Figure 2 nicely illustrates the limited flexibility of the whole NDARMA class. The sampling mechanism of choosing the predecessor with some probability α tends to generate long runs of the same value in particular when the parameter α ∈ (0, 1) is large. A switching from one state to the other, e.g., from X t−1 = 0 to X t = 1, can only occur, e.g., if the error term e t is selected (with probability 1 − α) and the error term takes the value e t = 1. Hence, the NDARMA class does not allow systematically selecting the opposite value of X t−1 for X t .
As for the NDARMA class all model parameters are restricted to be non-negative, which explains in particular why the NDARMA class can model exclusively non-negative autocorrelations in the data. For the example of a NDAR(1) process, the autocorrelation at lag one is equal to α ∈ [0, 1), such that any alternating pattern that corresponds to negative model parameters as, e.g., observed in Figure 1a, cannot be captured. For a more detailed discussion of the properties of NDARMA models, we refer also to Jacobs and Lewis (1983) or Weiß (2009a). To increase its flexibility, Gouveia et al. (2018) proposed an extension of the NDARMA model class by using a variation function, but the resulting models do also not allow for negative model parameters and, hence, no negative dependence structure. Hence, whenever negative dependence structure is present in binary time series data, the NDARMA model is not suitable. In fact, in all three data examples of Figure 1, a straightforward estimation based on Yule-Walker estimators leads to at least some negative coefficients, such that NDAR models turn out to be not applicable.
To address this lacking flexibility of the NDARMA model class, we propose a simple and straightforward extension of the original idea of Jacobs and Lewis (1983) that allows also negative serial dependence. The resulting generalized binary ARMA (gbARMA) model class maintains the nicely interpretable model structure. Furthermore, no additional parameters are required to handle the negative dependence, preserving the parsimonious parameterization as well. In Figure 3, a realization of a gbARMA(1,0) process, denoted as gbAR(1), is shown. As a straightforward extension of an NDAR(1) model in Figure 2, gbAR(1) models allow for negative serial dependence. In fact, the range of the autocorrelation at lag one is extended from [0, 1) for NDAR(1) to (−1, 1) for gbAR(1) models.  (4)) with parameter vector P = [−0.7, 0.3] and error distribution P (e t = 1) = 0.5 and the corresponding autocorrelation function (ACF).
To allow for negative autocorrelation up to some limited extend, Kanter (1975) proposed the binary ARMA model class, where he applied the modulo 2 operator in an ARMA-type model equation. Using the modulo operation assures to stay in the binary state space, but the nice interpretability of the dependence structure in the model is lost since the past values of the time series are summed up prior to the modulo operation, see also McKenzie (1981). We follow a different path in this paper and propose a much simpler operation that enables modeling a systematic change of the state from one time point to the other.
The idea of allowing for negative serial dependence resulting in the gbARMA class is as follows: a negative model parameter α ∈ (−1, 0) (and hence a negative autocorrelation α ∈ (−1, 0)) in binary time series data corresponds to the time series systematically changing from one state to the other over time. Hence, the natural idea to incorporate negative serial dependence in the binary NDAR(1) Model (Equation (2)) is to replace X t−1 by 1 − X t−1 as holds. This leads to the model equation This process has negative autocorrelation α at lag one. Note that, in comparison to Equation (2), as α ∈ (−1, 0) here, we have to use its absolute value |α| as the probability to select the 1 − X t−1 . Altogether, for α ∈ (−1, 1), we can define the generalized binary AR(1) (gbAR (1)) process by the model equation Note that Equation (4) extends the parameter space from α ∈ [0, 1) for NDAR(1) models to α ∈ (−1, 1) for gbAR(1) models. Further, note that, for identification of the model, we have to assume β (0) = β ∈ (0, 1]. Using indicator variables, Equation (4) can be compactly written as In Figure 3, a realization of a gbAR(1) process with negative parameter α = −0.7 is shown, where the time series tends to take systematically the opposite state of the predecessor. The corresponding autocorrelation plot reflects the negative serial dependence leading to an alternating pattern. Runs of the same state can only occur, when the error term e t is selected (with probability 1 − |α|) and the error term e t takes the same value as X t−1 , that is, e t = X t−1 . The empirical autocorrelations for the Old Faithful Geyser data can be found in Figure 4a, where the pronounced alternating behavior clearly indicates negative linear dependence to be present in the data.
The idea of allowing for a negative model coefficient by replacing X t−1 by 1 − X t−1 in gbAR(1) processes (Equation (5)) can be also employed for each parameter in pth order gbAR processes, where each X t−i , i = 1, . . . , p may be replaced by 1 − X t−i in the model equation.
The paper is organized as follows. In Section 2, generalized binary AR processes of order p ∈ N are defined, where we also give stationarity conditions and state the stationary solution. Further, stochastic properties are derived that include formulas for the transition probabilities, the marginal distribution, and Yule-Walker equations. As a real data example, we illustrate the applicability of our model class to the geyser eruption data in Section 1. In Section 3, we present several simulation experiments. First, in Section 3.1, for the example of a gbAR(2) model, we illustrate the generality of the resulting gbAR model class in comparison to natural competitors including AR, NDAR, and Markov models of order two, respectively. In Section 3.2, we examine the estimation performance of Yule-Walker estimators in the gbAR models in Section 3.2.1. In Section 3.2.2, we investigate the benefit of using the parsimonious gbAR models in comparison to Markov models and their robustness in cases where the model is mis-specified. By adding a moving-average part to gbAR models in Section 4, ARMA-type extensions of gbAR models leading to gbARMA processes are discussed. We conclude in Section 5. All proofs are deferred to Appendix A.

The Generalized Binary Autoregressive (gbAR) Model Class
We define now generalized binary AR(p) (gbAR(p)) models for binary data based on the notation of NDAR(p) models by adopting the idea of replacing X t−1 by 1 − X t−1 for a negative parameter α as in Equations (5) and (6) separately for all or some of the lagged values X t−1 , . . . , X t−p . To be most flexible, each parameter α (i) corresponding to the lagged value X t−i , i = 1, . . . , p is allowed to be either positive or negative, that is, α (i) ∈ (−1, 1), respectively.

gbAR Models
The parameter vector P := α (1) , . . . , α (p) , β (0) contains the probabilities of the multinomial distribution that controls the selection mechanism of NDAR models. As we allow now for α (i) ∈ (−1, 1), i = 1, . . . , p, i.e., the parameters can be negative, P has to be modified to serve again as a parameter vector of probabilities. This is achieved by taking entry-wise absolute values and we define where This enables us to give the definition of the generalized binary AR model of arbitrary order p ∈ N.
Definition 1 (Generalized binary AR processes). Let (X t ) t∈Z be a stationary process taking values in {0, 1}. Let (e t ) t∈Z be a binary error process such that e t is independent of (X s ) s<t with mean µ e = E(e t ) = P(e t = 1) and variance σ 2 e = Var(e t ) = P(e t = 1)(1 − P(e t = 1)) > 0. Let P := α (1) , . . . , α (p) , β (0) be the parameter vector with P |·| as in Equation (7) such that P |·| 1 p+1 = 1 with 1 p+1 the one vector of length p + 1. Further, let be iid random vectors, which are independent of (e t ) t∈Z and (X t ) s<t . Then, the process (X t ) t∈Z is said to be a generalized binary AR process of order p (gbAR(p)), if it follows the recursion By rewriting the random variables a (·,i) t , · ∈ {−, +} in the defining model (Equation (8)), the model can be represented in the spirit of Equation (5). However, the benefit of the representation in Equation (8) is that only one random variable is multiplied with the lagged value X t−i , whereas a (−,i) t is an additional random variable that accounts for the switching that leads to negative model coefficients.

Stochastic Properties of gbAR Models
Before calculating moments of the binary time series process (X t ) t∈Z itself, we first consider the expectation of the random variables related to the multinomial selection mechanism. Noting that E(a This enables us to compute the stationary mean µ X = E(X t ) of the process, directly leading to If all parameters α (1) , . . . , α (p) are non-negative, the above formula becomes µ , leading then to the well-known formula for the mean of NDAR(p) models. Otherwise, we have µ X = µ e for gbAR(p) models in contrast to NDAR(p) models (see, e.g., Weiß (2009a)).
For the familiar stationary condition imposed on the model parameters α (1) , . . . , α (p) that all roots of the characteristic polynomial lie outside the unit circle, i.e., if holds, the stationary solution of the gbAR(p) model can be derived. Note that the condition in (10) is equivalent to ∑ p i=1 |α (i) | < 1, such that the error has to be selected with strictly positive probability β (0) > 0 by the multinomial distribution. If the stationarity condition in Equation (10) holds, a moving-average representation of the gbAR(p) process can be derived.
For constructing the stationary solution of the gbAR time series, we follow the common approach based on a multivariate representation of the model, as in (Lütkepohl 2005, Chap. 11.3.2). Precisely, the gbAR(p) model can be written as a p-dimensional gbVAR(1) process (Y t , t ∈ Z) with the following matrices and vectors, such that the first entry of (Y t , t ∈ Z) is equal to the gbAR(p) process. We define To obtain a vector autoregressive representation for Y t , we have to define several matrices that contain the random variables of the multinomial distribution. Precisely, for · = {−, +}, let (1) be p × p matrices, where 0 r×s denotes the (r × s)-dimensional zero matrix. Based on the notation introduced above, gbVAR(p) processes can be represented as a vector-valued gbAR model of first order (gbVAR(1)) as follows where 1 p is the one vector of length p. The above notation enables us to state a moving-average representation of gbAR(p) processes as follows.
Theorem 1 (Moving-average representation of gbAR processes). Let (X t ) t∈Z be a stationary gbAR(p) process, that is (X t ) t∈Z fulfills Equation (10). Then, we have (i) For p = 1, the gbAR(1) model has a gbMA(∞)-type representation (in L 2 -sense), that is, where ζ 0 := I K and ζ i : Here, e 1 is the first unit vector and 1 p is the one vector of length p.
The notation used here is obtained from that used in Section 4.2 for the special case of q = 0.
Hence, the process can be represented as an infinite weighted sum of the error terms. However, in comparison to classical AR or NDAR processes, an additional term appears that takes control of potential negative parameters and, consequently, allows for negative dependence to be modeled. This term vanishes if all parameters α 1 , . . . , α p are positive.
The second-order dependence structure of gbAR processes coincides with that of AR or NDAR processes in the sense that the same Yule-Walker equations for h = 0 hold. However, note again that the parameter space for gbAR models is considerably larger than for NDAR models allowing for negative parameters leading to more flexibility. The Yule-Walker equations link the model parameters to the autocovariances of the process. Hence, they can be used for estimating the model parameters by the same well-known Yule-Walker estimators. A link between the autocovariances, the model coefficients, and the mean and variance of the error terms is established by the Yule-Walker equation for h = 0, respectively.
(i) For all h ∈ N, we have (ii) For h = 0, we have The next Lemma states some basic properties of the marginal distribution of gbAR processes and their transition probabilities. Since the time series has a binary state space, these conditional probabilities allow quantifying the probability to reach a certain state from the past values. In their derivation, the multinomial selection mechanism plays a crucial role and, in the stated formulas, the Kronecker delta δ ij indicates if a past value has actually impact on the outcome of the time series or not.
Lemma 1 (Marginal, joint and transition probabilities of gbAR processes). Let (X t ) t∈Z be a stationary gbAR(p) model and set p i := P (e t = i). Then, the following properties hold: Comparing the results in Lemma 1 with (Weiß 2009a, Lemma 11.2.1.3) established for NDARMA(p,q) processes, the main difference is in Part (iii). The marginal distribution of the NDARMA process is equal to the marginal distribution of the error term process, but this does not hold for gbAR processes. Instead, the marginal distribution of gbAR processes depends on an additional term that results from the absolute values of the negative parameters.
In the following example, let us conclude this section with a more detailed look at the gbAR(1) model and a real data example.
The iid error term process (e t , t ∈ Z) follows the distribution P (e t = 1) = p 1 ∈ (0, 1) such that µ e = p 1 and σ 2 e = p 1 (1 − p 1 ) > 0. Then, the model equation equals At each time point t, if α (1) ≥ 0, either the predecessor X t−1 with probability α (1) or the error term e t with probability β (0) is selected by a multinomial distributed random variable to determine X t . In the case of α (1) < 0, either 1 − X t−1 with probability |α (1) | or the error term e t with probability β (0) is selected. That is, as for each t, either a t or b t is equal to one and the other is zero, it holds For positive values of α (1) , the gbAR(1) model coincides with the NDAR(1) model. A corresponding realization is shown in Figure 2, where for large values of α (1) mainly the predecessor X t−1 is chosen and long runs of the same value occur. Figure 3 shows one realization of a gbAR(1) process with negative value of α (1) . The time series switches its states from zero to one and vice versa at most time points. The transition probability to move from state i 1 at time t − 1 to state i 0 at time t is given by The probability of the process taking the value i 0 = 1 depends on two terms. First, the probability of choosing the error term is multiplied by the probability of the error term taking the same value as X t , e.g., P (e t = i 0 ) = p i 0 with i 0 = 1. If the probability of choosing the predecessor is added, it depends on its value and the sign of α. If, for example α < 0, then the probability of choosing X t−1 is just added if its value is the contrary of i 0 , such that the Kronecker delta is equal to one. This leads to the representation of Equation (16) as Example 2 (Eruption duration of the Old Faithful Geyser). The binarized eruption duration of the Old Faithful Geyser is illustrated in Figure 1a. Its empirical autocorrelation, as shown in Figure 4a, clearly indicates that there is negative serial dependence present in the data such that a gbAR(p) process appears to be appropriate. The order selection using the AIC criterion leads to a model of order p = 2 with AIC = 159.83. This selection is confirmed by an inspection of the partial autocorrelation in Figure 4b. Parameter estimation is based on the Yule-Walker Equation (14) leading to the estimated parameter vector P = [−0.3949, 0.2659, 0.3393] and the fitted model The sample mean of the binary time series is equal to µ X = 0.6488 since long eruptions of the geyser arise more often than short eruption duration. The first parameter α 1 is indeed estimated to be negative and the second one α 2 to be positive. From Figure 1a, a change from zero to one or vice versa can be observed in many time steps, whereas the run of ones in the time series correspond in most cases to choosing an error term. The error term distribution is calculated by Equation (9) with µ e = P (e t = 1) = 0.9953.  To measure the predictive power of the estimated model, we use ROC curves and the corresponding area under the curve (AUC). The ROC concept indicates a good predictive performance whenever the resulting curve is "far away" above the diagonal leading to an AUC larger than 0.5. Note that the diagonal corresponds to the case of independent observations, where no prediction based on past values is meaningful. For the one step ahead prediction, the transition probability of Lemma 1 (ii) is used by plugging in the estimated probabilities.
Comparing the predictor to the realized values in the sample leads to the ROC curve shown in Figure 5, where the corresponding AUC becomes 0.8317. Hence, as the ROC curve is "far away" above the diagonal and the AUC is larger than 0.5, the prediction performance of the gbAR model turns out to be considerably better than that of a model that relies on independent observations. By allowing for negative model parameters, gbAR models appear to be suitable for this real data example that shows negative serial dependence. For further improvement, the Yule-Walker estimates might serve as starting values for a maximum likelihood estimation (MLE) based on the conditional log-likelihood function where p x t |x t−1 , . . . , x t−p := P X t = x t |X t−1 = x t−1 , . . . , X t−p = x t−p (see also (Weiß 2018 Table 1. The results differ only slightly and decrease with increasing subsample sizes.

Generality of the gbAR Model Class and Estimation Performance
In this section, we investigate the generality of the gbAR model class in comparison to obvious competitors in Section 3.1 and address the estimation performance in different setups and in comparison to parameter-intensive Markovian models in Section 3.2.
The parameter range as well as the range of autocorrelation pairs for the gbAR(2) model is considerably larger than those of an NDAR(2) model. The range of the classical AR model is again larger, but this is an unfair comparison as the AR model has been proposed for continuous data and is not suitable for binary data at all. In Figure 6b, the areas of AR(2), NDAR(2), and binAR(2) models are hyperboloid-shaped and, as shown in Jacobs and Lewis (1983), the autocorrelations of the NDAR(2) model take just positive values. In contrast to NDAR(2) processes, the binAR(2) process captures an additional area that corresponds to negative serial dependence. The range of autocorrelation pairs for the gbAR (2) is not hyperboloid-shaped, but forms a triangle. The range of this triangle comes actually close to the range of the AR(2) model, although the comparison with the AR(2) model is indeed unfair as the latter has been proposed for continuous data and the gbAR(2) for binary data. Compared to NDAR(2) processes, the extension allowing also for negative parameters leads to a much larger range of possible autocorrelation pairs than just the mirrored half parable. This is explained by the four times larger possible range for the model parameters of gbAR(2) processes in comparison to NDAR(2) processes, as shown in Figure 6a. In summary, by allowing for negative model parameters in gbAR models, we can get a considerably more flexible model class in comparison to NDAR models that is suitable to capture a wider range of dependence structures of binary time series data. In Figure 7, the possible range of autocorrelation pairs of gbAR(2) processes and Markov chains of order two are shown together. Recall that the gbAR(2) model is a parsimonious member of the class of Markov chains of order two and hence less flexible. Interestingly, with respect to pairs of autocorrelations of lags one and two, the possible range for the gbAR(2) model is only slightly smaller than that of a Markov chain of order two. Moreover, the largest range shown for the (continuous) AR(2) models in Figure 6b (in black) cannot be attained by Markov chains of order two. Hence, gbAR(2) models can to cover a large portion of the possible range of autocorrelation pairs of lags one and two of second-order Markov chains. However, keep in mind that for a pth-order Markov chain, 2 p parameter have to be estimated. For example, 2 2 = 4 parameters need to be specified for a second-order Markov chain, whereas gbAR(2) processes only require three parameters.

Simulations
In this simulation study, we addressed two things. First, as shown in Section 3.2.1, we investigated the estimation performance of Yule-Walker estimators for gbAR models of different orders and sample sizes. Second, as shown in Section 3.2.2, we studied the flexibility of the gbAR model class and compared the prediction performance to Markovian models in the case where the estimated model was correctly specified as a gbAR model and in the case where the underlying model was a Markovian model that does not belong to the class of gbAR models.

Estimation Performance
To study the estimation performance of Yule-Walker estimators in gbAR models, we considered three different specifications of gbAR(p) processes with p = 1, 2, 3 and sample sizes T = 100, 200, 500, 1000. Precisely, we considered the following gbAR data generating processes (DGPs): The model parameters summarized in P were estimated based on Yule-Walker Equation (14) and the error term distribution using Equation (9). Note that, in all setups, we considered gbAR models that make use of the extended parameter space by including negative parameters α (i) in the model. For each DGP, we simulated 1000 replications to calculate the mean squared error (MSE) to measure the estimation performance. Table 2 summarizes the simulation results for all DGPs and all considered sample sizes. The estimation performance is generally good, as confirmed by rather small MSEs. It turns out that, as expected, in all considered setups, the estimation performance improves with increasing the sample size. It is interesting to note that, relative to the estimation of the other quantities, the estimation of the mean of the error terms µ e is generally less precise. This can be explained by the fact that the error terms e t do enter the time series only in the case when it is actually selected which happens only with probabilities β (0) DGP1 = 0.15, β (0) DGP2 = 0.2 and β (0) DGP3 = 0.0874 for the three DGPs, respectively. A comparison of the estimation performance of the gbAR models of different orders shows that the estimation performance declines with increasing order, which is of course plausible as the number of parameters gets larger, leading to more estimation uncertainty.

Robustness of gbAR Model Class
The class of gbAR(p) models form a parsimoniously parametrized subclass in the class of Markovian models of order p. To study the benefit of this newly proposed class of binary models, we wanted to compare the gbAR(p) model to Markov chains which are mostly used for binary data.
First, let us consider the case of an underlying gbAR model. Since gbAR(p) processes have a Markov chain representation, a comparison in terms of the transition probabilities becomes suitable. From Lemma 1 (ii), the transition probabilities of gbAR models compute to First, for p = 1, 2, 3, we simulated from the gbAR(p) models defined for DGP1-3 in Section 3.2.1 realizations of different sample sizes to estimate the transition probabilities of: (a) a pth-order Markov chain and (b) a gbAR(p) process. For gbAR models, the true transition probabilities are given by Equation (18) and can be estimated by replacing the model parameters by the corresponding estimators. Then, the MSE is calculated model-wise and over all transition probabilities. For all three DGPs with model orders p = 1, 2, 3, the simulation results are stated in Table 3. By "MSE gbAR(p)", we denote the mean squared error by evaluating the difference between the estimated transition probability and the truth over all possible transition probabilities from a gbAR(p) process. Equivalently, "MSE MC" denotes the corresponding difference between the estimated transition probability and the truth of a Markov model over all possible transition probabilities. The MSEs are calculated over 1000 replications and show clearly that the estimated gbAR(p) transition probabilities have smaller MSEs for all sample sizes and orders in comparison to the MSEs of the Markov chain fits. This indicates that, in the case of an underlying gbAR(p) process, fitting the the more parsimonious model to the data leads to better estimation performance than fitting a Markov chain.
Next, we considered the situation where the underlying model is a Markov chain of order p that does not belong to the subclass of gbAR(p) models. In general, the 2 p × 2-dimensional transition probability matrix Q of a Markov chain of order p is defined by where For the simulations, we had to make sure that the used specifications of Q are such that the resulting model is not a member of the gbAR class. For a set of specified transition probabilities, it is actually easy to check whether the resulting model is a gbAR model, by checking whether Equation (18) hold true. It turns out that the class of gbAR(1) models and the class of binary Markov chains of order 1 coincide. Hence, for the simulations study, we chose transition probabilities such that Equation (18) Using such transition probabilities summarized in Q, binary time series were generated. Again, a gbAR(p) process and a pth order Markov chain were fitted. In Table 4, the MSE estimation performance for the different DGPs is summarized. Interestingly, although the corresponding gbAR fits (for p = 2, 3) do actually estimate the wrong models, with respect to MSE over all transition probabilities, their estimation performances for small sample sizes are superior to those of Markov chains that estimate the correct models. However, for large sample sizes, the estimated Markov models outperform the mis-specified gbAR model fits. As it is estimating the true model, this pattern was expected. In summary, for time series with small sample size, where the true underlying DGP is indeed a Markov chain and not a gbAR(p) process, the parsimonious gbAR model might be a good approximation, leading potentially to more precise estimates of the transition probabilities although the model is mis-specified. Table 3. Comparison of the estimation performance of gbAR(p) model fits and Markov chain fits of order p for p = 1, 2, 3 for three different gbAR-DGP1-3 with respect to the mean squared difference of estimated transition probabilities to the truth over 1000 Monte Carlo replications.

Further Extension: The Generalized Binary ARMA Class
In this section, we extend the gbAR model class and give a definition of generalized binary ARMA (gbARMA) models that additionally contain a moving average part in their model equations. In the spirit of the gbAR model as an extension of the NDAR model class, we allow also for negative parameters in the moving-average part of the model.
First, we provide the definition of the gbARMA(p,q) model, derive its stationary solution, and state some basic properties of marginal, joint, and transition probabilities of gbARMA(p,q) processes. We conclude this section with an example of a gbARMA(1,1) process.

Stochastic Properties of gbARMA Models
When dealing with possibly negative parameters also in the moving-average part of gbARMA models, the idea of Equation (4) is employed also for the lagged error terms. Hence, this allows modeling negative dependence in the moving average part as well. In the multinomial distribution, all values of the parameter vector P have to be considered in absolute value, thus we have to use P |·| as defined in Equation (21). For the expectation of gbARMA processes, two additional sums show up in comparison to the NDARMA case. Precisely, we have The construction of the stationary solution of the gbARMA time series is similar to the construction of the gbAR(p) process introduced in Section 2.1 and (Lütkepohl 2005, Chap. 11.3.2). The vector representation of the process (Y t , t ∈ Z) is equipped with a moving average part and thus the dimension of the corresponding random matrices becomes p + q × p + q. Precisely, the gbARMA(p,q) model can be written as a (p + q)-dimensional gbVAR(1) process (Y t , t ∈ Z) with the following matrices and vectors, such that the first entry of (Y t , t ∈ Z) is equal to the gbARMA(p,q) process. We define To obtain a vector autoregressive representation for Y t , we define directly matrices that contain the random variables of the multinomial distribution. Precisely, for · = {−, +}, let and B (1) are p × p, p × q and q × q matrices, respectively, and A t,21 := 0 q×p . Based on the notation introduced above, gbARMA(p,q) processes can be represented as a vector-valued gbAR model of first order (gbVAR(1)) as follows with 1 p+q being the one vector of length p + q.
To derive a suitable stationarity condition for the process, we know from Lütkepohl (2005)  t,22 . Hence, a gbARMA(p,q) process is stationary if the roots of the characteristic polynomial of the autoregressive part lie outside the unit circle, that is, if The assumption is fulfilled whenever an error term has a positive probability, such that there exists a |β (j) | > 0 for some j ∈ {0, . . . , q}. Therefore, the sum over all probabilities of choosing a predecessor fulfills ∑ p i=1 |α (i) | < 1. Without any restriction, we assume that β (0) is strictly positive for a stationary gbARMA process, i.e., β (0) ∈ (0, 1]. For a stationary gbARMA(p,q) process, a moving average representation can be derived using the above defined vectors and matrices.
Theorem 3 (Moving Average representation of gbARMA processes). Let (X t ) t∈Z be a stationary gbARMA(p,q) process with gbVAR(1) representation (Equation (23) ). Then, it follows that t−i = 0 (p+q)×(p+q) in L 1 and e 1 is the first unit vector.
The univariate moving average representation is obtained from the multivariate formula by multiplying it with the first unit vector because of X t = e T 1 Y t . Considering the autocorrelation structure, Jacobs and Lewis (1983) and Weiß (2011) showed that the NDARMA(p,q) model fulfils a set of Yule-Walker type equations which was also derived by Möller and Weiß (2018) for the GenDARMA class of categorical processes. The following result shows that this property is maintained for the gbARMA class.
For the generalized binary ARMA model, formulas for the marginal, joint and transition probabilities can be calculated, extending the results from Lemma 1.
Lemma 2 (Marginal, joint, and transition probability of gbARMA processes). Let (X t ) t∈Z be a stationary gbARMA(p,q) process. Then, the following properties hold: The flexibility of gbARMA models obtained by allowing for negative parameters shows also in the transition probabilities and in the joint and marginal distributions. Hence, more complex structures can be captured since systematic changes in the error terms are allowed as well.
We conclude this section with an example of a gbARMA(1,1) model.
Example 3 (gbARMA(1,1) process). Let (X t ) t∈Z be a stationary gbARMA(1,1) process. Then, the process follows the recursion Four sign combinations of parameter pairs are possible and the corresponding model equations are given as follows: (1) (1) Whereas for identification purposes β (0) only takes positive values, the predecessors X t−1 and e t−1 are systematically switched if the corresponding model parameters are negative, respectively. For a stationary gbARMA(1,1) process, the moving average representation fulfills the following equation: From the stationarity assumption, we have |α (1) | < 1, β (0) ∈ (0, 1] and |β (1) | ∈ [0, 1). The moving average representation consists of three parts. There first is a sum over all terms a (−,1) t−j for the potential case of α (1) < 0. This part accounts for the choosing a predecessor and its switching. Since β (0) is strictly positive, the second is a sum over all error terms without any modification occurs. In the third sum, the random variable b (−,1) t appears for controlling the case of β (1) < 0.

Conclusions
By extending the NDARMA model class of Jacobs and Lewis (1983) to allow for negative parameters in the binary state space, the generalized binary ARMA model remains parsimonious, but it becomes more flexible to allow for negative model parameters and, hence, negative dependence structure in the data. The extension of the model to a more general parameter space enables the application to real data without having that many restrictions as in the NDARMA model class. Although the extension leads to additional terms in the model equation, the Yule-Walker equations still provide a direct way to estimate the model parameters.
We discuss stationarity conditions for gbARMA models and derive the stationary solution. The resulting moving average representation shows an additional term, compared to most MA(∞)-type representations. These additional terms control for the switching of the states.
An illustration of autocorrelation pairs (ρ (1) , ρ (2)) of four different models of order 2 shows a comparison of the captured dependence structure of the time series models. It reveals that the proposed gbARMA model can capture a wide range of negative and positive dependence structures. A second-order Markov chain is shown to capture only a slightly larger range of negative dependence structure that gbAR(2) models. Hence, by allowing for negative parameters, the proposed extension of the NDARMA model class leads to a new model class that allows capturing a wide range of dependence structures in binary time series data, while maintaining a parsimonious parametrization. Moreover, in small sample sizes, parsimonious gbAR models might turn out to be beneficial in cases where the model is actually mis-specified as they may provide a sufficient approximation to the true model.

Author Contributions:
Both authors contributed to the methodology, investigation, writing and formal analysis. The software was written by L.R. and the supervision and conceptualization was given by C.J.
Funding: This research was financially supported by the Eliteprogramme for Postdocs of the Baden-Württemberg Stiftung. t−l | := A t−l ∈ {0, 1} and the expectation is given by E A t−l =: A |·| with only positive values in (0, 1). Consequently, the term vanishes for k → ∞ and as result we get the gbVMA(∞) representation.
For k ≥ 1, we get By defining φ k := ϕ k σ 2 e , the recursion of Theorem 4 follows.
Part (ii) of Theorem 2 is directly obtained by inserting the model equation and by using the property of multinomial choosing only one entry of P t equal to one and all others to zero, such that Appendix A.4. Proof of Lemma 1 and 2 Proof. (i) The conditional probability is an immediate consequence from the model equation and multinomial distribution of the random variables a t and b t .
(ii) With the independence assumption on the error terms and Part (i), the conditional probability without conditioning on the current error term is: (iii) Consider the probability that the time series is in state j ∈ {0, 1} at time point t and note that P (e t = 1 − j) = 1 − P (e t = j) , P (X t = 1 − j) = 1 − P (X t = j).
|β (l) | 1 {β (l) ≥0} P (e t = j) + 1 {β (l) <0} P (e t = 1 − j) + β (0) P (e t = j) Then, by rearranging the terms on the last right side, we get (iv) Consider the joint probability of the error term and time series at time point t. We get |β (l) |1 {β (l) ≥0} P (e t−l = i 0 , e t = j 0 ) + q ∑ l=1 |β (l) |1 {β (l) <0} P (e t−l = i 0 , e t = j 0 ) By inserting Part (iii) into the equation above, we get Using the properties of the parameters, the joint distribution of the time series and error term is given by