Identifying the Most Relevant Lag with Runs

In this paper, we propose a nonparametric statistical tool to identify the most relevant lag in the model description of a time series. It is also shown that it can be used for model identification. The statistic is based on the number of runs, when the time series is symbolized depending on the empirical quantiles of the time series. With a Monte Carlo simulation, we show the size and power performance of our new test statistic under linear and nonlinear data generating processes. From the theoretical point of view, it is the first time that symbolic analysis and runs are proposed to identifying characteristic lags and also to help in the identification of univariate time series models. From a more applied point of view, the results show the power and competitiveness of the proposed tool with respect to other techniques without presuming or specifying a model.


Introduction
In this paper, we are particularly interested in providing new statistical tools that help in the process of modelling univariate time series processes.We are focused on the selection of the appropriate time lags when data faced by the modeller might potentially come from either a linear or a nonlinear dynamic time process.Needless to say, a correct estimate of the lag is essential in forecasting basically because the introduction of delayed information into the dynamic models significantly changes their asymptotic properties.Traditionally, autocorrelation and partial autocorrelation coefficients have been utilized by empirical modellers in specifying the appropriate delays.However, it is well established [1] that processes with zero autocorrelation could still exhibit high order dependence or nonlinear time dependence.This is the case for some bilinear processes and even for purely deterministic chaotic models, among others.In general, autocorrelation-based procedures may be misleading for nonlinear models, and so might fail to detect important nonlinear relationships present in the data, and are therefore of limited utility in detecting appropriate time delays (lags), especially in those scenarios where nonlinear phenomena are more the rule than the exception.The relevance of non-linear time dependent processes in science and social sciences, as well as in macroeconomics and finance, is well established.However, statistical tools that help to specify what lag(s) to use in a nonlinear description of an observed time series are scarce.
From a statistical point of view, this situation has motivated the development of tests for serial independence (see [2] for a review) with statistical power against alternative hypotheses that exhibit general types of dependence.The vast majority of these statistical tests are of nonparametric nature, hence trying to avoid restrictive assumptions on the marginal distributions of the model.However these tests are not designed for selecting relevant lags.This partly explains the relative scarcity of nonparametric techniques for investigating lag dependence, regardless the linear or nonlinear nature of the process, which is an aspect that is unknown in most of the practical cases.Some notable exceptions to this relative scarcity are [3][4][5][6].A common characteristic of most of these techniques is the use of entropy-based measures to identify the correct lag.Particularly, in [5,6], the use of permutation entropy, evaluated at several delay times, is theoretically motived and then applied to identify from a time series the characteristic lag of the generating system.Interesting physical applications of this approach are [7,8].In order to complete the paper, the new proposal technique is compared with the widely applied autocorrelation function and with recent techniques based on permutation entropy.
In this paper, we construct a new nonparametric runs statistic, based on symbolic analysis, which estimates the lag that best describes a time series sample.The versatile nature of runs tests is well-known for the relevant statistical literature as it has been used for analyzing independence, symmetries, randomness, etc.Furthermore, symbolic analysis is a field of increasing interest in several scientific disciplines (see [9]).It has foundations in information theory and in theory of dynamic systems.For example, properties of symbolic encodings are central to the theory of communication [10].Furthermore, there is a well-established mathematical discipline,namely, symbolic dynamics, that studies the behavior of dynamical systems.This discipline started in 1898 with the pioneering works of Hadamard, which developed a symbolic description of sequences of geodesic flows, and was later extended by [11], who coined the name of symbolic dynamics.[12] showed that a complete description of the behavior of a dynamical system can be captured in terms of symbols.This observation is crucial for understanding this paper because important characteristics of a random variable can be also captured by analyzing the symbols derived from it.
The paper finally shows that the new approach can be useful for model identification, and it is applied to the a real time series, particularly, to the New York Stock Exchange.

Definitions and Notation
Let {X i } i∈I be a time series.Assume first that {X i } i∈I is a sequence of categorical data with q categories.Denote by n k the number of elements of the k category.Therefore where n = |I| is the cardinality of the set of time indexes I.Under this setting we will define a run as a sequence of categories of the same type.
In the case of quantitative (continuous or discrete) data we will encode the sequence {X i } i∈I in q different categories for q a positive integer in the following manner.Denote by Q k the quantile kn q of {X i } i∈I for k = 1, 2 . . ., q − 1.Now we are going to encoded the sequence {X i } i∈I as the sequence {Z i } i∈I where A run is obtained by encoding the time series {X i } i∈I with a finite set of symbols Γ.Then a sequence of symbols of the same type is called a run.In the previous encoding the set of symbols is Γ = {1, 2, • • • , q} for the symbolization given in Equation (1).

Constructing the Statistic
The classical runs test is defined for q = 2 where only two categories or symbols are used.This can be considered in a multinomial scenario with q > 2. For completeness we will give the construction of the runs test for i.i.d (identically and independently distributed) in the multinomial case that was developed in [13].
Let MR be the random variable counting the number of runs in {X i } i∈I (if this sequence is not categorical we will use the symbolization procedure given by Equation ( 1)).Define the following indicator function The variable I j is a Bernoulli variable B(p) with probability of success for all j = 2, 3, . . ., n.Then the statistic MR q remains as: and its expected value is The estimation of the variance of MR q is more complicated and can be computed as follows (see [13] for a detailed explanation of the computation): When the lag order of the underlying process is known, say p, one can consider the following p time series For each one of the time series Γ j one can compute the normalized statistic Under the null of i.i.d. the statistic Υ j q is asymptotically normal distributed (see [13]).Notice that if p is the most relevant lag in the underlying data generating process, then the runs statistic MR measured on each Γ j will differ from its expected value more than for any other lag.Then we define as the absolute value of the sum of the statistic Υ j q for j = 1, 2, . . ., p divided by the square root of p.In the case that the time series {X i } i∈I is i.i.d., and therefore no relevant lags are present, the distribution of the statistic Λ p q is the folded standard normal distribution, and hence its expected value is E Λ p q = 2 π ≈ 0.7979.Hence, if p 0 is the most relevant lag describing the dynamics of the underlying data generating process then Λ p 0 q = max{Λ p q | p ∈ N} and Λ p 0 q has to be greater than 2 π .

Monte Carlo Simulation Experiments
In order to show the statistical power performance of the Λ p q statistic under different scenarios, we have considered (and therefore simulated) the following data generating processes (DGPs) because of its rich linear and nonlinear variety.The models are the following: We have considered mainly nonlinear models, but also some linear ones in order to study and compare the procedure with other statistical tools that are commonly used for model specification.Model 1 is a linear processes while Models 2-6 are nonlinear.DGP 1 is an AR1 autoregressive with decaying memory at lag order 1, so the procedure should detect (select) p = 1.DGP 2 is a nonlinear autoregressive model of order 1.DGP 3 is a nonlinear moving average processes of order 2, and then the statistical process is expected to select p = 2. DGP 4 is bilinear with white noise characteristic of orders 2 and 1. Conditional heteroskedastic models (i.e., those with structure in the conditional variance) are commonly employed in financial applications (for example time series that show periods of high and low market uncertainty), and accordingly, it is interesting to know about the behavior of the procedure under these kind of nonlinearities in the conditional variance, so we have included DGP 5. Finally, a purely deterministic model (DGP 6) and an independent and identically distributed stochastic process (DGP 7) are incorporated as they represent two models of opposite nature.
To evaluate the performance of the nonparametric method in finite samples, we compute 1000 Monte Carlo replications of each model, and we consider 6 lags (p).In general, experiments using large data sets are desirable, however situations do occur in which the number of available data is small.Statistical techniques, especially those that are model-free, as it is the case, are very sensitive to the number of observations.For this reason, in the Monte Carlo experiment, we study the effect of small sample size on the outcome of the statistical procedure.We present the results for several different sample sizes, namely, T = 120, 360, 500, 1000, 5000 and 10,000.In order to estimate Λ p q it is necessary to fix the number of quantiles that we will use to symbolise the time series under consideration, in this case we select q = 3, thus only 3 symbols are used to obtain a conclusion about the dynamic structure of the time series under study.
As mentioned in the introduction, we also compare the proposed method with the widely applied autocorrelation functions and with the permutation entropy based technique, as used in [5], which is related to [6] in the sense that both papers look for the lag that minimizes the permutation entropy.In order to apply the procedure we also fix the embedding reconstruction dimension at m = 3, and we will refer to it as h 3 .
Tables 1-7 show the percentage of times that the Λ p q statistic, autocorrelation function (ACF) and partial autocorrelation function (PAF), estimate the lag parameter p in 1000 Monte Carlo replications.For sample sizes T = 1000 (or larger), Λ p 3 always selects the correct lag for the linear autoregressive process (DGP 1), the same can be said for ACF and PAF.As the sample size is reduced (to T = 360), Λ p q statistic reduces its statistical power to 93.2%, while autocorrelations functions only deteriorate their power to 99.7%.This is to be expected because autocorrelation functions are ideal for linear processes: even for the smallest sample size its empirical behavior is high (around 90%).According to these results, if the underlying process is linear, the researcher may either use autocorrelations functions or Λ p q for sample sizes above 500.It is not convenient to use Λ p q if sample is below 360 observations.Similarly, for DGP 2, which is a nonlinear variation of the autoregressive DGP 1, the both approaches perform extremely well for sample sizes large than 1000 observations.For these two processes, when compared with h 3 for DGP 1 and DGP 2, it can be observed that Λ p 3 outperforms h 3 .On the other hand, if the lag-dependence comes from a nonlinear moving average process, we can observe clear differences in favour of the symbolic-runs proposal: The results for DGP 3 show for large data sets (5000 and 10,000) that Λ p 3 is ideal, as it detects the correct lag 100% of the time, while autocorrelation functions correctly estimate the lag only 30% of the time, regardless the sample size, although a better performance is obtained for h 3 .With DGP 4 we study a combination of lag dependence in an autoregressive component of (dominant) order 2, and moving average lag dependence of order 1.Clearly, the results for this bilinear processes show that Λ p 3 captures the correct lag, while ACF and PAF do not.The empirical evidence is in favor of Λ p 3 when compared with h 3 .As commented earlier, if delay structure is introduced via the second conditional moment of the stochastic process (variance), a practitioner would like to have a statistical procedure that might also detect the correct lag.This is what we study in DGP 5.The proposed statistic is very effective in detecting the correct lag for T = 1000 or higher.However, it is remarkable that autocorrelation functions correctly estimate lag less that 50% of the time for all sample sizes.In comparison with the permutation entropy based technique, h 3 , Λ p 3 outperforms it.Finally, the last two models (DGP 6 and DGP 7) are also illuminating.The first one is a purely deterministic logistic model, so no stochastic terms are added into it; and the second one is a purely normal distribution.Autocorrelation based approaches perform poorly in detecting the correct lag (i.e., lag 1) for the logistic model.Further, the results of ACF and PAF for the normal samples are statistically not distinguished from those obtained for the logistic equation.In contrast the Λ p 3 procedure detects, even for small sample sizes, that there is a dependence structure and that it comes from lag 1.Interestingly, for this pure deterministic process, the entropy based procedure is superior in these terms, hinting that permutation entropy is very effective when there is no noise terms.
The results provided for DGP 7 allows to understand that to fail to detect the most relevant lag parameter(s) is equivalent to find that all considered lags are equally important, that is to say, δ = 1 τ , where τ is the number of lags that the user has considered in the study.In our Monte Carlo experiment τ = 6 and then δ = 16.66667%.Therefore, for a lag parameter to be detected, the percentage of times Λ p 3 identify that lag should be above n , for a nominal level α, where n is the number of Monte Carlo replications and z α is the quantile 1 − α of the Normal standard distribution N (0, 1).In our experiment γ = 18.599 for a confidence level of α = 0.05.Percentage of times that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.

Model Identification
In the previous sections we have shown, first theoretically and then empirically, that the proposed method correctly estimates the lag that a data analyst might use to modelling time series data.Now we are concerned with trying to identify the appropriate generating model with the help of Λ p q evaluated at several lags.For instance, we are particularly interested in studying the behavior of Λ p q when the data generating process is an autoregressive linear model (AR(p)-model) and how it behaves for a moving average model (MA(p)-model).In the case of being able to discriminate between models, the statistic would not only be useful for selecting lags, but also to distinguish between models of very different nature.To evaluate the performance of Λ p q for identifying models, the following stochastic models have been studied: The shared characteristic of these models is that they all have a linear conditional mean.However, some are autoregressive, some moving averages of external shocks and some are mixture of linear processes.These models are well-known in univariate time series analysis, so for the statistic to be of some utility a clear detection of the lag and of the model is expected.Autoregressive models have memory, while moving average models do not.Accordingly, this essential difference should be detected.We compute the average value of the Λ p 3 statistic for sample sizes n = 120, 360, 500, 1000, 5000, 10000 of 1000 Monte Carlo simulations for the seven models.We also do the same for a benchmark normal (0,1) model, which is an i.i.d.process and allows us to show if the expected value of E Λ p q = 2 π ≈ 0.7979.is achieved empirically.Averages are given in Figure 1, that reports the results for the largest sample size; while the behavior for the remaining sample sizes are given in Figures 4-8 which can be found in the appendix.AR For models AR(1) and AR(2) the statistic clearly shows an exponential decrease in Λ p 3 , indicating (a) that the correct lag is in 1 and 2, respectively; (b) that the data generating process has memory respecting the true lag, as each true lag is less relevant that the preceding one; and (c) this occurs for all sample sizes.Interestingly, observation (b) sharply contrast with classical techniques that do not gather this salient empirical fact.For models MA(1) and MA (2), our statistic also performs as expected for identifying the model: the true lag is detected, the memoryless basic property is clearly observed for all lags, and therefore cannot be confused with an AR(p) model.The results for mixed models are also of interest.Regarding the ARMA(1,1), the statistic reaches its maximum at the correct lag, namely, 1, and then decays exponentially to zero; given the MA structure at lag 1, statistic's decay from its value at lag 1 and 2 is more prominent than in the case of an AR (1).This can also be devised for ARMA (1,2) model: the two relevant lags are clearly detected, namely, 1 and 2; the statistic does not decay very fast at lag 2 given its MA(2) structure, but it reappears for p > 2. Finally, in the MA(2;4) model, we have considered a moving average linear process with different weights in the two relevant lags, so we expect and observe that the proposed statistic firstly estimates the correct lag and then determines its relevance.To complete the analysis, the results are compared for two non linear models previously studied (DGP 2 and DGP 3), which are nonlinear counterparts of AR and MA models.It can be observed that the exponential decaying behavior is much faster in the nonlinear case than in the linear case, for the autoregressive models.For the moving average models, the values of the statistic, while being statistically significant, are lower than in the linear MA counterpart.
We now illustrate how our tools can be used for helping in the modelling process of real data.To this end we have considered a well-known empirical time series, namely, the returns obtained from closing values of the daily New York Stock Exchange (NYSE) index from 2000 to 2008 (Figure 2).
Given the series of returns, our proposal consists of using the tools previously presented.To this end we compute Λ p 3 for several lags for the selected time series.The results are given in Figure 3.According to the results, our statistical procedure identifies lags 2 and 4, so the modeler is recommended to use these two lags.As regarding the identification of the underlying model, in view of the Figure 3 when compared with Figure 2, it does not seem that the model is of an autoregressive linear nature, and seem to be closer to a moving average process where lags 2 and 4 will play a relevant role.

Conclusions
In this paper, we have presented a statistical procedure, based on the distribution of symbols (number of runs), to estimate the relevant lag of a dynamic generating process from which the researcher only has one observed sample.This is the first time runs has been used for detecting structure, and it is also the first time it is used for model identification.The technique shows several appealing advantages: (1) It is model independent, so that the end-user can easily use it without assuming or estimating a model; (2) It can be used for stochastic processes of linear or nonlinear nature, because in both studied scenarios the procedure is very competitive and robust; (3) In the studied models, it correctly detects the correct lag even in the case of a relatively small number observations, which facilitates its use even in sciences where there are small samples; (4) When it is compared with the standard autocorrelation function and/or partial autocorrelation functions, the empirical results are undoubtedly in favor of the new statistical tool; and when compared with permutation entropy based procedures, it generally has more statistical power; (5) Particularly interesting for financial data analysis, it shows an extraordinary empirical behavior when the lag structure is in the second conditional moment of the data generating process; (6) It can be used for identifying, no only lags, but also linear models.Points (1) to (6) makes the new statistical tool general, and widely applicable for data analyzers.

A. Appendix
that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.
Percentage of times that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.

Figure 1 .
Figure 1.Mean value of Λ p 3 statistic as a function of the lag time (for p = 1, 2, ..., 6).Sample size for each realization is fixed at n = 10, 000.The number of Monte Carlo simulations is 1000 for each model.Blue bars refer to average the Λ p 3 for each model.Red bars refer to the benchmark iid process.

Figure 3 .
Figure 3. Λ p 3 function for NYSE Daily Returns (blue) and (red) expected value for Λ p 3 in case of no relevant lag.

Figure 4 .
Figure 4. Mean value of Λ p 3 statistic as a function of the lag time (for p = 1, 2,...,6).Sample size for each realization is fixed at n = 120.The number of Monte Carlo simulations is 1000 for each model.Blue bars refer to average the Λ p 3 for each model.Red bars refer to the benchmark standard normal process.

Figure 5 .
Figure 5. Mean value of Λ p 3 statistic as a function of the lag time (for p = 1, 2, ..., 6).Sample size for each realization is fixed at n = 360.The number of Monte Carlo simulations is 1000 for each model.Blue bars refer to average the Λ p 3 for each model.Red bars refer to the benchmark standard normal process.

Figure 6 .
Figure 6.Mean value of Λ p 3 statistic as a function of the lag time (for p = 1, 2,...,6).Sample size for each realization is fixed at n = 500.The number of Monte Carlo simulations is 1000 for each model.Blue bars refer to average the Λ p 3 for each model.Red bars refer to the benchmark standard normal process.

Figure 7 .
Figure 7. Mean value of Λ p 3 statistic as a function of the lag time (for p = 1, 2, ..., 6).Sample size for each realization is fixed at n = 1000.The number of Monte Carlo simulations is 1000 for each model.Blue bars refer to average the Λ p 3 for each model.Red bars refer to the benchmark standard normal process.

Figure 8 .
Figure 8. Mean value of Λ p 3 statistic as a function of the lag time (for p = 1, 2, ..., 6).Sample size for each realization is fixed at n = 5000.The number of Monte Carlo simulations is 1000 for each model.Blue bars refer to average the Λ p 3 for each model.Red bars refer to the benchmark standard normal process.
Percentage of times that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.
Percentage of times that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.

Table 5 .
Comparison Λ p 3 against ACF, P AF and h 3 for DGP 5: X t = Percentage of times that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.

Table 6 .
Comparison Λ p 3against ACF, P AF and h 3 for DGP 6: X t = 4X t−1 (1 − X t−1 ).Percentage of times that each lag parameter has been detected on 1000 Monte Carlo simulations.T stands for sample size and p for the considered lags.