Predicting Random Walks and a Data-Splitting Prediction Region

: Perhaps the ﬁrst nonparametric, asymptotically optimal prediction intervals are provided for univariate random walks, with applications to renewal processes. Perhaps the ﬁrst nonparametric prediction regions are introduced for vector-valued random walks. This paper further derives nonparametric data-splitting prediction regions, which are underpinned by very simple theory. Some of the prediction regions can be used when the data distribution does not have ﬁrst moments, and some can be used for high-dimensional data, where the number of predictors is larger than the sample size. The prediction regions can make use of many estimators of multivariate location and dispersion.


Introduction
This paper suggests prediction intervals and regions for univariate and vector-valued random walks.This section reviews random walks, renewal processes, nonparametric prediction intervals, and nonparametric prediction regions.Section 2.1 presents new nonparametric data-splitting regions.
The random walk can be written as Y t = Y 0 + ∑ t i=1 e i , where Y 0 = y 0 is often a constant.A stochastic process {N(t) : t ≥ 0} is a counting process if N(t) counts the total number of events that occurred in the time interval (0, t].Let e n be the interarrival time or waiting time between the (n − 1)th and nth events counted by the process, n ≥ 1.If the nonnegative e i are iid with P(e i = 0) < 1, then {N(t), t ≥ 0} is a renewal process.Let Y n = ∑ n i=1 e i denote the time of occurrence of the nth event = waiting time until the nth event.Then Y n is a random walk with Y 0 = y 0 = 0. Let the expected value E(e i ) = µ > 0. Then E(Y n ) = nµ and the variance V(Y n ) = nV(e i ) if V(e i ) exists.A Poisson process with rate λ is a renewal process where the e i are iid exponential EXP(λ) with E(e i ) = 1/λ.See Ross [1] for the Poisson process and renewal process.Given Y 1 , . . ., Y n , then n events have occurred, and the 1-step-ahead PI denotes the time until the next event, the 2-step-ahead PI denotes the time until the next 2 events, and the h-step-ahead PI denotes the time for the next h events.
For forecasting, we predict the test data Y n+1 , . . ., Y n+L using the past training data Y 1 , . . . ,Y n .A large sample 100 A large sample 100(1 − δ)% PI is asymptotically optimal if it has the shortest asymptotic length: the length of [L n , U n ] converges to U s − L s as n → ∞, where [L s , U s ] is the population shorth: the shortest interval covering at least 100(1 − δ)% of the mass.
The shorth estimator of the population shorth will be defined as follows.If the data are Z 1 , . . ., Z n , let Z (1) ≤ • • • ≤ Z (n) be the order statistics.Let x denote the smallest integer greater than or equal to x (e.g., 7.7 = 8).Consider intervals that contain c cases [Z (1) , ] is the interval with the shortest length.
For a large sample 100(1 − δ)% PI, the nominal coverage is 100(1 − δ)%.Undercoverage occurs if the actual coverage is below the nominal coverage.For example, if the actual coverage is 0.93 when n = 100, then for a large-sample 95% PI, the undercoverage is 0.02 = 2%.Suppose the data Z 1 , . . ., Z n are iid, and a large sample 100(1 − δ)% PI is desired for a future value Z f .The shorth(c) interval is a large sample 100(1 − δ)% PI if c/n → 1 − δ as n → ∞, which often has the asymptotically shortest length.Frey [2] showed that for large nδ and iid data, the shorth(k n = n(1 − δ) ) prediction interval has maximum undercoverage ≈ 1.12 √ δ/n, and uses the large sample 100 The shorth PI (1) often has good coverage for n ≥ 50 and 0.05 ≤ δ ≤ 0.1, but the convergence of U n − L n to the population shorth length U s − L s can be quite slow.Under regularity conditions, Grübel [3] showed that for iid data, the length and center of the shorth(k n ) interval are √ n-consistent and n 1/3 -consistent estimators of the length and center of the population shorth interval, respectively.The correction factor also increases the length of PI (1).Einmahl and Mason [4] provides large sample theory for the shorth under slightly milder conditions than Grübel [3].Chen and Shao [5] shows that the shorth PI converges to the population shorth under mild conditions for ergodic data.
To describe the Olive [6] nonparametric prediction region, Mahalanobis distances will be useful.Let the p × 1 column vector T be a multivariate location estimator, and let the p × p symmetric positive definite matrix C be a dispersion estimator.Then the ith squared sample Mahalanobis distance is the scalar for each observation w i , where i = 1, . . ., n.Notice that the Euclidean distance of w i from the estimate of center T is D i (T, I p ), where I p is the p × p identity matrix.The classical Mahalanobis distance D i uses (T, C) = (w, S), the sample mean, and sample covariance matrix, where Consider predicting a future test value w f , given past training data w 1 , . . ., w n , where w 1 , . . ., w n , w f are iid.Prediction intervals denote a special case of prediction regions with p = 1 so the w i are random variables.
A large sample 100(1 − δ)% prediction region is a set A n , such that P(w f ∈ A n ) ≥ 1 − δ asymptotically.A prediction region is asymptotically optimal if its volume converges in probability to the volume of the minimum volume covering region or the highest-density region of the distribution of w f .Like prediction intervals, prediction regions often need correction factors.For iid data from a distribution with a p × p nonsingular covariance matrix, it was found that the simulated maximum undercoverage of the prediction region (5) without the correction factor was about 0.05 when n = 20p.Hence, the correction factor ( 4) is used to provide better coverage for small n.Let q n = min(1 − δ + 0.05, 1 − δ + p/n) for δ > 0.1 and If 1 − δ < 0.999 and The large sample 100(1 − δ)% nonparametric prediction region for a future value w f given iid data w 1 , . . ., w n is The nonparametric prediction region is a large sample prediction region if the iid w i have a nonsingular covariance matrix, and is asymptotically optimal for a large class of elliptically contoured distributions, including multivariate normal distributions with nonsingular covariance matrices.Regions with smaller asymptotic volumes can exist if the distribution is not elliptically contoured.From Olive [7], simulated coverage was often near the nominal for n ≥ 20p, but simulated volumes behaved better for n ≥ 50p.The shorth PIs do not need the mean or variance of the e t to exist.
Prediction regions have some nice applications besides prediction.Applying a prediction region to data generated from a posterior distribution provides an estimated credible region for Bayesian Statistics.See Chen and Shao [5].Certain prediction regions applied to a bootstrap sample result in a confidence region.See Rajapaksha and Olive [16], Rajapaksha [17], and Olive [18].Mykland [19] converts prediction regions into investment strategies.
New data-splitting prediction regions that do not need the nonsingular covariance matrix to exist are provided in Section 2.1, Section 2.2 describes the prediction intervals and regions for the random walk, while Section 3 presents two examples and simulations.

A Data-Splitting Prediction Region
Some of the new data-splitting prediction regions, described in this section, can handle i from a distribution where the population mean does not exist.Data splitting divides the training data x 1 , . . ., x n into two sets: H and the validation set V, where H has n H of the cases and V has the remaining n V = n − n H cases i 1 , . . ., i n V .A common method of data splitting randomly divides the training data into two sets, H and V. Often, n H ≈ n/2 .
The estimator (T H , C H ) is computed using dataset H.Then, the squared validation distances (U V ) be the U V th order statistic of the D 2 j , where The new large sample 100(1 − δ)% data-splitting prediction region for x f is To show that ( 7) is a prediction region, suppose the x i are iid for i = 1, . . ., n, n + 1, where x f = x n+1 .Compute (T H , C H ) from the cases in H. Consider the squared validation distances D 2 k for k = 1, . . ., n V and the squared validation distances D 2 n V +1 for case x f .Since these n V + 1 cases are iid, the probability that D 2 t has rank j for j = 1, . . ., n V + 1 is 1/(n V + 1) for each t, i.e., the ranks follow the discrete uniform distribution.Let t = n V + 1 and let D 2 (j) denote the ordered squared validation distances using j = 1, . . ., n V .That is, we obtain the order statistics without using the unknown squared validation distance If there are no tied ranks, then Note that we can obtain the actual coverage U V /(1 + n V ) close to 1 − δ for n V ≥ 20 for δ = 0.05 even if (T H , C H ) is a bad estimator.The volume of the prediction region tends to be much larger than that of the highest density region, even if C H is well-conditioned.We likely need U V ≥ 50 for D 2 (U V ) to approximate the population percentile of The above prediction region coverage theory does not depend on dimension p as long as C is nonsingular.If C = I p or C = diag(S 2 1 , . . ., S 2 p ), then the prediction region (7) can be used for high-dimensional data, where p > n.Regularized covariance matrices or precision matrices could also be used.

Prediction Intervals and Regions for the Random Walk
To our knowledge, asymptotically optimal nonparametric prediction intervals for the random walk have not previously been proposed.The nonparametric prediction regions described in this section may be the first ones proposed for vector-valued random walks, and are asymptotically optimal if the i = i,h are iid from a large class of elliptically contoured distributions.The random walk with drift is an AR(1) model with unit root and an ARIMA(0,1,0) model since Y t − Y t−1 = e t .Parametric prediction intervals are given by Niwitpong and Panichkitkosolkul [20] and Panichkitkosolkul and Niwitpong [21].Wolf and Wunderli [22] considers time series prediction regions for (Y n+1 , . . ., Y n+L ) T .Parametric prediction regions have been given for vector autoregression (VAR) models.See Kim [23,24] for details and references.
The new prediction intervals and regions for random walks are simple.First, consider the random walk Y t = Y t−1 + e t , where e t are iid.Find the i for i = 1, . . ., m = n/h .Assume n ≥ 50h and let [L, U] be the shorth(c) PI (1) for a future value of f based on 1 , . . ., m with m ≥ 50.Then, the large sample 100 This PI tends to be asymptotically optimal as along as e t are iid.This PI is equivalent to applying the shorth(c) PI (1) on Y n + 1 , . . ., Y n + m .
For the vector-valued random walk Y t = Y t−1 + e t , find 1,h , . . ., m,h .The nonparametric 100(1 − δ)% prediction region for a future value f ,h is where S h is the sample covariance matrix of the i,h and This prediction region is a hyperellipsoid centered at the sample mean .The following large sample 100(1 − δ)% prediction region for Y n+h shifts the hyperellipsoid (8) to be centered at Since Y n+h has the same distribution as which is bounded below by 1 − δ, asymptotically.The prediction region ( 9) is equivalent to applying the nonparametric prediction region (5) to Y n + 1,h , . . ., Y n + m,h .The prediction region ( 9) is similar to the Olive [7] prediction region for the multivariate regression model.Given that the i = i,h are iid, alternative prediction intervals and regions, such as those in Sections 2.1 or Hyndman [25] for small p, could be used.

Example 1.
Common examples of random walks are stock prices.The EuStockMarkets dataset, available from the R software, is a multivariate time series with 1860 observations on 4 variables.The observations are the daily closing prices of major European stock indices: Germany DAX, Switzerland SMI, France CAC, and UK FTSE.The data are sampled in business time, i.e., weekends and holidays are omitted.If we consider Y t = DAX, the plot of the random walk e t = Y t − Y t−1 is rectangular around the e = 0 line for cases 1-1460.Cases 1461-1800 scatter about the e = 0 line, but have much more variability (not shown, but see Figure 9.1 in Haile [26]).Let cases 1-1450 be the training data, and let cases 1451-1460 be the test data.Figure 1 shows a plot of Y t−1 versus Y t on the vertical axis for t = 2 to 1450.The two parallel lines correspond to the one-step-ahead 95% prediction intervals, which cover slightly more than 95% of the training data.
Example 2. The Wisseman, Hopke, and Schindler-Kaudelka [27] pottery data consist of a chemical analysis of pottery shards.The dataset has 36 cases and 5 groups corresponding to types of pottery shards.The variables x 1 , . . ., x 20 correspond to the p = 20 chemicals analyzed.Consider the n = 18 group 1 cases where the pottery shards are Arretine, which is a type of Roman pottery.We randomly select case 4 from group 1 to be x f and compute the 88.89% data-splitting prediction region with the remaining 17 cases, n V = 8, and (T, C) = (MED(W ), I p ), where MED(W ) is the coordinate-wise median computed from the 9 cases in H.The cutoff is D 2 (U V ) = 612.2,and D 2 (x f ) = 353.8.Hence, x f is in the 88.89% prediction region.Next, we make x f equal to each of the 36 cases.Then, 8 cases x f are not in the above prediction region, including 7 of the 18 cases that are not from group 1.
Let the population forecast error be e(h).For type 1, the asymptotic optimal lengths of the large-sample 95% PIs are 3.92
The results are shown in Table 1.We roughly need n ≥ 50 h for good coverage.Thus, n = 100 is too small for the h-step-ahead PIs with h = 3 and h = 4.The Cauchy distribution requires large n before the average PI lengths get close to the asymptotically optimal lengths.Two lines are given for each distribution-sample size combination.The first line provides the coverages while the second line provides the average PI lengths with the standard deviation of the lengths in parentheses.The coverage denotes the proportion of the 5000 PIs that contain the test data case Y f = Y f i for i = 1, . . ., 5000.The last two lines of Table 1 correspond to the uniform (0,2) distribution with n = 800.The h = i label corresponds to the i-step-ahead 95% prediction interval with i = 1, 2, 3 and 4. The coverages are near 0.95 and the simulated average lengths (1.9014, 3.1666, 3.9651, 4.6357) are near the asymptotically optimal lengths (1.90, 3.11, 3.87, 4.48).A small vector-valued random walk simulation is also done for the large-sample 95% prediction regions using 5000 runs.We use distributions with nonsingular population covariance matrices.Let u t = (u t1 , . . ., u tp ) T where u ti are iid from type (1) N(1, 1), (2) 1 + t 5 , (3) EXP(1), or (4) U(0,2) distribution.Then e t = Au t , where p × p matrix A = (a ij ) with the diagonal elements a ii = 1, and a ij = ψ for i = j.
Table 2 shows some results from when p = 8, giving the coverages.We roughly need n ≥ 20ph to obtain good coverage near 0.95.Thus, n = 400 is too small for p = 8 with h = 3 or h = 4, although undercoverage is small for h = 3.Note that t = ( 1t , . . ., 8t ) T .Value ψ = 0 makes the it uncorrelated.Increasing ψ increases the correlation ρ = cor( it , jt ), where i = j.The prediction regions are hyperellipsoids, which have volumes (not given), instead of lengths.Simulations for the data-splitting prediction region.
The theory for the new prediction regions is simple; thus, Table 3 serves more as a verification that the programs work than a test of the theory itself.See Zhang [31] for more simulations.The output variables include cov = observed coverage, up = ≈ actual coverage, and mnhsq = mean cutoff D 2 (U V ) .With 5000 runs, expect observed coverage ∈ [0.94, 0.96] if the actual coverage is close to 0.95.The random vector is x = Aw, where x = w ∼ N p (0, I p ) for xtype = 3, and x ∼ N p (0, diag(1, . . ., p)) for xtype = 1.For xtype = 2, w has the w i iid lognormal(0,1) with A = diag (  should estimate the population percentile χ 2 p,0.95 if n ≥ max(20p, 200) and n V = 100.This result did occur in the simulations.
Table 3 gives n, p, n V , a number 'xtype' corresponding to the distribution of x, and a number 'dtype' corresponding to (T, C) used in the prediction region (7).High-dimensional data were used since p ≥ n.With n V = 20, the actual coverage is 20/21 = 0.9524; n V = 25 has actual coverage of 25/26 = 0.9615, and n V = 50 has actual coverage of 49/51 = 0.9608.The observed coverages are close to the actual coverages in Table 3.

Discussion
The new nonparametric, asymptotically optimal h-step-ahead prediction intervals for the random walk appear to perform well if n ≥ 50h.The new nonparametric h-step-ahead 95% prediction regions for the vector-valued random walk appear to have coverages near 0.95 for n ≥ 20ph.The new nonparametric prediction regions are fast, with simple theory, and have coverage ≥ min(n V , (n Datasets where future data do not behave like past data are common, and then the prediction intervals and regions tend to perform poorly.In Example 1, cases 1-1460 appear to follow one random walk, while cases 1461-1800 follow another random walk with more variability.
Some prediction intervals for stochastic processes include Pan and Politis [32], Vidoni [33], and Vit [34].Makridakis et al. [35] noted that a PI for the random walk, derived assuming normal errors, often failed to give good coverage.Pankratz [36] noted that the random walk model has been found to be a good model for many stock price time series.
Conformal prediction gives precise levels of coverage for one future observation, and prediction region ( 7) is a conformal prediction region that can have large volume.As an example, consider using (T, C) = (MED(W ), I p ). Then the prediction region is a hypersphere centered at the coordinate-wise median.The prediction region is good if the iid w i ∼ N p (µ, σ 2 I p ), but if w i ∼ N p (µ, Σ), such that the highest density region is a hyperellipsoid tightly clustered around a vector in the direction of 1 = (1, 1, . . ., 1) T , then the prediction region (7) has large volume compared to the highest density region.
There are many methods where prediction is useful.For example, Garg, Aggarwal, et al. [37] used support vector machines while Garg, Belarbi, et al. [38] used Gaussian process regression.Olive [7] shows how to obtain prediction intervals when the model is Y i = m(x i ) + e i if the errors are iid.If heterogeneity is present, and there are enough cases x i with m(x i ) near m(x f ), we make a prediction interval using Y i corresponding to the x i .Graphically, in a plot of m(x i ) versus Y i (on the vertical axis), we make a narrow vertical slice centered at m(x f ), and then make the PI from the Y i in the slice.

Figure 1 .
Figure 1.PI plot of the DAX dataset.