Next Article in Journal
Precise Tensor Product Smoothing via Spectral Splines
Previous Article in Journal
The Mediating Impact of Innovation Types in the Relationship between Innovation Use Theory and Market Performance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Random Walks and a Data-Splitting Prediction Region

1
Mathematics and Physics Department, Westminster College, Fulton, MO 65251, USA
2
Mathematics and Statistics Department, University at Albany, Albany, NY 12222, USA
3
School of Mathematical & Statistical Sciences, Southern Illinois University, Carbondale, IL 62901, USA
*
Author to whom correspondence should be addressed.
Stats 2024, 7(1), 23-33; https://doi.org/10.3390/stats7010002
Submission received: 27 November 2023 / Revised: 21 December 2023 / Accepted: 24 December 2023 / Published: 8 January 2024
(This article belongs to the Section Statistical Methods)

Abstract

:
Perhaps the first nonparametric, asymptotically optimal prediction intervals are provided for univariate random walks, with applications to renewal processes. Perhaps the first nonparametric prediction regions are introduced for vector-valued random walks. This paper further derives nonparametric data-splitting prediction regions, which are underpinned by very simple theory. Some of the prediction regions can be used when the data distribution does not have first moments, and some can be used for high-dimensional data, where the number of predictors is larger than the sample size. The prediction regions can make use of many estimators of multivariate location and dispersion.

1. Introduction

This paper suggests prediction intervals and regions for univariate and vector-valued random walks. This section reviews random walks, renewal processes, nonparametric prediction intervals, and nonparametric prediction regions. Section 2.1 presents new nonparametric data-splitting regions.
A random walk (with drift) is defined as Y t = Y t 1 + e t , where e t are independent and identically distributed (iid). Suppose there is a sample Y 1 , , Y n and we want a prediction interval (PI) for Y n + h . Then, Y t = Y t 2 + e t 1 + e t = Y t h + e t h + 1 + + e t = Y 0 + e 1 + + e t , or Y n + h = Y n + e n + 1 + e n + 2 + + e n + h = Y n + ϵ n , h . Let e j = Y j Y j 1 for j = 2 , , n . Divide e 2 , , e n into blocks of length h and let ϵ i be the sum of the e i in each block. Hence, ϵ 1 = e 2 + + e h + 1 , ϵ 2 = e h + 2 + + e 2 h + 1 , and ϵ i = e ( i 1 ) h + 2 + e ( i 1 ) h + 3 + + e ( i 1 ) h + h + 1 for i = 1 , , m = n / h . These ϵ i are iid from the same distribution as ϵ n , h . The same decomposition can be made for a vector-valued random walk, Y t = Y t 1 + e t , where the vectors are p × 1 . Thus, ϵ i = e ( i 1 ) h + 2 + e ( i 1 ) h + 3 + + e ( i 1 ) h + h + 1 for i = 1 , , m .
The random walk can be written as Y t = Y 0 + i = 1 t e i , where Y 0 = y 0 is often a constant. A stochastic process { N ( t ) : t 0 } is a counting process if N ( t ) counts the total number of events that occurred in the time interval ( 0 , t ] . Let e n be the interarrival time or waiting time between the ( n 1 ) th and nth events counted by the process, n 1 . If the nonnegative e i are iid with P ( e i = 0 ) < 1 , then { N ( t ) , t 0 } is a renewal process. Let Y n = i = 1 n e i denote the time of occurrence of the nth event = waiting time until the nth event. Then Y n is a random walk with Y 0 = y 0 = 0 . Let the expected value E ( e i ) = μ > 0 . Then E ( Y n ) = n μ and the variance V ( Y n ) = n V ( e i ) if V ( e i ) exists. A Poisson process with rate λ is a renewal process where the e i are iid exponential EXP( λ ) with E ( e i ) = 1 / λ . See Ross [1] for the Poisson process and renewal process. Given Y 1 , , Y n , then n events have occurred, and the 1-step-ahead PI denotes the time until the next event, the 2-step-ahead PI denotes the time until the next 2 events, and the h-step-ahead PI denotes the time for the next h events.
For forecasting, we predict the test data Y n + 1 , , Y n + L using the past training data Y 1 , , Y n . A large sample 100 ( 1 δ ) % prediction interval for Y n + h is [ L n , U n ] , where the coverage P ( L n Y n + h U n ) = 1 δ n is eventually bounded below by 1 δ as n . We often want 1 δ n 1 δ as n . A large sample 100 ( 1 δ ) % PI is asymptotically optimal if it has the shortest asymptotic length: the length of [ L n , U n ] converges to U s L s as n , where [ L s , U s ] is the population shorth: the shortest interval covering at least 100 ( 1 δ ) % of the mass.
The shorth estimator of the population shorth will be defined as follows. If the data are Z 1 , , Z n , let Z ( 1 ) Z ( n ) be the order statistics. Let x denote the smallest integer greater than or equal to x (e.g., 7.7 = 8 ). Consider intervals that contain c cases [ Z ( 1 ) , Z ( c ) ] , [ Z ( 2 ) , Z ( c + 1 ) ] , , [ Z ( n c + 1 ) , Z ( n ) ] . Compute Z ( c ) Z ( 1 ) , Z ( c + 1 ) Z ( 2 ) , , Z ( n ) Z ( n c + 1 ) . Then the estimator shorth(c) = [ Z ( s ) , Z ( s + c 1 ) ] is the interval with the shortest length.
For a large sample 100 ( 1 δ ) % PI, the nominal coverage is 100 ( 1 δ ) % . Undercoverage occurs if the actual coverage is below the nominal coverage. For example, if the actual coverage is 0.93 when n = 100 , then for a large-sample 95% PI, the undercoverage is 0.02 = 2%. Suppose the data Z 1 , , Z n are iid, and a large sample 100 ( 1 δ ) % PI is desired for a future value Z f . The shorth(c) interval is a large sample 100 ( 1 δ ) % PI if c / n 1 δ as n , which often has the asymptotically shortest length. Frey [2] showed that for large n δ and iid data, the shorth( k n = n ( 1 δ ) ) prediction interval has maximum undercoverage 1.12 δ / n , and uses the large sample 100 ( 1 δ ) % PI shorth(c) =
[ L n , U n ] = [ Z ( s ) , Z ( s + c 1 ) ] with
c = min ( n , n [ 1 δ + 1.12 δ / n ] ) .
The shorth PI (1) often has good coverage for n 50 and 0.05 δ 0.1 , but the convergence of U n L n to the population shorth length U s L s can be quite slow. Under regularity conditions, Grübel [3] showed that for iid data, the length and center of the shorth( k n ) interval are n -consistent and n 1 / 3 -consistent estimators of the length and center of the population shorth interval, respectively. The correction factor also increases the length of PI (1). Einmahl and Mason [4] provides large sample theory for the shorth under slightly milder conditions than Grübel [3]. Chen and Shao [5] shows that the shorth PI converges to the population shorth under mild conditions for ergodic data.
The large sample 100 ( 1 δ ) % shorth PI (1) may or may not be asymptotically optimal if the 100 ( 1 δ ) % population shorth is [ L s , U s ] and the cumulative distribution function (cdf) F ( x ) does not strictly increase in intervals ( L s ϵ , L s + ϵ ) and ( U s ϵ , U s + ϵ ) for some ϵ > 0 . Suppose that Y has a probability mass function (pmf) p ( 0 ) = 0.4 , p ( 1 ) = 0.3 , p ( 2 ) = 0.2 , p ( 3 ) = 0.06 , and p ( 4 ) = 0.04 . Then, the 90 % population shorth is [0,2] and the 100 ( 1 δ ) % population shorth is [0,3] for ( 1 δ ) ( 0.9 , 0.96 ] . Let W i = I ( Y i x ) = 1 if Y i x and 0, otherwise. The empirical cdf
F ^ n ( x ) = 1 n i = 1 n I ( Y i x ) = 1 n i = 1 n I ( Y ( i ) x )
is the sample proportion of Y i x . If Y 1 , , Y n are iid, then for fixed x, n F ^ n ( x ) b i n o m i a l ( n , F ( x ) ) . Thus, F ^ n ( x ) A N ( F ( x ) , F ( x ) ( 1 F ( x ) ) / n ) where A N stands for asymptotically normal. For the Y with the above pmf, F ^ n ( 2 ) P 0.9 as n with P ( F ^ n ( 2 ) < 0.9 ) 0.5 and P ( F ^ n ( 2 ) 0.9 ) 0.5 as n . Hence, the large sample 90% PI (1) will be [0,2] or [0,3] with probabilities → 0.5 as n with an expected asymptotic length of 2.5 and expected asymptotic coverage converging to 0.93. However, the large sample 100 ( 1 δ ) % PI (1) converges to [0,3] and is asymptotically optimal with asymptotic coverage 0.96 for ( 1 δ ) ( 0.9 , 0.96 ) .
To describe the Olive [6] nonparametric prediction region, Mahalanobis distances will be useful. Let the p × 1 column vector T be a multivariate location estimator, and let the p × p symmetric positive definite matrix C be a dispersion estimator. Then the ith squared sample Mahalanobis distance is the scalar
D i 2 = D i 2 ( T , C ) = D w i 2 ( T , C ) = ( w i T ) T C 1 ( w i T )
for each observation w i , where i = 1 , , n . Notice that the Euclidean distance of w i from the estimate of center T is D i ( T , I p ) , where I p is the p × p identity matrix. The classical Mahalanobis distance D i uses ( T , C ) = ( w ¯ , S ) , the sample mean, and sample covariance matrix, where
w ¯ = 1 n i = 1 n w i and S = 1 n 1 i = 1 n ( w i w ¯ ) ( w i w ¯ ) T .
Consider predicting a future test value w f , given past training data w 1 , , w n , where w 1 , , w n , w f are iid. Prediction intervals denote a special case of prediction regions with p = 1 so the w i are random variables.
A large sample 100 ( 1 δ ) % prediction region is a set A n , such that P ( w f A n ) 1 δ asymptotically. A prediction region is asymptotically optimal if its volume converges in probability to the volume of the minimum volume covering region or the highest-density region of the distribution of w f .
Like prediction intervals, prediction regions often need correction factors. For iid data from a distribution with a p × p nonsingular covariance matrix, it was found that the simulated maximum undercoverage of the prediction region (5) without the correction factor was about 0.05 when n = 20 p . Hence, the correction factor (4) is used to provide better coverage for small n. Let q n = min ( 1 δ + 0.05 , 1 δ + p / n ) for δ > 0.1 and
q n = min ( 1 δ / 2 , 1 δ + 10 δ p / n ) , otherwise .
If 1 δ < 0.999 and q n < 1 δ + 0.001 , set q n = 1 δ . Let D ( U n ) be the 100 q n th sample quantile of the D i , where i = 1 , , n .
The large sample 100 ( 1 δ ) % nonparametric prediction region for a future value w f given iid data w 1 , , w n is
{ z : ( z w ¯ ) T S 1 ( z w ¯ ) D ( U n ) 2 } = { z : D z 2 ( w ¯ , S ) D ( U n ) 2 } .
The nonparametric prediction region is a large sample prediction region if the iid w i have a nonsingular covariance matrix, and is asymptotically optimal for a large class of elliptically contoured distributions, including multivariate normal distributions with nonsingular covariance matrices. Regions with smaller asymptotic volumes can exist if the distribution is not elliptically contoured. From Olive [7], simulated coverage was often near the nominal for n 20 p , but simulated volumes behaved better for n 50 p . The shorth PIs do not need the mean or variance of the e t to exist.
There are many prediction intervals and regions in the literature. See Beran [8], Fontana, Zeni, and Vantini [9], Guan [10], Olive [11], Steinberger and Leeb [6], Tian [7], Nordman [12], and Meeker [13], for references. The new prediction regions can be used for distributions that do not have an expected value if appropriate ( T , C ) is used, e.g., ( T , C ) = ( MED ( W ) , I p ) , where MED ( W ) is the coordinate-wise median. Olive [14] and Lei et al. [15] use data splitting to obtain prediction intervals for the multiple linear regression model.
Prediction regions have some nice applications besides prediction. Applying a prediction region to data generated from a posterior distribution provides an estimated credible region for Bayesian Statistics. See Chen and Shao [5]. Certain prediction regions applied to a bootstrap sample result in a confidence region. See Rajapaksha and Olive [16], Rajapaksha [17], and Olive [18]. Mykland [19] converts prediction regions into investment strategies.
New data-splitting prediction regions that do not need the nonsingular covariance matrix to exist are provided in Section 2.1, Section 2.2 describes the prediction intervals and regions for the random walk, while Section 3 presents two examples and simulations.

2. Materials and Methods

2.1. A Data-Splitting Prediction Region

Some of the new data-splitting prediction regions, described in this section, can handle ϵ i from a distribution where the population mean does not exist. Data splitting divides the training data x 1 , , x n into two sets: H and the validation set V, where H has n H of the cases and V has the remaining n V = n n H cases i 1 , , i n V . A common method of data splitting randomly divides the training data into two sets, H and V. Often, n H n / 2 .
The estimator ( T H , C H ) is computed using dataset H. Then, the squared validation distances D j 2 = D x i j 2 ( T H , C H ) = ( x i j T H ) T C H 1 ( x i j T H ) are computed for the j = 1 , , n V cases in the validation set V. Let D ( U V ) 2 be the U V th order statistic of the D j 2 , where
U V = min ( n V , ( n V + 1 ) ( 1 δ ) ) .
The new large sample 100 ( 1 δ ) % data-splitting prediction region for x f is
{ z : D z 2 ( T H , C H ) D ( U V ) 2 } .
To show that (7) is a prediction region, suppose the x i are iid for i = 1 , , n , n + 1 , where x f = x n + 1 . Compute ( T H , C H ) from the cases in H. Consider the squared validation distances D k 2 for k = 1 , , n V and the squared validation distances D n V + 1 2 for case x f . Since these n V + 1 cases are iid, the probability that D t 2 has rank j for j = 1 , , n V + 1 is 1 / ( n V + 1 ) for each t, i.e., the ranks follow the discrete uniform distribution. Let t = n V + 1 and let D ( j ) 2 denote the ordered squared validation distances using j = 1 , , n V . That is, we obtain the order statistics without using the unknown squared validation distance D n V + 1 2 . Then D ( i ) 2 has rank i if D ( i ) 2 < D n V + 1 2 but rank i + 1 if D ( i ) 2 > D n V + 1 2 . Thus, D ( U V ) 2 has rank U V + 1 if D x f 2 < D ( U V ) 2 and
P ( x f { z : D z 2 ( T H , C H ) D ( U V ) 2 } ) = P ( D x f 2 D ( U V ) 2 ) U V / ( 1 + n V )
1 δ as n V . If there are no tied ranks, then
P ( D x f 2 D ( U V ) 2 ) = P ( D x f 2 < D ( U V ) 2 ) = P ( rank of D x f 2 U V ) = U V / ( 1 + n V ) .
Note that we can obtain the actual coverage U V / ( 1 + n V ) close to 1 δ for n V 20 for δ = 0.05 even if ( T H , C H ) is a bad estimator. The volume of the prediction region tends to be much larger than that of the highest density region, even if C H is well-conditioned. We likely need U V 50 for D ( U V ) 2 to approximate the population percentile of D j 2 = ( x i j T H ) T C H 1 ( x i j T H ) .
The above prediction region coverage theory does not depend on dimension p as long as C is nonsingular. If C = I p or C = d i a g ( S 1 2 , , S p 2 ) , then the prediction region (7) can be used for high-dimensional data, where p > n . Regularized covariance matrices or precision matrices could also be used.

2.2. Prediction Intervals and Regions for the Random Walk

To our knowledge, asymptotically optimal nonparametric prediction intervals for the random walk have not previously been proposed. The nonparametric prediction regions described in this section may be the first ones proposed for vector-valued random walks, and are asymptotically optimal if the ϵ i = ϵ i , h are iid from a large class of elliptically contoured distributions. The random walk with drift is an AR(1) model with unit root and an ARIMA(0,1,0) model since Y t Y t 1 = e t . Parametric prediction intervals are given by Niwitpong and Panichkitkosolkul [20] and Panichkitkosolkul and Niwitpong [21]. Wolf and Wunderli [22] considers time series prediction regions for ( Y n + 1 , , Y n + L ) T . Parametric prediction regions have been given for vector autoregression (VAR) models. See Kim [23,24] for details and references.
The new prediction intervals and regions for random walks are simple. First, consider the random walk Y t = Y t 1 + e t , where e t are iid. Find the ϵ i for i = 1 , , m = n / h . Assume n 50 h and let [ L , U ] be the shorth(c) PI (1) for a future value of ϵ f based on ϵ 1 , , ϵ m with m 50 . Then, the large sample 100 ( 1 δ ) % PI for Y n + h is [ Y n + L , Y n + U ] . This PI tends to be asymptotically optimal as along as e t are iid. This PI is equivalent to applying the shorth(c) PI (1) on Y n + ϵ 1 , , Y n + ϵ m .
For the vector-valued random walk Y t = Y t 1 + e t , find ϵ 1 , h , , ϵ m , h . The nonparametric 100 ( 1 δ ) % prediction region for a future value ϵ f , h is
{ z : ( z ϵ ¯ ) T S h 1 ( z ϵ ¯ ) D ( U m ) 2 } = { z : D z 2 ( ϵ ¯ , S h ) D ( U m ) 2 }
where S h is the sample covariance matrix of the ϵ i , h and D i 2 = ( ϵ i , h ϵ ¯ ) T S h 1 ( ϵ i , h ϵ ¯ ) . This prediction region is a hyperellipsoid centered at the sample mean ϵ ¯ . The following large sample 100 ( 1 δ ) % prediction region for Y n + h shifts the hyperellipsoid (8) to be centered at Y n + ϵ ¯ :
{ z : [ z ( Y n + ϵ ¯ ) ] T S h 1 [ z ( Y n + ϵ ¯ ) ] D ( U m ) 2 } .
Since Y n + h has the same distribution as Y n + ϵ f , h , P ( Y n + h ( 9 ) ) = P ( ϵ f , h ( 8 ) ) = 1 δ n , which is bounded below by 1 δ , asymptotically. The prediction region (9) is equivalent to applying the nonparametric prediction region (5) to Y n + ϵ 1 , h , , Y n + ϵ m , h . The prediction region (9) is similar to the Olive [7] prediction region for the multivariate regression model.
Given that the ϵ i = ϵ i , h are iid, alternative prediction intervals and regions, such as those in Section 2.1 or Hyndman [25] for small p, could be used.

3. Results

Example 1.
Common examples of random walks are stock prices. The EuStockMarkets dataset, available from the R software, is a multivariate time series with 1860 observations on 4 variables. The observations are the daily closing prices of major European stock indices: Germany DAX, Switzerland SMI, France CAC, and UK FTSE. The data are sampled in business time, i.e., weekends and holidays are omitted. If we consider Y t = DAX, the plot of the random walk e t = Y t Y t 1 is rectangular around the e = 0 line for cases 1–1460. Cases 1461–1800 scatter about the e = 0 line, but have much more variability (not shown, but see Figure 9.1 in Haile [26]). Let cases 1–1450 be the training data, and let cases 1451–1460 be the test data. Figure 1 shows a plot of Y t 1 versus Y t on the vertical axis for t = 2 to 1450. The two parallel lines correspond to the one-step-ahead 95% prediction intervals, which cover slightly more than 95% of the training data.
Example 2.
The Wisseman, Hopke, and Schindler-Kaudelka [27] pottery data consist of a chemical analysis of pottery shards. The dataset has 36 cases and 5 groups corresponding to types of pottery shards. The variables x 1 , , x 20 correspond to the p = 20 chemicals analyzed. Consider the n = 18 group 1 cases where the pottery shards are Arretine, which is a type of Roman pottery. We randomly select case 4 from group 1 to be x f and compute the 88.89% data-splitting prediction region with the remaining 17 cases, n V = 8 , and ( T , C ) = ( M E D ( W ) , I p ) , where M E D ( W ) is the coordinate-wise median computed from the 9 cases in H. The cutoff is D ( U V ) 2 = 612.2 , and D 2 ( x f ) = 353.8 . Hence, x f is in the 88.89% prediction region. Next, we make x f equal to each of the 36 cases. Then, 8 cases x f are not in the above prediction region, including 7 of the 18 cases that are not from group 1.
The remainder of this section presents simulations for the prediction intervals and regions. More simulations and tables are presented in Haile [26]. With 5000 runs, coverages between 0.94 and 0.96 suggest that there is no reason to believe that the nominal coverage is not 0.95.
A small random walk simulation is conducted for the large-sample 95% PIs using 5000 runs with Y 0 = 1 . The errors e t are iid from four distributions: (i) N(1,1), (ii) Cauchy (1,1), (iii) EXP (1), and (iv) uniform ( 0 , 2 ). Only distribution (iii) is not symmetric. We compute the h-step-ahead 95% PIs for h = 1 , 2 , 3 , 4 = L . We want n 50 L , but simulations may use smaller n, such as n = 25 L . The asymptotic optimal lengths are (i) 3.92, 5.54, 6.79, 7.84, (ii) 25.41, 50.82, 76.24, 101.65, (iii) 3.00, 4.72, 6.11, 7.22, (iv) 1.90, 3.11, 3.87, 4.48.
Let the population forecast error be e ( h ) . For type 1, the asymptotic optimal lengths of the large-sample 95% PIs are 3.92 h , where e ( h ) N ( h , σ 2 = h ) . For type 2, e ( h ) C ( h , σ = h ) denotes a Cauchy distribution. For type 3, e ( h ) G ( h , 1 ) denotes a Gamma distribution. For type 4, e ( 2 ) ∼triangular (0,4). The distribution of the sum of n iid U (0,1) random variables is known as the Irwin–Hall distribution. See Gray and Odell [28], Marengo, Farnsworth, and Stefanic [29], and Roach [30].
The results are shown in Table 1. We roughly need n 50  h for good coverage. Thus, n = 100 is too small for the h-step-ahead PIs with h = 3 and h = 4 . The Cauchy distribution requires large n before the average PI lengths get close to the asymptotically optimal lengths. Two lines are given for each distribution–sample size combination. The first line provides the coverages while the second line provides the average PI lengths with the standard deviation of the lengths in parentheses. The coverage denotes the proportion of the 5000 PIs that contain the test data case Y f = Y f i for i = 1 , , 5000 . The last two lines of Table 1 correspond to the uniform (0,2) distribution with n = 800 . The h = i label corresponds to the i-step-ahead 95% prediction interval with i = 1 , 2 , 3 and 4. The coverages are near 0.95 and the simulated average lengths (1.9014, 3.1666, 3.9651, 4.6357) are near the asymptotically optimal lengths (1.90, 3.11, 3.87, 4.48).
A small vector-valued random walk simulation is also done for the large-sample 95% prediction regions using 5000 runs. We use distributions with nonsingular population covariance matrices. Let u t = ( u t 1 , , u t p ) T where u t i are iid from type (1) N ( 1 , 1 ) , (2)  1 + t 5 , (3) EXP(1), or (4) U(0,2) distribution. Then e t = A u t , where p × p matrix A = ( a i j ) with the diagonal elements a i i = 1 , and a i j = ψ for i j .
Table 2 shows some results from when p = 8 , giving the coverages. We roughly need n 20 p h to obtain good coverage near 0.95. Thus, n = 400 is too small for p = 8 with h = 3 or h = 4 , although undercoverage is small for h = 3 . Note that ϵ t = ( ϵ 1 t , , ϵ 8 t ) T . Value ψ = 0 makes the ϵ i t uncorrelated. Increasing ψ increases the correlation ρ = c o r ( ϵ i t , ϵ j t ) , where i j . The prediction regions are hyperellipsoids, which have volumes (not given), instead of lengths.
Simulations for the data-splitting prediction region.
The theory for the new prediction regions is simple; thus, Table 3 serves more as a verification that the programs work than a test of the theory itself. See Zhang [31] for more simulations. The output variables include cov = observed coverage, up = ≈ actual coverage, and mnhsq = mean cutoff D ( U V ) 2 . With 5000 runs, expect observed coverage [ 0.94 , 0.96 ] if the actual coverage is close to 0.95. The random vector is x = A w , where x = w N p ( 0 , I p ) for xtype = 3, and x N p ( 0 , d i a g ( 1 , , p ) ) for xtype = 1. For xtype = 2, w has the w i iid lognormal(0,1) with A = d i a g ( 1 , 2 , , p ) . The dispersion matrix types are dtype = 1 if ( T , C ) = ( x ¯ , I p ) and dtype = 2 if ( T , C ) = ( MED ( W ) , I p ) , where MED ( W ) is the coordinate-wise median of the x i .
When xtype = 3 and dtype = 1, ( T , C ) = ( x ¯ , I p ) , where x i N p ( 0 , I p ) . Then D ( U V ) 2 should estimate the population percentile χ p , 0.95 2 if n max ( 20 p , 200 ) and n V = 100 . This result did occur in the simulations.
Table 3 gives n, p, n V , a number ‘xtype’ corresponding to the distribution of x , and a number ‘dtype’ corresponding to ( T , C ) used in the prediction region (7). High-dimensional data were used since p n . With n V = 20 , the actual coverage is 20 / 21 = 0.9524 ; n V = 25 has actual coverage of 25 / 26 = 0.9615 , and n V = 50 has actual coverage of 49 / 51 = 0.9608 . The observed coverages are close to the actual coverages in Table 3.

4. Discussion

The new nonparametric, asymptotically optimal h-step-ahead prediction intervals for the random walk appear to perform well if n 50 h . The new nonparametric h-step-ahead 95% prediction regions for the vector-valued random walk appear to have coverages near 0.95 for n 20 p h . The new nonparametric prediction regions are fast, with simple theory, and have coverage min ( n V , ( n V + 1 ) ( 1 δ ) ) / ( n V + 1 ) .
Datasets where future data do not behave like past data are common, and then the prediction intervals and regions tend to perform poorly. In Example 1, cases 1–1460 appear to follow one random walk, while cases 1461–1800 follow another random walk with more variability.
Some prediction intervals for stochastic processes include Pan and Politis [32], Vidoni [33], and Vit [34]. Makridakis et al. [35] noted that a PI for the random walk, derived assuming normal errors, often failed to give good coverage. Pankratz [36] noted that the random walk model has been found to be a good model for many stock price time series.
Conformal prediction gives precise levels of coverage for one future observation, and prediction region (7) is a conformal prediction region that can have large volume. As an example, consider using ( T , C ) = ( MED ( W ) , I p ) . Then the prediction region is a hypersphere centered at the coordinate-wise median. The prediction region is good if the iid w i N p ( μ , σ 2 I p ) , but if w i N p ( μ , Σ ) , such that the highest density region is a hyperellipsoid tightly clustered around a vector in the direction of 1 = ( 1 , 1 , , 1 ) T , then the prediction region (7) has large volume compared to the highest density region.
There are many methods where prediction is useful. For example, Garg, Aggarwal, et al. [37] used support vector machines while Garg, Belarbi, et al. [38] used Gaussian process regression. Olive [7] shows how to obtain prediction intervals when the model is Y i = m ( x i ) + e i if the errors are iid. If heterogeneity is present, and there are enough cases x i with m ^ ( x i ) near m ^ ( x f ) , we make a prediction interval using Y i corresponding to the x i . Graphically, in a plot of m ^ ( x i ) versus Y i (on the vertical axis), we make a narrow vertical slice centered at m ^ ( x f ) , and then make the PI from the Y i in the slice.
Plots and simulations were conducted in R. See R Core Team [39]. Programs are in the collection of functions tspack.txt. See (http://parker.ad.siu.edu/Olive/tspack.txt), accessed on 15 December 2023. Table 1 and Table 2 used functions rwpisim and rwprsim for random walk simulations. Function predsim2 simulates the data-splitting prediction region for Table 3. Function predrgn2 computes the prediction region (7) using ( T , C ) = ( MED ( W ) , I p ) . The pottery data are available from (http://parker.ad.siu.edu/Olive/sldata.txt), accessed on 15 December 2023.

Author Contributions

Conceptualization, M.G.H., L.Z. and D.J.O.; methodology, M.G.H., L.Z. and D.J.O.; writing—original draft preparation, D.J.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The EuStockMarkets dataset is available from the R software, version 4.0.3.

Acknowledgments

The authors thank the editors and referees for their work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ross, S.M. Introduction to Probability Models, 11th ed.; Academic Press: San Diego, CA, USA, 2014. [Google Scholar]
  2. Frey, J. Data-driven nonparametric prediction intervals. J. Stat. Plan. Inference 2013, 143, 1039–1048. [Google Scholar] [CrossRef]
  3. Grübel, R. The length of the shorth. Ann. Stat. 1988, 16, 619–628. [Google Scholar] [CrossRef]
  4. Einmahl, J.H.J.; Mason, D.M. Generalized quantile processes. Ann. Stat. 1992, 20, 1062–1078. [Google Scholar] [CrossRef]
  5. Chen, M.H.; Shao, Q.M. Monte carlo estimation of Bayesian credible and HPD intervals. J. Comput. Graph. Stat. 1993, 8, 69–92. [Google Scholar]
  6. Olive, D.J. Asymptotically optimal regression prediction intervals and prediction regions for multivariate data. Intern. J. Stat. Probab. 2013, 2, 90–100. [Google Scholar] [CrossRef]
  7. Olive, D.J. Applications of hyperellipsoidal prediction regions. Stat. Pap. 2018, 59, 913–931. [Google Scholar] [CrossRef]
  8. Beran, R. Calibrating prediction regions. J. Am. Stat. Assoc. 1990, 85, 715–723. [Google Scholar] [CrossRef]
  9. Beran, R. Probability-centered prediction regions. Ann. Stat. 1993, 21, 1967–1981. [Google Scholar] [CrossRef]
  10. Fontana, M.; Zeni, G.; Vantini, S. Conformal prediction: A unified review of theory and new challenges. Bernoulli 2023, 29, 1–23. [Google Scholar] [CrossRef]
  11. Guan, L. Localized conformal prediction: A generalized inference framework for conformal prediction. Biometrika 2023, 110, 33–50. [Google Scholar] [CrossRef]
  12. Steinberger, L.; Leeb, H. Conditional predictive inference for stable algorithms. Ann. Stat. 2023, 51, 290–311. [Google Scholar] [CrossRef]
  13. Tian, Q.; Nordman, D.J.; Meeker, W.Q. Methods to compute prediction intervals: A review and new results. Stat. Sci. 2022, 37, 580–597. [Google Scholar] [CrossRef]
  14. Pelawa Watagoda, L.C.R.; Olive, D.J. Comparing six shrinkage estimators with large sample theory and asymptotically optimal prediction intervals. Stat. Pap. 2021, 62, 2407–2431. [Google Scholar] [CrossRef]
  15. Lei, J.; G’Sell, M.; Rinaldo, A.; Tibshirani, R.J.; Wasserman, L. Distribution-free predictive inference for regression. J. Am. Stat. Assoc. 2018, 113, 1094–1111. [Google Scholar] [CrossRef]
  16. Pelawa Watagoda, L.C.R.; Olive, D.J. Bootstrapping multiple linear regression after variable selection. Stat. Pap. 2021, 62, 681–700. [Google Scholar] [CrossRef]
  17. Rajapaksha, K.W.G.D.H.; Olive, D.J. Wald type tests with the wrong dispersion matrix. Commun. Stat. Theory Methods 2022. [Google Scholar] [CrossRef]
  18. Rathnayake, R.C.; Olive, D.J. Bootstrapping some GLMs and survival regression models after variable selection. Commun. Stat. Theory Methods 2023, 52, 2625–2645. [Google Scholar] [CrossRef]
  19. Mykland, P.A. Financial options and statistical prediction intervals. Ann. Stat. 2003, 31, 1413–1438. [Google Scholar] [CrossRef]
  20. Niwitpong, S.; Panichkitkosolkul, W. Prediction interval for an unknown mean Gaussian AR(1) process following unit root test. Manag. Sci. Stat Decis. 2009, 6, 43–51. [Google Scholar]
  21. Panichkitkosolkul, W.; Niwitpong, S. On multistep-ahead prediction intervals following unit root tests for a Gaussian AR(1) process with additive outliers. Appl. Math. Sci. 2011, 5, 2297–2316. [Google Scholar]
  22. Wolf, M.; Wunderli, D. Bootstrap joint prediction regions. J. Time Ser. Anal. 2015, 36, 352–376. [Google Scholar] [CrossRef]
  23. Kim, J.H. Asymptotic and bootstrap prediction regions for vector autoregression. Intern. J. Forecast. 1999, 15, 393–403. [Google Scholar] [CrossRef]
  24. Kim, J.H. Bias-corrected bootstrap prediction regions for vector autoregression. J. Forecast. 2004, 23, 141–154. [Google Scholar] [CrossRef]
  25. Hyndman, R.J. Highest density forecast regions for non-linear and non-normal time series models. J. Forecast. 1995, 14, 431–441. [Google Scholar] [CrossRef]
  26. Haile, M.G. Inference for Time Series after Variable Selection. Ph.D. Thesis, Southern Illinois University, Carbondale, IL, USA, 2022. Available online: http://parker.ad.siu.edu/Olive/shaile.pdf (accessed on 15 December 2023).
  27. Wisseman, S.U.; Hopke, P.K.; Schindler-Kaudelka, E. Multielemental and multivariate analysis of Italian terra sigillata in the world heritage museum, university of Illinois at Urbana-Champaign. Archeomaterials 1987, 1, 101–107. [Google Scholar]
  28. Gray, H.L.; Odell, P.L. On sums and products of rectangular variates. Biometrika 1966, 53, 615–617. [Google Scholar] [CrossRef]
  29. Marengo, J.E.; Farnsworth, D.L.; Stefanic, L. A geometric derivation of the Irwin-Hall distribution. Intern. J. Math. Math. Sci. 2017, 2017, 3571419. [Google Scholar] [CrossRef]
  30. Roach, S.A. The frequency distribution of the sample mean where each member of the sample is drawn from a different rectangular distribution. Biometrika 1963, 50, 508–513. [Google Scholar] [CrossRef]
  31. Zhang, L. Data Splitting Inference. Ph.D. Thesis, Southern Illinois University, Carbondale, IL, USA, 2022. Available online: http://parker.ad.siu.edu/Olive/slinglingphd.pdf (accessed on 15 December 2023).
  32. Pan, L.; Politis, D.N. Bootstrap prediction intervals for Markov processes. Comput. Stat. Data Anal. 2016, 100, 467–494. [Google Scholar] [CrossRef]
  33. Vidoni, P. Improved prediction intervals for stochastic process models. J. Time Ser. Anal. 2004, 25, 137–154. [Google Scholar] [CrossRef]
  34. Vit, P. Interval prediction for a Poisson process. Biometrika 1973, 60, 667–668. [Google Scholar] [CrossRef]
  35. Makridakis, S.; Hibon, M.; Lusk, E.; Belhadjali, M. Confidence intervals: An empirical investigation of the series in the M-competition. Intern. J. Forecast. 1987, 3, 489–508. [Google Scholar] [CrossRef]
  36. Pankratz, A. Forecasting with Univariate Box-Jenkins Models; Wiley: New York, NY, USA, 1983. [Google Scholar]
  37. Garg, A.; Aggarwal, P.; Aggarwal, Y.; Belarbi, M.O.; Chalak, H.D.; Tounsi, A.; Gulia, R. Machine learning models for predicting the compressive strength of concrete containing nano silica. Comput. Concr. 2022, 30, 33–42. [Google Scholar]
  38. Garg, A.; Belarbi, M.-O.; Chalak, H.D.; Tounsi, A.; Li, L.; Singh, A.; Mukhopadhyay, T. Predicting elemental stiffness matrix of fg nanoplates using Gaussian process regression based surrogate model in framework of layerwise model. Eng. Anal. Bound. Elem. 2022, 143, 779–795. [Google Scholar] [CrossRef]
  39. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
Figure 1. PI plot of the DAX dataset.
Figure 1. PI plot of the DAX dataset.
Stats 07 00002 g001
Table 1. Random walk 95% PI, parentheses:sd (length).
Table 1. Random walk 95% PI, parentheses:sd (length).
ndisth = 1h = 2h = 3h = 4
100N0.95280.95780.94560.9220
100 4.1683 (0.3923)6.3504 (0.9390)7.2516 (1.2066)7.8247 (1.4372)
100C0.96060.96560.94720.9262
100 47.33 (39.38)1075.43 (41,234.9)1079.36 (41,233.0)1065.19 (41,233.7)
100EXP0.95520.95620.94080.9242
100 3.6615 (0.6325)6.3141 (1.4891)7.1391 (1.6336)7.6647 (1.8121)
100U0.94860.95840.94080.9212
100 1.9023 (0.0408)3.2878 (0.2577)3.9791 (0.5093)4.4074 (0.6977)
400N0.95260.95060.95560.9508
400 4.0646 (0.1868)5.7753 (0.3813)7.2431 (0.6028)8.3282 (0.7921)
400C0.96000.96220.96540.9632
400 32.7277 (8.3139)71.7138 (28.29)133.9884 (79.20)188.3578 (146.52)
400EXP0.95820.95980.96020.9578
400 3.3131 (0.2598)5.1497 (0.4369)6.7619 (0.6877)7.9367 (0.8970)
400U0.95420.95340.95680.9558
400 1.9028 (0.0193)3.1602 (0.1268)4.0569 (0.2564)4.7092 (0.3808)
800N0.95140.95200.95360.9514
800 4.0205 (0.1334)5.7498 (0.2720)7.0086(0.4012)8.1579 (0.5338)
800C0.95200.95500.95160.9522
800 29.7122 (4.9301)65.2292 (16.21)98.9266 (31.08)144.3277 (57.72)
800EXP0.95640.95500.95180.9596
800 3.2000 (0.1727)5.0514 (0.3100)6.4202 (0.4333)7.6747 (0.5787)
800U0.95060.95220.95220.9518
800 1.9014 (0.0132)3.1666 (0.0908)3.9651 (0.1835)4.6357 (0.2693)
Table 2. Random walk 95% prediction regions, p = 8.
Table 2. Random walk 95% prediction regions, p = 8.
n ψ Typeh = 1h = 2h = 3h = 4
400010.94260.94380.93700.9214
400020.94900.95020.94440.9270
400030.94660.95300.94760.9392
400040.94160.94460.93880.9216
4000.35410.95140.94460.94560.9186
4000.35420.94500.95720.94600.9290
4000.35430.95560.95460.94960.9314
4000.35440.94160.94120.93400.9182
4000.910.94840.94620.94240.9198
4000.920.95240.95020.94800.9310
4000.930.94820.95760.95460.9392
4000.940.94580.93760.93460.9228
800010.94580.94500.94600.9484
800020.95160.95540.95140.9506
800030.94940.95080.94800.9544
800040.94320.94080.94380.9418
8000.35410.94560.94640.94780.9450
8000.35420.94740.95500.95400.9488
8000.35430.95340.95160.95320.9536
8000.35440.94940.94660.94800.9518
8000.910.94360.94820.94780.9450
8000.920.95000.94940.95120.9514
8000.930.95520.95200.95140.9484
8000.940.94740.94500.94940.9464
1600010.95060.95160.94760.9464
1600020.95220.95340.95320.9514
1600030.94960.95300.95240.9522
1600040.94180.94280.94140.9430
16000.35410.95060.94720.95040.9502
16000.35420.94400.95200.94880.9502
16000.35430.95060.95720.95740.9570
16000.35440.94880.94180.94440.9462
16000.910.95100.94960.94760.9458
16000.920.94920.95000.95320.9474
16000.930.95240.95580.95480.9540
16000.940.94500.95080.94520.9500
Table 3. Data-splitting nominal 95% prediction region.
Table 3. Data-splitting nominal 95% prediction region.
npnvxtypedtypecov
5010020110.9560
5010020210.9466
5010020310.9504
5010020120.9558
5010020220.9508
5010020320.9522
10010050110.9620
10010050210.9622
10010050310.9596
10010050120.9638
10010050220.9578
10010050320.9638
10010025110.9588
10010025210.9658
10010025310.9568
10010025120.9622
10010025220.9672
10010025320.9662
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Haile, M.G.; Zhang, L.; Olive, D.J. Predicting Random Walks and a Data-Splitting Prediction Region. Stats 2024, 7, 23-33. https://doi.org/10.3390/stats7010002

AMA Style

Haile MG, Zhang L, Olive DJ. Predicting Random Walks and a Data-Splitting Prediction Region. Stats. 2024; 7(1):23-33. https://doi.org/10.3390/stats7010002

Chicago/Turabian Style

Haile, Mulubrhan G., Lingling Zhang, and David J. Olive. 2024. "Predicting Random Walks and a Data-Splitting Prediction Region" Stats 7, no. 1: 23-33. https://doi.org/10.3390/stats7010002

APA Style

Haile, M. G., Zhang, L., & Olive, D. J. (2024). Predicting Random Walks and a Data-Splitting Prediction Region. Stats, 7(1), 23-33. https://doi.org/10.3390/stats7010002

Article Metrics

Back to TopTop