Model-Based and Model-Free point prediction algorithms for locally stationary random fields

The Model-free Prediction Principle has been successfully applied to general regression problems, as well as problems involving stationary and locally stationary time series. In this paper we demonstrate how Model-Free Prediction can be applied to handle random fields that are only locally stationary, i.e., they can be assumed to be stationary only across a limited part over their entire region of definition. We construct one-step-ahead point predictors and compare the performance of Model-free to Model-based prediction using models that incorporate a trend and/or heteroscedasticity. Both aspects of the paper, Model-free and Model-based, are novel in the context of random fields that are locally (but not globally) stationary. We demonstrate the application of our Model-based and Model-free point prediction methods to synthetic data as well as images from the CIFAR-10 dataset and in the latter case show that our best Model-free point prediction results outperform those obtained using Model-based prediction.


Introduction
Consider a real-valued random field dataset {Y t ,t ∈ Z 2 } defined over a 2-D index-set D e.g.pixel values over an image or satellite data observed on an ocean surface.It may be unrealistic to assume that the stochastic structure of such a random field Y t has stayed invariant over the entire region of definition D hence, we cannot assume that {Y t } is stationary.Therefore it is more realistic to assume a slowly-changing stochastic structure, i.e., a locally stationary model.Discussions of such models for locally stationary time series can be found in [18], [19], [4] and [3].In the context of random fields, locally stationary models have been proposed in [14] and references therein where the data Y t is defined over a continuous subset S of R d .In this paper we assume a locally stationary model for random fields Y t ∈ R defined over t ∈ S where S ⊂ Z d , d = 2. Given data Y t 1 ,Y t 2 , . . .,Y t n , our objective is to perform point prediction for a future unobserved data point Y t n+1 .Here t 1 ,t 2 , . . .,t n ,t n+1 ∈ Z 2 denote the coordinates of the random field over the 2-D index set D and the notion of a future datapoint over a coordinate of a random field for purposes of predictive inference over t ∈ Z 2 is defined in Section 2. Algorithms for point prediction and prediction intervals of locally stationary time series and their applications in both synthetic and real-life datasets have been discussed in [6].Our work in this paper extends this framework to point prediction over locally stationary random fields with applications involving both synthetic and real-life image data.
The usual approach for dealing with nonstationary series is to assume that the data can be decomposed as the sum of three components: µ(t) + S t +W t where µ(t) is a deterministic trend function, S t is a seasonal (periodic) series, and {W t } is (strictly) stationary with mean zero; this is the 'classical' decomposition of a time series to trend, seasonal and stationary components see e.g.[1] which can also be used for decomposition of nonstationary random field data.The seasonal (periodic) component, be it random or deterministic, can be easily estimated and removed and having done that, the 'classical' decomposition simplifies to the following model with additive trend, i.e., Y t = µ(t) +W t (1) which can be generalized to accommodate a coordinate-changing variance as well, i.e., In both above models, the series {W t } is assumed to be (strictly) stationary, weakly dependent, e.g.strong mixing, and satisfying EW t = 0; in model (2), it is also assumed that Var (W t ) = 1.As usual, the deterministic functions µ(•) and σ(•) are unknown but assumed to belong to a class of functions that is either finitedimensional (parametric) or not (nonparametric); we will focus on the latter, in which case it is customary to assume that µ(•) and σ(•) possess some degree of smoothness, i.e., that µ(t) and σ(t) change smoothly (and slowly) with t.
As far as capturing the first two moments of Y t , models (1) and ( 2) are considered general and flexibleespecially when µ(•) and σ(•) are not parametrically specified-and have been studied extensively in the case of time series; see e.g.[21], [22].However, it may be that the skewness and/or kurtosis of Y t changes with t, in which case centering and studentization alone cannot render the problem stationary.To see why, note that under model ( 2), EY t = µ(t) and VarY t = σ 2 (t); hence, cannot be (strictly) stationary unless the skewness and kurtosis of Y t are constant.Furthermore, it may be the case that the nonstationarity is due to a feature of the m-th dimensional marginal distribution not being constant for some m ≥ 1, e.g., perhaps the correlation Corr(Y t j ,Y t j+1 ) where t j ,t j+1 ∈ Z 2 changes smoothly (and slowly) with t j .Notably, models (1) and ( 2) only concern themselves with features of the 1st marginal distribution.
For all the above reasons, it seems valuable to develop a methodology for the statistical analysis of nonstationary random fields that does not rely on simple additive models such as (1) and (2).Fortunately, the Model-free Prediction Principle of [16], [17] suggests a way to accomplish Model-free inference in the general setting of random fields that are only locally stationary.where predictive inference will be performed.For this purpose we adopt the framework proposed in [2] and consider random fields discussed in this paper to be defined over a subset of the non symmetric half-plane (NSHP) denoted as H ∞ .Figure 1 shows an NSHP centered at (0, 0).The NSHP can also be centered at any other point t as follows: Such non symmetric half-planes have been used previously for specifying causal 2-D AR models [2].In such cases a causal 2-D AR model with H p ⊂ H ∞ can be defined as below in equation (5) where the set H p is termed as the region of support (ROS) of the 2-D AR model. .Here H p = {( j, k) | j = 1, 2, . . ., p k = 0, ±1, . . ., ±p} ∪ {(0, k) | k = 1, 2, . . ., p} and v t 1 ,t 2 is a 2-D white noise process with mean 0 and variance Based on [7] a 2-D AR process with ROS S is causal if there exists a subset C of Z 2 satisfying the following conditions: • The set C consists of 2 rays emanating from the origin and the points between the rays • The angle between the 2 rays is strictly less than 180 degrees In this case since H p ⊂ H ∞ satisfies these conditions the 2-D AR process denoted by ( 5) is causal.Therefore we can use this framework to describe a causal random field defined over the NSHP and perform predictive inference on the same.Given this our setup for point prediction of random fields is described as below.
Consider random field data {Y t , t ∈ E} where E can be any finite subset of Z 2 for e.g.
This "future" value Y t 1 ,t 2 is determined using data defined over the region as shown in Figure 2:

Model-based inference
Throughout Section 3, we will assume model (2)-that includes model (1) as a special case-together with a nonparametric assumption on smoothness of µ(•) and σ(•).
For j ≺ J define F J j (Y ) to be the information set {Y j , . . .,Y J }, also known as σ-field, and note that the information sets F t −∞ (Y ) and F t −∞ (W ) are identical for any t, i.e., knowledge of {Y s for s ≺ t} is equivalent to knowledge of {W s for s ≺ t}.Here µ(•) and σ(•) are assumed known and the symbol ≺ denotes lexicographical ordering on the region of support of the random field as described in Section 2. Hence, for large n, and due to the assumption that W t is weakly dependent (and therefore the same must be true for Y t as well), the following large-sample approximation is useful, i.e., where We therefore need to construct an approximation for E(W t n+1 |W t n ).For this purpose, the L 2 -optimal linear predictor of W t n+1 can be obtained by fitting a (causal) AR(p, q) model to the data W t 1 , . . .,W t n with p, q chosen by minimizing AIC, BIC or a related criterion as described in [2]; this would entail fitting the model: where v t 1 ,t 2 is a 2-D white noise process i.e., an uncorrelated sequence, with mean 0 and variance σ 2 > 0 and

Trend estimation and practical prediction
To construct the L 2 -optimal predictor (6), we need to estimate the smooth trend µ(•) and variance σ(•) in a nonparametric fashion; this can be easily accomplished via kernel smoothing by using 2D kernels -see e.g.[11], [12], [15].Note, furthermore, that the problem of prediction of Y t n+1 involves estimating the functions µ(•) and σ(•) is essentially a boundary problem.In such cases, it is well-known that local linear fitting has better properties-in particular, smaller bias-than kernel smoothing which is well-known to be tantamount to local constant fitting; [8], [9], or [15].Note that for time series problems {Y t , t ∈ Z} local linear nonparametric estimation can approximate the trend locally by a straight line whereas for the case of random fields {Y t , t ∈ Z 2 } discussed in this paper local linear estimation can be used to approximate the trend locally with a plane.
Remark 3.1 (One-sided estimation) Since the goal is predictive inference on Y t n+1 , local constant and/or local linear fitting must be performed in a one-sided way.Furthermore to compute Ē(W t n+1 |W t n ) in eq. ( 9) we need access to the stationary data W t 1 , . . .,W t n .The W t 's are not directly observed, but-much like residuals in a regression-they can be reconstructed by eq. ( 3) with estimates of µ(t) and σ(t) plugged-in.What is important is that the way W t is reconstructed/estimated by (say) Ŵt must remain the same for all t, otherwise the reconstructed data Ŵt 1 , . . ., Ŵt n can not be considered stationary.Since W t can only be estimated in a one-sided way for t close to t n , the same one-sided way must also be implemented for t in the middle of the dataset even though in that case two-sided estimation is possible.
By analogy to model-based regression as described in [16], the one-sided Nadaraya-Watson (NW) kernel estimators of µ(t) and σ(t) can be defined in two ways.Note that the bandwidth parameter b will be assumed to satisfy i.e., b is analogous to the product hn where h is the usual bandwidth in nonparametric regression.We will assume throughout that K( where Using μ(t k ) and σ(t k ) we can now define the fitted residuals by 2. NW-Predictive fitting (delete-1): Let where Using μ(t k ) and σ(t k ) we can now define the predictive residuals by Similarly, the one-sided local linear (LL) fitting estimators of µ(t k ) and σ(t k ) can be defined in two ways.
Using one of the above four methods (NW vs. LL, regular vs. predictive) gives estimates of the quantities needed to compute the L 2 -optimal predictor (6).In order to approximate E(W t n+1 |Y t n ), one would treat the proxies Ŵt k or Wt k as if they were the true W t k , and proceed as outlined in Section 3.1.The bandwidth b in all 4 algorithms described above can be determined by cross-validation as described in Section 5.

Model-free inference
Model ( 2) is a flexible way to account for a spatially-changing mean and variance of Y t .However, nothing precludes that the random field {Y t for t ∈ Z 2 } has a nonstationarity in its third (or higher moment), and/or in some other feature of its mth marginal distribution.A way to address this difficulty, and at the same time give a fresh perspective to the problem, is provided by the Model-Free Prediction Principle of Politis (2013Politis ( , 2015)).
The A convenient way to ensure both the smoothness and data-based consistent estimation of is to assume that, for all t k , for some function f t k (w) that is smooth in both arguments t k and w, and some strictly stationary and weakly dependent, univariate series W t k ; without loss of generality, we may assume that W t k is a Gaussian series.
In fact, Eq. (33) with f t k (•) not depending on t k is a familiar assumption in studying non-Gaussian and/or long-range dependent stationary processes-see e.g.[20].By allowing f t k (•) to vary smoothly (and slowly) with t k , Eq. ( 33) can be used to describe a rather general class of locally stationary processes.Note that model ( 2) is a special case of Eq. ( 33) with m = 1, and the function f t k (w) being affine/linear in w.Thus, for concreteness and easy comparison with the model-based case of Eq. ( 2), we will focus in the sequel on the case m = 1.For reference model-free estimators for point prediction and prediction intervals in the case of locally stationary time series for m = 1 have been discussed in [6].

Constructing the theoretical transformation
Hereafter, adopt the setup of Eq. ( 33) with m = 1, and let D t (y) = P{Y t ≤ y} denote the 1st marginal distribution of random field {Y t }.Throughout Section 4, the default assumption will be that D t (y) is (absolutely) continuous in y for all t.
We now define new variables via the probability integral transform, i.e., let  behavior for estimation at the boundary, e.g.smaller bias than either Dt k (y) and Dt k (y) respectively.However, there is no guarantee that these will be proper distribution functions as a function of y, i.e., being nondecreasing in y with a left limit of 0 and a right limit of 1; see [15] for a discussion.
One proposed solution put forward by [10] involves a straightforward adjustment to the local linear estimator of a conditional distribution function that maintains its favorable asymptotic properties.The local linear versions of Dt k (y) and Dt k (y) adjusted via Hansen's (2004) proposal are given as follows: The weights w i are derived from weights w i described in equations ( 24) and (32) for the fitted and predictive cases where: As with eq. ( 37)and (38), we can let T = k or T = k − 1 in the above, leading to a fitted vs. predictive local linear estimators of D t k (y), by either DLLH t k (y) or DLLH t k (y).

Uniformization using Monotone Local Linear Distribution Estimation
Hansen's (2004) proposal replaces negative weights by zeros, and then renormalizes the nonzero weights.
The problem here is that if estimation is performed on the boundary (as in the case with one-step ahead prediction of random fields), negative weights are crucially needed in order to ensure the extrapolation takes place with minimal bias.A recent proposal by [5] where λ(y) is the derivative of Λ(y) and the weights w j can be derived based on equations ( 24) and ( 32) for the fitted and predictive cases.

3.
To make the above a proper density function, renormalize it to area one, i.e., let The above modification of the local linear estimator allows one to maintain monotonicity while retaining the negative weights that are helpful in problems which involve estimation at the boundary.As with eq. ( 37) and (38), we can let T = k or T = k − 1 in the above, leading to a fitted vs. predictive local linear estimators of D t k (y) that are monotone.
Different algorithms could also be employed for performing monotonicity correction on the original estimator DLL t k (y); these are discussed in detail in [5].In practice, Algorithm 4.1 is preferable because it is the fastest in term of implementation; notably, density estimates can be obtained in a fast way (using the Fast Fourier Transform) using standard functions in statistical software such as R. Computational speed is important in point prediction but is critical for cross-validation where a large number of estimates of DLLM t k (y) must be computed to determine the optimal bandwidth.

Estimation of the whitening transformation
To implement the whitening transformation (36), it is necessary to estimate Γ n , i.e., the n × n covariance matrix of the random vector Z t n = (Z t 1 , . . ., Z t n ) where the Z t are the normal random variables defined in eq. ( 35).
The problem involves positive definite estimation of Γ n based on the sample Z t 1 , . . ., Z t n .Let ΓAR n be the n × n covariance matrix associated with the fitted AR(p,q) model to the data Z t 1 , . . ., Z t n .with p, q by minimizing AIC, BIC or a related criterion as described in [2].Let γAR |i− j| denote the i, j element of the Toeplitz matrix ΓAR n .Using the 2D Yule-Walker equations to fit the AR model implies that γAR k,l = γk,l for k = 0, 1, . . ., p and l = 0, 1, . . ., q.For the cases where k > p or l > q, γAR k,l can be fitted by iterating the difference equation that characterizes the fitted 2D AR model.In the R software this procedure is automated for time series using the ARMAacf() function, here we extend the same for stationary data over random fields.
Estimating the 'uniformizing' transformation D t (•) and the whitening transformation based on Γ n allows us to estimate the transformation H n : Y t n → ε n .However, in order to put the Model-Free Prediction Principle to work, we also need to estimate the transformation H n+1 (and its inverse).To do so, we need a positive definite estimator for the matrix Γ n+1 ; this can be accomplished by extending the covariance matrix associated with the fitted 2D AR(p,q) model to (n + 1)by(n + 1) i.e. calculate ΓAR n+1 .
Consider the 'augmented' vectors: • Z t n+1 = (Z t 1 , . . ., Z t n , Z t n+1 ) and where the values Y t n+1 , Z t n+1 and ε n+1 are yet unobserved.We now show how to obtain the inverse transforma- where C n+1 is the (lower) triangular Cholesky factor of (our positive definite estimate of) Γ n+1 .From the above, it follows that where c n+1 = (c 1 , . . ., c n , c n+1 ) is a row vector consisting of the last row of matrix C n+1 .
ii. Create the uniform random variable iii.Finally, define of course, in practice, the above will be based on an estimate of D −1 n+1 (•).
Since Y t n has already been created using (the first n coordinates of) ε n+1 , the above completes the construction of Y t n+1 based on ε n+1 , i.e., the mapping H −1 n+1 : ε n+1 → Y t n+1 .

Model-free point prediction
In the previous sections, it was shown how the construct the transformation H n : Y t n → ε n and its inverse , where the random variables ε 1 , ε 2 , . . ., are i.i.d.Note that by combining eq. ( 43), ( 44) and (45) we can write the formula: Recall that c n+1 ε n+1 = ∑ n i=1 c i ε i + c n+1 ε n+1 ; hence, the above can be compactly denoted as Eq. ( 46) is the predictive equation required in the Model-free Prediction Principle; conditionally on Y t n , it can be used like a model equation in computing the L 2 -and L 1 -optimal point predictors of Y t n+1 .We will give these in detail as part of the general algorithm for the construction of Model-free point predictors.
reasonable range, we can form either here k o should be big enough so that estimation is accurate, e.g., k o can be of the order of √ n.The crossvalidated bandwidth choice would then be the b that minimizes PRESS(b); alternatively, we can choose to minimize PRESAR(b) if an L 1 measure of loss is preferred.Finally, note that a quick-and-easy (albeit suboptimal) version of the above is to use the (supoptimal) predictor Ŷt k+1 μ(t k+1 ) and base PRESS(b) or PRESAR(b) on this approximation.For the problem of selecting h 0 in the case of model-free point predictors, as in [16], our final choice is h 0 = h 2 where h = b/n.Note that an initial choice of h 0 (needed to perform uniformization and cross-validation to determine the optimal bandwidth b) can be set by any plug-in rule; the effect of choosing an initial value of h 0 is minimal.Point prediction performance as indicated by Mean Squared Error (MSE) are used to compare the estimators.

Simulation: Additive model with stationary 2-D AR errors
Let a random field be generated using the 2-D AR process as below: Let this field be generated over the region defined by 0 The NSHP limits are set from (101, 101) to (50, 50), this defines the region E t,n as shown in Figure 2. The data Y t is generated using the additive model in eq. ( 1) with trend specified as µ Here v(t 1 ,t 2 ) are i.i.d.N(0, τ 2 ) where τ = 0.1.Let t 1 = 50,t 2 = 50 where point prediction is performed.Bandwidths for estimating the trend are calculated using cross-validation for Results for point prediction using mean square error (MSE) over all MB and MF methods are shown in Table 1.A total of 100 realizations of the dataset were used for measuring point prediction performance.From this table it can be seen that MB-LL is the best point predictor.This is expected since the data was generated by a 2D AR model which is the same used in MB-LL prediction.In addition the estimation is performed at the boundary of the random field with a strong linear trend as shown in Figure 3 where LL regression is expected to perform the best.In addition it can be observed that MF-LLM performs the best among all MF point predictors and approaches the performance of MB-LL.This shows that monotonicity correction in the LLM distribution estimator has minimal effect on the center of the distribution that is used for point prediction.

Real-life example: CIFAR images
The CIFAR-10 dataset [13] is used as a real-life example to compare the model-based and model-free prediction algorithms discussed before.The original CIFAR-10 dataset consists of 60000 32 by 32 color images in 10 classes, with 6000 images per class.We pick 100 images from the class "dog" where the original images have 3 RGB (red, green, blue) channels with discrete pixel values.We pick the R (red) channel of each image, and standardize these to generate a new real-valued dataset.Our final transformed dataset has 100 32 by 32  Results for point prediction using mean square error (MSE) over all MB and MF methods are shown in Table 1.From this table it can be seen that MF-LLH and MF-LLM are the best point predictors.We attribute this to the fact that the CIFAR-10 image data is not compatible with additive model as given by eq. ( 1).It can also be seen that unlike the synthetic 2D AR dataset the two best predictors MF-LLH and MF-LLM are much closer in performance which is owing to lack of a linear trend at the point where prediction is performed.
Lastly for point prediction there is a difference in performance between fitted and predictive residuals for some estimators which is not the case with the synthetic dataset discussed before.This is due to finite sample effects as the CIFAR image random field is smaller in size and we use only a part of this for our one-sided prediction.

Conclusions and Future Work
In this paper we investigate the problem of one-sided prediction over random fields that are stationary only across a limited part over their entire region of definition.For such locally stationary random fields we develop frameworks for point prediction using both a model-based approach which includes a coordinate  changing trend and/or variance and also by using the model-free principle proposed by [16], [17].We apply our algorithms to both synthetic data as well as a real-life dataset consisting of images from the CIFAR-10 dataset.
In the latter case we obtain the best performance using the model-free approach and thereby demonstrate the superiority of this technique versus the model-based case where an additive model is assumed arbitrarily for purposes of prediction.In future work we plan to investigate both model-based and model-free prediction using random fields with non-uniform spacing of data as well as extending our algorithms for estimating prediction intervals.

6
Model-Free vs. Model-Based Inference: empirical comparisons The performance of the Model-Free and Model-Based predictors described above are empirically compared using simulated and real-life data based on point prediction.The Model-Based local constant and local linear methods are denoted as MB-LC and MB-LL respectively.Model-Based predictors MB-LC and MB-LL are described in Section 3. The Model-Free methods using local constant, local linear (Hansen) and local linear (Monotone) are denoted as MF-LC, MF-LLH, MF-LLM.Model-Free predictors are described in Section 4.

Figure 2 .
Figure 2. Rest of the image is considered as occluded and their pixel values are not available for prediction.Sample images used for prediction are shown in Figure 4. Let t 1 = 16,t 2 = 16 where point prediction is performed.Bandwidths for estimating the trend are calculated using cross-validation for both Model-Based and Model-Free cases described in Section 5.

Figure 4 :
Figure 4: Sample images from CIFAR-10 dataset with label dog (Note: Here full images are shown although only part of it is used for prediction.) The key towards Model-free inference is to be able to construct an invertible transformation H n : Y t n → ε n where Y t n = (Y t 1 ,Y t 2 , . . .,Y t n ) denotes the random field data under consideration and ε n = (ε 1 , . . ., ε n ) is a random vector with i.i.d.components; the details for point prediction are given in Section 4. In Section 3 we visit the problem of model-based inference and develop a [2]h model-based and model-free causal inference for Y t 1 ,t 2 are performed using the data specified over this region E t,n .We consider predictive inference atY t = Y t 1 ,t 2 given the data (Y s | s ≺ t & s ∈ E t,n )where the symbol ≺ denotes lexicographical ordering on the region of support of the random field i.e.(a k , b k ) ≺ (a k+1 , b k+1 ) if and only if either a k < a k+1 or (a = a k+1 and b k < b k+1 )[2].In the subsequent discussion is the lexicographically ordered "past" data Y s will be denoted as Y t 1 ,Y t 2 , . . .,Y t n and point prediction will be performed at Y t = Y t n+1 .
It is well-known that the L 2 -optimal predictor of Y t n+1 given the dataY s = Y t n = (Y t 1 , . ..,Y t n ) is the conditional expectation E(Y tn+1 |Y t n ) where Y t n indicates the data Y t 1 , . . .,Y t n .Furthermore, under model (2), we have •) is a nonnegative, symmetric 2-D Gaussian kernel function for which the diagonal values are set to the bandwidth b and the off-diagonal terms are set to 0. Random field data is denoted as Y t 1 , . . .,Y t k , . ..Y t n .
1. NW-Regular fitting: Let t k ∈ [t 1 ,t n ], and define key towards Model-free inference is to be able to construct an invertible transformation H n : Y t n → ε n where Y t n = (Y t 1 ,Y t 2 , . . .,Y t n ) denotes the random field data under consideration and ε n = (ε 1 , . . ., ε n ) is a random vector with i.i.d.components.In order to do this in our context, let some m ≥ 1, and denote by L(Y t k ,Y t k−1 , . . .,Y t k−m+1 ) the mth marginal of the random field Y t k , i.e. the joint probability law of the vector (Y t k ,Y t k−1 , . . .,Y t k−m+1 ) .Although we abandon model (2) in what follows, we still want to employ nonparametric smoothing for estimation; thus, we must assume that L(Y t k ,Y t k−1 , . . .,Y t k−m+1 ) changes smoothly (and slowly) with t k .In this case {Y t k , t k ∈ Z 2 } can be defined over a 2-D index-set D and the set Y t k ,Y t k−1 , . . .,Y t k−m+1 can be considered to be lexicographically ordered as discussed previously in Section 2.

)
It then follows that the entries of ε n = (ε 1 , ..., ε n ) are uncorrelated standard normal.Assuming that the random variables Z t 1 , ..., Z t n were jointly normal, this can be strenghtened to claim that ε 1 , ..., ε n are i.i.d.N(0, 1).interest lies in one-sided estimation on the boundary of the random field.Let DLLt k (y) and DLL t k (y) denote the local linear estimators of D t k (y) based on either the indicator variables 1{Y t i ≤ y} or the smoothed variables Λ( . Recall that ε n and Y t n are related in a one-to-one way via transformation H n , so the values Y t 1 , . . .,Y t n are obtainable by Y t n = H −1 n (ε n ).Hence, we just need to show how to create the unobserved Y t n+1 from ε n+1 ; this is done in the following three steps.