Estimation of the Covariance Matrix in Hierarchical Bayesian Spatio-Temporal Modeling via Dimension Expansion

Ozone concentrations are key indicators of air quality. Modeling ozone concentrations is challenging because they change both spatially and temporally with complicated structures. Missing data bring even more difficulties. One of our interests in this paper is to model ozone concentrations in a region in the presence of missing data. We propose a method without any assumptions on the correlation structure to estimate the covariance matrix through a dimension expansion method for modeling the semivariograms in nonstationary fields based on the estimations from the hierarchical Bayesian spatio-temporal modeling technique (Le and Zidek). Further, we apply an entropy criterion (Jin et al.) based on a predictive model to decide if new stations need to be added. This entropy criterion helps to solve the environmental network design problem. For demonstration, we apply the method to the ozone concentrations at 25 stations in the Pittsburgh region studied. The comparison of the proposed method and the one is provided through leave-one-out cross-validation, which shows that the proposed method is more general and applicable.


Introduction
Ozone concentrations are the daily maximum 8 h moving averages of hourly ozone concentration data recorded in micrograms per cubic meter, µg/m 3 , which are key indicators of air quality. Monitoring the changes both spatially and temporally is very important for the assessment of air quality change, which has a great impact on our environment, society and economy. However, modeling the ozone concentrations is not an easy task since the ozone concentrations vary over space and time with complicated spatial structures, temporal structures and spatio-temporal interactions. Furthermore, the presence of missing data brings even more difficulties. As commented in [1], although we cannot escape the "curse of dimensionality", we can take advantage of recent developments in computing speed and numerical advances (e.g., Markov chain Monte Carlo) that allow us to implement Bayesian spatio-temporal dynamical models in a hierarchical framework. Such a framework provides simple strategies for incorporating complicated spatio-temporal interactions at different stages of the models' hierarchy, and the models are feasible to be implemented for high-dimensional data. Two popular hierarchical Bayesian spatio-temporal models can be found in [1,2], among others. The latter one was used in [3].
Ref. [3] studied the ozone concentrations within −79 • to −81.5 • longitude and 39.5 • to 41.5 • latitude around the Pittsburgh region (−79.23 • , 43.39 • ), in which all of the monitoring stations have missing data. That paper dealt with the missing problems in two steps. First, it filled in some of the missing measurements by using linear models so that the pattern of missing data became monotone (the monotone missing is also referred to as the staircase pattern). Second, it applied hierarchical Bayesian spatio-temporal (HBST) modeling proposed in [2] on this staircase of missing data to estimate the hyperparameters of the spatial-temporal model. Based on the estimated hyperparemeters, it estimated the spatial correlation function for the monitoring stations. Then, it estimated the covariance matrix for all of the stations and derived the predictive distribution for the ungauged sites.
Generalized linear models can be used to accommodate non-Gaussian geostatistical data (e.g., see [4]). Ref. [3] selected the generalized linear model with the quasi-Poisson family as an appropriated spatial correlation function by examining the pattern of spatial correlations obtained via the hierarchical model in the plot. However, their link function is not appropriate if there are negative correlations. This is a strong restriction because negative correlations are common for the ozone concentrations and other spatial-temporal data. Moreover, choosing a model by examining the plots derived in terms of the observed data set is not rigorous enough and may only be suitable just for a particular data set.
In this paper, we propose a method to estimate the covariance matrix through a dimension expansion method for modeling the semivariograms in nonstationary fields based on the estimations from hierarchical Bayesian spatio-temporal modeling. For demonstration, we apply the proposed method on the same data as in Jin et al. [3]. Without any assumption on the correlation structure, the proposed method is more general than the method in [3] such that it is applicable to other spatio-temporal data sets. Using the covariance matrix estimated by the proposed method on the entropy criterion in the environmental network design problem, our study provides interesting findings, and the locations of the selected ungauged stations are more reasonable. We provide comparison of these two methods through leave-one-out cross-validation, which shows that the proposed method provides improved results.
The paper is arranged as follows. In Section 2, we briefly introduce hierarchical Bayesian spatio-temporal modeling. In Section 3, we describe the ozone concentrations in the Pittsburgh region and apply the hierarchical Bayesian spatio-temporal modeling techniques for filling in missing measurements following [3]. In Section 4, we model the ozone concentrations in the Pittsburgh region. We first introduce the method for estimating the covariance matrix through a dimension expansion method for modeling the semivariograms in nonstationary fields, and we then give spatial predictive distributions on the ungauged sites using the covariance matrix estimated by the proposed method. In Section 5, we present the results of the entropy of the predictive distributions and an optimality criterion for extending an environmental network. In Section 6, we provide the model evaluation through leave-one-out cross-validation. We conclude this paper with a conclusion in Section 7.
Throughout the rest of the paper, the L 1 -norm of a vector c is denoted by c 1 , a p × p identity matrix is denoted by I p , the transpose of a matrix A is denoted by A and the trace of a square matrix B is denoted by tr(B). In addition, '⊗' represents the Kronecker product, N k× (·, ·) refers to a matrix Gaussian distribution, t k× (·, ·) denotes a matric-t distribution, IW(·, ·) stands for the inverted Wishart distribution (see (a) of the appendix for definitions of these distributions) and GIW(·, ·) denotes the generalized inverted Wishart distribution.

Hierarchical Bayesian Spatio-Temporal Modeling
We briefly describe HBST modeling in this section, which is the same as that given in [3] excluding Step 3 in the HBST modeling procedure. It is noted that this modeling is a special case of the HBST modeling presented in Chapter 10 of Le and Zidek (2006) excluding Step 3 in the HBST modeling procedure.
Define the following notations: d = number of different type stations (e.g., agricultural, residential, commercial and industrial); n = number of time points (e.g., number of days); u = number of locations with no monitors (i.e., ungauged sites); g = number of locations with monitors (i.e., gauged sites). The stations are organized into k blocks where the g j (j = 1, 2, · · · , k) sites in the jth block have the same number of timepoints m j at which no measurements are taken. These blocks are numbered so that the measurements correspond to a monotone data pattern or a staircase structure, that is, The response variables are written as Here, Y [u] of dimension n × u denotes the unobserved responses at ungauged sites while Y [g] of dimension n × g is given by is an m j × g j matrix of missing measurements at the g j gauged sites for the m j time points and Y is an (n − m j ) × g j matrix of observed measurements at the g j gauged sites for the (n − m j ) time points.
We assume that the response matrix Y follows the Gaussian and generalized inverted Wishart model specified by where B is an l × (g + u) coefficient matrix with the hyperparameter mean matrix B 0 and the variance components V B , X is the matrix of covariates which is defined in (4) and {Θ, δ} is a set of model parameters specified below. We partition B corresponding to the l time-varying covariates in conformance with the block structure as .
By assuming an exchangeable structure across sites, B can be written as B =BR, wherẽ B is the l × d hyperparameter matrix and R = (r i,j ) d×(u+g) with r i,j = 1 for Station j under Class i and r ij = 0 otherwise. Likewise, we partition the (u + g) × (u + g) covariance matrix Σ over gauged and ungauged sites conformably as where Σ [u,u] is a u × u matrix being for the ungauged sites. Further, we partition the g × g covariance matrix Σ [g,g] for the gauged site blocks as follows: Similarly, for j = 1, · · · , k, we put We reparametrize the matrix Σ through the recursive one-to-one Bartlett transformation for the two blocks: g] . Similarly, by applying the Bartlett decomposition, we can represent the submatrix Σ [g j ,··· ,g k ] , for j = 1, · · · , k − 1, as where Γ k = Σ [g k ,g k ] and for j = 1, · · · , k − 1, Therefore, the GIW prior distribution for Σ in (1) is equivalently defined in terms of (Γ [u] , Υ [u] ) and {(Γ 1 , Υ 1 ), · · · , (Γ k , Υ k ), Σ k } as follows: where Υ [u] is the slope of the optimal linear predictor of Y [u] based on Y [g] and Γ [u] is the residual covariance of the optimal linear predictor. Similar interpretations can be applied to Υ j and Γ j , for j = 1, · · · , k − 1.
Let H be the set of the hyperparameters in (1) and (2) If a data matrix appears to be an ascending staircase, the HBST modeling procedure is given as follows: Step 1. Compute the hyperparameter values that maximize the marginal distribution f (Y [g o ] |H g ) using an empirical Bayesian approach (see (b) of Appendix A). The EM algorithm is used to obtainĤ g .
,Ĥ g ) of missing measurements as in (c) of Appendix A. Fill in the missing data by using the predictive distributions.
Step 3. Obtain the estimateΣ [g,g] from the estimate ofĤ g . In terms ofΣ [g,g] , obtain the estimate of the covariance matrix by using a dimension expansion method given in Qin et al. [5] and the thin-plate spline method given in Wabba and Wendelberger (1980). The details are given in Section 4.1.
Step 4. Estimate the hyperparameters H u and obtain the conditional predictive distribution f (Y [u] |Y [g] ,Ĥ) (see Section 4.2).

Ozone Concentrations from the Monitoring Stations in Pittsburgh Region
The ozone concentrations were recorded within −79 • to −81.5 • longitude and 39.5 • to 41.5 • latitude around the Pittsburgh region (−79.23 • , 43.39 • ) for four consecutive summer months, June, July, August and September, over the period from 1995 to 2007. There were 25 monitoring stations in the region as shown in Figure 1, which is the same as Figure 1 in [3]. The original data set Y 0 was collected from 25 stations, and there were a total of 1586 (13 years × 122 days) measurements at each station. The number of missing data in Y 0 is shown by N1.Miss in Table 1, which is the same as Table 1 in [3]. In this section, we fill in missing measurements.

Filling in the Missing Measurements for Each Monitoring Station within the Period of Monitoring Blocks
Since there are missing data in the dataset, we follow the steps in [3] in filling in some missing measurements occurred during the operation of each monitoring station, using the regression model as for i = 1, . . . , 13, and j = 1, . . . , 122, where a and b are regression coefficients, c i , for i = 1, . . . , 13, are the categorical factors and {ε t } is a sequence of independently and identically distributed Gaussian random variables with mean 0 and variance σ 2 . The model (3) assigns different means to the years with a yearly cycle of 122 days. We reexpress the 13 factors in the model via Helmert contrasts, which compare the first level of the factor with all later levels, the second level with all later levels, and so forth. The Helmert matrix, Z 13×13 , is defined as follows.

Filling in the Missing Measurements in Y 1
To fill in the missing measurements in Y 1 , we can proceed as follows [3]: (i) Obtain a new data set Y 2 from Y 1 by filling in the 488 missing measurements at Stations 5 and 25 during the end of the time period by using the HBST modeling technique. N3.Miss in Table 1 displays the number of missing data in the data set Y 2 , which shows that Y 2 has a staircase data structure, as all of the missing data are located in the beginning of the time period. (ii) Put d = 4, l = 15, n = 1586, k = 7, m 1 = 854, m 2 = 610, m 3 = 488, m 4 = 366, m 5 = 318, m 6 = 244, m 7 = 0, g 1 = 1, g 2 = 1, g 3 = 0, g 4 = 3, g 5 = 1, g 6 = 1 and g 7 = 16. Fill in the remaining missing values in Y 2 by executing Steps 1-2 of the HBST modeling procedure.

Model the Ozone Concentrations in the Pittsburgh Region
To model ozone concentrations in the Pittsburgh region by spatial interpolation, we cover the region by the 100 grid boxes of a spatial resolution of latitude 0.2 • × longitude 0.2 • . Thus, u = 100. The grid points are ungauged sites, and their classes are displayed in Figure 4. To derive the predictive distributions for these grid points, a key step is to estimate the covariance matrix.

Estimation of the Covariance Matrix
In this subsection, we introduce a method for estimating the covariance matrix through a dimension expansion method for modeling the semivariograms in nonstationary fields in terms ofĤ g from Step 1 of the HBST procedure.
Let {Y(x) : x ∈ S}, S ∈ R d , be an environmental random process, where x is a d-dimensional spatial index that varies continuously throughout the region S. At n spatial locations denoted by {x i : i = 1, . . . , n}, we observe realizations of the random process Y(x), i.e., {Y(x i ) : i = 1, . . . , n}. We are interested in learning the spatial dependency of the process through the observed data. The semivariogram function which describes the degree of spatial dependency of an intrinsic stationary random process is a cornerstone in spatial statistics. An intrinsic stationary random process satisfies the following two conditions (Cressie [6]): where a semivariogram is defined as γ( for two different locations, x i and x j , in the monitored region. The estimated covariance matrix of the monitoring stationsΣ [g,g] is based on the estimation of H g from Step 1 of the HBST procedure. We estimate the semivariograms of the ozone concentrations from the monitoring stations byγ From Figure 2, we notice that the estimated semivariograms related to Station 3 (marked by "×") are much higher than the other stations. We examine the location of Station 3 and notice that it was on the edge of the monitored region. Moreover, there were over ten airports around this station. According to Xue et al. [7], there is a great impact of high-altitude aircrafts on the ozone layer in the stratosphere. This becomes an influential factor in modeling the ozone concentrations. Next, we introduce how this factor is considered in the proposed modeling technique. It is obvious that this field is not stationary. Bornn et al. [8] proposed a novel approach to find the latent dimensions over which the nonstationary fields exhibit stationarity through dimension expansion. They justified that for a nonstationary Gaussian process Y(x), where x ∈ R d , there exists a vector z ∈ R p , p > 0, such that the expanded process Y([x, z]) is stationary under appropriate moment constraints. Note that [x, z] is the concatenation of the vectors x and z. The stationary semivariogram with latent vectors can be expressed by where [x i , z i ] is the expanded spatial index for the ith location. Qin et al. [5] improved the method in Bornn et al. [8] by considering the covariance structure of theγ i,j , for j = i, which are generally correlated. In our application, we use the lasso-penalized weighted least-squares criterion (WLS) in Qin et al. [5] as follows, to estimate the parameters and the expanded dimensions. Here,γ i,j is the estimated semivariogram by (5)  Huijbregts [9] and Cressie [6]), among others). For example, the exponential model is defined as The semivariogram plot with estimated expanded dimensions (Figure 3) of the monitoring stations shows that the field is in good agreement with the theoretical model, as most of the points are near the solid red line, the fitted exponential semivariogram model. Two extra dimensions are added to the original coordinate with λ = 0.01. Figure 3 shows that with the extra dimensions, Station 3 is pushed much further out of the two-dimensional plane, reflecting the impact of high-altitude aircrafts on the ozone layer in the stratosphere we have mentioned earlier. After the expanded dimensions for the monitoring stations are obtained, we use the thin-plate spline method [10] to estimate the hidden dimensions for the ungauged sites. The semivariograms for the ungauged stations are estimated by the exponential model using the estimated parameter vectorφ. Next, we estimate the semivariograms γ s i ,s j between stations s i and s j using the exponential model based on the distances over the space composed by the original and the expanded dimensions. Last, the covariance between any two sites can be estimated bŷ whereσ Y(s i ) andσ Y(s j ) are estimates of σ Y(s i ) and σ Y(s j ) obtained by the thin-plate spline approach.

Prediction of the Daily Ozone Concentrations at the Grid Points
By Chapter 10 of Le and Zidek (2006), spatial predictive distributions at the grid points given the monitoring sites are as follows: where We estimate the hyperparameters associated with the grid points Λ 0 , Υ 00 , H 0 and δ 0 viaδ , j = 1, . . . , k − 1, After all of the hyperparameters in the predictive distributions are estimated, we can predict the daily ozone concentrations at all the grid points in the time period of study by generating samples from the predictive distributions.

Environmental Network Extension
Assume that Y has the density function f . The total reduction in uncertainty of Y can be presented by the entropy of its distribution, i.e., is a not necessarily integrable reference density (Jaynes [11]). According to the predictive distribution (7), the total entropy H(Y [u] |Y [g] ) can be defined as where c u (u, q) is a constant depending on the degree of freedom and the dimension of the ungauged sites.
The key step in expanding an environmental network is to find appropriate ungauged sites to add to the existing network that maximizes the corresponding entropy. We use the following optimality criterion as given in [3]: The add sites, in a vector of dimension u 1 , are selected to maximize the entropy in (8). In [3], the grid points {91, 92, 93} were selected with the highest entropy 11.3774. The proposed method selects the grid points {41, 71, 100} with entropy 12.1207. This selection is more reasonable, as they are not gathered in the southeast corner of the region like {91, 92, 93}. The selected sites among 100 grid points by the two methods are shown in Figure 4 below.  [3] and red circled points by our method).

Model Evaluation
In this section, we use the leave-one-out cross-validation to evaluate the accuracy of the predictive model derived using the proposed method and compare the proposed method with the one in [3]. We select the observations from one of the original 25 stations as validation data, and observations in the remaining 24 stations are treated as training data. We use the data from day 855 to day 1586 at the end of the study from each station to evaluate the prediction because during this period, none of the stations has missing data. By choosing this period, we avoid using the Bayesian hierarchical modeling technique for estimating the missing data in the training data set, which is time-consuming and not our intention for evaluating the proposed method on estimating the covariance matrix. Station 22 is excluded because it is the only industrial station in the study. For each of the 24 stations, we generate 100 samples from the predictive distribution with parameters estimated using observations from the rest of the 23 stations. We compute the average of relative absolute bias (ARAB) as ∑ 100 j=1 y j,i,t − y i,t y i,t , where y j,i,t is the jth sample generated from the predictive distributions and y i,t is the observation from Station i on time t. The results are given in Table 2.
In Table 2, "-" means that there is no prediction for the station because there are negative correlations and the method in [3] is not applicable to estimate the predictive distribution. The results in Table 2 show that the proposed method provides slightly more accurate predictions than the one in [3] for most of the stations. More important is that, when there are negative correlations obtained from the estimations of the hierarchical Bayesian spatio-temporal modeling technique, the method in [3] fails to estimate the covariance matrix, while the proposed method still provides accurate predictions except for Station 3. This is expected because Station 3 is an influential station. Therefore, if we use observations at Station 3 as the validation data set, it has a great impact on estimating the covariance matrix.

Conclusions
In this paper, we have derived a predictive model through the hierarchical Bayesian spatio-temporal modeling technique given in [12] at ungauged sites based on the covariance matrix estimated by a dimension expansion method for modeling semivariograms in nonstationary fields. Further, we have applied an entropy criterion (see [12] or [3] for details) based on the predictive model to decide if new stations need to be added. This entropy criterion helps to solve the environmental network design problem. For demonstration, we have applied the proposed method on ozone concentrations at 25 stations in the Pittsburgh region studied in [3]. The proposed method has provided satisfactory results. Moreover, the results have shown that the method is more general and applicable, as no assumption is imposed on the correlation structure.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author, upon reasonable request.
Inverted Wishart distribution. If, for a p × p positive definite matrix Ψ and a positive constant δ ≥ p, the density function of a p × p random matrix S has the form then S is said to have an inverted Wishart distribution and is denoted by S ∼ IW(Ψ, δ). Matric-t distribution. If, for an n × m matrix U , positive definite matrices A : n × n and B : m × m and a positive constant δ > 0, the density function of an n × m random matrix X has the form , then X is said to have a matric-t distribution and is denoted by X ∼ t n×m (U , A ⊗ B, δ).
(b) Estimation of H g = {V B , B 0 , (Υ 01 , H 1 , Λ 1 , δ 1 ), · · · , (Υ 0,k−1 , H k−1 , Λ k−1 , δ k−1 ), (Λ k , δ k )} Le and Zidek (2006) showed that the predictive distributions derived through the integrated framework above are completely characterized by their hyperparameters, which are estimated by an empirical Bayes approach, that is, to estimate them by maximizing the marginal likelihood of all the measured responses (conditional on those hyperparameters) evaluated at their observed values. This procedure is referred to as type-II maximum likelihood estimation (type-II MLE). To estimate H g , the following procedure can be employed: Compute the hyperparameter values that maximize the marginal distribution f ( }. The subscript g indicates that not all the hyperparameters are involved in this marginal distribution. The response matrix Y follows the GIW distribution specified by (1) and (2). The marginal distribution f (Y [g o ] |H g ) can be written as , for j = 1, · · · , k − 1,   : : m j × m j m j × (n − m j ) (n − m j ) × m j (n − m j ) × (n − m j ) = I n + XV B X +ε [g j+1 ,··· ,g k ] H j (ε [g j+1 ,··· ,g k ] ) .
Although f (Y [g o ] |H g ) can be written as a matric-t distribution as in (A1), direct maximization of this marginal density presents a challenge. The EM algorithm helps circumvent it.