Next Article in Journal
A Semi-Supervised Stacked Autoencoder Using the Pseudo Label for Classification Tasks
Previous Article in Journal
Corn Harvester Bearing Fault Diagnosis Based on ABC-VMD and Optimized EfficientNet
Previous Article in Special Issue
Spatio-Temporal Analysis of Marine Water Quality Data Based on Cross-Recurrence Plot (CRP) and Cross-Recurrence Quantitative Analysis (CRQA)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States

1
Univ Bretagne Sud, CNRS UMR 6205, LMBA, F-56000 Vannes, France
2
TUMCREATE, 1 Create Way, #10-02 CREATE Tower, Singapore 138602, Singapore
3
Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, CA 93106, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2023, 25(9), 1272; https://doi.org/10.3390/e25091272
Submission received: 10 July 2023 / Revised: 18 August 2023 / Accepted: 24 August 2023 / Published: 29 August 2023
(This article belongs to the Special Issue Spatial–Temporal Data Analysis and Its Applications)

Abstract

:
We develop a new model for spatio-temporal data. More specifically, a graph penalty function is incorporated in the cost function in order to estimate the unknown parameters of a spatio-temporal mixed-effect model based on a generalized linear model. This model allows for more flexible and general regression relationships than classical linear ones through the use of generalized linear models (GLMs) and also captures the inherent structural dependencies or relationships of the data through this regularization based on the graph Laplacian. We use a publicly available dataset from the National Centers for Environmental Information (NCEI) in the United States of America and perform statistical inferences of future CO 2 emissions in 59 counties. We empirically show how the proposed method outperforms widely used methods, such as the ordinary least squares (OLS) and ridge regression for this challenging problem.

1. Introduction

Statistical models for spatio-temporal data are invaluable tools in environmental applications, providing insights, predictions, and actionable information for understanding and managing complex environmental phenomena [1]. Such models help uncover complex patterns and trends, providing insights into how environmental variables change geographically and temporally. Many environmental datasets are collected at specific locations and times, leaving gaps in information. Statistical models help interpolate and map values between observation points, providing a complete spatial and temporal picture of the phenomenon being studied. Moreover, environmental applications frequently require predicting future values or conditions. Statistical models allow for accurate predictions by capturing the spatial and temporal dependencies present in the data. Such predictions provided by these models provide valuable information for decision makers by quantifying the effects of various factors on the environment and projecting the consequences of different actions.
Let { y t , s : s Ω s , t Ω t } denote the spatio-temporal random process for a phenomenon of interest evolving through space and time. As an example, y t , s might be the CO 2 emission level at a geographical coordinate s = ( latitude , longitude ) on the sphere at a given time t. Traditionally, one considers models for such a process from a descriptive context, primarily in terms of the first few moments of a probability distribution (i.e., mean and covariance functions in the case of a Gaussian process). Descriptive models are generally based on the spatio-temporal mixed-effect model [1,2], in which the spatio-temporal process is described with a deterministic mean function and some random effects capturing the spatio-temporal variability and interaction:
y t , s = μ t , s + ϵ t , s
where μ t , s is a deterministic (spatio-temporal) mean function or trend, and ϵ t , s a zero-mean random effect, which generally depends on some finite number of unknown parameters. A common choice for the trend is to consider the following linear form μ t , s = ϕ t , s β , where ϕ t , s represents a vector of known covariates and β a set of unknown coefficients. Generally, with such a model, it is generally assumed that the process of interest is Gaussian. However, in real-world scenarios, data can exhibit heavy tails or outliers, which can significantly affect the distribution’s shape and parameters. If these extreme values are not accounted for, it can lead to biased estimates and incorrect inferences. As a consequence, a more advanced model based on a generalized linear model (GLM) has been proposed [3]. The systematic component of the GLM specifies a relationship between the mean response and the covariates through a possibly nonlinear but known link function. Note that some additional random effects can be added in the transformed mean function, leading to the so-called generalized linear mixed model (GLMM) [4].
The main challenge of such models lies in estimating the unknown parameters. Once this important step is done, the different tasks of interest (prediction, decision, etc.) can be performed. Unfortunately, the inference of these parameters can lead to overfitting, multicollinearity-related instability, and lack of variable selection, resulting in complex models with high variance. As a consequence, regularization methods using the 1 and/or 2 norm as penalty function are generally used in practice to mitigate these issues by controlling the model complexity, improving generalization, and enhancing the stability of coefficient estimates [5,6].

Contributions

Graph signal processing is a rapidly developing field that lies at the intersection between signal processing, machine learning and graph theory. In recent years, graph-based approaches to machine learning problems have proven effective at exploiting the intrinsic structure of complex datasets [7]. Recently, graph penalties were applied successfully to the reconstruction of a time-varying graph signal [8,9] or to the regression with a simple linear model [10,11]. In these works, the results highlight that regularization based on the graph structure could have an advantage over more traditional norm-based ones in situations where the data or variables have inherent structural dependencies or relationships. The main advantage of graph penalties is that they take into account the underlying graph structure of the variables, capturing dependencies and correlations that might not be adequately addressed by norm-based penalties.
In this work, we propose a novel and general spatio-temporal model that incorporates a graph penalty function in order to estimate the unknown parameters of a spatio-temporal mixed-effect model based on a generalized linear model. In addition, different structures of graph dependencies are discussed. Finally, the proposed model is applied to a real and important environmental problem: the prediction of CO 2 emissions in the United States. As recently discussed in [12], regression analysis is one of the most widely used statistical method to characterize the influence of selected independent variables on a dependent variable and thus has been widely used to forecast CO 2 emissions. To the best of our knowledge, this is the first time that a more advanced model, i.e., a GLM-based spatio-temporal mixed effect model with graph penalties, is proposed to predict CO 2  emissions.

2. Problem Statement—The Classical Approach

In this section, we first provide a background of graphs and their properties, then we introduce the system model of our problem followed by the classical approach which uses a linear regression structure.

2.1. Preliminaries

Let us consider a weighted, undirected graph G = ( V , E , A ) composed of | V | = N vertices. A R N × N is the weighted adjacency matrix, where A i j 0 represents the strength of the interaction between nodes i and j. An example of such a graph is depicted in Figure 1. E is the set of edges, and therefore ( i , j ) E implies A i j > 0 and ( i , j ) E implies A i j = 0 . The graph can be defined through the (unnormalized) Laplacian matrix L R N × N :
L = D A
where D corresponds to the degree matrix of the graph as D = diag ( D 11 , D 22 , D N N ) , where D i i is the i-th column sum (or row sum) of the adjacency matrix A.
The graph Laplacian, closely related to the continuous domain Laplace operator, has many interesting properties. One of them is the ability to inform about the connectedness of the graph. By combining this property with any graph signal at time t, y t R N , in the following quadratic sum,
y t L y T = i , j A i j y t , i y t , j 2
can be considered a measure of the cross-sectional similarity of the signal, with smaller values indicating a smoother signal reaching a minimum of zero for a function that is constant on all connected sub-components [13].

2.2. System Model

The main objective of this paper is to design a statistical regression model in order to characterize and predict CO 2 emissions across time and space. More precisely, the paper is concerned with the situation where a signal y t = y t , 1 , y t , 2 , , y t , N R N is measured on the vertices of a fixed graph at a set of discrete times t [ 1 , 2 , , T ] . This vector corresponds to the CO 2 emission measured at N different spatial locations at time t. At each of these time instants, a vector of K covariates x t R K is also measured, which is not necessarily linked to any node or set of nodes.
Objectives:
1.
Determine, for each of the N different locations, the specific relationship between the response variables y t , i t = 1 T and the set of covariates x t t = 1 T .
2.
Based on this relationship, make a prediction of the CO 2 levels in different locations in space and time.

2.3. Problem Formulation with a Classical Linear Regression Model

The most common form of structural assumption is that the responses are assumed to be related to predictors through some deterministic function f and some additive random error component ϵ i so that for the i-th location and t = 1 , , T we have that
y t , i = f i ( x t ) + ϵ i ,
where ϵ i is a zero-mean error random variable. Therefore, a classical procedure consists of approximating the true function f i by a linear combination of basis functions:
f i ( x t ) p = 1 P β i , p ϕ i , p ( x t ) = ϕ i ( x t ) T β i ,
where β i = β i , 1 β i , P T is the set of coefficients corresponding to basis functions ϕ i ( x t ) = ϕ i , 1 ( x t ) ϕ i , P ( x t ) T in order to approximate the function f i ( · ) associated to the signal over time at i-th location, i.e., y t , i t = 1 T .
The linear regression model over all the N different locations could be formulated in a matrix form as follows t [ 1 , 2 , , T ] :
y t = Φ t β + ϵ t ,
where
Φ t = ϕ 1 ( x t ) 0 1 × P 0 1 × P 0 1 × P ϕ 2 ( x t ) 0 1 × P 0 1 × P 0 1 × P ϕ N ( x t ) β = β 1 β N and ϵ t = ϵ t , 1 ϵ t , N
As a consequence, this linear regression can be fully summarized as
y = Φ β + ϵ ,
where y = y 1 T y 2 T y T T R N T × 1 and
Φ = Φ 1 Φ T R N T × N P and ϵ = ϵ 1 ϵ T R N T × 1 ,
where E ϵ = 0 N T × 1 and Var ϵ = Σ .
In such a model, the most common approach to estimate the regression coefficients is the generalized least square (GLS) method, which aims at minimizing the squared Mahalanobis distance of the residual vector:
β ^ GLS = arg min β ( y Φ β ) T Σ 1 ( y Φ β ) .
Theorem 1 
(Aitken [14]). Consider that the following conditions are satisfied:
(A1) 
The matrix Φ is nonrandom and has full rank, i.e., its columns are linearly independent,
(A2) 
The vector y is a random vector such that the following hold:
(i) 
E y = Φ β 0 for some β 0 ;
(ii) 
Var y = Σ is a known positive definite matrix.
Then, the generalized least square estimator from (9) is given by
β ^ GLS = Φ T Σ 1 Φ 1 Φ T Σ 1 y .
Moreover, β ^ GLS corresponds to the best linear unbiased estimator for β 0 and its covariance matrix is Var β ^ GLS = Φ T Σ 1 Φ 1 .
Let us remark that the ordinary least square (OLS) estimator is nothing but a special case of the GLS estimator. They are indeed equivalent for any diagonal covariance matrix Σ = σ 2 I .

2.4. Generalized Linear Models

In this paper, we propose to use the generalized linear model (GLM) structure [15], which is a flexible generalization of linear regression model discussed previously. In this model, the additivity assumption of the random component is removed and more importantly, the response variables can be distributed from more general distributions in the standard linear model for which one generally assumes normally distributed responses, see discussions in [16,17]. The likelihood distribution of the response variables f Y ( y | β ) is a member of the exponential family, which includes the normal, binomial, Poisson and gamma distributions, among others.
Moreover, in a GLM, a smooth and invertible function g ( · ) , called link function, is introduced in order to transform the expectation of the response variable, μ t , i E y t , i
g ( μ t , i ) = η t , i = ϕ i ( x t ) T β i .
Because the link function is invertible, we can also write
μ t , i = g 1 ( η t , i ) = g 1 ϕ i ( x t ) T β i ,
and, thus, the GLM may be thought of as a linear model for a transformation of the expected response or as a nonlinear regression model for the response. In theory, the link function can be any monotonic and invertible function. The inverse link g 1 is also called the mean function. Commonly employed link functions and their inverses can be found in [15]. Note that the identity link simply returns its argument unaltered μ t , i = g 1 ( η t , i ) = η t , i = ϕ i ( x t ) T β i and therefore is equivalent to the assumption (A2)-(i) of Theorem 1 used in the classical linear model.
In GLM, due to the nonlinearity induced by the link function, the regression coefficients are generally obtained with the maximum likelihood technique, which is equivalent to minimizing a cost function defined as the negative log-likelihood function f Y ( y | β ) as [16]
β ^ = arg min β V y ; β ,
with V y ; β = ln f Y ( y | β ) .

3. Proposed Graph Regression Model

In this section, we develop our penalized regression model over graph. We first show how to overcome some of the deficiencies in traditional regression models by introducing new penalty terms which regulate the solution. Finally we provide details regarding the estimation procedure and the algorithm we develop.

3.1. Penalized Regression Model over Graph

In the previous section, we introduced a flexible generalization in order to model our spatial and temporal response variables of interest. Unfortunately, two main issues could arise. On the one hand, the solution of the optimization problem defined in (12) may not be unique if Φ has full rank deficiency or when the number of regression coefficients exceeds the number of observations (i.e., N P > N T ). On the other hand, the learned model could suffer from poor generalization due to, for example, the choice of an overcomplicated model. To avoid such problems, the most commonly used approach is to introduce a penalty function in the optimization problem to further constrain the resulting solution as
β ^ = arg min β V y ; β + h ( β ; γ ) .
The penalty term h ( β ; γ ) can be decomposed as the sum of p penalty functions and therefore depends on some positive tuning parameters { γ i } i = 1 p (regularization coefficients), which controls the importance of each elementary penalty function in the resulting solution. When every parameter is null, i.e., { γ i } i = 1 p = 0 , we obtain the classical GLM solution in (12). On the contrary, for large values of γ , the influence of the penalty term on the coefficient estimate increases. The most commonly used penalty functions are the 2 norm (ridge), 1 norm (LASSO) or a combination of both (Elastic-net)—see [18] for details.
In this paper, we propose to use an elementary penalty function, which takes into account the specific graph structure of the observations. As in [10,11], a penalty function can be introduced in order to enforce some smoothness of the predicted mean of the signal E y t over the underlying graph at each time instant. More specifically, we propose to use the following estimator:
β ^ = arg min β V y ; β + γ 1 β β + γ 2 t = 1 T E y t L E y t = arg min β V y ; β + γ 1 β β + γ 2 t = 1 T g 1 ϕ ( x t ) β L g 1 ϕ ( x t ) β = arg min β V y ; β + γ 1 β β + γ 2 g 1 β Φ ( I T L ) g 1 Φ β ,
where the function g 1 · : R N T R N T corresponds to the element-wise application of the inverse link function introduced in (11) on the input argument. I T L stands for the tensor product between the identity matrix of size T ( I T ) and the Laplacian matrix of the underlying graph ( L ). The penalty function is therefore the sum of two elementary ones with γ 1 , γ 2 0 , their regularization coefficients. The regularization β β = β 2 imposes some smoothness conditions on possible solutions, which also remain bounded. Finally, the regularization based on the graph Laplacian L enforces the expectation of the response variable through the GLM model to be smooth over the considered graph G at each time t. It comes from the property of the Laplacian matrix discussed in Section 2.1.
As recently discussed in both [8,9], in some practical applications, the reconstruction of a time-varying graph signal can be significantly improved by adequately exploiting the correlations of the signal in both space and time. The authors show from several real-world datasets that the time difference signal (i.e., E y t E y t 1 in our case) exhibits smoothness on the graph, even if signals E y t are not smooth on the graph. The proposed model can be simply rewritten as follows in order to take into account this property:
β ^ = arg min β V y ; β + γ 1 β β + γ 2 g 1 β Φ L ˜ g 1 Φ β ,
With this general formulation, several cases can be considered:
  • Case 1— L ˜ = I T L : the penalization induces the smoothness of the successive mean vectors E y 1 , , E y T over a static graph structure L .
  • Case 2— L ˜ = diag ( L 1 , , L T ) : the penalization induces the smoothness of the successive mean vectors E y 1 , , E y T over a time-varying graph structure, L 1 , , L T .
  • Case 3— L ˜ = D h ( I T 1 L ) D h or L ˜ = D h diag ( L 1 , , L T 1 ) D h : The penalization induces the smoothness of the time difference mean vectors E y 2 E y 1 , , E y T E y T 1 over a graph structure which could be either static or time varying, respectively. The matrix D h of dimension N T × N ( T 1 ) defined as
    D h = I N 0 N 0 N I N I N 0 N 0 N 0 N I N I N 0 N 0 N 0 N 0 N I N I N 0 N 0 N 0 N I N I N 0 N 0 N I N ,
    allows to transform the mean vector into the time difference mean vector.
Proposition 1. 
When the response variables are considered to be normally distributed, i.e., y N Φ β , Σ , then the solution that minimizes the cost function defined in Equation (15) is given by
β ^ = Φ Σ 1 Φ + γ 1 I N P + γ 2 Φ L ˜ Φ 1 Φ Σ 1 y
Proof. 
See Appendix A.    □

3.2. Learning and Prediction Procedure

As discussed in the previous section, our proposed estimator in (15) results from a regression model with a penalization function over the graph, which depends on some hyperparameters, i.e., γ = γ 1 , γ 2 . Cross-validation techniques are the most commonly used strategies for the calibration of such hyperparameters, as they allow us to obtain an estimator of the generalization error of a model [19]. In this paper, a cross-validation technique is used by partitioning the dataset into train, validation and test sets. Only the train and validation sets are used to obtain the selected parameters/hyperparameters set. Finally, the model with the selected set is evaluated using the test set.
Cross validation (CV) is a resampling method that uses different portions of the data to test and train a model through different iterations. Resampling may be useful while working with i i d data. However, as opposed to the latter, time-series data usually posses temporal dependence, and therefore, one should respect the temporal structure while performing CV in that context. To that end, we follow the procedure of forward validation (we refer to it as time series CV) originally due to [20]. More specifically, the dataset is partitioned as follows D t r a i n = x t , y t t = 1 ρ t r a i n T , D v a l = x t , y t t = ρ t r a i n T + 1 ( ρ t r a i n + ρ v a l ) T and D t e s t = x t , y t t = ( ρ t r a i n + ρ v a l ) T + 1 T , where ρ t r a i n and ρ v a l correspond to the percentage of the dataset used for training and validation, respectively. In this paper, we set ρ v a l = 1 ρ t r a i n 2 to have the same number of data in both the validation and test sets. The set of hyperparameters and parameters are obtained by minimizing the generalization error approximated using the validation set. In practice, the hyperparameters are optimized using either numerical optimization methods that do not require a gradient (e.g., Nelder–Mead optimizer) or a grid of discrete values. The proposed learning procedure used in this work is summarized in Algorithm 1.
Algorithm 1 Learning procedure of the proposed penalized regression model over graph
Input:  D t r a i n = x t , y t t = 1 ρ t r a i n T ,
            D v a l = x t , y t t = ρ t r a i n T + 1 ( ρ t r a i n + ρ v a l ) T
            D t e s t = x t , y t t = ( ρ t r a i n + ρ v a l ) T + 1 T
1:
Iterations of a numerical optimization method
2:
while  E D v a l E D v a l m i n  do
3:
    Let γ denote the candidate for the values of hyperparameters for this iteration of the chosen derivative-free optimization technique.
4:
    Given γ , obtain the optimal regression coefficient β ^ in (15) using only the data from the training set D t r a i n :
β ^ = arg min β V y D t r a i n ; β + γ 1 β β + γ 2 t D t r a i n g 1 ϕ ( x t ) β L ˜ g 1 ϕ ( x t ) β .
either by a numerical optimization technique or Equation (16) in case of Gaussian likelihood.
5:
    Compute the estimator of the generalization error using the validation set:
E D v a l = 1 ρ v a l T t D v a l | | y t g 1 ϕ ( x t ) β ^ | | 2
6:
end while
Output: Optimal hyperparameters γ ^ and regression coefficients β ^

4. Numerical Study—CO 2 Prediction in the United States

In this section, we empirically assess the benefit of using our proposed penalized regression model over graph for the prediction of CO 2 in the United States. For this purpose, the CO 2 emission levels were obtained from the Vulcan project (https://vulcan.rc.nau.edu/ (accessed on 1 August 2023)) [21] and more especially the dataset (https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1810 (accessed on 1 August 2023)), which provides emissions on a 1 km by 1 km regular grid with an hourly time resolution for the 2010–2015 time period. More specifically, the response variable vector y t corresponds to the CO 2 emissions for the t-th day after 1 January 2011 at N = 59 different counties on the east coast of the United States of America (see Appendix B for the full list of selected counties).
On the other hand, among the explanatory variables presented in detail below, there are weather data from weather daily information available on the platform https://www.ncdc.noaa.gov/ghcnd-data-access (accessed on 1 August 2023) of National Centers for Environmental Information (NCEI) in the United States of America. NCEI manages one of the largest archives of atmospheric, coastal, geophysical, and oceanic research in the world.

4.1. Choice of Covariates and Data Pre-Processing

The covariates we propose to use to model the daily CO 2 emissions at the US counties level are composed of three types of data:
  • Daily weather data (available on the platform of National Centers for Environmental Information (NCEI) https://www.ncdc.noaa.gov/ghcnd-data-access (accessed on 1 August 2023)) in the United States of America including maximal temperature (TMAX), minimal temperature (TMIN) and precipitation (PREC);
  • Temporal information to capture the time patterns of the data;
  • Lagged CO 2 emission variables to take into account the time correlation of the response.
All the variables related to the first two points are commonly used as covariates for each county, whereas lagged variables are county-specific.
Firstly, for the weather data, a number of steps are taken to pre-process them before feeding into the learning procedure described in Algorithm 1. Firstly any weather stations from the 59 US counties with a large proportion of missing values over the period of time are discarded. Missing values in the retained weather stations are interpolated linearly between the available readings. Then, the weather data are summarized at the state level—the 59 counties are part of 19 different states. As a consequence, for each state, the 3 weather variables (TMAX, TMIN and PREC) are averaged over the retained weather stations of that state. Whatever the county considered, weather variables from all 19 states are utilized as covariates in { ϕ i } i = 1 N of Equation (7). The final step before estimation is to transform all variables so that they are scaled and translated to achieve a unit marginal variance and zero mean.
Secondly, for the temporal patterns in the data, we consider three types: a week identifier (WD), a weight associated to each day of a week (WD) and a trend variable (TREND). The variable WI simply corresponds to a one-hot encoding of the week number of the year. The variable WD is added after observing that a regular pattern can be observed concerning the evolution of the CO 2 emission with the day of the week—as shown in Figure 2, less emissions typically are observed during the weekend. The trend variable (TREND) is simply a linear and regularly increasing function at the daily rate from 0 (1 January 2010) to 1 (31 December 2015).
Finally, to take into account the time correlation of the CO 2 emissions, we decided to use some lagged response variables as covariates. More precisely, after analyzing the autocorrelation function (ACF) of the time series of CO 2 for each county (see Figure 3 for the ACF of three different counties), we proposed to use as covariates three lagged versions of the response variable. More precisely, for the i-th county at time t, y t , i , the following lagged variables are used as predictors: the 365-day lagged variable y t 365 , i (one year), the 182-day lagged variable y t 182 , i (about six months) and the 14-day lagged variable y t 14 , i (about 2 weeks).

4.2. Graph Construction of the Spatial Component

In this work, the 59 counties are considered the nodes of a common graph. The locations of the chosen counties are depicted in Figure 4. As a consequence, case 1 of the graph penalty function of Section 3.1 is considered, i.e., L ˜ = I T L . The single Laplacian matrix L is defined through the adjacency matrix.
A graph adjacency matrix should reflect the tendency for measurements made at node pairs to have similar values in mean. There are many possible choices for the design of this adjacency matrix. In this work, two different choices of matrix are compared. As in [11], we firstly construct the adjacency matrix based on distances by setting
A i , j d i s t = e l d i , j 2 i , j d i , j 2 ,
where d i , j denotes the geodesic distance between the i-th and j-th counties in kilometers and l is a scaling hyperparameter to be optimized using Algorithm 1. A heat map of the geodesic distances in kilometers between counties is represented in Figure 5.
The second proposition for the adjacency matrix is to utilize the empirical correlations between counties CO 2 emissions. For two counties i and j, the adjacency coefficient is defined as follows:
A i , j c o r r = e l max 0 , ρ i , j 2
where ρ i , j is the empirical correlation between y i and y j , the CO 2 emissions of the i-th and j-th counties, respectively.

4.3. Numerical Experiments

In the following numerical experiments, the proposed penalized regression model over graph is compared to two other classical models, namely, the ridge and the ordinary least square (OLS) solution. In fact, these two models are nothing but special cases of the proposed model by setting in Equation (16) either γ 2 = 0 or ( γ 1 = 0 , γ 2 = 0 ) , respectively.
Firstly, we empirically study the performance of the penalized regression model over graph with the two possible choices for the Laplacian matrix. As shown in Table 1, using the adjacency matrix based on geodesic distances rather than on empirical correlations improves the RMSE on both the validation and the test sets. A smaller RMSE on the training set using the correlation-based adjacency matrix shows that this choice could lead to overfitting.
Table 2 shows the root mean squared error (RMSE) over the different sets (training, validation and test) with a varying number of training data. Let us remark as described more precisely in Section 3.2 that since we use the same number of data, increasing the size of training set reduces the size of both the validation and test sets. As expected, since the proposed model is a generalization of both the ridge and OLS solution, smaller RMSE is obtained on all configurations. More importantly, the proposed model allows us to obtain a quite significant improvement on the test set compared to both the ridge and the OLS solutions, which clearly demonstrates the superiority in terms of the generalization of the proposed model.
Next, in Table 3 we present the RMSE obtained when the models are applied without any lagged variables as covariates. By comparing the values obtained with these variables in Table 2, we can clearly see the benefit of using an auto-regressive structure in the regression model by the introduction of such lagged response variables.
In Figure 6, the weekly RMSE is depicted as a function of time for three different counties. These weekly RMSEs are obtained by aggregating the daily forecasted values from the proposed regression model which is trained on 50% of the dataset. It is interesting to observe that the weekly RMSE does not explode with time but rather stays quite stable with respect to time.
In order to ensure that the previously observed conclusions are not too sensitive to the specific 59 chosen counties, we compute the RMSE on the three different sets for the different regression models by randomly selecting 2 counties for each of the 19 states. Let remark that we use transfer learning for the hyperpamemeters of the models (i.e., γ 1 and γ 2 ). They are not optimized on each random choice of data but are set to their optimized values in the previous scenario in which all 59 counties are used. From the results depicted in Figure 7, the same conclusions as before can be drawn. It is worth noting that, even if the hyperparameters are not optimized for each random choice, the RMSE on the validation set is still smaller using the proposed model. Finally, the boxplots obtained on the test sets empirically show better predictive power for the proposed penalized regression model over graph prediction.

5. Conclusions

In this paper, we propose a novel GLM-based spatio-temporal mixed-effect model with graph penalties. This graph penalization allows us to take into account the inherent structural dependencies or relationships of the data. Another advantage of this model is its ability to model more complicated and realistic phenomena through the use of generalized linear models (GLMs). To illustrate the performance of our model, a publicly available dataset from the National Centers for Environmental Information (NCEI) in the United States of America is used, where we perform statistical inference of future CO 2 emissions over 59 counties. We show that the proposed method outperforms widely used methods, such as the ordinary least squares (OLS) and ridge regression models. In the future, we will further study how to improve this model to this specific CO 2 prediction. In particular, the use of different likelihood and link functions will be studied along with other adjacency matrices. We will also study whether considering, for the graph penalties, time differences instead of the direct mean values as discussed in Section 3.1 could improve the prediction accuracy. Finally, it will be interesting to connect this prediction model to some decision-making problems as in [22].

Author Contributions

Conceptualization, R.T., F.S., I.N. and G.W.P.; methodology, R.T., F.S., I.N. and G.W.P.; software, R.T. and F.S.; writing—original draft preparation, R.T., F.S., I.N. and G.W.P.; writing—review and editing, R.T., F.S., I.N. and G.W.P.; visualization, R.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be available on request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Proposition 1

With the normal assumption of the response variables, the resulting estimator β ^ defined in (15) is given by
β ^ = arg min β y Φ β Σ 1 y Φ β + γ 1 β β + γ 2 β Φ L ˜ Φ β .
The partial derivative with respect to β is
C ( β ) β = 2 Φ Σ 1 ( y Φ β ) + 2 γ 1 β + 2 γ 2 Φ L ˜ Φ β = 2 Φ Σ 1 y + 2 Φ Σ 1 Φ β + 2 γ 1 β + 2 γ 2 Φ L ˜ Φ β
C ( β ) β = 0 Φ Σ 1 Φ β ^ + γ 1 β ^ + γ 2 Φ L ˜ Φ β ^ = Φ Σ 1 y Φ Σ 1 Φ + γ 1 I N P + γ 2 Φ L ˜ Φ β ^ = Φ Σ y 1 y
We finally obtain that
β ^ = Φ Σ 1 Φ + γ 1 I N P + γ 2 Φ L ˜ Φ 1 Φ Σ y 1 y

Appendix B. List of Counties Used in the Numerical Study

Table A1. List of counties.
Table A1. List of counties.
List of Counties
Number Counties States Number Counties States
1Anoka CountyMinnesota31Daviess CountyKentucky
2Dakota CountyMinnesota32Hopkins CountyKentucky
3Lyon CountyMinnesota33Russel CountyKentucky
4Buchanan CountyIowa34Alamance CountyNorth Carolina
5Crawford CountyIowa35Lenoir CountyNorth Carolina
6Page CountyIowa36Pender CountyNorth Carolina
7Union CountyIowa37Randolph CountyNorth Carolina
8Ashley CountyArkansas38Charleston CountySouth Carolina
9Columbia CountyArkansas39Dillon CountySouth Carolina
10Outagamie CountyWisconsin40Lee CountySouth Carolina
11Dane CountyWisconsin41Marlboro CountySouth Carolina
12Clark CountyIllinois42Pickens CountySouth Carolina
13Mercer CountyIllinois43Bartholomew CountyIndiana
14Ogle CountyIllinois44Posey CountyIndiana
15Stephenson CountyIllinois45Mahoning CountyOhio
16Lawrence CountyTennessee46Shelby CountyOhio
17Obion CountyTennessee47Delta CountyMichigan
18Cumberland CountyTennessee48Montcalm CountyMichigan
19Hinds CountyMississipi49Washtenaw CountyMichigan
20Tate CountyMississipi50Armstrong CountyPennsylvania
21Blount CountyAlabama51Montour CountyPennsylvania
22Autauga CountyAlabama52Lebanon CountyPennsylvania
23Marengo CountyAlabama53Luzerne CountyPennsylvania
24Morgan CountyAlabama54Addison CountyVermont
25Talladega CountyAlabama55Windsor CountyVermont
26Bulloch CountyGeorgia56Grant ParishLouisiana
27Habersham CountyGeorgia57Red River ParishLouisiana
28Bradford CountyFlorida58Vermilion ParishLouisiana
29Clay CountyFlorida59Madison ParishLouisiana
30Taylor CountyFlorida

References

  1. Cressie, N.; Wikle, C. Statistics for Spatio-Temporal Data; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
  2. Wikle, C. Modern Perspectives on Statistics for Spatio-Temporal Data. Wires Comput. Stat. 2014, 7, 86–98. [Google Scholar] [CrossRef]
  3. Wikle, C.K.; Zammit-Mangion, A.; Cressie, N. Spatio-Temporal Statistics with R; Chapman & Hall/CRC: Boca Raton, FL, USA, 2019. [Google Scholar]
  4. Stroup, W. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications; Chapman & Hall/CRC Texts in Statistical Science; Chapman & Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar]
  5. St-Pierre, J.; Oualkacha, K.; Bhatnagar, S.R. Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data. Bioinformatics 2023, 39, btad063. [Google Scholar] [CrossRef] [PubMed]
  6. Schelldorfer, J.; Meier, L.; Bühlmann, P. GLMMLasso: An Algorithm for High-Dimensional Generalized Linear Mixed Models Using 1-Penalization. J. Comput. Graph. Stat. 2014, 23, 460–477. [Google Scholar] [CrossRef]
  7. Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef]
  8. Qiu, K.; Mao, X.; Shen, X.; Wang, X.; Li, T.; Gu, Y. Time-Varying Graph Signal Reconstruction. IEEE J. Sel. Top. Signal Process. 2017, 11, 870–883. [Google Scholar] [CrossRef]
  9. Giraldo, J.H.; Mahmood, A.; Garcia-Garcia, B.; Thanou, D.; Bouwmans, T. Reconstruction of Time-Varying Graph Signals via Sobolev Smoothness. IEEE Trans. Signal Inf. Process. Over Netw. 2022, 8, 201–214. [Google Scholar] [CrossRef]
  10. Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. J. Mach. Learn. Res. 2006, 7, 2399–2434. [Google Scholar]
  11. Venkitaraman, A.; Chatterjee, S.; Händel, P. Predicting Graph Signals Using Kernel Regression Where the Input Signal is Agnostic to a Graph. IEEE Trans. Signal Inf. Process. Over Netw. 2019, 5, 698–710. [Google Scholar] [CrossRef]
  12. Karakurt, I.; Aydin, G. Development of regression models to forecast the CO2 emissions from fossil fuels in the BRICS and MINT countries. Energy 2023, 263, 125650. [Google Scholar] [CrossRef]
  13. Fouss, F.; Saerens, M.; Shimbo, M. Algorithms and Models for Network Data and Link Analysis; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
  14. Aitken, A.C. On Least-squares and Linear Combinations of Observations. Proc. R. Soc. Edinb. 1936, 55, 42–48. [Google Scholar] [CrossRef]
  15. Nelder, J.A.; Baker, R. Generalized Linear Models; Wiley Online Library: Hoboken, NJ, USA, 1972. [Google Scholar]
  16. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman & Hall: London, UK, 1989; p. 500. [Google Scholar]
  17. Denison, D.G. Bayesian Methods for Nonlinear Classification and Regression; John Wiley & Sons: Hoboken, NJ, USA, 2002; Volume 386. [Google Scholar]
  18. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  19. Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
  20. Hjorth, U.; Hjort, U. Model Selection and Forward Validation. Scand. J. Stat. 1982, 9, 95–105. [Google Scholar]
  21. Gurney, K.R.; Liang, J.; Patarasuk, R.; Song, Y.; Huang, J.; Roest, G. The Vulcan Version 3.0 High-Resolution Fossil Fuel CO2 Emissions for the United States. J. Geophys. Res. Atmos. 2020, 125, e2020JD032974. [Google Scholar] [CrossRef] [PubMed]
  22. Nevat, I.; Mughal, M.O. Urban Climate Risk Mitigation via Optimal Spatial Resource Allocation. Atmosphere 2022, 13, 439. [Google Scholar] [CrossRef]
Figure 1. Example of a graph with | V | = 6 vertices at time t.
Figure 1. Example of a graph with | V | = 6 vertices at time t.
Entropy 25 01272 g001
Figure 2. Choice of the covariate WD to encapsulate information about the weekday for the CO 2 emission. (a) Spatial and temporal average of the CO 2 emission per weekday. (b) Values assigned to the covariates WD depending on the current weekday.
Figure 2. Choice of the covariate WD to encapsulate information about the weekday for the CO 2 emission. (a) Spatial and temporal average of the CO 2 emission per weekday. (b) Values assigned to the covariates WD depending on the current weekday.
Entropy 25 01272 g002
Figure 3. Illustration of the time correlation of the daily CO 2 emissions per county with the autocorrelation function (ACF) of three different counties.
Figure 3. Illustration of the time correlation of the daily CO 2 emissions per county with the autocorrelation function (ACF) of three different counties.
Entropy 25 01272 g003
Figure 4. US counties selected as nodes of the graph depicted in green.
Figure 4. US counties selected as nodes of the graph depicted in green.
Entropy 25 01272 g004
Figure 5. Geodesic distances in kilometers between counties.
Figure 5. Geodesic distances in kilometers between counties.
Entropy 25 01272 g005
Figure 6. RMSE as a function of time for three different counties.
Figure 6. RMSE as a function of time for three different counties.
Entropy 25 01272 g006
Figure 7. Boxplots of the RMSE obtained after 50 random choices of two counties per state for the different regression models (70% of the dataset is used for training).
Figure 7. Boxplots of the RMSE obtained after 50 random choices of two counties per state for the different regression models (70% of the dataset is used for training).
Entropy 25 01272 g007
Table 1. RMSE of the penalized regression model over graph with the Laplacian defined using an adjacency matrix based either on geodesic distances or on empirical correlations.
Table 1. RMSE of the penalized regression model over graph with the Laplacian defined using an adjacency matrix based either on geodesic distances or on empirical correlations.
Root Mean Square Error (RMSE): Distances Versus Empirical Correlations
Testing Set Validation Set Training Set
Perc. Train Graph (Distance) Graph (Correlation) Graph (Distance) Graph (Correlation) Graph (Distance) Graph (Correlation)
70%16.4227.0413.6714.9213.407.96
Table 2. RMSE of the different regression models for different sizes of the training set.
Table 2. RMSE of the different regression models for different sizes of the training set.
Root Mean Square Error (RMSE)
Testing Set Validation Set Training Set
Perc. Train Graph Reg. Ridge OLS Graph Reg. Ridge OLS Graph Reg. Ridge OLS
50%35.6541.4342.1016.8017.8617.659.136.746.55
60%30.0236.7741.4115.0219.6019.7321.736.526.52
70%16.4222.6549.5213.6717.1316.4413.407.947.02
Table 3. RMSE of the different regression models without the use of the lagged response variables as covariates.
Table 3. RMSE of the different regression models without the use of the lagged response variables as covariates.
Root Mean Square Error (RMSE) without Lagged Variables
Testing Set Validation Set Training Set
Perc. Train Graph Reg. Ridge OLS Graph Reg. Ridge OLS Graph Reg. Ridge OLS
70%38.5438.5441.7620.2820.2820.349.659.659.64
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tayewo, R.; Septier, F.; Nevat, I.; Peters, G.W. Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States. Entropy 2023, 25, 1272. https://doi.org/10.3390/e25091272

AMA Style

Tayewo R, Septier F, Nevat I, Peters GW. Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States. Entropy. 2023; 25(9):1272. https://doi.org/10.3390/e25091272

Chicago/Turabian Style

Tayewo, Roméo, François Septier, Ido Nevat, and Gareth W. Peters. 2023. "Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States" Entropy 25, no. 9: 1272. https://doi.org/10.3390/e25091272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop