1. Introduction
PM
2.5 can be emitted directly from various sources, including moving vehicles, industrial activities, and chemical reactions. The long-time exposure to PM
2.5 can cause serious health problems, including respiratory system diseases [
1], chronic kidney diseases [
2], and cardiovascular diseases [
3]. In addition to harmful effects on public health, various studies have shown that PM
2.5 also leads to economic losses. According to [
4], the total economic loss due to PM
2.5 exposure was about 0.91% of the total Chinese GDP in 2016. In a seminal work [
5], evidence showed that from 2014 to 2016, China experienced a downward trend in total economic losses, while some regions such as Beijing–Tianjin–Hebei experienced great annual economic losses. To mitigate its adverse impacts, many countries have implemented policies and regulations aimed at reducing exposure to PM
2.5. For instance, the State Council of China released the Action Plan for the Control of Air Pollution as early as 2013, which is a guideline document to control air pollution. Controlling PM
2.5 concentration is crucial for protecting public health and making informed policy decisions related to air pollution [
6].
However, accurately predicting PM
2.5 is a challenging task due to the following two reasons. First, PM
2.5 is a complex pollutant that can be influenced by many factors such as meteorological effects, emission behaviors, and land use and land cover (LULC) [
7]. In a recent work [
8], it was also found that wind speed and terrain elevation are main factors influencing PM
2.5 concentration. Thus, accurately capturing the spatio-temporal patterns of PM
2.5 concentrations requires considering multiple variables [
9]. Second, the concentration of PM
2.5 can vary drastically across different locations and over time.
Figure 1 shows the time series of PM
2.5 concentrations in Beijing and Shanghai from 1 January 2021 to 31 March 2021. As can be seen, the concentration of PM
2.5 in Beijing exhibited significant fluctuations, ranging from 3 to 488 μg/m
3, while the concentration in Shanghai varied from 7 to 149 μg/m
3 during the same period. Such spatial and temporal variations pose major challenges for accurately predicting PM
2.5 concentrations.
Despite all these challenges, advanced deep learning models like recurrent neural networks (RNNs) and long short-term memory (LSTM) have shown promising results in predicting PM
2.5 concentrations, resulting in superior forecasting accuracy [
10,
11]. Moreover, a linear machine learning model was designed to deliver superior performance in capturing rare peaks of air pollution concentrations [
12]. However, RNNs or LSTM models do not fully consider spatial dependencies. To tackle this problem, image-based and graph-based methods have been developed based on the input data structure [
13]. In image-based methods, the input data are typically a two-dimensional map of the geographical area, with each pixel representing the PM
2.5 concentrations at a specific location. Accordingly, convolutional neural network (CNN) models can be utilized to handle the spatial dependencies. To simultaneously consider the temporal dependencies, models such as LSTM and the gated recurrent unit (GRU) are integrated with CNNs for the spatio-temporal modeling of PM
2.5 [
14,
15].
In contrast to image-based approaches, the graph-based methods represent air quality monitoring stations or air sensors as nodes, and their spatial dependencies can be expressed as edges through a graph structure. Graph neural network (GNN) models are extensively used along this line, showcasing huge advantages over CNN-based models due to their ability to capture the complex spatial relationships and dependencies between different geographical regions. For instance, Ref. [
16] introduced a novel geo-context-based diffusion convolutional recurrent neural network (GC-DCRNN) model for short-term PM
2.5 concentration forecasting. This model can effectively capture spatial and temporal dependencies in location-dependent time series data. In a similar vein, Ref. [
17] proposed a hybrid model that integrated graph convolutional networks (GCNs) and LSTM networks (GC-LSTM) to model and predict the spatio-temporal variation of PM
2.5 concentrations. More recently, Ref. [
13] developed a PM
2.5 forecasting model that combines a knowledge-enhanced GNN with a spatio-temporal RNN, which can capture long-term dependencies.
In this work, we propose a GNN framework for PM2.5 concentration prediction by incorporating the GRU module, which is referred to as a graph attention recurrent neural network (GARNN) model. The GARNN model predicts future PM2.5 concentrations by utilizing both meteorological and geographical information. To this end, an attention-based graph neural network is first designed to capture the spatial patterns and calculate the interactions between neighboring cities based on the graph convolution operations and residual attention mechanism. Then, a gated recurrent unit is employed to further establish the spatio-temporal modeling of PM2.5. Extensive experiments demonstrate the superior performance of GARNN compared to other existing models in terms of predictive capabilities. The main contributions of this paper are threefold. (1) We build a comprehensive dataset spanning 308 cities in China, encompassing data on seven pollutants and the meteorological variables from January 2015 to September 2022. (2) We establish a novel GNN-based deep learning framework to effectively predict PM2.5 concentrations. (3) We conduct a comparative study and an ablation study to highlight the superior performance of the proposed method over other existing methods.
The remainder of this paper is organized as follows. In
Section 2, we introduce the data sources, including pollutant data, meteorological data, and geographical data. The methodology is illustrated in
Section 3, showing the detailed structure of the proposed GARNN model. In
Section 4, various experiments are conducted to demonstrate the practical usefulness of GARNN in comparison with other existing methods; this is followed by concluding remarks in
Section 5.
2. Data Sources
Our work utilized three sets of data to forecast PM2.5, i.e., pollutant concentrations, meteorological variables, and geographic distributions. Specifically, the original data were collected at a one-hour frequency. To balance the computational burden and prediction accuracy, we selected data from January 2015 to September 2022 at a three-hour frequency.
2.1. Pollutant Data
The pollutant data were obtained from the Ministry of Environmental Protection of China and the China National Environmental Monitoring Center, in terms of automatic measurement type. We acquired data from 308 cities in China, each with 22,640 observations of 7 pollutant variables, including PM2.5, PM10, SO2, NO2, CO, O3, and the Air Quality Index (AQI). Internal correlations among pollutants are useful for predicting PM2.5 concentrations. PM2.5 concentration exhibited a strong positive correlation with AQI, CO, PM10, NO2, and SO2, with time series correlation coefficients of 0.82, 0.62, 0.60, 0.54, and 0.36, respectively. In contrast, there was a weak negative correlation between O3 and PM2.5 concentrations, with a Pearson correlation coefficient of −0.21.
2.2. Meteorological Data
Meteorological data were collected from the European Centre for Medium-Range Weather Forecasts (ECMWF), which provides hourly forecast data across the whole world. In line with previous research [
13], we selected the following meteorological variables: 2m temperature, total precipitation, boundary layer height,
K index, relative humidity, surface pressure, wind speed, and wind direction. A detailed description of the meteorological variables is provided in
Table 1, which is in line with ECMWF. These variables are extensively studied and known to affect the transfer and dissipation of pollutants.
2.3. Geographical Data
We selected 308 cities in China as the study area in this work, encompassing a comprehensive representation of the country’s diverse regions. By integrating both the temporal and the spatial aspects, our dataset comprised a total of 6,973,120 observations.
3. Methodology
In this work, we constructed the GARNN structure upon domain knowledge, characterizing the pattern of PM
2.5 concentration via the combination of RNN and GNN. From a local perspective, factors such as boundary layer height, rainfall, and humidity can affect the accumulation and dissipation of pollutants locally, while the concentration levels of other pollutants like PM
10 maintain a long-term correlation with the PM
2.5 concentrations. From an external perspective, wind is the primary driving force for transport, and factors such as distance, wind direction, and wind speed can influence the strength of transmission from surrounding cities to the target city. Based on the spatial geographic location of each city, we abstracted them into a graph, with geographic variables such as boundary layer height, temperature, and humidity, as well as pollutant variables such as PM
2.5 and PM
10 concentrations, serving as node attributes in the graph. Variables such as distance between two locations, wind speed, wind direction, and the angle between the connection lines were used as edge attributes in the graph. At each time point, information from the current location was first transmitted with information from the surrounding locations, and the integrated hidden layer vector was then fed into the RNN to output future PM
2.5 predictions over time.
Figure 2 shows the workflow of the proposed GARNN, which mainly consists of two modules, i.e., an attention-based graph neural network and a GRU model.
3.1. Graph Construction
A directed graph
is defined as a collection of cities (i.e., nodes) and their interactions (i.e., edges). The set
corresponds to the nodes, while
refers to the set of edges. We denoted the number of nodes as
and the number of directed edges as
. Furthermore, we predefined a distance threshold of 400 km and a height threshold of 1200 m. If the distance between two cities was less than the distance threshold and the path connecting them did not exceed the mountain range height threshold, then we considered that there existed a connection relationship between the two cities. In other words, these two nodes could be represented as having a two-way connection in the graph structure [
13].
3.2. Problem Definition
Let
be the collection of d pollutants, i.e., PM
2.5, PM
10, SO
2, NO
2, CO, O
3, and AQI, in
locations at time
. Let
be the collection of
meteorological variables in
locations at time
. In addition,
denotes the geographical variables on
edges, i.e., the distance between two locations and its angle direction. Consequently,
and
represent nodal and edge attributes, respectively. In general, geographical attributes do not change over time. However, for the sake of notation consistency, we still denoted them as
. Moreover, the meteorological attributes for the next 72 h could be obtained through the ECMWF. Furthermore, only historical pollutant concentrations (e.g., PM
2.5) were available. As a result, our primary objective was to predict the pollutant concentrations, particularly those of PM
2.5, in a dynamic way. Specifically, we aimed to solve the following prediction problem:
where
is the length of historical information, and
is the number of steps that we intend to predict. Based on prior knowledge, we set
so that the PM
2.5 concentrations in the next 72 h could be predicted since our data were collected at a three-hour frequency.
3.3. The Attention-Based Graph Neural Network
A GNN is a type of neural network designed to process data in the form of graph structures. It operates by aggregating information from neighboring nodes in the graph through the use of differentiable functions, thereby simulating complex processes such as pollutant transport and movement with the atmosphere. At each node, pollutant concentration and meteorological variables are encoded into a hidden layer representation through a fully connected layer and nonlinear operations. This hidden layer represents the current pollution and meteorological conditions of the node. Next, the transfer of information between nodes is calculated based on hidden layer status, distance, and angle direction for each directed edge in the graph. Therefore, two hidden layers and edge attributes can be input into the graph neural network operation layer to calculate the information received by each city from surrounding cities. To reduce the noise of unimportant city information, the residual attention mechanism from the graph attention network (GAT) is applied to the final layer, which allows each city to selectively receive information from surrounding important cities. This attention mechanism is implemented by defining the three projection matrices of “query, key, and value”.
The GNN structure is defined as follows. First, at time step
, the pollutant concentration
or the predicted concentration
is concatenated with the meteorological variables
and passed through the first fully connected layer
to obtain the initial hidden layer
for each node. The GNN operation is then performed, where information is transfer from node
to node
, denoted as
. It is calculated based on the hidden layer states
and
, as well as on the directed edge attribute
. Note that
in Equation (2) represents a fully connected layer.
After obtaining the information transmission amount of each directed edge
, the information from each edge is weighted and aggregated using the residual attention mechanism. More specifically, the query hidden layer
, the key hidden layer
, and the value hidden layer
of the neighboring node
to the central node
are, respectively, calculated in Equation (3) for information transmission, where
,
, and
are the three learnable projection matrices of the attention layer,
and
,
, and
are the bias terms.
Finally, the query result weights are normalized as
. All the information from the neighboring nodes is aggregated, and the original information and the updated information are aggregated by residual connection as follows:
where
refers to the collection of connected neighbors of node
.
3.4. The GRU Model
The GNN structure models the cross-sectional pattern of pollutant concentration in each time slice. To learn the temporal pattern, it is also necessary to build an RNN structure. Since the historical time length considered in this work was relatively long, i.e.,
, a common RNN is prone to gradient disappearance, which affects the overall model convergence efficiency. Therefore, we employed the GRU network for the RNN part of the model, which includes two gates, i.e., a reset gate and an update gate. The RNN structure is shown in Equation (5). More specifically, at each time step t, the hidden layer
derived from the GNN is used as the input of the GRU network. After the calculation of the reset gate and the update gate, the hidden layer update value
is computed, and the predicted value
of pollutant concentration at the next time step is the output
where
,
, and
are learnable matrix parameters,
denotes the sigmoid active function,
denotes the tanh active function, and
represents the fully connected layer function. Furthermore,
denotes the element-wise product operation.
4. Experiments
4.1. Comparative Study
To demonstrate the effectiveness of the proposed GARNN model, we compared it with other existing methods, including MLP, LSTM, GRU, GC-LSTM, nodesFC-GRU, and PM2.5-GNN. Specifically, these models were carefully selected for the following two main reasons. On one hand, these models are extensively utilized for PM2.5 prediction, exhibiting a strong ability in accurate prediction. On the other hand, these models represent three different aspects of modeling. The LSTM model only considers time information, the second model takes both temporal and spatial information into consideration, with spatial information being incorporated via GCN, and the PM2.5-GNN model employs a GNN to deal with spatio-temporal dependencies.
The Multilayer Perceptron (MLP) is a neural network architecture that does not explicitly model the temporal or spatial dependencies of the input data. Instead, it takes as input the pollutant concentration and the meteorological variables in each city and processes them using a 5-layer fully connected neural network with a hidden layer size of 16.
The Long Short-Term Memory (LSTM) [
18] architecture is an RNN that is well-suited for modeling temporal dependencies. To capture the temporal relationships between pollutant concentrations
and meteorological variables
, we employed a 2-layer LSTM model with a hidden layer size of 16.
The Gated Recurrent Unit (GRU) [
19] is a type of RNN that is computationally more efficient than the LSTM architecture and it uses fewer parameters. To leverage the temporal dependencies in the pollutant concentration
and meteorological variables
for each city, we employed a 2-layer GRU model with a hidden layer size of 16.
The GC-LSTM [
17] is a hybrid model that combines GCN and LSTM to model both the spatial and the temporal patterns of pollutant concentrations. The GCN component is constructed using an undirected graph that does not consider edge attributes but captures the transmission of spatial information among the nodes. However, this model does not incorporate critical factors such as air flow, wind speed, wind direction, or other pollutant information.
The nodesFC-GRU [
13] architecture was developed to model the spatio-temporal dependencies in pollutant concentration data. This model combines fully connected layers and GRU units to capture both spatial and temporal patterns in the given data. The fully connected layers enable the direct summarization of information from all adjacent nodes, while the GRU network learns the temporal dependencies between pollutant concentrations over time. However, this model lacks the characterization of factors such as air flow, wind speed, and wind direction.
The PM
2.5-GNN model [
13] employs a GNN to model the spatial patterns of PM
2.5 concentrations. A directed graph with bidirectional edges is constructed, and the domain knowledge is used. However, this model does not incorporate historical information or other pollutant information, which may limit its accuracy and applicability in certain scenarios.
4.2. Experimental Setting and Performance Assessment
To evaluate and compare the predictive capabilities of different models, we employed the evaluation metrics root-mean-square error (RMSE) and mean absolute error (MAE) to carry out performance assessment. We further followed the China Ambient Air Quality Standard (
https://www.mee.gov.cn/ywgz/fgbz/bz/bzwb/dqhjbh/dqhjzlbz/201203/t20120302_224165.shtml, accessed on 30 June 2024) and set the threshold to build the confusion matrix at 75 μg/m
3. This standard is the daily average concentration, which is in line with the WHO standard. As a result, commonly utilized meteorological metrics can be applied, including critical success index (CSI), probability of detection (POD), and false alarm rate (FAR). Among these metrics, higher values for CSI and POD indicate superior performance.
4.3. Experimental Results
Selecting the PM2.5 concentrations for the next three days () as the prediction target, we used all data from 2015 to 2022 from all monitoring stations as the dataset. Training was performed in a rolling manner, where the training set included data from 1 January 2015 to 31 December 2019, the validation set consisted of data from 1 January 2020 to 31 December 2020, and the testing set comprised data from 1 January 2021 to 31 December 2022.
Table 2 shows the experimental results. The top panel shows the results obtained without utilizing historical information (
). The GNN methods that consider information exchange between nodes (i.e., GC-LSTM, nodesFC-GRU, PM
2.5-GNN, GARNN) exhibited stronger predictive capabilities compared to methods that independently predicted the values at each station (i.e., MLP, LSTM, GRU). To be specific, the MLP model exhibited the highest average RMSE, amounting to 16.81. Other models, such as LSTM and GRU, which only incorporate temporal information, yielded an average RMSE value exceeding 15. In contrast, approaches that integrate both temporal and spatial information demonstrated superior performance, with an average RMSE below 15. It is noteworthy that our GARNN model achieved the lowest average RMSE, i.e., 14.12. The bottom panel shows the results obtained by utilizing historical information from the past 24 steps (
). It can be seen that the GNN method incorporating temporal dependencies (i.e., GC-LSTM) showed a certain improvement in prediction accuracy compared to the LSTM or GRU models, although the improvement was relatively small (i.e., from 15.18 to 14.40). Furthermore, the average RMSE of our proposed GARNN decreased from 14.12 to 13.51. Overall, the GNN methods that incorporate information exchange between nodes achieved better results in predicting pollutant concentrations. Moreover, in scenarios with less historical information, the interaction between nodes becomes more important, highlighting the advantages of GNN methods.
Furthermore, the GARNN model could predict the future PM
2.5 concentrations in each city for a period of time based on historical data. We also compared the results of three representative neural networks, i.e., MLP, GRU, and GARNN, in different prediction periods. The RMSE of the predictions was calculated for future periods from
to 24.
Figure 3 illustrates that the RMSE values of the three models increased as the prediction period grew, indicating that the uncertainty of the predictions also increased with the prediction period. Additionally, the difference in accuracy among these three models was relatively small for the future one-period predictions. However, as the prediction period increased, the advantage of neural network models with temporal information (e.g., GRU and GARNN) became more apparent. Additionally, a detailed illustration of the testing data is provided in
Figure 4, with predicted values. The data were selected from Beijing, spanning from 17 November 2021 to 29 November 2021. It can be seen that the model that only considered temporal information heavily underestimated the PM
2.5 concentration (yellow line indicating the LSTM model). Compared with other GCN (i.e., GC-LSTM) and GNN (i.e., PM
2.5-GNN) methods, our newly proposed GARNN model captured the pattern of PM
2.5 concentration very well.
4.4. Variable Importance
In this subsection, we conducted a variable importance analysis on the meteorological variables and other pollutants used in this paper. The permutation test, also known as bootstrap test, was initially introduced in the 1930s by Fisher and others as a method of statistical inference [
20,
21]. It falls under the category of non-parametric tests, requiring no assumptions about the sample population distribution. With the advancement of deep machine learning models, the concept of permutation test has also been applied to evaluate variable importance in “black-box” algorithms [
22]. This process can involve individually shuffling the order of each feature in the test set and subsequently employing a pre-trained model for inference and prediction. The more pronounced the decline in model performance after shuffling a particular variable, the more important that variable within the original model.
To be more specific, the GARNN model trained in
Section 4 was utilized. For each variable, a permutation test was performed on the test set. During the permutation, the values of that variable across all cities and time steps were simultaneously extracted. These values were then randomly shuffled and reassigned to each observation. Subsequently, the pre-trained GARNN model was employed for prediction. The changes in the model’s predictive performance after permuting each variable were compared. The results are presented in
Table 3. As can be seen, among all meteorological variables, the 2m temperature and boundary layer height variables stood out as, relatively, the most significant. After shuffling the 2m temperature, the model’s average RMSE increased by 3.28, MAE increased by 2.95, and FAR increased by 3.52%. After shuffling the boundary layer height, the model’s average CSI decreased by 6.99% and POD decreased by 5.66%. These findings are consistent with the existing literature [
23] on exploring meteorological influences on PM
2.5 concentrations.
4.5. Dissociation Experiment
An important component of the GARNN model established in this work is the GNN. To analyze the functionality of the components of the GNN, we attempted to replace the computational mechanism of the GNN in this subsection. Specifically, we experimented with scenarios where graph information propagation was not performed (i.e., GARNN-no-graph) and where modeling was conducted using the average propagation among neighboring cities (i.e., GARNN-wavg).
As shown in
Table 4, when no graph information propagation was performed, the GARNN-no-graph model could only build its predictions based on the historical pollutant concentrations and meteorological variables of the current site. Therefore, its predictive capability for the next three days was significantly weaker compared to that of the other two models that considered information from surrounding cities. When using the average of neighboring city hidden layers for information propagation (GARNN-wavg), the model did not take into consideration factors like inter-city distances, angular directions, wind speed, and wind direction. As a result, it could only provide a vague representation of pollutant environments and climatic conditions around it. The GARNN model first incorporated the feature hidden layers of pairwise cities along with edge attributes into a hierarchical fully connected layer for information exchange. Then, an attention mechanism was applied to re-weight the information from all neighboring cities. This allowed different neighboring cities to contribute differentially to the central city, enhancing the predictive capabilities compared to those achieved when modeling through average information propagation from surrounding cities (GARNN-wavg).
In addition to the graph information propagation module affecting the effectiveness of the GNN, the construction of the graph also has a certain impact on the model performance. By experimenting with variations in the distance threshold used in constructing the urban network graph, as shown in
Table 5, it became evident that as the distance threshold increased, the predictive performance of the GARNN model initially improved and then began to decline. More particularly, when the distance threshold was set at 300 km, the model’s performance across all metrics was at its lowest. Notably, the most substantial enhancement in performance was observed as the distance threshold increased from 300 km to 400 km. Beyond the 400 km threshold, the model’s efficacy experienced only marginal gains, with just two evaluation metrics displaying improvement.
However, when we took into account the number of edges, we found that at a distance threshold of 300 km, the graph contained a total of 3796 edges, while at 400 km, there were 5852 edges, and at 500 km, 8092 edges. As the distance threshold increased, the temporal and spatial complexity of each GNN computation grew. Therefore, when balancing predictive performance and computational complexity, a distance threshold of 400 km is more appropriate for the GARNN model.
5. Conclusions
This study proposes a novel graph neural network named GARNN for PM2.5 predictions. GARNN integrates three crucial components, i.e., historical information, meteorological variables, and geographical information, ensuring accurate forecasting of PM2.5. Notably, the attention-based GNN was designed to capture spatial patterns, while the GRU module was employed for an effective temporal modeling of PM2.5. To empirically demonstrate the effectiveness of the GARNN model, we gathered pollutant and meteorological data from 308 cities in China, obtaining a dataset of 6,973,120 observations. Rigorous experiments suggested that the GARNN model outperforms alternative methods in terms of predictive accuracy. Additionally, a variable importance analysis highlighted that 2m temperature and boundary layer height played pivotal roles in determining the accuracy of PM2.5 predictions. Various graph message passing mechanisms were also explored, accompanied by the evaluation of different distance thresholds. This exploration served to illustrate the operational effectiveness of the components within the GARNN framework. To summarize, this paper provides a potential solution for PM2.5 predictions in large-scale scenarios, contributing to the effective prevention and control of air pollution.
For future studies, we propose three possible directions. Firstly, our model could be applied in the field of social network analysis. This is because the network structure can be seen as a graph so that the GARNN framework can be utilized. Under such circumstances, some model objects, rather than measurement objects, can be studied such as the number of posts of a user on social network platforms. Secondly, utilizing multimodal data including satellite imagery and GIS geospatial data could effectively improve the model performance. In future work, these data can be combined for more accurate PM2.5 prediction, if available. Thirdly, we investigated the model complexity and efficiency of our GARNN model and the competitors. The number of parameters in our model is more than twenty thousand with a training cost of 2.8 min per epoch on average. We admit that the efficiency of our model is not the best. This is mainly because the GARNN model has a very complex structure, which allowed it to achieve the best prediction accuracy in this study. In the future, a possible research direction is to balance the model complexity and prediction ability. Fourthly, regarding data availability, if the current values of air pollution and meteorological variables cannot be obtained, our methodology can still be applied.