Next Article in Journal
Dynamic Evolution and Triggering Mechanisms of the Simutasi Peak Avalanche in the Chinese Tianshan Mountains: A Multi-Source Data Fusion Approach
Previous Article in Journal
Sandy Beach Extraction Method Based on Multi-Source Data and Feature Optimization: A Case in Fujian Province, China
Previous Article in Special Issue
Principles of Correction for Long-Term Orbital Observations of Atmospheric Composition, Applied to AIRS v.6 CH4 and CO Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Dynamic Global–Local Spatiotemporal Graph Framework for Multi-City PM2.5 Long-Term Forecasting

1
School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
2
Faculty of Science and Technology, University of Macau, Macau 999078, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(16), 2750; https://doi.org/10.3390/rs17162750
Submission received: 11 June 2025 / Revised: 5 August 2025 / Accepted: 6 August 2025 / Published: 8 August 2025
(This article belongs to the Special Issue Remote Sensing and Climate Pollutants)

Abstract

Accurate PM2.5 prediction is essential for effective urban air quality management. However, existing methods often struggle to capture the complex, nonlinear, and coupled spatiotemporal dynamics in long-term air pollution evolution. Most existing models rely on short-term observations and overlook long-range temporal trends and inter-station dependencies, which limit their ability to capture the spatiotemporal evolution of air pollution. To address these challenges, we propose a novel dynamic global–local spatiotemporal graph framework for PM2.5 long-term forecasting across multiple cities. Specifically, we introduce a Multi-Station iTransformer (MS-iTransformer) module to capture long-term temporal dependencies from station-specific historical sequences. To globally model evolving inter-city relationships, we design a bilinear spatiotemporal attention (BSTA) module to adaptively build dynamic spatiotemporal graphs using bilinear spatial and temporal attention. Furthermore, we propose a Graph-Enhanced Spatiotemporal Module (GESM) to capture localized spatiotemporal dependencies through graph convolution and recurrent modeling. The experimental results demonstrate that our model has significant improvements across PM2.5 forecasting tasks on three real-world air quality datasets, outperforming widely adopted baseline approaches. The MAE and RMSE are decreased by 1.7665 and 1.8578, respectively. The FAR is reduced by 0.0312. The CSI and R2 are improved by 0.0194 and 0.0260, respectively. Therefore, the proposed method achieves accurate air quality forecasting by effectively capturing long-term temporal trends, dynamic spatial dependencies, and localized spatiotemporal interactions.

1. Introduction

Recently, air pollution has intensified globally to emerge as a major threat to public health and the ecological environment. PM2.5 poses one of the most serious risks due to its extremely fine particle size, which allows it to penetrate the respiratory tract, reach the lungs, and even enter the bloodstream, causing severe harm to human health [1]. Therefore, PM2.5 is widely regarded as one of the most hazardous air pollutants [2,3]. Accurately predicting the trends of PM2.5 concentration has become a central challenge in air quality monitoring and environmental management. Traditional monitoring methods are no longer sufficient for the increasing complexity of air pollution sources. We employ advanced spatiotemporal modeling and intelligent forecasting techniques to enable efficient, accurate, and long-term dynamic prediction of PM2.5 concentrations.
The PM2.5 concentration forecasting consists of short-term predictions and long-term predictions. The former is primarily influenced by rapidly changing factors (such as temperature and wind speed) and sudden emission events [4]. In contrast, the seasonal patterns and annual climate variations affect long-term forecasting [5,6], which requires a comprehensive understanding of dependencies across multiple temporal scales to accurately capture the dynamic evolution of air quality. Currently, long-term PM2.5 forecasting methods face a variety of external interferences, such as fluctuations in meteorological conditions and human activities [7]. Further, they often require the integration of multi-source heterogeneous data, including meteorological variables, pollutant indices, and other exogenous and anthropogenic factors, which significantly increases the complexity of model construction and places greater demands on generalization capabilities [8]. Recurrent Neural Networks (RNNs) [9], Long Short-Term Memory (LSTM) [10], and Gated Recurrent Units (GRUs) [11] have shown advantages in handling small-scale or short-sequence tasks due to their relatively simple structures and fewer parameters. However, they struggle to fully capture complex temporal features and long-range dependencies for long-sequence tasks [12,13]. Therefore, Zhu et al. [14] designed an attention mechanism-based parallel network model, which extracts short-term and long-term temporal features to effectively capture complex temporal dependencies and significantly improve the accuracy of PM2.5 concentration forecasts. Fang et al. [15] introduced a novel decomposing-ensemble and spatiotemporal attention model, which decomposes mixed-mode time series into single-mode series and automatically assigns weights for spatiotemporal factors to enhance prediction precision.
In addition, Wen et al. [16] integrated CNNs with LSTM variants to capture spatiotemporal dependencies. Zhang et al. [17] extracted local features using CNNs and employed a spatiotemporal attention mechanism to assign different weights to various time steps and spatial regions, thereby improving the model′s sensitivity to dynamic spatiotemporal variations. However, the spatial relationships between monitoring stations are difficult to accurately characterize using Euclidean space, as they are influenced not only by distance but also by a variety of non-structural factors such as topography, terrain, and wind direction [18,19]. Therefore, the PM2.5 concentration forecasting in space is more suitable in a non-Euclidean space [20].
Graph Neural Networks (GNNs) have emerged as a widely used approach for modeling relationships in non-Euclidean spaces, enabling information propagation and feature extraction through connections between nodes [21]. In recent years, GNNs have demonstrated outstanding performance in numerous real-world scenarios by integrating graph structures with node attributes, such as classification, regression, and clustering [22]. In transportation networks, GNNs model spatial dependencies between roads and are used for traffic flow prediction and congestion mitigation [23]. In bioinformatics and drug discovery, GNNs leverage molecular graphs to enhance protein modeling, molecular property prediction, and drug–target interaction analysis [24,25]. GNNs support node classification, community detection, and relationship prediction in social networks [26]. In recommender systems, GNNs integrate user–item interaction graphs to improve recommendation accuracy and robustness [27]. Further, GNNs have also found applications in natural language processing [28], brain disease analysis [29], demonstrating their ability to flexibly capture interactions based on graph structures and to support long-range dependency modeling [30,31,32].
In spatiotemporal data analysis, GNNs not only represent complex structures but also capture dynamic spatiotemporal dependencies, demonstrating strong adaptability and generalization ability. Zhang et al. [33] leveraged GNNs to extract meteorological features and employed a Gray Wolf Optimization (GWO) algorithm to adaptively optimize model parameters, yielding remarkable advantages, especially in handling complex external environmental factors. Liu et al. [34] integrated a spatial graph modeling module with a gated continuous-time forecasting cell for long-term PM2.5 concentration prediction, which jointly models inter-city spatial dependencies and temporal evolution to improve adaptability to environmental conditions and complex meteorological conditions. Zhao et al. [35] developed a forecasting model integrating mixed graph convolutional GRU and a self-attention network, enhancing both the accuracy and stability of long-term forecasts.
The PM2.5 concentration forecasting methods based on GNNs and temporal modeling have made significant advancements [30,31]. However, these methods still face several limitations. The spatiotemporal variations in air quality exhibit highly nonlinear and strongly coupled dynamics at both global and local scales. Meanwhile, the majority of existing long-term forecasting models primarily depend on short-term historical data, thereby neglecting the intrinsic long-term trends and periodicities [36]. Consequently, the predictions tend to be overly sensitive to short-term fluctuations, and due to the temporal variability of influencing factors across different periods, these models often struggle to capture the evolving temporal dependencies effectively. This challenge impedes the ability to simultaneously model long-range dependencies and fine-grained local interactions with sufficient accuracy [37,38]. To address these challenges, we propose a long-term PM2.5 forecasting approach across multiple cities based on a dynamic global–local spatiotemporal graph framework. The main contributions of this work are summarized as follows:
(1)
We propose an MS-iTransformer module to capture long-term trends in PM2.5 sequences. The time series of each station is fed into an individual iTransformer to learn station-specific temporal dependencies. The MS-iTransformer improves the accuracy and robustness of long-term forecasts by station-wise normalization and multi-station self-attention.
(2)
We propose a BSTA module to capture global spatiotemporal dynamic dependencies across all cities within a region. By integrating spatial and temporal bilinear attention mechanisms, BSTA adaptively constructs a dynamic Spatiotemporal Dynamic Graph (STDG) that models the evolving inter-city spatial correlations over time.
(3)
We propose a GESM to capture localized spatiotemporal dependencies for fine-grained air quality prediction. The GESM aggregates neighbor information via graph convolution and models short-term temporal dynamics using recurrent units to effectively learn local interaction patterns across both spatial and temporal dimensions.
The rest of the paper is structured as follows: Section 2 introduces the study area and available data; Section 3 details the proposed methodology; Section 4 presents the experimental results and discussion; and Section 5 concludes this paper.

2. Study Area and Available Data

This paper selects 184 urban areas in China as the research object and conducts spatiotemporal prediction analysis of the PM2.5 concentration, as shown in Figure 1. The formation and diffusion of air pollution are jointly affected by a variety of environmental and geographical factors, including meteorological variables, such as temperature, humidity, precipitation, wind speed, and air pressure, as well as distance between cities and terrain characteristics [39].
Initially, we considered a total of 17 meteorological variables in our model, including average temperature, meridional and zonal wind speed, relative humidity, precipitation, surface pressure, and others. To minimize the impact of irrelevant or redundant features on model performance, we employed random forest feature importance ranking and Pearson correlation coefficient (PCC) analysis, as shown in Figure 2 and Figure 3. Figure 2 shows the feature importance scores from the random forest—higher bars indicate greater influence. Figure 3 illustrates the correlation strength between each feature and PM2.5 concentration based on PCC.
Specifically, we select eight features with strong physical significance and high statistical correlation to PM2.5 concentrations: Boundary_layer_height is negatively correlated with PM2.5, and higher layers facilitate vertical diffusion of pollutants. K-index indicates tropospheric instability; higher values reflect better diffusion conditions, usually resulting in lower PM2.5 levels. u_component_of_wind + 950 & v_component_of_wind + 950 represent horizontal wind speed at 950 hPa (~500 m altitude), where PM2.5 tends to accumulate; stronger winds help disperse pollutants. 2 m_temperature influences PM2.5 through cold front activities and ventilation efficiency. Surface pressure is strongly associated with vertical stability; higher pressure may lead to stratified layers that trap pollutants. Relative_humidity + 950: Water vapor contributes to PM2.5 formation and particle growth. Total_precipitation reduces PM2.5 through wet scavenging and downward airflow, showing a strong negative correlation [40].
Further, the regional pollution not only comes from local emissions, but is also significantly affected by the transmission effect of neighboring regions. For example, the PM2.5 levels of some cities in North China are often affected by the long-distance transmission of industrial emissions from Beijing and Tianjin [41]. The spatial diffusion of pollutants usually depends on factors such as wind direction, wind speed, atmospheric boundary layer height, and geographical proximity, which together shape a complex spatiotemporal propagation mechanism. Therefore, it is necessary to comprehensively consider the spatial correlation between cities and the potential impact of external inputs on local pollution levels.
The edge attributes of connecting adjacent city nodes consist of w s , w d , d r f , d b , and a c ; detailed descriptions of these features as provided in Table 1.
To more realistically simulate pollutant transmission pathways between cities, we introduce the advection coefficient a c [40], as in Equation (1), to model dynamic inter-city pollution transport. This coefficient accounts for the interaction among wind direction, wind speed, and inter-city distance, effectively representing potential pollutant transmission routes.
a c = R e L U ( | w s | d b cos ( d r f w d ) ) ,
When the wind speed is high and its direction aligns with the vector between two cities (i.e., cos ( d r f w d ) approaches 1), and the distance between the cities is short, the pollutant transport is stronger, resulting in a higher coefficient value. Conversely, if the wind direction opposes the inter-city vector, the wind speed is low, or the distance is long, the coefficient becomes smaller, indicating weaker pollutant transport capability.

3. Methodology

Figure 4 illustrates the overall architecture of the proposed model, which consists of three main components: First, the MS-iTransformer module is proposed to capture long-term temporal trends from historical data at individual monitoring stations. Second, a BSTA module is proposed to dynamically construct a global inter-city spatiotemporal graph, thereby modeling spatial and temporal dependencies at the global scale. Finally, the GESM uncovers localized spatiotemporal dependencies through graph convolution and recurrent modeling.

3.1. Global Temporal Feature Extraction

Figure 5 shows that we utilize historical PM2.5 observations from multiple cities and employ an iTransformer to predict PM2.5 trends for each city over multiple future time steps. Specifically, given the historical PM2.5 of multiple cities over the past L time steps ( t , t 1 , , t L ) , the model forecasts PM2.5 of these cities for the next P time steps. The core of the iTransformer lies in its ability to learn effective representations of historical PM2.5 variations in each city and to enhance prediction accuracy by modeling the dynamic correlations among cities. The historical PM2.5 time series of each city is passed through an embedding module to convert it into tokens that represent its characteristic temporal features. For the historical PM2.5 of each city X , PM2.5 features Y are then used for prediction via the following steps in Equation (2):
  h n 0 = E m b e d d i n g ( X ) , H l + 1 = TrmBlock ( H l ) , l = 0 , 1 , , L 1 ,   Y = Projection ( h n L ) ,
where H = h 1 , , h N N × D is the feature representation of latent space for N cities, and D denotes the dimension of each token. Subsequently, the iTransformer employs a multi-city self-attention mechanism to model the tokens of different cities and learn their dynamic relationships. The feature extraction is first performed on the time series of each city to obtain a comprehensive representation H = h 1 , , h N N × D . Then, the self-attention module uses linear mappings to generate query (Q), key (K), and value (V) [42]. The corresponding Q and K vectors of a given city are denoted as q i , k j d k . Each element of the score matrix before the Softmax operation in the attention computation is calculated as in Equation (3):
A i , j = ( QK T / d k ) i , j q i T k j ,
Since the features of each city are normalized along the feature dimension before input, each element in the score matrix partially reflects the correlation of historical PM2.5 trends between different cities. However, traditional normalization methods are not well-suited for multi-city PM2.5 prediction tasks as different cities are influenced by varying pollution sources, geographic environments, and climatic conditions. If we apply a uniform normalization across all cities, which introduces non-causal noise and may cause temporal lag effects, thereby affecting learning local dynamic patterns for each city, and degrading prediction performance. Therefore, the historical PM2.5 of each city is individually normalized to a standard normal distribution (mean 0, variance 1), which is defined as in Equation (4):
L a y e r N o r m ( H ) h n M e a n ( h n ) V a r ( h n ) n = 1 , , N ,
where μ i represents the mean and σ i denotes the standard deviation of the historical PM2.5 of the i -th city. After individual normalization, a feed-forward network is applied to the entire historical PM2.5 of each city. Finally, the latent space features are mapped to PM2.5 at future time steps by a linear projection map.

3.2. Global Spatiotemporal Dependency of Auxiliary Features

Many existing spatial models use static graphs based on fixed distance or correlation, which fail to account for evolving inter-city pollution transport paths driven by dynamic wind patterns, shifting emission sources, or seasonal meteorology [37]. To capture the dynamic spatiotemporal correlations among stations within a region across different time periods, we propose a BSTA based on a bilinear spatiotemporal attention mechanism. The BSTA adaptively constructs an STDG with temporal dynamics to effectively mine the evolving spatiotemporal dependencies among multiple stations, thereby enhancing the regional generalization ability and spatiotemporal modeling capacity. Figure 6 illustrates the structure of the BSTA, comprising the temporal bilinear attention (TBA) and spatial bilinear attention (SBA) module to model dynamic correlations in the temporal and spatial dimensions, respectively.
Temporal bilinear attention module: The TBA module first models the correlations between different time steps along the temporal dimension to highlight the characteristics of key moments in the sequence. Let the input temporal sequence be X L × N × d , L denotes the number of historical time steps, N represents the number of monitoring stations, and D denotes the feature dimension at each time step. The temporal attention matrix E T × T is defined as in Equation (7):
l time = ( X T · G 1 ) · G 2 ,
r t i m e = G 3 · X ,
E = Softmax ( V e · sigmod ( l · r + b e ) ) ,
where σ denotes the activation function, X is the input features, and the learnable parameters include V e , b e R T × T , U 1 R N , U 2 R d m × N , U 3 R d m . This bilinear attention mechanism integrates both the station and feature dimension information to capture the global correlations across different time steps. Subsequently, we normalize the temporal attention matrix E to obtain the temporal attention weights. Finally, the input feature tensor X is multiplied by E to obtain the weighted temporal feature matrix X E R N × T × d m as in Equation (8):
X E = X · E ,
Spatial bilinear attention module: The SBA module is capable of perceiving the influence of temporal dynamics on spatial relationships. After obtaining the temporally weighted features X E , we further model the dynamic correlations among different monitoring stations in the spatial dimension. Further, we construct a spatial attention matrix using a bilinear operation as in Equation (11):
l s = ( X E · Q 1 ) · Q 2 ,
r s = Q 3 · X E ,
S = softmax ( V s · sigmod ( l s · r s T + b s ) ) ,
where V s , b s N × N , Q 1 T , Q 2 d m × T , Q 3 d m are learnable parameters, and S is the normalized spatial attention weight matrix.
The spatial attention weight matrix constitutes an STDG that varies with the input sequence, is directly utilized in downstream graph-structured modeling tasks, such as graph convolution.
Based on the above analysis, the BSTA jointly models the spatiotemporal dependencies across multiple monitoring stations through TBA and SBA mechanisms to dynamically generate an STDG structure, thereby effectively enhancing its ability to represent and learn complex spatiotemporal structures across multiple regions.
Our BSTA module introduces a dynamic graph construction mechanism via bilinear attention, which jointly considers spatial and temporal relevance. Specifically, bilinear temporal attention reweights the input features to emphasize temporally salient patterns, which are then used in bilinear spatial attention to infer adaptive spatiotemporal graphs. This allows the model to adaptively capture changing inter-city relationships over time—for example, temporary downwind pollution transfer from one city to another. Thus, BSTA complements the limitations of static or implicitly encoded spatial methods by offering flexible, data-driven graph adaptation.

3.3. Local Spatiotemporal Feature Extraction

The pollutant diffusion and meteorological transport between neighboring regions exhibit significant spatial correlations in practical urban air quality forecasting scenarios. Therefore, we construct a spatial topology graph based on the adjacency relationships among cities to capture these inter-regional interactions and enhance representational capacity. Figure 7 shows the structure of the GESM. Specifically, N cities is defined as a set C = { c 1 , c 2 , , c N } , and c n denotes the time series of monitoring data for city n . The model considers not only the historical observations of the city but also incorporates the historical data of its first-order and higher-order neighbors based on the spatial graph structure to construct the input features and predict the corresponding values of c n over the next P time steps ( t + 1 , t + 2 , , t + P ) within the past L time steps ( t , t 1 , , t L ) at a given time t for a target city n .
We introduce a GNN to capture the latent spatial dependencies between cities, which dynamically learns the interaction mechanisms between nodes by integrating both the spatial topological structure and node feature information, thereby enhancing the representation of the spatiotemporal air quality dynamics. The GNN typically consists of aggregation, update, and iteration. Specifically, the feature of a single city node may be sparse and insufficient to accurately reflect its future air quality trends. Therefore, the GNN aggregates information from neighboring nodes of a target node, leveraging their features to compensate for the insufficiency of single-node representations. Then, the aggregated features from neighboring nodes are fused with the original features of the target node to update its representation through weighted summation, nonlinear transformations, or gated mechanisms. Finally, the GNN performs multiple rounds of aggregation and update, progressively incorporating broader neighborhood information for each node until the feature representations converge or it reaches a predefined number of iterations.
The graph convolution is a commonly used implementation of the GNN in practical applications [43]. In this study, we employ a Graph Convolutional Network (GCN) to capture the spatial dependencies among monitoring stations. Specifically, we construct the normalized Laplacian matrix based on the adjacency matrix A and the degree matrix D , both obtained during the preprocessing stage. Specifically, the adjacency matrix A is computed using Vincenty’s formula [44] to accurately quantify the geodesic distances between spatial nodes. The normalized Laplacian is then formulated as in Equation (12):
A ^ = D 1 2 ( D A ) D 1 2 ,
where the matrix is regarded as the aggregation operator in each update iteration. For example, the computation process of the m -layer GCN is computed as in Equation (13):
F t ( 1 ) = Re lu ( A ^ F t ( o ) W ( O ) ) F t ( 2 ) = Re lu ( A ^ F t ( 1 ) W ( 1 ) ) F t ( m ) = Re lu ( A ^ F t ( m 1 ) W ( m 1 ) ) ,
where F t ( m ) denotes the feature matrix of all nodes (i.e., cities) at the time step after the m -th iteration, and W ( m 1 ) is a learnable weight matrix. The embedding feature matrix includes the spatial dependencies among cities, which is reflected in two key aspects: (1) Each row of the matrix denotes the embedding feature of a city at time step t , which no longer corresponds to the raw air quality or meteorological indicators, and its dimensionality has typically changed. The features of neighboring nodes are aggregated through multiplication with the normalized Laplacian matrix A ^ during each update, maintaining the original feature dimension. Then, it is followed by multiplication with a non-square learnable weight matrix W ( · ) that performs feature dimension transformation, typically mapping the features into a lower-dimensional space. (2) The aforementioned matrix represents the spatial feature embeddings of all cities at time step t , capturing the spatial relationships among features. To fully model temporal dynamics, the same graph convolution operation is applied to the city features over the past L time steps ( t , t 1 , , t L ) , forming a feature sequence F t ( m ) , F t 1 ( m ) , , F t L ( m ) that incorporates historical temporal information. Subsequently, we form a new input feature tensor with temporal features and global spatiotemporal features, together with the hidden state h n from the previous time step, which is fed into the GRU to update the hidden state. The updated hidden state serves as the final feature representation for the current time step and followed by a fully connected layer to generate the prediction tensor x n , appending to the prediction sequence pm25_pred, thereby producing the predicted PM2.5 concentrations for all cities over the next M time steps.

4. Experimental Setting and Results Analysis

4.1. Experimental Setting

Dataset: The Know Air dataset [40] contains air quality monitoring data spanning four years, from 1 January 2015 to 31 December 2018, covering 184 cities (nodes) across China. To evaluate the effectiveness of our model in multi-city PM2.5 concentration prediction, we use historical data from the past 3 h to forecast PM2.5 trends for the next 72 h across all cities.
To assess the performance of our model in different scenarios, we divide the dataset into datasets 1, 2, and 3, as detailed in Table 2. Dataset 1: The entire dataset consists of training, validation, and testing sets, with a ratio of 2:1:1, evaluating overall air quality prediction performance. Dataset 2: This subset focuses on winter high-pollution periods and is partitioned equally (1:1:1). Due to increased emissions from winter heating combined with frequent northerly or northwesterly winds, PM2.5 emissions and long-range transport become more severe, which is more challenging for prediction, thus evaluating the adaptability under extreme pollution conditions. Dataset 3: This subset predicts PM2.5 concentrations of the subsequent month by using the first three months, with a split ratio of 3:1:1, which mainly serves to evaluate the model performance on long-term trend forecasting [45]. By employing these dataset partitions, we are able to comprehensively evaluate the adaptability and generalization ability of the model across diverse prediction scenarios.
Experimental Settings: We perform hyperparameter tuning using grid search and refer to the initialization strategies of baseline models. The Adam optimizer is used to adaptively adjust the learning rate based on gradient magnitudes, improving the stability and convergence speed of training. The batch size is set to 32, and the number of epochs is 50. The learning rate is set to 0.005, and the weight decay is 0.0001. To prevent overfitting and improve training efficiency, an early stopping mechanism is introduced to monitor performance on the validation set. Specifically, training is terminated early if the validation performance did not improve or worsened for 10 consecutive epochs, thereby avoiding unnecessary training and saving computational resources. Additionally, the Mean Squared Error (MSE) is employed as the loss function to quantify the difference between predicted and actual PM2.5 concentrations, serving as the optimization objective during model training.
Evaluation Metrics: To comprehensively evaluate the prediction accuracy and fitting performance, this paper utilizes the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R2) [46] as evaluation metrics, which are defined as in Equation (14):
R M S E = j = 1 n ( X j X ^ j ) 2 n M A E = j = 1 n X j X ^ j n R 2 = 1 n 1 j = 1 n ( X j X ^ j ) j = 1 n ( X j X ^ j ) 2 ,
where n is the number of samples, X j and X ^ j denote the actual and predicted PM2.5 concentrations of the j -th sample, respectively, and X represents the mean PM2.5 concentration across all samples.
Further, we use the Critical Success Index (CSI) and False Alarm Rate (FAR) [33] to comprehensively evaluate model performance under pollution threshold conditions in air quality prediction tasks. CSI measures the accuracy in predicting rare pollution events (e.g., high PM2.5 concentrations), with higher values indicating better detection capability. The FAR assesses the frequency of incorrect pollution event predictions, where lower values suggest more reliable performance in forecasting such events. Specifically, we binarize the predicted and actual observed PM2.5 concentrations into 0/1 to determine whether it exceeds a PM2.5 threshold of 75 μg/m3 in the ambient air quality standards of China [40], as the criterion for good air quality.

4.2. Experimental Results and Analysis

4.2.1. Comparison Evaluation with Different Models

To comprehensively evaluate the effectiveness of the proposed model, we conduct comparisons against various types of baseline models, including single-sequence forecasting models, hybrid deep learning models, and optimized deep learning models with swarm intelligence. Specifically, MLP, LSTM [47], and GRU [48] represent single-sequence forecasting models; GC-LSTM [49] and PM2.5-GNN [40] fall under hybrid deep learning models; and GWO-GART [33] is a deep learning model enhanced through swarm intelligence optimization. The EGCFC [34] method combined a hybrid graph convolutional GRU with a self-attention network. Table 3, Table 4, and Table 5 show the performance comparison between the proposed model and baseline methods on datasets 1, 2, and 3, respectively, with each value representing the average result over 10 independent runs.
It can be seen from Table 3, Table 4 and Table 5 that the traditional models, such as MLP, LSTM, and GRU, generally underperform compared to more advanced architectures in terms of prediction accuracy, where the MLP shows the weakest performance, and the LSTM and GRU demonstrate slightly better capability in capturing long-term dependencies within time series. Due to the influence of multiple complex factors on PM2.5 concentrations, single neural network models exhibit limited expressive power. In contrast, these methods (GAT-GRU, GC-LSTM, and PM2.5-GNN) effectively capture both spatial and temporal dependencies to significantly improve prediction performance.
Particularly, PM2.5-GNN enhances the interactions among nodes, resulting in improvements in predictive accuracy over GC-LSTM. Furthermore, GWO-GART integrates the GWO algorithm, which surpasses PM2.5-GNN, demonstrating the benefit of swarm intelligence in model optimization. Moreover, the EGCFC mode strengthens the capacity to capture long-term spatiotemporal dependencies, outperforming the above methods. Notably, our proposed model significantly enhances its ability to capture complex spatiotemporal patterns in long-term PM2.5 forecasting. Specifically, the MS-iTransformer module effectively captures long-range temporal trends specific to each monitoring site through station-wise temporal encoding and multi-station self-attention, improving robustness and accuracy over extended periods; the BSTA module introduces a bilinear attention mechanism that dynamically constructs spatiotemporal dependency graphs, enabling the model to adaptively learn evolving inter-city correlations across time; the GESM leverages graph convolution and recurrent modeling to extract fine-grained local spatial and temporal dependencies, enabling precise modeling of localized air pollution dynamics. Therefore, our proposed method excels by jointly capturing long-term temporal trends, dynamic global spatial dependencies, and localized spatiotemporal interactions, enabling superior multi-city PM2.5 forecasting performance over the existing methods.
Specifically, compared to the EGCFC model, the MAE and RMSE of the proposed method decreased by 0.7537 and 0.7751, respectively; CSI and R2 are improved by 0.0071 and 0.0159; and the FAR decreased by 0.0272 on dataset 1, as shown in Table 3. The MAE and RMSE decreased by 1.0849 and 1.0547, respectively; R2 is improved by 0.0188; and the FAR decreased by 0.0440 on dataset 2, as shown in Table 4. The MAE and RMSE decreased by 1.7665 and 1.8578, respectively; CSI and R2 are improved by 0.0194 and 0.0260; and the FAR decreased by 0.0312 on dataset 3, as shown in Table 5. These results comprehensively demonstrate the robustness, adaptability, and superior generalization capability of the proposed model under diverse forecasting scenarios.

4.2.2. Comparison Forecast Performance in a Representative City

To validate the effectiveness of our proposed model, Figure 8 illustrates the differences between the predicted and actual PM2.5 concentrations on dataset 3 across different models. Specifically, we select Xianyang city to evaluate predictive capabilities as it presents multiple pollution sources and significant PM2.5 fluctuations.
It can be seen from Figure 8 that our model demonstrates higher fitting accuracy compared to other methods, with its prediction curve closely aligning with the observed values. Notably, our model maintains strong predictive performance during periods of high PM2.5 concentrations. These results indicate that our model outperforms others in long-term forecasting tasks, particularly under complex pollution conditions.
To further validate the effectiveness of our proposed method, we evaluate the correlation between the predicted and actual PM2.5 concentrations. Figure 9 presents the scatter plot of predictions for Xianyang city, where the solid line is the linear regression fit and the dashed line is the reference line. Our model demonstrates a significantly higher correlation between predicted and observed values compared to other models. The predicted points and the fitted regression line are notably closer to the reference line, accurately capturing the variation patterns of PM2.5 concentrations in long-term forecasting tasks. Particularly, our model consistently outperforms others under high PM2.5 concentration scenarios. In contrast, the predictions of other models show larger deviations from the actual values, highlighting the superior accuracy of our proposed model even under complex environmental conditions.
Further, we also compare prediction performance in Yanan, which is located in a sparse monitoring region, as shown in Figure 10. To elaborate, the sparse spatial layout around Yanan (Figure 1) limits the available contextual information from neighboring stations, thereby weakening the spatial dependencies that can be effectively captured by the model. In addition, the observed PM2.5 series in Yanan exhibits strong fluctuations and irregular patterns, as shown in Figure 10, where the MLP model yields a low R2 value of 0.1062. This suggests that unstable and noisy temporal dynamics further hinder predictive accuracy. These findings highlight that both limited spatial connectivity and temporal instability can negatively affect model performance. Nevertheless, our model demonstrates robustness under such adverse conditions, outperforming baseline methods in both stable and unstable environments.
The results reveal that station density and data stability both impact predictive performance. However, our model consistently maintains superior results in both scenarios. Consequently, the model generalizes well to both densely and sparsely instrumented areas, as evidenced by the strong correlation and low error metrics achieved in both cases.
Nevertheless, we also recognize potential limitations. The current model is primarily trained and tested on PM2.5 data. While the architecture is flexible and can be extended to other pollutants (e.g., NO2 and O3), this transfer may require retraining and feature re-selection to account for different physical and chemical characteristics.

4.2.3. Comparison of Our Model and Existing Methods at Multiple Time Steps

The long-term prediction typically extends beyond 48 h [33]. Due to the accumulation of meteorological influences and pollution dynamics over multiple days, we use 60 h and 72 h forecasts as representative examples of long-term forecasting. Table 6 presents the evaluation results of MAE and RMSE on dataset 3. It can be seen that the proposed model significantly outperforms the other methods in both metrics. For example, compared to the EGCFC method for the 60 h and 72 h forecasting tasks, the MAE is reduced by 1.7849 and 2.4503, respectively, and the RMSE is reduced by 2.5580 and 1.8578, respectively. These results demonstrate that our method has improvements in prediction accuracy over other models, which are attributed to the explicit modeling of dynamic spatiotemporal relationships, enhancing its ability to track evolving pollutant transport patterns rather than relying solely on short-term autoregressive signals.
Our method also demonstrates marked advantages in detecting and handling extreme pollution events. Table 7 presents the evaluation results of the CSI and FAR metric on dataset 3. It can be seen that our model achieves the best performance among all compared methods in terms of CSI, with particularly notable improvements in mid-term to long-term forecasting. Specifically, the CSI is improved by 0.0154 in the 60 h forecast and by 0.0194 in the 72 h forecast, suggesting a higher true positive rate in predicting pollution exceedances. These results demonstrate the superior performance of our proposed model in long-term prediction tasks, which is attributed to its precise modeling of complex inter-city dependencies across both spatial and temporal dimensions. For the FAR metric, it is observed that our proposed model performs comparably to EGCFC in short-term predictions (3–24 h) but exhibits significant advantages in mid-term to long-term predictions (24–72 h), where the FAR is decreased by 0.0321 at 60 h and by 0.0312 at 72 h, meaning fewer false alarms in severe cases. This robustness under extreme conditions reflects the effectiveness of our spatial edge design, particularly the advection coefficient, which encodes physically plausible pollution transport based on wind direction and city-to-city proximity. This allows the model to anticipate pollutant incursions even when the local conditions alone do not strongly indicate a pollution rise.
Table 8 presents the comparison results of the R2 metric on dataset 3. Similar to the FAR metric, our proposed model exhibits comparable performance to EGCFC in short-term predictions (3–24 h). However, the proposed method demonstrates a significant advantage in mid-term to long-term predictions (24–72 h), with R2 improvements of 0.0044 and 0.0260 at 60 and 72 h, respectively, indicating that our model better captures the underlying variance in PM2.5 concentrations over extended horizons. This advantage is closely related to our model’s ability to accurately represent both short-range and long-range pollutant transport patterns, increasingly important in long-term forecasting. While baseline models often rely on temporal continuity or local correlations—whose predictive power diminishes beyond 24 h—our approach leverages both physically grounded features (e.g., boundary layer height, K-index, and wind components) and adaptive graph structures based on advection coefficients, enabling the model to simulate evolving pollution dynamics across city networks. This facilitates more accurate extrapolation beyond immediate historical data. In addition, the spatiotemporal attention mechanisms embedded in our architecture allow for flexible reweighting of relevant historical and spatial information, dynamically adapting to different atmospheric and emission regimes. This ensures that critical dependencies—such as long-range transport during stagnant weather or delayed cross-city pollution drift—are not overlooked in multi-step forecasting.

4.2.4. Comparison of Model Runtime and Complexity

To evaluate the computational efficiency of our proposed model, we report both runtime (in seconds) and complexity, as shown in Table 9. All models are tested under the same experimental environment to ensure fairness. It can be seen from Table 9 that although simple models like MLP, GRU, and LSTM have very low runtime and computational complexity, they also show significantly lower performance in terms of prediction accuracy (as discussed in Section 4.2.1). Our model, while more complex than traditional RNN-based models (e.g., GRU and LSTM), demonstrates competitive computational efficiency when compared to other graph-based models, such as PM2.5-GNN, EGCFC, and especially GWO-GART. Specifically, our model achieves a runtime of 1001.20 s with FLOPs = 52.008 G and only 0.090 M parameters, which is about 11% faster than EGCFC (1125.36 s), substantially more efficient than GWO-GART. Compared to PM2.5-GNN, our model has slightly higher FLOPs (52.008 G vs. 51.330 G), but a longer runtime primarily due to the richer modeling components and edge feature mechanisms. However, it remains well within a practical range for real-world forecasting scenarios. These results clearly demonstrate that the proposed method provides a favorable trade-off between model expressiveness and computational feasibility, making it suitable for large-scale spatiotemporal applications, even when forecasting air pollution across a large-scale urban network with 184 cities.

4.3. Ablation Study

We conduct ablation experiments on datasets 1, 2, and 3, as shown in Table 10, to evaluate the contribution of the key module in our proposed model. Specifically, we design the following ablation variants:
w/o GESM: This variant removes the GESM, which includes graph convolution and recurrent units used to capture localized spatial and short-term temporal dependencies. When removing the GESM, our proposed model achieves significant improvements in the MAE and RMSE, with R2 and CSI increasing by 0.0024 and 0.0012, respectively. Meanwhile, the FAR is decreased by 0.0097 on dataset 1. Further, it can be seen from Table 10 that our model achieves a more notable improvement in prediction accuracy compared to removing the GESM on datasets 2 and 3. Many baselines either model spatial and temporal dependencies separately or overlook localized spatiotemporal patterns (e.g., city-specific short-term meteorological events). The GESM integrates graph convolution with gated recurrent mechanisms to jointly model short-term dynamics and localized structural dependencies. This is particularly effective in scenarios where a city experiences a sudden weather shift or emission spike that does not immediately propagate to others. The GESM ensures such localized signals are captured and learned effectively, reducing false alarms and enhancing response sensitivity.
w/o MS-iTransformer: This variant removes the MS-iTransformer module, which is designed to capture long-term temporal trends from the historical data of individual stations. When removing the MS-iTransformer module, our proposed model achieves significant improvements in the MAE and RMSE, with R2 and CSI increasing by 0.0227 and 0.0144, respectively. Meanwhile, the FAR is decreased by 0.0147 on dataset 3. Further, our model achieves a more notable improvement in prediction accuracy compared to removing MS-iTransformer on datasets 1 and 2. Traditional time-series models (e.g., LSTM and GRU) focus primarily on short-term or fixed-length temporal windows. These methods struggle to learn long-term pollutant accumulation trends or delayed meteorological influences (e.g., distant precipitation or multi-day wind patterns) [30]. The MS-iTransformer overcomes this by leveraging a station-specific transformer structure, which enables the model to extract long-range temporal dependencies without being constrained by fixed memory lengths. This is particularly beneficial for recognizing multi-day pollution buildup or lagged meteorological effects, which often influence air quality trends beyond 48 h.
Notably, the proposed model achieves the best performance across all datasets, with the MAE and RMSE reduced by 4.0328 and 4.6769, the CSI increased by 0.0364, the FAR decreased by 0.0740, and R2 improved by 0.0630 compared to the baseline model on dataset 3. The ablation study results demonstrate that the spatial feature and global feature modules are effective in enhancing long-term prediction performance.

5. Conclusions

This paper proposes a novel dynamic global–local spatiotemporal graph framework for the long-term PM2.5 forecasting across multiple cities. Specifically, the MS-iTransformer module captures station-specific long-term temporal trends, the BSTA module dynamically models inter-city spatiotemporal dependencies, and the GESM learns fine-grained local interactions. By jointly modeling long-range temporal patterns, global spatial dynamics, and localized spatiotemporal relationships, the proposed model demonstrates superior performance in multi-step PM2.5 prediction, showing significant improvements over existing methods on real-world multi-city air quality datasets. The experimental results demonstrate that the MAE, RMSE, and FAR are decreased by 1.7665, 1.8578, and 0.0312, respectively. The CSI and R2 are improved by 0.0194 and 0.0260, respectively.

Author Contributions

Conceptualization, R.W.; methodology, Y.H.; validation, Y.H.; formula derivation, Y.H., and R.W.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H., Y.X., and S.F.; visualization, X.Z.; funding acquisition, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (NO. 61771299) and by the National Natural Science Foundation of China (grant number 62071286).

Data Availability Statement

Data are contained within the article.

Acknowledgments

We thank the Ministry of Ecology and Environment of China for providing the PM2.5 concentration data for each city, and the ERA5 atmospheric reanalysis project for offering the meteorological and environmental indicators used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Feng, Y.; Castro, E.; Wei, Y.; Jin, T.; Qiu, X.; Dominici, F.; Schwartz, J. Long-term exposure to ambient PM2.5, particulate constituents and hospital admissions from non-respiratory infection. Nat. Commun. 2024, 15, 1518. [Google Scholar] [CrossRef]
  2. Lim, S.; Bassey, E.; Bos, B.; Makacha, L.; Varaden, D.; Arku, R.E.; Baumgartner, J.; Brauer, M.; Ezzati, M.; Kelly, F.J.; et al. Comparing human exposure to fine particulate matter in low and high-income countries: A systematic review of studies measuring personal PM2.5 exposure. Sci. Total Environ. 2022, 833, 155207. [Google Scholar] [CrossRef]
  3. Han, D.; Guo, Y.; Wang, J.; Zhao, B. Global disparities in indoor wildfire-PM2.5 exposure and mitigation costs. Sci. Adv. 2025, 11, eads4360. [Google Scholar] [CrossRef] [PubMed]
  4. Bae, M.; Kang, Y.; Kim, E.; Kim, S.; Kim, S. A multifaceted approach to explain short-and long-term PM2.5 concentration changes in Northeast Asia in the month of January during 2016–2021. Sci. Total Environ. 2023, 880, 163309. [Google Scholar] [CrossRef] [PubMed]
  5. Wei, J.; Wang, J.; Li, Z.; Kondragunta, S.; Anenberg, S.; Wang, Y.; Zhang, H.; Diner, D.; Hand, J.; Lyapustin, A.; et al. Long-term mortality burden trends attributed to black carbon and PM2.5 from wildfire emissions across the continental USA from 2000 to 2020: A deep learning modelling study. Lancet Planet. Health 2023, 7, e963–e975. [Google Scholar] [CrossRef] [PubMed]
  6. Lin, M.D.; Liu, P.Y.; Huang, C.W.; Lin, Y.H. The application of strategy based on LSTM for the short-term prediction of PM2.5 in city. Sci. Total Environ. 2024, 906, 167892. [Google Scholar] [CrossRef]
  7. Zhu, S.; Tang, J.; Zhou, X.; Li, P.; Liu, Z.; Zhang, C.; Zou, Z.; Li, T.; Peng, C. Research progress, challenges, and prospects of PM2. 5 concentration estimation using satellite data. Environ. Rev. 2023, 31, 605–631. [Google Scholar] [CrossRef]
  8. Ma, Z.; Dey, S.; Christopher, S.; Liu, R.; Bi, J.; Balyan, P.; Liu, Y. A review of statistical methods used for developing large-scale and long-term PM2.5 models from satellite data. Remote Sens. Environ. 2022, 269, 112827. [Google Scholar] [CrossRef]
  9. Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
  10. Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; MIT Press: Cambridge, MA, USA, 2012; pp. 37–45. [Google Scholar]
  11. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
  12. Waqas, M.; Humphries, U.W. A critical review of RNN and LSTM variants in hydrological time series predictions. MethodsX 2024, 13, 102946. [Google Scholar] [CrossRef] [PubMed]
  13. Zhou, H.; Li, J.; Zhang, S.; Zhang, S.; Yan, M.; Xiong, H. Expanding the prediction capacity in long sequence time-series forecasting. Artif. Intell. 2023, 318, 103886. [Google Scholar] [CrossRef]
  14. Zhu, J.; Deng, F.; Zhao, J.; Zheng, H. Attention-based parallel networks (APNet) for PM2.5 spatiotemporal prediction. Sci. Total Environ. 2021, 769, 145082. [Google Scholar] [CrossRef]
  15. Fang, S.; Li, Q.; Karimian, H.; Liu, H.; Mo, Y. DESA: A novel hybrid decomposing-ensemble and spatiotemporal attention model for PM2.5 forecasting. Environ. Sci. Pollut. Res. 2022, 29, 54150–54166. [Google Scholar] [CrossRef] [PubMed]
  16. Wen, C.; Liu, S.; Yao, X.; Peng, L.; Li, X.; Hu, Y.; Chi, T. A novel spatiotemporal convolutional long short-term neural network for air pollution prediction. Sci. Total Environ. 2019, 654, 1091–1099. [Google Scholar] [CrossRef]
  17. Zhang, K.; Yang, X.; Cao, H.; Thé, J.; Tan, Z.; Yu, H. Multi-step forecast of PM2.5 and PM10 concentrations using convolutional neural network integrated with spatial–temporal attention and residual learning. Environ. Int. 2023, 171, 107691. [Google Scholar] [CrossRef] [PubMed]
  18. Teutscher, D.; Bukreev, F.; Kummerländer, A.; Simonis, S.; Bächler, P.; Rezaee, A.; Hermansdorfer, M.; Krause, M.J. A digital urban twin enabling interactive pollution predictions and enhanced planning. Build. Environ. 2025, 281, 113093. [Google Scholar] [CrossRef]
  19. Zhang, D.; Martin, R.V.; Bindle, L.; Li, C.; Eastham, S.D.; van Donkelaar, A.; Gallardo, L. Advances in simulating the global spatial heterogeneity of air quality and source sector contributions: Insights into the global South. Environ. Sci. Technol. 2023, 57, 6955–6964. [Google Scholar] [CrossRef]
  20. Chen, X.; Zhang, Y.; Wang, Y.; Zhang, L.; Yi, Z.; Zhang, H.; Mathiopoulos, P.T. A spatiotemporal interpolation graph convolutional network for estimating PM2.5 concentrations based on urban functional zones. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–14. [Google Scholar] [CrossRef]
  21. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
  22. Jin, M.; Koh, H.Y.; Wen, Q.; Zambon, D.; Alippi, C.; Webb, G.I.; King, I.; Pan, S. A survey on graph neural networks for time series: Forecasting, classification, imputation, and anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10466–10485. [Google Scholar] [CrossRef]
  23. Peng, H.; Wang, H.; Du, B.; Bhuiyan, M.Z.A.; Ma, H.; Liu, J.; Wang, L.; Yang, Z.; Du, L.; Wang, S.; et al. Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Inf. Sci. 2020, 521, 277–290. [Google Scholar] [CrossRef]
  24. Li, Y.; Liang, W.; Peng, L.; Zhang, D.; Yang, C.; Li, K.C. Predicting drug-target interactions via dual-stream graph neural network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 21, 948–958. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Hu, Y.; Han, N.; Yang, A.; Liu, X.; Cai, H. A survey of drug-target interaction and affinity prediction methods via graph neural networks. Comput. Biol. Med. 2023, 163, 107136. [Google Scholar] [CrossRef] [PubMed]
  26. Kumar, S.; Mallik, A.; Khetarpal, A.; Panda, B.S. Influence maximization in social networks using graph embedding and graph neural network. Inf. Sci. 2022, 607, 1617–1636. [Google Scholar] [CrossRef]
  27. Sharma, K.; Lee, Y.C.; Nambi, S.; Salian, A.; Shah, S.; Kim, S.W.; Kumar, S. A survey of graph neural networks for social recommender systems. ACM Comput. Surv. 2024, 56, 1–34. [Google Scholar] [CrossRef]
  28. Chen, C.; Wu, Y.; Dai, Q.; Zhou, H.Y.; Xu, M.; Yang, S.; Han, X.; Yu, Y. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10297–10318. [Google Scholar] [CrossRef]
  29. Klepl, D.; Wu, M.; He, F. Graph neural network-based eeg classification: A survey. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 493–503. [Google Scholar] [CrossRef]
  30. Ye, Y.; Cao, Y.; Dong, Y.; Yan, H. A Graph Neural Network and Transformer-based model for PM2.5 prediction through spatiotemporal correlation. Environ. Model. Softw. 2025, 191, 106501. [Google Scholar] [CrossRef]
  31. Chen, Y.; Wu, Y.; Zhang, S.; Yuan, K.; Huang, J.; Shi, D.; Hu, S. Regional PM2.5 prediction with hybrid directed graph neural networks and Spatio-temporal fusion of meteorological factors. Environ. Pollut. 2025, 366, 125404. [Google Scholar] [CrossRef]
  32. Chang-Silva, R.; Tariq, S.; Loy-Benitez, J.; Yoo, C. Smart solutions for urban health risk assessment: A PM2.5 monitoring system incorporating spatiotemporal long-short term graph convolutional network. Chemosphere 2023, 335, 139071. [Google Scholar] [CrossRef] [PubMed]
  33. Zhang, C.; Wang, S.; Wu, Y.; Zhu, X.; Shen, W. A long-term prediction method for PM2.5 concentration based on spatiotemporal graph attention recurrent neural network and grey wolf optimization algorithm. J. Environ. Chem. Eng. 2024, 12, 111716. [Google Scholar] [CrossRef]
  34. Zhang, C.; Li, X.; Sheng, H.; Shen, Y.; Xie, W.; Zhu, X. Long-term prediction method for PM2.5 concentration using edge channel graph attention network and gating closed-form continuous-time neural networks. Process Saf. Environ. Prot. 2024, 189, 356–373. [Google Scholar] [CrossRef]
  35. Zhao, G.; Yang, X.; Shi, J.; He, H.; Wang, Q. A PM2.5 spatiotemporal prediction model based on mixed graph convolutional GRU and self-attention network. Environ. Pollut. 2025, 368, 125748. [Google Scholar] [CrossRef]
  36. Zheng, W.; Hu, J. Multivariate time series prediction based on temporal change information learning method. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7034–7048. [Google Scholar] [CrossRef]
  37. Chen, X.; Hu, Y.; Dong, F.; Chen, K.; Xia, H. A multi-graph spatial-temporal attention network for air-quality prediction. Process Saf. Environ. Prot. 2024, 181, 442–451. [Google Scholar] [CrossRef]
  38. Hu, W.; Zhang, Z.; Zhang, S.; Chen, C.; Yuan, J.; Yao, J.; Zhao, S.; Guo, L. Learning spatiotemporal dependencies using adaptive hierarchical graph convolutional neural network for air quality prediction. J. Clean. Prod. 2024, 459, 142541. [Google Scholar] [CrossRef]
  39. Liu, X.; Chang, M.; Zhang, J.; Wang, J.; Gao, H.; Gao, Y.; Yao, X. Rethinking the causes of extreme heavy winter PM2.5 pollution events in northern China. Sci. Total Environ. 2021, 794, 148637. [Google Scholar] [CrossRef]
  40. Wang, S.; Li, Y.; Zhang, J.; Meng, Q.; Meng, L.; Gao, F. PM2.5-gnn: A domain knowledge enhanced graph neural network for PM2.5 forecasting. In Proceedings of the 28th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 3–6 November 2020; pp. 163–166. [Google Scholar]
  41. Pang, N.; Gao, J.; Che, F.; Ma, T.; Liu, S.; Yang, Y.; Zhao, P.; Yuan, J.; Liu, J.; Xu, Z.; et al. Cause of PM2.5 pollution during the 2016-2017 heating season in Beijing, Tianjin, and Langfang, China. J. Environ. Sci. 2020, 95, 201–209. [Google Scholar] [CrossRef]
  42. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS′17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  43. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef]
  44. Kipf, T.N. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  45. Abuouelezz, W.; Ali, N.; Aung, Z.; Altunaiji, A.; Shah, S.B.; Gliddon, D. Exploring PM2.5 and PM10 ML forecasting models: A comparative study in the UAE. Sci. Rep. 2025, 15, 9797. [Google Scholar] [CrossRef] [PubMed]
  46. Chen, M.H.; Chen, Y.C.; Chou, T.Y.; Ning, F.S. PM2.5 concentration prediction model: A CNN–RF ensemble framework. Int. J. Environ. Res. Public Health 2023, 20, 4077. [Google Scholar] [CrossRef] [PubMed]
  47. Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
  48. Weerakody, P.B.; Wong, K.W.; Wang, G.; Ela, W. A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 2021, 441, 161–178. [Google Scholar] [CrossRef]
  49. Qi, Y.; Li, Q.; Karimian, H.; Liu, D. A hybrid model for spatiotemporal forecasting of PM2.5 based on graph convolutional neural network and long short-term memory. Sci. Total Environ. 2019, 664, 1–10. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The 184 urban areas included in this study, the pentagram denotes the background.
Figure 1. The 184 urban areas included in this study, the pentagram denotes the background.
Remotesensing 17 02750 g001
Figure 2. Feature analysis based on random forest.
Figure 2. Feature analysis based on random forest.
Remotesensing 17 02750 g002
Figure 3. Feature analysis based on Pearson correlation coefficient.
Figure 3. Feature analysis based on Pearson correlation coefficient.
Remotesensing 17 02750 g003
Figure 4. The proposed overall framework.
Figure 4. The proposed overall framework.
Remotesensing 17 02750 g004
Figure 5. MS-iTransformer module.
Figure 5. MS-iTransformer module.
Remotesensing 17 02750 g005
Figure 6. The structure of BSTA.
Figure 6. The structure of BSTA.
Remotesensing 17 02750 g006
Figure 7. GESM.
Figure 7. GESM.
Remotesensing 17 02750 g007
Figure 8. Comparison of predicted and observed PM2.5 concentrations in Xianyang for (ah) corresponding to MLP, LSTM, GRU, GC_LSTM, PM2.5-GNN, GWO-GART, EGCFC, and our proposed model.
Figure 8. Comparison of predicted and observed PM2.5 concentrations in Xianyang for (ah) corresponding to MLP, LSTM, GRU, GC_LSTM, PM2.5-GNN, GWO-GART, EGCFC, and our proposed model.
Remotesensing 17 02750 g008
Figure 9. Comparison scatter plot of predicted and observed PM2.5 concentrations in Xianyang.
Figure 9. Comparison scatter plot of predicted and observed PM2.5 concentrations in Xianyang.
Remotesensing 17 02750 g009
Figure 10. Comparison of predicted and observed PM2.5 concentrations in Yanan for (ah) corresponding to MLP, LSTM, GRU, GC_LSTM, PM2.5-GNN, GWO-GART, EGCFC, and our proposed model.
Figure 10. Comparison of predicted and observed PM2.5 concentrations in Yanan for (ah) corresponding to MLP, LSTM, GRU, GC_LSTM, PM2.5-GNN, GWO-GART, EGCFC, and our proposed model.
Remotesensing 17 02750 g010
Table 1. Data attributes.
Table 1. Data attributes.
NodesEdges FeatureUnit
Nodesk_indexK
2 m_temperatureK
surface_pressurePa
total_precipitationm
boundary_layer_heightm
relative_humidity + 950%
u_component_of_wind + 950m/s
v_component_of_wind + 950m/s
Edges wind _ speed _ of _ source _ city   ( w s )km/h
wind _ direction _ of _ target _ city   ( w d ) ( ° )
direction _ from _ source _ city _ to _ target _ city   ( d r f ) ( ° )
distance _ between _ source _ city _ and _ target _ city   ( d b )km
advection_coeffient %
Table 2. Know Air dataset.
Table 2. Know Air dataset.
DatasetTraining SetValidation SetTest Set
Dataset 11 January 2015–31 December 201631 December 2016–31 December 201731 December 2017–31 December 2018
Dataset 21 November 2015–28 February 20161 November 2016–28 February 20171 November 2017–28 February 2018
Dataset 31 September 2016–30 November 201630 November 2016–30 December 201630 December 2016–31 January 2017
Table 3. The performance comparison between our model and baseline methods on dataset 1.
Table 3. The performance comparison between our model and baseline methods on dataset 1.
DatasetModelTrain LossValidate
Loss
Test
Loss
MAERMSECSIFARR2
1MLP0.5624
±0.0085
0.5269
±0.0088
0.5537
±0.0092
18.1240
±0.1940
22.6306
±0.1950
0.4252
±0.0047
0.3502
±0.0145
0.4651
±0.0089
LSTM0.4229
±0.0037
0.4287
±0.0017
0.4571
±0.0024
16.1657
±0.1289
20.4499
±0.1110
0.4615
±0.0037
0.3038
±0.0079
0.5585
±0.0023
GRU0.4286
±0.0020
0.4266
±0.0021
0.4512
±0.0018
16.0839
±0.1229
20.3553
±0.1049
0.4653
±0.0042
0.3029
±0.0110
0.5642
±0.0017
GC-LSTM0.4098
±0.0030
0.4206
±0.0016
0.4411
±0.0031
15.9462
±0.1132
20.1977
±0.1063
0.4702
±0.0046
0.3049
±0.0138
0.5739
±0.0030
PM2.5-GNN0.3972
±0.0042
0.3987
±0.0035
0.4185
±0.0042
15.4801
±0.1412
19.6491
±0.1355
0.4852
±0.0036
0.2897
±0.0114
0.5957
±0.0041
GWO-GART0.3621
±0.0043
0.4013
±0.0038
0.4229
±0.0041
15.3649
±0.1534
19.6826
±0.1415
0.4891
±0.0046
0.2753
±0.0155
0.5912
±0.0039
EGCFC0.3384
±0.0081
0.3858
±0.0032
0.3997
±0.0038
14.8373
±0.0050
19.0834
±0.1511
0.4959
±0.0049
0.2567
±0.0150
0.6168
±0.0045
ours0.3567
±0.0034
0.3601
±0.0017
0.3756
±0.0018
14.0836
±0.0676
18.3083
±0.0851
0.5030
±0.0068
0.2295
±0.0093
0.6327
±0.0044
Table 4. The performance comparison between our model and baseline methods on dataset 2.
Table 4. The performance comparison between our model and baseline methods on dataset 2.
DatasetModelTrain LossValidate
Loss
Test
Loss
MAERMSECSIFARR2
2MLP0.6409
±0.0066
0.6372
±0.0083
0.6523
±0.0096
28.4975
±0.3455
35.1934
±0.3549
0.4628
±0.0116
0.3081
±0.0116
0.3770
±0.0092
LSTM0.4464
±0.0140
0.5172
±0.0065
0.5459
±0.0107
25.8818
±0.3199
32.3494
±0.3496
0.5114
±0.0090
0.2975
±0.0076
0.4785
±0.0102
GRU0.4584
±0.0070
0.5068
±0.0031
0.5333
±0.0065
25.4581
±0.2491
31.8953
±0.2443
0.5142
±0.0059
0.2958
±0.0097
0.4906
±0.0062
GC-LSTM0.4336
±0.0102
0.5136
±0.0055
0.5410
±0.0098
25.7895
±0.2607
32.2493
±0.2876
0.5125
±0.0065
0.2933
±0.0082
0.4832
±0.0093
PM2.5-GNN0.4379
±0.0079
0.4855
±0.0032
0.5110
±0.0044
24.9161
±0.2012
31.2798
±0.1915
0.5258
±0.0052
0.2906
±0.0086
0.511
±0.0042
GWO-GART0.4319
±0.0068
0.4847
±0.0031
0.4941
±0.0042
24.1134
±0.1944
30.5662
±0.1871
0.5338
±0.0053
0.2729
±0.0083
0.5285
±0.0043
EGCFC0.4005
±0.0072
0.4787
±0.0032
0.4912
±0.0042
23.9113
±0.1932
30.3868
±0.1862
0.5493
±0.0054
0.2650
±0.0084
0.5314
±0.0044
ours0.3477
±0.0112
0.4429
±0.0025
0.4542
±0.0076
22.8264
±0.3842
29.3321
±0.3881
0.5396
±0.0123
0.2210
±0.0111
0.5502
±0.0115
Table 5. The performance comparison between our model and baseline methods on dataset 3.
Table 5. The performance comparison between our model and baseline methods on dataset 3.
DatasetModelTrain LossValidate
Loss
Test
Loss
MAERMSECSIFARR2
3MLP0.6229
±0.0101
0.7502
±0.0171
0.5570
±0.0108
38.1941
±0.3776
46.4208
±0.3766
0.5665
±0.0050
0.3125
±0.0094
0.4110
±0.0114
LSTM0.4386
±0.0060
0.5471
±0.0066
0.4862
±0.0124
36.3341
±0.6214
44.3482
±0.6369
0.6096
±0.0038
0.3070
±0.0054
0.4859
±0.0131
GRU0.4600
±0.0113
0.5525
±0.0104
0.4717
±0.0082
35.8335
±0.3977
43.706
±0.4276
0.6105
±0.0039
0.3091
±0.0079
0.5012
±0.0086
GC-LSTM0.4358
±0.0068
0.5535
±0.0124
0.4822
±0.0100
36.2248
±0.5390
44.2294
±0.5000
0.6055
±0.0040
0.3099
±0.0073
0.4901
±0.0106
PM2.5-GNN0.4401
±0.0081
0.5147
±0.0086
0.4636
±0.0128
35.1663
±0.6300
42.9891
±0.6633
0.6168
±0.0031
0.3063
±0.0077
0.5097
±0.0135
GWO-GART0.4125
±0.0076
0.5034
±0.0084
0.4462
±0.0123
34.3
±0.6151
42.52
±0.6570
0.6276
±0.0032
0.2956
±0.0074
0.5278
±0.0139
EGCFC0.3945
±0.0073
0.4913
±0.0082
0.4311
±0.0119
32.9
±0.5896
41.17
±0.6352
0.6338
±0.0032
0.2635
± 0.0066
0.5467
±0.0145
ours0.3690
±0.0194
0.4585
±0.0041
0.4322
±0.0073
31.1335
±0.5273
39.3122
±0.5665
0.6532
±0.0074
0.2323
±0.0116
0.5727
±0.0136
Table 6. MAE and RMSE of different models at different periods.
Table 6. MAE and RMSE of different models at different periods.
ModelMetric+3 h+6 h+12 h+24 h+36 h+48 h+60 h+72 h
MLPMAE11.145918.048325.861432.521235.684337.331937.401238.1941
RMSE15.762623.760732.432739.950643.686545.756445.191646.4208
LSTMMAE10.036615.980122.759929.087432.688934.710434.844636.3341
RMSE14.193821.143428.804036.105840.419942.956342.357444.3482
GRUMAE10.138516.048922.635128.816332.462234.389234.741235.8335
RMSE14.337921.247228.671035.797940.103142.503942.246543.7062
GC-LSTMMAE10.268816.076322.539628.740332.344034.669734.981036.2248
RMSE14.522321.255728.541435.734940.067442.968142.561844.2294
PM2.5-GNNMAE9.950215.721121.964527.951431.894833.641933.945735.1663
RMSE14.071820.827427.893534.844339.464741.738741.390842.9891
GWO-GARTMAE9.701415.332521.421727.256631.110532.798433.106434.3000
RMSE13.772120.379827.278234.061938.583040.839641.509142.5200
EGCFCMAE9.502415.013720.976126.693630.459532.128032.418133.5838
RMSE13.602820.121426.947733.662938.124540.323940.987641.1700
OursMAE9.475314.878320.894626.361429.123031.070830.633231.1335
RMSE13.400019.801626.768833.343736.928039.579438.429639.3122
Table 7. CSI and FAR of different models at different periods.
Table 7. CSI and FAR of different models at different periods.
ModelMetric+3 h+6 h+12 h+24 h+36 h+48 h+60 h+72 h
MLPCSI0.88030.80710.72400.65160.61100.59220.58210.5665
FAR0.05820.10420.16670.23040.26370.28560.30370.3125
LSTMCSI0.89140.82980.76280.69740.65790.63310.62520.6096
FAR0.06270.10540.15060.20480.24090.26650.29160.3070
GRUCSI0.88970.82700.76170.70060.65980.63740.62740.6105
FAR0.06160.10380.15450.20890.24740.27150.29280.3091
GC-LSTMCSI0.88950.82720.76310.70010.65970.63180.62080.6055
FAR0.05900.10300.15100.20380.24280.27090.29430.3099
PM2.5-GNNCSI0.89170.83030.76920.70910.66920.64760.63220.6168
FAR0.06190.10840.15430.20340.25040.26740.28950.3063
GWO-GARTCSI0.89200.83190.77020.71010.67160.65530.64110.6276
FAR0.06080.10730.15310.19140.24850.25110.25950.2956
EGCFCCSI0.89290.83320.77350.71000.67530.66230.64420.6338
FAR0.06360.10430.15500.18140.23590.23930.24930.2635
OursCSI0.89610.83990.77880.72320.69100.66410.65960.6532
FAR0.05500.08860.12040.16470.18890.20310.21720.2323
Table 8. R2 of different models at different periods.
Table 8. R2 of different models at different periods.
Model+3 h+6 h+12 h+24 h+36 h+48 h+60 h+72 h
MLP0.89290.79880.67080.54430.47860.43850.43650.4110
LSTM0.90690.83180.73010.62650.55810.51470.52720.4859
GRU0.90630.83080.73210.62930.56050.52310.52680.5012
GC-LSTM0.90590.83290.73660.63400.56490.51680.52330.4901
PM2.5-GNN0.90940.83900.74650.64560.57490.53520.53950.5097
GART0.90910.84870.74830.64610.58210.54670.55320.5278
EGCFC0.90920.84430.75180.64980.59460.56550.57690.5467
ours0.91230.84350.75220.65950.60580.56870.58130.5727
Table 9. The model runtime and complexity on dataset 3.
Table 9. The model runtime and complexity on dataset 3.
ModelRuntime (s)FLOPs (G)Params (M)
MLP395.360.1040.001
GRU369.080.9320.007
LSTM334.441.2210.009
GC_LSTM631.060.8370.006
PM2.5-GNN832.3151.3300.020
GWO-GART108,00052.4300.091
EGCFC1125.3655.4300.103
Ours1001.2052.0080.090
Table 10. Quantitative results of ablation study on datasets 1, 2, and 3.
Table 10. Quantitative results of ablation study on datasets 1, 2, and 3.
DatasetModelMAERMSECSIFARR2
1Baseline15.4801
±0.1412
19.6491
±0.1355
0.4852
±0.0036
0.2897
±0.0114
0.5957
±0.0041
w/o GESM14.1623
±0.0656
18.4013
±0.0707
0.5018
±0.0095
0.2392
±0.0162
0.6303
±0.0042
w/o MS-iTransformer14.1053
±0.0618
18.3386
±0.0653
0.5017
±0.0068
0.2348
±0.0124
0.6324
±0.0030
Ours14.0836
±0.0676
18.3083
±0.0851
0.5030
±0.0068
0.2295
±0.0093
0.6327
±0.0044
2Baseline24.9161
±0.2012
31.2798
±0.1915
0.5258
±0.0052
0.2906
±0.0086
0.5119
±0.0042
w/o GESM22.9963
±0.2168
29.5258
±0.2950
0.5365
±0.0073
0.2349
±0.0063
0.5477
±0.0099
w/o MS-iTransformer22.8453
±0.2401
29.3219
±0.2482
0.5393
±0.0125
0.2238
±0.0168
0.5486
±0.0101
Ours22.8264
±0.3842
29.3321
±0.3881
0.5396
±0.0123
0.2210
±0.0111
0.5502
±0.0115
3Baseline35.1663
±0.6300
42.9891
±0.6633
0.6168
±0.0031
0.3063
±0.0077
0.5097
±0.0135
w/o GESM32.4696
±0.5980
40.5141
±0.5914
0.6330
±0.0073
0.2566
±0.0128
0.5542
±0.0139
w/o MS-iTransformer32.2708
±0.4987
40.4062
±0.5761
0.6388
±0.0061
0.2470
±0.0153
0.5500
±0.0139
Ours31.1335
±0.5273
39.3122
±0.5665
0.6532
±0.0074
0.2323
±0.0116
0.5727
±0.0136
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Y.; Zhu, X.; Wang, R.; Xie, Y.; Fong, S. A Dynamic Global–Local Spatiotemporal Graph Framework for Multi-City PM2.5 Long-Term Forecasting. Remote Sens. 2025, 17, 2750. https://doi.org/10.3390/rs17162750

AMA Style

Huang Y, Zhu X, Wang R, Xie Y, Fong S. A Dynamic Global–Local Spatiotemporal Graph Framework for Multi-City PM2.5 Long-Term Forecasting. Remote Sensing. 2025; 17(16):2750. https://doi.org/10.3390/rs17162750

Chicago/Turabian Style

Huang, Yao, Xianxun Zhu, Rui Wang, Yanan Xie, and Simon Fong. 2025. "A Dynamic Global–Local Spatiotemporal Graph Framework for Multi-City PM2.5 Long-Term Forecasting" Remote Sensing 17, no. 16: 2750. https://doi.org/10.3390/rs17162750

APA Style

Huang, Y., Zhu, X., Wang, R., Xie, Y., & Fong, S. (2025). A Dynamic Global–Local Spatiotemporal Graph Framework for Multi-City PM2.5 Long-Term Forecasting. Remote Sensing, 17(16), 2750. https://doi.org/10.3390/rs17162750

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop