A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM2.5 Forecasting

Abbasi, Mohammad Taghi; Alesheikh, Ali Asghar; Rezaie, Fatemeh

doi:10.3390/land14081589

Open AccessArticle

A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM_2.5 Forecasting

by

Mohammad Taghi Abbasi

¹

,

Ali Asghar Alesheikh

¹

and

Fatemeh Rezaie

^1,2,*

¹

Department of GIS, Faculty of Geodesy and Geomatics Engineering, K. N. Toosi University of Technology, Tehran 19967-15433, Iran

²

Department of Geophysical Exploration, Korea University of Science and Technology, 217 Gajeong-ro, Yuseong-gu, Daejeon 34113, Republic of Korea

^*

Author to whom correspondence should be addressed.

Land 2025, 14(8), 1589; https://doi.org/10.3390/land14081589

Submission received: 27 June 2025 / Revised: 31 July 2025 / Accepted: 1 August 2025 / Published: 4 August 2025

(This article belongs to the Section Land Innovations – Data and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Air pollution threatens human health and ecosystems, making timely forecasting essential. The spatiotemporal dynamics of pollutants, shaped by various factors, challenge traditional methods. Therefore, spatiotemporal graph-based deep learning has gained attention for its ability to capture spatial and temporal dependencies within monitoring networks. However, many existing models, despite their high predictive accuracy, face computational complexity and scalability challenges. This study introduces clustered and lightweight spatio-temporal graph convolutional network with gated recurrent unit (ClusLite-STGCN-GRU), a hybrid model that integrates spatial clustering based on pollutant time series for graph construction, Copula-based dependency analysis for selecting relevant pollutants to predict PM_2.5, and graph convolution combined with gated recurrent units to extract spatiotemporal features. Unlike conventional approaches that require learning or dynamically updating adjacency matrices, ClusLite-STGCN-GRU employs a fixed, simple cluster-based structure. Experimental results on Tehran air quality data demonstrate that the proposed model not only achieves competitive predictive performance compared to more complex models, but also significantly reduces computational cost—by up to 66% in training time, 83% in memory usage, and 84% in number of floating-point operations—making it suitable for real-time applications and offering a practical balance between accuracy, interpretability, and efficiency.

Keywords:

air pollution; spatiotemporal graph; clustering; dependency analysis; PM_2.5 prediction; computational efficiency

1. Introduction

Urbanization and industrialization, though vital to socioeconomic progress, have significantly worsened air quality worldwide [1]. In 2024, the World Health Organization (WHO) reported that nearly 7.92 billion people worldwide are exposed to polluted air, leading to 6.7 million deaths each year due to its harmful effects on cardiovascular and respiratory health, including conditions such as asthma [2]. Air pollution is a complex mixture of particulate matter and gaseous contaminants, with key constituents including particulate matter less than 10 μm (PM₁₀) and 2.5 μm (PM_2.5) in diameter, along with sulfur dioxide (SO₂), nitrogen dioxide (NO₂), carbon monoxide (CO), and ozone (O₃) [3]. Among these, PM_2.5 is the most concerning due to its strong association with an increased risk of lung cancer and other serious health effects [4]. Given the absence of a comprehensive strategy for the complete eradication of air pollution, the development of forecasting models and early warning systems is essential [5]. This is particularly important for critical pollutants like PM_2.5, as they play a key role in formulating effective mitigation policies and protecting public health [6].

Governments have established networks of air quality monitoring stations (AQMSs) to provide accurate data on current conditions and historical trends of air pollution. However, one of the main limitations of these networks is their inability to forecast future pollutant concentrations [7]. Air pollutant concentration forecasting based on AQMS data faces several challenges, including spatial dependencies among AQMSs [8], temporal dependencies on historical data [9], and the influence of meteorological conditions [10]. Additionally, nonlinear chemical interactions, such as NO₂ oxidation and secondary O₃ formation, complicate predictions [11]. These complexities underscore the necessity of developing spatiotemporal models that can simultaneously capture spatial and temporal dependencies, account for the influence of meteorological conditions on pollutant concentrations, and identify the key pollutants affecting the target variable, such as PM_2.5.

Advancements in technology and increased Graphics Processing Unit (GPU) power have transformed deep learning into an advanced branch of machine learning, capable of analyzing complex data and modeling spatiotemporal relationships [12]. As a result, hybrid deep learning models combining spatial and temporal learning are gaining interest, with graph convolutional networks (GCNs) used for capturing spatial dependencies, and recurrent neural networks (RNNs), temporal convolutional networks (TCNs), or attention mechanisms employed to model temporal dependencies [13].

Over the past few years, a growing number of studies have explored various architectures and techniques to improve the accuracy and efficiency of spatiotemporal air pollution forecasting models [14]. For example, Guan et al. [7] introduced a multi-branch TGCN model to extract features from different meteorological variables, improving short- and long-term PM_2.5 forecasts. Chen et al. [15] developed an adaptive adjacency matrix learned from meteorological and point of interest (POI) data, which significantly reduced prediction errors compared to fixed distance-based graphs. Hierarchical spatial modeling was explored by Hu et al. [16], who showed that multi-scale graph aggregation improves accuracy. Other works, such as that of Zeng et al. [17], combined convolution with spatial attention mechanisms to capture both local and global spatial dependencies, while Liu et al. [18] enhanced the representation of nodes with few connections using fine-grained graph convolutions. Zhao et al. [19] developed dynamic graph structures that incorporate wind field data, capturing spatial dependencies more effectively through geographically informed directed graphs. Huang et al. [20] proposed a hybrid Transformer-GCN model, GCN-FFPformer, which combines spatial feature extraction through GCN with frequency-domain temporal modeling using an enhanced transformer based on fast Fourier transform (FFT). Zeng et al. [21] tackled the over-smoothing issue by introducing DAGJN, an dynamic adaptive graph jump network that integrates multi-head self-attention to better capture spatiotemporal dependencies. Table S1 summarizes the features of these papers for air pollutant concentration prediction.

With the growing body of research in this area, recent studies (Table S1) have prioritized enhancing prediction accuracy. Nevertheless, these gains in accuracy frequently lead to greater model complexity, resulting in more challenging implementation, longer processing times, and a substantial increase in the number of model parameters [22]. In real-world applications, where resources are limited and rapid prediction is required for horizons such as 1 to 72 h ahead, high accuracy alone is not sufficient [23]. Models must be not only accurate but also lightweight and simple [24]. Therefore, balancing accuracy and ease of deployment is a fundamental challenge in the design of spatiotemporal models.

To address this, a novel model, ClusLite-STGCN-GRU (clustered and lightweight spatio-temporal graph convolutional network with gated recurrent unit), is proposed for near-future air pollution forecasting. The main objective of this model is to achieve high prediction accuracy while reducing computational complexity. To this end, a previously developed clustering framework [25] is employed to partition the AQMS network into a set of local subgraphs. This clustering process is based on time–frequency analysis, dimensionality reduction through principal component analysis (PCA), and agglomerative hierarchical clustering. This decomposition offers three key advantages: (1) it enhances training stability by isolating more stationary spatial patterns, which aligns well with the inherently local nature of graph convolutional filters; (2) it substantially lowers computational costs by eliminating the need for dynamic adjacency matrix learning and by limiting connections to within-cluster interactions; and (3) it implicitly captures meteorological influences by clustering AQMSs based on full PM_2.5 time series over the study period (2019–2022), without requiring explicit weather data. Since pollutant patterns are shaped by factors like wind and temperature, similar time series often reflect similar underlying climate conditions. In addition, Copula models are utilized to identify pollutants that exhibit significant dependencies on PM_2.5 (the target variable), and only these pollutants are considered as inputs. This reduces the input size, accelerates training, and enables a lightweight and efficient framework. Experimental results demonstrate that ClusLite-STGCN-GRU outperforms all baseline models that either neglect spatial dependencies or use CNNs to extract spatial information among AQMSs. It achieves accuracy comparable to more complex graph-based models for short-term forecasts (up to 8 h ahead), while consistently outperforming them over longer horizons (from 12 to 72 h), thereby improving accuracy while reducing complexity.

The remainder of this paper is organized into the following six sections: a description of the study area and dataset along with the problem definition, the theoretical background, a detailed explanation of the proposed model, a comparative analysis of the prediction results and model performance, and finally, discussion and conclusions.

2. Materials and Problem Definition

2.1. Study Area and Data Description

Our study area is Tehran, the capital and most densely populated city in Iran (Figure 1a). This metropolis spans approximately 18,814 square kilometers and has a population of about 16 million, making it the second-largest city in the Middle East after Cairo [26]. Located at an elevation of 1200 m above sea level on the southern slopes of the Alborz Mountains, Tehran is naturally constrained to the north by these mountains, which limit the dispersion of air pollutants in that direction. However, the city opens to expansive plains in the south, which allow for some pollutant dispersion. The prevailing winds, coming mainly from the west and southwest, carry pollutants from industrial zones and densely populated areas to other parts of the city. Tehran has a dry climate, receiving only about 200 mm of rainfall per year and averaging 25 °C, which, along with rapid population growth, industrial expansion, and heavy traffic, has made Tehran one of the most polluted megacities globally [27].

In Tehran, air quality data is published by Tehran Municipality (https://www.tehran.ir/, accessed on 31 July 2025)) and Department of Environment (DOE, https://www.doe.ir/, accessed on 31 July 2025)). In this study, hourly pollutant concentration data—including PM_2.5, PM₁₀, NO₂, SO₂, CO, and O₃—were collected from these two organizations, covering the period from 00:00 on 1 January 2019 to 23:00 on 31 December 2022. Additionally, meteorological observations, including temperature, pressure, humidity, dew point temperature, and wind components (wind_x and wind_y), were obtained from the Iran Meteorological Organization (https://www.irimo.ir/, accessed on 31 July 2025), which is responsible for monitoring atmospheric conditions. The meteorological data were collected for the same time period and matched to AQMSs using the inverse distance weighting (IDW) interpolation method. It is worth mentioning that in our proposed model, meteorological data are not used directly as inputs but are collected only for baseline models that cannot implicitly capture their effects. Our model accounts for these impacts indirectly through time series clustering. Table 1 summarizes the descriptive statistics of the relevant variables, and the spatial distribution of AQMSs and meteorological stations is shown in Figure 1b.

2.2. Data Preprocessing

Air quality sensors in real-world conditions face challenges such as outliers and missing data due to factors like adverse weather conditions, equipment wear, and power outages [14]. Missing data hinders the model training process as models require complete data for learning. On the other hand, outliers cause the model to deviate from the underlying patterns and reduce prediction accuracy [28]. Additionally, due to scale differences in the variables measured by AQMSs (Table 1), data normalization is essential. This process ensures the uniformity of feature scales, thereby guaranteeing that each feature has an equal impact on the modeling process [29]. Finally, appropriate data splitting into training, validation, and test sets is crucial for accurately evaluating model performance and avoiding issues such as overfitting. Preprocessing procedures for imputing missing data and removing outliers followed the method proposed by Abbasi et al. [25]. Min–max normalization was applied to ensure consistent feature scaling. The dataset was sequentially split into training, validation, and test sets, covering the first 70%, the next 15%, and the final 15% of the data, respectively.

2.3. Problem Formulation

The objective of this study was to predict the spatiotemporal air quality at the locations of AQMSs using historical data from the station itself and its neighboring AQMSs. To achieve this, the data from these AQMSs are treated as signals recorded every hour of the day, with the domain of these signals represented as a graph. In this graph, AQMSs are nodes, and the observations from each station serve as the features of those nodes. The spatial relationships between AQMSs are captured through the adjacency matrix, which defines the edges of the graph. Ultimately, the spatiotemporal deep learning model, by learning a mapping function F, predicts air quality for the next τ hours at each node based on graph G and the historical node features of length T:

X_{t_{0} + 1}, X_{t_{0} + 2}, \dots, X_{t_{0} + τ} = F_{θ} (X_{t_{0} - T + 1}, X_{t_{0} - T + 2}, \dots, X_{t_{0}}; G)

(1)

where

θ

represents the learnable parameters and

X

represents the air quality features observed at AQMSs, with each

X_{t}

being a matrix of features recorded at all stations at time

t_{0}

, capturing pollutant levels and related environmental factors.

3. Theoretical Background

3.1. GCNs

GCNs are a class of deep learning models designed for graph-structured data. Unlike traditional neural networks that process Euclidean data (i.e., images, text, and videos), GCNs are capable of capturing the topological structure of graphs and learning node representations [30]. They are categorized into spectral and spatial graph convolutions [31]. Spectral methods use the eigendecomposition of the graph Laplacian, where the eigenvectors of the Laplacian act as the graph’s Fourier basis, similar to how sine and cosine functions work in Fourier space. Although the convolution operation in Fourier space is equivalent to a Hadamard multiplication, it is computationally expensive due to the need for eigenvalue decomposition [32]. Spatial methods have limitations compared to spectral methods in capturing long-range dependencies and global structural patterns. However, they operate without the need for spectral decomposition, directly aggregating local neighborhood features, which enhances scalability and computational efficiency for large-scale graphs [33].

The evolution of GCNs began with the models proposed by Bruna et al. [34] and Henaff et al. [35], which formulated convolution in the spectral domain using the eigendecomposition of the graph Laplacian matrix L. However, the dense nature of L incurs a computational complexity of

O

(n³), making it impractical for large-scale graphs. To address this, Defferrard et al. [36] proposed ChebNet, which utilizes Chebyshev polynomial approximations to eliminate the need for eigendecomposition, thereby reducing the computational complexity to

O (K | E |)

, where

K

is the polynomial order and

| E |

is the number of edges. Building on this, Kipf & Welling [37] introduced a simplified first-order model (

K = 1

) with linear complexity

O (| E |)

, enabling scalable learning on large graphs through localized filtering. This process is defined by the following equation:

Z = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} X Θ

(2)

where

\tilde{A} = A + I_{N}

is the adjacency matrix with self-loops;

\tilde{D}

the degree matrix;

X

the input features;

Θ

the trainable weights; and

Z

the output. This formulation ensures an efficient balance between scalability and expressive power.

3.2. RNNs

In 1906, Andrey Markov introduced the Markov chain model, where the future state of a system depends solely on its current state, without reference to prior states [38]. In 1982, John Hopfield introduced RNNs, enabling the modeling of longer sequence dependencies, moving beyond the reliance on just the current state [39]. These networks are particularly effective for tasks that involve time-series data or sequences, as they can maintain information over time and use it for predictions [40]. The Elman RNN (ERNN) [41] represents one of the pioneering RNN architectures specifically designed to capture sequence dependencies. In this architecture, the current input

x_{t}

is integrated with the hidden state

h_{t - 1}

from the previous time step, with both components being weighted by learnable parameters. The hidden state at time step t, denoted as

h_{t}

, is computed using the following equation:

h_{t} = \tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b)

(3)

where

t a n h

is the non-linear activation function;

W_{h}

represents the weight matrix for the previous hidden state;

W_{x}

is the weight matrix for the current input; and b is the bias term. ERNN is effective in modeling short-term dependencies but suffers from the vanishing gradient problem, which restricts its ability to capture long-term patterns [42].

The long short-term memory (LSTM) [43] was introduced to address the vanishing gradient problem in the ERNN by using gating mechanisms to preserve long-term dependencies [44]. The LSTM architecture consists of three primary gates: input, forget, and output. Each gate applies a linear transformation to the current input

x_{t}

and the previous hidden state

h_{t - 1}

, followed by a sigmoid activation function to produce gating values between 0 and 1:

g_{t} = σ (W_{g} h_{t - 1} + U_{g} x_{t} + b_{g})

(4)

where

g_{t} \in {f_{t}, i_{t}, o_{t}}

represents the output of the forget, input, or output gate, respectively. Here,

W_{g}

and

U_{g}

are learnable weight matrices applied to the previous hidden state and current input, respectively, and

b_{g}

is a bias vector. These gates regulate the information flow through the memory cell. The forget gate

f_{t}

controls which parts of the previous cell state

c_{t - 1}

are retained, while the input gate

i_{t}

determines the contribution of the new candidate state

{\tilde{c}}_{t}

. The cell state is updated as follows:

C_{t} = σ (f_{t} * C_{t - 1} + i_{t} * {\tilde{C}}_{t})

(5)

The output gate

o_{t}

then determines the hidden state:

h_{t} = \tanh (C_{t}) * o_{t}

(6)

This structure enables LSTM to effectively preserve long-term dependencies and mitigate the vanishing gradient problem [45].

The gated recurrent unit (GRU) [46] is a simpler variant of the LSTM that uses only two gates: update and reset, reducing model complexity while maintaining the ability to capture long- and short-term dependencies [47,48]. Each gate is computed as follows:

g_{t} = σ (W_{g} h_{t - 1} + U_{g} x_{t} + b_{g})

(7)

where

g_{t} \in {z_{t}, r_{t}}

corresponds to the update (

z_{t}

) and reset (

r_{t}

) gates, respectively. The candidate hidden state

{\tilde{h}}_{t}

and the final hidden state

h_{t}

are updated as follows:

{\tilde{h}}_{t} = \tanh (W \cdot [r_{t} * h_{t - 1}, x_{t}])

(8)

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(9)

This structure allows GRU to effectively regulate information flow without a separate memory cell, making it computationally efficient while mitigating the vanishing gradient problem [49]. The architectures of ERNN, LSTM, and GRU are shown in Figure 2a–c, respectively.

4. Proposed Model and Experimental Settings

4.1. Model Design

The proposed model in this study is a hybrid deep learning framework based on the combination of GCN and GRU networks, designed to simultaneously predict PM_2.5 concentrations at all AQMSs for a forecast horizon of 1 to 72 h. As shown in Figure 3, the proposed model consists of five separate blocks, namely Copula-based dependency analysis, pollutant time series clustering, graph convolution layers, GRU layers, and output PM_2.5 prediction layers.

4.1.1. Copula-Based Dependency Analysis Block

The concentrations of air pollutants are often influenced by common emission sources such as traffic and industrial activities, and under certain meteorological conditions, complex chemical interactions may occur among them [50,51]. These factors result in statistical dependencies among pollutants, where variations in the concentration of one pollutant are associated with changes in others. Therefore, dependency analysis helps identify pollutants related to PM_2.5 (the target pollutant) and prevents the inclusion of all pollutants in the model, thereby reducing the number of input parameters. Methods such as Pearson and Spearman correlation coefficients are commonly used for dependency analysis between variables. However, these methods are only capable of detecting linear (Pearson) or monotonic (Spearman) dependencies and have limited effectiveness in identifying critical dependency structures—particularly when pollutant concentrations simultaneously reach very high or very low levels. In this context, Copula models provide a more flexible statistical framework for modeling the dependency structure among random variables without requiring specific assumptions about their marginal distributions [52]. These models possess a strong ability to analyze nonlinear dependencies and examine the joint behavior of variables under extreme conditions (tail dependencies) [53].

In the Copula-based modeling approach, each marginal variable is first transformed into a standard uniform distribution over the interval [0, 1] using its empirical cumulative distribution function (ECDF)—a process known as marginal transformation. Subsequently, the dependence structure among the transformed variables is estimated using an appropriate Copula function [54]. Table 2 presents the most commonly used Copula families, the types of dependence they capture, and their typical applications in the context of air pollution. It is worth noting that in addition to the Copulas listed in Table 2, Copulas can be rotated by 90 or 180 degrees to better capture asymmetric dependence patterns, especially when tail dependence is observed only in specific quadrants of the joint distribution. Such rotations enhance the flexibility of the Copula framework, making it particularly valuable in environmental applications where dependencies may not be symmetric or uniformly distributed.

Key characteristics of dependency structures in all Copula models include Kendall’s tau rank correlation coefficient (τ), upper tail dependence (

λ_{u p p e r}

), and lower tail dependence (

λ_{l o w e r}

). The τ coefficient quantifies the overall strength and direction of association between two variables and ranges from –1 to +1. A positive τ suggests that increases in one pollutant are generally associated with increases in another (positive association), while a negative τ indicates that an increase in one variable tends to coincide with a decrease in the other (negative association or inverse trend). In contrast,

λ_{u p p e r}

and

λ_{l o w e r}

, both ranging from 0 to 1, quantify the probability of co-occurrence of extreme events in the upper and lower tails of the joint distribution, respectively, providing critical insights into the tail-dependent behavior of pollutants [55].

4.1.2. Pollutant Time Series Clustering Block

AQMSs measure pollutant concentrations at fixed sampling intervals, resulting in multivariate time series data for each pollutant across multiple AQMSs. Herein, an adjacency matrix is constructed separately for each pollutant by identifying the time series of AQMSs—which serve as graph nodes—that exhibit similar levels and patterns of variation. The adjacency matrix is defined such that only AQMSs within the same cluster are connected, while no connections exist between AQMSs belonging to different clusters. Moreover, pollutants that do not exhibit a distinct spatial clustering pattern—that is, those whose concentration levels and variation trends are similar across all urban AQMSs—are excluded from the model inputs due to their lack of added spatial value. This strategy removes the need to learn the adjacency matrix during training and implicitly captures meteorological effects, as AQMSs experiencing similar weather conditions often exhibit similar pollution dynamics [56]. By removing explicit meteorological features and pollutants without spatial cluster structures, and employing a fixed adjacency matrix, the model benefits from reduced input dimensionality and fewer trainable parameters.

To achieve this, a methodology inspired by a prior study [25] was adopted, consisting of the following steps: (1) transformation from the time domain to the time–frequency domain to capture the amplitude and frequency of pollutant concentration variations, (2) use of a PCA method to extract a low-dimensional feature representation from the transformed signals, (3) aggregate hierarchical clustering (AHC) to group AQMSs into clusters based on the extracted features. For further methodological details, the reader is referred to Abbasi et al. [25].

4.1.3. Graph Convolution Block

A key strength of using GCNs lies in their ability to simultaneously process information from multiple AQMSs while accounting for spatial dependencies among them. In this study, this capability is realized by constructing graphs in which each AQMS is considered as a node (a total of

N

nodes corresponding to the total number of AQMSs), and each node is associated with a P-dimensional feature vector. This feature vector is composed of variables selected by two prior blocks: pollutants that exhibit non-zero dependency with PM_2.5 in all three parts of the distribution—namely, central dependency, upper tail dependency, and lower tail dependency—as well as pollutants that display distinct spatial clustering behavior. For each selected pollutant, a separate sequence of graphs is constructed over the study period, such that each time step corresponds to a snapshot graph. As discussed in Section 4.1.2, the structure of each graph is defined by an adjacency matrix specific to the corresponding pollutant, which is constructed through spatial clustering of its concentration time series across all AQMSs. In this setting, the edges are binary and undirected, and are defined as follows:

A_{i j} = \{\begin{matrix} 1, & i f i \neq j a n d n o d e s i, j b e l o n g t o t h e s a m e c l u s t e r \\ 0, & o t h e r w i s e \end{matrix}

(10)

This explicit construction of the adjacency matrix removes the need for learning graph connectivity parameters during model training, thereby reducing model complexity compared to approaches that require dynamically learned or time-varying adjacency matrices. Within the GCN framework, the node feature vectors are combined using this adjacency matrix to produce spatially embedded outputs.

4.1.4. GRU Block

In the graph convolution block, for each pollutant, a distinct graph is constructed at every time step based on its spatial clustering. Graph convolution is applied separately on each pollutant’s graph, and their outputs are concatenated to form a combined spatial embedding vector. This sequence of embeddings is then input to a GRU block to capture temporal dependencies. As discussed in Section 3.2, the GRU employs two gating mechanisms—the update and reset gates—to selectively retain or discard past information, thereby modeling the temporal dynamics of the data. This mechanism, when applied to the output of the graph convolution block, is mathematically defined as follows:

g_{t} = σ (W_{g} h_{t - 1} + U_{g} G C N (X_{t}, A) + b_{g}); {\tilde{h}}_{t} = \tanh (W \cdot [r_{t} * h_{t - 1}, G C N (X_{t}, A)]); h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(11)

where

G C N (X_{t}, A)

denotes the output of the GCN at time step t given the input features

X_{t}

and the adjacency matrix A.

4.1.5. Output Block

After extracting spatiotemporal features through the graph convolution and GRU blocks, the output block employs fully connected (FC) layers to map the learned hidden representations to the target prediction dimensions. For multi-step forecasting over a prediction horizon of

T^{'}

time steps, the output is calculated as follows:

{\hat{Y}}_{t + 1 : {t + T}^{'}} = W F + b

(12)

where

{\hat{Y}}_{t + 1 : {t + T}^{'}}

denotes the predicted pollutant concentrations (e.g., PM_2.5) for the next

T^{'}

time steps;

F

represents the spatiotemporal feature vector extracted from the previous layers; and

W

and

b

are the weight matrix and bias vector of the fully connected layers, respectively.

4.2. Model Validation

To validate the performance of the proposed model in predicting PM_2.5 concentrations, four statistical metrics—MAE, RMSE, R², and Index of Agreement (IA)—were used. The selection of these metrics was made not only due to their widespread use in the literature of regression predictions but also considering the specific characteristics of air pollution data and the analytical needs associated with this domain [21,57,58,59]. In air pollution forecasting, the accuracy of predictions is critical for decision-making related to air quality and public health. MAE measures the average absolute error, while RMSE emphasizes large errors, helping evaluate the model’s performance during significant concentration changes. R² assesses the proportion of variance in the observed data that is predictable from the model, offering an interpretable measure of model fit across varying concentration levels. On the other hand, the Index of Agreement (IA) is used to measure the overall agreement between the predicted and actual values, serving as a dimensionless metric that reflects the model’s overall accuracy in predicting data behavior. The definitions of them are provided by Equations (13)–(16):

M A E = \frac{1}{n} \sum_{i = 1}^{n} |o_{i} - p_{i}|

(13)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(o_{i} - p_{i})}^{2}}

(14)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(o_{i} - p_{i})}^{2}}{\sum_{i - 1}^{n} {(o_{i} - \bar{o})}^{2}}

(15)

I A = 1 - \frac{\sum_{i = 1}^{n} {(o_{i} - p_{i})}^{2}}{\sum_{i = 1}^{n} {(|p_{i} - \bar{o}| + |o_{i} - \bar{o}|)}^{2}}

(16)

where

o_{i}

and

p_{i}

are the observed (ground truth) and predicted values, respectively.

\bar{o}

is the average value of n observed sample data.

4.3. Model Evaluation

In this subsection, the performance of the proposed model is evaluated. The aim is to assess the accuracy, robustness, and reliability of the model in predicting PM_2.5 concentrations, in comparison with baseline models. To ensure a fair and consistent comparison, identical training, validation, and testing datasets are used across all models. Moreover, the evaluation is conducted using the performance metrics introduced in Section 4.2. For optimal performance, the hyperparameters of each model are fine-tuned using a grid search strategy. The proposed model is evaluated by comparing it with seven baseline models:

(1): GRU competitive method

As one of the competitive prediction methods, this approach relies solely on the temporal dependencies of observational data from AMQSs to forecast future values, without considering the spatial dependencies among AQMSs. This approach can highlight the impact of ignoring the spatial dimension in predicting PM_2.5 levels. In this method, the pollutant observations from AQMSs, along with the relevant meteorological data as auxiliary information, are fed into the network. The model makes predictions separately for each AMQS, and then evaluation metrics are calculated as the average across all AMQSs. Thus, the input matrix is three-dimensional, consisting of batch size, time step, and number of features. What enables these models to effectively simulate temporal dependencies and variations in pollutants over time is the allocation of a separate dimension to historical data as time step. The structure of the proposed GRU model for the problem of predicting PM_2.5 concentrations at 11 AQMSs in Tehran is shown in Figure S1.

(2): LSTM competitive method

As another competitive prediction method, this method—similar to GRU—focuses exclusively on the temporal dependencies in individual AQMS time series, while ignoring the spatial dependencies among AQMSs. LSTM uses three gating mechanisms (input, forget, and output), which result in a larger parameter space and increased computational overhead compared to GRU, which only uses two gates (update and reset) with a simpler architecture, as discussed in Section 3.2. This configuration highlights the trade-off between enhanced temporal modeling capabilities and the associated computational complexity. Similar to the GRU model, pollutant observations and auxiliary meteorological data are fed into the network in a three-dimensional tensor (batch size, time step, features). Each AQMS is modeled independently, and the prediction accuracy is evaluated as the average across all AQMSs. The structure of the proposed LSTM model is shown in Figure S2.

(3): GRU and LSTM with multi-head attention competitive method

In standard GRU and LSTM architectures, the trainable weights (parameters) are shared across all time steps, although the hidden states vary at each step. Although this weight-sharing strategy promotes efficiency and temporal generalization, it limits the model’s ability to adaptively focus on specific time steps during sequence processing. By integrating a multi-head attention mechanism applied over the full sequence of GRU and LSTM outputs, these models can learn to assign distinct attention weights to each time step. This enables them to emphasize relevant temporal segments and capture more complex dependencies. Specifically, the input tensor has the shape (batch size, time step, number of features), and the attention mechanism computes the query, key, and value matrices from the full output sequence of the LSTM or GRU. Attention scores are calculated using scaled dot-product operations, normalized through the softmax function, and applied to weight the value vectors accordingly. The resulting multi-head outputs are concatenated and projected back to the original hidden size dimension. For the final prediction, the representation obtained from the attention mechanism corresponding to the last time step is passed through a fully connected layer to estimate PM_2.5 concentrations. This approach enables the model to focus on more important parts of the temporal data, although it increases computational complexity. The overall architectures of the models are shown in Figures S3 and S4.

(4): CNN-GRU competitive method

The hybrid CNN-GRU architecture, which functions as a competitive forecasting method, simultaneously models spatial and temporal dependencies. In this approach, the irregular graph-based structure of AQMSs is transformed into a regular spatial grid. This model allows for evaluating the impact of neglecting the true spatial topology on the model’s performance. The model’s input is a four-dimensional tensor, where each time step corresponds to a two-dimensional image representing the spatial distribution of AQMS data. For each AQMS, its three geographically nearest neighbors are placed adjacent to it, forming a local region of data. In the first stage, an independent convolutional filter with a kernel size of (1 × 4) is directionally applied along the width axis for each time step (which will later be set to 24). This design enables the model to learn local patterns across horizontally adjacent AQMSs while preserving the vertical structure of the image (height dimension). The extracted spatial features are then combined with auxiliary data (i.e., meteorological variables) and passed through two consecutive GRU layers to capture temporal dependencies. Finally, the GRU output is processed through fully connected layers to produce a two-dimensional output matrix, where each row corresponds to the predictions for a specific AQMS, and each column represents the predicted PM_2.5 concentration for one of the next 8 time steps. The structure of this model is shown in Figure S5.

(5): Distance-based GCN-GRU and wind-driven dynamic GAT-GRU

Two other competitive methods considered to compare the performance in forecasting PM_2.5 concentration are based on combining graph convolutional networks (GCNs) or graph attention networks (GATs) with a GRU block. The goal of these methods is to evaluate the performance of using static graphs and dynamic graphs with graph attention mechanisms for modeling spatial dependencies among AQMSs, in comparison to our proposed model. The distance-based GCN-GRU model differs from our proposed model in the formulation of the static adjacency matrix: a connection between two AQMSs is established only if their pairwise distance falls below a predefined threshold, and the corresponding edge weight is assigned as the inverse of their distance, as defined by the following:

A_{i j} = \{\begin{matrix} \frac{1}{d_{i j}}, & i f i \neq j a n d d_{i j} < R \\ 0, & o t h e r w i s e \end{matrix}

(17)

where

A_{i j}

denotes the weight of the edge between nodes (AQMSs) i and j;

d_{i j}

is the Euclidean distance between these two AQMSs; and

R

is the fixed distance threshold set to 5 km.

The wind-driven dynamic GAT-GRU model differs from our proposed model not only in the formulation of the adjacency matrix, but also in its use of GAT instead of GCN. In this model, the graph structure, i.e., the presence of directed edges between nodes, is dynamically determined at each time step based on wind direction. Specifically, an edge from node

i

to node

j

is created if the wind at AQMS

i

points toward AQMS

j

, formalized by the following:

c o s (θ_{i j}^{(t)}) = \frac{{\vec{w}}_{i}^{(t)}}{‖{\vec{w}}_{i}^{(t)}‖} \cdot \frac{{\vec{p}}_{j} - {\vec{p}}_{i}}{‖{\vec{p}}_{j} - {\vec{p}}_{i}‖} > 0

(18)

Edge weights are computed by the following:

A_{i j}^{(t)} = ‖{\vec{w}}_{i}^{(t)}‖ \cdot c o s (θ_{i j}^{(t)}) \cdot e x p (- \frac{d_{i j}^{2}}{2 σ^{2}})

(19)

where

{\vec{w}}_{i}^{(t)}

denotes the wind vector at AQMS i at time

t

;

{\vec{p}}_{i}

and

{\vec{p}}_{j}

are the coordinates of AQMSs i and j, respectively; and

σ

is a distance smoothing hyperparameter. This formulation yields a sparse, directed, and temporally dynamic graph. This formulation results in a directed and dynamic graph structure over time. The GAT layer assigns learnable attention coefficients to edges, reflecting their relative importance in node feature aggregation and enabling the model to learn an optimal graph structure at each time step. In both models, after extracting spatial features through GCN or GAT layers, the outputs are fed into a GRU block to model temporal dependencies in the data. The final structure of each model includes fully connected layers to generate the final prediction of PM_2.5 concentration at different AQMSs. The overall architecture of these models is shown in Figures S6 and S7.

4.4. Experimental Settings

The experiments in this study were conducted on a computing system running Windows 11. The hardware configuration included an NVIDIA GeForce RTX 3050 Ti GPU and 16 GB of RAM. On the software side, Python 3.10.4 was used along with PyTorch 2.4.1 and PyTorch Geometric (PyG) 2.6.1 to implement and train all models. The pollutant concentration data were collected by AQMSs at an hourly sampling rate; therefore, each time step was considered equivalent to one hour. In this study, a time window of 24 previous time steps (equivalent to 24 h) was used to predict future PM_2.5 concentrations, referred to as time step. This value was adjusted using the Fourier transform, which converts the pollutant concentration data from the time domain to the frequency domain. As shown in Figure S8, in this domain, the frequency corresponding to the 24 h cycle had the most significant impact. The goal of the prediction was to forecast PM_2.5 concentrations simultaneously across all AQMSs at future time steps with intervals of [1, 2, 4, 8, 12, 24, 48, 72] hours. All models were trained using the mean squared error (MSE) loss function with a weight decay of 0.0001, a batch size of 64, and a total of 400 training epochs. To reduce computational cost and prevent overfitting, early stopping with a patience value of 5 was employed. The remaining hyperparameters for each model were independently tuned using a grid search approach to identify the optimal combination that yields the best model performance. The final selected hyperparameter values for each model are presented in Table S2.

5. Results

5.1. PM_2.5 Dependency on Other Pollutants

The concentrations of PM_2.5 and five other major air pollutants (PM₁₀, NO₂, SO₂, CO, and O₃) were monitored hourly at each AQMS. To obtain a unified time series for the entire study area, the hourly average concentration of each pollutant across all AQMSs was calculated. This process yielded a single continuous hourly time series spanning the study period. Using this aggregated time series, the Copula modeling framework was employed to quantify the statistical dependencies between PM_2.5 (as the target pollutant) and the other pollutants.

This analysis showed that PM_2.5 exhibits varying statistical dependencies with other pollutants (Table 3). PM₁₀ showed the strongest positive dependence with PM_2.5 (τ = 0.665), with symmetric tail dependence (

λ_{u p p e r}

=

λ_{l o w e r}

= 0.229), modeled using a t-Student Copula. This strong association is primarily attributed to the physical nature of the pollutants, as PM_2.5 is a subset of PM₁₀, and their common anthropogenic sources such as vehicular emissions, road dust resuspension, and industrial activities [60]. This explains their consistent co-occurrence across both mild and extreme pollution levels. NO₂ and SO₂ exhibited moderate positive dependencies with PM_2.5 (with Kendall’s τ values of 0.424 and 0.388, respectively), both modeled using a t-Student Copula with non-zero upper and lower tail dependencies. These results indicate a significant co-occurrence of high (or low) concentrations of these gaseous pollutants alongside PM_2.5, which can be attributed to common anthropogenic sources such as vehicular emissions and industrial activities [9]. CO showed a weaker dependence with PM_2.5 (τ = 0.386), with both upper and lower tail dependence coefficients equal to zero, as modeled using a Frank copula. This indicates a general positive association without significant tail dependence. The relationship between O₃ and PM_2.5 was negative, with a Kendall’s τ of −0.262, modeled using a 90° rotated Gumbel copula, which exhibited no tail dependence. This negative association aligns with previous findings in the literature [55,61,62], as the photochemical processes responsible for ozone formation typically occur under atmospheric conditions that differ from those leading to elevated PM_2.5 levels.

5.2. Spatial Clustering of AQMSs

Spatial clustering analysis using AHC was conducted on the time series of air pollutants recorded at AQMSs within the study area. Internal validation indices, including the Silhouette score, Dunn index, and Calinski–Harabasz index, were employed to determine the optimal number of clusters. The validity of the resulting clusters was assessed using the q-index and the non-central F-test. The findings revealed that for O₃, CO, and SO₂, the temporal behavior of pollutant concentrations was not significantly influenced by the spatial location of the AQMSs, resulting in the formation of a single cluster encompassing all AQMSs. However, for the pollutants NO₂, PM₁₀, and PM_2.5, the analysis showed that their temporal concentration patterns were influenced by the spatial locations of the AQMSs, resulting in the formation of five, four, and five clusters, respectively, each exhibiting distinct spatial patterns, as shown in Figure 4a–c.

For NO₂ (Figure 4a), five distinct clusters were identified. Cluster 1 (blue) includes AQMSs located in the central–southwestern urban–industrial zones with high population and emission density; Cluster 2 (green) consists of AQMSs situated in eastern peripheral areas as well as one central station, reflecting both regional transport and localized urban effects; Cluster 3 (maroon) contains only the Rey AQMS, located in the southern part of the city, likely influenced by nearby power plants and industrial facilities; Cluster 4 (pastel green) encompasses AQMSs in the northern part of the city, where urban infrastructure and topographic influences may shape NO₂ patterns; and Cluster 5 (yellow) includes a single centrally located AQMS, suggesting site-specific temporal variability. For PM₁₀ (Figure 4b), four distinct clusters were identified, with most AQMSs within each cluster exhibiting close spatial proximity. This spatial coherence suggests that local environmental conditions and emission sources play a dominant role in shaping PM₁₀ distribution patterns. Nevertheless, two AQMSs deviated from the general spatial clustering pattern. Mantagheh 22 AQMS, despite being located in the western part of the city, was grouped with AQMSs in the southern region, suggesting similar temporal trends likely driven by shared emission sources, long-range pollutant transport, or dominant wind directions. Piroozi AQMS, although geographically situated in the eastern zone, displayed temporal behavior more aligned with central urban stations, indicating that urban activity intensity or localized environmental conditions exerted a stronger influence than its physical location. For PM₂.₅ (Figure 4c), five distinct clusters were identified that, similar to PM₁₀, mostly consist of AQMSs with significant spatial proximity within each cluster, reflecting the strong influence of local environmental conditions and shared pollutant sources in shaping similar temporal patterns. However, some deviations from this pattern were observed: Mantagheh 22 AQMS, despite being located in western Tehran, was grouped in Cluster 2 (dark green) along with AQMSs from central areas, indicating similar temporal trends likely influenced by pollutant transport or common pollution sources such as industrial activities or prevailing wind directions. Additionally, Masoudieh 22 AQMS, although situated at the eastern boundary of Tehran, was grouped with northern AQMSs in Cluster 3 (yellow), reflecting temporal alignment with northern city patterns. Furthermore, Piroozi 22 AQMS, also geographically located at the eastern boundary of Tehran, was grouped in Cluster 4 (light green) alongside central city AQMSs, suggesting that the intensity of urban activities or local environmental conditions had a stronger influence than its physical location. As a result of these two blocks, among the six main pollutants measured by the AQMSs, besides PM_2.5 as the target pollutant, NO₂ and PM₁₀ were selected as input variables to the model due to their non-zero dependency with PM_2.5 across all three parts of the distribution—namely, central dependency, upper tail dependency, and lower tail dependency—as well as their clustered spatial patterns across the AQMSs in Tehran.

5.3. Comparison of Model Performance Across Spatial Clustering of AQMSs

Table 4 summarizes the performance of the proposed ClusLite-STGCN-GRU and seven baseline models on the test set across forecasting horizons from +1 to +72 h, using the validation metrics introduced in Section 4.2. Table S3 further details the performance across the training, validation, and test datasets. Additionally, the correlation between observed and predicted PM_2.5 values for 1, 2, 4, 8, 12, 24, 48, and 72 h ahead forecasting across various models on the training and test datasets is shown in Figures S9–S16. The best results are marked in bold. As can be seen, non-hybrid baseline models (i.e., the first four competitive methods) exhibit lower forecasting accuracy compared to hybrid architectures incorporating CNN, GCN, or GAT components (i.e., the latter four competitive methods). These findings align with the main hypothesis of this study, which assumes that air pollutant concentrations and dispersion patterns are governed not only by the temporal dynamics of individual AQMS observations but also by spatial interactions among AQMSs. Additionally, although prediction errors increase with longer forecasting horizons in all models, our proposed model exhibits a lower rate of error growth compared to the baseline models.

Among the non-hybrid baseline models, GRU demonstrated the best balance between simplicity, computational efficiency, and forecasting accuracy across all horizons. While incorporating multi-head attention into GRU led to moderate improvements in short-term predictions (up to 8 h), it performed slightly worse than GRU in long-term forecasts, despite the added complexity. In contrast, both LSTM and LSTM with multi-head attention consistently underperformed compared to their GRU counterparts, indicating limited benefits from the added complexity under the current configuration.

Among the hybrid baseline models, CNN-GRU, which uses CNNs to extract spatial dependencies, performed worse than graph-based models (GCN- or GAT-based). This suggests that converting the irregular distribution of AQMSs into a regular Euclidean grid—as required by CNNs—results in a loss of spatial precision, thereby reducing forecasting accuracy. Nevertheless, it still outperformed purely temporal models, highlighting the importance of incorporating spatial features, even when the spatial representation is suboptimal.

Among graph-based baseline models, wind-driven dynamic GAT-GRU showed better performance in short-term predictions (up to 8 h). This performance is likely due to the model’s ability to adaptively and dynamically model spatiotemporal relationships influenced by wind, highlighting the crucial role of wind in rapid dispersion of pollutants over short periods. However, in longer-term horizons (from 12 h onward), the advantage of wind-based model diminished. In these intervals, performance of wind-driven dynamic GAT-GRU decreased compared to the ClusLite-STGCN-GRU model, indicating that the effect of wind plays a less significant role in longer horizons, where unpredictable factors such as cumulative uncertainty, atmospheric mixing, and broader environmental variables are more influential. This trend confirms that although wind is an important factor in accuracy of short-term forecasts, broader temporal and spatial factors dominate in long-term predictions. In contrast, the clustering approach in the ClusLite-STGCN-GRU model, which groups AQMSs based on temporal and behavioral similarities of pollutant patterns, demonstrated significant effectiveness particularly in medium-term (+8 to +12 h) and long-term (+24 to +72 h) horizons. By simplifying complex spatial relationships and focusing on similar behavioral patterns, this model was able to maintain stability and appropriate accuracy in predictions.

5.4. Comparison Results of Model Complexity

In this subsection, the complexity of the ClusLite-STGCN-GRU model is further analyzed. Given that the proposed model is a hybrid graph-based deep learning model, the comparison was limited to two similar deep learning approaches: distance-based GCN-GRU and wind-driven dynamic GAT-GRU. Table 5 presents the comparison results from various aspects of computational complexity, including training and inference time, memory consumption, number of floating-point operations (FLOPs), and number of parameters. Among them, the forward and backward propagation time (FBP), measured in seconds (s), evaluates the offline training speed per epoch, while the forward propagation time (FP), in milliseconds (ms), reflects the online inference speed per sample. Total epoch time (seconds) accounts for the entire duration of one training epoch, including computation and data handling. FLOPs, a unitless measure, indicate the number of floating-point operations needed to process one sample—a lower value means a lighter, faster model. The number of parameters, also unitless, represents the count of trainable weights and biases, reflecting model capacity. Inference memory allocated, reported in megabytes (MB), shows the RAM or GPU memory required during prediction.

Overall, the results in Table 5 highlight the superior computational efficiency of the ClusLite-STGCN-GRU model compared to the other two graph-based deep learning baselines. It achieves 12% faster training per epoch compared to distance-based GCN-GRU and is 65.7% faster than wind-driven dynamic GAT-GRU. In terms of online inference speed, ClusLite-STGCN-GRU is significantly faster—66.8% and 79.7% faster, respectively—than the two baselines. Moreover, it reduces inference memory usage by 82.72% compared to wind-driven dynamic GAT-GRU, and its FLOP count is up to 84.29% lower, indicating substantially lighter computations. Although its number of parameters is slightly higher (about 1.8% more than distance-based GCN-GRU), this does not affect performance; in fact, the model still achieves a 15.3% reduction in total epoch time over distance-based GCN-GRU and a 65.7% reduction compared to wind-driven dynamic GAT-GRU. These results confirm its overall advantage in time efficiency, scalability, and suitability for real-time deployment.

6. Discussion

Table 6 presents a comparative analysis of the results of our study with those of previous research conducted in the same study area, Tehran, and it is worth noting that the results presented correspond to daily PM_2.5 predictions. Although all the mentioned studies focus on PM_2.5 prediction, their conditions differ due to variations in data sets, time periods, and prediction models used, which constitutes a limitation of the present study. Nevertheless, acknowledging these limitations, the comparison of results across these studies indicates that the proposed method demonstrates an improved performance in PM_2.5 prediction.

Nabavi et al. [63] employed random forest to estimate daily PM_2.5 concentrations in Tehran using satellite-based 10 km resolution merged dark target and deep blue (DB_DT) aerosol optical depth (AOD) along with meteorological data from 2011 to 2016. To enhance the relationship between satellite AOD and surface-level PM_2.5, the study incorporated relative humidity adjustments and normalized AOD values using planetary boundary layer height (PBLH), supported by aerosol layer height (ALH) data derived from 159 CALIPSO profiles. While the model achieved a moderate level of accuracy (RMSE = 17.52 μg/m³, R² = 0.68), it exhibited reduced performance in the summer and in the northern and eastern regions of Tehran, likely due to the absence of variables representing secondary aerosol formation and long-range pollutant transport mechanisms. Zamani Joharestani et al. [64] improved upon this by implementing XGBoost on data collected from 2015 to 2018, incorporating a comprehensive set of 23 features including ground-measured PM_2.5, satellite AOD at 3 km resolution, meteorological parameters, and geographical information. Their model achieved an RMSE of 13.58 μg/m³ and an MAE of 9.93 μg/m³, with a maximum R² of 0.81 after eliminating irrelevant features. However, they found that including satellite-derived AOD reduced model performance (R² dropped to 0.63–0.67), suggesting that AOD may not significantly enhance PM_2.5 prediction in dense urban environments such as Tehran, particularly when high-resolution ground and meteorological data are available.

In the most similar study, Faraji et al. [47] proposed a hybrid deep learning model that integrates three-dimensional convolutional neural networks (3D CNNs) with gated recurrent units (GRUs) to simultaneously capture spatiotemporal dependencies in PM_2.5 data. The model was applied to air quality data collected from multiple AQMSs across Tehran between 2016 and 2019, and its performance was directly compared with machine learning models such as SVR and standalone deep learning models including ANN, LSTM, and GRU. For daily predictions, their proposed model achieved an RMSE of 15.21 µg/m³ and an MAE of 12.00 µg/m³, outperforming other models. However, the relatively higher error values compared to some other studies are likely due to the greater complexity of the data during the study period—such as stronger fluctuations in pollutant concentrations or limited availability of high-quality auxiliary data—rather than a limitation of the model architecture itself. In comparison, our proposed ClusLite-STGCN-GRU model, applied to more recent data from 2019 to 2022, achieved an RMSE of 13.45 μg/m³ and an MAE of 9.24 μg/m³, demonstrating the best performance among all studies conducted in Tehran. Although variations in study periods, features, and data quality limit direct comparison, the improved performance of our model can be attributed to the effective integration of spatial clustering with spatiotemporal graph convolutional networks. This approach better captures the complex spatial and temporal dependencies among AQMSs, resulting in enhanced accuracy of pollutant modeling, as confirmed by comparisons with competitive baseline models.

7. Conclusions

The irregular distribution of AQMSs and the spatiotemporal dynamics of air pollution have made graph-based spatiotemporal modeling an effective approach for accurate air quality forecasting. In such models, node features and edge weights vary over time, resulting in dynamic graph structures. However, many of the existing models rely on techniques such as time-varying adjacency matrices, adjacency matrices learned during training, or the combination of multiple adjacency matrices to achieve high accuracy. Although these approaches achieve good results, they often lead to increased computational complexity, implementation challenges, and reduced efficiency in practical applications. To address these limitations, a new lightweight spatiotemporal prediction model, ClusLite-STGCN-GRU, is proposed. The experimental results show that our proposed model, by simplifying the graph structure through clustering of AQMSs and feature selection based on spatiotemporal dependencies, achieves an effective balance between prediction accuracy, computational efficiency, and implementation simplicity. In medium- and long-term forecasting horizons (8 to 72 h), the model shows more stable and accurate performance compared to more complex graph-based models. Although there is a decrease in short-term (1 to 8 h) accuracy compared to models with dynamic and complex structures such as wind-driven dynamic GAT-GRU, the proposed model still delivers comparable and acceptable performance. From a computational standpoint, ClusLite-STGCN-GRU demonstrates superior efficiency—with reductions of up to 65.7% in training time, 82% in memory usage, and 84% in FLOPs compared to baseline models—making it well-suited for real-time applications. These results confirm that the proposed model establishes an effective balance between accuracy, simplicity, and computational efficiency, making it a practical choice for spatiotemporal air pollution forecasting.

To further evaluate the generalizability and robustness of the proposed model, future studies should investigate its performance across diverse geographic regions with varying topographical, climatic, and pollution-source characteristics. In addition, the effective deployment of this model in real-world environments, its integration with intelligent big data platforms, and its role in the development of comprehensive smart urban management systems are of great importance. To build such systems, leveraging the complementary strengths of these two types of models—dynamic models such as wind-driven dynamic GAT-GRU and clustering-based models such as ClusLite-STGCN-GRU—can prove beneficial. While dynamic models like wind-driven dynamic GAT-GRU offer higher accuracy for short-term predictions, their computational complexity may hinder deployment in real-time and rapid-response scenarios. Optimizing and simplifying such models can facilitate their use in short-term forecasting tasks. In contrast, clustering-based models such as ClusLite-STGCN-GRU are better suited for medium- to long-term forecasting, as they capture stable patterns based on temporal and behavioral similarities among monitoring stations. This performance distinction highlights the potential for designing hybrid intelligent systems that integrate both approaches at different levels of decision-making. Moreover, given the generalizable graph-based structure of the proposed model, it can also be applied to other domains such as weather forecasting, urban traffic flow analysis, and transportation demand estimation.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/land14081589/s1, Figure S1: GRU predictive model framework; Figure S2: LSTM predictive model framework; Figure S3: GRU with multi-head attention predictive model framework; Figure S4: LSTM with multi-head attention predictive model framework; Figure S5: CNN-GRU predictive model framework; Figure S6: distance-based GCN-GRU predictive model framework; Figure S7: wind-driven dynamic GAT-GRU predictive model framework; Figure S8: (a) hourly PM_2.5 concentration signal in the time domain; (b) frequency domain representation highlighting dominant 24 h, 12 h, and 6 h cycles; Figure S9: correlation between observed and predicted PM_2.5 values in 1 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S10: correlation between observed and predicted PM_2.5 values in 2 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S11: correlation between observed and predicted PM_2.5 values in 4 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S12: correlation between observed and predicted PM_2.5 values in 8 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S13: correlation between observed and predicted PM_2.5 values in 12 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S14: correlation between observed and predicted PM_2.5 values in 24 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S15: correlation between observed and predicted PM_2.5 values in 48 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Figure S16: correlation between observed and predicted PM_2.5 values in 72 h ahead forecasting by different models on the training and test datasets: (a) GRU, (b) LSTM, (c) GRU with multi-head attention, (d) LSTM with multi-head attention, (e) CNN-GRU, (f) distance-based GCN-GRU, (g) wind-driven dynamic GAT-GRU, (h) ClusLite-STGCN-GRU; Table S1: features of the papers devoted to the implementation of spatiotemporal hybrid deep learning models for air quality prediction; Table S2: details of the experimental settings; Table S3: model results on the three datasets: training, validation, and test.

Author Contributions

Conceptualization, M.T.A.; methodology, M.T.A., A.A.A. and F.R.; data curation, M.T.A.; formal analysis, M.T.A. and A.A.A.; investigation, M.T.A., A.A.A. and F.R.; project administration, M.T.A.; validation, M.T.A., A.A.A. and F.R.; visualization, M.T.A. and F.R.; writing—original draft preparation, M.T.A.; writing—review and editing, A.A.A. and F.R.; supervision, A.A.A. and F.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this paper are available at the following GitHub repository: https://github.com/m-t-abbasi/ClusLite-STGCNGRU, accessed on 31 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Fu, M.; Wang, L.; Liang, Y.; Tang, F.; Li, S.; Wu, C. Impact of Urban Shrinkage on Pollution Reduction and Carbon Mitigation Synergy: Spatial Heterogeneity and Interaction Effects in Chinese Cities. Land 2025, 14, 537. [Google Scholar] [CrossRef]
WHO World Health Organization. Available online: https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health (accessed on 31 July 2025).
Habibi, R.; Alesheikh, A.A.; Mohammadinia, A.; Sharif, M. An Assessment of Spatial Pattern Characterization of Air Pollution: A Case Study of CO and PM2. 5 in Tehran, Iran. ISPRS Int. J. Geo-Inf. 2017, 6, 270. [Google Scholar] [CrossRef]
Liu, F.; Jia, S.; Ma, L.; Lu, S. Spatiotemporal Dynamic Evolution of PM2. 5 Exposure from Land Use Changes: A Case Study of Gansu Province, China. Land 2025, 14, 795. [Google Scholar] [CrossRef]
Song, Y.; Mao, H.; Li, H. Spatio-Temporal Modeling for Air Quality Prediction Based on Spectral Graph Convolutional Network and Attention Mechanism. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–9. [Google Scholar]
Liu, Z.; Fang, Z.; Hu, Y. A Deep Learning-Based Hybrid Method for PM2. 5 Prediction in Central and Western China. Sci. Rep. 2025, 15, 10080. [Google Scholar]
Guan, Q.; Wang, J.; Ren, S.; Gao, H.; Liang, Z.; Wang, J.; Yao, Y. Predicting Short-Term PM2. 5 Concentrations at Fine Temporal Resolutions Using a Multi-Branch Temporal Graph Convolutional Neural Network. Int. J. Geogr. Inf. Sci. 2024, 38, 778–801. [Google Scholar] [CrossRef]
Wang, Z.; Hu, K.; Wang, Z.; Yang, B.; Chen, Z. Impact of Urban Neighborhood Morphology on PM2. 5 Concentration Distribution at Different Scale Buffers. Land 2024, 14, 7. [Google Scholar] [CrossRef]
Faridi, S.; Niazi, S.; Yousefian, F.; Azimi, F.; Pasalari, H.; Momeniha, F.; Mokammel, A.; Gholampour, A.; Hassanvand, M.S.; Naddafi, K. Spatial Homogeneity and Heterogeneity of Ambient Air Pollutants in Tehran. Sci. Total Environ. 2019, 697, 134123. [Google Scholar] [CrossRef]
Mun, H.; Li, M.; Jung, J. Spatial-Temporal Characteristics and Influencing Factors of Particulate Matter: Geodetector Approach. Land 2022, 11, 2336. [Google Scholar] [CrossRef]
Alharbi, B.H.; Alduwais, A.K.; Alhudhodi, A.H. An Analysis of the Spatial Distribution of O3 and Its Precursors during Summer in the Urban Atmosphere of Riyadh, Saudi Arabia. Atmos. Pollut. Res. 2017, 8, 861–872. [Google Scholar] [CrossRef]
Hu, Y.; Li, Q.; Shi, X.; Yan, J.; Chen, Y. Domain Knowledge-Enhanced Multi-Spatial Multi-Temporal PM2. 5 Forecasting with Integrated Monitoring and Reanalysis Data. Environ. Int. 2024, 192, 108997. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N. Prabhat, fnm Deep Learning and Process Understanding for Data-Driven Earth System Science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
Abbasi, M.T.; Alesheikh, A.A.; Lotfata, A.; Azizi, Z. Hybrid Graph Convolutional Networks for Air Quality Prediction: A Systematic Review of Foundations, Challenges, and Opportunities. Int. J. Environ. Sci. Technol. 2025. [Google Scholar] [CrossRef]
Chen, Q.; Ding, R.; Mo, X.; Li, H.; Xie, L.; Yang, J. An Adaptive Adjacency Matrix-Based Graph Convolutional Recurrent Network for Air Quality Prediction. Sci. Rep. 2024, 14, 4408. [Google Scholar] [CrossRef]
Hu, W.; Zhang, Z.; Zhang, S.; Chen, C.; Yuan, J.; Yao, J.; Zhao, S.; Guo, L. Learning Spatiotemporal Dependencies Using Adaptive Hierarchical Graph Convolutional Neural Network for Air Quality Prediction. J. Clean. Prod. 2024, 459, 142541. [Google Scholar] [CrossRef]
Zeng, Q.; Cao, Y.; Fan, M.; Chen, L.; Zhu, H.; Wang, L.; Li, Y.; Liu, S. Fine Particulate Matter Concentration Prediction Based on Hybrid Convolutional Network with Aggregated Local and Global Spatiotemporal Information: A Case Study in Beijing and Chongqing. Atmos. Environ. 2024, 333, 120647. [Google Scholar] [CrossRef]
Liu, H.; Han, Q.; Lu, D.; Sheng, J.; Sui, S.; Sun, H. Fine-Grained Graph Convolutional Network with Learning-Based Bi-Relational Graph for Spatiotemporal Forecasting. Expert Syst. Appl. 2025, 265, 125959. [Google Scholar] [CrossRef]
Zhao, Q.; Liu, J.; Yang, X.; Qi, H.; Lian, J. Spatiotemporal PM2. 5 Forecasting via Dynamic Geographical Graph Neural Network. Environ. Model. Softw. 2025, 186, 106351. [Google Scholar] [CrossRef]
Huang, Y.; Han, F.; Feng, Q. A Novel Model for Predicting PM2. 5 Concentrations Utilizing Graph Convolutional Networks and Transformer. IEEE Access 2025. [Google Scholar]
Zeng, Q.; Zeng, H.; Fan, M.; Chen, L.; Tao, J.; Zhang, Y.; Zhu, H.; Liu, S.; Zhu, Y. Adaptive Graph-Generating Jump Network for Air Quality Prediction Based on Improved Graph Convolutional Network. Atmos. Pollut. Res. 2025, 16, 102488. [Google Scholar] [CrossRef]
Wang, P.; Zhang, H.; Cheng, S.; Zhang, T.; Lu, F.; Wu, S. A Lightweight Spatiotemporal Graph Dilated Convolutional Network for Urban Sensor State Prediction. Sustain. Cities Soc. 2024, 101, 105105. [Google Scholar] [CrossRef]
Zheng, Y.; Capra, L.; Wolfson, O.; Yang, H. Urban Computing: Concepts, Methodologies, and Applications. ACM Trans. Intell. Syst. Technol. 2014, 5, 1–55. [Google Scholar] [CrossRef]
Van, N.H.; Van Thanh, P.; Tran, D.N.; Tran, D.-T. A New Model of Air Quality Prediction Using Lightweight Machine Learning. Int. J. Environ. Sci. Technol. 2023, 20, 2983–2994. [Google Scholar] [CrossRef]
Abbasi, M.T.; Alesheikh, A.A.; Jafari, A.; Lotfata, A. Spatial and Temporal Patterns of Urban Air Pollution in Tehran with a Focus on PM2. 5 and Associated Pollutants. Sci. Rep. 2024, 14, 25150. [Google Scholar] [CrossRef] [PubMed]
Kalankesh, L.R.; Khajavian, N.; Soori, H.; Vaziri, M.H.; Saeedi, R.; Hajighasemkhan, A. Association Metrological Factors with Covid-19 Mortality in Tehran, Iran (2020-2021). Int. J. Environ. Health Res. 2024, 34, 1725–1736. [Google Scholar] [CrossRef] [PubMed]
Taksibi, F.; Khajehpour, H.; Saboohi, Y. On the Environmental Effectiveness Analysis of Energy Policies: A Case Study of Air Pollution in the Megacity of Tehran. Sci. Total Environ. 2020, 705, 135824. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Ge, Z.; Song, Z.; Gao, F. Review and Big Data Perspectives on Robust Data Mining Approaches for Industrial Process Modeling with Outliers and Missing Data. Annu. Rev. Control 2018, 46, 107–133. [Google Scholar] [CrossRef]
Singh, D.; Singh, B. Feature Wise Normalization: An Effective Way of Normalizing Data. Pattern Recognit. 2022, 122, 108307. [Google Scholar] [CrossRef]
Jiang, W.; Luo, J. Graph Neural Network for Traffic Forecasting: A Survey. Expert Syst. Appl. 2022, 207, 117921. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. AI open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph Convolutional Networks: A Comprehensive Review. Comput. Soc. Networks 2019, 6, 1–23. [Google Scholar] [CrossRef]
Wu, G.; Al-qaness, M.A.A.; Al-Alimi, D.; Dahou, A.; Abd Elaziz, M.; Ewees, A.A. Hyperspectral Image Classification Using Graph Convolutional Network: A Comprehensive Review. Expert Syst. Appl. 2024, 257, 125106. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Henaff, M.; Bruna, J.; LeCun, Y. Deep Convolutional Networks on Graph-Structured Data. arXiv 2015, arXiv:1506.05163. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Shin, J.; Yeon, K.; Kim, S.; Sunwoo, M.; Han, M. Comparative Study of Markov Chain with Recurrent Neural Network for Short Term Velocity Prediction Implemented on an Embedded System. IEEE Access 2021, 9, 24755–24767. [Google Scholar] [CrossRef]
Hopfield, J.J. Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef]
Farmanifard, S.; Alesheikh, A.A.; Sharif, M. A Context-Aware Hybrid Deep Learning Model for the Prediction of Tropical Cyclone Trajectories. Expert Syst. Appl. 2023, 231, 120701. [Google Scholar] [CrossRef]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Hamedi, H.; Alesheikh, A.A.; Panahi, M.; Lee, S. Landslide Susceptibility Mapping Using Deep Learning Models in Ardabil Province, Iran. Stoch. Environ. Res. Risk Assess. 2022, 36, 4287–4310. [Google Scholar] [CrossRef]
Schmidhuber, J.; Hochreiter, S. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar]
Fan, J.; Li, R.; Zhao, M.; Pan, X. A BiLSTM-Based Hybrid Ensemble Approach for Forecasting Suspended Sediment Concentrations: Application to the Upper Yellow River. Land 2025, 14, 1199. [Google Scholar] [CrossRef]
Hakim, W.L.; Nur, A.S.; Rezaie, F.; Panahi, M.; Lee, C.-W.; Lee, S. Convolutional Neural Network and Long Short-Term Memory Algorithms for Groundwater Potential Mapping in Anseong, South Korea. J. Hydrol. Reg. Stud. 2022, 39, 100990. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Faraji, M.; Nadi, S.; Ghaffarpasand, O.; Homayoni, S.; Downey, K. An Integrated 3D CNN-GRU Deep Learning Method for Short-Term Prediction of PM2. 5 Concentration in Urban Environment. Sci. Total Environ. 2022, 834, 155324. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Yan, Y. Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data. Land 2024, 13, 1331. [Google Scholar] [CrossRef]
Xiong, B.; Tang, J.; Li, Y.; Zhou, P.; Zhang, S.; Zhang, X.; Dong, C.; Gooi, H.B. A Flow-Rate-Aware Data-Driven Model of Vanadium Redox Flow Battery Based on Gated Recurrent Unit Neural Network. J. Energy Storage 2023, 74, 109537. [Google Scholar] [CrossRef]
Szramowiat-Sala, K.; Marczak-Grzesik, M.; Karczewski, M.; Kistler, M.; Giebl, A.K.; Styszko, K. Chemical Investigation of Polycyclic Aromatic Hydrocarbon Sources in an Urban Area with Complex Air Quality Challenges. Sci. Rep. 2025, 15, 6987. [Google Scholar] [CrossRef]
Yang, L.; Wang, G.; Wang, Y.; Wang, Y.; Ma, Y.; Zhang, X. A Rapid Computational Method for Quantifying Inter-Regional Air Pollutant Transport Dynamics. Atmosphere 2025, 16, 163. [Google Scholar] [CrossRef]
Joe, H. Dependence Modeling with Copulas; CRC Press: Boca Raton, FL, USA, 2014; ISBN 1466583223. [Google Scholar]
Lyu, M.-Z.; Fei, Z.-J.; Feng, D.-C. Copula-Based Cloud Analysis for Seismic Fragility and Its Application to Nuclear Power Plant Structures. Eng. Struct. 2024, 305, 117754. [Google Scholar] [CrossRef]
Pan, S.; Joe, H. Predicting Times to Event Based on Vine Copula Models. Comput. Stat. Data Anal. 2022, 175, 107546. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Liu, C.; Wu, B.; Shi, K. A Study of Cross-Correlations between PM2. 5 and O3 Based on Copula and Multifractal Methods. Phys. A Stat. Mech. Its Appl. 2022, 589, 126651. [Google Scholar] [CrossRef]
Zhang, Y. Dynamic Effect Analysis of Meteorological Conditions on Air Pollution: A Case Study from Beijing. Sci. Total Environ. 2019, 684, 178–185. [Google Scholar] [CrossRef] [PubMed]
Qi, Y.; Li, Q.; Karimian, H.; Liu, D. A Hybrid Model for Spatiotemporal Forecasting of PM2. 5 Based on Graph Convolutional Neural Network and Long Short-Term Memory. Sci. Total Environ. 2019, 664, 1–10. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, F.; Du, Z.; Liu, R. A Theory-Guided Graph Networks Based PM2. 5 Forecasting Method. Environ. Pollut. 2022, 293, 118569. [Google Scholar] [CrossRef]
Wang, H.; Zhang, L.; Wu, R.; Cen, Y. Spatio-Temporal Fusion of Meteorological Factors for Multi-Site PM2. 5 Prediction: A Deep Learning and Time-Variant Graph Approach. Environ. Res. 2023, 239, 117286. [Google Scholar] [CrossRef] [PubMed]
Pillai, P.S.; Babu, S.S.; Moorthy, K.K. A Study of PM, PM10 and PM2. 5 Concentration at a Tropical Coastal Station. Atmos. Res. 2002, 61, 149–167. [Google Scholar] [CrossRef]
Wang, P.; Guo, H.; Hu, J.; Kota, S.H.; Ying, Q.; Zhang, H. Responses of PM2. 5 and O3 Concentrations to Changes of Meteorology and Emissions in China. Sci. Total Environ. 2019, 662, 297–306. [Google Scholar] [CrossRef] [PubMed]
Chuang, M.-T.; Chou, C.C.-K.; Lin, C.-Y.; Lee, J.-H.; Lin, W.-C.; Chen, Y.-Y.; Chang, C.-C.; Lee, C.-T.; Kong, S.S.-K.; Lin, T.-H. A Numerical Study of Reducing the Concentration of O3 and PM2. 5 Simultaneously in Taiwan. J. Environ. Manag. 2022, 318, 115614. [Google Scholar] [CrossRef]
Nabavi, S.O.; Haimberger, L.; Abbasi, E. Assessing PM2. 5 Concentrations in Tehran, Iran, from Space Using MAIAC, Deep Blue, and Dark Target AOD and Machine Learning Algorithms. Atmos. Pollut. Res. 2019, 10, 889–903. [Google Scholar] [CrossRef]
Zamani Joharestani, M.; Cao, C.; Ni, X.; Bashir, B.; Talebiesfandarani, S. PM2. 5 Prediction Based on Random Forest, XGBoost, and Deep Learning Using Multisource Remote Sensing Data. Atmosphere 2019, 10, 373. [Google Scholar] [CrossRef]

Figure 1. (a) Study area; (b) distribution of AQMSs.

Figure 2. Architectures of (a) Elman RNN, (b) LSTM, and (c) GRU.

Figure 3. The framework of our proposed model for PM_2.5 prediction.

Figure 4. Cluster patterns of (a) NO₂, (b) PM₁₀, and (c) PM_2.5.

Table 1. Descriptive statistics of variables.

Variable	Unit	Range	Mean	St. Dev.
PM_2.5	$μ g / m^{3}$	[0.167, 249.724]	30.680	20.309
PM₁₀	$μ g / m^{3}$	[0.677, 697.977]	76.780	46.340
SO₂	ppb	[0.051, 142.400]	7.561	6.472
NO₂	ppb	[0.565, 301.055]	48.597	22.942
O₃	ppb	[0.0680, 213.035]	20.702	20.204
CO	ppm	[0.0075, 15.7300]	1.871	1.269
Temperature	°C	[−7.631, 40.888]	17.770	10.191
Pressure	mbar	[956.452, 1037.462]	1011.236	8.928
Humidity	%	[2.479, 99.147]	36.656	20.766
Dew point temperature	°C	[−26.819, 24.372]	0.061	5.308
Wind_x	m/s	[−11.653, 7.269]	−0.874	1.408
Wind_y	m/s	[−18.558, 9.383]	−0.188	1.936

Table 2. Copula families and their dependence characteristics.

Copula Family	Tail Dependence	Symmetry	Type of Dependence Captured	Typical Use Case in Air Pollution
Clayton	Lower tail (strong)	Asymmetric	Captures stronger association in low extremes	Simultaneous decrease in pollutant concentrations
Gumbel	Upper tail (strong)	Asymmetric	Captures stronger association in high extremes	Joint increase in pollutants under severe pollution events
t-Student	Both tails (moderate/strong)	Symmetric	Models symmetric tail dependence	Extreme events with co-movements in both directions
Gaussian	None (only linear correlation)	Symmetric	Captures linear correlation but no tail dependence	Mild/moderate dependence under normal conditions
Frank	No tail dependence (moderate)	Symmetric	Captures moderate dependence across the whole range	Balanced and non-extreme pollutant interactions

Table 3. Copula-based dependence measures between PM_2.5 and other pollutants.

Pollutant	Fitted Cupola Model	$τ$	$λ_{u p p e r}$	$λ_{l o w e r}$
O₃	Rotated Gumbel 90°	−0.262	0	0
CO	Frank	0.386	0	0
NO₂	t-Student	0.424	0.136	0.136
SO₂	t-Student	0.388	0.189	0.189
PM₁₀	t-Student	0.665	0.229	0.229

Table 4. Results of models on the test dataset.

Model	Metric	+1	+2	+4	+8	+12	+24	+48	+72
GRU	IA	0.886	0.835	0.761	0.708	0.647	0.606	0.543	0.513
	R²	0.688	0.585	0.463	0.360	0.319	0.277	0.229	0.202
	MAE ( $μ g / m^{3}$ )	6.761	7.912	9.187	10.121	10.432	10.655	10.976	11.175
	RMSE ( $μ g / m^{3}$ )	10.208	11.780	13.438	14.709	15.170	15.647	16.194	16.550
LSTM	IA	0.868	0.812	0.737	0.662	0.629	0.578	0.527	0.504
	R²	6.911	8.008	9.186	10.072	10.337	10.652	10.935	11.044
	MAE ( $μ g / m^{3}$ )	10.548	12.050	13.542	14.684	15.140	15.593	16.012	16.251
	RMSE ( $μ g / m^{3}$ )	10.548	12.050	13.542	14.684	15.140	15.593	16.012	16.251
GRU with multi-head attention	IA	0.892	0.835	0.756	0.689	0.657	0.604	0.519	0.480
	R²	0.691	0.581	0.456	0.355	0.322	0.270	0.189	0.142
	MAE ( $μ g / m^{3}$ )	6.688	7.893	9.287	10.370	10.611	10.814	11.383	11.703
	RMSE ( $μ g / m^{3}$ )	10.121	11.788	13.478	14.722	15.128	15.705	16.614	17.171
LSTM with multi-head attention	IA	0.877	0.825	0.748	0.669	0.638	0.585	0.503	0.471
	R²	0.663	0.562	0.437	0.329	0.294	0.245	0.162	0.128
	MAE ( $μ g / m^{3}$ )	7.090	8.160	9.440	10.458	10.739	11.021	11.608	11.783
	RMSE ( $μ g / m^{3}$ )	10.569	12.052	13.697	14.983	15.389	15.922	16.847	17.255
CNN-GRU	IA	0.914	0.863	0.811	0.748	0.735	0.674	0.585	0.549
	R²	0.740	0.627	0.546	0.451	0.434	0.359	0.247	0.200
	MAE ( $μ g / m^{3}$ )	6.376	7.778	8.822	9.783	9.909	10.257	11.018	11.359
	RMSE ( $μ g / m^{3}$ )	9.838	11.783	13.045	14.313	14.421	14.854	15.966	16.440
Distance-based GCN-GRU	IA	0.921	0.876	0.811	0.766	0.743	0.720	0.614	0.592
	R²	0.748	0.644	0.554	0.466	0.431	0.386	0.239	0.224
	MAE ( $μ g / m^{3}$ )	6.185	7.396	8.476	9.258	9.593	9.880	10.992	11.089
	RMSE ( $μ g / m^{3}$ )	9.686	11.510	12.922	14.147	14.485	14.539	16.066	16.211
Wind-driven dynamic GAT-GRU	IA	0.935	0.892	0.843	0.786	0.752	0.715	0.567	0.546
	R²	0.802	0.717	0.632	0.551	0.506	0.457	0.284	0.241
	MAE ( $μ g / m^{3}$ )	5.613	6.763	7.856	8.848	9.184	9.471	10.796	10.972
	RMSE ( $μ g / m^{3}$ )	8.485	10.306	11.790	12.999	13.515	13.721	15.625	16.059
ClusLite-STGCN-GRU	IA	0.920	0.884	0.842	0.788	0.776	0.752	0.633	0.624
	R²	0.765	0.687	0.617	0.533	0.512	0.475	0.323	0.317
	MAE ( $μ g / m^{3}$ )	5.905	7.008	8.001	8.909	9.062	9.244	10.458	10.461
	RMSE ( $μ g / m^{3}$ )	9.373	10.797	11.981	13.212	13.400	13.450	15.148	15.210

Table 5. Comparison of computational complexity from different aspects among hybrid graph-based deep learning models.

Model	FBP (s)	FP (ms)	Total Epoch Time (s)	Inference Memory Allocated (MB)	FLOPs	Number of Parameters
Distance-based GCN-GRU	383.35	11.03	440.65	920.40	78,312,960	252,844
Wind-driven dynamic GAT-GRU	995.87	18.01	1089.40	4166.76	105,318,400	176,304
ClusLite-STGCN-GRU	337.27	3.66	373.41	720.09	16,549,904	257,288

Table 6. A comparison of the findings of our study with those of earlier research conducted in Tehran.

Authors	Publication	Study Period	Model	Evaluation Criteria
Nabavi et al. [63]	2019	2011–2016	Machine Learning (Random Forest)	$RMSE = 17.52 μ g / m^{3}$ MAE = Not mentioned.
Zamani Joharestani et al. [64]	2019	2015–2018	Machine Learning (XGBoost)	$RMSE = 13.58 μ g / m^{3}$ $MAE = 9.93 μ g / m^{3}$
Faraji et al. [47]	2022	2016–2019	Deep Learning (3D CNN-GRU)	$RMSE = 15.21 μ g / m^{3}$ $MAE = 12.00 μ g / m^{3}$
Ours	-	2019–2022	Deep Learning (ClusLite-STGCN-GRU)	$RMSE = 13.45 μ g / m^{3}$ $MAE = 9.24 μ g / m^{3}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abbasi, M.T.; Alesheikh, A.A.; Rezaie, F. A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM_2.5 Forecasting. Land 2025, 14, 1589. https://doi.org/10.3390/land14081589

AMA Style

Abbasi MT, Alesheikh AA, Rezaie F. A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM_2.5 Forecasting. Land. 2025; 14(8):1589. https://doi.org/10.3390/land14081589

Chicago/Turabian Style

Abbasi, Mohammad Taghi, Ali Asghar Alesheikh, and Fatemeh Rezaie. 2025. "A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM_2.5 Forecasting" Land 14, no. 8: 1589. https://doi.org/10.3390/land14081589

APA Style

Abbasi, M. T., Alesheikh, A. A., & Rezaie, F. (2025). A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM_2.5 Forecasting. Land, 14(8), 1589. https://doi.org/10.3390/land14081589

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM2.5 Forecasting

Abstract

1. Introduction

2. Materials and Problem Definition

2.1. Study Area and Data Description

2.2. Data Preprocessing

2.3. Problem Formulation

3. Theoretical Background

3.1. GCNs

3.2. RNNs

4. Proposed Model and Experimental Settings

4.1. Model Design

4.1.1. Copula-Based Dependency Analysis Block

4.1.2. Pollutant Time Series Clustering Block

4.1.3. Graph Convolution Block

4.1.4. GRU Block

4.1.5. Output Block

4.2. Model Validation

4.3. Model Evaluation

4.4. Experimental Settings

5. Results

5.1. PM2.5 Dependency on Other Pollutants

5.2. Spatial Clustering of AQMSs

5.3. Comparison of Model Performance Across Spatial Clustering of AQMSs

5.4. Comparison Results of Model Complexity

6. Discussion

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

A Lightweight Spatiotemporal Graph Framework Leveraging Clustered Monitoring Networks and Copula-Based Pollutant Dependency for PM_2.5 Forecasting

5.1. PM_2.5 Dependency on Other Pollutants