Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net

Zhang, Jiankun; Su, Pei; Wang, Juexuan; Cai, Zhantong

doi:10.3390/su172411177

Open AccessArticle

Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net

¹

Jiangxi Engineering Technology Research Center of Nuclear Geoscience Data Science and System, East China University of Technology, Nanchang 330013, China

²

School of Artificial Intelligence and Information Engineering, East China University of Technology, Nanchang 330013, China

³

School of Software, East China University of Technology, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(24), 11177; https://doi.org/10.3390/su172411177

Submission received: 5 October 2025 / Revised: 21 November 2025 / Accepted: 11 December 2025 / Published: 13 December 2025

(This article belongs to the Special Issue Atmospheric Pollution and Microenvironmental Air Quality)

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of air pollutant concentrations, particularly fine particulate matter (PM_2.5), is essential for controlling and preventing heavy pollution incidents by providing early warnings of harmful substances in the atmosphere. This study proposes a novel spatiotemporal model for PM_2.5 concentration prediction based on a Conditional Wasserstein Generative Adversarial Network with Gradient Penalty (CWGAN-GP). The framework incorporates three key design components: First, the generator employs an Inception-style Convolutional Long Short-Term Memory (ConvLSTM) network, integrating parallel multi-scale convolutions and hierarchical normalization. This design enhances multi-scale spatiotemporal feature extraction while effectively suppressing boundary artifacts via a map-masking layer. Second, the discriminator adopts an architecturally enhanced U-Net, incorporating spectral normalization and shallow instance normalization. Feature-guided masked skip connections are introduced, and the output is designed as a raw score map to mitigate premature saturation during training. Third, a composite loss function is utilized, combining adversarial loss, feature-matching loss, and inter-frame spatiotemporal smoothness. A sliding-window conditioning mechanism is also implemented, leveraging multi-level features from the discriminator for joint spatiotemporal optimization. Experiments conducted on multi-source gridded data from Dongguan demonstrate that the model achieves a 12 h prediction performance with a Root Mean Square Error (RMSE) of 4.61 μg/m³, a Mean Absolute Error (MAE) of 6.42 μg/m³, and a Coefficient of Determination (R²) of 0.80. The model significantly alleviates performance degradation in long-term predictions when the forecast horizon is extended from 3 to 12 h, the RMSE increases by only 1.84 μg/m³, and regional deviations remain within ±3 μg/m³. These results indicate strong capabilities in spatial topology reconstruction and robustness against concentration anomalies, highlighting the model’s potential for hyperlocal air quality early warning. It should be noted that the empirical validation is limited to the specific environmental conditions of Dongguan, and the model’s generalizability to other geographical and climatic settings requires further investigation.

Keywords:

spatiotemporal PM_2.5 concentration prediction; conditional Wasserstein generative adversarial network with gradient penalty (CWGAN-GP); inception-style ConvLSTM; architecturally enhanced U-Net

1. Introduction

Fine particulate matter (PM_2.5) refers to particles in ambient air with an aerodynamic equivalent diameter of 2.5 μm or less [1]. Due to their small size, PM_2.5 can remain suspended in the atmosphere for extended periods and travel long distances, making their concentration a key indicator for assessing air pollution levels [2]. Furthermore, PM_2.5 readily adsorbs toxic and hazardous substances such as heavy metals and microorganisms, posing a serious threat to public health [3]. According to data released by the World Health Organization in 2023, approximately 92% of the global population lives in areas where PM_2.5 concentrations exceed recommended levels [4]. This exposure leads to over seven million premature deaths annually from respiratory diseases and results in estimated annual economic losses of approximately US $5 trillion. Therefore, the accurate prediction of PM_2.5 levels, the clear identification of air pollutant concentration characteristics, and the timely implementation of warnings and control measures are essential, not only for safeguarding public health but also for reducing economic losses [5,6]. PM_2.5 concentration prediction can be seen as a classic spatiotemporal modeling problem that elucidates future spatiotemporal evolution patterns and the regional transmission mechanisms of pollutants based on historical pollution and meteorological data [7,8,9].

Methods for predicting air pollutant concentrations can be broadly classified into numerical simulation models, statistical models, traditional machine-learning algorithms, and deep-learning approaches. Numerical simulation models are grounded in atmospheric dynamics and pollutant dispersion theory and represent the transport, diffusion, and deposition of pollutants through coupled physical and mathematical equations [10,11,12]. These models, however, depend heavily on high-accuracy emission inventories, are computationally demanding, and are difficult to calibrate dynamically. Statistical models infer concentration dynamics from historical observations; common approaches include autoregressive integrated moving average (ARIMA) model [13], multiple linear regression (MLR) model [14,15], and geographically weighted regression model [16]. Such methods have relatively low data requirements and are computationally efficient, making them well suited for short-term predictions. Nevertheless, they generally struggle to capture the nonlinear spatiotemporal interactions arising from meteorology–terrain coupling and often demand frequent empirical tuning. Traditional machine-learning methods include decision trees, random forests, support vector machines (SVM), and extreme gradient boosting (XGBoost). Mogollón-Sotelo et al. [17] developed an SVM with a Radial Basis Function (RBF) kernel to predict PM_2.5 in the complex terrain of Bogotá, Colombia, achieving a correlation coefficient of 0.654, an index of agreement of 0.732, a root mean square error (RMSE) of 9.302 μg/m³, and a mean bias of 1.405 μg/m³. This result demonstrates the effectiveness of the method for short-term pollution prediction in tropical urban environments. Wei et al. [18] compared random forest, XGBoost, and fully connected neural networks for PM_2.5 and O₃ forecasting in the Beijing–Tianjin–Hebei region, and found that XGBoost achieved the best performance. Overall, traditional machine-learning models model nonlinear relationships effectively and are robust to noisy data. However, they remain limited in dynamic spatiotemporal modeling and in capturing complex spatiotemporal dependencies.

In recent years, deep learning has demonstrated exceptional performance in spatiotemporal prediction tasks, owing to its powerful nonlinear modeling capabilities [19,20]. Numerous studies and their experimental results in air pollution forecasting confirm that deep neural architectures can effectively capture complex spatiotemporal patterns. For instance, Mu et al. [21] developed a Seasonal-Trend Loss Transformer (STL-Transformer) model that decomposes ozone time series into seasonal, trend, and residual components. This approach enhances the extraction of long-term dependencies and improves model interpretability. Xia et al. [22] proposed a multimodal deep learning model named Res-GCN. It integrates high-resolution remote sensing images with multi-station air quality time-series data. Using residual and spatiotemporal graph convolutional networks, the model extracts features to predict future air quality. Su et al. [23] introduced a graph neural network (GNN)-based deep learning model to predict PM_2.5 concentrations up to 48 h ahead in Taiwan, China. By combining GNN with gated recurrent units (GRU), the model effectively captures long-term spatiotemporal characteristics in air quality time-series data. Zhang et al. [24] constructed an RCL-Learning model by integrating convolutional long short-term memory (ConvLSTM) and residual neural networks (ResNet). This model showed superior performance in predicting spatiotemporal distribution characteristics of PM_2.5 compared to traditional deep learning methods. Kalajdjieski et al. [25] proposed a generative adversarial network (GAN) combined with data augmentation. This method uses camera images and weather data for air pollution prediction. It effectively handles imbalanced sample distributions where low-pollution samples far outnumber high-pollution cases. Yin et al. [26] integrated spatiotemporal modeling modules with a large language model, proposing the Spatio-Temporal Large Language Model (LLM) Generative Adversarial Network (STLLM-GAN). This framework incorporates adversarial training from GAN into the learning process. It achieves a blend of unsupervised and supervised learning. By simultaneously optimizing adversarial loss and mean squared error (MSE) loss, the framework significantly enhances training robustness and improves generalization ability. Although these hybrid deep learning models, Transformers, and GANs outperform traditional pollutant prediction methods and classical machine learning algorithms in capturing spatiotemporal features and long-term forecasting, they still have limitations. A key drawback is the insufficient ability to adaptively capture multi-scale coupling effects between local pollution events and regional diffusion processes. Furthermore, they struggle to effectively capture and maintain long-range spatiotemporal dependencies. In long-term forecasting of atmospheric pollutant concentrations, error accumulation becomes a critical issue. When the prediction horizon exceeds 3 h, spatial accuracy declines significantly due to accumulated errors. Evaluation metrics degrade rapidly as the prediction timestep increases, indicating clear memory decay.

To address the weak modeling of complex spatiotemporal correlations and the accuracy degradation in long-term fine-grained forecasting caused by error accumulation, this study proposes a CWGAN-GP based, spatiotemporal PM_2.5 concentration prediction. The generator is an Inception-style ConvLSTM network, and the discriminator is an architecturally enhanced U-Net. The model uses data from October 2021 to May 2022, including air pollutant monitoring data, meteorological observations, satellite remote sensing data, and mobile ground monitoring data from Dongguan. Gridded air pollutant concentration maps are generated via inverse distance weighting interpolation. The goal is to predict PM_2.5 concentration grids for the next 12 h. The main contributions of this study are summarized as follows:

(1): A novel generative adversarial architecture is proposed. The generator employs a multi-scale ConvLSTM integrated with a map masking layer, which enhances multi-scale spatiotemporal feature extraction while effectively suppressing boundary blurring in predictions. The discriminator utilizes an architecturally enhanced U-Net network. By incorporating spectral normalization and a raw score map output mechanism, it mitigates the premature convergence commonly encountered in traditional GAN during pixel-level training.
(2): A joint optimization training mechanism is constructed. A composite loss function combining adversarial loss, feature matching loss, and spatiotemporal smoothness constraints is designed, significantly improving gradient propagation efficiency. Using historical pollutant grid sequences as sliding window conditions, the mechanism iteratively refines the generated sequences by leveraging the discriminator’s multi-level features, effectively alleviating memory degradation in long-term forecasting.
(3): A comprehensive spatiotemporal PM_2.5 concentration prediction framework is established. Based on CWGAN-GP, the framework enables refined spatiotemporal forecasting of atmospheric pollutant concentrations over long periods for target cities. Experimental results on real-world datasets demonstrate that the model achieves significantly higher accuracy in 12 h prediction tasks compared to various state-of-the-art deep learning models.

2. Study Area and Dataset Analysis

2.1. Study Area

Dongguan is situated in the central–southern part of Guangdong Province, on the eastern bank of the Pearl River Estuary. It lies within the alluvial plain of the lower Dongjiang River in the Pearl River Delta region (113°31′–114°15′ E, 22°39′–23°09′ N). The city covers a total land area of approximately 2460.38 square kilometers and reported a permanent population of 10.4853 million by the end of 2023. Dongguan experiences a subtropical monsoon climate, characterized by long summers, mild winters, abundant sunlight, and distinct wet and dry seasons. The local topography is generally higher in the southeast and lower in the northwest. This semi-enclosed basin, formed by the Yinping Mountain range and the Pearl River Estuary, hinders the effective dispersion of atmospheric pollutants. Consequently, pollutants tend to accumulate, often leading to the formation of regional pollution belts, particularly during autumn and winter. As a major national manufacturing base, Dongguan has complex air pollution sources. Primary contributors include industrial emissions (e.g., from electronics manufacturing and plastic processing), vehicle exhaust, and dust. These sources contribute to significant PM_2.5 pollution, which presents considerable challenges for control. As of 2024, the Dongguan Ecological Environment Bureau operates a network of 36 ambient air quality monitoring stations across the city’s key areas (Figure 1), including locations in Huangjiang, Hongmei, Qishi, Houjie, Humen, Zhongtang, Wanjiang Jintai, Xiegang, Chashan, and Dalang.

2.2. Data Sources

This study utilizes historical data provided by the Dongguan Meteorological Bureau, covering the period from 1 October 2021 to 14 May 2022. The dataset consists of four main components: (1) PM_2.5 concentration measurements from 36 monitoring stations; (2) Meteorological data observed at weather stations near the monitoring sites, including wind direction, wind speed, rainfall, and solar radiation density at ground level; (3) Himawari-8 satellite AHI imager data across visible, near-infrared, thermal infrared, and infrared bands, along with Aerosol Optical Depth (AOD) products derived using the New Dark Target (New-DT) algorithm; and (4) street-level PM_2.5 concentration data obtained from mobile ground monitoring vehicles (Figure 1). For aerosol optical depth (AOD) retrieval, we apply the New Dark Target (New-DT) algorithm:

ρ_{T O A} (λ) = ρ_{a t m} (λ) + \frac{ρ_{s u r f} (λ) T (θ_{s}, θ_{v})}{1 - S (λ) ρ_{s u r f} (λ)}

(1)

where

ρ_{T O A} (λ)

the top-of-atmosphere reflectance measured by the satellite at wavelength λ,

ρ_{s u r f}

denotes surface reflectance,

T

represents the atmospheric transmittance function, which depends on the solar zenith angle (

θ_{s}

) and view zenith angle (

θ_{v}

), and

S

signifies spherical albedo.

2.3. Data Preprocessing

2.3.1. Concentration Grid Generation

This study develops a spatiotemporally continuous PM_2.5 concentration grid through a multi-stage data fusion process (Figure 2). The methodology integrates historical multi-source observations from four components. First, we employ a Bayesian model to generate initial concentration estimates. The model’s core formulation is expressed as:

P (θ ∣ X) \propto P (X ∣ θ) P (θ)

(2)

Here,

θ

represents the spatial distribution parameters of pollutants, while X denotes the multi-source observational data. The prior distribution

P (θ)

combines historical emission inventories with satellite AOD retrievals. The likelihood function

P (X |θ)

is constructed from the joint probability of ground monitoring measurements and satellite-derived results. The posterior distribution

P (θ |X)

yields the optimized initial concentration field.

Subsequently, spatial interpolation is performed using the Inverse Distance Weighting (IDW) method:

Z (x_{0}) = \sum_{i = 1}^{n} (\frac{d {(x_{i}, x_{0})}^{- p}}{\sum_{j = 1}^{n} d {(x_{j}, x_{0})}^{- p}}) Z (x_{i})

(3)

where

Z (x_{0})

is the estimated concentration at the target grid point,

Z (x_{i})

represents the observed concentration at the i-th monitoring station, and

d (x_{i}, x_{0})

is the spatial distance between them. The power parameter

P

controls the interpolation behavior. Based on empirical knowledge of atmospheric pollutants,

P

typically ranges from 2.0 to 2.8. Through 5-fold cross-validation across nine candidate values, we selected

P = 2.5

to balance interpolation smoothness and local feature preservation. The cross-validation procedure involved partitioning monitoring stations into five spatial subsets; iteratively using each subset as validation data; predicting validation concentrations using the remaining four subsets; calculating RMSE values; and finally selecting the p-value minimizing RMSE.

Figure 2. A Bayesian–Kalman integration approach for PM_2.5 mapping from multi-source observations.

For dynamic updating, we implement a Kalman filter:

X_{k} = F_{k} X_{k - 1} + K_{k} (Y_{k} - H_{k} F_{k} X_{k - 1})

(4)

The state transition matrix

F_{k}

is driven by pollutant dispersion processes simulated by the WRF-Chem model.

Y_{k}

represents observational data,

H_{k}

is the observation operator, and

K_{k}

denotes the Kalman gain, which optimally adjusts the concentration field’s evolution trajectory.

The final output comprises spatiotemporally continuous concentration grids. Each grid image has dimensions of 50 × 75 pixels with 1 km × 1 km resolution. Pixel values represent near-surface pollutant concentrations within corresponding grid cells. This integrated Bayesian–Kalman fusion system provides high-quality spatiotemporal data inputs suitable for subsequent predictive modeling applications.

2.3.2. Concentration Grid Processing

Prior to model construction, PM_2.5 concentration grids undergo comprehensive preprocessing: Administrative boundaries of Dongguan City guide geographic masking to exclude sea areas, river confluence zones, waterways, and Lion Sea estuaries—regions characterized by rapid pollutant dispersion due to strong hydrodynamic and wind forces, resulting in unreliable low-concentration distributions. Subsequent dynamic range adjustment normalizes concentration values, eliminating outliers to achieve normal distribution for noise reduction and accelerated convergence. Following Air Quality Index (AQI) concentration thresholds, values exceeding 500 µg/m³ are discarded while values ≤ 0 are replaced via anisotropic diffusion-guided bilinear interpolation. Final standardization enforces zero-mean and unit-variance normalization across all processed grids.

3. Methodology

3.1. Framework Overview

This study develops a CWGAN-GP framework in which the generator employs an Inception-style ConvLSTM network and the discriminator is based on an architecturally enhanced U-Net (ICLU-CWGAN). Through adversarial training, the framework progressively drives the generated PM_2.5 concentration grid distribution to approximate the real distribution [27,28]. The model training workflow comprises three main steps: (1) Generator forward propagation: Historical PM_2.5 concentration grid sequences are input into the Inception-style ConvLSTM to extract spatiotemporal features across local, regional, and macroscopic scales step-by-step, yielding predicted concentration grid sequences. (2) Discriminator error calculation: The generated concentration distribution maps from the Inception-style ConvLSTM and the corresponding real observational data at the same timestep are fed into the U-Net network to evaluate the spatial authenticity of the generated data. (3) Adversarial optimization and parameter update: Errors from both the discriminator and generator undergo backpropagation, updating parameters. This establishes an iterative adversarial competition between the generator and discriminator, compelling the generator to produce PM_2.5 concentration grid sequences that the discriminator cannot reliably distinguish from real data (Figure 3).

3.2. CWGAN-GP

To address the challenges of boundary blurring and insufficient pixel-level error backpropagation in PM_2.5 concentration grid generation, this study proposes two key optimizations: (1) Embedding an identical map-masking layer in the generator as used in the discriminator; (2) Augmenting the traditional adversarial loss with a feature matching loss and a spatiotemporal smoothing term (i.e., inter-frame consistency loss derived from generated PM_2.5 concentration grids). The formulas for the feature matching loss

L_{f m}

and spatiotemporal smoothing term

L_{t e m p}

are as follows:

L_{f m} = E_{x, z} [\sum_{l = 1}^{L} λ_{l} \cdot {‖f_{l} (x) - f_{l} (G (z))‖}_{1}]

(5)

L_{t e m p} = E_{z \sim p_{z}} [\sum_{t = 1}^{T - 1} {‖G {(z)}_{t + 1} - G {(z)}_{t}‖}_{1}]

(6)

where

f_{l} (x)

denotes the feature map at the l-th layer of the Critic (U-Net) when fed with real data

x

,

f_{l} (G (x))

represents the corresponding feature map at the l-th layer of the generator

G (z)

,

λ_{l}

is the layer-wise weighting coefficient,

{‖\cdot‖}_{1}

represents the L1 norm, and

G {(z)}_{t}

is the output of the generator at the t-th timestep. The Critic loss and Generator loss can be calculated using the following equations:

L_{C r i t i c} = E_{\tilde{x} \sim P_{g}} [D (\tilde{x})] - E_{x \sim P_{r}} [D (x)] + λ_{g p} \cdot E_{\hat{x} \sim P_{\hat{x}}} [{({‖\nabla_{\hat{x}} D (\hat{x})‖}_{2} - 1)}^{2}]

(7)

L_{G e n e r a t o r} = - E_{\tilde{x} \sim P_{g}} [D (\tilde{x})] + λ_{f m} \cdot L_{f m} + λ_{t e m p} \cdot L_{t e m p}

(8)

In the equations,

E_{\tilde{x} \sim P_{g}} [D (\tilde{x})]

and

E_{x \sim P_{r}} [D (x)]

respectively denote the expected scores of the generated data distribution

P_{g}

and real data distribution

P_{r}

, where their difference estimates the Wasserstein distance between these distributions. The hyperparameters include:

λ_{g p}

as the gradient penalty coefficient,

λ_{f m}

as the weight coefficient for the feature matching loss, and

λ_{t e m p}

as the weight coefficient for the spatiotemporal smoothing term. The gradient penalty term

E_{\hat{x} \sim P_{\hat{x}}} [{({‖\nabla_{\hat{x}} D (\hat{x})‖}_{2} - 1)}^{2}]

involves the Critic’s input gradient

\nabla D (\hat{x})

evaluated at interpolated samples

\hat{x}

, constructed through random linear interpolation

\hat{x} = ε x + (1 - ε) \tilde{x}

with

ε \sim U [0, 1]

, where

x \sim P_{r}

and

\tilde{x} \sim P_{g}

represent real and generated data samples, respectively.

3.3. Inception-Style ConvLSTM

To effectively capture the spatial dependencies in PM_2.5 concentration spatiotemporal sequences, this study proposes an Inception-Style ConvLSTM network that integrates Inception-based multi-scale feature extraction with ConvLSTM spatiotemporal modeling. The architecture adopts a hierarchical design comprising multiple stacked units, each incorporating multi-scale convolutional algorithms (Figure 4).

3.3.1. Multi-Scale Feature Extraction and Fusion

The network initiates hierarchical feature extraction by applying multi-scale convolutional operations to grid-based PM_2.5 input data. To address the differential diffusion characteristics of PM_2.5 concentrations across temporal domains, we implement a stratified processing scheme: (a) For immediate prediction intervals, 1 × 1 convolutions with Batch Normalization capture fine-grained local structures; (b) For medium-term intervals, 3 × 3 convolutions with Group Normalization extract regional patterns; (c) For distant temporal ranges, 5 × 5 convolutions with Instance Normalization characterize macro-scale distributions (Figure 4). Residual connections are integrated into each processing pathway to enhance spatiotemporal dependency extraction, optimize information flow, and mitigate vanishing gradients while promoting multi-scale feature fusion. The fundamental convolutional operation

C o n v_{k} (A)

is formalized as:

C o n v_{k} (A) = W_{k} *_{k \times k} A + b_{k}

(9)

where

k

denotes the convolutional kernel size (specifically

k \in \{1, 3, 5\}

corresponding to near/mid/long-term temporal horizons),

W_{k}

represents the learnable weight tensor of dimensionality

[C_{o u t}, C_{i n}, k, k]

, the asterisk operator

*

signifies the spatial convolution operation,

b_{k}

is the bias vector of dimension

[C_{o u t}]

, and

A

constitutes the input feature map with dimensionality

[H, W, C_{i n}]

.

3.3.2. Gated Spatiotemporal Modeling

The gating mechanism in ConvLSTM precisely regulates information flow. The input gate controls the retention of new features via a sigmoid-activated coefficient. The forget gate determines which historical information to remove using a learned weighting factor. The output gate dynamically modulates how the current cell state contributes to the hidden state. Together, these gates enable effective cell state updates and support multi-scale spatiotemporal feature propagation. To handle the non-negative nature of pollutant concentrations and the small magnitude of normalized inputs, we use Parametric ReLU (PReLU) to activate the candidate cell state (Figure 5). PReLU extends ReLU by introducing a trainable parameter

α

. When the input is negative, PReLU outputs

α x

, preserving a small gradient. All gates and states are updated as follows:

f_{t} = σ (\sum_{k \in \{1, 3, 5\}} (C o n v_{k} (x_{t}) + C o n v_{k} (h_{t - 1})))

(10)

i_{t} = σ (\sum_{k \in \{1, 3, 5\}} (C o n v_{k} (x_{t}) + C o n v_{k} (h_{t - 1})))

(11)

o_{t} = σ (\sum_{k \in \{1, 3, 5\}} (C o n v_{k} (x_{t}) + C o n v_{k} (h_{t - 1})))

(12)

{\tilde{c}}_{t} = P R e L U (\sum_{k \in \{1, 3, 5\}} (C o n v_{k} (x_{t}) + C o n v_{k} (h_{t - 1})))

(13)

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ {\tilde{c}}_{t}

(14)

h_{t} = o_{t} ⊙ R e L U (c_{t})

(15)

where

f_{t}

is the output of the forget gate,

i_{t}

is the output of the input gate,

o_{t}

is the output of the output gate,

{\tilde{c}}_{t}

is the candidate cell state,

c_{t}

is the cell state at the current time step

t

,

σ

represents the sigmoid activation function,

h_{t}

and

h_{t - 1}

respectively denote the hidden states at time step

t

and

t - 1

,

c_{t - 1}

indicates the prior cell state at time step

t - 1

,

x_{t}

constitutes the input data tensor at time step

t

, and

⊙

designates the Hadamard product.

3.4. U-Net with Architectural Enhancements

U-Net serves as the discriminator, which takes a set of real and generated images as input and is tasked with outputting a probability value for each pixel [29]. However, the original U-Net architecture leads to premature convergence of the CWGAN-GP training to a steady state in pixel-level tasks, causing parameter updates to stagnate. To address this limitation, an enhanced architecture is proposed, as shown in Figure 6.

(1): Spectral normalization and activation removal: Apply spectral normalization (SN) to all convolutional kernel weights and remove all activation functions within layers to enforce Lipschitz continuity throughout the discriminator. The normalized weights are computed as:

$W_{S N} = \frac{W}{σ (W)}$

(16)

where $σ (W)$ denotes the spectral norm (largest singular value) of weight matrix $W$ , approximated via power iteration; $S N$ represents the spectrally normalized weights; and $W$ corresponds to the original convolutional kernel weights.
(2): Dual-attention guidance mechanism based on map masking: Given that valid information in grid map datasets is only present within target regions, this study introduces a spatial attention mechanism constrained by map masks into the feature maps at various network layers, thereby enhancing the pixel-level feature extraction accuracy of both skip connections and convolutional layers. Additionally, a feature guidance mechanism is deployed along the skip connection paths, which dynamically adjusts the fusion ratio between encoder and decoder features through channel attention weights, effectively suppressing the transmission of redundant background information. The channel attention weight $α$ and the adaptively fused output feature map $F_{o u t}$ are defined by the following equations, respectively:

$α = σ (W_{2} \cdot Re L U (W_{1} \cdot (M ⊙ [F_{e n c}; F_{d e c}])))$

(17)

$F_{o u t} = α ⊙ F_{e n c} + (1 - α) ⊙ F_{d e c}$

(18)

where $M \in {[0, 1]}^{H \times W_{d}}$ is the binary mask matrix (1 for target regions, 0 otherwise) with spatial dimensions $H$ (height) and $W_{d}$ (width) matching the current feature map; $W_{1}$ and $W_{2}$ are $1 \times 1$ convolutional weight matrices; $⊙$ denotes the Hadamard product; $[F_{e n c}; F_{d e c}]$ represents channel-wise concatenation of encoder and decoder features; $σ$ is the sigmoid activation function.
(3): Instance normalization in shallow encoder layers: Instance Normalization (IN) layers are incorporated into shallow encoder blocks. The normalized output features are calculated as:

$y_{t i j k} = γ_{i} \frac{x_{t i j k} - μ_{t i}}{\sqrt{σ_{t i}^{2} + ε}} + β_{i}$

(19)

$β_{t i} = \frac{1}{H W_{d}} \sum_{j = 1}^{W_{d}} \sum_{k = 1}^{H} x_{t i j k}$

(20)

$σ_{t i}^{2} = \frac{1}{H W_{d}} \sum_{j = 1}^{W_{d}} \sum_{k = 1}^{H} (x_{t i j k} - μ_{t i})$

(21)

where $x_{t i j k}$ is the feature value at spatial position $(j, k)$ of channel $i$ for the t-th sample; $H$ and $W_{d}$ are the height and width of the feature map; $μ_{t i}$ and $σ_{t i}$ are the mean and variance for channel $i$ ; $γ_{i}$ is the learnable scale parameter for channel $i$ , $β_{i}$ is the learnable shift parameter for channel $i$ , and $ε$ is a numerical stability constant preventing division by zero.
(4): Linear raw score map output: A linear convolutional transformation is applied to the final feature map to directly output a raw score map instead of probability values. This avoids discriminator saturation, preserves accurate expression of real/generated data discrepancies, and ensures continuous gradient signals for the generator:

$D (x) = W_{f i n a l} * F_{f_o u t} + b_{f i n a l}$

(22)

where $W_{f i n a l} \in R^{1 \times 1 \times C \times 1}$ is the convolutional kernel, $*$ denotes the convolution operation, $F_{f_o u t} \in R^{H \times W_{d} \times C}$ is the final U-Net feature map ( $H$ , $W_{d}$ and $C$ being height, width, and channel dimensions), and $b_{f i n a l}$ is the bias term.

3.5. Loss Function

The loss function employs a hybrid Mean Squared Error (MSE) and Mean Absolute Error (MAE) loss to enhance model robustness while maintaining computational efficiency. The formulation is given by Equation (19):

M i x L o s s = \frac{1}{N} \sum_{t = 1}^{T} \sum_{i = 1}^{m} \sum_{j = 1}^{n} (a |y_{i, j, t} - {\hat{y}}_{i, j, t}| + b {(y_{i, j, t} - {\hat{y}}_{i, j, t})}^{2})

(23)

where

a

and

b

denote the weighting coefficients for MAE and MSE, respectively (

a = 0.5

,

b = 0.5

);

y_{i, j, t}

represents the observed value at timestep

t

and spatial position

(i, j)

;

{\hat{y}}_{i, j, t}

is the corresponding predicted value;

m

and

n

indicate the spatial grid dimensions (rows and columns);

T

denotes the total number of timesteps; and

N

represents the total number of test samples, calculated as

N = m \times n \times T

.

3.6. Metrics

The ICLU-CWGAN model presented in this study was compared with other prediction models on the same dataset. Root mean square error (RMSE), mean absolute error (MAE), and Coefficient of Determination (R²) were used as metrics to confirm the effectiveness of the proposed method. Experimental metrics were calculated by the following formulas:

R M S E = \sqrt{\frac{1}{N} \sum_{t = 1}^{T} \sum_{i = 1}^{m} \sum_{j = 1}^{n} {(y_{i, j, t} - {\hat{y}}_{i, j, t})}^{2}}

(24)

M A E = \frac{1}{N} \sum_{t = 1}^{T} |y_{t} - {\hat{y}}_{t}|

(25)

R^{2} = 1 - \frac{\sum_{t = 1}^{T} (y_{t} - {\hat{y}}_{t})}{\sum_{t = 1}^{T} {(y_{t} - \bar{y})}^{2}}

(26)

where

y_{i, j, t}

denotes the observed value at time step

t

and spatial position

(i, j)

,

{\hat{y}}_{i, j, t}

represents the corresponding predicted value,

N

indicates the total number of test samples,

T

signifies the total number of time steps in the test set,

m

and

n

specify the grid dimensions (rows and columns) with

N = m \times n \times T

,

y_{t}

is the observed grid mean at time step

t

,

{\hat{y}}_{t}

denotes the predicted grid mean at time step

t

, and

\bar{y}

represents the overall average of observed grid means.

4. Results

4.1. Parameter Setting

In this study, the dataset was partitioned into 70% for training, 10% for validation, and 20% for testing to maintain independence across model training, hyperparameter tuning, and evaluation phases. During training, an early stopping strategy terminated the process when validation loss showed no improvement for 10 consecutive epochs, preventing overfitting. Network weights were initialized using the Xavier method to ensure gradient stability during forward/backward propagation, while model optimization employed the Adam algorithm with an initial learning rate of 0.005 for adaptive parameter updates. Dropout regularization (rate = 0.2) was applied to randomly deactivate neurons, enhancing generalization capability. For prediction tasks, two sliding window configurations were implemented: (1) 10 h (hour) windows comprising 7 h input sequences to generate 3 h forecasts (7–3 h task), and (2) 20 h windows comprising 8 h input sequences to produce 12 h predictions (8–12 h task). Post-experiment validation parameters are documented in Table 1.

4.2. Multi-Timescale Prediction

Table 2 and Table 3 comprehensively evaluate the multi-timescale PM_2.5 prediction performance of our ICLU-CWGAN model against ConvLSTM, Convolutional Neural Network—Long Short-Term Memory (CNN-LSTM), Convolutional Gated Recurrent Unit (ConvGRU), Spatio-Temporal Attention Residual Convolutional Neural Network (STA-ResCNN), Spatio-Temporal Transformer (ST-Transformer) and Adaptive Graph Convolutional and Temporal Convolutional Network (AGCTCN) across the full test set. For short-term forecasting (3 h prediction using 7 h historical concentration grids), Table 2 demonstrates ICLU-CWGAN’s superiority with RMSE = 2.77 μg/m³—representing a 40.93% reduction versus ConvLSTM and 14.10% R² improvement over CNN-LSTM. For long-term forecasting (12 h prediction from 8 h inputs), Table 3 shows ICLU-CWGAN achieves optimal metrics (RMSE = 4.61 μg/m³, MAE = 6.42 μg/m³, R² = 0.80), outperforming all SOTA comparators (ConvLSTM/CNN-LSTM)/ConvGRU). Crucially, the model exhibits attenuated temporal decay: while SOTA models suffered ≥ 4.23 μg/m³ RMSE degradation when extending from 3 h to 12 h prediction (e.g., ConvLSTM’s 4.23 μg/m³ increase), ICLU-CWGAN maintained near-optimal performance with only 1.84 μg/m³ RMSE elevation—providing empirical validation of its long-term memory degradation mitigation capability.

4.3. Model-Centric Performance Validation

Figure 7 evaluates the predictive performance of different models for the 8–12 h task on the test set, plotting mean predicted PM_2.5 grid concentrations (y-axis) against mean observed values (x-axis). The black line indicates the y = x reference function, while black dots denote deviation magnitudes between observed and predicted grid averages. Dispersion analysis reveals: (1) All models exhibit increased dispersion when PM_2.5 concentrations exceed 18 μg/m³, with ConvLSTM showing the highest dispersion and ICLU-CWGAN demonstrating minimal dispersion and optimal performance; (2) For concentrations within 0–20 μg/m³, ICLU-CWGAN maintains superior dispersion control. Substantial consistency between observed and predicted grid averages is evident for ICLU-CWGAN. Correlation coefficients across the full test set confirm this superiority—ConvLSTM (0.41), CNN-LSTM (0.66), ConvGRU (0.63), and ICLU-CWGAN (0.80)—indicating the strongest correlation between observed and predicted grid-averaged concentrations.

Figure 8 shows the generalization capabilities of different models on the same test set for the 8-12 h task. The x-axis in Figure 8, spanning 565 h, represents a randomly selected continuous 565 h segment from the test set used to evaluate the predictive model’s performance during this period. This validation method, based on prior research papers [32,33], primarily aims to visualize the predictive effects, highlighting the model’s prediction performance and fitting capability. The blue curve represents the mean of the observed values, while the red curve represents the mean of the predicted values. Figure 8 presents experimental results for only four state-of-the-art forecasting models, showcasing the fitting trends of the ConvLSTM, CNN-LSTM, ConvGRU, and ICLU-CWGAN models across the entire test set. As demonstrated in Figure 8, the ICLU-CWGAN model maintains good forecast accuracy over the extended 565 h prediction horizon, exhibiting minimal deviation between the mean predicted values and the mean observed values. Particularly, during most time periods, the model’s predicted mean curve consistently aligns with the envelope of the observed mean curve, further validating the model’s superiority.

4.4. Scenario-Oriented Robustness Evaluation

To comprehensively validate the robustness of the proposed model in real-world scenarios of Dongguan city, we conducted multi-scale analyses through temporal trend forecasting, spatial grid predictions, and station-level concentration outputs. Figure 9 compares 12 h ahead PM_2.5 prediction means against observational means (test-set samples randomly selected across temporal phases). The ICLU-CWGAN prediction curve (blue) exhibits strong concordance with observational trends (red), maintaining precise tracking during both stationary and volatile periods. This temporal consistency confirms the model’s adaptability to dynamic PM_2.5 diffusion processes over extended durations.

Figure 10 demonstrates spatial topology reconstruction using predicted PM_2.5 concentration grids for Dongguan. Figure 10a,b present two randomly selected temporal windows, each displaying (from left to right): the observed grid, the predicted grid from the ICLU-CWGAN model, and the predicted grid from the ConvLSTM model. Comparative analysis reveals the following: (1) At Timestep T1, the predicted grid from the ICLU-CWGAN model fully preserves the elliptical topology of the PM_2.5 core zone. Concentration gradient variations within dashed-circle areas match the observed grid, and local concentration extrema are accurately captured. In contrast, the predicted grid from the ConvLSTM model reconstructs only low-concentration gradient features. (2) At Timestep T2, the predicted grid from the ICLU-CWGAN model maintains the continuity of the high-concentration hollow annular belt, with diffusion front sharpness approaching observed values. Conversely, the predicted grid from the ConvLSTM model exhibits fragmentation in the western region and reduced resolution. (3) Within dashed-circle administrative boundary areas, the predicted grid from the ICLU-CWGAN model shows high spatial congruence with the observed grid regarding high-concentration saturation zones, distribution patterns, and gradient transition characteristics. The predicted grid from the ConvLSTM model, however, fails to reproduce comparable spatial features.

Table 4 and Figure 11 present the evaluation of heterogeneous geographical unit generalization based on a 72 h validation set across 12 monitoring sites. The ICLU-CWGAN model maintains regional concentration prediction deviations within ±3 μg/m³ with an overall deviation rate below 15%. This represents a 57% improvement in spatial prediction accuracy compared to the ±7 μg/m³ fluctuation range of ConvLSTM in concurrent testing. Particularly in high-concentration gradient areas such as Qiaotou Station (observed: 35.7 μg/m³; predicted: 36.9 μg/m³), ICLU-CWGAN stabilizes prediction error rates below 3.4%, significantly outperforming the typical 12–18% error rates of the ConvLSTM model in comparable regions. The predicted-observed scatter plot (R² = 0.74, RMSE = 4.61 μg/m³) further confirms strong correlation without systematic bias, indicating enhanced spatiotemporal feature decoupling capabilities through the generative adversarial mechanism.

In summary, ICLU-CWGAN significantly outperforms baseline models in spatial topology reconstruction (elliptical structures/ring continuity), microscopic feature preservation (front sharpness), anomaly localization, and administrative boundary characterization, demonstrating spatial generalization capability suitable for hyperlocal air quality early-warning requirements.

5. Discussion

5.1. Comparison with Previous Prediction Models

The ICLU-CWGAN model proposed in this study demonstrates significant improvements in prediction accuracy and robustness. This is achieved through a fused architecture combining generative adversarial training and multi-scale spatiotemporal feature extraction. Recently, integrating physical mechanisms with deep learning has emerged as a key direction for enhancing model generalizability. For instance, Li et al. [34] proposed a physics-informed deep learning framework. By embedding advection-diffusion equations into the neural network, they achieved a 16–42% reduction in systematic bias for pollutant prediction. This aligns with our approach, where the multi-scale ConvLSTM captures diffusion processes under physical constraints. Together, these studies demonstrate the critical role of physical mechanisms in improving prediction accuracy.

Architecturally, the generator utilizes an Inception-style ConvLSTM. This design employs parallel multi-scale convolutional kernels and hierarchical normalization strategies. It enhances the adaptive capture of multi-scale features within the PM_2.5 concentration field. This concept resonates with the design philosophy of the Memo-UNet model [35], which uses an updatable memory module to adaptively leverage crucial historical information. Both architectures focus on improving the long-term modeling capability of dynamic spatiotemporal processes. This architectural refinement proved effective. Our model reduced the RMSE by 40.93% in 3 h prediction tasks compared to the original ConvLSTM (Table 2). Simultaneously, the discriminator adopts a U-Net architecture constrained by spectral normalization. This effectively mitigates the training instability common in traditional GANs for pixel-level prediction tasks. This technique is consistent with trends observed in Temporal U-Net [36] and Enhanced U-Net [37]. These models also employ structural optimizations to enhance discriminator stability and feature representation capability. Collectively, they contribute to the spatial topological authenticity of the generated data (Figure 10). Notably, Li et al. (2025) [38] improved ACGAN in radar target recognition research. By integrating self-attention mechanisms and Wasserstein distance optimization, they significantly enhanced model recognition accuracy in complex environments. This corroborates the effectiveness of combining self-attention mechanisms with Wasserstein distance, a strategy also reflected in our technical approach.

From an algorithmic optimization perspective, our model employs a composite loss function. This function combines adversarial loss, feature matching loss, and spatiotemporal smoothness constraints, enabling multi-objective cooperative optimization. This design philosophy is also reflected in similar multi-objective optimization strategies applied to the WRF-Chem model [1]. There, it achieved an RMSE improvement ranging from 38.90% to 48.86%. This further confirms that hybrid loss functions can effectively balance data-driven accuracy with physical plausibility.

5.2. Long-Term Series Prediction and Model Comparison

The ICLU-CWGAN model demonstrates a significant advantage in mitigating performance degradation in long-term predictions. It shows notable strengths in the medium- to long-term forecasting of PM_2.5 concentration fields. Compared to the RCL-Learning model proposed by Zhang et al. (2022), which addressed long-term dependencies by extracting correlations between pollutants and meteorological data, our model performs better [24]. The RCL-Learning model exhibited a substantial increase in prediction error with forecast lead time (RMSE increased from 13.622 μg/m³ at 3 h to 22.927 μg/m³ at 15 h). In contrast, our model incorporates a sliding window dynamic optimization mechanism and discriminator multi-level feature fusion. When the forecast horizon extends from 3 h to 12 h, the RMSE increases by only 1.84 μg/m³. The performance decay is reduced by 56.5% compared to ConvLSTM. This significantly enhances the capture of long-term evolutionary patterns and prediction stability. This approach of optimizing the training process to improve long-term prediction stability is also reflected in other advanced models. The performance advantage of our model is further highlighted in comparison with other state-of-the-art models. Compared to the STLLM-GAN proposed by Yin et al. (2025), which enhances reasoning about complex spatiotemporal relationships by introducing a large language model, our model offers a more efficient alternative [26]. Our model utilizes a specialized design based on multi-scale ConvLSTM and an enhanced U-Net. It maintains comparable prediction performance while avoiding the massive parameter footprint, demonstrating superior computational efficiency and engineering applicability. Furthermore, compared to the AGCTCN model proposed by Choudhury et al. (2022) [39] and baseline methods like ST-Transformer (2024) [30], our model achieves a lower RMSE (4.61 μg/m³) and a higher R² (0.80) in the 12 h prediction task. This validates the effectiveness of the proposed architecture for long-term spatiotemporal prediction.

6. Conclusions

This study proposes a spatiotemporal pollutant concentrations prediction model based on CWGAN-GP (ICLU-CWGAN). The model features an Inception-style ConvLSTM generator with parallel multi-scale convolutional kernels and hierarchical normalization. The discriminator employs an architecturally enhanced U-Net network. The key advantages of the proposed method are summarized as follows:

(1): The generator integrates physical diffusion mechanisms with deep learning. This integration enables adaptive capture of multi-scale spatial features in pollutant concentration fields.
(2): The adversarial training framework provides strong robustness. It effectively learns nonlinear dynamic characteristics at concentration field boundaries. This capability suppresses error accumulation caused by traditional models’ reliance on stationarity assumptions.

Several limitations should be acknowledged. The study area, Dongguan City, covers approximately 2460 km². As a single-city case study, its spatial representativeness is limited. The observation period spans only eight and a half months. This duration fails to cover complete annual climate and pollution pattern variations. Most importantly, the absence of validation in other geographical and environmental conditions necessitates further investigation into the model’s generalizability.

Despite these limitations, this study offers significant value. Theoretically, it provides an innovative technical pathway for spatiotemporal prediction through multi-scale feature extraction and adversarial training. Practically, the model demonstrates superior hyperlocal prediction accuracy and long-term stability in Dongguan. It establishes a reliable technical paradigm for air quality warning systems in similar-sized cities. The model’s precise characterization of complex pollution processes within limited spatiotemporal scope convincingly validates the effectiveness and advancement of its architectural design.

Author Contributions

J.Z.: Conceptualization, Methodology, Writing—original draft, Visualization, Writing—review & editing. P.S.: Conceptualization, Methodology, Data curation, Writing—original draft, Visualization, Investigation, Formal analysis. J.W.: Software, Validation. Z.C.: Software, Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by Jiangxi Provincial Natural Science Foundation (20202BAB204035) and Jiangxi Engineering Technology Research Center of Nuclear Geoscience Data Science and System (JETRCNGDSS202103).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, X.; Liu, H.; Peng, Z. Improving WRF-Chem PM_2.5 predictions by combining data assimilation and deep-learning-based bias correction. Environ. Int. 2025, 195, 109199. [Google Scholar]
Wu, D.; Zheng, H.; Li, Q.; Jin, L.; Lyu, R.; Ding, X.; Huo, Y.; Zhao, B.; Jiang, J.; Chen, J.; et al. Toxic potency-adjusted control of air pollution for solid fuel combustion. Nat. Energy 2022, 7, 194–202. [Google Scholar] [CrossRef]
Fan, Y.; Chen, Z.; He, T. The Impact of Carbon-Emission Trading Scheme Policies on Air Quality in Chinese Cities. Sustainability 2024, 16, 10023. [Google Scholar] [CrossRef]
Zhao, M.; Wang, K. Short-term effects of PM_2.5 components on the respiratory infectious disease: A global perspective. Environ. Geochem. Health 2024, 46, 293. [Google Scholar] [CrossRef]
Ding, L.; Fang, X.; Cheng, K. The impact of PM_2.5 pollution on residents’ health and economic loss accounting in China. Econ. Geogr. 2021, 41, 82–92. [Google Scholar]
Shi, H.; Chen, L.; Zhang, S.; Li, R.; Wu, Y.; Zou, H.; Wang, C.; Cai, M.; Lin, H. Dynamic association of ambient air pollution with incidence and mortality of pulmonary hypertension: A multistate trajectory analysis. Ecotoxicol. Environ. Saf. 2023, 262, 115126. [Google Scholar] [CrossRef]
Wen, C.; Liu, S.; Yao, X.; Peng, L.; Li, X.; Hu, Y.; Chi, T. A novel spatiotemporal convolutional long short-term neural network for air pollution prediction. Sci. Total Environ. 2019, 654, 1091–1099. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Li, W. MGC-LSTM: A deep learning model based on graph convolution of multiple graphs for PM_2.5 prediction. Int. J. Environ. Sci. Technol. 2023, 20, 10297–10312. [Google Scholar] [CrossRef]
Lolli, S. Urban PM_2.5 concentration monitoring: A review of recent advances in ground-based, satellite, model, and machine learning integration. Urban Clim. 2025, 63, 102566. [Google Scholar] [CrossRef]
Grell, G.A.; Peckham, S.E.; Schmitz, R.; McKeen, S.A.; Frost, G.; Skamarock, W.C.; Eder, B. Fully coupled "online" chemistry within the WRF model. Atmos. Environ. 2005, 39, 6957–6975. [Google Scholar] [CrossRef]
Seinfeld, J.H.; Pandis, S.N. Atmospheric Chemistry and Physics: From Air Pollution to Climate Change; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
Chi, X.; Li, Z.; Liu, H.; Chen, J.; Gao, J. Predicting air pollutant emissions of the foundry industry: Based on the electricity big data. Sci. Total Environ. 2024, 917, 170323. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis: Forecasting and Control, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Kutner, M.H.; Nachtsheim, C.J.; Neter, J.; Li, W. Applied Linear Statistical Models, 5th ed.; McGraw-Hill Irwin: New York, NY, USA, 2004. [Google Scholar]
Wang, J.; Ogawa, S. Effects of Meteorological Conditions on PM_2.5 Concentrations in Nagasaki, Japan. Int. J. Environ. Res. Public Health 2015, 12, 9089–9101. [Google Scholar] [CrossRef]
Brunsdon, C.; Fotheringham, A.S.; Charlton, M.E. Geographically Weighted Regression: A Method for Exploring Spatial Nonstationarity. Geogr. Anal. 1996, 28, 281–298. [Google Scholar] [CrossRef]
Mogollón-Sotelo, C.; Casallas, A.; Vidal, S.; Celis, N.; Ferro, C.; Belalcazar, L. A support vector machine model to forecast ground-level PM_2.5 in a highly populated city with a complex terrain. Air Qual. Atmos. Health 2021, 14, 399–409. [Google Scholar] [CrossRef]
Wei, C.; Zhao, C.; Hu, Y.; Tian, Y. Predicting the Concentration Levels of PM_2.5 and O₃ for Highly Urbanized Areas Based on Machine Learning Models. Sustainability 2025, 17, 9211. [Google Scholar] [CrossRef]
Kim, H.Y.; Won, C.H. Forecasting the volatility of stock price index: A hybrid model integrating LSTM with multiple GARCH-type models. Expert Syst. Appl. 2018, 103, 25–37. [Google Scholar] [CrossRef]
Mao, X.; Liu, G.; Wang, J.; Lai, Y. BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer. Sustainability 2025, 17, 8631. [Google Scholar] [CrossRef]
Mu, L.; Bi, S.; Ding, X.; Xu, Y. Transformer-based ozone multivariate prediction considering interpretable and priori knowledge: A case study of Beijing, China. J. Environ. Manag. 2024, 366, 121883. [Google Scholar] [CrossRef]
Xia, H.; Chen, X.; Wang, Z.; Chen, X.; Dong, F. A Multi-Modal Deep-Learning Air Quality Prediction Method Based on Multi-Station Time-Series Data and Remote-Sensing Images: Case Study of Beijing and Tianjin. Entropy 2024, 26, 91. [Google Scholar] [CrossRef] [PubMed]
Su, I.F.; Chung, Y.C.; Lee, C.; Huang, P.M. Effective PM_2.5 concentration forecasting based on multiple spatial–temporal GNN for areas without monitoring stations. Expert Syst. Appl. 2023, 234, 121074. [Google Scholar] [CrossRef]
Zhang, B.; Zou, G.; Qin, D.; Ni, Q.; Mao, H.; Li, M. RCL-Learning: ResNet and convolutional long short-term memory-based spatiotemporal air pollutant concentration prediction model. Expert Syst. Appl. 2022, 207, 118017. [Google Scholar] [CrossRef]
Kalajdjieski, J.; Zdravevski, E.; Corizzo, R.; Lameski, P.; Kalajdziski, S.; Pires, I.M.; Garcia, N.M.; Trajkovik, V. Air pollution prediction with multi-modal data and deep neural networks. Remote Sens. 2020, 12, 4142. [Google Scholar] [CrossRef]
Yin, C.; Mao, Y.; Deng, L.; Chen, M.; Rong, Y.; He, X.; Zhou, X. STLLM-GAN: Spatio-temporal LLM Generative Adversarial Network for PM_2.5 prediction. Expert Syst. Appl. 2025, 292, 128250. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Arjovsky, M.; Bottou, L. Towards Principled Methods for Training Generative Adversarial Networks. arXiv 2017, arXiv:1701.04862. [Google Scholar] [CrossRef]
Schonfeld, E.; Schiele, B.; Khoreva, A. A u-net based discriminator for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 8207–8216. [Google Scholar]
Zhang, K.; Yang, X.; Cao, H.; Thé, J.; Tan, Z.; Yu, H. Multi-step forecast of PM2.5 and PM10 concentrations using convolutional neural network integrated with spatial–temporal attention and residual learning. Environ. Int. 2023, 171, 107691. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
Huang, C.-J.; Kuo, P.-H. A deep CNN-LSTM model for particulate matter (PM_2.5) forecasting in smart cities. Sensors 2018, 18, 2220. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Kim, M.; Kim, M.; Namgung, H.G.; Kim, K.T.; Cho, K.H.; Kwon, S.B. Predicting PM10 concentration in Seoul metropolitan subway stations using artificial neural network (ANN). J. Hazard. Mater. 2018, 341, 75–82. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Khalili, R.; Lurmann, F.; Pavlovic, N.; Wu, J.; Xu, Y.; Liu, Y.; O’Sharkey, K.; Ritz, B.; Oman, L.; et al. Knowledge-informed deep learning to mitigate bias in joint air pollutant prediction. Environ. Int. 2025, 206, 109915. [Google Scholar] [CrossRef]
Fang, T.; Li, X.; Shi, C.; Zhang, X.; Xiao, W.; Kou, Y.; Mumtaz, I.; Huang, Z. Memo-UNet: Leveraging historical information for enhanced wave height prediction. Neurocomputing 2025, 634, 129840. [Google Scholar] [CrossRef]
Tong, Q.; Wang, L.; Dai, Q.; Zheng, C.; Zhou, F. Enhanced cloud removal via temporal U-Net and cloud cover evolution simulation. Sci. Rep. 2025, 15, 4544. [Google Scholar] [CrossRef] [PubMed]
Sahragard, E.; Farsi, H.; Mohamadzadeh, S. Advancing semantic segmentation: Enhanced UNet algorithm with attention mechanism and deformable convolution. PLoS ONE 2025, 20, e0305561. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Zhu, H. Target classification with low-resolution radars based on cyclic bispectrum and improved ACGAN. Measurement 2025, 259, 119715. [Google Scholar] [CrossRef]
Choudhury, A.; Middya, A.I.; Roy, S. Attention enhanced hybrid model for spatiotemporal short-term forecasting of particulate matter concentrations. Sustain. Cities Soc. 2022, 86, 104112. [Google Scholar] [CrossRef]

Figure 1. Distribution map of air pollutant concentration monitoring stations and meteorological stations in Dongguan, China.

Figure 3. Architecture of the ICLU-CWGAN adversarial training framework for PM_2.5 concentration prediction.

Figure 4. Structure diagram of multiscale convolutional algorithm.

Figure 5. Inception-style ConvLSTM network architecture.

Figure 6. Architecturally enhanced U-Net.

Figure 7. Degree of fit between the observed and predicted values on the test set.

Figure 8. Fitting trends of the different models in the 12 h task. (a–d) represent the fitting trends of ConvLSTM, ConvGRU, CNNLSTM, and ICLU-CWGAN models.

Figure 9. Prediction of PM_2.5 concentration trends over the different periods. The red curve represents the observed values; the blue curve represents the predicted values.

Figure 10. Comparison of 12 h concentration prediction grids (random time-steps) from ICLU-CWGAN and ConvLSTM.

Figure 11. Degree of fit between average observed and predicted values at PM_2.5 monitoring sites.

Table 1. Inception-style ConvLSTM (IS ConvLSTM) parameters.

Network Layer	Layer Hierarchy	Parameters	Input Size	Output Size
Input sequence				(2, 7/8, 1, 50, 75)
Encoder	IS ConvLSTMBlock (Layer 1)	f = [1 × 1, 3 × 3, 5 × 5]; s = 1; p = [1, 1, 2]; d = 32	(2, 7/8, 1, 50, 75)	(2, 7/8, 16, 50, 75)
	IS ConvLSTMBlock (Layer 2)	f = [1 × 1, 3 × 3, 5 × 5]; s = 1; p = [1, 1, 2]; d = 64	(2, 7/8, 16, 50, 75)	(2, 7/8, 32, 50, 75)
	IS ConvLSTMBlock (Layer 3)	f = [1 × 1, 3 × 3, 5 × 5]; s = 1; p = [1, 1, 2]; d = 32	(2, 7/8, 32, 50, 75)	(2, 7/8, 64, 50, 75)
Decoder	IS ConvLSTMBlock (Layer 1)	f = [1 × 1, 3 × 3, 5 × 5]; s = 1; p = [1, 1, 2]; d = 64	(2, 7/8, 64, 50, 75)	(2, 7/8, 32, 50, 75)
	IS ConvLSTMBlock (Layer 2)	f = [1 × 1, 3 × 3, 5 × 5]; s = 1; p = [1, 1, 2]; d = 32	(2, 7/8, 32, 50, 75)	(2, 7/8, 16, 50, 75)
	IS ConvLSTMBlock (Layer 3)	f = [1 × 1, 3 × 3, 5 × 5]; s = 1; p = [1, 1, 2]; d = 1	(2, 7/8, 16, 50, 75)	(2, 7/8, 1, 50, 75)
	Mapping layer	seq_len = 7/8; pre_len = 3/12	(2, 7/8, 50, 75, 1)	(2, 3/12, 50, 75, 1)
Output sequence				(2, 3/12, 1, 50, 75)

Notat: f: convolution kernel size, s: convolution stride, p: padding size, d: total convolutional kernel count in the layer, seq_len: input sequence length (historical time steps), pre_len: output sequence length (predicted time steps).

Table 2. Performance comparison of all models for the 7–3 h task.

Model	RMSE	MAE	R²
ConvLSTM	4.69	7.23	0.50
CNN-LSTM	5.88	8.10	0.78
ConvGRU	6.08	8.22	0.70
STA-ResCNN [29]	11.72	7.72	-
ST-Transformer [30]	6.92	4	-
AGCTCN [31]	8.75	11.76	0.64
ICLU-CWGAN	2.77	5.48	0.89

Table 3. Performance comparison of ICLU-CWGAN and state-of-the-art methods for the 8–12 h prediction.

Model	RMSE	MAE	R²
ConvLSTM	8.92	18.76	0.41
CNN-LSTM	10.45	9.73	0.66
ConvGRU	11.06	11.66	0.63
ICLU-CWGAN	4.61	6.42	0.80

Table 4. Site-specific PM_2.5 concentrations generated by ICLU-CWGAN model.

Observed Area	Observed Value (μg/m³)	Predicted Value (μg/m³)
Huangjiang	16.9	14.9
Hongmei	31.3	34.2
Qishi	21.8	15.8
Houjie	19.3	20.8
Zhongtang	10.8	9.4
Shilong	7.3	16.1
Qiaotou	35.7	36.9
Xiegang	12.2	11.5
Liaobu	17.8	16.6
Dalang	16.6	12.1
Changan	3.1	11.2
Humen	14.9	8.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Su, P.; Wang, J.; Cai, Z. Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net. Sustainability 2025, 17, 11177. https://doi.org/10.3390/su172411177

AMA Style

Zhang J, Su P, Wang J, Cai Z. Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net. Sustainability. 2025; 17(24):11177. https://doi.org/10.3390/su172411177

Chicago/Turabian Style

Zhang, Jiankun, Pei Su, Juexuan Wang, and Zhantong Cai. 2025. "Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net" Sustainability 17, no. 24: 11177. https://doi.org/10.3390/su172411177

APA Style

Zhang, J., Su, P., Wang, J., & Cai, Z. (2025). Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net. Sustainability, 17(24), 11177. https://doi.org/10.3390/su172411177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Air Pollutant Concentration Prediction Using a Generative Adversarial Network with Multi-Scale Convolutional Long Short-Term Memory and Enhanced U-Net

Abstract

1. Introduction

2. Study Area and Dataset Analysis

2.1. Study Area

2.2. Data Sources

2.3. Data Preprocessing

2.3.1. Concentration Grid Generation

2.3.2. Concentration Grid Processing

3. Methodology

3.1. Framework Overview

3.2. CWGAN-GP

3.3. Inception-Style ConvLSTM

3.3.1. Multi-Scale Feature Extraction and Fusion

3.3.2. Gated Spatiotemporal Modeling

3.4. U-Net with Architectural Enhancements

3.5. Loss Function

3.6. Metrics

4. Results

4.1. Parameter Setting

4.2. Multi-Timescale Prediction

4.3. Model-Centric Performance Validation

4.4. Scenario-Oriented Robustness Evaluation

5. Discussion

5.1. Comparison with Previous Prediction Models

5.2. Long-Term Series Prediction and Model Comparison

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI