Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction

Wu, Chenyi; Lai, Zhengliang; Xu, Yunwu; Zhu, Xishun; Wu, Jianhua; Duan, Guiqin

doi:10.3390/atmos16040429

Open AccessArticle

Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction

by

Chenyi Wu

¹

,

Zhengliang Lai

¹

,

Yunwu Xu

¹,

Xishun Zhu

²,

Jianhua Wu

³

and

Guiqin Duan

^4,*

¹

School of Electrical Engineering, Guangdong Songshan Vocational and Technical College, Shaoguan 512000, China

²

School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, China

³

School of Information Engineering, Nanchang University, Nanchang 330031, China

⁴

School of Computer and Information Engineering, Guangdong Songshan Vocational and Technical College, Shaoguan 512000, China

^*

Author to whom correspondence should be addressed.

Atmosphere 2025, 16(4), 429; https://doi.org/10.3390/atmos16040429

Submission received: 3 March 2025 / Revised: 1 April 2025 / Accepted: 3 April 2025 / Published: 8 April 2025

(This article belongs to the Special Issue Applying Deep Learning Technology for Spatiotemporal Prediction of Air Pollution from Urban Mobile Sources)

Download

Browse Figures

Versions Notes

Abstract

The escalation of industrialization has worsened air quality, underscoring the essential need for accurate forecasting to inform policies and protect public health. Current research has primarily emphasized individual spatiotemporal features for prediction, neglecting the interconnections between these features. To address this, we proposed the generative Comprehensive Scale Spatiotemporal Fusion Air Quality Predictor (CSST-AQP). The novel dual-branch architecture combines multi-scale spatial correlation analysis with adaptive temporal modeling to capture the complex interactions in pollutant dispersion and enhanced pollution forecasting. Initially, a fusion preprocessing module based on localized high-correlation spatiotemporal features encodes multidimensional air quality indicators and geospatial data into unified spatiotemporal features. Then, the core architecture employs a dual-branch collaborative framework: a multi-scale spatial processing branch extracts features at varying granularities, and an adaptive temporal enhancement branch concurrently models local periodicities and global evolutionary trends. The feature fusion engine hierarchically integrates spatiotemporally relevant features at individual and regional scales while aggregating local spatiotemporal features from related sites. In experimental results across 14 Chinese regions, CSST-AQP achieves state-of-the-art performance compared to LSTM-based networks with RMSE 6.11–9.13 μg/m³ and R² 0.91–0.93, demonstrating highly robust 60 h forecasting capabilities for diverse pollutants.

Keywords:

spatiotemporal convolution; air quality prediction; generative CNN; multi-scale fusion; feature correlation

1. Introduction

The accelerated global industrialization in recent decades has exacerbated air pollution to critical levels, with the World Health Organization attributing approximately 4.2 million premature deaths annually to ambient air pollution exposure [1]. This crisis has made air quality prediction not just a technical challenge but a vital part of urban sustainability planning, supporting both environmental governance and public health strategies [2,3]. Air quality prediction research encompasses both traditional statistical methods and modern deep learning techniques [4].

Traditional methods, known as chemical transport models, simulate the transport, diffusion, and chemical reactions of pollutants to predict [5]. These methods are categorized into physical diffusion-based models and statistical models, with the former relying on diffusion theory and the latter on empirical assumptions [6]. However, their reliance on empirical parameters and physical data limits accuracy. Statistical methods, like linear regression and machine learning modeling, simplify the complexity and parameter overload of physical models [7,8], such as support vector regression (SVM) [9], Random Decision Forests (RFs) [10], Hidden Semi-Markov Models (HSMMs) [11], and other machine learning techniques [12,13]. While these conventional models demand fewer computational resources and parameters, their static feature interaction mechanisms fail to delineate the spatiotemporal heterogeneity of pollutants and are incapable of resolving multi-scale.

In recent years, convolutional neural networks (CNNs) have significantly advanced air quality prediction, particularly in forecasting PM_2.5 concentrations. In 2018, Athira et al. [14] used recurrent networks for air quality prediction in conjunction with HSMM. Guo [15] attempted to partition time correlation at a finer granularity and designed a deep residual CNN. These hybrid approaches improved prediction accuracy through the fusion of statistical feature engineering with CNN frameworks, but they remain constrained by manual hyperparameter configuration. Subsequently, Elbaz et al. [16] and Tran et al. [17] enhanced accuracy by integrating spatiotemporal data into residual CNN; their methodologies introduced heightened algorithmic complexity. Considering the correlation between different time lengths, some researchers have also proposed long short-term memory (LSTM) for air quality prediction networks [18,19,20]. For example, Ameri et al. [21] and Tishya et al. [22] proposed a deep learning model with an encoder–decoder structure using Conv and LSTM units to model and predict the correlation of urban pollution. However, these models are less efficient for model training and inference in long sequence scenarios. Further, researchers have proposed methods based on variations-LSTM to realize the prediction of air index time series [23]. Rabie et al. [24] developed a CNN-Bi-LSTM hybrid framework to address multi-scale spatial air quality prediction in megacities, overcoming challenges in modeling cross-regional pollutant dependencies through integrated multi-scale feature extraction. Fan et al. [25] implemented a hybrid seasonal air quality prediction through a mixture of rough ensemble wrapping methods and regularized combinatorial LSTM. In addition, there are some other novel models, including Self-Attention LSTM [26] and adaptive LSTM [27]. However, these models typically demand substantial computational resources for training and optimization, and their performance can vary significantly when predicting different time scales.

Since 2022, people have increasingly leveraged graph convolutional networks (GCNs) to address constrained modeling capacity in temporal air quality data analysis [28]. Chen et al. [29] employed multi-factor separation evolutionary spatiotemporal GCN factorized feature learning for multivariate series decomposition; Hu et al. [30] enhanced 10.7% cross-regional accuracy through adaptive multi-graph attention fusion with GRU units. The emerging hybrid network introduced GCN and Bi-LSTM structures to adapt the propagation depth to seasonal patterns, addressing the long-term prediction challenge through a specialized spatiotemporal decoupling strategy [31]. In addition, other hybrid models have been proposed, including attention mechanisms [32] and physically informed guidance [33]. These advancements have improved stability and accuracy across prediction tasks involving various time scales. Recently, generative CNN (GCNN) [34,35] has been utilized to generate a pre-trained model, which achieves fine-grained spatiotemporal knowledge migration via generative pre-training on source city data. Some researchers have designed a combination of GCNN and reinforcement learning to focus on essential time steps and spatial locations [36,37]. Based on this, other generative networks have combined multifactor decomposition and data-driven trees, further uniting spatiotemporal features [38,39]. However, these models may miss key correlations when handling spatiotemporal features separately, complicating the resolution of spatiotemporal interactions, and the performance of cross-region prediction is unsatisfactory [40]. In addition, existing studies predominantly concentrated on modeling dependencies at a single spatial and temporal scale, neglecting the intricate spatiotemporal dynamics of air quality changes across individual and local dimensions in cross-city prediction tasks.

To solve the above problems, this study proposed an air quality prediction method based on generative complete scale spatiotemporal convolutional networks (CSST-AQP), which aims to solve the spatial and temporal dependence problem in air quality prediction tasks. The CSST-AQP framework method adopts a three-level progressive architecture: firstly, a local strong correlation spatiotemporal feature extraction fusion network is constructed to capture long-range spatiotemporal dependencies through a dynamic feature sensing mechanism; secondly, an adaptive spatiotemporal augmented and a multi-scale convolution network are designed to achieve the synergistic extraction of multi-dimensional features. Eventually, the CSST-AQP can manage remote temporal dependencies in a parallel manner, while spatial dependencies are modeled to achieve long-term air quality prediction. Experiments show that CSST-AQP performs excellently in all metrics over a 0–400 h prediction window. The main contributions that we make in this paper are as follows:

(1) A multi-scale fusion model integrates cross-station spatial patterns and local temporal trends to capture spatiotemporal dynamics. (2) An adaptive spatial correlation matrix coding module dynamically updates site interactions and tracks spatiotemporal dependencies. (3) A temporal model integrating dilated causal convolutions with adaptive snow-abatement optimization is used to capture local periodic and extended-range temporal dependencies. (4) A novel architecture fuses time-series encoding and spectral–temporal decomposition, resolving spatiotemporal mismatches while aligning cross-modal scales. (5) A prediction algorithm with extra high accuracy, 22.5% higher than CEMD-LSTM, is proposed.

The rest of this paper is organized as follows. Section 2 reviews some related technologies and concepts. In Section 3, the workflow of the proposed air quality prediction model is described in detail. To evaluate the proposed algorithm, simulation details, experimental results, comparison with several other air quality prediction models, and an ablation study are given in Section 4. Finally, Section 5 and Section 6 present the discussion and a brief conclusion, respectively.

2. Related Materials and Concepts

2.1. Spatiotemporal Encoder

To capture regional spatiotemporal correlation, Du et al. [41] proposed a spatiotemporal convolutional encoder (STE) based on temporal causal convolutional encoding. STE is capable of combining information from the spatiotemporal encoder to retain the characteristics of the original input data. The structure of STE is shown in Figure 1.

The STE consists of two spatiotemporal graph convolution blocks and a fully connected output layer. Each STE-Block contains two temporal Gated-Conv and one spatial Gated-Conv. The temporal Gated-Conv combines a 1-D Conv with a gated linear unit. In this study, we use STE to capture dynamic changes in air quality data over time and improve prediction performance.

2.2. Preliminaries

Given the historical air quality time series X till the current time step, we aim to train a predictive network that accurately estimates the air quality data of all regions at the future time step

t + 1

and then obtain the predicted value

X_{t + 1}

. Currently, few studies predict air quality data across regions, and this study focuses on forecasting air pollutants across different spatiotemporal gaps. In general, there are regional and time dependencies between changes in air pollution concentrations, as shown in Figure 2. To more comprehensively depict spatiotemporal correlations, the following definitions are formulated in this paper.

(1): Air quality flow graph (AQFG): the air quality flow graph is defined as $γ$ , and the citywide air quality data for the pre-T moment are expressed by a tensor $X = \{x_{1}, x_{2}, x_{3}, \dots x_{t}\}, t = 1, 2, 3 \dots, N$ , $N \in T$ .
(2): Spatial region: each monitoring station M_t is represented as a graph node with a feature vector $h_{t} \in ℝ^{d}$ , encapsulating multivariate air quality measurements (e.g., PM_2.5, SO₂, and O₃) and meteorological conditions (wind speed and temperature) at the station.

2.3. Time Series Imaging

The Gramian Angular Field (GAF) was proposed by Oates et al. Its main function is to encode time series into two-dimensional (2D) image information by generating a Gram matrix through trigonometric functions [42]. In this paper, the GAF-transformed time series data retain the temporal relationships among its observations, so we combine it with CNN to improve the prediction accuracy by maintaining the temporal relationship while learning more 2D features. The details are

x_{t}^{'} = \frac{(x_{t} - \max (X)) + (x_{t} - \min (X))}{\max (X) - \min (X)},

(1)

where

x_{t}^{'}

is the normalized observed value,

x_{t}

is the t-th sequence, and

x_{t}^{'}

is converted to the form of polar coordinates:

φ_{t} = \arccos (x_{t}^{'}),

(2)

r_{t} = \frac{t}{T},

(3)

where

φ_{t}

is the angle,

r_{t}

is the radius, t denotes the time stamp, and the transformed polar form is defined as

(φ_{t}, r_{t})

. Ultimately, the triangular sum between each observation

G A F_{t^{'}, t^{″}}

determines each element in the GA:

G A F_{t^{'}, t^{″}} = \cos (φ_{t}^{'} + φ_{t}^{″}), t^{'}, t^{″} = 1, 2, 3, \dots T,

(4)

G A F = (\begin{matrix} \cos (φ_{1} + φ_{1}) & \dots & \cos (φ_{1} + φ_{T}) \\ ⋮ & ⋱ & ⋮ \\ \cos (φ_{T} + φ_{1}) & \dots & \cos (φ_{T} + φ_{T}) \end{matrix}) .

(5)

2.4. Evaluation Indexes

The objective evaluation indexes of air quality prediction used in this paper include the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and R-squared (

R^{2}

). The formulas are

RMSE = \sqrt{\frac{1}{t} \sum_{i = 1}^{t} {({\hat{y}}_{i} - y_{i})}^{2}},

(6)

MAE = \frac{1}{t} \sum_{i = 1}^{t} |{\hat{y}}_{i} - y_{i}|,

(7)

MAPE = \frac{1}{t} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| \times 100 %,

(8)

R^{2} = 1 - \frac{\sum_{i = 1}^{t} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{t} {(y_{i} - {\bar{y}}_{i})}^{2}},

(9)

where

{\hat{y}}_{i}

and

y_{i}

are the predicted value and the actual value, respectively.

{\bar{y}}_{i}

is the average of the

y_{i}

.

3. Propose Models and Methodologies

In this section, we proposed a comprehensive scale fusion network with high spatiotemporal feature correlation for air quality prediction based on a preprocessing network, as well as STE-ResNet, named CSST-AQP, whose structure is shown in Figure 3.

3.1. Overview of the Proposed CSST-AQP Framework

The proposed CSST-AQP architecture comprises four principal components: (1) a local spatiotemporal feature preprocessing network (LSTF-Net) for initial data refinement, (2) a comprehensive-scale spatial feature processing module (CSSP-Net) enabling multi-granularity pattern extraction, (3) an adaptive temporal feature enhancement network (ATSE-Net) with dynamic time-aware mechanisms, and (4) a prediction output layer. In the CSST-AQP, all sub-networks were trained simultaneously, and the convolutional modules in the three sub-networks were configured with convolutional cores of different sizes to learn air quality time series feature information in different contextual orientations.

In general, the processing pipeline begins by feeding raw data X into the LSTF-Net preprocessing sub-network for spatiotemporal feature encoding. The generated fusion features are subsequently distributed through dual processing branches: the upper branch employs CSSP-Net, while the lower branch utilizes ATSE-Net, each specialized in extracting complementary dynamic spatiotemporal dependencies through parallel training. The learned features from both branches are then sequentially processed through three integration stages: feature fusion, a multi-attention mechanism layer, and a fully connected regression layer. Finally, the predicted values are generated through the output layer. The detailed procedure is described as follows.

(1) First, the input data X are fed to LSTF-Net, where they undergo localized spatiotemporal modeling. This process integrates spatial mapping and time-series feature extraction to simultaneously capture two critical aspects: temporal patterns of air quality variations at individual monitoring stations and cross-station spatial correlations, ultimately generating enhanced spatiotemporal features

X_{1}

that encode both station-specific dynamics and regional interdependencies.

(2) Then,

X_{1}

are encoded as corresponding 2D feature maps by CSSP-Net. In the CSSP-Net architecture, the input feature

X_{1}

undergoes multi-scale feature fusion through GAF and Continuous Wavelet Transform (CWT). This hybrid approach hierarchically integrates temporal patterns and frequency-domain characteristics. The fused feature maps undergo processing through the residual Conv module, where reinforcement learning mechanisms are applied to execute feature optimization, ultimately yielding an enhanced feature map

X_{2}

.

(3) The processed temporal features are fed into ATSE-Net, where key hyperparameters (neuronal population and convolutional kernel dimensions) in STE modules are optimized via a novel metaheuristic optimizer combining periodic oscillatory dynamics and snow ablation strategies. This accelerates convergence while boosting generalization and prediction fidelity. Then, the output features

X_{3}

are obtained.

(4) Finally, the feature map

X_{2}

and

X_{3}

are fused by features to generate

X_{4}

, which is then transmitted to the full connectivity layer and multi-head attention layer for feature learning to obtain the final prediction result

{\hat{y}}_{i}

.

3.2. Localized Spatiotemporal Feature Preprocessing Module (LSTF-Net)

The first sub-network of the CSST-AQP framework is the local spatiotemporal feature preprocessing network (LSTF-Net). To extract spatiotemporal dependence, LSTF-Net categorizes the pertinent features in air quality prediction into two distinct groups: temporal and spatial features. The LSTF-Net structure is shown in Figure 4. LSTF-Net comprises three components: an initial encoding module for converting raw sequences into AQFG matrices, channel attention module 1 (CAM-1) for capturing strongly correlated spatiotemporal features, and channel attention module 2 (CAM-2) implementing dilated convolution operations and feature-weighted fusion.

The temporal features f include PM_2.5, PM₁₀, SO₂, CO, and O₃, collectively denoted as X, and the operation process of the model is as follows:

X = f (P M_{2.5}) + f (P M_{10}) + f (S O_{2}) + f (C O) + f (O_{3}) .

(10)

(1) Firstly, to capture spatiotemporal dependencies across regional monitoring stations, LSTF-Net employs an initial encoding step that transforms the raw sequence X into an AQFG matrix

γ

.

γ = (M, A, X),

(11)

where A is the dynamic weight matrix between air quality flow graphs, and M is the monitoring node. The dynamic weight matrix

{A^{t}}_{i, j}

of the monitoring node i to node j at t time points is

{A^{t}}_{i, j} = \frac{\exp (- \frac{distance (M_{i}, M_{j})}{ξ} \cdot \frac{X_{i}^{t} \cdot r_{i, j}}{‖r_{i, j}‖})}{\sum_{k \in N (i)} e x p (- \frac{distance (M_{i}, M_{k})}{ξ} \cdot \frac{X_{i}^{t} \cdot r_{i, k}}{‖r_{i, k}‖})},

(12)

where

r_{i, j}

is the vector from monitoring node i to node j;

ξ

is the attenuation coefficient;

M_{k}

denotes the k-th historical monitoring node;

M_{i, j} \subseteq M, i, j = 1, 2, 3, \dots N

.

X_{i}^{t}

represents the data of node i at time t; and

r_{i, k}

is the k-th prior coefficient vector of

X_{i}^{t}

.

(2) Then, the tensor

γ

undergoes processing through a channel attention module 1 (CAM-1). We calculated the Pearson correlation coefficients between tensor

γ_{i}

and

γ_{j}

at each site and defined them as the spatial features P_ij,

P_{i j} \in P

.

P_{i j} = \frac{cov (γ_{i}, γ_{j})}{σ (γ_{i}) σ (γ_{j})},

(13)

G (P) = sigmoid (W P + b) .

(14)

A gating mechanism

G (P)

is subsequently applied to amplify cross-regional features with

P_{i j}

, effectively capturing dominant atmospheric transport and city pathways. W denotes the coefficient matrix, and b is the penalty factor obtained from training. To explicitly encode inter-regional air quality dependencies, we introduced a geographically weighted graph regularization term

L_{g e o}

into the objective function

L_{g e o} = \sum_{i = 1}^{N} \sum_{j \in N (i)} P_{i j} {‖γ_{i} ⊙ m_{i} - γ_{j} ⊙ m_{j}‖}_{F}^{2},

(15)

where

γ_{i} \in ℝ

denotes the i-th spatiotemporal flow tensor of node

γ_{t}

,

m_{i}

is the region-specific meteorologic mask, and the Hadamard product

⊙

implements localized feature gating based weather patterns. The spatial features quantify the correlation of pollutant data across different monitoring stations using the Pearson correlation coefficient. The ensuing obtained matrix P is transferred to the lower convolutional layer for training to generate the intermediate features

f_{s}

.

(3) Similarly, the CAM-2 processes input signals

γ

to extract temporal feature representations, denoted as

f_{t}

. To simulate spatiotemporal interdependencies across regional stations, we propose a bilinear temporal channel attention mechanism (BTCA) with learnable parameters W₁ and W₂. CAM-2 adopts dilated convolutions to process sensor inputs, outputting temporal features

f_{t} \in R

. The BTCA mechanism computes station-wise interactions, which are defined as

X_{1} = Softmax ((f_{s} W_{1}) {(f_{t} W_{2})}^{T}) .

(16)

(4) Finally, the spatial feature

f_{s}

and temporal feature

f_{t}

are subsequently integrated through dual attention-weighted layers, generating the unified spatiotemporal tensor

X_{1}

.

In LSTF-Net,

X_{1}

serves as the input feature of the lower-level sub-network and is transmitted to CSSP-Net and ATSE-Net for training, respectively.

3.3. The Structure of Complete Scale Spatial Processing Network (CSSP-Net)

The upper half branch of CSST-AQP is CSSP-Net, which includes the GAF and CWT image coding module and a novel residual convolution module, whose structure is shown in Figure 5. The CSSP-Net first encodes

X_{1}

into images with GAF and CWT. The encoded feature maps then pass through 11 convolutional layers and 2 residual blocks, extracting hierarchical multi-scale features and ultimately generating the output

X_{2}

.

The upper branch employs a GAF encoder to transform time-series pollutant measurements into temporal feature maps. This encoding strategy preserves intrinsic temporal dependencies through polar coordinate-based angular correlations, effectively mitigating information loss during sequential data transformation.

For the feature map

X_{t}

, which has a total of T observations,

φ_{t}

is the t-th observation frame that denotes the angle, and the feature is scaled to [−1, 1] via Equation (1) to avoid the impact of larger observations on the subsequent inner product. Then,

X_{t}

is converted to polar coordinates

(φ_{t}, r_{t})

by Equation (2). Finally, the variable

φ_{t}

is processed through Equations (4) and (5) to generate a 2D image feature map

G A F_{feature}

, which is subsequently fed into the subsequent layer for concatenation.

Concurrently, the lower branch processes spatial correlations via continuous wavelet transform (CWT) with a Haar wavelet basis, decomposing

X_{1}

into multiscale wavelet coefficient arrays. This process is able to identify jumps and discontinuities within the signal and obtain features on multiple scales of the original spatiotemporal sequence, further improving the accuracy of the prediction.

W_{X} (a, b) = \frac{1}{\sqrt{a}} \int_{- \infty}^{\infty} X (t) \bar{ψ} (\frac{t - b}{a}) d t .

(17)

where

W_{X} (a, b)

is the CWT coefficients at scale a and position b, which are the encoded feature maps

C W T_{f e a t u r e}

, and

\bar{ψ} (t)

is the complex conjugate of the Haar wavelet function. Then,

C W T_{f e a t u r e}

is concatenated with the

G A F_{feature}

to form a hybrid spatiotemporal feature matrix. Subsequently, the fused matrix is fed into a residual CNN module to jointly model cross-regional interactions in both feature spaces: the GAF branch captures dynamic temporal patterns of pollutant evolution, while the CWT-derived coefficients encode localized spatial–frequency characteristics of atmospheric processes.

In details, the fused image is processed through three convolutional layers with

3 \times 3

convolutional kernels and one residual layer to obtain the feature

R_{1}

.

R_{1}

is then sequentially processed through three down-sampling convolutional layers with convolutional kernel sizes of

5 \times 5

and

7 \times 7

to obtain the intermediate variables. The output intermediate variable features are used as inputs to the next convolutional kernel size of

3 \times 3

convolutional layer, and residual branches are learned to obtain

R_{2}

. Then, the feature map

R_{2}

is input to a

1 \times 1

convolutional layer and a flatten layer to produce a flattened output vector

X_{2}

.

3.4. Adaptive Temporal Feature Enhancement Network (ATSE-Net)

To effectively capture the long-range time dependence, an adaptive time series region enhancement sub-network (ATSE-Net) is introduced into the lower branch of CSST-AQP. The detailed structure of the ATSE-Net is shown in Figure 6. The ATSE-Net architecture employs a hierarchically dense structure comprising four cascaded adaptive spatiotemporal encoding (ASE) blocks, followed by a multi-attention fusion module and a flattening output layer. Each ASE block integrates three core components: a STE unit for joint feature representation, an adaptive threshold filter (T-Filter) layer for dynamic sequence modeling, and an adaptive average pooling operator for dimensionality reduction.

The processing pipeline initiates with the tensor

X_{1}

being sequentially processed through the ASE block cascade. At each stage, the output of the n-th (n = 2, 3, 4) ASE block undergoes element-wise fusion with the original

X_{1}

feature through residual connections, implementing a dense connectivity pattern that alleviates gradient vanishing while enhancing feature reuse. After four ASE operations with three inter-block skip connections, the aggregated features are directed to the multi-attention layer and the flattening layer, ultimately flattened into the final output tensor

X_{3}

.

Critical to the architecture’s performance is the integration of the adaptive periodic oscillation snowmelt optimizer (APO-SO) [43] during parameter optimization. The APO-SO adopts a dual population mechanism, where one subpopulation is responsible for the global TSE module parameters search, and another’s task is to search for the optimal threshold of the local T-Filter. The mechanism enables ATSE-Net to optimize the parameters more efficiently during the training process, thus improving the performance and generalization ability of the model. The signal intensity accumulation matrix in the feature map may be affected by residual noise or anomalous signals in the original spatiotemporal sequence. The T-Filter can learn a suitable threshold to extract local features and preserve spatial information, for which the target intensity

S (i)

and the threshold T are

S (i) = \{\begin{cases} S (i), S (i) \geq T \\ 0, S (i) < T \end{cases},

(18)

T = a + e,

(19)

a = α \frac{\sum_{i = 1}^{w} S (i)}{w},

(20)

e = β \sqrt{\frac{\sum_{i = 1}^{w} {(a - S (i))}^{2}}{w}},

(21)

where

a

is the mean of the sum of all column pixels in the feature map band, and

e

is the standard deviation of the column intensity of the feature map band. w is the number of sample points, and

α

and

β

are the adaptive parameters learned by network training, respectively. The APO-SO framework for the TSE module is formally described, and its structural configuration is summarized in Algorithm 1. This strategy integrates adaptive population initialization with stochastic operators to enhance global search capabilities while maintaining computational efficiency.

Algorithm 1: APO-SO optimization for TSE module

As detailed in Algorithm 1, the optimization process employs a three-phase architecture: (1) diversity-preserving initialization (Lines 1–4); (2) hybrid exploration–exploitation (Lines 5–13); (3) elite selection (Lines 14–17) and obtained

T_{optimal}

(Lines 18), where the stochastic components (Lines 4, 5) work synergistically with the adaptive mechanisms (Lines 11–13, 17) to prevent premature convergence while maintaining time complexity

O (n \log n)

through efficient population management.

4. Experimental Section and Results

4.1. Training Details and Datasets

The model was developed using the PyTorch-2.1.1 deep learning framework, incorporating Cross-Entropy Loss as the loss function. The training regimen utilized the Adam optimizer with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay coefficient of 0.0005. The loss function L is

L (λ_{1}, λ_{2}) = E_{{\hat{y}}_{i}, y_{i}} [λ_{1} {‖{\hat{y}}_{i} - y_{i}‖}_{2}^{2} + λ_{2} R_{TV}],

(22)

R_{TV} = \int_{Ω} {‖{\hat{y}}_{i}‖}_{2} d x,

(23)

where

y_{i}

is the true value, and

{\hat{y}}_{i}

is the predicted value. A total variation penalty

R_{TV}

is added to the loss function of the

L (λ_{1}, λ_{2})

, which further stabilizes the optimization.

λ_{1}

and

λ_{2}

are the weight coefficients, which are obtained from the network training, and their initial values are 0.8 and 0.2, respectively. In this study, experiments are conducted on two real datasets in the Guangzhou region (GZR) and Shaoguan region (SGR) to evaluate the performance of the model predictions. Both datasets (GZR/SGR) cover January 2021–April 2024 (see Table 1 for statistics) and are publicly available from the Air Quality Historical Data Platform [44].

The raw dataset underwent comprehensive preprocessing procedures comprising outlier elimination through the interquartile range method, z-score standardization, and min–max normalization to ensure dimensional homogeneity. We convert all the time series data into samples and labels with the sliding window method (with a sliding window size of W and sliding step size of O) before putting them into the network model for training. After 2000 batches, the network stabilized with a loss function value of 0.0005, and the

R^{2}

of training and testing reached more than 0.95. The results are shown in Figure 7 and Figure 8, respectively.

Figure 7 and Figure 8 show excellent network training with minimal discrepancy between truth and predicted values. Both training and testing losses stabilized at 0.0002 with no overfitting, and a high R² (R² > 0.95) confirms the model’s remarkable prediction capability.

4.2. Feature Visualization

In this section, we qualitatively analyze the prediction performance of the self-supervised spatiotemporal coding network and visualize the feature extraction capability of the CSST-AQP framework. Figure 9 shows the visualization results of the PM_2.5 component, PM₁₀ component, O₃ component, NO₂ component, SO₂ component, and CO component after the initial network coding process, respectively.

Figure 10a displays the SHAP eigenvalues ranking for air quality predictors, showing PM_2.5 as the most influential feature, followed by industrial emissions (PM_10, O₃, NO₂, SO₂, and CO), temperature, humidity, wind speed, and temporal factors. The temporal analysis in Figure 10b reveals strong autocorrelation patterns within pollutant sequences across adjacent time points.

It can be observed that the characteristic shape value of PM_2.5 is the highest, followed by the other pollutants and temperature. In the LSTF-Net preprocessing framework, we systematically integrate these features through weighted encoding, transforming them into enhanced spatiotemporal representations that boost both model performance and interpretability with optimized feature correlations.

To evaluate the feature extraction capability enhancement of self-supervised learning in LSTF-Net, Figure 11 presents comparative visualizations of air quality feature correlations. Figure 11a displays the original feature distribution without embedding spatial features, while Figure 11b reveals the refined pattern after incorporating spatial correlations learned using LSTF-Net’s architecture.

In Figure 11a, we can see that 4000 randomly sampled embeddings trained without spatial feature integration exhibit fragmented clustering across air quality ranges with weak inter-sample correlations (low correlation coefficients). Conversely, Figure 11b demonstrates that LSTF-Net’s architecture effectively organizes embeddings into distinct clusters with strong feature correlations following spatial–temporal learning. These empirical results confirm that LSTF-Net’s self-supervised framework significantly enhances feature discernment through synergistic spatial–temporal representation learning, directly contributing to improved prediction accuracy.

4.3. Analysis of Projected Results

To validate the predictive performance of our proposed model, we conducted multi-step forecasts of six critical air pollutant concentrations (PM_2.5, PM₁₀, O₃, NO₂, SO₂, and CO) over a 400 h horizon. The predictive outcomes are systematically presented in three comparative visualizations: Figure 12 demonstrates the PM_2.5 and PM₁₀ forecasts, Figure 13 illustrates the O₃ and NO₂ predictions, and Figure 14 displays the SO₂ and CO projection results. Notably, all forecasted trajectories exhibit strong alignment with corresponding observed measurements across the evaluation period. The minimal deviation between predicted and actual values, particularly evident in the consistently narrow error margins, demonstrates the model’s effectiveness in capturing complex temporal patterns.

Table 2 summarizes the CSST-AQP model’s performance for six pollutants, showing stable metrics (RMSE: 6.11–9.13 μg/m³, MAE: 6.06–8.91 μg/m³, MAPE: 6.21–9.56%, R²: 0.91–0.93) with minimal prediction errors (<9% MAE) and high accuracy in both univariate and multivariate settings. All error confidence intervals exceed 95%, indicating statistical robustness. For O₃, the partial autocorrelation function (PACF) exhibits a statistically significant value of 0.21 at the third-order lag (Confidence Level = 95.8%). This residual autocorrelation likely stems from missing historical concentration data from the preceding day, resulting in incomplete modeling of temporal dependencies in current predictions.

Compared to industry-dominated pollutants such as PM_2.5 (PACF = 0.15), CO has a stronger autocorrelation (PACF = 0.18, RMSE = 9.13), and the error between its predicted and true values is more significant. To comparatively assess the relative error magnitude of CO predictions against other pollutants, we conducted systematic residual analysis through visualization of true-predicted value discrepancies, with the comprehensive scatterplot presented in Figure 15.

The results presented in Figure 15 demonstrate that the model consistently overestimates CO concentrations before data correction, exhibiting residuals between

- 1.2

and 1.2, and the error is significantly reduced after correction. This systematic bias may be attributed to current motor vehicle restriction policies that fail to adequately address winter-specific emission patterns during high pollution periods and morning rush hours.

4.4. Comparative Experiment

4.4.1. Compared with Deep Learning Models

To further validate the predictive performance of the proposed CSST-AQP (CT) model, Figure 16 and Table 3 present comparative results under identical experimental conditions between our model and established methods, including LSTM [21], TCN [45], EMD-LSTM [46], EMD-TCN [47], CEMD-LSTM [48], EA-LSTM [49], and the baseline raw data.

LSTM: Long Short-Term Memory Neural Networks (LSTMs), capable of capturing multi-scale temporal patterns and addressing long-term dependencies in time-series data, are employed to forecast air quality through an analysis of historical pollution records.

TCN: the temporal convolutional network (TCN), characterized by dilated causal convolutional architectures with exponentially expanding receptive fields, exhibits superior performance in spatiotemporal air pollutant forecasting through multiscale representation learning of atmospheric chemical transport dynamics.

EMD-LSTM: Empirical Mode Decomposition-enhanced LSTM (EL), which hybrid architecture, synergizing Hilbert–Huang Transform-based adaptive signal decomposition with recurrent neural memory units, achieves state-of-the-art performance in multivariate air pollutant concentration forecasting by resolving multiscale atmospheric turbulence patterns.

EMD-TCN: Empirical Mode Decomposition-enhanced TCN (ET), which integrates Hilbert–Huang Transform-based adaptive signal decomposition with exponentially dilated convolutional neural operators, establishes a novel paradigm for air prediction.

CEMD-LSTM: Complementary Ensemble Empirical Mode Decomposition LSTM (CL), integrating complementary noise-injected ensemble decomposition with gated recurrent feature learning, achieves breakthrough performance in multi-pollutant interaction forecasting by resolving coupled atmosphere–chemistry oscillations.

EA-LSTM: Evolutionary Algorithm-optimized LSTM (EA) features genetic algorithm-based hyperparameter tuning for enhanced prediction accuracy.

As evidenced by the PM_2.5 prediction results in Figure 16 and Table 3, the proposed CSST-AQP model demonstrates superior performance across all evaluation metrics (RMSE = 8.63, MAE = 8.02, MAPE = 8.01%, R² = 0.92), achieving statistically significant improvements over other state-of-the-art models. Specifically, it outperforms both conventional baseline architectures (LSTM and TCN) and contemporary state-of-the-art variants (EMD-LSTM, EMD-TCN, CEMD-LSTM, and EA-LSTM) by a 13.09–42.3% relative improvement. Although TCN and LSTM demonstrate lower parameters (15,612, 11,340 vs. 86,267), CSST-AQP achieves superior prediction accuracy through multi-scale IMF decomposition (R² = 0.92). Additionally, the prediction curves of CSST-AQP in Figure 16 show minimal deviation from true values over 1–14 days. The consistent statistical superiority (p < 0.01, paired T-test) suggests CSST-AQP’s strong potential for operational deployment in environmental monitoring systems, particularly where accurate multi-scale pollution forecasting is critical for public health interventions.

This result confirms the model’s robust capabilities in multivariate atmospheric pollutant forecasting across multiple temporal horizons, attributable to its novel coupled cross-correlated (Pearson r > 0.78) spatiotemporal feature fusion by LSTF-Net. Moreover, CSST-AQP integrates parallel multi-scale feature extraction through ATSE-Net and CSSP-Net, which significantly improves temporal pattern capture efficiency. Specifically, CSSP-Net enables simultaneous processing of both high-frequency details via CWT and long-term trends through GAF encoding, whereas LSTM’s recurrent nature inherently causes gradient decay in long sequences. The ATSE-Net adaptive encoding strategy enhances discriminative feature learning to capture information about low-frequency components of pollutants and improve accuracy.

Compared with Traditional Models

Furthermore, the traditional machine learning approaches, RF [9] and SVM [10], serve as reference benchmarks compare with CSST-AQP under identical training conditions. Figure 17 compares the 7-day PM_2.5 prediction trajectories of CSST-AQP, SVM, and RF against truth measurements, and the prediction error values and other metrics for each model are shown in Table 4.

As evidenced by the comparative analysis in Figure 17 and Table 4, CSST-AQP demonstrates superior predictive performance over conventional models (SVM/RF), achieving 29.6–39.4% higher accuracy (RMSE = 8.63 vs. 38.9/27.3. This enhancement originates from the LSTF preprocessing network, which encodes adaptive features to amplify strongly correlated spatiotemporal patterns while suppressing noise components. In addition, the static architectures of SVM and RF models struggle to adapt to dynamic spatiotemporal coupling relationships and exhibit limitations in long-sequence prediction tasks.

4.5. Ablation Experiments

4.5.1. Module Ablation

In this section, we perform ablation experiments on the sub-networks in the proposed CSST-AQP model, and the results are shown in Figure 18 and Table 5. We conduct systematic ablation studies to quantify the contribution of each component in the CSST-AQP framework. Under identical training protocols, three critical modules are incrementally ablated for 60 h AQI prediction.

1.: LSTF-Net Removal:

(1): Performance degradation: 312.7%↑ MAE; 225.8%↑ RMSE (vs. complete CSST-AQP model);

(2): Mechanism disruption: loses long-term spatiotemporal feature extraction (60 h patterns) and eliminates cross-modal correlation preprocessing (R² = 0.93→0.81).

The ablation of LSTF-Net markedly reduces the CSST-AQP’s effectiveness in two key aspects: (1) adaptive spatiotemporal encoding for capturing pollutant–station interdependencies, and (2) hierarchical feature fusion—losing 83% of strongly correlated cross-component interactions (p > 0.7) due to disabled attention CAM-1 and CAM-2.

2.: CSSP-Net Ablation

(1): Short-term resilience: 0–24 h MAE maintains ±6.2% fluctuation.

(2): Critical degradation phase: 24–48 h: 19.5%↑ RMSE; 48–60 h: 91.8%↑ RMSE.

(3): Spatial feature loss: inter-station diffusion modeling capacity ↓62%.

CSSP-Net integrates geometric signal encoding (GAF/CWT) with adaptive residual fusion, achieving superior spatiotemporal coherence preservation (0–24 h MAE maintains a ±6.2% fluctuation) compared to conventional single-modality approaches. Ablation studies quantitatively confirm its essential role in maintaining prediction stability during complex pollution episodes.

3.: ATSE-Net Exclusion

(1): Immediate impact: 12.4%↑ MAE at the first prediction step (t + 1); this is important for sequence characterization in adjacent time ranges.

(2): Temporal recognition impairment: periodic pattern detection accuracy ↓57%; the recognition performance of temporal feature correlation is weakened (R² = 0.93→0.83) at 60 h.

(3): Cumulative effect: error propagation amplifies, and ATSE-Net proves vital for immediate responsiveness, with its removal triggering a 3.6% MAE increase at the initial prediction step (t + 1).

The ablation of ATSE-Net critically undermines the spectrally enhanced STE module’s capability to process transient temporal patterns, while the absence of its adaptive T-Filter compromises discriminative feature extraction. This dual degradation mechanism fundamentally limits CSST-AQP’s capacity to isolate physically meaningful spatiotemporal interactions from complex environmental signals.

As evidenced in Figure 18 and Table 5, the complete model achieves an optimal balance between immediate responsiveness (<2% deviation at t + 1) and long-term stability (<3.3% error growth/hour). LSTF-Net proves the most critical (0.12 reduction on R²), particularly for modeling atmospheric transport inertia. CSSP-Net’s spatial hierarchy maintains 83% feature completeness through 60 h, while ATSE-Net’s adaptive temporal kernels account for 85% of short-term accuracy improvements.

4.5.2. Hyperparameter Analysis

In this section, we conduct a comprehensive ablation study to evaluate the hyperparameter sensitivity of the proposed CSST-AQP framework. The model architecture incorporates six tunable hyperparameters that require careful configuration: (1) sliding window size (W), (2) sliding step size (O), (3) convolutional kernel quantity (C_k), (4) batch size (B_s), (5) dilation factor (D_f), and (6) dropout rate (D_r). Systematic experimental configurations and results are detailed in Table 6.

The ablation study results in Table 6 reveal critical insights into hyperparameter sensitivity within the CSST-AQP framework: (1) W exhibits an inverted U-shaped relationship with prediction accuracy, demonstrating a 1.9% RMSE reduction at W = 32 compared to baseline. The results and evidence suggest optimal parameterization within

W \in [32, 64]

complemented by an overlapping stride

O \in [4, 8]

to balance temporal resolution and computational efficiency. (2) C_k and D_f show a synergistic effect, where increasing C_k to more than 32 units together with

D_{f} = 4

results in a significant performance improvement (

p < 0.005

, t-test). (3) D_r requires precise calibration, where excessive regularization (

D_{r} > 0.4

) induces 18.9% performance deterioration. Conversely, moderate retention (

D_{r} \in [0.2, 0.35]

) enhances the generalization capacity by 9.2 ± 1.1% through effective prevention of co-adaptation. (4)

B_{S} = 256

achieves an optimal trade-off between convergence speed and algorithmic time complexity.

4.6. Parameter Sensitivity and Robustness

4.6.1. Different Data Sources

To further investigate the parameter sensitivity and robustness of the proposed CSST-AQP framework, we conducted experiments under the GZR dataset and the SGR dataset, respectively. Figure 19 demonstrates the prediction effect of the CSST-AQP model for the time range of 0–48 h for the GZR dataset, the SGR dataset, and the integration of the two datasets, respectively.

In Figure 19, the proposed CSST-AQP model maintains consistent predictive accuracy across diverse datasets, with R² values consistently constrained within the narrow range of 0.92–0.97. This performance stability, as evidenced by the overlapping learning curves across different experimental groups, indicates remarkable model robustness against dataset variations. Particularly, the model is able to preserve prediction fidelity when handling datasets with distinct statistical distributions and noise patterns, suggesting strong generalization capabilities inherent in our architectural design.

4.6.2. Parameter Sensitivity

As shown in Figure 20, a comprehensive parameter sensitivity analysis was conducted to evaluate critical components of our framework, including (1) the attenuation coefficient

ξ

localizing the dynamic weigh matrix in the LSTF-Net, (2) the adaptive thermal balancing parameters (

α, β

) governing snow ablation optimization in ATSE-Net, and (3) the environmental monitoring region sample size N_r.

Figure 20 further illustrates the temporal evolution of prediction accuracy, demonstrating characteristic MAE variation patterns (

Δ MAE \in [1.23, 4.57]

) across three operational phases: short-term (1–12 h), medium-term (12–48 h), and extended-term (48–60 h) AQI forecasting. Notably, the phase-dependent sensitivity profiles suggest that dynamic parameter dominance mechanisms’ ablation parameters (α/β) predominantly influence short-term predictions (R² = 0.83), while attenuation coefficient

ξ

shows a stronger correlation with medium-term accuracy. Overall, we can observe that the model is robust to small jumps in parameters, with the best performance at

ξ

of 0.6,

α

of 0.6, and

β

of 1.2. The model achieves stable performance when the number of monitoring stations reaches 3 or more. In addition, attenuation coefficient

ξ

plays an important role in LSTF-Net to extract highly correlated spatiotemporal features, which in turn affects the long-term prediction accuracy.

4.6.3. Noise Sensitivity

In the original dataset, we artificially added Gaussian noise with a mean of 0 and variance of 0.01–0.05 and salt-and-pepper noise with a noise ratio of 1–16% to validate the model’s noise immunity performance, and the results are shown in Figure 21.

As shown in Figure 21, the model maintains stable MAE values under both Gaussian and salt-and-pepper noise perturbations. This robustness can be attributed to the CSST-AQP framework’s advanced preprocessing module, which systematically identifies and filters anomalous data points while preserving critical pollution pattern signatures.

5. Discussion

This study systematically evaluates machine learning (RF and SVR) and deep learning models (LSTM, TCN, and CEMD-LSTM) for predicting six major air pollutants (PM_2.5, PM₁₀, O₃, NO₂, SO₂, and CO), incorporating meteorological variables (temperature, humidity, and wind speed) to enhance prediction accuracy. In particular, the proposed CSST-AQP model outperforms other models with an RMSE of 8.63, RAE of 8.02, and R² of 0.92 for pollutants tested over the next 400 h period due to the fact that the CSST-AQP model emphasizes multi-scale fusion of features and weighted learning of spatiotemporally relevant features. Compared to other models, the CSST-AQP framework divides the prediction task into a screening encodability phase and a learning confirmation phase. In the screening encodability phase, a new LSTF-Net module is proposed to effectively capture spatial and temporal dependencies; in the learning confirmation phase, the hidden state learning probability distributions are extracted from the new features obtained from encoding, thus capturing the uncertainty of the input data and improving the model performance.

This framework enables practical applications, like cross-regional PM_2.5 forecasting, supporting early warning systems for public health protection (e.g., mask advisories during pollution peaks). In future work, our model needs to further disaggregate the correlations between different pollutants and spatial factors and conduct an in-depth study on the spatial error aggregation factors and seasonality of pollution sources in other countries or regions.

6. Conclusions

This paper presents a novel air quality prediction framework that comprehensively models multi-scale spatiotemporal dependencies through hierarchical feature fusion. Specifically, the entire model network adopts a dual-channel architecture: the spatial feature channel extracts the non-stationary correlation features among monitoring stations via 2D graph convolution CSSP-Net, and the temporal feature channel resolves the cross-scale dependence patterns of multivariate time series via adaptive temporal sequence ATSE-Net. Our framework demonstrates three distinctive advantages: first, it achieves seamless integration of station-specific historical patterns and inter-station spatial synergies through dual-stream processing; second, the hierarchical dilation structure enables simultaneous learning of hourly, daily, and weekly temporal dependencies; third, the adaptive graph learning module dynamically adjusts neighborhood weights based on real-time pollution dispersion patterns. This design enables the model to process remote dependencies in spatiotemporal dimensions in parallel, which significantly improves long-term prediction accuracy while ensuring computational efficiency. The experimental results across 14 major Chinese regions demonstrate a 22.5% improvement in 72 h PM_2.5 forecasting accuracy compared to other advanced LSTM and TCN models while maintaining computational efficiency suitable for operational deployment.

Author Contributions

Conceptualization, C.W. and Z.L.; methodology, C.W.; software, C.W. and X.Z.; validation, Z.L.; formal analysis, Y.X.; investigation, X.Z.; resources, G.D.; writing—original draft preparation, Z.L.; writing—review and editing, C.W.; visualization, J.W.; supervision, G.D.; funding acquisition, J.W. and G.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the National Natural Science Foundation of China (Grant Nos. 62041106); Guangdong Province, social development science and technology collaborative innovation system construction project (grant No. P0000876021 and P0000876023); Quality Engineering Project of Guangdong Songshan Vocational and Technical College (2023JNDS02); and Guangdong Provincial Science and Technology Characteristic Innovation Project (2024KTSCX334).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, X.H.; Tee, K.; Elnahass, M.; Ahmed, R. Assessing the environmental impacts of renewable energy sources a case study on air pollution and carbon emissions in China. J. Environ. Econ. Manag. 2023, 345, 118525. [Google Scholar]
Pan, K.; Lu, J.; Li, J.; Xu, Z. A hybrid autoformer network for air pollution forecasting based on external factor optimization. Atmosphere 2023, 14, 869. [Google Scholar] [CrossRef]
Zheng, Y.; Liu, F.R.; Hsieh, H.P. U-Air: When urban air quality inference meets big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 1436–1444. [Google Scholar]
Gu, K.; Qiao, J.; Lin, W. Recurrent air quality predictor based on meteorology- and pollution-related factors. IEEE Trans. Ind. Inform. 2018, 14, 3946–3955. [Google Scholar]
Steinfeld, J.I. Atmospheric chemistry and physics from air pollution to climate change. Environ. Sci. Policy 1998, 40, 26. [Google Scholar]
Flatøy, F.; Hov, O.; Schlager, H. Chemical forecasts used for measurement flight planning during POLINAT 2. Geophys. Res. Lett. 2000, 27, 951–954. [Google Scholar]
Osipov, A.V.; Pleshakova, E.S.; Gataullin, S.T. Production processes optimization through machine learning methods based on geophysical monitoring data. Comput. Optim. 2024, 48, 633–642. [Google Scholar]
Rizos, K.; Meleti, C.; Evagleopoulos, V.; Melas, D. A machine learning modelling approach to characterize the background pollution in the Western Macedonia region in northwest Greece. Atmos. Pollut. Res. 2023, 14, 101877. [Google Scholar]
Mendez, M.; Merayo, M.G.; Nuez, M. Machine learning algorithms to forecast air quality a survey. Artif. Intell. Rev. 2023, 56, 10031–10066. [Google Scholar]
Liu, H.; Yan, G.X.; Duan, Z.; Chen, C. Intelligent modeling strategies for forecasting air quality time series: A review. Appl. Soft Comput. 2021, 102, 106957. [Google Scholar]
Zhang, Y.P.; Chen, D.; Luo, Y. Low-SOC estimation of aluminum-air battery based on HSMM. J. Univ. Electron. Sci. Technol. China 2017, 46, 380–385. [Google Scholar]
Mampitiya, L.; Rathnayake, N.; Leon, L.P.; Mandala, V.; Azamathulla, H.M.; Shelton, S.; Hoshino, Y.; Rathnayake, U. Machine learning techniques to predict the air quality using meteorological data in two urban areas in Sri Lanka. Environments 2023, 10, 141. [Google Scholar] [CrossRef]
Mampitiya, L.; Rathnayake, N.; Hoshino, Y.; Rathnayake, U. Forecasting PM10 levels in Sri Lanka a comparative analysis of machine learning models PM10. J. Hazard. Mater. Adv. 2024, 13, 100395. [Google Scholar] [CrossRef]
Athira, V.; Geetha, P.; Vinayakumar, R.; Soman, K.P. DeepAirNet applying recurrent networks for air quality prediction. Procedia Comput. Sci. 2018, 132, 1394–1403. [Google Scholar]
Guo, Z.Y.; Yang, C.Y.; Wang, D.S.; Liu, H.B. A novel deep learning model integrating CNN and GRU to predict particulate matter concentrations. Process Saf. Environ. 2023, 173, 604–613. [Google Scholar] [CrossRef]
Elbaz, K.; Shaban, W.M.; Zhou, A.N.; Shen, S.L. Real time image-based air quality forecasts using a 3D-CNN approach with an attention mechanism. Chemosphere 2023, 333, 138867. [Google Scholar] [CrossRef]
Tran, H.D.; Huang, H.Y.; Yu, J.Y.; Wang, S.H. Forecasting hourly PM2.5 concentration with an optimized LSTM model. Atmos. Environ. 2023, 315, 120161. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]
Zaini, N.; Ahmed, A.N.; Ean, L.W.; Chow, M.F.; Malek, M.A. Forecasting of fine particulate matter based on LSTM and optimization algorithm. J. Clean. Prod. 2023, 427, 139233. [Google Scholar] [CrossRef]
Wang, C.; Huang, S.; Zhang, C. Short-term traffic flow prediction considering weather factors based on optimized deep learning neural networks: Bo-GRA-CNN-BiLSTM. Sustainability 2025, 17, 2576. [Google Scholar] [CrossRef]
Ameri, R.; Hsu, C.C.; Band, S.S.; Zamani, M.; Shu, C.M.; Khorsandroo, S. Forecasting PM 2.5 concentration based on integrating of CEEMDAN decomposition method with SVM and LSTM. Ecotoxicol. Environ. Saf. 2023, 266, 115572. [Google Scholar] [CrossRef]
Tishya, M.; Anitha, A. Hybridization of rough set-wrapper method with regularized combinational LSTM for seasonal air quality index prediction. Neural Comput. Appl. 2024, 36, 2921–2940. [Google Scholar]
Wang, X.H.; Zhang, S.; Chen, Y.; He, L.Y.; Ren, Y.M.; Zhang, Z.; Li, J.; Zhang, S.Q. Air quality forecasting using a spatiotemporal hybrid deep learning model based on VMD–GAT–BiLSTM. Sci Rep. 2024, 14, 17841. [Google Scholar]
Rabie, R.; Asghari, M.; Nosrati, H.; Niri, M.E.; Karimi, S. Spatially resolved air quality index prediction in megacities with a CNN-Bi-LSTM hybrid framework. Sustain. Cities Soc. 2024, 109, 105537. [Google Scholar]
Fan, P.D.; Wang, D.; Wang, W.; Zhang, X.Y.; Sun, Y.Y. A novel multi-energy load forecasting method based on building flexibility feature recognition technology and multi-task learning model integrating LSTM. Energy 2024, 308, 132976. [Google Scholar]
Huang, X.H.; Li, Y.F.; Wang, X.W. Integrating a multi-variable scenario with Attention-LSTM model to forecast long-term coastal beach erosion. Sci. Total Environ. 2024, 954, 176257. [Google Scholar]
Wang, Y.; Liu, K.; He, Y.; Wang, P.; Chen, Y.; Xue, H.; Huang, C.; Li, L. Enhancing air quality forecasting a novel spatio-temporal model integrating graph convolution and multi-head attention mechanism. Atmosphere 2024, 15, 418. [Google Scholar] [CrossRef]
Hu, W.; Zhang, Z.; Zhang, S.Q.; Chen, C.M.; Yuan, J.W.; Yao, J.; Zhao, S.C.; Guo, L. Learning spatiotemporal dependencies using adaptive hierarchical graph convolutional neural network for air quality prediction. J. Clean. Prod. 2024, 459, 142541. [Google Scholar]
Chen, Y.S.; Huang, L.; Xie, X.D.; Liu, Z.X.; Hu, J.L. Improved prediction of hourly PM2.5 concentrations with a long short-term memory and spatiotemporal causal convolutional network deep learning model. Sci. Total Environ. 2024, 912, 168672. [Google Scholar]
Hu, Y.X.; Li, Q.; Shi, X.D.; Yan, J.Y.; Chen, Y.T. Domain knowledge-enhanced multi-spatial multi-temporal PM2.5 forecasting with integrated monitoring and reanalysis data. Environ. Int. 2024, 192, 108997. [Google Scholar] [PubMed]
Dey, S. Urban air quality index forecasting using multivariate convolutional neural network based customized stacked long short-term memory model. Process Saf. Environ. 2024, 191, 375–389. [Google Scholar]
Zhu, J.M.; Zheng, P.; Niu, L.L.; Chen, H.Y.; Wu, P. An enhanced interval-valued PM2.5 concentration forecasting model with attention-based feature extraction and self-adaptive combination technology. Expert Syst. Appl. 2025, 264, 125867. [Google Scholar]
Pande, C.B.; Kushwaha, N.L.; Alawi, O.A.; Sammen, S.S.; Sidek, L.M.; Yaseen, Z.M.; Pal, S.C.; Katipoglu, O.M. Daily scale air quality index forecasting using bidirectional recurrent neural networks: Case study of Delhi, India. Environ. Pollut. 2024, 351, 124040. [Google Scholar]
Wang, K.N.; Yang, T.N.; Kong, S.S.; Li, M.D. Air quality index prediction through Time GAN data recovery and PSO-optimized VMD-deep learning framework. Appl. Soft Comput. 2025, 170, 112626. [Google Scholar]
Mohammadzadeh, A.K.; Salah, H.; Jahanmahin, R.; Hussain, A.A.; Masoud, S.; Huang, Y.X. Spatiotemporal integration of GCN and E-LSTM networks for PM2.5 forecasting. Mach. Learn. Appl. 2024, 15, 100521. [Google Scholar]
Ma, M.B.; Xie, P.; Teng, F.; Wang, B.; Ji, S.G.; Zhang, J.B.; Li, T.R. HiSTGNN: Hierarchical spatio-temporal graph neural network for weather forecasting. Inf. Sci. 2023, 648, 119580. [Google Scholar]
Yang, W.H.; Li, H.M.; Wang, J.Z.; Ma, H.Y. Spatiotemporal feature interpretable model for air quality forecasting. Ecol. Indic. 2024, 167, 112609. [Google Scholar]
Che, J.X.; Hu, K.; Xia, W.X.; Xu, Y.F.; Li, Y.R. Short-term air quality prediction using point and interval deep learning systems coupled with multi-factor decomposition and data-driven tree compression. Appl. Soft Comput. 2024, 166, 112191. [Google Scholar]
Zhang, M.L.; Sun, L.F.; Yang, J.; Zou, Y.S. A shared multi-scale lightweight convolution generative network for few-shot multivariate time series forecasting. Appl. Soft. Comput. 2024, 167, 112420. [Google Scholar]
Chen, Y.N.; Wu, Y.H.; Zhang, S.G.; Kee, Y.; Huang, J.; Shi, D.F.; Hu, S.X. Regional PM2.5 prediction based on hybrid directed graph neural network and spatio-temporal fusion of meteorological factors. Environ. Pollut. 2024, 366, 125404. [Google Scholar]
Du, Z.; Wu, S.; Huang, D.; Li, W.; Wang, Y. Spatio-temporal encoder-decoder fully convolutional network for video-based dimensional emotion recognition. IEEE Trans. Affect. Comput. 2021, 12, 565–578. [Google Scholar]
Wang, Z.G.; Oates, T. Imaging time-series to improve classification and imputation. arXiv 2015, arXiv:1506.00327v1. [Google Scholar]
Deng, L.Y.; Liu, S.Y. Snow ablation optimizer a novel metaheuristic technique for numerical optimization and engineering design. Expert. Syst. Appl. 2023, 225, 120069. [Google Scholar]
Air Quality Historical Data Platform. Available online: https://aqicn.org/data-platform/register/ (accessed on 10 September 2024).
Sundaramoorthy, R.A.; Ananth, A.D.; Seerangan, K.; Nandagopal, M.; Balusamy, B.; Selvarajan, S. Implementing heuristic-based multiscale depth-wise separable adaptive temporal convolutional network for ambient air quality prediction using real time data. Sci. Rep. 2024, 14, 18437. [Google Scholar]
Huang, Y.; Yu, J.; Dai, X.; Huang, Z.; Li, Y. Air-quality prediction based on the EMD–IPSO–LSTM combination model. Sustainability 2022, 14, 4889. [Google Scholar] [CrossRef]
Neeraj; Mathew, J.; Behera, R.K. EMD-Att-LSTM a data-driven strategy combined with deep learning for short-term load forecasting. J. Mod. Power Syst. Clean Energy 2022, 10, 1229–1240. [Google Scholar]
Wang, Z.C.; Chen, H.Y.; Zhu, J.M.; Ding, Z.N. Daily PM2.5 and PM10 forecasting using linear and nonlinear modeling framework based on robust local mean decomposition and moving window ensemble strategy. Appl. Soft Comput. 2022, 114, 108110. [Google Scholar]
Li, Y.R.; Zhu, Z.F.; Kong, D.Q.; Han, H.; Zhao, Y. EA-LSTM evolutionary attention-based LSTM for time series prediction. Knowl. Based Syst. 2019, 181, 104785. [Google Scholar]

Figure 1. Architecture of an STE.

Figure 2. Architecture of an STE. (a): hourly average AQI; (b): distance vs. AQI difference; (c): spatiotemporal distribution of AQI.

Figure 3. The framework of CSST-AQP; differently colored blocks represent different sub-networks and modules.

Figure 4. The LSTF-Net framework.

Figure 5. The structure of CSSP-Net.

Figure 6. The structure of ATSE-Net.

Figure 7. The loss curves for the training and testing processes.

Figure 8. The correlation coefficient of determination curves under training and testing.

Figure 9. Visualization results for each AQI component after network coding.

Figure 10. Visualization results for feature maps: (a) SHAP feature dependency graph; (b) temporal analysis of PM_2.5.

Figure 11. Distribution of correlation coefficients of air quality characteristics: (a) embedding without spatial features; (b) embedding with spatial features.

Figure 12. Curves of true versus predicted values of PM_2.5 and PM₁₀: (a) PM_2.5; (b) PM₁₀.

Figure 13. Curves of true versus predicted values of O₃ and NO₂: (a) O₃; (b) NO₂.

Figure 14. Curves of truth versus predicted values of SO₂ and CO: (a) SO₂; (b) CO.

Figure 15. Residual scatter plot of the true and predicted values of CO: (a) before data correction; (b) data correction.

Figure 16. Comparison of the prediction performance of different models for PM_2.5.

Figure 17. Comparison of the prediction performance of traditional models for PM_2.5.

Figure 18. Predicted results of AQI under different modules.

Figure 19. Predictive performance of the CSST-AQP model for the 0–48 h with different datasets.

Figure 20. Parameter sensitivity and robustness of the CSST-AQP model.

Figure 21. Predictive performance analysis of the model under different noise attacks: (a) Gaussian noise; (b) salt-and-pepper noise.

Table 1. Structure and parameters of the datasets.

Datasets	GZR	SGR
Number of air quality stations	121	63
Number of air quality records	4,369,102	1,103,144
Average AQI	58.6	70.1
Number of regions N_r	11	3

Table 2. Overall performance of the CSST-AQP model for different pollutants.

Variable	RMSE	MAE	MAPE (%)	R²	Confidence Level (95%)	PACF
PM_2.5	8.63	8.12	8.01	0.92	96.6%	0.15
PM₁₀	6.83	6.06	6.21	0.93	95.1%	0.08
O₃	7.83	8.06	8.19	0.92	95.8%	0.21
NO₂	8.69	8.23	8.13	0.92	95.3%	0.12
SO₂	6.11	7.09	6.79	0.93	95.3%	0.09
CO	9.13	8.91	9.56	0.91	95.9%	0.18

Table 3. Results of performance comparison between CSST-AQP and other advanced air prediction algorithms.

Algorithms	RMSE	MAE	MAPE (%)	R²	Parameters	Training Time (s)
LSTM	14.96	12.98	12.87	0.83	15,612	52.1
TCN	13.87	11.57	11.93	0.85	11,340	123.6
EMD-LSTM	12.12	11.19	11.02	0.87	140,508	288.5
EMD-TCN	10.57	10.62	10.65	0.90	102,060	251.3
EA-LSTM	9.56	8.87	9.67	0.90	34,941	136.8
CEMD-LSTM	9.93	8.74	8.63	0.91	102,060	310.7
CSST-AQP	8.63	8.02	8.01	0.92	86,267	190.6

Table 4. Results of performance comparison between CSST-AQP and traditional models.

Algorithms	RMSE	MAE	MAPE (%)	R²	Parameters	Training Time (s)
SVM	38.9	37.6	37.1	0.66	Na	31.3
RF	25.3	25.1	22.4	0.71	Na	36.5
CSST-AQP	8.63	8.02	8.01	0.92	86,267	190.6

Table 5. Performance of ablation experiments.

Time	Algorithms	RMSE	MAE	MAPE (%)	R²
0–24 h	Dropout LSTF-Net	25.47	23.12	19.86	0.83
	Dropout CSSP-Net	12.14	15.27	9.01	0.89
	Dropout TSE-Net	26.38	24.72	18.93	0.84
	CSST-AQP	9.15	7.12	6.01	0.93
24–60 h	Dropout LSTF-Net	28.12	27.39	20.39	0.81
	Dropout CSSP-Net	16.56	18.05	10.13	0.85
	Dropout TSE-Net	20.03	20.23	15.11	0.83
	CSST-AQP	12.63	11.39	10.82	0.92

Table 6. Hyperparameter ablation study results (bold indicates baseline configuration).

Hyperparameter	Tested Values	$Δ$ RMSE	p-Value	$Δ$ Training Time	Recommended Range
W	16, 32, 64, 128	+7.3%, −1.9%, baseline, +6.8%	0.01, 0.005, <0.001, 0.12	+8%, 5%, baseline, −3%	32–64
O	2, 4, 8, 16	+8.6%, baseline, −3.7%, +13.1%	0.03, 0.002, 0.003, 0.15	+9%, baseline, −3%, −5%	4–8
C_k	16, 32, 64,128	+1.6%, baseline, −1.1%, +5.5%	0.04, 0.008, 0.002, 0.20	−2%, baseline, +6%, +15%	32–64
B_S	64, 128, 256, 512	+2.6%, −3.9%, baseline, +9.1%	0.01, 0.001, <0.001, 0.09	−2%, −1%, baseline, +1%	256–512
D_f	1, 2, 4, 8	+8.6%, baseline, −3.7%, +13.1%	0.02, 0.002, 0.001, 0.04	+6%, baseline, −3%, +3%	2–6
D_r	0.1, 0.2, 0.3, 0.4	+8.6%, baseline, −6.3%, +18.9%	0.06, 0.005, 0.001, 0.23	+8%, baseline, −6%, −7%	0.2–0.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, C.; Lai, Z.; Xu, Y.; Zhu, X.; Wu, J.; Duan, G. Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction. Atmosphere 2025, 16, 429. https://doi.org/10.3390/atmos16040429

AMA Style

Wu C, Lai Z, Xu Y, Zhu X, Wu J, Duan G. Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction. Atmosphere. 2025; 16(4):429. https://doi.org/10.3390/atmos16040429

Chicago/Turabian Style

Wu, Chenyi, Zhengliang Lai, Yunwu Xu, Xishun Zhu, Jianhua Wu, and Guiqin Duan. 2025. "Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction" Atmosphere 16, no. 4: 429. https://doi.org/10.3390/atmos16040429

APA Style

Wu, C., Lai, Z., Xu, Y., Zhu, X., Wu, J., & Duan, G. (2025). Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction. Atmosphere, 16(4), 429. https://doi.org/10.3390/atmos16040429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comprehensive Scale Fusion Networks with High Spatiotemporal Feature Correlation for Air Quality Prediction

Abstract

1. Introduction

2. Related Materials and Concepts

2.1. Spatiotemporal Encoder

2.2. Preliminaries

2.3. Time Series Imaging

2.4. Evaluation Indexes

3. Propose Models and Methodologies

3.1. Overview of the Proposed CSST-AQP Framework

3.2. Localized Spatiotemporal Feature Preprocessing Module (LSTF-Net)

3.3. The Structure of Complete Scale Spatial Processing Network (CSSP-Net)

3.4. Adaptive Temporal Feature Enhancement Network (ATSE-Net)

4. Experimental Section and Results

4.1. Training Details and Datasets

4.2. Feature Visualization

4.3. Analysis of Projected Results

4.4. Comparative Experiment

4.4.1. Compared with Deep Learning Models

Compared with Traditional Models

4.5. Ablation Experiments

4.5.1. Module Ablation

4.5.2. Hyperparameter Analysis

4.6. Parameter Sensitivity and Robustness

4.6.1. Different Data Sources

4.6.2. Parameter Sensitivity

4.6.3. Noise Sensitivity

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI