High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization

Zhan, Leqing; Feng, Kai; Gu, Xiaoyang; Han, Te

doi:10.3390/atmos16121363

Open AccessArticle

High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization

by

Leqing Zhan

^1,†,

Kai Feng

^2,†,

Xiaoyang Gu

^2,* and

Te Han

²

¹

School of Mathematics, University of Bristol, Bristol BS8 1UG, UK

²

Beijing Laboratory for System Engineering of Carbon Neutrality, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Atmosphere 2025, 16(12), 1363; https://doi.org/10.3390/atmos16121363

Submission received: 11 October 2025 / Revised: 11 November 2025 / Accepted: 25 November 2025 / Published: 30 November 2025

(This article belongs to the Section Air Quality)

Download

Browse Figures

Versions Notes

Abstract

Rapid urbanization and industrialization have intensified air pollution, posing severe challenges to sustainable development and public health. As a core economic zone in China, the Beijing–Tianjin–Hebei (BTH) region faces persistent air quality deterioration, highlighting the urgent need for accurate and intelligent prediction models. However, existing studies often suffer from limited adaptability of single models and subjective feature selection thresholds, constraining predictive performance and generalization capability. To address these challenges, this study proposes a feature-optimized hybrid deep learning framework for AQI prediction across Beijing, Tianjin, and Shijiazhuang. An adaptive feature selection strategy is first developed by integrating the Relief_F algorithm with the Bat Optimization Algorithm (BOA), which adaptively determines feature importance, thereby enhancing objectivity and effectiveness in identifying key pollutant and meteorological indicators. Subsequently, an attention-enhanced CNN–BiLSTM–GRU hybrid network is constructed, where the attention mechanism emphasizes critical temporal information that most influences prediction results. Experiments show that the proposed model achieves MAPE values of 1.00%, 1.15%, and 1.09% for Beijing, Tianjin, and Shijiazhuang, outperforming benchmark models by 18.43–45.05%. These results confirm the framework’s reliability for practical application with strong robustness and statistical validity.

Keywords:

air quality forecast; hybrid neural network; feature selection; attention mechanism

1. Introduction

As the global urbanization drive gathers pace and industrialization scales up, air pollution has evolved into a critical environmental concern that holds back the sustainable advancement of society. This form of pollution gives rise to numerous health issues, including respiratory illnesses, cardiovascular conditions and premature mortality, while also bringing about considerable risks to ecological systems [1]. In China, the Beijing–Tianjin–Hebei region, as a core engine of economic development and a densely populated area, has always attracted wide attention at home and abroad regarding its air quality. Air pollution monitoring plays a vital role in assessing atmospheric pollutant concentrations against ambient air quality criteria, and it also serves as a core component for implementing effective air quality governance [2,3]. However, the sources and types of air pollution are complex and change with time and geographical location, making it difficult to predict air quality [4]. In January 2024, China’s Ministry of Ecology and Environment put forward that priority should be given to bolstering the technical capacity for air quality forecasting, with its core objective being to enhance the overall accuracy of such forecasts over a 72-h timeframe, and in particular, to overcome the key technical bottlenecks in the prediction and forecasting of heavy pollution weather processes. By the end of 2025, provincial and municipal ecological environment monitoring and forecasting institutions shall fully possess the ability to forecast air quality for 7–10 days. The supply of effective air quality prediction information enables governments to devise and enforce environmental protection policies in a scientific way, which in turn promotes the development of a green economy and achieves the goal of sustainable urban development [5].

At present, the academic community has developed various technical paths around air quality prediction, ranging from early traditional models based on the statistical laws of data, to machine learning methods relying on nonlinear mapping capabilities, and then to deep learning architectures with deep feature mining capabilities [6]. Various methods show differentiated advantages in prediction accuracy, applicable scenarios, and computational efficiency [7,8]. Meanwhile, the advancement of feature selection techniques also serves as a crucial safeguard for optimizing model inputs and enhancing predictive performance.

Early air quality prediction mainly relied on traditional statistical models, which center on data’s temporal regularity or linear correlation. Time series analysis is a key branch, with the ARIMA model widely used for Air Quality Index (AQI) prediction due to its good univariate time series fitting. Zhong et al. (2024) improved ARIMA to address nonlinearity and non-stationarity of carbon emission data, verifying its accuracy and efficiency [9]. In addition, The multiple linear regression model establishes linear correlations between multi-factors and air quality. Ma et al. (2020) used SPSS tools to construct a multiple linear regression model, and selected five types of pollutants as the key factors affecting AQI and it was proved that the model has practical promotion value in air quality prediction [10]. Mendes et al. (2022) proposed a statistical prediction method combining Classification and Regression Trees and Multiple Regression, and applied it to air quality prediction in Lisbon, Madeira, and Macao. It predicts next-day

{PM}_{10}

and

{PM}_{2.5}

concentrations, with

R^{2}

0.50–0.89, well following pollutant trends [11].

As air quality data grows in dimension and complexity, traditional statistical models show prominent limitations in adapting to its nonlinear and non-stationary traits. In contrast, machine learning models have become a key breakthrough for air quality prediction. Support Vector Machines (SVM) demonstrate unique advantages in predicting data characterized by limited sample quantities and elevated dimensionality. Kulkarni et al. (2022) developed an SVM-based system to forecast AQI and concentrations of pollutants in the next 15 h, outperforming linear models in Root Mean Square Error (RMSE) [12]. In addition, Yu et al. (2025) [13] proposed an attention-enhanced Random Forest for multi-pollutant estimation to address single-pollutant focus in prior studies. Experimental validation demonstrated superiority over other single-task comparison models, as evidenced by an increase in

R^{2}

from 9% to 26%; Varghese et al. (2023) built an Extreme Gradient Boosting (XGBoost)-regression model and combined multiple pollutants such as Pb,

{NH}_{3}

,

{SO}_{2}

,

{NO}_{2}

with meteorological data for prediction, achieving higher accuracy than traditional models [14]. Van et al. (2023) proposed an AQI prediction method combining data processing technology and lightweight machine learning algorithms, and comparisons on Indian regional datasets showed XGBoost is optimal in Mean Absolute Error (MAE), RMSE, and

R^{2}

among Decision Tree, Random Forest, and XGBoost [15]. Despite machine learning’s higher accuracy than traditional models, single machine learning models still face issues like heavy reliance on feature engineering and insufficient capture of long-term temporal dependencies, driving research into deep learning models with stronger deep feature mining capabilities.

Deep learning models, by constructing a multi-layer neural network structure, can realize in-depth mining of spatiotemporal features and long short-term temporal dependencies, and effectively solve the performance bottlenecks of machine learning models in complex scenarios [16,17]. Zhou (2023) [18] tackled the constraint that Long Short-Term Memory (LSTM) focuses solely on historical information, and put forward an air quality prediction model built on an improved LSTM. This model leverages a Bidirectional Long Short-Term Memory (BiLSTM) network to read data in both forward and backward directions, enabling the extraction of more comprehensive temporal characteristics. It also incorporates an attention mechanism to assign weights to the outputs of the BiLSTM hidden layer, thereby realizing the selective utilization of key input information. In addition, CNN, relying on their local feature extraction capabilities, provide a new idea for predicting the spatial distribution of air quality. Wang et al. (2024) [19] constructed a CNN model to predict the CO concentration with a 10-m resolution in Nanjing. The model input integrates various factors such as building height, terrain, and emission sources. The results showed that the

R^{2}

of the CNN prediction results is greater than 0.8, confirming its spatial generalization ability.

To further integrate the capabilities of capturing long-term temporal dependencies, hybrid deep learning architectures have become a research hotspot. Gilik et al. (2022) [20] put forward a deep learning model based on CNN and LSTM. In the pollutant prediction of three cities, compared with the single-hidden-layer LSTM, this model reduced the prediction error of PM by 11–53%,

O_{2}

by 20–31%, nitrogen oxides by 9–47%, and

{SO}_{2}

by 18–46%. Currently, prediction performance is largely affected by the architecture of deep learning models; different model architectures exhibit different performance. How to propose an applicable model architecture for specific problems remains a highly challenging issue.

Although deep learning models have significantly improved prediction performance, the increase in model complexity has also brought about the problem of feature redundancy. Therefore, efficient feature selection technology has become the key to optimizing model input and computational efficiency. Feature selection approaches rooted in deep learning algorithms have found widespread application in air quality prediction, thanks to their capacity to quantify the importance of features. Jamei et al. (2022) [21] used XGBoost and CART to screen key prediction factors, and determined the optimal input combination through Best Subset Regression (BSR). After inputting this combination into the LSTM model, the prediction accuracy of

{PM}_{2.5}

and

{PM}_{10}

was better than that of models such as LightGBM. Tao made use of the Boruta feature selection method to determine which input variables are most significant. Based on three daily AQI sequences in China, experiments were carried out, which verified that the model could yield positive outcomes for these three cities [22]. Currently, significant progress has been made in feature selection technology, but there remains a challenge in how to select the optimal algorithm for air quality prediction problems. The attention mechanism allows the model to focus more on important details, and how to apply it to the air quality prediction problem remains a challenge. In addition, feature selection involves hyperparameters and how to reasonably set the hyperparameters is also an issue that needs to be considered.

To solve the above problems, this study constructed an attention-enhanced hybrid neural network and took three core cities (Beijing, Tianjin, and Shijiazhuang) in the Beijing–Tianjin–Hebei region as research objects. It designed single-model multivariable comparison experiments, feature selection method comparison experiments, hyperparameter optimization comparison experiments, robustness tests, and Diebold–Mariano (DM) tests.

The novelty of this research manifests itself in the three aspects outlined below. First, in the model construction part, a CNN-BiLSTM-Gated Recurrent Unit (GRU)-Attention hybrid architecture is built. The CNN extracts the local spatiotemporal correlational features of pollutants, and the BiLSTM and GRU collaborate to capture both long-term and short-term temporal associations; the Attention mechanism strengthens the influence of pollution peak periods and key factors through dynamic weight allocation, designing a hybrid prediction model with adaptive feature selection, which breaks through the performance bottleneck of traditional models. Second, the Relief_F algorithm is used to measure the correlation weight between features and AQI, and core features are screened in combination with dynamic thresholds. Finally, in the hyperparameter optimization part, the Bat Optimization (BAT) algorithm is introduced to adaptively select the feature selection threshold of the Relief_F algorithm.

The subsequent sections of this paper are structured as follows. Section 1 introduces the problem definition and the methods used. Section 2 presents the proposed hybrid model prediction framework. Section 3 includes information on the data set, data processing, and the design of comparison experiments. Section 4 describes the comparison experiments and the discussion of experimental results. Section 5 is the conclusion part.

2. Proposed Model

2.1. Proposed Model Framework

The BAT-Relief_F-CNN-BiLSTM-GRU-Attention model proposed in this paper achieves high-precision prediction in the Beijing–Tianjin–Hebei region through three parts: feature selection, adaptive hyperparameter optimization, and optimal base model combination. The specific research framework is shown in Figure 1.

In the Feature Selection section, the Relief_F algorithm serves to assess input features quantitatively. This is achieved by computing the correlation weight between characteristics and the target variable, key pollutant indicators such as

{PM}_{2.5}

and

{PM}_{10}

, as well as meteorological features such as temperature and precipitation are selected.

In the Hyperparameter Optimization segment, the BAT algorithm is brought in to accomplish the hyperparameter optimization for the Relief_F algorithm. By emulating the echolocation behavior of bats, it makes dynamic adjustments and reaches the optimal solution after multiple iterations, thus addressing the issues of high subjectivity and restricted accuracy in conventional manual parameter tuning.

The third part centers on the Hybrid Neural Network. It builds a combined model of CNN-BiLSTM-GRU-Attention. In this model, the CNN takes charge of extracting the local spatiotemporal features of pollutant concentrations; the BiLSTM and GRU work together to capture long-term and short-term temporal dependencies; and the Attention mechanism enhances the impact of crucial time steps and feature variables by allocating weights.

2.2. Definition of the Problem

This study aims to construct a data-driven air quality prediction model, which realizes the accurate prediction of concurrent air quality by learning historical data.

Let the input data be a time-series dataset

X = x_{1}, x_{2}, \dots, x_{n}

containing multi-dimensional features, where

x_{i} = x_{i 1}, x_{i 2}, \dots, x_{i d}

represents the d-th dimensional original feature data at the ith time step. These features include six types of pollutants as well as the characteristics of temperature and precipitation. First, S is applied to decrease the original features, and a subset of key features that contribute to the prediction results is selected acquire the processed feature set

X^{'} = x_{1}^{'}, x_{2}^{'}, \dots, x_{m}^{'}

, where

x_{i}^{'} = S (x_{i})

is the screened feature data at the ith time step.

The output result is a set of time series air quality indexes

Y = y_{1}, y_{2}, \dots, y_{n}

, where

y_{i}

specifically refers to the AQI at the i-th time step. Through a series of data processing and feature learning operations F, the model sets up a association between the screened features and the concurrent AQI values, that is:

y_{i} = F (x_{i}^{'}) = F (S (x_{i})) (i = 1, 2, \dots, n)

(1)

Among these components, operation F encompasses core links like feature acquisition and temporal sequence correlation modeling of the deep learning network, feature weight allocation of the attention mechanism, and hyperparameter optimization. Combined with the previous feature screening step, the prediction accuracy of the features at the i-th time step for the concurrent AQI value is improved through multi-step collaboration.

2.3. Relief_F Feature Selection Method

Relief evaluates feature importance by calculating the ability of features to distinguish instances of different classes [23]. It can effectively identify features related to the target variable, reduce redundant information, and improve model efficiency and generalizability.

The Relief_F algorithm is aimed to randomly select a sample R from the dataset, and then find its nearest neighbor sample

H_{n e a r}

in the same class and k nearest neighbor samples

M_{n e a r}

in other classes. For each feature A, its weight

W (A)

is calculated as follows:

W (A) = W (A) - \sum_{i = 1}^{k} \frac{d (A, R, M_{n e a r}^{i})}{m \times k} + \sum_{i = 1}^{k} \frac{d (A, R, H_{n e a r}^{i})}{m \times k}

(2)

d (A, R, X)

represents the distance between sample R and sample X in terms of feature A, and m is the size of the dataset. After multiple iterations, features with larger weights are considered more important for the classification or prediction task and thus are retained, while redundant features with smaller weights are eliminated.

2.4. BAT Hyperparameter Optimization Method

The BAT algorithm simulates the flight and hunting behavior of bats, has strong global search ability, and can automatically find the optimal hyperparameter combination of the model [24]. Each bat represents a set of hyperparameters. Bats search the optimal solution space by adjusting their positions and velocities. The update formulas for the position and velocity of bats are:

x_{i}^{t + 1} = x_{i}^{t} + v_{i}^{t}

(3)

v_{i}^{t + 1} = ω v_{i}^{t} + (x^{*} - x_{i}^{t}) \times f_{i}

(4)

Among them,

x_{i}^{t}

is the position of the i-th bat at the t-th iteration (corresponding to the hyperparameter combination),

v_{i}^{t}

is its velocity,

ω

is the inertia weight,

x^{*}

is the position of the current global optimal solution, and

f_{i}

is the frequency of the i-th bat. The calculation formula of

f_{i}

is:

f_{i} = f_{m i n} + (f_{m a x} - f_{m i n}) \times β

(5)

Here,

f_{m i n}

and

f_{m a x}

are the minimum and maximum frequencies, respectively, and

β

is a random number between

[0, 1]

.

In addition, bats also have the ability of random search and perform local search with a certain probability to prevent the algorithm from falling into local optimality. During the search process, the quality of each bat’s position is evaluated, and the global optimal solution is continuously refreshed.

2.5. Standard Deep Learning Model

In this model, the CNN, BiLSTM and GRU together form the core part of feature extraction and time-series modeling, and they work collaboratively to capture the beneficial details in the input data.

CNN has a key advantage in effectively extracting local data features. Through the convolution operation between the convolution kernels in the convolution layer and the input data, the perception of local region features is realized. Next, the down-sampling process of the pooling layer is employed to reduce data dimensionality, and ultimately, the fully connected layer is utilized for feature integration [25]. The CNN has been employed in air pollution prediction and delivered high-performance prediction outcomes [26]. In air quality prediction, CNN can be used to extract local correlation information between features such as different pollutant concentrations and meteorological factors. The mathematical expression of its convolution operation is:

z_{i, j}^{l} = \sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} w_{m, n}^{l} \times x_{i + m, j + n}^{l - 1} + b^{l}

(6)

Among them,

z_{i, j}^{l}

represents the output feature value at position

(i, j)

of the convolution layer l-th,

w_{m, n}^{l}

is the weight of the convolution kernel of the l-th layer,

x_{i + m, j + n}^{l - 1}

are the input data at position

(i + m, j + n)

of the layer (

l -

1)-th, M and N are the sizes of the convolution kernel, respectively, and

b l

is the bias term. The mathematical expression of the average pooling operation of the pooling layer is:

p_{i, j}^{l} = \frac{1}{S \times S} \sum_{m = 0}^{S - 1} \sum_{n = 0}^{S - 1} z_{i \times S + m, j \times S + n}^{l}

(7)

Here,

p_{i, j}^{l}

is the output of the l-th pooling layer at position (i, j), and S is the size of the pooling window.

BiLSTM is an extended form of the LSTM network, which consists of a forward LSTM and a backward LSTM [27]. LSTM can effectively handle the long-term dependence problem in long-sequence data through three gating mechanisms: input gate, forget gate, and output gate [28], avoiding the common gradient vanishing or explosion phenomenon in traditional Recurrent Neural Network (RNN). The mathematical formulas of its gating mechanisms are as follows:

I n p u t g a t e : i_{t} = σ (W_{i i} x_{t} + W_{h i} h_{t - 1} + b_{i})

(8)

F o r g e t g a t e : f_{t} = σ (W_{i f} x_{t} + W_{h f} h_{t - 1} + b_{f})

(9)

O u t p u t g a t e : o_{t} = σ (W_{i o} x_{t} + W_{h o} h_{t - 1} + b_{o})

(10)

C a n d i d a t e m e m o r y c e l l : {\tilde{C}}_{t} = tanh (W_{i c} x_{t} + W_{h c} h_{t - 1} + b_{c})

(11)

M e m o r y c e l l : C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t}

(12)

H i d d e n s t a t e : h_{t} = o_{t} ⊙ tanh (C_{t})

(13)

Among them,

σ

is the sigmoid activation function, tanh is the hyperbolic tangent activation function, ⊙ represents element-wise multiplication, W is the weight matrix, b is the bias vector,

x_{t}

is the input at the current time step,

h_{t - 1}

is the hidden state at the previous time step, and

C_{t - 1}

is the memory cell at the previous time step. On this basis, BiLSTM extracts information from both the forward and backward directions of the sequence, thereby more comprehensively capturing the bidirectional dependence relationship of the sequence data. Assuming that the output of the forward LSTM is

{\vec{h}}_{t}

and the output of the backward LSTM is

{\overset{\leftarrow}{h}}_{t}

, the final output of BiLSTM is

h_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}]

, where

[;]

represents vector concatenation.

GRU is a simplified recurrent neural network, which has a more concise structure than LSTM and only includes a reset gate and an update gate [29]. Its mathematical expressions are:

R e s e t g a t e : r_{t} = σ (W_{i r} x_{t} + W_{h r} h_{t - 1} + b_{r})

(14)

U p d a t e g a t e : z_{t} = σ (W_{i z} x_{t} + W_{h z} h_{t - 1} + b_{z})

(15)

C a n d i d a t e h i d d e n s t a t e : {\tilde{h}}_{t} = tanh (W_{i h} x_{t} + r_{t} ⊙ (W_{h h} h_{t - 1}) + b_{h})

(16)

H i d d e n s t a t e : h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(17)

However, the CNN, BiLSTM, and GRU models pay the same attention to all features or time steps during processing, making it difficult to highlight the key information. Thus, incorporating the attention mechanism to allocate distinct weights to different features and time steps can optimize the air quality prediction performance.

2.6. Attention Mechanism

The attention mechanism imitates the process of human attention concentration, and during the processing of sequence data, it is capable of assigning varying weights to information in different positions [30]. In air quality prediction, by adding Attention after the GRU layer, the weights between different elements are calculated.

The attention mechanism mainly uses three vectors: Query (Q), Key (K), and Value (V) for calculation. Assume that the input sequence is

X = [x_{1}, x_{2}, \dots, x_{n}]

, where

x_{i}

represents the GRU output feature vector at the i-th time step. The Q, K, and V vectors are generated by multiplying with the learnable weight matrices

W^{Q}

,

W^{K}

, and

W^{V}

, respectively. The calculation formulas are as follows:

Q_{i} = W^{Q} \cdot x_{i}

(18)

K_{i} = W^{K} \cdot x_{i}

(19)

V_{i} = W^{V} \cdot x_{i}

(20)

The additive Attention mechanism can better handle input sequences of different lengths. The specific calculation formula is:

e_{i j} = v^{T} tanh (W_{1} Q_{i} + W_{2} K_{j})

(21)

Among them,

e_{i j}

represents the attention score between the query

Q_{i}

at the i-th position and the key

K_{j}

at the j-th position.

W_{1}

,

W_{2}

and v are learnable. The tanh function is used for non-linear transformation to enhance the model’s ability to express complex relationships.

Moreover, the softmax function is used to normalize the scores to obtain the attention weights

α_{i j} : α_{i j} = \frac{exp (e_{i j})}{\sum_{j = 1}^{n} \exp (e_{i j})}

(22)

The attention weight

α_{i j}

represents the degree of attention paid to the j-th position by the i-th position, and

\sum_{j = 1}^{n} α_{i j} = 1

.

Finally, the attention weights are used to generate the output representation of each time step through weighted summation:

O_{i} = \sum_{j = 1}^{n} α_{i j} V_{j}

(23)

Among them,

O_{i}

is the final output of the i-th time step, which integrates the information of each position in the entire input sequence and focuses on the information highly relevant to the current position.

3. Experiment

3.1. Dataset Description

In this study, the AQI is used as a comprehensive evaluation indicator for air pollution. Compared with traditional air pollution indices, this indicator not only covers a more comprehensive range of pollutant detection but also improves the objectivity of evaluation results through more stringent classification limit standards. To comprehensively investigate the temporal variation patterns of AQI and its influencing factors, this study collected daily average AQI data for three cities from 1 January 2018, to 31 December 2024, resulting in 2513 daily observations for each city. All data were obtained from http://www.tianqihoubao.com. At the same time, to construct a more complete analytical framework, the study also obtained two types of key variables related to AQI from this website: one type is air pollution indicators, including

{PM}_{2.5}

,

{PM}_{10}

,

{SO}_{2}

,

{NO}_{2}

, CO, and

O_{3}

, which affect air quality through physical and chemical effects; the other type is environmental meteorological indicators, where temperature and precipitation indicators are selected to reflect the urban environmental meteorological conditions.

Considering the non-stationary characteristics of the AQI time series, the study introduced Wavelet Coherency (WTC) to explore the time-frequency correlation law between AQI and air pollutant indicators [31]. This method combines the multi-scale analysis capability of wavelet transform with the idea of coherence analysis, which can quantify the covariance intensity of two time series and present the correlation and phase relationship of signals at different time and frequency scales.

The wavelet coherence results between AQI and air quality factors in Beijing, Tianjin, and Shijiazhuang are shown in the Figure 2, Figure 3 and Figure 4. Each point in the wavelet coherence image represents the coherence value at a specific time and scale. High coherence values in the image indicate strong signal synchronization; the direction of the arrows reflects the phase relationship, where horizontally to the right represents in-phase, to the left represents anti-phase, and the vertical direction reflects the sequence of factors and AQI. The closed solid line area represents passing the 5% significance level red noise test, and the area within the conical dashed line is the exclusion area.

The study divides the time-frequency scales into small time-frequency scales (frequency < 32 days), medium time-frequency scales (32 days < frequency < 128 days), and large time-frequency scales (frequency > 128 days), and conducts specific analyses for the three cities of Beijing, Tianjin, and Shijiazhuang.

At the small time-frequency scale, the yellow and orange areas in the images are extensive, and pollutants in most regions generally show high coherence with AQI. For example, the AQI of the three cities shows in-phase positive coherence with CO,

{PM}_{2.5}

, and

P M_{10}

, while the yellow areas in the coherence images with

O_{3}

,

{NO}_{2}

, and

{SO}_{2}

are small, indicating weak coherence.

At the medium time-frequency scale, the coherence distribution varies, and some pollutants in certain regions still maintain a certain degree of coherence in specific medium-to-long cycles. For instance,

O_{3}

had a significant impact from 2022 to 2023, and

{NO}_{2}

showed annual periodic changes. The phase relationship between AQI and CO in the three cities is complex over time, and there is no significant coherence between AQI and

{SO}_{2}

.

At the large time-frequency scale, the AQI of Beijing shows negative coherence with

O_{3}

, where

O_{3}

lags behind AQI by approximately 1/4 of a cycle, and there is no coherence with CO. This also indicates that CO mainly affects the abrupt changes in air quality. The AQI of Beijing shows a relationship where other factors lead AQI when paired with all other factors. For Tianjin, the AQI shows a leading relationship with

{PM}_{2.5}

,

{NO}_{2}

, and

{SO}_{2}

, and a lagging relationship with

O_{3}

. For Shijiazhuang, the relationship between AQI and

O_{3}

is the same as that in Tianjin, and the AQI shows a leading relationship with CO,

{PM}_{2.5}

, and

{NO}_{2}

.

In summary, the AQI of each city shows strong coherence with

{PM}_{2.5}

and

{PM}_{10}

across all time-frequency scales. At small and medium time-frequency scales, the AQI has strong coherence with CO, indicating that changes in CO concentration have a rapid and significant impact on AQI in a short period, and pollution fluctuations are easily reflected in the air quality index quickly. In contrast, the large time-frequency scale reflects the continuous effect of processes such as regional pollution transmission and accumulation on AQI over a longer time scale.

3.2. Data Preparation

Step 1: Data Preprocessing

The step of AQI data cleaning is crucial to ensure the accuracy of the final results. First, identify the NaN values in the AQI data, temporarily store their positions, and remove samples containing NaN values to avoid interfering with calculations. Next, use the Hampel filter to process the time-series AQI data, where time is used as the X vector and AQI values as the Y vector. The filter automatically sets a reasonable window half-width and anomaly threshold: it detects AQI outliers that significantly deviate from the normal fluctuation range, then replaces these outliers with the median of the AQI data within the corresponding local window. Finally, reinsert the previously temporarily stored NaN values back into their original positions to complete the AQI data cleaning.

Step 2: Dataset Division

There are altogether 2513 samples in the original data. In this research, the data spanning from 2018 to 2024 is split into a training set, a validation set, and a test set with a proportion of 7:1:2 [32]. The dataset division for Beijing, Tianjin, and Shijiazhuang is presented in Figure 5.

Step 3: Model Training

The AQI data, air pollution indicator data, and environmental meteorological indicator data are used as the input of the BAT-Relief_F-CNN-BiLSTM-GRU-Attention model. After training the model, it can predict the AQI data for multiple future time steps.

3.3. Evaluation Metrics and Experiment Design

3.3.1. Evaluation Metrics

This study constructs a system of five metrics, including Mean Absolute Percentage Error (MAPE), RMSE, Symmetric Directional Absolute Percentage Error (SDAPE), Direction Accuracy (DA) of prediction results, and Performance Parameter (PP). The specific calculation formulas and descriptions of each metric are as follows:

M A P E = \sum_{i = 1}^{N} |\frac{y_{c} (i) - {\hat{y}}_{c} (i)}{y_{c} (i)}| \times \frac{1}{N} \times 100 %

(24)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{c} (i) - {\hat{y}}_{c} (i))}^{2}}

(25)

S D A P E = 2 \times \sum_{i = 1}^{N} \frac{|y_{c} (i) - {\hat{y}}_{c} (i)|}{|y_{c} (i)| + |{\hat{y}}_{c} (i)|} \times \frac{1}{N}

(26)

D A = \frac{\sum_{i = 1}^{l} ω_{i}^{l}}{l}

(27)

ω_{i}^{l} = \{\begin{matrix} 1, & if (y_{c} (i + 1) - y_{c} (i)) ({\hat{y}}_{c} (i + 1) - {\hat{y}}_{c} (i)) > 0 \\ 0, & Else \end{matrix}

(28)

P P = 1 - \frac{\sqrt{\sum_{i = 1}^{N} {[{\hat{y}}_{c} (i) - y_{c} (i)]}^{2}} \times N^{- 1}}{\sqrt{y_{c} (i)}}

(29)

Among these symbols, ∑ denotes the summation operation,

y_{c} (i)

denotes the observed value or true value,

{\hat{y}}_{c} (i)

denotes the predicted value, N denotes the number of samples, c denotes the city, and i denotes the time.

3.3.2. Experiment Design

This study introduces various types of benchmark models for comparative evaluation. First are statistical learning models, with the ARIMAX model specifically selected; second are machine learning models, with the XGBoost model chosen. In terms of neural network-related models, on the one hand, they include standard neural network models, specifically three types: GRU, LSTM, and BiLSTM; on the other hand, they also cover combined neural network models, with two combined models selected here: CNN-BiLSTM and CNN-BiLSTM-GRU-Attention.

In addition, this study also introduces two types of prediction models related to feature selection as benchmarks. The first type is feature selection-prediction models, including four models: Grey-CNN-BiLSTM-GRU-Attention, Lasso-CNN-BiLSTM-GRU-Attention, Relief_F-CNN-BiLSTM-GRU-Attention, and Principal Component Analysis(PCA)-CNN-BiLSTM-GRU-Attention. Among them, “Grey” refers to feature selection implemented by calculating grey relational grade; “Lasso” completes feature selection by reducing the coefficients of unimportant features to zero. “Relief_F” uses the Relief_F model to calculate the distance between features and the similarity between instances, thereby evaluating feature importance and further screening out features with the strongest discriminative ability for classification. PCA retains as much information as possible by calculating the contribution rate and cumulative contribution rate of each factor.

The second type is algorithm-optimized feature selection-prediction models, specifically including three models: BAT-Relief_F-CNN-BiLSTM-GRU-Attention, GWO-Relief_F-CNN-BiLSTM-GRU-Attention, and DA-Relief_F-CNN-BiLSTM-GRU-Attention. Among them, the Bat-inspired Algorithm (BAT) simulates the flight and hunting behaviors of bats and has strong global search capability; the Grey Wolf Optimizer (GWO) simulates the leader-follower behaviors of grey wolf packs and has fast convergence speed and good local search capability; the Dragonfly Algorithm (DA) simulates the behaviors of dragonflies when searching for food and is a heuristic optimization algorithm based on bionics principles.

Next, the basic parameters of the model put forward in this study are discussed. To improve the reliability of the model results, the model was trained 10 times and the average value of the final prediction indicators was calculated. The specific parameters of the models are displayed in Table 1.

4. Result Analysis

4.1. Comparison Results of Single Model with Multivariate Input

The results of the single model with multivariate input comparison experiment are shown in Table 2. From the perspective of the comprehensive indicators of the three cities (Beijing, Tianjin, and Shijiazhuang), the CNN-BiLSTM-GRU-Attention model(CBGA) significantly outperforms other comparative models. This superiority stems from the hybrid network’s complementary advantages: CNN extracts local spatial features of multivariate input data, BiLSTM and GRU capture bidirectional and long-term temporal dependencies, respectively, and the integrated Attention mechanism emphasizes key information—together enhancing the model’s ability to mine complex AQI variation patterns. The proposed model integrates CNN for spatial feature extraction and BiLSTM-GRU for temporal dependency modeling, resulting in an 82.33% lower MAPE than XGBoost in Shijiazhuang. Its MAPE values in Beijing, Tianjin, and Shijiazhuang are 2.35%, 4.43%, and 2.12%, respectively, which are 9.26%, 10.51%, and 4.71% lower than those of the CNN-BiLSTM model. The traditional statistical model ARIMAX performs the worst, with a MAPE of 15.55% in Beijing, indicating that ARIMAX is not suitable. Although XGBoost outperforms ARIMAX, it is limited by the capacity of tree-based models to forecast air quality, and its MAPE in Shijiazhuang still reaches 6.17%, which is 2.9 times that of the proposed model.

The DA values of the CNN-BiLSTM-GRU-Attention model all exceeded 0.95, and the P values all exceeded 0.5 in the air quality prediction tasks for three cities. This indicates that the model possesses a strong ability to judge the changing trends of air quality, demonstrating stable and excellent overall prediction performance.

4.2. Comparison Results of Feature Selection Methods

Four feature selection methods, namely PCA, Relief_F, Lasso, and Grey, were selected for comparison. In this process, the default threshold values were adopted for feature screening, and a threshold of 0.5 was used in this study. The results are shown in Table 3. Compared with directly using all variables for prediction without feature selection, after implementing PCA, Lasso, and Grey feature selection, the prediction results were unstable. In some cases, the prediction accuracy improved, while in others, the prediction performance declined. This is because the default parameters used for feature selection fail to correctly identify the optimal features, thereby leading to the loss of important feature information.

Different from the CBGA model, the model screened by the Relief_F algorithm showed a significant improvement over other feature selection methods. Specifically, its MAPE values in Beijing, Tianjin, and Shijiazhuang decreased by 22.55%, 68.17%, and 32.55%, respectively, while its RMSE values decreased by 78.81%, 81.17%, and 74.03%, respectively.

4.3. Comparison Results of Adaptive Hyperparameter Optimization

The threshold of default parameters can cause the model to lose important feature information. Therefore, it is necessary to optimize this parameter. In this study, the BAT, GWO, and DA are selected to optimize the feature selection threshold of the Relief_F algorithm, thereby enabling it to select features that are more beneficial to the prediction results. The experimental results of adaptive hyperparameter optimization for the Relief-CNN-BiLSTM-GRU-Attention (RCBGA) model are shown in Table 4.

Compared with the model without hyperparameter optimization, the model optimized by the BAT algorithm exhibited reduced errors: its MAPE values in Beijing, Tianjin, and Shijiazhuang decreased by 45.05%, 18.43%, and 23.77%, respectively, and its RMSE values decreased by 48.27%, 41.36%, and 62.82%, respectively. Among these, Shijiazhuang showed the largest reduction in RMSE, from 2.68 to 0.99. Additionally, the data in the table indicates that the error of the BAT algorithm after multiple rounds of parameter optimization is smaller than that of the GWO and DA algorithms. Therefore, in air quality prediction, the BAT algorithm also makes the model’s prediction performance more stable.

To more clearly depict the model fitting situation, this study selected the first 200 test set data points for demonstration. The forecast results are shown in the Figure 6. As shown in the figure, the curves of Grey-CBGA and PCA-CBGA deviate significantly from the true values, indicating the poor performance of the Grey and PCA methods. In contrast, the Relief_F algorithm, when used as a feature selection method, achieved high prediction accuracy. After adopting the adaptive hyperparameter optimization method, the fitting degree of the model was further improved. Among these methods, the BAT algorithm showed the best performance in improving prediction accuracy and also exhibited the optimal fitting effect.

It can thus be concluded that the Relief_F algorithm is based on an iterative update mechanism of feature weights: it quantifies the correlation strength between features and AQI by calculating the differential contribution of each feature between samples of the same class and different classes, thereby determining the contribution degree of each influencing factor to air quality prediction. The BAT algorithm, on the other hand, is based on a bionic optimization mechanism simulating bat echolocation: by simulating the behavior of bats emitting sound waves and receiving echoes, it dynamically adjusts the search strategy using parameters such as pulse frequency and loudness to achieve adaptive optimization of the Relief_F feature selection threshold. Under the synergistic effect of the two algorithms, not only does Relief_F not need to rely on assumptions about the probability distribution of data, but the BAT algorithm also reduces the errors caused by manual selection of feature selection thresholds.

4.4. Robustness Test

The robustness test evaluates the anti-interference by modifying the input air quality data and verifying whether the prediction results remain accurate after adding disturbances to the input data. In this study, disturbances of 5%, 7%, and 10% were added to the input data of the test set, and the network structure obtained from the original training set was still used for prediction.

The results of the noise disturbance verification are shown in Table 5. With the increase in disturbance intensity, although the model error indicators fluctuated slightly, they always remained at a low level. In Beijing, the MAPE values under 5% and 7% disturbances were 0.91% and 0.91%, respectively; although the RMSE increased from 1.95 to 2.72 under 5% disturbance, it dropped back to 2.36 under 7% disturbance, and the DA value remained stable at around 0.96. In Tianjin, the MAPE under 10% disturbance was 1.73%, which was only 0.58% higher than that without disturbance, and the DA value remained above 0.95. In Shijiazhuang, the MAPE increased to 3.63% and the RMSE increased to 2.72 under 10% disturbance; this may be related to Shijiazhuang being an industrial city, where pollutant sources are more complex and concentration fluctuations are more intense. Even so, its DA value under 10% disturbance was still 0.93, indicating that the model still has reliable judgment on the changing trend of pollution. Overall, the model can maintain stable prediction performance when there is a certain degree of noise or fluctuation in the data, meeting the requirement for the model’s anti-interference ability in practical air quality early warning.

4.5. DM Test

The DM test is primarily employed to quantitatively compare the performance of various models. It leverages statistical approaches to ascertain if there are notable discrepancies in the errors of multiple models during AQI prediction, so as to steer clear of judgment biases stemming from relying purely on subjective error value comparison [30]. The DM test makes use of three indicators—MAPE, MSE, and Mean Absolute Deviation (MAD)—to validate the prediction efficiency of the models. The outcomes are presented in Table 6, Table 7 and Table 8.

The null hypothesis of the DM test is that the prediction efficiency of the two models is identical, that is, the average values of their loss sequences are the same. The alternative hypothesis is that the prediction efficiency of the two models differs. When the DM value exceeds 0, it indicates that the prediction effect of Model 2 is superior to that of Model 1. In this research, Model 1 was designated as other benchmark models, and Model 2 was designated as the model put forward in this paper.

Based on the DM test results, under all three evaluation indicators, the DM statistics of the proposed model in contrast to each benchmark model were notably larger than 0, and all p-values were below 0.05. This shows that at the statistical significance level, the null hypothesis stating “no difference in prediction efficiency between the two types of models” can be rejected, verifying that the proposed model has higher prediction accuracy. Its superiority is not due to random factors but reflects a statistically significant enhancement.

5. Conclusions

High-precision air quality prediction is not only a key support for coordinated regional pollution control, but also an important prerequisite for ensuring the health and travel safety of the public. Addressing the demand for high-precision air quality prediction in the Beijing–Tianjin–Hebei region, this study combines feature selection, hyperparameter optimization, and a hybrid neural network. First, the Relief_F algorithm is used to screen key features. Then, the BAT algorithm is employed for adaptive optimization of the feature selection threshold of Relief_F. Finally, a Hybrid Neural Network is constructed to capture spatiotemporal features and temporal dependencies. The core innovation lies in integrating an attention mechanism into the hybrid neural network, while further improving the model performance through the coordination of feature selection and hyperparameter optimization. The experimental findings demonstrate that this approach is remarkably better than ARIMAX, XGBoost, and individual deep learning models. The MAPE in Beijing, Tianjin, and Shijiazhuang is as low as 1.00–1.15%, which is 18.43–45.05% lower than that of the unoptimized hybrid neural network. Moreover, robustness tests and DM tests confirm that the proposed method has strong anti-interference ability and good statistical significance, and can effectively handle the nonlinear and non-stationary characteristics of air quality data.

To broaden the application scope of the proposed framework, future research could integrate temperature and other environmental variables to establish a combined evaluation network for air pollution and heat exposure, drawing on studies that have validated the nonlinear impacts of urban morphology on microclimate and thermal environment [33,34]. Additionally, incorporating multi-source data such as ground monitoring, satellite observations, and meteorological reanalysis can further enhance the spatial/temporal resolution of the evaluation, as demonstrated in relevant urban environmental assessment studies. Such extensions would provide more comprehensive decision support for low-carbon urban planning and public health risk management.

Author Contributions

L.Z. conceptualization, methodology, software; K.F. validation, formal analysis, data curation, writing—original draft; X.G. methodology, resources, investigation, writing—review and editing; T.H. writing—review and editing, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zaid, M.; Nawale, P.; Kumar, V.; Malyan, V.; Sahu, M. Investigating IoT-Based Low-cost Sensor Network for Real-Time Hyper-Local Air Quality Monitoring and Exposure Assessment. Atmos. Pollut. Res. 2025, 102749. [Google Scholar] [CrossRef]
Gryech, I.; Assad, C.; Ghogho, M.; Kobbane, A. Applications of machine learning and IoT for Outdoor Air Pollution Monitoring and Prediction: A Systematic Literature Review. Eng. Appl. Artif. Intell. 2024, 137, 109182. [Google Scholar] [CrossRef]
Kumar, V.; Malyan, V.; Sahu, M. Significance of Meteorological Feature Selection and Seasonal Variation on Performance and Calibration of a Low-Cost Particle Sensor. Atmosphere 2022, 13, 587. [Google Scholar] [CrossRef]
Mallet, V.; Sportisse, B. Air quality modeling: From deterministic to stochastic approaches. Comput. Math. Appl. 2008, 55, 2329–2337. [Google Scholar] [CrossRef]
Wang, Z.; Chen, L.; Chen, H.; Yang, J.; Rehman, N.U. Graph signal processing meets machine learning: Multi-scale spatial-temporal ensemble learning methodology for air quality forecasting. Expert Syst. Appl. 2025, 291, 128538. [Google Scholar] [CrossRef]
Liao, H.; Yuan, L.; Wu, M.; Chen, H. Air quality prediction by integrating mechanism model and machine learning model. Sci. Total Environ. 2023, 899, 165646. [Google Scholar] [CrossRef] [PubMed]
Han, T.; Gu, X.; Li, D.; Chen, K.; Cong, R.G.; Zhao, L.T.; Wei, Y.M. Causal neural network for carbon prices probabilistic forecasting. Appl. Energy 2025, 397, 126343. [Google Scholar] [CrossRef]
Han, T.; Wang, X.; Guo, J.; Chang, Z.; Chen, Y. Health-Aware Joint Learning of Scale Distribution and Compact Representation for Unsupervised Anomaly Detection in Photovoltaic Systems. IEEE Trans. Instrum. Meas. 2025, 74, 3538811. [Google Scholar] [CrossRef]
Zhong, W.; Zhai, D.; Xu, W.; Gong, W.; Yan, C.; Zhang, Y.; Qi, L. Accurate and efficient daily carbon emission forecasting based on improved ARIMA. Appl. Energy 2024, 376, 124232. [Google Scholar] [CrossRef]
Ma, L.; Gao, Y.; Zhao, C. Research on Machine Learning Prediction of Air Quality Index Based on SPSS. In Proceedings of the 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA), Xi’an, China, 25–27 September 2020; pp. 1–5. [Google Scholar] [CrossRef]
Mendes, L.; Monjardino, J.; Ferreira, F. Air Quality Forecast by Statistical Methods: Application to Portugal and Macao. Front. Big Data 2022, 5, 826517. [Google Scholar] [CrossRef]
Kulkarni, M.; Raut, A.; Chavan, S.; Rajule, N.; Pawar, S. Air Quality Monitoring and Prediction using SVM. In Proceedings of the 2022 6th International Conference On Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 26–27 August 2022; pp. 1–4. [Google Scholar] [CrossRef]
Yu, X.; Wong, M.S.; Lee, K.H. Attention mechanism augmented random forest model for multiple air pollutants estimation. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104661. [Google Scholar] [CrossRef]
Varghese, A.A.; Krishnadas, J.; Antony, A.M. Robust Air Quality Prediction Based on Regression and XGBoost. In Proceedings of the 2023 Advanced Computing and Communication Technologies for High Performance Applications (ACCTHPA), Ernakulam, India, 20–21 January 2023; pp. 1–6. [Google Scholar] [CrossRef]
Van, N.H.; Van Thanh, P.; Tran, D.N.; Tran, D.T. A new model of air quality prediction using lightweight machine learning. Int. J. Environ. Sci. Technol. 2023, 20, 2983–2994. [Google Scholar] [CrossRef]
Zheng, Y.; Yi, X.; Li, M.; Li, R.; Shan, Z.; Chang, E.; Li, T. Forecasting Fine-Grained Air Quality Based on Big Data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW Australia, 10–13 August 2015; pp. 2267–2276. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, Q.; Gao, D.; Shen, S.; Dick, R.; Hannigan, M.; Liu, Q. Multi-group encoder-decoder networks to fuse heterogeneous data for next-day air quality prediction. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; AAAI Press: Palo Alto, CA, USA, 2019. Volume IJCAI’19. pp. 4341–4347. [Google Scholar]
Zhou, Z. Air Quality Prediction Based on Improved LSTM Model. In Proceedings of the 2023 4th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 7–9 April 2023; pp. 392–395. [Google Scholar] [CrossRef]
Wang, S.; McGibbon, J.; Zhang, Y. Predicting high-resolution air quality using machine learning: Integration of large eddy simulation and urban morphology data. Environ. Pollut. 2024, 344, 123371. [Google Scholar] [CrossRef]
Gilik, A.; Ogrenci, A.S.; Ozmen, A. Air quality prediction using CNN+LSTM-based hybrid deep learning architecture. Environ. Sci. Pollut. Res. 2022, 29, 11920–11938. [Google Scholar] [CrossRef]
Jamei, M.; Ali, M.; Malik, A.; Karbasi, M.; Sharma, E.; Yaseen, Z.M. Air quality monitoring based on chemical and meteorological drivers: Application of a novel data filtering-based hybridized deep learning model. J. Clean. Prod. 2022, 374, 134011. [Google Scholar] [CrossRef]
Tao, H.; Al-Sulttani, A.O.; Saad, M.A.; Ahmadianfar, I.; Goliatt, L.; Hassan Kazmi, S.S.U.; Alawi, O.A.; Marhoon, H.A.; Tan, M.L.; Yaseen, Z.M. Optimized ensemble deep random vector functional link with nature inspired algorithm and boruta feature selection: Multi-site intelligent model for air quality index forecasting. Process Saf. Environ. Prot. 2024, 191, 1737–1760. [Google Scholar] [CrossRef]
Zhang, B.; Li, Y.; Chai, Z. A novel random multi-subspace based ReliefF for feature selection. Knowl.-Based Syst. 2022, 252, 109400. [Google Scholar] [CrossRef]
Meng, X.B.; Gao, X.; Liu, Y.; Zhang, H. A novel bat algorithm with habitat selection and Doppler effect in echoes for optimization. Expert Syst. Appl.s 2015, 42, 6350–6364. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Marpaung, Y.P.L.; Nuha, H.H.; Oktaria, D.; Sailellah, H. Air Pollution Forecasting using Integrated Weather Stations and Convolutional Neural Network (CNN) Algorithm. In Proceedings of the 2024 International Conference on Data Science and Its Applications (ICoDSA), Kuta, Bali, Indonesia, 10–11 July 2024; pp. 178–182. [Google Scholar] [CrossRef]
Li, F.; Liu, S.; Wang, T.; Liu, R. Optimal planning for integrated electricity and heat systems using CNN-BiLSTM-Attention network forecasts. Energy 2024, 309, 133042. [Google Scholar] [CrossRef]
Al Mehedi, M.A.; Amur, A.; Metcalf, J.; McGauley, M.; Smith, V.; Wadzuk, B. Predicting the performance of green stormwater infrastructure using multivariate long short-term memory (LSTM) neural network. J. Hydrol. 2023, 625, 130076. [Google Scholar] [CrossRef]
Zhou, R.; Hu, C.; Ou, T.; Wang, Z.; Zhu, Y. Intelligent GRU-RIC Position-Loop Feedforward Compensation Control Method With Application to an Ultraprecision Motion Stage. IEEE Trans. Ind. Inform. 2024, 20, 5609–5621. [Google Scholar] [CrossRef]
Meng, S.; Shi, Z.; Peng, M.; Li, G.; Zheng, H.; Liu, L.; Zhang, L. Landslide displacement prediction with step-like curve based on convolutional neural network coupled with bi-directional gated recurrent unit optimized by attention mechanism. Eng. Appl. Artif. Intell. 2024, 133, 108078. [Google Scholar] [CrossRef]
Wang, Y.; Yang, G.; Yuan, S.; Zhang, H.; Tang, H. Nonstationary response of hydrology and water quality in river-connnected lakes: Comparative analysis of pre- and post- three Gorges Dam. J. Hydrol. 2025, 662, 133958. [Google Scholar] [CrossRef]
Madokoro, H.; Nix, S. Multimodal Particulate Matter Prediction: Enabling Scalable and High-Precision Air Quality Monitoring Using Mobile Devices and Deep Learning Models. Sensors 2025, 25, 4053. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Wong, N.H.; Zhang, W.; Ignatius, M. The impact of urban morphology on the spatiotemporal dimension of estate-level air temperature: A case study in the tropics. Build. Environ. 2023, 228, 109843. [Google Scholar] [CrossRef]
Yu, Z.; Yu, R.; Ge, X.; Fu, J.; Hu, Y.; Chen, S. Tabular prior-data fitted network for urban air temperature inference and high temperature risk assessment. Sustain. Cities Soc. 2025, 128, 106484. [Google Scholar] [CrossRef]

Figure 1. Research framework for AQI prediction. * represents the optimal feature weight.

Figure 2. WTC results of Beijing.

Figure 3. WTC results of Tianjin.

Figure 4. WTC results of Shijiazhuang.

Figure 5. AQI time series.

Figure 6. Results of multi-model comparison for AQI prediction.

Table 1. Model parameter settings.

Model	Parameters	Values
CNN	Filter size, Number of filters, Padding, Stride, Dilation Factor	3, 8, “same”, 1, 2
GRU	Number of hidden units	220
LSTM	Number of hidden units	220
BiLSTM	Number of hidden units	100
Attention	Number of attention heads, Feature dimension	4, 64
Grey	Distinguishing coefficient	0.5
Lasso	Lambda, Threshold for feature selection	5, 0.5
Relief_F	Threshold for feature selection	0.6
PCA	Number of principal components	5
BAT	Population size, Loudness, Pulse rate	10, 0.2, 0.5
GWO	Number of search agents, Maximum iterations	20, 100
DA	Number of search agents, Maximum iterations	20, 100

Table 2. Comparison results of single-model with multivariate input.

Model	MAPE (%)			RMSE			SDAPE			DA			P
Model	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ
ARIMAX	15.55	13.33	13.75	21.63	22.01	18.07	15.40	13.20	13.61	0.85	0.86	0.84	0.53	0.50	0.65
XGboost	9.93	6.76	6.17	16.48	18.10	13.74	13.39	8.62	15.99	0.89	0.94	0.93	0.66	0.65	0.72
GRU	3.67	4.88	2.83	18.56	22.33	10.68	3.63	4.84	2.81	0.96	0.95	0.96	0.59	0.49	0.79
LSTM	3.14	5.12	2.77	19.64	25.27	12.06	3.11	5.07	2.74	0.96	0.95	0.96	0.57	0.43	0.77
BiLSTM	2.82	4.94	2.46	18.45	24.60	10.21	2.88	4.99	2.43	0.96	0.95	0.96	0.59	0.44	0.80
CNN-BiLSTM	2.59	4.95	2.22	17.83	21.09	11.48	2.54	4.89	2.18	0.95	0.95	0.96	0.60	0.52	0.79
CBGA	2.35	4.43	2.12	17.79	20.29	10.36	2.39	4.37	2.37	0.96	0.96	0.96	0.59	0.51	0.80

Note: BJ stands for Beijing; TJ stands for Tianjin; SJZ stands for Shijiazhuang.

Table 3. Comparison results of feature selection methods.

Model	MAPE (%)			RMSE			SDAPE			DA			P
Model	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ
CBGA	2.35	4.43	2.12	17.79	20.29	10.36	2.39	4.37	2.37	0.96	0.96	0.96	0.59	0.51	0.80
Grey-CBGA	2.75	1.89	2.29	3.76	3.44	2.79	2.72	1.87	2.27	0.95	0.95	0.95	0.92	0.92	0.95
Lasso-CBGA	1.94	1.70	1.79	3.73	3.28	2.27	1.92	1.68	1.78	0.96	0.95	0.96	0.92	0.93	0.96
PCA-CBGA	2.06	1.90	1.85	3.30	3.51	2.32	2.04	1.88	1.84	0.95	0.95	0.96	0.93	0.92	0.96
Relief_F-CBGA	1.82	1.41	1.43	3.77	3.82	2.69	1.80	1.39	1.42	0.96	0.95	0.96	0.92	0.91	0.95

Note: BJ stands for Beijing; TJ stands for Tianjin; SJZ stands for Shijiazhuang.

Table 4. Comparison results of adaptive hyperparameter optimization.

Model	MAPE (%)			RMSE			SDAPE			DA			P
Model	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ	BJ	TJ	SJZ
RCBGA	1.82	1.41	1.43	3.77	3.82	2.69	1.80	1.39	1.42	0.96	0.95	0.96	0.92	0.91	0.95
GWO-RCBGA	1.18	2.43	1.02	2.86	5.77	1.73	1.16	2.41	1.01	0.96	0.93	0.96	0.94	0.87	0.97
DA–RCBGA	1.42	1.32	1.59	2.21	2.24	1.60	1.40	1.31	1.58	0.96	0.96	0.96	0.95	0.95	0.97
BAT-RCBGA	1.00	1.15	1.09	1.95	2.24	1.00	0.99	1.13	1.08	0.96	0.96	0.97	0.96	0.95	0.98

Note: BJ stands for Beijing; TJ stands for Tianjin; SJZ stands for Shijiazhuang.

Table 5. Robustness test results.

Cities	Perturbation	MAPE (%)	RMSE	SDAPE	DA	P
Beijing	0	1.00	1.95	0.99	0.96	0.96
	0.05	0.91	2.73	0.90	0.97	0.94
	0.07	0.91	2.36	0.90	0.96	0.95
	0.10	1.74	2.18	1.72	0.96	0.95
Tianjin	0	1.15	2.24	1.13	0.96	0.95
	0.05	1.38	3.01	1.37	0.96	0.93
	0.07	1.15	3.56	1.14	0.95	0.92
	0.10	1.73	2.83	1.72	0.96	0.94
Shijiazhuang	0	1.09	1.00	1.08	0.97	0.98
	0.05	1.65	1.32	1.63	0.96	0.97
	0.07	1.71	1.56	1.70	0.96	0.96
	0.10	3.63	2.73	3.60	0.93	0.95

Table 6. DM test results of Beijing.

Model	MAPE (%)	MSE	MAD
ARIMAX	31.25 *	4.23 *	17.99 *
XGBoost	17.70 *	2.47 *	10.67 *
GRU	13.81 *	2.19 *	5.73 *
LSTM	10.18 *	2.05 *	5.08 *
BiLSTM	15.25 *	2.29 *	6.33 *
CNN-BiLSTM	16.14 *	2.51 *	7.22 *
CBG-Attention	13.58 *	2.17 *	4.64 *
PCA-CBG-Attention	8.22 *	2.07 *	5.00 *
Relief_F-CBG-Attention	7.92 *	2.14 *	5.02 *
Lasso-CBG-Attention	9.55 *	2.11 *	5.21 *
Grey-CBG-Attention	14.81 *	2.10 *	6.30 *
GWO-Relief_F-CBG-Attention	8.19 *	2.11 *	4.82 *
DA-Relief_F-CBG-Attention	8.22 *	2.06 *	4.83 *

Note: * indicates p < 0.05.

Table 7. DM test results of Tianjin.

Model	MAPE (%)	MSE	MAD
ARIMAX	28.72 *	4.81 *	17.52 *
XGboost	17.36 *	2.12 *	8.85 *
GRU	17.72 *	2.31 *	7.44 *
LSTM	17.62 *	2.31 *	7.12 *
BiLSTM	17.86 *	2.32 *	7.29 *
CNN-BiLSTM	19.48 *	2.91 *	9.83 *
CBG-Attention	17.48 *	3.01 *	9.76 *
PCA-CBG-Attention	11.76 *	2.65 *	6.33 *
Relief_F-CBG-Attention	8.68 *	2.64 *	5.33 *
Lasso-CBG-Attention	10.50 *	2.58 *	5.72 *
Grey-CBG-Attention	12.94 *	2.57 *	6.13 *
GWO-Relief_F-CBG-Attention	11.98 *	2.64 *	7.40 *
DA-Relief_F-CBG-Attention	8.97 *	2.57 *	5.22 *

Note: * indicates p < 0.05.

Table 8. DM test results of Shijiazhuang.

Model	MAPE (%)	MSE	MAD
ARIMAX	29.52 *	6.65 *	21.99 *
XGboost	8.63 *	2.09 *	8.62 *
GRU	10.91 *	1.80 *	6.22 *
LSTM	10.18 *	1.89 *	5.86 *
BiLSTM	10.70 *	1.90 *	5.99 *
CNN-BiLSTM	12.26 *	2.09 *	5.93 *
CBG-Attention	15.73 *	2.05 *	6.41 *
PCA-CBG-Attention	8.02 *	2.07 *	5.54 *
Relief_F-CBG-Attention	12.33 *	2.12 *	7.27 *
Lasso-CBG-Attention	10.61 *	2.10 *	6.62 *
Grey-CBG-Attention	8.69 *	2.04 *	5.90 *
GWO-Relief_F-CBG-Attention	7.92 *	2.04 *	5.51 *
DA-Relief_F-CBG-Attention	9.20 *	2.03 *	5.93 *

Note: * indicates p < 0.05.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhan, L.; Feng, K.; Gu, X.; Han, T. High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization. Atmosphere 2025, 16, 1363. https://doi.org/10.3390/atmos16121363

AMA Style

Zhan L, Feng K, Gu X, Han T. High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization. Atmosphere. 2025; 16(12):1363. https://doi.org/10.3390/atmos16121363

Chicago/Turabian Style

Zhan, Leqing, Kai Feng, Xiaoyang Gu, and Te Han. 2025. "High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization" Atmosphere 16, no. 12: 1363. https://doi.org/10.3390/atmos16121363

APA Style

Zhan, L., Feng, K., Gu, X., & Han, T. (2025). High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization. Atmosphere, 16(12), 1363. https://doi.org/10.3390/atmos16121363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Precision Air Quality Prediction via Attention-Driven Hybrid Neural Networks and Adaptive Feature Optimization

Abstract

1. Introduction

2. Proposed Model

2.1. Proposed Model Framework

2.2. Definition of the Problem

2.3. Relief_F Feature Selection Method

2.4. BAT Hyperparameter Optimization Method

2.5. Standard Deep Learning Model

2.6. Attention Mechanism

3. Experiment

3.1. Dataset Description

3.2. Data Preparation

3.3. Evaluation Metrics and Experiment Design

3.3.1. Evaluation Metrics

3.3.2. Experiment Design

4. Result Analysis

4.1. Comparison Results of Single Model with Multivariate Input

4.2. Comparison Results of Feature Selection Methods

4.3. Comparison Results of Adaptive Hyperparameter Optimization

4.4. Robustness Test

4.5. DM Test

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI