Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model

Sun, Zhuoran; Zhang, Qing; Chen, Guici

doi:10.3390/atmos17030249

Open AccessArticle

Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model

by

Zhuoran Sun

,

Qing Zhang

^* and

Guici Chen

School of Mathematics and Systems Science, Wuhan University of Science and Technology, Wuhan 430065, China

^*

Author to whom correspondence should be addressed.

Atmosphere 2026, 17(3), 249; https://doi.org/10.3390/atmos17030249

Submission received: 14 January 2026 / Revised: 24 February 2026 / Accepted: 24 February 2026 / Published: 27 February 2026

(This article belongs to the Special Issue Air Quality in China (4th Edition))

Download

Browse Figures

Versions Notes

Abstract

Accurate AQI forecasting is essential for public health and environmental management. However, existing network models for AQI forecasting still exhibit limited predictive accuracy, with insufficient consideration of key influencing factors in current research. Therefore, we present a hybrid model, Transformer Encoder–CNN–BiLSTM. The model not only considers the influence of six major atmospheric pollutant factors (

{PM}_{2.5}

,

{PM}_{10}

, CO,

{NO}_{2}

,

{SO}_{2}

,

O_{3}

), but also offers advantages in modeling long-range dependencies of time series, extracting local features and capturing periodicity and seasonal trends of AQI. Taking Shanghai, China as the research object, the

R^{2}

, MAE and RMSE of the proposed model are 0.9781, 2.4266 and 4.0321 respectively, far superior to those of other comparison models. In the cross-city validation experiment, the AQI forecasting of Beijing, which has distinct climatic conditions from Shanghai while sharing the same national AQI standard and similar dominant pollutant structure, also demonstrates favorable performance with an

R^{2}

of 0.9712, and MAE and RMSE of 3.1275 and 6.6269 respectively. The results indicate that the model can effectively forecast the AQI of Chinese megacities with consistent AQI evaluation criteria.

Keywords:

AQI forecasting; air pollutant; Transformer; CNN; BiLSTM

1. Introduction

Urban air pollution is worsening due to rapid urbanization and industrialization [1,2], posing health risks and concerns about future air quality. To quantitatively evaluate ambient air quality, the air quality index (AQI) has been widely adopted as a comprehensive indicator that integrates the concentrations of multiple pollutants into a single numerical value. According to the national air quality standard (GB3095-2012), six major pollutants—ozone (

O_{3}

), carbon monoxide (CO), nitrogen dioxide (

{NO}_{2}

), sulfur dioxide (

{SO}_{2}

),

{PM}_{2.5}

and

{PM}_{10}

—are incorporated into the AQI calculation [3]. Accurately predicting air quality index has become an important research topic for regional development and public health [4].

The technologies for air quality forecasting can be categorized into three types: numerical simulation, statistics, and deep learning. Numerical simulation methods [5,6] are based on atmospheric theory and employ models such as the Gauss model, Lagrange model, Euler model, and chemical transport models (CTMs) [7], which are widely used to simulate atmospheric pollutants—especially the criteria pollutants included in the air quality index (AQI) calculations. These models characterize complex atmospheric physical and chemical processes and realistically represent air pollution dynamics by integrating emission inventories, meteorological fields, and chemical reaction mechanisms. Meanwhile, statistical methods [8,9] utilize classification, regression, filtering, and fitting algorithms to identify correlations among heterogeneous data based on historical records and subsequently predict air quality. Although these traditional approaches have been extensively studied, their predictive accuracy is limited by the complex, nonlinear, and non-stationary properties of air pollution processes. As deep learning rapidly advances, data-driven models have substantially improved the accuracy and reliability of air quality forecasting, poised to play a pivotal role in high-resolution and real-time predictions.

Artificial neural network (ANN), as the foundational technology of deep learning, possesses strong nonlinear modeling and adaptive learning capabilities, making them well suited for air quality prediction. For example, in 2020, He et al. investigated the predictive accuracy of LSTM models for forecasting AQI within different time intervals ranging from 0 to 48 h [10]. Research shows that LSTM has better robustness than traditional neural networks. In 2021, Krishna et al. proposed the MTCAN model to predict the concentration of

{PM}_{2.5}

[11]. This approach preserves the temporal relationships between recorded features, meteorological data and pollution data to fill the missing values of

{PM}_{2.5}

. These studies demonstrate the effectiveness of deep learning—based approaches in capturing complex air quality dynamics.

Although multiple studies have explored the combination of Transformer with LSTM/BiLSTM and other recurrent structures for air quality prediction in recent years, this study exhibits essential differences in both methodological design and modeling objectives compared with existing works. Specifically, Liu et al. [12] and Dong et al. [13] mainly enhance temporal dependency modeling through Transformer–BiLSTM cascade architectures or by introducing signal decomposition strategies (such as EMD). While these approaches improve predictive performance to a certain extent, they may also increase model complexity and introduce additional modeling procedures. In contrast, this paper constructs a unified end-to-end Transformer Encoder–CNN–BiLSTM framework that directly models multivariate pollutant concentration data (

O_{3}

, CO,

{NO}_{2}

,

{SO}_{2}

,

{PM}_{2.5}

, and

{PM}_{10}

) without requiring signal decomposition or factor analysis. Within this framework, Convolutional neural network (CNN) is employed to extract local temporal features, the Transformer encoder models long-range temporal dependencies [14], and BiLSTM further captures bidirectional temporal dynamics, thereby enabling effective fusion of multi-level temporal features within a single model. In addition, unlike factor-analysis-based approaches relying on latent variable transformations [15,16], the proposed method preserves original pollutant-level information, which enhances interpretability and robustness in practical AQI prediction tasks. Based on these design considerations, this study proposes a prediction framework with a distinct modeling strategy and complementary significance relative to existing Transformer–BiLSTM-based approaches.

The primary contributions of this paper are as follows: (1) Existing AQI prediction models have several problems, such as poor prediction accuracy and lack of long series modeling. Therefore, we have collected air quality data in Shanghai, China, and first proposed the Transformer Encoder–CNN–BiLSTM hybrid model to predict AQI by considering the concentration of pollutants in the air (

O_{3}

, CO,

{NO}_{2}

,

{SO}_{2}

,

{PM}_{2.5}

,

{PM}_{10}

). (2) The model integrates the self-attention mechanism of Transformer, the local feature extraction of CNN, and the temporal modeling capability of BiLSTM, enabling effective feature extraction and fusion. In particular, the Transformer encoder improves the ability to capture long-range temporal dependencies, which are difficult for conventional approaches. (3) The model exhibits high predictive accuracy and performs well under the evaluation metrics MAE, RMSE, and

R^{2}

, demonstrating smaller errors compared to other methods. Moreover, the model also shows good performance in generalization tests, proving its high practical value.

2. Theory of Transformer Encoder–CNN–BiLSTM Hybrid Model

2.1. Transformer Encoder

As illustrated in Figure 1, the core of the Transformer is the self-attention mechanism, where the scaled dot-product attention computes the degree of correlation. The formulation is given as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

where Q, K and V represent the query feature matrix, key feature matrix and value feature matrix, respectively.

d_{k}

stands for the dimension of the input vector and softmax represents the activation function.

The specific values of Q, K and V are obtained by transforming the input. In addition, to calculate the weighted sum, K and V are transformed from the same input. It is worth noting that the self-attention mechanism selects V by calculating the similarity between Q and K. Obviously, a recurrent neural network (RNN) acquires global information through gradual recursion based on previous hidden states, whereas a Transformer acquires global information directly from the perspective of the entire sequence.

Building on the scaled dot-product attention mechanism, the Transformer introduces multi-head attention mechanism that splits Q, K, and V into h parts after they are transformed, and then it computes the scaled dot-product attention mechanism separately for each part (Figure 2a). The “multi-headed” in the multi-head attention mechanism means that the above steps are performed multiple times, and the formula is as follows:

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O}

(2)

h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(3)

In the formula above,

W_{i}^{Q}

,

W_{i}^{K}

and

W_{i}^{V}

represent the parameter matrices for the i-th head, with i ranging from [1, h].

W^{O}

is the parameter matrix used for output. In comparison to a single-head attention mechanism, the advantage of a multi-head attention mechanism is that it can learn information from different subspaces.

Finally, the classic Transformer encoder structure is shown in Figure 2, consisting of multi-head attention layers and feed-forward network layers, each accompanied by a residual connection and layer normalization. And multi-head attention allows the model to consider the information of all other data points while processing each data point, effectively capturing different levels of dependency relationships in the series. In contrast to conventional RNN structures, it can enhance computational efficiency and feature extraction capabilities. The Addition(ADD) structure, inspired by ResNet, helps alleviate gradient vanishing and exploding issues in deep networks. Layer normalization, abbreviated as Norm, can enhance training efficiency. In the past few years, this structure has made significant contributions in various fields [17,18].

2.2. Convolutional Neural Network

CNNs are effective at processing grid-structured data and have been widely used in image recognition tasks [19,20]. As illustrated in Figure 3, the input data are successively processed by convolutional and pooling layers for local feature extraction and dimensionality reduction, followed by classification or regression through fully connected and output layers.

The convolutional layer is a key component in CNNs. It uses a convolution kernel to extract features from the input variables and is only connected to a part of the neurons from its preceding layer. Moreover, the convolutional layer utilizes convolution calculations rather than matrix multiplication to produce feature maps. The computation for each element of the feature map is given by:

x_{i, j}^{o u t} = f_{c o v} (\sum_{m = 0}^{k} \sum_{n = 0}^{k} w_{m, n} x_{i + m, j + n}^{i n} + b)

(4)

where

x_{i, j}^{o u t}

denotes the output value at position

(i, j)

of the feature map;

x_{i + m, j + n}^{i n}

represents the input value at row i and column j of the input matrix;

f_{c o v}

denotes the activation function;

w_{m, n}

is the weight at position

(m, n)

of the convolution kernel; and b is the bias of the convolution kernel. The input matrix is typically convolved with several kernels. By extracting features from the input matrix, each convolution kernel generates a feature map. The subsequent pooling layer down-samples the feature map, reducing its spatial dimensions and improving computational efficiency.

2.3. Long Short-Term Memory and Bidirectional Long Short-Term Memory

As shown in Figure 4a, LSTM is an advanced form of RNN. Traditional RNN models lack specific constraints on the information that gets updated, often leading to confusion within the transmitted data. However, LSTM can only use memory units to retain historical information and make predictions based on data from multiple past time steps before the target moment, i.e., LSTM can only learn data information along the time axis. Due to this limitation, researchers have developed bidirectional long short-term memory networks. BiLSTM consists of two LSTMs, one for processing the forward sequence and the other for processing the backward sequence. As shown in Figure 4,

x_{t}

and

h_{t}

are the input and output vectors, respectively. In the forward LSTM, input data enters sequentially in time order, with each time step’s output serving as the input for the next. Conversely, the backward LSTM processes the data in reverse order. The outputs of the forward and backward LSTM units are fused and concatenated together to form the final output. Therefore, the model can learn information from both directions of time series at the same time and uncover the hidden information better [21,22,23]. Therefore, the model can learn information from both directions of time series at the same time and uncover the hidden information better. In this study, the BiLSTM operates strictly within the above 12-h historical input window, with no access to any future information, including the t + 1 target value.

2.4. Proposed Model

Figure 5 depicts the overall architecture of the Transformer Encoder–CNN–BiLSTM model. The Transformer Encoder–CNN–BiLSTM model can be divided into five parts: data preprocessing layer, Transformer Encoder layer, CNN layer, BiLSTM layer, and output layer, which are delineated as follows

1.

Data Preprocessing Layer: Data preprocessing includes outlier detection, missing value filling, data set segmentation and data normalization, which will be introduced in Section 3.

2.

Transformer Encoder Layer:

Multi-head Attention: In order for the model to simultaneously learn information from different subspaces, the Transformer encoder utilizes a multi-head attention mechanism. This mechanism enables the model to carry out self-attention operations in parallel, where each attention head focuses on a distinct representation subspace of the input data, and the resulting information is then combined.
Layer Normalization and Feed-forward Networks: After the self-attention layer, the model’s training process is stabilized through layer normalization, followed by a feed-forward network that further processes the representation at each time point. The above operations not only allow Transformer encoder to capture complex time dependencies, but also extract deeper features through nonlinear transformations to enhance model expressiveness.

3.

CNN Layer:

Convolutional Layers: Convolutional layers slide over the input data with a convolutional kernel, calculating the dot product between the kernel and the local data, thus generating feature maps. This helps the model capture local features within the data.
Pooling Layers: Pooling layers down-sample feature maps through max or average pooling to reduce spatial dimensions while preserving key information.

4.

BiLSTM Layer: The core of BiLSTM is to use two separate LSTM units to process time series data: one for forward sequences (from the past to the future) and the other for reverse sequences (from the future to the past). The final output is the combination of the outputs from two LSTM units, which allows the network to consider both past and future information. By fusing forward and backward information, BiLSTM is able to provide a comprehensive view of the information at each time point. This allows the model to predict AQI more accurately through cyclical and seasonal trends.

5.

Output Layer: Finally, a weighted sum is performed on the fully connected layer to obtain a final AQI value.

3. Experiment Preparation

3.1. Experiment Environment

Computational environment: The experiments were implemented using an Intel Core i7-10870H CPU (2.20 GHz), 16 GB RAM, and an NVIDIA GeForce RTX 2070 Max-Q GPU. The model was developed in Python 3.8.0 with PyCharm 2023.3.2 (64-bit), and TensorFlow 2.9.1 was used as the deep learning framework. Based on the above hardware configuration (laptop-level GPU) and the complexity of the proposed three-module hybrid model, the total training time for the Shanghai dataset (34,874 samples) was about 2.8 h (with 50 training epochs, batch size = 64), and the average time of a single training epoch was 12.5 min. The whole data preprocessing stage (outlier detection, missing value filling, normalization and dataset division) for the full dataset took approximately 40 min, which was mainly determined by the computational efficiency of the iForest algorithm and matrix normalization operation on the CPU.

Realistic scenario setting: This model adopts a timeliness design of “predicting the next 1 h with the data of the past 12 h”, which can not only meet the real-time requirements of hourly-level decision-making such as immediate protection for sensitive groups and temporary urban traffic control, but also be extended to the multi-step prediction by means of “rolling prediction” to support air quality management over longer cycles. The low RMSE of the model ensures that the cumulative error of multi-step prediction is within a controllable range.

3.2. Data Preparation and Dataset Division

The experimental data is obtained from https://data.epmap.org/product/nationair (accessed on 23 February 2026), specifically focusing on air quality data for Shanghai from 1 January 2020, 0:00 to 31 December 2023, 23:00. The major atmospheric pollutants included in the data set comprise

O_{3}

,

CO

,

{NO}_{2}

,

{SO}_{2}

,

{PM}_{2.5}

and

{PM}_{10}

, with a total of 34,874 sample points.

Before training, the dataset is divided into training, validation, and test sets, with an 8:2 split between training and test data, and 10% of the training set reserved for validation. A fixed historical time window is adopted for prediction, using pollutant concentration data of the past 12 consecutive hours (t − 11 to t) as input to predict the AQI value of the next hour (t + 1).

3.3. Data Processing

3.3.1. Outlier Detection

Outliers in time series can reduce prediction accuracy, making it essential to identify and handle them before training. The isolation Forest (iForest) is an ensemble-based algorithm for detecting outliers [24]. The algorithm isolates samples by randomly choosing a feature and then selecting a random split value on that feature. This process will build an isolation tree. For a given data set, the algorithm repeats this process to build multiple isolation trees, thus forming an isolation forest. In an isolation tree, normal data points usually require more isolation steps to be fully isolated, whereas outliers are more easily isolated due to their scarce and distinct nature, thus requiring fewer steps. For each sample x, the calculation formula for its abnormal score

s (x, n)

is as follows:

s (x, n) = 2^{- \frac{E (h (x))}{c (n)}}

(5)

c (n) = 2 H (n - 1) - \frac{2 (n - 1)}{n}

(6)

where

E (h (x))

denotes the mean path length of sample x across all trees, n represents the number of training samples, and

c (n)

is the mean tree path length used for normalization.

The advantage of the iForest algorithm lies in its high computational efficiency and strong adaptability. Unlike methods based on statistical probability distribution, iForest is not influenced by data distribution and is suitable for various types of data, including nonlinear and high-dimensional data. Hence, in this experiment, we choose iForest algorithm to detect outliers. Figure 6 shows the comparison between the original AQI and the processed AQI.

3.3.2. Data Normalization

In the field of air quality prediction, sequence data are often nonlinear, and the impacts of pollutants such as

CO

and

{NO}_{2}

on AQI are also different. When there is a significant difference among feature values, the larger values in the data may weaken the influence of the smaller values in the model if they are not handled properly before being used in calculations and analysis, leading to poorer performance of the smaller values in the data. Therefore, feature standardization is necessary to ensure all values are treated equally and to accelerate model convergence.

Max-Min normalization is a common data preprocessing technique. It scales all feature values via a linear transformation into the range [0, 1], maintaining the original data’s relative relationships. The formula is as follows:

x_{n o r m} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(7)

where subscript norm refers to normalized data; max and min represent the maximum value and minimum value of this feature, respectively. Max-Min normalization may make it easy to implement and the data can be normalized to a unified scale without destroying the internal structure of the data.

3.4. Evaluation Metrics

Model performance is evaluated by comparing predicted values with observed values using multiple criteria, including mean absolute error (MAE), root mean square error (RMSE), and the coefficient of determination (

R^{2}

), among others. Employing multiple evaluation criteria ensures a more reliable assessment of predictive accuracy. The evaluation criteria used in this study are defined as follows:

1.: Mean Absolute Error (MAE): MAE directly gives the mean magnitude of the deviation between predicted values and actual values. Its mathematical definition is given as follows:

$MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$

(8)

where n denotes the number of observations, $y_{i}$ represents the true value of the i-th observation, and ${\hat{y}}_{i}$ denotes the corresponding predicted value.
2.: Root Mean Square Error (RMSE): RMSE assesses the mean of prediction error, with units matching those of the observed variable, which is given as follows:

$RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}$

(9)
3.: Coefficient of Determination ( $R^{2}$ ): $R^{2}$ evaluates the goodness of predicted values to observed values in the regression model, with values ranging from 0 to 1. Its formula is given as follows:

$R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$

(10)

where $\bar{y}$ is the average value of the true value.

4. Experiment Result and Analysis

4.1. Single Model Prediction

To evaluate the predictive capabilities of various models in forecasting AQI, we first selected four single neural network prediction models for the experiment: CNN, Temporal Convolutional Network (TCN), BiLSTM, and Transformer. Our goal is to use data from the past 12 h to predict the values for the next hour. After analysis, it is found that no matter in terms of evaluation indices in Table 1 and fitting effect, as shown in Figure 7, Transformer has the best performance while CNN is slightly inferior.

The Transformer demonstrates strong performance mainly because of its ability to capture global relationships between elements and its efficient parallel computation capabilities. The BiLSTM also performs well due to its ability to handle bidirectional information flows. The CNN, being more suitable for processing spatial data, has an advantage in extracting local features but is slightly less capable in handling long sequences. Each of the above models has its own advantages, so we will use hybrid models to make further predictions for AQI.

4.2. Hybrid Model Prediction

To illustrate advantages of the proposed model, three distinct hybrid models are chosen for comparison. The

R^{2}

, MAE and RMSE values of four prediction models for Shanghai’s AQI are shown in Table 2. Among these models, the proposed model performs best in three evaluation metrics and is highlighted in bold. The histogram for the three evaluation metrics of different comparison models is clearly shown in Figure 8. Furthermore, as depicted in Figure 9, the proposed model’s AQI forecasting curve demonstrates the highest level of conformity with the actual curve, whereas other models merely capture fluctuating trend of AQI and possess restricted forecast precision.

By considering prediction error and curve analysis, and taking Shanghai as an example, the Transformer Encoder–CNN–BiLSTM model demonstrates higher prediction accuracy compared with other hybrid models. According to Table 2 and Figure 8 and Figure 9, we have arrived at the following constructive conclusion.

1.: Hierarchical feature abstraction: This model integrates the strengths of three architectures—Transformer encoder, BiLSTM and CNN—allowing for the abstraction and fusion of multiple levels of features in stages. The Transformer can capture the long-range dependencies among different features and assign different weights to each input sequence through a self-attention mechanism to extract global contextual information. The CNN enhances the model’s ability to identify local features of a time series. The BiLSTM can consider past and future information in time series. Due to the temporal continuity of environmental data, BiLSTM is capable of capturing the influence of preceding and subsequent associations on AQI. This stepwise refined feature extraction helps the model predict AQI more accurately.
2.: Model ensemble advantage: Hybrid models can address the limitations of single models that exhibit excessive bias towards a particular aspect during prediction. Additionally, single models are susceptible to overfitting when applied to specific data sets, while hybrid models mitigate this risk by employing ensemble learning, thereby enhancing their robustness and generalization capabilities for novel data.

4.3. Generalization Experiment

To conduct cross-city validation of the proposed model, we select a new dataset containing hourly AQI data of Beijing from 2020 to 2023 for experimental verification. Beijing and Shanghai exhibit distinct climatic conditions, while both adopt the national AQI standard (GB3095-2012) and have similar dominant pollutant structures. The experimental results are shown in Figure 10, where the proposed model achieves an

R^{2}

of 0.9712, MAE of 3.1275 and RMSE of 6.6269 for Beijing’s AQI dataset. The results show that the model performs stably in the cross-city validation of Chinese megacities with the same AQI evaluation system and similar pollutant composition.

To comprehensively evaluate the practical application value of the model, this study introduces the AQI grade prediction accuracy and confusion matrix as additional evaluation metrics, in addition to numerical prediction indicators such as RMSE and MAE. The model achieves a grade prediction accuracy of 95.24% on the test set, which verifies that the model can reliably identify different air quality grades and meet the requirements of practical applications.

The model’s ability to capture abrupt changes in AQI grades holds significant public health implications. For instance, the model can effectively identify the threshold crossing when the actual AQI jumps from Good to Slightly Polluted, as shown in Table 3. Among the 2029 samples correctly predicted as Good, only 85 were misclassified as Slightly Polluted, and no samples rated Excellent were misjudged as Slightly Polluted. This reliable identification of grade mutations avoids the neglect of actual health risks caused by minor numerical differences, thus ensuring the practical application value of the prediction results.

4.4. Comparison of Activation Functions

To verify the rationality of the activation function selection, this study compares the performance of four activation functions, namely rectified linear unit (ReLU), leaky rectified linear unit (LeakyReLU), Gaussian error linear unit (GELU), and Swish activation function (Swish).

This experiment was conducted based on the basic framework of the Transformer Encoder–CNN–BiLSTM model, where only the type of activation function was adjusted while other hyperparameters, including the number of network layers, batch size and learning rate, were kept consistent. The prediction performance of the model with different activation functions was evaluated using three metrics: MAE, RMSE and

R^{2}

, with the results presented in Table 4. The experimental results show that the LeakyReLU activation function achieves the optimal prediction performance, yielding the lowest corresponding RMSE (4.4434) and MAE (2.6562). ReLU exhibits slightly inferior prediction accuracy due to the occurrence of the dying ReLU problem to a certain extent. Although GELU and Swish enhance the fitting capability of the model, they increase the computational complexity, which results in longer training time and no significant improvement in prediction performance compared with LeakyReLU. Considering both prediction accuracy and training efficiency comprehensively, LeakyReLU is finally selected as the activation function for the proposed Transformer Encoder–CNN–BiLSTM hybrid model, which provides a reliable hyperparameter basis for the subsequent performance comparison and analysis of hybrid models.

4.5. Model Robustness Verification

To verify the robustness of the model and address concerns about the sensitivity of deep learning models to random initialization, this study conducted five independent training and evaluation runs on the optimal model with LeakyReLU as the activation function. A distinct random seed (42, 100, 200, 300, 400) was adopted for each experiment to control the potential impact of parameter initialization on model performance.

As presented in Table 5, the model exhibited extremely high stability across all runs. Specifically, the model achieved an RMSE of 4.37 ± 0.21, an MAE of 2.59 ± 0.11, an

R^{2}

of 0.974 ± 0.003, and an AQI grade prediction accuracy of 95.18% ± 0.36%. The standard deviation of all key metrics remained at an extremely low level, indicating that the model performance was insensitive to random initialization. This confirms that the results of a single run are highly representative, and the experimental conclusions possess statistical reliability and reproducibility.

5. Conclusions

In recent years, as the social economy has rapidly developed, people have become increasingly concerned about ecological and air quality issues. However, due to the influence of various factors, accurately determining the total concentration of future air pollutants is challenging. Using traditional methods to predict air quality is relatively costly and low in accuracy. However, deep learning methods have great significance in reducing prediction costs and improving prediction accuracy. In this research, we propose a unique and efficient hybrid model called Transformer Encoder–CNN–BiLSTM. The model accounts for the impact of various factors related to air pollutants and offers advantages in modeling long-range dependencies of time series, extracting local features, and capturing periodicity and seasonal trends in AQI. The model achieves favorable predictive performance in Shanghai and Beijing, two Chinese megacities with distinct climatic characteristics yet consistent national AQI standards and similar dominant pollutant structures. The results suggest that the model has potential applicability in other Chinese cities that follow the same AQI evaluation criteria and have analogous pollutant composition characteristics.

The future research can be primarily categorized into the following two aspects:

1.: To further account for the impact of natural disasters on AQI, future work may incorporate natural disaster prediction into the model to enable more accurate assessment of uncontrollable factors.
2.: Future studies should include data from cities around Shanghai to enhance air quality prediction by considering spatiotemporal factors.
3.: Future studies will integrate multi-source data, such as meteorological factors (wind speed, temperature, humidity), into the model to construct a comprehensive feature input, thus further improving the AQI prediction accuracy.

Author Contributions

Conceptualization, Z.S. and Q.Z.; methodology, G.C.; software, Z.S.; validation, Z.S. and G.C.; formal analysis, Q.Z.; investigation, Z.S.; resources, Z.S.; data curation, Z.S.; writing—original draft preparation, Z.S.; writing—review and editing, G.C.; visualization, Z.S.; supervision, G.C.; project administration, G.C.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No.62406258), Hubei Province Key Laboratory of Systems Science in Metallurgical Process (Wuhan University of Science and Technology) (No. Z202401) and the Hubei Key Laboratory of Intelligent Robot (Wuhan Institute of Technology) (No. HBIR202408).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://data.epmap.org/product/nationair (accessed on 23 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AQI	Air Quality Index
ADD	Addition
BiLSTM	Bidirectional Long Short-Term Memory
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
RMSE	Root Mean Square Error
RNN	Recurrent Neural Network
TCN	Temporal Convolutional Network
ReLU	Rectified Linear Unit
LeakyReLU	Leaky Rectified Linear Unit
GELU	Gaussian Error Linear Unit
Swish	Swish activation function

References

Goudarzi, G.; Shirmardi, M.; Naimabadi, A.; Ghadiri, A.; Sajedifar, J. Chemical and organic characteristics of PM_2.5 particles and their in-vitro cytotoxic effects on lung cells: The Middle East dust storms in Ahvaz, Iran. Sci. Total Environ. 2019, 655, 434–445. [Google Scholar] [CrossRef]
Lin, B.; Zhu, J. Changes in urban air quality during urbanization in China. J. Clean. Prod. 2018, 188, 312–321. [Google Scholar] [CrossRef]
Zhu, S.; Lian, X.; Liu, H.; Hu, J.; Wang, Y.; Che, J. Daily air quality index forecasting with hybrid models: A case in China. Environ. Pollut. 2017, 231, 1232–1244. [Google Scholar] [CrossRef] [PubMed]
Sarkar, N.; Gupta, R.; Keserwani, P.K.; Govil, M.C. Air quality index prediction using an effective hybrid deep learning model. Environ. Pollut. 2022, 315, 120404. [Google Scholar] [CrossRef] [PubMed]
Michael, E.D.; Uapipatanakul, S. Evaluation of the performance of ADMS in predicting the dispersion of sulfur dioxide from a complex source in Southeast Asia: Implications for health impact assessments. Air Qual. Atmos. Health 2014, 7, 401–405. [Google Scholar] [CrossRef]
Zhang, R.Q.; Li, M.; Ma, H.C. Comparative study on numerical simulation based on CALPUFF and wind tunnel simulation of hazardous chemical leakage accidents. Front. Environ. Sci. 2022, 10, 1025027. [Google Scholar] [CrossRef]
Wyat Appel, K.; Napelenok, S.; Hogrefe, C.; Pouliot, G.; Foley, K.M.; Roselle, S.J.; Pleim, J.E.; Bash, J.; Pye, H.O.T.; Heath, N.; et al. Overview and Evaluation of the Community Multiscale Air Quality (CMAQ) Modeling System Version 5.2. In Air Pollution Modeling and its Application XXV; Springer: Cham, Switzerland, 2018; pp. 69–73. [Google Scholar] [CrossRef]
Li, C.; Du, S.Y.; Bai, Z.P.; Shao-fei, K.; Yan, Y.; Bin, H.; Dao-wen, H.; Li, Z.Y. Application of land use regression for estimating concentrations of major outdoor air pollutants in Jinan, China. J. Zhejiang Univ.-Sci. A 2010, 11, 857–867. [Google Scholar] [CrossRef]
Janarthanan, R.; Partheeban, P.; Somasundaram, K.; Elamparithi, P.N. A deep learning approach for prediction of air quality index in a metropolitan city. Sustain. Cities Soc. 2021, 67, 102720. [Google Scholar] [CrossRef]
He, H.; Luo, F. Study of LSTM air quality index prediction based on forecasting timeliness. IOP Conf. Ser. Earth Environ. Sci. 2020, 446, 032113. [Google Scholar] [CrossRef]
Krishna, K.K.R.; Babu, K.S.; Das, S.K. Multi-directional temporal convolutional artificial neural network for PM_2.5 forecasting with missing values: A deep learning approach. Urban Clim. 2021, 36, 100800. [Google Scholar] [CrossRef]
Liu, X.; Su, K.; Wang, S.; Zhang, Y.; Li, H. Intelligent Prediction of Air Quality Index Based on the Transformer–BiLSTM Model. Sci. Rep. 2025, 15, 41838. [Google Scholar] [CrossRef] [PubMed]
Dong, J.; Zhang, Y.; Hu, J. Short-Term Air Quality Prediction Based on EMD–Transformer–BiLSTM. Sci. Rep. 2024, 14, 20513. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
Wang, P.; Tang, K.; Li, Y.; Li, D. An Improved BERT–BiLSTM–CRF Model Integrating CNN and Transformer for Chinese Named Entity Recognition. In Proceedings of the 5th International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 23–25 May 2025; pp. 512–520. [Google Scholar] [CrossRef]
Liu, S.; Hu, Y. Air Quality Prediction Based on Factor Analysis Combined with Transformer and CNN–BiLSTM–Attention Models. Sci. Rep. 2025, 15, 20014. [Google Scholar] [CrossRef] [PubMed]
Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 21–25. [Google Scholar] [CrossRef]
Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2018, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Sun, Y.; Wang, X.; Tang, X. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1891–1898. [Google Scholar] [CrossRef]
Li, Y.T.; Li, R.Y. A hybrid model for daily air quality index prediction and its performance in the face of impact effect of COVID-19 lockdown. Process Saf. Environ. Prot. 2023, 176, 673–684. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Li, Z.; Cheng, J.C.; Ding, Y.; Lin, C.; Xu, Z. Air quality prediction at new stations using spatially transferred bi-directional long short-term memory network. Sci. Total Environ. 2020, 705, 135771. [Google Scholar] [CrossRef] [PubMed]
Saeed, A.; Li, C.; Danish, M.; Rubaiee, S.; Tang, G.; Gan, Z.; Ahmed, A. Hybrid bidirectional LSTM model for short-term wind speed interval prediction. IEEE Access 2020, 8, 182283–182294. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar] [CrossRef]

Figure 1. Transformer self-attention mechanism.

Figure 2. The structural diagrams of the multi-head attention mechanism and the Transformer encoder. (a) Multi-head attention mechanism. (b) Transformer encoder.

Figure 3. Structure diagram of CNN.

Figure 4. The structural diagrams of LSTM and BiLSTM. (a) Long short-term memory. (b) Bidirectional long short-term memory.

Figure 5. Transformer Encoder–CNN–BiLSTM model architecture.

Figure 6. Comparison between the original AQI and the processed AQI.

Figure 7. Prediction results of the single model.

Figure 8. The histogram for evaluation metrics.

Figure 9. Prediction results of the hybrid model.

Figure 10. Generalization experiment based on Beijing.

Table 1. Evaluation metrics of single prediction model.

Model	$R^{2}$	MAE	RMSE
CNN	0.8952	5.7717	8.8245
TCN	0.9311	4.5229	7.1559
BiLSTM	0.9356	4.7670	6.9163
Transformer	0.9585	3.6765	5.5496

Table 2. Evaluation metrics of hybrid prediction model.

Model	$R^{2}$	MAE	RMSE
CNN-BiLSTM	0.9496	3.4577	6.1213
Transformer Encoder–CNN	0.9524	3.4617	5.9463
Transformer Encoder–BiLSTM	0.9658	3.3223	5.0417
Proposed model	0.9781	2.4266	4.0321

Table 3. AQI grade evaluation result.

	Predicted Good	Predicted Moderate	Predicted Lightly Polluted	Predicted Moderately Polluted
Actual Good	4254	134	0	0
Actual Moderate	90	2029	85	0
Actual Lightly Polluted	0	13	295	7
Actual Moderately Polluted	0	0	3	63

Table 4. Activation function comparison experiment.

Activation Function	Rounds	RMSE	MAE	$R^{2}$
ReLU	100	5.2765	3.923	0.9625
LeakyReLU	100	4.4434	2.6562	0.9734
GELU	87	5.1046	3.4257	0.9649
Swish	100	4.9263	2.9497	0.9673

Table 5. Performance of the model over 5 independent runs with different random seeds.

Run	Random Seed	RMSE	MAE	$R^{2}$	Grade Accuracy (%)
1	42	4.4434	2.6562	0.9734	95.24
2	100	4.5355	2.7081	0.9723	94.59
3	200	4.6076	2.6616	0.9714	95.10
4	300	4.1682	2.4857	0.9766	95.58
5	400	4.0890	2.4502	0.9775	95.40
Mean ± Std	—	4.37 ± 0.21	2.59 ± 0.11	0.974 ± 0.003	95.18 ± 0.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Z.; Zhang, Q.; Chen, G. Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model. Atmosphere 2026, 17, 249. https://doi.org/10.3390/atmos17030249

AMA Style

Sun Z, Zhang Q, Chen G. Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model. Atmosphere. 2026; 17(3):249. https://doi.org/10.3390/atmos17030249

Chicago/Turabian Style

Sun, Zhuoran, Qing Zhang, and Guici Chen. 2026. "Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model" Atmosphere 17, no. 3: 249. https://doi.org/10.3390/atmos17030249

APA Style

Sun, Z., Zhang, Q., & Chen, G. (2026). Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model. Atmosphere, 17(3), 249. https://doi.org/10.3390/atmos17030249

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Air Quality Index Prediction Based on Transformer Encoder–CNN–BiLSTM Model

Abstract

1. Introduction

2. Theory of Transformer Encoder–CNN–BiLSTM Hybrid Model

2.1. Transformer Encoder

2.2. Convolutional Neural Network

2.3. Long Short-Term Memory and Bidirectional Long Short-Term Memory

2.4. Proposed Model

3. Experiment Preparation

3.1. Experiment Environment

3.2. Data Preparation and Dataset Division

3.3. Data Processing

3.3.1. Outlier Detection

3.3.2. Data Normalization

3.4. Evaluation Metrics

4. Experiment Result and Analysis

4.1. Single Model Prediction

4.2. Hybrid Model Prediction

4.3. Generalization Experiment

4.4. Comparison of Activation Functions

4.5. Model Robustness Verification

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI