BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer

Mao, Xinyi; Liu, Gen; Wang, Jian; Lai, Yongbo

doi:10.3390/su17198631

Open AccessArticle

BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(19), 8631; https://doi.org/10.3390/su17198631

Submission received: 18 July 2025 / Revised: 18 September 2025 / Accepted: 23 September 2025 / Published: 25 September 2025

Download

Browse Figures

Versions Notes

Abstract

Predicting the concentrations of air pollutants, particularly PM_2.5, with accuracy and dependability is crucial for protecting human health and preserving a healthy natural environment. This research proposes a deep learning-based, robust prediction system to predict regional PM_2.5 concentrations for the next one to twenty-four hours. To start, the input features of the prediction system are initially screened using a correlation analysis of various air pollutants and meteorological factors. Next, the BiTCN-ISInformer prediction model with a two-branch parallel architecture is constructed. On the one hand, the model improves the probabilistic sparse attention mechanism in the traditional Informer network by optimizing the sampling method from a single sparse sampling to a synergistic mechanism combining sparse sampling and importance sampling, which improves the prediction accuracy and reduces the computational complexity of the model; on the other hand, through the introduction of the bi-directional time-convolutional network (BiTCN) and the design of parallel architecture, the model is able to comprehensively model the short-term fluctuations and long-term trends of the temporal data and effectively increase the inference speed of the model. According to experimental research, the proposed model performs better in terms of prediction accuracy and performance than the most advanced baseline model. In the single-step and multi-step prediction experiments of Shanghai’s PM_2.5 concentration, the proposed model has a root mean square error (RMSE) ranging from 2.010 to 10.029 and a mean absolute error (MAE) ranging from 1.436 to 6.865. As a result, the prediction system proposed in this research shows promise for use in air pollution early warning and prevention.

Keywords:

deep learning; air pollutant concentration prediction; bidirectional temporal convolutional network; ISInformer

1. Introduction

One of the most important concerns in global environmental governance in recent years has been air pollution [1]. The concentration and extent of air pollution have been growing as a result of human activities’ growing impact on the atmospheric environment. This is particularly true during the urbanization process, where energy consumption, industrial emissions, and traffic exhaust have all contributed to a significant increase in air pollution levels [2]. The World Health Organization’s (WHO) statistics and earlier research indicate that air pollution has caused a number of issues, including the sharp decline in biodiversity, the occurrence of respiratory diseases more frequently, and the deterioration of soil quality. These issues pose a serious threat to both human health and the ecological environment, which is essential to our survival [3,4]. PM_2.5 is a unique air pollutant that is of particular concern because it carries heavy metals, polycyclic aromatic hydrocarbons (PAHs), and other hazardous and toxic substances. When it enters the human body, it disrupts normal cellular metabolism and signaling pathways [5,6]. In addition to affecting the endocrine system, prolonged exposure to PM_2.5 can result in neurodegenerative illnesses. Therefore, accurate and timely PM_2.5 forecasting is not merely a technical challenge but a critical tool for public health intervention. It enables the early issuance of health advisories, helps vulnerable populations take preventive measures, and ultimately mitigates the disease burden associated with air pollution exposure. Furthermore, PM_2.5 may disrupt photochemical reactions in the atmosphere, hence impacting climatic stability [7]. Therefore, given the severe air pollution issues, it is imperative that precise and trustworthy PM_2.5 concentration predictions be made. Even though the prediction alone cannot directly address the issue of air pollution, it can give the public and government agencies useful early warning and decision support [8], allowing them to be ready for prevention and control and thereby lessening the threat that PM_2.5 poses to human health and the environment.

1.1. Literature Review

In general, there have been four stages in the development of methods for predicting the concentrations of air pollutants: numerical modeling, statistical modeling, simple machine learning, and deep learning modeling. Numerical models primarily use mathematical-physical equations constructed under specific idealized conditions to simulate the spread of air pollutants in the atmosphere. Jiang and Yoo [9] utilized a community multiscale air quality (CMAQ) model to predict the daily mean of PM_2.5 concentration; Zhang et al. [10] estimated the long-term trend of SO₂ concentration in China using a MOZART model based on chemical transport simulation. These are common techniques for using numerical modeling to predict pollution concentrations. Numerical models, however, are less robust to data that contains errors or is incomplete and therefore rely heavily on high-quality input data. Furthermore, numerical models must abstract complicated real-world situations into numerical equations, which inevitably results in some information loss and is not accurate [11]. Consequently, the creation of models that more closely represent the real situation is necessary for predicting air pollution concentrations.

In contrast to numerical models, statistical models can instantly identify patterns in data and do not need intricate physical or chemical processes to be abstracted from complex real-world problems. Some of the more often used statistical models that researchers have selected in recent years are the autoregressive integrated moving average model (ARIMA) [12] and the multiple linear regression model [13]. However, a number of variables, including the spatial environment and meteorological conditions, affect the concentrations of air pollutants. These variables interact to create intricate nonlinear relationships. Complex nonlinear relationships are more difficult for statistical models to handle since they are frequently predicated on linear assumptions or specific probability distributions. As a result, statistical models have a similarly limited capacity for prediction.

Another class of methods for predicting air pollution concentrations is simple machine learning models. decision trees [14], K-nearest neighbor algorithm (KNN) [15], support vector machines (SVM) [16], and others are examples of frequently used machine learning models. Razavi-Termeh et al. [17] integrated the conventional CatBoost machine learning algorithm with two optimization algorithms to construct CatBoost HHO and CatBoost GWO models, respectively. According to the results of the experiments, the accuracy of the two suggested models is demonstrated by a notable decrease in prediction error in the urban air pollution prediction task. Emeç and Yurtsever [18] proposed a novel stacked integrated model by combining the predictions of many machine learning models. The model outperformed both statistical and conventional machine learning models in predicting PM_2.5 in cities like Beijing and Istanbul. Compared to statistical models, machine learning models are superior at parsing nonlinear data patterns and have better generalization capabilities and integrability. However, the effectiveness of machine learning models is largely dependent on the quality of manual feature engineering, which not only necessitates a high level of skill and experience from the researcher but also has a negative impact on model migration and scaling due to its task-specific [19]. In addition, machine learning models are currently inadequate for big data and are better suited for jobs involving small data quantities [20].

Deep learning models have gained popularity in the field of predicting the concentrations of air pollutants in recent years. By independently mining the temporal connection patterns hidden in historical and real-time monitoring data, deep learning models create end-to-end prediction systems. One of the key characteristics that sets them apart from other models is their capacity to model nonlinear interactions of multidimensional features [21]. Many researchers employ these models as baselines for novel time-series prediction research, including Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), and Gated Recurrent Units (GRU) [22]. Liang et al. [23] proposed the HGA-LSTM model, which incorporates a new hyperparameter optimization method and uses LSTM as the foundation architecture. Combining local search and genetic algorithms, the model tackles the challenge of selecting parameters for deep learning models, particularly LSTM, and shows a low prediction error for air pollution. Qing [24] suggested the GRA-GRU model to increase the precision of PM_2.5 concentration prediction, beginning with the dataset processing. Applying gray correlation analysis (GRA) results to the dataset is the fundamental component of this model, which enables GRU to acquire inputs with spatial weights. This method lessens the interference from redundant features and permits a greater release of the GRU’s performance. Deep learning models outperform numerical, statistical, and machine learning models in terms of prediction accuracy and performance for the majority of time-series prediction problems. Single deep learning models, however, are still inadequate for handling multi-scale temporal dependencies, and dynamically balancing the distinct characterization requirements of short-term and long-term dependencies is challenging. In the meantime, it is challenging for a single deep learning model to thoroughly examine the possible relationships among the variables when dealing with complex spatiotemporal datasets including meteorology and air pollution [25]. These limit how well a single deep learning model can predict pollutants.

Researchers have started looking into hybrid deep learning models in an effort to overcome the drawbacks of single deep learning models. Several deep learning models are organically combined into hybrid models to create a system with complementary functions that can incorporate the benefits of each model and enhance prediction performance [26]. Ahmed et al. [27] created the CLSTM-BiGRU hybrid deep learning model, which was effectively used to the precise prediction of the air quality index (AQI). This model avoids the issue of the network deteriorating to random guessing in long-time series by combining CNN, LSTM, and BiGRU into a recursive structure, as opposed to a single deep learning model. This makes it possible to move seamlessly from capturing short-term dependencies to long-term ones. Based on the spatiotemporal connection of air pollutants in the area, Wu et al. [28] suggested a hybrid Res-GCN-BiLSTM deep learning model. BiLSTM captures rich temporal details, whereas graph convolutional networks (GCN) fully integrate each monitoring station’s topological information to derive spatial features. The experimental findings show that the Res-GCN-BiLSTM model can mine potential spatiotemporal dependencies, greatly increasing the prediction accuracy of the model. Zhang et al. [29] proposed employing the RCL-Learning model to forecast the concentration of PM_2.5 in a city of interest. The model combines the benefits of convolutional long short-term memory networks (ConvLSTM) and residual neural networks (ResNet) and performs best when compared to conventional deep learning methods. These findings serve as a reminder to develop thorough predictive models in order to more comprehensively and thoroughly mine the multivariate features in spatiotemporal datasets.

Furthermore, Transformer models have gained popularity in recent years in the areas of load forecasting, traffic flow prediction, and disease prediction. They have also given rise to a number of excellent variations, including Autoformer [30], iTransformer [31], and Pathformer [32]. Hybrid deep learning models form more powerful prediction systems by flexibly combining the strengths of different models. Transformer models, on the other hand, primarily rely on the self-attention mechanism to directly model global dependencies among all elements in a sequence, without requiring recurrent or convolutional structures. The task of predicting the concentration of air pollutants has also been the subject of related study. Wang et al. [33] suggested a CNN-Transformer hybrid model to predict high-resolution PM_2.5 concentrations in cities and confirmed its effectiveness through trials. Mu et al. [34] created an STL-Transformer model to forecast the concentration of ozone in the air. In order to enhance predictive performance, the model creatively begins with the input data processing, breaking down the ozone time series into trend, season, and residual components using STL decomposition. In addition to improving Transformer’s capacity to extract long-term dependence features, this procedure fortifies the model’s interpretability. Nevertheless, prior research has not taken into account merging the extraction of short-term and long-term dependence features in order to thoroughly mine the data’s multi-scale dependencies [35]. Moreover, complicated models make the prediction system less effective and increase computing complexity. As a result, it is worthwhile to look into air pollutant concentration prediction methods further in order to increase the prediction system’s efficiency and accuracy.

This study selected PM_2.5 as the target pollutant and focused on the core area of China’s Yangtze River Delta urban cluster (with Shanghai as the primary focus) due to its typicality and pressing practical needs. PM_2.5 ranks among the most prevalent and health-threatening air pollutants in the Yangtze River Delta. As one of China’s most urbanized regions with the highest population density and economic activity, the area faces persistent, severe challenges from complex atmospheric pollution, characterized by heavy PM_2.5 pollution loads. Therefore, achieving precise PM_2.5 concentration forecasting in this region not only provides an ideal testing ground for validating complex spatiotemporal prediction models, but also offers critical early warning information and decision support for protecting the public health of its densely populated areas. This holds significant scientific value and practical importance.

1.2. Contribution and Innovation

This study aims to develop a deep learning-based prediction method for single-step and multi-step PM_2.5 concentration prediction in the core area of the Yangtze River Delta urban agglomeration in China, utilizing data from 1 January 2022, to 30 November 2024. The aforementioned indicates that the current models need to be made more computationally efficient while also failing to effectively integrate the extraction of short-term and long-term dependence features. In order to overcome these problems and enhance the prediction performance and accuracy, we suggest the BiTCN-ISInformer hybrid prediction model for both single-step and multi-step regional PM_2.5 concentration prediction. The following is a summary of this paper’s contributions and innovations:

(1) Sampling method optimization for probabilistic sparse attention mechanism in the conventional Informer network: sparse sampling alone is optimized to a synergistic mechanism that combines sparse sampling and importance sampling. The network that has been optimized is called ISInformer. On the one hand, the improvement can better capture the dependencies between query-keys in a sequence and enhance the precision of attention calculation; on the other hand, it lowers the computational complexity and boosts the model’s computational efficiency.

(2) A BiTCN-ISInformer prediction model with a two-branch parallel architecture is constructed to fully extract the local details and long-term trends of spatiotemporal data. Of these, BiTCN captures local short-term dependence features, while ISInformer captures global long-term dependence features. The parallel architecture’s design makes it possible to extract comprehensive information, which greatly enhances the performance and prediction accuracy of the complex time-series prediction task. It also efficiently speeds up the model’s inference speed.

(3) A prediction system that includes spatiotemporal data processing, model construction, and model evaluation is proposed. Among these, the model evaluation thoroughly assesses the proposed model from four angles: stability, computational efficiency, generalization ability and universality. It is demonstrated that the prediction system can offer a solid and accurate foundation for warning and preventing regional air pollution.

2. Study Area and Dataset Analysis

2.1. Study Area

The study area is the core region of the Yangtze River Delta urban agglomeration, which is situated in China’s lower reaches of the Yangtze River and borders the Yellow Sea and East China Sea. It has a subtropical humid monsoon climate, with mild to moderate winter rainfall and high summer temperatures and heavy rainfall. The distribution of the study area is displayed in Figure 1. The Yangtze River Delta is situated in the prime geographical zone between approximately 116°21′ and 123°08′ east longitude and 29°56′ and 35°08′ north latitude. Covering an area of approximately 358,000 square kilometers and home to over 235 million permanent residents, the Yangtze River Delta holds a pivotal position in China’s economic landscape, serving as the core engine driving the nation’s high-quality economic development. Shanghai, the hub of China’s worldwide economy, finance, trade, shipping, and scientific and technical innovation, is a world-class city and the core of the Yangtze River Delta urban agglomeration. Therefore, Shanghai is chosen as the main research object, and the experimentally validated model will also be applied to the other 13 cities. The Yangtze River Delta urban agglomeration, one of China’s most developed regions, consistently has the worst air quality in the country. According to earlier research, PM_2.5 is the most common air pollutant in the area [36]. Thus, it is crucial to accurately anticipate PM_2.5 levels in the area in order to serve as a foundation for controlling air pollution.

2.2. Data Description and Preprocessing

PM_2.5 and other air pollutants interact and are connected to one another. For instance, the atmospheric chemical interactions of SO₂ and NO₂ generated by human industrial activities contribute to the generation of PM_2.5 [37]. Additionally, we employed the pearson correlation coefficient to assess the strength of linear correlation between the target variable and each candidate feature, thereby further validating the rationality of candidate feature selection. We set |r| ≥ 0.5 as the threshold for strong correlation. Thus, the air quality index (AQI) and key air pollutants (PM_2.5, PM₁₀, SO₂, NO₂, O₃, and CO) were used as input features for the model to predict PM_2.5. The intensity of chemical reactions in the atmosphere is also changed by meteorological factors like temperature, relative humidity, precipitation, and wind speed. These factors also cause physical phenomena like pollutant agglomeration, deposition, and dilution, which in turn impact the concentration of air pollutants [38,39]. Consequently, meteorological factors (Temperature, Dew/Frost Point, Relative Humidity, Precipitation Corrected, Surface Pressure, Wind Speed and Wind Direction) were also used as model input features. Furthermore, prior research has demonstrated that there is a spatial correlation between air pollutants, and the introduction of spatial information is necessary to improve the prediction of PM_2.5 in the target city [40]. Thus, as input features for the model, we pre-selected pollutant and meteorological data from 13 other cities in the study area beyond the target city. Considering that the virus impacted air pollution concentrations during the pre-epidemic era, we employed hourly air pollutant concentrations and meteorological data for 14 cities from 1 January 2022, to 30 November 2024, to guarantee the study’s timeliness and accuracy. The Shared Data website (https://quotsoft.net/air/, accessed on 20 December 2024) provided information on air pollutant concentrations, while the Open website (https://power.larc.nasa.gov, accessed on 21 December 2024) provided meteorological data.

The following preprocessing was performed on the collected data. Initially, to simplify the model inputs and prevent the introduction of random measurement noise from individual stations, the average of the related factors from several monitoring stations in each city was used as input. Since the dataset’s overall missing values were less than 2%, the data were then filled in using either simple linear interpolation if they were missing for certain time intervals within a single day, or data from the day before or after if they were missing for a single day. Lastly, the data were min-max normalized to fit a standard normal distribution in order to remove the effect of magnitude and speed up model convergence [41]. After preprocessing, the data was split into training and test sets using forward partitioning to make sure there was no overlap between the two and that the training set was earlier than the test set’s time period. The training set comprises 80% of the data, whereas the test set consists of 20%. This study did not establish an independent validation set, primarily due to the requirement that time-series data be strictly partitioned in chronological order to prevent information leakage. The model employs Dropout regularization and early stopping to monitor training loss and control overfitting. With sufficient total data volume, the training samples adequately support model learning, and the test set possesses strong statistical representativeness, enabling effective evaluation of generalization capabilities. To properly display the data specifics, we display the statistical data on Shanghai’s air pollutants and meteorology in Table S1 in the Supplementary Materials.

3. Methodology

3.1. The Framework of the Proposed Prediction System

The three primary phases of the proposed prediction system’s framework are depicted in Figure 2.

The first stage is the analysis and processing of the data. Section 2 indicates that, on the one hand, there are correlations between PM_2.5 and other air pollutants and meteorology, and the model must incorporate multiple examples of air pollutant and meteorological data as inputs. On the other hand, there are spatial correlations of air pollutants, and the model must incorporate spatial information by incorporating air pollutant and meteorological data from other cities in the study area, excluding Shanghai. Lastly, a spatiotemporal matrix was created using the 14 cities’ historical air pollutant concentrations and meteorological data. This spatiotemporal matrix is a three-dimensional tensor that has three dimensions: features (air pollutant indicators and meteorological factors), time step (number of time points to look back) and cities (14 cities).

The second stage, which uses the proposed BiTCN-ISInformer model to simulate the spatiotemporal prediction of PM_2.5 in Shanghai, is the essential component of the proposed prediction system. In this two-branch parallel architecture model, the spatiotemporal matrix concurrently enters two branches as inputs. The first branch, called BiTCN, uses the hierarchical structure of bidirectional inflated causal convolution to achieve detailed modeling of local features. As an output, BiTCN extracts rich local short-term dependence features. The second branch, ISInformer, achieves fine-grained capturing of long-distance dependencies on a global scale by utilizing the three embedding modes and the encoder–decoder structure of multilayered stacking. Rich global long-term dependence features are extracted by ISInformer as the branch’s output. The ISInformer approach is based on an enhancement of the classic Informer model’s probabilistic sparse attention mechanism, where importance sampling is added to allow higher quality Keys to take part in the attention calculation. This enhancement not only speeds up inference by lowering the model’s computational complexity, but it also greatly raises the model’s performance and prediction accuracy. The parallel architecture receives the outputs from the two branches, splices them together, and then uses the fully connected layer to map them to the final prediction result.

The model evaluation step is the final stage. The superiority of the proposed model in PM_2.5 concentration prediction is confirmed by comparing it with the state-of-the-art baseline model using a number of assessment indicators, comparing the predicted and true values, and confirming its stability and generalization ability.

3.2. BiTCN Module

In order to enhance the Informer model in feature extraction, we introduce the Bidirectional Temporal Convolutional Network (BiTCN). The Informer model’s sparse attention mechanism allows it to capture global long-term dependencies of sequences, but it is unable to catch local short-term dependencies. BiTCN, on the other hand, uses the hierarchical structure of bidirectional inflated causal convolution to achieve fine modeling of local features. BiTCN and Informer are used together to improve the model’s feature extraction capabilities and achieve multi-granularity fusion from local to global features. As illustrated in Figure 3, BiTCN is made up of two independent time-series convolutional networks (TCNs) operating in forward and backward orientations. Our model, which consists of three forward and three reverse TCN layers, controls the size of the expansion coefficient to capture local features finely. The following is a representation of the main architecture of BiTCN:

H_{t}^{forward} = Dropout (ReLU (WN (\sum_{k = 0}^{K - 1} W_{k}^{forward} \cdot X_{t - d \cdot k})))

(1)

H_{t}^{backward} = Reverse (Dropout (ReLU (WN (\sum_{k = 0}^{K - 1} W_{k}^{backward} \cdot X_{t - d \cdot k}^{r e v e r s e}))))

(2)

H^{BiTCN} = (X^{forward} + H^{forward}) \oplus (X^{backward} + H^{backward})

(3)

among them, WN represents weight normalization,

K

represents convolution kernel size,

d

represents dilation coefficient,

W_{k}

represents convolution kernel weight,

X_{t - d \cdot k}

represents the feature values of the input sequence at the time step

t - d \cdot k

,

X^{forward}

and

X^{backward}

represent the outputs of forward and backward residual connections,

+

represents element wise addition, and

\oplus

represents concatenation.

By inflating the causal convolutional layer, BiTCN is able to capture local short-term dependencies. The convolutional kernel size

K

and the expansion coefficient

d

are crucial components since they regulate the length of the temporal window that is covered on the path. By keeping the expansion coefficient with

d = 1, 2, 4

being modest and designing the convolution kernel size

K = 3

in our model, the window span of BiTCN is expressly restricted to the local neighborhood. At the same time, BiTCN’s bi-directional capability allows it to record refinement patterns both before and after the current time step, significantly improving the model’s ability to extract local features. Finally, the use of residual connection ensures that the original local features do not get lost due to bidirectional path coverage, thus providing the model with richer and more precise local feature information [42].

BiTCN normalizes the weights of the convolution kernel by decomposing the convolution kernel weights

W_{k}

into a direction vector

v

for determining the weight direction and a gain coefficient

g

for controlling the weight magnitude, which is denoted as:

W_{k} = g \cdot \frac{v}{{‖v‖}_{2}}

(4)

among them,

{‖v‖}_{2}

represents the L2 norm of

v

. After normalization,

v

is constrained to unit length, while

g

is fixed to

{‖W_{k}‖}_{2}

. This operation directly constrains the magnitude of gradient propagation, thus avoiding gradient vanishing or explosion. At the same time, the adaptive learning capability of

g

balances the gradient stability and feature strength to make the loss function optimization path smoother, which accelerates the model convergence and improves the generalization ability [43].

3.3. ISInformer Module

The traditional probabilistic sparse attention mechanism in the Informer model selects important queries by sparsity metric but uses fixed sampling for the keys of each query [44]. Such a method has two drawbacks: first, some unimportant keys are double-counted, resulting in computational duplication; second, unsampled keys may contain important information that is omitted. By adding importance sampling after sparsity sampling, we improve the traditional probabilistic sparse attention mechanism and increase the sample quality by precisely controlling the key sampling procedure. As illustrated in Figure 4, the improved model is called ISInformer. The following are the details.

3.3.1. Improved Probabilistic Sparse Attention Mechanism

The improved probabilistic sparse attention mechanism is shown in Figure 5. Firstly, calculate the complete attention score matrix:

Q K_{all} = \frac{Q \cdot K^{T}}{\sqrt{D}}

(5)

Among them,

D

represents the feature dimension. Next, calculate the sparsity score for each query (

q_{i}

), and the higher the score, the stronger the association pattern between it and the keys. Then, filter out the more important n_top queries based on their scores:

M (q_{i}) = \max_{j} (Q K_{all} [i, j]) - \frac{1}{L_{K}} \sum_{j = 1}^{L_{K}} Q K_{all} [i, j]

(6)

Among them,

L_{K}

represents the sequence length of keys. After selecting queries, perform importance sampling on keys based on the query-key importance score distribution to ensure that highly correlated keys are more likely to be selected. Compared to fixed sampling in traditional methods, blindness in sampling of keys is avoided. For each query (

q_{i}

), its importance score distribution with each key (

k_{j}

) is:

\begin{matrix} P (k_{j} |q_{i}) & = Softmax (Q K_{all} [i, j]) \\ = \frac{\exp (Q K_{all} [i, j])}{\sum_{j = 1}^{L_{K}} \exp (Q K_{all} [i, j])} \end{matrix}

(7)

Finally, calculate the local attention scores for the selected queries (

Q_{n_top}

) and keys (

K_{sample}

) to obtain the final attention score matrix:

Q K_{sample} = Softmax (\frac{Q_{n_top} \cdot K_{sample}^{T}}{\sqrt{D}})

(8)

Instead of computing the attention scores of each time step with regard to all other time steps, the ISInformer model implements an attention computation only for the significant time steps, relying on the improved probabilistic sparse attention mechanism. This enables the model to effectively handle lengthy sequences and significantly lowers the computing cost of the model. In addition, the introduced importance sampling method makes the sampling process more refined and enhances the model’s ability to capture key features thereby improving the model’s prediction accuracy and performance.

3.3.2. Encoder

The purpose of ISInformer is to capture long-term, global dependencies of sequences. Through a sampling method, the probabilistic sparse attention mechanism reduces the computational complexity from

O (L^{2})

to

O (L \log L)

, allowing the model to concentrate on the global scope with higher efficiency and readily capture the long range dependency information. In the meantime, the encoder and decoder’s deep stacking structure abstracts timing features layer by layer, allowing the high-level network to incorporate contextual information over long distances. This, in turn, makes it possible to model long span dependencies. Since the proposed BiTCN-ISInformer model is built using a parallel architecture to expedite training and inference, raw sequence data serves as the encoder’s input. The original inputs thereafter proceed to the embedding layer, where value embedding (VE), position embedding (PE), and temporal embedding (TE) transform the multi-source data into a high-dimensional vector representation that the model can comprehend more readily. In order for the model to accurately capture long-distance dependency information, value embedding converts discrete data into a more compact and continuous vector space; position embedding then adds the position information of each data point in the sequence to the vector; and temporal embedding aids in the identification and modeling of seasonal and cyclical patterns in the sequence data. By combining the three embeddings, the model is better equipped to represent the long-term dependencies present in the data. The embedding layer is represented as follows:

\{\begin{cases} F_{embed} = Dropout (VE (x_{enc}) + PE + TE (x_{mark}^{enc})) \\ VE (x_{enc}) = Conv 1 D (x_{enc}) \\ {PE}_{(p o s, 2 i)} = \sin (p o s / {10,000}^{2 i / d_{model}}) \\ {PE}_{(p o s, 2 i + 1)} = \cos (p o s / {10,000}^{2 i / d_{model}}) \\ TE (x_{mark}^{enc}) = Linear (x_{mark}^{enc}) \end{cases}

(9)

among them,

F_{embed}

represents the output of the embedding layer,

x_{enc}

represents the input data of the encoder layer, and

x_{mark}^{enc}

represents the timestamp information input by the encoder layer.

By using an improved probabilistic sparse attention mechanism, the encoder layer dynamically filters important queries and sparsifies key-value pair sampling, reducing computational complexity while maintaining sensitivity to long-distance temporal dependence; residual connection and layer normalization work together to stabilize gradient flow and speed up the deeper network’s convergence, preventing the issue of information attenuation of long-distance signals in multilayer transmission; and feedforward neural networks further process the attention mechanism’s outputs through a high-dimensional nonlinear mapping to extract more intricate features. The stacked multi-layer encoder progressively expands the temporal features that have been extracted to the global scale. Its outputs serve as key-value pairs for the decoder’s cross-attention, which enables the decoder to precisely concentrate on the long-term dependence features that the encoder has extracted, improving the model prediction’s performance and accuracy.

3.3.3. Decoder

Two components make up the decoder’s input: the global features of the encoder output, which are fed into the decoder via a cross-attention mechanism, and the target sequence to be predicted. The embedding layer receives the input data and creates high-dimensional vectors that the model can effectively identify. The decoder is made up of several stacked decoder layers, the main parts of which are the cross-attention mechanism and the self-attention mechanism with masks. The self-attention mechanism circumvents the issue of gradient vanishing in the transmission process of long-distance dependencies by explicitly modeling the dependence patterns of time steps at arbitrary distances. At the same time, Softmax normalized weights allow the model to adaptively focus on the important time steps, minimizing noise interference. The following is a precise representation of the self-attention mechanism with mask:

\{\begin{cases} SelfAttn (Q_{dec}, K_{dec}, V_{dec}) = Softmax (\frac{Q_{dec} K_{dec}^{T}}{\sqrt{D}} + Mask) V_{dec} \\ Mask (i, j) = \{\begin{cases} 0, if j \leq i \\ - \infty, otherwise \end{cases} \end{cases}

(10)

where

Mask (i, j)

is the upper triangular mask, which ensures causality by limiting the decoder to concentrating only on the historical and present time step information for prediction. By interacting between the decoder’s query and the encoder’s key and value, the cross-attention mechanism allows the decoder to make predictions about the future based on both present and historical data. With this design, the prediction is more stable, and the cumulative error is decreased since the encoder gives a global view and the decoder makes precise modifications based on the current information. The following is a representation of the cross-attention mechanism:

CrossAttn (Q_{dec}, K_{enc}, V_{enc}) = Softmax (\frac{Q_{dec} K_{enc}^{T}}{\sqrt{D}}) V_{enc}

(11)

Finally, the output of the decoder maps the high-dimensional features to the target dimension through the fully connected layer.

Our proposed model is built as a two-branch parallel architecture, combining local short-term dependence features from BiTCN with global long-term dependence features from ISInformer. The model synthesizes local details and long-term trends, realizes the effective coupling of short-term fluctuations and long-term trends, and significantly improves the prediction accuracy and performance of complex time series prediction tasks. Furthermore, the parallel design efficiently accelerates the model’s training and inference, as experiments in Section 5.5.1 will show.

3.4. Model Evaluation

3.4.1. Baseline Models

To evaluate the advantages of the proposed model we compare it with the following state-of-the-art deep learning models.

(1) CNN [45]: Convolutional neural network is one of the most classic deep learning models in air pollutant concentration prediction tasks.

(2) LSTM [46]: Long short-term memory networks are also one of the most classical models in time-series prediction tasks, which can effectively extract long-term dependencies in data.

(3) TCN [47]: Time convolutional network is a variant of CNN specifically designed for temporal prediction tasks. The TCN used in this paper adopts a convolution kernel size of 3 and a dilation factor of 2.

(4) TCN-LSTM [48]: Combine TCN with LSTM to build a hybrid deep learning model. This model was selected as the baseline to demonstrate the advantages of the hybrid model over the single model.

(5) Transformer [49]: The Transformer model in this paper consists of an input layer, an embedding layer, encoder layers, and an output layer. There are 2 encoder layers, and the multi-head self-attention mechanism is used inside them.

(6) CBAM-CNN-BiLSTM [50]: A hybrid model consisting of convolutional block attention module, convolutional neural network, and bidirectional long short-term memory network. The use of attention mechanism enhances the performance of the model.

(7) ST-Transformer: The model adds temporal embedding, positional embedding and value embedding based on Transformer to focus attention on the most useful contextual information in spatial, temporal and variable dimensions. The model is referenced in the same way as Transformer.

The baseline model was trained and tested using the same hyperparameter and evaluation metric information as the proposed BiTCN-ISInformer, which is provided in Section 3.4.2 and Section 4.

3.4.2. Evaluation Metrics

By computing several evaluation metrics, we conduct a quantitative comparison between the baseline model and the proposed model. Each metric’s comprehensive information is displayed below. Root mean squared error (RMSE), mean absolute error (MAE), and index of agreement (IA) show how well the model predicted outcomes, and coefficient of determination (R²) shows how well the independent variables explained the dependent variable. As the RMSE and MAE decrease, the model’s prediction accuracy increases; the model’s prediction performance improves when the IA approaches 1, and its goodness-of-fit improves when the R² approaches 1.

RMES = \sqrt{\frac{\sum_{i = 1}^{T} {(y_{i} - {\hat{y}}_{i})}^{2}}{T}}

(12)

MAE = \frac{1}{T} \sum_{i = 1}^{T} |y_{i} - {\hat{y}}_{i}|

(13)

IA = 1 - \frac{\sum_{i = 1}^{T} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{T} {(|y_{i} - \bar{y}| + |{\hat{y}}_{i} - \bar{y}|)}^{2}}

(14)

R^{2} = 1 - \frac{\sum_{i = 1}^{T} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{T} {(y_{i} - \bar{y})}^{2}}

(15)

where

y_{i}

is the observed value,

{\hat{y}}_{i}

is the predicted value,

T

is the test set size, and

\bar{y}

is the average observed value.

Furthermore, we examine the stability of the model using the variance of prediction error as an indication. A model’s stability is defined as its capacity to continue producing accurate predictions even when input data and internal structural parameter changes occur [51]. The following is the expression for the stability test formula:

S_{v a r} = \frac{1}{N} {\sum_{i = 1}^{N} (e_{i} - {\bar{e}}_{i})}^{2}

(16)

among them,

e_{i}

and

{\bar{e}}_{i}

respectively represent the absolute prediction error of the i-th period and the average of N absolute prediction errors.

4. Experimental Design

The device used in this study is a Windows 10 system with the following basic configuration: CPU: i7-9750 @ 2.60 GHz, GPU: NVIDIA GeForce GTX1650 4GB, and RAM: 8G. The Python version used is Python 3.9.12. During data processing, model construction, training, and testing, open-source libraries and frameworks such as PyTorch 2.2.2, Pandas 2.2.1, NumPy 1.26.4, scikit-learn 1.4.2, and matplotlib 3.9.0 were utilized.

Section 2.2 confirmed that the air pollutant and meteorological data of 14 cities would be used as inputs for the model in advance. Considering that adding data from other cities could make the model more complex and that adding noise lowers the model’s prediction accuracy [52]. Therefore, before starting the prediction experiments, we aimed to identify the specific input information. The specific design aimed to maintain a constant number of input air pollutant and meteorological factors, then sequentially merge data from Shanghai with data from other cities to create distinct variable pairs. These were then entered into the model and the changes in evaluation metrics were examined.

Optimizing model performance requires careful consideration of both the parameter settings for model structure and those for model training. Multiple iterative experiments were carried out using an exploratory method in order to determine the set of parameters that would produce the highest predicted performance. The evaluation metrics RMSE, MAE, IA, and R² were used to evaluate predictive performance. Table 1 displays the specific parameter settings that were decided upon.

5. Results and Discussion

5.1. The Impact of Relevant Factors on PM_2.5 Concentration Prediction

Adding too much city information could affect the BiTCN-ISInformer model’s prediction accuracy. Therefore, in order to ascertain the city information input to the model, we ran an experiment. The experiment was carried out in the context of single-step prediction (window size of three and prediction step of one) to explore how the model’s prediction accuracy changed when Shanghai and other cities were combined and used as inputs, respectively. It is important to note that we enter all of the city’s pollutant and meteorological data in each experiment, and the variables represent each individual city. The experimental outcomes are saved in Table S2 in the Supplementary Materials.

As shown in Table S2, compared with the baseline (using only Shanghai data, RMSE = 2.779, R² = 0.951), incorporating data from most cities can significantly improve the prediction accuracy (decrease in RMSE and increase in R²). However, introducing data from Wuxi, Taizhou, and Huzhou leads to a decline in model performance, with their RMSE values increasing to 3.138, 2.932, and 2.833, respectively. We analyzed and believe that this anomaly may stem from two main reasons. On one hand, differences in industrial structures result in distinct emission patterns and chemical compositions. The generation and variation patterns of PM_2.5 in these cities have a weak correlation with those in Shanghai. On the other hand, geographical factors can also give rise to this anomaly. For instance, Wuxi and Huzhou are located along the coast of Taihu Lake. The unique local micro-climate and humidity conditions may affect the secondary generation and sedimentation processes of pollutants, causing deviations in their time series compared to Shanghai. This supports the analysis in Section 2.2 that found spatial relationships between air pollutants and that information from other cities should be added to improve the model’s inputs and prediction accuracy. As a result, we fed the model with pollutant and meteorological data from Shanghai and the ten cities we chose for the studies that followed.

5.2. Single Step Prediction of PM_2.5 in Shanghai

Single step prediction is the process of predicting Shanghai PM_2.5 concentrations for the upcoming hour based on past data collected over a three-hour period. The experiment aims to compare the prediction accuracy and performance of the proposed model and the baseline model, for which the best prediction results are marked in bold black. Each model uses the identical set of hyperparameters, and Table 2 displays the experimental results. When compared to the baseline model, our proposed model BiTCN-ISInformer yields the smallest RMSE and MAE and the biggest IA and R², demonstrating the model’s superior prediction accuracy. Additionally, Table 2 shows that single deep learning models (CNN, LSTM, and TCN) perform poorer at prediction than hybrid deep learning models, which is in line with the conclusions in Section 1.1.

From the test set, we chose 1000 consecutive hours at random, and we then plotted a fit between the predicted and real values. Figure 6 illustrates that the red curve represents the predicted value curve, and the blue curve represents the real value curve. Compared with the baseline model, the BiTCN-ISInformer model has the best fit of the predicted values to the real values, especially in the abrupt change points, peaks and valleys, and fluctuation-intensive time periods of the data.

Using data from the same time period as the fitting plot, we also created scatter plots showing the predicted and real values for each baseline model and the proposed model. As illustrated in Figure 7, the vertical axis displays the real PM_2.5 values, the horizontal axis the predicted values, and the red dashed line is the constant line. In comparison to the baseline model, BiTCN-ISInformer exhibits the most concentrated distribution of outliers and the shortest interquartile range (IQR). In comparison to the baseline models CNN, LSTM, TCN, TCN-LSTM, Transformer, CBAM-CNN-BiLSTM, and ST-Transformer, the proposed model’s R² is 0.973, which is 0.224, 0.166, 0.162, 0.106, 0.090, 0.085, and 0.030 higher, according to the quantitative evaluation results. This suggests that the proposed model is appropriate for single step prediction tasks involving the concentration of air pollutants and offers outstanding prediction accuracy and precision. The low prediction error indicates that the forecast results possess sufficient accuracy to provide meaningful early warnings for air quality deterioration events, thereby supporting proactive public health measures.

5.3. Multi Step Prediction of PM_2.5 in Shanghai

Multi-step PM_2.5 prediction is the process of predicting PM_2.5 concentrations in the future hours using historical air pollutant and meteorological data. The length of the past period is called the historical window and the length of the future period is called the prediction step. We performed six sets of multi-step prediction trials, the results of which are shown in Table 3. In all six sets of experiments, the proposed BiTCN-ISInformer model outperforms the comparator models in terms of accuracy and performance. We computed the mean evaluation metrics values for the six sets of prediction scenarios and presented the histograms to more clearly illustrate each model’s performance. As illustrated in Figure 8, each model’s multi-step prediction performance is ordered from CBAM-CNN-BiLSTM to TCN-LSTM, ST-Transformer, and BiTCN-ISInformer in ascending order.

It should be noted that as the forecast horizon increases, the prediction errors (RMSE, MAE) of all models exhibit an upward trend. This phenomenon stems from the accumulation of uncertainty and the complexity of long-term dependencies, representing a common challenge in the field of time series forecasting. Against this backdrop, the BiTCN-ISInformer model proposed in this study maintains the lowest error levels across all forecast horizons, highlighting its superiority in addressing long-term forecasting challenges. The current experimental results demonstrate the overall trend and average level of model performance.

5.4. Ablation Experiments on the Proposed Model

We carried out an ablation experiment to explore how each module of the proposed model contributed to the prediction performance and accuracy. The specific methods are as follows: (1) Module Removal: Retain only the BiTCN module or the ISInformer module separately to evaluate their independent ability to capture short-term local dependencies or long-term global dependency features; (2) Module Replacement: Replace the ISInformer module with the original Informer model to independently validate the performance gains brought by the cooperative mechanism of sparse sampling and importance sampling. All ablation models were trained and tested under identical datasets, hyperparameter settings, and training conditions. Performance comparisons were conducted using the same evaluation metrics as the complete BiTCN-ISInformer model.

Two groups are chosen from the multi-step prediction in order to predict Shanghai’s PM_2.5 levels for the upcoming three and twelve hours. Table 4 presents the experiment’s findings, demonstrating the efficacy of the proposed model architecture by showing that the removal of any module negatively impacts the model’s capacity for prediction. The model’s capacity to extract features was severely constrained when ISInformer was eliminated, allowing it to only identify short-term, localized patterns. Among the compared models, this ablated version’s prediction accuracy was the lowest in both sets of experiments. The model’s capacity for modeling local temporal features was severely compromised when BiTCN was eliminated, which resulted in a decline in prediction performance, with RMSE values increasing by 0.359 and 0.336, respectively.

This study also enhances the query-key filtering approach in the probabilistic sparse attention mechanism by optimizing the traditional sparse sampling to work in tandem with importance sampling. The ablation experiments show that the prediction accuracy of the BiTCN-Informer variant with only the traditional sampling strategy is significantly lower (RMSE rises by 0.426 and 0.576, respectively), and even lower than the performance of the ISInformer module alone. The reason is that conventional approaches introduce a huge number of pointless calculations that result in significant noise interference by evenly and randomly selecting a fixed number of keys for each query in order to compute the attention score [53]. In contrast, importance sampling greatly increases the accuracy of dependency modeling by filtering keys through the importance score distribution (Equation (7)), ensuring that the query only interacts with strongly linked keys. This demonstrates the rationality and efficacy of the proposed BiTCN-ISInformer model.

5.5. Computational Efficiency and Stability of the Models

This section evaluates the significant differences in computational efficiency and stability between the proposed BiTCN-ISInformer model and the baseline model. The excellent performance of the proposed model is further validated.

5.5.1. Computational Efficiency of the Models

To assess the models’ computational efficiency, we use the average computation time for each model across various prediction tasks. The model indicated in Section 3 is constructed as a parallel architecture, which effectively speeds up the model’s inference. This is confirmed by Table 5, which shows that the proposed BiTCN-ISInformer model’s average computation time is shorter than that of TCN-LSTM and CBAM-CNN-BiLSTM. Furthermore, the proposed BiTCN-ISInformer model’s average computation time is less than that of BiTCN-Informer, demonstrating that the importance sampling method’s incorporation successfully lowers the model’s computational complexity and boosts computational efficiency.

5.5.2. Stability of the Models

Numerous intricate aspects can influence how well PM_2.5 concentration prediction models perform, and their responses to prediction tasks with varying time spans will vary. Highly stable models possess greater task adaptability and higher prediction accuracy [54]. Consequently, we evaluate the stability of the proposed model and the baseline model using the variance of the prediction error S_var (Equation (16)). The results of the stability test are displayed in Figure 9, where HW denotes the historical window, PS denotes the prediction step, and the smaller value of S_var indicates the more stability of the model. Figure 9 shows that out of the seven prediction tasks, the proposed BiTCN-ISInformer model has the lowest S_var value. This demonstrates that the model outperforms the baseline model in terms of prediction performance and stability, which should contribute to an accurate prediction of Shanghai’s PM_2.5 concentration.

The stability demonstrated by the BiTCN-ISInformer model across multiple prediction time horizons serves as a key indicator of its potential operational reliability. In practical deployment, this signifies the model’s ability to deliver consistent and dependable forecasts over time, unaffected by fluctuations in input data trends. This capability is crucial for decision-makers to establish confidence in the forecasting system. Furthermore, high stability enables the model to process larger datasets, demonstrating robust scalability.

5.6. Application of the Proposed Model to the Entire Study Area

We implemented the proposed prediction system in each of the 14 target cities in the study area to further confirm its universality and capacity for generalization. Specifically, the BiTCN-ISInformer model was utilized for single-step and multi-step prediction of PM_2.5 concentrations in each of the 14 cities, and all air pollutant and meteorological data of the 14 cities were used as input data for each prediction task. Ordinary kriging interpolation was then used to explore the spatial distribution characteristics of PM_2.5 concentrations and the spatial distribution characteristics of model prediction errors (RMSE) [55].

Figure 10 shows the spatial distribution of the average PM_2.5 concentration in the cities on the test set, which is characterized by low PM_2.5 concentrations in the east and high ones in the west. Correspondingly, the distribution of PM_2.5 concentration prediction errors in Figure 11 likewise displays a very similar spatial pattern, with low values in the east and high values in the west. Compared to low-concentration locations, high-concentration regions typically have larger model prediction uncertainties and errors due to more complicated meteorological variables, emission source architecture, and regional transit [56]. The consistency of the distribution pattern between Figure 10 and Figure 11 indicates that the model can show reasonable prediction behaviors at various concentration levels and geographical locations, demonstrating excellent generalization ability and universality.

Thus, the aforementioned analysis demonstrates the superior prediction accuracy and performance of the proposed BiTCN-ISInformer model, as well as its wide range of potential applications in regional PM_2.5 concentration prediction.

6. Conclusions

This research proposes a deep learning-based prediction method for both single-step and multi-step regional PM_2.5 concentration prediction. First, analyzing the correlation between regional air pollutant and meteorological data provides the initial selection of input features for the model. Then, as the central component of the prediction system, the BiTCN-ISInformer prediction model with a two-branch parallel architecture is constructed. On the one hand, the model optimizes the sampling method of the probabilistic sparse attention mechanism in the traditional Informer network from a single sparse sampling to a synergistic mechanism combining sparse sampling and importance sampling; on the other hand, the model can fully model both short-term fluctuations and long-term trends of spatiotemporal data by introducing the BiTCN and designing the parallel architecture. The proposed model outperformed the baseline model in both single-step and multi-step prediction experiments of Shanghai’s PM_2.5 concentration, with root mean square error (RMSE) and mean absolute error (MAE) ranging from 2.010 to 10.029 and 1.436 to 6.865, respectively. Subsequent ablation experiments confirmed that every module of the model made a significant contribution to improving prediction accuracy and performance. Last but not least, the stability test, computational efficiency analysis, and Kriging space interpolation experiments confirm that the proposed model has outstanding stability, generalization ability and universality, and that the inference speed has increased. Thus, it is anticipated that the research presented in this paper will offer a solid and precise foundation for air pollution early warning and prevention, which has a wide range of potential applications.

There is yet opportunity for development even if the superiority of the prediction system proposed in this study has been thoroughly confirmed. Transforming the geospatial data of every monitoring station in the study area into topological data that is fed into the model and jointly extracting short-term dependence features, long-term dependence features, and spatial features is a viable optimization approach that will increase the model’s prediction accuracy.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su17198631/s1, Table S1: Statistical information of Shanghai; Table S2: The impact of information from neighboring cities on PM_2.5 concentration prediction.

Author Contributions

X.M.: Writing—original draft, Validation, Formal analysis, Visualization, Software, Methodology. G.L.: Writing—review and editingWriting-review & editing, Funding acquisition, Resources. J.W.: Data curation, Supervision. Y.L.: Investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

No applicable.

Informed Consent Statement

No applicable.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Khurram, H.; Lim, A. Analyzing and forecasting air pollution concentration in the capital and Southern Thailand using a lag-dependent Gaussian process model. Environ. Monit. Assess. 2024, 196, 1106. [Google Scholar] [CrossRef]
Qi, G.; Che, J.; Wang, Z. Differential effects of urbanization on air pollution: Evidences from six air pollutants in mainland China. Ecol. Indic. 2023, 146, 109924. [Google Scholar] [CrossRef]
Goossens, J.; Jonckheere, A.C.; Dupont, L.J.; Bullens, D.M. Air pollution and the airways: Lessons from a century of human urbanization. Atmosphere 2021, 12, 898. [Google Scholar] [CrossRef]
He, C.; Clifton, O.; Felker-Quinn, E.; Fulgham, S.R.; Juncosa Calahorrano, J.F.; Lombardozzi, D.; Purser, G.; Riches, M.; Schwantes, R.; Tang, W.; et al. Interactions between air pollution and terrestrial ecosystems: Perspectives on challenges and future directions. Bull. Am. Meteorol. Soc. 2021, 102, E525–E538. [Google Scholar] [CrossRef]
Wrotek, A.; Jackowska, T. Molecular Mechanisms of RSV and Air Pollution Interaction: A Scoping Review. Int. J. Mol. Sci. 2022, 23, 12704. [Google Scholar] [CrossRef]
Zhang, X.J.; Wei, F.Y.; Fu, H.Y.; Guo, H.B. Characterisation of environmentally persistent free radicals and their contributions to oxidative potential and reactive oxygen species in sea spray and size-resolved ambient particles. npj Clim. Atmos. Sci. 2025, 8, 27. [Google Scholar] [CrossRef]
Li, G.; Sun, S. Changing PM_2.5 concentrations in China from 1998 to 2014. Environ. Plan. A Econ. Space 2018, 50, 5–8. [Google Scholar]
Wang, J.; Xu, W.; Dong, J.; Zhang, Y. Two-stage deep learning hybrid framework based on multi-factor multi-scale and intelligent optimization for air pollutant prediction and early warning. Stoch. Environ. Res. Risk Assess. 2022, 36, 3417–3437. [Google Scholar] [CrossRef] [PubMed]
Jiang, X.; Yoo, E.H. The importance of spatial resolutions of Community Multiscale Air Quality (CMAQ) models on health impact assessment. Sci. Total Environ. 2018, 627, 1528–1543. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Wang, Z.; Cheng, M.; Wu, X.; Zhan, N.; Xu, J. Long-term ambient SO₂ concentration and its exposure risk across China inferred from OMI observations from 2005 to 2018. Atmos. Res. 2021, 247, 105150. [Google Scholar] [CrossRef]
Xu, M.; Jin, J.B.; Wang, G.Q.; Segers, A.; Deng, T.; Lin, H.X. Machine learning based bias correction for numerical chemical transport models. Atmos. Environ. 2021, 248, 118022. [Google Scholar]
Houria, B.; Abderrahmane, M.; Kenza, K.; Gábor, G. Short-term predictions of PM₁₀ and NO₂ concentrations in urban environments based on ARIMA search grid modeling. CLEAN-Soil Air Water 2024, 52, 2300395. [Google Scholar] [CrossRef]
Gong, S.; Zhang, L.; Liu, C.; Lu, S.; Pan, W.; Zhang, Y. Multi-scale analysis of the impacts of meteorology and emissions on PM_2.5 and O₃ trends at various regions in China from 2013 to 2020 2. Key weather elements and emissions. Sci. Total Environ. 2022, 824, 153847. [Google Scholar] [CrossRef]
Mahesh, T.R.; Balajee, A.; Dorai, D.R.; Sehgal, L.; Khan, S.B.; Kumar, V.V.; Almusharraf, A. RSPDT: Randomized Search Probabilistic Decision Tree Classifier for Pollution Level Prediction in Smart Cities. Hum.-Centric Comput. Inf. Sci. 2025, 15, 35–50. [Google Scholar]
Singh, S.; Suthar, G. Machine learning and deep learning approaches for PM_2.5 prediction: A study on urban air quality in Jaipur, India. Earth Sci. Inform. 2025, 18, 97. [Google Scholar]
Liu, C.C.; Lin, T.C.; Yuan, K.Y.; Chiueh, P.T. Spatio-temporal prediction and factor identification of urban air quality using support vector machine. Urban Clim. 2022, 41, 101055. [Google Scholar] [CrossRef]
Razavi-Termeh, S.V.; Sadeghi-Niaraki, A.; Jelokhani-Niaraki, M.; Choi, S.M. Exploring multi-pollution variability in the urban environment: Geospatial AI-driven modeling of air and noise. Int. J. Digit. Earth 2024, 17, 2378819. [Google Scholar]
Emeç, M.; Yurtsever, M. A novel ensemble machine learning method for accurate air quality prediction. Int. J. Environ. Sci. Technol. 2025, 22, 459–476. [Google Scholar]
Pan, K.; Lu, J.; Li, J.; Xu, Z. A Hybrid Autoformer Network for Air Pollution Forecasting Based on External Factor Optimization. Atmosphere 2023, 14, 869. [Google Scholar] [CrossRef]
Kow, P.Y.; Chang, L.C.; Lin, C.Y.; Chou, C.C.K.; Chang, F.J. Deep neural networks for spatiotemporal PM_2.5 forecasts based on atmospheric chemical transport model output and monitoring data. Environ. Pollut. 2022, 306, 119348. [Google Scholar] [CrossRef] [PubMed]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Ali, A.B.M.S.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Bekkar, A.; Hssina, B.; Douzi, S.; Douzi, K. Air-pollution prediction in smart city, deep learning approach. J. Big Data 2021, 8, 161. [Google Scholar] [CrossRef] [PubMed]
Liang, J.; Lu, Y.; Su, M. Hga-lstm: LSTM architecture and hyperparameter search by hybrid GA for air pollution prediction. Genet. Program. Evolvable Mach. 2024, 25, 20. [Google Scholar] [CrossRef]
Qing, L. PM_2.5 concentration prediction using GRA-GRU network in air monitoring. Sustainability 2023, 15, 1973. [Google Scholar] [CrossRef]
Zhang, B.; Liu, Y.; Yong, R.; Zou, G.; Yang, R.; Pan, J.; Li, M. A spatial correlation prediction model of urban PM_2.5 concentration based on deconvolution and LSTM. Neurocomputing 2023, 544, 126280. [Google Scholar] [CrossRef]
Mao, W.; Jiao, L.; Wang, W.; Wang, J.; Tong, X.; Zhao, S. A hybrid integrated deep learning model for predicting various air pollutants. GISci. Remote Sens. 2021, 58, 1395–1412. [Google Scholar]
Ahmed, A.A.M.; Jui, S.J.J.; Sharma, E.; Ahmed, M.H.; Raj, N.; Bose, A. An advanced deep learning predictive model for air quality index forecasting with remote satellite-derived hydro-climatological variables. Sci. Total Environ. 2024, 906, 167234. [Google Scholar] [CrossRef] [PubMed]
Wu, C.L.; He, H.D.; Song, R.F.; Zhu, X.H.; Peng, Z.R.; Fu, Q.Y.; Pan, J. A hybrid deep learning model for regional O₃ and NO₂ concentrations prediction based on spatiotemporal dependencies in air quality monitoring network. Environ. Pollut. 2023, 320, 121075. [Google Scholar] [CrossRef]
Zhang, B.; Zou, G.; Qin, D.; Ni, Q.; Mao, H.; Li, M. RCL-Learning: ResNet and convolutional long short-term memory-based spatiotemporal air pollutant concentration prediction model. Expert Syst. Appl. 2022, 207, 118017. [Google Scholar] [CrossRef]
Jiang, Y.; Gao, T.; Dai, Y.; Si, R.; Hao, J.; Zhang, J.; Gao, D.W. Very short-term residential load forecasting based on deep-autoformer. Appl. Energy 2022, 328, 120120. [Google Scholar] [CrossRef]
Zou, Y.; Chen, Y.; Xu, Y.; Zhang, H.; Zhang, S. Short-term freeway traffic speed multistep prediction using an iTransformer model. Phys. A 2024, 655, 130185. [Google Scholar] [CrossRef]
Liu, X.; Tao, Y.; Cai, Z.; Bao, P.; Ma, H.; Li, K.; Li, M.; Zhu, Y.; Lu, Z.J.; Wren, J. Pathformer: A biological pathway informed transformer for disease diagnosis and prognosis using multi-omics data. Bioinformatics 2024, 40, btae316. [Google Scholar] [CrossRef]
Wang, Y.Z.; He, H.D.; Huang, H.C.; Yang, J.M.; Peng, Z.R. High-resolution spatiotemporal prediction of PM_2.5 concentration based on mobile monitoring and deep learning. Environ. Pollut. 2025, 364, 125342. [Google Scholar] [CrossRef]
Mu, L.; Bi, S.; Ding, X.; Xu, Y. Transformer-based ozone multivariate prediction considering interpretable and priori knowledge: A case study of Beijing, China. J. Environ. Manag. 2024, 366, 121883. [Google Scholar] [CrossRef] [PubMed]
Gu, K.; Liu, Y.C.; Liu, H.Y.; Liu, B.; Qiao, J.F.; Lin, W.S.; Zhang, W.J. Air Pollution Monitoring by Integrating Local and Global Information in Self-Adaptive Multiscale Transform Domain. IEEE Trans. Multimed. 2025, 27, 3716–3728. [Google Scholar] [CrossRef]
Yan, Y.; Li, Y.; Sun, M.; Wu, Z. Primary pollutants and air quality analysis for urban air in China: Evidence from Shanghai. Sustainability 2019, 11, 2319. [Google Scholar] [CrossRef]
Guo, F.F.; Xie, S.D. Formation Mechanisms of Secondary Sulfate and Nitrate in PM_2.5. Prog. Chem. 2023, 35, 1313–1326. [Google Scholar]
Chang-Hoi, H.; Park, I.; Oh, H.R.; Gim, H.J.; Hur, S.K.; Kim, J.; Choi, D.R. Development of a PM_2.5 prediction model using a recurrent neural network algorithm for the Seoul metropolitan area, Republic of Korea. Atmos. Environ. 2021, 245, 118021. [Google Scholar] [CrossRef]
Zhu, Y.Y.; Gao, Y.X.; Liu, B.; Wang, X.Y.; Zhu, L.L.; Xu, R.; Wang, W.; Ding, J.N.; Li, J.J.; Duan, X. Concentration characteristics and assessment of model-predicted results of PM_2.5 in the Beijing-Tianjin-Hebei region in autumn and winter. Huan Jing Ke Xue 2019, 40, 5191–5201. [Google Scholar]
Li, D.; Wang, J.; Tian, D.; Chen, C.; Xiao, X.; Wang, L.; Wen, Z.; Yang, M.; Zou, G. Residual neural network with spatiotemporal attention integrated with temporal self-attention based on long short-term memory network for air pollutant concentration prediction. Atmos. Environ. 2024, 329, 120531. [Google Scholar] [CrossRef]
Yan, R.; Liao, J.; Yang, J.; Sun, W.; Nong, M.; Li, F. Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering. Expert Syst. Appl. 2021, 169, 114513. [Google Scholar] [CrossRef]
Li, J.; Wen, M.; Zhou, Z.; Wen, B.; Yu, Z.; Liang, H.; Zhang, X.; Qin, Y.; Xu, C.; Huang, H. Multi-objective optimization method for power supply and demand balance in new power systems. Int. J. Electr. Power Energy Syst. 2024, 161, 110204. [Google Scholar] [CrossRef]
Yuan, X.; Shen, X.; Mehta, S.; Li, T.; Ge, S.; Zha, Z. Structure injected weight normalization for training deep networks. Multimed. Syst. 2022, 28, 433–444. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Sayeed, A.; Choi, Y.; Eslami, E.; Lops, Y.; Roy, A.; Jung, J. Using a deep convolutional neural network to predict 2017 ozone concentrations, 24 hours in advance. Neural Netw. 2020, 121, 396–408. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Peng, L.; Yao, X.; Cui, S.; Hu, Y.; You, C.; Chi, T. Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation. Environ. Pollut. 2017, 231, 997–1004. [Google Scholar] [CrossRef]
Huang, J.; Liu, S.; Hassan, S.G.; Xu, L. Pollution index of waterfowl farm assessment and prediction based on temporal convoluted network. PLoS ONE 2021, 16, e0254179. [Google Scholar] [CrossRef]
Ren, Y.; Wang, S.; Xia, B. Deep learning coupled model based on TCN-LSTM for particulate matter concentration prediction. Atmos. Pollut. Res. 2023, 14, 101703. [Google Scholar] [CrossRef]
Yu, M.; Masrur, A.; Blaszczak-Boxe, C. Predicting hourly PM_2.5 concentrations in wildfire-prone areas using a SpatioTemporal Transformer model. Sci. Total Environ. 2023, 860, 160446. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Liu, J.; Zhao, Y. Prediction of multi-site PM_2.5 concentrations in Beijing using CNN-Bi LSTM with CBAM. Atmosphere 2022, 13, 1719. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, L.; Wang, J.; Niu, X. Hybrid system based on a multi-objective optimization and kernel approximation for multi-scale wind speed forecasting. Appl. Energy 2020, 277, 115561. [Google Scholar] [CrossRef]
Wen, C.; Liu, S.; Yao, X.; Peng, L.; Li, X.; Hu, Y.; Chi, T. A novel spatiotemporal convolutional long short-term neural network for air pollution prediction. Sci. Total Environ. 2019, 654, 1091–1099. [Google Scholar] [CrossRef]
Liang, S.; Hua, Z.; Li, J. Enhanced feature interaction network for remote sensing change detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4900–4915. [Google Scholar] [CrossRef]
Sun, W.; Li, Z. Hourly PM_2.5 concentration forecasting based on mode decomposition-recombination technique and ensemble learning approach in severe haze episodes of China. J. Clean. Prod. 2020, 263, 121442. [Google Scholar] [CrossRef]
Nori-Sarma, A.; Thimmulappa, R.K.; Venkataramana, G.V.; Fauzie, A.K.; Dey, S.K.; Venkareddy, L.K.; Berman, J.D.; Lane, K.J.; Fong, K.C.; Warren, J.L.; et al. Low-cost NO₂ monitoring and predictions of urban exposure using universal kriging and land-use regression modelling in Mysore, India. Atmos. Environ. 2020, 226, 117395. [Google Scholar] [CrossRef]
Zhang, K.; Yang, X.; Cao, H.; Thé, J.; Tan, Z.; Yu, H. Multi-step forecast of PM_2.5 and PM₁₀ concentrations using convolutional neural network integrated with spatial-temporal attention and residual learning. Environ. Int. 2023, 171, 107691. [Google Scholar] [CrossRef]

Figure 1. Distribution of core areas in the Yangtze River Delta urban agglomeration.

Figure 2. The framework of the proposed prediction system.

Figure 3. The architecture of BiTCN module.

Figure 4. The architecture of ISInformer module.

Figure 5. Improved probabilistic sparse attention mechanism.

Figure 6. Fitting diagram for single step prediction of the models, (a–h) represent the fitting results of single step prediction for the following models, respectively: CNN, LSTM, TCN, TCN-LSTM, Transformer, CBAM-CNN-BiLSTM, ST-Transformer, and BiTCN-ISInformer.

Figure 7. Scatter plot for single step prediction of the models, (a–h) represent the distribution pattern of single step prediction for the following models, respectively: CNN, LSTM, TCN, TCN-LSTM, Transformer, CBAM-CNN-BiLSTM, ST-Transformer, and BiTCN-ISInformer.

Figure 8. The average PM_2.5 prediction performance for different models.

Figure 9. Stability test results: variance of the prediction error for different models.

Figure 10. The distribution of the average PM_2.5 concentration on the test set.

Figure 11. The distribution of RMSE for PM_2.5 concentration prediction, (a–f) represent different forecasting tasks with varying combinations of historical windows and prediction steps: (3, 1), (4, 2), (6, 3), (8, 4), (16, 12), and (28, 24).

Table 1. Model parameters.

Parameter	Value
Sampling factor	5
BiTCN layers	3
Dimension of hidden layers in BiTCN	64
Encoder layers	3
Decoder layers	2
Dimension of hidden layers in ISInformer	256
Dimension of feedforward neural network	256
Dropout	0.1
Batch size	256
Epochs	100
Optimizer	Adam
Learning rate	0.001
Loss function	0.5 × MSE + 0.5 × MAE

Table 2. Comparison of single step prediction performance of different models.

Model	RMSE	MAE	IA	R²
CNN	6.116	4.271	0.914	0.749
LSTM	5.532	3.977	0.940	0.807
TCN	5.477	4.017	0.946	0.811
TCN-LSTM	4.596	3.346	0.961	0.867
Transformer	4.313	3.189	0.967	0.883
CBAM-CNN-BiLSTM	4.220	2.970	0.969	0.888
ST-Transformer	3.000	2.259	0.986	0.943
BiTCN-ISInformer	2.010	1.436	0.993	0.973

Table 3. Comparison of multi step prediction performance of different models.

Prediction Horizon	TCN-LSTM		CBAM-CNN-BiLSTM		ST-Transformer		BiTCN-ISInformer
Prediction Horizon	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
Historical window = 4 h Prediction step = 2 h	5.411	3.988	4.989	3.530	4.276	3.302	2.974	2.041
Historical window = 6 h Prediction step = 3 h	6.153	4.468	5.733	4.054	4.238	2.876	4.025	3.004
Historical window = 8 h Prediction step = 4 h	6.132	4.309	6.583	4.635	5.218	3.669	4.809	3.245
Historical window = 10 h Prediction step = 6 h	6.892	4.876	7.480	5.153	6.208	4.231	6.093	4.139
Historical window = 16 h Prediction step = 12 h	8.724	5.860	9.039	6.254	8.908	6.165	8.371	6.116
Historical window = 28 h Prediction step = 24 h	10.261	7.147	10.213	7.199	10.119	6.996	10.029	6.865

Table 4. Ablation experiments for multi step prediction of PM_2.5 in Shanghai.

Model	Historical Window = 6 h, Prediction Step = 3 h		Historical Window = 16 h, Prediction Step = 12 h
Model	RMSE	MAE	RMSE	MAE
BiTCN	4.501	3.140	9.753	6.576
ISInformer	4.384	3.082	8.707	6.521
BiTCN-Informer	4.451	3.091	8.947	6.365
BiTCN-ISInformer	4.025	3.004	8.371	6.116

Table 5. Comparison of average computation time of different models (in seconds).

Model	Average Computation Time
TCN-LSTM	196.705
CBAM-CNN-BiLSTM	192.650
ST-Transformer	187.245
BiTCN-Informer	190.220
BiTCN-ISInformer	188.060

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mao, X.; Liu, G.; Wang, J.; Lai, Y. BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer. Sustainability 2025, 17, 8631. https://doi.org/10.3390/su17198631

AMA Style

Mao X, Liu G, Wang J, Lai Y. BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer. Sustainability. 2025; 17(19):8631. https://doi.org/10.3390/su17198631

Chicago/Turabian Style

Mao, Xinyi, Gen Liu, Jian Wang, and Yongbo Lai. 2025. "BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer" Sustainability 17, no. 19: 8631. https://doi.org/10.3390/su17198631

APA Style

Mao, X., Liu, G., Wang, J., & Lai, Y. (2025). BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer. Sustainability, 17(19), 8631. https://doi.org/10.3390/su17198631

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BiTCN-ISInformer: A Parallel Model for Regional Air Pollutant Concentration Prediction Using Bidirectional Temporal Convolutional Network and Enhanced Informer

Abstract

1. Introduction

1.1. Literature Review

1.2. Contribution and Innovation

2. Study Area and Dataset Analysis

2.1. Study Area

2.2. Data Description and Preprocessing

3. Methodology

3.1. The Framework of the Proposed Prediction System

3.2. BiTCN Module

3.3. ISInformer Module

3.3.1. Improved Probabilistic Sparse Attention Mechanism

3.3.2. Encoder

3.3.3. Decoder

3.4. Model Evaluation

3.4.1. Baseline Models

3.4.2. Evaluation Metrics

4. Experimental Design

5. Results and Discussion

5.1. The Impact of Relevant Factors on PM2.5 Concentration Prediction

5.2. Single Step Prediction of PM2.5 in Shanghai

5.3. Multi Step Prediction of PM2.5 in Shanghai

5.4. Ablation Experiments on the Proposed Model

5.5. Computational Efficiency and Stability of the Models

5.5.1. Computational Efficiency of the Models

5.5.2. Stability of the Models

5.6. Application of the Proposed Model to the Entire Study Area

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.1. The Impact of Relevant Factors on PM_2.5 Concentration Prediction

5.2. Single Step Prediction of PM_2.5 in Shanghai

5.3. Multi Step Prediction of PM_2.5 in Shanghai