Prediction of Particulate Matter 2.5 Concentration Using a Deep Learning Model with Time-Frequency Domain Information

: In recent years, deep learning models have gained signiﬁcant traction and found extensive applications in the realm of PM2.5 concentration prediction. PM2.5 concentration sequences are rich in frequency information; however, existing PM2.5 concentration prediction models lack the ability to capture the frequency information. Therefore, we propose the Time-frequency domain, Bidirectional Long Short-Term Memory (BiLSTM), and attention (TF-BiLSTM-attention) model. First, the model uses Discrete Cosine Transform (DCT) to convert the time domain information into its corresponding frequency domain representation. Second, it joins the time domain information with the frequency domain information, which enables the model to capture the frequency domain information on top of the original. Simultaneously, incorporating the attention mechanism after BiLSTM enhances the importance of critical time steps. Empirical results underscore the superior predictive performance of our proposed univariate model across all sites, outperforming both the univariate BiLSTM, univariate BiLSTM-attention, and univariate TF-BiLSTM. Meanwhile, for the multivariate model that adds PM2.5 concentration from other sites in the study area as input variables, our proposed model outperforms the prediction of some basic models such as BiLSTM and some hybrid models such as CNN-BiLSTM for all sites.


Introduction
For a long time, air pollution has attracted the attention of the public, the government, and the scientific community.Air pollution not only affects weather and climate, leading to more extreme weather, but also endangers human health [1].Particulate matter with an aerodynamic diameter less than 2.5 µm (PM2.5), as a prominent air pollutant, is able to penetrate the gas exchange area of the lungs and cause harm to other organs through the lungs [2].Moreover, PM2.5 stands out as a pivotal factor influencing visibility, wherein escalated PM2.5 concentration induces alterations in the sky's color and leads to diminished atmospheric clarity [3].Therefore, accurate prediction of PM2.5 concentration in advance holds tremendous significance in the realms of air pollution mitigation, human lifestyle, and physical health.
Various methods have been proposed by researchers to predict PM2.5 concentration in recent years.The existing methods for PM2.5 concentration prediction can be classified into two types: deterministic methods and statistical methods [4].The deterministic methods use theoretical meteorological emissions and a chemical model to simulate the formation and dispersion process of pollutants, which ultimately achieves the prediction of future concentration trends of pollutants [5].Among them, the Community Multiscale Air Quality (CMAQ) [6] stands out as a prevalent deterministic approach.Compared with the deterministic methods, the statistical methods are able to identify the complex dependencies between air pollutant concentration and potential predictors by using existing data [7], so this approach effectively circumvents the intricacies and unwarranted complexities inherent in the modeling process, showcasing commendable performance.Among statistical methods, the artificial neural network stands out as a paramount representative.The artificial neural network is not constrained by physical, biological, or chemical processes and is capable of handling non-linear relationships with strong fitting and predictive capabilities.
A contemporary advancement in the field of PM2.5 concentration prediction research involves the utilization of deep neural networks, a specialized form of artificial neural network adept at handling extensive data through intricate model architectures.Among these techniques, the prevalent employment of deep learning models revolves around architectures such as Long Short-Term Memory (LSTM) [8], Bidirectional LSTM (BiLSTM) [9], Convolutional Neural Network-LSTM (CNN-LSTM) [10], and CNN-BiLSTM [11].Moreover, researchers have sought to enhance the predictive prowess of these models by amalgamating diverse techniques, encompassing data decomposition and data/model fusion [12].One research direction for data fusion is to incorporate the neighboring sites of the target site into the PM2.5 prediction.
Air quality data belong to time series data, which are rich in frequency information.Addressing the limitations observed in certain models like the transformer and LSTM, Jiang et al. [13] contended that these models exhibited a notable discrepancy between their predicted outcomes and the true values of the datasets due to their inadequate capacity to capture frequency information effectively.As a remedy, they explored the utilization of Discrete Cosine Transform (DCT) instead of the commonly employed Fourier Transform (FT) for time-frequency transformations, thereby mitigating the occurrence of the Gibbs phenomenon during the transformation process.Building on this concept, they introduced the Frequency Enhanced Channel Mechanism (FECAM), enabling proficient feature extraction from time series data.Extensive experiments conducted across diverse time series datasets demonstrated that FECAM, as a versatile approach, substantially enhanced the prediction performance of LSTM models.Chen et al. [14] introduced an innovative approach named the joint time-frequency domain transformer (JTFT) to facilitate multivariate prediction tasks.The JTFT method capitalized on the sparsity inherent in time series data within the frequency domain, skillfully employing a limited number of learnable frequencies to adeptly capture temporal dependencies.Through extensive experimentation, the authors demonstrated that this method significantly enhanced prediction performance while concurrently reducing computational overheads, rendering it a promising and efficient approach for time series forecasting.
Recently, the integration of the attention mechanism [15] within the domain of deep learning has garnered substantial traction, yielding promising and impactful outcomes.Zhou et al. [16] put forth an alternative prediction method for air pollutant concentrations, leveraging the Kalman filter, attention, and long and short-term memory (Kalman-attention-LSTM) model.The augmentation of the attention mechanism into the conventional LSTM architecture empowered the model with enhanced capabilities to effectively capture temporal information features.Through extensive experimentation, the researchers unveiled that the second prediction approach, employing the Kalman-attention-LSTM model, exhibited superior fitting results in comparison with six other competing models.This underscored the potential efficacy of the proposed method in advancing air pollutant concentration predictions.Wang et al. [17] introduced a novel air quality prediction model called CNN-BiNLSTM-attention.This model comprised three essential components: CNN, BiNLSTM, and attention.The incorporation of attention was instrumental in effectively capturing the impacts of distinct temporal feature conditions on Air Quality Index (AQI) prediction, thereby facilitating more precise AQI predictions for the subsequent hour.The empirical findings demonstrated that the proposed CNN-BiNLSTM-attention model outperformed the other five models, rendering it a more suitable choice for air quality prediction tasks.
In order to take full advantage of the frequency information of the PM2.5 concentration data as well as to enable the BiLSTM model to aggregate the attention to key positions, this paper establishes a TF-BiLSTM-attention model for PM2.5 concentration prediction among 12 monitoring sites in Beijing, China.This paper contributes to existing knowledge in the following ways: (1) First, it uses DCT to transform the input data from the time domain to the frequency domain.Next, the amalgamation of the time domain dataset with the frequency domain dataset is executed to generate an integrated dataset.This unified dataset is designated as the model's input, enabling optimal utilization of frequency information while encompassing essential time domain information.(2) Adding the attention mechanism after the BiLSTM layer can improve the role of important time steps in BiLSTM, which can in turn improve the prediction effect of the model.Finally, using time frequency domain information, BiLSTM, and attention, it develops the TF-BILSTM-attention model to predict PM2.5 concentration.(3) For the multivariate model, the input variables consist solely of the PM2.5 concentration at the site itself and at the remaining 11 sites within the study area, without considering the effects of other pollutant factors and meteorological factors.Empirical findings demonstrate that the multivariate model with these variables added has a good prediction effect.

Study Area and Materials
The air quality dataset employed in this study originates from the dataset provided by Zhang et al. [18], which is readily accessible for download from the University of California, Irvine (UCI) Machine Learning Repository page.This dataset encompasses air quality records spanning the period from 2013 to 2017, comprising data collected from 12 distinct Guokong monitoring sites situated in and around Beijing.Specifically, the monitoring sites include Aotizhongxin, Changping, Dingling, Dongsi, Guyuan, Gucheng, Huairou, Nongzhanguan, Shunyi, Tiantan, Wanliu, and Wanshougong.To facilitate the description, we number the above-mentioned sites, with Aotizhongxin noted as S1, Changping as S2, Dingling as S3, and so on.
The individual sites are represented by raw sensor data obtained at hourly intervals, spanning 1461 days with a total of 35,064 data samples.The dataset for each site contains 12 variables.For the purposes of this study, however, only the PM2.5 data series from each site is utilized.
Due to subjective and objective reasons such as instrument damage and human factors [19], there is a certain percentage of missing data in this dataset.Within this study, the missing data are supplied employing the random forest interpolation algorithm [20].In order to eliminate the different dimensions between the features and speed up the data convergence, we normalize and scale the dataset to [0, 1].In this case, Equation (1) illustrates the Min-Max normalization method.
where X denotes the normalized value, x denotes the original value, and min and max denote the minimum and maximum values in the dataset, respectively.

Methodology Framework
The methodology framework for TF-BiLSTM-attention is depicted in Figure 1, and the detailed training steps for the model are outlined below.Step 1. Preprocessed data.Firstly, remove the outliers in the original dataset, which include anomalously high or low pollutant concentration values.That is, values that deviate more than three times the standard deviation from the mean pollutant value are removed.Then the corresponding missing data are filled in using the random forest interpolation algorithm.Next, narrow down the data to between 0 and 1 using the Mix-Max normalization algorithm, which speeds up the convergence of the data.
Step 2. Convert the data format.Convert raw multivariate air quality sequences into the supervised learning sequence format.That is, convert the multivariate air quality sequences into a set of sequences containing pairs of inputs and outputs.
Step 3. Divide the dataset.The dataset is partitioned into training and testing sets, with 80% allocated to the training set and the remaining 20% designated for testing purposes.
Step 4. Find optimal hyperparameters and predict the result.Hyperparameters of a model are parameters that are predefined empirically before model training, such as the learning rate, number of iterations, etc.Some researchers in recent years have started to use automatic machine learning (auto-ML) methods to replace manual tuning and thus accomplish hyperparameter optimization.In this paper, the free and open-source Neural Network Intelligence (NNI) framework and the Tree-structured Parzen Estimator (TPE) [21] method are used as the optimization method.The hyperparameter search space of NNI is shown in Error!Reference source not found..Then, the model's performance is examined on a test dataset to find the optimal hyperparameter combination.
Step 5. Save the prediction model parameters.The real-world datasets contain rich frequency information and so do the time series datasets.In order to take full advantage of the frequency information in time series data and better determine the hidden data of the series, this paper introduces DCT to turn time domain information into frequency domain information.
DCT is similar to Fourier Transform [13], but DCT achieves better time-frequency energy compression characteristics than Discrete Fourier Transform (DFT), and there is Step 1. Preprocessed data.Firstly, remove the outliers in the original dataset, which include anomalously high or low pollutant concentration values.That is, values that deviate more than three times the standard deviation from the mean pollutant value are removed.Then the corresponding missing data are filled in using the random forest interpolation algorithm.Next, narrow down the data to between 0 and 1 using the Mix-Max normalization algorithm, which speeds up the convergence of the data.
Step 2. Convert the data format.Convert raw multivariate air quality sequences into the supervised learning sequence format.That is, convert the multivariate air quality sequences into a set of sequences containing pairs of inputs and outputs.
Step 3. Divide the dataset.The dataset is partitioned into training and testing sets, with 80% allocated to the training set and the remaining 20% designated for testing purposes.
Step 4. Find optimal hyperparameters and predict the result.Hyperparameters of a model are parameters that are predefined empirically before model training, such as the learning rate, number of iterations, etc.Some researchers in recent years have started to use automatic machine learning (auto-ML) methods to replace manual tuning and thus accomplish hyperparameter optimization.In this paper, the free and open-source Neural Network Intelligence (NNI) framework and the Tree-structured Parzen Estimator (TPE) [21] method are used as the optimization method.The hyperparameter search space of NNI is shown in Table 1.Then, the model's performance is examined on a test dataset to find the optimal hyperparameter combination.
Step 5. Save the prediction model parameters.

Discrete Cosine Transform
The real-world datasets contain rich frequency information and so do the time series datasets.In order to take full advantage of the frequency information in time series data and better determine the hidden data of the series, this paper introduces DCT to turn time domain information into frequency domain information.
DCT is similar to Fourier Transform [13], but DCT achieves better time-frequency energy compression characteristics than Discrete Fourier Transform (DFT), and there is no redundant data in the resultant sequence [13].Therefore, DCT can extract the frequency information in the time series well.
The DCT of a one-dimensional sequence of length N [22] is defined as where u= 0, 1, 2, . . ., N − 1, F is the transformed sequence, cos (x +0.5)π N u is called forward DCT transformation kernel, f (•) is the original sequence, and c(u) [22] is a compensa- tion coefficient defined as

Bidirectional Long Short-Term Memory Neural Network
Long Short-Term Memory Neural Network LSTM, a type of Recurrent Neural Network (RNN), offers advancements over traditional RNNs by effectively addressing both short-term and long-term dependency issues.Notably, LSTM demonstrates the capability to retain crucial information from early sequences even within lengthy sequences, making it a compelling choice for numerous applications [23].In recent years, the LSTM neural network is widely used in air quality prediction.Fundamentally, LSTM architecture comprises three pivotal gates: the input gate, forget gate, and output gate, as depicted in Figure 2.
no redundant data in the resultant sequence [13].Therefore, DCT can extract the frequency information in the time series well.
The DCT of a one-dimensional sequence of length N [22] is defined as where u=0, 1, 2, ……, N-1, F is the transformed sequence, cos is the original sequence, and c(u) [22] is a compensation coefficient defined as

Bidirectional Long Short-Term Memory Neural Network
Long Short-Term Memory Neural Network LSTM, a type of Recurrent Neural Network (RNN), offers advancements over traditional RNNs by effectively addressing both short-term and long-term dependency issues.Notably, LSTM demonstrates the capability to retain crucial information from early sequences even within lengthy sequences, making it a compelling choice for numerous applications [23].In recent years, the LSTM neural network is widely used in air quality prediction.Fundamentally, LSTM architecture comprises three pivotal gates: the input gate, forget gate, and output gate, as depicted in Error!Reference source not found..As shown in Error!Reference source not found., the inputs and outputs of the network on the LSTM structure are described as follows: As shown in Figure 2, the inputs and outputs of the network on the LSTM structure are described as follows:

Bidirectional Long Short-Term Memory Neural Network
The BiLSTM model, an extension of the LSTM architecture, comprises two LSTM layers: a forward LSTM layer and a backward LSTM layer.Among them, the forward LSTM processes the sequence in the forward direction and the backward LSTM processes the sequence in the reverse direction, and the outputs of the two LSTMs are spliced together after the processing is completed [11].Figure 3  The BiLSTM model, an extension of the LSTM architecture, comprises two LSTM layers: a forward LSTM layer and a backward LSTM layer.Among them, the forward LSTM processes the sequence in the forward direction and the backward LSTM processes the sequence in the reverse direction, and the outputs of the two LSTMs are spliced together after the processing is completed [11].Error!Reference source not found.presents a schematic representation of the BiLSTM model structure.As can be seen in Error!Reference source not found.,BiLSTM can process both forward and backward time series using two LSTMs.Each of these hidden layers in both directions is adept at capturing relevant information from both past and future contexts pertaining to a specific time step.As a result, the features of air pollutants can be extracted more comprehensively using BiLSTM, thus representing an improvement in the predictive performance of the hybrid model.

Attention Mechanism
In the process of time series forecasting, the input features at different times of the time series have different effects on the predicted values, and the smaller the interval from the prediction point, the greater the influence of feature information on the prediction point [16].The basic BiLSTM network assigns equal weight values to all input features, ignoring the degree of influence of the input features on the predicted values.In this paper, we use the attention mechanism to optimize the basic BiLSTM network, and adding the attention mechanism after the BiLSTM layer can improve the role of important time steps in BiLSTM, which in turn improves the prediction effect of the model.
In this study, the output vectors from the BiLSTM hidden layer serve as inputs to the attention layer, trained by a fully connected layer.The outputs of the fully connected layer As can be seen in Figure 3, BiLSTM can process both forward and backward time series using two LSTMs.Each of these hidden layers in both directions is adept at capturing relevant information from both past and future contexts pertaining to a specific time step.As a result, the features of air pollutants can be extracted more comprehensively using BiLSTM, thus representing an improvement in the predictive performance of the hybrid model.

Attention Mechanism
In the process of time series forecasting, the input features at different times of the time series have different effects on the predicted values, and the smaller the interval from the prediction point, the greater the influence of feature information on the prediction point [16].The basic BiLSTM network assigns equal weight values to all input features, ignoring the degree of influence of the input features on the predicted values.In this paper, we use the attention mechanism to optimize the basic BiLSTM network, and adding the attention mechanism after the BiLSTM layer can improve the role of important time steps in BiLSTM, which in turn improves the prediction effect of the model.
In this study, the output vectors from the BiLSTM hidden layer serve as inputs to the attention layer, trained by a fully connected layer.The outputs of the fully connected layer are normalized using the softmax function to derive the assigned weights for each hidden layer vector, the size of which indicates the importance of the hidden state for the prediction result at each time step.The weight training process and the weighted average sum of the hidden layer output vectors using the trained weights are described as follows: where W is the weight coefficient and b is the bias coefficient, k i is the ith hidden unit state value of the output in moment t in the BiLSTM layer.e i is the score of each hidden unit, a i is the normalized score, and O t is the final output of the attention layer.

The TF-BiLSTM-Attention Model
To improve the accuracy of PM2.5 concentration prediction, this study introduces a hybrid model called TF-BiLSTM-attention, integrating the Discrete Cosine Transform [13], BiLSTM [9], and attention mechanism [15].The model's framework is visually represented in Figure 4.
are normalized using the softmax function to derive the assigned weights for each hidden layer vector, the size of which indicates the importance of the hidden state for the prediction result at each time step.The weight training process and the weighted average sum of the hidden layer output vectors using the trained weights are described as follows: where W is the weight coefficient and b is the bias coefficient, k i is the ith hidden unit state value of the output in moment t in the BiLSTM layer.e i is the score of each hidden unit, a i is the normalized score, and O t is the final output of the attention layer.

The TF-BiLSTM-Attention Model
To improve the accuracy of PM2.5 concentration prediction, this study introduces a hybrid model called TF-BiLSTM-attention, integrating the Discrete Cosine Transform [13], BiLSTM [9], and attention mechanism [16].The model's framework is visually represented in Error!Reference source not found.. Firstly, the data in the time domain are converted into data in the frequency domain by DCT, and then the data in the time domain and the data in the frequency domain are combined to form the input data to be fed into the model.In this way, the model can make use of both time domain and frequency domain information in the prediction process.Next, the long-term and short-term dependencies between the input data are extracted using the BiLSTM network.Then, the prediction results of BiLSTM are obtained, and the attention mechanism is leveraged to assign higher weights to influential factors, thus optimizing resource allocation and elevating prediction accuracy within the model.Finally, the optimized results are put into the output layer and the prediction results are output according to the required dimensions.

Network Architecture and Hyperparameter Setting
The deep neural network in this study is constructed using the PyTorch [24] framework.To determine the optimal performance configuration for the TF-BiLSTM-attention model, experimental investigations are conducted with the assistance of NNI.The final parameter settings that yielded the best results are identified as follows: a hidden size of 11, a sequence length of 7, a batch size of 16, an epoch size of 20, a learning rate of 0.0001, Firstly, the data in the time domain are converted into data in the frequency domain by DCT, and then the data in the time domain and the data in the frequency domain are combined to form the input data to be fed into the model.In this way, the model can make use of both time domain and frequency domain information in the prediction process.Next, the long-term and short-term dependencies between the input data are extracted using the BiLSTM network.Then, the prediction results of BiLSTM are obtained, and the attention mechanism is leveraged to assign higher weights to influential factors, thus optimizing resource allocation and elevating prediction accuracy within the model.Finally, the optimized results are put into the output layer and the prediction results are output according to the required dimensions.

Network Architecture and Hyperparameter Setting
The deep neural network in this study is constructed using the PyTorch [24] framework.To determine the optimal performance configuration for the TF-BiLSTM-attention model, experimental investigations are conducted with the assistance of NNI.The final parameter settings that yielded the best results are identified as follows: a hidden size of 11, a sequence length of 7, a batch size of 16, an epoch size of 20, a learning rate of 0.0001, and the adoption of the "adam" optimization function, a widely-used technique in deep neural networks.Moreover, to measure the model's predictive efficacy, the Mean Squared Error (MSE) loss function is employed, effectively guiding the training process towards attaining accurate predictions.

Feature Selection
To enhance the predictive accuracy of PM2.5 concentrations, adding pollutant factors and meteorological factors as input features to the model [25] is a common method used by researchers.Considering the spatio-temporal dependence, Wardana et al. [26] analyzed all PM2.5 samples from 12 sites in Beijing and calculated the PM2.5 correlation coefficients between the sites.The experimental results showed that there is a strong correlation of PM2.5 concentration between sites and that adding PM2.5 from other sites as features to the input data can improve the prediction effect of the model.Within this study, we employ the Pearson correlation coefficient [25] to quantify the relationship between the PM2.5 concentration and meteorological features.The formula of the Pearson correlation coefficient is shown in Equation (15).
where x and y denote the input features, and n is the number of samples in the sequence.
Figure 5 shows a heat map of the correlation coefficients between the six pollutants at site S1 itself and a total of 17 features of PM2.5 concentration at the remaining 11 sites, where S1, S2, S3, S4, etc. represent the PM2.5 concentration of each site, respectively.As can be seen in Figure 5, the correlations of the six pollutant factors at site S1 vary considerably, with the PM2.5 concentration at site S1 having the highest correlation with the PM10 concentration at its own site, and the lowest correlation with the O3 concentration at its own site.In contrast, the concentrations of PM2.5 at all 12 sites are strongly correlated with each other (ρ > 0.7).The observed strong correlation indicates a substantial spatial dependence concerning PM2.5 levels among the various monitoring sites.

Evaluation of Prediction Results
In order to assess the predictive accuracy of the model, we chose mean absolute error (MAE), root mean square error (RMSE), average absolute percentage error (MAPE), and coefficient of determination (R 2 ) as model evaluation metrics.The calculations are shown in Equations ( 16)-( 19).

MAE=
1 Meanwhile, Wardana et al. [26] conducted comparative experiments using different input variables.The findings demonstrated that, for the same model, the prediction effect of using the PM2.5 concentration at this site and the PM2.5 concentration at other sites as the input variables was better than the prediction effect of the model using the pollutant factors and meteorological factors at this site as the input variables.Therefore, in this paper, only the PM2.5 concentration at this site and the PM2.5 concentration at other sites are also used as input variables for the multivariate model without considering other pollutant factors and meteorological factors.

Evaluation of Prediction Results
In order to assess the predictive accuracy of the model, we chose mean absolute error (MAE), root mean square error (RMSE), average absolute percentage error (MAPE), and coefficient of determination (R 2 ) as model evaluation metrics.The calculations are shown in Equations ( 16)- (19).
where y i denotes the true value, ŷi denotes the predicted value, and y i denotes the mean of all true values.Lower values of MAE, RMSE, and MAPE indicate diminished error levels, implying more accurate predictions.Meanwhile, as the R 2 value gets closer to 1, the better the model is fitted.

Results
To validate the effectiveness of our proposed model, a series of experiments is conducted.First, only use the variable PM2.5 at this site as an input to the model.Specifically, we compare four models using a single variable, including BiLSTM, BiLSTM-attention, TF-BiLSTM, and TF-BiLSTM-attention.Second, a total of 12 variables, including the PM2.5 concentration at this site and the PM2.5 concentrations at the remaining 11 sites in the study area, are used as inputs to the model.Our proposed univariate model is first compared with the multivariate model, and then the multivariate model is compared with other basic multivariate models and hybrid multivariate models.

Comparison with Different Univariate Models
Table 2 demonstrates the comparison between the four models using univariate inputs for each site.A comparison of the overall predictive effectiveness of the four univariate models for the 12 sites is shown in Figures 6 and 7. Figures 6 and 7 illustrate that the TF-BiLSTM-attention model for the 12 sites has the smallest mean value of RMSE and the mean value of MAE, and the mean value of R 2 is the largest.This indicates that the TF-BiLSTM-attention model performs the best among the four models.The difference of the forecast effect between the TF-BiLSTM model and the TF-BiLSTM-attention model is very small.The difference of the forecast effect between the BiLSTM model and the BiLSTMattention model is also very small.However, incorporating the attention mechanism renders an improvement over the model lacking this feature, which indicates that in the case of univariate input, the attention mechanism does not improve the prediction effect of the model significantly.Finally, the prediction effect of the TF-BiLSTM model is significantly better than that of the BiLSTM model, which indicates that adding frequency information to the model can improve the prediction effect of the model.Meanwhile, it can be found from Table 2 that the same model has different prediction effects for different sites.Taking the univariate TF-BiLSTM-attention model, which has the best performance in general, as an example, site S12 has the worst performance, site S3 has the smallest value of RMSE, site S7 has the smallest value of MAE, site S10 has the largest value of R 2 , and site S6 has the smallest value of MAPE.

Comparison with Different Multivariate Models
In this study, we take the PM2.5 concentration at its own site and the PM2.5 concentration at 11 other sites as the input variables of the multivariate model.The prediction effects of the univariate TF-BiLSTM-attention and multivariate TF-BiLSTM-attention models are shown in Figure 8. Figure 8 illustrates that the multivariate model provides significantly better prediction performance compared to the univariate model.This suggests that incorporating the PM2.5 concentration data from the remaining 11 sites as input variables enhances the predictive capability of the model.

Comparison with Different Multivariate Models
In this study, we take the PM2.5 concentration at its own site and the PM2.5 concentration at 11 other sites as the input variables of the multivariate model.The prediction effects of the univariate TF-BiLSTM-attention and multivariate TF-BiLSTM-attention models are shown in Error!Reference source not found.. Error!Reference source not found.illustrates that the multivariate model provides significantly better prediction performance compared to the univariate model.This suggests that incorporating the PM2.5 concentration data from the remaining 11 sites as input variables enhances the predictive capability of the model.For this study, we select four basic models, LSTM, BiLSTM, Gate Recurrent Unit (GRU), and Bidirectional Gate Recurrent Unit (BiGRU), and four hybrid models, CNN-LSTM, CNN-BiLSTM, CNN-GRU, and CNN-BiGRU, to compare with our proposed models.Also, two models, TF-BiLSTM and TF-CNN-BiLSTM, are constructed to compare with our proposed models.The prediction performances of different multivariate models are shown in Error!Reference source not found.and Error!Reference source not found.. Error!Reference source not found.and Error!Reference source not found.provide a clear depiction, illustrating that the mean value of RMSE, the mean value of MAE, and the mean value of MAPE are the smallest, and the mean value of R 2 is the largest for the 12 sites of our proposed model.This indicates that the TF-BiLSTM-attention model has the best prediction performance among these multivariate models.
From a single site, the values of the evaluation metrics of the multivariate LSTM, BiLSTM, GRU, and BiGRU models do not differ much from each other, which suggests that under the same conditions, the prediction effect of these four models is about the same.Meanwhile, the difference in the values of the evaluation metrics between the hybrid model with the CNN network added and the base model is also not significant, which indicates that adding a layer of 1D-CNN to a single base model in this study does not For this study, we select four basic models, LSTM, BiLSTM, Gate Recurrent Unit (GRU), and Bidirectional Gate Recurrent Unit (BiGRU), and four hybrid models, CNN-LSTM, CNN-BiLSTM, CNN-GRU, and CNN-BiGRU, to compare with our proposed models.Also, two models, TF-BiLSTM and TF-CNN-BiLSTM, are constructed to compare with our proposed models.The prediction performances of different multivariate models are shown in Tables 3 and 4. Tables 3 and 4 provide a clear depiction, illustrating that the mean value of RMSE, the mean value of MAE, and the mean value of MAPE are the smallest, and the mean value of R 2 is the largest for the 12 sites of our proposed model.This indicates that the TF-BiLSTM-attention model has the best prediction performance among these multivariate models.From a single site, the values of the evaluation metrics of the multivariate LSTM, BiLSTM, GRU, and BiGRU models do not differ much from each other, which suggests that under the same conditions, the prediction effect of these four models is about the same.Meanwhile, the difference in the values of the evaluation metrics between the hybrid model with the CNN network added and the base model is also not significant, which indicates that adding a layer of 1D-CNN to a single base model in this study does not significantly improve the prediction effect of the model.
From the overall effect of the 12 sites, when the three multivariate models, TF-BiLSTM, TF-CNN-BiLSTM, and TF-BiLSTM-attention, are compared with each other, the TF-BiLSTMattention model has the best prediction effect, the TF-BiLSTM model is the second best, and the TF-CNN-BiLSTM model is the worst.This indicates that after adding the frequency domain information in the model, the feature information cannot be better extracted using the CNN network.Instead, adding the attention mechanism in the model can play a role in improving the prediction effect of the model.Meanwhile, compared with the univariate model, the multivariate model with the addition of the attention mechanism has a higher degree of improvement in the prediction effect than that of the univariate model.

Conclusions
In this paper, we propose the TF-BiLSTM-attention model for PM2.5 concentration prediction at 12 sites in Beijing.In order to capture the frequency information in the PM2.5 concentration series, firstly, the DCT is used to transform the time domain series into the frequency domain series.Secondly, the original time domain series is united with the frequency domain series, allowing the model to effectively utilize both the time and frequency domain information.The input features at different times of the time series have different effects on the predicted values, while the basic BiLSTM network assigns equal weight values to all the input features, ignoring the degree of influence of the input features on the predicted values.For this reason, we add the attention mechanism after the BiLSTM network to give higher weights to the factors with higher influence.At the same time, taking advantage of the high spatial dependence of PM2.5 concentration data from different sites, a total of 12 variables, including the PM2.5 concentration at its own site and that of the remaining 11 sites in the study area, are used as input variables for the multivariate model.The results demonstrate superior prediction performance of the model incorporating frequency domain information compared to the model solely utilizing time domain information.Furthermore, the hybrid model augmented with the attention mechanism outperforms the hybrid model without this augmentation.Our proposed TF-BiLSTM-attention model outperforms all the basic and hybrid models in the experiment.The improved model will be able to capture key information and trends in the time series data more accurately.This will help to improve the accurate prediction of pollutant concentrations, meteorological conditions, and other factors, thus improving the overall prediction accuracy.Nevertheless, using DCT and adding the attention mechanism to the base BiLSTM model results in a model with higher complexity.It takes longer time to train a complex model, and the longer computation time may affect the real-time availability of air quality predictions, making it more difficult to update and provide information in a timely manner.Therefore, in future applications, we will strive to find a suitable balance between model complexity and computational efficiency.In addition, this study is limited by the lack of spatial feature information related to PM2.5 to explain the generalization of the model to different regions.For this reason, a large amount of real and valid air quality related data will be collected in the future, and the model proposed in this paper will be applied to several datasets from different regions for validation.

Figure 1 .
Figure 1.Methodology framework of this study.

Figure 1 .
Figure 1.Methodology framework of this study.

Figure 3 .
Figure 3. Structure of the BiLSTM network.

Figure 3 .
Figure 3. Structure of the BiLSTM network.

16 Figure 5 .
Figure 5. Heat map of correlation coefficients between different features of site S1.

Figure 5 .
Figure 5. Heat map of correlation coefficients between different features of site S1.

Table 1 .
Search space of Neural Network Intelligence.

Table 1 .
Search space of Neural Network Intelligence.
e −x e x + e −x (11) where f t represents the forget gate, i t represents the input gate, is a vector created by the tanh layer for the new candidate value, C t refers to the cell state, o t represents the output gate, h t refers to the hidden state, W f , W i , W c and W o are input weights, b f , b i , b c , b o are bias weights, t denotes the current state, and t − 1 denotes the previous state.σ and tan h are activation functions.
∼C t presents a schematic representation of the BiLSTM model structure.represents the forget gate, i t represents the input gate, C t is a vector created by the tanh layer for the new candidate value, C t refers to the cell state, o t represents the output gate, h t refers to the hidden state, W f , W i , W c and W o are input weights, b f , b i , b c , b o are bias weights, t denotes the current state, and t-1 denotes the previous state.σ and tanh are activation functions.
Bidirectional Long Short-Term Memory Neural Network

Table 2 .
Comparison of different models using PM2.5 as the input in terms of RMSE, MAE, R 2 , and MAPE.BiLSTM model, which indicates that adding frequency information to the model can improve the prediction effect of the model.Meanwhile, it can be found from Error! Reference source not found.thatthe same model has different prediction effects for different sites.Taking the univariate TF-BiLSTM-attention model, which has the best performance in general, as an example, site S12 has the worst performance, site S3 has the smallest value of RMSE, site S7 has the smallest value of MAE, site S10 has the largest value of R 2 , and site S6 has the smallest value of MAPE.Comparison of mean values of MAPE and R 2 for 12 sites with different univariate models.Comparison of mean values of RMSE and MAE for 12 sites with different univariate models.prove the prediction effect of the model.Meanwhile, it can be found from Error! Reference source not found.that the same model has different prediction effects for different sites.Taking the univariate TF-BiLSTM-attention model, which has the best performance in general, as an example, site S12 has the worst performance, site S3 has the smallest value of RMSE, site S7 has the smallest value of MAE, site S10 has the largest value of R 2 , and site S6 has the smallest value of MAPE.Comparison of mean values of MAPE and R 2 for 12 sites with different univariate models.Comparison of mean values of MAPE and R 2 for 12 sites with different univariate models.

Table 3 .
Comparison of RMSE, MAE of different multivariate models.

Table 4 .
Comparison of R 2 , MAPE of different multivariate models.