A New Combination Model for Air Pollutant Concentration Prediction: A Case Study of Xi’an, China

: As energy demand continues to increase, the environmental pollution problem is becoming more severe. Governments and researchers have made great efforts to avoid and reduce air pollution. The prediction of PM 2.5 , as an important index affecting air quality, has great signiﬁcance. However, PM 2.5 concentration has a complex change process that makes its prediction challenging. By calculating both PM 2.5 concentration and that of other pollutants in the atmosphere and mete-orological factors, it is evident that the variation in PM 2.5 concentration is inﬂuenced by multiple factors, and that relevant features also inﬂuence each other. To reduce the calculated loss, with full consideration given to the inﬂuencing factors, we used the maximum correlation and minimum redundancy (MRMR) algorithm to calculate the correlation and redundancy between features. In addition, it is known from the Brock–Dechert–Scheinman (BDS) statistical results that the change in PM 2.5 is nonlinear. Due to the outstanding performance of bidirectional long short-term memory (BiLSTM) neural networks in nonlinear prediction, we constructed an encoder–decoder model based on BiLSTM, named ED-BiLSTM, to predict the PM 2.5 concentration at monitoring stations. For areas without monitoring sites, due to the lack of historical data, the application of neural networks is limited. To obtain the pollutant concentration distribution in the study area, we divided the study area into a 1 km × 1 km grid and combined the ED-BiLSTM model via the use of the inverse distance weighting (IDW) algorithm to obtain the PM 2.5 concentration values in a region without monitoring stations. Finally, ArcGIS was used to visualize the results. The data for the case study were obtained from Xi’an. The results show that, compared with the standard long short-term memory (LSTM) model, the RMSE, MAE, and MAPE of our proposed model were reduced by 24.06%, 24.93%, and 22.9%, respectively. The proposed model has a low error for PM 2.5 prediction and can provide a theoretical basis for the formulation of environmental protection policies.


Introduction
Environment pollution is becoming increasingly severe as industrialization and urbanization accelerate, attracting the attention of governments and experts worldwide [1].PM 2.5 , an essential factor causing haze [2] and reducing atmospheric visibility [3], contributes much more to heavily polluted weather than other pollutants [4].Many researchers have studied the relationship between PM 2.5 and health.One study shows that breathing air with excessive pollution levels for a long time can worsen cardiovascular and respiratory diseases [5].Ashtari et al. [6] demonstrated a positive relationship between air quality and multiple sclerosis severity through an eight-year study in the Isfahan region of Iran.There are an estimated 1.65 million to 2.19 million early deaths every year associated with PM 2.5 [7].In order to prevent air pollution and protect the environment, air pollution prediction is crucial [8].An accurate PM 2.5 concentration forecast model can help the public to make sensible travel arrangements, lessen the risks to people's health, and guide governments and the public to make the right choices.
Obtaining accurate PM 2.5 concentration prediction values through the existing large quantity of meteorological and pollutant historical monitoring data is a crucial issue in air pollution research.The concentration of pollutants at a given point in time can affect future pollutant concentrations.The effects may last hours, days, or longer.At present, the time granularity in pollutant prediction is mainly assessed in hourly [9] and daily [10] terms, and some studies have examined pollutant emissions on a yearly scale [11,12].The current pollutant prediction models are classified into mechanism models, statistical models and deep learning models.Mechanism models simulate the chemical and physical evolution of pollutant transport through atmospheric kinetic theory.The typical methods are community multiscale air quality (CMAQ) [13], the nested air quality prediction and modeling system (NAQPMS) [14], and the weather research and forecasting (WRF) model [15].The advantage of these methods is that the modeling does not require historical data.Still, the complexity of model calculations and parameter settings requires a strong background in atmospheric science, limiting the practical application of the method.With the development of sensing technology, data acquisition has become easy.Statistical models are used for pollutant prediction.Statistical models are divided into traditional statistical models and machine learning models.For pollutant prediction, the traditional statistical models known as autoregressive integrated moving average (ARIMA) [16] and autoregressive moving average with extra input (ARMAX) [17] are frequently utilized.However, they are based on linear presumptions and have straightforward model structures that cannot be used to solve nonlinear issues.Machine learning techniques are also used to address nonlinear issues in pollutant prediction.Machine learning-based prediction methods include back propagation (BP) [18], extreme learning machine (ELM) [19], and support vector regression (SVR) [20].Even though machine learning algorithms have produced positive results for pollutant prediction, PM 2.5 concentration sequences exhibit complex fluctuations over time and are influenced by a variety of factors.The prediction accuracy of statistical models is constrained because they cannot satisfy the requirements of multivariate nonlinear data prediction.
With the development of computer technology, deep learning has demonstrated outstanding nonlinear data processing capabilities.For example, computer vision [21,22], image classification [23], and natural language processing [24] have demonstrated the effectiveness of deep learning models.Because recurrent neural networks (RNNs) are good at dealing with the difficulties encountered when dealing with sequential data, they have been used in numerous models to predict pollution concentration [25].Nevertheless, RNNs suffer from gradient disappearance and gradient explosion when predicting long time series.Long short-term memory (LSTM) [26], as a variant of an RNN, can solve this problem [27].To demonstrate the viability of using LSTM models in short-term prediction, LSTM-based models have been compared with statistical models [28][29][30].Both Wu et al. [31] and Wang et al. [32] predicted air pollution using LSTM; the difference was that Wang et al. [32] took into account the role of influencing factors through the chi-square test.As the LSTM can only encode data in one direction, Graves and Schmidhuber [33] proposed the bidirectional long short-term memory (BiLSTM) model, which contained forward LSTM unit and a backward LSTM unit.Experiments have shown that the BiLSTM is more accurate and stable in pollutant prediction than the single LSTM model, particularly for maximums and minimums.A single neural network cannot handle complex prediction tasks since the changes in pollutant concentration are complex and have spatiotemporal characteristics.So, more and more combination models are being developed.Bai et al. [34] considered the seasonal characteristics of PM 2.5 concentration variation and then obtained PM 2.5 predicted values using a stacked auto-encoder model.Pak et al. [35] extracted the spatial features of multidimensional time series using a convolutional neural network and then used LSTM to make predictions.Chang et al. [36] achieved PM 2.5 prediction by aggregating multiple LSTM units and considered the correlation of influencing factors in the model.A combination of auto-encoder and BiLSTM was proposed as a new model [37].Furthermore, the encoder-decoder model based on LSTM was also applied to time series prediction [38,39].Because encoders and decoders based on LSTM can fully extract time series features, they are widely used in pollutant concentration prediction [40,41].These models use an LSTM unit as an encoder to extract the features of the sequence data and then use another LSTM unit as a decoder to decode the encoded vector to obtain the predicted values.The existing studies show that BiLSTM performs better than LSTM in time series prediction and so we construct an encoder-decoder prediction model using BiLSTM.
In addition to prediction models, the presence of excessive features in PM 2.5 short-term prediction can cause information redundancy and lead to error accumulation [42].Multipleseries data can lead to the overfitting of models [43].Based on the above issues, we select features from the many pollutants and a large quantity of meteorological historical monitoring data.Feature selection improves prediction accuracy by selecting strongly correlated variables from multidimensional data.The methods commonly used for feature selection include mutual information [44,45], Pearson correlation coefficient [46,47], and Kendall correlation coefficient [48].As part of their study, Bai et al. [34] calculated the Kendall correlation coefficient for PM 2.5 and meteorological variables.Zhang et al. [37] solved the same problem using the Pearson correlation coefficient, but the Pearson correlation coefficient is not a good measure of nonlinear problems [49].To measure correlations between PM 2.5 and other pollutants, Wang et al. [2] used interval gray incidence analysis.However, the above study did not consider the redundancy between features.This study shows that adjacent monitoring sites can provide important information for the prediction model.To overcome the limitation, we use the maximum correlation minimum redundancy (MRMR) algorithm based on mutual information [50] to decide the model's input variables.Mutual information is also used to calculate the correlation of PM 2.5 between adjacent stations.
Although the existing methods can adequately extract time series features, they only consider the correlation between multidimensional feature variables and ignore the effect of feature redundancy on model accuracy.In addition, few studies have been conducted to predict pollutant concentrations without monitoring stations.In this paper, a new combination model for PM 2.5 concentration prediction is proposed based on the above analysis.This study contributes to the following areas: (1) Based on the nonlinear characteristics of the PM 2.5 concentration series, this paper adopts the MRMR algorithm based on mutual information for feature selection, removing the effect of redundant features with full consideration of the influencing factors.(2) To fully extract of spatial and temporal features, we construct a dual encoder-decoder prediction model based on BiLSTM for predicting PM 2.5 concentration named ED-BiLSTM.The proposed model accurately predicts PM 2.5 concentration.(3) To obtain the PM 2.5 concentration distribution in Xi'an, China, we combine the ED-BiLSTM model with inverse distance weight (IDW) spatial interpolation to get the predicted PM 2.5 concentration values without monitoring stations.Finally, the overall distribution is obtained by ArcGIS visualization.

Study Data
The study area is located in Xi'an, Shaanxi Province, China.Xi'an is in the middle of the Guanzhong Basin, with the Loess Plateau in the north and the Qinling Mountains in the south.The area is long and narrow from east to west, with a low altitude in the east and high altitude in the west.Only the cold air from the northwest is beneficial to the diffusion of pollutants, while the contaminants entering the basin from the east are not easy to diffuse and remove.However, the northeast direction is the distribution area of heavy industries in China and is prone to the retention and accumulation of pollutants due to the influence of climate and topography.Therefore, the pollution problem is an urgent issue to be solved in Xi'an.PM 2.5 prediction can provide a theoretical basis for pollution control.We have collected pollutant data from Xi'an environmental monitoring station.The information on the monitoring station is shown in Table 1.The study data include pollutant and meteorological data for each hour from 1 January 2017 to 31 December 2019.With 26,280 historical records, the data set contains 1180 missing values.The pollutant features are PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , and CO, and the meteorological features are wind direction (WD), wind speed (WS), temperature (Tem), dewpoint temperature (TDew), pressure (Pre), and relative humidity (RH), totaling 12 features in total.A detailed description of the feature variables can be found in Table 2.In order to ensure data integrity, missing values need to be filled.We use a linear function to fill the missing value in Equation (1).The missing values at the time x t is replaced by F(x t ).
where a and b represent the slope and intercept obtained by fitting the sample points of the characteristic variable x.The fitting value at the time x t is expressed as F(x t ).

Data Normalization
The multidimensional feature variables collected in this paper have different units and value ranges.We utilize the maximum and minimum normalization to eliminate the dimensional difference between data to reduce the computational complexity and increase convergence speed.Equation (2) shows the calculation formula.
x max and x min , respectively, represent the maximum and minimum values of the original sample x, and x i represents the normalized sample value.

Methods
We propose a new three-stage combination model for use in short-term PM 2.5 prediction.In the first part of the research, we select the temporal and spatial features related to PM 2.5 concentration through the MRMR algorithm.For the second part, we construct a dual encoder-decoder model based on BiLSTM to train and learn the spatiotemporal features and perform prediction.Finally, the study area is divided into 1 km × 1 km grids, and the IDW method is used to interpolate the grids to obtain PM 2.5 concentration values of all grids.Figure 1 presents a flow chart for the proposed model.This section describes the methods involved in the model in detail.
where a and b represent the slope and intercept obtained by fitting the sample points of the characteristic variable x.The fitting value at the time t x is expressed as ( )

Data Normalization
The multidimensional feature variables collected in this paper have different units and value ranges.We utilize the maximum and minimum normalization to eliminate the dimensional difference between data to reduce the computational complexity and increase convergence speed.Equation (2) shows the calculation formula.

Methods
We propose a new three-stage combination model for use in short-term PM2.5 prediction.In the first part of the research, we select the temporal and spatial features related to PM2.5 concentration through the MRMR algorithm.For the second part, we construct a dual encoder-decoder model based on BiLSTM to train and learn the spatiotemporal features and perform prediction.Finally, the study area is divided into 1 km × 1 km grids, and the IDW method is used to interpolate the grids to obtain PM2.5 concentration values of all grids.Figure 1 presents a flow chart for the proposed model.This section describes the methods involved in the model in detail.

MRMR Feature Selection Algorithm
PM 2.5 concentration is affected by other atmospheric pollutants, meteorological conditions [51], and PM 2.5 concentration at adjacent sites [52].However, using multiple-feature variables may result in information redundancy and reduce the prediction model's accuracy [42].To eliminate information redundancy between features, we use the MRMR feature selection algorithm based on mutual information to remove irrelevant and redundant features, reduce computational consumption, and improve prediction accuracy.
The MRMR algorithm considers the correlation and redundancy between features simultaneously.The correlation between the random variables x and y can be calculated using mutual information (MI) from Equation (3).I(x, y) = P(x, y)lg P(x,y) P(x)P(y) dxdy where P(x) and P(y) are the probability distribution functions of x and y, P(x, y) is the joint probability distribution function of x and y, and I(x, y) is the mutual information of x and y.
Suppose that S denotes the set of feature variables x i and |S| = m, and that m represents the number of features.The correlation between the characteristic variables and the target variable is calculated by Equation (4).The redundancy between the characteristic variables is calculated by Equation (5).
where x i is the value of the ith feature variable, y is the value of the target variable, I x i ; x j is the MI between the feature variables, I(x i ; y) is the MI between the target variable and the feature variables, C is the correlation between the feature variable and the target variable, and R is the redundancy between the features.The objective function for selecting the initial feature subset is shown in Equations ( 6) and (7).After determining the initial feature subset, an incremental search is performed using Equation (8).The steps for conducting an incremental search are as follows: Step 1: The initial feature set determined by the MRMR feature selection algorithm is represented as S k−1 , which contains k − 1 features.The remaining features are expressed as S − S k−1 .The feature variables contained in S − S k−1 are denoted as x 1 , x 2 , . . .x j .
Step 2: Add x 1 to S k−1 to form feature set S .Input S k−1 and S into the prediction model to calculate the evaluation index of the model, respectively.If the prediction error corresponding to input S k−1 is small, the original data set is kept.If the error corresponding to input S is small, the S replaces S k−1 as the new data set.
Step 3: Add the remaining features in S − S k−1 to the feature set with the minimum error in turn and perform Step 2 until the final feature set with the minimum error is obtained. max The final feature set fully preserves the correlation between the multidimensional characteristic variables and the prediction target, and achieves a data dimensionality reduction of the prediction model, which reduces the calculation loss.

ED-BiLSTM Prediction Model
The ED-BiLSTM model comprises a dual encoder and a decoder based on BiLSTM.An encoder encodes a temporal feature, and an encoder encodes a spatial feature.Then, the encoded vectors are aggregated and inputted to the decoder to obtain the ED-BiLSTM model.Finally, the PM 2.5 concentration prediction is realized via the test set.
In the deep neural network prediction model, the time series modeling and time step calculation for the characteristic variables are the basis of the realization of the prediction model.Therefore, before proposing the ED-BiLSTM prediction model, we first introduce the time series modeling and time step calculation methods.

Time Series Sample Modeling
To convert the data into a form that the computer can understand, we model the time series through a rolling window.An example of time series sample modeling by a rolling window is shown in Figure 2. Suppose that the period contains 10 records X T1 , X T2 , . . ., X T10 , where X = {x 1 , x 2 , . . . ,x n }, and n is the number of feature dimensions.When ∆t = 6, for Sample 1, X T1 , X T2 , X T3 , X T4 , X T5 , and X T6 are the features and X T7 is the label.For Sample 2, the features are X T2 , X T3 , X T4 , X T5 , X T6 and X T7 ; X T8 is the label, and all samples are modeled in the same way.In the prediction model, a small ∆t will make the input information incomplete, and a larger one will increase the noise and the computational consumption [27].We determined the ∆t by calculating the autocorrelation coefficient (ACF) and the partial autocorrelation coefficient (PACF), as implemented in Section 3.2.2.

Time Series Sample Modeling
To convert the data into a form that the computer can understand, we model the time series through a rolling window.An example of time series sample modeling by a rolling window is shown in Figure 2. Suppose that the period contains 10 records , , , n X x x x , and n is the number of feature dimensions.When ∆t = 6, for Sample1, X X X X X , and 6 T X are the features and 7 T X is the label.
For Sample 2, the features are is the label, and all samples are modeled in the same way.In the prediction model, a small Δt will make the input information incomplete, and a larger one will increase the noise and the computational consumption [27].We determined the ∆t by calculating the autocorrelation coefficient (ACF) and the partial autocorrelation coefficient (PACF), as implemented in Section 3.2.2.

Time Step Calculation
The number of stamps in each sample is determined by the size of time step ∆t, which is called the lag effect.A small value of ∆t does not ensure sufficient information input, while a large value will introduce irrelevant information and cause computational loss.Therefore, determining an appropriate ∆t is crucial to designing the prediction model.We determined the ∆t by analyzing ACF and PACF.
As a measure of correlation, the ACF describes the degree of correlation between values over time.For the time series {Y t , t ∈ T}, the ACF between the values of [Y t−∆t , Y t ) and Y t can be measured by Equation (9).A larger ACF value indicates a stronger correlation between [Y t−∆t , Y t ) and Y t .
where Y t is the label, Y t−∆t is the sample with a ∆t time lag, and Y is the average value of the sample.Compared to ACF, PACF is more focused on the correlation between Y t−∆t and Y t .The PACF of Y t−∆t and Y t is the correlation coefficient between Y t−∆t and Y t after removing the indirect effects of [Y t−∆t+1 , Y t−∆t+2 , . . .Y t−1 ].The formula of PACF is shown in Equations ( 10)- (12).When the value of PACF is closer to 0, it indicates that the correlation between Y t−∆t and Y t is weaker.

ED-BiLSTM Model
The BiLSTM [33] neural network structure contains a forward LSTM cell and a backward LSTM cell.The BiLSTM can fully mine time series information and has shown outstanding performance capacity in time series prediction tasks [27,29,53].Figure 3 shows a comparison between LSTM and BiLSTM.We can see that BiLSTM performs not only forward, but also backward, extraction of features compared with LSTM.To extract bidirectional features from sequence data, we use the BiLSTM as the main component to construct the dual encoder and decoder model.To provide a deeper understanding of BiLSTM, we give the detailed calculation process of LSTM.  Figure 4 shows that each LSTM neural unit consists of an input gate, an output gate, and a forget gate.Data are received through the input gate, historical information is preserved through the forgetting gate, and information is output through the output gate.Equations ( 13)- (18) give the calculation formula.
( ) Figure 4 shows that each LSTM neural unit consists of an input gate, an output gate, and a forget gate.Data are received through the input gate, historical information is preserved through the forgetting gate, and information is output through the output gate.Equations ( 13)- (18) give the calculation formula.
where h t−1 is the output of the hidden at t − 1, c t−1 is the output of the memory cell at t − 1, and x t is the current input vector.
The output of the BiLSTM neural network unit is a combination of forward LSTM hidden layer output  t h and backward LSTM hidden layer output  t h .The calculation process is shown in Equation (19).
where α and β are the output weights of the forward and backward LSTM hidden layers, respectively.For multivariate time series, one hidden layer cannot express all the information.Adding hidden layers may give more accurate results.The general approach is to stack multiple hidden layers [54] and use the output of the previous hidden layer as the input of the next hidden layer.Since both temporal and spatial features are considered in our study, simply adding hidden layers cannot solve the complex data feature problem.Therefore, we propose the use of a dual encoder and decoder model (ED-BiLSTM) based on BiLSTM.Figure 5 shows the model structure, which consists of two parts: the encoder and the decoder.One of the encoders encodes the temporal features obtained from MRMR, and the other encoder extracts spatial features from adjacent sites.Then, the encoded vectors from both encoders are aggregated and fed into the decoder for prediction.The ED-BiLSTM prediction model will be more beneficial for calculating the grid space interpolation without monitoring stations.The output of the BiLSTM neural network unit is a combination of forward LSTM hidden layer output → h t and backward LSTM hidden layer output ← h t .The calculation process is shown in Equation (19).
where α and β are the output weights of the forward and backward LSTM hidden layers, respectively.For multivariate time series, one hidden layer cannot express all the information.Adding hidden layers may give more accurate results.The general approach is to stack multiple hidden layers [54] and use the output of the previous hidden layer as the input of the next hidden layer.Since both temporal and spatial features are considered in our study, simply adding hidden layers cannot solve the complex data feature problem.Therefore, we propose the use of a dual encoder and decoder model (ED-BiLSTM) based on BiLSTM.Figure 5 shows the model structure, which consists of two parts: the encoder and the decoder.One of the encoders encodes the temporal features obtained from MRMR, and the other encoder extracts spatial features from adjacent sites.Then, the encoded vectors from both encoders are aggregated and fed into the decoder for prediction.The ED-BiLSTM prediction model will be more beneficial for calculating the grid space interpolation without monitoring stations.

Inverse Distance Weight (IDW) Interpolation Method
With increasing popular concern over environmental pollution, predictions about air pollutants have made significant progress in recent years.However, for deep learning to be effective, large amounts of historical data are required for prediction, which leads most of the research about pollutant prediction to target monitoring stations [30,42].There have also been a few studies into new sites with limited historical data [27].However, for areas without monitoring sites, the absence of historical data has led to a lack of research in this area.In this study, the ED-BiLSTM prediction model was combined with the IDW algorithm to interpolate the 1 km × 1 km spatial grid and obtain each grid's predicted values.The method is divided into the following steps.Firstly, the PM2.5 concentration of all monitoring stations in the study area is predicted using the ED-BiLSTM model at the next moment.Then, the place is divided into grids of 1 km × 1 km; then, the coordinates of each

Inverse Distance Weight (IDW) Interpolation Method
With increasing popular concern over environmental pollution, predictions about air pollutants have made significant progress in recent years.However, for deep learning to be effective, large amounts of historical data are required for prediction, which leads most of the research about pollutant prediction to target monitoring stations [30,42].There have also been a few studies into new sites with limited historical data [27].However, for areas without monitoring sites, the absence of historical data has led to a lack of research in this area.In this study, the ED-BiLSTM prediction model was combined with the IDW algorithm to interpolate the 1 km × 1 km spatial grid and obtain each grid's predicted values.The method is divided into the following steps.Firstly, the PM 2.5 concentration of all monitoring stations in the study area is predicted using the ED-BiLSTM model at the next moment.Then, the place is divided into grids of 1 km × 1 km; then, the coordinates of each grid center point are determined based on latitude and longitude.Finally, the IDW spatial interpolation algorithm obtains the predicted values for all grids.
IDW is a spatial interpolation algorithm based on points.For the interpolated grid with the center point coordinates (x, y), the predicted PM 2.5 value at this grid is calculated by Equation (20), where n is the number of monitoring stations, z i is the predicted value at the monitoring station i and d i indicate the distance between the interpolation grid (x, y), and the monitoring point (x i , y i ) is calculated by Equation (21).k is the power exponent of distance.When defining higher values of k, the more adjacent stations have a more significant effect on the interpolation.By comparison, when defining smaller values of k, the more remote stations have more significant effects on interpolation.The core of the IDW algorithm is the selection of k.It has been shown that the search radius does not affect the results when k is 3 or more [55].Based on the analysis of previous studies, we set k as 3.

Evaluation Index
To quantify the prediction model performance, we evaluated the ED-BiLSTM model and compared models using the root mean square error (RMSE), the mean absolute error (MAE), the mean absolute percentage error (MAPE), and R squared (R 2 ).Deviation between actual and predicted values is measured by RMSE and MAE, and MAPE can determine the percentage deviation.The smaller the RMSE, MAE, and MAPE values are, the more accurate the prediction will be.R 2 reflects the ability to fit data, and the value ranges from 0 to 1.The higher the R 2 is, the better the model fits the data will be.Extreme values do not affect MAE and MAPE, while RMSE uses the square of the error.This amplifies the prediction error and is more sensitive to anomalies.Therefore, we combine the evaluation metrics to evaluate the ED-BiLSTM model.The formulas are shown in Equations ( 22)- (25).
where y i is the observed value of PM 2.5 concentration, ŷi is the predicted value, y i is the mean of the monitored value, and N denotes the number of observed samples.

Case Study
4.1.Feature Selection PM 2.5 concentration is affected by other atmospheric pollutants, meteorological conditions, and the concentration of PM 2.5 at adjacent sites.However, in short-term predictions, the presence of multiple feature variables may result in information redundancy and reduce the prediction model's accuracy.To eliminate information redundancy caused by excessive features, we calculated the MI of PM 2.5 using other pollutants, meteorological factors, and the MI on PM 2.5 from adjacent sites.
To choose the appropriate correlation analysis method, we first checked the linearity of the PM 2.5 series using Brock-Dechert-Scheinman (BDS) statistics [56].The BDS test result is presented in Table 3 for when the dimension is set to 6. Here, the Z-statistic is higher than the critical value 1.96, and the p value is less than 0.05, indicating that the BDS statistic rejects the initial linearity hypothesis and proves that nonlinearity of PM 2.5 time series.Therefore, we used mutual information to calculate the correlation between variables.Then, we used the MRMR algorithm to remove redundant features and determine the optimal subset of features with which to implement the dimensionality reduction of the input data.Figure 6 displays the heat map of the mutual information values between PM 2.5 concentrations and influence factors at the 1463A site.The heat map's color is darker, suggesting a higher value for mutual information and a higher correlation between the variables.PM 2.5 concentration at time t + 1 and other influence factors values at time t were used to calculate mutual information.Figure 6 illustrates that PM 10 , NO 2 , O 3 , RH, SO 2 , and PM 2.5 are highly correlated, and that the mutual information value is greater than 1.When k = 6, we determined the feature subsets according to the mutual information values as PM 2.5 , PM 10 , NO 2 , SO 2 , O 3 , and RH.However, the subset of features calculated by the MRMR algorithm considering redundancy included PM 2.5 , PM 10 , NO 2 , SO 2 , CO, and RH.In Figure 6, we see that O 3 strongly correlates with PM 10 , NO 2 , and RH, indicating that the MRMR feature selection algorithm fully accounts for the redundancy between features.
PM2.5 concentration is affected by other atmospheric pollutants, meteorological conditions, and the concentration of PM2.5 at adjacent sites.However, in short-term predictions, the presence of multiple feature variables may result in information redundancy and reduce the prediction model's accuracy.To eliminate information redundancy caused by excessive features, we calculated the MI of PM2.5 using other pollutants, meteorological factors, and the MI on PM2.5 from adjacent sites.
To choose the appropriate correlation analysis method, we first checked the linearity of the PM2.5 series using Brock-Dechert-Scheinman (BDS) statistics [56].The BDS test result is presented in Table 3 for when the dimension is set to 6. Here, the Z-statistic is higher than the critical value 1.96, and the p value is less than 0.05, indicating that the BDS statistic rejects the initial linearity hypothesis and proves that nonlinearity of PM2.5 time series.Therefore, we used mutual information to calculate the correlation between variables.Then, we used the MRMR algorithm to remove redundant features and determine the optimal subset of features with which to implement the dimensionality reduction of the input data.Figure 6 displays the heat map of the mutual information values between PM2.5 concentrations and influence factors at the 1463A site.The heat map's color is darker, suggesting a higher value for mutual information and a higher correlation between the variables.PM2.5 concentration at time + 1 t and other influence factors values at time t were used to calculate mutual information.Figure 6 illustrates that PM10, NO2, O3, RH, SO2, and PM2.5 are highly correlated, and that the mutual information value is greater than 1.When k = 6, we determined the feature subsets according to the mutual information values as PM2.5, PM10, NO2, SO2, O3, and RH.However, the subset of features calculated by the MRMR algorithm considering redundancy included PM2.5, PM10, NO2, SO2, CO, and RH.In Figure 6, we see that O3 strongly correlates with PM10, NO2, and RH, indicating that the MRMR feature selection algorithm fully accounts for the redundancy between features.We conducted experiments by inserting a subset of features, determined by two feature selection algorithms based on mutual information and the MRMR algorithm, into the prediction model.Table 4 shows the values of evaluation metrics for different feature input prediction models.The MRMR algorithm obtained the optimal feature set.Then, we added the remaining features to the subset of features using an incremental search algorithm.We found by comparing the model evaluation metrics that the features identified by the MRMR algorithm at k = 6 could be used to achieve the best prediction results.Therefore, the input feature subset of the prediction model contains PM 2.5 , PM 10 , NO 2 , SO 2 , RH, and CO.In addition, we calculated the mutual information value for PM 2.5 concentration between adjacent stations.Figure 7 shows a strong correlation between PM 2.5 concentrations at adjacent sites.Comparing the model before and after adding spatial features, we found that RMSE decreased by 12.2% and MAE decreased by 27.3% after adding spatial features.Based on the feature selection algorithm, we used PM 2.5 , PM 10 , NO 2 , SO 2 , RH, CO, and PM 2.5 from adjacent sites as inputs in the prediction model.We conducted experiments by inserting a subset of features, determined by two feature selection algorithms based on mutual information and the MRMR algorithm, into the prediction model.Table 4 shows the values of evaluation metrics for different feature input prediction models.The MRMR algorithm obtained the optimal feature set.Then, we added the remaining features to the subset of features using an incremental search algorithm.We found by comparing the model evaluation metrics that the features identified by the MRMR algorithm at k = 6 could be used to achieve the best prediction results.Therefore, the input feature subset of the prediction model contains PM2.5, PM10, NO2, SO2, RH, and CO.In addition, we calculated the mutual information value for PM2.5 concentration between adjacent stations.Figure 7 shows a strong correlation between PM2.5 concentrations at adjacent sites.Comparing the model before and after adding spatial features, we found that RMSE decreased by 12.2% and MAE decreased by 27.3% after adding spatial features.Based on the feature selection algorithm, we used PM2.5, PM10, NO2, SO2, RH, CO, and PM2.5 from adjacent sites as inputs in the prediction model.Figure 8 shows the ACF calculation results of PM 2.5 concentration at different sites in Xi'an.We can see that the ACF presents a decreasing trend as the time lag lengthens.When ∆t is less than 5, the ACF of most sites is greater than 0.9, and the data are highly correlated.When ∆t is greater than 30, the correlation is reduced.

Time Step Calculation
In the time series forecasting task, the time step Δt determines how many historical record values are input to the model, and a suitable value of Δt can increase the model's forecasting performance.In this paper, we decide the value of Δt by calculating ACF and PACF.
Figure 8 shows the ACF calculation results of PM2.5 concentration at different sites in Xi'an.We can see that the ACF presents a decreasing trend as the time lag lengthens.When Δt is less than 5, the ACF of most sites is greater than 0.9, and the data are highly correlated.When Δt is greater than 30, the correlation is reduced.Compared to ACF, PACF is more focused on the correlation between Y t−∆t and Y t .Figure 9 shows the PACF calculation results for PM 2.5 concentrations at site 1463A.The PACF tends to 0 when ∆t = 4, and the ACF is 0.905 at this point, indicating a strong correlation between Y t , Y t−1 , Y t−2 , Y t−3 , and Y t−4 .However, when ∆t > 4, the PACF fluctuates around zero until ∆t > 20, when the fluctuation becomes relatively flat.To further determine Δt , we construct training samples with ∆t in the range of [1,24] in order to train and make predictions using ED-BiLSTM.The model prediction results under different Δt settings are shown in Table 5.As seen in Table 5, when ∆t is set to 4, the RMSE is 6.603, the MAE is 4.066, and the prediction model has the most minor error.To see the effect of ∆t more intuitively on the prediction performance, we plot the line graphs of RMSE and MAE for different time lags in Figure 10.There is a similar trend between the RMSE and MAE, and a regular pattern can establish that the prediction error is more minor when ∆t is a multiple of 4. This is in agreement with the conclusion reached To further determine ∆t, we construct training samples with ∆t in the range of [1,24] in order to train and make predictions using ED-BiLSTM.The model prediction results under different ∆t settings are shown in Table 5.As seen in Table 5, when ∆t is set to 4, the RMSE is 6.603, the MAE is 4.066, and the prediction model has the most minor error.To see the effect of ∆t more intuitively on the prediction performance, we plot the line graphs of RMSE and MAE for different time lags in Figure 10.There is a similar trend between the RMSE and MAE, and a regular pattern can establish that the prediction error is more minor when ∆t is a multiple of 4. This is in agreement with the conclusion reached by performing PACF, and is also confirmed by the study of Huang et al. [9].Through ACF and PACF calculations, we set ∆t to 4, which means that when predicting Y t−1 , Y t−2 , Y t−3 , and Y t−4 are input into the prediction model.

Experimental Settings
The experimental data in this paper, taken from the historical records of PM2.5, PM10, NO2, SO2, RH, CO, and adjacent sites' PM2.5 concentrations from a previous 4 h period, are used to predict the PM2.5 concentration for the next hour.For time series modeling, we individually model the time series features and spatial features.In model construction, the hidden layer of the encoder is the BiLSTM unit, and a concatenate layer is added after the encoder to fuse the feature vectors of the two encoders.Finally, the decoder sets a hidden layer.The prediction model with the lowest error on the training set is obtained by continuously adjusting of the parameters.Table 6 shows the parameters set for the ED-BiLSTM model.

Experimental Settings
The experimental data in this paper, taken from the historical records of PM 2.5 , PM 10 , NO 2 , SO 2 , RH, CO, and adjacent sites' PM 2.5 concentrations from a previous 4 h period, are used to predict the PM 2.5 concentration for the next hour.For time series modeling, we individually model the time series features and spatial features.In model construction, the hidden layer of the encoder is the BiLSTM unit, and a concatenate layer is added after the encoder to fuse the feature vectors of the two encoders.Finally, the decoder sets a hidden layer.The prediction model with the lowest error on the training set is obtained by continuously adjusting of the parameters.Table 6 shows the parameters set for the ED-BiLSTM model.In the model training, we input the samples into the model in batches; the batch size is 128.We set the epoch to 200 and the number of neurons to 128 in the BiLSTM.The optimizer is Adam, the loss function is a mean absolute error (MAE), and the learning rate is 1 × 10 −3 .The activation function is set to ReLU.At the same time, the dropout parameter is set to 0.5 to prevent overfitting.The number of encoder and decoder layers is set to 1.As soon as the model training process is complete, a prediction is performed on the basis of the test set, and an evaluation index is calculated for the model.

Prediction Results
To verify the predictive performance of the proposed model, we predicted the PM 2.5 concentration at station 1463A for the next hour using the ED-BiLSTM and the comparative models.We constructed seven comparative models, including the machine learning model SVR, single LSTM, and single BiLSTM, as well as four combination models, namely, ED-LSTM, AE-BiLSTM, ED-BiLSTM-One, and ED-BiLSTM-IF.ED-LSTM means that the hidden layers of the encoder and decoder use LSTM neural network units.The combined model AE-BiLSTM denotes that the model uses an auto-encoder for the encoder and BiLSTM for the decoder.ED-BiLSTM-One indicates that an encoder with the hidden layer of BiLSTM encodes spatiotemporal features.ED-BiLSTM-IF indicates the selection of influencing factors, a practice which considers the correlation feature variables of the target site alone and disregards the spatial correlation of PM 2.5 concentrations at adjacent sites.All models were constructed in the Keras framework.To ensure the validity of the experimental results, all models were kept with the same parameter settings.Additionally, a 10-time repetition of each deep learning model was conducted to reduce randomness in neural network training, and Table 7 shows the experimental results.7 show that the RMSE, MAE, and MAPE of the ED-BiLSTM model are lower than those of the comparative model, and that the R 2 is closer to 1.This indicates that the ED-BiLSTM model can better fit the PM 2.5 time series than the other machine learning and deep learning models.The average RMSE of the ED-BiLSTM model is 6.603, and the MAE is 4.066.Compared with the machine learning model SVM, the average RMSE is 53.91% lower, and the MAE is 65.73% lower.Compared with the single-LSTM and BiLSTM models, the average RMSE is 24.06% and 17.63% lower, and the average MAE is 24.93% and 22.18% lower.It proved that the ED-BiLSTM model is more suitable for PM 2.5 short-term prediction than the machine learning and single-neural network models.By comparing all models' predicted and true value curves, Figure 11 shows that the ED-BiLSTM model has the best fit.In addition, we construct four hybrid models and compare the prediction results with our model.Table 7 shows the experimental results.The AE-BiLSTM model is provided by the work of Zhang et al. [37].We construct the ED-LSTM, ED-BiLSTM-One, and ED-BiLSTM-IF models.There is a significant difference between these models due to the time series feature modeling and the hidden layer setting.The difference between ED-LSTM and ED-BiLSTM model is that LSTM replaces the neural network unit of the hidden layer.Compared with ED-LSTM, ED-BiLSTM is 26.12% lower in RMSE and 24.04% lower in MAE on average, indicating that using BiLSTM as the hidden layer unit of encoder and decoder can better extract the dependencies forward and backward from the sequence data.The encoder of AE-BiLSTM adopts an auto-encoder.Compared with AE-BiLSTM, the average RMSE of our model is 37.12% lower, and the MAE is 52.25% lower, indicating that BiLSTM has a better interpretation of sequence data than the auto-encoder.The ED-BiLSTM-One model uses one encoder to extract spatiotemporal features.Compared with ED-BiLSTM-One, our model has an average reduction of 12.49% in RMSE and 14.27% in MAE, indicating that the use of independent encoders for spatial feature extraction of adjacent stations improves the model's prediction accuracy.The ED-BiLSTM-IF model ignores the mutual influence between adjacent stations, with an average reduction of 12.21% in RMSE and 27.3% in MAE for ED-BiLSTM compared with ED-BiLSTM-IF, which In addition, we construct four hybrid models and compare the prediction results with our model.Table 7 shows the experimental results.The AE-BiLSTM model is provided by the work of Zhang et al. [37].We construct the ED-LSTM, ED-BiLSTM-One, and ED-BiLSTM-IF models.There is a significant difference between these models due to the time series feature modeling and the hidden layer setting.The difference between ED-LSTM and ED-BiLSTM model is that LSTM replaces the neural network unit of the hidden layer.Compared with ED-LSTM, ED-BiLSTM is 26.12% lower in RMSE and 24.04% lower in MAE on average, indicating that using BiLSTM as the hidden layer unit of encoder and decoder can better extract the dependencies forward and backward from the sequence data.The encoder of AE-BiLSTM adopts an auto-encoder.Compared with AE-BiLSTM, the average RMSE of our model is 37.12% lower, and the MAE is 52.25% lower, indicating that BiLSTM has a better interpretation of sequence data than the auto-encoder.The ED-BiLSTM-One model uses one encoder to extract spatiotemporal features.Compared with ED-BiLSTM-One, our model has an average reduction of 12.49% in RMSE and 14.27% in MAE, indicating that the use of independent encoders for spatial feature extraction of adjacent stations improves the model's prediction accuracy.The ED-BiLSTM-IF model ignores the mutual influence between adjacent stations, with an average reduction of 12.21% in RMSE and 27.3% in MAE for ED-BiLSTM compared with ED-BiLSTM-IF, which demonstrates the necessity of considering spatial features in pollutant prediction.The comprehensive comparison shows that ED-BiLSTM has the smallest error and the best fitting in PM 2.5 short-term prediction.
Figure 11 shows a graph of the prediction results for all models and a partial enlargement.As the prediction effect of each model is difficult to see from multiple curves directly, the 50 prediction points are partially enlarged to observe the prediction results of the different models more clearly.We can see from Figure 11 that the ED-BiLSTM model fits the predicted value curve better than the machine learning model, the simplex deep learning model, and the other combined models.Figure 12 plots a curve of predicted and actual values for the ED-BiLSTM model; the red line is the predicted value, and the blue line is the monitored value.The figure shows that the ED-BiLSTM model can track the fluctuations of PM 2.5 concentration well, especially the sudden changes and the maximum and minimum points.Figure 13 plots the scatter plot of predicted and actual values, with the horizontal and vertical coordinates representing the monitored and predicted, respectively.The black line symbolizes that y = x, and the scatter points are closely around y = x.Combining the prediction result plot in Figure 12 and the scatter plot in Figure 13, it is easy to see that the ED-BiLSTM fits the PM 2.5 concentration sequence best.The proposed model is significantly better than the comparative model and can be effectively applied to PM 2.5 concentration prediction.Combining the prediction result plot in Figure 12 and the scatter plot in Figure 13, it is easy to see that the ED-BiLSTM fits the PM2.5 concentration sequence best.The proposed model is significantly better than the comparative model and can be effectively applied to PM2.5 concentration prediction.

Other Cases
To evaluate the robustness and stability of our model, we select 12 monitoring sites in Xi'an from 1 January 2017 to 31 December 2019 for PM2.5 concentration prediction, and the information on the monitoring stations is shown in Table 1.The feature selection and prediction model construction process is the same as the one for the 1463A site.To  Combining the prediction result plot in Figure 12 and the scatter plot in Figure 13, it is easy to see that the ED-BiLSTM fits the PM2.5 concentration sequence best.The proposed model is significantly better than the comparative model and can be effectively applied to PM2.5 concentration prediction.

Other Cases
To evaluate the robustness and stability of our model, we select 12 monitoring sites in Xi'an from 1 January 2017 to 31 December 2019 for PM2.5 concentration prediction, and the information on the monitoring stations is shown in Table 1.The feature selection and prediction model construction process is the same as the one for the 1463A site.To

Other Cases
To evaluate the robustness and stability of our model, we select 12 monitoring sites in Xi'an from 1 January 2017 to 31 December 2019 for PM 2.5 concentration prediction, and the information on the monitoring stations is shown in Table 1.The feature selection and prediction model construction process is the same as the one for the 1463A site.To simplify the description and avoid repetition, we directly give the prediction results of 12 monitoring stations.The evaluation indexes of each model are shown in Tables 8 and 9 for the 12 stations, and the prediction results differ among monitoring stations.The evaluation index of the proposed model's RMSE, MAE, and MAPE is smaller than that of other comparative models, and R 2 is the largest.The largest value of RMSE, MAE, and MAPE in the proposed model is 9.553, 5.786, and 0.128.The smallest R 2 is 0.974.Despite the different prediction results for each site, the proposed model's RMSE, MAE, and MAPE remain at a small value, and the R 2 is greater than 0.97.It appears that the proposed model can be used to estimate PM 2.5 concentrations.In addition, the prediction results are compared for all monitoring stations in Xi'an, and it is found that the proposed model has a minor prediction error for 1470A, while the worst result is obtained for 1467A.1470A is Yanliang Industrial Park, with a high PM 2.5 pollutant emission concentration.In contrast, 1467A is in Xi'an High-Tech Industrial Park.A high-tech industrial park combines modern technology and natural ecology with a low concentration of pollutant emissions.It proves that the PM 2.5 prediction value of our model tends to be high, but that the error is still in the low range.

PM 2.5 Concentration Prediction on Areas without Stations
In Sections 4.3 and 4.4, we discuss the feasibility and prediction performance of the model.However, the information one can obtain from the PM 2.5 prediction results at the monitoring stations is limited.To have a more comprehensive grasp of PM 2.5 in the study area in the future time, a 1 km × 1 km resolution gridding is performed for Xi'an.Then, according to the prediction results of monitoring stations, the IDW algorithm is used for grid interpolation to obtain the PM 2.5 concentration prediction value for each grid.Finally, the interpolation results are visualized through ArcGIS for researchers and managers to get the distribution of PM 2.5 concentration in the whole region.
To see more clearly the variation process of PM 2.5 concentration in a whole day, Figure 14 visualizes the distribution of PM 2.5 concentration at two-hour intervals.We can see from the figure that the PM 2.5 concentration varies significantly at each moment of the day, and the distribution of PM 2.5 concentration in each region is imbalanced.To explain more clearly the reason for this phenomenon, the distribution of administrative divisions of Xi'an is given in Figure 15.Xi'an, the largest central city in northwest China, consists of 11 districts (Xincheng District, Beilin District, Lianhu District, Baqiao District, Weiyang District, Yanta District, Yanliang District, Lintong District, Chang'an District, Goaling District, and Huyi District) and 2 counties (Lantian County and Zhouzhi County), the main urban areas of which include Weiyang District, Baqiao District, Yanta District, Lianhu District, Xincheng District, and Beilin District.We combined Figures 14 and 15 to analyze the variation process of PM 2.5 concentration in Xi'an on 10 September 2019.From the perspective of the regional division, the PM 2.5 concentrations in Zhouzhi County, Huyi County, and Lantian County are low.In contrast, the PM 2.5 concentrations in the main urban area and Yanliang District are higher than in other districts and counties.Because of the large population and dense traffic in the central metropolitan area, Yanliang District has a dense industrial park.From the perspective of temporal evolution, PM 2.5 concentrations are higher at night than during the day.The reason for this phenomenon is that the atmosphere easily forms a cold inversion layer at night in autumn and winter, making it impossible for air to undergo convection.The accumulation of PM 2.5 emissions during the day and the passage of high-emission trucks at night are also significant sources of PM 2.5 .After a night of accumulation and motor vehicle emissions, the highest PM 2.5 concentration occurs at 8:00 am during the latter part of the morning peak.Then, as the ground temperature gradually rises, the inversion layer is broken, and the airflow rate accelerates.The lowest pollutant concentration of the day occurs from 14:00 to 16:00, which is also the best time for outdoor activities.The PM 2.5 concentration gradually increases after 18:00 as the ground-level temperature decreases and the evening peak time arrives.We combined Figures 14 and 15 to analyze the variation process of PM2.5 concentration in Xi'an on 10 September 2019.From the perspective of the regional division, the PM2.5 concentrations in Zhouzhi County, Huyi County, and Lantian County are low.In contrast, the PM2.5 concentrations in the main urban area and Yanliang District are higher than in other districts and counties.Because of the large population and dense traffic in the central metropolitan area, Yanliang District has a dense industrial park.From the perspective of temporal evolution, PM2.5 concentrations are higher at night than during the day.The reason for this phenomenon is that the atmosphere easily forms a cold inversion layer at night in autumn and winter, making it impossible for air to undergo convection.The accumulation of PM2.5 emissions during the day and the passage of high-emission trucks at night are also significant sources of PM2.5.After a night of accumulation and motor vehicle emissions, the highest PM2.5 concentration occurs at 8:00 am during the latter part of the morning peak.Then, as the ground temperature gradually rises, the inversion layer is broken, and the airflow rate accelerates.The lowest pollutant concentration of the day occurs from 14:00 to 16:00, which is also the best time for outdoor activities.The PM2.5 concentration gradually increases after 18:00 as the ground-level temperature decreases and the evening peak time arrives.
In summary, the complex road traffic in the main urban area and the enormous number of PM2.5 emission sources increase the difficulty of the model prediction.The worst prediction result was obtained at 1467A site in Section 4.4.1467A site is in the Yanta district of the urban area, and so the interpolation analysis results are the same as the conclusions obtained from the model prediction results.Therefore, we can add more stations if we want to predict the PM2.5 concentration in areas with complex emission sources.Secondly, In summary, the complex road traffic in the main urban area and the enormous number of PM 2.5 emission sources increase the difficulty of the model prediction.The worst prediction result was obtained at 1467A site in Section 4.4.1467A site is in the Yanta district of the urban area, and so the interpolation analysis results are the same as the conclusions obtained from the model prediction results.Therefore, we can add more stations if we want to predict the PM 2.5 concentration in areas with complex emission sources.Secondly, we strongly recommend that the government invest more in efforts to deal with the problem of high PM 2.5 concentration in industrial parks.Finally, according to the distribution of pollutant concentrations in the entire day, it is recommended that residents should choose the mid-afternoon for outdoor activities when the PM 2.5 concentration is lowest.

Discussion
This paper proposes a new combination prediction model, combining the MRMR feature selection algorithm, the BiLSTM neural network, and IDW spatial interpolation algorithm to predict PM 2.5 concentrations in Xi'an.Firstly, we select the appropriate feature subset using the MRMR algorithm, and experiments show that the algorithm fully considers the correlation and redundancy among features, which helps to improve the accuracy of the prediction model.Secondly, we compare our model with the machine learning and single deep learning models.Due to our model considering the spatiotemporal correlation of multivariate time series, it shows strong advantages.Thirdly, the model based on BiLSTM units is compared with the model based on LSTM and autoencoder units, and it is found that BiLSTM has a more robust feature extraction ability.Next, our model improves the predictive ability by considering the spatial correlation between adjacent sites.Finally, the experimental analysis finds that the dual encoder model outperforms a single encoder in extracting the spatiotemporal features of multiple time series.In our proposed comparative model prediction experiments, the ED-BiLSTM model has the lowest prediction error.In addition, we divided the Xi'an into 1 km × 1 km grids to predict PM 2.5 concentrations in each grid and explored the process of PM 2.5 concentration variation all day through ArcGIS visualization.The proposed model combines the feature selection algorithm, neural network, and interpolation algorithm, which solves the feature redundancy in PM 2.5 prediction and the difficulty of PM 2.5 prediction in regions without monitoring stations.
According to the analysis of PM 2.5 distribution results in Figure 14, it can be found that industrial pollution and traffic pollution are the primary sources of PM 2.5 .The public can reduce PM 2.5 emissions by choosing more public transportation and reducing the use of motor vehicles.Regarding industrial emissions, industrial production processes should be optimized to reduce and purify industrial waste gas.In addition, the emission of PM 2.5 precursors, such as NOx, SO 2 , and volatile organic compounds, should be controlled in the management of PM 2.5 .When the level of PM 2.5 pollution in the atmosphere is high, it is reasonable for the public to wear masks when traveling, which can effectively protect the cardiovascular system.Effective source prevention and control can cut off pollution sources and solve pollution problems.As such, environmental protection departments should formulate more reasonable pollution prevention and control countermeasures to eliminate pollution at the source.
As our model predicts PM 2.5 concentration at the hourly level in the short term, and in the case analysis, the model's accuracy is verified by selecting stations only in Xian.However, this does not mean that the proposed model is geographically limited, and the model can be applied to PM 2.5 or other pollutant prediction in different cities.Due to the limitation of data sources, we only consider the effect of other pollutants and meteorological factors on PM 2.5 concentration.In addition, we only consider the effects of feature correlation and PM 2.5 concentration nonlinear features on the prediction results in this paper, ignoring the errors caused by the non-stationarity of PM 2.5 concentration series.In the future, we will analyze the non-stationarity of PM 2.5 to achieve a more accurate PM 2.5 concentration prediction.Besides, model parameter optimization by using optimization algorithms is also a future research direction.

Conclusions
To accurately predict the PM 2.5 concentration, we construct a new combination prediction model based on feature selection, BiLSTM, encoder-decoder, and spatial interpolation algorithms.The model introduces the following three parts.(1) Feature selection, whereby the correlation calculation of PM 2.5 concentration with other pollutants and meteorological factors is performed using the maximum correlation minimum redundancy algorithm based on mutual information.(2) Neural network model construction, whereby the temporal and spatial correlation features of PM 2.5 concentration are extracted using a dual encoder-decoder model.(3) Spatial interpolation, whereby the predicted value of PM 2.5 concentration without monitoring site is obtained by the IDW spatial interpolation algorithm, and ArcGIS obtains the distribution map of pollutant concentration in the study area.This paper takes the research data of Xi'an city as an example with which to verify the proposed model, and the main conclusions are as follows.
(1) Considering the correlation and redundancy between features can achieve data dimension reduction and increase prediction performance.The experimental results showed that the prediction models considering the influence factors RMSE, MAE and MAPE decreased by 12.2%, 27.3%, and 11.4%, respectively.(2) The model based on dual encoder-decoder fully extracts the multi-dimensional time series features.The results showed that RMSE, MAE, and MAPE of ED-BiLSTM were reduced by 17.63%, 22.18%, and 9.8%, respectively, compared with a single-BiLSTM model.(3) IDW spatial interpolation results and ArcGIS visualization conclude that Weiyang, Gaoling, and Yangling districts are seriously polluted, and that pollution prevention and control in these areas should be strengthened.In addition, it is recommended that more air quality monitoring stations be installed in the main urban area of Xi'an, where air quality prediction is complicated due to the complex roads and high traffic flow during the daytime.
The main contribution of this paper is to propose a new method for predicting PM 2.5 concentration.The prediction results of the model can not only provide real-time air quality warnings and fill in the missing monitoring values caused by instrument failures, they can also capture the distribution of PM 2.5 concentrations in the study area.In addition, it should be noted that the innovation of this paper is the idea that other feature selection algorithms can be used in future research and that the encoder-decoder model structure can be adapted by other neural network models.The proposed model has a good generalization ability.
x , respectively, represent the maximum and minimum values of the original sample x , and ix represents the normalized sample value.

Figure 1 .
Figure 1.The flow chart of the proposed model.Figure 1.The flow chart of the proposed model.

Figure 1 .
Figure 1.The flow chart of the proposed model.Figure 1.The flow chart of the proposed model.

Figure 2 .
Figure 2. Modeling of time series samples.Black is the feature, red is the label.

Figure 2 .
Figure 2. Modeling of time series samples.Black is the feature, red is the label.

Figure 3 .
Figure 3. LSTM and BiLSTM.Blue indicates forward propagation and orange indicates backward propagation.

Figure 3 .
Figure 3. LSTM and BiLSTM.Blue indicates forward propagation and orange indicates backward propagation.

Figure 6 .
Figure 6.Mutual information of characteristic variables for site 1463A.

Figure 7 .
Figure 7. Mutual information between PM 2.5 concentration at sites.4.2.Parameter Setting of ED-BiLSTM Model 4.2.1.Time Step Calculation In the time series forecasting task, the time step ∆t determines how many historical record values are input to the model, and a suitable value of ∆t can increase the model's

Figure 8 .
Figure 8.The ACF of different time lags for each site.

Figure 10 .
Figure 10.The RMSE and MAE with different time lags.

Figure 10 .
Figure 10.The RMSE and MAE with different time lags.
One, and ED-BiLSTM-IF.The hyperparameter settings for all deep learning models are identical to those of the ED-BiLSTM model.The results in Table

Sustainability 2023 , 29 Figure 11 .
Figure 11.Results of all models and partial magnification.

Figure 11 .
Figure 11.Results of all models and partial magnification.

Sustainability 2023 ,
15, x FOR PEER REVIEW 20 of 29 learning model, and the other combined models.Figure 12 plots a curve of predicted and actual values for the ED-BiLSTM model; the red line is the predicted value, and the blue line is the monitored value.The figure shows that the ED-BiLSTM model can track the fluctuations of PM2.5 concentration well, especially the sudden changes and the maximum and minimum points.Figure 13 plots the scatter plot of predicted and actual values, with the horizontal and vertical coordinates representing the monitored and predicted, respectively.The black line symbolizes that y = x, and the scatter points are closely around y = x.

Figure 12 .
Figure 12.Result of the ED-BiLSTM prediction model.

Figure 13 .
Figure 13.The scatter plot of predicted and true values on the test set.

Figure 12 .
Figure 12.Result of the ED-BiLSTM prediction model.
Figure 13 plots the scatter plot of predicted and actual values, with the horizontal and vertical coordinates representing the monitored and predicted, respectively.The black line symbolizes that y = x, and the scatter points are closely around y = x.

Figure 12 .
Figure 12.Result of the ED-BiLSTM prediction model.

Figure 13 .
Figure 13.The scatter plot of predicted and true values on the test set.

Figure 13 .
Figure 13.The scatter plot of predicted and true values on the test set.

Sustainability 2023 ,
15, x FOR PEER REVIEW 23 of 29 and the distribution of PM2.5 concentration in each region is imbalanced.To explain more clearly the reason for this phenomenon, the distribution of administrative divisions of Xi'an is given in Figure 15.Xi'an, the largest central city in northwest China, consists of 11 districts (Xincheng District, Beilin District, Lianhu District, Baqiao District, Weiyang District, Yanta District, Yanliang District, Lintong District, Chang'an District, Goaling District, and Huyi District) and 2 counties (Lantian County and Zhouzhi County), the main urban areas of which include Weiyang District, Baqiao District, Yanta District, Lianhu District, Xincheng District, and Beilin District.

Figure 14 .
Figure 14.The distribution of PM2.5 concentrations at two-hour intervals on 10 September 2019.Figure 14.The distribution of PM 2.5 concentrations at two-hour intervals on 10 September 2019.

Figure 14 .
Figure 14.The distribution of PM2.5 concentrations at two-hour intervals on 10 September 2019.Figure 14.The distribution of PM 2.5 concentrations at two-hour intervals on 10 September 2019.

29 Figure 15 .
Figure 15.The Map of Xi'an District.

Figure 15 .
Figure 15.The Map of Xi'an District.

Table 1 .
The information on the monitoring stations.

Table 2 .
The attributes of feature variables.Some missing values inevitably exist due to instrument failure and other reasons.
and U O are the parameter matrices controlling the hidden state.W f , W i , W c , and W O are the parameter matrices controlling the input information, b f , b i , b c , and b o are the bias vectors, σ is the sigmoid function, tanh is the function tanh, and * is the Hadamard operation.f t is the forget gating used to control the degree of forgetting of the previous state c t−1 , c t is the memory of the current input information, i t is memory gating, and c t and i t perform Hadamard operations to determine the storage of new information.The forgetting of past states and the updating of current input information together determine the c t state vector.The output state h t is obtained by the operation of o t and c t .

Table 3 .
The BDS results of the PM 2.5 concentration series at the 1463A site.

Table 3 .
The BDS results of the PM2.5 concentration series at the 1463A site.

Table 4 .
PM 2.5 prediction results of different feature selection algorithms.

Table 4 .
PM2.5 prediction results of different feature selection algorithms.

Table 5 .
ED-BiLSTM model evaluation index with different time lags.

Table 6 .
The parameter setting of the prediction model.

Table 6 .
The parameter setting of the prediction model.We model 26,280 historical records.Due to the time step being set to 4, we get 26,276 samples, including 21,020 training samples, 2628 validation samples, and 2628 test samples.

Table 7 .
The prediction result of all models at the 1463A site.

Table 8 .
Prediction results of different monitoring stations, and the bolded represents the best value.

Table 9 .
Prediction results from monitoring stations 1469A-1474A, and the bolded represents the best value.