Prediction of Ambient PM2.5 Concentrations Using a Correlation Filtered Spatial-Temporal Long Short-Term Memory Model

: Due to the increasingly serious air pollution problem, air quality prediction has been an important approach for air pollution control and prevention. Many prediction methods have been proposed in recent years to improve the prediction accuracy. However, most of the existing methods either did not consider the spatial relationships between monitoring stations or overlooked the strength of the correlation. Excluding the spatial correlation or including too much weak spatial inputs could inﬂuence the modeling and reduce the prediction accuracy. To overcome the limitation, this paper proposes a correlation ﬁltered spatial-temporal long short-term memory (CFST-LSTM) model for air quality prediction. The model is designed based on the original LSTM model and is equipped with a spatial-temporal ﬁlter (STF) layer. This layer not only takes into account the spatial inﬂuence between stations, but also can extract highly correlated sequential data and drop weaker ones. To evaluate the proposed CFST-LSTM model, hourly PM2.5 concentration data of California are collected and preprocessed. Several experiments are conducted. The experimental results show that the CFST-LSTM model can e ﬀ ectively improve the prediction accuracy and has great generalization.


Introduction
In the last decades, along with the rapid development of urbanization and industrialization, emissions of air pollutants, such as PM2.5, PM10, CO, and SO 2 , have caused serious environmental problems. Each year, millions of people die from exposure to serious air pollutants [1,2]. Billions of wealth is lost due to direct and indirect effects of air pollution [3]. Deteriorating air quality has become a critical environmental concern. A number of policies and measurements have been introduced to reduce the emissions of air pollutants and mitigate the impacts of air pollution on human society [4]. Air pollution monitoring stations have also been constructed in many areas to monitor and collect air pollution data.
In academia, scholars also have put lots of effort in studying how to better manage the air pollution. Commonly seen literature covers topics like air pollution dispersion mechanism modeling [5], influential factors analysis [6,7] as well as air quality prediction [8,9]. Air quality prediction is one of the main areas in this domain. Based on accurate air pollution forecasting, early warnings can be given. The public can then prepare themselves in advance to mitigate the impact of air pollution and the government can adopt effective measurements such as traffic restriction to control air pollution.
In order to obtain better forecasting results, statistical methods and machine learning techniques have been widely adopted for modeling the air quality. Statistical methods refer to methods developed based on large amounts of statistical data and linear mathematical functions. For example, Davis and Speckman [10] took a generalized additive model (GAM) approach to predict the ozone concentrations one day in advance for Houston. Li et al. [11] employed a set of numerical forecasting models such as autoregressive integrated moving average (ARIMA) to improve the forecast of air pollutants including PM2.5, NO 2 , and O 3 in Hong Kong. Kulkarni et al. [12] also adopted ARIMA time series model for forecasting air pollution in India.
Machine learning techniques also predict air pollution based on historical data. They model the data in a non-linear way, which is more consistent with the non-linearity of the real-world air pollution data and therefore can generate higher prediction accuracy. For example, Osowski and Garanty [13] presented a method for daily air pollution forecasting based on support vector machine (SVM) and wavelet decomposition. Gardner and Dorling [14] trained multilayer perceptron (MLP) neural networks to model hourly NO x and NO 2 pollutant concentrations in Central London. Jusoh and Ibrahim [15] utilized artificial neural networks (ANNs) for air pollution index forecasting.
Beside algorithms developed based on traditional statistical methods and machine learning methods, more and more studies recently started to implement deep learning technologies for air pollution modeling. Deep learning is one kind of advanced non-linear modeling techniques that was designed based on artificial neural networks but grows the neural-like calculation unit deeper for a better modeling. It has been tested by several studies and was reported to have outstanding prediction performance in air quality forecasting. For example, Prakash et al. [16] proposed a wavelet-based recurrent neural network (RNN) model to forecast one step ahead hourly, daily mean, and daily maximum concentrations of ambient CO, NO, PM2.5, and other most prevalent air pollutants. Li et al. [17] extended a long short-term memory (LSTM) network for air pollution prediction, and achieved better performance than existing methodologies.
However, although most of the aforementioned approaches have generated accurate forecasting results, the majority of them only modeled the prediction based on the historical air pollutants data and meteorological data. Only a few considered the temporal and spatial correlation of the data recorded by neighbor stations. Yang et al. [18] developed a space-time support vector regression (STSVR) model to predict hourly PM2.5 concentrations, incorporating spatial dependence and spatial heterogeneity into the modeling process. Li et al. [17] proposed a long short-term memory neural network extended (LSTME) model that inherently considers spatial-temporal correlation for air pollutant concentration prediction. Szpiro et al. [19] described a methodology for assigning individual estimates of long-term average air pollution concentrations that accounts for a complex spatial-temporal structure and can accommodate spatial-temporally misaligned observations. Still, these papers have a common limitation. Since the distribution of air quality monitoring stations is dense and balanced in some places but sparse and imbalanced in other places, spatial and temporal correlation of the air pollutant concentrations between two neighboring stations could be different. It might be strong when the distribution of stations is dense, and weak when the distribution is sparse. Weak correlation, on the other hand, might add more noise during the modeling process and influence the model performance. However, most of the previous studies did not well address this point. Therefore, a model that considers the strength of spatial and temporal correlation of air pollutant concentrations is required to further improve the air quality prediction accuracy.
To this end, this paper proposes a correlation filtered spatial-temporal long short-term memory (CFST-LSTM) neural network for air quality prediction. A special spatial-temporal filter (STF) layer is designed into the ordinary LSTM network to optimize the various spatial-temporal time series from the input layer. In this way, highly correlated inputs are filtered, the influence of noisy data is mitigated, the complexity of the model is reduced, and therefore, the model performance can be improved. In this paper, PM2.5 concentration is selected as the prediction target. Historical PM2.5 data of California are collected.

Data Collection
To validate the effectiveness of the proposed methodology, a case study was conducted in this paper. California, US was selected as the study area due to data availability. Hourly PM2.5 concentrations data of California from 2016 to 2017 were collected from the United States Environmental Protection Agency (EPA). Data of 30 monitoring stations were covered. A brief summary of the data is shown in Table 1. The distribution of these 30 stations is presented in Figure  2. It can be seen that the distribution is imbalanced. It is dense in the northern and southeastern areas of California but sparse in the middle areas. As mentioned in Section 0, spatial and temporal correlation of the air pollutant concentrations between two neighboring stations could be different due to the spatial distribution of stations. Take station 11 as an example. It may have higher correlation with station 12 but lower correlation with station 5 since station 5 is geospatially farther. The data from station 5 may have smaller impact on predicting the air quality in station 11 but increase the risk of noisy data contamination and lower computation speed. Therefore, it is important to consider the spatial-temporal correlation among stations when building the forecasting model. Regarding this, this paper developed a correlation filtered spatial-temporal LSTM (CFST-LSTM) model, which can automatically determine the highly correlated data segments and simultaneously optimize the model from both temporal and spatial aspects. Detailed modeling process of CFST-LSTM will be introduced later.

Data Preprocessing
After the raw data were collected, data preprocessing needed to be performed. The collected datasets unavoidably involve some missing values due to machine failure, routine maintenance, human error, insufficient sampling, and other factors. The missing values normally are required to be removed or filled to ensure the performance of modeling [35].The missing rate in this experiment was relatively small, at 2.65%. We implemented the linear interpolation methods following [36] to fill the empty values. After this procedure, each station resulted in 17,544 records of PM2.5 concentrations.
Furthermore, to mitigate the impact of dimension and speed up the model training, min-max normalization was adopted to normalize the data. Calculation of the normalization can be formulated as Equation (1) x * = x − x min x max − x min (1) where x max represents the maximum value in the dataset and x min represents the minimum value.

Data Collection
To validate the effectiveness of the proposed methodology, a case study was conducted in this paper. California, US was selected as the study area due to data availability. Hourly PM2.5 concentrations data of California from 2016 to 2017 were collected from the United States Environmental Protection Agency (EPA). Data of 30 monitoring stations were covered. A brief summary of the data is shown in Table 1. The distribution of these 30 stations is presented in Figure 2. It can be seen that the distribution is imbalanced. It is dense in the northern and southeastern areas of California but sparse in the middle areas. As mentioned in Section 0, spatial and temporal correlation of the air pollutant concentrations between two neighboring stations could be different due to the spatial distribution of stations. Take station 11 as an example. It may have higher correlation with station 12 but lower correlation with station 5 since station 5 is geospatially farther. The data from station 5 may have smaller impact on predicting the air quality in station 11 but increase the risk of noisy data contamination and lower computation speed. Therefore, it is important to consider the spatial-temporal correlation among stations when building the forecasting model. Regarding this, this paper developed a correlation filtered spatial-temporal LSTM (CFST-LSTM) model, which can automatically determine the highly correlated data segments and simultaneously optimize the model from both temporal and spatial aspects. Detailed modeling process of CFST-LSTM will be introduced later.

Data Preprocessing
After the raw data were collected, data preprocessing needed to be performed. The collected datasets unavoidably involve some missing values due to machine failure, routine maintenance, human error, insufficient sampling, and other factors. The missing values normally are required to be removed or filled to ensure the performance of modeling [35].The missing rate in this experiment was relatively small, at 2.65%. We implemented the linear interpolation methods following [36] to fill the empty values. After this procedure, each station resulted in 17,544 records of PM2.5 concentrations.
Furthermore, to mitigate the impact of dimension and speed up the model training, min-max normalization was adopted to normalize the data. Calculation of the normalization can be formulated as Equation (1) x * = x − x min x max − x min (1) where x max represents the maximum value in the dataset and x min represents the minimum value.

Long Short-Term Memory (LSTM)
After the data collection and preprocessing, the modeling algorithm could be applied to model the data and predict PM2.5 concentrations. In this paper, a correlation filtered spatial-temporal long Appl. Sci. 2020, 10, 14 6 of 16 short-term memory (CFST-LSTM) model was proposed to accomplish the task. Future air quality of a target station was predicted based on the historical data of itself and surrounding stations.
The proposed CFST-LSTM network was developed based on the long short-term memory (LSTM) model. LSTM is a special architecture of recurrent neural network (RNN) proposed by Hochreiter and Schmdhuber [30]. It overcomes the vanishing and exploding gradient problem of RNN and is capable of learning from long dependencies through a gating mechanism. Due to the strong ability of LSTM in modeling temporal sequential data, it has been reported to have state-of-the-art performance in many domains, such as speech recognition, language modeling, and sequence forecasting [37][38][39][40].
The network structure of LSTM consists of an input layer, an output layer, and a plurality of hidden layers. The specialty of LSTM lies in the compositions of its hidden layers, which are composed of one or more self-recurrent memory blocks. These blocks allow a value (forward pass) or gradient (backward pass) that flows into the block to be preserved and subsequently retrieved at the required time step [41]. The basic structure of the memory block is shown in Figure 3. It consists of three gates, including the forget gate f t , the input gate i t , the output gate o t , and a recurrent connection cell C t . x t is the input to the current block, h t−1 is the hidden state of the last block, and h t is the state of the current block. At time step t, the LSTM memory block can be defined with the following set of equations: where W represents the connection weights between neurons, b represents deflection, and σ denotes the sigmoid activation the gates used.

Long Short-Term Memory (LSTM)
After the data collection and preprocessing, the modeling algorithm could be applied to model the data and predict PM2.5 concentrations. In this paper, a correlation filtered spatial-temporal long short-term memory (CFST-LSTM) model was proposed to accomplish the task. Future air quality of a target station was predicted based on the historical data of itself and surrounding stations.
The proposed CFST-LSTM network was developed based on the long short-term memory (LSTM) model. LSTM is a special architecture of recurrent neural network (RNN) proposed by Hochreiter and Schmdhuber [30]. It overcomes the vanishing and exploding gradient problem of RNN and is capable of learning from long dependencies through a gating mechanism. Due to the strong ability of LSTM in modeling temporal sequential data, it has been reported to have state-ofthe-art performance in many domains, such as speech recognition, language modeling, and sequence forecasting [37][38][39][40].
The network structure of LSTM consists of an input layer, an output layer, and a plurality of hidden layers. The specialty of LSTM lies in the compositions of its hidden layers, which are composed of one or more self-recurrent memory blocks. These blocks allow a value (forward pass) or gradient (backward pass) that flows into the block to be preserved and subsequently retrieved at the required time step [41]. The basic structure of the memory block is shown in Figure 3. It consists of three gates, including the forget gate f t , the input gate i t , the output gate o t , and a recurrent connection cell C t . x t is the input to the current block, h t−1 is the hidden state of the last block, and h t is the state of the current block. At time step t, the LSTM memory block can be defined with the following set of equations: where W represents the connection weights between neurons, b represents deflection, and σ denotes the sigmoid activation the gates used.

CFST-LSTM
However, when predicting air pollutant concentrations, the ordinary LSTM model neither takes into account the spatial correlation among the monitoring stations nor considers the strength of the spatial-temporal correlations. Consequently, its performance is limited when the air pollution in one area is largely influenced by other areas. To overcome the problem, this study proposed a correlation filtered spatial-temporal long short-term memory (CFST-LSTM) model, which can analyze the lagged Appl. Sci. 2020, 10, 14 7 of 16 spatial-temporal correlation of the input data and select the time series data that meet the pre-set threshold. The architecture of the CFST-LSTM model is presented in Figure 4. It consists of five layers, including input layer, STF layer, LSTM layer, fully connected (FC) layer, and output layer. Working process of the CFST-LSTM model is shown as follows.

CFST-LSTM
However, when predicting air pollutant concentrations, the ordinary LSTM model neither takes into account the spatial correlation among the monitoring stations nor considers the strength of the spatial-temporal correlations. Consequently, its performance is limited when the air pollution in one area is largely influenced by other areas. To overcome the problem, this study proposed a correlation filtered spatial-temporal long short-term memory (CFST-LSTM) model, which can analyze the lagged spatial-temporal correlation of the input data and select the time series data that meet the pre-set threshold. The architecture of the CFST-LSTM model is presented in Figure 4. It consists of five layers, including input layer, STF layer, LSTM layer, fully connected (FC) layer, and output layer. Working process of the CFST-LSTM model is shown as follows.
where Corr(•) denotes the function for calculating the correlation coefficient matrix, S target_site represents the time series of the target station, S other_sites represents the lagged time series matrix of other sites, which can be formulated as Equation (9).
where n represents the number of sites, r represents the largest time lag, t represents the length of time series, each S site_i t−j represents the lagged j-moment time series of the number ith station for the target station, 1 ≤ i ≤ n, 1 ≤ j ≤ r . For Corr(•), each R (i,j) = ρ(S target site , S site_i t−j ). ρ(•) is the Pearson correlation function. Its calculation is shown in Equation (10).
where Corr(·) denotes the function for calculating the correlation coefficient matrix, S target_site represents the time series of the target station, S other_sites represents the lagged time series matrix of other sites, which can be formulated as Equation (9).
where n represents the number of sites, r represents the largest time lag, t represents the length of time series, each S t−j site_i represents the lagged j-moment time series of the number ith station for the target station, 1 ≤ i ≤ n, 1 ≤ j ≤ r. For Corr(·), each R (i,j) = ρ S target site , S t−j site_i . ρ(·) is the Pearson correlation function. Its calculation is shown in Equation (10).
Appl. Sci. 2020, 10, 14 8 of 16 where X represents the original input of the model, Y represents the output, Cov(X, Y) represents the covariance of X and Y, σ X and σ Y represent the standard deviation of X and Y, respectively, ρ ∈ (−1, 1). The formulation of X can be expressed as Equation (11).
where site_i t−j represents the recorded value of the ith station at time t-j.
Secondly, transform the R based on the correlation threshold ρ th , as shown in Equation (12).
where I(·) represents the indicator function. When R (i,j) ≥ ρ th , R I (i,j) = 1. When R (i,j) < ρ th , R I (i,j) = 0. Thirdly, apply the element-wise product to X and R I . Then, the final output X of STF layer is given, as presented in Equation (13).
where * represents the element-wise product and (·) T represents the matrix transpose.
After the X is given, it is input into the LSTM layer. The following steps are the same as the ordinary LSTM model.

LSTM Structure Optimization
To validate the effectiveness of the proposed model, hourly PM2.5 data of 30 stations in California were collected. After the data were preprocessed, they were input into the model. However, before that, parameters of the model need to be optimized at first. For illustration purpose, monitoring station 11 is used as an example. It contains 17,544 records of PM2.5 concentrations; 70% of its data are used as training samples while the remained 30% are used as testing samples. The number of samples relies on the largest time lag value. If time lag is 12, then there will have 12,272 training samples, and 5260 testing samples. Note that the calculated optimization results in the following paper are all based on the performance from the testing set.
Before applying the proposed model on further experiments, its parameters need to be identified and optimized for a better result. Although this study designed an STF layer into the LSTM network, some parameters of the deep neural network setting can be re-used. Following the study of V et al., [42], this paper sets the epoch and batch size as 1000 and 48, respectively. MSE is selected as the loss function, and RMSprop is adopted as the optimizer. The learning rate of the model is set as 0.001. The number of fully connected layer is set as 1. The number of neurons of fully connected layer is set as 64. The linear active function is used. The number of neurons of output layer is set as 1. The discussion on the largest time lag of STF layer and ρ threshold will be introduced in later sections, and we pre-set them as 12 and 0.4 first.
The most important parameters that need to be tuned before model implementation is the structure of the stacked LSTM layer. The optimization performance is evaluated using root mean square error (RMSE), mean absolute error (MAE), and R 2 . Calculations of the three metrics are presented in Equations (14), (15) and (16), respectively. (16) where N represents the number of samples, y i true represents the true value of the ith sample, y i pred represents the predicted value of the ith sample, and y true mean represents the mean value of true values. The number of neurons in each LSTM layer is set as the same. Candidates of the number of LSTM layers narrowed down to {2-4} after a trial and error process, and the number of neurons at each layer was set as {32, 64, 128}. Optimization results are shown in Table 2. It can be observed that when the number of LSTM layers is 2 and the number of neurons of each layer is 64, the model has the lowest scores of RMSE and MAE and the highest score of R 2 . Therefore, this paper sets the number of LSTM layers as 2 and the number of neurons as 64.

Comparison of Different Correlation Threshold
Besides the structure of the stacked LSTM layer, it is also important to optimize the largest time lag r and correlation threshold ρ th . These two parameters are the key parameters in the newly designed STF layer. Time lag r decided the feature pool of the model. Correlation threshold ρ th decided the quality of the feature pool. Since the air quality time series pattern in one place usually follows a daily period, and one day may not enough to extract the useful pattern, we therefore set the test range of r as 48 h, and calculated the model performance using different pairs of r and ρ th . The results are shown in Figures 5-7.  (16) where N represents the number of samples, y true i represents the true value of the ith sample, y pred i represents the predicted value of the ith sample, and y mean true represents the mean value of true values.
The number of neurons in each LSTM layer is set as the same. Candidates of the number of LSTM layers narrowed down to {2-4} after a trial and error process, and the number of neurons at each layer was set as {32, 64, 128}. Optimization results are shown in Table 2. It can be observed that when the number of LSTM layers is 2 and the number of neurons of each layer is 64, the model has the lowest scores of RMSE and MAE and the highest score of R 2 . Therefore, this paper sets the number of LSTM layers as 2 and the number of neurons as 64.

Comparison of Different Correlation Threshold
Besides the structure of the stacked LSTM layer, it is also important to optimize the largest time lag r and correlation threshold ρ th . These two parameters are the key parameters in the newly designed STF layer. Time lag r decided the feature pool of the model. Correlation threshold ρ th decided the quality of the feature pool. Since the air quality time series pattern in one place usually follows a daily period, and one day may not enough to extract the useful pattern, we therefore set the test range of r as 48 h, and calculated the model performance using different pairs of r and ρ th . The results are shown in Figures 5-7.   It can be seen from Figure 5-Figure 7 that both two parameters can affect the model performance a lot. For example, the R 2 value can change from the lowest 0.909 to 0.958. It is almost 5% performance. The setting of ρ th shows a clear separation for model performance. When ρ th ≤0.4, the R 2 value never gets lower than 0.920, and when ρ th > 0.4, most of the R 2 values are lower than 0.920. This is because when ρ th is too large, many useful inputs will be filtered out, and result in limited knowledge to learn. But when ρ th goes too low, the inputs will keep much noise, and therefore the model performance drops. Overall, it can be seen from Figure 5-Figure 7 that when r=12, and ρ th =0.4, the model has the lowest RMSE, MAE and highest R 2 value. This pair is then the optimal pair for these two parameters. The goodness of fit plot is shown in Figure 8.   It can be seen from Figure 5- Figure 7 that both two parameters can affect the model performance a lot. For example, the R 2 value can change from the lowest 0.909 to 0.958. It is almost 5% performance. The setting of ρ th shows a clear separation for model performance. When ρ th ≤0.4, the R 2 value never gets lower than 0.920, and when ρ th > 0.4, most of the R 2 values are lower than 0.920. This is because when ρ th is too large, many useful inputs will be filtered out, and result in limited knowledge to learn. But when ρ th goes too low, the inputs will keep much noise, and therefore the model performance drops. Overall, it can be seen from Figure 5-Figure 7 that when r=12, and ρ th =0.4, the model has the lowest RMSE, MAE and highest R 2 value. This pair is then the optimal pair for these two parameters. The goodness of fit plot is shown in Figure 8. It can be seen from Figures 5-7 that both two parameters can affect the model performance a lot. For example, the R 2 value can change from the lowest 0.909 to 0.958. It is almost 5% performance.
The setting of ρ th shows a clear separation for model performance. When ρ th ≤ 0.4, the R 2 value never gets lower than 0.920, and when ρ th > 0.4, most of the R 2 values are lower than 0.920. This is because when ρ th is too large, many useful inputs will be filtered out, and result in limited knowledge to learn. But when ρ th goes too low, the inputs will keep much noise, and therefore the model performance drops. Overall, it can be seen from Figures 5-7 that when r = 12, and ρ th = 0.4, the model has the lowest RMSE, MAE and highest R 2 value. This pair is then the optimal pair for these two parameters. The goodness of fit plot is shown in Figure 8

Model Comparison
To prove the effectiveness of the CFST-LSTM model, its prediction performance is compared with other traditional machine learning models and commonly seen neural networks. These contain three traditional machine learning models, including LASSO regression, Ridge regression, and support vector regression (SVR), and two commonly seen neural networks, including artificial neural network (ANN) and recurrent neural network (RNN) [43][44][45][46][47]. The parameters of these algorithms were all tuned using the same grid search process. Table 3 presents the results of the model comparison. (1) It can be seen from the first six rows that compared with other models, the two deep learning neural networks, RNN and CFST-LSTM, have lower RMSE/MAE and higher R 2 . This is because compared with Lasso and Ridge, neural network-based learning algorithms are better at modeling non-linear real-world relationships. Therefore, ANN, RNN, and CFST-LSTM have better performance. (2) Also, compared with traditional machine learning methods SVR and ANN, the other two deep learning based models (RNN and CFST-LSTM) have lower error and higher R 2 . This is because they are specifically designed for time series problems. It is easier for them to learn the impact from historical data. (3) Lastly, CFST-LSTM outperforms RNN, and exhibits the lowest RMSE and the highest R 2 . It can be seen that the difference is quite significant. The main reason is that except CFST-LSTM, other models did not consider the influence from nearby stations.

Model Comparison
To prove the effectiveness of the CFST-LSTM model, its prediction performance is compared with other traditional machine learning models and commonly seen neural networks. These contain three traditional machine learning models, including LASSO regression, Ridge regression, and support vector regression (SVR), and two commonly seen neural networks, including artificial neural network (ANN) and recurrent neural network (RNN) [43][44][45][46][47]. The parameters of these algorithms were all tuned using the same grid search process. Table 3 presents the results of the model comparison. (1) It can be seen from the first six rows that compared with other models, the two deep learning neural networks, RNN and CFST-LSTM, have lower RMSE/MAE and higher R 2 . This is because compared with Lasso and Ridge, neural network-based learning algorithms are better at modeling non-linear real-world relationships. Therefore, ANN, RNN, and CFST-LSTM have better performance. (2) Also, compared with traditional machine learning methods SVR and ANN, the other two deep learning based models (RNN and CFST-LSTM) have lower error and higher R 2 . This is because they are specifically designed for time series problems. It is easier for them to learn the impact from historical data. (3) Lastly, CFST-LSTM outperforms RNN, and exhibits the lowest RMSE and the highest R 2 . It can be seen that the difference is quite significant. The main reason is that except CFST-LSTM, other models did not consider the influence from nearby stations. Since the newly designed STF layer in CFST-LSTM is one kind of neural network layer, it can also be added into ANN and RNN. To test how much it can help increase the performance of neural network models, this study also calculated the R 2 value of CFST-ANN and CFST-RNN. The results are shown in the last two columns in Table 3. It can be seen that the R 2 values of these two neural networks have improved. This, on another angle, proved the effectiveness of the proposed STF layer. However, the overall performance of CFST-ANN and CFST-RNN are still behind CFST-LSTM. This is because the LSTM layer in CFST-LSTM is better at learning long-term dependency from time series data.

Comparison of Ordinary LSTM and CFST-LSTM
Besides the comparison between CFST-LSTM and other commonly seen machine learning/deep learning algorithms, this study also compared the performance of CFST-LSTM and the ordinary LSTM network. To better illustrate the comparison and the advantages of the proposed CFST-LSTM model, this experiment expands the test site form site 11 to all the sites within California State. For each site, we will separate the data into 70% training set and 30% testing set, and then train the models using ordinary LSTM (O-LSTM), full site inputs LSTM (F-LSTM), and CFST-LSTM. The differences between these three models are shown in Equations (17) to (19).
Inputs All Sites The results are all calculated based on the testing sets of different sites, and they are shown in Figure 9. Figure 9 (left) is the R 2 values on different sites. R 2 O means the R 2 value calculated using O-LSTM, and is marked using a blue line. R 2 F means F-LSTM, and is marked using a green line. R 2 C means CFST-LSTM, and is presented using an orange line. It can be seen that overall, CFST-LSTM performs the best, O-LSTM the second, and F-LSTM the third. To show the difference between these three models more clear, we calculated the values of R 2 O − R 2 F and R 2 O − R 2 C , and also plotted them on Figure 9 (right). In this way, it can be seen that O-LSTM performs better than F-LSTM in most sites, while it also seldom surpasses CFST-LSTM.
Since the newly designed STF layer in CFST-LSTM is one kind of neural network layer, it can also be added into ANN and RNN. To test how much it can help increase the performance of neural network models, this study also calculated the R 2 value of CFST-ANN and CFST-RNN. The results are shown in the last two columns in Table 3. It can be seen that the R 2 values of these two neural networks have improved. This, on another angle, proved the effectiveness of the proposed STF layer. However, the overall performance of CFST-ANN and CFST-RNN are still behind CFST-LSTM. This is because the LSTM layer in CFST-LSTM is better at learning long-term dependency from time series data.

Comparison of Ordinary LSTM and CFST-LSTM
Besides the comparison between CFST-LSTM and other commonly seen machine learning/deep learning algorithms, this study also compared the performance of CFST-LSTM and the ordinary LSTM network. To better illustrate the comparison and the advantages of the proposed CFST-LSTM model, this experiment expands the test site form site 11 to all the sites within California State. For each site, we will separate the data into 70% training set and 30% testing set, and then train the models using ordinary LSTM (O-LSTM), full site inputs LSTM (F-LSTM), and CFST-LSTM. The differences between these three models are shown in Equations (17) to (19).
The results are all calculated based on the testing sets of different sites, and they are shown in Figure 9. Figure 9 (left) is the R 2 values on different sites. 2 means the R 2 value calculated using O-LSTM, and is marked using a blue line. 2 means F-LSTM, and is marked using a green line. 2 means CFST-LSTM, and is presented using an orange line. It can be seen that overall, CFST-LSTM performs the best, O-LSTM the second, and F-LSTM the third. To show the difference between these three models more clear, we calculated the values of 2 − 2 and 2 − 2 , and also plotted them on Figure 9 (right). In this way, it can be seen that O-LSTM performs better than F-LSTM in most sites, while it also seldom surpasses CFST-LSTM. The average values of these three indicators on all sites are shown in Table 4. It can be seen that CFST-LSTM has the highest average R 2 with a value of 0.9155, and it is 2.88% better than O-LSTM and 6.29% better than F-LSTM. This is reasonable since CFST-LSTM not only considered the influence from nearby stations, but also filtered out less related time series inputs. F-LSTM performs the worst The average values of these three indicators on all sites are shown in Table 4. It can be seen that CFST-LSTM has the highest average R 2 with a value of 0.9155, and it is 2.88% better than O-LSTM and 6.29% better than F-LSTM. This is reasonable since CFST-LSTM not only considered the influence from nearby stations, but also filtered out less related time series inputs. F-LSTM performs the worst with a value of 0.8613, which is 3.32% lower than O-LSTM. This is also understandable since it contains too much noise when modeling the PM2.5 concentrations. Overall, the comparison between these three models helps prove the effectiveness of the proposed CFST-LSTM. This reflects that retaining the stations with higher correlation and dropping those with lower correlation can effectively improve the prediction accuracy of PM2.5 concentrations.

Improvements Interpolation
To further explore the features of CFST-LSTM, this study interpolates the R 2 improvements geospatially. The inverse distance weighted (IDW) technique of GIS [48] is used to visualize and interpolate the value of R 2 The spatial change of R 2 ∆ is presented in Figure 10. Red means the value of R 2 ∆ is high while yellow means the value of R 2 ∆ is low. It can be seen that for areas with denser stations, the improvement is higher than those with sparser stations. This is because in areas with higher density of stations, more stations remained during the modeling process of CFST-LSTM, and therefore more related information are utilized to learn the temporal patterns. However, in areas with lower density of stations, only one or two stations might be retained by CFST-LSTM as the inputs. The prediction result, therefore, could be similar to the ordinary model, which uses the historical data of the target station only. with a value of 0.8613, which is 3.32% lower than O-LSTM. This is also understandable since it contains too much noise when modeling the PM2.5 concentrations. Overall, the comparison between these three models helps prove the effectiveness of the proposed CFST-LSTM. This reflects that retaining the stations with higher correlation and dropping those with lower correlation can effectively improve the prediction accuracy of PM2.5 concentrations.

Improvements Interpolation
To further explore the features of CFST-LSTM, this study interpolates the R 2 improvements geospatially. The inverse distance weighted (IDW) technique of GIS [48] is used to visualize and interpolate the value of ∆ 2 = 2 − 2 in the map of California. The spatial change of ∆ 2 is presented in Figure 10. Red means the value of ∆ 2 is high while yellow means the value of ∆ 2 is low. It can be seen that for areas with denser stations, the improvement is higher than those with sparser stations. This is because in areas with higher density of stations, more stations remained during the modeling process of CFST-LSTM, and therefore more related information are utilized to learn the temporal patterns. However, in areas with lower density of stations, only one or two stations might be retained by CFST-LSTM as the inputs. The prediction result, therefore, could be similar to the ordinary model, which uses the historical data of the target station only.

Conclusions
This paper proposed a correlation filtered spatial-temporal long short-term memory (CFST-LSTM) model for PM2.5 concentrations prediction. For a target station, not only the historical data of

Conclusions
This paper proposed a correlation filtered spatial-temporal long short-term memory (CFST-LSTM) model for PM2.5 concentrations prediction. For a target station, not only the historical data of itself, but also the data of its surrounding stations are used. A spatial-temporal filter (STF) layer was designed to automatically remove the station data with low correlation with the target station. Hourly PM2.5 concentrations data of 30 stations in California were collected to validate the model effectiveness. Prediction performance of the CFST-LSTM model was compared with other traditional machine learning models and commonly seen neural networks. Results show that: • The proposed CFST-LSTM model outperforms other commonly seen machine learning/deep learning models with a better fitting degree and higher prediction accuracy. Its R 2 can reach 0.9583.

•
Compared with ordinary LSTM, our method not only considers the influence from nearby stations but also filters out less related time series inputs, and this helps increase 2.88% R 2 performance in our tests on 30 sites. On the other hand, if only simply adding the time series inputs from other stations, the model performance will drop 3.32% due to a higher level of noise.

•
According to the experiment on the R 2 improvements over the sites in California, the proposed method exhibited a higher improvement over ordinary LSTM in areas with denser sites, but lower improvement in sparser districts. This reflects that our method performs better at places with denser spatial inputs.

•
Parameter optimization of the newly designed STF layer is quite important to the proposed method. The experiment showed that the difference in R 2 between proper and improper parameters can reach around 5.39% of the overall performance.
The main contribution of this study is that we proposed an improved neural network model for spatial-temporal time series predictions. The model is modified based on deep learning techniques. Besides the prediction of PM2.5 concentrations, the model is expected to be applicable to other types of spatial-temporal time series problems, such as the prediction of weather, wind power, and other types of air pollutants. Of course, further studies need to be conducted for verifications.
Due to the data availability, only the historical PM2.5 concentration data are collected and tested in this paper. Other possible influential factors of PM2.5, such as meteorological characteristics and traffic emissions, are not considered. Further studies could be conducted to explore the feasibility of implementing the proposed method on multivariate inputs.