PM2.5 Prediction with a Novel Multi-Step-Ahead Forecasting Model Based on Dynamic Wind Field Distance

Particulate matter with a diameter of less than 2.5 μm (PM2.5) has damaged public health globally for a decade. Accurate forecasts of PM2.5 concentration can provide early warnings to prevent the public from hazard exposure. However, existing methods have not considered the available spatiotemporal data sufficiently due to their architecture or inadequate input, and most neglected wind impact on spatiotemporal correlation when selecting related sites. To fill this gap, we proposed a long short-term memory-convolutional neural network based on dynamic wind field distance (LSTM-CNN-DWFD) to predict PM2.5 concentration of a specific site for the next 24 h. A KNN method based on dynamic wind field distance was developed and applied to select highly related sites considering wind impact. A local stateful LSTM model was employed to capture temporal correlations in historical air quality and meteorological data for each related site. Then, these temporal features were integrated as a spatiotemporal matrix, and input into CNN for extracting spatiotemporal correlation features. Weather forecasts were also integrated into the model to promote accuracy. Hourly PM2.5 data from 36 monitoring sites in Beijing, China collected from 1 May 2014 to 30 April 2015 were used as experimental dataset. Six-fold rolling origin method was employed to conduct experiments on each site, and the results of 216 experiments validated the performance of the proposed LSTM-CNN-DWFD model. The mean R2 values of the next 1–6 h prediction were 0.85, 0.81, 0.76, 0.70, 0.64, and 0.59, respectively, showing a decrease trend, indicating that the prediction accuracy decreases as the prediction time increases. Comparisons of LSTM-CNN-DWFD results to results from six other methods show that it delivered higher accuracy PM2.5 predictions, with the mean RMSE (MAE) of 1–6, 7–12, and 13–24 h prediction were 43.90 (29.17), 57.89 (42.16), and 63.14 (47.64), respectively. The results also demonstrate that the sites selected based on dynamic wind field distance are more related to the central site than that based on geographical distance, also contributing to prediction accuracy.


Introduction
Over the past decades, the serious pollution problems caused by the rapid global economic development have attracted the attention of both researchers and the general public worldwide. According to the World Health Organization [1], approximately 90% of people worldwide breathe air that does not accord with air quality standards, and around three million deaths each year are related to outdoor air pollution. Particles suspended in the air can harm the respiratory and cardiovascular

Materials and Methods
This study aims to forecast a specific site's hourly PM 2.5 concentration of the next 24 h. We divided 1-24 h into three intervals, that is, 1-6, 7-12, and 13-24 h, and trained three LSTM-CNN-DWFD forecasting models for them. As shown in Figure 1, the historical data of past r hours were input into the model. We forecasted the real-valued concentration of PM 2.5 for the next 1-6 h and the maximum and minimum values of the next 7-12 and 13-24 h. Therefore, the proposed model is a multi-step-ahead forecasting model. In the following subsections, we introduce the details of the proposed LSTM-CNN-DWFD model, the experimental datasets, and the evaluating methods. pollution. Huang and Kuo [27] proposed APNet model that used CNN to capture the correlation 96 among the air quality and meteorological data of the target station and LSTM to extract temporal 97 features. Wen and Liu [28] used CNN to capture the spatial correlation of the PM2.5 data of multiple 98 stations and then used LSTM to capture temporal tendency. However, both of them have not 99 considered the air quality and meteorological spatiotemporal data of surrounding sites sufficiently. 100 The current study proposed an LSTM-CNN multi-step-ahead forecasting model using dynamic 101 wind field distance (LSTM-CNN-DWFD) to fill the gap of existing methods. First, we selected K-102 nearest sites as neighbor sites in accordance with defined dynamic wind field distance instead of the 103 most popular geographical distance. Second, we built an LSTM-CNN hybrid neural network to 104 provide predictions. We input the historical air quality and meteorological condition of the target site 105 and neighbor sites into a local LSTM and FC to capture temporal features of each site. Then, we used 106 CNN to capture spatiotemporal features. We also concatenated the spatiotemporal features with the 107 weather forecasts. In turn, we input these forecasts into FC to predict the hourly PM2.5 concentration 108 of the target site over the next 24 h. We selected the PM2. 5  makes the following contributions: (1) When using KNN to select related sites, wind impact was 113 taken into consideration by replacing geographical distance with dynamic wind field distance; (2) 114 spatiotemporal data, which includes the air quality and weather data of the target and neighbor sites, 115 were fully considered in the proposed LSTM-CNN-DWFD model. From the results, the proposed 116 method was found more effective in selecting related sites and extracting spatiotemporal features 117 and performed with higher accuracy in multi-scale predictions of PM2.5 concentration than other 118 methods. 119

Materials and Methods 120
This study aims to forecast a specific site's hourly PM2.5 concentration of the next 24 h. We 121 divided 1-24 h into three intervals, that is, 1-6, 7-12, and 13-24 h, and trained three LSTM-CNN-122 DWFD forecasting models for them. As shown in Figure 1, the historical data of past r hours were 123 input into the model. We forecasted the real-valued concentration of PM2.5 for the next 1-6 h and the 124 maximum and minimum values of the next 7-12 and 13-24 h. Therefore, the proposed model is a 125 multi-step-ahead forecasting model. In the following subsections, we introduce the details of the 126 proposed LSTM-CNN-DWFD model, the experimental datasets, and the evaluating methods. 127 −1 ... ...

Data 133
As the capital of China, Beijing is one of the most economically dynamic cities and the most 134 polluted cities. According to the dataset collected from Urban Air project of Microsoft Research [29], 135 the air quality and meteorological data measured in Beijing was selected as the experimental data. 136 Figure 1. Multi-step-ahead forecasting model. r refers to the time lag; z T−r+1 to z T refer to the historical data at time T − r + 1 to T, including air pollution and meteorological factors; wf refers to weather forecasts of the target site; x T+1 to x T+24 refer to PM 2.5 concentration at time T + 1 to T + 24.

Data
As the capital of China, Beijing is one of the most economically dynamic cities and the most polluted cities. According to the dataset collected from Urban Air project of Microsoft Research [29], This project was established for using a diversity of big data to infer and predict fine-grained air 137 quality throughout a city, and finally tackle air pollutions [10,30,31]. This project collected real-time 138 meteorological data, weather forecasts, and air quality data of different cities from a list of public 139 websites and public/private web services. Hourly air quality data from 2296 stations in 302 Chinese 140 cities were collected, including PM2.5, PM10, NO2, CO, SO2, and O3. The meteorological data were 141 collected from 3514 cities/districts/stations, which included observations of time, weather, 142 temperature, pressure, humidity, wind speed, and wind direction. Additionally, weather forecasts 143 data were also collected from 2612 cities/districts, including weather, wind direction, wind level, and 144 up and down temperature. The whole dataset of four major Chinese cities (Beijing, Tianjin, 145 Guangzhou, and Shenzhen) were organized as general text files and open to the public for further 146 researches [32], and more details of this dataset can be found in Zheng [10]. This dataset has been 147 applied to many recent empirical researches for exploring PM2.5 concentration forecasting models 148 [12,14,22,33,34]. According to this dataset, we selected the hourly data related to Beijing as 149 experimental data, including the air quality records of 36 monitoring stations, meteorological 150 observations of meteorological stations, and weather forecasts. Its time period is from 1 May 2014 to 151 30 April 2015. Figure 2 shows the locations of the 36 monitoring stations. The color represents the 152 grading of the average PM2.5 concentration. Discrete variables, namely, wind level, wind direction, 153 and weather, were input into the embedding layer to reduce the dimensionality.  The air quality and meteorological observations contained some missing values. We filled these 158 missing values via two means: If the values of the previous hour and the subsequent hour existed, 159 then, we interpolated the missing value via linear interpolation based on these values; otherwise, we 160 filled the missing value via inverse distance interpolation in accordance with the values of other sites. 161 The interpolation accuracy was tested by randomly interpolating 100 valid PM2.5 concentration 162 observations. The mean absolute error (defined in Subsection 2.  The air quality and meteorological observations contained some missing values. We filled these missing values via two means: If the values of the previous hour and the subsequent hour existed, then, we interpolated the missing value via linear interpolation based on these values; otherwise, we filled the missing value via inverse distance interpolation in accordance with the values of other sites. The interpolation accuracy was tested by randomly interpolating 100 valid PM 2.5 concentration observations. The mean absolute error (defined in Section 2.3) was 7.85 µg/m 3 . Finally, all observations were centralized and standardized by the following equation to eliminate the impact of different dimensions.
where x i and x * i represent the original observation and transformed values of factor x, respectively; x and σ are the mean and standard deviation of all observations, respectively. Figure 3 presents the overall framework of the proposed LSTM-CNN-DWFD. The input (see the blue arrows in Figure 3) originated from two parts: The historical air quality and meteorological data of the target and neighbor sites at past r hours and weather forecasts of the target station. The output was the PM 2.5 concentration predictions of the target station at T + 1, . . . , T + 24. The architecture of LSTM-CNN-DWFD consisted of two steps: Selecting highly related sites using KNN based on dynamic wind field distance (KNN-DWFD) and extracting spatiotemporal features using a hybrid neural network LSTM-CNN. Particularly, LSTM-CNN contained a spatiotemporal relation extractor (SRE) and a weather forecast combinator (WFC) as shown in Figure 3. The processes of KNN-DWFD and LSTM-CNN are as follows.  (SRE) and a weather forecast combinator (WFC) as shown in Figure 3.

181
Step means the dimension of forecast; it equals to 6 for 1-6 h forecasting, whereas it equals to 2 for 7-  predicted. We defined dynamic wind field distance to evaluate the distance between the target and 185 other sites at time . Then, we selected the K-nearest stations as the most related sites, and we defined 186 them as neighbor sites to the target site (blue flags in Figure 3). The historical air quality (PM2.5, PM10, 187 NO2, CO, SO2, and O3) and meteorological data (weather, temperature, pressure, humidity, wind 188 Step means the dimension of forecast; it equals to 6 for 1-6 h forecasting, whereas it equals to 2 for 7-12 and 13-24 h forecasting.

KNN-DWFD.
As shown at the bottom of Figure 3, the red flag represents the target station to be predicted. We defined dynamic wind field distance to evaluate the distance between the target and other sites at time T. Then, we selected the K-nearest stations as the most related sites, and we defined them as neighbor sites to the target site (blue flags in Figure 3). The historical air quality (PM 2.5 , PM 10 , NO 2 , CO, SO 2, and O 3 ) and meteorological data (weather, temperature, pressure, humidity, wind speed, and wind direction) of the target and neighbor stations at past r hours (T − r + 1, . . . , T) were input into the following model. The wind impact on spatial correlation was fully considered, and the interference of less relevant sites was avoided by the selection process.
LSTM-CNN: SRE. As the green part in Figure 3 shows, we constructed an SRE to extract spatiotemporal features from the above historical spatiotemporal data. Local LSTM was built on each site to learn the long-and short-term tendencies of historical data. By decoding the outputs of FC, the temporal features were obtained for each site. Finally, we employed CNN to learn spatiotemporal correlation among these temporal features. The compressed spatiotemporal features were output by the convolution and pooling layers.
LSTM-CNN: WFC. The weather forecasts (i.e., weather, wind direction, wind level, and up and down temperature) of the target station were integrated with the spatiotemporal features obtained by SRE into an FC layer (see the pink part in Figure 3), and the final predictions of the future PM 2.5 concentration of target station were output.
The three parts were combined together to form LSTM-CNN-DWFD. The details of each part are introduced in the following subsubsections.

KNN-DWFD
Traditional KNN and Its Limitations KNN [35] aims to calculate the distance between a sample and other samples in accordance with their characteristics and to choose the K-nearest samples as the most related ones. In this paper, we define the K most related sites as neighbor sites to the target site. KNN has been extensively employed to choose the most related sites in the pollution field due to its high computation efficiency. The most commonly used distance criterion is the geographical distance, and the formula is as follows: where x i and x j are the locations (longitude, latitude) of stations i and j, respectively, and d ij represents their geographical distance. This distance considers spatial correlation to be related only with geographical distance. Therefore, the spatial correlation is considered to be isotropic by this definition of distance. However, the diffusion of pollutants is based on wind direction and wind speed. As such, the spatial correlation should be considered to be anisotropic. For example, in Figure 4a, p is the target station and its wind direction is northeast (the red arrow in Figure 4a); the upwind region is denoted as shadow area. n 1 , n 2 , . . . , n 5 are the surrounding monitoring stations that locate at different directions. Evidently, n 1 , n 2 , and n 3 are at the upwind area, and their pollutants are transported to p depending on the wind. Therefore, their affect degree to p are higher than that indicated by the geographical distance. Moreover, the nearer to the target station and the closer to the upwind direction, the easier the pollutants can be transported to p, thus the higher the affect degree of the surrounding sites. We define θ ij (t) to represent the angle between the wind direction of target site i at time t and the edge between i and a surrounding site j (the blue arrow in Figure 4a), and d ij represents the geographical distance between sites i and j. Then, owing to the transport process based on wind, the spatial correlation is negative with d ij and θ ij (t). Moreover, the affect degree of upwind sites is higher than that of downwind sites with the same distance.

Definition of KNN-DWFD 230
The traditional geographical distance can only represent the dependence of spatial distance; 231 it cannot describe the effect of wind direction. To address the negative correlation between spatial 232 relation with and (t) and simultaneously increase the affect degree of upwind sites, a 233 dynamic wind field distance based on Gauss kernel function ( ) was presented as shown in 234 the following equation. 235 where is the geographical distance between sites i and j, which can be calculated by their 236 locations; (t) is defined as the above, and it ranges in [0, 180°]; (t) and _ ( ) are the 237 temporal variables, which can be computed by the dynamic real-time wind direction, because wind 238 direction changes over time; is a tradeoff between the effect of geographical distance and wind 239 direction. The dynamic wind field distance of upwind sites as defined above is smaller than their 240 geographical distance. Accordingly, the evaluated affect degree increases. Meanwhile, the dynamic 241 wind field distance of downwind sites remains equal to their geographical distance. The value of 242 should be moderate. For example, Figure 4(b) shows the variation curve between _ ( ) and 243 (t) when = 1 and σ = 1,2,3. When = 1, the wind field distance of the exact upwind location 244 ( (t) = 0) is 0.6 times its actual distance, thereby exaggerating the wind field effect. However, when 245 increases to 3, the wind field distance of the exact upwind location is 0.95 times its actual distance; 246 obviously, wind direction might not be functional in this situation. Thus, in our experiments, was 247 set to 2.

248
At the bottom of Figure  neighbor sites selected, and r represents the time lag. For example, when we use the historical data 255 of the past 3 h to predict the future PM2.5 concentration, r refers to 3. f represents the number of 256 features in consideration, and in our study, the features included air quality (PM2.5, PM10, NO2, CO, 257 SO2, and O3) and meteorological factors (weather, temperature, pressure, humidity, wind speed, and 258 wind direction). Compared with the traditional KNN based on spatial distance, KNN-DWFD 259

Definition of KNN-DWFD
The traditional geographical distance d ij can only represent the dependence of spatial distance; it cannot describe the effect of wind direction. To address the negative correlation between spatial relation with d ij and θ ij (t) and simultaneously increase the affect degree of upwind sites, a dynamic wind field distance based on Gauss kernel function d ij wind (t) was presented as shown in the following equation.
where d ij is the geographical distance between sites i and j, which can be calculated by their locations; θ ij (t) is defined as the above, and it ranges in 0, 180 • ; θ ij (t) and d ij_wind (t) are the temporal variables, which can be computed by the dynamic real-time wind direction, because wind direction changes over time; σ is a tradeoff between the effect of geographical distance and wind direction. The dynamic wind field distance of upwind sites as defined above is smaller than their geographical distance. Accordingly, the evaluated affect degree increases. Meanwhile, the dynamic wind field distance of downwind sites remains equal to their geographical distance. The value of σ should be moderate. For example, Figure 4b shows the variation curve between d ij_wind (t) and θ ij (t) when d ij = 1 and σ = 1, 2, 3. When σ = 1, the wind field distance of the exact upwind location (θ ij (t) = 0) is 0.6 times its actual distance, thereby exaggerating the wind field effect. However, when σ increases to 3, the wind field distance of the exact upwind location is 0.95 times its actual distance; obviously, wind direction might not be functional in this situation. Thus, in our experiments, σ was set to 2. At the bottom of Figure 3 is an illustrated process of KNN-DWFD. In the case of predicting the PM 2.5 concentration of the target station i at time T + 1, . . . , T + 24, we computed the dynamic wind field distance of all other sites at time T d ij_wind (T). We selected and defined the K-nearest sites as neighbor sites. The historical air quality and meteorological data of the target and neighbor stations at r time points before time T + 1 were chosen. Consequently, K + 1 r * f 2D matrices with time series were obtained (see the blue frames at the bottom of Figure 3), where K is the number of neighbor sites selected, and r represents the time lag. For example, when we use the historical data of the past 3 h to predict the future PM 2.5 concentration, r refers to 3. f represents the number of features in consideration, and in our study, the features included air quality (PM 2.5 , PM 10 , NO 2 , CO, SO 2, and O 3 ) and meteorological factors (weather, temperature, pressure, humidity, wind speed, and wind direction). Compared with the traditional KNN based on spatial distance, KNN-DWFD considers the effect of wind on the spreading of air pollutants. Therefore, it chooses more related sites than KNN based on geographical distance.

LSTM-CNN: SRE
To perform a full excavation of the spatiotemporal correlations from the above K + 1 r * f 2D matrices, we constructed an SRE (see the Spatiotemporal Relation Extractor in Figure 3) that combined the LSTM characteristics that learns long-and short-term tendencies of input time series data along with the CNN characteristics that extracts and compresses spatiotemporal features from input data. The concrete flow of SRE is provided below.
First, LSTM was employed for conducting time series analysis. LSTM is a special recurrent neural network, with its recurrent neuron simultaneously learning long-and short-term tendencies of the time series data. The LSTM model used in our model was stateful LSTM, which uses the state of the current batch of LSTM samples as the initial state of the next batch of samples. It is more suitable for processing long-term time series data than the other models. As illustrated in the middle of Figure 3, we constructed local stateful LSTM models in the target and neighbor stations to capture long-and short-term tendencies from the historical air quality and meteorological data. Figure 5 shows the structure of the recurrent memory cell of LSTM, where x t represents the input, that is, the historical air quality and meteorological data of each site, and h t represents the output. The training process of the recurrent memory cell is given by the following equations. considers the effect of wind on the spreading of air pollutants. Therefore, it chooses more related sites 260 than KNN based on geographical distance. 261

LSTM-CNN: SRE 262
To perform a full excavation of the spatiotemporal correlations from the above K + 1 r * f 2D 263 matrices, we constructed an SRE (see the Spatiotemporal Relation Extractor in Figure 3) that combined 264 the LSTM characteristics that learns long-and short-term tendencies of input time series data along 265 with the CNN characteristics that extracts and compresses spatiotemporal features from input data. 266 The concrete flow of SRE is provided below. 267 First, LSTM was employed for conducting time series analysis. LSTM is a special recurrent 268 neural network, with its recurrent neuron simultaneously learning long-and short-term tendencies 269 of the time series data. The LSTM model used in our model was stateful LSTM, which uses the state 270 of the current batch of LSTM samples as the initial state of the next batch of samples. It is more suitable 271 for processing long-term time series data than the other models. As illustrated in the middle of Figure  272 3, we constructed local stateful LSTM models in the target and neighbor stations to capture long-and 273 short-term tendencies from the historical air quality and meteorological data. Figure 5 shows the 274 structure of the recurrent memory cell of LSTM, where represents the input, that is, the historical 275 air quality and meteorological data of each site, and ℎ represents the output. The training process 276 of the recurrent memory cell is given by the following equations.
Forget gate: Selectively forget some past trends and other factors: 283 Output gate: Determine the output information for the current observation: Forget gate: Selectively forget some past trends and other factors: Output gate: Determine the output information for the current observation: where σ and tan h are activation functions; W and b are weight matrix and bias vector, respectively.
Second, a spatiotemporal matrix was constructed. The outputs of LSTM were input to a higher layer, namely FC layer, to output step temporal features. Here, step represents the dimension of predictions; it equals to 6 for 1-6 h prediction but equals to 2 for 7-12 and 13-24 h predictions. The temporal features were obtained via LSTM and FC. Hence, they were inherently correlated. We merged these temporal features of different sites into a 2D matrix (see the green matrix in Figure 3), in which each row represented a station's temporal features extracted by local LSTM and FC. Accordingly, the adjacent elements of this matrix were correlated. The matrix is a spatiotemporal matrix, thereby forming the input of the next neural network.
Finally, CNN was employed for further spatiotemporal analysis. CNN can automatically learn and detect the spatiotemporal features from the input high-dimensional matrix by the convolution layer. Thus, a 2D-convolution layer was used to extract spatiotemporal correlation from the above spatiotemporal matrix, and a pooling layer was used to conduct compression and output the spatiotemporal features. Figure 6 illustrates the process of CNN. where σ and tanh are activation functions; W and b are weight matrix and bias vector, 285 respectively. 286 Second, a spatiotemporal matrix was constructed. The outputs of LSTM were input to a higher 287 layer, namely FC layer, to output step temporal features. Here, step represents the dimension of 288 predictions; it equals to 6 for 1-6 h prediction but equals to 2 for 7-12 and 13-24 h predictions. The 289 temporal features were obtained via LSTM and FC. Hence, they were inherently correlated. We 290 merged these temporal features of different sites into a 2D matrix (see the green matrix in Figure 3), 291 in which each row represented a station's temporal features extracted by local LSTM and FC. 292 Accordingly, the adjacent elements of this matrix were correlated. The matrix is a spatiotemporal 293 matrix, thereby forming the input of the next neural network. 294 Finally, CNN was employed for further spatiotemporal analysis. CNN can automatically learn 295 and detect the spatiotemporal features from the input high-dimensional matrix by the convolution 296 layer. Thus, a 2D-convolution layer was used to extract spatiotemporal correlation from the above 297 spatiotemporal matrix, and a pooling layer was used to conduct compression and output the 298 spatiotemporal features. Figure 6 illustrates the process of CNN.

301
The left matrix TF in Figure 6 represents the spatiotemporal matrix obtained above. Kernel where (•) is the activation function, and it was chosen as linear function. Superscript represents the 305 element value of the corresponding position, and as well as Q are the sizes of the kernels. To 306 capture additional spatiotemporal features from TF, the number of kernels can be increased, and the 307 number of obtained feature map can be increased accordingly. Then, the feature map FM was input 308 into a pooling layer. The maximums of each submatrix with a pre-defined size (e.g., 2 × 2) was 309 calculated, and they formed the final compressed spatiotemporal features SF. CNN can directly 310 handle 2D spatiotemporal matrix without any flattening by convolution kernels and weight sharing. 311 Compared with FC, CNN reduces the complexity of feature extraction, and additional deep 312 spatiotemporal features can be extracted owing to the above advantages. 313

LSTM-CNN: WFC 314
Apart from past air quality and weather, the future weather condition is also highly related to 315 the future PM2.5 concentration. To promote prediction accuracy, we integrated the weather forecasts, 316 which included weather, wind direction, wind level, and up and bottom temperature, of the target 317 station with the spatiotemporal features of past data extracted by SRE into FC. At this time, we output 318 the PM2.5 predictions of the target site at the prediction time (See the pink frame at the top of Figure  319 3). 320 The left matrix TF in Figure 6 represents the spatiotemporal matrix obtained above. Kernel matrix is denoted by w, and the output feature map is denoted by FM. The elements of FM was obtained by the following equation: where g(·) is the activation function, and it was chosen as linear function. Superscript represents the element value of the corresponding position, and P as well as Q are the sizes of the kernels. To capture additional spatiotemporal features from TF, the number of kernels can be increased, and the number of obtained feature map can be increased accordingly. Then, the feature map FM was input into a pooling layer. The maximums of each submatrix with a pre-defined size (e.g., 2 × 2) was calculated, and they formed the final compressed spatiotemporal features SF. CNN can directly handle 2D spatiotemporal matrix without any flattening by convolution kernels and weight sharing. Compared with FC, CNN reduces the complexity of feature extraction, and additional deep spatiotemporal features can be extracted owing to the above advantages.

LSTM-CNN: WFC
Apart from past air quality and weather, the future weather condition is also highly related to the future PM 2.5 concentration. To promote prediction accuracy, we integrated the weather forecasts, which included weather, wind direction, wind level, and up and bottom temperature, of the target station with the spatiotemporal features of past data extracted by SRE into FC. At this time, we output the PM 2.5 predictions of the target site at the prediction time (See the pink frame at the top of Figure 3).

Evaluation Methods
The predictions of the next 24 h were divided into three intervals: 1-6, 7-12, and 13-24 h. As mentioned above, we predicted the real-valued concentration of PM 2.5 for 1-6 h. Meanwhile, for 7-12 and 13-24 h, the maximum and minimum of PM 2.5 concentration were forecasted, respectively. The evaluation criteria were the root mean square error (RMSE), the mean absolute error (MAE), and R-square (R 2 ). RMSE and MAE were employed to evaluate the prediction error of different models. Smaller values result in better prediction accuracy of the model. R 2 represents the fitting degree to the true PM 2.5 observations of different models. A high value increases the reliability of the model. The formulas of these criteria are defined as follows: where n is the sample size; y i and y * i represent the observations and predictions of PM 2.5 , respectively; Cov(·) represents the covariance; Var(·) represents the variance; Y and Y * represent the predicted and observed sequences, respectively. Particularly, for the prediction task of the next 7-12 and 13-24 h, y i and y * i represent the mean of the observed and predicted maximum and minimum PM 2.5 concentration of each interval, respectively, and Y and Y * are the corresponding sequences of y i and y * i .

Results
The experimental data included the hourly air pollutant observations, meteorological factor observations, and weather forecasts of the 36 stations in Beijing from 1 May 2014 to 30 April 2015. A total of 8760 samples were collected for each station, and Section 2.1 introduces the detailed description of the data. Considering that evaluation results based on a single forecast origin can be unreliable when the forecasting results are sensitive to randomness and systematic business cycle effects [36], rolling origin has become a widely used evaluation technique in time series studies [22,37,38]. In the rolling origin method, the time series data are divided into several periods. The first several periods are selected as train set, and the next period is selected as test set. Then, the forecasting origin moves to the next period in turn and the forecasts are produced from each origin [39]. Rolling origin method partially controls for specific effects arising from a particular origin. In this study, considering the required sample size, the moving window and forecast window was set as one month. Considering the requirement of sample size, six-fold rolling origin experiments were conducted on each station in Beijing, and the results of totally 216 experiments were used to evaluate the performance of the proposed model. Table 1 shows the concrete time span of each train set and test set. It is worth noting that the train and test sets of each fold were determined in accordance with time sequence, guaranteeing they have no overlap.
Before building the LSTM-CNN-DWFD model, several parameters should be preset. According to Section 2.2.1, σ in the definition of dynamic wind field distance being 3 or 1 causes the wind impact to be considered inadequate or excessive. Therefore, we set σ equals to 2 in the following experiments.
The best values of other parameters were determined via the rolling origin experiments on Site 1, where the number of neighbor sites selected K was set to 9, and the time lags r was set to 3, 6, and 12 for 1-6, 7-12, and 13-24 h, respectively. To avoid overfitting of the neural network, dropout layer and early stopping method were employed in our experiments.
To demonstrate the effectiveness of the proposed LSTM-CNN-DWFD model, Section 3.1 presents the spatiotemporal distribution of its prediction error, from which we can obtain details of its prediction performance. In Sections 3.2 and 3.3, we design a series of comparisons with six other methods to show its advantages in extracting spatiotemporal features and to confirm the superiority of the proposed KNN-DWFD method in selecting related sites.

Performance of LSTM-CNN-DWFD Model
This subsection shows the forecasting performance of LSTM-CNN-DWFD model.  Figure 7) slowly decreased than the true value. Hence, a low prediction accuracy occurs. Additionally, (g) and (h) in Figure 7 show that, during 13-24 h, the range of the predicted PM 2.5 concentration was larger than that in 7-12 h. In other words, the level of uncertainty increases in 13-24 h. However, the mean of the observations of each interval (red lines) constantly fell in the predicted range. experiments on Site 1, where the number of neighbor sites selected K was set to 9, and the time lags 358 r was set to 3, 6, and 12 for 1-6, 7-12, and 13-24 h, respectively. To avoid overfitting of the neural 359 network, dropout layer and early stopping method were employed in our experiments. 360 To demonstrate the effectiveness of the proposed LSTM-CNN-DWFD model, Subsection 3.1 361 presents the spatiotemporal distribution of its prediction error, from which we can obtain details of 362 its prediction performance. In Subsections 3.2 and 3.3, we design a series of comparisons with six 363 other methods to show its advantages in extracting spatiotemporal features and to confirm the 364 superiority of the proposed KNN-DWFD method in selecting related sites. 365

Performance of LSTM-CNN-DWFD Model 366
This subsection shows the forecasting performance of LSTM-CNN-DWFD model. and 0.59, respectively. Accordingly, the explained variance of our model decreases as prediction time 371 increases. As shown in Figure 7 (a-f), as prediction time increases, the prediction value of the peak 372 value around 9 March (the circles in Figure 7) slowly decreased than the true value. Hence, a low 373 prediction accuracy occurs. Additionally, (g) and (h) in Figure 7 show that, during 13-24 h, the range 374 of the predicted PM2.5 concentration was larger than that in 7-12 h. In other words, the level of 375 uncertainty increases in 13-24 h. However, the mean of the observations of each interval (red lines) 376 constantly fell in the predicted range.     (42.16), and 63.14 (47.64) for 1-6, 7-12, and 13-24 h, respectively. The standard errors of RMSE (MAE) of 36 stations were 9.48 (6.84), 12.13 (9.16), and 12.73 (9.88), respectively, which illustrates the stability of LSTM-CNN-DWFD model. Additionally, the RMSEs of the 1-6, 7-12, and 13-24 h predictions in Fold1-4 ranged in 45-52, 58-73, and 62-82, respectively. However, in Fold5-6, the RMSEs ranged in [34][35][42][43][44][45][46][47], and 44-52, respectively. Hence, the RMSEs of Fold5-6 were relatively lower in general than those in other folds. The results of MAE also show the same trend. Therefore, the prediction accuracy in spring (Fold5-6) tends to be higher than in autumn and winter (Fold1-4). Figure 8 presents the spatial distribution of prediction error in each fold for the next 1-6 h prediction task. The bluer the color, the lower the prediction error (RMSE or MAE).

399
As shown in Figure 8, the color of sites becomes bluer as fold increases. Thus, the prediction 400 error decreases. The spatial distribution of the prediction error of all folds shows that the RMSE 401 and MAE of southern sites were higher than those of northern sites. Hence, the prediction 402 performance in northern sites are better than that in southern sites. 403  As shown in Figure 8, the color of sites becomes bluer as fold increases. Thus, the prediction error decreases. The spatial distribution of the prediction error of all folds shows that the RMSE and MAE of southern sites were higher than those of northern sites. Hence, the prediction performance in northern sites are better than that in southern sites.

Effectiveness of LSTM-CNN-DWFD in Extracting Spatiotemporal Correlation
We proposed a multi-step-ahead forecasting model LSTM-CNN-DWFD in this paper. The LSTM-CNN part was designed to extract spatiotemporal correlation from the input data. To show the effectiveness of its architecture, we conducted three groups of comparison experiments between LSTM-CNN and five baseline models. We chose the neighbor sites in accordance with geographical distance in all models.
(a) Evaluate the effect of neighbor sites' data: LSTM-NN versus LSTM-CNN, CNN, and LSTM-FC [22]. LSTM-NN adopted an LSTM layer to capture the temporal trend of the historical data of target station and used FC layer to integrate weather forecasts. It did not consider the spatiotemporal data of neighbor sites. However, CNN, LSTM-FC, and LSTM-CNN included the historical data of target and neighbor stations as well as weather forecasts as input.
(b) Evaluate the effectiveness in feature extraction: LSTM-CNN versus CNN and LSTM-FC. The input data of these models were the same, but their neural network differed. CNN adopted a convolution layer and a pooling layer to capture spatiotemporal features and integrated weather forecasts by an FC layer. The LSTM-FC separately trained local LSTM models in the target and neighbor sites similar to LSTM-CNN model. However, it adopted FC to extract the spatial feature from the outputs of local LSTM models. Finally, the weather forecasts were also integrated by FC layer in LSTM-FC.
(c) Evaluate the effect of weather forecasts: LSTM-NN versus LSTM and APNet [27]. LSTM and APNet considered the historical data of the target station as input, thereby neglecting the effect of weather forecasts. LSTM used an LSTM layer to capture temporal correlation of the input. APNet first used three 1D-convolution and batch normalization layers to compress the input data; it then used LSTM to capture temporal features.
For fairness, all of the LSTM layers used above were stateful LSTM. These models were trained and tested on 36 stations using six-fold rolling origin, and the mean RMSE, MAE, and R 2 of the total 216 experiments can provide us their general prediction performance. Table 3 clearly shows the prediction error of different methods in three different prediction intervals. The best prediction performance (the smallest RMSE and MAE and the largest R 2 ) of each column is marked in boldface. As shown in Table 3, LSTM-NN showed higher RMSE and MAE and lower R 2 at all prediction scales than CNN, LSTM-FC, and LSTM-CNN. This result is explained by the LSTM-NN neglecting the spatiotemporal correlation of neighbor sites, thereby causing its low prediction accuracy. The RMSEs of the proposed LSTM-CNN model for 1-6, 7-12, and 13-24 h were 44.68, 58.77, and 63.40, respectively, and all RMSEs were lower than those of CNN and LSTM-FC. From the aspect of MAE and R 2 , LSTM-CNN performed the best in the 1-6 and 13-24 h prediction tasks with lower MAE and higher R 2 . Therefore, in 1-6 and 13-24 h predictions, LSTM-CNN has the highest prediction accuracy. In 7-12 h prediction task, CNN, LSTM-FC, and LSTM-CNN, respectively showed the best performance in accordance with R 2 , MAE, and RMSE. Therefore, the overall performance of the three models can be regarded as close in 7-12 h predictions. In addition, the RMSE and MAE of APNet and LSTM were obviously higher than those of LSTM-NN, where their R 2 value were much lower. Hence, neglecting weather forecasts causes substantial loss to the prediction accuracy. Note: The best prediction performance (the smallest RMSE and MAE and the largest R 2 ) of each column is marked in boldface.

Effectiveness of LSTM-CNN-DWFD in Selecting Related Sites
The KNN-DWFD part in the LSTM-CNN-DWFD model was designed to select K neighbor sites considering wind impact. This subsection compares the prediction performance of the forecasting model under KNN in accordance with geographical distance (in LSTM-CNN model) and dynamic wind field distance (in LSTM-CNN-DWFD model).

The Difference of KNN and KNN-DWFD in Selecting Neighbor Sites
To demonstrate the process of using KNN-DWFD for selecting neighbor sites, we compared the five nearest sites to Site 1 under different wind directions in accordance with dynamic wind field distance in Figure 9. Here, * represents Site 1, and the triangles represent the five nearest sites. The bluer the color, the nearer the site, and the higher its affect degree.

Effectiveness of LSTM-CNN-DWFD in Selecting Related Sites 448
The KNN-DWFD part in the LSTM-CNN-DWFD model was designed to select K neighbor sites 449 considering wind impact. This subsection compares the prediction performance of the forecasting 450 model under KNN in accordance with geographical distance (in LSTM-CNN model) and dynamic 451 wind field distance (in LSTM-CNN-DWFD model). 452

The Difference of KNN and KNN-DWFD in Selecting Neighbor Sites 453
To demonstrate the process of using KNN-DWFD for selecting neighbor sites, we compared the 454 five nearest sites to Site 1 under different wind directions in accordance with dynamic wind field 455 distance in Figure 9. Here, * represents Site 1, and the triangles represent the five nearest sites. The 456 bluer the color, the nearer the site, and the higher its affect degree. 457 458 Figure 9. Illustration of the five nearest stations to Site 1 under different wind directions according to 459 dynamic wind field distance.

460
As shown in Figure 9, Sites 6 and 24 are approximately the same geographical distance from Site 461 1; however, they are at different directions, thereby causing their ranks of affect degree to change 462 with wind direction. At time when the wind direction of Site 1 was northwest, Site 24 was in the 463 upwind area, thereby making it more related to Site 1 than the other sites (excluding the nearest 464 surrounding site Site 2). By contrast, at time , when the wind direction of Site 1 changed to 465 southeast, Site 6 was nearly in the exact upwind area. Hence, the affect degree of Site 6 was higher 466 than that of other sites (excluding the nearest surrounding site Site 2). This illustrates that KNN-467 DWFD can dynamically select the most related surrounding sites according to dynamic wind field. 468

Comparing Prediction Performance of LSTM-CNN and LSTM-CNN-DWFD 469
Six-fold rolling origin experiments were performed to build LSTM-CNN and LSTM-CNN-470 DWFD models in 36 stations. Table 4  is 47.54, which was slightly higher than that of LSTM-CNN. However, the RMSE of LSTM-CNN-475 DWFD was 63.14, and was 34.21%, both of which showed better performance than those of 476 As shown in Figure 9, Sites 6 and 24 are approximately the same geographical distance from Site 1; however, they are at different directions, thereby causing their ranks of affect degree to change with wind direction. At time t 1 when the wind direction of Site 1 was northwest, Site 24 was in the upwind area, thereby making it more related to Site 1 than the other sites (excluding the nearest surrounding site Site 2). By contrast, at time t 2 , when the wind direction of Site 1 changed to southeast, Site 6 was nearly in the exact upwind area. Hence, the affect degree of Site 6 was higher than that of other sites (excluding the nearest surrounding site Site 2). This illustrates that KNN-DWFD can dynamically select the most related surrounding sites according to dynamic wind field.

Comparing Prediction Performance of LSTM-CNN and LSTM-CNN-DWFD
Six-fold rolling origin experiments were performed to build LSTM-CNN and LSTM-CNN-DWFD models in 36 stations. Table 4 shows the mean RMSE, MAE, and R 2 at multiple prediction scales, where Models 1 and 2 represent LSTM-CNN and LSTM-CNN-DWFD, respectively. In 1-6 and 7-12 h prediction tasks, LSTM-CNN-DWFD showed low RMSE and MAE and high R 2 , thereby indicating a high prediction accuracy. For 13-24 h prediction task, the MAE of LSTM-CNN-DWFD is 47.54, which was slightly higher than that of LSTM-CNN. However, the RMSE of LSTM-CNN-DWFD was 63.14, and R 2 was 34.21%, both of which showed better performance than those of LSTM-CNN. Thus, LSTM-CNN-DWFD performed higher accuracy than LSTM-CNN in all prediction scales. In 7-12 h prediction, LSTM-CNN, LSTM-FC, and CNN respectively had the lowest RMSE, the lowest MAE, and the highest R 2 as shown in Table 3. However, LSTM-CNN-DWFD further improved the prediction accuracy of LSTM-CNN from all aspects. The RMSE, MAE, and R 2 of LSTM-CNN-DWFD for 7-12 h prediction were 57.89, 42.16 and 49.43%, respectively, all of which outperformed those of LSTM-CNN, LSTM-FC, and CNN. Therefore, from all kinds of criteria, such as RMSE, MAE, and R 2 , LSTM-CNN-DWFD has the best prediction performance. Notes: Models 1 and 2 represent LSTM-CNN and LSTM-CNN-DWFD, respectively. The best prediction performance (the smallest RMSE and MAE and the largest R 2 ) of each column is marked in boldface.
The density of stations highly affects the significance of spatial correlations. The higher the density, the nearer neighbor sites, and the more significant the spatial correlation. Table 5, Table 6, and Table 7 show the prediction error of LSTM-CNN and LSTM-CNN-DWFD in regions with different densities of stations at three different prediction intervals. Here, #ns is the number of surrounding sites within 1.5 km to the target station, which represents the density of stations. The number in the bracket stands for the number of target stations that locate the corresponding density area.  Notes: Models 1 and 2 represent LSTM-CNN and LSTM-CNN-DWFD, respectively. Here, #ns represents the number of surrounding sites within 1.5 km to the target station. The number in the brackets stands for the number of target stations that locate the corresponding density area. The best prediction performance (the smallest RMSE and MAE and the largest R 2 ) of each column is marked in boldface.
The distribution of stations in Beijing is uneven. A total of 15 stations have no more than two sites within 1.5 km. Meanwhile, nine sites have more than 12 sites within the same distance range. The comparison among Table 5, Table 6, and Table 7 shows that the highest prediction accuracy for 1-6, 7-12, and 13-24 h prediction tasks of both models all occurred in 2 < #ns ≤ 12 region, as a lower RMSE and MAE and a higher R 2 indicated. However, the accuracy in #ns ≤ 2 and #ns > 12 regions was relatively worse.
Nonetheless, for #ns ≤ 2, the RMSEs of LSTM-CNN-DWFD at 7-12 and 13-24 h are 62.35 and 67.24, respectively, both of which were lower than that of LSTM-CNN. For 2 < #ns ≤ 12 and #ns > 12 region, the RMSE and MAE of LSTM-CNN-DWFD were generally all lower than LSTM-CNN at multiple prediction scales (except for the MAE at 13-24 h prediction in 2 < #ns ≤ 12 region), and R 2 of LSTM-CNN-DWFD were all higher. Hence, LSTM-CNN-DWFD showed a better prediction accuracy at all prediction scales and all regions with different densities than the other models. In addition, as the density of stations increases, the difference among the RMSE, MAE, and R 2 of the two models increases, which means the superiority of LSTM-CNN-DWFD increases in areas where spatial correlation is important.

Discussion
This paper proposed a novel PM 2.5 forecasting model-LSTM-CNN-DWFD, which constructed a hybrid neural network to extract spatiotemporal data, and took wind impact into consideration when selecting related surrounding sites. To demonstrate the advantage of the proposed model, six-fold rolling origin experiments were conducted, and Section 3 shows the results. The experimental data was restricted to a single year, and the test sets covered part autumn (Nov 2014 in Fold1), a whole winter (Dec 2014 to Feb 2015 in Fold 2-4), and part spring (Mar 2015 to Apr 2015 in Fold 5-6). As shown in Table 2 and Figure 8, from the view of season, LSTM-CNN-DWFD performed better in spring than in autumn and winter; from the view of space, LSTM-CNN-DWFD performed better in the north than in the south. Similar results were obtained by Zhao [22] and Bai [40], both of which found that the performance in winter was the worst, followed by autumn, spring, and summer. Thus, the proposed model is expected to perform higher accuracy if the test set is expanded to a longer time period. The seasonal difference in the prediction accuracy resulted from the variations of atmospheric environment and human activities. The atmosphere environment in autumn and winter (Nov 2014 to Feb 2015) was more stable than spring (Mar 2015 to Apr 2015), including a lower temperature (0.96 versus 10.72°C) and lower wind speed (7.16 versus 7.72 m/s). The stable atmosphere structure in winter contributed weak diffusion of PM 2.5 in both horizontal and vertical directions. In addition, human activities in winter (e.g., heating and use of festival firecrackers) contributed anthropogenic emission. According to Liang and Zou [41], heating activities in winter has contributed more than 50% increase (on average) in PM 2.5 concentration in Beijing since 2010. Ye and Chen [42] found that the traditions of exploding firecrackers had a direct effect on the air pollution aggravation during the Chinese New Year. As a result, the variations of PM 2.5 concentration in winter were more dramatic and the peaks were higher, as Figure 10 shows. Consequently, higher contribution of anthropogenic emission in winter and higher peaks of PM 2.5 concentration caused predicting air quality based on meteorological factors more difficult, and the prediction accuracy was lower. The spatial difference was caused due to that the pollution condition is worse in the south (as shown in Figure 2), and the monitoring stations in the south are fewer and farther between than those in north. Hence, the CNN-based spatial relation extractor cannot capture the spatial dependence well.

548
From the comparison results between LSTM-CNN and five baseline models in Table 3, three 549 useful findings can be extracted. 550 (1) From comparison (a), CNN, LSTM-FC, and LSTM-CNN exhibited lower RMSE and MAE 551 and higher than those in LSTM-NN. This result is explained by neighbor stations having a high 552 effect on the pollution of the target station due to the transport of pollutants. Similar conclusions were 553 also drawn by Zhao [22] and Wen [28] by comparing the performance of the models with and without 554 the surrounding sites considered. Therefore, considering related neighbor stations can further 555 improve prediction accuracy.

556
(2) From comparison (b), LSTM-CNN had higher prediction accuracy, especially in 1-6 and 13-557 24 h, than those of CNN and LSTM-FC. This result is due to the special architecture of combining 558 LSTM with CNN in the SRE part of LSTM-CNN. Compared with CNN, the LSTM layer in SRE is 559 more suitable for processing time series data. The recurrent cell of LSTM contains input gate, forget 560 gate, and output gate. The three gates make the recurrent neuron able to store long-term tendency of 561 the input time series data and extract useful short-term tendency at the same time. This was also 562 confirmed in the work of Li and Peng [11]. However, CNN do not have a recurrent neuron in its 563 architecture, therefore it cannot learn the temporal dependency of time series data. Hence, LSTM can 564 more efficiently extract temporal features than CNN. By comparing the performance of air quality 565 prediction on city scale of CNN-alone and LSTM-alone models, Qin and Yu [26] also found that CNN 566 performance was poor in dealing with long-term sequence prediction. In addition, compared with 567 LSTM-FC, the CNN layer in SRE can directly handle the 2D spatiotemporal matrix and extract 568 spatiotemporal features therefrom. However, FC can only employ 1D data as input, so 2D 569 spatiotemporal matrix must be flattened to 1D data to be processed by FC. The flattening process 570 causes some loss of the spatiotemporal dependency among the element of 2D matrix. Therefore, the 571 spatiotemporal information can be more fully utilized by CNN than FC, and additional deep 572 spatiotemporal features can be extracted. By combining LSTM with CNN in SRE, the proposed 573 LSTM-CNN model showed higher prediction accuracy than CNN and LSTM-FC.

574
( From the comparison results between LSTM-CNN and five baseline models in Table 3, three useful findings can be extracted. (1) From comparison (a), CNN, LSTM-FC, and LSTM-CNN exhibited lower RMSE and MAE and higher R 2 than those in LSTM-NN. This result is explained by neighbor stations having a high effect on the pollution of the target station due to the transport of pollutants. Similar conclusions were also drawn by Zhao [22] and Wen [28] by comparing the performance of the models with and without the surrounding sites considered. Therefore, considering related neighbor stations can further improve prediction accuracy.
(2) From comparison (b), LSTM-CNN had higher prediction accuracy, especially in 1-6 and 13-24 h, than those of CNN and LSTM-FC. This result is due to the special architecture of combining LSTM with CNN in the SRE part of LSTM-CNN. Compared with CNN, the LSTM layer in SRE is more suitable for processing time series data. The recurrent cell of LSTM contains input gate, forget gate, and output gate. The three gates make the recurrent neuron able to store long-term tendency of the input time series data and extract useful short-term tendency at the same time. This was also confirmed in the work of Li and Peng [11]. However, CNN do not have a recurrent neuron in its architecture, therefore it cannot learn the temporal dependency of time series data. Hence, LSTM can more efficiently extract temporal features than CNN. By comparing the performance of air quality prediction on city scale of CNN-alone and LSTM-alone models, Qin and Yu [26] also found that CNN performance was poor in dealing with long-term sequence prediction. In addition, compared with LSTM-FC, the CNN layer in SRE can directly handle the 2D spatiotemporal matrix and extract spatiotemporal features therefrom. However, FC can only employ 1D data as input, so 2D spatiotemporal matrix must be flattened to 1D data to be processed by FC. The flattening process causes some loss of the spatiotemporal dependency among the element of 2D matrix. Therefore, the spatiotemporal information can be more fully utilized by CNN than FC, and additional deep spatiotemporal features can be extracted. By combining LSTM with CNN in SRE, the proposed LSTM-CNN model showed higher prediction accuracy than CNN and LSTM-FC.
(3) From comparison (c), LSTM-NN performed better than LSTM and APNet, especially at 13-24 h. This illustrates that weather forecast data are highly related to the future PM 2.5 concentration, especially for long-term prediction. Introducing the weather forecasts can improve prediction performance.
Compared with LSTM-CNN, which selected neighbor sites in accordance with geographical distance, LSTM-CNN-DWFD selected neighbor sites in accordance with dynamic wind field distance, and obtained higher prediction accuracy. Table 4-7 provide the comparison of the prediction accuracy between LSTM-CNN and LSTM-CNN-DWFD. On the basis of the results, both models performed worse in #ns ≤ 2 and #ns > 12 regions. The bad performance, namely, the low prediction accuracy, results from the number of neighbor stations K set as 9 in our experiments. For #ns ≤ 2, some less relevant sites were introduced into the model. Meanwhile, for #ns > 12, some high relevant sites were ignored in the model. Therefore, a more adaptive selection method can be explored to make the number of selected surrounding sites be able to be adaptive to different density of sites. Nonetheless, the results show that LSTM-CNN-DWFD performed well for 1-6, 7-12, and 13-24 h prediction tasks and all regions with different densities. Moreover, the higher the density of stations, the more important the spatial correlation, and the more significant the superiority of LSTM-CNN-DWFD. This result is explained by spatial correlation being anisotropy which is affected by wind. However, geographical distance describes that the spatial dependency is affected by distance, and it takes spatial correlation to be isotropy. In this study, the proposed dynamic wind field distance introduced wind direction into the evaluation of the distance between sites, making it more suitable to represent the spatial relations between sites than geographical distance. Consequently, the neighbor sites selected in LSTM-CNN-DWFD contributed more to the PM 2.5 concentration prediction of target station than the neighbor sites selected in LSTM-CNN. Similar trends also occurred in some studies on spatial interpolations of air pollutants. By introducing wind direction and wind speed into the evaluation of the distance between sites, Li [43] and Li [44] both improved the accuracy of spatial interpolation of pollutants. However, both of them did not discuss the applications in air quality forecasting. Therefore, in the future study, wind speed can be introduced to the definition of wind field distance, and we believe that the prediction performance of the forecasting model will be better.
Compared with six other methods, the proposed LSTM-CNN-DWFD model showed the highest prediction accuracy in forecasting hourly PM 2.5 concentration. The LSTM-CNN architecture is shown to be more effective in extracting spatiotemporal features, and dynamic wind field distance fits the spatial correlation better than geographical distance. Due to the limited sample size of the employed dataset, the performance of summer and autumn was not evaluated enough. However, as mentioned above, many related studies demonstrated that the performance of summer and autumn were better than winter [22,40]; thus, the proposed model is believed to have a higher accuracy if a longer time period is covered in the test set.

Conclusions
This study presented a site-specific forecasting model, namely, LSTM-CNN-DWFD, to predict air pollutant concentrations over the next 24 h using historical air quality, meteorological data, and weather forecasts. By combining LSTM and 2D-CNN, the proposed model simultaneously handled long-and short-term temporal trends and spatial dependency of the spatiotemporal data. Additionally, using a new KNN method, namely, KNN-DWFD, highly related neighbor stations were chosen in the model with wind effect considered. Finally, accurate and stable predictions were realized via the combination of KNN-DWFD and LSTM-CNN in LSTM-CNN-DWFD. Furthermore, through the six-fold rolling origin comparison experiments for 1-6, 7-12, and 13-24 h prediction tasks conducted on the 36 stations in Beijing, LSTM-CNN-DWFD has the highest prediction accuracy, taking RMSE, MAE, and R 2 as indicators. The following are the main findings of this study:

1.
The historical air quality and meteorological data of neighbor stations are valuable spatiotemporal data, and fully utilizing these data can considerably improve prediction accuracy. Additionally, taking weather forecasts into consideration can also help predict the future PM 2.5 concentration, especially for long-term prediction. 2.
The proposed model, namely, LSTM-CNN, can more efficiently capture the spatiotemporal features by combining local LSTM models and CNN than CNN and LSTM-FC. Hence, it exhibited better prediction performance than the other models as indicated by its low RMSE and MAE and high R 2 .

3.
We proposed a dynamic wind field distance to replace geographical distance in new KNN method-KNN-DWFD. The comparison results show that it can fit the spatial correlation better than geographical distance. LSTM-CNN-DWFD is more capable of adapting to different prediction time and density levels than LSTM-CNN, thereby providing more accurate and stable predictions as indicated by its low RMSE and MAE and high R 2 .
Future studies should focus on the following aspects: (1) Develop a method to choose the number of neighbor stations adaptively for areas with different densities of stations so that the forecasting model can fit the spatial correlations well accordingly; (2) explore a wind field distance definition that simultaneously considers the impact of wind speed and direction and not only the wind direction; (3) explore other patterns to introduce wind impact into the spatial dependency.