Urban PM 2.5 Concentration Prediction via Attention-Based CNN–LSTM

: Urban particulate matter forecasting is regarded as an essential issue for early warning and control management of air pollution, especially ﬁne particulate matter (PM 2.5 ). However, existing methods for PM 2.5 concentration prediction neglect the e ﬀ ects of featured states at di ﬀ erent times in the past on future PM 2.5 concentration, and most fail to e ﬀ ectively simulate the temporal and spatial dependencies of PM 2.5 concentration at the same time. With this consideration, we propose a deep learning-based method, AC-LSTM, which comprises a one-dimensional convolutional neural network (CNN), long short-term memory (LSTM) network, and attention-based network, for urban PM 2.5 concentration prediction. Instead of only using air pollutant concentrations, we also add meteorological data and the PM 2.5 concentrations of adjacent air quality monitoring stations as the input to our AC-LSTM. Hence, the spatiotemporal correlation and interdependence of multivariate air quality-related time-series data are learned by the CNN–LSTM network in AC-LSTM. The attention mechanism is applied to capture the importance degrees of the e ﬀ ects of featured states at di ﬀ erent times in the past on future PM 2.5 concentration. The attention-based layer can automatically weigh the past feature states to improve prediction accuracy. In addition, we predict the PM 2.5 concentrations over the next 24 h by using air quality data in Taiyuan city, China, and compare it with six baseline methods. To compare the overall performance of each method, the mean absolute error (MAE), root-mean-square error (RMSE), and coe ﬃ cient of determination ( R 2 ) are applied to the experiments in this paper. The experimental results indicate that our method is capable of dealing with PM 2.5 concentration prediction with the highest performance.


Introduction
Air pollution is a serious environmental problem that is attracting increasing attention worldwide [1].With the rapid development of the Chinese economy and the acceleration of industrialization, urban air pollution is getting worse.As one of the main pollutants in the air, fine particulate matter (PM 2.5 ) contains a large amount of toxic and harmful substances due to its small particle size.It not only stays in the atmosphere for a long time, but also has a long transport distance, resulting in a decrease in air visibility, seriously affecting our living environment and physical health.In response to it, the Chinese government established air quality monitoring stations in most cities, Appl.Sci.2020, 10,1953 2 of 17 to detect PM 2.5 and other air pollutant concentrations in real time [2].However, it is inevitable for the government to bear a significant financial burden because of expensive equipment [3,4].In addition to monitoring, there is a rising demand for the prediction of future air quality.Obviously, the prediction of real-time and future PM 2.5 concentration is essential for air pollution control and the prevention of health issues caused by air pollution.
With the development of machine learning in recent years, artificial neural network (ANN), support vector regression (SVR), and other methods were successfully applied to the prediction of air pollutant concentration.Zheng et al. [5] used the spatial features of roads, factories, and parks in the prediction area to predict the concentration of PM 10 and NO 2 .Li et al. [6] used SVR to predict the PM 2.5 concentration of a target station using observation data from the surrounding monitoring stations.Although all these aforementioned methods made use of the spatial features that affect the concentrations of pollutants, the temporal correlation of air pollutants and the time-delay characteristics of PM 2.5 were not considered.
Due to the dynamic nature of relevant atmospheric environments, the recurrent neural network (RNN) is especially suitable to simulate the temporal evolution of air pollutant distributions because RNNs can handle arbitrary sequences of inputs, thereby guaranteeing the capacity to learn temporal sequences [7].Ong et al. [8] used meteorological data to predict PM 2.5 concentration using an RNN.Feng et al. [9] combined random forest (RF) and an RNN to analyze and forecast the next 24-h PM 2.5 concentration of air pollutants in Hangzhou, China.When there is a long time lag in the traditional RNN, however, it may suffer from problems such as gradient disappearance and gradient explosion [10].These RNN-based methods do not take full advantage of spatial features either.Additionally, the states of the feature formation at different times will also have different effects on future PM concentrations.The existing studies did not consider the effects of feature states of the past different times on air pollutants, but only extracted the temporal correlation features of historical data.
To tackle the aforementioned problems, we propose an attention-based convolutional neural network (CNN)-long short-term memory (LSTM) model, AC-LSTM, for predicting the PM 2.5 concentrations over the next 24 h.The proposed AC-LSTM model comprises a one-dimensional convolutional neural network (CNN), long short-term memory (LSTM) network [10], and attention-based network.As a representative network of RNN, the LSTM network overcomes the defect of gradient disappearance and gradient explosion of the traditional RNN due to its special cell structure [10].It can capture the spatiotemporal correlation and interdependence of air quality-related time-series data at the same time.The joint one-dimensional CNN aims to extract spatiotemporal features from air quality data and local spatial correlation features of PM 2.5 concentrations among air monitoring stations.The attention mechanism is an effective mechanism to obtain superior results, as demonstrated in image recognition [11], machine translation [12] and sentence summarization [13].Therefore, the attention mechanism [12] was applied in the AC-LSTM model, used to capture the importance degrees of effects of past feature states at different times on PM 2.5 concentration in this paper.
The major contributions of this paper are as follows: (1) by analyzing the spatiotemporal correlation of air quality data, we propose a novel deep learning method that can capture the spatiotemporal dependency of air pollutant concentration, to predict PM 2.5 concentrations in the next 24 h; (2) according to the importance degrees of effects of past feature states on PM 2.5 concentration, the attention-based layer weighs the past featured states in our predictive model to improve prediction accuracy; (3) comparing the performances of six popular machine learning methods in the air pollution prediction problem, we validate the practicality and feasibility of the proposed model in PM 2.5 concentration prediction.

Overview of the AC-LSTM Framework
As shown in Figure 1, the framework of our approach consists of three major parts: model input, feature extraction, and aggregation and prediction.Since PM 2.5 concentration is extremely affected by spatiotemporal features, recent air pollutant concentration, meteorological data, and the PM 2.5 concentrations of all adjacent stations are stacked to construct an input tensor for the one-dimensional CNN layer.In this way, the spatiotemporal features are extracted by the CNN layer.Then, the spatiotemporal correlation is learned by the LSTM layer.Because of the different effects of past states of different times on the PM 2.5 concentration, the attention-based layer can weigh the feature states at past different hours.Finally, the aggregation and prediction of the proposed model is achieved.

Overview of the AC-LSTM Framework
As shown in Figure 1, the framework of our approach consists of three major parts: model input, feature extraction, and aggregation and prediction.Since PM2.5 concentration is extremely affected by spatiotemporal features, recent air pollutant concentration, meteorological data, and the PM2.5 concentrations of all adjacent stations are stacked to construct an input tensor for the one-dimensional CNN layer.In this way, the spatiotemporal features are extracted by the CNN layer.Then, the spatiotemporal correlation is learned by the LSTM layer.Because of the different effects of past states of different times on the PM2.5 concentration, the attention-based layer can weigh the feature states at past different hours.Finally, the aggregation and prediction of the proposed model is achieved.How the model predicts the PM2.5 concentrations of the next 24 h is described in Figure 2. As shown in Figure 2, Xt represents the input data of the model at time t (e.g., air quality data, meteorological data in Figure 1), Yt+1 represents the predicted value of the PM2.5 concentration at time t + 1, and k represents the time lag.We group the air quality data within a particular time lag to formulate different inputs (shown in the broken rectangle) for multiscale predictors, which are used to train separate models corresponding to different time intervals.The time lag of the model input indicates how many hours the input data are in the past.Each blue arrow shown in Figure 2 represents a different predictor.Afterward, a separate model is trained for each hour over the next 3 h.With respect to the next 7-24 h, it is divided into three time intervals, i.e., 4-6, 7-12, and 13-24 h, where separate models are trained to predict the mean PM2.5 concentration during each time interval.How the model predicts the PM 2.5 concentrations of the next 24 h is described in Figure 2. As shown in Figure 2, X t represents the input data of the model at time t (e.g., air quality data, meteorological data in Figure 1), Y t+1 represents the predicted value of the PM 2.5 concentration at time t + 1, and k represents the time lag.We group the air quality data within a particular time lag to formulate different inputs (shown in the broken rectangle) for multiscale predictors, which are used to train separate models corresponding to different time intervals.The time lag of the model input indicates how many hours the input data are in the past.Each blue arrow shown in Figure 2 represents a different predictor.Afterward, a separate model is trained for each hour over the next 3 h.With respect to the next 7-24 h, it is divided into three time intervals, i.e., 4-6, 7-12, and 13-24 h, where separate models are trained to predict the mean PM 2.5 concentration during each time interval.

Overview of the AC-LSTM Framework
As shown in Figure 1, the framework of our approach consists of three major parts: model input, feature extraction, and aggregation and prediction.Since PM2.5 concentration is extremely affected by spatiotemporal features, recent air pollutant concentration, meteorological data, and the PM2.5 concentrations of all adjacent stations are stacked to construct an input tensor for the one-dimensional CNN layer.In this way, the spatiotemporal features are extracted by the CNN layer.Then, the spatiotemporal correlation is learned by the LSTM layer.Because of the different effects of past states of different times on the PM2.5 concentration, the attention-based layer can weigh the feature states at past different hours.Finally, the aggregation and prediction of the proposed model is achieved.How the model predicts the PM2.5 concentrations of the next 24 h is described in Figure 2. As shown in Figure 2, Xt represents the input data of the model at time t (e.g., air quality data, meteorological data in Figure 1), Yt+1 represents the predicted value of the PM2.5 concentration at time t + 1, and k represents the time lag.We group the air quality data within a particular time lag to formulate different inputs (shown in the broken rectangle) for multiscale predictors, which are used to train separate models corresponding to different time intervals.The time lag of the model input indicates how many hours the input data are in the past.Each blue arrow shown in Figure 2 represents a different predictor.Afterward, a separate model is trained for each hour over the next 3 h.With respect to the next 7-24 h, it is divided into three time intervals, i.e., 4-6, 7-12, and 13-24 h, where separate models are trained to predict the mean PM2.5 concentration during each time interval.

Data Description
The spatiotemporal variation of atmospheric particulate matter is affected by various factors such as pollution emission sources and meteorological conditions [14,15].The PM 2.5 concentration is not only related to the atmospheric state and PM 2.5 concentration at the previous time, but also the PM 2.5 concentration in the adjacent areas [16,17].The air quality data used for the AC-LSTM model input consists of readings of pollutant concentrations from air quality monitoring stations and meteorological data.In this paper, the air quality data from nine air quality monitoring stations in Taiyuan City, China, were obtained from the National Environmental Protection Bureau and the Shanxi Provincial Environmental Protection Department.The location map of Taiyuan is shown in Figure 3a, and the yellow coordinate represents Taiyuan.The experimental data were collected from 1 January 2014 to 25 December2016 at an hourly rate, and the spatial locations of the air quality monitoring stations are illustrated in Figure 3b.The experimental data contain the concentrations of PM 10 , PM 2.5 , SO 2 , NO 2 , O 3 , and CO.The detailed characteristics of Taiyuan air quality monitoring stations are shown in Table 1.The meteorological data include indicators such as air pressure, temperature, wind speed, humidity, and visibility.

Data Description
The spatiotemporal variation of atmospheric particulate matter is affected by various factors such as pollution emission sources and meteorological conditions [14,15].The PM2.5 concentration is not only related to the atmospheric state and PM2.5 concentration at the previous time, but also the PM2.5 concentration in the adjacent areas [16,17].The air quality data used for the AC-LSTM model input consists of readings of pollutant concentrations from air quality monitoring stations and meteorological data.
In this paper, the air quality data from nine air quality monitoring stations in Taiyuan City, China, were obtained from the National Environmental Protection Bureau and the Shanxi Provincial Environmental Protection Department.The location map of Taiyuan is shown in Figure 3a, and the yellow coordinate represents Taiyuan.The experimental data were collected from 1 January 2014 to 25 December2016 at an hourly rate, and the spatial locations of the air quality monitoring stations are illustrated in Figure 3b.The experimental data contain the concentrations of PM10, PM2.5, SO2, NO2, O3, and CO.The detailed characteristics of Taiyuan air quality monitoring stations are shown in Table 1.The meteorological data include indicators such as air pressure, temperature, wind speed, humidity, and visibility.

Data Preprocessing
The collected data were preprocessed so as to improve the data quality and carry out data mining.Air quality monitoring equipment and meteorological monitoring equipment will cause leakage in data collection due to machine failure, regular inspection and maintenance, unstable transmission, bad weather, and other uncontrollable factors.The existence of such missing values will have some impact on data mining.The missing values are normally required to be removed or filled to ensure the performance of modeling [18].On the one hand, when there are several missing values in a data record, we directly remove them; on the other hand, a linear interpolation [19] is implemented to fill empty values when there is only one missing value in a data record.After that, features in the data that are described by text, such as weather (sunny, cloudy, foggy, snowy, rainy, etc.) and wind direction (north, west, east, south, northwest, northeast, etc.), are quantified.Furthermore, to accelerate the convergence of the model and reduce the impact of outliers, the features in the data records are normalized as follows: where f min represents the minimum value, and f max represents the maximum value.

Correlation between Meteorological
Features and PM 2.5 PM 2.5 concentrations are correlated with meteorological features, as shown in the Figure 4.In Figure 4, the black dots in the box represent the average PM 2.5 concentration, and the horizontal line in the middle of the box represents the median.When the air temperature near the ground is high, the atmospheric convection is strengthened, and the concentration of pollutants can be reduced.When the air temperature near the ground is low, the atmosphere tends to form an inversion layer, which is not conducive to the diffusion of pollutants.It can be seen from Figure 4a that the concentration of PM 2.5 decreases with the increase in temperature, indicating that the increase in temperature is conducive to the diffusion and dilution of PM 2.5 .
Furthermore, to accelerate the convergence of the model and reduce the impact of outliers, the features in the data records are normalized as follows: where min f represents the minimum value, and max f represents the maximum value.

Correlation between Meteorological Features and PM2.5
PM2.5 concentrations are correlated with meteorological features, as shown in the Figure 4.In Figure 4, the black dots in the box represent the average PM2.5 concentration, and the horizontal line in the middle of the box represents the median.When the air temperature near the ground is high, the atmospheric convection is strengthened, and the concentration of pollutants can be reduced.When the air temperature near the ground is low, the atmosphere tends to form an inversion layer, which is not conducive to the diffusion of pollutants.It can be seen from Figure 4a that the concentration of PM2.5 decreases with the increase in temperature, indicating that the increase in temperature is conducive to the diffusion and dilution of PM2.5.
A higher relative humidity results in a weaker diffusion ability of PM2.5.Such an environment will cause an increase in the hygroscopicity of pollutants and accelerate the chemical transformation of pollutants, thus aggravating the degree of air pollution.As can be seen from Figure 4b, there is a correlation between PM2.5 and relative humidity, that is, with the increase in relative humidity, PM2.5 concentration will also increase.
The main reason for the loss of visibility is air pollution, with particulate matter having the greatest impact on visibility.Low visibility means heavy air pollution, while high visibility means light air pollution.Figure 4c shows that a higher visibility denotes a lower concentration of PM2.5.There is a significant correlation between PM2.5 concentration and visibility.Weather is an important factor affecting air quality.Rain and snow wash particulate matter and other pollutants from the air, effectively purifying the air.However, fog, sandstorms, and haze will increase the pollution level and reduce air quality.In addition, air pressure affects the flow of air, which affects the migration of PM2.5.
Wind direction determines the direction of migration and horizontal diffusion of atmospheric pollutants.The relationship between wind direction, wind speed, and PM2.5 is shown in Figure 5. Taiyuan is located in north-central China, with northerly and northwesterly winds prevailing.In the north and west of Taiyuan, there are a large number of factories, such as steel mills and power plants.When northerly wind prevails in Taiyuan, a large number of pollutants are transported from north to south, aggravating the urban pollution.According to Figure 5, when the wind direction is westerly or northerly, PM2.5 concentration is relatively high.A higher wind level facilitates PM2.5 migration, which leads to more pollution downwind.A higher relative humidity results in a weaker diffusion ability of PM 2.5 .Such an environment will cause an increase in the hygroscopicity of pollutants and accelerate the chemical transformation of pollutants, thus aggravating the degree of air pollution.As can be seen from Figure 4b, there is a correlation between PM 2.5 and relative humidity, that is, with the increase in relative humidity, PM 2.5 concentration will also increase.
The main reason for the loss of visibility is air pollution, with particulate matter having the greatest impact on visibility.Low visibility means heavy air pollution, while high visibility means light air pollution.Figure 4c shows that a higher visibility denotes a lower concentration of PM 2.5 .There is a significant correlation between PM 2.5 concentration and visibility.
Weather is an important factor affecting air quality.Rain and snow wash particulate matter and other pollutants from the air, effectively purifying the air.However, fog, sandstorms, and haze will increase the pollution level and reduce air quality.In addition, air pressure affects the flow of air, which affects the migration of PM 2.5 .
Wind direction determines the direction of migration and horizontal diffusion of atmospheric pollutants.The relationship between wind direction, wind speed, and PM 2.5 is shown in Figure 5. Taiyuan is located in north-central China, with northerly and northwesterly winds prevailing.In the north and west of Taiyuan, there are a large number of factories, such as steel mills and power plants.When northerly wind prevails in Taiyuan, a large number of pollutants are transported from north to south, aggravating the urban pollution.According to Figure 5, when the wind direction is westerly or northerly, PM 2.5 concentration is relatively high.A higher wind level facilitates PM 2.5 migration, which leads to more pollution downwind.Weather is an important factor affecting air quality.Rain and snow wash particulate matter and other pollutants from the air, effectively purifying the air.However, fog, sandstorms, and haze will increase the pollution level and reduce air quality.In addition, air pressure affects the flow of air, which affects the migration of PM2.5.
Wind direction determines the direction of migration and horizontal diffusion of atmospheric pollutants.The relationship between wind direction, wind speed, and PM2.5 is shown in Figure 5. Taiyuan is located in north-central China, with northerly and northwesterly winds prevailing.In the north and west of Taiyuan, there are a large number of factories, such as steel mills and power plants.When northerly wind prevails in Taiyuan, a large number of pollutants are transported from north to south, aggravating the urban pollution.According to Figure 5, when the wind direction is westerly or northerly, PM2.5 concentration is relatively high.A higher wind level facilitates PM2.5 migration, which leads to more pollution downwind.

Spatiotemporal Correlation Analysis
Because of meteorological conditions, especially wind speed and wind direction, air pollutants are affected by the environment in the surrounding area.Similarly, particulate matter stays in the air for a long time and is more susceptible to the surrounding area.To analyze the spatial correlation of PM2.5 concentrations, the Pearson correlation coefficient [20] is calculated among all monitoring

Spatiotemporal Correlation Analysis
Because of meteorological conditions, especially wind speed and wind direction, air pollutants are affected by the environment in the surrounding area.Similarly, particulate matter stays in the air for a long time and is more susceptible to the surrounding area.To analyze the spatial correlation of PM 2.5 concentrations, the Pearson correlation coefficient [20] is calculated among all monitoring stations, and the results are shown in Table 2. From the table, the correlation coefficients among most stations were greater than 0.7, except for the correlation coefficient at S3.This indicates that PM 2.5 concentrations are highly correlated among most stations.The reason for the small correlation coefficient between S3 and other stations is that it is far away from other stations and it is located in a rural area.Thus, the spatial correlation of PM 2.5 concentrations can be used to optimize the input of the model for improving the prediction performance.As shown in Figure 1, PM 2.5 concentrations of the adjacent stations are added to the model input.In the experiment, we used PM 2.5 concentrations of all stations as input because the number of air quality monitoring stations is too small in Taiyuan city.
The PM 2.5 concentration is highly correlated in the temporal domain, which is similarly affected by other features in the past.The autocorrelation functions [21] below were used to measure the temporal correlations among the PM 2.5 concentration time series at each station.The detailed formula of the autocorrelation function is as follows: where ρ k represents the autocorrelation coefficient when the time lag is k, y(t) represents the PM 2.5 concentration vector, y(t + k) represents the PM 2.5 concentration vector after k hours, Cov(y(t), y(t + k)) is the covariance of y(t) and y(t + k), and σ y(t) and σ y(t+k) represent the standard deviations of y(t) and y(t + k), respectively.The autocorrelation coefficients of all stations when the value of the time lag k is different are shown in Figure 6.As seen, in general, the autocorrelation coefficients of all stations declined.When the time lag was smaller, the autocorrelation coefficient was larger.This indicates that a PM 2.5 concentration closer to the current time has a stronger correlation with the PM 2.5 concentration at the current time.The PM 2.5 concentrations within a lag of 15 h are strongly correlated to each other in a period of one day.It is worth noting that, when the time lag was 24, 48, and 72 h, the autocorrelation coefficients of each station in Figure 6 showed a temporary rise.This is very likely due to the periodic living pattern across different days in the same geographical environment and season.As PM 2.5 tends to stay in the air for a long time, its concentrations in the past few hours will affect observed data in the future.In addition to PM 2.5 concentration, the past weather conditions, such as wind and rain/snow, also affect PM 2.5 concentration.According to the above analysis, the current state of observed PM 2.5 concentration is closely related to that in the past states.These findings can help us choose a suitable time lag for multiscale predictors.The CNN has excellent performance in image processing [22], and it can be effectively applied to time series analysis [2].CNN's local perception and weight sharing features can reduce the number of parameters for processing multivariate time series, thereby improving the learning In general, PM 2.5 concentration is strongly influenced by the spatiotemporal correlations among monitoring stations and the past states of the prediction area.The AC-LSTM model proposed in this paper can capture the spatiotemporal relations of the variation in PM 2.5 concentrations.The attention mechanism is introduced in the proposed method to weigh the past states, which helps to measure the importance of past states at different times, as modeled using the LSTM, for PM 2.5 concentrations.

Convolutional Neural Network
The CNN has excellent performance in image processing [22], and it can be effectively applied to time series analysis [2].CNN's local perception and weight sharing features can reduce the number of parameters for processing multivariate time series, thereby improving the learning efficiency [2].Spatiotemporal features can be easily extracted by the one-dimensional (1D) CNN (1D-CNN) from the model input.Let the given model input be X = [x 1 , x 2 , • • • , x t ], consisting of meteorological data, pollutant concentrations, and PM 2.5 concentrations at adjacent stations in the past.Firstly, the model input X is input to the 1D-CNN layer; hence, we have where x t represents the input vector, k t is the convolution kernel, b l represents bias vector, and l t is the output vector of the 1D-CNN layer.The output of the 1D-CNN layer is a spatiotemporal feature matrix

LSTM Network
As a special kind of RNN, the LSTM network [10] is capable of learning long-term dependencies.It has the advantage of connecting previous information to the present task [23,24].Because of its special memory cell architecture, the LSTM network overcomes the defects of the traditional RNN, especially the problems of gradient disappearance and gradient explosion.The architecture of an LSTM memory cell is shown in Figure 7, where each cell has three "gate" structures, namely, the input gate, the forget gate, and the output gate.A chain of repeating cells forms the LSTM layer.The calculation process of the spatiotemporal feature matrix L = [l 1 , l 2 , • • • , l t ] in the LSTM layer is given in Equations ( 4)- (9).
where W f , W i , and W c denote the weight vector of the input gate, output gate, and forget gate, respectively, whereas b f , b i , b c , and b o are the bias vectors for the three gates, and σ denotes the sigmoid activation function.Actually, Equation (4) represents the forget gate and it decides what information should be thrown away from the cell state, where f t denotes the output of the forget gate.Equations ( 5) and ( 6) represent the input gate, which decides what new information should be stored in the cell state, where i t and c t denote the output of the input gate, and c t denotes the activation vector of the current cell.Equations ( 8) and ( 9) represent the output gate, where o t denotes the output of the output gate.h t−1 is the hidden state of the last cell, and h t is the state of the current cell.The feature state matrix H = [h 1 , h 2 , • • • , h t ] is the output of the LSTM layer.

Attention Layer
The attention mechanism [12] allows the model to capture the most important parts of the PM2.5 concentration when different features of past states are considered.In order to take advantage of the information of the past states, an attention-based layer is added to the LSTM layer in the proposed AC-LSTM model.It ranks the importance degrees of different feature states in the past as follows, where where t u and v denote the projection vectors, t  is the normalized attention weight of t h , and s is the weighted output vector of the attention layer.
According to the importance of each vector in the feature state matrix H, Equations ( 10) and (11) can calculate the normalized weight of each vector.Equation (12) gives the weighted vector s .This achieves the importance of measuring feature states at different times Eventually, the weighted vector s passes through a layer of a fully connected network to obtain the PM2.5 concentration of the prediction task.

Attention Layer
The attention mechanism [12] allows the model to capture the most important parts of the PM 2.5 concentration when different features of past states are considered.In order to take advantage of the information of the past states, an attention-based layer is added to the LSTM layer in the proposed AC-LSTM model.It ranks the importance degrees of different feature states in the past as follows, where H = [h 1 , h 2 , • • • , h t ] is the feature state matrix in the attention layer: where u t and v denote the projection vectors, α t is the normalized attention weight of h t , and s is the weighted output vector of the attention layer.
According to the importance of each vector in the feature state matrix H, Equations ( 10) and ( 11) can calculate the normalized weight of each vector.Equation (12) gives the weighted vector s.This achieves the importance of measuring feature states at different times Eventually, the weighted vector s passes through a layer of a fully connected network to obtain the PM 2.5 concentration of the prediction task.

Results and Discussion
The collected dataset is divided into two parts: the data of the first 28 months are used to train the model, and the data of the last 8 months are used to test the performance of the developed models when benchmarking with others.The mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R 2 ) are used as evaluation metrics to evaluate the performance of the different models in this paper.

Experimental Set-Up
This section describes the hardware and software environment of the experiment and the configuration of hyperparameters [2].The code for all the prediction methods in this paper was written in Python.Our model and other deep learning comparison models were implemented through Keras, an open source deep learning library based on Tensorflow.All experiments were conducted on a Server with two NVIDIA GTX 1080Ti graphics processing units (GPUs) and an Intel Xeon central processing unit (CPU) E5.
There are several hyperparameters in the AC-LSTM prediction model, including the time lag, the number of LSTM layers, the number of nodes in each LSTM layer, and the learning rate.They need to be preset before the model structure is built.Under the condition that all other parameters remain unchanged, we determined the optimal hyperparameter for that selected through our experiments.
In the end, we built our model structure using four LSTM layers, and the number of nodes in each LSTM layer was set to 800.The learning rate was 0.0001 in all experiments.The above setting seemed to outperform all others in our experiments.
The time lag is one of the most important hyperparameters.It determines the number of past hours used in the model input and is necessary for multiscale prediction tasks.To this end, we evaluated the performance of the model with different time lags in order to find the optimal time lag in the model.At different time lags, we predicted the PM 2.5 concentrations of all stations in the training set in the next hour.The calculated MAE and RMSE are compared in Table 3.When the time lag was 10, the RMSE of the model was the lowest.While the time lag was 14, the MAE was the lowest.According to the analysis in Section 3.4 and the previous studies on RNN [10], if the time lag is too small, the temporal correlation between time-series data cannot be fully learned and the prediction accuracy will decrease.However, a large time lag may lead to a longer time for training and unnecessary noise.As a result, the time lag in our model was set to 12 for the one-hour prediction task.Of course, for prediction tasks of different time scales, we can also find the optimal lag through experiments in a similar way.

Effects of Different Features
The input of our model was composed of three types of features: pollutant concentration (F p ), meteorological data (F m ), and PM 2.5 concentrations of adjacent monitoring stations (F a ).To evaluate the effectiveness of different features in the proposed AC-LSTM model, we conducted experiments with different combinations of features and computed the errors on the multiscale prediction tasks.Because the number of monitoring stations in Taiyuan is too small, we used PM 2.5 concentrations from all stations rather than from adjacent stations.The effects of various features in AC-LSTM are shown in Tables 4 and 5.As can be seen, by gradually adding features, the prediction accuracy of the model could be generally improved.Except for the lowest MAE in the next 1 h and 13-24 h prediction tasks, the model with three types of features as input had the best overall performance.This shows that the past feature states and the PM 2.5 concentrations of adjacent monitoring stations can help predict the PM 2.5 concentration.

Model Convergence
After setting appropriate model parameters, it was necessary to verify whether AC-LSTM converges during training.Therefore, the training loss of AC-LSTM model in the one-hour PM 2.5 prediction task was calculated, as shown in Figure 8, and the results were compared with three other deep learning methods (simple RNN, LSTM, and CNN-LSTM).The parameters of all models in Figure 8 were the same, and the mean square error (MSE) after data normalization was used as the loss function for training.It can be seen from Figure 8 that all models converged at epoch = 20.In Figure 8a, after 20 epochs in the one-hour prediction task, the MSE losses of the three models were close, but the MSE loss of the AC-LSTM model was slightly smaller than those of the other two models, LSTM and CNN-LSTM.At epoch = 1, the MSE loss of the simple RNN model in Figure 8b was nearly 100 times greater than that of the three models in Figure 8a.In addition, at epoch = 80, the loss value of the simple RNN model was 0.00154, while none of the other three models in Figure 8a had values greater than 0.0015.Obviously, the three models of LSTM, CNN-LSTM, and AC-LSTM in Figure 8a had better convergence results because of the special memory cell architecture.

Model Convergence
After setting appropriate model parameters, it was necessary to verify whether AC-LSTM converges during training.Therefore, the training loss of AC-LSTM model in the one-hour PM2.5 prediction task was calculated, as shown in Figure 8, and the results were compared with three other deep learning methods (simple RNN, LSTM, and CNN-LSTM).The parameters of all models in Figure 8 were the same, and the mean square error (MSE) after data normalization was used as the loss function for training.It can be seen from Figure 8 that all models converged at epoch = 20.In Figure 8a, after 20 epochs in the one-hour prediction task, the MSE losses of the three models were close, but the MSE loss of the AC-LSTM model was slightly smaller than those of the other two models, LSTM and CNN-LSTM.At epoch = 1, the MSE loss of the simple RNN model in Figure 8b was nearly 100 times greater than that of the three models in Figure 8a.In addition, at epoch = 80, the loss value of the simple RNN model was 0.00154, while none of the other three models in Figure 8a had values greater than 0.0015.Obviously, the three models of LSTM, CNN-LSTM, and AC-LSTM in Figure 8a had better convergence results because of the special memory cell architecture.

Model Comparison
To verify the feasibility and efficacy of the proposed model in this paper, we compared our proposed AC-LSTM model with six state-of-the-art models, including support vector regression (SVR) [6], random forest regression (RFR) [9], multilayer perceptron (MLP) [25], simple RNN [9,26], LSTM [27], and CNN-LSTM [28].After training all the models with the same training and testing datasets, the PM2.5 concentrations of all stations at different time scales were predicted for

Model Comparison
To verify the feasibility and efficacy of the proposed model in this paper, we compared our proposed AC-LSTM model with six state-of-the-art models, including support vector regression (SVR) [6], random forest regression (RFR) [9], multilayer perceptron (MLP) [25], simple RNN [9,26], LSTM [27], and CNN-LSTM [28].After training all the models with the same training and testing datasets, the PM 2.5 concentrations of all stations at different time scales were predicted for performance evaluation.We selected appropriate time lag and hyperparameters for different scale prediction tasks in our AC-LSTM model in the same way.Furthermore, each experiment was repeated five times, and the averaged results were used for comparison, as shown in Tables 6 and 7.
The prediction results from our approach and six others in terms of MAE and RMSE are compared in Tables 6 and 7, where several interesting observations can be highlighted.Firstly, the performance of all models gradually deteriorated as the time to predict became longer.For this purpose, the detailed comparison results of each model for different scale prediction tasks are shown in Figures A1-A4 in Appendix A. From Figures A1-A4, it is more obvious that the prediction accuracy of the three models (SVR, RFR, and MLP) worsened as the time to predict became longer.The lack of sufficient and directly relevant input data makes it difficult to predict PM 2.5 concentrations for longer future periods.Secondly, the performance of the four deep learning methods, i.e., simple RNN, LSTM, CNN-LSTM, and AC-LSTM, was much better than that of the three traditional shallow learning methods, SVR, RFR, and MLP, particularly in predicting over an hour.As can be seen from Tables 4 and 5, the MAE and RMSE of the four deep learning methods were relatively low.The predicted values of the four models on the multiscale prediction task were closer to the observed values in Figures A1-A4.
Thirdly, as can be seen from Figure A1 and the tables, the prediction accuracy of the three non-deep learning models on the one-hour prediction task was comparable to that of the four deep learning models.However, according to the goodness-of-fit plots for all models in Figure A5 the predicted value distributions of the three models were relatively dispersed, and their R 2 values were lower than those of the four deep learning models.The predicted value distributions of the four deep learning models were close to a 45-degree line (y = x).This, on one hand, shows the limitation of conventional approaches; on the other hand, it fully demonstrates the superior performance of the deep learning models in modeling long-term dependency for effective prediction of the PM 2.5 concentration in the future.The reason for this is that these three traditional shallow models cannot process time-series data and fail to learn the temporal correlation of air pollutants.By contrast, simple RNN is able to predict PM 2.5 concentrations over the next 24 h.Compared to simple RNN, the three models of LSTM, CNN-LSTM, and AC-LSTM bring further improved result from overcoming the defects of the conventional RNN.
Furthermore, according to the tables, the MAE and RMSE of AC-LSTM models were the lowest compared to other benchmarking models, except for the MAE for the 13-24 h prediction task.The predicted values of AC-LSTM models on the multiscale prediction task were closer to the observed values in Figures A1-A4.Moreover, the R 2 of the AC-LSTM model in the one-hour PM 2.5 prediction task in Figure A5 was highest.After adding the attention mechanism, AC-LSTM could outperform the LSTM and CNN-LSTM in multiscale prediction tasks.The results show that the proposed AC-LSTM model can effectively learn the spatiotemporal correlation of air pollutants, and it is suitable for predicting urban PM 2.5 concentration in the future.
However, our study has several limitations.Emissions have a significant impact on air quality.Since emission data are difficult to obtain, the data collected in this paper do not include emissions from factories and vehicles in the area.This does affect the prediction accuracy of our model.Moreover, when a sudden pollution accident occurs, the PM 2.5 concentration changes suddenly.Whether the proposed model can predict it well still needs to be demonstrated.

Conclusions and Future Work
In this paper, we propose an attention-based CNN-LSTM model to predict urban PM 2.5 concentrations over the next 24 h.By taking the pollutant concentration in air quality data, meteorological data, and PM 2.5 concentrations in adjacent monitoring stations as the input, the model can learn the spatiotemporal correlation and long-term dependence of PM 2.5 concentrations.At the same time, the attention mechanism can capture the importance degrees of different feature states based on past time and further improve the prediction accuracy of the model.The experimental results show that the AC-LSTM model improved performance in the multiscale prediction tasks.Several main conclusions of this paper can be highlighted as follows: 1.
Through the analysis of air quality data, PM 2.5 concentration has a strong spatiotemporal correlation.Due to the air flow, PM 2.5 concentration in the predicted area can be easily affected by the PM 2.5 concentrations of the adjacent monitoring stations.As PM 2.5 stays in the air for a long time, the past feature states also affect future PM 2.5 concentration.This motivated the design of a spatiotemporal model for effective prediction of PM 2.5 concentrations; 2.
The experimental results indicate that, in addition to using only the pollutant concentrations of the air monitoring stations, adding the meteorological data and the PM 2.5 concentrations of the adjacent monitoring stations can improve the prediction accuracy of the model, especially for prediction tasks on time scales over one hour; 3.
The proposed AC-LSTM model can be applied to multiscale predictors at different time gaps.When compared with the traditional machine learning methods, such as SVR, MLP, and RFR, its prediction accuracy was improved significantly, especially in predicting the PM 2.5 concentrations over the gap of one hour.In comparison with deep learning methods, such as simple RNN, LSTM, and CNN-LSTM, AC-LSTM produced improved prediction with lower MAE and RMSE measures due to the introduced attention mechanism in the LSTM model.
Although the proposed model can support the multiscale prediction of PM 2.5 concentrations in the temporal domain, in the future, we will also explore its expansion for multiscale prediction in the spatial domain.In addition, the model will also be extended for predicting other pollutants.Last but not least, sensing data, especially satellite data, will also be utilized for large-scale prediction of the PM 2.5 concentrations and other pollutants for early warning of air pollution and the protection of people's health.

Figure 1 .
Figure 1.Framework of the proposed approach.

Figure 1 .
Figure 1.Framework of the proposed approach.

Figure 1 .
Figure 1.Framework of the proposed approach.

Figure 2 .
Figure 2. Illustration of the multiscale predictors.Figure 2. Illustration of the multiscale predictors.

Figure 2 .
Figure 2. Illustration of the multiscale predictors.Figure 2. Illustration of the multiscale predictors.

Figure 3 .
Figure 3. (a) The location map of Taiyuan City; (b) distribution of air quality monitoring stations in Taiyuan City.

Figure 3 .
Figure 3. (a) The location map of Taiyuan City; (b) distribution of air quality monitoring stations in Taiyuan City.

Figure 5 .
Figure 5. Relationship between wind direction, wind speed, and PM 2.5 .

Figure 6 .
Figure 6.Comparison of the autocorrelation coefficients at different time lags for different stations.

Figure 6 .
Figure 6.Comparison of the autocorrelation coefficients at different time lags for different stations.

( 6 ) 12 [
represent the input gate, which decides what new information should be stored in the cell state, where t i and t c denote the output of the input gate, and t c denotes the activation vector of the current cell.Equations (8) and (9) represent the output gate, where t o denotes the output of the output gate.-1 t h is the hidden state of the last cell, and t h is the state of the current cell.The feature state matrix of the LSTM layer.

Figure 8 .
Figure 8.The loss convergence of deep learning methods in one-hour PM 2.5 prediction: (a) loss convergence of LSTM, convolution neural network (CNN)-LSTM, and AC-LSTM models; (b) loss convergence of simple recurrent neural network (RNN) model.

Table 1 .
Characteristics of Taiyuan air quality monitoring stations.N-north; E-east.

Table 1 .
Characteristics of Taiyuan air quality monitoring stations.N-north; E-east.

Table 2 .
Correlation coefficients of PM 2.5 among all monitoring stations.

Table 3 .
Effect of different time lags.MAE-mean absolute error; RMSE-root-mean-square error.

Table 5 .
The root-mean-square error (RMSE) of various features in AC-LSTM.

Table 6 .
The performances of the different models in terms of mean absolute error (MAE).SVR-support vector regression; RFR-random forest regression; MLP-multilayer perceptron.

Table 7 .
The performances of the different models in terms of root-mean-square error (RMSE).