A Long Short-Term Memory (LSTM) Network for Hourly Estimation of PM 2.5 Concentration in Two Cities of South Korea

Featured Application: Forecasting particulate matter of size less than 2.5 µm (PM 2.5 ) in big cities is a major challenge for scientific community. In addition to environmental impacts, these particulate matter cause various diseases, such as cardiopulmonary disease, stroke, lung cancer and even neurological disorders. Forecasting high PM 2.5 events helps to raise awareness among people to take precautionary measures, such as limit outdoor activities, use masks, etc. In the future, advanced Machine Learning (ML) based PM 2.5 forecasting will help to reduce the cost of sampling of PM 2.5 , such as sampler and equipment costs, which are needed to measure the concentration of particulate matter in air. Abstract: Air pollution not only damages the environment but also leads to various illnesses such as respiratory tract and cardiovascular diseases. Nowadays, estimating air pollutants concentration is becoming very important so that people can prepare themselves for the hazardous impact of air pollution beforehand. Various deterministic models have been used to forecast air pollution. In this study, along with various pollutants and meteorological parameters, we also use the concentration of the pollutants predicted by the community multiscale air quality (CMAQ) model which are strongly related to PM 2.5 concentration. After combining these parameters, we implement various machine learning models to predict the hourly forecast of PM 2.5 concentration in two big cities of South Korea and compare their results. It has been shown that Long Short Term Memory network outperforms other well-known gradient tree boosting models, recurrent, and convolutional neural networks.


Introduction
The industrial revolution and modernization have led us to a new era of science and technology. On the one hand, it has opened new horizons for transportation, trade, mining, agriculture, and urbanization. On the other hand, it has become a vital factor in polluting air, soil, and water. In the last two decades, many environmental researchers have been monitoring the quality of ambient air. Particulate matter (PM) is found to be the most dangerous kind of air pollution among various other air pollutants. After a study done by the World Health Organization (WHO) and the International

Related Work
Time series forecast is the most important part of the ML regression problem; both shallow and DL models have been used for this purpose. Tree-based models such as decision trees, random forests (RF) [7], and gradient tree boosting models are well known to give good performance and have been widely used in supervised ML methods. These can map non-linear relationships among data unlike linear ML models such as linear regression [8] and support vector machine (SVM) [9]. The RF model has been used to study the impact of various factors on pollutants concentrations by utilizing meteorological parameters, pollutants concentration, and traffic flow [10]. XGBoost [11], introduced by Chen, T. and Guestrin, C., is an ensemble of boosted decision trees that uses gradient descent for model optimization and has been widely used in regression [12], classification [13], and time series forecasting [14]. XGBoost was implemented to predict PM 2.5 concentration in [15], where the author analyzed the data of one station in China and compared the results with RF, SVM, Multiple Linear Regression (MLR) [16], and Decision Tree Regression (DTR) algorithms [17]. The dependent variables used in this research were pollutants' concentrations such as PM 10 , CO, NO 2 , SO 2 , and O 3 ; among all the models, XGBoost showed the best results. LightGBM [18] also belongs to the gradient tree boosting models, in which a decision tree is split in leaf-wise with the best fit, thus reducing the loss with better accuracy. Similarly, XGBoost and LightGBM models have been used to predict the thermal power energy development [19] and later showed less Mean Absolute Percentage Error (MAPE%) on their dataset. Along with shallow ML models, DL models are also commonly used these days and have been successfully used for pollutants forecasting [20]. In a recent study [21,22], LSTM model has been used for the prediction of PM 10 and PM 2.5 concentrations by utilizing pollutants concentration and meteorological parameters. The authors compared the results with the Community Multi-scale Air Quality (CMAQ) model [23] and found that DL based model performs better. CNNLSTM is also a variant of LSTM models in which CNN [24] has been used for extracting the features and then fed to the LSTM model to get the forecast; they are being used in various time series prediction problems [25,26]. Huang, C. J. [27] only used three meteorological parameters (wind speed, wind direction, and precipitation) to predict the PM 2.5 concentrations. Their proposed model, which they named "APNet" (a combination of CNN and LSTM), showed good results against SVM, DTR, RF, MLP, CNN, and LSTM. In a recent study [28], the authors proposed a novel CNNLSTM model with attention mechanism. Along with pollutants concentration and meteorological parameters, they also utilized the information of nearest stations to capture the spatial dependencies. GRU [29] is also a type of RNN and a variant of LSTM with fewer gates, making the model faster. It also has been adopted in many time series forecasting problems. In [30], GRU is utilized for estimating primary energy consumption in China and the model results are compared with SVM and MLR, where GRU gives good prediction accuracy. Similarly, a combination of the Bidirectional and Unidirectional LSTM (BiULSTM) model was used for PM 10 forecasting by Yun, J. [31], who tested it with SVM and MLR, with BiULSTM providing better prediction results than the other methods used. In this study, input features used are concentrations of pollutants (SO 2 , CO, NO 2 , and O 3 ), the meteorological parameters, and PM 10 concentration of the nearest stations.
The input features play an important role in the prediction of any machine learning model, and, by using background knowledge of the parameters that are vital in the formation of PM particles, the models' performance can be improved. In our study, we utilized meteorological parameters and pollutants concentrations that are highly effective in the formation of PM 2.5 concentration collected from ground based monitoring sites as well as predictions of CMAQ model.

Methodology
In this section, we discuss how the study was conducted. To get prediction from ML models, data collection, analysis for feature correlation, and data preprocessing were done before inputting the data to ML model. After that, each model was constructed and optimized by setting its best hyperparameters. Then, models were trained and predictions were generated on a test dataset. Finally, to check the efficiency of the models, each model was evaluated using statistical evaluation parameters. The process of this study is shown in Figure 1. Section 3.1 contains the description and preprocessing of the data. Section 3.2 describes the architecture of LSTM network. The experimental process of setting the models is described in Section 3.3. The evaluation metrics and their formulas are discussed in Section 3.4.

Data and Preprocessing
The dataset contains meteorological parameters, measured values of pollutants' concentration from ground base stations, and predictive values of four pollutants predicted by the CMAQ model in South Korea from 1 January 2016 to 31 December 2016 recorded on hourly basis. Six ground-based pollutants observation are collected: PM 2.5 , PM 10 , sulfur dioxide, ozone, nitrogen dioxide, and carbon monoxide concentrations measured in µg/m 3 . They are available at Air-Korea website [32]. Six meteorological parameters (temperature, wind speed, relative humidity, surface roughness, planetary boundary layer, and precipitation) were taken from Korean public data website [33]. PM 2.5 has a strong correlation with the pollutants such as elemental carbon, nitrate, and ammonium, as described in various studies [34,35], and ground-based sites do not measure these dependent pollutants, but CMAQ model has the ability to predict these features. CMAQ data have been predicted and provided by Air Lab at Gwangju Institute of Science and Technology [36] for the same time duration. The CMAQ model predictive features labels are: CMAQ_EC, CMAQ_ANO3, CMAQ_ANH4, and CMAQ_OTHR, measured in µg/m 3 . To check the models' performance, we selected data from four sites of Seoul and four locations of Gwangju (a city located south of Seoul). The average evaluation results from all the stations for each model with and without using CMAQ data are given in Section 4, which show that by including CMAQ features, we can get better prediction results.
It is necessary to analyze the relationship between PM 2.5 and other features. For this purpose, a heat map is provided in Figure 2. The variables having the higher correlation with PM 2.5 concentrations are shown in dark red color while variables with less correlation are shown in light pink shade. The correlation of PM 2.5 with the pollutants from higher to lower is: PM 10 > ammonium = nitrate ions > carbon monoxide > other-pollutants > nitrogen dioxide > elemental carbon > sulfur dioxide. Ozone and other meteorological parameters are negatively correlated with PM 2.5 concentration. The order of negatively correlated features with PM 2.5 from highest to lowest are: relative humidity > surface roughness > precipitation > wind speed > ozone > planetary boundary level > temperature. To find data distribution of each feature, we used the histogram shown in Figure 3. There are 8727 records of data for each station, from which 7680 records were selected for training and 1023 used for testing the models (9:1 ratio for train and test dataset). The missing values were imputed by linear interpolation; data records from 1 January 2016 to 15 November 2016 were used for training and from 16 November to 31 December for testing the models. The inputs of the models are hourly observations of 16 selected features discussed above over the last 24 h and the output or label variable is the PM 2.5 concentrations that is the forecast for the next 1 h. The time duration for train and test datasets are separate from each other and do not overlap. For each prediction model, all the training was done on train dataset while validation and evaluation were made on test dataset. We used two gradient tree boosting machine learning models, namely extreme gradient boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), and reshaped the data to be appropriate for time series forecasting. Four very famous and ubiquitous deep learning models-Long Short-Term Memory (LSTM), a combination of Bidirectional and Unidirectional LSTM (BiULSTM), Gated Recurrent Unit (GRU), and Convolution LSTM (CNNLSTM)-were used. The results were compared after calculating their respective Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Correlation Coefficient (R), and Index of Agreement (IA), which are given in Section 3.4.
Before implementing deep learning models, it is recommended to normalize the data. After training the models, we un-normalized or re-scaled the data into their original form to get the prediction results. Thus, all input features were scaled between 0 and 1. The formula for scaling the data is given in Equation (1): We also included the observation values during high fine dust periods that usually occurs in spring and winter seasons [37] in our training model so we could observe how well our models can predict high dust concentration values.

LSTM Network
An LSTM [22] network uses cell state, input, output, and forget gates to store long-term dependencies to overcome vanishing gradient problem in typical RNNs and was introduced in 1997 by Hochreiter, S. and Schmidhuber, J. The LSTM processes the data sequentially passing the information as it propagates forward. The operations within LSTM allows it to forget or keep the information. The architecture of LSTM model is shown in Figure 4. The cell state which is shown as a horizontal line runs through the entire network and has the ability to add or remove the information with the help of gates. The process of the cell state is to carry the information through the sequence processing and theory information from earlier time steps can be carried all the way through the last time step thus reducing the effect of short term memory. As the process goes on, the information is added or removed from the cell states to gate states. Gates decide which information is allowed on the cell state. The first gate that is the forget gate is responsible for learning what information is necessary to keep or forget as they contain sigmoid function. The sigmoid function generates numbers between zero and one, describing how much of each component should be let through. The tanh function generates a new vector, which is added to the state. The cell state is updated based upon the outputs generated from the gates.
The sigmoid function is given as Equations (3)-(8) represent the flow of information at each gate and cell state of LSTM network: f t , i t , and o t represent the outputs generated by forget gate, input gate, and the output gate, respectively. W f , W i , W C , and W o are the input weights, respectively. b f , b i , b C , and b o are bias terms and H t is the output of LSTM network.

Experimental Set-Up
All models were implemented using Python language version 3.6.7, trained and tested on a computer with an Intel Core i7-8700 CPU processor and the speed of 3.20 GHz using 8192 MB RAM with the graphics card GeForce GTX 1080Ti and the operating system is Linux Ubuntu 18.04.4 LTS. The parameters setting for models is discussed in Sections 3.3.1 and 3.3.2.

XGBoost and LightGBM
To perform extreme gradient tree boosting algorithm, we used standard XGBRegressor from Python package called xgboost version 0.90 and LGBRegressor from lightgbm Python package version 2.1.1 for the implementation of LightGBM model. To get better results from tree-based models, we needed to find best parameters for each model by using customized search approach. The best parameters for XGBoost model are: n_estimators = 70, max_depth = 2, min_child_weight = 1, learning_rate = 0.2, gamma = 0, colsample_bytree = 1, alpha = 10, and objective = reg:squarederror, with all other parameters set to default. For LightGBM, the parameter setting is: learning_rate = 0.1, max_depth = −1, metric = {'l1', 'l2'}, num_leaves = 255, colsample_bytree = 1.0, objective = regression, subsample = 0.6, and seed = 10. Training the model for the best number of iterations while using early stopping patience until 5 epochs to prevent the model from overfitting gives best results at 28 epochs.

Deep Learning Models
To implement recurrent neural networks (RNNs), a high level neural network API called Keras with Tensorflow back end was used. We tried different parameter settings to design each DL model by changing various parameters, such as number of neurons, number of layers, optimizing function, and learning rate, to obtain the best DL model which not only performs well on the train data but also gives good prediction results on the unseen test data. We used 2-4 layers for constructing each RNN model and ran the model by selecting the number of neuron in each layer ranging as 50, 70, 100, or 150 and found that, by using two layers and keeping the number of neurons in each layer as 70, our RNN models give the best performance by minimizing the problem of overfitting and reducing model complexity. To compare RNNs, we used the same number of epochs, batch size, dropout, and loss function. Hyperparameter settings for GRU, LSTM, and BiULSTM were kept the same for comparison. During model construction process, we used dropout [38], which is a common way to prevent overfitting in neural networks. The number of neurons or units in RNN, dropout rate, and other parameters in each layer from top to bottoms are given as: Each DL model was trained using mini batch size of 32; early stopping [39] technique was also utilized to prevent the model from overfitting. Call backs were used to save best weights for each model. To optimize the models, we used Rmsprop [40], which is an unpublished optimization algorithm introduced by Hinton, G. and designed for neural networks.
Customized search method was adopted to find the best learning rate for DL models, and 0.0001 were observed to be appropriate, while 'mean absolute error' was used as the loss function to monitor the loss during training process.

Performance Evaluation for Models
To evaluate the performance of our models, we compared the observed and predicted concentrations of PM 2.5 by using four statistical evaluation metrics: (MAE), (RMSE), (R), and (IA). They are given in Equations (9)- (12). In these equations, y i is the actual PM 2.5 concentration,ŷ i represents the predicted PM 2.5 concentration,ȳ i is the average of observed values, and n is the predicted length of the test set.

Results and Discussions
The first part of this section compares the models' mean performance with and without including the CMAQ parameters. The second part covers the performance of each model at all sites after including CMAQ features.

Models' Average Performance with and without CMAQ Data
Tables A1-A4 (see Appendix A) include the details of each model performance at every station before adding CMAQ features. Tables 1-4 show the average MAE, RMSE, R, and IA values of all stations before and after including CMAQ features. F p , F m , and F c represent pollutants, meteorological parameters, and CMAQ features, respectively. From the results in Tables 1-4, it is clear that, by including CMAQ features that are highly correlated with the PM 2.5 concentration, each model's MAE and RMSE values are decreased while their R and IA values are improved, thus improving the models performance.   Figure 11 shows the results of all models and the numerical analysis are subsequently provided in Tables 5-8. In Figure 5, it can be noticed that, at Stations 2 and 4, XGBoost is not predicting the peak values at some points. In Figure 6, LightGBM has difficulty in predicting the actual values, especially at Stations 1, 2, 4 and 7 where it is showing a wide difference between actual and predicted values. Results of GRU are shown in Figure 7, which shows it is unable to predict the real values at Stations 1, 2, and 4. CNNLSTM in Figure 8 is providing good predictions only at Stations 3 and 8, while, at Station 4, its predictions results are deviating from original values. BiULSTM prediction and actual values are drawn in Figure 9. On average, it is showing better results than any other model; however, at Station 4, it is unable to detect the peak values. LSTM prediction results are shown in Figure 10; it gives better results than all models except at Stations 2 and 7, where the BiULSTM model error values are lower. Overall, LSTM is performing well by giving fewer error values and a higher IA.
The MAE, RMSE, R, and IA values for all models after adding CMAQ data are given in Tables 5-8. Table 5  From the results of our experiments, the MAE and RMSE values of LSTM network are the lowest while correlation coefficient R and IA are the highest, which shows that this model performs well on this dataset. BiULSTM network is the next best after LSTM, considering all metrics of evaluation. There are still the following limitations: 1. The observation period for this study is only one year. If more data were provided, the network would have better capability to understand the spatial and temporal dependencies.
2. Our networks were trained on past 24 h data to get next 1 h PM 2.5 concentration prediction. As the sequence of future hours increases, the efficiency of the network to predict usually drops. In the future, we will try to generate 24-72 h predictions and check the models' performance.

Conclusions and Future Work
In this study, ground base measurements of pollutants, meteorological, and predictive data from CMAQ models are concatenated after analyzing the dependent features that affect the concentration of PM 2.5 . We estimate the hourly values of PM 2.5 concentration by applying various well-known machine learning models. In our network training process, we input these features to ML models in order to get next 1 h prediction, while the past 24 h data are provided. Due to spatial and temporal constraints, each station gives different prediction results, therefore, average evaluation values are calculated for all sites. The results show that a well-optimized LSTM network performs better than any other models used in the study. Although ML models and specifically RNNs have the ability to map temporal features, it is very important to analyze the data first, which is then followed by optimizing the model. The advantages of pollutants forecasting using ML models include: 1. The time, effort, and cost to collect and measure the data from ground based stations or from any other sensors are reduced. 2. In the case of any defect or failure of measuring equipment or sensors, there would be missing data that can be generated by ML models in limited resources and time using past data.
3. As other pollutants such as NO 2 , ozone, and PM 10 are also correlated with the concentration of PM 2.5 , ML models can predict their values as well.
In a nutshell, ML models can be applied in the development of forecasting systems, especially in weather and pollutants concentration predictions. In the future, we will try to overcome the limitations discussed in Section 4 to get better forecasting results.

Acknowledgments:
We would like to thank Muhammad Ishfaq Hussain for his valuable feedback and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.