Comparative Analysis of Predictive Models for Fine Particulate Matter in Daejeon, South Korea

: Air pollution is a critical problem that is of major concern worldwide. South Korea is one of the countries most affected by air pollution. Rapid urbanization and industrialization in South Korea have induced air pollution in multiple forms, such as smoke from factories and exhaust from vehicles. In this paper, we perform a comparative analysis of predictive models for fine particulate matter in Daejeon, the fifth largest city in South Korea. This study is conducted for three purposes. The first purpose is to determine the factors that may cause air pollution. Two main factors are considered: meteorological and traffic. The second purpose is to find an optimal predictive model for air pollutant concentration. We apply machine learning and deep learning models to the collected dataset to predict hourly air pollutant concentrations. The accuracy of the deep learning models is better than that of the machine learning models. The third purpose is to analyze the influence of road conditions on predicting air pollutant concentration. Experimental results demonstrate that considering wind direction and wind speed could significantly decrease the error rate of the predictive models.


Introduction
Air pollution is a major issue in numerous countries worldwide because it causes harmful diseases, including physical and mental illnesses [1][2][3]. A World Health Organization report states that air pollution causes approximately 1/8 of premature deaths annually, which is estimated to be 6.5 million people [4]. Industrial emissions, vehicle engine emissions, and meteorological factors are considered to be the root causes of air pollution [5]. The air quality index (AQI) represents the pollution caused by six primary air pollutants: particulate matter (PM), ozone (O3), nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2). Among these, fine PM is a major air pollutant. PM10 refers to PM with a diameter of 10 μm or less, and PM2.5 refers to PM with a diameter of less than 2.5 μm. PM includes the waste generated by combustion engines, solid fuel, energy production, and other activities.
According to an air quality map obtained using a NASA satellite, South Korea is severely affected by air pollution [6][7][8]. The transportation system in South Korea has grown significantly because of rapid urbanization and industrialization. Even though South Korea has one of the world's most modern transportation systems, most people still use personal vehicles. With a population of approximately 51 million, South Korea had approximately 23 million on-road motor vehicles registered as of 2018 [9]. The large

Related Work
Various studies have been conducted on the harmful effects of air pollution. We classify these studies into the following three categories: 1) studies that use only meteorological data, 2) studies that use only traffic data, and 3) studies that use meteorological and traffic data. The subsequent sections discuss each category in detail.

Prediction of AQI Using Meteorological Data
Several authors have proposed machine learning-based and deep learning-based methods for predicting the AQI using meteorological data [16,[20][21][22][23][24]. For example, Park et al. [16] predicted PM2.5 concentrations on the basis of meteorological features, including temperature, humidity, wind direction, and wind speed. A dataset was collected from two areas in Seoul, South Korea. The study proposed the LSTM and artificial neural network (ANN) models to predict PM concentrations after a certain time. The authors proposed an algorithm that selected the LSTM or ANN model on an hourly basis. The accuracy of the proposed model was higher than that of the LSTM and ANN models. Lee et al. [20] predicted PM2.5 concentrations in Taiwan using the GB model. They used a dataset consisting of hourly measurements obtained over one year from 77 air monitoring stations and 580 meteorological stations in Taiwan. Experimental results indicated that the model provided accurate 24-h predictions at most air stations. Chang et al. [21] used the RF model to predict PM2.5 concentrations on the basis of meteorological features such as wind direction, wind speed, temperature, humidity, and rainfall. The authors compared the proposed model with two other time-series data analysis models: logistic regression and linear discriminant. Experimental results demonstrated that the RF model was the most accurate for predicting PM2.5 concentrations. Choubin et al. [22] assessed the spatial hazard of PM10 concentrations using three machine learning models: RF, bagged cart, and mixture discriminant analysis. The study area was selected from Barcelona, which is an urban and industrial area in Western Europe. The authors assembled a dataset that included PM concentrations (PM10, PM2.5, PM1, and others) and meteorological features (wind speed, wind direction, etc.). In addition, the features that affected PM modeling were identified by a feature selection approach referred to as simulated annealing. Experimental results demonstrated that the accuracies of all three machine learning models were higher than 87% for predicting PM10 concentrations.
A few studies have used deep learning approaches to predict the AQI. For example, Qadeer et al. [23] predicted hourly PM2.5 concentrations in two large South Korean cities (Seoul and Gwangju), along with various pollutants and meteorological features. The pollutant features consisted of PM2.5, PM10, SO2, O3, NO2, and CO concentrations. The meteorological features consisted of temperature, wind speed, relative humidity, surface roughness, planetary boundary layer, and precipitation. Experimental results showed that the LSTM model outperformed the XGBoost, LGBM, recurrent neural network (RNN), and convolutional neural network models in predicting hourly PM2.5 concentrations. Xayasouk et al. [24] applied the LSTM and deep autoencoder (DAE) models to predict hourly PM2.5 and PM10 concentrations in Seoul, South Korea. The authors used the AQI data for 2015-2018 and various meteorological features, such as humidity, rain, wind speed, wind direction, temperature, and atmospheric conditions. Experimental results showed that the performance of the LSTM model was slightly better than that of the DAE model in terms of the root mean square error (RMSE).

Prediction of AQI Using Traffic Data
Numerous researchers have proposed approaches for determining the relationship between air quality and traffic [25][26][27]. For example, Comert et al. [25] studied the impact of traffic volume on air quality in South Carolina, United States. They predicted O3 and PM2.5 concentrations on the basis of the annual average daily traffic (AADT) by obtaining historical traffic volume and air quality data between 2006 and 2016 from monitoring stations. Experimental results showed that air quality worsened when the AADT increased. Adams et al. [26] examined the PM2.5 concentration caused by vehicles in schools, particularly in the morning when parents dropped their children off. A dataset was obtained from a study of 23-116 personal vehicles at 25 schools, which had 160-765 students. The dataset was fit to predict the PM2.5 concentration using a linear regression model. The PM2.5 concentration was 10-50 μg/m 3 in the morning at the drop-off locations. This study concluded that the use of private vehicles could significantly deteriorate air quality. Askariyeh et al. [27] studied PM2.5 concentrations on the basis of traffic on highways and arterial roads. Near-road PM2.5 concentrations depended on the road type, vehicle weight, traffic volume, and other features. A dataset was collected from a hotspot in Dallas, Texas, by the U.S. Environmental Protection Agency (EPA). The authors proposed a traffic-related PM2.5 concentration model using emission modeling based on MOtor Vehicle Emission Simulator (MOVES) and dispersion modeling based on the American Meteorological Society/Environmental Protection Agency Regulatory Model (AERMOD). The MOVES model required traffic-related variables, including exhaust, brake, and tire wear. AERMOD required emissions and meteorological features. Experimental results revealed that emission and dispersion modeling increased the prediction accuracy of near-road PM2.5 concentrations by up to 74%.

Prediction of AQI Using Meteorological and Traffic Data
Studies have used a combination of meteorological and traffic data [28][29][30][31][32] to improve the accuracy of AQI prediction models. For example, Rossi et al. [28] studied the effect of road traffic flows on air pollution. The dataset of the study was collected in Padova, Italy, during the COVID-19 lockdown. The authors analyzed pollutant concentrations (NO, NO2, NOX, and PM10) with vehicle counts and meteorology. Statistical tests, correlation analyses, and multivariate linear regression models were applied to investigate the effect of traffic on air pollution. Experimental results indicated that PM10 concentrations were not primarily affected by local traffic. However, vehicle flows significantly affected NO, NO2, and NOx concentrations. Lešnik et al. [29] performed a predictive analysis of PM10 concentrations using meteorological and detailed traffic data. They used a dataset consisting of wind direction, atmospheric pressure, wind speed, rainfall, ambient temperature, relative humidity, vehicle speed, and traffic volume. They proposed a genetic algorithm to perform multiple regression analysis. Experimental results showed that the proposed genetic algorithm was more accurate than the present state-of-the-art algorithms. Wei et al. [30] proposed a framework to explore the relationship between roadside PM2.5 concentrations and traffic volume. They collected three types of data, i.e., meteorological, traffic volume, and PM2.5 concentrations, from Beijing, China. Their framework utilized data characteristics using a wavelet transform, which divided the data into different frequency components. The framework demonstrated two microscale rules: 1) the characteristic period of PM2.5 concentrations; 2) the delay of 0.3-0.9 min between PM2.5 concentrations and traffic volume. Catalano et al. [31] predicted peak air pollution episodes using an ANN. The study area was Marylebone Road in London, which consists of three lanes on each side. The dataset used in the study contained traffic volume, meteorological conditions, and air quality data obtained over ten years (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007). The authors compared the ANN and autoregressive integrated moving average with an exogenous variable (ARIMAX) in terms of the mean absolute percent error. Experimental results showed that the ANN produced 2% fewer errors compared to the ARIMAX model. Askariyeh et al. [32] predicted near-road PM2.5 concentrations using wind speed and wind direction. The EPA has installed monitors in near-road environments in Houston, Texas. The monitors collect PM2.5 concentrations and meteorological data. The authors created a multiple linear regression model to predict 24-h PM2.5 concentrations. The results indicated that wind speed and wind direction affected near-road PM2.5 concentrations. Figure 1 shows the overall flow of the proposed method. It consists of the following steps: data acquisition, data preprocessing, model training, and evaluation. Our main objective is to predict PM10 and PM2.5 concentrations on the basis of meteorological and traffic features using machine learning and deep learning models. First, we collected data from various governmental online resources via web crawling. Then, we integrated the collected data into a raw dataset and preprocessed it using several data-cleaning techniques. Finally, we applied machine learning and deep learning models to predict PM10 and PM2.5 concentrations and analyzed the prediction results. We have described each step in detail in the following subsections.

Study Area
The study area was Daejeon, which is located in the central area of the Korean Peninsula. Daejeon experiences severe air pollution owing to the high usage of personal vehicles and proximity of power plants. There are 11 air pollution measurement stations in the five districts of Daejeon, as shown in Figure 2(a). These stations measure the city's AQI for six different pollutants (PM2.5, PM10, O3, NO2, CO, and SO2) every hour. We selected eight roads for our study on the basis of traffic congestion, i.e., Gyeryong-ro, Daedeok-daero, Dunsan-daero, Munye-ro, Mun-jeong-ro, Wolpyeong-ro, Cheongsaseo-ro, and Hanbat-daero, as shown in Figure 2

Data Collection
All datasets used in this study were retrieved from South Korea's open government data portals. Air quality data were obtained from AirKorea [33], which is operated by the Korean Ministry of Environment and the Korea Environment Corporation, and meteorological data were obtained from the Korea Meteorological Administration [34]. Traffic data were collected from the Daejeon Transportation Data Warehouse system [35], which provides road traffic information such as travel speed and traffic volume. The data were collected using web crawling techniques, which access web pages over the HTTP protocol to retrieve and extract data in the HTML or JSON format.
We collected hourly time-series data between January 1, 2018, and December 31, 2018. We concatenated the collected datasets into one dataset on the basis of the DateTime index. The final dataset consisted of 8,760 observations. Figure 3 shows the distribution of the AQI by the (a) DateTime index, (b) month, and (c) hour. The AQI is relatively better from July to September compared to the other months. There are no major differences between the hourly distribution of the AQI. However, the AQI worsens from 10 a.m. to 1 p.m.

Competing Models
Several models were used to predict air pollutant concentrations in Daejeon. Specifically, we fitted the data using ensemble machine learning models (RF, GB, and LGBM) and deep learning models (GRU and LSTM). This subsection provides a detailed description of these models and their mathematical foundations.
The RF [36], GB [37], and LGBM [38] models are ensemble machine learning algorithms, which are widely used for classification and regression tasks. The RF and GB models use a combination of single decision tree models to create an ensemble model. The main differences between the RF and GB models are in the manner in which they create and train a set of decision trees. The RF model creates each tree independently and combines the results at the end of the process, whereas the GB model creates one tree at a time and combines the results during the process. The RF model uses the bagging technique, which is expressed by Equation (1). Here, represents the number of training subsets, ℎ represents a single prediction model with training subsets, and is the final ensemble model that predicts values on the basis of the mean of n single prediction models. The GB model uses the boosting technique, which is expressed by Equation (2). Here, and represent the total number of iterations and the iteration number, respectively. is the final model at each iteration. represents the weights calculated on the basis of errors. Therefore, the calculated weights are added to the next model (ℎ ).
The LGBM model extends the GB model with the automatic feature selection. Specifically, it reduces the number of features by identifying the features that can be merged. This increases the speed of the model without decreasing accuracy.
An RNN is a deep learning model for analyzing sequential data such as text, audio, video, and time series. However, RNNs have a limitation referred to as the short-term memory problem. An RNN predicts the current value by looping past information. This is the main reason for the decrease in the accuracy of the RNN when there is a large gap between past information and the current value. The GRU [39] and LSTM [40] models overcome the limitation of RNNs by utilizing additional gates to pass information in long sequences. The GRU cell uses two gates: an update gate and a reset gate. The update gate determines whether to update a cell. The reset gate determines whether the previous cell state is important. The LSTM cell uses three gates: an insert gate, a forget gate, and an output gate. The insert gate is the same as the update gate of the GRU model. The forget gate removes the information that is no longer required. The output gate returns the output to the next cell states. The GRU and LSTM models are expressed by Equations (3) and (4)

Evaluation Metrics
The models are evaluated to study their prediction accuracy and determine which model should be used. Three of the most frequently used parameters for evaluating models are the coefficient of determination (R 2 ), RMSE, and mean absolute error (MAE). The RMSE measures the square root of the average of the squared distance between actual and predicted values. As errors are squared before calculating the average, the RMSE increases exponentially if the variance of errors is large.
The R 2 , RMSE, and MAE are expressed by Equations (5), (6), and (7), respectively. Here, N represents the number of samples, represents an actual value, represents a predicted value, and represents the mean of observations. The main metric is the distance between and , i.e., the error or residual. The accuracy of a model is considered to improve as these two values become closer.

Preprocessing
The datasets used in this study consisted of hourly air quality, meteorology, and traffic data observations. The blank cells in the datasets represented a value of zero for wind direction and snow depth. When the cells for wind direction were blank, the wind was not notable (the wind speed was zero or almost zero). Furthermore, the cells for snow depth were blank on non-snow days. Hence, they were replaced by zero. The seasonal factor was extracted from the DateTime column of the datasets. A new column, i.e., month, was used to represent the month in which an observation was obtained. The column consisted of 12 values (Jan-Dec). The wind direction column was converted from the numerical value in degrees (0°-360°) into five categorical values. The wind direction at 0° was labeled N/A, indicating that no critical wind was detected. The wind direction from 1°-90° was labeled as northeast (NE), 91°-180° as southeast (SE), 181°-270° as southwest (SW), and 271° or more as northwest (NW). The average traffic speed was calculated and binned. The binning size was set as 10 (unit: km/h) because the minimum average speed was approximately 25 and the maximum was approximately 60. Subsequently, the binned values were divided into four groups. The average speeds in the first, second, third, and fourth groups were 25-35 km/h, 36-45 km/h, 46-55 km/h, and more than 55 km/h, respectively.
The datasets were combined into one dataset, as shown in Table 1. A few observations in this dataset were missing or invalid. Missing values were treated as types of data errors, in which the values of observations could not be found. The occurrence of missing data in a dataset can cause errors or failure in the model-building process. Thus, in the preprocessing stage, we replaced the missing values with logically estimated values. The following three techniques were considered for filling the missing values: Interpolation: New data points were constructed within the range of a discrete set of known data. As shown in Figure 4, the interpolation method provided the best result in estimating the missing values in the dataset. Thus, this method was used to fill in the missing values.  Figure 5 shows the process of data integration, model training, and testing. First, the data from three datasets were integrated into one dataset by mapping the data using the DateTime index. Here, T, WS, WD, H, AP, and SD represent temperature, wind speed, wind direction, humidity, air pressure, and snow depth, respectively, from the meteorological dataset. R1 to R8 represent eight roads from the traffic dataset, and PM indicates PM2.5 and PM10 from the air quality dataset. In addition, it is important to note that machine learning methods are not directly adapted for time-series modeling. Therefore, it is mandatory to use at least one variable for timekeeping. We used the following time variables for this purpose: month (M), day of the week (DoW), and hour (H).

Hyperparameters of Competing Models
Most machine learning models are sensitive to hyperparameter values. Therefore, it is necessary to accurately determine hyperparameters to build an efficient model. Valid hyperparameter values depend on various factors. For example, the results of the RF and GB models change considerably based on the max_depth parameter. In addition, the accuracy of the LSTM model can be improved by carefully selecting the window and learning_rate parameters. We applied the cross-validation technique to each model, as shown in Figure 6. First, we divided the dataset into training (80%) and test (20%) data. Furthermore, the training data were divided into subsets that used a different number of folds for validation. We selected several values for each hyperparameter of each model. The cross-validation technique determined the best parameters using the training subsets and hyperparameter values. Figure 6. Cross-validation technique to find the optimal hyperparameters of competing models. Adopted from [41]. Table 2 presents the selected and candidate values of the hyperparameters of each model and their descriptions. The RF and GB models were applied using Scikit-learn [41]. As both models are tree-based ensemble methods and implemented using the same library, their hyperparameters were similar. We selected the following five essential hyperparameters for these models: the number of trees in the forest (n_estimators, where higher values increase performance but decrease speed), the maximum depth of each tree (max_depth), the number of features considered for splitting at each leaf node (max_features), the minimum number of samples required to split an internal node (min_samples_split), and the minimum number of samples required to be at a leaf node (min_samples_leaf, where a higher value helps cover outliers). We selected the following five essential hyperparameters for the LGBM model using the LightGBM Python library: the number of boosted trees (n_estimators), the maximum tree depth for base learners (max_depth), the maximum tree leaves for base learners (num_leaves), the minimum number of samples of a parent node (min_split_gain), and the minimum number of samples required at a leaf node (min_child_samples). We used the grid search function to evaluate the model for each possible combination of hyperparameters and determined the best value of each parameter. We used the window size, learning rate, and batch size as the hyperparameters of the deep learning models. The number of hyperparameters for the deep learning models was less than that for the machine learning models because training the deep learning models required considerable time. Two hundred epochs were used for training the deep learning models. Early stopping with a patience value of 10 was used to prevent overfitting and reduce training time. The LSTM model consisted of eight layers, including LSTM, RELU, DROPOUT, and DENSE. The input features were passed through three LSTM layers with 128 and 64 units. We added dropout layers after each LSTM layer to prevent overfitting. The GRU model consisted of seven GRU, DROPOUT, and DENSE layers. We used three GRU layers with 50 units.

Impacts of Different Features
The first experiment compared the error rates of the models using three different feature sets: meteorological, traffic, and both combined. The main purpose of this experiment was to identify the most appropriate features for predicting air pollutant concentrations. Figure 7 shows the RMSE values of each model obtained using the three different feature sets. The error rates obtained using the meteorological features are lower than those obtained using the traffic features. Furthermore, the error rates significantly decrease when all features are used. Thus, we used a combination of meteorological and traffic features for the rest of the experiments presented in this paper.  Table 3 shows the R 2 , RMSE, and MAE of the machine learning and deep learning models for predicting the 1 h AQI. The performance of the deep learning models is generally better performance than that of the machine learning models for predicting PM2.5 and PM10 values. Specifically, the GRU and LSTM models show the best performance in predicting PM10 and PM2.5 values, respectively. The RMSE of the deep learning models is approximately 15% lower than that of the machine learning models in PM10 prediction. Figure 8 shows the PM10 and PM2.5 predictions obtained using all models. The blue and orange lines represent the actual and predicted values, respectively. The PM2.5 values predicted by the LSTM model are 27% more accurate than those predicted by the other models.

Comparison of Prediction Time
We performed an experiment to analyze the effect of time scales (1 h, 3 h, 6 h, and 12 h) on the accuracy of the machine learning and deep learning models. Figure

Influence of Wind Direction and Speed
In recent years, numerous studies have considered the influence of wind direction and speed [42][43][44] on air quality. Wind direction and speed are essential features used by stations to measure air quality. On the basis of wind direction and speed, air pollutants may move away from a station or settle around it. Thus, we conducted additional experiments to examine the influence of wind direction and speed on the prediction of air pollutant concentrations. For this purpose, we developed a method of assigning road weights on the basis of wind direction. We selected the air quality measurement station that was located in the middle of all eight roads. Figure 10 shows the air pollution station and surrounding roads. On the basis of the figure, we can assume that traffic on Roads 4 and 5 may increase the AQI close to the station when the wind direction is from the east. In contrast, the other roads have a weaker effect on the AQI around the station. We applied the computed road weights to the deep learning models as an additional feature. The roads around the station were classified on the basis of the wind direction (NE, SE, SW, and NW), as shown in Table 4. According to Table 4, the road weights were set as 0 or 1. For example, if the wind direction was NE, the weights of Roads 3, 4, and 5 were 1 and those of the other roads were 0. We built and trained the GRU and LSTM models using wind speed, wind direction, road speed, and road weight to evaluate the effect of road weights. Figure 11 shows the RMSE of the GRU and LSTM models with (orange) and without (blue) road weights. For the GRU model, the RMSE values with and without road weights are similar. In contrast, for the LSTM model, the RMSE values with road weights are approximately 21% and 33% lower than those without road weights for PM10 and PM2.5, respectively. . Error rates of GRU and LSTM models with and without application of road weights.

Discussion and Conclusions
We proposed a comparative analysis of predictive models for fine PM in Daejeon, South Korea. For this purpose, we first examined the factors that can affect air quality. We collected the AQI, meteorological, and traffic data in an hourly time-series format from January 1, 2018, to December 31, 2018. We applied the machine learning models and deep learning models with 1) only meteorological features, 2) only traffic features, and 3) meteorological and traffic features. Experimental results revealed that the performance of the models with only meteorological features was better than that with only traffic features. Furthermore, the accuracy of the models increased significantly when meteorological and traffic features were used.
Furthermore, we determined a model that is most suitable to perform the prediction of air pollution concentration. We examined three types of machine learning models (RF, GB, and LGBM models) and two types of deep learning models (GRU and LSTM models). The deep learning models outperformed the machine learning models. Specifically, the LSTM and GRU models showed the best accuracy in predicting PM2.5 and PM10 concentrations, respectively. The accuracies of the GB and RF models were similar. We also compared the effect of time scales (1 h, 3 h, 6 h, and 12 h) on the models. The AQI predicted at a time scale of 1 h was more accurate than that predicted at the other time scales.
Finally, we have analyzed the effect of road conditions on the prediction of air pollutant concentrations. Specifically, we measured the relationship between traffic and wind direction and speed. An air pollution measurement station surrounded by eight roads was selected. We set weights for each road based on the location and wind direction. The consideration of road weights reduced the RMSE by approximately 21% and 33% for PM10 and PM2.5, respectively.
We conducted the experiments based on time-series data (i.e., air pollution, meteorological, and traffic), which are widely used in predicting air pollutant concentration. Considering that nowadays, most countries or cities make their environmental data open publicly, we assume that the proposed methodology can be easily applied to predict air pollutant concentration in both local and international applications.
There are several limitations of our study that should be addressed in the future. Firstly, we considered only meteorological and traffic factors for air pollution. However, air pollution is affected by several other factors, which should be further investigated. Secondly, we only considered the roads located in the city center when analyzing the effect of road conditions on the prediction of air pollutant concentration. However, suburban roads can also help characterize the overall air pollution of the city. Finally, we used a relatively small dataset of a one-year period. In the future, we aim to improve the prediction accuracy in two manners. The first is to consider different air pollution causes, such as power plants and industrial emissions. The second is to use more data, treat outliers, and tune the models.