Outdoor Particulate Matter Correlation Analysis and Prediction Based Deep Learning in the Republic of Korea

: Particulate matter (PM) has become a problem worldwide, with many deleterious health effects such as worsened asthma, affected lungs, and various toxin-induced cancers. The International Agency for Research on Cancer (IARC) under the World Health Organization (WHO) has designated PM as a group 1 carcinogen. Although Korea Environment Corporation forecasts the status of outdoor PM four times a day, whichever is higher among PM 10 and PM 2.5 . Korea Environment Corporation forecasts for the stages of PM. It remains difficult to predict the value of PM when going out. We correlate air quality and solar terms, address format, and weather data, and PM in the Republic of Korea. We analyzed the correlation between address format, air quality data, and weather data, and PM. We evaluated performance according to the sequence length and batch size and found the best outcome with a sequence length of 7 days, and a batch size of 96. We performed PM prediction using the Long Short-Term Recurrent Unit (LSTM), the Convolutional Neural Network (CNN), and the Gated Recurrent Unit (GRU) models. The CNN model suffered the limitation of only predicting from the training data, not from the test data. The LSTM and GRU models generated similar prediction results. We confirmed that the LSTM model has higher accuracy than the other two models.


Introduction
Recently, PM has become a problem worldwide.The causes of PM are industries, etc [1].PM10 and PM2.5 mean that the length of the diameter is 10 μm or less and 2.5 μm or less, respectively.Air pollution causes 7 million deaths annually worldwide, including children [2].Air pollution affects the respiratory symptoms, causes breathing problems, and decreases respiratory function [3][4][5][6][7][8][9][10][11][12].In particular, PM worsens asthma and affect lung [13][14][15][16].In 2013, the IARC under the WHO designated PM as a group 1 carcinogen.In 2018, to prevent the deterioration of health caused by PM, the Republic of Korea strengthened the definition for the level of PM to that of advanced countries (Table 1).PM is generated during the process of power plants and vehicles [17].PM10 reduced through the filter of the power plant, but PM2.5 is not reduced through the filter of power plant [17].The causes of PM in the Republic of Korea are domestic, and international from neighboring countries such as China [17][18][19].The Government of South Korea estimates that it is 30 to 50% of ultrafine dust from China as a cause of air pollution [20].Deputy Mayor for Climate & Environment in the Republic of Korea estimates that the ratio of ultrafine dust generated in Korea is 50-70% [21].According to a report by the Korea Environmental Industry & Technology Institute (KEITI) [22], it is estimated that 51% of ultra-fine dust in Seoul is generated.A high concentration of PM occurs every day in the Republic of Korea [21].In addition, Korea Environment Corporation forecasts the status of PM, whichever is higher among PM10 and PM2.5, for times a day (5 AM, 11 AM, 5 PM, and 11 PM) according to four stages: Good, Normal, Bad, and Very bad (Table 2).Korea Environment Corporation forecasts for the stages of PM.Therefore, it is difficult to predict the value of PM when going out.Recently, the prediction of PM in the Republic of Korea has been studied [23][24][25].Yi at al [26] used air quality information and PM, and included air quality data, PM data, and location information in the image, and used location information using a grid pattern through the ConvLSTM model.However, it was applied to only one city, and only air quality data and PM data were used.The PM was correlated with weather data, but Yi et al. did not consider the weather data.Their study also suffered the limitation of needing to perform pre-processing to create a grid image based on time series data.Vong et al. [27] used SVM and MLM models to predict PM in Macau, China.Vong et al. used air quality data and weather data.When using 11 variables, the accuracy of the ELM model is approximately 81%, and the accuracy of the SVM model is approximately 79%.When using 52 variables, the accuracy of the ELM model is approximately 75%, and the accuracy of the SVM model is approximately 78%.PM2.5 had more adverse effects on health, but Vong predicted only PM10.Xayasouk et al [25] studied PM with Deep AutoEncoder (DAE) and LSTM model in Seoul, the Republic of Korea.The LSTM model was more accurate than the DAE model.Accordingly, we consider location information and air quality data and predict PM2.5 as well as PM10.We consider the Recurrent Neural Network (RNN) model such as the LSTM model.We analyze the correlation between address format, air quality data, and weather data, and PM and develop the PM prediction model.

Air pollution Stations
In the Republic of Korea, air pollution stations are installed in eight cities and nine provinces (Table 3).The Republic of Korea has seven metropolitan cities, one self-governing city, eight provinces, and one self-governing province.We use air quality data from 2015 to 2019.Air quality data were collected by AirKorea [28].The collected air quality data include PM10, PM2.5, O3, CO, SO2, and NO2 and were measured every hour; however, some data were missing due to communication problems in the air pollution stations.

Weather Stations
In the Republic of Korea, weather stations are installed in eight cities and nine provinces (Table 4).We used only the weather data measured at stations with the Automated Surface Observing System (ASOS) installed.The Republic of Korea has seven metropolitan cities, one self-governing city, eight provinces, and one self-governing province.We use weather data from 2015 to 2019.Weather data were collected by the Korea Meteorological Administration [29].The collected weather data were temperature, humidity, wind speed, wind direction, snowfall, precipitation, and atmospheric pressure, and were measured every hour; however, some data were missing due to communication problems in the air pollution stations.

Data Description
Air quality data and weather data are important for predicting PM.We used Korea's air quality data and weather data from 1 January 2015 to 30 November 2019, because air quality data provided by Air Korea did not have final measurement data as of December 2019.We added weather data for each region to air quality data.In some locations, air quality data were available, but weather data were not, so these locations were deleted.In addition, we deleted data in areas where air quality data were measured, but weather data were not.We combined weather data and air quality data based on address and time.

Weather Data
Weather data is meaningful data in predicting PM.Temperature affects air quality data [30].Low humidity and low temperature are related to the PM2.5 concentration [31].Also, the wind carries PM and affects the PM concentration [30,31].In addition, rain removes air pollution [32,33].

Air Quality Data
NO2 and CO are related to automobile emissions [33], and SO2 and PM are related to factory emissions.O2 is produced when there are NO2 and NO in the atmosphere [34].SO2, NO2, and O3 have a high correlation with PM [34], and a low correlation between CO and PM [35].

Address Format
PM is affected by the surrounding cities because it moves in the wind [34,36].The address format used a total of three methods.The first method was divided into (1) city name, (2) district name or county name or ward name, and (3) dong name or eup name or myeon name.The second method was divided into (1) province name, (2) city name or county name or and (3) dong name or eup name or, and myeon name.The third method is the address of the station.We removed the building name from the address of the air quality station.Accordingly, we used the address format using the first method or the second method and the third method.We converted the address format using the onehot encoding library.

Twenty-Four Solar Terms
The 24 solar terms are divided into seasons.Twenty-four solar terms are significant in time series data affected by weather [37].We divided the 24 solar terms by dates provided by the Korea Meteorological Administration [38].We converted the 24 solar terms using the one-hot encoding library.

Wind_x, Wind_y
The wind is expressed in terms of wind speed and wind direction.The wind direction represents the direction from 0 to 360 degrees, and the wind speed represents the distance per second of the wind.The accuracy is higher when calculating wind_x and wind_y than when using wind speed and wind direction [39].

Data Parameters
We predict PM10 and PM2.5 using air quality data and weather data.Air quality data include SO2, O3, CO, and NO2.Weather data include temperature, atmospheric pressure, humidity, precipitation, snowfall, precipitation, and snowfall.We performed deep learning using four air quality data, seven weather data, wind_x, wind_y, and 24 solar terms, and address format.The following shows the input data, measurement unit, and range (Table 5).We used data from 2015 to 2019.In the dataset, 70% were used as training data, 24% were used as verification data, and 6% were used as test data.We corrected the missing values for precipitation and snowfall to zero.CO and NO2 were corrected using the average.Resampling was performed for SO2 and O3, temperature, air pressure, wind speed, wind direction, and humidity.SO2 has 631319 missing values, CO has 636586, O3 has 474455, and NO2 has 492315.Temperatures have 470 missing values, wind speeds have 5242, wind directions have 19926, humidities have 6052, and atmospheric pressures have 1554.If the value of IsRain is 0, it means that it does not rain, and if it is 1, it means rain.If the value of IsSnow is 0, it means no snow, and 1 means snow.We correlated the input data (Figure 1).Correlation analysis excludes Address1, Address2, Address3, Full Address, IsRain, and IsSnow, which are values expressed as integers, because only continuous time-series data has meaning.

Methods
In previous studies [40], the CNN-LSTM hybrid model showed better performance than the CNN and LSTM models.However, we excluded the hybrid models due to its low accuracy.We used the GRU model that is similar to the LSTM model, as well as the CNN model and LSTM model.We applied the CNN-LSTM and CNN-GRU hybrid models.

Long Short-Term Recurrent Unit
The LSTM model is one of the RNN models.The RNN model takes the previous results as input, causing long-term dependency problems (Figure 2).The LSTM model is one of the models that solves the long-term dependency problem in the RNN model.Below is the structure of the LSTM model (Figure 3).The LSTM model has high accuracy for the prediction of time-series data. .
The LSTM model uses Equation ( 4) to determine which data to keep and which to delete.
The LSTM model determines the data to be updated with long-term memory through Equations 5 and 6.The LSTM model updates long-term memory through Equation (7).

Gated Recurrent Unit
The GRU model is an improved model of complex equations in the LSTM model.The following is the structure of the GRU model (Figure 4).The GRU model consists of the reset gate and the update gate.The GRU model is a model that improves the computation speed of the LSTM model.h The hidden state.h means the hidden state of the current state when calculating the j-th hidden unit.The output is the hidden state.
Table 7 show the variables used in GRU model.In the GRU model, the logistic sigmoid function was implemented through Equation ( 3).The GRU model does not update or update h depending on the r the value obtained through Equation (10).Through this, the long-term memory modification in the LSTM was improved.
The GRU model uses h or h according to the z value through Equation (11).It updates h through Equation (12).The predicted value is output through Equation (13).

Convolutional Neural Network
The CNN model is used to extract hidden patterns through the convolution and pooling stages.

Evaluation Methods
We evaluated the performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).In deep learning, regression is verified via errors.MAE and RMSE are expressions of errors, and the lower the value, the better the performance.MAE and RMSE are calculated using Equations ( 14) and ( 15), respectively.MAE and RMSE use the difference between actual and predicted values to determine how many errors have occurred.The lower the values of MAE and RMSE, the higher the accuracy.

Data Prepare
We corrected outliers and missing values in the collected data.When the ranges of PM2.5 and PM10, humidity, temperature, precipitation, and snowfall were outside the range of Table 5, the average was corrected.The missing value of precipitation and snowfall was corrected to zero because it means that there was no rain or snow at that time.We resampled the missing values using the LSTM model to process the missing values of air quality data and weather data.When resampling air pollution information, we corrected the missing value using the average value (Table 9).The small values of the units of air quality data value made it difficult to understand whether it is appropriately predicted only by MAE and RMSE.Accordingly, we evaluated the air pollution information by referring to a comparison graph between the actual (blue line) and predicted values (orange line; Figure 5).As a result of the comparison, NO2 and O3 are predicted similarly, but SO2 and CO are not.We resampled the missing values of NO2 and O3 using the predicted values.Because SO2 and CO are not predicted accurately, the missing values were corrected using the average.When resampling the weather data, missing values were corrected using the average value.We predicted weather data for evaluation (Table 10) and compared the graph between the predicted (orange line) value and the actual (blue line) value (Figure 6).As a result of the comparison, atmospheric pressure, temperature, wind speed, wind direction, and humidity were predicted similarly.We have resampled the missing values of weather data using predicted values.

Comparison of PM Prediction Accuracy According to SO2
We performed PM prediction using only weather data to compare by air quality data (Table 11).Based on the results in Table 11, we analyzed the correlation by air quality data.Forecasts were made for PM10 and PM2.5 using only weather data (Figure 7).The orange line is the predicted value, and the blue line is the actual value.We analyzed the correlation between SO2 and PM by comparing weather data and weather data with SO2.As a result, it was confirmed that RMSE and MAE of PM2.5 and PM10 were lowered (Table 12), and hence SO2 are related to PM2.5 and PM10.The actual and predicted values of PM10 and PM2.5 were compared (Figure 8).The orange line is the predicted value, and the blue line is the actual value.We confirmed that the accuracy of PM prediction is improved via SO2.
Table 12.Evaluation of the predicted PM using weather data with SO2.

Comparison of PM Prediction Accuracy According to CO
We analyzed the correlation between CO and PM by comparing weather data and weather data with CO.As a result, it was confirmed that RMSE and MAE of PM2.5 and PM10 were lowered (Table 13), and hence CO is related to PM2.5 and PM10.The actual and predicted values of PM10 and PM2.5 were compared (Figure 9).The orange line is the predicted value, and the blue line is the actual value.We confirmed that the accuracy of PM prediction is improved via CO.

Comparison of PM Prediction Accuracy According to NO2
We analyzed the correlation between NO2 and PM by comparing weather data and weather data with NO2.As a result, it was confirmed that RMSE and MAE of PM2.5 and PM10 were lowered (Table 14), and hence NO2 is related to PM2.5 and PM10.The actual and predicted values of PM10 and PM2.5 were compared (Figure 10).The orange line is the predicted value, and the blue line is the actual value.We confirmed that the accuracy of PM prediction is improved via NO2.Table 14.Evaluation of the predicted PM using weather data with NO2.

Comparison of PM Prediction Accuracy According to O3
We analyzed the correlation between O3 and PM by comparing weather data and weather data with O3, and found that RMSE and MAE of PM10 were lowered (Table 15), MAE of PM2.5 was higher (Table 15), and hence O3 is related to PM10.The actual and predicted values of PM10 and PM2.5 were compared (Figure 11).The orange line is the predicted value, and the blue line is the actual value.We confirmed that the accuracy of PM10 prediction is improved via O3.

Comparison of PM Prediction Accuracy According to Atmospheric pressure
We analyzed the correlation between atmospheric pressure and PM by comparing weather data and weather data with atmospheric pressure.As a result, it was confirmed that RMSE and MAE of PM10 were lowered (Table 16), RMSE and MAE of PM2.5 were higher (Table 16), and hence atmospheric pressure is related to PM10.The actual and predicted values of PM10 and PM2.5 were compared (Figure 12).The orange line is the predicted value, and the blue line is the actual value.We confirmed that the accuracy of PM10 prediction is improved via atmospheric pressure.
Table 16.Evaluation of the predicted PM using weather data with atmospheric pressure.

Comparison of PM Prediction Accuracy According to Wind_x, Wind_y
We confirmed that the accuracy was improved through Equations ( 1) and (2) discussed in Section 2.3.5.After removing wind speed and wind direction, we evaluated the performance by adding wind_x and wind_y (Table 17).We confirmed that the accuracy of PM prediction is improved via wind_x and wind_y (Figure 13).

Comparison of PM Prediction Accuracy According to Address Format
We used the address format as discussed in the 2.3.3 section.The results were compared with Table 9 and confirmed to be improved (Table 18), and hence, the city and road are separated and affected by adjacent cities and provinces.These are the actual and predicted values of PM10 and PM2.5 (Figure 14).The orange line is the predicted value, and the blue line is the actual value.We confirmed that the accuracy of PM prediction is improved via address format.
Table 18.Evaluation of the predicted PM using weather data with the address format.

Comparison of PM Prediction Accuracy According to 24 solar terms
We used the 24 solar terms as discussed in Section 2.3.4.The results were compared with Table 7 and confirmed to be improved (Table 19).We confirmed that the accuracy of PM prediction is improved via 24 solar terms (Figure 15).

Comparison of PM Prediction Accuracy According to Sequence Length
We evaluated the performance according to the sequence length with the data parameters in Table 5.The sequence length is 24 per day.The comparison confirmed that the best result was a sequence length of 7 days (Table 20), which generated the lowest values for PM2.5 MAE, PM2.5 RMSE, PM10 MAE, and PM10 RMSE.We highlighted that MAE and RMSE were reduced.

Comparison of PM Prediction Accuracy According to Batch Size
We evaluated the performance according to the batch size and found the best result with a batch size of 96 (Table 21).We highlighted that MAE and RMSE were reduced.Based on the experimental results, we evaluated the performance with the LSTM and CNN models, the GRU model with a sequence length of 7 days, and a batch size of 96 (Table 22).As a result, the LSTM model appeared to be highly accurate in the training data (Figure 16), but the test data confirmed that the CNN model did not show similarity (Figure 17).The gray line is the actual value and the red line is the prediction value of the GRU model.The green line is the CNN prediction value, and the blue line is the LSTM model prediction value.We highlighted that lowest MAE and RMSE.The difference between the actual and predicted values in each model was confirmed to differ from the training data and the test data.In particular, the training data of the CNN model had some similarities, but the test data could not be predicted.We confirmed that the highest accuracy was obtained using the LSTM model (Table 22).

Conclusions
We analyzed other PM prediction models.We analyzed the correlation between address format, air quality data, and weather data, and PM.We developed a PM prediction model through the 24 solar terms provided by the Korea Meteorological Administration, weather data, and air quality data provided by AirKorea.Our paper makes the following key contributions to the literature.1) It was confirmed that the address format improves the accuracy when developing prediction models that are affected by regions such as PM and weather.2) When developing a forecasting model that is affected by seasons such as PM and weather, it was confirmed that accuracy is improved when 24 solar terms are used.3) It was confirmed that the air quality data improves the accuracy when developing the PM prediction model.4) It was confirmed that the atmospheric pressure improves the accuracy when developing the PM prediction model.For future research, the authors will apply hybrid models, such as convolutional neural networks and recurrent neural networks.For future research, the authors will apply visualization such as a map.For future research, the authors will apply hybrid models of the LSTM model and other deep learning models (e.g., RNN, CNN, GRU, DAE, Q-Networks) to improve the accuracy of PM prediction.

Figure 2 .
Figure 2. Structure of the RNN model.

Figure 4 .
Figure 4. Structure of the GRU model.

Figure 5 .
Figure 5.A comparison graph of actual and predicted values for NO2, SO2, O3, and CO.(a) Predicted and actual value of NO2; (b) predicted and actual values of SO2; (c) predicted and actual values of O3; (d) predicted and actual values of CO.

Figure 6 .
Figure 6.A comparison graph of actual and predicted values for atmospheric pressure, temperature, wind speed, wind direction, and humidity.(a) Predicted and actual values of humidity; (b) predicted and actual values of wind speed, (c) predicted and actual values of atmospheric pressure; (d) predicted and actual values of wind direction; (e) predicted and actual values of temperature.

Figure 7 .
Figure 7.A graph of the actual and predicted values of PM10 and PM2.5.(a) Predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 8 .
Figure 8.A graph of the actual and predicted values of PM10 and PM2.5 using weather data with SO2.(a) The predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 9 .
Figure 9.A graph of the actual and predicted values of PM10 and PM2.5 using weather data with CO.(a) Predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 10 .
Figure 10.A graph of the actual and predicted values of PM10 and PM2.5 using weather data with NO2: (a) predicted and actual values of PM2.5, and (b) predicted and actual values of PM10.

Figure 11 .
Figure 11.A graph of the actual and predicted values of PM10 and PM2.5 using weather data with O3.(a) predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 12 .
Figure 12.A graph of the actual and predicted values of PM10 and PM2.5 using weather data with atmospheric pressure.(a) Predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 13 .
Figure 13.This figure is a graph of the actual and predicted values of PM10 and PM2.5 using weather data with wind_x and wind_y: (a) predicted and actual values of PM2.5, and (b) predicted and actual values of PM10.

Figure 14 .
Figure 14.A graph of the actual and predicted values of PM10 and PM2.5.(a) Predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 15 .
Figure 15.A graph of the actual and predicted values of PM10 and PM2.5.(a) Predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Figure 16 .Figure 17 .
Figure 16.A graph of actual and predicted values of PM10 and PM2.5 using training data.(a) Predicted and actual values of PM2.5;(b) predicted and actual values of PM10.

Table 1 .
Air Quality Standard.

Table 2 .
Stages of PM in the Republic of Korea.

Table 3 .
The number of air pollution stations in each city and province.

Table 4 .
The number of weather stations in each city and province.

Table 6 .
Variables of the LSTM model.

Table 7 .
Variables of the GRU model.
t -1 means the previous state.j Index of hidden unit r Unit of reset gate.r means unit of reset gate of current state when calculating the j-th hidden unit.It decides whether the previous hidden state is ignored.z Unit of update gate.z means a unit of update gate of current state when calculating the j-th hidden unit.It selects whether the hidden state is to be updated with a new hidden state.U Weight matrix.U means weight matrix of reset gate.W Weight matrix.W means weight matrix of the reset gate.h A new hidden state.h means a new hidden state of current state when calculating the j-th hidden unit.

Table 8 .
Table 8 are variables used in MAE and RMSE.Variables.

Table 9 .
Results of MAE and RMSE of air quality data.

Table 10 .
Results of MAE and RMSE of air quality data.

Table 11 .
Evaluation of the predicted PM using weather data.

Table 13 .
Evaluation of the predicted PM using weather data with CO.

Table 15 .
Evaluation of the predicted PM using weather data with O3.

Table 17 .
Evaluation of the predicted PM using weather data with wind_x and wind_y.

Table 19 .
Evaluation of the predicted PM using weather data with 24 solar terms.

Table 20 .
Performance evaluation results according to the sequence length.

Table 21 .
Performance evaluation results according to the batch size.

Table 22 .
Performance evaluation result according to each model.