3.1. Trend and Correlation Analysis for Input Parameter
Table 4 displays the basic statistics of air pollutant factors for nationwide and administrative districts. Based on these,
Table 4 and the South Korea comprehensive air quality index (
Table S1), PM
10 and PM
2.5 have normal levels on an average, whereas NO
2, SO
2, O
3, and CO show good concentrations. However, the maximum values in
Table 4 show harmfully high levels of PM
10 and PM
2.5, and bad levels of NO
2 and O
3 depending on the region.
Table S2 provides in-depth details of the PM
2.5 factor, which is the response variable in this study. Over 80% of PM
2.5 daily averages were good or normal, whereas very bad concentrations were scarce.
Figure 2 shows the daily average of air pollutant data and the 30-day moving average of the daily average. Each air pollutant factor had a seasonal pattern; PM
10, NO
2, SO
2, and CO showed a decreasing trend every year, whereas O
3 showed an increasing trend. PM
10 was high in the winter and low in the summer. NO
2, SO
2, and CO also show patterns similar to those of PM
10. PM
2.5 also shows a similar pattern to PM
10, as shown in
Figure 2. Conversely, O
3 was high in summer and low in winter.
Figure 3 shows the daily standard deviation of the air pollutant data and its 30-day moving average. A seasonal pattern in the standard deviation was also observed. As shown in
Figure 3, the daily standard deviation of PM
2.5 varies seasonally, and the variance is large in winter and small in summer. As for NO
2, SO
2, and CO, the standard deviation in summer decreases each year. As shown in
Figure 3 and, CO concentration and its standard deviation are high in winter, low in summer, and decreases every year. As shown in
Figure 3, the standard deviation of O
3 was large in summer and small in winter. As shown in
Figure 2 and
Figure 3, PM
10, O
3, CO, and PM
2.5, all had large standard deviations when the daily average was high and small standard deviations when the daily average was low.
Table 5 presents the basic statistics of the meteorological factors, where precipitation indicates the amount of rain per hour, and snowfall indicates the height of snow accumulation from the hourly indicator. Therefore, the daily average snowfall represents the average daily snow height from the ground. In the case of the daily average snowfall, there was a significant difference between administrative districts. Precipitation and snowfall were not observed meteorological phenomena due to the characteristics of the factor; therefore, the days when they do not occur have a value of 0, and the value of median is 0 or close to 0 because precipitation and snowfall are not observed for over half of the year.
Figure 4 shows the daily average and 30-day moving average of the meteorological data. Each meteorological factor exhibited a seasonal pattern. Temperature, humidity, and precipitation were high in summer and low in winter, and from the wind speed and atmospheric pressure are high in winter and low in summer. Snowfall is observed in winter due to its seasonal characteristics.
Figure 5 shows the daily standard deviation of the meteorological data and 30-day moving average. In the standard deviation, seasonal patterns similar to the daily average patterns were observed. As shown in
Figure 5, the standard deviations of temperature, humidity, wind speed, and atmospheric pressure are small in summer and large in winter, and the standard deviation in precipitation is large in summer and small in winter. Snowfall is observed exclusively in winters, the standard deviation appears only during winter.
Because the air pollutant factor is known to affect the occurrence of PM
2.5 and meteorological factors reflect seasonal changes, they are appropriate as independent variables in the prediction of PM
2.5. To examine their relationship in greater depth, the correlation between the air pollutants and meteorological factors and the next-day PM
2.5 concentration was analyzed. Independent and dependent variables used to develop the prediction model are summarized in
Table 1. Air pollutant and meteorological factors were independent variables, and next-day PM
2.5 concentration was the dependent variable to be predicted. As correlations using NW data and AD data are very similar, only the correlation with NW data is presented.
Results of correlation analysis for next-day PM
2.5 and air pollutant factors are summarized in
Table 6. The daily average concentrations of PM10 (0.57), NO
2 (0.62), SO
2 (0.52), CO (0.63), and PM
2.5 (0.71), which are air pollutants excluding O
3 (−0.02), showed a high positive correlation with next-day PM
2.5. In addition, next-day PM
2.5 had a positive correlation with the variances of the air pollutants, and had a positive correlation with NO
2 (0.58), CO (0.46), and PM
2.5 (0.49). Based on the correlation analysis results, using daily variances of air pollutants as independent variables is, thereby, reasonable.
Table 7 shows the correlation between next-day PM
2.5 and meteorological factors. In terms of the daily average, next-day PM
2.5 was positively correlated with atmospheric pressure (0.29) and snowfall (0.05). Temperature (−0.31), humidity (−0.21), wind speed (−0.32), and precipitation (−0.26) were negatively correlated with next-day PM
2.5. In terms of daily variance, next-day PM
2.5 was positively correlated with temperature (0.40), humidity (0.30), and snowfall (0.02), and negatively correlated with the rest of the factors. In general, meteorological factors had lower correlations than air pollutant factors but were used for prediction model development as important independent variables showing seasonal changes.
3.2. PM2.5 Prediction Model for Nationwide
According to the factor levels in
Table 2, 18 prediction models were developed NW.
Table 8 shows the analysis of variance (ANOVA) table that statistically explains the factorial design experiment results. The response variable in this table is the R-squared of the out-of-sample. Based on a significance level of 0.05, all main effects (hidden layer layout, month, and threshold) affected R-squared. Two interaction effects (month factor and hidden layer layout, and hidden layer layout and threshold) also affected R-squared. Therefore, three factors affected the predictive model. This shows that it is appropriate to find the optimal model according to level change.
Figure 6a,b show in-sample/out-of-sample R-squared of the 18 prediction models developed according to factorial design, respectively. The higher the R-squared value was, the better the predictive ability of the model was. However, to avoid overfitting, the difference in R-squared between in-sample and out-of-sample should be minimal. Thus, to prevent overfitting, the model with a high average R-squared (average R-squared for in-sample and out-of-sample) was selected as the optimal model among the models in which the in-sample and out-of-sample R-squared differed within 10%. Using the month factor showed higher R-squared, and in-sample R-squared decreased as the threshold increased. However, if the threshold increased from 0.01 to 0.05 or 0.1, the deviation of R-squared between in-sample and out-of-sample decreased, thereby reducing the overfitting problem. Therefore, with the overfitting problem reduced, the model with the highest average R-squared (month factor: included, hidden layer layout: 24 × 24, threshold: 0.05) was determined as the optimal model. Assuming that the weather prediction is accurate, if the meteorological factor on the forecasting day is used instead of data from the previous day, the R-squared increased from 0.65 to 0.81 for in-sample and from 0.64 to 0.73 for out-of-sample as shown in
Table 9. The absolute error n σ also increased, indicating that the accuracy of the model could be further improved if accurate weather predictions were used.
The predicted and actual values of the optimal prediction model are presented in
Figure 7. The distribution of most of the results proximate to the centerline indicates that the model adequately predicted PM
2.5.
Figure 7b has a slightly smaller difference between the predicted and the measured values than
Figure 7a.
3.3. PM2.5 Prediction Model for Administrative Districts
The PM
2.5 prediction model for AD, was developed in the same way as the NW PM
2.5 prediction model. As shown in
Table 10, based on a significance level of 0.05, the two main effects of the hidden layer layout and the threshold were significant, and they were also the most significant among all the interaction effects. Unlike the NW analysis, the month factor was not clearly significant in the ANOVA.
As with the NW prediction model, the model with the month factor had higher overall R-squared than the model without the month factor (
Figure 8). Although it is not as sharp as the NW model in
Figure 6, the in-sample R-squared value decreased as the threshold increased. As the number of nodes in the hidden layer layout decreased and the threshold increased,
Figure 8 describes that the deviation between the in-sample and out-of-sample R-squared values decreased, and the overfitting problem also reduced. Therefore, the model with the highest average R-squared value and a less than 10% difference between in-sample and out-of-sample R-squared values was selected as the optimal PM
2.5 prediction model. Thus, when the month factor was included, the number of nodes in the hidden layer layout was 12, and the threshold was 0.05.
Table 11 shows the R-squared and absolute errors in n σ of the optimal PM
2.5 prediction model for AD (month factor: included, hidden layer layout: 6 × 6, threshold: 0.05). As shown in
Figure 8 and
Table 5, the accuracy of the PM
2.5 prediction model for AD is slightly lower than that of the PM
2.5 NW prediction model. R-squared of PM
2.5 prediction model without weather forecasting for AD are 0.65 (in-sample) and 0.55 (out-of-sample), and the absolute error in 1 σ is 88% and 84%, respectively. Assuming that the weather prediction is accurate, if the meteorological factor on the forecasting day is used instead of data from the previous day, The R squared increased from 0.65 to 0.73 for in-sample and from 0.55 to 0.65 for out-of-sample. The accuracy of the NW PM
2.5 prediction model can be improved with high accuracy weather forecasting.
Figure 9 presents the predicted and measured values of the optimal PM
2.5 prediction model for the AD. PM
2.5 in each region is accurately predicted without considering regional variables.
Figure 9b has a slightly smaller difference between the predicted and measured values than
Figure 9a. In future studies, a regional variable should be considered or a regional model should be developed separately to further improve the predictive power of the model.
Next-day PM
2.5 prediction models are compared in
Table 12. The PM
2.5 prediction model developed in this study is superior to the two preceding studies in terms of R-squared and mean absolute error (MAE). However, since the data used in each study is different, it may be inappropriate to compare only R-squared and MAE.