Visibility Prediction over South Korea Based on Random Forest

In this study, the visibility of South Korea was predicted (VISRF) using a random forest (RF) model based on ground observation data from the Automated Synoptic Observing System (ASOS) and air pollutant data from the European Centre for Medium-Range Weather Forecasts (ECMWF) Copernicus Atmosphere Monitoring Service (CAMS) model. Visibility was predicted and evaluated using a training set for the period 2017–2018 and a test set for 2019. VISRF results were compared and analyzed using visibility data from the ASOS (VISASOS) and the Unified Model (UM) Local Data Assimilation and Prediction System (LDAPS) (VISLDAPS) operated by the Korea Meteorological Administration (KMA). Bias, root mean square error (RMSE), and correlation coefficients (R) for the VISASOS and VISLDAPS datasets were 3.67 km, 6.12 km, and 0.36, respectively, compared to 0.14 km, 2.84 km, and 0.81, respectively, for the VISASOS and VISRF datasets. Based on these comparisons, the applied RF model offers significantly better predictive performance and more accurate visibility data (VISRF) than the currently available VISLDAPS outputs. This modeling approach can be implemented by authorities to accurately estimate visibility and thereby reduce accidents, risks to public health, and economic losses, as well as inform on urban development policies and environmental regulations.


Introduction
Visibility refers to the maximum horizontal distance visible to the human eye as a measure of the distance through which an object or light can be identified. The World Meteorological Organization (WMO) [1] defines the meteorological optical range (MOR) as the distance at which light intensity decreases to 5% of its original level [2]. This provides a useful metric of visibility based on distinguishing and measuring objects, such as buildings, identified by human observation or by detecting the amount of attenuated or scattered light using optical sensors [3]. Visibility is often measured using networks of visibility sensors, such as Vaisala or Biral sensors, and can reach 300 km when only Rayleigh scattering and gas absorption are taken into account [4]; however, distances of 145-225 km are typical for unpolluted atmospheric conditions, and distances of 10-100 km are commonly recorded [5]. Importantly, precipitation and air pollution can have significant impacts on visibility, with low-visibility conditions of a few kilometers not uncommon [6], and such effects have impacts on weather and climate change [7,8]. In addition, low visibility is linked to socioeconomic losses, including road and air traffic accidents and public health risks [9][10][11].
Visibility is generally predicted spatiotemporally using numerical weather prediction (NWP) models such as the Unified Model (UM) and the Weather Research and Forecasting (WRF) model [12,13]. NWP models predict visibility based on various meteorological variables including the liquid/ice water content of clouds and water droplets, aerosol concentrations, and rain/snow, which are included in the parameterization of cloud physics and microphysical processes [14][15][16]. However, because visibility is sensitive to a range of variables, prediction based on NWP is challenging, yielding poor predictive performance compared to other meteorological variables such as precipitation [3,10,12,[17][18][19]. Therefore, various studies have been reported that focus on parameterization improvement [14,20,21], Table 1. Ground-based observation variables at the ASOS 72 sites (the parentheses are the KMA's site classification number) and meteorological forecasting variables of the CAMS model.

ASOS Observation Sites (72)
ASOS sites (24)  The air pollutant datasets included particulate matter (sea salt (SS), dust (DU), organic matter (OM), black carbon (BC), and sulfate (SU)) [11,22] as well as ozone (O 3 ), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), and carbon monoxide (CO), all of which affect visibility [7,32,49,50]. These are the only model prediction datasets offering global-scale air pollutant information, including the Modern-Era Retrospective Analysis for Research and Applications version 2 (MERRA-2) [51,52], with a correlation coefficient in the range 0.5-0.8 and uncertainty in the range 10-20% compared to ground observations [53][54][55]. However, MERRA-2 has a temporal resolution of 1-h and a relatively low spatial resolution (0.5 • × 0.625 • ), which is unsuitable for local-scale analyses; discrepancies with groundbased measurements can, therefore, be larger for MERRA-2 data than with CAMS model outputs [55]. Furthermore, as the CAMS model provides air pollutant data as mass mixing ratios (kg/kg), these data were converted to mass concentrations (µg/m 3 ) using Equation (1) for the RF model input. The CAMS mass mixing ratios (MMRs) were assumed to be similar to those in the lower atmosphere, and the ground-level mass concentrations were calculated using the ASOS temperature and pressure observations [47,56].
where P and T are the pressure (Pa) and temperature (K) derived from the ASOS, respectively, and R is the specific gas constant (287.06 J/(kg·K)).

Evaluation of Predictions
To evaluate the accuracy of the visibility predictions, variability data from the ASOS and the UM Local Data Assimilation and Prediction System (LDAPS) [57] were used. As shown in Figure 1, the ASOS visibility data were acquired using the Biral VPF730 (24 sites) and Vaisala PWD22 (48 sites) models based on 72 observation sites distributed across South Korea (Table 1). These sensors measure visibility in the range 0.01-20 km with uncertainty of ±10-15% within the measurement range [58,59]. The red and blue points in Figure 1 indicate the representative sites in urban and island regions analyzed in the results (Section 3). Visibility data from the UM LDAPS are local predictions provided by the Korea Meteorological Administration (KMA) with a spatiotemporal resolution of 1.5 × 1.5 km 2 at 3-h intervals eight times a day (+0-36-h predictions at 3-h intervals at 0000, 0600, 1200, and 1800 UTC and +0-h and 3-h predictions at 0300, 0900, 1500, and 2100 UTC), derived using a 3-dimensional variational data assimilation (3DVAR) method [60,61]. We used the analysis fields data (+0-h), which predicts at 3-h intervals eight times a day. As the UM LDAPS predicts visibility in the range 0.01-100 km, data for ranges > 20 km were limited to 20 km (maximum), according to the measurement range of visibility sensors, and analyzed. The same conditions were applied to results predicted by using the RF model. For comparison, bias, root mean square error (RMSE), and correlation coefficients (R) were calculated for the visibility data predicted by the RF (VIS RF ) and LDAPS (VIS LDAPS ) models, and the visibility measurements from the ASOS (VIS ASOS ), using Equations (2)(3)(4): where P is the prediction, M is the measurement, and N is the number of data.

Random Forest (RF) Model Sets
The random forest adopted in this study constructs N decision trees by combining variables randomly selected from each node to grow a regression tree, as shown in Figure  2. Further, the results of individual decision trees are ensembled to obtain the prediction result [43,44]. Thus, in the RF ensemble learning method, each tree contributes to the final prediction [62]. RF ensembles the results of all decision trees to obtain a large prediction variability and avoid overfitting, producing optimal results. The adopted RF method en-

Random Forest (RF) Model Sets
The random forest adopted in this study constructs N decision trees by combining variables randomly selected from each node to grow a regression tree, as shown in Figure 2. Further, the results of individual decision trees are ensembled to obtain the prediction result [43,44]. Thus, in the RF ensemble learning method, each tree contributes to the final prediction [62]. RF ensembles the results of all decision trees to obtain a large prediction variability and avoid overfitting, producing optimal results. The adopted RF method ensembles the results of a decision tree to optimize results by minimizing prediction variability and overfitting, being widely applied given its high prediction accuracy and ability to rapidly process large datasets [45]. The 'Ranger' R package was used to construct the RF model, which shows similar prediction performance and ten-times faster learning and prediction speeds than other RF packages [46].  Figure 3 shows the permutation importance of the input variables input in the RF model. RH has a well-known influence on visibility, and among the air pollutants, SU and CO have high importance [28,30,64,65]. In addition to the meteorological variables, time and location variables also influence visibility predictions. In comparison, wind direction (WD) had the lowest influence on visibility predictions (0.23) among the input meteorological variables. When this variable was removed from the RF model, the OOB error (RMSE) was 2.46 km and R was 0.85.  Table 1, were used as training and test datasets in the RF model. The training set was composed of data from 1 January 2017, to 31 December 2018, and the test set was composed of data from 1 January 2019, to 31 December 2019. As hygroscopic aerosols and pollutants show deliquescent properties in wet and high RH conditions [31], visibility predictions are often limited to days with no precipitation and RH < 80% [9,25,26,28]. Here, to predict visibility under various weather conditions, a visibility prediction model was constructed using a learning training set for all the variables shown in Table 1, thus removing the restriction of meteorological conditions. At this time, mtry was set to 4, which is sqrt (number of variables) (default), and the tree was set to 500 (default). As the number of trees increased in the RF model, more trees were averaged, thereby reducing overfitting to construct a more stable prediction model. However, even with more than 500 trees, no significant change in the out-of-bag (OOB) estimation result was observed, with only the training time increasing [63]. Figure 3 shows the permutation importance of the input variables input in the RF model. RH has a well-known influence on visibility, and among the air pollutants, SU and CO have high importance [28,30,64,65]. In addition to the meteorological variables, time and location variables also influence visibility predictions. In comparison, wind direction (WD) had the lowest influence on visibility predictions (0.23) among the input meteorological variables. When this variable was removed from the RF model, the OOB error (RMSE) was 2.46 km and R was 0.85. model. RH has a well-known influence on visibility, and among the air pollutants, SU and CO have high importance [28,30,64,65]. In addition to the meteorological variables, time and location variables also influence visibility predictions. In comparison, wind direction (WD) had the lowest influence on visibility predictions (0.23) among the input meteorological variables. When this variable was removed from the RF model, the OOB error (RMSE) was 2.46 km and R was 0.85.  Relative importance of meteorological variable inputs in the random forest model. "Before" and "after" mean the change in importance before and after removing WD are shown. Figure 4 shows the RF results using training data (2017-2018). The RF model predicts variables in the direction of lowering the variance but shows a positive bias for low predicted values and a negative bias for relatively high values [66,67]. Therefore, bias correction was performed using the equation Y = 1.18 × −2.49, which increased the predictive performance of the model relative to the VIS ASOS data, with an RMSE of 1.04 km (without bias correction) and 0.88 km (with bias correction).
Atmosphere 2021, 12, x FOR PEER REVIEW 7 of 15 Figure 4 shows the RF results using training data (2017-2018). The RF model predicts variables in the direction of lowering the variance but shows a positive bias for low predicted values and a negative bias for relatively high values [66,67]. Therefore, bias correction was performed using the equation Y = 1.18 × −2.49, which increased the predictive performance of the model relative to the VISASOS data, with an RMSE of 1.04 km (without bias correction) and 0.88 km (with bias correction).
(a) Before bias corrected (b) After bias corrected  Figure 5 shows VISASOS, VISLDAPS, and VISRF from 1 January to 31 December 2019, as a time-series scatter plot. Based on Figure 5a, the frequency of VISLDAPS data exceeding 20 km was high (approximately 63% of the entire dataset) irrespective of VISASOS. This indicates that the LDAPS tended to over-predict visibility compared to the observational data. During this period, VISLDAPS predictions showed large differences relative to VISASOS, with an overall bias of 3.67 km, an RMSE of 6.12 km, and an R of 0.35; these results indicate low predictive performance. In comparison, VISRF, predicted using the RF model (Figure 5b), was relatively consistent with the 1:1 line with VISASOS, with a bias of 0.14 km, an RMSE of 2.84 km, and an R of 0.81. Figure 5c shows the time-series of the daily mean VISASOS,  with an overall bias of 3.67 km, an RMSE of 6.12 km, and an R of 0.35; these results indicate low predictive performance. In comparison, VIS RF , predicted using the RF model (Figure 5b), was relatively consistent with the 1:1 line with VIS ASOS , with a bias of 0.14 km, an RMSE of 2.84 km, and an R of 0.81. Figure 5c shows the time-series of the daily mean VIS ASOS , VIS LDAPS , and VIS RF for all of the observation sites. VIS LDAPS predicted values (following the same method currently employed by the KMA) were high due relative to VIS ASOS observations due to the large number of cases exceeding 20 km (mean = 17.98 km), with a bias of 3.61 km, an RMSE of 4.53 km, and an R of 0.71. Comparatively, the RF-derived VIS RF estimates show better predictive performance, with a bias of 0.12 km, an RMSE of 1.54 km, and a high R of 0.89.   Table 2 shows monthly comparisons of VISASOS, VISLDAPS, and VISRF alongside monthly mean data for the meteorological variables used in the RF model. The Korean Peninsula experiences a rainy season from late spring to late autumn (April-October), and in summer (June-August), the temperature and RH are high, with many cloudy days due to general low-pressure conditions [68][69][70]; meteorological conditions have the opposite characteristics during the winter season (September-February). Therefore, during the rainy season, with the exceptions of SS, DU, and O3, which naturally occur due to higher RH and rainfall, the mass concentrations of anthropogenic air pollutants tend to decrease [34,71]. SS and O3 generally have high mass concentrations on islands and near coastal areas, increasing in association with precipitation, surface temperature, and wind speed during the rainy season [72]. In the case of DU, mass concentrations are via inward transport from surrounding dry areas, such as large cities and deserts, during spring and winter when high pressure and sunny conditions dominate [26,73,74]. In contrast, during the winter and spring dry season (November-March), OM, BC, SU, NO2, SO2, and CO tend to increase as a result of fossil fuel burning for heating [53,[75][76][77]. Based on the CAMS model data analyses, these air pollutants showed strong positive correlations (R = 0.53-0.98); the VISASOS data are strongly affected by RH, precipitation, and air pollutants, alt-  Table 2 shows monthly comparisons of VIS ASOS , VIS LDAPS , and VIS RF alongside monthly mean data for the meteorological variables used in the RF model. The Korean Peninsula experiences a rainy season from late spring to late autumn (April-October), and in summer (June-August), the temperature and RH are high, with many cloudy days due to general low-pressure conditions [68][69][70]; meteorological conditions have the opposite characteristics during the winter season (September-February). Therefore, during the rainy season, with the exceptions of SS, DU, and O 3 , which naturally occur due to higher RH and rainfall, the mass concentrations of anthropogenic air pollutants tend to decrease [34,71]. SS and O 3 generally have high mass concentrations on islands and near coastal areas, increasing in association with precipitation, surface temperature, and wind speed during the rainy season [72]. In the case of DU, mass concentrations are via inward transport from surrounding dry areas, such as large cities and deserts, during spring and winter when high pressure and sunny conditions dominate [26,73,74]. In contrast, during the winter and spring dry season (November-March), OM, BC, SU, NO 2 , SO 2 , and CO tend to increase as a result of fossil fuel burning for heating [53,[75][76][77]. Based on the CAMS model data analyses, these air pollutants showed strong positive correlations (R = 0.53-0.98); the VIS ASOS data are strongly affected by RH, precipitation, and air pollutants, although these effects vary seasonally. Overall, VIS ASOS was negatively correlated with RH, precipitation, and air pollutants, with mean R-values of −0.57, −0.25, and −0.20 during the rainy season, and −0.59, −0.16, and −0.48 during the dry season, respectively.

Results and Discussion
The VIS LDAPS predictions in Table 2 were approximately 3 km higher than the VIS ASOS observations, with a mean RMSE of 5 km or more and R-values ranging from as low as 0.11 to 0.54. The VIS LDAPS model tended to overestimate visibility under clear conditions and showed poor predictive performance on a monthly timescale. In contrast, the VIS RF predictions had a bias of just −0.22 km, an RMSE of 3.01 km, and an R of 0.75 during the rainy season, compared with a bias of 0.63 km, an RMSE of 2.56 km, and an R of 0.87 during the dry season. These results indicate that the predictive performance of the RF model is best during the dry season, when RH and precipitation are high and the mass concentrations of air pollutants are low [29,31]. For example, in September, when RH was as high as 81.96%, precipitation was 1.00 mm (approximately 10 precipitation days), and the mass concentrations of air pollutants were relatively low, the VIS RF predictions had a bias of 0.17 km, an RMSE of 3.26 km, and an R of 0.71, corresponding to the poorest predictive performance. By comparison, in January, when the RH was as low as 54.66%, precipitation was 0.16 mm (approximately 3 precipitation days), and the mass concentrations of air pollutants were relatively high, the VIS RF predictions had a bias of 0.61 km, an RMSE of 2.23 km, and an R of 0.91, corresponding to the best predictive performance.
In the selected urban region (Figure 6a and Table 3), RH was 9.53% lower than the island region during the study period, precipitation was 0.26 mm lower, and 22 fewer precipitation days occurred, whereas the mass concentrations of air pollutants were between 33% (SU) and 1150% (NO 2 ) higher. In particular, for Seoul (#108) in the urban region, with a population of close to 10 million, the mass concentrations of air pollutants were between 141% (DU) and 4435% (NO 2 ) higher than Ulleungdo (#115) in the island region, indicating the cleanest air quality (OM: 530%, BC: 700%, SU: 208%, SO 2 : 2243%, CO: 379%). The correlations (R) between RH and precipitation and anthropogenic air pollutants in the VIS RF predictions for the urban region ranged from −0.52 (SO 2 ) to −0.72 (SU) ( Table 4), and similar relationships were observed between the VIS ASOS observations and these meteorological variables. SU is known to strongly attenuate visibility, worsening visibility relative to other pollutants including OM, BC, and CO [32,50,65]. In particular, as the mass concentrations of air pollutants are highest during the dry season, visibility in the urban region was greatly reduced compared to the island region during this season (mean visibility was 0.72 km lower; Figure 6c). During the dry season, VIS RF predictions for the urban region had a bias of −0.26 km, an RMSE of 2.10 km, and an R of 0.85.   lutants in the VISRF predictions for the urban region ranged from −0.52 (SO2) to −0.72 (SU) ( Table 4), and similar relationships were observed between the VISASOS observations and these meteorological variables. SU is known to strongly attenuate visibility, worsening visibility relative to other pollutants including OM, BC, and CO [32,50,65]. In particular, as the mass concentrations of air pollutants are highest during the dry season, visibility in the urban region was greatly reduced compared to the island region during this season (mean visibility was 0.72 km lower; Figure 6c). During the dry season, VISRF predictions for the urban region had a bias of −0.26 km, an RMSE of 2.10 km, and an R of 0.85.   In comparison, in the selected island region (Figure 6b), the correlations between RH and precipitation and air pollutant concentrations in the VIS RF model were higher than in the urban region (R = −0.72 and −0.41, respectively) ( Table 4). Therefore, in the island region, VIS ASOS was low during the rainy season when precipitation was high (Figure 6d), and the VIS RF predictions were very similar to the observations, with a bias of 0.01 km, an RMSE of 1.74 km, and an R of 0.89. In contrast, the VIS LDAPS model results overpredicted visibility in the island region, where the mass concentrations of air pollutants were low, with a bias of 4.23 km, an RMSE of 5.45 km, and an R of 0.40. Thus, the results of the visibility prediction of UM LDAPS were %bias and %RMSE of 27.84% and 35.88%, respectively, compared to the mean visibility; these results indicate a high uncertainty in visibility prediction. Therefore, the RF model constructed from the training set that included ground-based observational data and air pollutant data yielded better predictive performance than the NWP model.

Conclusions
In this study, an RF machine learning model was constructed using meteorological data (T, P, RH, WS, and precipitation) observed by the ASOS, and a range of aerosol and pollutant (SS, DU, OM, BC, SU, O 3 , NO 2 , SO 2 , and CO) datasets from the ECMWF CAMS model to predict visibility over South Korea. Data (3-h resolution) for the period between 1 January 2017, and 31 December 2018, were used as the model training dataset, and data for the period 2019 were used as the test dataset. The visibility predictions were analyzed in comparison with VIS LDAPS outputs of the UM LDAPS model administered by the KMA, and VIS ASOS observational data from 72 ASOS sites across South Korea. VIS LDAPS values tended to over-predict visibility under relatively clear conditions (bias = 3.67 km, RMSE = 6.12 km, and R = 0.36), whereas the VIS RF predictions showed that the RF model provides excellent predictive performance (bias = 0.14 km, RMSE = 2.84 km, and R = 0.81).
The RF model showed the best predictive performance during the dry season (bias = 0.63 km, RMSE = 2.56 km, and R = 0.87) when the RH and precipitation are low and the mass concentrations of air pollutants are high. Furthermore, in urban region with high mass concentrations of air pollutants, the correlation (R) between the VIS RF predictions and pollutants ranged from −0.52 (SO 2 ) to −0.72 (SU); in island region, with warmer and wetter conditions, correlations (R) between the VIS RF predictions and RH and precipitation were −0.72 and −0.41, respectively. These results demonstrate that the predictions of the RF model reflected the meteorological conditions most strongly affecting visibility in both urban (bias = −0.26 km, RMSE = 2.10 km, and R = 0.85) and island (bias = 0.01 km, RMSE = 1.74 km, and R = 0.89) regions. In contrast, the VIS LDAPS predictions for urban and island regions showed a bias, RMSE, and R of 1.01 km, 3.28 km, and 0.63, and 4.23 km, 5.45 km, and 0.40, respectively, indicating generally poor predictive performance.
Based on these results, the VIS RF predictions derived using the RF model demonstrate excellent predictive performance, offering a suitable replacement for VIS LDAPS predictions. In this study, visibility was predicted using ASOS data from the KMA meteorological observation network. In the future, if more dense observational network data including the Automatic Weather Station (AWS) and ASOS distributed in South Korea are used, it is expected that higher predictive performance will be achieved [79][80][81]. Given that visibility is a useful indicator of meteorological and climatic change, the impacts of changes in air quality and visibility due to anthropogenic sources can be accurately estimated using this modeling approach [26]. Thus, the suggested approach will be helpful in reducing traffic accidents, economic losses, and public health risks associated with atmospheric pollution and visibility, and accurate visibility predictions are essential to inform urban development policy and environmental control interventions.  Data Availability Statement: ECMWF CAMS datasets were downloaded from https://apps.ecmwf. int/datasets/data/cams-nrealtime/levtype=sfc/ (accessed on 24 April 2021) and https://apps. ecmwf.int/datasets/data/cams-nrealtime/levtype=pl/ (accessed on 24 April 2021).