Comparing Methods to Impute Missing Daily Ground-Level PM10 Concentrations between 2010–2017 in South Africa

Good quality and completeness of ambient air quality monitoring data is central in supporting actions towards mitigating the impact of ambient air pollution. In South Africa, however, availability of continuous ground-level air pollution monitoring data is scarce and incomplete. To address this issue, we developed and compared different modeling approaches to impute missing daily average particulate matter (PM10) data between 2010 and 2017 using spatiotemporal predictor variables. The random forest (RF) machine learning method was used to explore the relationship between average daily PM10 concentrations and spatiotemporal predictors like meteorological, land use and source-related variables. National (8 models), provincial (32) and site-specific (44) RF models were developed to impute missing daily PM10 data. The annual national, provincial and site-specific RF cross-validation (CV) models explained on average 78%, 70% and 55% of ground-level PM10 concentrations, respectively. The spatial components of the national and provincial CV RF models explained on average 22% and 48%, while the temporal components of the national, provincial and site-specific CV RF models explained on average 78%, 68% and 57% of ground-level PM10 concentrations, respectively. This study demonstrates a feasible approach based on RF to impute missing measurement data in areas where data collection is sparse and incomplete.


Introduction
Ambient particulate air pollution is a major environmental risk to health. An estimated 4.14 million mortality in 2019 was associated with exposure to ambient air pollution [1]. Routine ambient air quality measurements at a sufficient spatial and temporal scale are essential for the management and evaluation of ambient air pollution regulations, policies and mitigation measures. They are also crucial for calibrating air pollution statistical models for accurate exposure assessment in epidemiological studies investigating the link between air pollution and health. However, in low-and middle-income countries (LMIC), routine air pollution monitoring stations are sparse due to the limited financial, human and technical capacities to manage these monitoring networks [2,3]. The lack of air pollution measurements in LMIC obstructs the development of aforementioned air pollution models for estimating ambient air pollution exposures and thus informing population health studies. Particulate matter less than or equal to 10 micrometers in aerodynamic diameter (PM 10 µg/m 3 ) is associated with acute and chronic adverse health outcomes and it is of high public health significance globally [1, 4,5]. PM 10 is one of the criteria air pollutants in most countries including South Africa, it is measured in South Africa by an air quality monitoring network managed by three levels of government (National, Provincial and Metropolitan/Local) and privately managed air quality monitoring stations [6]. However, due to limited air quality management capacities, these monitoring stations are concentrated around the designated air pollution priority areas. To date, four areas (located in four of the nine provinces) of South Africa have been designated an air pollution priority area; Vaal triangle, Highveld, South Durban Basin and Waterberg based on historical evidence of poor ambient air quality due to the presence of possible source of air pollution [7]. The quality of available data is a major concern with only a small number of South Africa's ambient air pollution monitoring stations accredited by South African National Accreditation System [8].
In South Africa, air quality measurements are often missing due to various reasons such as vandalization of monitoring facilities, and periodic interruption of measurements due to electrical shut down or breakdown of monitoring equipment. This has led to a significant number of monitoring stations being out of operation for months or years resulting in long time-series of PM 10 measurements missing [9]. Inconsistent air quality data hampers epidemiological studies in South Africa from investigating the association between air pollution and health. Previous studies in South Africa have documented the trends in air pollutants for raising public health awareness about the need for air pollution control [10][11][12][13].
Univariable methods of unconditional mean or median, nearest neighbour have been compared with multivariable methods from regression models using other environmental predictors' for imputing daily PM 10 measurements [14,15]. Multivariable methods were reported to be more robust in performance when the proportion of missing data are higher than 10% [14]. The relationship between PM 2.5 and PM 10 at co-located monitoring sites was explored using multivariable methods with the aim to predict PM 2.5 at sites with PM 10 data only in Switzerland and India [16,17]. However, this approach is not feasible in South Africa due to the paucity of PM 2.5 data as it was only designated a criteria air pollutant in 2012 [18].
Random forest (RF)-a machine learning method can be classified as a multivariable method that aggregates the predictions of several regression trees to improve the performance of single regression models. Several studies have been published using RF and other multivariable models to predict missing air pollutants in areas with no or sparse monitoring networks [16,17,[19][20][21]. However, this study aims to leverage on the spatial and temporal dependence characteristics of air pollutants [22,23], by combining observed PM 10 data with spatial and temporal predictors as well as chemical transport estimates of PM 10 , ozone (O 3 ) and nitrogen dioxide (NO 2 ) in a RF model to predict missing daily PM 10 observation in some monitoring stations across four provinces of South Africa for years 2010-2017. The result of this analysis will be subsequently used to construct models to predict PM 10 in areas without monitoring sites.

Methods
The RF machine learning method was employed to accommodate the non-linear relationship between PM 10 measurements and covariates. For each year we constructed RF models at 3 geographical scales to predict missing daily PM 10 data: (1) one national model, using all daily PM 10 measurements from the four provinces combined; (2) four provincial models using daily PM 10 monitoring measurements from sites within each province; and (3) site specific models exclusively using daily PM 10 measurements from individual sites.

Monitoring Sites
The focus of this investigation was on PM 10 monitoring sites in South Africa, which are located in Mpumalanga, Gauteng, Western Cape and KwaZulu-Natal (Figure 1). These stations are managed by the Department of Environmental Affairs, South Weather Services, provincial, local governments and private industries. Hourly PM 10 data from the four provinces were obtained from the South African Air Quality Information System (SAAQIS) for 61 monitoring sites (27 in Gauteng,17 in Mpumalanga, 10 in Western Cape, 7 in Kwazulu-Natal) for the study period 1 January 2010-30 December 2017. Air quality monitoring stations instruments were serviced and calibrated bi-weekly, undergoing a full calibration annually, using National Metrology Institute of South Africa certified gases. The number of sites per year varies across the study period. Figure 2 shows the data completeness of the PM 10 observations obtained from the SAAQIS by province between 2010 and 2017. SAAQIS provides PM 10

Monitoring Sites
The focus of this investigation was on PM10 monitoring sites in South Africa, whi are located in Mpumalanga, Gauteng, Western Cape and KwaZulu-Natal (Figure These stations are managed by the Department of Environmental Affairs, South Weath Services, provincial, local governments and private industries. Hourly PM10 data from t four provinces were obtained from the South African Air Quality Information Syste (SAAQIS) for 61 monitoring sites (27 in Gauteng,17 in Mpumalanga, 10 in Western Cap 7 in Kwazulu-Natal) for the study period 1 January 2010-30 December 2017. Air qual monitoring stations instruments were serviced and calibrated bi-weekly, undergoing full calibration annually, using National Metrology Institute of South Africa certifi gases. The number of sites per year varies across the study period. Figure 2 shows the da completeness of the PM10 observations obtained from the SAAQIS by province betwe 2010 and 2017. SAAQIS provides PM10 data for research purposes in South Africa up completion of the required data disclosure forms. SAAQIS can be reached via their we site (https://saaqis.environment.gov.za/. Accessed on 22 October 2018)

Quality Check and Data Management
To ensure quality of the PM 10 data, the following quality check filters were applied. All negative values or observations greater or less than four times the interquartile range of each monitoring stations were considered outliers and were subsequently removed. A threshold of 75% hourly data per day was used to aggregate hourly data to a daily mean concentration.

Quality Check and Data Management
To ensure quality of the PM10 data, the following quality check filters were applied. All negative values or observations greater or less than four times the interquartile range of each monitoring stations were considered outliers and were subsequently removed. A threshold of 75% hourly data per day was used to aggregate hourly data to a daily mean concentration.

Temporal Parameters
Daily meteorological parameters of total precipitation, boundary layer height, temperature, the component of the horizontal wind towards east (U wind component) and the component of the horizontal wind towards north (V wind component) at a spatial resolution of 0.125 × 0.125° (approximately 10 × 10 km 2 ) for the hour 12:00:00 were downloaded from the European Centre for Medium-Range Weather Forecasts Reanalysis 5 th Generation (ERA5) global climate reanalysis dataset for the year 2010-2017 for South Africa. The U and V wind components were subsequently used to calculate wind speed (ws) and wind direction (wd) respectively using the formulas below: In addition to Copernicus Atmosphere Monitoring Service (CAMS) Reanalysis PM10 estimates, columnar daily ensemble estimates of pollutant gases of nitrogen dioxide, ozone were also downloaded from the CAMS data store at 0.125 × 0.125° (approximately 10 × 10 km 2 ). All temporal predictors were resampled at a 1 × 1 km 2 resolution, matching the 1 × 1 km 2 resolution of the raster specifically constructed for this study. The monitoring stations locations were subsequently linked to this raster to extract the temporal predictors.

Temporal Parameters
Daily meteorological parameters of total precipitation, boundary layer height, temperature, the component of the horizontal wind towards east (U wind component) and the component of the horizontal wind towards north (V wind component) at a spatial resolution of 0.125 × 0.125 • (approximately 10 × 10 km 2 ) for the hour 12:00:00 were downloaded from the European Centre for Medium-Range Weather Forecasts Reanalysis 5th Generation (ERA5) global climate reanalysis dataset for the year 2010-2017 for South Africa. The U and V wind components were subsequently used to calculate wind speed (ws) and wind direction (wd) respectively using the formulas below: In addition to Copernicus Atmosphere Monitoring Service (CAMS) Reanalysis PM 10 estimates, columnar daily ensemble estimates of pollutant gases of nitrogen dioxide, ozone were also downloaded from the CAMS data store at 0.125 × 0.125 • (approximately 10 × 10 km 2 ). All temporal predictors were resampled at a 1 × 1 km 2 resolution, matching the 1 × 1 km 2 resolution of the raster specifically constructed for this study. The monitoring stations locations were subsequently linked to this raster to extract the temporal predictors.

Spatial Parameters
A number of spatial geographic information system (GIS) predictor variables were calculated for this study at the aforementioned 1 × 1 km 2 grid (see Table 1). South Africa's road network was obtained from OpenStreetMap. For each 1 × 1 km 2 grid cell, we calculated the sum of road length for two categories: major roads and other roads. Land cover data were extracted from the 2018 South Africa National Land cover dataset. The initial 72 land use classes were re-categorized into five major categories: residential; industrial; built-up; agriculture; and water bodies. South Africa's climatic zones were extracted based on the South Africa Bureau of Standards 2005 classification. Population density was obtained from the Socioeconomic data and Application Center (SEDAC) dataset. For the light at night, data extracted from the Visible Infrared Imaging Radiometer Suite-Day/Night Band (VIIRS-DNB) was extracted and averaged at the 1 × 1 km 2 resolution. Elevation and impervious surface were extracted from respectively the Shuttle Radar Topography Mission Digital Elevation Database version 4.1 and the National Oceanic and Atmospheric Administration database.

Random Forest Model
RF is a non-parametric machine learning algorithm and an ensemble method that can be used to perform regression for continuous outcome variable (e.g., PM 10 ). Imputation of missing daily PM 10 data for stations with at least 70% of annual PM 10 was achieved by combining the measured PM 10 and spatial and temporal predictor variables at three geographical scales; national, provincial and site specific.
To impute missing PM 10 , all possible monitoring stations with valid PM 10 measurements were included in RF analysis. RF was used to estimate the PM 10 concentration for the missing days by exploring the relationship between observed PM 10 and spatial and temporal predictors. RF leverages on averaging several independent bootstrap ensemble trees to reduce the variance in the predicted PM 10 by [24,25]: Randomly resample the data with replacement to create training and validation sets of same sample size as the original dataset.

2.
Repeatedly construct regression trees on the training sets and predict on the validation sets.

3.
At each trees node, the best predictors from the random subsets of predictors were subsequently used to partition the nodes of respective trees. 4.
The final estimate of PM 10 is the average of individual trees of PM 10 predictions in a process called bagging.
In this study, the RF parameters number of variables randomly sampled as candidates at each split (mtry) and number of trees to grow (ntree) and minimum number of observations in a terminal node (min.node.size) were selected based on the combinations that minimized out of bag prediction error in the one-third sample left out for validation. Throughout this study, 500 trees were considered. Generally, mtry was tuned at each terminal nodes with two and respective predictors to de-correlate the trees. RF models are less sensitive to parameter tuning for low dimensional data [26]. Similarly, using minimum number of predictors that substantially contribute to explaining the variance in PM 10 could prevent overfitting the models as RF is prone to overfitting when spatial and temporal variables are included as predictors [27,28].
The feature importance of the models was ranked based on predictors that reduced prediction error when used as splits over the ensemble trees in the RF models. For all the RF models, the faster implementation of RF via the ranger packages was accessed from the caret package in R [29].

Model Validation
Spatial and temporal cross-validation was used to assess the daily PM 10 models prediction errors in time and space. Spatial leave one location out cross-validation (LOLO CV) was used to evaluate the national and provincial models. The national model was split into four folds using the province as splitting criterion. Thus, a model was trained on data from all but one province (n − 1). The hold-out provinces sites were iteratively used to estimate the prediction errors of using these models to predict for sites not included in the training data. Sites were used as the splitting criterion for the different provincial models. To account for possible spatial autocorrelation in the models, a complete time-series of observations of a site was sequentially withheld (n − 1) for cross-validation. Spatial LOLO CV was not possible for the site-specific models. Temporal leave time out cross-validation (LTO CV) was used to assess the model's performance in time. Day of the year was used to split the dataset 10 fold. All three models were sequentially trained on all but one held-out fold. All the models cross-validation were implemented using CAST (Caret Applications for Spatial-Temporal Models) package-a caret package wrapper for spatial and temporal cross-validation [28].

Error Metrics
Coefficient of determination (R 2 ), the square of the correlation coefficient between the observed and predicted daily PM 10 observation was used to evaluate the variance explained by the models. For all the models but sites models, we computed three R 2 measures to assess the models performance. The model building R 2 describing the overall models ability to explain the variance between observed and predicted daily PM 10 observation. Spatial and temporal R 2 to quantify the contribution of the spatial and temporal level to the total variance of daily PM 10 model predictions on held-out stations and days.
Root mean squared error (RMSE), the square root of the mean quadratic differences between observed and predicted daily PM 10 .
Mean absolute error (MAE), the average over the absolute differences between the observed daily PM 10 and predicted daily PM 10 were also calculated to provide summary estimates of the models prediction errors.

National Model
The RF models combined spatial and temporally predictor variables with ground monitored PM 10 from all the four provinces to construct national models for 2010-2017 (Table 2). Figure 3 shows the top 15 ranked variable of importance based on the predictors that reduced prediction error when used as splits over the ensemble trees in the RF models. Temporal predictors of chemical transport model-based estimates of PM 10 , humidity, Julian date and the spatial variable population emerged as influential variables across 2010-2017. The national RF models for 2010 to 2017 explained between 77% and 79% of the variation in daily PM 10 concentrations. Spatial CV was used to assess the robustness of the models. vation. Spatial and temporal R 2 to quantify the contribution of the spatial and temporal level to the total variance of daily PM10 model predictions on held-out stations and days.
Root mean squared error (RMSE), the square root of the mean quadratic differences between observed and predicted daily PM10.
Mean absolute error (MAE), the average over the absolute differences between the observed daily PM10 and predicted daily PM10 were also calculated to provide summary estimates of the models prediction errors.

National Model
The RF models combined spatial and temporally predictor variables with ground monitored PM10 from all the four provinces to construct national models for 2010-2017 (Table 2). Figure 3 shows the top 15 ranked variable of importance based on the predictors that reduced prediction error when used as splits over the ensemble trees in the RF models. Temporal predictors of chemical transport model-based estimates of PM10, humidity, Julian date and the spatial variable population emerged as influential variables across 2010-2017. The national RF models for 2010 to 2017 explained between 77% and 79% of the variation in daily PM10 concentrations. Spatial CV was used to assess the robustness of the models. The R 2 of the spatial and temporal cross validation varies between 0.

Provincial Model
The provincial model explored the relationship between PM 10 and predictor variables by each province across 2010-2017. Supplementary Material Figures S1-S4 highlight chemical transport model-based estimates of PM 10 , humidity, total precipitation, sites coordinates as variables of importance for explaining the intra-province PM 10 variability. The contribution of these variables also varied across the study period and provincesunderlying the heterogeneity in the provincial characteristics of PM 10 concentration.
The performance of the provincial models while predicting PM 10 for held-out sites varied across provinces and study period ( Table 2). The CV results of the RF models for Gauteng, for example, explained between 26% and 52% of spatial variability and between 52% and 79% of temporal variability in measured PM 10 concentrations. Mpumalanga RF models slightly improved on the Gauteng models with R 2 ranges of 0.39-0.69 (spatial) and 0.73-0.78 (temporal).  * The provincial models included all possible sites with PM 10 observation; ** The sites models included the monitoring stations with at least 70% annual PM 10 observation. NA: Not applicable. These are individual site models-Spatial cross-validation (CV) cannot be perform for models with less than two sites. LOLO: Leave one location out spatial cross-validation; LTO: Leave time out temporal cross-validation. Range: The minimum and maximum values of the statistics metrics from the models across 2010.

Site-Specific Models
Site-specific or individual site models were used to assess the relationship between PM 10 and temporal predictor variables if the site have at least 70% annual PM 10 data. The site-specific models were explored independently from each other. The models for Witbank monitoring station performed best with explaining PM 10 variability between 72% and 83% ( Table 2). Leandra monitoring station performed worst with a range of explained PM 10 variability between 29% and 36%. The temporal variables of chemical transport model-based estimates of PM 10 , humidity, Julian date, wind speed, temperature, total precipitation are important variables for explaining PM 10 variability of the different sites (Supplementary Material Figure S5). Table 3 compares the distribution of observed PM 10 values against the CV predicted PM 10 for the three models (national, provincial and site-specific) for days with PM 10 measurements. The site-specific models outperformed the national and provincial models in capturing the variability in PM 10 . The mean and the standard deviation of the predicted PM 10 from the provincial and site-specific models are somewhat comparable to that of the observed PM 10 concentrations. The range of the predicted mean PM 10 concentrations from the national models differs substantially from the observed PM 10 concentrations. Table 3. Range of the observed versus predicted PM 10 concentrations (in µg/m 3 ) for the 3 different models (National, Provincial and Site-specific) averaged over all sites and years (2010-2017) by province for the mean, standard deviation (SD) and 5th, 25th, 50th, 75th and 95th percentiles).

Discussion
This study explored methods for imputing missing daily PM 10 measurements in South Africa, while considering the spatial distribution pattern of the sparsely PM 10 monitoring stations across four provinces of South Africa. The RF models, representing three different geographical domains, exhibit markedly different predictive performances for predicting missing daily PM 10 measurements across four provinces of South Africa.
The performance of the national models and provincial models decreased considerably when used to predict daily PM 10 in the LOLO validation. Table 3 indicates that the provincial and site-specific models predicted PM 10 concentrations do not differ substantially from the observed PM 10 concentrations in terms of mean and standard deviation. In addition, we constructed a national model for the entire eight years (2010-2017) to compare the performance of this model to the yearly models (not presented in Table 2). The overall performance of the model R 2 of 0.67 (RMSE, 17.70) suggest a reduced performance when compared to the range of the yearly models R 2 of 0.77-0.79 (RMSE 2.10-16.76). The cross-validated spatial R 2 of 0.24 (RMSE, 23.47) is within the range of yearly models R 2 (0.11-0.35), RMSE (17.72-29.47). The better performance of the yearly models might be because most of the PM 10 sites did not provide measurements consistently through the eight years. Also, the levels of PM 10 between the years are different due to changing PM 10 related emission variables. The national model, despite high overall R 2 s (0.77-0.79), performed poorly in the LOLO CV (R 2 0.11-0.35). This was also reflected in the poor ability to predict the observed PM 10 concentration (Table 3). This is perhaps not surprising given the large geographical domain of South Africa. The distances between the provinces are substantial (e.g., approximately 1000 km between Western Cape and the other three provinces) and, therefore, they exhibit different local emission characteristics driven by social and economic factors, but also by different climatological differences. The air pollution priority areas of Mpumalanga, Gauteng and KwaZulu-Natal provinces are home to the majority of coal reserves, mining and steel facilities in South Africa. The combined impact of these anthropogenic sources with other local sources of PM 10 and different climatic zones is likely to result in spatial variation in PM 10 concentration levels between the provinces resulting in distinct provincial characteristics of PM 10 , which are not transferable between the provinces.
Our provincial models were based on few monitoring stations relative to the size of the four provinces. For example Western Cape Province, the largest province among the four provinces (area = 129,462 km 2 ), has only 10 operating sites to capture the variability in PM 10 . The lack of sufficient representative monitoring sites to capture intra-province variability in PM 10 could explain the relative poor performance of the provincial and national models. Previous studies also reported on the limitation of regulatory monitoring networks in capturing small-scale spatial variations of pollutant concentrations due to the sparse distribution of the few monitoring stations [30,31].
The site-specific models' PM 10 predictions did not differ substantially from the distribution pattern of the observed PM 10 ( Table 3). The site-specific RF models, only using temporal predictor variables, were able to capture the observed temporal variability in PM 10 better than the national and provincial models. Previous studies in India and Switzerland have explored the association between PM 2.5 and PM 10 in co-located sites to impute missing daily PM 2.5 observations. These studies were able to develop imputation models explaining 89% (Switzerland) and 92% (India) variability in PM 2.5 [16,17]. These two studies were able to use sufficient PM 10 and PM 2.5 measurements at co-located sites to inform their models and then apply these to PM 10 only sites to impute PM 2.5 . In South Africa, there were insufficient co-located sites to follow this approach. Despite this disadvantage, we were able to explain PM 10 variance by between 29% and 83% in the site-specific models.
This finding highlights the paucity of air quality monitoring data in South Africa where only four provinces provided PM 10 measurements used for this study. Increasing the number of air pollution monitoring sites in South Africa and improving the data capture will provide more power to model more improved and reliable exposure estimates. Nonetheless, the RF variable of importance ranking across the four provinces indicates that chemical transport model estimates of PM 10 and meteorological variables contributed considerably to explaining ground-level PM 10 across our study area and study period.

Conclusions
This study compared three models (national, provincial and site-specific) combining spatial, temporal and chemical transport model-based estimates of PM 10 , O 3 and NO 2 with observed PM 10 concentrations to predict missing daily PM 10 concentrations across 44 monitoring sites in four provinces of South Africa between 2010-2017. Given the extent of air quality monitoring currently conducted in South Africa, the site-specific and provincial models showed a better performance compared to the national models in capturing the variability of ground-level PM 10 . Thus, our study provides evidence that a model constructed with sites from a province is less generalizable to another province.
The results of this study, complete time-series of daily PM 10 concentrations containing a mix between measured and imputed PM 10 concentrations, will be used in subsequent air pollution exposure studies aimed at informing population health studies in South Africa.
Supplementary Materials: The following are available online at https://www.mdpi.com/1660-460 1/18/7/3374/s1, Figure S1: Mpumalanga province random forest variable of importance, Figure S2: Gauteng province random forest variable of importance, Figure S3: Western Cape province random forest variable of importance, Figure S4: KwaZulu-Natal province random forest variable of importance, Figure S5: Site-specific model's random forest variable of importance.