Development and Evaluation of Spatio-Temporal Air Pollution Exposure Models and Their Combinations in the Greater London Area, UK

Land use regression (LUR) and dispersion/chemical transport models (D/CTMs) are frequently applied to predict exposure to air pollution concentrations at a fine scale for use in epidemiological studies. Moreover, the use of satellite aerosol optical depth data has been a key predictor especially for particulate matter pollution and when studying large populations. Within the STEAM project we present a hybrid spatio-temporal modeling framework by (a) incorporating predictions from dispersion modeling of nitrogen dioxide (NO2), ozone (O3) and particulate matter with an aerodynamic diameter equal or less than 10 μm (PM10) and less than 2.5 μm (PM2.5) into a spatio-temporal LUR model; and (b) combining the predictions LUR and dispersion modeling and additionally, only for PM2.5, from an ensemble machine learning approach using a generalized additive model (GAM). We used air pollution measurements from 2009 to 2013 from 62 fixed monitoring sites for O3, 115 for particles and up to 130 for NO2, obtained from the dense network in the Greater London Area, UK. We assessed all models following a 10-fold cross validation (10-fold CV) procedure. The hybrid models performed better compared to separate LUR models. Incorporation of the dispersion estimates in the LUR models as a predictor, improved the LUR model fit: CV-R2 increased to 0.76 from 0.71 for NO2, to 0.79 from 0.57 for PM10, to 0.81 to 0.66 for PM2.5 and to 0.75 from 0.62 for O3. The CV-R2 obtained from the hybrid GAM framework was also increased compared to separate LUR models (CV-R2 = 0.80 for NO2, 0.76 for PM10, 0.79 for PM2.5 and 0.75 for O3). Our study supports the combined use of different air pollution exposure assessment methods in a single modeling framework to improve the accuracy of spatio-temporal predictions for subsequent use in epidemiological studies.


Introduction
Epidemiological studies have been utilizing various air pollution exposure assessment methods to associate individualized exposure to air pollution concentrations and health effects [1][2][3]. Early studies [4][5][6] of long-term air pollution exposure assigned the concentration measurement from the nearest fixed monitoring site or applied interpolation methods (e.g., inverse distance weighing and kriging) to a geographical point of interest (usually at the participant's residential address). A limitation of this approach is its limited ability to capture temporal variation in concentrations. More recently, land use regression (LUR) and dispersion/chemical transport models (D/CTMs) are among the most common methods applied to predict air pollution concentrations at a fine spatial scale for use in epidemiological studies [1,2,[7][8][9]. D/CTM models can capture temporal variation, but are limited by the accuracy of the emissions inventories. Moreover, satellite aerosol optical depth (AOD) data haves been frequently used for predicting fine particulate matter pollution [3,10], especially in the lack of fixed site monitoring data and in the need of analysis of nation-wide/large population studies.
These more recent methods have been shown to be useful tools [11] for different epidemiological study designs since they can be extended to predict both the spatial and temporal variability of air pollution concentrations. For example, in time series and panel studies they can provide daily predictions, while in studies assessing long-term exposure health effects, the daily predictions obtained can be averaged over the time period of interest. In addition, these developments in exposure modeling provide spatially resolved daily estimates enabling an integrated assessment of health effects arising from both long-and-short-term exposures. However, they also have some limitations. LUR model development relies on air pollution measurements provided from a fixed monitoring network or is based upon specifically designed monitoring campaigns within a study [12]. As a result, a spatially sparse monitoring network or the limited temporal coverage of specifically designed monitoring campaigns may increase exposure measurement error. D/CTMs are based on the description of the physicochemical processes of air pollution, involving pollutant emissions sources and its precursors [13]. Therefore, they require high quality input data to produce accurate predictions. AOD data have contributed to developing national models with a spatial resolution of 1 × 1 km for particulate matter with an aerodynamic diameter equal or less than 2.5 µm (PM2.5) [14], but, as they measure PM within a height of several kilometers above ground, modelling is required to estimate concentrations at the height of the human breathing zone [10,11]. Additionally, values are missing on days with cloud coverage, which may be a significant problem for certain geographical areas and seasons. Therefore, models based only on AOD data may have increased uncertainty and may not allow the adequate assessment of intra-city variations.
Air pollution is a major environmental risk factor for human health [15,16] and it is crucial to provide epidemiological studies with accurate estimates of exposure. To overcome the limitations of single exposure assessment methods and to improve the accuracy of predictions, recent studies are combining methodologies or outputs from different methods. Incorporation of predictions from D/CTMs and/or satellite-derived air pollutant concentrations as predictor variables within a LUR model have been shown to improve the model performance in terms of predicting the spatial variability of air pollution [17,18]. On the other hand, very few studies have developed hybrid models by incorporating the output of different exposure assessment methods into a single spatiotemporal (ST) modeling framework [18][19][20].
Within the "Comparative evaluation of Spatio-Temporal Exposure Assessment Methods for estimating health effects of air pollution" (STEAM) project we developed two hybrid ST modeling approaches for air pollutants, by combining LUR, dispersion and machine learning modeling, in the Greater London Area for the years 2009-2013. In the first approach, we incorporate the predictions from dispersion modeling of nitrogen dioxide NO 2 , ozone (O 3 ; trioxygen) and particulate matter with an aerodynamic diameter equal or less than 10µm (PM10) and PM2.5 into a ST LUR model framework. In the second approach, we apply an ST-generalized additive model (GAM) combination of the predictions of individual models. For NO 2 , O 3 and PM10 this is carried out by including the predictions of the two methods (LUR and CTM modeling), while for PM2.5 by including the predictions of three methods (LUR, CTM and an ensemble model using machine learning and models informed also by satellite data) as independent variables in the GAM.

Study Area
The Greater London study area has a total of 9,784,200 inhabitants (Census 2011 data; https://data.london.gov.uk/dataset/2011-census-demography, accessed on 1 March 2017) across 5373 census-based aggregation units, the Lower Layer Super Output Areas (LSOAs), of which the centroids (longitude, latitude) are located within the London Orbital Motorway (M25). Each LSOA within the study area has a minimum population of 1000 persons with an average population in 2010 of 1722 [21]. Within the same area there are 219,093 post codes which also have defined centroids which are population weighted. The Greater London Area has an exceptionally extensive air pollution monitoring network which allows testing the performance of various models as will be described in the following sections.

Air Pollution Monitoring Data and Enhanced PM2.5 Database
We obtained daily measured concentrations of NO 2 , O 3 , PM10 and PM2.5 for 2009 through 2013 from the London Air Quality Network [22], the Air Quality England [23] and from the Automatic Urban and Rural Network [24]. For NO 2 , PM10 and PM2.5 we formulated a database with the 24 h average measured pollutant concentration (µg/m 3 ), while for O 3 we calculated the daily maximum 8 h (8 h max) average concentration (µg/m 3 ). NO 2 measurements were available from 130 monitoring sites, PM10 from 115 sites and PM2.5 from 33 sites. In order to enhance the representation of PM2.5 monitoring sites in the study area, we combined a regression model and a random forest (RF) approach to predict PM2.5 concentrations at fixed sites with PM10 measurements (but without PM2.5 measurements). More details on the methods applied for PM2.5 can be found in [25]. This procedure was essential for informing the ST LUR and machine learning PM2.5 modeling development with a sufficient number of monitoring locations (n = 104). For O 3 , measurements from 62 sites were used located in an extended area including 9688 LSOAs. O 3 is a secondary pollutant whose levels are mostly higher in rural areas than in urban settings. Therefore, we extended our study area to account for its formation properties and transport patterns. Figure 1 shows the study area and the geographical location of the monitoring sites.

Brief Summary of the First Stage Exposure Assessment
At the first stage ST LUR, dispersion and additionally machine learning models for PM2.5 were developed. The outputs of these approaches were combined under different hybrid model developed as the second stage. "Temporal" refers to daily (24 h) variation, while "spatial" refers to variation within the study area at the coordinates of interest (LSOA centroid).

Brief Summary of the First Stage Exposure Assessment
At the first stage ST LUR, dispersion and additionally machine learning models for PM2.5 were developed. The outputs of these approaches were combined under different hybrid model developed as the second stage. "Temporal" refers to daily (24 h) variation, while "spatial" refers to variation within the study area at the coordinates of interest (LSOA centroid).

LUR Models
We developed semi-parametric ST LUR models to predict daily NO2, O3, PM10 and PM2.5 concentrations at any point of interest in the study area, for the years 2009 to 2013. Similar to the approach previously applied previously in Athens, Greece [26,27], the logtransformed air pollutant measurements (except for O3 as it was normally distributed) at

LUR Models
We developed semi-parametric ST LUR models to predict daily NO 2 , O 3 , PM10 and PM2.5 concentrations at any point of interest in the study area, for the years 2009 to 2013. Similar to the approach previously applied previously in Athens, Greece [26,27], the logtransformed air pollutant measurements (except for O 3 as it was normally distributed) at location i on day t was modeled using a set of covariates that had either a linear or a smooth effect on the pollutant; plus a bivariate smooth function of the fixed monitoring sites' geographical location (longitude, latitude) in order to account for remaining spatial correlation. We used available air pollution measurement data, while 97 potential predictors of air pollutants' spatio-temporal variability were tested. These variables can be classified in four categories: (a) land use type (Land Cover Map-LCM of Great Britain from 2007) in a buffer range of 100 to 5000 m (m) around each fixed monitoring site (precisely, 100, 300, 500, 1000, and 5000 m); (b) meteorological variables, obtained from the UK Meteorological Office (daily mean): temperature ( • C), relative humidity (%), wind direction ( • N), wind speed (m/s), cloud coverage (okta), barometric pressure (mBar/hPa) and solar radiation (W/m 2 ); (c) traffic-related: total length of major roads (m) in buffers of 25, 50, 100, 300, 500 and 1000 m, inverse distance of the fixed monitoring sites to the nearest major road (m −1 ), and traffic intensity on the nearest major road (veh day −1 ) to the fixed monitoring site and total traffic load within each buffer of 25 to 1000 m (veh day −1 *m). Traffic counts were obtained from the Department of Transport in the United Kingdom; (d) indicators: of linear trend within a year (a day count variable accounting for trends within each year coded from 1 to 365 or 366), of linear trend over the years of the study period (4 dummy variables with year 2009 as reference category) and the day of the week (6 dummy variables with Sunday as the reference category). For the smooth function we used a penalized spline with degrees of freedom (df) estimated via Restricted Maximum Likelihood (REML). Regarding all continuous variables, the final model included the term (linear or smoothed) that provided the better model fit. The final set of explanatory variables was selected based on the model's adjusted-R 2 value. Apart from the adjusted-R 2 value, the coefficient of spatial covariates had to conform to the pre-defined direction of effect. The addition of variables was continued until none added more than 1% to the value of adjusted-R 2 . The final variables included in each of the ST LUR models are shown in Figure 2. All the predictor variables of the spatial variation of O 3 , PM10 and PM2.5 concentrations are traffic related. For NO 2 , the spatial variables included in addition to traffic-related variables, also a land use type (area characterized as urban within a buffer of 300 m) variable.
Regarding temporal variables, all final ST LUR models included meteorological and indicator variables for study years. Moreover, all final models included the geography (longitude, latitude) of the fixed monitoring sites. We validated our developed models using ten-fold cross-validation (10-fold CV). All land use types and traffic-related variables were extracted by conducting GIS analysis via ArcGIS Desktop, Release 10 [28]. All statistical analysis was conducted using the R statistical software (version 3.3.3; R Core Team, 2017, sourced from Athens, Greece) [29] and the R library "SemiPar" version 1.0-2 [30].

Dispersion Models
We used a modeling system that combines the anthropogenic and natural emissions data, with the Weather Researching and Forecasting (WRF) meteorological model [31] and the Community Multiscale Air Quality (CMAQ) model [32], which has been coupled to the Atmospheric Dispersion Modelling System (ADMS) roads model [33,34] to predict hourly NO 2 , O 3 , PM10 and PM2.5 spatially at a 20 m grid resolution over the study area, for the time period 2009 to 2013. The anthropogenic emissions data were obtained by combining the UK National Atmospheric Emissions Inventory (NAEI) [35], the London Atmospheric Emissions Inventory [36], King's road transport emissions model [37] and EMEP European emissions (https://www.ceip.at/, accessed on 15 June 2017). The biogenic emissions from vegetation and soils were estimated using the Biogenic Emission Inventory System version 3 (BEIS3) model [38]. Sea-salt emissions were calculated in-line in CMAQ. Bias in the 2 × 2 km CMAQ PM2.5 and PM10 hourly output was corrected using a sample of background sites before the local scale dispersion modelling stage. The discrepancies between the model output and the measurements at a random sample of 50% of background sites in the case of PM10, and 5 sites in the case of PM2.5, was interpolated onto the 2 × 2 km grid to create a correction surface. This interpolation was carried out using two iterations of a multilevel B-spline algorithm [39], which normally takes around eight iterations to interpolate points exactly, so that the resultant error surface provided smoothly varying bias correction across the domain, rather than fixing the model output to the measurements. The results from CMAQ-urban model were evaluated at 152 fixed monitoring stations from the UK and London monitoring networks, using methods described in the UK Department of the Environment, Food and Rural Affairs (DEFRA) model evaluation protocol [40].
selected based on the model's adjusted-R 2 value. Apart from the adjusted-R 2 value, the coefficient of spatial covariates had to conform to the pre-defined direction of effect. The addition of variables was continued until none added more than 1% to the value of adjusted-R 2 . The final variables included in each of the ST LUR models are shown in Figure  2. All the predictor variables of the spatial variation of O3, PM10 and PM2.5 concentrations are traffic related. For NO2, the spatial variables included in addition to traffic-related variables, also a land use type (area characterized as urban within a buffer of 300 m) variable. ; TRAFMLOAD_300: traffic load of major roads (veh*m/day) in a buffer of 50m, 100m and 300m around each fixed monitoring site, respectively; MROADLENGTH_100: total length of major roads (m) in a buffer of 100m around each fixed monitoring site; INVDIST: inverse distance of fixed monitoring sites to the nearest major road (m −1 ); URBAN_300: urban areas (m 2 ) in a buffer of 300m around each fixed monitoring site; DAYCOUNT: day count variable accounting for trends within each year coded from 1 to 365 or 366 (included penalized splines with 6 degrees of freedom (df)); YEARS: years of the study period (4 dummy variables with year 2009 as reference category); TEMP: daily mean temperature ( • C, included penalized splines with 3 df); WDIR: daily mean wind direction ( • N, included penalized splines with 3 df); WSPEED: daily mean wind speed (m/s); RHUM: daily mean relative humidity (%); CLOUD: daily mean cloud coverage (okta); BARPRESS: daily mean barometric pressure (mBar/hPa, included penalized splines with 3 df).

PM2.5 Prediction Model Based on an Ensemble Machine Learning ST Approach
We applied an ensemble machine learning approach including AOD, land use and meteorological data in order to predict daily PM2.5 concentrations in the study area, on a 1km × 1km scale (consisting in a total of 3960 grid cells). Details on the prediction model development can be found in [41]. In brief, the machine learners used in the process were the gradient boosting machine (GBM), the random forest (RF) and the k-nearest neighbor (KNN). AOD data were provided by the MAIAC algorithm for Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on the Aqua and Terra satellites [42]. Predictors of the spatial variation of PM2.5 were: population density (persons/km 2 ), land use type (LCM 2007), distance to water (km), distance to Heathrow airport (m), normalized difference vegetation index (NDVI), traffic counts, average daily PM2.5 across the greater London area (µg/m 3 ), light at night, elevation (m), distance to nearest major road (km), distance to nearest bus stop (km), average building height (m) and length of major road (km), number of bus stops and number of buildings, in the grid cell. The included meteorological covariates were: cloudiness (okta), barometric pressure (mBar/hPa), wind direction ( • N), wind speed (m/s), dew point temperature ( • C), temperature ( • C) and inverse of the height of the planetary boundary layer (m −1 ). Additionally, variables on the temporal scale were included to account for seasonal variations of PM2.5 (sine of day of the year, cosine of day of the year and day of week) and to account for long-term trends (number of days from time of origin and year). Model training was based using a grid search to optimize the hyper-parameters for the algorithms, and by taking into account the mean square error (MSE) and cross-validated R 2 values as selection criteria. Following, we obtained the final ensemble-averaged PM2.5 predictions from a GAM, with independent variables a smoothed function (using a penalized spline with degrees of freedom estimated via REML) of the predictions obtained from each machine learning methods and by including a bivariate smooth function of latitude and longitude. Ten-fold CV was applied to evaluate model performance.

Agreement between First Stage (Independent) Exposure Assessment Methods
Lin's concordance correlation coefficient was calculated as a measure of agreement, at a temporal and spatial level, between the independent exposure assessment methods [43]. Agreement at the spatial level was investigated by the comparison of annual means of estimates (from the 3 methods) for each Lower Layer Super Output Area (LSOA), centroid. Agreement at the temporal level was investigated by taking into account the daily estimates, over the whole study area. Moreover, Bland-Altman method was applied in order to evaluate the mean differences and to estimate an 95% agreement interval of the differences (LoA) between the independent exposure assessment methods (ST LUR models, Dispersion models and the ensemble machine learning approach [44]. Depending on the availability of the independent exposure assessment approaches, we developed hybrid models by incorporating estimates derived from dispersion modeling and the prediction model for PM2.5 from an ensemble machine learning approach, of the following form: where poll it is the air pollutant concentration measurement at location i on day t, f l (.) l = 1, 2, ..., q is a smooth function reflecting the non-linear effect of covariate S l,it on the pollutant's concentration poll it , S l,it stands for the lth smoothed covariate, h is a bivariate smooth function of the fixed monitoring sites geographical coordinates geog ij = (longitude i , latitude j ) that account for residual correlation between locations i and j, W lt is the vector of covariates that have a linear effect on poll it and β is the corresponding vector of regression coefficients. M it is a ST smooth function of the predictions from the k = dispersion model and the additional prediction model for PM2.5 , with coefficient g. For g k = 0 model (1) is equivalent to the non-integrated ST LUR model. The errors (ε it ) are assumed to be independent and normally distributed, with a mean value of zero and a constant variance σ 2 ε . For NO 2 , PM10 and PM2.5 the air pollutant concentration measurement was log-transformed. We did not log-transform the O 3 measurements since they were normally distributed. Therefore, for NO 2 , PM10 and PM2.5: f p (poll it ) = log(poll it ), while for O 3 : f p (poll it ) = poll it . Based on the above, the specific models constructed were: A hybrid ST LUR framework by incorporating predicted concentrations of NO 2 , O 3 and PM10 from dispersion modeling and by incorporating predicted concentrations of PM2.5 from the dispersion the ensemble machine learning approach, as covariates within the ST LUR. These models are thereafter referred to as "Hybrid 1" and "ensemble", respectively.

Combination of Predictions Derived from LUR, Dispersion and for PM2.5 also Based on an Ensemble Machine Learning Modeling within a GAM
Similarly, depending on the availability of independent exposure assessment approaches we constructed a GAM, by fitting a smooth function (penalized splines) of the predictions from each method (LUR, dispersion and the prediction model based on an ensemble machine learning approach). The predictions reflect the daily estimated air pollutant concentrations at the fixed monitoring sites located in the study area. Since both the LUR model and the prediction model based on the machine learning use the air pollutant measurements during the model development procedure (dependent variable), the 10-fold CV predictions obtained from each method were used in the GAM. Our GAM has the following form: where poll it is the air pollutant concentration measurement at the fixed monitoring site location i on day t, s is a penalized splines with the degree of smoothness based on the generalized cross validation criterion-GCV), LUR pred 10 f old CV .poll it are the 10-fold CV predictions of the air pollutant concentrations obtained from the ST LUR models, dispersion pred.poll it are the estimated air pollutant concentrations obtained from dispersion modeling and ML pred 10 f old CV .poll it are the 10-fold CV predictions of the air pollutant (available only for PM2.5) concentrations obtained from the ensemble machine learning approach, on location i on day t. The final output from the GAM framework is the weightedaverage daily predictions of air pollutants. The weighing of single methods was carried out using the smoothed function which allows each method to vary along the concentration range of the pollutants in case one method performs better than another in a specific concentration range. The specific models constructed were: a hybrid GAM by combining predicted concentrations of NO 2 , O 3 , PM10 and PM2.5 from the LUR, dispersion and PM2.5 prediction model based on an ensemble machine learning approach-hereafter, "hybrid 2".

Validation
The hybrid 1 and 2 models' performance was evaluated using a 10-fold CV method. In this method, all models were fitted to N-10% fixed monitoring sites and the predicted concentrations were compared with the measured (observed) concentrations at the left-out sites. This procedure was repeated 10 times. Finally, the overall level of fit (based on the model's R 2 value) between the predicted and observed concentrations, across all sites, was calculated as a measure of our hybrid models' performance. Moreover, we separately investigated the temporal and spatial validity of the developed hybrid models. To assess each model's temporal validity, we regressed the difference between the daily predicted and the mean annual predicted pollutant concentrations over the difference between measured pollutant concentrations and their mean annual levels. In order to assess spatial validity, we regressed the mean annual predicted concentrations of each pollutant over the mean annual measured concentrations at the fixed monitoring sites. The level of temporal and spatial fit was evaluated by obtaining the models' 10-fold CV R 2 value.

Application
Subsequently, the hybrid models were used to predict daily concentrations of NO 2 , O 3 , PM10 and PM2.5 concentrations per LSOA, by averaging the predictions in all post code centroids located within the LSOA. Thus, for the 5373 LSOAs included in the Greater London study Area we predicted pollutant concentrations for 219,093 postcode centroids. Table 1 presents the distribution of the estimated NO 2 , O 3 , PM10 and PM2.5 concentrations predicted from the first-stage (independent) and the ensemble/hybrid exposure assessment methods. The data shown are long-term (years 2009 to 2013) predicted concentrations of pollutants per LSOA, after averaging the predictions in all post code centroids within the area of the LSOA. The range of the number of postcode centroids per LSOA was 1 to 1585, with a median value of 32 postcodes. The ST-LUR predicted pollutant concentrations are slightly underestimated compared to measurements from fixed sites. Similarly regarding NO 2 and PM10 predictions obtained from dispersion modeling.   Table 2 summarizes the agreement between the independent exposure assessment methods. Regarding NO 2 , O 3 and PM10 we assessed the spatial and temporal agreement by comparing the predictions obtained from LUR and dispersion models, while for PM2.5 we compared per 2 methods at a time. The spatial agreement between LUR and dispersion models is moderate. The temporal agreement is better than the spatial. The pollutant which displays the highest spatial agreement between exposure assessment methods is NO 2 . The highest temporal agreement was observed for PM2.5, between the dispersion and ensemble machine learning approach. However, the mean difference is larger for NO 2 and O 3 compared to PM.    Table 3 presents the performance of the independent and the hybrid modeling approaches for NO 2 , O 3 , PM10 and PM2.5 concentrations at the fixed air pollution monitoring sites. The ST LUR model for NO 2 performed well (CV-R 2 : 0.71) with better ability to predict spatially than over time (i.e., for NO 2 spatial R 2 : 0.67 and temporal R 2 : 0.33), while for O 3 and particulate matter, the ST LUR model performed moderately well. The dispersion compared to the LUR model performed similarly regarding the prediction of nitrogen dioxides and ozone and better regarding the prediction of particulate matter. Both hybrid modeling approaches (hybrid model 1 and 2) outperformed the independent models and improved the accuracy of predictions for all pollutants in terms of RMSE. Incorporating the estimates derived from dispersion models and the ensemble machine learning approach (only for PM2.5) into the LUR model (hybrid model 1), resulted in an increase in the CV-R 2 value by 5% to 22%. The combination of estimates derived from the separate exposure assessment methods within a GAM, increased the CV-R 2 value by 9% to 19%. The largest improvement in terms of CV-R 2 was for PM10.

Independent
The hybrid modeling approaches for NO 2 showed better ability to predict spatially than temporally, while for ozone and particulate matter the predicted better temporally (Table 4). Figure 3 shows the yearly pattern (2009 to 2013) of the combined estimates derived from independent exposure assessment methods, within a GAM. The distribution of the estimated pollutant concentrations over the study period is similar. PM concentrations have more outliers compared to NO 2 and O 3 series. Table 3. Model performance evaluated by the value of adjusted R 2 and 10-fold cross validated (CV) R 2 , root mean square error (RMSE) and mean bias, for the independent and hybrid modeling approaches.   Figure 3 shows the yearly pattern (2009 to 2013) of the combined estimates deriv from independent exposure assessment methods, within a GAM. The distribution of t estimated pollutant concentrations over the study period is similar. PM concentratio have more outliers compared to NO2 and O3 series.  Figure 4 displays an application of the combination of estimates derived from ind pendent exposure assessment methods, within a GAM to predict pollutant long-term co centrations, per LSOA. The combination of estimates derived from ST LUR and dispersi models into a GAM framework (hybrid model 2) allows the relative weights for ea model to vary spatially and by concentration and, therefore, display better performanc

Findings
We developed a number of air pollution exposure assessment approaches fo Greater London Area, based on different methodological principles, to estimate co trations in fine spatial (LSOA level) and temporal (daily) scales. For NO2, O3 and we developed a ST LUR and dispersion models, whilst for PM2.5 we additionally d oped a model using machine learning algorithms and incorporating satellite data. T independent methods of pollutant concentration estimates are prone to errors from uncertainty inherent to the measurement of variables used to develop the correspon models. The errors are likely independent, as the dispersion modelling uses an em inventory and information on atmospheric transformation processes influenced b urban-scape, whilst the LUR models are based on air pollution measurements an spatial and temporal variables determining their magnitude. So, it appears intuitive tractive to combine these methods, expecting that the errors of each separate method cancel out. In the present project we used two types of combination models for each lutant: one that incorporates the results of the dispersion model as a covariate in the (hybrid 1) and one that combines the predictions from the 2 or 3 (for PM2.5) indepen models with a GAM (Hybrid 2).

Findings
We developed a number of air pollution exposure assessment approaches for the Greater London Area, based on different methodological principles, to estimate concentrations in fine spatial (LSOA level) and temporal (daily) scales. For NO 2 , O 3 and PM10 we developed a ST LUR and dispersion models, whilst for PM2.5 we additionally developed a model using machine learning algorithms and incorporating satellite data. These independent methods of pollutant concentration estimates are prone to errors from the uncertainty inherent to the measurement of variables used to develop the corresponding models. The errors are likely independent, as the dispersion modelling uses an emission inventory and information on atmospheric transformation processes influenced by the urban-scape, whilst the LUR models are based on air pollution measurements and the spatial and temporal variables determining their magnitude. So, it appears intuitively attractive to combine these methods, expecting that the errors of each separate method will cancel out. In the present project we used two types of combination models for each pollutant: one that incorporates the results of the dispersion model as a covariate in the LUR (hybrid 1) and one that combines the predictions from the 2 or 3 (for PM2.5) independent models with a GAM (Hybrid 2). The Greater London Area has the advantage of a very dense monitoring network which provides measurements, including 130 sites for NO 2 , 115 for PM10, 62 for O 3 . The smaller number of sites measuring PM2.5 (n = 33) was enhanced by additionally using the data base developed in [25], resulting in the use of 104 sites. This network was used to predict the pollutant concentrations at each site and assess the exposure error and the agreement between methods.
Our results indicate that the combination models performed better in terms of crossvalidated R 2 , RMSE and mean bias. It should be noted that for pollutants with high spatial variability and good knowledge on the determinants of this variability, such as NO 2 , models explain a larger proportion of this variability. For pollutants that are more spatially homogenous and tend to have larger temporal variability, the models tend to explain the temporal variability better. It is also interesting to note that the combination model 1 performs better for PM2.5 and PM10 whilst combination model 2 better for the gaseous pollutants.

Evaluation of the Combined Modeling Performance
Relatively few studies have compared the performance of model combinations of LUR and dispersion models at daily and fine spatial scales. The temporal scale ranges in the published work from annual to monthly or biweekly and seldom to daily. Additionally, the models have been developed for very different geographical areas and although the concepts are often similar, the models developed and those compared follow very different methodologies. Some studies compared different prediction methods (e.g., regression models and machine learning algorithms) using the same predictors and generally find that the application of machine learning methods yields better predictions [45][46][47].
De Hoogh et al. [18] compared the performance of LUR and dispersion modelling but did not assess the performance of any combination model. In a later work, De Hoogh et al. [48] developed and evaluated an extended LUR model for predicting annual concentrations in Western Europe for PM2.5, NO 2 , BC and O 3 , incorporating dispersion model estimates, kringing, satellite observations in addition to the LU variables. The combined full model performed better compared to less sophisticated models. This work was extended to eight elemental PM components [49]. Akita et al. [17] compared the performance of several models with a combined one using Bayesian Maximum Entropy in predicting annual NO 2 concentrations in Catalunya, Spain, and report that their proposed combined framework outperformed the more conventional (LUR, Dispersion and others) approaches based on RMSE and other indices. Wang et al. [19] developed a combined LUR and chemical transport model, using a geostatistical modelling framework, for bi-weekly estimation of O 3 and PM2.5 in the Los Angeles Basin and report that the combined model outperformed the initial models especially improving the accuracy of O 3 predictions. Tripathy et al. [50] developed a combined model ("Hybrid") for PM2.5, BC and metal components for the Pittsburg area, which performed better for PM2.5 than for the other components, but its performance was not compared to other models.
We compared the performance of models in terms of accuracy and bias in the predictions in space and time. Other papers evaluate how well the models perform in terms of providing valid and accurate estimates of the pollution exposure association with health outcomes. The optimal methods under these assessments do not necessarily coincide [51]. The exposure assessment methods presented in this paper have been evaluated in terms of their performance in estimating health effects using simulations that indicated the Hybrid 2 model to perform better both for PM and gaseous pollutants [52,53].

Advantages and Limitations
Our work has some advantages: it relies on a very dense and extensive monitoring network allowing many points in space for validation. Additionally, it combines exposure assessment approaches widely used in epidemiological studies. Further it assesses concentration estimates in a very fine spatial and temporal scale. However, it also has a number of limitations. London has a dense air pollution monitoring network. This may not be the case in other urban settings, especially those suffering from poor air quality. In such cities, the lack of measurements may limit the possible modelling and hybrid approaches. However, this could be overcome by designing a specific monitoring campaign or by applying methods to enhance existing measurement data bases [25]. The methods evaluated are only a subset of those that can be developed for the same area. Thus, other models may incorporate further data as available for example satellite data which were only used for PM2.5 in our study and other algorithms for prediction. Other types of combinations to produce hybrid models may be used. Models using land use/cover variables to predict air pollutant concentrations should be periodically updated to capture any changes in land use. In Europe, freely available land use databases (i.e., CORINE, Urban Atlas) are updated every 6 years. Data from local sources could be included to account for intense or fast changes in land use and therefore improve predictive performance of LUR models. Additionally, the transferability of the comparison results to other areas is questionable. As many characteristics determining space and time specific pollutant concentrations depend on the local topography, urban characteristics, population behavior and climate, our results may not be readily transferable to other locations and this aspect should be further investigated.

Conclusions
In conclusion, we show that combination or hybrid exposure models combining independent modelling methods based on different methodological principles perform better in terms of valid and accurate estimations of concentrations in time and space. This is broadly in accordance to the sparse and not directly comparable results that have already been published for other geographical locations. Future work should further evaluate methods that combine approaches (often termed "hybrid" to denote a variety of combinations) which appear consistently to outperform the single method approach.