Time-Dependent Downscaling of PM 2.5 Predictions from CAMS Air Quality Models to Urban Monitoring Sites in Budapest

: Budapest, the capital of Hungary, has been facing serious air pollution episodes in the heating season similar to other metropolises. In the city a dense urban air quality monitoring network is available; however, air quality prediction is still challenging. For this purpose, 24-h PM 2.5 forecasts obtained from seven individual models of the Copernicus Atmosphere Monitoring Service (CAMS) were downscaled by using hourly measurements at six urban monitoring sites in Budapest for the heating season of 2018–2019. A 10-day long training period was applied to ﬁt spatially consistent model weights in a linear combination of CAMS models for each day, and the 10-day additive bias was also corrected. Results were compared to the CAMS ensemble median, the 10-day bias-corrected CAMS ensemble median, and the 24-h persistence. Downscaling reduced the root mean square error (RMSE) by 1.4 µ g / m 3 for the heating season and by 4.3 µ g / m 3 for episodes compared to the CAMS ensemble, mainly by eliminating the general underestimation of PM 2.5 peaks. As a side-e ﬀ ect, an overestimation was introduced in rapidly clearing conditions. Although the bias-corrected ensemble and model fusion had similar overall performance, the latter was more e ﬃ cient in episodes. Downscaling of the CAMS models was found to be capable and necessary to capture high wintertime PM 2.5 concentrations for the short-range air quality prediction in Budapest.


Introduction
Hungary's capital, Budapest, the ninth largest city of the European Union, has been facing high concentrations of particulate matter (PM), and the guidelines defined by the European Environmental Agency have been exceeded at urban monitoring sites, especially in the winter [1]. The World Health Organization estimated an approximate 8000 deaths related to outdoor air pollution in Hungary each year [2] and the APHEKOM (Improving Knowledge and Communication for Decision Making on Air Pollution and Health in Europe) project found that excess pollution caused by particulate matter smaller than 2.5 µm diameter (PM 2.5 ) reduced life expectancy by 19 months in Budapest in the period of [2004][2005][2006], which was the second highest value among the 25 investigated European cities [3]. The dominant local source of PM in the Budapest area is domestic heating [1,4,5], but the contribution from large-scale transport is also considerable [5,6]. Winter stagnation episodes with persistent inversions and high PM 2.5 concentrations pose a major environmental risk in Budapest [7,8] and are expected to amplify the seasonal flu epidemic [9]. Urban air quality regulation and decision-making

Methods
A fusion of CAMS air quality models was applied to downscale PM 2.5 air quality predictions to monitoring sites in Budapest. The fused prediction c f usion,x,t for location x and time t was constructed as a time-dependent linear combination of the 7 independent models based on the idea presented in [36]: The prediction c i,x,t was (a) the raw prediction; (b) the bias-corrected PM 2.5 prediction from the ith CAMS model for time t and the nearest grid point to monitoring location x. Model weights w 0−7,t were optimized on each day by minimizing the regularized RMSE cost function J t for all available locations over the training period of [t − T − d, t − d].
where X is the number of observation sites with available hourly data in the training period and T is the length of the training period. A delay d was introduced to simulate delays in the availability of monitoring data. Note that the model weights were time-dependent but spatially consistent, and all stations were included to optimize the model weights to enhance the consistency of the fused prediction for the entire urban area. To reduce overfitting, a regularization term R was added to the cost function containing two terms to regularize deviations from the ensemble mean and large temporal shifts in consecutive model weighting, respectively [36]: Depending on the available urban monitoring sites reporting hourly PM 2.5 concentrations in Budapest, X = 4, X = 5 or X = 6 was used (see data availability in Table 1). The training period was set to 10 days, i.e., T = 240 was applied for hourly data. Thus, a total of 960-1440 observation-prediction pairs were used to fit 8 independent weights for each day, representing each CAMS model's relevance for predicting urban PM 2.5 concentrations during the past 10 days, plus a zero-degree term. Regularization strength parameters α and β were set to 0.1 and 100, respectively. Model experiments with a set of different regularization strength parameters between 0.1 and 100 were carried out and compared to the measurements. Very little sensitivity on the alpha parameter was found. Increasing the beta parameter improved the prediction of concentration peaks but had a little overall impact. The delay d in monitoring data availability was set to 24 h permitting manual data quality assurance.
Hourly PM 2.5 monitoring data were obtained from the Hungarian Air Quality Network for 6 sites in Budapest, including city and suburban locations (Table 1, Figure 1). Monitoring sites included suburban residential neighborhoods with low local traffic, but strong local domestic sources and major highway roads within 2 km distance (Budatétény, Gilice tér); densely habited suburban hubs with high local traffic and domestic emissions (Gergely utca, Kőrakás park); and city center areas with very high local traffic and low domestic emissions (Erzsébet tér, Honvéd). All stations lie between 100-150 m elevation above mean sea level over a flat terrain bounded by hills from the west reaching 400-500 m elevation (  Hourly PM2.5 monitoring data were obtained from the Hungarian Air Quality Network for 6 sites in Budapest, including city and suburban locations (Table 1, Figure 1). Monitoring sites included suburban residential neighborhoods with low local traffic, but strong local domestic sources and major highway roads within 2 km distance (Budatétény, Gilice tér); densely habited suburban hubs with high local traffic and domestic emissions (Gergely utca, Kőrakás park); and city center areas with very high local traffic and low domestic emissions (Erzsébet tér, Honvéd). All stations lie between 100-150 m elevation above mean sea level over a flat terrain bounded by hills from the west reaching 400-500 m elevation ( Near-real-time, daily initialized 24-h regional model predictions were obtained from the Copernicus Atmosphere Monitoring Service (CAMS) [15,16]. The data fusion method was implemented in two ways: (a) using the raw CAMS forecasts as input predictions; (b) correcting the previous 10-day additive bias independently for each model and using the bias-corrected CAMS model forecasts as input predictions for the fusion.  Near-real-time, daily initialized 24-h regional model predictions were obtained from the Copernicus Atmosphere Monitoring Service (CAMS) [15,16]. The data fusion method was implemented in two ways: (a) using the raw CAMS forecasts as input predictions; (b) correcting the previous 10-day additive bias independently for each model and using the bias-corrected CAMS model forecasts as input predictions for the fusion.
Comparison statistics obtained with the fusion method were compared with raw and bias-corrected CAMS models, the CAMS ensemble and the 24-h persistence. Bias correction is an important step as PM 2.5 model predictions typically had a negative bias; while urban PM 2.5 measurements are assumed to be positively biased compared to representative values [45]. Comparison statistics for each bias-corrected model are also presented. Model-observation differences were measured by the mean absolute bias (MAB), root mean square error (RMSE) and Pearson correlation (r) (Appendix A). The European Air Quality Index (EAQI) accuracy measure was introduced from an operational aspect, as the typical application of air quality predictions is to communicate air quality indices to the public. Therefore, EAQI accuracy is defined as the percentage of hourly concentrations when the modeled and the observed value fell in the same EAQI category (Appendix B).
Model performance was evaluated on three subsets of the October 2018-April 2019 period. The first subset was the official heating season between 15 October and 15 April representing the overall quality of predictions in the winter half-year. The second subset was restricted to days of episodes (polluted periods) when air quality predictions have a regulatory interest and are actively communicated to the public. The third subset was the pattern shift days marking the onset and clear-up of polluted periods, which are landmark events in the assessment of air pollution and are the primary interests of air quality prediction users. Model performance on pattern shift days is also an important indicator of the potential overfitting on past days data and demonstrates the added value of modeling compared to persistence or measurement time series extrapolation methods.
An episode was defined as a day when daily mean PM 2.5 concentration exceeded 25 µg/m 3 ; or a day when at least three of the previous two and following two days had a daily mean of PM 2.5 above 25 µg/m 3 at any of the six stations. Pattern shift days were defined as the first days of an episode and the first days after an episode, given that both the episode and the following clear period lasted at least 3 days. By these definitions, episodes occurred in a total of 70 days and pattern shift days in a total of 15 days. Selected temporal subsets are presented in Table 2.

Results
A linear model fusion with time-dependent but spatially consistent weights of CAMS air quality models was applied to obtain 24-h PM 2.5 measurements for Budapest. Figure 2 presents the observed time series, the fused prediction and the CAMS ensemble median for three urban sites. Time-dependent bias correction was also applied for each day by removing the additive bias of the previous 10-day long window for each day. The bias-corrected ensemble and the fusion of bias-corrected individual models (bias-corrected fusion) are also presented with points in Figure 2. 10-day long window for each day. The bias-corrected ensemble and the fusion of bias-corrected individual models (bias-corrected fusion) are also presented with points in Figure 2.
The CAMS ensemble median generally underestimated urban PM2.5 concentrations, especially during episodes. Downscaled forecasts, both by time-dependent bias correction of the CAMS ensemble and the fusion of individual CAMS models, largely improved the predictions and could better capture the episode peaks ( Figure 2). Exceptions were the clear-up days following episodes, overestimated by the downscaled predictions (e.g., [14][15][16][17][18][19][20][23][24][25][26][27][28][29][30]. If the 10-day training period included an episode, a large negative bias was corrected and/or the model predicting higher concentrations was overweighted, which subsequently resulted in overestimations after the episode. Meanwhile, the clear-up event was clearly captured by all models, and the overestimation was less severe in the data fusion model than for the bias-corrected CAMS ensemble. Accordingly, the daily root mean square error (RMSE) calculated from hourly model-observation pairs at all available observation sites (Figure 3) was lower for the fused model than for any individual model during the entire heating season, except for the clear periods following an episode.  period included an episode, a large negative bias was corrected and/or the model predicting higher concentrations was overweighted, which subsequently resulted in overestimations after the episode. Meanwhile, the clear-up event was clearly captured by all models, and the overestimation was less severe in the data fusion model than for the bias-corrected CAMS ensemble. Accordingly, the daily root mean square error (RMSE) calculated from hourly model-observation pairs at all available observation sites (Figure 3) was lower for the fused model than for any individual model during the entire heating season, except for the clear periods following an episode. The observed intra-urban variability reached a 20 µg/m 3 difference between peak 24-h PM2.5 concentrations among urban sites. As expected, this could not be captured by the models, and the fused model underestimated peaks in more polluted (Gergely utca), and overestimated peaks in less polluted (Kőrakás park) sites. (Figure 2) Site-specific weighting might solve this issue, but this research aimed to produce a downscaled model for the entire urban area, and not for specific local environments.
In the 15 October 2018-15 April 2019 heating period, the comparison between hourly urban PM2.5 measurements yielded a Pearson correlation of 0.71 for the fusion model and 0.7 for the CAMS ensemble median; the individual models ranging between 0.58 and 0.68, except for SILAM which had a slightly better correlation than the ensemble ( Figure 4). As expected, the 10-day bias correction slightly affected correlations, although EMEP and EURAD gained larger improvements. Twentyfour-hour persistence had a correlation of 0.56, worse than any of the CAMS models, underlining the model added value. CAMS model predictions were negatively biased, according to both the general negative bias of CAMS PM2.5 forecasts during this winter [45] and the non-representativity of urban sites. The notable exception was SILAM with positive bias over the entire heating season, as well as the two subsets of episodes and pattern shifts. The observed intra-urban variability reached a 20 µg/m 3 difference between peak 24-h PM 2.5 concentrations among urban sites. As expected, this could not be captured by the models, and the fused model underestimated peaks in more polluted (Gergely utca), and overestimated peaks in less polluted (Kőrakás park) sites. (Figure 2) Site-specific weighting might solve this issue, but this research aimed to produce a downscaled model for the entire urban area, and not for specific local environments.
In the 15 October 2018-15 April 2019 heating period, the comparison between hourly urban PM 2.5 measurements yielded a Pearson correlation of 0.71 for the fusion model and 0.7 for the CAMS ensemble median; the individual models ranging between 0.58 and 0.68, except for SILAM which had a slightly better correlation than the ensemble ( Figure 4). As expected, the 10-day bias correction slightly affected correlations, although EMEP and EURAD gained larger improvements. Twenty-four-hour persistence had a correlation of 0.56, worse than any of the CAMS models, underlining the model added value. CAMS model predictions were negatively biased, according to both the general negative bias of CAMS PM 2.5 forecasts during this winter [45] and the non-representativity of urban sites. The notable exception was SILAM with positive bias over the entire heating season, as well as the two subsets of episodes and pattern shifts.
The added value of fusion became more visible during episodes, where the fused model reached a Pearson correlation of 0.56, while persistence (0.41) was only slightly worse than the CAMS ensemble (0.43) and better than many individual models. Contrarily, on the pattern shift days, the fused Pearson correlation of 0.48 was weak compared to 0.65 of the CAMS ensemble. Note that the definition of pattern shift days meant that the correlation of persistence was negative in this subset.   The RMSE of the CAMS ensemble prediction improved from 11.4 µg/m 3 to 10 µg/m 3 both by bias correction and data fusion, a value lower than that of any of the individual models and the persistence. The performance gain was more pronounced during episodes, improving the RMSE from 17.2 µg/m 3 of the CAMS ensemble to 13.2 µg/m 3 by bias-correction and 12.9 µg/m 3 with fusion. Note that only one of the original individual models (EMEP) had a lower RMSE than persistence during episodes, but a simple 10-day additive bias-correction largely improved predictions for each model. On the other hand, pattern shift days caused the RMSE to worsen from 9.3 µg/m 3 of the CAMS ensemble to 9.5 µg/m 3 by bias-correction and 10.5 µg/m 3 with fusion.
The EAQI category was accurately predicted in 51% of the cases by the CAMS ensemble, better than any of the individual models and the persistence. The same accuracy was observed after bias-correction and for model fusion. However, during episodes, the negative bias of the original CAMS ensemble was serious and thus the EAQI accuracy was only 32%, compared to 50% after bias-correction and 51% of model fusion. On pattern shift days, EAQI accuracy was 45% in the CAMS ensemble, slightly worsened by downscaling methods to 41-45%. Note that the EAQI categories are very sensitive at low concentrations, thus larger EAQI accuracy can be expected in more polluted periods (Appendix B).
For the entire heating season, the bias-corrected CAMS ensemble, the fusion of individual CAMS models, and the fusion of bias-corrected individual CAMS models resulted in similar prediction performance and clearly improved the original CAMS ensemble. However, model selection must consider a tradeoff between the correct prediction of concentrations during relatively persistent episodes and for rapidly changing pattern shift days. Most public and policy interest in air quality forecast occurs during episodes, where the fusion of bias-corrected individual models offer the best results. Contrarily, on pattern shift days, the lagged training period introduces error, and the original CAMS ensemble or a fusion of the original CAMS models is preferable (Figure 4).
Among the individual models, there are large differences in performance, and some observations can be made where one of the models perform better than the ensemble (Figure 4). However, the time series of model weights in the fusion model ( Figure 5) shows that each model gains relatively high weights in some periods. Notably, the positively biased SILAM was the dominant model in the polluted mid-winter ( Figure 5), but it became almost neglected in the cleaner spring months. The 10-day training window allows to benefit from the added value of each model in their optimal season and unweight them in different conditions, although with a delay of 10 days. With more available observations to reliably fit the model weights, the training period could be shortened, and thus downscaling for rapid pattern shifts could be improved.

Discussion
Local on-site air quality measurements are often applied to improve and downscale air quality forecasts. In this study, an increase in correlation coefficients from 0.43 to 0.56 was observed during air pollution episodes by a time-dependent weighting of the CAMS model ensemble. The RMSE was reduced by 12% in general and by 25% during episodes, mostly due to the elimination of a large negative bias. The obtained correlations are similar to the 0.5-0.64 range found by Borrego et al. after applying bias correction of hourly PM2.5 predictions from three European air quality models [34]. Monteiro et al. found an improvement of 18% in the RMSE of PM10 prediction by the static linear regression method, similar to the model fusion method applied here with the important difference of using spatially consistent weights [32]. A similar study reached a decrease of 43% in the RMSE of PM10 predictions by bias correction, [33] however, it found that the CAMS ensemble median was not better than the best individual model, while we obtained that the bias-corrected ensemble outperformed all individual models in PM2.5 prediction and was slightly (12-25% in terms of the RMSE) further improved by model fusion. The selection between spatially consistent or spatially dependent weights is based on the aimed representativity of results. Spatially dependent weights optimize predictions for each monitoring site, while spatially consistent weights produce an

Discussion
Local on-site air quality measurements are often applied to improve and downscale air quality forecasts. In this study, an increase in correlation coefficients from 0.43 to 0.56 was observed during air pollution episodes by a time-dependent weighting of the CAMS model ensemble. The RMSE was reduced by 12% in general and by 25% during episodes, mostly due to the elimination of a large negative bias. The obtained correlations are similar to the 0.5-0.64 range found by Borrego et al. after applying bias correction of hourly PM 2.5 predictions from three European air quality models [34]. Monteiro et al. found an improvement of 18% in the RMSE of PM 10 prediction by the static linear regression method, similar to the model fusion method applied here with the important difference of using spatially consistent weights [32]. A similar study reached a decrease of 43% in the RMSE of PM 10 predictions by bias correction, [33] however, it found that the CAMS ensemble median was not better than the best individual model, while we obtained that the bias-corrected ensemble outperformed all individual models in PM 2.5 prediction and was slightly (12-25% in terms of the RMSE) further improved by model fusion. The selection between spatially consistent or spatially dependent weights is based on the aimed representativity of results. Spatially dependent weights optimize predictions for each monitoring site, while spatially consistent weights produce an optimized consistent model prediction for the entire domain. In this study, the latter approach was selected to provide a general downscaling for the city of Budapest and avoid overfitting on the local environment of each site.
Incorporating surface air quality measurements into air quality models is generally performed by data assimilation. Assimilating additional surface measurements into air quality simulation yields a very high improvement in prediction quality (e.g., a 0.4 increase in correlation coefficient in the PM 2.5 analysis in a study over Europe [47]). However, data assimilation is performed as part of the atmospheric chemistry transport simulation and it enforces grid-scale consistency [25], thus, it cannot be considered as a downscaling method. Surface air quality measurements over Europe are operationally assimilated in the CAMS air quality prediction system, thus, data assimilation was included in the input data for this study. However, urban measurements can be also be used to post-process model outputs for downscaling to local levels. For this purpose, learning algorithms and data fusion methods are applied.
Learning algorithms, e.g., artificial neural networks, rely on the regression relationship between external meteorological factors and air quality and are typically applied in complex urban environments where the representativity of atmospheric chemistry models is limited. Neural network PM 2.5 predictions reached correlations of 0.37-0.46 in a North American [48] and 0.52-0.85 in a European study [49], compared to the range of 0.56-0.71 reported in this study by the downscaling method. While learning algorithms are not considered as downscaling tools, atmospheric transport model results can be used as a predictor variable in the learning algorithm. An artificial neural network coupled with an atmospheric trajectory model reached an RMSE of 19.8 µg/m 3 for 1-day PM 2.5 forecasting and was very efficient at predicting the high peaks observable in the Jing-Jing-Ji area, China [50]. Better performance was found in a study performed in Iran that reached correlations of 0.86-0.91 and an RMSE of 2.76-7.04 µg/m 3 for hourly PM 2.5 forecast, however, these results were obtained by using PM 10 as a predictor variable for PM 2.5 forecasting [51]. While neural networks show impressive performances in predicting air quality at complex urban sites, their main limitation is that they can hardly be generalized for broader areas, and obtaining predictions for locations without measurements requires an atmospheric chemistry model or interpolation between sites. In contrast, the data fusion method applied in this study benefits from the spatial consistency of the underlying air quality models.
Observing the model weights in different periods offers quantitative time-dependent information on model applicability for forecasters and thus facilitates the applicability and quality of CAMS-based operational air quality forecasting and decision making. For example, SILAM had the highest RMSE among all models during the whole season, but it gained the largest fitted weights in the polluted mid-winter days. The second highest overall RMSE was that of MATCH, which, on the other hand, gained larger than 1/7 weights in the cleaner spring months. CHIMERE and MOCAGE were two models with low overall RMSEs, but their fitted weights dropped to near zero in the polluted January 2019. EMEP showed the most balanced weights as well as the lowest overall RMSE, however, it was often overweighted by other models, such as EURAD in November, and SILAM in December and January. Time-dependent model weights provide more applicable information for flexible model selection in daily forecasts than the overall validation statistics, and they can therefore enhance the quality of derived forecasts and public trust in CAMS air quality forecasting.

Conclusions
A downscaling method using 24-h forecasts from seven independent air quality models of the Copernicus Atmosphere Monitoring Service (CAMS) was introduced to improve PM 2.5 predictions in Budapest in the heating season of 2018-2019. Hourly observations from six urban monitoring sites were used to fit time-dependent, but spatially consistent weights in a 10-day long moving training period to produce a model-weighted prediction. A 10-day additive bias was also corrected for each model and the CAMS ensemble median.
Both the bias-corrected ensemble and the model fusion-improved model predictions compared to the original CAMS ensemble. The RMSE in the overall heating season improved from 11.4 µg/m 3 to 10.0 µg/m 3 . The added value of downscaling was more pronounced during episodes, improving the RMSE from 17.2 µg/m 3 to 12.9 µg/m 3 . This came at the price of introducing forecast error in rapidly changing conditions due to the lagged training period, however, the time and direction of pollution changes were still captured. The European Air Quality Index (EAQI) category was correctly predicted in 51% of the hourly cases. With downscaling, the same accuracy could be reached in episodes, while the original CAMS ensemble had a prediction accuracy of 32%. While bias-corrected ensemble and model fusion had similar overall performance, the latter was more efficient at predicting PM 2.5 peaks and the time-dependent model weighting benefited from all the widely different CAMS model systems in their optimal conditions.