Probabilistic Forecasting of the 500 hPa Geopotential Height over the Northern Hemisphere Using TIGGE Multi-Model Ensemble Forecasts

Bayesian model averaging (BMA) and ensemble model output statistics (EMOS) were used to improve the prediction skill of the 500 hPa geopotential height field over the northern hemisphere with lead times of 1–7 days based on ensemble forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF), National Centers for Environmental Prediction (NCEP), and UK Met Office (UKMO) ensemble prediction systems. The performance of BMA and EMOS were compared with each other and with the raw ensembles and climatological forecasts from the perspective of both deterministic and probabilistic forecasting. The results show that the deterministic forecasts of the 500 hPa geopotential height distribution obtained from BMA and EMOS are more similar to the observed distribution than the raw ensembles, especially for the prediction of the western Pacific subtropical high. BMA and EMOS provide a better calibrated and sharper probability density function than the raw ensembles. They are also superior to the raw ensembles and climatological forecasts according to the Brier score and the Brier skill score. Comparisons between BMA and EMOS show that EMOS performs slightly better for lead times of 1–4 days, whereas BMA performs better for longer lead times. In general, BMA and EMOS both improve the prediction skill of the 500 hPa geopotential height field.


Introduction
Numerical weather prediction is an important technique in modern weather forecasting in which the basic equations of atmospheric motion are solved numerically to predict the future state of the atmosphere given the initial and boundary conditions [1,2]. However, chaos theory shows that even small perturbations in the initial conditions will lead to very different forecasts [3,4]. The prediction skill of a single deterministic forecast is limited by the uncertainties associated with the models and the initial conditions [5,6]. It is therefore necessary to use probabilistic forecasting rather than deterministic forecasting to provide a better numerical weather prediction service.
Probabilistic weather prediction describes the probability of the occurrence of future weather events as a percentage [7]. The forecast information (i.e., the uncertainty and probability) provided by probabilistic weather prediction is important in decision-making and can help to predict extreme weather events [8][9][10]. Probabilistic forecasting gradually became operational in many countries and regions from about 1960-for example, the Meteorological Development Laboratory of the US National Weather Service has tested and operationally run the model output statistics forecast equation as a tool for weather forecasting since the 1990s.
Ensemble forecasting is a key technique used for conceptual transition from a single deterministic forecast to a probabilistic forecast [11][12][13]. Ensemble forecasting aims to quantitatively describe the uncertainty of the forecast and to provide a more reliable probability density function (PDF) rather than a better deterministic forecast [14,15]. However, ensemble forecasts are usually biased and under-dispersed as a result of the imperfect ensemble prediction system (EPS) and the limited initial perturbation schemes [16,17]. Several statistical post-processing methods have therefore been proposed and applied to reduce systematic errors and forecast uncertainties [18][19][20][21][22][23][24][25][26][27].
Ensemble model output statistics (EMOS) [28][29][30] and Bayesian model averaging (BMA) [31][32][33] are the two most popular and effective post-processing methods for the correction of bias in ensemble forecasting and produce probabilistic forecasts. The BMA method combines predictive PDFs from different EPSs by weighted averages. The weights are equal to the posterior probabilities of the EPSs and reflect their relative performance during the training period. Contributing EPSs with a better performance over the training period will be assigned a larger weight and will thus contribute more to the BMA predictive PDF. The EMOS method generates a single parametric PDF using the raw ensemble forecasts directly rather than combining their kernels or PDFs, which need to be computed separately in BMA. The predictive parameters in EMOS are estimated using multiple regression equations with the raw ensemble forecasts and the corresponding verifications over a training period. Both BMA and EMOS have been applied to several different weather quantities (i.e., temperature, wind, and precipitation) and have been shown to improve prediction skills relative to the raw ensembles [34][35][36][37][38][39][40][41][42][43][44][45][46][47].
Few studies have applied BMA and EMOS to the geopotential height for probabilistic forecasting. The geopotential height field reflects the state of the atmosphere circulation and has an important influence on the regional and hemispheric climate. It is also important in short-and medium-range weather forecasts [48][49][50][51][52][53]. An improvement in forecasting of the 500 hPa geopotential height field will help forecasters to better understand largescale atmospheric circulation patterns and give more accurate predictions of the weather. We therefore used these two ensemble-based statistical post-processing methods for bias correction and probabilistic forecasting of the 500 hPa geopotential height in the northern hemisphere and compared their performance.
The rest of this paper is organized as follows. Section 2 describes the datasets and the BMA and EMOS methods and introduces the verification methods. Section 3 compares the raw ensembles and forecasts obtained from the BMA and EMOS prediction models and our summary and conclusions are presented in Section 4.

Data
We used multi-member 500 hPa geopotential height forecasts produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), the National Centers for Environmental Prediction (NCEP) and the UK Met Office (UKMO) EPSs at a horizontal resolution of 2.5 • × 2.5 • . The forecasts were initialized daily at 1200 UTC for lead times of 1-7 days using the TIGGE datasets ( Table 1). The TIGGE datasets are a key component of THORPEX and have collected ensemble forecast data from 10 global model prediction centers since 2006. We analyzed the data for the northern hemisphere for a one-year period from 1 January to 31 December 2013. The ECMWF ERA5 reanalysis dataset for the 500 hPa geopotential height products at 1200 UTC was used as the verification dataset.

Bayesian Model Averaging
The BMA method was first applied to the ensemble forecasting of surface pressure and temperature by Raftery et al. [31]. Following their research, the BMA predictive PDF of a weather quantity y is a weighted average of the individual PDFs centered on the bias-corrected forecasts corresponding to the members of the ensemble. The weights reflect the relative performance of the ensemble members during a given training period. Raftery et al. [31] suggested that temperature and sea-level pressure can be fitted reasonably well by a normal distribution. The BMA predictive PDF is then given by where f k is the forecast of the kth model, g k (y| f k ) is the conditional normal PDF with a mean of (a k + b k f k ) and a standard deviation σ. ω k is the weight corresponding to the performance of the kth model in the given training period. A simple linear regression is used to estimate the BMA parameters a k and b k (k = 1, 2, . . . , K); ω k and σ are estimated by a maximum likelihood method [31,54]. The BMA parameters are estimated during a training period. A sliding training period has been proved to be better than a fixed period [55]. We therefore retrained the BMA prediction model on each grid point in the study area every day throughout the forecast period using a training sample period of N previous days. The performance of different BMA prediction models using different lengths of training period showed that the previous 45 days before the forecast period was the optimum length of the training period (not shown).

Ensemble Model Output Statistics
The EMOS predictive PDF of a weather quantity y is single parametric where the parameters are obtained from the ensemble members of the contributing EPSs. Following Gneiting et al. [28], the EMOS normal predictive distribution is defined as: with f 1 , . . . , f K , for the K-member of an individually distinguishable forecast at a geopotential height of 500 hPa. a and b 1 , . . . , b K are the regression coefficients. S 2 is the ensemble variance with non-negative coefficients c and d. We estimated the EMOS coefficients by minimizing the continuous ranked probability score (CRPS) during the training period.
Our EMOS experiments were partially implemented in the ensembleMOS packages of R.

Verification Methods
We used the root mean square error (RMSE) to evaluate the deterministic forecast of the raw ensemble, BMA and EMOS forecasts. The CRPS, Brier score (BS), and Brier skill score (BSS) were adopted to assess the probabilistic forecasts.

Root Mean Square Error
The RMSE is the square root of the average of the squared differences between the forecasts and observations: where N is the total number of grid points and y i and o i are the model forecast and observed value of the ith grid sample, respectively. The RMSE is negatively oriented, which means that a smaller value indicates a better prediction skill. The CRPS is a measure of the integrated squared difference between the cumulative distribution function of the forecasts and the corresponding cumulative distribution function of the observations: where f (y i ) is the forecast cumulative distribution function (CDF) at model forecast y i and o i is the corresponding observed value. H(y i − o i ) is the Heaviside function and is equal to 1 when y i ≥ o i , otherwise it has a value of 0. The CRPS is also negatively oriented and 0 indicates a perfect score.

Brier Score and Brier Skill Score
The BS is the mean square error of probabilistic two-category forecasts [56]: where n is the number of forecast-event pairs, f i is the forecast probability and O i is the observation, which is either 0 (no occurrence) or 1 (occurrence). The BS ranges from 0 to 1 and is 0 for perfect forecasts. The BSS is a skill score based on the values of the BS to assess the relative improvement of the forecast over the reference forecast [57].
We used the climatological forecast as the reference forecast. The climatological probability of the sample was determined for each month and each grid point to consider the differences between the frequency of the climatological event in different regions within the northern hemisphere [58]. The forecast was considered to be perfect when BSS = 1. BSS > 0 meant that there was an improvement compared to the reference (climatological) forecast.

Deterministic Forecasts
We selected two extreme events to investigate the performance of the deterministic forecasts by raw ensembles and two statistical post-processing methods (BMA and EMOS). The first extreme event was the persistent severe rainstorms that occurred in South China from late March 2013. Continuous extreme precipitation processes mainly occurred during the periods 26 March-11 April and 23 April-30 May. The intensity of precipitation was significantly stronger in the second period than in the first period and was affected by the west Pacific subtropical high (WPSH, the area surrounded by the 5880 gpm line on the 500hPa geopotential height) and the South China Sea monsoon. The WPSH covered a larger area and was more intense than the climatological average and the location of the WPSH ridge was further south [59,60]. The observed distribution of the 500 hPa geopotential height showed two relatively stable ridges of high pressure located to the west of the Ural Mountains and east of Lake Baikal, respectively ( Figure 1). Northeast China, North China, and the Yellow and Huaihe river basins were therefore mainly controlled by high-pressure ridges, whereas the area from Xinjiang to the west of Lake Baikal was a low-pressure trough in the 500 hPa geopotential height field. The deterministic forecasts of the ECMWF, NCEP, and UKMO EPSs were similar. They all predicted higher values of 500 hPa geopotential height between the Urals and Lake Baikal, and lower values in the east of Russia compared to the observations. The ECMWF and UKMO EPSs had under-forecasted the geopotential height over China, while NCEP EPS had slightly larger predictions. For WSPH area in the northern hemisphere shown in the observations, the raw ensemble predictions with the lead times of 7 days were not strong enough. The difference between the observations and deterministic forecasts of BMA and EMOS were smaller in comparison of the raw ensembles, with the smallest RMSE and highest ACC values. The location and area predictions of the WPSH using the BMA and EMOS methods were fairly similar to the observations, even with lead times of 7 days.
WPSH covered a larger area and was more intense than the climatological average and the location of the WPSH ridge was further south [59,60]. The observed distribution of the 500 hPa geopotential height showed two relatively stable ridges of high pressure located to the west of the Ural Mountains and east of Lake Baikal, respectively ( Figure 1). Northeast China, North China, and the Yellow and Huaihe river basins were therefore mainly controlled by high-pressure ridges, whereas the area from Xinjiang to the west of Lake Baikal was a low-pressure trough in the 500 hPa geopotential height field. The deterministic forecasts of the ECMWF, NCEP, and UKMO EPSs were similar. They all predicted higher values of 500 hPa geopotential height between the Urals and Lake Baikal, and lower values in the east of Russia compared to the observations. The ECMWF and UKMO EPSs had under-forecasted the geopotential height over China, while NCEP EPS had slightly larger predictions. For WSPH area in the northern hemisphere shown in the observations, the raw ensemble predictions with the lead times of 7 days were not strong enough. The difference between the observations and deterministic forecasts of BMA and EMOS were smaller in comparison of the raw ensembles, with the smallest RMSE and highest ACC values. The location and area predictions of the WPSH using the BMA and EMOS methods were fairly similar to the observations, even with lead times of 7 days.  The second extreme event was the heatwaves that occurred in the summer of 2013. The average number of heatwave days (daily maximum temperature > 35 • C) in this summer was 31 days, the highest number since 1951 [61]. The temperature in South China was abnormally high and there were many extreme high temperature events. One of the main Atmosphere 2021, 12, 253 6 of 14 reasons for this heatwave event was that the WPSH was significantly further north than normal. The WPSH frequently strengthens and expands westward into inland areas, so that most of South China is controlled by the western side of the subtropical high [62]. This strengthened WSPH causes a stronger downward movement of air, weaker convection, and less precipitation in these regions, leading to extreme high temperature events in South China. The raw ensembles predicted higher values in northern Russia and lower values in western Pacific. The BMA and EMOS deterministic forecasts of BMA and EMOS were similar to the observations, whereas the ECMWF, NCEP, and UKMO EPSs predicted a weaker WPSH than the observations (Figure 2).  Figure 3 shows the predictive PDFs obtained from the BMA, EMOS and raw ensemble forecasts for a certain grid point. The raw ensemble PDFs tended to spread the distri-  Figure 3 shows the predictive PDFs obtained from the BMA, EMOS and raw ensemble forecasts for a certain grid point. The raw ensemble PDFs tended to spread the distribution, whereas the BMA and EMOS predictive PDFs were sharper and more concentrated than the raw ensembles. The EMOS PDF was sharper than the BMA PDF. Probabilistic forecasting maximizes the sharpness of the predictive PDF through calibration [31,63]. We therefore evaluated the calibration using the verification rank histogram [64,65] for the raw ensemble forecasts and the probability integral transform (PIT) histogram [18] for the BMA and EMOS forecast distributions. A straight and uniform distribution characterizes an "ideal" EPS. The verification rank histograms for the ECMWF EPS showed a U-shaped distribution ( Figure 4). These uneven distributions indicated that the raw ECMWF EPS was uncalibrated and that the raw ensemble forecasts were underdispersed; the results for other EPSs (not shown) were qualitatively similar. The PIT histogram is a continuous analog of the verification rank histogram obtained by calculating the value of CDF(x) at verification x. The PIT histogram of the calibrated probabilistic forecasts should also be straight and uniform. The BMA forecasts were over-dispersed and yielded inverted U-shaped distributions, although the deviations from uniformity were small for longer lead times. The EMOS PIT histograms were close to uniform, whereas the EMOS performance became worse as the lead time increased. The BMA and EMOS forecasts were better calibrated than the raw ensemble forecasts. The best EMOS performance occurred at lead times of 1-4 days, whereas BMA was superior to EMOS for longer lead times. Probabilistic forecasting maximizes the sharpness of the predictive PDF through calibration [31,63]. We therefore evaluated the calibration using the verification rank histogram [64,65] for the raw ensemble forecasts and the probability integral transform (PIT) histogram [18] for the BMA and EMOS forecast distributions. A straight and uniform distribution characterizes an "ideal" EPS. The verification rank histograms for the ECMWF EPS showed a U-shaped distribution ( Figure 4). These uneven distributions indicated that the raw ECMWF EPS was uncalibrated and that the raw ensemble forecasts were under-dispersed; the results for other EPSs (not shown) were qualitatively similar. The PIT histogram is a continuous analog of the verification rank histogram obtained by calculating the value of CDF(x) at verification x. The PIT histogram of the calibrated probabilistic forecasts should also be straight and uniform. The BMA forecasts were over-dispersed and yielded inverted U-shaped distributions, although the deviations from uniformity were small for longer lead times. The EMOS PIT histograms were close to uniform, whereas the EMOS performance became worse as the lead time increased. The BMA and EMOS forecasts were better calibrated than the raw ensemble forecasts. The best EMOS performance occurred at lead times of 1-4 days, whereas BMA was superior to EMOS for longer lead times. Figure 5 shows the RMSE and CRPS values for the raw ensembles, BMA and EMOS forecasts with different lead times. As expected, the prediction skill generally decreases as the lead time increases for all predictions. The ECMWF EPS performed better than the NCEP and UKMO EPSs and had the smallest RMSE and CRPS values. The EMOS forecasts were the most skillful with lead times of 1-4 days, whereas the BMA forecasts had the best performance for lead times of 5-7 days. The EMOS forecasts were less skillful than the best single-center model (the ECMWF EPS) for longer lead times.   Figure 5 shows the RMSE and CRPS values for the raw ensembles, BMA and EMOS forecasts with different lead times. As expected, the prediction skill generally decreases as the lead time increases for all predictions. The ECMWF EPS performed better than the NCEP and UKMO EPSs and had the smallest RMSE and CRPS values. The EMOS forecasts were the most skillful with lead times of 1-4 days, whereas the BMA forecasts had the best performance for lead times of 5-7 days. The EMOS forecasts were less skillful than the best single-center model (the ECMWF EPS) for longer lead times.   Figure 5 shows the RMSE and CRPS values for the raw ensembles, BMA and EMOS forecasts with different lead times. As expected, the prediction skill generally decreases as the lead time increases for all predictions. The ECMWF EPS performed better than the NCEP and UKMO EPSs and had the smallest RMSE and CRPS values. The EMOS forecasts were the most skillful with lead times of 1-4 days, whereas the BMA forecasts had the best performance for lead times of 5-7 days. The EMOS forecasts were less skillful than the best single-center model (the ECMWF EPS) for longer lead times.  The probabilistic prediction skill for the particular probabilistic forecast event in which the 500 hPa geopotential height forecast at each grid point was higher than its climatological value was measured by the BS and BSS. The reference forecast for the BSS metric was the climatological forecast. In this case, the NCEP and UKMO EPSs performed in a similar manner, whereas the ECMWF EPS had a relatively low prediction skill ( Figure 6). The BS (BSS) scores of the BMA and EMOS forecasts were lower (higher) than the scores for all the raw ensembles, which shows that the forecast skill was improved by the statistical post-processing methods. BMA and EMOS had almost the same probabilistic forecast skill for lead times of 1-4 days, with BMA showing a slightly better performance than EMOS. All the predictions were more skillful than the climatological forecast.

Probabilistic Forecasts
which the 500 hPa geopotential height forecast at each grid point was higher than its climatological value was measured by the BS and BSS. The reference forecast for the BSS metric was the climatological forecast. In this case, the NCEP and UKMO EPSs performed in a similar manner, whereas the ECMWF EPS had a relatively low prediction skill ( Figure  6). The BS (BSS) scores of the BMA and EMOS forecasts were lower (higher) than the scores for all the raw ensembles, which shows that the forecast skill was improved by the statistical post-processing methods. BMA and EMOS had almost the same probabilistic forecast skill for lead times of 1-4 days, with BMA showing a slightly better performance than EMOS. All the predictions were more skillful than the climatological forecast. Reliability displayed via reliability diagram, is commonly used to evaluate probabilistic forecasts for dichotomous events by plotting the observed frequency as a function of the forecast probability [66,67]. The yellow areas in Figure 7 represent the areas of skill relative to the climatology. The curves of raw ensembles were generally above the diagonal, indicating under forecasting (probabilities too low). The distributions of BMA and EMOS were closer to the diagonal, which represents the perfect forecast, than the raw ensembles. EMOS performed better than BMA for low probability forecasts and BMA was slightly superior to EMOS at high probability forecasts. Reliability displayed via reliability diagram, is commonly used to evaluate probabilistic forecasts for dichotomous events by plotting the observed frequency as a function of the forecast probability [66,67]. The yellow areas in Figure 7 represent the areas of skill relative to the climatology. The curves of raw ensembles were generally above the diagonal, indicating under forecasting (probabilities too low). The distributions of BMA and EMOS were closer to the diagonal, which represents the perfect forecast, than the raw ensembles. EMOS performed better than BMA for low probability forecasts and BMA was slightly superior to EMOS at high probability forecasts. To compare the performance of BMA and EMOS more intuitively, Figure 8 shows the maps of the best-performing method in terms of the BSS over the northern hemisphere with different lead times. The performances of BMA and EMOS were similar for shorter lead times. EMOS had a better prediction skill at mid-and high-latitudes, whereas BMA had a better prediction skill at low latitudes. BMA provided the best forecasts for most of the grid points, especially for lead times of 5-7 days. To compare the performance of BMA and EMOS more intuitively, Figure 8 shows the maps of the best-performing method in terms of the BSS over the northern hemisphere with different lead times. The performances of BMA and EMOS were similar for shorter lead times. EMOS had a better prediction skill at mid-and high-latitudes, whereas BMA

Discussion and Conclusions
We applied two state-of-the-art approaches to improve ensemble-based probabilistic forecasts of the 500 hPa geopotential height over the northern hemisphere. BMA and EMOS prediction models were established based on the TIGGE multi-model ensembles with lead times of 1-7 days. The results obtained from these two statistical post-processing methods were then compared in terms of the deterministic and probabilistic forecasts and compared with the raw ensembles of the ECMWF, NCEP, and UKMO EPSs using the TIGGE datasets.
A sliding temporal window of 45 days before the forecast period was shown to be optimum training period for all lead times in the BMA prediction model. The EMOS prediction model also used the previous 45 days as the training sample for a fair comparison of its performance with that of the BMA prediction model.
From the perspective of deterministic forecasts, the predictions of the 500 hPa geopotential height distribution from BMA and EMOS were much closer to the observations than the deterministic forecasts of the raw EPSs. In the two selected extreme events, BMA and EMOS predicted the WPSH well, with a strong intensity and large area even for a

Discussion and Conclusions
We applied two state-of-the-art approaches to improve ensemble-based probabilistic forecasts of the 500 hPa geopotential height over the northern hemisphere. BMA and EMOS prediction models were established based on the TIGGE multi-model ensembles with lead times of 1-7 days. The results obtained from these two statistical post-processing methods were then compared in terms of the deterministic and probabilistic forecasts and compared with the raw ensembles of the ECMWF, NCEP, and UKMO EPSs using the TIGGE datasets.
A sliding temporal window of 45 days before the forecast period was shown to be optimum training period for all lead times in the BMA prediction model. The EMOS prediction model also used the previous 45 days as the training sample for a fair comparison of its performance with that of the BMA prediction model.
From the perspective of deterministic forecasts, the predictions of the 500 hPa geopotential height distribution from BMA and EMOS were much closer to the observations than the deterministic forecasts of the raw EPSs. In the two selected extreme events, BMA and EMOS predicted the WPSH well, with a strong intensity and large area even for a lead time of 7 days. The potential of the BMA and EMOS methods to improve deterministic forecasts shows that it may be possible to apply these two methods to extended-range weather forecasts with lead times of 10-30 days. The pre-rainy season in South China, the meiyu and the rainy season in North China are usually accompanied by northward movement of the WPSH. These regional rainy seasons determine the distribution and evolution of droughts and floods during the flood season in mid-to eastern China. If the circulation at 500 hPa could be predicted correctly, then timing of these three regional rainy seasons could also be predicted better.
The BMA and EMOS methods also provide a better calibrated and sharper PDF than the raw ensembles. The raw ensembles were under-dispersed, whereas the PIT histograms of the BMA and EMOS methods were comparatively uniform. The overall performance of the ECMWF EPS was the best among the three EPSs and had the smallest RMSE and CRPS. For a specific probabilistic forecast event, in which the 500 hPa geopotential height forecast value at each grid point was higher than its climatological value, the BMA and EMOS predictions were both more skillful than the raw ensembles and the climatological forecast in terms of the BS and BSS verification.
In general, the BMA and EMOS methods were similar for lead times of 1-4 days, whereas the performance of EMOS was poorer at longer lead times. This was probably due to the length of the training period. For a fair comparison, the lengths of the training period for the BMA and EMOS prediction models were both set to 45 days for all lead times. However, a 45-day training period may be not suitable for the EMOS method at longer lead times. EMOS is designed based on the model output statistics method and regression techniques. Estimation of the EMOS parameters is strongly dependent on the choice of the training period and the stability of the forecast error of each contributing EPS. Krishnamurti et al. [68] reported that a stable forecast error is a necessary condition for linear regression. Therefore, the optimum length of the training period for EMOS needs to be investigated further.
The use of BMA and EMOS as statistical post-processing methods based on ensemble forecasts therefore improves the quality of 500 hPa geopotential height forecasts. We used the ECMWF, NCEP, and UKMO EPSs, which have relatively high prediction skills. We tried to increase the number of EPSs contributing to the BMA and EMOS prediction models-for example, the China Meteorological Administration EPS-but the results were less good and the poor performance of the China Meteorological Administration EPS made a negative contribution. We therefore suggest that the EPSs used by the BMA and EMOS prediction models should be selected for further study.