Probabilistic Evaluation of the Multicategory Seasonal Precipitation Re-Forecast

: The Meteo-France seasonal forecasting system 7 provides a 7-month forecast range with 25 ensembles. The seasonal precipitation re-forecast (from May to November 1993–2015) was evaluated by the Brier score in terms of accuracy and reliability based on tercile probabilities. Multiple analyses were performed to assess the robustness of the score. These results show that the spatial distribution of the Brier score depends signiﬁcantly on tercile thresholds, reference data, sampling methods, and ensemble types. Large probabilistic errors over the dry regions on land and the Nino regions in the Paciﬁc can be reduced by adjusting the tercile thresholds. The forecast errors were identiﬁed when they were insensitive to different analysis methods. All the analyses detected that the errors increase/decrease with the lead time over the tropical Indian/Paciﬁc Ocean. The intra-seasonal analysis reveals that some of these errors are inherited from monthly forecasts, which may be related to large-scale, short-term variability modes. A new conﬁdence interval calculation was formulated for the “uncertain” case in the reference data. The conﬁdence interval at a 95% level for the mean Brier score over the entire tropical region was quantiﬁed. The best estimations are ~6% the mean Brier score for both the above and below-normal terciles.


Introduction
The seasonal precipitation forecast has important values for social and economic development and activities. For instance, low precipitation during the growing season and heavy precipitation during the harvest season have the potential to damage crops. Therefore, the quality of the forecast, in particular about excessive rainfall or drought, is crucial to sectors such as agriculture and insurance [1][2][3][4].
The seasonal forecast is now regularly produced by major operational forecast centers around the world. These forecast systems are generally based on dynamical ensemble predictions (e.g., https://climate.copernicus.eu/seasonal-forecasts, accessed on 30 May 2022. The prediction utilizes a set of initial conditions and/or numerical models to compute a set of deterministic forecasts. Each of the ensemble members represents a possible outcome in the future. The dynamical ensembles, combined with observations and a statistical model, form a typical probabilistic forecast. However, because of biases in the initial conditions and in the dynamical prediction models, the real-time seasonal forecasts need to be complemented by ensemble hindcasts to take into account the model uncertainties [5,6]. In this regard, a set of long term hindcasts are necessary, and its quality is critical to the estimation of the model uncertainties. The deterministic scores, such as bias, correlation, and root mean squared error, provide information only about the accuracy of forecasted precipitation intensity, but not the probability, which can be described as a binary event with a probability of 1 (precipitation) or 0 (no precipitation). For an ensemble forecast system, probabilistic properties, such as accuracy, skill, and reliability, are essential, especially at different intensity levels, e.g., heavy precipitation or low precipitation. The Brier score (BS) [7,8] is commonly used to measure accuracy of probabilistic forecasts by the mean squared errors of probabilities from the forecast. It can be decomposed to reliability, resolution, and uncertainty [9]. Kharin and Zwiers [10] use the BS to assess the accuracy of the probability forecasts issued by the Canadian Centre for Climate Modelling and Analysis (CCCma). They found that forecast probabilities estimated by the statistical model using the parametric Gaussian distribution estimator have smaller mean square errors than the nonparametric counting estimator on seasonal timescales. The Brier skill score (BSS) indicates the degree of improvement of BS relative to the BS of a climatological forecast. The BSS can be improved by adjusting the mean and standard deviation with two scaling factors to match the observed distribution when the Gaussian estimator is used [10]. Tippett et al. [11] attempted to improve the forecast of seasonal precipitation over land based on ensemble simulation of the general circulation model ECHAM4.5 [12] at the International Research Institute for Climate and Society (IRI). They made comparisons of the nonparametric counting estimator with two parametric estimators for tercile probabilities, but the comparison were assessed by the ranked probability skill score (RPSS), or the multicategory BSS, which is the weighted average of squared error skill scores across categories [13]. They found that the that the RPSS varies with the probability estimators too. The tercile category probability of these estimators is a function of the ensemble size, mean, and variance. However, the RPSS possesses a special property for reliable forecasts: it is only associated with the squared correlation between forecast probabilities and occurrences, and it is insensitive to the number of categories [14]. Becker and Van Den Dool [15] examine the performance of the global seasonal probabilistic forecast based on the North American multi-model ensemble forecasting system. Their study suggests that model diversity is a more important source for the improvement of the evaluation score than the ensemble size.
A reliability diagram is one of the tools used to assess the reliability of a seasonal forecast [16]. It is a graph of the observed frequency of an event against the forecast probability. More reliable forecasts can be achieved through setting the category thresholds, which are determined by fitting the transformed Gaussian distribution/estimator of the forecasted precipitation anomalies to count for the biases in the ensemble mean and variance [15]. Its shortcoming lies in the graph interpretation, which relies on visualization only [17].
The ranked probability score (RPS) is the sum of several BSs evaluated at several probability thresholds [18]. Compared to RPS, the continuous ranked probability score (CRPS) can be seen as an RPS with an infinite number of classes, each of zero width or as the integral of the Brier score over all possible threshold values [18]. The score calculation is not based on any specific intervals. The continuous ranked probability skill score (CRPSS) [19] is suitable for probabilistic evaluation and comparisons of different versions of the forecast systems. The comparison of the 5th generation of the operational European Centre for Medium Weather Forecast (ECMWF) seasonal forecast systems (SEAS5) to the 4th version is made using CRPSS [20]. The SEAS5 is a state-of-the-art, coupled atmospheric, ocean, sea ice, and wave modeling system. Compared to version 4, SEAS5 is improved through a substantial upgrading of atmosphere and ocean models at higher resolutions, and an addition of the new prognostic sea-ice model [20]. Other improvements include a large ensemble forecast with 51 members, advanced initialization techniques, and bias corrections. The system is verified and calibrated by comparing a set of seasonal hindcasts to the observed historical record. The improvement of SEAS5 in the seasonal precipitation forecast over the tropical Pacific, in particular for JJA, is identified by the CRPSS.
The most distinctive feature of SEAS5 is the high El Niňo and the Southern Oscillation (ENSO) forecast skill. ENSO is the most important source of seasonal predictability [21]. It is a combined ocean-atmosphere mode of interannual climate variability, manifesting mostly by sea surface temperature (SST). Palmer [22] showed that the residence time in the regimes of the Lorenz attractor [23] can change in a predictable way by including a forcing in the Lorenz equations. This means that the response of the climate system is predictable when there exists an external forcing. As the forcing from the ocean changes slowly, the predictability of the forecast system resides partly in the oceans [24]. The impact of ENSO on seasonal precipitation is through direct effects or remote teleconnections from SST anomalies [25,26]. The forecast qualities are, thus, strongly related to forecast errors in the large-scale modes driving seasonal and interannual climate internal variabilities such as ENSO [21].
Soil moisture is another important memory component of the climate system, thus, it is a useful source of predictability for seasonal forecasting [27,28]. The interactions of atmosphere and land surface transform the signal with larger variability and less predictability in precipitation to a more predictable soil moisture signal [29]. Components contained in the atmospheric initial conditions are particularly important for shorter intra-seasonal timescales, but information such as heat fluxes contained in the large-scale atmospheric circulation has also been shown to be important for processes such as ENSO on the seasonal time scale [21].
Numerous observational studies have shown that the Madden-Julian Oscillation (MJO) is the dominant mode on the intra-seasonal timescale. The interannual warming of the equatorial sea surface (ENSO) in the central and eastern Pacific is commonly due to the advancement of the western Pacific warm pool associated with the MJO [30]. The MJO is considered a dominant constituent of stochastic forcing for ENSO. Their interaction through exchanges of heat, moisture, and momentum regulates intra-seasonal SST variations. The momentum transfer can drive upper ocean surface currents and, for particularly strong and long-lived westerly wind events, excite oceanic Kelvin waves. Kelvin waves are sometimes associated with ENSO initiation [31]. The results from multi-model ensembles indicate that forecast skill of subseasonal precipitation predictability has a modest but statistically significant relation with ENSO and MJO [32]. Vigaud et al. [33] found that the subseasonal forecast skill is related to the parameter fitting of the extended logistic regression model when it is assessed by RPSS.
As shown by previous studies, probabilistic forecasts can be improved not only by upgrading the dynamic ensemble model and its simulation method, but also by reducing uncertainty sources in the statistical model. Large uncertainties arise from, for example, the choice of probability estimators, the adjustment of their parameters, number of forecast categories, and sampling methods. They can directly alter the forecasts and evaluation scores.
The Meteo-France seasonal ensemble forecasting system is a state-of-the-art coupled climate modelling system. It has been in the operational mode since 1990s [6]. The current system 7 has been extensively evaluated against observations by deterministic scores (available at http://seasonal.meteo.fr/content/PS-scores-cartes, accessed on 30 May 2022), but it still needs rigorous probabilistic evaluation. The current study uses the BS for the evaluation of the accuracy and reliability of precipitation hindcast, especially at low and high intensity levels for the application of seasonal and monthly precipitation products. As numerous uncertainties in the statistical model are not fully and explicitly investigated in the previous research, this study aims to assess the robustness of the BS at the seasonal time scale. Multiple analyses were carried out to detect and to investigate the forecast errors as well as to demonstrate the sensitivity of the BS to different analysis methods. The uncertainty of the BS derived from the analysis was quantified by the confidence interval, considering not only the classical binary precipitation event, but also the uncertain event. This article has 4 sections. Section 2 introduces the forecast system, the observation data, the methods of evaluation in this study. Section 3 describes and compares the results for seasonal and intra-seasonal analyses. Section 4 summarizes the main findings and conclusions obtained in this study.

Dynamical Seasonal Ensemble Forecast
The Meteo-France seasonal forecasting system 7 (System 7) uses the coupled National Centre for Meteorological Research Climate Model version 6 (CNRM-CM6) [34]. It has four component models which are coupled together by the OASIS-MCT (Model Coupling Toolkit) version 3.0 coupler [35] to dynamically exchange fields between models. The Atmosphere model is ARPEGE version 6.4 [36], SURFEX version 8.1 including ISBA land surface model [37] and ISBA-CTRIP river routing model [38,39], ocean model NEMO version 3.6 [40], sea ice model GELATO version 6 [41]. The system uses stochastic perturbations in the dynamics [42] and the initial conditions by nudging technique to generate ensemble members. The observations/references for the nudging are 6 hourly ERA5 prognostic variables. The systematic biases are partly corrected by computing the forecast anomalies with respect to the model climatology. The resolution of ARPEGE is 1 • × 1 • at the horizontal direction, and 91 levels at the vertical direction. System 7 produces 7-month lead forecast, with 12 months (from January to December) as starting months (0 lead month). The hindcast spans from 1993 to 2016 (24 years) with 25 ensemble members (24 ensembles with nudging and one control simulations). The initial condition is the ERA5 reanalysis data archived at ECMWF for atmosphere, and for land surface [43], and Mercator-Ocean [44] for ocean and sea-ice. The system is operated and managed by the Environment for Climate Simulation (ECLIS). The precipitation is forecasted every 12 h on a 1 • × 1 • regular grid with the global coverage.

Observation Data
The Global Precipitation Climatology Project data (GPCP) version 2 [45] contains monthly mean precipitation rates with 2.5 • × 2.5 • resolution from January 1979 to August 2016. It is a merged analysis that incorporates precipitation estimates from low-orbit satellite microwave data, geosynchronous-orbit satellite infrared data, and surface rain gauge observations. The current existing GPCP data at CNRM is rescaled from 2.5 • × 2.5 • to 1 • × 1 • resolution, from 1993 to 2016. The GPCP data provides global coverage, and it is widely applied for climate variability studies and model validation. It was used by other evaluation metrics for the System 7. To be consistent, monthly mean data of the GPCP version 2, instead of the more recent versions, was applied for the probabilistic assessment of the precipitation hindcast. However, the mean precipitation from the GPCP data is likely under-estimated possibly due to missing light rain over ocean (especially the Southern Ocean) and orographic precipitation over land, through interpolation of the data to other coordinates using the conservative remapping method.
The Multi-Source Weighted-Ensemble Precipitation (MSWEP) version 1.2 [46] was also involved in this study. It combines various precipitation sources: satellite data, World Meteorological Organization Global Telecommunication System rain gauges and reanalysis. The monthly mean precipitation data is available for the period 1979-2015 and covers the global at 0.25 • × 0.25 • horizontal resolution. The current existing data was aggregated to 1 • × 1 • regular grid through conservative remapping.

Probabilistic Evaluation Method
The Brier Score (BS) is the mean squared differences between pairs of forecast probabilities f and the binary observations x. N is the total forecast number. It measures the total probability error, considering that the observation is 1 if the event occurs, and 0 if the event does not occur (dichotomous events).
The BS takes values in the range of 0 to 1. Perfect forecasts receive 0 and less accurate forecasts receive higher scores. Under the condition that x is 0.5 when the observation data is uncertain, the mean squared differences between the forecast probabilities and observation at 0.5 is calculated. In the R software package s2dverification [47], the BS is decomposed to BS = reliability − resolution + uncertainty The reliability measures the degree of correspondence between the forecast probability and the observed frequency for an event or outcome that is being predicted. It summarizes the conditional bias of the forecasts for a given event and is equal to the weighted average of squared differences between the forecast and conditional observed probabilities. If the reliability is 0, the forecast is perfectly reliable. To observe the frequency distribution, the forecast probability, from 0 to 1, is divided into 100 bins to compare to the observed frequency in each of the same bin in this study. The resolution characterizes the amplitude of the deviations of the conditional probability from the climatological frequency. The higher this term is the better. The uncertainty measures the inherent uncertainty in the outcomes of the event. It is the variance of observation's frequency. For binary events, it is 0 if an outcome always occurs or never occurs, and it is maximum if the probability of the event is 0.5. As such the uncertainty characterizes the properties of the observed system only.
A probabilistic forecast may be characterized by the full probability distribution such as CRPS or by giving the probabilities of an event falling into classified categories. Tercile category contains the above-normal (TerA), near-normal (TerN), and below-normal (TerB) three categories. Each category or tercile takes 33.3% of the total sample number. The probability based on the tercile category is referred to as the tercile probability. Counting the number of ensemble member falling into each category divided by the total ensemble numbers is a simple method of using a forecast ensemble to estimate categorical probabilities. Due to the inhomogeneous of precipitation probability density functions (pdf) at each grid point on a global scale, some pdf curves are skewed or asymmetric, others may have several peaks. Therefore, the counting estimator instead of parametric estimator was applied in this study. The two tercile thresholds were determined by the values at 1/3 and 2/3 of the total number of the climatological samples in an ascending order.

Experiment Design for Seasonal and Intra-Seasonal Hindcasts
This study evaluated precipitation hindcast, from May to November, with an emphasizes on the summer precipitation on a global scale produced by System 7. The data used is the monthly averaged precipitation. Note that all monthly data were averaged from ensemble daily data. The daily precipitation data is not involved because it is more suitable for the subseasonal assessment over small regions [48]. The spatial distribution of the BS was generated from the probability based on the ensemble hindcast at each grid point during 1993-2015, with the GPCP as the reference data.
There are two types of ensembles for probability estimations. The single initialization ensemble has 25 ensemble members initialized from May. Each member is the monthly mean of the hindcast daily precipitation. The mixed initialization ensemble has seven ensemble members because of the 7-month seasonal forecast range in each year. Each member is the monthly 25 ensemble mean, initialized by 12 months, from January to December.
To test the sensitivity of spatial distribution patterns and values of the BS to the basic elements in the statistical model, multiple analyses were designed with the combinations of different sample types, sample sizes, the way to determine the tercile thresholds (TTs), ensemble types, reference data, and number of the data sets. Table 1 summarizes the features of these six analysis methods. Intra-seasonal analysis is not included in the Table 1, because it can be viewed as an extension of the seasonal analysis by Method 1. If the probabilistic errors are insensitive to the analysis methods, it suggests that they are more related to the forecast system than to the evaluation methods.
The forecast lead time at 0, 1, 2, 3, or 4 months corresponds to the forecast seasons of MJJ, JJA, JAS, ASO, and SON, and the forecast months of May, June, July, August, and September, respectively. For the seasonal hindcast, results in MJJ at the lead time at 0 months are excluded. The evaluation period is 23 years (1993-2015) because the 1 • × 1 • GPCP observation is available till August 2016.

Confidence Interval
The uncertainty of the BS derived by different analysis methods was further quantified by the 95% confidence interval (CI95%), which was estimated by the multiple resampling approaches and analytical expressions of the method of moments. The multiple resampling method with a replacement is usually referred to as the bootstrap method [49]. This method is suitable for a wide range of samples disregarding the underlying distribution type. The BSs were initially calculated for those resampled 25 ensemble members with the GPCP as the reference. For 80 bootstrap resamples of the mean difference, the 2nd value and 78th value of the ranked mean differences are the low and upper boundaries of the CI95%, equivalent to the 2.5 and 97.5 percentile of the resampling distribution.
The bootstrap resampling involves n repeated trials of simple random sampling with a replacement. This method may not result in samples that are equally informative [50]. The sample size, following the sequential bootstrap resampling scheme, is kept as a constant, which is equal to the preassigned value 0.632n [51]. To select the sample, where m is bootstrap samples, and n is the ensemble members. For the n equal to 25, m is 15.8, so 15 samples are taken each time. The sample size for 15 out of 25 ensemble members is 375 (15 × 23 years). The advantage of this percentile confidence interval is that due to the central limit theorem, the distribution from resampling should be close to a normality, except that the BS is a non-linear function of probabilities. The disadvantage is that its computation cost is high, because sometimes it is difficult to converge to the forecast mean. An alternative way is to calculate the CI95% for the BS by the analytical approach proposed by Bradley et al. [52], which is limited to the probability forecast for a dichotomous event. Let observation x be a Bernoulli variable that equals to 1 if the event occurs and 0 if it does not. Let f be a probability forecast of the occurrence of the event; f is either a discrete or continuous random variable, and the BS can be expressed as a sample estimator of the mean squared error (MSE), The expected value of MSE is where µ (2) f is the second-order noncentral moment for the forecast, µ f |x=1 is the first conditional moment and observation is 1 for the forecast, and µ x is the first moment (mean) of the observation, equivalent to the climatological probability of the event occurrence.
The sample estimator for the m order's noncentral moment for the forecasts iŝ The conditional mean, µ f |x=0 , is estimated using subsamples, { f 0 j , k = 1, . . ., N 0 } when x is 0, and the conditional mean µ f |x=1 is estimated using subsamples, { f 1 k , k = 1, . . ., N 1 } when x is 1, in a similar way as above.
The analytical approaches are derived only for binary events within the distributionoriented framework [53], assuming that each forecast-observation pair is independent and identically distributed. The extended analytical equation for the case when x is uncertain (0.5) is presented in Appendix A. It can be generalized to non-binary cases in the same fashion as above.

Probabilistic Evaluation of Seasonal Precipitation Hindcast
In seasonal and decadal forecasts, imperfect initial conditions and model physics and dynamics, as well as long lead time, have large impacts on the long-term mean drift in the model climate. The seasonal precipitation forecast or hindcast often has an asymptote drift, which is that the mean forecast bias is of the same sign but smaller than the longterm bias [54]. By replacing precipitation with precipitation anomaly, it partly corrects the systematic bias in the mean, for instance, the systematic difference between the climatology of the model and the observation. It also removes the trends due to aspects such as seasonal cycle, focusing on variability. Figure 1 displays the BS spatial distribution patterns from JJA to SON, generated by Method 1. The feature of this method is that it evaluates the probabilistic accuracy of the seasonal precipitation variability, as the sample is a precipitation anomaly. Another important feature of this method is that the hindcast and the reference data share the observed TTs. To ensure a wide sample range, the climatological samples of 3 months instead of the 3-month mean were used. For TerB and TerA, the BS ranges generally fell between 0 and 0.55. The BSs greater than 0.3 (>~0.3), which are visible on the map, are distributed mostly in the tropical or sub-tropical region, and in the Antarctic. The high BS value is set to be around or more than 0.45 (>~0.45). The high BSs are found in both terciles, for TerB in particular, over the extremely dry Sahara Desert in North Africa. This indicates the probabilistic error is significant for a low precipitation forecast. The BS distribution pattern shows seasonal features in this region. Large BS (>~0.45) distribution patterns in TerA and TerB appear in JJA. As the lead time increases, they remain in JAS, and are reduced in ASO and SON. The high BS in TerB is more persistent than that in TerA. These errors may be related to the West Africa monsoon circulation. During the monsoon (June to September), the south westerly wind is strong, pushing the ITCZ northward. The very dry and hot Harmattan retreats toward the north.
Meteorology 2022, 2, FOR PEER REVIEW 8 TerA. These errors may be related to the West Africa monsoon circulation. During the monsoon (June to September), the south westerly wind is strong, pushing the ITCZ northward. The very dry and hot Harmattan retreats toward the north. Seasonal features also can be seen in the BS spatial distribution in the Arabian Peninsula, which is also a desert region. A high BS in JJA and JAS gradually reduces, and it reaches the minimum in SON. For TerB, a high BS remains even in SON. This is possibly related to the subsidence of the Indian summer monsoon flow during June to September [55], which superimposes on the impact of ENSO. In the southwest of the peninsula, a north-south-oriented mountain range (>1500 m) is bordering the Red Sea. It plays an important role in uplifting the Indian monsoon flow from the Red Sea [56] to produce heavy precipitation [57]. The recent study identifies that the large biases exist in daily maximum and minimum land surface temperatures in North Africa and the Arabian Peninsula [58]. The under-estimation can have a direct impact on latent and sensible heat fluxes and their feedback from the atmosphere, leading to too-low moisture transport and precipitation in arid regions. The precipitation rates for these two regions from System 7 indeed have negative biases (−1 mm/day), especially in JJA and JAS, as shown by the bias chart. A low anomaly correlation coefficient (ACC) frequently appears in the region where the BS is located. Charts are available at http://seasonal.meteo.fr/content/PS-scores-cartes (accessed on 30 May 2022).
Tropical rainforests have a significant impact on precipitation due to high moisture flux exchange rates between land and atmosphere. Soil moisture is also a slow varying variable at a seasonal time scale, and acts as a low boundary forcing factor of the model. A high BS distribution pattern can be observed initially in JJA for both terciles over the tropical rainforest in the Kongo Basin, but it disappears gradually after summer. The precipitation is very low there (~100 mm monthly) during the dry season from May to September, with the lowest precipitation being in July [59]. Similar conditions also can be found near the southeast of the Amazon rainforest in South America.
The continent of the Antarctic is also extremely dry (it is technically a desert). Almost all Antarctic precipitation falls as snow. West Antarctica is mostly covered by a massive ice sheet. The greenhouse-gases-induced global warming causes strong ice loss along the area 50° W-90° W. This is possibly not totally captured by the hindcast as the BS, especially for TerA, which starts to increase in ASO in the West Antarctic. The high BS pattern expands till SON when the warm season starts in the southern hemisphere (SH).
The tropical Pacific Ocean is the main region with large probabilistic errors due to its heavy precipitation in summer. Over the Nino 1 (80°-90° W and 0°-10° S) to 3.4 re- Seasonal features also can be seen in the BS spatial distribution in the Arabian Peninsula, which is also a desert region. A high BS in JJA and JAS gradually reduces, and it reaches the minimum in SON. For TerB, a high BS remains even in SON. This is possibly related to the subsidence of the Indian summer monsoon flow during June to September [55], which superimposes on the impact of ENSO. In the southwest of the peninsula, a north-south-oriented mountain range (>1500 m) is bordering the Red Sea. It plays an important role in uplifting the Indian monsoon flow from the Red Sea [56] to produce heavy precipitation [57]. The recent study identifies that the large biases exist in daily maximum and minimum land surface temperatures in North Africa and the Arabian Peninsula [58]. The under-estimation can have a direct impact on latent and sensible heat fluxes and their feedback from the atmosphere, leading to too-low moisture transport and precipitation in arid regions. The precipitation rates for these two regions from System 7 indeed have negative biases (−1 mm/day), especially in JJA and JAS, as shown by the bias chart. A low anomaly correlation coefficient (ACC) frequently appears in the region where the BS is located. Charts are available at http://seasonal.meteo.fr/content/PS-scores-cartes (accessed on 30 May 2022).
Tropical rainforests have a significant impact on precipitation due to high moisture flux exchange rates between land and atmosphere. Soil moisture is also a slow varying variable at a seasonal time scale, and acts as a low boundary forcing factor of the model. A high BS distribution pattern can be observed initially in JJA for both terciles over the tropical rainforest in the Kongo Basin, but it disappears gradually after summer. The precipitation is very low there (~100 mm monthly) during the dry season from May to September, with the lowest precipitation being in July [59]. Similar conditions also can be found near the southeast of the Amazon rainforest in South America.
The continent of the Antarctic is also extremely dry (it is technically a desert). Almost all Antarctic precipitation falls as snow. West Antarctica is mostly covered by a massive ice sheet. The greenhouse-gases-induced global warming causes strong ice loss along the area 50 • W-90 • W. This is possibly not totally captured by the hindcast as the BS, especially for TerA, which starts to increase in ASO in the West Antarctic. The high BS pattern expands till SON when the warm season starts in the southern hemisphere (SH).
The tropical Pacific Ocean is the main region with large probabilistic errors due to its heavy precipitation in summer. Over the Nino 1 (80 • -90 • W and 0 • -10 • S) to 3.4 region (170 • -120 • W and 5 • N-5 • S) in the eastern tropical Pacific Ocean, the potential predictability for SST forcing is highest where natural variability is comparatively low, and the atmosphere responds directly to the change in SST [21]. It is true that this region has the highest ensemble mean ACC for precipitation. Nevertheless, a high BS distribution pattern along the equator in TerA can be seen in JJA. It gradually extends to the west, and it shifts further to the Nino 3 (90 • -150 • W and 5 • N-5 • S) and 4 regions (150 • -160 • E and 5 • N-5 • S) till SON, as the heavy precipitation center is moving in the same direction. Similar conditions occur for TerB with a lower BS. Marine precipitation depends greatly on SST and the moisture flux from the ocean. As the coupled system is able to capture the SST variation quite well, e.g., high ACC (see http://seasonal.meteo.fr/content/PS-scores-cartes, accessed on 30 May 2022), discrepancies in the precipitation anomalies are possibly also associated with other tropical atmospheric and oceanic short-term forcing, which propagates eastward from the Indian Ocean or westward from the Pacific Ocean in summer. In the eastern Pacific adjacent to the west coast of the US, a high BS distribution pattern in JJA gradually disappears after JAS.
Some high BSs in TerA and TerB are found to be distributed over the northwest coast of Australia in JJA. This is a northwest cloud-bands (NWCB) active region. During the dry season (May to October) in the SH, the heaviest rain often occurs from May to July due to NWCB [60]. This is a complicated process due to the influences from multiple large-scale variability modes. High BS distribution patterns in TerA and TerB subsequently emerge over the adjacent Indian Ocean in JAS during the transition of maximum and minimum frequencies of NWCB [61], expanding over the Indian Ocean till SON. This seasonal feature and/or lead time feature is particularly strong over the tropical Indian Ocean. It is worth noticing that the equatorial Indian Ocean is the origin of the MJO [30]. This suggests that intra-seasonal precipitation anomalies should be also analyzed. Figure 2 is the same as Figure 1 but is generated by Method 2. This method uses the 3-month mean precipitation as samples, so the reference data has a smaller sample size. The most distinct feature of Method 2 is that the hindcast and the GPCP reference data each defines their own TTs. The hindcast takes advantage of the availability of the large amount of ensemble climatological samples to define TTs. Usually, it is more likely for a large number of samples to capture the distribution of a small set of samples. It can be seen that there is a significant improvement in the BS compared to that in Figure 1. The high BS distribution patterns over land shown by Figure 1 all disappeared. This indicates that the hindcast has a very good agreement with the GPCP observation over North Africa, the Arabian Peninsula, and the West Antarctic, as well as those around the Amazon and Kongo Basin during all seasons. The evaluation Method 2 demonstrates that System 7 ensemble hindcast climatology is capable of reproducing the observed precipitation probability accurately, even under drought conditions. Compared to Figure 1, high BSs are reduced significantly, but are still visible over the tropical oceans and oceans in SH, more clearly in TerA than in TerB. They decrease over the Nino regions and increase over the Indian Ocean with the lead time, exhibiting seasonal features as well as lead time features. The similar evolution also can be observed in Figure 1. This is because the climate variability of large-scale modes like ENSO is also seasonally dependent [21]. A few errors are persistent over the tropical Atlantic during all seasons, similar to those in Figure 1 Over the tropical western Pacific (north of Australia), errors in TerA are discerned in JJA and JAS in Figure 2 but not in Figure 1.
There are some contradictions in the distribution patterns between the BS, the bias, and the ACC over the tropical ocean, where the high BS, ACC, and bias can be seen (http: //seasonal.meteo.fr/content/PS-scores-cartes, accessed on 30 May 2022). The precipitation ACC is the highest over the tropical ocean, the Pacific Ocean in particular, but the negative correlations appear below the high values over the Nino 1 to 3 regions from JJA to SON. This pattern can be found in the BS distributions generated by Method 2. Indeed, the lowest BS exists in that region too. However, the negative ACC does not show any moving features, while the high BS patterns show clear moving features for both terciles. This is partly because ACC reflects the relevancy between forecast and observed anomalies. It is calculated by all samples. The BSs for TerA and TerB zoom into the samples with high or low precipitation rates to measure how well the forecast of precipitation/anomaly frequencies is achieved. It is worth pointing out that for the BS in TerN, there are few errors in the tropical Pacific by Method 2 (not shown). It is obvious that the BS is related to the bias to some extent. However, the bias in the tropical ocean increases significantly with the lead time as the season progresses, while the BS does not. The probabilistic errors, along with the highest ACC in the JJA seasonal precipitation hindcast, are also identified over the Nino 1 to 2 regions by Johnson et al. [20], with the ECMWF's SEAS5, using the CRPSS evaluation score. Figure 3 is the same as Figure 2 but is created by Method 3. The only difference between Method 2 and 3 is that the hindcast in Method 3 defines its TTs by the ensemble's mean climatological samples. Only the results in JJA are presented in Figure 3. The BS spatial distribution for TerA is very close to the pattern generated by Method 2, except the slight differences in North Africa and the Arabian Peninsula. However, the BS spatial distribution for TerB in Figure 3a is significantly different from that in Figure 2a over the arid regions. It is similar to the BS distribution pattern for TerB in Figure 1a, but the errors are larger. Method 3 also estimated some high BSs for TerA and TerB distributed over tropical oceans, for example, the high BS over the western Pacific adjacent to the northwest of Australia. Note the sample size for the hindcast is the same for Method 2 and 3, but Method 2 uses more samples to define the hindcast TTs, while the hindcast and reference in Method 3 have the same sample number to define the TTs. This implies that the BS is more sensitive to TTs than to the ensemble size, although a larger ensemble size may also affect the hindcast TTs.
Meteorology 2022, 2, FOR PEER REVIEW 11 any moving features, while the high BS patterns show clear moving features for both terciles. This is partly because ACC reflects the relevancy between forecast and observed anomalies. It is calculated by all samples. The BSs for TerA and TerB zoom into the samples with high or low precipitation rates to measure how well the forecast of precipitation/anomaly frequencies is achieved. It is worth pointing out that for the BS in TerN, there are few errors in the tropical Pacific by Method 2 (not shown). It is obvious that the BS is related to the bias to some extent. However, the bias in the tropical ocean increases significantly with the lead time as the season progresses, while the BS does not. The probabilistic errors, along with the highest ACC in the JJA seasonal precipitation hindcast, are also identified over the Nino 1 to 2 regions by Johnson et al. [20], with the ECMWF's SEAS5, using the CRPSS evaluation score.  Figure 3 is the same as Figure 2 but is created by Method 3. The only difference between Method 2 and 3 is that the hindcast in Method 3 defines its TTs by the ensemble's mean climatological samples. Only the results in JJA are presented in Figure 3. The BS spatial distribution for TerA is very close to the pattern generated by Method 2, except the slight differences in North Africa and the Arabian Peninsula. However, the BS spatial distribution for TerB in Figure 3a is significantly different from that in Figure 2a over the arid regions. It is similar to the BS distribution pattern for TerB in Figure 1a, but the errors are larger. Method 3 also estimated some high BSs for TerA and TerB distributed over tropical oceans, for example, the high BS over the western Pacific adjacent to the northwest of Australia. Note the sample size for the hindcast is the same for Method 2 and 3, but Method 2 uses more samples to define the hindcast TTs, while the hindcast and reference in Method 3 have the same sample number to define the TTs. This implies that the BS is more sensitive to TTs than to the ensemble size, although a larger ensemble size may also affect the hindcast TTs. The spatial distribution of the Brier score for the seasonal hindcast of a 3-month mean precipitation, using the hindcast tercile thresholds determined by the ensemble climatology (Method 2). The reference is the GPCP observations. The Brier scores for the below, (a,c,e,g), and the above-normal terciles, (b,d,f,h), at 1, 2, 3, and 4 lead months correspond to JJA, JAS, ASO, and SON, respectively. Figure 4 is the same as Figure 2 but is generated by Method 4. The only difference between Method 4 and 2 is that the hindcast and the reference data in Method 4 share the TTs defined by the GPCP observations, like Method 1. Only the results in JJA are presented in Figure 4. Compared to Figure 2, the most noticeable difference is the high BS for TerB over North Africa and the Arabian Peninsula. The BS spatial distribution pattern for TerA is similar to that in Figure 2b, except for over North Africa and the Arabian Peninsula, and the slightly higher BS (~0.3) distribution pattern can be discerned. It means that the BS in TerB is more sensitive to the TTs. Compared to Figure 1, the probabilistic errors in Figure 4 grow quickly with the seasons/lead time using Method 4 (not shown), although the BS for TerA is closer or even better than that in Figure 1b in JJA. Figure 5 is the same as Figure 2 but is generated by Method 5, in which two observation data sets were used. Only the results in JJA and SON are presented. The ensemble of the GPCP and the MSWEP observation data increases the observed climatological sample size and slightly the observed sample range, which may result in changes of the observed TTs. It is more important that one more outcome, "uncertain", was added to the observed binary events. The BS calculation still follows its definition. The resulted BS distribution patterns for TerA and TerB are very similar. Compared to Method 2, the BS shows an overall improvement in all seasons, ranging between 0 and 0.44. The most noticeable change is that the high BS for TerA and TerB are eliminated over the Nino region in the tropical Pacific.
A little high BS (>~0.36) was still visible in the tropical western Pacific adjacent to the northeast of Australia in JJA and JAS. This is possibly due to the sampling method for the hindcast in Method 2, 3, and 4. The high BS (>~0.36) distribution patterns over the Indian Ocean for TerA and TerB in JJA gradually become larger in summer, reaching the maximum till SON. This is similar to those shown by Figure 1g,h and Figure 2g,h, indicating the high BS over the Indian Ocean is insensitive to different analysis methods. This is also true for small errors over the tropical Atlantic, shown by very slight BS distributions in that region, as in Figures 1 and 2. The implication of this analysis is that the ensemble hindcast may generate tercile probabilities closer to 0.5 than to 0 or 1, such as over the tropical Pacific. However, the BS spatial distributions are similar when either the GPCP or MSWEP is used as the reference (not show). Note that Method 5 may have a disadvantage to the grid which has a probability closer to 0 or 1.  Figure 4 is the same as Figure 2 but is generated by Method 4. The only difference between Method 4 and 2 is that the hindcast and the reference data in Method 4 share the TTs defined by the GPCP observations, like Method 1. Only the results in JJA are presented in Figure 4. Compared to Figure 2, the most noticeable difference is the high BS for TerB over North Africa and the Arabian Peninsula. The BS spatial distribution pattern for TerA is similar to that in Figure 2b, except for over North Africa and the Arabian Peninsula, and the slightly higher BS (~0.3) distribution pattern can be discerned. It means that the BS in TerB is more sensitive to the TTs. Compared to Figure 1, the probabilistic errors in Figure 4 grow quickly with the seasons/lead time using Method 4 (not shown), although the BS for TerA is closer or even better than that in Figure 1b Figure 4 is the same as Figure 2 but is generated by Method 4. The only difference between Method 4 and 2 is that the hindcast and the reference data in Method 4 share the TTs defined by the GPCP observations, like Method 1. Only the results in JJA are presented in Figure 4. Compared to Figure 2, the most noticeable difference is the high BS for TerB over North Africa and the Arabian Peninsula. The BS spatial distribution pattern for TerA is similar to that in Figure 2b, except for over North Africa and the Arabian Peninsula, and the slightly higher BS (~0.3) distribution pattern can be discerned. It means that the BS in TerB is more sensitive to the TTs. Compared to Figure 1, the probabilistic errors in Figure 4 grow quickly with the seasons/lead time using Method 4 (not shown), although the BS for TerA is closer or even better than that in Figure 1b in JJA.   analysis methods. This is also true for small errors over the tropical Atlantic, shown by very slight BS distributions in that region, as in Figure 1 and Figure 2. The implication of this analysis is that the ensemble hindcast may generate tercile probabilities closer to 0.5 than to 0 or 1, such as over the tropical Pacific. However, the BS spatial distributions are similar when either the GPCP or MSWEP is used as the reference (not show). Note that Method 5 may have a disadvantage to the grid which has a probability closer to 0 or 1. (c) (d) Figure 5. The spatial distribution of the Brier score for the seasonal hindcast of precipitation using the GPCP and the MSWEP ensemble observation as the reference data (Method 5). The Brier Figure 5. The spatial distribution of the Brier score for the seasonal hindcast of precipitation using the GPCP and the MSWEP ensemble observation as the reference data (Method 5). The Brier scores for the below, (a,c), and the above-normal terciles, (b,d), at lead times of 1 and 4 months correspond to JJA and SON, respectively. Figure 6 presents the BS spatial distribution in JJA generated by Method 6. Method 6 uses 3-month precipitation anomalies as samples like Method 1, but the hindcast defines its TTs by the ensemble mean climatological samples, like Method 3. The most special feature of this method is the ensemble construction. Each ensemble member starts from a different month, rather than a fixed month May, so that more ensemble data were involved. Generally, the BS in TerB is significantly higher than that generated by other methods. Although the BS spatial distribution for TerA is similar to that in Figures 3b and 4b, it is significantly higher than that in Figure 5b. This analysis demonstrates the importance of the ensemble type in the statistical model for the evaluation.
The above analyses show that the errors over the tropical land in the BS maps for TerB, and for TerA as well, were detected by Method 1, 3, 4, and 6, not by Method 2 and 5 due to the TTs effects. Errors over the tropical Pacific were detected by Method 1, 2, 3, 4, 5, and 6, as Method 5 can correct those over the Nino regions with an additional probability of 0.5 in the reference data, but not over the western Pacific (north of Australia). Those over the Indian Ocean were derived from all methods, supported by the large negative precipitation bias of System 7. fines its TTs by the ensemble mean climatological samples, like Method 3. The most special feature of this method is the ensemble construction. Each ensemble member starts from a different month, rather than a fixed month May, so that more ensemble data were involved. Generally, the BS in TerB is significantly higher than that generated by other methods. Although the BS spatial distribution for TerA is similar to that in Figures 3b  and 4b, it is significantly higher than that in Figure 5b. This analysis demonstrates the importance of the ensemble type in the statistical model for the evaluation.
(a) (b) Figure 6. The spatial distribution of the Brier scores for the 3₋month precipitation anomalies in JJA, using the tercile thresholds for the hindcast derived from ensemble mean climatological samples, with a mixed month initialization ensemble (Method 6). The Brier scores for the below (a) and the above₋normal terciles (b) at 1 lead month correspond to JJA.
The above analyses show that the errors over the tropical land in the BS maps for TerB, and for TerA as well, were detected by Method 1, 3, 4, and 6, not by Method 2 and 5 due to the TTs effects. Errors over the tropical Pacific were detected by Method 1, 2, 3, 4, 5, and 6, as Method 5 can correct those over the Nino regions with an additional probability of 0.5 in the reference data, but not over the western Pacific (north of Australia). Those over the Indian Ocean were derived from all methods, supported by the large negative precipitation bias of System 7. Figure 7 is the reliability decomposed from the BS for JJA using ensemble observations (Method 5) shown by Figure 5. The overall reliability, ranging from 0 to 0.33, is more satisfactory than the corresponding BS. Locations with a high BS due to large probabilistic errors are mostly caused by low reliabilities over the tropical western Pacific and the Indian Ocean. The strong correlations between reliability and the BS are also found in the analyses using other methods (not shown). Consequently, their spatial distribution patterns are similar, such as in Figure 5a, b and Figure 7a, b. Generally, high reliability leads to a good BS, but resolution may also affect the BS. Reliability is important to an ensemble prediction system as it is crucial to an accurate forecast, as well as the predictability of the system [14]. Figure 6. The spatial distribution of the Brier scores for the 3-month precipitation anomalies in JJA, using the tercile thresholds for the hindcast derived from ensemble mean climatological samples, with a mixed month initialization ensemble (Method 6). The Brier scores for the below (a) and the above-normal terciles (b) at 1 lead month correspond to JJA. Figure 7 is the reliability decomposed from the BS for JJA using ensemble observations (Method 5) shown by Figure 5. The overall reliability, ranging from 0 to 0.33, is more satisfactory than the corresponding BS. Locations with a high BS due to large probabilistic errors are mostly caused by low reliabilities over the tropical western Pacific and the Indian Ocean. The strong correlations between reliability and the BS are also found in the analyses using other methods (not shown). Consequently, their spatial distribution patterns are similar, such as in Figure 5a,b and Figure 7a,b. Generally, high reliability leads to a good BS, but resolution may also affect the BS. Reliability is important to an ensemble prediction system as it is crucial to an accurate forecast, as well as the predictability of the system [14].

Probabilistic Evaluation of the Intra-seasonal Precipitation Hindcast
The analyses of the seasonal precipitation hindcast detected probabilistic errors mostly over the tropical region, in particular, over the Indian Ocean and the Pacific Ocean. Precipitation forecast in the tropical region depends significantly on ENSO at seasonal to interannual time scales. By far, ENSO is the most successfully predicted large-scale mode and a known source of predictability. This leads to the intra-seasonal probabilistic evaluation in order to improve our understanding about intra-seasonal to Figure 7. The reliability decomposed from the Brier score for precipitation in JJA using the GPCP and the MSWEP ensemble observation as the reference, in the above-normal (a) and the below-normal terciles (b).

Probabilistic Evaluation of the Intra-Seasonal Precipitation Hindcast
The analyses of the seasonal precipitation hindcast detected probabilistic errors mostly over the tropical region, in particular, over the Indian Ocean and the Pacific Ocean. Pre-cipitation forecast in the tropical region depends significantly on ENSO at seasonal to interannual time scales. By far, ENSO is the most successfully predicted large-scale mode and a known source of predictability. This leads to the intra-seasonal probabilistic evaluation in order to improve our understanding about intra-seasonal to seasonal precipitation variations, which are dominated by large-scale nature variability modes and their interactions, such as MJO and ENSO. As shown by several recent studies, it is evident that ENSO also has influences on the subseasonal precipitation forecast [62,63]. Figure 8 is the BS spatial distribution for TerA in June, July, and August. The 23-year monthly precipitation anomalies were assessed for these three months using Method 1 as an extension of the seasonal analysis of JJA. The sample size was decreased from the 3-month data in Method 1 to the monthly data. The BS has a general increase, as it is sensitive to the sample size. In addition, this is because the intra-seasonal precipitation is more affected by short-term and random forcings, which may be difficult to capture by the dynamic model. It can be also attributed to the reduced observed climatological sample range, which determines TTs. Generally, large BSs distribute at the same locations as those in the BS maps in JJA, displayed in Figure 1a,b. They remain or diverge in the tropical region and in the subtropical eastern Pacific. Compared to the BS for the seasonal precipitation anomalies (Figure 1), a significant increase of the BS for TerA occurs over the eastern equatorial Pacific Nino 1 to 3 regions in June (Figure 8a). It gradually shifts westward to the Nino 3 to 4 regions till September. This feature manifests in the seasonal BS maps shown by Figure 1, as the heavy precipitation center displaces westward. The seasonal precipitation variability in the tropical Pacific is closely associated with ENSO activity through SST [64], while MJO is more responsible for the subseasonal precipitation variability [65]. Unlike the stationary ENSO, MJO has a slow eastward-moving character, affecting the convective processes and precipitation in the tropical Pacific. Vigaud et al. [32] found that the intra-seasonal probabilistic forecast of precipitation, assessed by RPSS, is related to La Nina and El Nino, and is significantly correlated with MJO during Asian and West Africa monsoons. Over the northwest coast of Australia and the nearby equatorial Indian Ocean, the error of the monthly hindcast for TerA persists and is stronger than that in the JJA map. It suggests that the probabilistic error in this region is possibly more related to intra-seasonal short-term forcing. Figure 9 shows the pattern of the high BS emerging over the Indian Ocean for TerB in September. This is consistent with the seasonal analysis. Note that over the tropical Indian and Pacific Ocean monsoon areas in July and August, the maximum SST zone is away from Equator. Thus, the strong SST gradients initiate the convection and cross equatorial low-level jet streams, which grow through a positive feedback process [66]. In a more recently study, tropical shallow convection is identified as one of the main sources of biases in the tropical precipitation [67].
The intra-seasonal analysis reveals that the probabilistic errors in the JJA seasonal forecast are the manifestation of the deficiency mostly in the intra-seasonal forecast for June, July, and August, such as over the equatorial oceans. The decrease of forecast sample size may also have an impact on the BS.
forcing. Figure 9 shows the pattern of the high BS emerging over the Indian Ocean for TerB in September. This is consistent with the seasonal analysis. Note that over the tropical Indian and Pacific Ocean monsoon areas in July and August, the maximum SST zone is away from Equator. Thus, the strong SST gradients initiate the convection and cross equatorial low-level jet streams, which grow through a positive feedback process [66]. In a more recently study, tropical shallow convection is identified as one of the main sources of biases in the tropical precipitation [67].  The intra-seasonal analysis reveals that the probabilistic errors in the JJA seasonal forecast are the manifestation of the deficiency mostly in the intra-seasonal forecast for June, July, and August, such as over the equatorial oceans. The decrease of forecast sample size may also have an impact on the BS.

Quantification of Uncertainties by Confidence Intervals
The uncertainty of the averaged BS over the entire tropical region (30° S-30° N and 0°-360°) was quantified by the CI95%, as more errors indicated by higher BSs were detected there. The CI95%s was initially estimated by the resampling approach for the rela-

Quantification of Uncertainties by Confidence Intervals
The uncertainty of the averaged BS over the entire tropical region (30 • S-30 • N and 0 • -360 • ) was quantified by the CI95%, as more errors indicated by higher BSs were detected there. The CI95%s was initially estimated by the resampling approach for the relative comparisons of the different methods. Although the averaged BS estimated by Method 5 is the best, mostly due to the ensemble of the two observation data sets, it has a larger CI95% than that for the BSs by other methods, such as Method 2. The results for the averaged BSs in JJA are displayed in Table 2 under "the bootstrap resampling approach". The analytical approach with a non-central moment estimator [13] was then applied to the accurate estimations of the CI95% for the BSs by Method 1, 2, and 3. Table 2 also lists the CI95%s for the averaged BSs over the entire tropical region in JJA, estimated by the method of moments. Using the extended analytical approach (see Equations (A5)-(A8)) for Method 5 with the observation x being uncertain, the CI95% for the BS is smaller than that estimated by the bootstrap approach (~25%). Since Method 4 uses observed TTs and precipitation as samples, it is similar but less favored than Method 1. While Method 6 produces the highest BS for TerB, it has less possibilities to be applied for the evaluation. The CI95% for the BS in JJA, estimated by Method 1, is the highest at about 6% of the forecast mean. This is partly because the precipitation anomalies were bias-corrected to minimize the difference between the hindcast and the observation, so that the spread of its BS is narrower. It may be also because the confidence interval is disproportional to the sample size. The number of the samples used in Method 1 are three times more than those applied in Method 2 and 3. The BS estimated by Method 3 has a better CI95% than the one for the BS by Method 2 for both TerA (14.6% vs. 19.6%) and TerB (17.4% vs. 17.8%). Since the sample size of the ensemble hindcast is the same for Method 2 and 3, the difference is caused by TTs only.
With 80 drawings, the resampling approach generated bootstrap mean BS is too higher than the forecast mean BS. Experiments were performed to test the effects of different sample sizes and the total numbers of resampling. The CI95% changes very little when the bootstrap resampling increased more than 80 times. The CI95% can be calculated with much simplicity by the analytical approach with a moment estimator [13]. The samples of the BS are assumed to be normally distributed according to the central limit theorem, with the critical value of the t distribution at a significance level of 0.05 because of the unknown true standard deviation. Table 3 lists the CI95%s for the averaged BS estimated by Method 2 over the tropical region. The CI95%s for all seasons were calculated by the analytical approach. The averaged values of the CI95% in TerB and TerA are very close to each other, about 0.04. Converting the absolute value to the percentage of the mean BS, TerA has better CI95%s (16-18%) than TerB does (18-19%) for all seasons. It is interesting to see that the CI95% decreases with the lead time, while the hindcast mean BS increases with the lead time till ASO, then it drops slightly down in SON. For TerA and TerB, the mean BS in SON achieves the best CI95%s: 0.0392 or 16.4% and 0.0387 or 18.15%, respectively. The evolution of the mean BS shows that the probabilistic errors are affected by strong seasonal features of large-scale modes like ENSO, which is strong from November to January. This is consistent with previous analysis for the BS spatial distribution. The changes in the mean BS and its CI95% with the lead time are different. There is no simple and direct relationship between the mean BS and CI95% over the tropical region. Table 3. The averaged CI95%s, lower (LB) and upper (UB) bounds, and percentage of the mean, estimated by the analytical approach for the seasonal evaluation of the Brier score (Method 2), over the entire tropical region (30 • S-30 • N). TerB and TerA represents the below-and above-normal terciles.

Conclusions
The probabilistic evaluation in terms of accuracy and reliability was performed for 23 year (1993-2015) seasonal precipitation re-forecast, from May to November, using the BS and its decomposition. The re-forecast was produced by the Meteo-France operational seasonal forecasting system 7, with 25 ensemble members, perturbed model dynamics and initial conditions. This is a new evaluation score, in addition to the existing deterministic metrics for the seasonal and intra-seasonal forecast products.
The BS was estimated based on tercile probabilities using the non-parametric counting estimator, with the GPCP observation data as the reference. Six analyses for the seasonal hindcast were carried out to assess the robustness of the evaluation score. It is found that BS spatial distributions change significantly with sampling methods, TTs, reference data, and ensemble types. The 3-month mean precipitation and 3-month precipitation anomalies were used as samples. TTs were determined for the hindcast by (1) observed climatological samples, (2) hindcast ensemble climatological samples, (3) and hindcast mean ensemble climatological samples, for the reference by (1) the GPCB climatological samples (2), the ensemble of GPCP, and MSWEP ensemble climatological samples. The ensembles were formed by both single-and multiple-month initializations. The analysis results show that the significant probabilistic errors indicated by high BSs in TerB, and some in TerA over the dry regions, such as North Africa, the Arabian Peninsula, the Antarctic, Kongo Basin, South America, and Australia, can be corrected by taking hindcast ensemble climatological samples. Large errors, especially for TerA over the tropical ocean, were detected by all analyses. Those over the Nino region in the Pacific Ocean can be reduced in the same way. The errors over the tropical Pacific Ocean/Indian Ocean generally decrease/increase with lead time, respectively, showing seasonal and/or lead time features. The ensemble of observation data can significantly reduce the overall BS, in particular, over the tropical ocean, except over the western Pacific. This analysis method creates a new outcome, "uncertain", for the precipitation events so that the BS is reduced. It implies that the tercile probability from the ensemble hindcast is often closer to 0.5 than either 0 or 1. Under such a condition, the increase of ensemble size is not very effective. However, the increase of BS to the maximum in SON over the tropical Indian Ocean can be seen even with different analysis methods. A few small errors persist over the Atlantic Ocean.
The intra-seasonal analysis reveals that the BS spatial distribution patterns in the intraseasonal precipitation hindcast are similar to those in the seasonal hindcast, but the BSs are higher partly due to reduced sample size. This is also possibly related to MJO and tropical shallow convection, which can be improved by the implementation of a more sophisticated physics algorithm and the improved ensemble initial conditions for the ocean, as well as the high-resolution model configuration. The high BS patterns are mostly manifested in the JJA seasonal hindcast, except those over northwest Australia, which can be eliminated in the seasonal analysis when the sample size was increased and TTs were changed.
Since there are more probabilistic errors in the tropical region, and they change significantly with the analysis methods, the CI95% was used to quantify the uncertainty for the averaged BS over the entire tropical region. The CI95% was initially estimated by the bootstrap resampling approach for relative comparisons. Note that it is not an accuracy estimation. The CI95% was further estimated by the analytical approach of the moment estimator, which was extended to cases of non-binary outcomes and non-Bernoulli random variables. The BS calculated by Method 1 is generally the highest, but the CI95% is about 6-7% of the mean BS. The BS calculated by Method 2 was improved, but the CI95% is about 17-19% of the mean. The BS calculated by Method 3 has a relatively higher BS compared to that calculated by Method 2, but the CI95% is better, about 14-17% of the mean BS. For Method 5, the CI95%s for the averaged BS is not comparable to that calculated by Method 1, though the estimated BS is the lowest.
The averaged BS and its CI95% are not linearly correlated. The CI95% was subsequently estimated using an analytical approach for the averaged BS by Method 2 in MJJ, JJA, JAS, ASO, and SON. The CI95% is around 0.03-0.04, about 16-19% of the forecast mean, which decreases with the lead time. However, the averaged BS does not change linearly with the lead time. Generally, the BS varies with different analysis methods, and CI95% changes accordingly. However, the impact of the sample and sample size on the confidence interval should not be neglected. The increase of forecast ensemble members will certainly affect the confidence level.
To enhance the reliability of the forecast products, to provide more objective evaluations, and to detect the forecast system deficiencies, it is necessary to reduce the uncertainties in the evaluation method. As the uncertainty in the statistic model is quite large, in the future, this kind of assessment should also be applied to other evaluation scores, including the forecast skill score and the relative operative characteristic (ROC). Data Availability Statement: The system 7 forecast data are archived at the Meteo-France storage machine, Hendrix, with restricted access, at/home/copernicusps/SYS7/HINDCAST/HYYYYEMME/ iox/output/HYYYEMME.tar.nc. YYYY is the year, and MME is the ensemble number. The size of the tar file is about 63,000,000 bytes. It was first available on 15 January 2020.

Acknowledgments:
The author is grateful for the support from the Copernicus Climate Change Service. She also thanks Jean-Francois Guérémy for his guidance and helpful discussions.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A
For the dichotomous case when x is 0 or 1, For the case with an additional event, x = 0.5, the conditional mean of µ f |x=0.5 is estimated using subsamples, { f 2 l , k = 1, . . ., N 2 } when x is 0.5. E x 2 i and E f 2 i are calculated based on the definitions, and The analytical expression of the BS is For the variance of the BS, The second term at the right-hand side of Equation (A6) is the square of the BS; the first term can be expanded as For the case of an additional outcome, x = 0.5, To generalize Equation (A8) when there are more outcomes, 0, 1, A, B, . . . , in a large collection of samples, and sufficient samples for each outcome,