A Weighting Scheme in A Multi-Model Ensemble for Bias-Corrected Climate Simulation

: A model weighting scheme is important in multi-model climate ensembles for projecting future changes. The climate model output typically needs to be bias corrected before it can be used. When a bias-correction (BC) is applied, equal model weights are usually derived because some BC methods cause the observations and historical simulation to match perfectly. This equal weighting is sometimes criticized because it does not take into account the model performance. Unequal weights reﬂecting model performance may be obtained from raw data before BC is applied. However, we have observed that certain models produce excessively high weights, while the weights generated in all other models are extremely low. This phenomenon may be partly due to the fact that some models are more ﬁt or calibrated to the observations for a given applications. To address these problems, we consider, in this study, a hybrid weighting scheme including both equal and unequal weights. The proposed approach applies an “imperfect” correction to the historical data in computing their weights, while it applies ordinary BC to the future data in computing the ensemble prediction. We employ a quantile mapping method for the BC and a Bayesian model averaging for performance-based weighting. Furthermore, techniques for selecting the optimal correction rate based on the chi-square test statistic and the continuous ranked probability score are examined. Comparisons with ordinary ensembles are provided using a perfect model test. The usefulness of the proposed method is illustrated using the annual maximum daily precipitation as observed in the Korean peninsula and simulated by 21 models from the Coupled Model Intercomparison Project Phase 6.


Introduction
Over the last few decades, ensemble methods based on global climate models have become an important part of climate forecasting owing to their ability to reduce uncertainty in prediction. The multi-model ensemble (MME) methods in climatic projection have proven to improve the systematic bias and general limitations of single simulation models. It has been argued that model uncertainty can be reduced by giving more weight to those models that are more skillful and realistic for a specific process or application. A number of model weighting techniques have been proposed to produce a suitable probability density function of the changes in climatic variables of interest [1][2][3][4][5][6].
Of the many possible ensemble methods, the method we apply here is Bayesian model averaging (BMA), which determines static weights for each model using the posterior probability [3,[7][8][9][10]. Other weighting methods such as reliability ensemble averaging [1] and combined performance-independence weighting [6,11] are similarly applicable to the proposed method in this study, though they were not actually considered here. One advantage of the BMA is that the uncertainty in the BMA ensemble prediction is separated into two variances: uncertainty due to the within-model and due to the among-model. It is therefore relatively easy in BMA to differentiate and quantify the main sources of uncertainty in simulating future climate.
Our interest is in predicting extreme climatic events; the generalized extreme value distribution (GEVD) is typically used as an assumed probability distribution in BMA and accordingly can be used for our propose. The GEVD encompasses all three possible asymptotic extreme value distributions predicted by the large sample theory [12]. In this study, we use the approach by Zhu et al. [13], which is a BMA method embedded with GEVD in projecting the future extreme climate using several climate models. The model weights in the BMA are determined by the distance between the observations and historical data generated by each model. This distance is viewed as the performance or skill of the model.
There is a high probability that the simulation model is biased systematically. To solve this problem, one can apply the bias-correction (BC) technique which constructs a transfer function that matches the observations and the historical data well. The transfer function is then applied to the future simulation outputs [14]. While not without controversy, the BC can be useful in confidence in climate projection and is, in practice, a common component of climate change impacts studies [15,16].
Some BC methods, such as quantile mapping or delta change [14], make a perfect matching in the sense that the quantiles of the observations and the historical data are same.
When BC such as quantile mapping is used, all the model weights from the BMA method become equal because of a perfect matching, and consequently, the prediction is the simple average of bias-corrected model outputs. This equal-weighted ensemble may be accepted by researchers by acknowledging the conceptual and practical difficulties in quantifying the absolute uncertainty of each model [17]. Annan and Hargreaves [18] presented evidence that the models are "statistically indistinguishable" from one another and their observations, and they determined that equal weighting of models is a reasonable approach. Wang et al. [19] conclude that it is likely that using BC and equal weighting is viable and sufficient for hydrological impact studies. Another perspective is that it is futile to assign weighting factors to models in order to create probability distributions from climate model ensembles, but rather to view the ensemble as the "non-discountable range of possibilities" [20]. Nonetheless, the simple average or "model democracy" [21] can be criticized because it does not take into account the performance, uncertainty, and independency of each model in constructing an ensemble, i.e., it is not an optimal strategy [5,6,[22][23][24].
In our experience, when non-bias-corrected (non-BC) historical data are used in BMA, only a few models exhibit extreme weights and most others have very low weights [25] (see Figure 3). Specifically, the posterior distribution may excessively depend on a few "outlier models" close to the observation, when all other models fail to capture observations of the historical period-a common situation for precipitation metrics [26]. This phenomenon may be due to the fact that some models are more "fit for purpose" (or calibrated) to the observations for a given applications (e.g., variable, region, or lead time) than others, and thus, receive very high weights in (or dominate) the multi-model estimate of change [27]. This occurrence is also a result of the BMA weights being obtained based only on the performance of the model. It would be dangerous to weight a few models too strongly based on the performance when observational uncertainty is large. Moreover, weighting may not be robust in quantifying uncertainty in a multi-model ensemble. Some researchers [5,6,28,29] considered a weighting method that accounts for model performance and independency simultaneously. The weights obtained from these research turned out to the tendency of smoothing the performance-based weights.
In this study, we do not take the model dependency into account, but investigate another method to control both unfairly high weights and distinctionless simple averaging. The weighting scheme proposed in this article prevents the situation where only a few models have very high weights and the most others have very low weights, which happens frequently in the BMA approach. Our method strives to drive the weights far from those that are equal. Therefore, it searches for a balanced "sweet spot" between the BMA weights and simple averaging. One can view it as a technique for smoothing the performance-based weights. The proposed approach applies the level of BC differently to the historical data and to the future simulation data. An "imperfect" BC of the historical data to the observations is employed in computing the weights, while ordinary BC is used for the future data in computing the weighted average for projection.
To illustrate the usefulness of our approach, we use the annual maximum daily precipitation (AMP1) as simulated by the Coupled Model Intercomparison Project Phase 6 (CMIP6) models in historical and future experiments over the Korean peninsula. The AMP1 is employed here because the authors are interested in the variable in Korea. The proposed method in this study can be applied similarly to other climate variables in the other regions, with some modification.
The objective of this study is to assess the impacts of weighting schemes on the skill of the future projections, and to find the optimal distribution of weights that leads to an increase in the skill score when the BC is applied to the future simulation data. As a measure of the skill, the continuous ranked probability score (CRPS) is used based on a perfect model test (or model-as-truth).

Data and Simulation Models
Several types of data are used for this study. These comprise observations for past years at each grid cell, the historical data obtained from each simulation model for the reference period, and the future data generated by each model for certain scenarios of future periods. Each simulation model generates R × P future datasets for both R scenarios and P future periods. Thus, for each grid cell, there is one observation dataset, K historical datasets where K is the number of simulation models, and K × R × P future datasets. The statistical BC is done for each grid cell and each simulation's data. The observations and historical data in the reference period are sometimes referred to as "in-sample", while the future data is called "out-of-sample" [30].
For clarity, the following notations are employed.
• x f : future value •x f : bias corrected value in the future • x obs : observed value in the reference period • x h : historical value in the reference period Here, the subscript h stands for the historical data in the reference period, obs stands for the observations, and f stands for the future. The reference period is 1973-2014, and future period is 2021-2060 (p1) and 2061-2100 (p2). The scenario levels for the future climate are the shared socioeconomic pathway (SSP) 3 and 5 [31]. For regridding to common grid points of 1.5 • × 1.5 • , the iterative Barnes interpolation scheme [32] was employed for the observations and simulation data from 21 models. The Barnes technique produces a rainfall field on a regular grid from irregularly distributed observed rainfall stations. It has gained large importance in the mesoscale analysis (see, e.g., in [33][34][35]). Figure 1 shows a map of the Korean peninsula, the spatial distribution of 127 rainfall observed stations, and the 15 grid cells used in this study. The observations for the 42-year reference period were obtained from the Korea Meteorological Administration. Table 1 is the list of 21 CMIP6 climate models used in this study.

Preliminary Methods
Model weighting or model averaging is a statistical method used to improve the accuracy of a set of models [36] and estimate the conceptual uncertainty of climate model projections. Generally, model averaging can improve the skill of projections and forecasts from multi-model prediction systems [6,24,37]. The model-weighting approach that we propose in this study is applied to the bias-corrected extreme precipitation in the BMA framework [13]. The main features of the GEVD, the BMA, and the BC method are briefly described here.

Generalized Extreme Value Distribution
The GEVD is widely used to analyze univariate extreme values. The three types of extreme value distributions are sub-classes of GEVD. The cumulative distribution function of the GEVD is as follows, where µ, σ, and ξ are the location, scale, and shape parameters, respectively. The particular case for ξ = 0 in Equation (1) is the Gumbel distribution, whereas the cases for ξ > 0 and ξ < 0 are known as the Fréchet and the negative Weibull distributions, respectively [12]. Assuming the data approximately follow a GEV distribution, the parameters can be estimated by the maximum likelihood method [12,38] or the method of L-moments estimation. The L-moments estimator is more efficient than the maximum likelihood estimator in small samples for typical shape parameter values [39]. The L-moments method is employed in this study using the "lmom" package in R [40] because a relatively small number of samples, about 40 years worth, are analyzed for each comparison period. Moreover, the formulae used to obtain the L-moments estimator are simple compared to that of obtaining the maximum likelihood estimator, which needs an iterative optimization until convergence.
It can be helpful to describe changes in extremes in terms of changes in extreme quantiles. These are obtained by inverting (1) Here, z p is known as the return level associated with the return period 1/p, as the level z p is expected to be exceeded on average once every 1/p years [12]. For example, a 20-year return level is computed as the 95th percentile of the fitted GEVD and a 50-year return level as the 98th percentile.

Bayesian Model Averaging
Among the various ensemble methods, the BMA method is commonly used to integrate over multi-model ensembles of climate series. It combines the forecast distributions of different models and builds a weighted predictive distribution from them. Many empirical studies including those in [8,9,22,[41][42][43] have shown that various BMA approaches outperform other competitors, including a single best model and a simple averaging in prediction performance.
Zhu et al. [13] proposed a bootstrap-based BMA in which the weight of each model was calculated by comparing the observations with the historical data from a simulation model. For each model and each bootstrap realization (from i = 1 to B), the GEV parameters and rainfall intensity (i.e., return level) in return period T were estimated.Then, they used the Gaussian likelihood function as follows, where I i (T) is the rainfall intensity of the i-th bootstrap from the observations, I k i (T) is the intensity of the i-th bootstrap from the historical data of model M k , and σ 2 where p(M k ) is the prior probability of M k , and p(R|M k ) can be approximated by L(M k ) [44]. Equal priors are used for all models in this study. The BMA prediction of rainfall intensity I over K models is then given by with the posterior variance where w k = p(M k |R) is the weight of each model k given in (3). For the historical data, these quantities are calculated from the B parametric bootstrap samples used in (2): For the future simulation data, the above equations are not used, but the intensity estimates are obtained by fitting GEVDs to each model's future data, i.e.,Ê(I|R, M k ) =Î k f (T).

Bias Correction by Quantile Mapping
Although simulations from climate or meteorological models provide much information, the simulated data are associated with potential biases that their statistical distribution differs from the distribution of the observations. This is partly because of unpredictable internal variability that differs from the observations, and because global climate models (GCMs) have a very low spatial resolution to be employed directly in most of the impact models [45,46]. For example, in GCM precipitation fields, the bias may be due to errors in convective parameterizations and unresolved subgrid-scale orography [16]. BC methods are commonly applied to transform the simulated data into new data with no or at least fewer statistical biases with respect to a reference, this being generally an observed time series. Many BC methods are available including delta change, quantile mapping, transfer function method, trend preserving BC, stochastic BC, and multivariate BC methods [14]. Among these, quantile mapping (QM) is simple, classical, and most famous [46][47][48][49].
QM adjusts different quantiles individually using the cumulative distribution functions (cdf) of observations and modeled historical values. From all modeled values (x h,i ) and observations (x obs,i ) over the reference period, the corresponding cdfs F h and F o are estimated. The modeled value (x f ,i ) for the future, which are specific quantiles of the modeled distribution, are then mapped onto the corresponding observed quantiles asx QM is used if trust in model simulations at the daily scale is high. Then, the full advantage of the dynamical model, including changes in dynamical properties such as the temporal dependence structure, can be exploited [14]. Implementations differ in the model used to estimate the cdf. Some authors consider empirical cdf with linear interpolation, while others employ parametric models such as a normal distribution for temperature, a gamma distribution for precipitation, and a GEV distribution for extreme events.

α-Correction
We consider a cdf representing the data between the historical data and the observations. The new cdf is defined as a linear combination of F obs and F where h is the cdf of the historical data from the k-th model with k = 1, · · · , K. If α = 1, F α = F obs . If α = 0, then F α = F h . We refer to α as "correction rate". The α-corrected value of x h is given bŷ We refer (9) the "α-correction" of the historical data to the observations. If α = 1, then the α-correction is an ordinary QM. If α = 0, then no BC is performed for the historical data. A sketch of the method with α = 0.4 and 0.7 is given in Figure 2. Note thatx The idea for the α-correction was inspired from examining various QM graphs in [48,49].

α-Weights
As α approaches 1, the BMA weights obtained from (3) become almost equal. When α = 0, the usual BMA weights without any BC are obtained. In the latter case, sometimes a few excessively high weights dominate, while the remainder is very low (see Figure 3). If equal weights or a few dominating weights are undesirable, a hybrid one between these two situations may be an alternative. We expect that such hybrid weights can be obtained from the α-correction with a α value between 0 and 1, perhaps close to 0.5. We call it "α-weighting", and denote the weights by w k (α). See Figure 3 for the weights with α = 0, 0.4, and 0.7, where a few very high weights for α = 0 decrease and few very low weights increase. The 0.4-weights follow approximately the pattern of weights obtained by the original (α = 0) BMA, whereas the 0.7-weights go toward an even distribution. A figure depicting the the weights of additional α values is presented in the Supplementary Material.
Next, we describe the details of computing w k (α). Here, to obtain the BMA weights or the α-weights, the historical data and observations are included without the future data. To obtain the α-weights in this study, we just followed the same procedure for calculating model weights as in the BMA, but with "nonzero" α.
We denote M k (α) as the α-corrected historical data of k-th model. The Gaussian likelihood is now modified to be dependent on α; L(M k (α), T) is the same as in (2) except that I k i (T) is replaced by I k i (α, T), which is the intensity of the i-th bootstrap from the M k (α). Then, we have the posterior probability of M k (α) for given the observations R: where p(R|M k (α)) can be approximated by L(M k (α)) = 1 4 ∑ T L(M k (α), T) as in [44], where 5, 10, 20, and 50 years are used for T in this study. Note that this averaged likelihood L(M k (α)) is actually a type of distance between GEVD(R) and GEVD(M k (α)). When equal priors are given, the α-weights are The bias-corrected values in the future from the k-th model are obtained by the ordinary QM in our proposed method.x h (x f )). (12) The α-correction does not apply to the future data, but only to the past data to obtain the α-weights.
The ensemble prediction for a quantity such as a 20-year return level ( r E 20 ) in the future is obtained by where r

Selection of the Correction Rate
Researchers may want to fix the α at one value. For this purpose, two methods applicable to each grid are presented here: The first method is the chi-square statistic for testing the null hypothesis of uniform (or equal) distribution of the weights, where H 0 : w k (α) = 1/K for every k, and we let g k (α) = w k (α) × 100. Then, for each grid, the chi-square statistic to test the above H 0 given by [38] The values of g k (α) that are less than 5 are summed up to one category such that the total frequency is greater than 5. Let us denote K as the number of the categories that have the summed frequency greater than 5. If g k (α) are far from equal values, then χ 2 0 (α) will not be small. The null hypothesis is rejected at the 5% significance level when There is more chance to reject H 0 for smaller α, and less chance to reject for α near to 1, because the α-weights become equal to each other as α goes up to 1. By changing the α from 0 to 1 in increments of 0.05, we can select the optimal α such that the null hypothesis is rejected, as follows, If there is no such α * in [0,1], we set α * = 0. As this selection is done for each grid cell, the α * value is determined independently for each cell. Figure 4 shows the chi-square test statistics calculated from various α-weights and the selected optimal α * j for select j-th grid cell.  Figure 4. Plots of the chi-square test statistics as α changes from 0 to 1, and the selected optimal correction rate α * for some grid cells.
As another way of selecting an optimal α, we consider the use of the continuous ranked probability score (CRPS), which can be employed to evaluate the impact of the weighting on the skill of the future projections. The CRPS defined for a single forecast can be extended for multiple occasions by averaging it, which is detailed in [50]. The averaged CRPS can be formulated as a function of α-weights, denoted by CRPS(α). In the next section, the leave-one-model-out version is averaged to compute CRPS cv (α) as in Equation (16). Thus, an optimal α can be chosen at the minimum of the CRPS cv (α). These optimal α j s values, selected differently from each grid, can lead to an increase in skill while minimizing the probability of overfitting [27]. In selecting an appropriate α, one can consider other criterion such as entropy [19] or the cross-validated mean squared error. Figure 5 shows the selected optimal α j and the values of CRPS cv (α) as α increases from 0 to 1 for each j-th grid. More figures for different future periods and scenarios are available in the Supplementary Material. 15 (16) and the selected optimal correction rate α for select grid cells and for future period 1 (2021-2060) under SSP3 scenario. Figure 6 shows the weights for 21 models obtained from various α-weighting methods. We see that two high weights from BMA decrease significantly, while most small weights increase slightly by the α-correction methods. It seems that the methods based on CRPS alter the weights by bringing them closer to equal in value. The α selection method based on the chi-square statistic is relatively easier to computate and it smooths the BMA weights well, but may lack climatological meaning. The method based on the CRPS has a clear climatological meaning, but for this study at least, shifts the weights towards being different in value.
Instead of using one set of weights, one can apply several sets of weights that are obtained from various values of α. For example, α can be set to 0, 0.25, 0.5, 0.75, and 1. Several different prediction results from such α values are available. These multiple results provide various plausible shapes of the future climate change which may not be detected by a single ensemble.

Comparison of Weighting Schemes
Some metrics can be used to compare weighting schemes and assess the skill of "out-of-sample" ensemble prediction. Those metrics include absolute error, ensemble spread, overconfidence bias, and ranked probability skill score [30]. The latter is based on CRPS and is essentially a combination of accuracy (absolute error of prediction) and precision (the width of predictive distribution) [30]. These are computed under the following models-as-truth experiments. In this study, however, we consider only CRPS cv (α) as a metric.

Leave-One-Model-Out Validation
This approach picks each model from a multi-model ensemble in turn and treats it as the true representation of the climate system. The data from this perfect model are treated as true observations, termed as "pseudo-observations". Then, the α-weights are obtained by following the same process presented in the above Section 4.2. Actually, the pseudo-observations from the perfect model and the α-corrected historical data for the remaining models are used. This leave-one-model-out method is called a model-as-truth experiment or a perfect model test [27,30]. This allows for an evaluation of the impact of the weighting in the future based on each model representing the truth once.
As a measure of the method's skill, we use the CRPS(α) as mentioned in the above section. Here, the leave-one-model-out cross-validated version, CRPS (−k) (α), is employed. This is basically the average of the mean squared error between the distribution of the k-th perfect model and the distribution of all other models except for the k-th model [27]. The distribution of all other models is composed of α-weighted average of the distributions of the K − 1 models. Then, the CRPS cv (α) is averaged over the K models as. Thus, Small values for this quantity may represent a better performance in projecting future climate. For numerical computation, we used the "scoringRules" package in R [51]. Figure 7 shows parallel coordinated box-plots of CRPS cv (α) calculated over 15 grid cells for future periods P1 and P2 under both SSP3 and SSP5 scenarios. The details of the parallel coordinated box-plots are described in the Supplementary Material. The optimal α-weights based on CRPS cv (α) produces the best performance among those considered in this study. Its performances in P2 of the SSP3 scenario and P1 of the SSP5 scenario are, however, similar to those of the simple average. This is because the selected optimal αs based on the CRPS are equal to 1 in many of the grid cells. In addition, we see that the chi-square-based approach works better than the BMA.
The CRPS (−k) (α) can be interpreted as the difficulty in predicting the k-th model from all other K − 1 models. When its value is small, the k-th model is more explicable by a combination of other K − 1 models than when it is large. Considering this, it is notable that the CRPS values for P2 (SSP5) are greater than those for P1 (SSP3) in Figure 7. This may mean that the mutual predictability or the coherence among the models is weaker for the far future and the SSP5 scenario than those for the near future and the SSP3. In calculating the LOOCV based on CRPS on 15 grid cells for 21 models, we experienced too much computing time, as α changes from 0 to 1 by 0.05. It took more than 120 min on an Intel i5 PC with 16 GB memory. Therefore, a parallel computing using "foreach" package [52] in R program was executed on a GPU server (Xeon*G 6230) with 80 cores. It took approximately 5 min. When the study region gets wider or the number of grid cells are large for several weather variables and for more than 21 models in CMIP6, it seems the parallel computing is necessary. In the sense of computing time only, selection of the correction rate based on the chi-square statistic may be preferred to that based on CRPS with a perfect model test. Figure 8 shows schematic box-plots of the 20-and 50-year return levels of the annual maximum daily precipitation for 15 grid cells over the Korean peninsula obtained from various α-weights based on the BMA, the chi-square statistic, the CRPS, and the simple averaging (SA). Compared to the observations, the return levels for all cases increase in the future; more in P2 and the SSP5 scenario than in P1 and the SSP3, respectively. Figure 9 depicts the similar plots as Figure 8 but for 20-and 50-year return periods (i.e., waiting time) relative to the observations from 1973 to 2014. The results from Figures 7-9 show the differences due to the weighting schemes, but the differences may not be as distinctive as they appear. Because the results from the weighted and unweighted means are similar, one may question why the simple average could not be used. Lorenz et al. [29] argued that although the resulting numbers may be similar, the interpretation of the spread is different between the unweighted and the weighted multi-model means. In that paper, they wrote the following. "The spread of the simple average is just a spread and is not a measure of uncertainty. It is an ad hoc measure of spread reflecting the ensemble design, or lack thereof, whereas the spread in the weighted multi-model mean can be interpreted as a measure of uncertainty given everything we know. We thus should have more confidence in the latter." Plots in Figure 9 indicate that the 20-year return periods for P2 are about 0.75 times shorter than those for P1. Specially, the 20-year return period for P2 under the SSP5 is very short, about 7 years in median. By reading the right-hand plots too, we realize that a 1-in-20 year (1-in-50 year) annual maximum daily precipitation in the Korean peninsula will likely become a 1-in-10 (1-in-20) year and a 1-in-7 (1-in-15) year event in median by the end of the 21st century based on the SSP3 (the SSP5) scenarios, respectively, when compared to the observation from 1973 to 2014. These show that the 20-year and 50-year return periods will likely reduce to approximately half (40%) under the SSP3 scenario and approximately one-third (30%) under SSP5 by the end of the 21st century. This is approximately of equal frequency as that in the result by Lee et al. [25], which was obtained by a multi-model ensemble with the BMA weights based on 17 CMIP5 simulation models. They showed that both 20-and 50-year return periods across the Korean peninsula will likely reduce to approximately half for RCP 4.5 and to approximately one-quarter for RCP 8.5 by the end of the 21st century, compared to the observations from 1971 to 2005.

Discussion
The idea of the α-correction based on QM proposed in this paper is only based only on the at-site BC, which may be unreasonable and, and therefore is a limitation of this approach. The α-correction can be applied to the BC based on some variants of QM [14]. The variation of the CRPS cv (α) values among the grid cells is larger than those among various α-weights, as seen in Figure 5. This high variation among grid cells may be reduced by using the regional BC, which takes into account the spatial dependence between nearby grids. Some multivariate spatial BC methods [16,46] with the α-correction may lead to reduce uncertainty, consideration of spatial pattern or correlation, and to increased skill of ensembles more than an at-site BC.
Ensembles from various sets of α-weights can be recombined to produce another ensemble. This ensemble of ensembles is referred to the double ensemble (DE) or stacking [36]. If there are L number of ensembles in which each ensemble is constructed from a set of α l -weights, for l = 1, · · · , L, the prediction is a form of I DE = ∑ L l=1 v(α l ) I(α l ), where v(α l ) is another weight for the l-th ensemble, and I(α l ) is the predicted value from the l-th ensemble with a set of α l . One can choose α l = 0, 0.5, 1 or α l = 0, 1/4, 0.5, 3/4, 1 for simple examples. In this case, the selection of a specific α is no longer required. As another example, in addition to α l = 0, 1, one can include the ensembles with α * 1 and α * 2 , which are the optimal correction rates obtained by the chi-square test statistic and by the CRPS cv (α), respectively, for each grid cell. It is known that this DE generally leads to better prediction than a single ensemble in the sense of reducing the variance [36], but this requires more computational efforts. Equal values can be set to v(α l ) for a simple average calculation. One may assign different values for v(α l ) based on a cross-validated criterion as in statistical learning.
The leave-one-out (LOO) cross-validation (CV) based on CRPS was employed in this study to find an optimal correction rate and to compare weighting schemes. The LOOCV perfect model test and the model-as-truth experiment have been used to select hyperparameters in the weights and to evaluate the weighting scheme, respectively, in multi-model ensemble studies [6,27,29,37]. In the statistical learning community, however, the k-fold CV is preferred over LOOCV due to the higher variance of LOOCV, than does that of k-fold CV [36]. The k-fold CV approach involves randomly dividing the set of models into k groups or folds of approximately same size. The first fold is treated as a validation set, and the method calculates the CRPS (−1) based on the remaining k − 1 folds. This is repeated for all k-folds; each time, a different fold is treated as a validation set. It finally produces the averaged value CRPS cv similarly to a formula in Equation (16). LOOCV is a special case of k-fold CV in which k is set to equal the number (K) of models. In practice, one can perform k-fold CV using k = 5 or k = 10 (when K is large). LOOCV is not involved with randomness, whereas the k-fold CV is dependent on random divisions of models into k groups. The latter approach requires further considerations to be applicable to the weighting scheme in the multi-model climate ensemble study. If we employed the k-fold CV instead of LOOCV in the above section, the results might be altered.
The weights in the BMA employed in this study are obtained based on the performance of the model. The performance is defined by the distance between the historical data and the observations, which measures how well the model reproduces the historical data close to the observations. This performance, however, may not guarantee the reliability of future climate change. In some cases, nonetheless, past trends are strongly related to future trends, e.g., for large-scale greenhouse gas-induced warming [53] or Arctic sea ice decline [54]. The study by Smith and Chandler [55] in this issue shows that present-day climate and variability are related to the predicted change in precipitation in parts of Australia [21]. Therefore, a performance criterion based on the distance between the historical data and the observations may be necessary in calculating the model weights, but is not sufficient. In addition to the performance measure, one can include others such as uncertainties in the model or in the observations, the model convergence criterion [1], or the model independence [5,6]. The α-weighting proposed in this study is related to uncertainties in data. Large uncertainties in the model or in the observations weaken the confidences on the model performance and on the weights based on the performance. We can infer that the weights in this situation are determined by more chance than in the situation with lower uncertainties. Especially, a few models with very high weights which dominate most other models with very low weights would be unfair and may result in an unreliable and unrobust prediction. The method in this study was thus proposed to smooth or regulate such weights.
If we have another set of weights, for instance, d m , such as for the independence of the m-th model, then a new weighting scheme of combining both weights is obtained by, for m = 1, · · · , K, This approach can lead to a weighting scheme accounting for performance and independence simultaneously [5,6,27]. One can further consider the α-weighting to the future model data too. It may introduce another "semi"-bias correction, which adds complexity to the situation. Nevertheless, this would enrich the methodology of ensemble prediction. We leave this undertaking future work.

Conclusions
Multi-model ensemble methods in climatic projection have proved to improve upon the systematic bias and general limitations of a single simulation model. It has been argued that model uncertainty can be reduced by attributing more weight to those models that are more skillful and realistic for a specific process or application. As both bias-correction (BC) and model weighting are common procedure in impact studies, we considered a weighting scheme that includes both equal and unequal weights for a bias-corrected simulation model output. The proposed approach applies an "imperfect" BC or α-correction to the historical data in computing the model weights, while it applies ordinary BC to the future data in computing the weighted average for projection.
The proposed weighting scheme prevents the situation where only a few models have very high weights and the most majority have very low weights, which frequently occurs in the BMA approach. Our method, conversely, seeks to shift the high weights far from those that are equal. It, therefore, searches for a balanced "sweet spot" between BMA weights and simple averaging. A weighting scheme in which an optimal correction rate is selected based on the chi-square test statistic smooths unfairly high or low weights, while it continues to reject the hypothesis of the uniform distribution.
Based on this generalized or hybrid scheme, researchers can present their optimal results between equal and BMA weights. We illustrated from model-as-truth experiments that an ensemble with a set of weights obtained by minimizing the CRPS cv (α) can improve the skill to a greater extent than the BMA and simple averaging. One may provide multiple results from several weighted ensembles to capture various plausible shapes and uncertainties of future changes.
The numerical results and the selected αs illustrated here using the annual maximum daily precipitation in the Korean peninsula depend strongly on variables, regions, and simulation model outputs. The introduction of α-correction and α-weights, however, is a step that can serve to combine simulation models optimally and with more flexibility. A more refined α-weights method, such as the one we developed in this study, can deserve to be included in a discourse about tactics to build a better multi-model ensemble in predicting future climate.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4433/11/8/775/s1, Figure S1: Distributions of the weights obtained from various α-corrections for each grid cell. Red, green, and blue bars represent the α-weights with α = 0, 0.4, 0.7, respectively. Figure S2: Distributions of the weights obtained from various α-corrections for each grid cell. Red, green, and blue bars represent the α-weights with α = 0.2, 0.5, 0.8, respectively. Figure S3: Plots of the chi-square test statistics as α changes from 0 to 1, and the selected optimal correction rate α * for 15 grid cells. Figure S4: Computed values of CRPS cv (α) defined as in Equation (16) and the selected optimal correction rate α for 15 grid cells and for the future period 1 (2021-2060) under the SSP3 scenario. Figure S5: Same as Figure S4 but for the future period 2 (2061-2100) under the SSP3 scenario. Figure S6: Same as Figure S4 but for the future period 1 (2021-2060) under the SSP5 scenario. Figure S7: Same as Figure S4 but for the future period 2 (2061-2100) under the SSP5 scenario.