Consider

$y$ as a quantity which is going to be forecasted (i.e., predictand) and, therefore,

$Y=\left({y}_{1},{y}_{2},\dots ,{y}_{T}\right)$ denotes the training period of observation with data length

$T$. Having

$K$ different models (i.e.,

$M=\left({M}_{1},{M}_{2},\dots ,{M}_{K}\right)$) results in

${Y}^{f}=\left({Y}^{{M}_{1}},{Y}^{{M}_{2}},\dots ,{Y}^{{M}_{K}}\right)$, the ensemble of model predictions for the aforementioned training period, where

${Y}^{{M}_{i}}=\left({y}_{1}^{{M}_{i}},{y}_{2}^{{M}_{i}},\dots ,{y}_{T}^{{M}_{i}}\right)$. Based on the law of total probability and the assumption about the independence of different model forecasts, the PDF of the predictand conditioned on the models over the given training period can be formulated as follows [

15]:

where

$P(y|{Y}^{{M}_{i}},Y)$ is the posterior distribution of

$y$ given the prediction of model

${M}_{i}$ and observed data

$Y$, which simply can be considered as the forecast PDF of

y based on model

${M}_{i}$. Moreover,

$P\left({Y}^{{M}_{i}}\u2502Y\right)$ is the posterior probability or the likelihood of the model’s

${M}_{i}$ prediction being correct over the training period. Due to the assumption of models’ independency, the posterior probabilities of models should sum to unity,

$\sum}_{i=1}^{K}P\left({Y}^{{M}_{i}}|Y\right)=1$, and, consequently, they can be considered as weights (i.e.,

${w}_{i}=P\left({Y}^{{M}_{i}}\u2502Y\right)$ is the weight of model

$i$). Furthermore, in the BMA approach, it is assumed that the model forecasts are unbiased, meaning that the expected value of the difference between observation and each model forecast should be equal to zero (i.e.,

$E\left(Y-{Y}^{{M}_{i}}\right)=0$ for

$i\in \left[1,K\right]$). So, before BMA implementation, a bias-correction method should be used in order to create an unbiased ensemble of predictions. Although there are several bias-correction methods which all can be used for this aim, a linear-regression technique is utilized in the original BMA [

16]. The bias-corrected results,

${F}^{{M}_{i}}={a}_{i}\times {Y}^{{M}_{i}}+{b}_{i}$ (where

${a}_{i}$ and

${b}_{i}$ are the coefficients of the linear regression model), are replaced with the original model forecasts (

${Y}^{{M}_{i}}$). Therefore, the BMA predictive model (Equation (1)) can be rewritten as follows:

On the other hand, in the original BMA method, it is assumed that the aforementioned posterior probability (i.e.,

$P\left(y\right|{F}^{{M}_{i}},Y)$) follows the normal (Gaussian) distribution,

$g(y|{F}^{{M}_{i}},{\sigma}_{i}^{2})$, with mean

${F}^{{M}_{i}}$ and variance

${\sigma}_{i}^{2}$, reflecting the uncertainty within the individual model

$i$. As explained in the introduction, some studies discussed that this assumption is a poor choice for a non-Gaussian forecast variable like streamflow. Therefore, they proposed implementing more representative distribution types (e.g., gamma distribution) or applying data transformation procedures (e.g., the Box–Cox transformation method [

50]) for transforming data from their original space to a Gaussian space. It is worth mentioning that in the case of applying a data transformation procedure, the reverting process has to be able to apply in order to revert back to the original variable space.

Finally, based on Equation (2) and considering the Gaussian distribution, the BMA predictive mean and its associated variance can be determined using the two following equations [

15,

16]. The mean value is the weighted average of individual predictions, and the BMA variance consists of (1) between-model variance, reflecting the spread of the ensemble, and (2) within-model variance that represents the uncertainty regarding each model having the best forecast.

Successful implementation of the BMA method relies on the proper estimation of the parameters including weights (

${w}_{i}$) and variances (

${\sigma}_{i}^{2}$) of each individual prediction (

$i=1,\dots k$). Following Raftery et al. [

16], in the standard BMA, the EM algorithm is utilized in order to maximize the log-likelihood function of the parameter vector (

$\theta =\left\{{w}_{i},{\sigma}_{i}^{2},i=1,2,..,K\right\}$) being approximated as follows:

Given that there is no analytical solution for maximizing the summation of the aforementioned function over the training period, an iterative procedure such as the EM algorithm was used. In this procedure, the optimization problem was set by introducing a latent variable (

${Z}_{k}$). Apart initialization, this algorithm included an (1) expectation step, where the latent variable was calculated based on the current values of parameters, and a (2) maximization step, where the parameters were estimated according to the determined value of the latent variable (

Figure 4b). It is worthy of note that, although the EM algorithm is computationally efficient, it is argued that using other optimization methods can lead to more robust estimation of the parameters.