Inverse modeling of hydrologic parameters in CLM4 via generalized polynomial chaos in the Bayesian framework

In this study, the applicability of generalized polynomial chaos (gPC) expansion for land surface model parameter estimation is evaluated. We compute the (posterior) distribution of the critical hydrological parameters that are subject to great uncertainty in the community land model (CLM). The unknown parameters include those that have been identified as the most influential factors on the simulations of surface and subsurface runoff, latent and sensible heat fluxes, and soil moisture in CLM4.0. We setup the inversion problem this problem in the Bayesian framework in two steps: (i) build a surrogate model expressing the input-output mapping, and (ii) compute the posterior distributions of the input parameters. Development of the surrogate model is done with a Bayesian procedure, based on the variable selection methods that use gPC expansions. Our approach accounts for bases selection uncertainty and quantifies the importance of the gPC terms, and hence all the input parameters, via the associated posterior probabilities.


Introduction
Scientists use land surface models (LSMs) to quantitatively simulate the exchange of water and energy fluxes at the Earth surface-atmosphere interface. During the past decades, LSMs have evolved from oversimplified schemes describing only the surface boundary conditions for general circulation models (GCMs) to complex models that integrate modules representing biogeochemical, hydrological, and energy cycles at the surface-atmosphere [Pitman (2003)].
Built upon mathematical formulations of the laws of physics, the model parameters are usually associated with certain physical meaning and have influences on the major model outputs such as water and energy fluxes. It is a common hypothesis that the parameters are measurable and transferable to locations sharing the same physical properties or site conditions, as assumed in the Project for Intercomparison of Land Surface Parameterization Schemes (PILPS) [Bastidas et al. (1999); Henderson-Sellers et al. (1995, 1996]. However, default assignment of parameter values are actually inappropriate according to [Bastidas et al. (1999); Rosero et al. (2010)]. Meanwhile, given the high dimensionality of the parameter space and complexity of the land surface system, more studies are needed to understand what parameters are more uncertain and what the potential is for using observations to constrain or calibrate the uncertain parameters to better capture uncertainty in the resulting land surface states [Hou et al. (2012); Huang et al. (2013)]. This study aims at quantifying the uncertainties related to a subset of parameters in a community LSM named the Community Land Model (CLM), which is the land component within the Community Earth System Model (CESM) (formerly known as the Community Climate System Model (CCSM) [Collins et al. (2006); Gent et al. (2010); Lawrence et al. (2011)].
There are different sources of uncertainties associated with LSMs, and they include model structural uncertainty due to simplified assumptions or representations of the actual processes or phenomena. Many such assumptions are only valid under specific conditions. In addition, LSMs are also subject to uncertainty related to input parameter values, particularly because a large number of input parameters such as those associated with land cover and land use and soil properties are not directly measurable at the scales of its applications.
A common practice in land surface modeling has been to define a set of default parameter values that are globally applicable. Efforts has been made by the land surface modeling community in the last two decades to deal with uncertainty in model parameters, data, and model structure. The focus of this study is to reduce uncertainty in model parameters via generalized polynomial expansion and Bayesian inversion.
Stochastic inversion for a high-dimensional parameter space is computationally demanding. In order to address this problem, surrogate models can be used as alternatives to the numerical simulators. Ensemble simulations, which are required to develop surrogate models, can be performed efficiently in a task-parallel manner on supercomputing facilities. But surrogate development itself is a non-trivial effort. The use of surrogates in the calibration of climate models or LSMs is particularly uncommon. In [Ray et al. (2015); Huang et al. (2016)], the authors used various surrogates (e.g., polynomials and/or universal kriging) to calibrate hydrological parameters of CLM 4.0 using measurements of latent heat fluxes.
Two competing models were used for the model-data mismatch to estimate a composite of measurement error and (a crude estimate of) the structural error of CLM. In [Gong et al. (2015)], the authors used adaptive surrogate-based optimization to perform parameter estimation of the Common Land Model using six observables jointly; 12 independent parameters were (deterministically) calibrated. [Sargsyan et al. (2014)] attempted to construct surrogates for five variables of interests from CLM4 with prognostic carbon and nitrogen modules turned on (i.e., CLM4-CN) using Bayesian compressive sensing (BCS) in combination with polynomial chaos expansions (PCEs). They found that the input-output relationship in CLM4-CN could be composed of qualitatively different regimes (i.e., live or dead vegetation regimes associated with different regions in the parameter space), so that clustering-based and classification -piecewise PCE construction is needed.
This study evaluates the applicability of using gPC for CLM4 hydrological model calibration. We present a fully Bayesian procedure, based on (Karagiannis and Lin, 2014), which couples fully Bayesian statistics, variable selection, and generalised polynomial chaos surrogate models to address the uncertainty quantification and model inversion problem in CLM4.
The procedure produces a cheap mathematical/statistical approximation of the model output (latent heat flux) as a function of a set of model parameters. Bayesian inversion of the model parameters given observations is performed by using the produced cheap gPC surrogate model instead of the expensive computer model output in the likelihood function, and then performing Bayesian parametric inference facilitated by Markov chain Monte Carlo methods. The method allows dimension reduction and selection of the important model parameters which significantly influence the output parameter by computing inclusion posterior probabilities.
The layout of the paper is as follows: in Section 2, we describe the study site, input data, and the conducted numerical simulations, and present the parameters of interest that we calibrate; In Section 3, we present the inversion methodology using gPC in the Bayesian framework; In Section 4, we evaluate the inversion results; In Section 5, we draw our conclusions.

Dataset and Parameterization
The FLUXNET database (www.fluxdata.org) contains half-hourly observations of ecosystem CO2, heat fluxes and meteorological data of more than 250 sites worldwide and for a total of 960 site-years. The study site in this study is US-ARM (ARM Southern Great Plains site, Lamont, Oklahoma) [Fischer et al. (2007); Torn (2016)]. It has a vegetation type of croplands, covered with temporary crops followed by harvest and a bare soil period (e.g., single and multiple cropping systems), a humid subtropical climate, and clay-type soil texture.
Observational data used in parameter estimation are observed latent heat fluxes and runoff measurements, which are processed and gap-filled to obtain daily and monthly averaged data. We consider 10 critical hydrological parameters that are likely to have dominant impacts on the simulation of surface and subsurface runoff, latent and sensible fluxes, and soil moisture as suggested in existing literature [Hou et al. (2012); Niu et al. (2005Niu et al. ( , 2007; Oleson et al. (2008Oleson et al. ( , 2010]. The selected parameters are fmax, Cs, fover, fdrai, qdrai,max (denoted as Qdm hereinafter), Sy, b, Ys, Ks, and θs. Explanations of the 10 parameters and their prior information are shown in Table 1 in Hou et al. (2012). Prior distributions of the parameters were derived based on entropy theory and 256 samples were generated using quasi Monte Carlo sampling. Numerical simulations corresponding to sampled parameter sets were conducted, which yield the data matrix of inputs (i.e., realizations of the 10 parameters) and outputs (i.e., latent heat fluxes), which enables development of response surfaces or surrogates that can be used for sensitivity analysis, parameter ranking, and model calibration.

Bayesian methodology
We describe a synergy of Bayesian methods aiming at quantifying the importance of input CLM4 model parameters, calibrating these parameters against real measurements, as well as building a surrogate model describing the input-output relation in CLM4 model, in the Bayesian framework.

Bayesian inverse problem setup
Bayesian inverse methods allow the uncertainty quantification of input parameters of a computer model from observations in a probabilistic manner (Marzouk and Xiu, 2009). Bayesian inference is performed through the posterior distribution which is derived according the Bayes theorem that requires the specification of two components: the likelihood function representing the information from the measurements, and the prior distribution representing the researcher's prior information about the uncertain parameters.
We consider that the output value observed u f ∈ R du is associated to some unknown input ξ via a forward model (e.g. CLM4) u(·) : R d ξ → R, and possibly contaminated by some additive observational noise ε f (residuals); namely u f = u(ξ) + ε f . A reasonable assumption is to model ε f to be Normally distributed with mean zero and variance σ 2 . This is justified by central limit theorem arguments, and the act that in the noise term there are accumulated several insignificant random measurement errors from observations and modeling errors due to model parameterization. A reasonable assumption is to model ε f to be Normally distributed with mean zero and variance σ 2 . Thus, we consider a likelihood: To account the uncertainty about the input parameters we assign prior distribution and ξ i,max can be specified because a reasonable range of the model parameter is often a priori known. Also a ξ i and b ξ i can be specified by using the method of moments or the maximum entropy method (Berger, 2013) because often reasonable values for the mean and variance of the model parameter are a priori known. To account for uncertainty about the unknown parameter {σ 2 j }, we assign an prior distribution σ 2 j ∼ π σ 2 j (·), for j = 1, ..., d u . A computationally convenient choice for the priors of the variance {π σ 2 j (·)} is the Inverse Gamma prior distribution IG(a σ 2 , b σ 2 ) with parameters a σ 2 > 0, and b σ 2 > 0, as it is semi-conjugate prior to the likelihood. By considering the likelihood variance σ 2 as random unknown parameters and treating the problem in the Bayesian framework, we let the data decide a proper value for σ 2 through the posterior distribution. The prior hyper-parameters are considered to be fixed values and pre-defined by the researcher.
The posterior distribution of ξ, σ 2 given the observations u f is by using the Bayes' theorem, and marginal distribution π ξ (ξ|u f ) quantifies the uncertainty about the model parameters ξ.
The posterior distribution density (1) is usually intractable. If u(·) was known or cheap to be computed, inference on ξ could be obtained using standard Markov chain Monte Carlo (MCMC) methodology (Robert and Casella, 2004). In Algorithm 1, we present a simple MCMC sampler that updates iteratively in two blocks ξ and σ 2 . The output sample of Algorithm 1 can be used to perform inference on ξ and σ 2 e.g. expectation of any function h(·) of ξ can be approximated as for large T .
In the present application of the CLM4 model, u(·) is prohibitively expensive to be computed iteratively because the forward model (CLM4) is too expensive to run. This prevents us from using directly this method, and particularly Algorithm 1. To overcome this issue, we build a cheap but accurate proxy (called surrogate model) for u(·), and plug it in (1). The construction of such a surrogate model, in the Bayesian framework, is discussed in Section 3.2.

Algorithm 1 Blocks of the MCMC sweep
Block I Update ξ : • Simulate ξ from a Metropolis-Hastings transition probability that targets Block II Update σ 2 :

Surrogate model specification
We describe a fully Bayesian procedure for building a surrogate model, to be used as a cheap but accurate proxy of u(·) in (1), based on gPC expansions and MCMC methods.
The highlight is that apart from evaluating a surrogate model, the procedure is able to quantify the importance of each PC basis, via inclusion posterior probabilities, which allows the selection of the important input parameters: Fmax, Cs, Fover, Fdrai, Qdm, Sy, B, Psis, Ks, thetas.

Generalized polynomial chaos expansion
We consider the output parameter u(ξ) as a function of the d ξ -dimensional vector of random input variables ξ ∈ Ξ that admits distribution f (·).
The output parameter u(ξ) can be represented by an infinite series of PC bases {ψ α (·)} and PC coefficients {c α } in the tensor form: for ξ ∼ f (·) (Xiu, 2010). We denote multi-indices α := (α 1 , ..., α d ξ ) of size d ξ that are defined on a set of non-negative integers N bases with respect to the probability measure f (·) of ξ. Each multidimensional PC basis ψ α (·) results as a tensor product of univariate orthogonal polynomial bases ψ α j (·) of degree α j ∈ N 1 0 namely: It is common in practice, but not a panacea, for the family of PC bases {ψ α (·)} to be pre-defined so that they are orthogonal with respect to the distribution f (d·). In this way, many common distributions can be associated with a specific family of polynomials, e.g.
the Askey family (Xiu and Karniadakis, 2002). In this work, we focus on the use of Jacobi polynomial bases (Xiu and Karniadakis, 2002) which can be defined recursively as : 1], and ξ min , ξ max are the minimum and maximum of ξ.
In practice, a truncated version of (2) is used by considering a finite set of available PC bases. Traditionally, the total truncated rule is used, which results in the expansion form: that accounts for only a finite set of multi-indices A such that A = {α ∈ N d ξ : Other truncation rules can be adopted (Blatman and Sudret, 2011;Sargsyan et al., 2013).
The evaluation of the gPC expansion is challenging. Although the PC coefficients Xiu, 2010), they are not available in closed form due to the intractable integration in the expectation. Moreover, in high-dimensional scenarios if a high degree of accuracy is required, the number of unknown PC coefficients is of order d p ξ ξ and grows rapidly with the dimension d ξ and PC degree p ξ . This causes computational problems such as over-fitting (Doostan and Owhadi, 2011;Karagiannis and Lin, 2014). Reduction of the PC degree or careless omission of PC bases, in order to reduce the number of the unknowns, may lead to a significant increase of bias and provide inaccurate surrogate models. Hence, there is a particular interest in keeping in the gPC only the inputs or bases that significantly affect the output. The Bayesian procedure in Section 3.2.2 effectively addresses these matters.

Bayesian training procedure
We describe a stochastic and automatic Bayesian procedure which evaluates accurately the PC coefficients and the gPC surrogate model, while it allows the selection of the the significant PC bases, and hence input model parameters. This procedure is able to trade off efficiently between the bias (caused by omitting bases) and the over-fitting are required.
Furthermore, it can select the significant PC bases and estimate the PC coefficients simultaneously, while providing credible intervals.
We assume there is available a training dataset D = {(u j , ξ j )} n ξ j=1 , where n ξ is the size of the dataset, ξ j denotes the random input value, and u j := u(ξ j ) denotes the output value corresponding to the j-th input value ξ j . Given the training dataset D and the gPC expansion, it is where j ∈ R is a residual term, associated to the j-th datum of D. Eq. 5 can be written in matrix form: where u := (u j ; j = 1 : n ξ ) , := ( j ; j = 1 : n ξ ) , X a := (ψ a (ξ j ); j = 1 : n ξ ), and X := (X a ; a ∈ A) is an n ξ × m ξ dimensional matrix of basis functions.
Following, we formulate the Bayesian model : The likelihood function L(u|c, σ 2 ) := L({u j }|{ξ j }, c, σ 2 ) is: = N u|X c, I m σ 2 , where N(·|µ, σ 2 ) denotes the Normal density with mean µ and variance σ 2 . The choice of the likelihood is merely a modelling choice. Here, the likelihood function can be considered as a measure of goodness-of-fit of the truncated gPC expansion to the training data-set, where the statistical discrepancy between the real model and the gPC expansion, for a given training data-set, is quantified by the residual term { j }. We consider the following hierarhical prior model π(c, γ, σ 2 , λ, ρ): where a λ , b λ , a ρ , and b ρ are fixed prior hyper parameters, and predetermined by the researcher. In the Bayesian framework, inference on the unknown parameters of the model can be performed based on the posterior distribution π(c, γ, σ 2 , λ, ρ|D) = L(u|c, σ 2 )π(c, γ, σ 2 , λ, ρ) L(u|c, σ 2 )π(c, γ, σ 2 , λ, ρ)d(c, γ, σ 2 , λ, ρ) .
Particular interest lies on the computation of the inclusion probabilities π(γ a |D) that refer to the marginal posterior probability that the a-th basis is important, and the π(c a |D, γ a ) that refers to the posterior density of the a-th PC coefficient.
To fit the Bayesian model, we resort to Markov chain Monte Carlo (MCMC) methods because the above posterior distribution (9) is intractable and cannot be sampled directly.
The conjugate prior model (8) allows the design of a Gibbs sampler (Hans, 2010;Geman and Geman, 1984) whose blocks involve sampling form the full conditional distributions of the parameter updated. In Algorithm 2, we represent a pseudo-code of one sweep of the Gibbs samples, along with the associated full conditional distributions. We highlight that the procedure is fully automatic because there is no need to tune the algorithmic parameters involved in Algorithm 2. The notation c −a ( and X −a ) refers to the vector c (and matrix X) excluding the a-th element (and column). Moreover, we denote m ξ,γ = a∈A γ a .
In order to evaluate the surrogate model, as well as quantify the importance of the input model parameters, we consider two fully Bayesian approaches: the Bayesian model averaging (Hoeting et al., 1999) most suitable in cases that the predictive ability of the surrogate model is of main interest, and the median probability model ( Barbieri and Berger, 2004) most suitable for cases where interest in discovering a sparse (or parsimonious) representation of the stochastic solution or selecting important input model parameters.
Bayesian model averaging: The evaluation of the gPC expansion (4) can be performed by Bayesian model averaging (BMA) (Hoeting et al., 1999) if the predictive ability of the gPC expansion is of main interest.
We consider a Gibbs sample γ (t) , c (t) , σ 2,(t) , λ (t) , ρ (t) T t=1 generated by Algorithm 2. Estimates and associated standard errors of {c a } can be computed by the ergodic averages according to the standard Bayesian practice (Robert and Casella, 2004;Hoeting et al., 1999), e.gĉ a = 1 a ψ a (ξ)) = a∈Aĉ a ψ a (ξ); (10) Median probability model based evaluation: A parsimonious (or sparse) surrogate model involving only significant basis functions and important input model parameters can be obtained by examining the estimated inclusion probabilities π(γ a |D). A suitable probabilistic basis selection mechanism is the median probability model (MPM) ( Barbieri and Berger, 2004).
Given a Gibbs sample γ (t) , c (t) , σ 2,(t) , λ (t) , ρ (t) T t=1 drawn from Algorithm 2, the marginal inclusion posterior probabilities, namely the posterior distribution that the PC basis is significant, can be estimated asP a =π(γ a |D) = 1 After the selection of the significant PC bases according to the aforementioned MPM rule, inference about the unknown quantities of interest can be performed by re-running the Gibbs sampler Algorithm 2 for fixed γ equal toγ (MPM) . The new Gibbs sample c (t) , σ 2,(t) , ...

T t=1
can be used to perform inference and estimation. For instance, estimates for µ, v and u(ξ) can be computed by Monte Carlo integration and using the equations of the estimators in (10). Note that a number of the coefficients {c (t) a } will be constantly zero for all t = 1, ..., T .
The reason is because, unlike in BMA approach, here we consider only a single subset of PC bases, and hence the inclusion parameter is a fixed parameter equal to γ (t) = γ (MPM) , for t = 1, ..., T .
The MPM approach allows the selection of the important input model parameters that significantly affect the output. If an input parameter ξ j is not represented by any significant PC basis in the gPC expansion, it would be reasonable to consider that input parameter ξ j does not significantly affect the output model and hence be omitted from the analysis.   Table 1.

Months
We apply the above methodology which involves two stages: (i) build a surrogate model to replace the accurate but expensive forward model CLM4 according to the methodology in Section 3.2, and (ii) conduct inversion of the 10 inputs, according to the theory in Section 3.1.

Surrogate model building step
For each month, we build a surrogate model that maps the input of CLM4 ξ to the output LH u. For this type of dataset, Hou et al. (2012) and Sun et al. (2013) observed that the dependency between the output parameters LH corresponding to different months is weak and hence can be neglected. Therefore, here we build surrogates models for each month independently. An advantage of assuming independence is that it leads to a simpler parameterization for the statistical model, which is easier to treat and interpret.
To build the gPC expansion, we follow the procedure in Section 3.2. For the design of the gPC expansion, we consider PC bases from the Jacobi polynomial family. The parameters of the Jacobi PC bases are set according to the prior information of the input of CLM4 in (Hou et al., 2012, Table 1) by matching the moments of the corresponding shifted Beta distribution to which they are orthogonal. We consider the prior model (8) with hyperparameters a λ = 10 −3 , b λ = 10 −3 , a σ = 10 −3 , b σ = 10 −3 , a ρ = 1, and b ρ = 1. This choice of hyper-parameters leads to weakly informative priors. This is a reasonable choice because there is lack of prior information about the parameterization of the surrogate model. For training the gPC expansion, we used the training data set US-ARM. We run Algorithm 2 for 2 · 10 5 iterations where the first 10 5 were discarded as burn in.
In Figures 1-2, we present the marginal posterior probabilities {Pr(γ j |D)} computed by the ergodic average of the occurrences of the corresponding PC bases in the Gibbs sample.
We observe that during the period May-August the marginal inclusion probabilities are higher than those of the rest months. This indicates that the input parameters of CLM4 may have larger impact on the output LH during those months. From the hydrology point of view, this is expected because LH is higher on average and has larger variability during these months, and effects of hydrological parameters are expected to be more pronounced in the summer months.
We can infer that the input parameters Fdrai, Qdm, and B are significant according to the MPM rule. This is because these input parameters are represented by significant PC basis functions; namely the corresponding marginal inclusion probabilities in Figures 1-2 are greater than 0.5. From the hydrology perspective this is reasonable, because these parameters are major factors controlling the drainage and runoff generation, which in turn impact heat fluxes. The results are also consistent with the previous work in [Hou et al. (2012)].
In Figures 3-4, we use box-plots to represent the posterior density estimates of the PC coefficients generated by the Gibbs sampler. We observe that the coefficients with narrow bounds around the zero value correspond to non-significant PC bases in Figures 1-2, namely those with Pr(γ j |D) < 0.5. That shows that the method is consistent. Moreover, we observe that the significant PC coefficients that correspond to the period May-August have in general is larger compared to that of the rest months, which is as expected.
The gPC expansion of LH as a function of the input parameters ξ can be evaluated according to the estimator in (10) which is the ergodic average of the Gibbs sample.

Inversion step:
We calibrate the 10 CLM parameters ξ = (Fmax, Cs, Fover, Fdrai, Qdm, Sy, B, Psis, Ks, θ) against the measurement of the parameter LH u (f) in Table 1, as in Section 3.1. We consider Beta priors on the CLM parameters whose hyper-parameters are specified by using the method of moments and based on the prior information in Hou et al. (2012).
Calibration is performed by running the MCMC sampler (Algorithm 1) and evaluating the posterior distributions according to the procedure in Section 3.1. Even though, in the previous step, we detected that the input model parameters are Fdrai, Qdm, and B are the significant ones, we also consider them for calibration to obtain information about them as well. In order to make the MCMC sampler tractable, we replace the forward model CLM4 u(·), in Algorithm 1, with the estimated gPC expansion that serves as a surrogate model.
We run the MCMC sampler (Algorithm 1) for 2 · 10 4 iterations and discard the first 10 4 as burn in.
In Figures 5-6, we present the estimated posterior densities of the input parameters of

Model validation
We validate the effectiveness of the Bayesian inversion procedure. We evaluate the predictive distribution of the output parameter LH, by using the derived gPC surrogate model, and the MCMC sample of the input parameters generated by Algorithm 1. In Figures 7-8, we present the resulted predictive distributions of the output LH for the 12 months. The blue bars belong the histogram estimate, the red line is the kernel density estimate of the predictive distribution, while the green arrow represents the observed output value of LH. The plots show that for each month the observed output value u f for the LH lie below the modes of each of the marginal predictive distributions. This implies that the proposed methodology is valid, and that the surrogate model derived from the method is able to produce accurate predictions.

Conclusion
In this study, we focused on evaluating uncertainties associated with hydrologic parameters in CLM4, in the Bayesian framework. We presented a Bayesian methodology for the uncertainty quantification (UQ) framework that couples generalized Polynomial Chaos model with the Bayesian variable selection methods. We presented a fully Bayesian methodology that involves two steps: the construction of a surrogate model to express the input-output mapping, and the evaluation of the posterior distribution of the input parameters for a given value of the output parameter LH.
For the construction of the surrogate model we propose a Bayesian procedure, based on variable selection methods, that uses gPC expansions and accounts for bases selection uncertainty. The advantage of this approach is that it can quantify the significance of the gPC    Figure 8: Distributions of the output LH associated to input parameters of CLM4 drawn by the a posteriori distributions; July -December terms, and hence the importance of the input parameters, in a probabilistic manner. The input posterior distributions were evaluated according to Bayesian inverse modeling. Our empirical results, showed that the proposed method is suitable to perform inverse modeling of hydrologic parameters in CLM4, and able to effectively describe the uncertainty related to these parameters.
Our future work involves the comparison of the proposed method against other methods based on neural networks, and generalized linear models on which we are currently working.