A Bayesian Approach to the Estimation of Parameters and Their Interdependencies in Environmental Modeling

We present a case study for Bayesian analysis and proper representation of distributions and dependence among parameters when calibrating process-oriented environmental models. A simple water quality model for the Elbe River (Germany) is referred to as an example, but the approach is applicable to a wide range of environmental models with time-series output. Model parameters are estimated by Bayesian inference via Markov Chain Monte Carlo (MCMC) sampling. While the best-fit solution matches usual least-squares model calibration (with a penalty term for excessive parameter values), the Bayesian approach has the advantage of yielding a joint probability distribution for parameters. This posterior distribution encompasses all possible parameter combinations that produce a simulation output that fits observed data within measurement and modeling uncertainty. Bayesian inference further permits the introduction of prior knowledge, e.g., positivity of certain parameters. The estimated distribution shows to which extent model parameters are controlled by observations through the process of inference, highlighting issues that cannot be settled unless more information becomes available. An interactive interface enables tracking for how ranges of parameter values that are consistent with observations change during the process of a step-by-step assignment of fixed parameter values. Based on an initial analysis of the posterior via an undirected Gaussian graphical model, a directed Bayesian network (BN) is constructed. The BN transparently conveys information on the interdependence of parameters after calibration. Finally, a strategy to reduce the number of expensive model runs in MCMC sampling for the presented purpose is introduced based on a newly developed variant of delayed acceptance sampling with a Gaussian process surrogate and linear dimensionality reduction to support function-valued outputs.


Introduction
Mathematical ecosystem models differ with regard to mathematical complexity and the number of free parameters involved. Even the most complex models vastly simplify reality, including arbitrary choices with respect to their structure. It remains as an everpresent challenge to suitably balance model complexity with the amount of data available for model calibration. Having agreed on a certain structure, overparameterization, i.e., parameters not being controlled by observational evidence, is a ubiquitous problem. Different parameter combinations may produce very similar results, e.g., [1][2][3][4][5]. Simultaneously strengthening one process and weakening another may have no substantial overall effects. As a result, some parameters can be set to unrealistic values without contradiction with the data under study.
From a purely mathematical point of view, the problem of overparameterization should be alleviated by striving for a reduced number of model parameters, discard-ing simulation of details. Alternatively, variables might be aggregated into a set of few compound variables. However, lumped parameters often lack a clear process-oriented interpretation. A simplified pure input-output model might be successful in predicting variables of interest but can usually not explain why certain things are going to happen. In process-oriented simulations, each model parameter has a specific meaning. Knowing about its interpretation will be important for any informed management action based on integrated assessments [6]. The method proposed in this study therefore keeps the full set of model parameters. However, instead of providing for each of these parameters its most probable value (possibly together with an error bar), we provide a description of the joint probability density of the full set of parameters.
We tackle model calibration by methods of Bayesian probability theory (see, e.g., [7,8] for an introduction). This theory provides a consistent framework of assigning probabilities to quantify a degree of belief. In contrast to frequentist analysis, where probability is usually defined in terms of countable statistics, Bayesian analysis enables the assignment of probabilities to sets of model parameters before (prior) and after (posterior) the observation of calibration data, using the likelihood that the model generates these data. Investigating the structure of the posterior distribution sheds light on the quality of the estimation of model parameters and their interdependencies by observational data.
We propose using the Bayesian network (BN) technology [9,10] to describe interdependencies within parameter sets that produce simulations in agreement with observed data. In contrast to their established use for statistical modeling, we apply BNs to the analysis of a posterior after calibrating a process model. The underlying ensemble of successful simulations is produced performing Markov Chain Monte Carlo (MCMC) simulations. A BN represents a joint distribution of multivariate data by its factorization in terms of conditional probabilities. Various software packages are available for that purpose. Evidence provided for any subset of parameters can be spread across the whole network, potentially changing the marginal distributions of all other parameters.
The problem with constructing a saturated BN (retaining all possible interactions) is the dimension and overall size of conditional probability tables needed. However, as long as all parameters are allowed to interact, conditional marginal distributions can also be obtained by sub-sampling from the set of successful parameter sets generated by MCMC. In a first step, we will follow this direct approach. A benefit from using a BN based on conditional probability tables fitted to the data arises when the goal is to focus on just the most important interaction patterns. A BN displays such interaction structure in terms of a directed acyclic graph (DAG). Often, a DAG is seen as a means to represent cause-effect relationships [11,12]. Although this concept is not applicable for the example under study, we explore graph simplification to visualize key parameter dependencies.
An important limitation of the presented approach is the required computing time for simulation runs during MCMC sampling. Delayed acceptance can accelerate the procedure up to a factor of one over the acceptance rate [13,14]. In order to do so, it requires a surrogate of the posterior that contains the cost function inside the likelihood in case of model calibration. The simplest way to implement delayed acceptance relies on a surrogate with scalar output built for this cost function or for the likelihood. Here, we take an intermediate step and construct a surrogate for the functional output of a blackbox model to be calibrated against reference data. Typical examples are numerical simulations that output time series or spatial data and depend on tunable input parameters. We demonstrate the application of this approach on two examples using usual and hierarchical Bayesian model calibration. In the latter case, a surrogate beyond the L 2 cost function is required if the likelihood depends on additional auxiliary parameters. As an example, we allow variations of the (fractional) order of the norm, thereby marginalizing over different noise models, including Gaussian and Laplacian noise.
This paper is organized in the following way. In Section 2, this work is put into context with existing literature. In Section 3, we describe the problem addressed as a case study, a simple numerical model that simulates chlorophyll a time series observed at a station on the Elbe River. Section 4 gives a short introduction into Bayesian probability theory and the way Markov Chain Monte Carlo is implemented in the context of model calibration. A brief introduction is given to graphical modeling, encompassing both Gaussian graphical models and Bayesian networks. A special tool for analyzing conditional posterior marginal distributions of MCMC parameter samples is presented. In Section 5, we first illustrate model output uncertainties that arise from posterior model parameter uncertainties. Then, dependencies between calibrated parameters are explored, looking at conditionalized marginal distributions. Finally, a Bayesian network with simplified parameter dependencies is devised based on the results from fitting a Gaussian graphical model to the MCMC parameter samples. Section 6 provides a comprehensive discussion followed by some conclusions.

Relation to Existing Work
In the light of inevitable model uncertainties, Fedra [15] proposed a replacement of predictions that pretend an unrealistic precision by multiple predictions covering predictive uncertainty. He suggested models be used for more qualitative discrimination between different options rather than for detailed predictions. Identifiability of model parameters needs the observed part of model output to be sensitive to these parameters [16]. However, lacking identifiability must be distinguished from model output insensitivity [17]. Local sensitivity analyses based on local derivatives of some model output are suitable when a model is essentially linear. By contrast, global sensitivity analysis (GSA) takes a sampling approach to apportion model output uncertainties to uncertainties of single input parameters or combinations thereof [18]. Model output variance may be decomposed in terms of orthogonal partial variances with an increasing number of input parameters contributing to them (Sobol' indices; [19]). In practice, the large number of Monte Carlo simulations needed often makes calculation of higher order partial variances infeasible. Sudret [20] proposes the use of surrogate models based on polynomial chaos expansion (PCE), originally developed by Wiener [21], as a possible way out of this difficulty. Using PCE, Sobol' indices can be calculated analytically.
Referring to the problem of 'equifinality' of different acceptable models, Beven and Freer [22] developed the generalized likelihood uncertainty estimation (GLUE) methodology in which model simulations are ranked according to their performance. Interrelationships between parameters that lead to satisfactory results are implicitly represented by the respective subset of Monte Carlo sampling. In contrast to Bayesian Monte Carlo (BMC) [23], for instance, GLUE does not employ likelihood in a statistically rigorous sense. For a comparison of GLUE with formal Bayesian approaches, including also Markov Chain Monte Carlo (MCMC), see Vrugt et al. [24] or Camacho et al. [25], for instance. Ratto et al. [26] combined GLUE and GSA, replacing model output variability in GSA by the variability of an informal GLUE likelihood measure. Callies et al. [27] applied the GSA-GLUE approach to a model, which is similar to the model [28] the present case study refers to.
There exist numerous related works treating blackbox models with functional outputs with surrogates. Campbell et al. [29] use an adaptive basis of principal component analysis (PCA) to perform global sensitivity analysis. Pratola et al. [30] and Ranjan et al. [31] use GP regression for sequential model calibration in a Bayesian framework. Lebel et al. [32] model the likelihood function in an MCMC model calibration via a Gaussian process. Perrin [33] compares the use of a multi-output GP surrogate with a Kronecker structure to an adaptive basis approach. Extensions presented here rely on the adaptive basis approach in principal components (Karhunen-Loéve expansion or functional PCA) to reduce the dimension of the functional output, while modeling the map from inputs to weights in this basis via GP regression.

General Background
Weir Geesthacht, located on the Elbe River (Elbe km 586) some 40 km upstream of the city of Hamburg (Figure 1), separates the riverine part of the Elbe River from its estuary issuing into the North Sea. Quasi-continuous observations of several parameters (see [28]), obtained from an automated flow-through unit operated at the weir by the former GKSS Research Center Geesthacht (now Helmholtz-Zentrum Hereon), are available for the years 1997-2001. Here, we focus on concentrations of chlorophyll a (obtained from observed fluorescence, using calibration based on high-performance liquid chromatography (HPLC)) and silica, observed during March-October. Chlorophyll a data were collected quasicontinuously, silica on an hourly basis. Our case study on the use of Bayesian methods will be based on a very simple water quality model published previously. Considering no other algal species than diatoms, the model simulates chlorophyll a time series at Geesthacht Weir as a function of few environmental parameters. We do not consider other observations available at Geesthacht Weir (oxygen, pH, turbidity, nutrients) as our intention is not to improve the model. We rather wish to demonstrate the presence of overparameterization in even this simple model and how Bayesian methods can tackle this problem and make users aware of it.
According to Karrasch et al. [34], diatoms dominate algae biomass in the Elbe River. Our diatom-based model fits chlorophyll a observations quite well, despite its simplicity. For the Rhine river, de Ruyter van Steveninck et al. [35] identified a potential silica limitation effect during an experiment in 1990. Generally, it is very difficult to identify those among different biological processes that really control observed phytoplankton growth. Numerical models provide a means to at least formalize different hypotheses, estimate their consequences and compare them with observational evidence.
The complexity of the example model described below is much lower than that of most other mechanistic models trying to resolve processes in more detail, e.g., [36]. A key feature of the simple model is its ability (depending on how parameter values are set) to explain sporadic sharp decreases in chlorophyll a concentrations in terms of diatoms suffering from lack of silica [28]. At station Geesthacht, very low silica concentrations (below 0.1 mg Si/L) are observed during summer. However, the model also offers temperature dependent grazing rates as a potential alternative mechanism. Hardenbicker et al. [37] report an experimental study on major differences between the plankton dynamics in the two rivers Rhine and Elbe. They try to substantiate the hypothesis that much lower phytoplankton densities in the Rhine than in the Elbe River might be due to grazing by invasive bivalves being more abundant in the Rhine than in the Elbe River. Similarly, Waylett et al. [38] argue that between-year differences in grazing are likely to explain interannual variability of phytoplankton loss observed in the upper Thames. Although the temperature dependent loss rate assumed in our model is very simplistic compared to real world conditions, model calibration can nevertheless be hoped to indicate how distinguishable such temperature dependent mechanisms are from silica related effects.

Lagrangian Model Concept
For our case study, we revive a simple model originally introduced by Callies et al. [27] and even further simplified (neglect of all algae species other than diatoms, assumption of constant maximum growth rates of diatoms, no shading by mineral compounds) by Scharfe et al. [28]. Using a Lagrangian model concept originally suggested by Schroeder [39], fluid parcels are released at Schmilka close to the Czech-German border ( Figure 1). These fluid parcels are then assumed to travel downstream until they reach Geesthacht Weir. During its journey, each parcel is treated like a biological reactor exposed to time dependent external forcing. Ideally, simulations would be complemented by Lagrangian sampling campaigns, trying to follow water parcels during their transport, e.g., [35,37]. However, the data available for this study provide observations just at the drift paths' end points so that concentrations simulated on a fluid parcel's arrival at Geesthacht Weir will be contrasted with corresponding observations at that time. For each parcel travelling down the Elbe River, the following equation for chlorophyll a concentration C chl is integrated in time t: Time dependent growth rate µ and loss rate σ will be further detailed in Section 3.3.
The potential for chlorophyll a development is assumed to depend on the amount of silica being available. Our simple model concept assumes that an initial reservoir of silica, C Si (t 0 ), is continuously depleted due to assimilation of silica by algae. The following equation describes the evolution of silica concentration C Si in a given fluid parcel, where parameter f Si specifies the ratio of silica to chlorophyll a in algal biomass. Note that Equation (2) does not take into account any sources of silica, like releases from the sediment or inputs from tributaries. Following Scharfe et al. [28], we initialized Equation (2) with concentrations observed at station Schmilka near the Czech-German border (Elbe km 4, see Figure 1). This station marks the end of the upper reach of the river (about 370 km long) with a mean river discharge of about 310 m 3 /s (compared to about 730 m 3 /s at Geesthacht Weir, 580 km further downstream). Initial concentrations were constrained, however, by a minimum value of 2 mg Si/L. Simulations for different times are completely independent amongst each other, any temporal coherence at Geesthacht Weir is brought about just by the fact that external forcing (light and temperature) will be the same for trajectories overlapping in time. Each fluid parcel is initialized with the same low chlorophyll a concentration (10 µg chl a/L), a value expected to be largely overwritten during the water parcel's 580 km journey. What might contribute to the success of the very simple approach is that in nature there seem to be only small contributions of chlorophyll a from major tributaries of the river Elbe [37].
It must be stressed that our drift simulations take into account neither variations in river geometry nor any dispersion processes or supply from external sources. The only consequence of changing river discharge is a changing travel time, i.e., the time available for algae growth and loss processes. Travel times τ (usually less than 10 days) are estimated as function of discharge Q, with reference values Q ref = 270 m 3 /s and τ ref = 10 days. This simple formula was found to reasonably agree with existing flow time data [40].
Simulations are scheduled in such a way that drift trajectories arrive once a day at noon. For a more detailed illustration of the Lagrangian approach, also showing examples of concentrations that develop within single fluid parcels, the reader is referred to the original paper of Scharfe et al. [28].

Parameterizations Used in the Model
In Equation (1), both µ and σ depend on environmental conditions and therefore vary with time t. The value of µ(t) results as the triple product of a constant maximum growth rate µ 0 , a light dependent limitation factor F light (t) parameterizing restricted growth efficiency under unfavorable light conditions and another limitation factor F Si (t) parameterizing detrimental effects of possibly limited availability of silica: Both F light (t) and F Si (t) can assume values between 0 and 1. According to Beer's law, radiation intensity I(t) at the water surface implies a radiation intensity I(t)e −λ(t)z at water depth z. Here, coefficient λ(t) is used to parameterize light attenuation due to so-called algal self-shading, assumed to be proportional to chlorophyll a concentration, with some constant factor λ S . Our simple model does not explicitly resolve the water depth coordinate z, so that the time dependent light limitation factor F light in Equation (4) is obtained by vertical averaging over water depth D. The following formula, is based on the 'Smith formula' [41]. If light intensity equals the half-saturation constant K light , algal growth rate will assume 71% of its maximum possible value. Global radiation on an hourly basis was taken from GKSS Research Center located in close vicinity to the weir. During trajectory calculation, water depth D is treated as a constant. For each individual trajectory, however, the value of D is adjusted to the water discharge observed at station Neu Darchau (about 50 km upstream of Geesthacht) at the time when this trajectory reaches Geesthacht Weir. A polynomial formula well reproduces an empirical relationship between discharge and water depth, although slightly enhancing small values of D (see [28], their Figure 3).
A half-saturation constant K Si is introduced to specify limitation factor F Si in Equation (4): Parameter K Si is set to the fixed value of 0.1 mg Si/L. A silica concentration equal to this value will imply a 50% reduction in growth rate µ in Equation (4).
Scharfe et al. [28] identified in each of the five years, 1997-2001, a short period when the model consistently failed to reproduce a fast increase in chlorophyll a concentrations after a late spring chlorophyll a minimum. This model deficiency could not be fixed by any parameter adjustments, which suggests relevance of certain processes that are lacking in the present model framework (e.g., dominance of algae other than diatoms). The authors therefore decided to modify the model in such a way that in each year, assimilation of silica is abandoned during a 1-2 week period (see [28], Figure 11 therein). In this study, we adopt this approach to prevent the large short-term discrepancies dominating the overall model evaluation. In all time series shown in this paper, the special periods will be highlighted. The workaround was chosen because any extension of the model would be beyond the scope of the present study.
Loss rate σ (including also respiration) in Equation (1) is assumed temperature dependent only if temperature T exceeds 20 • C, otherwise it is set constant: The interpretation of coefficient a being greater than one remains unspecific but could cover an increased zooplankton grazing rate, for instance. Evaluation of Equation (8) is based on 24 h means of water temperature at Geesthacht Weir. Only for the year 1997, data from station Schnackenburg (Elbe km 475, from the former ARGE ELBE; since 2010, part of the 'Flussgebietsgemeinschaft Elbe-FGG Elbe'; https://www.fgg-elbe.de, accessed on 29 September 2021) had to be used to fill existing data gaps.

Parameters Selected for Calibration
From the above equations, we selected six parameters for this calibration study: maximum growth rate µ 0 , half-saturation constant K light , light attenuation constant λ S , algal silica content f Si , loss rate σ 0 and coefficient a for loss rate temperature dependence above 20 • C. These parameters cover all essential aspects of the model.

Methods of Bayesian Analysis and Complexity Reduction
Here, we provide an introductory overview of the Bayesian methods used to calibrate and analyze the above-described model. A more detailed introduction to Bayesian concepts and techniques can, e.g., be found in [7,8]. In addition, Gaussian graphical models and principal component analysis are briefly outlined, as well as tools for preand postprocessing.

Bayesian Inference
Model calibration means estimating a set of M model parameters θ = (θ 1 , θ 2 , . . . , θ M ) from observed data d. As mentioned in the introduction, both input and output of the Bayesian parameter estimation procedure are probability distributions in θ. Before considering d, we choose a prior distribution p(θ) that contains all previous knowledge independent of d (e.g., certain parameters are positive or limited by plausible physical boundaries). Even though procedural models usually yield deterministic output for a given parameter vector θ, observational data d are subject to measurement uncertainties or noise. The further away d is from the model prediction, the less likely the parameters θ are correct. Quantitatively, this is encoded in the likelihood p(d | θ), being the conditional probability to observe d for given θ. Since the given information is observed data d rather than abstract parameters θ, our goal is to estimate the posterior distribution p(θ | d) where the conditionality is flipped compared to the likelihood. For this purpose, one applies the chain rule of conditional probabilities, transformed to Bayes' rule The probability to observe d at all for a given model over all possible parameter sets is called the marginal likelihood or evidence. While for parameter estimation in a single model, p(d) cancels out as a normalization factor, and it becomes important when comparing several models.

Markov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC) [7,[42][43][44] is a method to obtain access to unbiased samples from potentially high-dimensional probability distributions. These samples can then be used to compute quantities of interest such as parameter means and variances. Here, we are interested in samples distributed according to the posterior p(θ | d) for the model parameters θ given the data d via Equation (10). The key idea underlying the MCMC approach is an iterative exploration of a target probability distribution such that the distribution of the samples asymptotically converges to it. In the commonly used Metropolis-Hastings algorithm, proposal steps (i.e., changing the parameter values from θ to θ ) which increase the probability are always accepted. If, however, the probability is reduced, then such a parameter step is only accepted with an acceptance probability for symmetric proposal functions ( [43], chapter 15.8). If a proposed step is not accepted, then the old parameter vector θ is added again to the chain of sampled parameter values, otherwise the system state is set to the new value θ . Under some weak technical conditions (such as ergodicity, detailed balance), the distribution of this chain converges to the desired distribution p(θ | d) [45]. Let now θ denote the vector of M = 6 model parameters θ k (see Table 1) we wish to calibrate based on existing observations. Let d t be a time series with N t observations of chlorophyll a depending on time t, and f t (θ) the corresponding output of one chlorophyll a simulation. Note that N t is not the number of samples, but each sample is a time series of length N t that is compared to the observed time series. Model calibration is based on the cost function with an assumed standard deviation σ chl of the observational error, assumed as a Gaussian random variable. Instead of just minimizing J, we use it as the negative log-likelihood in a Gaussian error model for probabilistic inference. The MCMC search algorithm should not explore unrealistically large values for parameters or parameter combinations insufficiently controlled by observations. To realize this while not being overly restrictive, for each parameter θ k , a heavy-tailed Cauchy distribution is introduced as a prior: All parameters θ k considered in this study are constrained to positive values. Thus, we truncate the prior to vanish for values θ k < 0. Scaling ensures that ∞ 0 p(θ k )dθ k = 1. Coefficients b k in Equation (14) control the width of the probability distribution for each parameter θ k . This width is fixed by specifying values θ k that to be exceeded should be quite unlikely (see Table 1). Values for b k are chosen such that the probability to find θ k < θ k is close to one. This probability is given by the respective value of the cumulative distribution function Table 1. Parameter values assumed to be exceeded with probability 10 percent (1-P , see Equation (15)).
With Equation (14) and (15) can then be calculated as: In the following, we assume a value of P = 0.9.
In the MCMC algorithm, positivity of parameters θ k is warranted by flipping negative proposed parameter test values into the positive range by taking absolute values. This approach reproduces the truncated prior and maintains the detailed balance necessary for the convergence of MCMC [8]. Freni and Mannina [46] studied the implications of choosing prior distributions on uncertainty analysis. Here, we sought to keep prior constraints as weak as possible. A more rigorous alternative to do so would be the use of the maximum entropy principle to minimize the amount of prior information. This is not performed here for simplicity, since we expect that the number of samples is sufficiently large to make the influence of the exact form of the prior on the posterior negligible for practical purposes.

Graphical Modeling
Graphical models can be used to highlight key interrelationships between parameters, discarding dependencies of minor importance. Data are represented in terms of nodes (or vertices) for each variable and a number of edges connecting them. Edges in undirected Gaussian graphical models (e.g., [47,48]) represent partial correlations (i.e., correlations between pairs of variables when all other variables are held constant). By contrast, directed edges in Bayesian networks (e.g., [9,10]) represent conditional probability distributions approximated by tables for response variables (child nodes) given the values of all explanatory variables (parent nodes) from which edges are pointing into them.

Gaussian Graphical Models (GGMs)
Undirected edges in a GGM represent non-zero pairwise partial correlations conditioned by all the rest of the variables. Covariance selection developed by Dempster [49] provides a general framework to assess whether or not the set of constraints displayed by a GGM contradicts observations. See Whittaker [48] for a comprehensive presentation of the concept. More recent developments of variational methods are described in Jordan [50]. Callies [51] and Callies and Scharfe [52] applied graphical modeling for analyzing interaction structures from water quality observations; Taeb et al. [53] used it to characterize dependencies among water reservoirs. The specific aspect in this study is that we do not apply the method to observed variables but rather to a set of parameters in a process-oriented simulation model that were calibrated in order to adjust model output (time series of one single variable) to its observed counterpart.
The basic idea of graphical Gaussian modeling is to modify a sample correlation matrix S within the limits of observational uncertainty in such a way that small partial correlations are replaced by zero values. The partial correlation matrix S p is closely related to the precision matrix S −1 and zero valued elements of the two matrices coincide, see [48]. Among all correlation matrices that satisfy the constraints of a given GGM, some matrix V will fit the data best. The difference between the log-likelihoods of sample correlation matrix S and V provides an entropy type measure of the amount of information in the data against the interaction structure hypothesized by the graph G. The deviance dev S (G) is defined as twice this difference of log-likelihoods or twice the sample size N times the Kullback-Leibler information divergence between two jointly normal distributions, assuming that their means are equal [54]. Specific properties of V imply that the deviance assumes the following simplified form [47]: If data are normally distributed, the deviance has an asymptotic χ 2 distribution with the degrees of freedom given by the number of edges missing in the graph [48].
With Equation (17), it is straightforward to evaluate the effects of either excluding another edge from the graph (edge excluding deviance, EED) or re-establishing an edge previously removed (edge inclusion deviance, EID). In this study, we will not rely on statistical significance of graph simplification. Due to the very large number of samples, i.e., the number of underlying successful Monte Carlo simulations (N = 10 6 ), formal statistical significance will always be satisfied. Strict statistical testing for model parameter interactions would also be inconsistent with the fact that even very detailed processoriented models necessarily comprise substantial simplifications and parameterization of much more detailed natural processes. In the light of this unavoidable incorrectness of any process-oriented model, we try to characterize the model's key interaction structure by stopping graph simplification at a point when the smallest EED in the simplified graph G is clearly larger than the largest EID among the EIDs of all edges discarded previously. It must be noted that this graph simplification is a manual and to some extent subjective procedure, complicated by the fact that removal or establishment of an edge generally can affect the relevance of all other edges.

Bayesian Networks (BNs)
Contrary to Gaussian graphical models, Bayesian networks are directed acyclic graphs (DAGs) [9,10]. Nodes of the BN represent random variables with usually discrete states, often (as in our case) obtained by binning a continuous variable into a certain number of categories. For each node, a table specifies the conditional probabilities for its states, given all possible combinations of states of the node's ancestors according to the DAG. The joint distribution for N variables X N is then given as a product of conditional probabilities, where Pa(X i ) denotes the set of all parent nodes of node X i . For root nodes without parents (applies to at least one node in a DAG), the conditional probability P(X i | Pa(X i )) is replaced by the simple prior distribution P(X i ). Structuring a BN in terms of parent and child nodes can often be related to the concept of causality [11,12]. However, for nodes representing parameters of a process model, such an interpretation is not applicable.
Interactive BN software provides a convenient tool to explore parameter dependences empirically. However, depending on how many state categories are used for each parameter, the maximum number of M − 1 parent nodes (with M denoting the number of parameters) can be a serious limitation for the application of BN software. That is why, for saturated graphs (with all edges being maintained), specification of conditional marginal distributions by sampling directly from the data (see Section 4.8) may be preferable to specification of conditional probability tables. Omission of edges from the graph can much reduce the problem of dimensionality. For nearly multinormal distributions, graphical Gaussian modeling may guide such simplification of a BN.

Gaussian Process Regression and Bayesian Global Optimization
Gaussian process regression [55][56][57] is a commonly used tool to construct flexible non-parameteric surrogates. Based on observed outputs f (x k ) at training points x k and a covariance function k(x, x ), the GP regressor predicts a Gaussian posterior distribution at any point x * . For a single prediction f (x * ), expected value and variance of this distribution are given byf where m(x * ) is the mean model, the covariance matrix K contains entries K ij = k(x i , x j ) based on the training set, K * i (x * , x i ) are entries of a row vector and K * * = k(x * , x * ) is a scalar. The unit matrix I is added with the noise covariance σ n that regularizes the problem and is usually estimated in an optimization loop together with other kernel hyperparameters.
Such a surrogate with uncertainty information can be used for Bayesian global optimization [58][59][60] of the log-posterior as a cost function. Here, we apply this method to reach the vicinity of the posterior's mode before sampling. As an acquisition function, we use the expected improvement (see, e.g., [59]) at a newly observed location x * given existing training data d, (21) wheref is the optimum value for f (x) observed so far. Due to the non-linear transformation from the functional blackbox output to the value of the cost function, it is more convenient to realize Bayesian optimization with a direct GP surrogate of the cost function that is constructed in addition to the surrogate for the functional output for the KL expansion coefficients described below.

Linear Dimension Reduction via Principal Components
Formally, the blackbox output for given input x can be a function f (t) ∈ H in an infinite-dimensional Hilbert space (though sampled at a finite number of points in practice). Linear dimension reduction in such a space means finding the optimum set of basis functions ϕ k (t) that spans the output space f (t; x) for any input x given to the blackbox. The reduced model of order r is then given by This approach is known as the Karhunen-Loéve (KL) expansion [61], in case f (t; x) are interpreted as realizations of a random process, or as the functional principal component analysis (FPCA) [62]. For our application, this distinction does not matter. The KL expansion boils down to solving a regression problem in the non-orthogonal basis of N observed realizations to represent new observations. Then an eigenvalue problem is solved to invert the N × N collocation matrix A with entries Here, the inner product in Hilbert spaces and its approximation for a finite set of support points is given by If N t N (many support points, few samples), solving the eigenvalue problem of the collocation matrix A is more efficient than the dual one of the covariance matrix C with [56] for their equivalence via the singular value decomposition of Y ij = f (t i , x j )). The question at which r to truncate the eigenspectrum in (22) depends on the desired accuracy in the output that is briefly analyzed in the following paragraph.
Here, we justify why we can assume an L 2 truncation error of the order of the ratio λ r /λ 1 between the smallest eigenvalue considered in the approximation and the largest one. The truncated SVD can be shown to be the best linear approximation A (r) of lower rank r to an N × N matrix A in terms of the Frobenius norm ||A|| F (see, e.g., [63]). Its value is simply computed from the L 2 norm of singular values, where σ 2 k = λ k in case of real eigenvalues λ k of a positive semi-definite matrix as for the covariance or collocation matrix. The truncation error is given by The error estimate for the KL expansion uses this convenient property together with the fact that the Frobenius norm is compatible with the usual L 2 norm |x| of vectors y, i.e., |Ay| ≤ ||A|| F |y|. (27) Representing y via the first r eigenvalues of the collocation matrix yields a relative squared reconstruction error of The last estimate is relatively crude if N r and the spectrum decays fast with the index variable k. If one assumes a decay rate α with one obtains where ζ is the Riemann zeta function. This function diverges for a spectral decay of order α = 1 and reaches its asymptotic value ζ(∞) = 1 relatively quickly for α ≥ 2 (e.g., ζ(3) = 1.2). The spectral decay rate α can be fitted in a log-log plot of λ k over index k and takes values between α = 3 and 5 in our use case. The underlying assumptions are violated if the spectrum stagnates at a large number of constant eigenvalues for higher indices k.

Delayed Acceptance MCMC
Delayed acceptance MCMC builds on a fast surrogate for the posteriorp(θ|d) to reject unlikely proposals early [13,14]. Following the usual Metropolis-Hastings algorithm, the probability to accept a new proposal θ * in this first stage in the n-the step of the Markov chain is as usual,P n acc =p where g is a transition probability that has been suitably tuned during warmup. The true posterior p(θ|d) is only evaluated if the proposal 'survives' this first stage and enters the final acceptance probability Actual computation is, as usual, performed in the logarithmic space with cost function If this function is fixed, it is most convenient to just directly build a surrogate˜ (θ|d) for the scalar log-posterior (cost) function y = (θ|d) depending on x = θ including the corresponding prior. Below, we describe an alternative approach that models the full functional output instead.

Bayesian Hierarchical Models and Fractional Norms
One application of modeling the full functional output instead of only the cost function is the existence of additional distribution parameters ζ in the likelihood besides the original model inputs θ. Such dependencies appear within Bayesian hierarchical models [64], where ζ are again subject to a certain (prior) distribution with possibly further levels of hyperparameters. There are essentially two ways to construct a surrogate with support for additional parameters ζ: Building a surrogate for the cost function that adds ζ as independent variables or constructing a surrogate with functional output for f k (θ) and keeping the dependencies on ζ exact. Here, we focus on the latter and apply this surrogate within delayed acceptance MCMC with both θ and ζ as tunable parameters.
As an example we use a more general noise model than the usual Gaussian likelihood that builds on arbitrary ζ norms [65][66][67] with real-valued ζ not fixed while traversing the Markov chain. We allow members of the exponential family for observational noise and specify only its scale but keep ζ as a free parameter. Namely, we model the likelihood for observing d in the output as with the normalized ζ norm to the power of ζ, as the loss function between observed data d i and blackbox model f i (θ). Choosing the usual L 2 norm leads to a Gaussian likelihood for the noise model, whereas using the L 1 norm means Laplacian noise. To maintain the relative scale when varying ζ, it is important to add the term log Γ(1 + ζ −1 ) from (34) to the negative log-likelihood. In the following use cases, we are going to compare the cases of fixed and variable ζ.

Pre-and Postprocessing
For analyzing measured data and the posterior distribution of model parameters, two techniques have been implemented in the interactive proFit [68] toolkit for probabilistic reduced order model fitting, using GPflow [69,70] and GPy [71] backends and visualization via Plotly/Dash. On the one hand, this concerns the estimation of noise in measured timeseries data in absence of a parameterized model. On the other hand, the analysis of the posterior distribution of the calibrated parameters is facilitated by on-the-fly visualization of conditional marginal distributions.
In order to introduce a scale for the tolerated deviation in the MCMC calibration of model parameters, the random noise in the measured time-series data d has to be known or, as in the present case, estimated. For this purpose, a fairly general Gaussian process (GP) regression [57] with a squared-exponential kernel is applied to the data. Characteristic timescale and random noise are left as free parameters and optimized to their maximumlikelihood values based on the data. This yields a decomposition of a kernel-smoothed representation of the original data plus a Gaussian noise term σ n . In order for this estimate to be valid, the random error must be sufficiently close to a normal distribution and the characteristic timescale should not vary over time.
An alternative way to estimate noise has been evaluated by keeping observational noise as a free parameter and inferring its value during MCMC sampling. This path has been abandoned, as it yields to an overestimation of noise in the present case. The reason is the following. In contrast to the GP regression, no combination of model parameters can eliminate all systematic deviations from the observed data. Inference with the present Gaussian likelihood model incorrectly identifies these deviations with noise and expands the confidence bands to enforce a match between model and data. In contrast, using the estimated noise from an empirical 'perfect' fit via a GP infers noise and confidence bands mainly from the data alone, without implicitly assuming correctness of the model.
For M parameters, the marginal distribution of the posterior for each model parameters θ k , given measured data d is given by Here, these marginal distributions p(θ k |d) are approximately computed by taking sums of MCMC data inside histogram intervals. Similar to software for BNs, proFit allows to interactively restrict values of certain parameters to intervals (θ A l , θ B l ) of a certain histogram bar and observe the influence on conditional marginal distributions of each parameter. This enables a fast exploration in parameter spaces that are too highdimensional to be visualized directly.

MCMC Sampling
Based on the cost function defined in Equation (13), MCMC was used to explore the posterior joint distribution of parameters θ. An observational error (standard deviation) of σ chl = 5 µg chl/L has been determined via the maximum-likelihood estimate of the noise term from a Gaussian process regression as described in Section 4.4. A total of 1000 parallel chains were calculated with 1000 iterations each, starting at randomized locations in parameter space. For each chain, 5 warm-up trajectories with 500 iterations were calculated. To reduce the computational burden, chlorophyll a concentrations at Geethacht Weir were simulated for just every third day. Acceptance rates close to 35% for all parameters indicated a reasonable choice of the individual proposal step size of the MCMC algorithm [7,72]. The distance of the Gelman-Rubin statistic R [73] to the asymptotic value of 1 was found to be less than 10 −3 for the computed chains, thus yielding no indication for an insufficient sampling.
Simulations based on 10 6 feasible parameter combinations obtained from MCMC are summarized in Figure 2  In Figure 2, the spread in model outputs arising from parameter uncertainties is represented by the means of box plots. A total 50% of simulations are close to each other (magenta colored boxes). However, there is a surprisingly large spread between the extremes for each day. One may ask how this goes together with the positive evaluation as a reasonable simulation.
First, larger deviations may occur in simulations for specific years because model performance was evaluated for the five-year period in total. Second, the relevance of different parameters for model output depends on environmental conditions (e.g., temperature, availability of silica) at the time of interest, so that particularly large values of some parameters may lead to large anomalies at certain times while being of minor importance during other, probably longer, periods. To illustrate this effect, Figure 3a combines the optimum simulation for 2001 with three other simulations that produce the most extreme chlorophyll a values for 11 May, 10 July and 31 July, respectively. Table 2 compares the parameter values these three simulations are based on with the optimum (i.e., minimum cost function) simulation as a reference.
In Figure 3a, the green curve deviates from the others in that it shows particularly large peak values in May and June. This might be explained by a large maximum growth rate µ 0 in combination with a large (compared to the reference) half-saturation constant K light ( Table 2). The latter assumption partly compensates for the large µ 0 but at the same time makes growth rate µ more sensitive to variable light conditions (see Equation (6)). The brown curve to some extent follows an opposite approach, which results in a rather smooth simulation. At the end of July and August, it is then the red simulation that much overestimates two minima of chlorophyll a. This simulation is based on a very large parameter a ( Table 2) Figure 3 (cost: first term in Equation (13), evaluated for 1997-2001; prior: second term in Equation (13)). Figure 3 Cost/Prior (m µg Chl) − Although algal silica content f Si underlying the red curve in Figure 3a is the highest among the three example simulations (Table 2), the very low chlorophyll a concentrations imply low consumption of silica and therefore coincide with peak concentrations of this nutrient (Figure 3b). The upper bounds of silica ranges indicated in Figure 3b are more or less identical with concentrations prescribed at Schmilka where trajectories start (in the first half of June, the imposed lower bound of 2 mg Si/L can be noticed) and arise from zero consumption of silica. It is interesting to see that generally silica simulations look quite reasonable, although they were not used for model calibration. In fact, it turned out that their inclusion did not much affect the overall outcome of model calibration (not shown). Note that the simulated unrealistic increase in silica in June is due to the abandoning of silica consumption during the short period highlighted in yellow.

Principal Component Analysis of Feasible Parameter Combinations
Posterior parameter dependences greatly influence model behavior, but their effects are not easily recognized in higher dimensions. Conventional principal component analysis of the parameter correlation matrix may be applied. Scaling is needed to remove different physical dimensions. If the six selected parameters in our study were strictly independent from each other, each of them (and also each principal component) would contribute 16.7% of overall parameter variability. It turns out, however, that already the first two principal components (PCs) describe 80% of overall parameter variation (see Figure 4). The spectrum of eigenvalues λ k can be used to estimate the statistical degree of freedom (dof) in parameter space [74]: Here, we achieve dof = 2.7 for logarithmized (except f Si ) data in M = 6 dimensions. The logarithm was applied as five marginal distributions showed tails toward large values (see Section 5.3, Figure 5a). Empirical orthogonal functions (EOFs) describe the structure of parameter covariation underlying each mode of variability [75]. According to Figure 4, the two leading EOFs do not suggest a grouping of parameters or separation between different processes. Only the third EOF (explaining 13.8% of variance) clearly focuses on an interplay between parameters λ S and K light (see Equations (5) and (6)).

Exploring Conditional Marginal Distributions
Discrete marginal parameter distributions, with continuous values of each parameter being binned into 10 classes, are shown in Figure 5a. Different colors were used to better distinguish between parameters related to different processes. Most distributions have tails toward large values. The only parameter showing a symmetric distribution is silica content f Si .
We now study implications of assigning specific values to parameter subsets. Confining the value of one parameter may narrow the feasible ranges of other parameters and possibly shift the peaks of their marginal distributions. Figure 5b compares consequences of assigning an either low or high value to maximum algal growth rate µ 0 . The respective choice impacts other parameters to a variable extent. The parameter probably most affected is K light , while impact on parameter a, for instance, remains small. Figure 5c illustrates the effects of additionally assuming an either low or high algal silica content f Si . It turns out that this very much affects parameters σ 0 and a, while much smaller effects occur for λ S , for instance. This is a first indication that silica content is in fact a key variable in the overall model concept.
The widths of marginal parameter distributions depend on the scaling of modelobservation discrepancies in the likelihood function Equation (13), achieved by specifying standard deviation σ chl . Figure 6a shows the results of choosing σ chl = 1 µg chl/L instead of 5 µg chl/L. The assumed high accuracy of observations prevents divergence of the MCMC process and allows to abandon provision of prior information on parameter distributions. Resulting marginal parameter distributions are very concentrated and located near the maxima of those distributions that were derived assuming larger observational uncertainty in combination with an estimated prior distribution (Figure 5a).
From Equation (38), the effective statistical dimension of feasible parameter space was estimated to be less than three (dof = 2.7). This means that already fixing the values of 2-3 parameters greatly constrains the joint distribution of the six parameters under study. To substantiate this expectation, we constrain Figure 5a by entering again evidence for the two parameters µ 0 and f Si , now selecting those values that are most likely according to Figure 6a. As a result (Figure 6b

Bayesian Network Assuming Simplified Parameter Interrelationships
So far, we obtained marginal distributions by sampling directly from the MCMC output. However, using advanced BN software with a graphical user interface can very much ease exploration of parameter dependences. Unfortunately, for a larger number of parameters, representation of the joint probability by means of a BN with all edges being kept becomes prohibitive due to the high dimensionality of the conditional probability tables needed. Resolving marginal distributions with a lower number of bins reduces this dimensionality. To avoid coarse resolution, a representation assuming a reduced number of connecting edges often would be the preferred option. However, eliciting the proper simplified structure of a BN [10] is much more difficult than just specifying conditional probability tables for an already given dependence structure. Here, we use Gaussian graphical modeling (Section 4.3.1) as an auxiliary technique, although this concept must be clearly distinguished from the BN approach.
Just as principal component analysis, Gaussian graphical modeling relies on the parameter correlation matrix. An undirected GGM represents correlations (partial correlations) between pairs of variables that are not mediated by any third variables. The idea is to adjust the posterior parameter correlation matrix in a way that generates zero partial correlations. Statistical relevance of any further truncation of the GGM is assessed statistically in the light of existing data. Here, we applied a more qualitative concept, looking for a graph in which all missing edges are clearly less relevant than those maintained (see Section 4.3.1). In the truncated graph we found in agreement with this criterion (Figure 8), 6 out of 15 edges of the saturated graph could be removed. Table 3 shows the original correlation matrix S of feasible parameter combinations, EOFs of which were displayed in Figure 4. Note the particularly strong correlations (either positive or negative) between algal silica content f Si and the two algal loss related parameters log(a) and log(σ 0 ) (see Equation (8)). Table 3 compares S with correlation matrix V, simplified to agree with the graphical structure shown in Figure 8. The iterative proportional fitting algorithm [48] was applied for adjusting matrix V to conform to this independence structure. Numbers in bold type correspond to edges that were maintained. These correlations generally remain unchanged (see [48]).
To assess implications of fitting the graphical model, we quantify posterior parameter dependences in terms of the percentages of uncertainty (variability) of every single parameter that can be modeled as a linear function of all other parameters. Table 4 shows these explained variances as derived from S and V, respectively. Generally high values conform to the low dimensionality (dof = 2.7) of the posterior parameter space. Small differences between the results from either S or V are in favor of the simplified GGM. Moreover, the leading EOFs obtained from V resemble those for S in Figure 4 (not shown). Interpretation of the GGM in Figure 8 is the following. Assume, for instance, that parameters f Si and σ 0 were given. According to the GGM, this would block all interaction between a and the remaining three parameters. Interaction means that changes of any of the three parameters µ 0 , λ S and K light could compensate for effects of changing a and vice versa. Given fixed values for f Si and σ 0 , this mechanism would be suppressed. Figure 9 shows a BN with directed edges replacing undirected edges of the GGM in Figure 8. Generally, it is not possible to exactly translate a GGM into a directed BN. Conversely, given any BN, a corresponding conditional independence graph can be derived by first connecting all joint predecessors (parent nodes) of all child nodes. Then, all directed edges are converted into undirected ones, giving the so-called moral graph [76]. The moral graph derived from the BN in Figure 9, for instance, would also contain an edge between λ S and f Si , as these two nodes have joint children K light and µ 0 . Such edge is missing in the GGM in Figure 8. Hence, the GGM tends to be more restrictive than the BN and it can therefore be expected that the simplified BN (6 edges out of 15 edges were removed; maximum number of parents is 3 instead of 5 for the saturated graph) behaves similar to the saturated BN.

Accelerated Sampling via Delayed Acceptance
As mentioned above, the large required number of model runs to obtain the presented results can be reduced by delayed acceptance MCMC sampling. Instead of directly using the surrogate for the cost with fixed ζ, we take a step in-between and construct a functionvalued surrogate model. Multiple surrogatesz k (x) are built, where each maps the input x to one weight z k (x) in the KL expansion. A surrogatef i (x) ≡f (t i ; x) for the model output is then given by replacing z k (x) byz k (x) in (22). The according surrogate˜ (y; x, ζ) for the cost function usesf i (x) instead of f i (x) in (35). Dependencies on ζ are kept exact in this approach. The main algorithm proceeds in the following steps:

1.
Construct a GP surrogate for the L 2 cost function on a space-filling sample sequence over the whole prior range.

2.
Refine the sampling points near the posterior's mode by Bayesian global optimization with the L 2 cost surrogate.

3.
Train a multi-output GP surrogate for the functional output z(x) on the refined sampling points.

4.
Use the function-valued surrogate for delayed acceptance in the MCMC run.
For all GP surrogates, we use a Matern 5/2 kernel for k(x, x ) together with a linear mean model for m(x). For step 4, we use Gibbs sampling and the surrogate for z(x) yielding the full output y(t, x) rather than only the L 2 distance to a certain reference dataset. The idea to refine the surrogate iteratively during MCMC had to be abandoned early. The problem is that detailed balance is violated as soon as the surrogate proposal probabilities change when modifying the GP regressor with a new point. In the following application cases, we compare a usual MCMC evaluation using the full model to MCMC with delayed acceptance using the GP surrogate together with the KL expansion/functional PCA (GP+KL) in the output function space.
First, we test the quality of the algorithm on a toy model given by We choose reference values x 1 = 1.15, x 2 = 1.4 to test calibration of x against the according output y ref (t) ≡ y(t, x ref ) and add Gaussian noise of amplitude σ = 0.05. A flat prior is used for x. For the hierarchical model case (34), we choose a starting guess of ζ = 2 for the norm's order and a Gaussian prior with σ ζ = 0.5 around this value together with a positivity constraint. The initial sampling domain is the square x 1 , x 2 ∈ (0, 2). The comparison between MCMC and delayed acceptance MCMC is made once for fixed ζ = 2 (Gaussian likelihood) and then for a hierarchical model with a random walk also in ζ. The respective Markov chain with 10.000 steps has a correlation length of ≈ 10 steps ( Figure 10) and yields a posterior parameter distribution for (x 1 , x 2 ) depicted in Figure 11.
The results in Figure 11 show good agreement in the posterior distributions of full MCMC and delayed acceptance MCMC. Compared to the case with fixed ζ = 2, the additional freedom in ζ in the hierarchical model leads to further exploration of the parameter space. The posterior of ζ according to the Markov chain is given in Figure 12. The similarity to the prior distribution shows that the data does not yield new information on how to choose ζ. The construction of a reliable Gaussian process surrogate for the full six-dimensional input space of the diatom model has not been successful due to the excessive number of support points. This is why we limit the present analysis to only two input parameters, namely x 1 = θ 3 = K light and x 2 = θ 1 = µ 0 . As in the case of the toy model, we use 10.000 steps in the Markov chain. Results for autocorrelation and posterior samples using the full model versus delayed acceptance are shown in Figures 13 and 14. The correlation time of ≈ 500 steps is much larger than in the toy model, and the decay of the autocorrelation over the lag roughly matches between the two approaches. Delayed acceptance sampling produces similar posterior samples in Figure 14 at about one third of the overall computation time. There one also sees the issue of high correlation between K light and µ 0 in the posterior of the calibration, making Gibbs sampling inefficient in this particular case.

The Case Study Example
Our case study considered a simple model with only few parameters that nevertheless reproduced chlorophyll a observations at Geesthacht Weir reasonably well. However, even a good fit does not prove a model's truth [1,3,5]. The fact that even complex environmental models drastically simplify the natural system has attracted much scepticism, e.g., [77]. Model design always relies on certain presumptions. Hornberger and Spear [78] considered their simple model for the Peel Inlet a speculative scenario that presumes phosphorus controls algae growth. An alternative scenario putting nitrogen at the heart of the analysis was reported by Humphries et al. [79]. Similarly, the model in the present case study was based on the hypothesis that lack of silica might explain sudden drops of chlorophyll a concentration observed at station Weir Geesthacht. Model calibration shed some light on how (within the a priori specified model structure!) silica limitation and a temperature dependent loss rate could provide competing concepts to explain variability of chlorophyll a observations. An indication in favor of the basic model concept is the fact that when small observational errors were assumed, a disregard of prior knowledge about model parameter values did not let MCMC simulations produce infeasible parameter combinations (Figure 6a). According to van Straten [80] '. . . one may question, whether a model is actually well-structured if the use of parameter constraints is the only way to avoid nonfeasible solutions'. Another vague indication in favor of the underlying model concept is the fact that posterior parameter distributions did not much depend on whether or not silica observations were used in the process of model calibration (not shown). On the other hand, the model consistently failed to reproduce a sharp late spring chlorophyll a increase, so that in each year silica consumption had to be abandoned for a 1-2 week period. This model deficiency could not be fixed by any choice of parameters (not discussed in this paper), which confirms that the model's structure is specific and cannot be adjusted to any arbitrary time series.
One must be aware that simple (or even complex) models neglect many external factors that potentially impact time series observed in nature. Observations from different years cannot necessarily be treated like the outcomes of repeated well-defined experiments. External factors not considered in the model (there are plenty of them) might differ in different years. Discussing an exceptionally high chlorophyll a concentration in the River Rhine in 2011, Hardenbicker et al. [37] hypothesize a high growth potential of phytoplankton which, however, most of the time is suppressed by some other environmental factor. From a modeler's perspective, Waylett et al. [38] found that particularly high spring chlorophyll a concentrations in one year could not be explained by physico-chemical factors in their model. They suggested variable strengths of grazing loss rates, possibly depending on over-winter survival rates of benthic filter feeders, being the most feasible explanation for such interannual differences. According to Figure 7, our analyses for the two years 1997 and 1998 suggest an either larger (1997) or smaller (1998) impact of silica limitation, assuming an either high (1997) or low (1998) algal silica content. One must be careful to prematurely attribute such differences for individual years to just the model being overfitted.
A proper choice between the two explanations of chlorophyll a variability (either silica limitation or increased algae loss rate) might gain importance as soon as the model would be run in a predictive mode. However, extrapolation of a simplified model into a domain of unobserved environmental conditions would be a questionable enterprise. The goal of the present study in support of past data interpretation was rather to improve the description of imponderabilities that remain after model calibration.
We assessed model performance in terms of a squared-error loss function (Equation (13)). A quadratic measure, penalizing in particular large discrepancies, is suitable to highlight problems with the simulation of major short-term changes that characterized the chlorophyll a time series under study, for instance. Using the squared-error loss function might have been less revealing if the general model performance had been worse (that is why we modified the model during the short periods in May/June). In case of generally strong model data discrepancies, a linear measure of model data misfit could have been more adequate.

MCMC in Relation to GLUE and BMC
Modeling eutrophication in the Peel Inlet, Hornberger and Spear [78] and Hornberger [81] formulated a set of six behavioral conditions to discriminate between simulations being either successful or unsuccessful in mimicking key aspects of the system's evolution. Spear and Hornberger [82] found separation induced correlations between model parameters to not exceed 0.23, which is why the authors did not embark on a deeper analysis of the correlation matrix. According to Spear [4], conventional multivariate analyses proved to be also unhelpful in other studies using the same approach. In our study, correlations were found to be much higher (see Table 3). We presume that this relates to (a) our model being much more controlled by observations and (b) the huge number of successful simulations (10 6 in our study) affordable with today's computer power.
The aforementioned studies motivated further developments leading to the GLUE (generalized likelihood uncertainty estimation; [83]) technique, which sometimes is referred to as a pseudo-Bayesian approach. In contrast to MCMC (a formal Bayesian approach), the GLUE approach separates parameter sampling (either uniform Monte Carlo or Latin Hypercube Sampling, for instance) from likelihood evaluation [22]. For higher dimensions, the random sampling makes GLUE computationally more expensive than MCMC using sequential sampling [25]. Random sampling is also used within the Bayesian Monte Carlo (BMC) approach, a method related to GLUE but using a statistically rigorous likelihood function [23]. According to Beven and Freer [22], the advantage of MCMC might diminish when model output likelihood has a complex shape. That seemed not to be the case in our application.
Tan et al. [84] contrasted results of the GLUE and MCMC approaches, assessing uncertainties of nine parameters of a crop model. The authors did not address, however, an explicit specification of parameter interactions according to the posterior joint distribution. The same holds for a comparative assessment of the two approaches conducted by Li et al. [85], referring to two conceptual hydrological models, or Camacho et al. [25], reporting a study on the calibration of a hydraulic or hydrodynamic model using synthetic data. Our study focused on parameter interactions, and we believe that for that purpose the many samples obtainable from MCMC are a key advantage when it comes to filling a higher dimensional parameter space. Using BMC to calibrate nine parameters of a simple water quality model, Dilks et al. [23] found that approximately 60% of model output uncertainty could be related to covariances between model input parameters. For a lake modeling example, Fedra et al. [1] found that focusing on meaningful simulations did not much constrain individual parameters. However, in higher dimensions, they found clustering of successful parameter combinations.
Referring to a binary classification of model runs in terms of simulations being successful ('passes' or 'behaviours') or not, Spear et al. [86] described the interactions between parameters that gave rise to passes by a tree-structured estimation technique. Studying an example from groundwater pathways modeling, the authors found discontinuous localized regions, interactions of which were not reflected in a linear correlation matrix. A similar feature could not be recognized in our study. The reason might be that our analysis was based on a continuous goodness-of-fit index rather than a sharp binary classification.

Benefit from Using BNs
Discussing half-saturation constants, Mulder and Hendriks [87] warn that simultaneous calibration of a whole set of model parameters might not reveal the true values one would obtain in the laboratory. Similarly, Brun et al. [5] emphasize that fixing selected parameters will usually bias the estimates of other parameters. However, experimental data are sparse and often not representative for the overall description of a complex natural system. An at least approximate description of the joint posterior distribution of all parameters offers a way out of this dilemma. A BN representing this distribution enables users to explore the extent to which selecting values for any subset of parameters constrains the marginal distributions of all other parameters (see Figure 5), thereby explicitly addressing the concerns raised by Brun et al. [5].
Marginal distributions constrained by available evidence on some of the model parameters can also be calculated directly from MCMC output, without an involvement of specialized BN software. The advantage is that high-dimensional conditional probability tables that hamper the analysis need not be specified. However, using BN software with a graphical user interface provides a more convenient approach. BN software is also needed when aiming at a simplified representation of posterior parameter interactions. Structural analysis might suggest some kind of stepwise calibration of different process related modules of a model. We demonstrated how in this context graphical Gaussian modeling could be helpful, given that parameter uncertainty distributions are reasonably well represented by a simple linear correlation matrix. Although undirected and directed graphs cannot be directly translated into each other, the undirected graph seems nevertheless more informative than conventional principal component analysis, for instance.
Parameter correlations documented in the BN represent alternative model structures, an example of the equifinality which, according to Beven and Freer [22], may occur already for moderate model complexity. Many authors discussed that a model accommodating such overparameterization, not pretending existence of a single true model, may even be useful, e.g., [1,3,22,88]. Given the model structure we used, the aspect worst controlled by data is decreasing chlorophyll a concentrations explained by either algae growth limited by lack of silica or a strong algal loss rate. According to Figure 5c, choosing a relatively low algal silica content f Si (and therefore a low depletion of the silica reservoir) implies a large loss rate σ 0 and a small coefficient a governing effects of temperature on the loss rate (e.g., via grazing). By contrast, for a large silica content, the maximum loss rate should be set to a small value; its variation for high temperatures becomes less constrained by the data.

Conclusions
For a simple model of riverine diatoms, we provided a detailed description of the posterior joint distribution of adequate model parameters, inferred via observations of chlorophyll a concentration. We argue that this is the most profound information on a model's calibration potential users can achieve. Results showed how, within the prespecified model structure, two different processes affected model output in a very similar way, thereby offering different interpretations of features in the observations. Of course, it must be kept in mind that (like with any other ecosystem model) already choosing the specific model structure vastly simplified the representation of nature by the neglect of many (in fact, the majority of) detailed processes.
Two challenges must be met for the approach we discussed. First, estimation of the joint distribution in a higher dimensional parameter space needs a large number of simulations. MCMC seems more effective than random sampling, avoiding exploration of those regions in parameter spaces that produce unrealistic simulations. As MCMC would still be prohibitive for large models, provision of computationally less demanding surrogate models could offer a way out.
As a possible way to reduce the number of simulations in MCMC, we have illustrated the application of function-valued surrogates to delayed acceptance MCMC for parameter calibration in simple as well as hierarchical Bayesian models. Using a surrogate for the functional output rather than cost function or likelihood is useful for several reasons. Conceptually, it allows introducing additional distribution parameters in Bayesian hierarchical models. Our results demonstrate that it is possible and efficient to perform MCMC with delayed acceptance on such models while keeping dependencies in these additional parameters exact. In particular, the fractional order of the norm appearing in the cost function has been left free, which is useful for robust model calibration. The method was applied to a toy model and the present application with a limitation to two variable input parameters. In both cases, using delayed acceptance with a surrogate for the functional output produced results comparable to using the full model at only about one third of actual model evaluations. Compared to direct surrogate modeling of the cost function, we could also observe an increase in the quality of the predicted cost. This is likely connected to the higher flexibility of modeling weights to multiple principal components with Gaussian processes with individual hyperparameters.
The described approach is not immune to the curse of dimensionality. On the one hand, the number of required GP regressors grows linearly with the effective dimension of the output function space. Since evaluation is fast and parallelizable, this is a minor issue in practice. On the other hand, increasing the dimension of the input space soon prohibits the construction of a reliable surrogate due to the required number of training points to fill the parameter space. In such cases, the preprocessing overhead is expected to outweigh the speedup of delayed acceptance MCMC for either functional or scalar surrogates. More detailed investigations will be required to give quantitative estimates on this tradeoff. In the future, it may further be of interest to leverage the uncertainty information given by the GP regressor to decide in which regions of parameter space the surrogate is reliable enough to provide a delayed acceptance proxy.
The second challenge is a convenient representation of the joint posterior parameter distribution. BN software estimates a set of conditional probability tables that represent results from the model calibration exercise. While direct sub-sampling the original data avoids technical problems with high-dimensional conditional probability tables, BN technology enables a simplified representation of parameter interactions. For large models, a graph of model parameter interactions could possibly be assembled from a number of sub-graphs dealing with parameters of different modules of a process-oriented model. Proving feasibility of such an approach is left to further research.