Learning model discrepancy of an electric motor with Bayesian inference

Uncertainty Quantification (UQ) is highly requested in computational modeling and simulation, especially in an industrial context. With the continuous evolution of modern complex systems demands on quality and reliability of simulation models increase. A main challenge is related to the fact that the considered computational models are rarely able to represent the true physics perfectly and demonstrate a discrepancy compared to measurement data. Further, an accurate knowledge of considered model parameters is usually not available. E.g. fluctuations in manufacturing processes of hardware components or noise in sensors introduce uncertainties which must be quantified in an appropriate way. Mathematically, such UQ tasks are posed as inverse problems, requiring efficient methods to solve. A popular approach for UQ in inverse problems is Bayesian inference. This work investigates the influence of model discrepancies onto the calibration of physical model parameters and further considers a Bayesian inference framework including an attempt to correct for model discrepancy by an additional term. A Markov Chain Monte Carlo (MCMC) method is utilized to approximate the posterior distribution. A polynomial expansion with unknown coefficients is used to approximate and learn model discrepancy and system parameters simultaneously. This work extends by discussion and specification of a guideline on how to define the model discrepancy term complexity, i.e. here the maximum polynomial degree, based on the available measurement data. Furthermore, the suggested method is applied to an electric motor model with synthetic measurement data and evaluated by comparing the results to the reference. The example illustrates the importance and promising perspective of the method by good approximation of discrepancy and parameters.


Introduction
The increasing complexity of technical systems yields high demands on model quality and numerical accuracy.The advancements in computational power allow detailed models to represent these systems, but uncertainty is still present in most models.Quantification of these models under uncertainty is desired to make statements about reliability and accuracy.Consequently, there is a need for efficient solvers and new additional methods for analysis, which are able to cope with the soaring complexity of models and that take into account all sources of uncertainty.
Uncertainty Quantification (UQ) methods focus on understanding, quantifying and propagating uncertainty in computational simulation of models.Uncertainty arises from different sources and is often grouped in two categories: aleatoric and epistemic uncertainties.Aleatoric uncertainties are inherent to the model itself and are not due to insufficient knowledge.They are stochastic variations that cannot be reduced, e.g. by performing more measurements.Epistemic uncertainties are due to model assumptions, parameterizations and discretization.If a model for a system of interest is available, then epistemic uncertainty is subdivided into model form uncertainty and parametric uncertainty.The first category arises if reasonable doubts exist that the model itself is structurally correct and the latter, if the model reflects reality well enough, but the parameters are uncertain [19,22].
To be more specific we formulate the general forward problem.Let G : X → Y, for some spaces X ⊆ R d , Y ⊆ R n , d, n ∈ N, be an operator representing a simulation model that maps some input x ∈ X to an output y = G(x).In general G is nonlinear.If this x is uncertain, one might formulate it as a random variable X in a probability space (Ω, Σ, P ).With the random variable X the output Y = G(X) is a random variable as well, with an unknown probability distribution of Y .The task of forward uncertainty quantification methods is now to characterize the distribution of Y with appropriate methods, e.g. with the Monte Carlo method.
On the other hand the inverse problem describes the task of finding a x ∈ X for a given measurement y ∈ Y such that y = G(x).Generally, this equality does not hold as the measurements y are usually corrupted by observation noise.Hence, one considers where η ∈ Y represents the observation noise due to measurement noise or model error.Now, simply inverting G is not possible as η is unknown and in general G −1 is non-existent.A classical approach to solve this problem is by minimizing the data misfit, i.e. min x∈X However, this problem is typically ill-posed in the sense of Hadamard, i.e. multiple solutions might exist and stability might be a problem.In order to obtain a well-posed problem regularization is necessary.One approach is Tikhonov regularization, also known as ridge regression in statistics, where a regularization term R(x) is added to the data misfit, i.e. min x∈X 1 2 y − G(x) 2  Y + λR(x), for a λ > 0 [10].An example is R(x) = x − x 2 for some norm • and a fixed x ∈ X .However, the choice of regularization is somewhat arbitrary [21].The Bayesian approach to inverse problems yields a natural regularization of ill-posed problems [4,10,21] and is furthermore a popular approach for UQ in inverse problems.With the Bayesian approach the problem is formulated in terms of probability theory and one is interested in finding a probability measure µ y (x) on X with probability density π(x|y) that expresses how likely a certain x ∈ X , i.e.G(x), describes the given data y under consideration of the noise.The inverse problem in this setting is well-posed under slight assumptions and leads to a natural way of regularization due to the definition of prior distributions for unknown parameters, see [21].Usually, the problem is formulated as y = G(x) + η, where η is modeled as a realization of a zero mean random variable with probability density ρ, e.g.η ∼ N (0, Σ) with unknown covariance matrix Σ ∈ R n×n .The probability of y given x has density π η (y|x) := ρ(y − G(x)) and is denoted as likelihood.Let π 0 (x) be the prior density expressing prior knowledge of the unknown x ∈ X .Bayes' formula leads to where π(y) = X π(x, y)dx is called marginal or evidence.The evidence works as a normalizing constant and is usually omitted, i.e.only the relation π(x|y) ∼ π η (y|x)π 0 (x) is considered.Generally, the posterior distribution π(x|y) is intractable and one can not sample from it directly, hence approximative methods are required.The approximation can be done by filtering, variational and sampling methods [5,21].An example for the latter is Metropolis-Hastings Markov-Chain Monte-Carlo (MH-MCMC) sampling, which is utilized in this work [5,8,17].
A difficulty in solving the inverse problem is to capture all sources of uncertainty and correctly identify them, often called "identification problem".Uncertainty sources are • parametric uncertainty, e.g.either unknown model parameters (epistemic uncertainty) that are completely unknown or only known to a certain degree for instance by a probability distribution expressing an a-prior degree of believe, or variable / stochastic parameters (aleatoric uncertainty) or even both, unknown stochastic parameters, • structural model uncertainty, e.g.lack of knowledge, missing physics, in modeling the system of interest, • solution method uncertainty, e.g.errors introduced by implementation and also by numerical approximation schemes and • observation uncertainty, e.g.noise and errors that are introduced by observing and recording quantities of interest of the system under consideration.
These categories are not necessarily selective and single uncertainty sources might not uniquely be associated to one single category.For a full quantification of uncertainties in the inverse problem all sources should be considered.A main challenge is related to the fact that the considered computational models G are rarely able to represent the true physics perfectly and demonstrate a discrepancy compared to measurement data.To be more specific the term model discrepancy in this work denotes the difference between simulation model for the true physical parameters and the true system, hence the structural uncertainty that might be due to missing physics in the formulation of the simulation model.It also comprises numerical errors.However, this model discrepancy is usually unknown.Considering model discrepancy in inverse problems is crucial in order to obtain realistic calibrations of unknown physical parameters [3] and to further obtain a quantification of model quality and reliability, which is important for most engineering tasks.In the following we give a brief overview of previous work considering model discrepancy in inverse problems.
The Kennedy and O'Hagan framework [11] is one of the first attempts to model and explicitly take account of all the uncertainty sources that arise in the calibration of computer models.In particular, model discrepancy is considered by an additional term in the Bayesian formulation of the inverse problem.Gaussian Processes are then used to model the model discrepancy and the simulator.One conclusion is that one only calibrates and receives the best fitting parameters according to the given data and defined error structure.These obtained parameter estimates do not necessarily correspond to the true physical values.Often the computer model does explain the data better for different values and restriction of some calibration parameters to the true physical value might lead to worse results.Following [11], Arendt et.al. [1,2] suggest a modular Bayesian approach, where first the hyperparameters of the Gaussian processes for the simulator and the model discrepancy term are approximated by maximum likelihood estimates consecutively.Then the posterior distribution of the parameters conditioned on the data and the hyperparameter estimates are approximated.Further, the identification problem, i.e. the problem of distinguishing between effects of the calibration parameters and the model discrepancy, is discussed.Examples illustrate that sometimes this separation is possible under mild assumptions, e.g.smoothness of the model discrepancy, but also that it is not possible in other cases.In the companion work [2] they show an approach how to improve identifiability by using multiple responses and representing correlation between responses.Another work using multiple responses is [14].Brynjarsdóttir and O'Hagan [3] state that with the Kennedy and O'Hagan framework [11] in order to infer physical parameters and model discrepancy simultaneously a sufficient prior distributions for at least one of those must be given.In an example they show that a constrained Gaussian Process prior for the model discrepancy, including best and most realistic prior information, yields good results for interpolation and learning about physical parameters, but still seems to be bad for extrapolation.Hence, simply introducing model discrepancy with weak prior information is not enough.However, Tuo and Wu [24,25] showed that the choice of the model discrepancy prior has a permanent influence onto the parameter posterior distribution even in the large data limit.Plumlee [15] presented an approach to improve identifiability by defining a prior distribution of the model discrepancy that is orthogonal to the gradient of the model.Nagel et.al. [13] modeled the model discrepancy term by an polynomial expansion, assuming smoothness for the true underlying model discrepancy.
Following all these works we consider a Bayesian model with a term for measurement noise and a model discrepancy term in order to infer parameters of simulation model for given measurement data.This work adapts the idea of representing model discrepancy as a polynomial expansion [13].The major contribution of this work is to provide answers on how to select a polynomial degree by keeping the complexity of the model low while still providing high accuracy in discrepancy modeling.This is shown in a practical guideline, which recommends how to select a sufficient maximum polynomial degree of the truncated polynomial expansion, based on the available data and the estimation of measurement noise.Furthermore, critical points conditioning the choice of the model discrepancy term are discussed in detail.
The framework is then applied to the calibration of a direct current (DC) electric motor model under consideration of model inadequacy, measurement uncertainty and parametric uncertainty in a synthetic setting.I.e.synthetic measurements are created and a modified electric motor model, containing an artificial model error are used to infer model parameters.Due to the synthetic setup of the example, reference parameter values and reference discrepancies are available and allow a quantitative evaluation of the considered methods performance and accuracy.
Considering all sources of uncertainties increases the complexity of the problem one needs to solve and consequently increases computational cost.Hence, efficient methods need to be developed to leverage this drawback.One possibility is to increase the efficiency of the posterior approximation method, e.g. by using additional information (first or second order derivatives, e.g.gradient, Hessian) of the problem [9,18,20].
Another option is to replace the simulation model by an surrogate to leverage computational cost.In this work surrogate modeling is omitted for the sake of simplicity, but we emphasize that in cases where simulation time is a bottleneck surrogate modeling, such as Polynomial Chaos expansions (PCE) [6,27] or Gaussian processes (GPs) [11,16], could leverage overall simulation time and speed up the inference.This paper is structured in the following way: Section 2 formulates the considered example setup, by defining the electric motor model, an artificial model error and synthetic measurements.In Section 3 a Bayesian inference model considering model discrepancy and measurement noise is specified, followed by a discussion of the model discrepancy term complexity.The numerical results of the methods applied to the example are presented in Section 4. This work concludes in Section 5.

Electric motor model
The following introduces briefly the mathematical model of an electric motor and the corresponding numerical approximation.This is followed by the synthetic problem set up with synthetic measurement data creation and artificial model error.

Model differential equations
A direct current (DC) electric motor is a rotating electrical machine converting electrical energy into mechanical energy.It is used in cases where a DC voltage source is available and a variable motor speed operation is required.With the magnitude of the applied voltage the motor speed can be varied.Elements of a classical DC motors are a fixed stator and a movable rotor.One or more field windings on the stator generate a permanent magnetic field, in which the rotor is located.The rotor itself is surrounded by the armature winding consisting of several coils.When electrical current is applied to the armature coil a magnetic field is generated.Due to the permanent magnetic field caused by the stator a physical force, the Lorentz force, affects the rotor.This force causes the rotor to rotate.After a certain rotation, when the magnetic fields are aligned a commutator located on the rotor changes the poles such that the current and therefore the magnetic poles of the rotor reverse their direction.Thus, the rotor remains in its continuous torque.For controlling this system variables of interest are current and voltage on the electrical side and torque and speed on the mechanical side.The following two equations relate these variables [23].Let I [A] denote the electric current and ω [rad/s] the angular velocity, i.e. the speed.For t > 0 the ordinary differential equations with initial conditions

Numerical approximation
In order to approximate a solution of the electric current I and the angular velocity ω for t ∈ [0, T ], T = 0.5[s], we define equidistant discrete time points t i := i t for i = 1, . . ., M, M ∈ N with step size t := T /M .Now, let Î(t i ) and ω(t i ) denote numerical approximations of I and ω at time points t i for a given time integration scheme.For this work an explicit Runge-Kutta method of order 4( 5) is used, in particular dopri5 (Dormand and Prince), see [26].Consequently, the order of the numerical approximation error is O( t 4 ) = O(10 −12 ), for T = 0.5 and M = 501.As the measurements are created synthetically numerical approximation error can be neglected for this work.For notational convenience we define the forward model operator G : X → Y, with X ⊆ R and Y ⊆ R 2M , by as the operator delivering numerical approximations Î(t i ), ω(t i ) at the discrete time points t i , i = 1, . . ., M depending on the parameter R ∈ X and for fixed parameters c g , c m , U, d, L, J ∈ R.

Synthetic measurement
In order to generate synthetic measurements a realization of zero mean Gaussian distributed noise ε is added to the numerical approximation G(R 0 ) for a fixed R 0 = 0.1, i.e.

Artificial model error
As the goal of this work is to learn about model discrepancy an artificial model error φ ∈ R is introduced into the equations of the electric motor Let G φ (R) denote the numerical approximation of the model for an artificial model error factor φ. Obviously, for φ = 1 the identity G = G φ=1 holds, if the same numerical approximation method and the same time step is used.In the following, G φ is referred to as simulation model.If φ = 1, this can be interpreted as missing physics, i.e. by damping or amplifying a part of the equation.Or it can be interpreted as a misspecified model, if the parameter c g is fixed.However, for this showcase it is not important how to interpret this artificial model error.For the following it is sufficient that the approximation of the model with φ = 1 has some discrepancy to the model with φ = 1.And this is the case, if c g is fixed to the same value for both models.
Define by the discrepancy d(R) ∈ R 2M and the noisy discrepancy

Bayesian inference solution process
This section considers methods addressing the inverse problem of finding an R ∈ X such that y = G φ (R), where y ∈ R 2M are noisy measurements and G φ is the simulation model containing an artificial model error φ.The goal here is not just to infer an optimal R ∈ X , but also one that is as close as possible to the reference R 0 , i.e. the true physical value, which was used to create the measurements y.Additionally we consider and learn about the model discrepancy during inference.
As already reasoned in the introduction the Bayesian approach to inverse problems has certain advantages to address the inverse problem.Regularization comes along naturally by defining eventually existing initial knowledge about unknown parameters as prior probability measure with density π 0 .Also measurement noise is considered separately.Furthermore, methods considering model discrepancy can be easily formulated within the Bayesian framework as it allows to define a prior distribution for the model discrepancy term.As we are interested in separating several sources of uncertainty it is natural to formulate the problem in terms of probability theory and use Bayesian inference.
Following [13] we introduce two Bayesian models that differ in their complexity.The simpler model only considers identically and independent distributed (i.i.d.) measurement noise, whereas the more complex model considers model discrepancy additionally.
In this work sampling methods are utilized to approximate the posterior distribution probability density function, in particular a Metropolis-Hastings Markov-Chain Monte-Carlo (MH-MCMC) method [5].We use the MH-MCMC implementation of the Python package PyMC3 [17].

Bayesian model 1 (BM1): measurement noise
As the measurements y are noisy, additive noise ε is added to the simulation model where Here, ε I , ε ω are considered as zero-mean Gaussian random variables N (0, σ 2 I I M ) and N (0, σ 2 ω I M ) with unknown standard deviations σ I , σ ω > 0, respectively.Consequently, the Bayesian model 1 (BM1) is considering only identically and independent distributed (i.i.d.) Gaussian measurement noise.Hence, the likelihood, i.e. the distribution of the measurements y conditioned on the parameters R, σ I , σ ω is with Σ(σ . By expressing a-priori knowledge of the parameters R, σ I , σ ω in terms of a prior probability density function π(R, σ I , σ ω ), Bayes' formula yields for the posterior probability density I.e. the posterior probability density is proportional to the product of likelihood and prior up to a normalization constant.Let the prior be given by π(R, σ I , σ ω ) = π(R)π(σ I )π(σ ω ).The priors for the unknown standard deviations of the noise π(σ I ), π(σ ω ) are defined as uninformative Inverse Gamma distributions InvGamma(1, 1).This is a common choice for conjugate priors of scale parameters in Bayesian statistics [5], in particular for a Gaussian likelihood with given mean.The inverse Gamma distribution InvGamma(1, 1) ensures the positiveness of σ I , σ ω > 0, see Figure 6 for a plot of the density function of InvGamma(1, 1).

Bayesian model 2 (BM2): measurement noise and model discrepancy
The second model approach extends the first one by considering model discrepancy with an additive term δ, additionally to additive i.i.d.measurement noise.Hence, the considered model is where δ = [δ I , δ ω ] T ∈ R 2M .Note that this is already the discretized version and δ I = [δ I (t 0 ), . . ., δ I (t M )], δ ω = [δ ω (t 1 ), . . ., δ ω (t M )] are vectors of the, yet to define, model discrepancy terms δ I , δ ω : D → R evaluated at the discrete time points t i , i = 1, . . ., M .Omitting the superscripts for a moment, we now assume that δ ∈ L 2 (D).The space of square integrable functions L 2 (D) is equipped with the inner product f, g = D f gdµ, where µ is a measure on D and f, g Then for all δ ∈ L 2 (D) there exists {a j } j∈N ⊆ R with j∈N |a j | 2 < ∞ such that δ can be represented by the expansion For practicability reasons the expansion is truncated after a K ∈ N Such an truncated expansion was also used in [13].With this let, for the truncation parameter K I , K ω ∈ N, the truncated functional expansions be approximative models for the model discrepancy terms δ I , δ ω .The bases {p j } j=0,...,K I and {q j } j=0,...,Kω are not necessarily identical.Note that each expansion could be truncated with own truncation parameters K I , K ω , but for notational convenience and due to later usage we stick to K ] ∈ R 2K+2 be the vector containing all coefficients and ] be the vectors of the model discrepancy terms evaluated at the discrete time points denotes the approximation of the true underlying model discrepancy δ, depending on the coefficients a and the truncation parameter K.The Bayesian model 2 (BM2) with all parameter dependencies is The number of unknown parameters depends on K and is now 3 + 2K + 2. With the additional unknown coefficients a, the prior probability density function is defined as where π(a) = K j=0 π(a The prior distributions for π(σ I ) and π(σ ω ) are specified as above.Now, with the likelihood the posterior is given by where π K denotes the dependence on K.

Choosing basis functions
If knowledge about the discrepancy is available, this should be modeled accordingly by defining an appropriate prior distribution for δ, i.e. in the case of δ K (a) for a, K and of course by an appropriate choice of basis functions.However, in general this knowledge is not available and some modeling assumptions need to be made.Following the assumption above, the basis needs to be dense in L 2 (D).With the additional assumption that δ is rather smooth, polynomials are a reasonable choice.For this let p j : D → R be a polynomial with polynomial degree deg(p j ) = j and {p j } j∈N be an orthonormal polynomial basis, with the orthogonality conditions p j , p k = D p j p k dµ, for a measure µ on D. Let w : D → R + be the weighting function of µ.Then, for instance, the 1 − t 2 are dense in L 2 (D) and thus possible choices for the expansion.Of course there are further bases that are dense in L 2 (D).However, for our assumptions Legendre polynomials with the constant weighting function are a reasonable choice for {p j } j=0,...,K .This choice was also made in [13].The first five Legendre polynomials are displayed in Figure 3.

Choosing prior distributions for the coefficients
In [13] they argue that the choice of the prior distribution for the coefficients is a bit delicate and somewhat arbitrary, as they have no physical meaning.Since no knowledge about the model discrepancy δ is available we assume δ K (a) to be as small as possible by specifying priors that are centered around zero.Every around zero centered probability distribution with decaying tails should be sufficient.However, following [13], we opt for zero mean Laplace distributions, as it assign the highest probability around zero and decays exponentially towards the tails.The probability density function of Laplace(a, b) is  distribution for the same specification, see Figure 4. Consequently, the π(a j ) are modeled as Laplace distributions with zero mean and variance 10, according to [13].As the coefficients a (I) j , a (ω) j have independent zero mean priors the resulting distributions for δ K I , δ K ω are centered around zero as well.For the variance we obtain Note that for the particular choice of the Legendre polynomials the variance of δ K (t), δ K (t) is non-constant and varies with t, as displayed in Figure 3.

Choosing the truncation parameter K
The remainder of this section addresses the open question on how to specify an appropriate value for the truncation parameter K.With the choice of Legendre polynomials K now corresponds to the maximum polynomial degree K.This K determines the complexity of the model discrepancy term δ K (a).Important factors in order to specify the model discrepancy term, or more specifically its complexity, are • accuracy, • computational costs, • bias-variance tradeoff (underfitting, overfitting) and • non-identifiability.
These points are important for a general definition of the model discrepancy term.Here, with the already made choice of Legendre polynomials and Laplace distribution for the polynomial coefficients, only the maximum polynomial degree K can be modified to adjust the complexity of δ K (a).Hence, the factors listed above are explained in terms of the maximum polynomial degree K, which will serve as a synonym for the term model complexity.An optimal K should be large enough such that δ K (a) is accurate enough to approximate the underlying discrepancy correct, but at the same time it should be as small as possible since a large K increases the number of unknown parameters and consequently computational costs.Furthermore, a large K might yield overfitting and non-identifiability of parameters.Overfitting occurs when for a large K the high degree polynomials start to reproduce oscillations of the measurement noise.With increasing K the discrepancy term δ K (a) yields an increased flexibility and is able to approximate a growing class of functions.However, a remedy of this increased flexibility is a loss of information in the prior of δ K (a).I.e. the prior gets non-informative, this might yield non-identifiability of all unknown parameters, depending on the information of the remaining model parameter prior distributions.In [3] they state that in order to infer model discrepancy and model parameters at the same time, at least for one of those an informative prior must be given.The term bias-variance tradeoff, illustrated in Figure 5, describes the fact that with increasing model complexity K the bias, i.e. the difference between the mean of the estimator and the reference value (E[R|y] − R 0 ), decreases, but at the same time the variance of the estimator V [R|y] increases with the model complexity [7].Consequently, the mean square error (MSE) of R, which is the sum of squared bias and variance , is minimal for an optimal model complexity K opt .With this optimal model complexity the contributions of bias and variance are somehow balanced.Models with an model complexity over this optimum are called overfitting (explaining the measurement data to good) and below underfitting (explaining the measurement data to bad).
Finally, taking all these factors into account the approach in this work on how to find an optimal K is following: Start with an initial K = K 0 ∈ N and increase K iteratively until the marginal posterior distribution π(σ I , σ ω |y) of the noise standard deviation stabilizes.I.e. that some distance D(π K (σ I , σ ω |y), π K+L (σ I , σ ω |y)) < tol for a given tolerance tol > 0 and L = 1, 2, 3, . . . .Then select the K such that this condition holds.Why is this sufficient?If a model discrepancy is present in BM1, then the noise term is the only instance to capture it.As the noise term is modeled with zero mean, the standard deviation might be overestimated consequently.By adding the model discrepancy term in BM2, this term, which is allowed to have an non-zero mean, captures, depending on its flexibility and degrees of freedom, a fraction of the model discrepancy.As a consequence, the noise term ε needs to represent only

Numerical results
In this section the numerical results for the two Bayesian models, BM1 and BM2, applied to the electric motor model are presented and compared to the reference, which is available due to the fully synthetic problem set up.Reference parameter values are R 0 = 0.1, σ I,0 = 2 and σ ω,0 = 10.The reference discrepancy d 0 and the noisy reference discrepancy d ε 0 as defined in Section 2.4 are displayed in Figure 2. The prior distribution for the unknown parameter is chosen as Gaussian distribution R ∼ N (R 0 , 0.2R 0 ), with prior mean R 0 and variance 0.2R 0 .Variations of the prior mean by +/ − 15% did not influence the upcoming results, thus simply the reference value was chosen as mean of the prior.The results are obtained by approximating the posterior distributions with MH-MCMC samples.The approximated marginal posterior distributions displayed in the following plots are kernel density estimates of the corresponding marginal MH-MCMC samples.Consequently, the marginal posterior moments are empirically approximated using Monte Carlo integration with a certain number of the obtained MH-MCMC samples.

Results Bayesian model 1
This section presents the numerical results obtained with Bayesian model 1 (BM1) for an artificial model error φ = 0.9. Figure 6 displays the marginal MH-MCMC samples and corresponding posterior distributions for parameters R, σ I and σ ω in comparison to the reference values.With respect to a burn in phase, only the last 700 samples of three independently sampled Markov chains, each of total length 1500, are used to obtain the results.The chains stabilize, as can be seen in Figure 6, this confirms convergence.However, the sampling behavior is poor as the effective sample size is low.This is due to the highly informative data and the so-called concentration effect, where the posterior distribution concentrates on a small region, which makes sampling inefficient, see [18,20] for further reading.
Figure 6 shows that the marginal posterior distribution for σ ω is close to the reference, but for R and 0.07 0.08 0.09 0.10 0.11 0.12 0.13 where x 0 is the reference value.x might be the empirical approximation of the marginal posterior mean   is approximated by assuming linearity for G φ .Thus only the mean of R needs to be computed.This does not cause large approximation errors as the posterior distribution of R has very small support and consequently linearity for G φ needs to be assumed in a small neighborhood of R only.
, where d ε I ( R) i denotes the i-th component of d ε I ( R), i.e. at time point t i , one obtains σ † I = 6.23 ≈ σI .The obtained value corresponds to the overestimated marginal posterior distribution of σ I that centers around a similar value, see Figure 6 and Table 1.Obviously, by only considering i.i.d.zero mean measurement noise, BM1 leads to biased and overconfident parameter estimates.Furthermore, the calibrated simulation model does not represent the data accurately enough.However, for BM1 and the model G φ with the artificial model error φ = 0.9, this is the best result one can get.

Results Bayesian model 2
This section presents the numerical results obtained with Bayesian model 2 (BM2) for an artificial model error φ = 0.9.Now, the additional model discrepancy term δ is expected to cover, depending on its definition, at least a fraction of the underlying model discrepancy and consequently yields an improved parameter estimation.As introduced in Section 3 the model discrepancy term δ is represented by a polynomial expansion δ K (a) with maximum polynomial degree K and unknown coefficients a = [a (I) , a (ω) ] = [a (I) 0 , . . ., a 8 displays moments of the marginal posterior distributions, i.e. mean, standard deviation and a 95% confidence interval of R, σ I and σ ω , obtained via BM2 for varying maximum polynomial degree K = 0, . . ., 20.For each K the last 40000 samples of three independently sampled Markov chains of length 100000 are used to approximate the moments.In contrast to BM1 and due to the increasing number of unknowns in BM2, a larger number of MH-MCMC samples is required in order to obtain convergence.For better comparison and interpretation of the results, the index BM 1 in the figure denotes the previous results without δ, obtained via Bayesian model 1.Following the guideline specified in Section 3 the marginal posterior distributions of the noise standard deviations of σ I and σ ω in Figure 8 are considered to pick an appropriate K.Both marginal posterior distributions π K (σ I |y) and π K (σ ω |y) stabilize for K ≥ 7 and are almost identical for K ≥ 9. I.e.mean and standard deviation of the marginal posterior distribution are almost identical for K ≥ 9.This indicates that K = 9 is a sufficient polynomial degree and increasing K further does not improve the results with the current specification of the model discrepancy term.Detailed results for K = 9 are presented later on.Now, assessing the marginal posterior distribution of R for varying K in Figure 8 one observes that for K = 6, . . ., 19 the marginal posterior mean of R roughly stabilizes in some kind of a plateau close to the reference R 0 , but the posterior standard deviation increases with K.For K = 0, . . ., 5 the model discrepancy term δ K (a) is not flexible enough to approximate the underlying discrepancy appropriate enough.As a consequence, the standard deviations of the measurement noise are overestimated and the estimation of the parameter R is still biased for those values of K. But, compared to the results from BM1, already K = 1 improves the inference.
In order to compare and evaluate the solutions we define error measures for the parameters and the model discrepancy term.Definition 4.2.Let x with posterior probability density π(x|y) be an estimation for an unknown x 0 ∈ R, for given data y.Denote x 0 as reference parameter.The mean square error (MSE) of x is given as with bias Bias(x) = (E[x|y] − x 0 ).The mean is with respect to π(x|y).In case of the posterior density where • is the L 2 (R 2M ) norm.With triangle and Cauchy Schwarz' inequality this can be bounded by  Error model complexity and backing up the decision of K = 9 as sufficient polynomial degree.Figure 9 largely corresponds to the bias-variance tradeoff, since the variance increases with K and the bias decreases with K, at least until K = 15.The variance increases due to the increased number of parameters, i.e. more degrees of freedom.This results in more uncertainty, i.e. variance, for each parameter by using the same amount of data.Furthermore, for large K overfitting starts, i.e. δ K (a) starts to oscillate and reproduce parts of the measurement noise.The bias decreases with increasing K, but only until K = 15, then it increases as another phenomena, namely non-identifiability, occurs additionally.for K = 0, . . ., 20 from BM2.For K ≥ 7 the MSE for both stays on the same level mainly as the variance stays on the same level.As for each K the same data y is used there is a point where the estimation of π K (σ I |y) and π K (σ ω |y) can not be improved further, without considering new additional data.Figure 11 displays the normed nBias K (δ * ), nV K (δ * ), nM SE K (δ * ) and nM SE K (δ * ) for the model dis- crepancy approximations δ * ∈ {δ K I (a), δ K ω (a)} depending on K, for K = 0, . . ., 20 from BM2. nBias K (δ * ) is minimal for K = 8.Here again, as in Figure 9, the effects of Bias-variance tradeoff and non-identifiability can be observed.
Especially for K = 20 overfitting and non-identifiability can be observed.For K = 20 the marginal posterior samples of R in Figure 12 do not converge any more, even not for an increased sample size of three independent chains with 500.000samples each.Consequently, the posterior distribution is spread out and biased.Also the posterior distribution of δ K (a) for polynomial degree K = 20, displayed in 0.07 0.08 0.09 0.10 0.11 0.12 0.13 Figure 13, is spread out and the reference discrepancy is not captured anymore.Only the bounds of the posterior distribution of δ K (a) tend towards the reference.This posterior distribution corresponds more to the discrepancy d( R) I that was still present after inference with BM1, see Figure 7. Furthermore, overfitting can be observed as the model discrepancy term tries to capture the high oscillations of the measurement noise.Looking at the posterior distributions of the polynomial coefficients in Figure 14 one observes that for I the coefficients a I 0 and a I 1 of the constant and linear polynomials are somehow specific and have small support, but for a ω 2 , a ω 3 and a ω 4 it is spread out and they are non-identifiable.All other concentrate around zero.For ω one observes flat posterior distributions for a ω 0 and a ω 1 , i.e. the coefficients of the constant and linear polynomials are non-identifiable.All other posterior distributions of the coefficients a ω j for j = 2, . . ., 20 concentrate around zero with decreasing support.However, the posterior distributions of σ I and σ ω (see Figure 12) concentrate at the reference values.This indicates that albeit the posterior distribution of R and δ K (a) are spread out, the sum G φ (R) + δ(a) + ε(σ I,ω ) still represents the data y well.
Finally, considering all these observations in Figure 8, and omitting the reference for a while, it makes sense to consider a polynomial degree K = 9.K = 7 would be already sufficient, but in order to allow a bit more flexibility for δ K (a) K = 9 is considered as a sufficient polynomial degree.Also by considering the reference K = 9 is confirmed to be a good choice.However, K = 15 might yield a better marginal posterior mean and has a comparable MSE, but the variance is also increased and the posterior distribution of δ 15 (a) already shows an oscillating behavior.Furthermore, in order to keep computational cost low a small K is preferred.Now, for BM2 and K = 9 the number of parameters to sample from is 3 + 2(K + 1) = 23.the posterior mean reduces around one order of magnitude compared to the result of BM1, see Table 1.Also, both noise standard deviations concentrate around the reference and significantly improve in relative errors compared to BM1. Figure 16  respect to the L 2 -norm, is 1.0e-01 for I and 9.2e-02 for ω.Also the marginal posterior distributions of the polynomial coefficients a in Figure 17 correspond to the reference coefficients.Overall, the posterior distribution of G φ (R) + δ K yields a good approximation of the measurements y with only a small variance band.

Conclusion
In this work a method to infer model parameters and model discrepancy is considered and applied to synthetic measurement data of an electric motor.The suggested Bayesian model 2 (BM2) considers measurement noise and model discrepancy, in order to separate these two sources of uncertainty and improve physical parameter estimation.The model discrepancy term is modeled as a truncated polynomial expansion δ K (a) with unknown coefficients a and an maximum polynomial degree K.The smoothness assumption, implied by the polynomials, restricts the flexibility of δ K (a) depending on K and introduces specific prior knowledge about the true underlying model discrepancy.This work discusses and then defines a guideline on how to define appropriate model discrepancy term complexity, i.e. here the maximum polynomial degree K of δ K (a) based on the marginal posterior distribution of measurement noise standard deviation.
The framework applied to the electric motor showed promising perspectives by an improved estimation of the model parameters.Furthermore a good approximation of the a-priori unknown model discrepancy is learned during BM2 for a sufficient maximum polynomial degree K = 9.For too small K the flexibility of δ K (a) is not sufficient and estimation is just slightly improved.Additionally, the results showed that if K is too large the prior contains to less information about the smoothness of the reference discrepancy and consequently the posterior distribution does not converge anymore.Consequently, in order to identify both, the underlying parameter value and the reference discrepancy of the test scenario, it is important to formulate at least for one of those unknowns a prior containing some information about the reference.Otherwise the problem is not identifiable.
For this example the model discrepancy δ was rather smooth.This might be not the case in other examples.If one knows that δ is non-smooth one might consider other representations for δ, such as radial basis functions, wavelets, Gaussian Processes or even neuronal networks.Here the question arises, if more flexibility implies non-identifiability.However, in our example, even for small K estimation could be improved, and in the case of no model discrepancy results do not degrade, if one uses an model discrepancy term.This might lead to the conclusion that using a model discrepancy term, even if it is not sufficiently modeled, does not degrade the results.This work showed that with an additional term for model discrepancy inference results can be improved and realistic information about the discrepancy can be collected, if some information about the parameters or the discrepancy is specified in the priors.However, computational costs increase drastically with the number of parameters for the model discrepancy term.As previously suggested, one might consider surrogate models to leverage this or improve sampling.
According to the definition of δ the resulting prior has a time dependent variance with high values at the boundaries.The posterior distribution of δ K (a) shows a similar behavior at the boundary T .This might be due to the prior definition.In order to get rid of the wide posterior distribution at T it might be necessary to define a prior for δ K (a) with (at least almost) constant variance over time.The solution proposed for the next problem, might also help in this case.
With the polynomial approach it is difficult to define the right polynomial degree K a-priori.If the maximal polynomial degree is to low, then it is not possible for δ K (a) to approximate the reference discrepancy d 0 appropriately.If the polynomial degree is to high, δ K (a) tends to overfit and tries to capture, in addition to d 0 , the highly oscillatory behavior of the measurement noise as well.The coefficients belonging to higher degree basis polynomials do not tend to zero as they actually should.A solution to this problem could be to model this in the definition of the polynomial coefficients prior distributions, i.e. by reducing the variance with increasing polynomial degree, e.g.V [a i ] ∼ 1/i, i = 1, . . ., N .
Next steps are to apply this approach to higher dimensional problems and further to real field data to test its capabilities.Then, as real data implies a more complex simulator, surrogate modeling will be mandatory to leverage computational expenses.Additionally, evidence approximation could be an another criteria to select a Bayesian model, which will be considered in future work.
and constant torque T L [N m] required by the load.For the following only the resistance R is considered uncertain and all other parameters are fixed.With the initial conditions I 0 = 0 and ω 0 = 0 and T L = 0 the model describes the start of an electric motor from resting position without load.For the following the parameter values are set to the following arbitrarily chosen values: R 0 = 0.1, c m = 0.01, c g = 0.01, U = 24, d = 10 −6 , L = 10 −4 , J = 5 * 10 −5 .

Figure 2 :
Figure 2: Reference discrepancy d 0 and d ε 0 of electric current I(t) and angular velocity ω(t).

Figure 5 :
Figure 5: Bias-variance tradeoff: This figure shows the contribution of bias and variance to the total error.From http://scott.fortmann-roe.com/docs/BiasVariance.html

Figure 6 :Definition 4 . 1 .
Figure 6: Posterior distributions and samples from BM1: The figures on the right hand side show marginal MH-MCMC samples of the parameters R, σ I and σ ω (from top to bottom).The figures on the left hand side show the kernel density estimates of the corresponding samples as an approximation of the marginal posterior distributions.The prior distributions are plotted as dotted black lines and the reference values as dashed red lines.

E
[x|y] , i.e. x = 1 L L l=1 x l , where x l , l = 1, . . ., L are a selection of MH-MCMC samples of the posterior density π(x|y).The marginal posterior mean values of R, σ I and σ ω and their respective relative error with respect to the reference values R 0 = 0.1, σ I,0 = 2 and σ ω,0 = 10 are displayed in

Figure 8 :
Figure 8: This figures show results obtained with BM2 for varying polynomial degree K = 0, . . ., 20.Moments (mean, mean +/− standard deviation and 95% confidence interval) of the marginal posterior distributions of R are displayed on the left hand side and moments (mean and mean +/− standard deviation) of σ I , σ ω on the right hand side.Reference values are plotted as dashed red lines.The index BM 1 denotes the previous results without δ, obtained via Bayesian model 1.

Definition 4 . 3 .
In order to have something similar for the model discrepancy estimations δ K (a) = [δ K I (a), δ K ω (a)] ∈ R 2M with the posterior density function π K (a|y) and with respect ot the reference discrepancy d 0 we construct for δ Here the normed bias nBias K (δ * ) := E K [δ * |y] − d 0 and variance nV K (δ * ) := V K [δ * |y] .The mean is with respect to π K (a|y).

Figure 9 displays
Figure 9 displays Bias K (R), variance V K [R|y] and mean square error M SE K (R) of R with respect to R 0 , depending on the model complexity, i.e. maximum polynomial degree K of the discrepancy term δ K (a).The MSE of R is minimal around K = 7, 8, 9 and K = 15, indicating those values as an optimal

Figure 9 :
Figure 9: Bias-variance tradeoff: Bias Bias K (R), variance V K [R|y] and mean square error M SE K (R) of R with respect to R 0 , depending on K, for K = 0, . . ., 20 from BM2.The index BM1 denotes the previous results without δ, obtained via Bayesian model 1.

Figure 10
displays bias, variance and MSE of σ I and σ ω with respect to the reference values σ I,0 , σ ω,0 ,

Figure 12 :
Figure 12: Posterior distributions and samples from BM1 and BM2 with K = 20: The figures on the right hand side show marginal MH-MCMC samples of the parameters R, σ I and σ ω (from top to bottom).The figures on the left hand side show the kernel density estimates of the corresponding samples as an approximation of the marginal posterior distributions.The results without model discrepancy term from BM1 are plotted in grey and the results with model discrepancy term from BM2 and polynomial degree K = 20 are black.The prior distributions are plotted as dotted black lines and the reference values as dashed red lines.

Figure 14 :
Figure 14: Marginal posterior distributions of polynomial coefficients a of δ K (a) for K = 20.The upper plot displays the marginal posterior distributions of a (I) j and the lower of a (ω) j , j = 0, . . ., K. The vertical lines are the coefficients a obtained from least squares fitting of δ K (a) to the reference discrepancy d 0 .

Figure 15 :
Figure 15: Posterior distributions and samples from BM1 and BM2 with K = 9: The figures on the right hand side show marginal MH-MCMC samples of the parameters R, σ I and σ ω (from top to bottom).The figures on the left hand side show the kernel density estimates of the corresponding samples as an approximation of the marginal posterior distributions.The results without model discrepancy term from BM1 are plotted in grey and the results with model discrepancy term from BM2 and polynomial degree K = 9 are black.The prior distributions are plotted as dotted black lines and the reference values as dashed red lines.

Figure 17 :
Figure 17: Marginal posterior distributions of polynomial coefficients a of δ K (a) for K = 9.The upper plot displays the marginal posterior distributions of a (I) j and the lower of a (ω) j , j = 0, . . ., K. The vertical lines are the coefficients a obtained from least squares fitting of δ K (a) to the reference discrepancy d 0 .
Compared to a Gaussian distribution, Laplace distribution enforces the zero a bit more and has slightly fatter tails.For a visual comparison of Laplace versus Gaussian j (t) 2 j (p j )

Table 1 .
The relative error rel ( R) is around 10%.Inserting the marginal posterior mean R ≈ E[R|y] of R in the simulation model G φ yields

Table 1 :
Marginal posterior mean and relative error of parameters R, σ I and σ ω for BM1 and