Bayesian Robust Multivariate Time Series Analysis in Nonlinear Models with Autoregressive and t-Distributed Errors †

: Many geodetic measurement data can be modelled as a multivariate time series consisting of a deterministic (“functional”) model describing the trend, and a stochastic model of the correlated noise. These data are also often affected by outliers and their stochastic properties can vary signiﬁcantly. The functional model of the time series is usually nonlinear regarding the trend parameters. To deal with these characteristics, a time series model, which can generally be explained as the additive combination of a multivariate, nonlinear regression model with multiple univariate, covariance-stationary autoregressive (AR) processes the white noise components of which obey independent, scaled t-distributions, was proposed by the authors in previous research papers. In this paper, we extend the aforementioned model to include prior knowledge regarding various model parameters, the information about which is often available in practical situations. We develop an algorithm based on Bayesian inference that provides a robust and reliable estimation of the functional parameters, the coefﬁcients of the AR process and the parameters of the underlying t-distribution. We approximate the resulting posterior density using Markov chain Monte Carlo (MCMC) techniques consisting of a Metropolis-within-Gibbs algorithm.


Introduction
Adjustment calculus offers a rich toolbox of statistical models and procedures for parameter estimation and hypothesis testing based on given numerical observations (cf.[1]).Such models usually consist of a deterministic functional model (e.g., a linear model describing some trend function), a correlation model (e.g., in the form of a variance-covariance matrix or an autoregressive (AR) error process), and a stochastic model (i.e., a probability distribution of the observation errors or the innovations of the AR error process).The stochastic model is often taken to be some multivariate normal distribution, which, however, easily leads to erroneous estimation results if the observations are afflicted by outliers.To take outliers into account, the normal distribution can be replaced by some outlier distribution, for example, a heavy-tailed t-distribution (cf.[2]).A multivariate time series model, including a nonlinear functional model and an autoregressive observation error model with t-distributed innovations, was suggested and investigated in [3] and [4].A shortcoming of that model is that it does not include prior knowledge about the parameters of the functional, correlation or stochastic model, the information about which may readily be available.Therefore, the current paper describes a Bayesian extension of that time series model, which can be expected to result in more robust and more accurate parameter estimates (cf.[5]).
A general Bayesian estimation approach in the specific context of models based on the t-distribution was introduced by [6].Due to the complexity of such a model, the posterior density function must be approximated numerically or analytically.For numerical approximation, Monte-Carlo (MC) simulation and, in particular, Markov-Chain Monte-Carlo (MCMC) methods, which are suitable also for multivariate distributions, have been applied routinely (cf.[7]).In particular, the Gibbs sampler and the Metropolis-Hastings algorithm have been employed for (non-robust) Bayesian estimation of the parameters of a linear functional model with autoregressive moving-average (ARMA) and normally distributed errors [8].MCMC methods have also been applied in the context of the robust Bayesian estimation of ARMA models [9] and AR models [10], with one additional (directly observed) mean parameter in the functional model.In both studies, outliers within the auto-correlated errors and within the uncorrelated innovations were modeled as normally distributed random variables with variances inflated by unknown multipliers.Thus, the stochastic error model was based on a discrete mixture of normal distributions, not on the t-distribution.To incorporate an automatic model selection procedure regarding the AR/ARMA model into the adjustment, the preceding studies also included unknown index parameters, taking the value 0 in case the corresponding AR (or MA) coefficient is 0 (or not significant) and taking the value 1 otherwise.Prior distributions for all of the parameter groups and the likelihood function for the data were fixed, and sampling distributions were then derived in order to obtain a numerical approximation of the posterior distribution for all the unknowns.In [11], an MCMC-based computational algorithm was proposed, to facilitate Bayesian analysis of real data when the error structure can be expressed as a p-order AR model.
The paper is organized as follows: First, the Bayesian multivariate time series model with AR and t-distributed errors is described in detail in Section 2. It is shown how the generic deterministic functional model, the AR process and the t-distribution model are first combined to a likelihood function and how prior information about the model parameters to be estimated is taken into account by means of a specified prior density.Here, we denote unknown parameters with Greek letters, random variables with calligraphic letters, and constants with Roman letters.Furthermore, we distinguish between a random variable (e.g., L t ) and its realization (l t ).Matrices are shown, as usual, as bold capital letters and vectors as bold small letters.Section 3 outlines an MCMC algorithm for determining the posterior density of the unknown parameters of the functional model, the coefficients of the AR process and the scale parameter, as well as the degrees of freedom of the t-distribution.In Section 4, a time series model for GNSS observations of a circle in 3D is proposed, and the results of a Monte Carlo simulation are discussed.These findings are used to evaluate the performance of the implemented Metropolis-Hastings-within Gibbs algorithm in this scenario.

The Bayesian Time Series Model
We assume that an N-dimensional time series ) is observed at equi-spaced time instances t without data gaps.The observation model consists of the three interconnected model components, where (1) defines the "observation equations", (2) the "error equations" and (3) defines the probability distribution of the innovations.The parameters of this observation model are combined within the vector, with On the one hand, the parameters θ are treated as variables of the likelihood function f L|Θ (L|θ), defined by the observation model ( 1)-(3).On the other hand, the parameters θ are viewed as a realization of a random vector Θ, having a specified pdf independent of the observables.According to the Bayes theorem, this prior density f Θ (θ) and the likelihood function f L|Θ (L|θ) are connected to the posterior density f Θ|L (θ|L) via proportionality relationship which serves as the foundation of the inference of the parameters and the adjustment of the observations.The details of this model are described in the following.
The observation equations: Equation (1) reflects the idea that geodetic measurements L k,t are approximated by a "deterministic" model using mathematical functions h k,t (β), which are assumed to be partially differentiable.The index k refers to the time series surveyed by the kth sensor or sensor component, and the time instances t = 1, . . ., n are the same for all sensors.In some applications, the functional model h k,t (β) takes the form h k,t (β) = X k,t β, (10) of a "linear model", where X denotes the design matrix and has a full rank.Since geodetic observables can generally not be modeled using a deterministic model alone, random deviations E k,t are added to absorb the remaining effects.It is assumed that the instruments used to survey the observables are calibrated, so that no systematic errors occur.Thus, the expected values of the random deviations are assumed to be 0.
The error equations: Equation ( 2) is included to take account of auto-correlations within each of the N time series.Since the different sensors or sensor components may have different noise characteristics, AR processes, with individual orders p 1 , . . ., p N and sets of coefficients α 1 , . . ., α N , are selected.The noise characteristics are assumed to be constant throughout the measurement period.For practical purposes, the AR processes considered are therefore required to be (asymptotically) covariance-stationary.The random variables U k,t are referred to as "innovations".Since the observation window is finite, ranging from t = 1, ..., n, the error equations involve errors at times t = 0, −1, .... To ensure asymptotic covariance-stationarity and the computability of the recursive equations, these quantities are set as equal to 0 (cf.[12]).This initial distortion of the AR process fades out as the process advances in time.
The stochastic model: The innovations of an AR process are usually assumed to be Gaussian white noise.Since the assumption of normal distributions is unrealistic in some geodetic applications, for example, due to outliers, the heavy-tailed t-distributions are employed here.These are defined by the probability density function (pdf), where Γ is the gamma function.Since the expected values of the random deviations E k,t should be 0, we can also restrict the location parameter µ to 0. Since the noise of different sensors or sensor components may exhibit different levels of variance and outliers, each time series involves a t-distribution with individual scale parameter ψ 2 k and degree of freedom (df) ν k .It should be mentioned that the alternative usage of a multivariate t-distribution (as defined in [2]) involves a single df and would therefore not allow for the modeling of distinct outlier characteristics within the different time series.
The likelihood function: A likelihood function f L|Θ (L|θ) can be obtained by combining the observation Equation ( 1), the error Equation ( 2) and the stochastic model of the innovations (3).To do so, the well-known method of conditional likelihoods in connection with AR processes with various forms of non-Gaussian innovations is applied (cf.[13][14][15]).Assuming the AR processes to be invertible, the error Equation ( 2), in terms of their numerical realizations, can be rewritten as "innovation equations," As the errors e k,t contained in the observation Equation ( 1) can be expressed as the innovation Equation ( 12) become The conditional likelihood function is then obtained as the product of the univariate pdf (11), evaluated at all the stochastically independent innovations u k,t with location µ = 0, associated scale factor ψ 2 k and df ν k , that is, For the purpose of maximum likelihood (ML) estimation, the logarithm of this likelihood function is easier to handle (see [3]).In that contribution, a computationally convenient ML estimation of the model parameters was achieved by rewriting the t-distributions as conditional normal distributions with latent variables; these variables play the role of weights in an iteratively reweighted least-squares algorithm.As the main innovation of the current contribution, the likelihood function ( 15) is incorporated into a Bayesian model instead, which is described in the following.
The Bayesian model: In this contribution, both informative and non-informative prior information is considered.The result of using a fully non-informative prior is that the posterior density follows directly from the likelihood function.In the case of an informative prior, a joint pdf must be specified for the random vector Θ.This task is simplified by the assumption of stochastic independence of the parameter groups β, α, ψ and ν, so that the factorization property, holds.Consequently, individual prior densities can be specified for these parameter groups.
As the prior density of β depends on the selected functional model h(β), its specification is fixed after the introduction of the application example in Section 4. In contrast, the prior densities for α, ψ and ν do not depend on the choice of the functional model but mainly on the precision of the sensors or instruments employed.In the case that no prior information is available for the employed sensors or instruments, it is still possible to define prior densities for these three groups of parameters.Due to the assumption, in connection with the error Equation ( 2) and the stochastic model (3), that the N time series is stochastically independent, the prior density can be further simplified to Consequently, as far as the parameters ψ and ν are concerned, only univariate prior densities f Θ (ψ k ) and f Θ (ν k ) need to be specified.When it is known that the scale factor ψ is between ψ min and ψ max , the prior density defining the continuous uniform distribution U(ψ min , ψ max ) can be used as a weak form of prior information.The specification of the prior for the df ν k can be based, on the one hand, on the requirement ν k > 2, so that the variance of the t-distributed random variables is defined.On the other hand, the t-distribution is practically indistinguishable from a normal distribution for dfs greater than 120 (cf.[16]), so that the upper limit ν k ≤ 120 can be fixed.In the absence of further information about the dfs, the prior density defining the continuous uniform distribution U(2, 120) is reasonable.The auto-correlations of a time series may, for instance, be induced by calibration corrections within the measurement device, by movements of the measured object, or by a combination of the two effects.Therefore, a general definition of the prior density of the AR coefficients is not trivial.For this reason, a non-informative prior density is specified under the additional assumption that all of the AR coefficients are stochastically independent.

The Developed MCMC Algorithm
Because of the use of the Student distribution for the white measurement noise (Equation ( 15)), an analytical solution of the posterior density based on the Bayes theorem (Equation ( 9)) is not possible, so it can only be solved numerically.The general solution approach is based on generating a so-called Markov Chain for the unknown posterior density using the MCMC method.MCMC algorithms are commonly used in all fields of statistics because of their versatility and generality.When an MCMC method is applied to solve the posterior density function given in (9), it is usually realized with a Gibbs sampler.An implementation of a Gibbs sampler relies on the availability of the complete conditional pdfs of all parameters of interest in our particular problem (cf.Equation ( 4)).However, the complete conditional pdfs of all parameters of interest are not readily available.In such cases, a Metropolis-Hastings (MH) method can be incorporated within a Gibbs sampler to draw samples from the parameters, the full conditional pdf of which cannot be analytically determined.In this paper, we demonstrate the development of such an algorithm, known as Metropolis-Hastings-within Gibbs.In this algorithm, the Gibbs sampler is used to generate the Markov Chain for θ|L, and within the Gibbs sampler the MH algorithm is used to generate random numbers.For a clearer presentation, the solution algorithm is encapsulated by two separate functions.For the Gibbs sampler, the outer function is given by, for: j = 1, . . ., M Draw where M is the length of the MCMC.Such a Markov chain ensures the convergence of the distribution of the samples to the target distribution after a few burn-in periods V [17].
We observe that the full conditional pdf shown in (20) does not fit to any known pdf and, therefore, we cannot directly draw samples from it.However, there are two challenges to calculating the conditional posterior densities.The first challenge is that it results from the likelihood function and the prior density.Consequently, changing the distributional assumption for the prior density results in a new conditional posterior density.While this challenge can be overcome with a small amount of effort, the second challenge is much more fundamental.For the calculation of the conditional posterior densities, several integrals have to be solved and this may not be analytically possible.To overcome these challenges, the MH algorithm is used to draw the required random numbers.The general algorithm for drawing a random number θ j i from the posterior density f Θ|L (θ|L) follows from the following steps: where m in Equation ( 22) denotes the length of the parameter vector.The results of the Metropolis-Hastings-within Gibbs are M random realizations of the unknown parameters β, α, ψ and ν from the posterior density f Θ|L (θ|L).The estimated values θ for the unknown parameters with their variance-covariance matrix (VCM) result from (cf. [5]):

The Framework of the Simulation
In our Closed Loop Monte Carlo simulation (CLS), we rely on the real-world application demonstrated in [3], in which we used a multi-sensor-system (MSS) composed of a laser scanner and two firmly attached pieces of GNSS equipment (see [18] for details).We consider a multivariate, non-linear regression model in terms of a circle in N = 3 dimensions that has the following six parameters: two for the orientation (ϕ and ω) of its unit normal vector; one for the radius (r); and three for the circle center (c x , c y , c z ).The observable 3D circle points are described by l x,t := l 1,t = h 1,t (β) = r cos (κ t ) cos (ϕ) + c x (23) l y,t := l 2,t = h 2,t (β) = r cos (κ t ) sin (ϕ) sin (ω) + r sin (κ t ) cos (ω) + c y (24) The random deviations E t were generated by the AR( 2) processes with true coefficients The innovations of these processes are sampled from the scaled t-distributions with true scale parameters and true dfs In Equation ( 16), the prior density has been introduced for the general Bayes model.In this simulation, only prior densities for the functional parameters Θ (see Equation ( 26)) are assumed to be known: In Equation ( 29), we assume that the prior information for the center of the circle, the radius and the angles are independent of each other.The reason for this assumption is that this information is obtained from data sheets or from calibrations.These are specified as follows.
The prior density of the circle center: The prior for the center of the circle is the knowledge that it must be located approximately in the middle between the observations constituting a circle.The location parameter µ c for the definition of the prior density of the circle center is thus dependent on the observations.Therefore, we consider the prior of the circle center as weak prior information.Hence, we use the identity matrix as a VCM for Σ c .The uncertainty of a coordinate component of the prior information of the circle center is thus significantly larger than the assumed measurement precision of a single observation.As a pdf for the prior f Θ c x , c y , c z , the multivariate normal distribution is assumed with the expected value µ c and VCM Σ c .
The prior density for the radius: The prior information for the radius of the circle is the result of a calibration measurement using a laser tracker.The manufacturer specifies the accuracy of a single point measurement with this instrument as the maximum permissible error of MPE x,y,z = ±15µm + 6 µm m .During the measurement, the diameter of the wing, on which the GNSS antenna was mounted, was determined.For this purpose, the 3D coordinates were measured on the left and right hand side and the Euclidean distance was calculated, which corresponds to the diameter.In total, the radius was estimated four times in this way.By averaging these results, µ r = 30.0026was obtained as the location parameter for the prior density of the radius.The scale parameter for the prior is the result of the standard deviation σ r = 0.0043 of the four determined radii of the laser tracker measurement.The measurement results of the laser tracker are assumed to be normally distributed.Hence, the normal distribution is assumed for the prior density of the radius of the circle.
The prior density for the rotation angles: The prior information for the rotation angles is derived from the instrument's levelling.The manufacturer specifies the accuracy of the bubble level as ±0.0047 rad for a 99.9% confidence interval.It is assumed that this information refers separately to one vial axis, so that f Θ (ω) and f Θ (ϕ) are independent of each other.Due to the specification of the accuracy by means of the confidence interval, the uniform distribution is used as the prior family for the angles.The limit of the confidence interval is used as the limit for the uniform distribution, from which follows a ω = −0.0047rad and b ω = 0.0047 rad.This results in the prior density: For ϕ, the density of the prior is identical to that of the previous formula.

Results of the Simulation
The results described in this section were achieved with M = 10, 000 iterations and a burn-in period with V = 3000 for the MCMC algorithm.We compare in the following three estimation procedures: Bayesian informative (Bayes Inf) using prior knowledge about the functional parameters β, Bayesian non-informative (Bayes Non-Inf) without using prior knowledge about β and the EM (expectation-maximization) algorithm developed in [3].Before the results of the three estimation methods from the entire CLS are compared, the result of the Bayes Non-Inf solution is considered in more detail.For this purpose, the result of the Markov chains from a single simulation solution is considered in Figure 1.In total, Markov chains were generated for 18 unknown parameters.Figure 1 only shows a representative selection of the results and is limited to the results of the z-component and only one rotation angle.For the other components, the generated chains correspond to the behavior of the z-component.The green dashed line in Figure 1 shows the true value of the parameters used to generate the simulated observations.The red dashed line, or the red cross on the secondary diagonal, show the estimated parameters resulting from the Markov chains.On the diagonal, the distribution of the generated random numbers of the chain is shown as a histogram.The histograms show how the generated random numbers scatter around the estimated parameter and that the true value is close to the estimated parameter.The secondary diagonal of the figure represents the scatter of the generated random numbers depending on two parameters.From this distribution, the correlation of the generated chain between two parameters can be calculated.A dependency can be seen, especially for the scaling parameter and the df of the z-components.This also holds for the coefficients of the AR(2) process of the z-component.The same can be noticed for the xand y-components.For all other parameters, the major axes of the ellipse are rather parallel to the axes of the expected values, which corresponds to a correlation around zero.The fact that, for example, the scale parameter and the df show stronger correlation behavior is also to be expected.The reason for this is that, with a smaller scale factor, more observations lie in the tails of the distribution, which leads to a smaller df.In contrast, if the scaling factor is large, there are fewer observations in the tails of the distribution, so a larger df can be selected.
To compare the results of the three approaches, the adjusted observations l are used instead of the estimated parameters β.The reason for this is that, in all estimation procedures, the estimated parameters are closer to the nominal value of the simulation parameters and, therefore, it is difficult to judge which approach produces the best results.On the other hand, the predicted observations l include the cumulative estimation uncertainties of all parameters β, allowing for an easier comparison.Furthermore, we restrict ourselves here to the representation of the z-component because lz is most sensitive to inaccuracies in the estimated angles ω and φ.The results for the x-and y-components show a result similar to that of the model shown in Figure 2 for the z-component.
Before discussing the results of the three approaches, Figure 2 is first explained.The l were calculated with the estimated parameters β using Equation (25).Subsequently, the lz were reduced by the true value for the observations E(l z ), which is why the predicted observations scatter around 0. The dashed line is the mean value of observation l z,t , which results from the 10,000 results of the CLS for one observation t.The dashed blue line of the mean value of the EM algorithm cannot be seen in the figure because it is overlaid by the dashed red line.The colored area shows, for observation t, the 95% confidence interval that results from the 10,000 predicted observations lz,t .The colored lines >0 show the maximum deviation from the true value for the observation lz,t that appeared in the 10,000 simulations.The colored lines <0 show the minimum deviation from the true value for the corresponding approaches.It is expected that the result of the EM algorithm is identical to the Bayes Non-Inf solution.This can be seen clearly in the result of the mean, where the two dashed lines (blue and red) overlap almost completely and only deviate from each other by a maximum of ≈0.002 cm.Furthermore, the mean values of the two estimation procedures are identical to the nominal value of 0 for all predicted observations within 10 −2 cm.This deviation can be explained by the fact that the CLS was only performed 10,000 times and not infinitely often.For the confidence interval of the EM and Bayes Non-Inf solution, a similar behavior can be seen as for the mean value.For most observations t, the two confidence intervals overlap almost perfectly and, only for a few observations, a difference of at most ≈0.05 cm can be identified.However, this identical behavior is not seen in the maximum and minimum deviations of the blue and red lines, where the Bayes Non-Inf solution more often has smaller minimum deviations than the EM results.The reason for this has not yet been analysed in more detail and will be addressed in future work.
In the result of the mean value of Bayes Inf, a deviation from the mean value of EM and Bayes Non-Inf can be seen.The solution of the mean value of Bayes Inf oscillates cyclically around the nominal value of 0 with a maximum deviation of ±0.05 cm, whereby this deviation is smaller by a factor of 10 than the scale factor ψ z of the measurement noise.The mean values of the EM and Bayes also oscillate around 0, but this is smaller by a factor of 100 and therefore cannot be seen visually in Figure 2. The influence of the prior information f Θ (β) on the observations lz,t can be seen well in the confidence interval and the lines marked in black.These are clearly closer to the true value than in the EM algorithm and Bayes Non-Inf.It is of particular interest that the maximum deviation lies in the 95% confidence interval of EM and Bayes Non-Inf.This is not always the case for the minimum deviation of Bayes Inf, but the black line is still closer to zero than the blue and red lines.The reason for this is mainly due to the prior information for the angles f Θ (ω) and f Θ (ϕ), which improves the estimation of ω and ϕ.
To compare the results lx , ly and lz of the three approaches, the RMSE of the true observations is calculated for each CLS result.From these RMSE values, the statistical measures in Table 1 were calculated for each approach.All results in the table show the same behavior as the results previously displayed in Figure 2. The EM and Bayes Non-Inf results are almost identical, with the small deviation explained by the different methods used to estimate the parameters.In the Bayes Inf solution, however, all statistical measures are smaller.For the mean, the RMSE of Bayes Inf is about 37% smaller than the RMSE of the EM algorithm, and for the median the difference is slightly larger, at about 41%.Only for the maximum RMSE value from the 10,000 simulations is the difference between EM and Bayes Informative significantly smaller at about 13%, but the maximum RMSE of Bayes Inf is still significantly smaller than the result from EM.

Conclusions
To achieve an Bayesian adaptive robust adjustment of a multivariate regression time series with outlier-afflicted/heavy-tailed and autocorrelated errors, we described the theory and implementation of an MCMC based approach consisting of a Metropolis-within-Gibbs algorithm.In particular, the Gibbs sampler and the Metropolis-Hastings algorithm have been employed for robust Bayesian estimation of the parameters of a non-linear functional model with AR and t-distributed errors.An advantage of this procedure compared to the EM algorithm, besides the capability to process additional prior knowledge, is that the approximation of the posterior model parameters is feasible without linearization of the functional model.Furthermore, the approximation of the VCM Σθ of the estimated parameters can be derived directly from the generated chains.CLS showed that the bias of the parameter estimates and the adjusted observations, as well as the RMSE, are significantly reduced when an informative Bayesian approach is used, but only under the condition that good prior information, that is, information that contains the true value, is assumed to be known.Otherwise, the prior can also have the opposite effect and may cause the estimated parameters to deteriorate.

with t = 1 ,
. . ., 2000 and where κ t = 2π 200 • t represents fixed rotation angles around the z-axis.In this simulation, the functional parameters are the circle parameters β, which were assumed to take the true values β = c x c y c z r ω ϕ T = 1716.03012.0 1064.0 30.0 0.0019 rad −0.0013 rad T .(

Figure 1 .
Figure 1.Result of the generated Markov chain after burn-in for Bayes Non-Inf: The main diagonal shows the distribution of the generated random numbers of a parameter.The secondary diagonal shows the correlation behavior of the generated random numbers at time j between two parameters.

Figure 2 .
Figure 2. Result of the 10,000 CLS for the 2000 predicted observations of the z-component lz of the EM algorithm, Bayes non-informative (Bayes Non-Inf) and Bayes informative (Bayes Inf).

Table 1 .
Root-Mean-Square-Error (RMSE) for the predicted observations to the true observations.