Unlike the extended ORKF (EORKF) [
32], for a practical estimation approach in V-INS, this study investigates only measurement outliers due to the following reasons. As the measurement update is not the process performed at every time step, the outlier detection by each residual value cannot directly detect the outliers of IMU measurements. Furthermore, in the sequential measurement update, multiple residuals are computed to update at one IMU time stamp. In other words, as only rare observations among feature measurements from one image are corrupted by the remaining outliers, hypothesizing that the outliers come from the IMU may be faulty. Therefore, in the scope of this paper, we handle only measurement outliers.
4.3.1. Student’s t-Distribution
Despite the true system with outliers, the classical EKF assumes that each model in the filter is corrupted with additive white Gaussian noise. The levels of the noise are assumed to be constant and encoded by sensor covariance matrices
Q and
R (i.e.,
${\eta}_{k}\sim \mathcal{N}(0,\phantom{\rule{0.166667em}{0ex}}Q),\phantom{\rule{4pt}{0ex}}{\left({\zeta}_{j}\right)}_{k}\sim \mathcal{N}(0,\phantom{\rule{0.166667em}{0ex}}R)$). However, as outliers arise in the realistic system, now we do not restrict noise at either a constant or Gaussian level. Instead, their levels vary over time, or noise have heavier tails than the normal distribution as follows,
where
$\mathrm{ST}(\xb7)$ denotes a Student’s
t-distribution, and
${\nu}_{k}>m-1$ is degrees of freedom. Covariance matrix
${\tilde{R}}_{j}$ follows the inverse-Wishart distribution, denoted as
${\mathcal{W}}^{-1}(\xb7)$.
${\Lambda}_{j}\succ 0$ is
$m\times m$ precision matrix.
In Bayesian statistics, the inverse-Wishart distribution is used as the conjugate prior for the covariance matrix of a multivariate normal distribution [
20]. The probability density function (pdf) of the inverse-Wishart is
where
$\mathrm{tr}(\xb7)$ denotes the trace of a square matrix in linear algebra. Moreover, in probability and statistics, a Student’s
t-distribution is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the standard deviation of the population is unknown [
51]. Whereas a normal distribution describes a full population, a
t-distribution describes samples drawn from a full population; thus, the larger the sample, the more the distribution resembles a normal distribution. Indeed, as the degree of freedom goes to infinity, the
t-distribution approaches the standard normal distribution. In other words, when the variance of a normally distributed random variable is unknown and a conjugate prior placed over it that follows an inverse-Wishart distribution, the resulting marginal distribution of the variable follows a Student’s
t-distribution [
52]. Then, the Student-
t, a sub-exponential distribution with much heavier tails than the Gaussian, is more prone to producing outlying values that fall far from its mean.
4.3.2. Variational Inference
The purpose of filtering is generally to find the approximations of posterior distributions
$p\left({x}_{k}\phantom{\rule{0.166667em}{0ex}}\right|\phantom{\rule{0.166667em}{0ex}}{y}_{1:k})$, where
${y}_{1:k}=[{y}_{1},{y}_{2},\cdots ,{y}_{k}]$ is the histories of sensor measurements obtained up to time
k. For systems with the heavy-tailed noise, we also wish to produce another inference about covariance matrices whose priors follow the inverse-Wishart distribution. Hence, our goal in this section is to find both approximations for posterior distribution
$p({x}_{1:k},{\tilde{R}}_{1:k}|{y}_{1:k})$ and model evidence
$p\left({y}_{1:k}\right)$. Compared to sampling methods, the variational Bayesian method performs approximate posterior inference at low computational cost for a wide range of models [
20,
52]. In the method, we decompose log marginal probability
where
p is the true distribution that is intractable for non-Gaussian noise models, and
q is a tractable approximate distribution.
In probability theory, a measure of the difference between two probability distributions
p and
q is the Kullback–Leibler divergence, denoted as
$\mathrm{KL}[\xb7]$. If we allow any possible choice for
q such as the Gaussian distribution, then lower bound
$\mathcal{L}\left[q\right]$ is maximum when the KL divergence vanishes; that is,
$q({x}_{1:k},{\tilde{R}}_{1:k})=p({x}_{1:k},{\tilde{R}}_{1:k}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{y}_{1:k})$. To minimize the KL divergence, we seek the member of a restricted family of
$q({x}_{1:k},{\tilde{R}}_{1:k})$. Indeed, maximizing
$\mathcal{L}\left[q\right]$ is equivalent to minimizing another new KL divergence [
52], and thus the minimum occurs when factorized distributions
$q({x}_{1:k},{\tilde{R}}_{1:k})=q\left({x}_{1:k}\right)\phantom{\rule{0.166667em}{0ex}}q\left({\tilde{R}}_{1:k}\right)$ and the following Equations (
25) and (
26) hold,
where
${\mathbb{E}}_{q(\xb7)}$ represents the expectation with respect to
$q(\xb7)$. With assuming that initial state
${x}_{1}$ is Gaussian, the measurement update with varying noise covariance
$\mathbb{E}\left[\phantom{\rule{0.166667em}{0ex}}{\tilde{R}}_{t}^{-1}\phantom{\rule{0.166667em}{0ex}}\right]={\Lambda}_{t}^{-1}$, which closely resemble the EKF updates, solve Equation (
25). Algorithm 2 describes the details of the updates.
Now let us assume that the true priors are IID noise models as the case in this study; that is,
$p\left({\tilde{R}}_{k}\right)$ follows
${\mathcal{W}}^{-1}(\nu R,\phantom{\rule{0.166667em}{0ex}}\nu )$ distribution. Then, second term
$lnp\left({\tilde{R}}_{k}\right)$ in the right-hand side of Equation (
26) is computed using the pdf of the inverse-Wishart distribution in Equation (
21) with its prior noise model.
As the term is conjugate prior for Equation (
20), the approximations of
$q\left({\tilde{R}}_{k}\right)$ have same mathematical forms as priors; that is,
$q\left({\tilde{R}}_{k}\right)$ also follows
${\mathcal{W}}^{-1}({\tilde{\nu}}_{k}\phantom{\rule{0.166667em}{0ex}}{\Lambda}_{k},\phantom{\rule{0.166667em}{0ex}}{\tilde{\nu}}_{k})$ distribution.
As
${y}_{t}\phantom{\rule{0.166667em}{0ex}}|\{{x}_{t},{\tilde{R}}_{t}\}\sim \mathcal{N}(h\left({x}_{t}\right),{\tilde{R}}_{t})$,
From Equations (
26)–(
29), to handle measurement outliers, similar to Agamennoni et al. [
18,
19]’s derivation, we derive how to compute precision matrix
${\Lambda}_{k}$ of approximate distribution
$q\left({\tilde{R}}_{k}\right)$ of
${\tilde{R}}_{k}$ as follows,
where each feature from one image is independent and
Next, in Equation (
30),
where estimation error
${e}_{j}={x}_{k}-{\left({\widehat{x}}_{k}\right)}_{j}$ and the Jacobian
${C}_{j}=\frac{\partial {h}_{j}}{\partial x}{|}_{{\left({\widehat{x}}_{k}\right)}_{j}}$. In the sequential measurement update,
${\left({\widehat{x}}_{k}\right)}_{j}$ and
${\left({P}_{k}\right)}_{j}$ are corrected by Kalman gain
${K}_{j}$ that is a function of
${\left({\Lambda}_{j}\right)}_{k}$, so these update steps are coupled. Hence no a closed-form solution exists, and we can only solve iteratively. The purpose of the iteration seems to be similar to that of the online learning of unknown variances of each noise [
10]. In addition, similar to Agamennoni et al.’s interpretation [
19], the convergence and optimality of the derived update steps for outliers are guaranteed since the variational lower bound is convex with respect to
${\left({\widehat{x}}_{k}\right)}_{j}$,
${\left({P}_{k}\right)}_{j}$, and
${\left({\Lambda}_{j}\right)}_{k}$. In particular, as the
j-th feature is observed countless times (i.e.,
${\nu}_{j}$→
∞),
${\Lambda}_{j}$ converges to
R in the limit of an infinitely precise noise distribution, so the iterative update steps reduce to the standard sequential measurement update of the EKF.
If true state
$x\left({t}_{\mathrm{img}}\right)$ is significantly different from its estimate
${\left({\widehat{x}}_{k}\right)}_{j}$, then statistics
$\mathbb{E}\left(\right)open="["\; close="]">\phantom{\rule{0.166667em}{0ex}}{\left({\zeta}_{j}\right)}_{k}\phantom{\rule{0.166667em}{0ex}}{\left({\zeta}_{j}\right)}_{k}^{\mathrm{T}}\phantom{\rule{0.166667em}{0ex}}$ dominates, and
${\left({\Lambda}_{j}\right)}_{k}$ becomes much larger than
R. This
${\zeta}_{j}$ is regarded as a measurement outlier at time
k. As Kalman gain
${K}_{j}$ is a function of the inverse of precision matrix
${\left({\Lambda}_{j}\right)}_{k}$, the larger
${\left({\Lambda}_{j}\right)}_{k}$ values, the smaller the Kalman gain. Therefore, to deal with situations where measurement outliers occur, the iteration for Equations (
30) and (
31) corrects the state estimates and its covariance with low weights.