2.1. Bayesian StateSpace Models
The framework used for this analysis will be that of a general Bayesian statespace model. Such a model is fully defined in terms of its transition and observation densities ${f}_{\mathbf{\theta}}\left({\mathbf{x}}_{t+1}\right{\mathbf{x}}_{t},{\mathbf{u}}_{t})$ and ${g}_{\mathbf{\theta}}\left({\mathbf{y}}_{t}\right{\mathbf{x}}_{t},{\mathbf{u}}_{t})$ dependent on a set of parameters $\mathbf{\theta}$. The transition density defines, in a probabilistic sense, the distribution over the hidden states at a time $t+1$ given the states at time t and any external inputs ${\mathbf{u}}_{t}$ at time t. The observation density defines the distribution of the observed quantity ${\mathbf{y}}_{t}$ at time t given the states ${\mathbf{x}}_{t}$ and external inputs ${\mathbf{u}}_{t}$ also at time t. This framework allows a model to be built for evolving sequences of states ${\mathbf{x}}_{1:T}$ from the inital prior states to those at time T. Where the notation ${\left(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right)}_{a:b}$ indicates the variable from time a to time b inclusively. One important property of this model is that it obeys the Markov property, i.e., the distribution of the states at any point in time t only needs to be conditioned on time $t1$ in order to capture the effect of the full history of the states from time 0 to $t1$.
So far no restriction has been placed on the form of the above densities. It is possible to define the transition in terms of any distribution, although, it can be useful to consider the transition in terms of a function which moves the states from time
$t1$ to
t and a
process noise. Likewise, for the observation, it is possible to consider a function which is corrupted with some
measurement noise. The generative process can be written as,
It is also possible to represent the structure of this model graphically as in
Figure 1. The graphical model is useful in revealing the Markov structure of the process. There are broadly three tasks associated with models of this form; these are prediction (simulation), filtering and smoothing. These are concerned with recovering the distributions of the hidden states
${\mathbf{x}}_{1:t}$ given differing amounts of information regarding
$\mathbf{y}$. The prediction task is to determine the distribution of the states into the future, i.e., predicting
$p\left({\mathbf{x}}_{t}\right{\mathbf{y}}_{1:tk},{\mathbf{u}}_{1:t})$ for some
k. Filtering considers the distribution of the states given observations up to that point in time,
$p\left({\mathbf{x}}_{t}\right{\mathbf{y}}_{1:t},{\mathbf{u}}_{1:t})$. Smoothing infers the distributions of the states given the complete sequence of observations,
$p\left({\mathbf{x}}_{t}\right{\mathbf{y}}_{1:t},{\mathbf{u}}_{1:t})$, where
$T\ge t$.
By restricting the forms of the transition and observation densities to be linear models with additive Gaussian noise, it is possible to recover closedform solutions for the filtering and smoothing densities. These solutions are given by the now ubiquitous Kalman filtering [
14] and RauchTungStriebel (RTS) smoothing [
15] algorithms. With these approaches, it is possible to recover the densities of interest for all linear dynamical systems. However, when the model is nonlinear or the noise is no longer additive Gaussian, the filtering and smoothing densities become intractable.
In order to solve nonlinear statespace problems a variety of approximation techniques have been developed; these include the Extended Kalman Filter (EKF) [
16,
17] and the Unscented Kalman Filter (UKF) [
18]. Each of these models, and others [
19], approximate the nonlinear system by a Gaussian distribution at each time step. This approach can work well; however, it will not converge to complex (e.g., multimodal) distributions in the states. An alternative approach is one based on importance sampling, commonly referred to as the Particle Filter (PF).
The particle filtering approach will be used in this paper, a good introduction to the method can be found in Doucet and Johansen [
20]; however, a short overview is presented here. The task is to approximate the filtering density of a general nonlinear statespace model as defined in Equation (1). Importance sampling forms a representation of the distribution of interest as a set of weighted point masses (particles) which form an approximation to the distribution in a Monte Carlo manner. The distribution of the states (The conditioning on
${\mathbf{y}}_{1:t}$ is not explicitly shown here but will still exist).
$\pi \left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$ is approximated as,
with
N particles
${\delta}_{{\mathbf{x}}_{1:t}^{i}}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$ and corresponding importance weights
${w}_{t}^{i}$. The importance weights are normalised such that the sum over the weights equals unity, from a set of unnormalised importance weights, at time
t,
${\tilde{w}}_{t}^{i}$. These unnormalised weights are calculated by the ratio of the unnormalised posterior likelihood (For clarity, while
$\pi \left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$ is the object of interest, normally there is only access to
${\gamma}_{t}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$ when
$\pi \left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}/Z\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$. In other words, the posterior distribution of interest is only known up to a constant.). of the states
${\gamma}_{t}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$ and a proposal likelihood
${q}_{t}\left({\mathbf{x}}_{1:t}\right)$ at each point in time,
The importance weights can be computed sequentially, such that,
Applying this sequential method naïvely gives rise to problems where the variance of the estimation increases to unacceptable levels. To avoid this increase, the final building block of the particle filter is to introduce a resampling scheme. At a given time the
particle system is resampled according to the importance weight of each particle. The aim of this resampling is to keep particles which approximate the majority of the probability mass of the distribution while discarding particles that do not contribute much information. A general SMC algorithm for filtering in statespace models is shown in Algorithm 1. The external inputs
${\mathbf{u}}_{1:t}$ have been omitted from the notation here for compactness, but can also be included.
Algorithm 1 SMC for StateSpace Filtering 
All operations for $i=1,\dots ,N$ particles 
 1:
Sample ${\mathbf{x}}_{1}^{i}\sim q\left({\mathbf{x}}_{1}\right{\mathbf{y}}_{1})$  2:
Calculate ${\tilde{w}}_{1}^{i}=\frac{p\left({\mathbf{x}}_{1}^{i}\right{\mathbf{x}}_{0}^{i}\left){g}_{\mathbf{\theta}}\right({\mathbf{y}}_{1}\left{\mathbf{x}}_{1}^{i}\right)}{q\left({\mathbf{x}}_{1}^{i}\right{\mathbf{y}}_{1})}$ and normalise to obtain ${w}_{t}^{i}$  3:
Resample ${\left(\right)}_{{w}_{1}^{i}}^{,}$ updating the particle system to ${\left(\right)}_{\frac{1}{N}}^{,}$  4:
for$t=2,\dots ,T$do  5:
Sample ${\mathbf{x}}_{t}^{i}\sim q\left({\mathbf{x}}_{t}\right{\mathbf{y}}_{t},{\stackrel{\u02c6}{\mathbf{x}}}_{t1}^{i})$  6:
Update path histories ${\mathbf{x}}_{1:t}^{i}\leftarrow {\left(\right)}_{{\stackrel{\u02c6}{\mathbf{x}}}_{1:t1}^{i}}^{,}i=1N$  7:
Calculate ${\tilde{w}}_{t}^{i}=\frac{g\left({\mathbf{y}}_{t}\right{\mathbf{x}}_{t}^{i}\left)f\right({\mathbf{x}}_{t}^{i}\left{\mathbf{x}}_{t1}^{i}\right)}{q\left({\mathbf{x}}_{t}^{i}\right{\mathbf{x}}_{t1}^{i})}$ and normalise to obtain ${w}_{t}^{i}$  8:
Resample ${\left(\right)}_{{w}_{t}^{i}}^{,}$ updating the particle system to ${\left(\right)}_{\frac{1}{N}}^{,}$  9:
end for

Before an effective application of this method can be made, there remains a key user choice. That is, the choice of proposal distribution
$q\left({\mathbf{x}}_{t}^{i}\right{\mathbf{x}}_{t1}^{i})$. The optimal choice of this proposal distribution would be the true filtering distribution of the states. This choice is rarely possible; therefore, a different proposal must be chosen. The simplest approach is to set the proposal distribution to be the transition density of the model, i.e.,
$q\left({\mathbf{x}}_{t}^{i}\right{\mathbf{x}}_{t1}^{i})={f}_{\mathbf{\theta}}({\mathbf{x}}_{t}\left{\mathbf{x}}_{t1},{\mathbf{u}}_{t1}\right)$, this is commonly referred to as the
bootstrap particle filter. In this case the value of the (unnormalised) weights is given by the likelihood of the observation
${\mathbf{y}}_{t}$ under the observation density
${g}_{\mathbf{\theta}}\left({\mathbf{y}}_{t}\right{\mathbf{x}}_{t},{\mathbf{u}}_{t})$. It is possible to reduce the variance in the proposal, which can improve efficiency by choosing an alternative proposal density [
21]; however, that will not be covered here.
The particle filter, alongside providing an approximate representation of the filtering densities, provides an unbiased estimate
${\stackrel{\u02c6}{p}}_{\mathbf{\theta}}\left(\right)open="("\; close=")">{\mathbf{y}}_{1:T}$ of the marginal likelihood of the filter
${p}_{\mathbf{\theta}}\left(\right)open="("\; close=")">{\mathbf{y}}_{1:T}$. This is given by,
Access to this unbiased estimatior will be a key component of a particle Markov Chain Monte Carlo (pMCMC) scheme.
So far, it has been shown how a general model of a nonlinear dynamical system can be established as a nonlinear statespace model. It has been stated that, in general, this model does not have a tractable posterior and so an approximation is required. The particle filter, an SMCbased scheme for inference of the filtering density, has been introduced as a tool to handle this type of model. This algorithm will form the basis of the approach taken in this paper to solve the joint inputstate problem for a nonlinear dynamic system.
2.2. Input Estimation as a Latent Force Problem
The specific model form used for the joint inputstate estimation task can now be developed. The starting point for this will be the secondorder differential equation which is the equation of motion of the system. The methodolgy will be shown for a singledegreeoffreedom (SDOF) nonlinear system, although the framework can extend to the multidegreeoffreedom (MDOF) case. Defining some nonlinear system as,
a point with mass
m is subjected to an external forcing as a function in time
U and its response is governed by its inertial force
$m\ddot{z}$ and some internal forces
$h(z,\dot{z})$ which are a function of its displacement and velocity (Although not shown explicitly to reduce clutter in the notation, it should be realised that the displacement, velocity and acceleration of the system are timedependent. In other words the
z implies
$z\left(t\right)$ and similar for the velocity and acceleration.). If
$h(z,\dot{z})=kz+c\dot{z}$ for some contant stiffness
k and damping coefficent
c, the classical linear system is recovered.
When
U is known (measured), the particle fiter can be used to recover the internal states of the oscillator, the displacement and velocity (computing the acceleration from these is also trivial). By modifying the observation density, a variety of sensor modalities can be handled; however, measuement of acceleration remains the most common approach because of ease of physical experimentation. In this work it is assumed that the parameters are known and the input is not, for the opposite scenario see [
22], it is hoped in future to unify these two into a nonlinear joint inputstateparameter methodolgy. However, in the situation being addressed in this work it is assumed that there is no access to
U.
Here, the unknown input will be estimated simultaneously with the unknown states of the model. To do so, a statistical model of the missing force must be established. One elegant approach to this was introduced in Alvarez et al. [
23]. To infer the forces, it is necessary to make some assumption about their generating process. The assumption in Alvarez et al. [
23] is that the unknown forcing
U can be represented by a Gaussian process in time—the same assumption is made in this work.
The Gaussian process (GP) [
24,
25] is a flexible nonparametric Bayesian regression tool. A GP represents a function as an infinitedimensional joint Gaussian distribution, where each datapoint conditioned on a given input is a sample from a univariate Gaussian conditioned on that input. An intuitive way of viewing the GP is as a probability distribution
over functions; in other words, a single sample from a GP is a function with respect to the specified inputs. A temporal GP, is simply a GP where the inputs to the model are time. The temporal GP over a function
$l\left(t\right)$ is fully defined in terms of its mean function
$m\left(t\right)$ and covariance kernel
$k(t,{t}^{\prime})$,
In this work, as is common, it will be assumed that the prior mean function is zero for all time, i.e., $m\left(t\right)=0$.
Alvarez et al. [
23] show how to solve the latent force problem for linear systems where,
and in the MDOF case. One of the difficulties in the approach is the computational cost of solving such a problem. However, it is possible to greatly reduce this cost by transforming the inference into a statespace model. It was shown in Hartikainen and Särkkä [
26], that inference for a Gaussian process in time can be converted into a linear Gaussian statespace model. Using the Kalman filtering and RTS smoothing solutions, an identical solution to the standard GP is recovered in linear time rather than cubic.
Given the availability of this solution, it was sensibile to use this result to improve the efficiency of inference in the latent force model. The combination of the statespace form of the Gaussian process with the linear dynamical system is natural, given the equivalent statespace form of the dynamics; this was shown in Hartikainen and Sarkka [
27]. One of the most useful aspects of this approach is that the whole system remains a linear Gaussian statespace model which can be solved with the Kalman filter and RTS smoother to recover the filtering and smoothing distributions exactly. It is this form of the model that has been exploited previously in structural dynamics [
5,
6,
10].
A full derivation of the conversion from the GP to its statespace form will not be shown here, although it can be found in Hartikainen and Särkkä [
26]. Instead it will be shown how this transformation can be performed for one family of kernels, the Matérn class. This set of kernels are a popular choice in machine learning and have been argued to be a sensible general purpose nonlinear kernel for physical data [
28]. The Matérn kernel is defined as a function of the absolute difference between any two points in time
$r=t{t}^{\prime}$. As such, it is a stationary isometric kernel, i.e., it is invariant to the absolute values of the inputs, their translation or rotation. Its characteristics are controlled by two hyperparameters; a total scaling
${\sigma}_{f}^{2}$ and a length scale
ℓ which mediates the characteristic frequency of the function. In addition, particular kernels are selected from this family by virtue of a smoothness parameter
$\nu $. The general form of the kernel is given by,
where
$\mathsf{\Gamma}\left(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right)$ is the gamma function and
${\mathcal{K}}_{\nu}\left(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right)$ is the modified Bessel function of the second kind. The most common Matérn kernels are chosen such that
$\nu =p+1/2$ where
p is zero or a positive integer. When
$p=0$, Equation (
9) recovers the Matérn 1/2 kernel which is equivalent to the function being modelled as an OrnsteinUhlenbeck process [
19], as
$\nu \to \infty $ the Matérn covariance converges to the squaredexponential or Gaussian kernel.
To convert these covariance functions into equivalent statespace models, the spectral density of the covariance function is considered. Taking the Fourier transform of Equation (
9) gives the spectral density,
where
$S\left(\omega \right)=\mathcal{F}\left(\right)open="["\; close="]">k\left(r\right)$. This density can then be rearranged into a rational form with a constant numerator and a denominator that can be expressed as a polynomial in
${\omega}^{2}$. This can be rewritten in the form
$S\left(\omega \right)=H\left(i\omega \right)qH(i\omega )$, where
$H\left(i\omega \right)$ defines the transfer function of a process for
$l\left(t\right)$ governed by the differental equation, with
q being the spectral density of the white noise process
$w\left(t\right)$,
The differential equation has order a with associated coefficients ${\sigma}_{0},\dots ,{\sigma}_{a1}$ assuming ${\sigma}_{a}=1$. This system is driven by a continuous time white noise process $w\left(t\right)$ which has a spectral density equal to the numerator of the rational form of the spectral density of the kernel.
Returning to the Matérn kernels, it can be shown that,
defining
$\lambda =\sqrt{(}2\nu )/\U0001d4c1$. With this being the denominator of the rational form, the numerator is simply the constant of proportionality for Equation (
12) which will be referred to as
q with
Therefore, the spectral density of $w\left(t\right)$ is equal to q.
The expression for the GP in Equation (
11) can now easily be converted into a linear Gaussian statespace model. In continuous time, this produces models in the form,
For example, setting
$p=0$ yields,
This procedure has now provided a statespace representation for the temporal GP which is the prior distribution for a function in time placed over the unknown force on the oscillator.
At this point, the states of the linear system can be augmented with this new model for the forcing. For example, for a model with a Matérn 3/2 kernel chosen to represent the force, the full system would be,
where
u is the augmented hidden state which represents the unknown forcing; as a consequence of the model formulation its derivative
$\dot{u}$ is also estimated. Since this is a linear system, it is a standard procedure to covert it into a discretetime form and the Kalman filtering and RTS smoothing solutions can be applied, see [
22] for an example of this. Doing so, the smoothing distribution over the forcing is identical to solving the problem in Equation (
8) as in [
23].
Up to this point, the model has been shown in the context of linear systems. This has hopefully provided a roadmap for the construction of an equivalent nonlinear model. The start of developing the nonlinear model will be to generalise Equation (
8) to the nonlinear case,
Considering the development of the model for the linear system shown previously, the reader will notice that the dynamics of the system do not enter until the final step in the model construction. It is therefore possible to use an identical procedure to convert the forcing modelled as a GP into a statespace form. Depending on the kernel, the force will still be represented in the same way, for example as in Equations (
15)–(
17). In order to extend the method to the nonlinear case, it is necessary to consider how this linear model of the forcing may interact with the nonlinear dynamics of the system. This can be done by forming the nonlinear statespace equation for the system. Again using the Matérn 3/2 kernel as an example,
with the spectral density of
$w\left(t\right)$ being
q as before. Since this is now a nonlinear model, it is no longer possible to write the transition as a matrixvector product and the state equations are written out in full. It is worth bearing in mind that there are certain nonlinearities which may increase the dimension of the hidden states in the model; for example, the BoucWen hysteresis model. This form of nonlinearity can be incorporated into this framework, provided the system equations can be expressed as a nonlinear statespace model.
2.3. Inference
In possession of a nonlinear model for the system, attention turns to inference. There are two unknowns in the system, the hyperparameters associated with the model of the forcing (
${\sigma}_{f}^{2}$ and
ℓ) and the internal states which have now been augmented with states related to the forcing signal. Additionally, in order to recover the GP representation of the force, the smoothing distribution of the states is required [
26]. In this model, the parameters of the physical system are known or fixed
a priori and that the task being attempted is joint inputstate estimation.
Identification of the smoothing distribution of a nonlinear statespace model is a more challenging task than filtering; however, the basic framework of particle filtering can be used. There are a variety of methods to recover the smoothing distribution of the states starting from a particle filtering approach [
19]. In this case, however, since there are also (hyper)parameters which require identification and the Bayesian solution is desired, a Markov Chain Monte Carlo (MCMC) approach is adopted. Specifically, particle MCMC (pMCMC) [
1] is an approach to performing MCMC inference for nonlinear statespace models.
The particular approach used in this work is a particle Gibbs scheme. In this methodology, a form of particle filter is used to build a Markov kernel which can draw valid samples from the smoothing distribution of the nonlinear statespace model conditioned on some (hyper)parameters. This sample of the states is then used to sample a new set of (hyper)parameters. Similarly to a classical Gibbs sampling approach, alternating samples from these two conditional distributions allows valid samples from the joint posterior to be generated. This approach shows improved efficiency in the sampler due to a blocked construction where all the states are sampled simultaneously and likewise the (hyper)parameters are sampled together. Each of these two updating steps will now be considered.
2.3.1. Inferring the States
In order to generate samples from the smoothing distributions of the states, the particle Gibbs with Ancestor Sampling (PGAS) [
29] method is used. This technique will be briefly described here but the reader is reffered to Lindsten et al. [
29] for a full introduction and proof that this is a valid Markov kernel with the smoothing distribution as its stationary distribution.
To generate a sample from the smoothing distribution using the particle filter, some modification needs to be made to the procedure seen for the basic particle filter. A more compact form of the Sequential Monte Carlo (SMC) scheme in Algorithm 2 is shown to make clear the necessary modifications. In particular, the notation
${M}_{\mathbf{\theta},1}\left(\right)open="("\; close=")">{a}_{t},{\mathbf{x}}_{t}$ will be used to representation a
proposal kernel which returns an
ancestor index ${a}_{t}^{i}$ and proposed sample for the states at time
t,
${\mathbf{x}}_{t}^{i}$.
${M}_{\mathbf{\theta},1}\left(\right)open="("\; close=")">{a}_{t},{\mathbf{x}}_{t}$ represents the usual resampling and proposal steps in the particle filter and the explicit bookkeeping for the ancestors at each time step aids in recovering the particle paths. The ancestor index
${a}_{t}^{i}$ is simply an integer which indicates which particle from time
$t1$ was used to generate the proposal of
${\mathbf{x}}_{t}^{i}$. The weighting of each particle is also now more compactly represented by a general weighting function
${W}_{\mathbf{\theta},t}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}$.
Algorithm 2 Sequential Monte Carlo 
All operations for $i=1,\dots ,N$ 
Sample ${\mathbf{x}}_{1}^{i}\sim {q}_{\mathbf{\theta}}({\mathbf{x}}_{1})$ 
 2:
Calculate ${w}_{1}^{i}={W}_{\mathbf{\theta},1}\left(\right)open="("\; close=")">{\mathbf{x}}_{1}^{i}$

for $t=2,\dots ,T$ do 
 4:
Sample $\left(\right)open="\{"\; close="\}">{a}_{t}^{i},{\mathbf{x}}_{t}^{i}$

Set ${\mathbf{x}}_{1:t}^{i}\leftarrow \left(\right)open="("\; close=")">{\mathbf{x}}_{1:t1}^{{a}_{t}^{i}},{\mathbf{x}}_{t}^{i}$ 
 6:
Calculate ${w}_{t}^{i}={W}_{\mathbf{\theta},t}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}^{i}$

end for 
Simply taking samples of the paths from this algorithm does not give rise to a valid Markov kernel for sampling the smoothing distribution [
1]; in order to do so it is necessary to make a modification which ensures stationarity of the kernel. This modification is to include one particle trajectory which is not updated as the algorithm runs, referred to as the
reference trajectory. Although the index of this particle does not affect the algorithm, it is customary to set it to be the last index in the filter, i.e., for a particle system with
N particles the
Nth particle would be the reference trajectory. This reference will be denoted as
${\mathbf{x}}_{1:T}^{\prime}=\left(\right)open="("\; close=")">{\mathbf{x}}_{1}^{\prime},\dots ,{\mathbf{x}}_{T}^{\prime}$. This small change forms the conditional SMC algorithm (CSMC) (Algorithm 3) which is valid for generating samples of the smoothing distribution.
Algorithm 3 Conditional Sequential Monte Carlo 
 1:
Sample ${\mathbf{x}}_{1}^{i}\sim {q}_{\mathbf{\theta}}({\mathbf{x}}_{1})$ for $i=1,\dots ,N1$  2:
Set ${\mathbf{x}}_{1}^{N}={\mathbf{x}}_{1}^{\prime}$  3:
Calculate ${w}_{1}^{i}={W}_{\mathbf{\theta},1}\left(\right)open="("\; close=")">{\mathbf{x}}_{1}^{i}$ for $i=1,\dots ,N$  4:
for$t=2,\dots ,T$do  5:
Sample $\left(\right)open="\{"\; close="\}">{a}_{t}^{i},{\mathbf{x}}_{t}^{i}$ for $i=1,\dots ,N1$  6:
Set ${\mathbf{x}}_{1:t}^{i}\leftarrow \left(\right)open="("\; close=")">{\mathbf{x}}_{1:t1}^{{a}_{t}^{i}},{\mathbf{x}}_{t}^{i}$ for $i=1,\dots ,N1$  7:
Set ${a}_{t}^{N}=N$  8:
Set ${\mathbf{x}}_{t}^{N}={\mathbf{x}}_{t}^{\prime}$  9:
Calculate ${w}_{t}^{i}={W}_{\mathbf{\theta},t}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}^{i}$ for $i=1,\dots ,N$  10:
end for

To sample from the smoothing distribution of the states, a path is selected based on sampling from a multinomial distribution with probabilities ${\mathbf{w}}_{T}$, i.e., the weights of each particle at the final time step. The path is defined in terms of the ancestors for that particle working backwards in time, but this has been updated as the algorithm runs (line 6 of Algorithm 3). CSMC forms a servicable algorithm; however, it can show poor mixing in the Markov chain close to the beginning of the trajectories owing to path degeneracy.
The ancestor sampling approach is a simple yet powerful modification to the algorithm proposed in [
29]. This ancestor sampling adds the additional step of sampling the ancestor for the reference trajectory in the CSMC filter. The ancestor is sampled at each time step according to the following unnormalised weight,
When using the bootstrap filter discussed earlier and resampling at every time step Equation (
21) reduces to,
remembering that
${\mathbf{x}}_{t1}^{N}={\mathbf{x}}_{t1}^{\prime}$. Conceptually, this is sampling the ancestor for the remaining portion of the reference trajectory according to the likelihood of the reference at time
t given the transition from all of the particles (including the reference) at the previous time step; doing so maintains an invariant Markov kernel on
${p}_{\mathbf{\theta}}\left({\mathbf{x}}_{1:T}\right)$.
Given this
ancestor sampling procedure, the PGAS Markov kernel can be written down as in Algorithm 4. This PGAS kernel will generate a sample
${\mathbf{x}}_{1:T}^{\star}$ given the inputs of the parameters
$\mathbf{\theta}$ and a reference trajectory
${\mathbf{x}}_{1:T}^{\prime}$.
Algorithm 4 PGAS Markov Kernel 
 1:
Sample ${\mathbf{x}}_{1}^{i}\sim {q}_{\mathbf{\theta}}({\mathbf{x}}_{1})$ for $i=1,\dots ,N1$  2:
Set ${\mathbf{x}}_{1}^{N}={\mathbf{x}}_{1}^{\prime}$  3:
Calculate ${w}_{1}^{i}={W}_{\mathbf{\theta},1}\left(\right)open="("\; close=")">{\mathbf{x}}_{1}^{i}$ for $i=1,\dots ,N$  4:
for$t=2,\dots ,T$do  5:
Sample $\left(\right)open="\{"\; close="\}">{a}_{t}^{i},{\mathbf{x}}_{t}^{i}$ for $i=1,\dots ,N1$  6:
Calculate ${\left(\right)}_{{\tilde{w}}_{t1T}^{i}}^{}$  7:
Sample ${a}_{t}^{N}$ with $\mathbb{P}({a}_{t}^{N}=i)\propto {\tilde{w}}_{t1T}^{i}$  8:
Set ${\mathbf{x}}_{t}^{N}={\mathbf{x}}_{t}^{\prime}$  9:
Set ${\mathbf{x}}_{1:t}^{i}\leftarrow \left(\right)open="("\; close=")">{\mathbf{x}}_{1:t1}^{{a}_{t}^{i}},{\mathbf{x}}_{t}^{i}$ for $i=1,\dots ,N$  10:
Calculate ${w}_{t}^{i}={W}_{\mathbf{\theta},t}\left(\right)open="("\; close=")">{\mathbf{x}}_{1:t}^{i}$ for $i=1,\dots ,N$  11:
end for  12:
Draw k with $\mathbb{P}(k=1)\propto {w}_{T}^{i}$  13:
return${\mathbf{x}}_{1:T}^{\star}={\mathbf{x}}_{1:t}^{k}$

The methodology outlined in Algorithm 4 now provides an algorithm for generating samples from the smoothing distribution of the states give a previous sample of the states and a sampled set of parameters. Therefore, initialising the algorithm with a guess of the state trajectory and parameters allows samples from the posterior to be generated. It is also necessay to bear in mind the usual burnin required for an MCMC method meaning that the inital samples will be discarded. In other words, it is now possible to sample, at step j in the MCMC scheme, ${\mathbf{x}}_{1:T}^{\left(j\right)}$ as ${\mathbf{x}}_{1:T}^{\star}{\mathbf{x}}_{1:T}^{(j1)},{\mathbf{\theta}}^{(j1)}$. It remains to develop a corresponding update for sampling ${\mathbf{\theta}}^{\left(j\right)}$.
2.3.2. Inferring the Hyperparameters
Working in the Gibbs sampling setting means that the samples of the hyperparameters are generated based upon the most recent sample for the state values. Therefore, it is necessary to consider samples from the posterior of the (hyper)parameters given the total data likelihood. In other words, a method must be developed to sample from $p\left(\mathbf{\theta}\right{\mathbf{x}}_{1:T}^{\left(j\right)},{\mathbf{y}}_{1:T})$. Before addressing how that is done, it is useful to clarify what is meant by $\mathbf{\theta}$ in the context of this nonlinear joint inputstate estimation problem. Since it is assumed that the physical parameters of the system are known a priori, the only parameters which need identification in this model are the hyperparameters of the GP, this should simplify the problem, as the dimensionality of the parameter vector being inferred remains low. As such, it can be said that $\mathbf{\theta}=\left(\right)open="\{"\; close="\}">{\sigma}_{f}^{2},\U0001d4c1$; another advantage of taking the GP latent force approach is that the GP over the unknown input also defines the process noise in the model; therefore, this does not need to be estimated explicitly.
It can now be explored how to sample the conditional distribution of
$\mathbf{\theta}$, given a sample from the states. Applying Bayes theorem,
Unfortunately, in this case it is not possible to sample directly from the posterior in Equation (
23) in closed form. Therefore, a
MetropolisinGibbs approach is taken. That is, a MetropolisHastings kernel is used to move the parameters from
${\mathbf{\theta}}^{(j1)}$ to
${\mathbf{\theta}}^{\left(j\right)}$ conditioned on a sample of the states
${\mathbf{x}}_{1:T}^{\left(j\right)}$. The use of this approach is convenient since there is no simple closedform update for the hyperparameters, which precludes the use of a Gibbs update for them. This follows much the same approach as the standard MetropolisHastings kernel where a new set of parameters are proposed according to a proposal density
${\mathbf{\theta}}^{\prime}\sim {q}^{\prime}\left(\mathbf{\theta}\right{\mathbf{\theta}}^{(j1)})$ and these are accepted with an acceptance probability,
To calculate this acceptance probability, it is not necessary to evaluate the normalising constant in Equation (
23)
${Z}_{{\mathbf{x}}_{1:T}}\left(\mathbf{\theta}\right)$, since this cancels. It is necessary however, to develop an expression for
${\gamma}_{{\mathbf{x}}_{1:T}}\left(\mathbf{\theta}\right)$. The distribution
$p\left(\mathbf{\theta}\right)$ is the prior over the hyperparameters which is free to be set by the user. The other distribution in the expression for
${\gamma}_{{\mathbf{x}}_{1:T}}\left(\mathbf{\theta}\right)$ is the total data likelihood of the model given a sampled trajectory for
${\mathbf{x}}_{1:T}$ and the observations
${\mathbf{y}}_{1:T}$. Given that this is a nonlinear statespace model, the likelihood of interest is given by,
As can be seen in Equation (
25), this quantity is directly related to transition and observation densities in the statespace model. Since these densities and observations are known and a sample of the states is available from the PGAS kernel, the quantity in Equation (
25) is relatively easy to compute. Therefore, so is
${\gamma}_{{\mathbf{x}}_{1:T}}\left(\mathbf{\theta}\right)$ (provided a tractable prior is chosen!). To compute
$\alpha $ in Equation (
24) is a standard procedure; often the proposal density
${q}^{\prime}\left({\mathbf{\theta}}^{\prime}\right{\mathbf{\theta}}^{(j1)})$ is chosen to be symmetric (e.g., a random walk centered on
${\mathbf{\theta}}^{(j1)}$) so it cancels in Equation (
25). As an implementation note, the authors have found that it can be helpful for stability to run a number of these MetropolisHastings steps for each new sample of the states
${\mathbf{x}}_{1:T}^{\left(j\right)}$; this does not affect the validity of the identification procedure. It has now been possible to define a Markov kernel
$K\left(\right)open="("\; close=")">\left(\right)open="\{"\; close="\}">{\mathbf{\theta}}^{(j1)},{\mathbf{x}}_{1:T}^{\left(j\right)}$ which moves the parameters
$\mathbf{\theta}$ from step
$j1$ to step
j with the stationary distribution
$\pi \left(\mathbf{\theta}\right{\mathbf{x}}_{1:T}^{\left(j\right)})$.
2.3.3. Blocked Particle Gibbs for Joint InputState Estimation
The full inference procedure is then the combination of these two steps into a blocked Gibbs sampler [
30]. Starting from some initial guess of
${\mathbf{x}}_{1:T}^{\prime}$ and
$\mathbf{\theta}$, the sampler alternates between sampling from the states
${\mathbf{x}}_{1:T}^{\star}$ using Algorithm 4 conditioned on the current sample of
$\mathbf{\theta}$ and sampling
$\mathbf{\theta}$ by
$K\left(\right)open="("\; close=")">\left(\right)open="\{"\; close="\}">{\mathbf{\theta}}^{(j1)},{\mathbf{x}}_{1:T}^{\left(j\right)}$ conditioned on the latest sample of the states.
With this algorithm in hand, it is now possible to perform the joint inputstate estimation for a nonlinear system where the forcing is modelled as a Gaussian process. The number of steps
J, which this chain should be run for remains a user choice, although diagnostics on the chain can help to check for convergence [
30]. It should also be noted that up to this point the form of the nonlinear system has not been restricted, but it is assumed that the system parameters are known. Likewise, other kernels than the Matérn can be used within this framework by following a similar line of logic.
Before moving to the results, it is worth considering theoretically what will happen when the model of the system is misspecified: if the system parameters used for the inputstate estimation were not correct. Since the Gaussian process model of the forcing is sufficiently flexible, the estimated input to the system will be biased. The associated recovered “forcing” state related, which is the Gaussian process, will now be a combination of the unmeasured external forcing and the required internal forces to correct for the discrepancy between the specified nonlinear system model and the “true system” which generated the observations. The level of the bias in the force estimated will be affected by a number of factors; firstly, by the degree to which the parameters (or model) of the nonlinear system are themselves biased and secondly, the relative size of the external force to the correction required to align the assumed system with the true model. The authors would caution a potential user as to the dangers of a misspecified nonlinear dynamical model if seeking a highly accurate recovery of the forcing signal, however, the results returned can still be of interest and may give potential insight into the dynamics of the system even in the case of bias in the results.