4.1. Gaussian Process Regression
Gaussian process regression seeks to infer the unknown latent function
$f\left(X\right)$ (where
$X\in {\mathbb{R}}^{N\times d}$ are inputs) of the noisy function
$\mathit{y}=f\left(X\right)+e$ (where
$\mathit{y}\in {\mathbb{R}}^{N}$ are the target outputs). First a Gaussian process prior is defined over the latent function,
where
$\mathcal{GP}(\xb7,\xb7)$ is a Gaussian process, with a mean function
$m(\xb7)$ and covariance function
$k(\xb7,\xb7)$, defining prior belief about the types of possible functions that could create
$\mathit{f}\in {\mathbb{R}}^{N}$. The mean function defines the prior mean of the process, and the covariance function states the correlation between any two points in the inputs space (hence a function of
X and
${X}^{\prime}$) and controls the functions ‘smoothness’. Here zero mean and Matérn 3/2 covariance functions are utilised, given that no prior mean information is known about the missing functional form, and that a Matérn 3/2 is (3/21) times mean square differentiable [
42], making it well suited to modelling relatively ‘smooth’ real world functions (for other mean and covariance function the reader is referred to [
41]). The covariance function is defined as,
where,
where
${K}_{f,f}\in {\mathbb{R}}^{N\times N}$ is the covariance matrix,
$\mathit{\varphi}=\{{\sigma}_{f}^{2},L\}$ are a set of hyperparmeters called the signal variance and lengthscale, respectively, where
$L=\mathrm{diag}({l}_{1},\cdots ,{l}_{d})$ (making the covariance function an automatic relevance determination prior, i.e., it reduces the effect of redundant inputs). The shorthand notation
${K}_{f,\ast}=k(X,{X}_{\ast}^{\prime})$ is used, where
f indicates training and * test data. Predictions can be made by forming the joint Gaussian distribution between a set of training
$\{X,\mathit{y}\}$ and testing data
$\{{X}_{\ast},{\mathit{y}}_{\ast}\}$, assuming a Gaussian likelihood,
where
$\mathbb{I}$ denotes an identity matrix and
${\sigma}_{n}^{2}$ is a Gaussian noise variance (i.e.,
$e\sim \mathcal{N}(0,{\sigma}_{n}^{2})$). Standard Gaussian conditionals can be applied in order to obtain the predictive posterior distribution,
The hyperparameters (including the noise variance) are typically inferred through a typeII maximum likelihood approach [
41], found by minimising the negative log marginal likelihood
$\widehat{\mathit{\varphi}}=arg\; minlogp\left(\mathit{y}\phantom{\rule{0.166667em}{0ex}}\right\phantom{\rule{0.166667em}{0ex}}X,\mathit{\varphi})$, where,
The probabilistic nature of a GP model means that the variance associated with posterior process (Equation (11)) reflects the uncertainty in the identified latent function. Given the tool is designed for interpolation, the variance provides a measure of extrapolation, which can be used to identify regions where the inputoutput pairs were ‘far’ from the training data (where ‘far’ is defined by the lengthscales in the covariance function).
Figure 6 demonstrates the effect, whereby the variance increases away from the training observations, indicating that the model is less certain about the mean prediction; a useful property in addressing the question of
how to access when predictive performance of a digital twin will be poor.
4.2. DataBased Model Component
For the digital twin of the shear structure, the databased component is expected to capture some dynamic content (as the missing latent function will be dynamic). In order to capture this behaviour, the GP input matrix X will contain lagged information (in a similar form to an autoregressive model structure). However, the input vector will only contain lagged information from the forcing F and the physicsbased model outputs ${\ddot{Y}}^{p}={\mathcal{M}}^{p}(F,\mathit{\theta})$; meaning $X=\{{\{{\ddot{\mathit{y}}}_{i}^{p}({t}_{n}q),\cdots ,{\ddot{\mathit{y}}}_{i}^{p}({t}_{n}1),{\ddot{\mathit{y}}}_{i}^{p}\left({t}_{n}\right)\}}_{i=1}^{3},\phantom{\rule{0.277778em}{0ex}}F({t}_{n}q),\cdots ,F({t}_{n}1),F\left({t}_{n}\right)\}$, where q is the number of lags. The outputs for each of the three Gaussian process models are the accelerations at each floor; for example, the output of the Gaussian process modelling the acceleration at the first floor is $y={\ddot{y}}_{1}\left({t}_{n}\right)$. This inputoutput structure means that the GP model can be used to make mstep ahead predictions online, as long as the forcing input is known for the msteps (where the physicsbased model only requires the forcing to produce output predictions).
It is important to note that this structure differs from a GPNARX (nonlinear autoregressive exogenous) model structure [
43,
44], where the measured response (rather than physicsbased model outputs) become inputs to the GP regression model in an autoregressive manner. Additionally, this modelling structure is similar to a nonlinear finite impulse response (FIR) model [
45], augmented with the physicsbased model outputs. As the input matrix in the databased component is independent of any measured response variable, the combined physics and databased models in Equation (
4) can be used to simulate the predicted output response for any forcing input.
The initial digital twin model, structured as Equation (
4), was formed from the linear physicsbased model trained on dataset one, i.e.,
${\ddot{Y}}^{p}={\mathcal{M}}^{p}(X,\mathit{\theta})={\mathcal{M}}^{l}(F,{\mathit{\theta}}_{\mathcal{D}1}^{MAP})$ and three independent Gaussian process models (one for the acceleration at each floor) where nine lags of the forcing and linear model outputs were used (based on the number of lags with the lowest negative log marginal likelihood in training).
Figure 7 shows predictions when the GP models are trained using dataset one, i.e.,
${\ddot{Y}}^{dt}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}{\mathcal{M}}^{d}({\mathcal{M}}^{p}(X,\mathit{\theta}),\mathit{\varphi})={\mathcal{M}}^{GP}(\{{\ddot{Y}}^{p},F\},{\widehat{\mathit{\varphi}}}_{\mathcal{D}1})$, and applied on datasets one to three, where the NMSEs are
$\{0.000,0.000,0.000\}$,
$\{0.309,2.479,2.089\}$ and
$\{29.554,44.518,32.434\}$, respectively. It is noted that the addition of the databased model has not completely reduced the error on dataset three (where the harsh nonlinearity was active), which is expected given that dataset one contains only information from the structure in a noncontact state (although the performance is better than the linear model alone that produced NMSEs
$\{34.595,65.310,42.214\}$ on dataset three). The reason the dataaugmented model performs better on dataset three than the linear model alone, is that the GP model has learnt to compensate for some of the modelform misspecification. For example, the physics of the joints are simplified in the linear shearmodel structure, and the residual from this behaviour may have been identified by the GP from the linear behaviour in dataset one. However, the databased component can be updated, given new information about the nonlinearity, subsequently improving predictive performance in dataset three, without biasing any physical parameters.
In addition to improved predictive accuracy, the digital twin now includes a measure of predictive uncertainty, expressed through the GP predictive latent function variance (i.e.,
$\mathbb{V}\left({\mathit{f}}_{\ast}\right)=\mathbb{V}\left({\mathit{y}}_{\ast}\right){\mathbb{I}}_{\ast}{\sigma}_{n}^{2}$), displayed in
Figure 8. Intuitively, a threshold can be been defined in order to determine when predictive performance is uncertain and a decision should be made. There are several methods for setting this threshold, discussed in
Section 4.3, where, for the example in
Figure 8, a threshold has been obtained by taking the maximum pointwise predictive variance from the digital twins initial predictions on an independent test set (in this case the first 100 data points in dataset two). It is noted that as the digital twin, once trained, only requires the forcing input, simulation can be made for any forcing schedule in order to evaluate loading conditions where the digital twin is uncertain, prior to deployment. Future data points that would be valuable to improving the digital twin can therefore be highlighted before they occur; a valuable property particularly if the digital twin is being used for control or health monitoring. The following section extends this observation, describing an active learning procedure that can be used to highlight data points of interest, and continuously update the databased component of the digital twin.
4.3. Active Learning Approach
Active learning, a branch of machine learning, forms algorithms that actively query data points, with the aim of improving the learner [
46,
47]. For a digital twin predicting dynamic outputs, the most suitable type of active learning is streambased active learning [
48,
49] (a form of selective sampling), in which the learner observes data points and makes a decision whether to query an observation or discard it. By querying a data point, the learner has the ability to use the data point in updating the model. Key to active learning is the aim that by using a modelinformed approach, the training dataset selected by the model will be more informative (leading to better generalisation) than selecting a training dataset using a nonmodel informed approach, such as random or uniform sampling etc. This type of approach can be used as a way of answering
what does a digital twin do when predictive performance is poor?The proposed active learning approach seeks to use the predictive variance of the latent function (i.e.,
${\{\mathbb{V}\left({\mathit{f}}_{\ast ,i}\right)=\mathbb{V}\left({\mathit{y}}_{\ast ,i}\right){\mathbb{I}}_{\ast}{\sigma}_{n,i}^{2}\}}_{i=1}^{3}$) and a threshold
T to determine when a decision is made to query a particular observation (in this case only considering future observations). Any
jth instance where the latent predictive variance is greater than the threshold, i.e.,
$\mathbb{V}\left({\mathit{f}}_{j,\ast}\right)>T$, for any output, are data points that should be queried. In a digital twin context, a query can result in three main actions: recalibrating the physicsbased model, retraining the databased model, or updating the physicsbased model with new physics. The latter action is the most challenging to perform in an automated manner, and as such is not considered in the proposed active learning scheme. Instead, when a data point is queried, the instance
$\{{\mathit{x}}_{j},{y}_{j}\}$ becomes part of a new training dataset, i.e.,
${\mathcal{D}}_{k+1}=\{{\mathcal{D}}_{k},\{{\mathit{x}}_{j},{y}_{j}\}\}$. The new training dataset can be used to either recalibrate the physicsbased model parameters
$\mathit{\theta}$ or update the GP model hyperparameters
$\mathit{\varphi}$. Unfortunately, recalibrating the physicsbased model changes the inputoutput map of the GP model (as the physicsbased model outputs are inputs to the Gaussian process). As a result, deciding to recalibrate the physicsbased model is computationally more expensive than updating the databased component alone. For this reason, the active learning approach will initially consider two actions: do nothing or update the GP hyperparameters
$\mathit{\varphi}$, outlined in Algorithm 1.
Algorithm 1 Active learning for databased component of a digital twin 
 1:
Set initial dataset ${\mathcal{D}}_{1}={\{{\mathit{x}}_{j},{y}_{j}\}}_{j=1}^{{N}_{initial}}$  2:
Train GP on ${\mathcal{D}}_{1}$ and obtain ${\widehat{\mathit{\varphi}}}_{1}$  3:
Set initial threshold ${T}_{{N}_{initial}1}$  4:
$k=1$  5:
for$t={N}_{initial}:N$do  6:
Predict GP at $\left\{{\mathit{x}}_{\ast ,t}\right\}$ using ${\widehat{\mathit{\varphi}}}_{k}$  7:
Apply ‘forgetting factor’, ${T}_{t}={f}_{f}{T}_{t1}$  8:
if $\mathbb{V}\left(f\left({\mathit{x}}_{\ast ,t}\right)\right)>T$ then  9:
Update dataset ${\mathcal{D}}_{k+1}=\{{\mathcal{D}}_{k},\{{\mathit{x}}_{t},{y}_{t}\}\}$  10:
Retrain GP on ${\mathcal{D}}_{k+1}$ and obtain ${\widehat{\mathit{\varphi}}}_{k+1}$  11:
Update predictions at $\left\{{\mathit{x}}_{\ast ,1:t}\right\}$ using ${\widehat{\mathit{\varphi}}}_{k+1}$  12:
Update threshold, ${T}_{k}=max\left[\mathbb{V}\left(f\left({\mathit{x}}_{\ast ,1:t}\right)\right)\right]$  13:
$k=k+1$  14:
end if  15:
end for

The main consideration in Algorithm 1 is how to set (and whether to update) the threshold. One approach is to fix the threshold (i.e., lines 7 and 11 in Algorithm 1 are not performed), using some expected performance of the Gaussian process. This may be challenging to set a priori to deployment, and if set at a low value, will lead to very frequent querying, and if too high, will result in all data points being discarded. Another approach is to set an adaptive threshold; two approaches are considered. The first scheme sets the threshold ${T}_{k}$ at query k to be the maximum variance of the latent function at the previous t observations, i.e., ${T}_{k}=max\left[\mathbb{V}\left(f\left({\mathit{x}}_{\ast ,1:t}\right)\right)\right]$ (ignoring line 7 in Algorithm 1). This criteria states that the next queried observation must be more informative than past observations. However, this criteria may lead to scenarios where the threshold rises to a value where no future points can be queried. To overcome this problem the criteria can be amended with a ‘forgetting factor’ ${f}_{f}$, allowing the threshold ${T}_{k}$ to decrease at each sample point t from the previous update k (line 7 of Algorithm 1). The ‘forgetting factor’ should be set ${f}_{f}<1$ (where ${f}_{f}=1$ is the same as taking ${T}_{k}$), meaning that the threshold will decrease until a new point is queried. A value of ${f}_{f}$ close to zero will very frequently query observations regardless of their informativeness, whereas a value close to one states only data points with a large latent variance are sampled.
Furthermore, in Algorithm 1, the latest GP model (from update k) is used to (re)predict the output at all test instances. This assumes that the latest GP model is the most optimal so far. However, it is trivial to amend the algorithm such that the latest GP trained on dataset ${\mathcal{D}}_{k}$, is only used to make predictions on time points from k until the next query at $k+1$. This change in implementation would be useful if it is assumed that the GP for each training dataset k is optimal until the next queried data point.
Comparisons of the performance between uniform sampling (left panels) and active learning with both fixed (middle panels) and adaptive thresholds, where
${f}_{f}=1$ (right panels), are presented in
Figure 9. In addition,
Figure 10 provides a comparison of active learning with different ‘forgetting factors’:
${f}_{f}=0.999$ (left panels),
${f}_{f}=0.99$ (middle panels) and
${f}_{f}=0.9$ (right panels)—the NMSEs for both figures are calculated for the complete dataset at every update step, therefore assessing both the digital twins ability to generalise, as well as predictive performance. The physicsbased model for each approach was linear (Equation (
1)), where the parameters were
${\mathit{\theta}}_{\mathcal{D}1}^{MAP}$. Each method was initialised with 75 observations from dataset one, i.e.,
${N}_{initial}=75$. The uniform sampling approach queried every 25th data point (where the threshold criteria is ignored) and for active learning with a fixed threshold,
$T=0.001$ for each floor (based on the performance of GP models in
Figure 8). For datasets two and three,
${\mathcal{D}}_{1}$ (line 1 in Algorithm 1) was set as the final
${\mathcal{D}}_{k}$ from the previous dataset (meaning information from the previous datasets is carried forward). The number of additional queries for each dataset is provided in
Table 2.
The performance of uniform sampling is poor compared to the other approaches, with the algorithm taking a long time to converge for dataset one, and producing higher NMSEs for all three datasets when compared to the active learning approaches. For dataset three, the approach is actually detrimental to performance, with the predictions for floors two and three becoming worse with additional queries. In contrast, both the fixed threshold and adaptive thresholds (where ${f}_{f}=1$) converge quickly for dataset one, and maintain low NMSEs for both datasets one and two, with methods both querying a large number of observations around 14 s on dataset three, leading to an increase in performance. The main difference between these two approaches is the number of queries made, with the fixed threshold querying significantly more observations, 365 in total, compared to the adaptive threshold (${f}_{f}=1$) with 130 queries. This difference is particularly clear for dataset three just after 16 s, where the adaptive threshold is no longer sampling every data point (due to a high threshold), leading to the approach plateauing earlier, with higher final NMSEs. This shows that there is a tradeoff between how the threshold is set, such that the number of queries are not too high, slowing the GP retrain step, whilst balancing potential performance.
A key decision in performing the active learning approach is setting a reasonable threshold, given some engineering judgement. Three other ‘forgetting factors’ were also compared,
${f}_{f}=0.999$,
${f}_{f}=0.99$ and
${f}_{f}=0.9$, where a lower value leads to more queries at locations where the predictive latent variance may not be maximum. This expectation is confirmed by
Table 2, where the number of queries increases as
${f}_{f}$ decreases (with all three approaches querying less observations than the fixed threshold). It is interesting to note that the three approaches provide similar NMSE performance on dataset one, even though
${f}_{f}=0.9$ queries over double the amount of observations. Both
${f}_{f}=0.999$ and
${f}_{f}=0.99$ perform relatively similarly for dataset two, with
${f}_{f}=0.99$ obtaining a slightly better NMSE for floor two. However,
${f}_{f}=0.9$ shows the best performance on dataset two, even compared to the fixed threshold. All three ‘forgetting factor’ values produce similar initial performance on dataset three, with each querying observations from 14 s, and unlike
${f}_{f}=1$, keep querying data points after the high amplitude response from the initial contact; with relatively similar final NMSEs around
$\{13,28,16\}$ for floors one to three, respectively—better than
${f}_{f}=1$,
$\{13.278,30.787,19.761\}$, but worse than the fixed threshold,
$\{8.414,24.263,10.921\}$.
Figure 11 shows a comparison of the updated predictions at the end of dataset two, i.e., the predictions from the GP model trained on the final training dataset. The figure also shows query locations (vertical lines) which are data points that have been selected to form the training dataset.
Figure 11 compares the fixed (left panels) and adaptive (right panels) threshold when
${f}_{f}=0.99$. Queries are sparse for both methods, expected given the system behaves linearly. In contrast,
Figure 12 presents the final updated predictions and locations of queries for the fixed threshold (left panels) and
${f}_{f}=0.99$ (right panels). It is clear from the fixed threshold results that the system response has changed in dataset three, leading to all the observations being queried from around 14 s. In comparison,
${f}_{f}=0.99$ queries continuously around the large amplitude response where the harsh nonlinearity is active. A heuristic could be introduced to the algorithm, that given a large number of continuous queries, the physical system is expected to have changed and therefore other actions should be taken, changing the underlying physicsbased model. These actions could either be recalibration, or the addition of new physics to the model. This modification is left as future research, but could be aided by considering the associated cost of these actions (both in terms of the consequences of poor performance and in terms of computational resource requires to perform the action), making a utilitybased approach appropriate [
50,
51]. However, the following section, on realtime hybrid testing, considers the scenario where active learning results on dataset three have indicated additional physics are required. Hybrid testing is then considered as one approach for isolating and identifying these physics.