Copula-Based Uncertainty Quantification (Copula-UQ) for Multi-Sensor Data in Structural Health Monitoring

The problem of uncertainty quantification (UQ) for multi-sensor data is one of the main concerns in structural health monitoring (SHM). One important task is multivariate joint probability density function (PDF) modelling. Copula-based statistical inference has attracted significant attention due to the fact that it decouples inferences on the univariate marginal PDF of each random variable and the statistical dependence structure (called copula) among the random variables. This paper proposes the Copula-UQ, composing multivariate joint PDF modelling, inference on model class selection and parameter identification, and probabilistic prediction using incomplete information, for multi-sensor data measured from a SHM system. Multivariate joint PDF is modeled based on the univariate marginal PDFs and the copula. Inference is made by combing the idea of the inference functions for margins and the maximum likelihood estimate. Prediction on the PDF of the target variable, using the complete (from normal sensors) or incomplete information (due to missing data caused by sensor fault issue) of the predictor variable, are made based on the multivariate joint PDF. One example using simulated data and one example using temperature data of a multi-sensor of a monitored bridge are presented to illustrate the capability of the Copula-UQ in joint PDF modelling and target variable prediction.


Introduction
The problem of uncertainty quantification (UQ) for multi-sensor data has been one of the main concerns in nondestructive testing and structural health monitoring (SHM) over the years [1][2][3][4][5][6][7][8][9]. One important task is multivariate joint probability density function (PDF) modelling. Due to irregularities of multi-sensor data, the joint PDF can be too complicated to be modelled by traditional approaches. For example, traditional multivariate PDFs (such as a multivariate normal distribution) cannot model the PDF with multiple peaks. The multivariate mixture PDFs (such as multivariate normal mixture model), utilized in SHM and damage detection [10][11][12][13], rely on the proper choice of the number and the type of the mixture distributions and an initial value of parameter vector in optimization [14]. The Nataf distribution, utilized in SHM and structural reliability [15,16], relies on the assumption that the transformed random variables, obtained from the marginal transformations of the original random variables, are multivariate normal distribution [17].
In recent years, copula-based statistical inference has attracted significant attention due to the fact that it decouples the inference on the univariate marginal PDF of each random variable and the statistics dependence structure (called copula) among the random variables. In the areas of the SHM and structural assessment, Zhang and Kim [18] investigated a way of detecting bridge damage for the long-term health monitoring by using the copula theory. Fan and Liu [19] predicted the dynamic reliability of a bridge system based on SHM data. Pan et al. [20] developed a copula-based approach to model the structural health of an operational metro tunnel in a dependent system. Liu et al. [21] considered the correlation between the fatigue equivalent stress and the stress cycle using the copula function in the fatigue reliability assessment. Srinivas et al. [22] proposed the multivariate simulation of dependent axle weights of different vehicle classes. Zhang et al. [23] investigated the specification of long-term design loads for offshore structures considering multiple environmental factors.
Although the copula-based statistical inference has been widely applied, there are two limitations in previous works related to the SHM and structural assessment. The first limitation is insufficient types of probabilistic model candidates for univariate marginal PDF modelling. From the parametrization point of view, there are parametric models and nonparametric models for PDF modelling. The former type, assuming that sample data come from a distribution that has a fixed set of parameters, is suitable for data with regular statistical pattern; the latter type, being not specified a priori but being instead adaptively determined from data, is suitable for data with an irregular statistical pattern. In the SHM, it is well known that the statistical regularities of data from multiple sensors can be significantly different from each other. Thus, due to the complexity of real SHM data, only considering one type of probabilistic model in PDF modelling bounds the solution space for UQ, leading to incapability of capturing a statistical pattern of data. However, this important issue was not realized in previous works, so parametric models and nonparametric models were not considered simultaneously. For example, References [18][19][20][21]23] solely adopted parametric models, while Reference [22] solely adopted nonparametric models for univariate marginal PDF modelling. Thus, this paper attempts to break through this limitation by including sufficient types of probabilistic models as candidates. The second limitation is negligence of probabilistic prediction using available information, especially in the case of using the incomplete information of the predictor variable due to missing data caused by a sensor fault issue. For the works of the research area of SHM using the copula [18][19][20][21][22][23], it had not been realized that the joint PDF can be utilized for probabilistic prediction on the target variable using the available information of the predictor variable. Even for the very recent work of another research area using the copula [24], probabilistic prediction on the target variable is limited to the case using the complete information of the predictor variable only. However, the case of incomplete information of the predictor variable, due to missing data caused by a sensor fault issue, is critical and common in the SHM. Thus, this paper attempts to break through this limitation by conducting computation of marginalization and conditioning based on the copula-based joint PDF, for prediction on the PDF of the target variable using the complete (from normal sensors) or incomplete information (due to missing data caused by sensor fault issue) of the predictor variable.
This paper proposes the copula-based UQ (Copula-UQ), composing multivariate joint PDF modelling, inference on model class selection and parameter identification, and probabilistic prediction using incomplete information, for multi-sensor data measured from a SHM system. The proposed Copula-UQ contains two stages. The first stage is the copula-based multivariate joint PDF modelling. It is based on the univariate marginal PDFs and the copula. The second stage is copula-based inference and prediction. Inference, including determination of optimal parameters and selection of optimal model classes, is made by combining the idea of the inference functions for margins (IFM) and the maximum likelihood estimate (MLE). Prediction on the PDF of the target variable, using the complete or incomplete information of the predictor variable, are made based on the copula-based multivariate joint PDF.
The structure of this paper is outlined as follows. Section 2 presents copula-based multivariate joint PDF modelling, including model class candidates for univariate marginal PDFs and copula. Section 3 presents copula-based inference and prediction, including inference on univariate marginal PDFs and copula, and prediction on the target variable. Section 4 presents illustrative examples. One example using simulated data and one example using temperature data of multi-sensor of a monitored bridge are presented to illustrate the capability of the proposed Copula-UQ in joint PDF modelling and target variable prediction.

Copula-Based Multivariate Joint PDF Modelling
Let p(x 1 , x 2 , · · · , x D ) denote the joint PDF of D random variables (X 1 , X 2 , . . . X D ), and X ∈ R D×N denote the measured data matrix with its component X d,i being the d-th dimension of the i-th data point, with d = 1, . . . , D and i = 1, . . . , N. The copula-based multivariate joint PDF is to model p(x 1 , x 2 , · · · , x D ) based on the univariate marginal PDFs p(x d ), d = 1, 2, . . . , D of each random variable and the statistics dependence structure (called copula) among the random variables, given the measured data matrix X.

Univariate Marginal PDFs
For the d-th univariate random variable X d , consider a set of N M model class candidates, namely, M Consider the joint CDF of (X 1 , X 2 , . . . X D ): Sklar's theorem states that there exists a D-dimensional copula, such that [27]: where ψ is the parameter vector of the copula. If P x d θ are continuous, the copula is unique; otherwise, it is uniquely determined on the Cartesian product of the ranges of the marginal CDFs. Sklar's theorem clearly indicates that the joint CDF of random variables can be characterized by a copula in terms of the marginal CDFs.
Thus, the joint PDF, p(x 1 , x 2 , · · · , x D ), can be derived from its joint CDF, P(x 1 , x 2 · · · , x D ), of Equation (10): where c u m 1 1 , u m 2 2 , · · · , u m D D ψ is the copula density function: and In this paper, the multivariate Gaussian copula and the associated copula density function are introduced as follows [28]: where Φ −1 (·) is the inverse CDF of the univariate standard normal distribution function, Φ ρ(ζ) (·) is the joint CDF of a D-dimensional normal distribution with mean vector zero and covariance matrix equal to the correlation coefficient matrix ρ(ζ) of ζ defined in Equation (15), the parameter vector ψ is the collection of the off-diagonal elements of the upper triangular part of ρ(ζ) and I is a D-dimensional identity matrix.

Inference on Univariate Marginal PDFs and Copula
This stage is to make inference on Θ = θ m 1 1 , . . . (model class candidates of the marginal PDFs) and ψ (parameters of the multivariate Gaussian-copula), based on the measured data matrix X ∈ R D×N and the probability matrix U ∈ R D×N , with its component U d,i = P X d,i Θ, M . Under the idea of the inference functions for margins (IFM) [29], Θ (along with M) and ψ can be determined separately.
For the univariate marginals, the optimal parameter values can be obtained by the MLE: where log{·} is the logarithmic function. For most of the parametric models, analytical forms for the optimal parameters can be derived (for example, see Reference [30]). For nonparametric models, the optimal value,θ m d d (bandwidth), can be obtained by considering the asymptotic mean integrated squared error solution [31]:θ where Q d (0.75) and Q d (0.25) are the 75% and 25% quantiles of X d .
The optimal marginal PDFs (M = M 1 , . . . ,M D ) are selected by comparing the optimal likelihood values of different m d : After selectingM, the optimal parameters associated withM are denoted asΘ = θ 1 , . . .θ D .
Based onM andΘ, the component of the optimal probability matrixÛ d,i = P X d,i Θ ,M can be obtained.
The optimal values of the parameters of the multivariate Gaussian-copulaψ can be determined by considering the optimization on Lc Û ψ : where c Û 1,i ,Û 2,i , · · · ,Û D,i ψ is obtained by substitutingÛ 1,i ,Û 2,i , · · · ,Û D,i into Equation (14). For the multivariate Gaussian copula, the optimal parameterψ is the collection of the off-diagonal elements of the upper triangular part of ρ(ζ), with each component being Pearson's correlation coefficient.

Prediction on the Target Variable Given Complete or Incomplete Information (Due to Missing Data Caused by a Sensor Fault Issue)
Let p x 1 , x 2 , · · · , x D ψ ,Θ,M denote the multivariate joint PDF obtained by substitutingψ,Θ,M into Equation (11). Let the target variable be the set containing the selected components of X 1 , X 2 , . . . X D for prediction, and the predictor variable be the complement of the target variable. As the joint PDF contains all the statistical information about the random variables (X 1 , X 2 , . . . X D ), prediction on the PDF of the target variable can be obtained using the complete or incomplete information of the predictor variable. Let x ta , x o and x uo denote the target variable, observed predictor variable and unobserved predictor variable, respectively. It is worth noting that the existence of unobserved predictor variable x uo is very common in the SHM as it represents missing data of the corresponding channels of fault sensors. However, the very recent work of copula-based prediction [24] was still incapable of tackling the existence of x uo in its prediction phase. Here, by conducting computation of marginalization and conditioning on the copula-based multivariate joint PDF, the prediction on the PDF of x ta based on the observation x o = x o only (that is, available information only) can be obtained by: where p x 1 , x 2 , · · · , x D ψ ,Θ,M is the copula-based multivariate joint PDF with substituting Accordingly, the predicted value (mean) and the associated uncertainty (standard deviation) of x ta can be obtained.

Illustrative Examples
One example of simulation data and one example of real SHM data are demonstrated. For the simulation data example, the design of it is to validate the following three critical issues: (1) the necessary introduction of both parametric and nonparametric models for breaking through the first limitation (i.e., insufficient types of probabilistic model candidates), (2) the capability of the multivariate joint PDF modelling of the proposed Copula-UQ (as the true joint PDF is known) and (3) the performance of the proposed Equation (23) for prediction on the target variable given complete or incomplete information (due to missing data caused by a sensor fault issue). For the real SHM data example, the performance of the proposed Copula-UQ for prediction on the target variable under complete (normal sensors) or incomplete (fault sensors) information is further validated by considering the following two cases: (1) the test dataset is identical to the training dataset, and (2) the test dataset is different from the training dataset.

Simulation Data
This example applies the proposed Copula-UQ for multivariate joint PDF modelling and prediction of five-dimensional random variables, X = [X 1 , . . . , X 5 ] T . First, five uncorrelated random variables, Z = [Z 1 , . . . , Z 5 ] T , with different marginal PDFs are constructed (see Table 1). Then, the random variables X = [X 1 , . . . , X 5 ] T are obtained by applying an affine transformation X = AZ with A given as: Thus, the analytical form of the joint PDF of X is: where p Z z = A −1 x is the joint PDF of Z with z = A −1 x. Figure 1 shows the scatter plot of the simulated data for X 1 to X 5 (N = 500). The correlation coefficient matrix for X is:  High correlation (for example, between x 1 and x 4 ), medium correlation (for example, between x 2 and x 5 ) and low correlation (for example, between x 1 and x 3 ) can be found in this case. Table 2 shows the maximum log-likelihood value of different univariate marginal PDFs of X 1 to X 5 . Using Equation (20), the optimal univariate marginal PDF of each dimension can be determined, and they are indicated by "_" (underline) in Table 2. The optimal PDFs of X 1 to X 5 are Normal kernel, Triangle kernel, Lognormal distribution, Lognormal distribution and Triangle kernel, respectively. In order to compare the fitting capacities of different PDFs shown in Table 2, Figure 2 shows the univariate marginal PDFs of X 1 to X 5 . Each subplot shows the data histogram, the top ranking PDF (that is, the optimal marginal PDF in Table 2; line style as "dash-dot line"), an intermediate ranking PDF (that is, an intermediate ranking PDF in Table 2; line style as "dashed line") and a low ranking PDF (that is, a low ranking PDF in Table 2; line style as "dotted line"). From each subplot, it can be reconfirmed that the optimal marginal PDF of each dimension in Table 2 is the best model for uncertainty quantification of the corresponding component of X. It is worth noting that, from Table 2, even though X is a linear mapping of Z only composing very regular types of distributions described in Table 1, the optimal univariate marginal PDFs of X are not only from parametric but also from nonparametric models. This result shows that the introduction of both parametric and nonparametric models is necessary for breaking through the first limitation (i.e., insufficient types of probabilistic model candidates) described in Section 1 because it provides a large solution space for uncertainty quantification. Table 1. Probability density functions (PDFs) of Z 1 to Z 5 (Simulation data).

Random Variable
Distribution Type PDF  The multivariate joint PDF of X is determined by Equation (11) with substituting the optimal marginal PDFsM along with the associated optimal parametersΘ and the optimal parameterψ of the multivariate Gaussian copula. Figure 3 shows the projections of the multivariate joint PDF of X 1 to X 5 . Each subplot represents the projection of the multivariate joint PDF between two specific components of X. The black contour is the true PDF of Equation (25) while the green contour is the joint PDF by the proposed Copula-UQ. It can be shown that, even though the shape of the true PDF is irregular, the proposed Copula-UQ is capable of describing the statistical dependency structure.  Figure 4 shows the comparisons of observed values and predicted values of X 2 to X 5 . The 45-degree reference line represents that the observed values and predicted values are identical. Each subplot shows the predicted value of the target variable, determined based on Equation (23), using the incomplete information (yellow dots) and complete information (blue dots) of the predictor variable. For example, for the yellow dots of the subplot in the upper left (for x 2 ), the target variable, observed predictor variable and unobserved predictor variable are respectively. For the blue dots of the subplot in the upper left (for x uo is an empty set. By comparing the scatter plots of yellow and blue dots, one can observe the evolution of the predicated value changes with respect to the amount of information given by the predictor variable. It can be anticipated that the predicted values can be improved (that is, the dots distributing closer to the 45-degree reference line) when given more information from the predictor variable. This conclusion can be confirmed from the subplots in the upper left (for x 2 ), lower left (for x 4 ) and lower right (for x 5 ) of Figure 4. Note that there is insignificant improvement of the predicted values of the subplots located in the upper right (for x 3 ), this is because of low correlations between x 3 and other components, shown in Equation (26). This result shows that the proposed formulation of Equation (23), breaking through the second limitation (i.e., negligence of probabilistic prediction using available information) by conducting computation of marginalization and conditioning, is capable of making predictions even though the information of the predictor variable is incomplete.

Temperature Data of Multi-Sensor of a Monitored Bridge
Temperature is a critical loading factor for structures [32]. Variation of temperatures in structures significantly influences the material properties (for example, Young's modulus [32]), static characteristics (for example, deflection and deformation [32]), dynamic characteristics (for example, structural frequencies [33][34][35], damping ratios and mode shapes [36]) and boundary conditions [37]. Temperatures, including ambient air temperature and structural component temperature, of a multi-sensor of a structure are uncertain due to the fact that they are affected by not only the ambient factors, including air temperature variation, solar radiation intensity, humidity and wind speed, but also the complex processes of heat transfer [38]. Practically, UQ in temperatures are conducted based on temperature data measured from multiple sensors installed in different locations of a monitored structure [38][39][40][41][42]. As these works utilized traditional PDF modelling approaches, and modelling of temperature-related random variables was limited to two-dimensional. Here, due to the capacity of the multivariate joint PDF modelling of the proposed Copula-UQ, the dimension can be extended to D-dimensional, where D is the number of temperature sensors selected in the analysis.
This study utilized the proposed Copula-UQ to analyze temperature data of the multi-sensor of the Dowling Hall Footbridge [36]. The bridge, located on the Medford campus of Tufts University, has a two-span continuous steel frame (each spam is 22 m) and a reinforced concrete deck. Temperatures of different locations are monitored using the type T thermocouples manufactured by Omega Engineering (measurement ranging from -250 to +350 • C). Multi-sensor layout for temperature monitoring can be referred to in Figure 7 of Reference [43]. There are in total ten temperature sensors and they can be divided into two sensor clusters according to their locations: the west span cluster and the east span cluster. The west span cluster includes sensors for pier temperature (C 1 ), bridge deck temperature (C 2 ), steel temperature at the south side (S 1 ), steel temperature at the north side (S 3 ) and air temperature (A 1 ). The east span cluster includes sensors for pier temperature (C 4 ), bridge deck temperature (C 3 ), steel temperature at the south side (S 2 ), steel temperature at the north side (S 4 ) and air temperature (A 2 ).
The temperature data can be accessed from Reference [44]. Figure 5 shows time histories of ten temperature sensors beginning on January 5 2010 and ending on May 2 2010. In each subplot, two sensors monitor the same type of temperature, but these two sensors belong to the west span cluster and the east span cluster, respectively. For example, in the first subplot, both C 1 and C 4 monitored pier temperature, but C 1 and C 4 belong to the west span cluster and the east span cluster, respectively. It can be observed that there is insignificant difference in measurement between two sensors monitoring the same type of temperature even though they belong to two different clusters. It is worth noting that there is difference between the steel temperature at the south and north sides of the bridge. The reason is due to the fact that the effects of sunlight to the south and north side are different. During the daytime hours, the sensor on the south side (S 1 and S 2 ) was significantly warmer than the sensor on the north side (S 3 and S 4 ) [36]. Therefore, temperature data of five sensors from the west span cluster (C 1 , C 2 , S 1 , S 3 , A 1 ) are utilized for UQ. The corresponding correlation coefficient matrix is:  Table 3 shows the maximum log-likelihood value of the univariate marginal PDFs of C 1 , C 2 , S 1 , S 3 and A 1 . It can be observed that the optimal PDFs of C 1 , C 2 , S 1 , S 3 and A 1 are distributed as the nonparametric model with the Normal kernel. Figure 6 shows the univariate marginal PDFs of C 1 , C 2 , S 1 , S 3 and A 1 . It is obvious that in each subplot, the top-ranking model fits the frequency histogram better than the intermediate-and low-ranking models, reconfirming the model class selection results in Table 3. Figure 7 shows the projections of the multivariate joint PDF of C 1 , C 2 , S 1 , S 3 and A 1 .
It can be shown that the contours by the Copula-UQ are capable of quantifying the uncertainty of the multivariate temperature data.      Figure 4, shows the comparisons of observed values and predicted values of C 1 , C 2 , S 1 and S 3 (training dataset: full monitoring dataset, test dataset: full monitoring dataset). Again, for each subplot, it can be observed that the predicted values can be improved when given more information of the predictor variable. Note that the yellow dots correspond to incomplete information of the predictor variable due to a sensor fault. For example, for the yellow dots of the subplot in the upper left (for C 1 ), the target variable, observed predictor variable and unobserved predictor variable are x ta = C 1 , x o = A 1 , x uo = {C 2 , S 1 , S 3 }, respectively. That is, the yellow dots show the predicted value of the target variable C 1 using the information from the observed variable of normal sensor A 1 , but without using the information from the unobserved variable of fault sensors C 2 , S 1 , S 3 because of the fault status of these three sensors. For the blue dots of the subplot in the upper left (for C 1 ), x ta = C 1 , x o = {A 1 , C 2 , S 1 , S 3 }, x uo is an empty set. From the four subplots of Figure 8, although the sensor fault issue enlarges the fluctuation of the yellow dots, the proposed Copula-UQ gives satisfactory results as the available information of sensor A 1 is properly utilized for making predictions on the target variable. For further validating the prediction capacity of the proposed Copula-UQ under data missing by sensor fault issue, a new computation is conducted as follows: (1) the monitored dataset was divided into the training dataset (data covering first 90% of days out of total monitoring period) and the test dataset (complement of the training dataset), (2) the marginal PDF along with the copula model was inferred based on the training dataset and (3) the prediction capacity of the trained copula model was validated based on the test dataset with or without data missing by sensor fault issue. Figure 9, in the same fashion as Figure 8, shows comparisons of observed values and predicted values of C 1 , C 2 , S 1 and S 3 (training dataset: dataset of first 90% of total number of the monitoring days, test dataset: complement of training dataset). It can be observed that the fluctuations of both the blue dots (corresponding to complete information from normal sensors) and the yellow dots (corresponding to incomplete information due to data missing by sensor fault issue) are acceptable. Therefore, it can be concluded that even though the training dataset is different from the test dataset, the proposed Copula-UQ still gives satisfactory results in multivariate PDF modelling and target variable prediction. Figure 9. Comparisons of observed values and predicted values of C 1 , C 2 , S 1 , S 3 (training dataset: data covering first 90% days out of total monitoring period, test dataset: complement of training dataset). 2 Note that the yellow dots correspond to incomplete information of the predictor variable due to missing data caused by a sensor fault issue Figure 10 shows the predicted joint PDFs between S 1 and S 3 under incomplete information with different given values of A 1 only, and without observing information of fault sensors C 1 and C 2 . That is, the joint PDF p S 1 , S 3 A 1 = A 1 shows how the steel temperatures of the south and north sides evolve with changing the air temperature. As A 1 increases, the optimal values of p S 1 , S 3 A 1 = A 1 increased accordingly. It can be observed that the differences between the steel temperature (S 1 or S 3 ) and the air temperature ( A 1 ) become more significant as A 1 increases. The reason is as follows: higher A 1 associates with higher solar radiation intensity. Given that the specific heat capacity of steel is higher than that of air, the increase of temperature of steel is more significant than that of air. It can also be observed that the temperature of S 3 is lower than that of S 1 . This result coincides with the on-site situation of sunlight of the Dowling Hall Footbridge, in that the sunlight intensity to the north side (S 3 ) is lower than that to the south side (S 1 ) [36]. The predicted joint PDFs among the temperature of different locations of the structure are important pieces of information for uncertain thermal loading and can be utilized for thermal-induced structural response assessment.

Conclusions
This paper proposed the Copula-UQ for multivariate joint PDF modelling, inference on model class selection and parameter identification, and probabilistic prediction using incomplete information, and presented one example using simulated data and one example using temperature data of a multi-sensor of a monitored bridge. For inference on univariate marginal PDFs, the results show that, in general cases, the optimal univariate marginal PDFs of different dimensions are different, so the introduction of both parametric and nonparametric models is necessary because it provides a large solution space for uncertainty quantification. For prediction on the target variable using the complete (from normal sensors) or incomplete information (due to missing data caused by a sensor fault issue) of the predictor variable, the proposed Copula-UQ is capable of obtaining the PDF of the target variable. The proposed methodology can be extended to tackle different multivariate joint PDF modelling problems in SHM with emphasizing the prediction purpose under incomplete information with a sensor fault issue. This important piece of information of the PDF of the target variable can be utilized for uncertainty propagation in further analysis.