A Data-Driven Approach Based on Multivariate Copulas for Quantitative Risk Assessment of Concrete Dam

: Risk assessment of dam’s running status is an important part of dam management. A data-driven method based on monitored displacement data has been applied in risk assessment, owing to its easy operation and real-time analysis. However, previous data-driven methods considered displacement data series at each monitoring point as an independent variable and assessed the running status of each monitoring point separately, without considering the correlation between displacement of different monitoring points. In addition, previous studies assessed the dam’s running status qualitatively, without quantifying the risk probability. To solve the above two issues, a displacement-data driven method based on a multivariate copula function is proposed in this paper. Multivariate copula functions can construct a joint distribution which reveals the relevance structure of random variables. We assumed that the risk probability of each dam section is independent and took monitoring points at one dam section as examples. Starting from the risk assessment of single monitoring points, we calculated the residual between the monitored displacement data and the modelled data estimated by the statistical model, and built a risk ratio function based on the residual. Then, using the multivariate copula function, we obtained a combined risk ratio of multi-monitoring points which took the correlation between each monitoring point into account. Finally, a case study was provided. The proposed method not only quantitatively assessed the probability of the real-time dam risk but also considered the correlation between the displacement data of different monitoring points.


Introduction
Risk assessment of a dam's running status is of great importance for dam safety management [1]. In earlier studies, a dam's risk was assessed based on the physical mechanism of its structural behaviours using numerical simulations or theoretical analysis [2][3][4]. With the development of soft computing techniques, data-driven methods based on displacement data have been applied to dam risk assessment, as the displacement data can reflect a dam's structural behaviour and it can be obtained by the scores of monitoring instruments buried in the dam body [5,6].
The principle of data-driven methods is an analysis of the absolute residual between monitored and modelled displacement data [7][8][9]. With long-term continuous displacement and environmental monitoring data sets, models can be trained to seize deterministic relation between displacement and environmental variables (upstream water level and temperature etc.) and modelled displacement represents the usual (or anticipated) response to environmental variables. The value of monitored data deviating from modelled data is then an indicator of a potentially unusual running status. In most studies, researchers considered the displacement data at one monitoring point as an independent random variable and assessed the risk of each monitoring point independently, ignoring the correlations between random variables [10,11]. However, in practical engineering, the displacements of adjacent monitoring points are highly interrelated and they interact with each other, and any parts of the dam jointly afford the common external load such as hydrostatic pressure and temperature [12,13].
Several attempts have been made in recent years to deal with the spatial correlations between displacements of adjacent monitoring points. Samaras et al. [14] applied an analytic hierarchy process method to dam risk assessment and simplified the correlation between random variables as a linear relation. Recently, Qin et al. [15] considered the non-linear correlation between random variables using a principal component analysis method, emphasising selecting dominant indexes rather than a quantitative risk assessment.
In addition, previous studies evaluated dam running status risks by classifying the residual between monitored and modelled displacement data into several intervals, and corresponded these intervals to several degrees of risk [16]. The probability of risk is rarely analysed quantitatively. Specifically, a dam's risk represents the probability of an adverse event, such as dam failure. In practical engineering, the possibility of dam failure is relatively small. However, public society now demands, more than ever before, high vigilance in dam management regarding safety issues and risk levels associated with dams. Once an assessment has been made of the probability of failure, standards of an acceptable level of risk are needed to determine whether safety management requires improvement. Therefore, the present study put the emphasis on providing a criterion for quotidian safety management, namely, once the level of risk exceeds a given value, the frequency of real-time monitoring should be increased so as to prevent potentially serious events.
The present study took a first step toward assessing dam risk quantitatively, with the consideration of non-linear correlations between the displacement of each monitoring point. First, we quantified a single monitoring point's level of risk by establishing a risk ratio function based on the distribution of absolute residual between monitored and modelled displacement data. The modelled data were estimated by a statistical model. Second, the non-linear correlations were taken into account using a Copula function, which is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform [17,18]. Copula functions have been used to describe the dependence between random variables in many other fields [19,20]. To connect the risk ratio of different monitoring points, we started by determining the optimal marginal distribution function of risk ratio of one single monitoring point based on statistical tests; we then used three Archimedean Copula functions [21] (i.e., Clayton copula [22], Frank copula [23] and Gumbel copula [24]) to connect the marginal distribution functions of each monitoring point and selected the best performed Gumbel copula function [25]. Displacement is mainly influenced by reservoir water level, temperature effect and time effect. Displacement consists of a horizontal direction (including alongside and across the stream direction for a gravity dam; radial and tangential direction for an arch dam) and vertical direction. Among them, a radial displacement component alongside a stream displacement component are the crucial parts for a concrete arch dam and a gravity dam, respectively [26,27]. In this work, we selected the Jingpin-I concrete arch dam as an example; therefore, radial displacement is used for the following analysis.
This article is organised as follows-Section 2 presents the method of the present model; Section 3 describes the engineering case and data sets; Section 4 presents the risk assessment results; and the concluding remarks complete the paper in Section 5.

Single Monitoring Point Risk Ratio Function
Monitored displacement data reflect a dam's real-time structural behaviour and can be easily acquired, benefiting from the widely installed monitoring instruments inside the dam. Modelled displacement data, which are estimated from related parameters of the external environment (e.g., upstream water level, temperature), imply the theoretical dam displacement. To assess the dam risk, we established a risk ratio function based on the distribution of residual between the monitored and modelled displacement data. For an arch concrete dam (e.g., the selected dam in the present study), the radial displacement is commonly used to evaluate the risk to dam safety.
A statistical model is commonly used to obtain the radial modelled displacement dataδ. In the statistical model, the dam's radial displacement consists of three components-a water pressure component δ H , a temperature component δ T and an aging component δ θ . The modelled displacement dataδ can be expressed as: The water pressure component δ H is the sum of the displacement of the dam body itself δ 1H , the dam foundation δ 2H and the dam bedrock's rotational displacement δ 3H under the upstream water load. δ 1H , δ 2H and δ 3H are as functions of the upstream water level H.
δ 1H , δ 2H and δ 3H are obtained on the basis of engineering mechanics, which simplifies the gravity dam as a beam structure and the arch dam as an arch shape beam structure. Figure 1 shows a stretch of dam displacement due to water pressure δ H . Taking the gravity dam as an example, Then, for the gravity dam, δ H is written as: For the arch dam, owing to its more complicated structure, δ H is a polynomial function at least of third order: The coefficients a i are related to the dam height h, downstream slope angle m, the distance from the monitoring point to the dam foundation d and the material parameters including the elastic modulus of the dam body E c , Poisson's ratio of the dam body µ c , elastic modulus of the foundation E r and Poisson's ratio of the foundation µ r .
The temperature component δ T is the displacement mainly resulting from internal temperature variation. When the internal monitored temperature data is lacking, researchers often apply a trigonometric function to describe δ T .
where d = 2πt 365 and t is the number of days from the beginning of the monitoring sequence, b j1 and b j2 are pending coefficients. Equation (8) is used to describe the temperature effect with one-year or six-month periodicity.
The aging component δ θ characterizes the irreversible displacement caused by the factors such as creep and fatigue of concrete. Due to the difficulties of expressing δ θ theoretically, we provide a formula considering time as a variable to describe its tendency.
where θ is a parameter related to the time of the observation date t and the time of the initial date t 0 , which can be expressed as θ = (t − t 0 )/100; c 1 and c 2 are pending coefficients. Then, the statistical model used to model the dam's displacement can be written as: where n = 3 for gravity dam and n ≥ 4 for arch dam; a i are pending coefficients in the water pressure component δ H , b j1 and b j2 are pending coefficients in the temperature component δ T and c i are pending coefficients in the aging component δ θ , which can be estimated by Ordinary Least Squares regression with monitored displacement data δ as test sets. After coefficients are determined, with input sets of upstream water level H and number of days from the beginning of monitoring sequence t, the modelled displacementδ can be calculated. In practical engineering, the occurrence probability of the event-that the monitored displacement data δ has a large deviation from the modelled displacement dataδ-is fairly low. We adopted the occurrence probability of the absolute residual between δ andδ to assess dam risk. Here, we quantified the probability of dam risk by risk ratio and proposed a cumulative distribution function (CDF) of the absolute residual to express it (see Equation (11)).
where |ε (t) | = |δ (t) −δ (t) | is the absolute difference between monitored and modelled displacement data; P (|ε|, t) denotes the risk ratio; t is time; F (|X| ≤ |ε (t) |) is the CDF of the absolute residual; α is a failure limit indicator. Figure 2 exhibits the relationship between the risk ratio P and the absolute residual |ε|. The interval of the risk ratio is [0,1]. Based on the small probability method, once |ε| exceeded the limit value α, the monitoring point region would be regarded as being in a failure state.

Multi-Monitoring Point Risk Ratio Function
With the method presented in Section 2.1, we can obtain the real-time risk ratio P(|ε|, t) of a single monitoring point. As both the value of the loading combination and the material resistance are identical for adjacent monitoring points, it is of great importance to take into consideration the correlations of the displacement of adjacent monitoring points. Therefore, we applied a joint CDF to express the risk ratio for multiple monitoring points. The combined risk ratio for multiple monitoring points becomes: where X i (i = 1, 2, . . . , n) denote the random variables of |ε| of the i-th monitoring point; P * (x 1 , x 2 , . . . , x n ; t) denotes the risk ratio of the dam section. Section 2.2 presents how the joint CDF is constructed by copula functions.

Copula Theory
Sklar (1973) [28] proposed that any joint distribution function can be decomposed into N marginal distribution functions and one copula function, in which the copula function describes the relevance structure between random variables. Here, each single-monitoring point risk ratio function is a marginal distribution function and the multi-monitoring point risk ratio function is the joint distribution function. Hence, the copula function essentially constructs the multi-monitoring point risk ratio function by connecting the risk ratio functions of several monitoring points.
Suppose the joint CDF of the d-dimensional random vector (X 1 , , the marginal CDFs are F 1 , F 2 , . . . , F d and C is the copula function that characterizes the correlation between each random vector (X 1 , X 2 , · · · , X d ), the joint CDF of the d-dimensional vector (X 1 , X 2 , · · · , X d ) can be expressed as: Then, several commonly used distributions, including Exponential, Gamma, Lognormal and Weibull distributions, were selected as possible marginal distributions of absolute residuals of each displacement data series. The distribution functions of the selected four marginal distributions and the parameters are presented in Table 1. Table 1. List of the selected marginal distribution functions and parameters.

Distribution Cumulative Distribution Parameters
Three Archimedean copulas (i.e., Clayon copula, Frank copula and Gumbel copula) were used to construct the joint CDFs for a risk ratio of multi-monitoring points. Table 2 shows their generator functions and ranges of related parameters. Table 2. Generator functions and parameter ranges of selected Archimedean copulas.

NO.
Name Generator Function Parameter Range

Parameters of Distribution Functions Determination
We used the maximum likelihood estimation to estimate the parameters in both marginal distribution functions and copula distribution functions. First, the likelihood function is expressed by: In order to make a simplification, we used the logarithm form of the likelihood function lnL (θ) in the following calculations (see Equation (15)).
By solving Equation (16), we can obtain that the maximum likelihood estimatorθ ML satisfying

Optimal Distribution Function Selection
In order to find the optimal distribution functions, three statistical tests, including Kolmogorov-Smirnov (K-S), root mean square error c(RMSE) and the Akaike information criterion (AIC), were used.
The K-S test evaluates whether the random variables X follow the selected CDF by comparing the samples' actual distribution F n (x) and the theoretical distribution of selected CDF F (x). The statistics for the K-S test are D n = sup x |F n (x) − F (x) | and the observation of D n can be defined as: The D n (α) can be obtained, once the significance level α and the sample size n are determined. Then, ifD n < D n (α), we may consider that the theoretical distribution of selected CDF F(x) fits well with the actual distribution of sample data F n (x); otherwise, the selected CDF is not matched with the samples. RMSE reflects the difference between the theoretical probability of selected CDF and the empirical probability of sample data. The equation of RMSE is written as: where n is the sample size, F c is the theoretical probability distribution, P 0 denotes the empirical probability of the sample data. AIC estimates the relative amount of information lost by the selected CDF, with the consideration of its goodness of fit and simplicity. The AIC is expressed as: where L is the likelihood function of the selected CDF, m is the number of parameters in the selected CDF. The flowchart of the proposed method is exhibited in Figure 3.

Data Sets
In this study, we selected the concrete dam at Jinping-I Hydropower station as an engineering example. The selected dam, with a height of 305 m, is currently the highest concrete arch dam in the world. Figure 4 shows a picture of the selected dam and its location. As the dam adopted the construction form that casting the remaining transverse joints between dam sections, we supposed that the risk probabilities of different dam sections are independent and took monitoring points at one dam section as examples.  Figure 5a represents the distribution map of all 24 displacement monitoring points that were installed inside the dam Sections 5#, 9#, 11#, 13#, 16#, 19#, respectively. In this study, we chose Section 9#, which includes three monitoring points (PL9-3, PL9-4 and PL9-5), as an example. We selected the radial displacement monitored data (to the downstream is positive, to the upstream is negative) from 20 November 2012 to 4 November 2016. The radial displacement was measured twice a day during the flood season and once a day during the drought season and we calculated the average radial displacement for each day. There were 676 validated time frames in total. The time evolution of the radial displacement of the selected three monitoring points, as well as the upstream water level, are exhibited in Figure 5b. It is noticeable that the trend of the radial displacement at the selected three monitoring points had high relevance.

Results and Discussion
According to Section 2, we first calculated the modelled displacement dataδ(t) of these three monitoring points by Equation (10) using the Ordinary least square regression. We then determined the absolute residual between modelled and monitored displacement data |ε (t) | = |δ (t) −δ (t) |. With the absolute residual |ε|, we constructed a risk ratio function for each monitoring point. Then, with the optimal marginal distribution of each random variable, we applied copula functions to construct the joint distribution function, so as to determine the risk ratio of the whole dam section.

Single-Point Qualitative Risk Assessment
First, we determined the modelled displacement dataδ i using a statistical model (Section 2.1) for each monitoring point and compared them with the monitored displacement data δ. Time variation of theδ i and δ i of the monitoring points PL9-3, PL9-4 and PL9-5 are shown in Figures 6a, 7a and 8a, respectively. Then, we calculated the absolute residual |ε i | between theδ i and δ i . In many earlier studies, the dam's running status risk was qualitatively analysed based on the classification of the absolute residual |ε i | with a building assessment index system [29]. We adopted s (standard deviation) to set the borders of the intervals. Standard deviation is used to describe the dispersion degree in the sample and we considered that if the residual obeys a normal distribution, then the proportion of the residual belonging to interval [0, s), [s, 2s), [2s, 3s) and [3s, +∞) are 68.3%, 27.3%, 4.2% and 0.2%, respectively. According to the quantitative description and assessment intervals on the evaluation index, the original information of the evaluation indexes are classified into different divided evaluation levels. Here, the intervals on the evaluation index and the evaluation level of the dam's risk are divided as follows:   To avoid the human effect, here we classified the dam's risk into 4 groups based on the probability of the occurrence of the residual between the monitored and modelled data. A higher residual means a higher level of risk; however, the exact value for the risk of dam collapse needs further simulation based on mechanics. The occurrence probability of the 'moderate unusual' group is 0.2% but the risk of dam collapse is a lot lower than 0.2%. The objective of the classification of risk is to provide a quantitative criterion for engineers to determine the frequency of measurement (e.g., whether the dam requires additional measurement). The present study set the interval using s, 2 s and 3 s but these intervals are not absolute. Table 3 presents the frequency of monitoring response to each interval of level of risk.

Correlations between Monitoring Points
In order to determine the strength of the relevance of selected monitoring points, we first evaluated the coefficient of correlation between absolute residuals of each monitoring point (|ε| 9−3 , |ε| 9−4 and |ε| 9−5 )). Using a Pearson correlation coefficient (Equation (21)), the correlation between each random variable is represented in Figure 9.
where r xy is the Pearson correlation coefficient, X i and Y i are random variable pairs of |ε i | (i.e., PL9-3 and PL9-4, PL9-4 and PL9-5, or PL9-5 and PL9-3). As can be seen from Figure 9, the correlations between these three pairs of variables are quite significant. Especially, the Person correlation coefficient r between |ε| 9−3 and |ε| 9−4 is as high as 0.84. The lowest one is 0.62 between |ε| 9−3 and |ε| 9−5 , which also represents a high correlation.
In addition to the Person correlation coefficient, the Spearman's rank correlation coefficient (Equation (22)-(23)) and the Kendall's rank correlation coefficient(Equation (24)-(25)) were also calculated. The correlations between each pair of random variables are represented in Table 4.
where ρ n is Spearman's rank correlation coefficient; R i and S i are ranks of random variables X and Y, respectively.

Marginal Distributions
To determine a best fitted marginal distribution for the random variables, we first selected four commonly used distribution functions-Exponent function, Gamma function, Lognormal function and Weibull function. Then, we adopted a Maximum Likelihood Estimate to estimate the parameters in each selected marginal distribution (see Section 2.2.1). Table 5 exhibits the estimated parameters and the results of statistical tests for selected distribution functions and random variables. The values of the K-S test did not exceed their critical values 0.05 and D 30 (0.05) = 0.409, which implies that all empirical distributions fitted well with the marginal distributions for |ε| 9−3 , |ε| 9−4 and |ε| 9−5 . The CDF fitting curves of the selected four distribution functions for |ε| 9−3 , |ε| 9−4 and |ε| 9−5 are shown in Figures 10-13. It is surprising that the Gamma distribution had the lowest RMSE and AIC for |ε| of all three of these monitoring points; hence, it was selected as the preferred marginal distribution to establish a joint probability model.

Copula-Based Multivariate Joint Distributions
We used three Archimedean copulas including Clayon, Frank and Gumbel to connect the marginal distributions of |ε| 9−3 , |ε| 9−4 and |ε| 9−5 . First, we compared the theoretical probabilities of the Archimedean copulas and the empirical probabilities of sample data. As shown in Figures 14-16, the plots of all copulas have a deviation from the 45 • diagonal line. For the first impression, the minimum deviation from diagonal line were with the Gumbel copula, which implies that the Gumbel is the best suited copula. It should be noted that, when the empirical probability exceeds 0.8, the Gumbel copula starts to overestimate the risk ratio. However, overestimation during a high level of risk improves the safety of risk management. In contrast, Clayon and Frank copulas overestimate the risk ratio when the empirical probabilities are below 0.4; however, they underestimate the probabilities when they exceed 0.4.   In addition, we calculated the statistical tests of these three copula functions and find the same results as with the comparison of probabilities. Table 6 exhibits the copula parameters θ as well as the results of statistical tests RSME and D n for each joint distribution function. The Gumbel distribution was selected as the optimal joint distribution for multiple variables, as it has the lowest RSME of 0.0285 and the lowest D n of 0.0711.

Multi-Point Risk Ratio
In this section, we constructed a joint distribution based on the risk ratio of each monitoring point with the Gumbel copula, which represents the occurrence probability of the event (X 1 ≤ |ε| 1 , X 2 ≤ |ε| 2 , X 3 ≤ |ε| 3 ). 0.95 was the assessment criterion of the joint distribution probability-once the joint distribution probability exceeds 0.95, the running status of the selected dam section will be regarded as moderately unusual.
According to the method in Section 2.1, the risk ratios of these three monitoring points are calculated, respectively. Figure 17 shows the time evolution of risk ratios of these three monitoring points. Each monitoring point has different probabilities of moderate-unusual running status. As can be seen from Figure 17, the risk ratios of the three monitoring points varied sharply during the whole period (from 20 November 2012 to 4 November 2016), while the average values are quite close, being 0.4961, 0.4927 and 0.4975, respectively. Then, we calculated the risk ratios in the most recent year (from January 1 2016 to November 1 2016), the average values being 0.3925, 0.3965 and 0.4329, respectively. By comparison of these risk ratios, we discovered that the risk ratios surrounding the monitoring point PL9-5 in dam Section 9# are relatively high. Therefore, attention should be paid to safety inspection around the dam foundation region. Figure 18 represents the multi-point risk ratios calculated with subjective weight method (setting even weight as 0.333 for PL9-3, PL9-4 and PL9-5 without the consideration of correlations) and multivariate copulas. The results obtained by the proposed method have a similar tendency to the subjective weight method. Taking account of the structural correlation, the risk ratios with multivariate copulas were above 0.95 from 29 December 2014 to 15 March 2015 and from 1 April 2015 to 14 April 2015, which implies the dam section was in moderate-unusual status during those periods. The evaluated time of moderate-unusual running status was 45 days less than that without considering the correlations, namely, the previous methods overestimated the probability of risk. Multivariate copulas construct the joint distribution to describe the occurrence probability of the event (X 1 ≤ |ε| 1 , X 2 ≤ |ε| 2 , X 3 ≤ |ε| 3 ), which considers the running status of these three monitoring points simultaneously. In this way, the description of the running status of one dam section is more practical, as all components in the dam section are considered to be bearing external loads (upstream water pressure and temperature changes) as an integral structure. In addition, the multivariate copulas method expresses the correlations merely on the basis of displacement monitoring data, which is objective. Therefore the multi-point risk ratio of the proposed method was more rational to use for risk assessment.

Conclusions
The present study proposed a data-driven method based on multivariate copulas for the risk assessment of a dam's safety, with the objective of assessing the dam risk quantitatively and to consider the correlation of displacement between each monitoring point. The concrete dam at Jinping-I Hydropower Station was selected as a case study. We first calculated the absolute residuals between monitored and modelled displacement data for a single monitoring point. We then used multivariate copulas to connect the distribution functions of different monitoring points. The Gamma distribution was selected as the optimal marginal distribution. Gumbel copula, which fitted best with the empirical probabilities, was selected as the joint distribution. Considering that the risk probabilities of different dam sections are independent, we took one dam section, which includes three monitoring points, as an example, and we estimated the real-time running status of the selected dam section, which we call the risk ratio. The risk ratio helps us to explain and quantify the level of the dam's safety and the measurement frequency needs. The following conclusions are remarkable: First, the risk ratio of each monitoring point was highly dependent on its displacement and the residual between monitored and modelled displacement data; the risk ratio of the selected dam section had a slight deviation from the risk ratios of each monitoring point.
Second, taking account of the temporal and spatial correlations among the selected monitoring points with a Copula function, the estimated moderate unusual time was 45 days less than that without considering the correlations. This means that the previous methods overestimated the probability of risk. Most previous studies considered displacement at different monitoring points as independent variables and assessed the risk for each monitored point separately.
In addition, the risk ratios during flood seasons were slightly higher than those during drought seasons, indicating that monitoring frequency should be increased during flood seasons. The risk ratio can provide a quantitative indicator for the measures management of the dams.