Concrete Dam Displacement Prediction Based on an ISODATA-GMM Clustering and Random Coefficient Model

Displacement data modelling is of great importance for the safety control of concrete dams. The commonly used artificial intelligence method modelled the displacement data at each monitoring point individually, i.e., the data correlations between the monitoring points are overlooked, which leads to the over-fitting problem and the limitations in the generalization of model. A novel model combines Gaussian mixture model and Iterative self-organizing data analysing (ISODATA-GMM) clustering and the random coefficient method is proposed in this article, which takes the temporal-spatial correlation among the monitoring points into account. By taking the temporal-spatial correlation among the monitoring points into account and building models for all the points simultaneously, the random coefficient model improves the generalization ability of the model through reducing the number of free model variables. Since the random coefficient model supposed the data follows normal distributions, we use an ISODATA-GMM clustering algorithm to classify the measuring points into several groups according to its temporal and spatial characteristics, so that each group follows one distribution. Our model has the advantage of having a stronger generalization ability.


Introduction
Dam safety monitoring aims to understand the actual running status of the dam, so as to provide sufficient information to ensure the safety of the concrete dam [1].Displacement is a dominant indicator of the safety of the dam.One of the most important topics in dam safety management is to forecast the dam's displacement from the displacement data obtained from the monitoring points laid inside the dam [2,3].
Researchers have established many displacement forecast models.At the very beginning, researchers developed statistical models, in which the displacement δ at each monitoring point can be approximated by: δ = δ H + δ θ + δ t + k, where δ H , δ θ and δ t are displacements due to hydrostatic pressure, temperature and ageing, respectively; k is determined by regression analysis [4][5][6].The δ H is usually fitted by a polynomial equation of the upstream water level in reservoir H: However, the prediction accuracy of the statistical models is limited due to its uncertainty and the multicollinearity caused by the high correlation between explanatory variables.
In recent years, many artificial intelligence models, such as artificial neural network [7], grey system model [8], support vector machines [9], and genetic algorithm [10] have been applied in the displacement data analysis and the prediction.Whether in statistical models or artificial intelligence models, the explanatory variables' coefficients at each monitoring points were analysed and predicted independently, herein the spatial correlation of each monitoring point is overlooked [11].However, the displacement of adjacent monitoring points are correlative, as both the hydrostatic pressure and temperature acting on the dam are gradually varied.In addition, in order to obtain more accurate fitting results, redundant explanatory variables are adopted in the models which may weaken the generalization ability of the model.
Instead of modelling the data of each monitoring point individually, we introduced the random coefficient model of multi-dimensional data.The random coefficient model can model the data of several monitoring points synchronously, and make the explanatory variables' coefficients of each measuring point satisfy asymptotic normal distributions [12,13].Of course, not all coefficients follow the same normal distribution.Therefore, we classified the measuring points into several clusters based on their structural and temporal characteristics.Then, we can assume that each cluster follows the same distribution.
The clustering methods mainly fall into two categories.One is based on similarity or dissimilarity distances such as hierarchical cluster analysis [14] and K-means algorithm [15].Another is model-based method in which each cluster is represented by a parametric distribution such as Gaussian distribution, and the entire dataset is modelled by a mixture of these distributions [16][17][18].Model based clustering provides a rigorous framework to assess the number of clusters and the role of each variable in the clustering process.In this study, we clustered the data using the Gaussian mixture model (GMM), which assumes a multivariate Gaussian distribution for each component.To avoid the divergence in the GMM model, an iterative self-organizing data analysis (ISODATA) was used to govern the number of individuals in each class.
This article is organised as follows.Section 2 introduces the classical statistical prediction model.The model developed in this study is presented in Section 3, where we first present the clustering method based on ISODATA-GMM in Section 3.1 and then a random coefficient model of multidimensional data in Section 3.2.Section 4 describes the data sets.The predicting results and comparison with statistical model are discussed in Section 5. Concluding remarks complete the paper in Section 6.

Statistical Prediction Model
The dam displacement δ includes displacement due to three components: temperature component δ, aging component δ θ , and water pressure component δ H -among which, the δ H is mainly composed of deformation of three parts: the dam body itself δ 1H , the dam foundation δ 2H and the rotation of the dam bedrock δ 3H (see Figure 1): δ H , δ 2H and δ 3H are mainly dependent with the upstream water level H, as exhibited in Equations ( 2)-(4): where H is the upstream water level; h is the height of the dam; m is the downstream slope; d is the distance between the observation point and the dam crest; E c , G c are elastic modulus and shear modulus of dam concrete, respectively; E r , µ r are elastic modulus and Poisson's ratio of foundation, respectively; γ 0 is the water density and α is the rotation angle of dam foundation surface at the dam heel.Due to a lack of measured temperature data in most engineering projects, the temperature component δ T can be expressed by a trigonometric function of different periods.The aging component δ θ is commonly described by a fixed form of trend function.In addition to the above-mentioned deformation components, a random interference term k is often considered, which consists of the human errors, measurement errors, etc.According to the central limit theorem, k obeys a normal distribution with a mean of zero.
Equation (5) expresses the most commonly used statistical model of the dam deformation [19]: where n is a coefficient, n = 3 for gravity dam and n = 4 for arch dam; t is the time; a, b, c are coefficients; k is the random interference term.In this paper, the coefficients in the statistical model were solved using an Ordinary Least Squares estimation method.

Model Development
We first clustered the measured displacement data obtained from each measuring point using Gaussian Mixture Model (GMM) and improved the model by an Iterative Self-Organizing Data Analysis (ISODATA).The displacement data of 24 measuring points we selected for the case were classified into five groups.We then used the random coefficient model to fit the data of each class.

Clustering of the Monitoring Data Based on ISODATA-GMM
As we have introduced in Section 2, the displacement is mainly induced by three components: water pressure component δ H , temperature component δ T and aging component δ θ .To build a clustering criterion to represent the spatial and temporal characteristics of measuring points, we have to discuss these three factors, respectively.The water pressure component δ H and the temperature component δ T are mainly dependent on the location of the measuring point and the geometrical size of the dam.For concrete dams, the spatial relations between each measuring point can be represented by its distance to the dam foundation d.The temporal characteristic of measuring points mainly affects the aging component δ θ .We first separated the aging component δ θ from the time series measured data.The temporal characteristic can be described by two factors: one is the maximum absolute value of the aging sequence λ, and another is the the degree of convergence of the data series ξ, which is , where c 1 and c 2 are the aging term coefficients.Therefore, we use d, λ and ξ as the clustering criteria, to represent the spatial and temporal characteristics of the measuring points.Gaussian Mixture Model (GMM) based clustering assumes that data comes from several sub-datasets which are modelled separately, and the whole dataset is a mixture of these sub-datasets.The resulting model is a finite mixture model.When data are multivariate continuous observations, the parametrized component density is usually a multidimensional Gaussian density.
For a one-dimensional dataset, we assume that the probability distribution of a random variable x follows a mixture of two Gaussian distributions as described in Equation ( 6): where k = 1 and k = 2 represent the two Gaussian distributions; the kth prior probability is {p 1 = 1/2, p 2 = 1/2}; {µ k } and σ are the mean and the standard deviation of the two Gaussian distributions, respectively.We use θ ≡ {{µ k } , σ} to simplify these parameters.The dataset {x n } N n=1 which contains N points is assumed as an independent sample from the distribution.k n denotes the unknown class tag for the nth point.
In the case that {µ k } and σ are known, the posterior probability of the class tag of the nth point k n can be written as: If the case of {µ k } is unknown and σ is known, we may deduce the {µ k } from the data series {x n } N n=1 .We hence derive an iterative algorithm of {µ k } to maximize the likelihood estimation: The natural logarithm of the likelihood L derivation of {µ k } is: where p k|n ≡ P (k n = k|x n , θ) is the Gaussian density (see Equation ( 7)).Ignoring the items in ∂ ∂µ k P (k n = k|x n , θ) , the second derivative versus {µ k } can be approximated as: Then, the initial µ 1 , µ 2 are iterated to µ 1 , µ 2 using the approximate Newton-Raphson steps: We now expand to the multidimensional dataset (multiple Gaussian distribution).The Gaussian mixture density can be written as: where k is the serial number of the Gaussian distribution; i is the serial number of the data's dimension; n is the serial number of the data sequence; I is the total number of the data's dimension; π k is the weight; µ i is the mean of the Gaussian distribution; σ is the variance of the Gaussian distribution; is the data point.The iterative formula of µ k i has been presented in Equation (11).The iterative formulas of the variance σ (k) i and the weight π k are as follows: Once the iteration is in convergence, the GMM clustering classified the dataset into several classes.However, there are still some defects in GMM clustering.The number of classes and the number of data points in one class are unknown before clustering; hence, the iteration may obtain a class with only one or two data points, which may result in the divergence of the final results.
To solve this problem, we introduce the Iterative Self-Organizing Data Analysis (ISODATA) to realize the following functions: (a) separate the class into two when the variance is too large, (b) delete the class when the number of samples below an indicated value, and (c) merge two classes when they are too close.Figure 2 shows the flow chart of ISODATA.

Random Coefficient Model
As shown in Figure 3, the monitoring data is two-dimensional, which contains time series data and cross-sectional data.The data on one panel represents the cross section displacement data at a certain time, and each grid on the panel stands for a monitoring point.The monitoring data of the dam's cross section at an indicated time can be considered as a two-dimensional panel.Here, Equation (15) expresses the regression coefficients of a panel without time variation: where y it is the two-dimensional dam displacement data; x kit is the two-dimensional data of explanatory variables; t is the time index; i is the cross section index; k is the explanatory variables index; β ki is independent with time and can be divided into β k and γ ki ; is the derivation from individual data to the common mean value; u is a random interference term.Ref. [20] assumed that β i = β + γ i is a random variable and deduced the following assumptions in Equation ( 16): By integrating the NT observation data, we can obtain the equation in a matrix format (Equation ( 17)): where is the number of panel, T is the number of data in each panel, the compound error term Xγ + u is a diagonal matrix, and the i-th diagonal block is According to [20], the estimation of β from OLS is biased.Once 1 NT X X converges to a non-zero constant matrix, we can hence obtain a consistent non-effective estimation.The optimal linear unbiased estimator of β is the generalized least squares estimation: The variance of the estimator is: βGLS follows an asymptotic normal distribution and it is the effective estimation of β.The random coefficient model can dominate the explanatory variables coefficients, which makes the coefficients following asymptotic normal distributions instead of being free variables, and hence represents the correlation between adjacent monitoring points.
The distribution density of one monitoring point is strongly dependent on its features such as the location of the monitoring point.Hence, we cluster the measuring points based on its spatial and temporal characteristics.Using the ISODATA-GMM method introduced in Section 3.1, the measuring points with similar spatial and temporal characteristics are classified into the same group.Then, the coefficients in the same cluster can be considered as following the same normal distribution.

Data Sets
We selected the concrete dam in the Jinping-I Hydropower Station as an example to validate the model.The station is located at the Yalong River in China (Figure 4).The main feature of the station is generating electricity, with a maximum capacity of 3600 MW.Another feature is flood-control, the gross capacity of reservoir and flood regulation storage capacity are 77.6 × 10 8 and 49.1 × 10 8 m 3 , respectively.The dam is the world's tallest arch dam at present, it is a double-curved arch dam with the height of 305 m, the crest width of 16 m, the bottom thickness of 63 m and the volume of dam 474 × 10 6 m 3 [21,22].The storage of water started from 30 November 2012, while the construction of the dam body were accomplished in June 2013.During this period, the water level was fairly low and hence the associated dam deformation was ignorable.The water level reached a normal value on 23 August 2014.
In this study, we selected radical displacement data (to the downstream is positive, to the upstream is negative) from 16 June 2013 to 25 August 2015 for analysis.The data were measured by plumb lines (PL) and inverted lines (IP) installed at dam crest and dam body.Note that some measuring points had not yet been installed during this period, we selected 24 measuring points.Figure 5 exhibits the distribution map of the measuring points selected in this study.These 24 points are distributed in six perpendiculars on the same cross section.The data collecting during the flood season is more frequent than in other period.During the storage period and the flood season, the radical displacement data were collected three times each day.For other periods, the data were collected once a week.For the case that there are three pieces of data in one day, we calculated the mean of the three values.For the case that there is only one piece of data in a week, we limited the number of missing data in the panel using the generalized least squares method, and finally obtained 274 validated time frames.The time variation of the water level and the displacement data for all the measuring points are shown in Figure 6.It is obvious that the displacement data of all measuring point is strongly dependent with the water level, which also indicated the importance of taking the temporal characteristics into consideration in data clustering.In addition, many noisy data exist in the time variation of the radical displacement, especially during the period from January 2014 to May 2014 and the period from April 2015 to August 2015.The noisy data commonly came from the measurement errors or human errors, it may reduce the accuracy of the prediction; in another aspect, the prediction results at the noisy point can serve as an indicator to evaluate whether the model is over fitting.As shown in Figure 7, the annual variation of the radical displacement at each measurement point in 2014 (from 1 January 2014 to 31 December 2014) has a strong relevance to its spatial location.
More specifically, the variation at one point on the dam is dependent with its distance to the dam foundation.The displacement at the marginal position of the cross section is significantly smaller than the displacement at the central position.Therefore, the distance from the measuring point to the dam's foundation d was selected as the spatial indicator in the clustering criteria.The datasets were divided into two groups: 16 June 2013 to 15 June 2015 as fitting datasets, and 16 June 2015 to 28 September 2015 as testing datasets.We first used the fitting datasets to develop the predicting model, and then used the testing datasets to check the prediction capacity of the developed model.

Clustering Results
As introduced in Section 3.1, we clustered the monitoring data obtained from 24 measuring points based on their spatial and temporal characteristics.We used the distance from the measuring point to the dam foundation d as an indicator of the spatial characteristics.The temporal characteristics are represented by two indicators: one is the maximum absolute value of the aging sequence λ; another is the degree of convergence of the data series ξ.As the first step of the clustering, we calculated the three indicators d, λ and ξ for each measuring point as a criteria.The values of d, λ and ξ are shown in Table 1.The 24 measuring points were classified into five groups.Figure 9  It is obvious that the classification roughly corresponds to its spatial location, e.g., all the measuring points in Class 1 were located on the edge area of the dam, and all the measuring points in Class 3 were located in the center part (see Figure 9).Of course, the results are not strictly dependent with their spatial location due to the influence of the temporal indicators-for instance, one of the measuring points in Class 4 was located on the edge while the other three were in the center area.However, these four points in Class 4 were relatively located in adjoining areas.Therefore, it can be thought that the clustering results represent the spatial characteristics of the measuring points.In addition, according to the indicator d exhibited in Table 1, the intervals of indicator d in five classes are [7,40]  In Figure 10, the clustering results were exhibited relating to temporal indicators (λ and ξ).It is interesting to note that measuring points in Class 5 (PL11-2 and PL16-4) and Class 4 (IP11-1, PL9-3, PL9-4 and PL11-5) gathered in centers far away other points, respectively.It means that the temporal similarities of measuring points in Classes 4 and 5 are more significant than those in Classes 1, 2, 3, which is opposite to the spatial similarities where measuring points in Classes 1, 2, 3 were significantly better than those in Classes 4 and 5.In general, we can see that the clustering model took both the effects of temporal and spatial factors into consideration.

Predicting Results
After we classified the measuring points into five classes using clustering analysis based on the ISODATA-GMM method, we developed a random coefficient model for each class.Here, in order to establish models of monitoring data, we selected the explanatory variables relating to upstream level, temperature and age which include H, H 2 , H 3 , H 4 , sin 2πjt 365 , cos 2πjt 365 , t and lnt, where H is the upstream water level and t is the time.The water level is expressed by a several exponential function of H, which inspired from the statistical model.The temperature is represented by trigonometric functions of time, by assuming that the temperature follows the same tendency each year.The aging component is described by time t and its natural logarithm directly.
Using the ISODATA-GMM method and random coefficient model, we fitted the displacement data from 16 June 2013 to 15 June 2015 to develop the prediction model.Then, we validated the model with the dataset from 16 June 2015 to 25 August 2015.Figure 11 shows the fitting and forecast results of the seven measuring points in Class 1.The modelling datasets are located in the white area and the testing dataset are located in the blue area.The red dots represent the measured data and the black lines are the fitted and predicted results.Even though there are always some noisy points exist in the measured data, the predicted data fit well with the measured data in the whole.In addition, the fitting data for the other 17 measuring points in Classes 2, 3, 4, 5 are illustrated in Appendix A.
After the prediction model had been developed, we then evaluated the performance of the model and compared it with the statistical model.We used correlation coefficient R and residual standard deviation s as criteria of model performance.Their expressions are as follows: where ŷi is the series of fitting data; y i is the series of measuring data; ȳi is the mean of measuring data series; and n is the number of measuring data.The correlation coefficient R and residual standard deviation s represent the strength of the relationship between the measured dataset and predicted dataset.The R and s of each measuring point's testing dataset ( from 16 June 2015 to 28 September 2015) are indicated in Table 2. Generally, the model can be validated once the R above 0.9.Here, the R for all the monitoring points are located in the range 0.958-0.999,which represents a fairly well fitting between the predicting data and the measuring data.The maximum R is 0.999 for the measuring points PL9-4 and PL16-4, which means that these two datasets best fit with the model.The s ranges from 0.121 to 1.344.

Comparison with the Statistical Model
We then compared our model with the statistical model which has been introduced in Section 2. To evaluate the prediction performance of two models, the correlation coefficients R and the residual standard deviation s calculated from the testing dataset are represented in Figure 12.It is obvious that the random coefficient model with ISODATA-GMM clustering has a better performance than the statistical model.A total of 24 measuring points were modelled, and the R of the random coefficient model is larger than that of the statistical model for 22 of them.For the measuring point PL9-3 and PL16-3, the statistical model performs better than our model.If we take a look on the data series of the fitting results PL9-3 on Figure A3c, the deviation from the predicting data to the measuring data is much more significant than that of other data series.It is interesting to note that, for the statistical model, the values of R for PL16-3 and PL9-3 are lower than other measuring points.In addition, regarding the statistical model, correlation coefficients R are always above 0.95 for all the measuring points.Hence, even though less accurate than the machine learning model, the accuracy of the statistical model is still validated for predicting the dam displacement in common occasions.

Limitations
One limitation of the current model is that the fitting and prediction results were vulnerable to noisy data.However, with an effective pre-processing technique to reduce the noisy, this defect could be overcome.Another issue should be noted in the current model is that correctness of monitoring data used for modelling should be guaranteed.Different from establishing models for each measuring point separately, the random coefficient model analyses all the data series in one class simultaneously and considers the relations between data series at different measuring points.Therefore, any noisy or error data series at one measuring point may affect the predicting results of other measuring points.

Conclusions
Dam displacement monitoring is one of the most efficient methods to manage and forecast the safety of the dam.As the monitoring points are limited in most dams, researchers and engineers commonly modelled displacement data at different monitoring points individually, and ignored the correlations among each points.However, more and more ultra high dams have been constructed in recent years, in which the uncertainty and multicollinearity increase significantly with the increasing of the number of monitoring points laid in dams.
With the objective of solving the multicollinearity problem in commonly used models, we built a random coefficient model of multi-dimensional data in this paper, which models multi-points simultaneously by making one explanatory variable coefficient at different points following the same asymptotic normal distribution.Measuring points following the same normal distribution are supposed to have similar spatial and temporal characteristics.
The second work is taking the correlations among data at different measuring points into account, by classifying the measuring points with a Gaussian mixture model according to structural attributes and the temporal characteristics.We selected the distance from the measuring point to dam foundation (d) as a spatial indicator and selected the maximum absolute value of the aging sequence (λ), degree of convergence of data series (ξ) as temporal indicators.The Gaussian mixture model has high flexibility (i.e., the shape of the multidimensional Gaussian distribution can fits well the sample points), which may induce high risk of over-fitting and fall into a local optimal solution.To find the optimal solution in a wider space, we introduced the Iterative self-organizing data analysis method to improve the Gaussian mixture model's annealing ability.
In this study, we validated the model using radical displacement data of the concrete arch dam in Jinping-I Hydropower Station as an example.We calculated a dataset of 24 measuring points, and evaluated the model using correlation coefficient (R) and residual standard deviation (s).It turned out that the predicted model fits well with the monitoring data, where the correlation coefficients for all the measuring points are above 0.9.We then compared our model with the statistical model and found that our model has better performance than the statistical model.
Using the clustering algorithm, the correlation between the measuring points can be considered when evaluating displacement of the dam, which significantly improves the accuracy of the prediction model.For the perspectives of the research, besides the dam displacement data, many other kinds of monitoring data exist in hydraulic engineering such as crack monitoring, slope deformation data, etc.In these monitoring projects, correlation exists between measuring points data.Therefore, we expect to apply the model in further structure monitoring projects.In addition, the current model takes the spatial characteristics of the measuring points into account; however, it can not yet predict the data at one point from measured data of adjacent points.Hence, another future orientation can predict the displacement at an arbitrary point on the dam from the measured data at limited monitoring points by combining the finite element method and the prediction model based on the machine learning method.

Figure 2 .
Figure 2. Flow chart of the ISODATA model.

Figure 3 .
Figure 3. Schematic diagram of the temporal and spatial representation of the monitoring data.

Figure 4 .
Figure 4.A sketch map for the geographical location of Jinping-I Hydropower Station and its design drawing.

Figure 5 .
Figure 5. Distribution map of the measuring points.

Figure 6 .
Figure 6.Time evolution of water level and radical displacement (to the downstream is positive) of each measuring point (data provided by Jinping I hydropower station with permission).

Figure 7 .
Figure 7. Annual variation of cross section's radical displacement in 2014.
. The spatial indicator d ranges from 7 to 175, the temporal indicator λ ranges from 0 to 19.93, and the values of ξ are located in the range of 0 to 3.49.In order to eliminate the dimensional influence between different indicators, we normalized the values of indicators d, λ and ξ to 0 − 1.Then, we set the initial parameters of the clustering.The initial class number was set to 4; the initial weight parameter π k was 0.25; the initial variance σ (k) i was 1; the minimum element number N min was 2; the maximum allowable variance σ max was 3; the minimum allowable distance d min was 0.1.Using the ISODATA-GMM method, the clustering results of the measuring points based on the values of d, λ and ξ are shown in Figure 8.

Figure 8 .
Figure 8. Classification of the measuring points after clustering.

Figure 9 .
Figure 9. Map of measuring points indicated clustering results.

Figure 10 .
Figure 10.Clustering results of measuring points indicated by temporal characteristics.

Figure 11 .
Figure 11.The fitting and predicting results of Class 1: (a) IP 13-2; (b) IP 16-1; (c) PL 19-5; (d) PL 19-4; (e) PL 9-5; (f) PL 5-4; (g) PL 5-3.The red dots represent the measured data, and the black lines are fitted and predicted data.The model was developed with the data in the white area and validated by the data in the blue area.The positive direction is downstream.

Figure 12 .
Figure 12.Comparison of (a) R and (b) s for a random coefficient model and statistical model.

Table 1 .
Indicators d, ξ and λ calculated based on the ISODATA-GMM clustering method.
As we can see from Table1, the dataset of indicators are represented at different scales

Table 2 .
Correlation coefficient R and residual standard deviation s of each measuring point.