We first clustered the measured displacement data obtained from each measuring point using Gaussian Mixture Model (GMM) and improved the model by an Iterative Self-Organizing Data Analysis (ISODATA). The displacement data of 24 measuring points we selected for the case were classified into five groups. We then used the random coefficient model to fit the data of each class.
3.1. Clustering of the Monitoring Data Based on ISODATA-GMM
As we have introduced in
Section 2, the displacement is mainly induced by three components: water pressure component
, temperature component
and aging component
. To build a clustering criterion to represent the spatial and temporal characteristics of measuring points, we have to discuss these three factors, respectively. The water pressure component
and the temperature component
are mainly dependent on the location of the measuring point and the geometrical size of the dam. For concrete dams, the spatial relations between each measuring point can be represented by its distance to the dam foundation
d. The temporal characteristic of measuring points mainly affects the aging component
. We first separated the aging component
from the time series measured data. The temporal characteristic can be described by two factors: one is the maximum absolute value of the aging sequence
, and another is the the degree of convergence of the data series
, which is expressed by
, where
and
are the aging term coefficients. Therefore, we use
d,
and
as the clustering criteria, to represent the spatial and temporal characteristics of the measuring points.
Gaussian Mixture Model (GMM) based clustering assumes that data comes from several sub-datasets which are modelled separately, and the whole dataset is a mixture of these sub-datasets. The resulting model is a finite mixture model. When data are multivariate continuous observations, the parametrized component density is usually a multidimensional Gaussian density.
For a one-dimensional dataset, we assume that the probability distribution of a random variable
x follows a mixture of two Gaussian distributions as described in Equation (
6):
where
and
represent the two Gaussian distributions; the
kth prior probability is
;
and
are the mean and the standard deviation of the two Gaussian distributions, respectively. We use
to simplify these parameters.
The dataset which contains N points is assumed as an independent sample from the distribution. denotes the unknown class tag for the nth point.
In the case that
and
are known, the posterior probability of the class tag of the
nth point
can be written as:
If the case of
is unknown and
is known, we may deduce the
from the data series
. We hence derive an iterative algorithm of
to maximize the likelihood estimation:
The natural logarithm of the likelihood
L derivation of
is:
where
is the Gaussian density (see Equation (
7)). Ignoring the items in
, the second derivative versus
can be approximated as:
Then, the initial
,
are iterated to
,
using the approximate Newton–Raphson steps:
We now expand to the multidimensional dataset (multiple Gaussian distribution). The Gaussian mixture density can be written as:
where
k is the serial number of the Gaussian distribution;
i is the serial number of the data’s dimension;
n is the serial number of the data sequence;
I is the total number of the data’s dimension;
is the weight;
is the mean of the Gaussian distribution;
is the variance of the Gaussian distribution;
is the data point. The iterative formula of
has been presented in Equation (
11). The iterative formulas of the variance
and the weight
are as follows:
Once the iteration is in convergence, the GMM clustering classified the dataset into several classes.
However, there are still some defects in GMM clustering. The number of classes and the number of data points in one class are unknown before clustering; hence, the iteration may obtain a class with only one or two data points, which may result in the divergence of the final results.
To solve this problem, we introduce the Iterative Self-Organizing Data Analysis (ISODATA) to realize the following functions: (a) separate the class into two when the variance is too large, (b) delete the class when the number of samples below an indicated value, and (c) merge two classes when they are too close.
Figure 2 shows the flow chart of ISODATA.
3.2. Random Coefficient Model
As shown in
Figure 3, the monitoring data is two-dimensional, which contains time series data and cross-sectional data. The data on one panel represents the cross section displacement data at a certain time, and each grid on the panel stands for a monitoring point.The monitoring data of the dam’s cross section at an indicated time can be considered as a two-dimensional panel.
Here, Equation (
15) expresses the regression coefficients of a panel without time variation:
where
is the two-dimensional dam displacement data;
is the two-dimensional data of explanatory variables;
t is the time index;
i is the cross section index;
k is the explanatory variables index;
is independent with time and can be divided into
and
;
is the common mean coefficient vector,
is the derivation from individual data to the common mean value;
u is a random interference term. Ref. [
20] assumed that
is a random variable and deduced the following assumptions in Equation (
16):
By integrating the
observation data, we can obtain the equation in a matrix format (Equation (
17)):
where
,
,
N is the number of panel,
T is the number of data in each panel, the compound error term
is a diagonal matrix, and the
i-th diagonal block is
. According to [
20], the estimation of
from OLS is biased. Once
converges to a non-zero constant matrix, we can hence obtain a consistent non-effective estimation. The optimal linear unbiased estimator of
is the generalized least squares estimation:
The variance of the estimator is:
follows an asymptotic normal distribution and it is the effective estimation of . The random coefficient model can dominate the explanatory variables coefficients, which makes the coefficients following asymptotic normal distributions instead of being free variables, and hence represents the correlation between adjacent monitoring points.
The distribution density of one monitoring point is strongly dependent on its features such as the location of the monitoring point. Hence, we cluster the measuring points based on its spatial and temporal characteristics. Using the ISODATA-GMM method introduced in
Section 3.1, the measuring points with similar spatial and temporal characteristics are classified into the same group. Then, the coefficients in the same cluster can be considered as following the same normal distribution.