Next Article in Journal
A Distributed Multi-Hop Intra-Clustering Approach Based on Neighbors Two-Hop Connectivity for IoT Networks
Next Article in Special Issue
Optimization and Security of Hazardous Waste Incineration Plants with the Use of a Heuristic Algorithm
Previous Article in Journal
Evaluation of Accelerometer-Derived Data in the Context of Cycling Cadence and Saddle Height Changes in Triathlon
Previous Article in Special Issue
Determinism in Cyber-Physical Systems Specified by Interpreted Petri Nets
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Possible World-Based Fusion Estimation Model for Uncertain Data Clustering in WBNs

1
Department of Electronic and Information Engineering, Key Laboratory of Communication and Information Systems, Beijing Municipal Commission of Education, Beijing Jiaotong University, Beijing 100044, China
2
The School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China
3
Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China
4
Department of Electrical Engineering, National Dong Hwa University, Hualien 97401, Taiwan
5
School of Information Engineering, Beijing Institute of Petrochemical Technology, Beijing 102617, China
*
Author to whom correspondence should be addressed.
Sensors 2021, 21(3), 875; https://doi.org/10.3390/s21030875
Submission received: 15 December 2020 / Revised: 24 January 2021 / Accepted: 25 January 2021 / Published: 28 January 2021

Abstract

:
In data clustering, the measured data are usually regarded as uncertain data. As a probability-based clustering technique, possible world can easily cluster the uncertain data. However, the method of possible world needs to satisfy two conditions: determine the data of different possible worlds and determine the corresponding probability of occurrence. The existing methods mostly make multiple measurements and treat each measurement as deterministic data of a possible world. In this paper, a possible world-based fusion estimation model is proposed, which changes the deterministic data into probability distribution according to the estimation algorithm, and the corresponding probability can be confirmed naturally. Further, in the clustering stage, the Kullback–Leibler divergence is introduced to describe the relationships of probability distributions among different possible worlds. Then, an application in wearable body networks (WBNs) is given, and some interesting conclusions are shown. Finally, simulations show better performance when the relationships between features in measured data are more complex.

1. Introduction

Clustering is a kind of technology for machine learning that puts similar objects into the same cluster. Clustering techniques play an important role in many areas such as health care and action recognition in the medical domain [1,2], behavior surveillance and battlefield prediction in the military field [3,4], resource and information management in the communications field [5,6], and so on. There are plenty of cluster methods presented that can be divided into three principal types according to the clustering scale: distance-based, density-based, and connectivity-based [7,8].
Most clustering methods focus on deterministic data. Unfortunately, almost all clustering data are collected by the corresponding equipment, which entails measuring errors. In this case, the uncertain data can describe the measurement data better. For acquiring better and more appropriate results, the fusion estimation methods such as the Bayes-based [9], Kalman-based [10], or artificial intelligence-based [11,12] methods are commonly used to estimate the measurements.
Fusion estimation is a technology that uses the computing power of data acquisition equipment to de-noise and de-redundancy the measurement data according to certain rules. It focuses on mining data information, designing corresponding estimation algorithms, and improving the accuracy of data. In this technology, the measurement data are de-noised first, and then the data is fused on the time series to obtain the accurate conclusion for uncertain data. Finally, the uncertain data is processed by clustering, and the final processing result is obtained.
Many methods have been proposed to deal with uncertain data in recently years [13,14,15]. Among these methods, the possible world-based methods have been demonstrated to be efficient and reasonable. Possible world-based clustering methods consider all the probabilities of the uncertain data and fuse them into the final clustering result. This kind of method usually exhibits good performance. On the other hand, the uncertain data can be represented by a probability distribution in most cases. Therefore, the Kullback–Leibler divergence (KL divergence) [16] is used to describe the similarity of two probability distributions.
In practice, there are differences in the accuracy of different acquisition equipment, which is represented by the differences in data uncertainty. Existing algorithms based on possible worlds can deal with the difference problem of uncertainty in a relatively simple way. In this paper, variance, an important statistic of data uncertainty, is introduced into the model of possible worlds to study its role in improving accuracy. Then, a possible world-based fusion estimation model (PWFEM) for uncertain data is presented, which includes two methods according to different distance-based formulas. When the variance of uncertain data is small, the numerical distance-based method (PWFEM-nd) is employed. The probabilistic distance-based method (PWFEM-pd) is employed when variance is prominent. Then, the application in wearable body networks (WBNs) is introduced. The specific derivation formula is given with the different distance-based formulas. Finally, the simulations show good performance in terms of the proposed model.
The rest of the paper is organized as follows. In Section 2, the related works are introduced. In Section 3, the preliminaries are introduced, and some definitions and assumptions are given. The theoretical derivation of the PWFEM is given in Section 4. In Section 5, the simulations examine the performance of the PWFEM. Finally, conclusions are given in Section 6.

2. Related Works

In this section, the processing technologies of uncertain data are introduced in detail. The collected data that come from acquisition equipment contain noise, which means the collected data contain great uncertainty. Therefore, it is necessary to perform fusion estimation processing on the data first, and use the rules and redundancy of the data itself to improve the data accuracy and reduce the uncertainty of the data.
Commonly used fusion estimation algorithms include Bayes filter (BF) [17], Kalman filter (KF) [18], extended Kalman filter (EKF) [19], unscented Kalman filter (UKF) [20], and particle filter (PF) [21]. Wherein, BF and KF are estimates of linear systems, BF can theoretically estimate data of arbitrary noise distribution, and KF is BF when the noise is Gaussian white noise. The EKF, UKF, and PF are the estimates of the nonlinear system, where EKF is for weak nonlinear system, UKF is for strong nonlinear systems and has high computing complexity, while the PF is calculated directly from the average probability density conditions, in which the probability density is determined by EKF and UKF approximation, but the estimation precision is higher than that of a single use of EKF or UKF, but the number of calculations is much higher than that of EKF and UKF.
In [22], the authors argued that two possible world-based clustering algorithms suffered from the following issues: (1) they dealt with each possible world independently and ignored the consistency principle across different possible worlds; (2) they required an extra post-processing procedure to obtain the results, which meant that effectiveness was highly dependent on the post-processing method, and their efficiencies were also not very good. In order to solve the problems above, Liu et al. proposed a possible world-based consistency learning model that considered the consistency principle during the clustering/classification procedure and thus could achieve satisfactory performance.
The Possible world based consistency learning model for clustering uncertain data (PWCLU) was proposed in [22], which holds that the clustering results in each possible world are consistent. Several equipment types were used for collecting the same data. Each piece of data for one piece of equipment was considered to belong to a possible world, and the probability was regarded as equal for each possible world. The authors only gave an algorithm to deal with finite possible worlds.
On the other hand, clustering algorithms usually require a method to describe the distance between two datasets. In uncertain data, the distance can be expressed as a probability distribution in most cases. Therefore, a method of describing the distance between probability distributions is required. Sinkkonen and Kaski [23] studied the problem of learning groups or categories that were local in the continuous primary space but homogeneous according to the distributions of an associated auxiliary random variable over a discrete auxiliary space. In their model, Kullback–Leibler divergence was used to calculate the distance between two probability distributions.
In this paper, a possible world-based fusion estimation model (PWFEM) is proposed for clustering uncertain data. The proposed model removes the assumption of the consistency principle of [22]. Moreover, two PWFEM-based methods are given. One generalizes the PWCLU to the continuous possible worlds, which is based on numerical distance. Therefore, it is called PWFEM-nd. The other is based on probability distribution distance and is named PWFEM-pd. Then, an application in WSNs is discussed. Two specific distance functions that correspond to the numerical distance and probability distribution distance, respectively, are introduced to prove that the PWFEM-nd is equivalent to PWFEM-pd under certain circumstances. Finally, the simulations are discussed; they showed good performance of the models.

3. Preliminaries

In this section, some necessary definitions and assumptions are given for possible world and Kullback–Leibler divergence; the assumptions of independence for each component of the datasets and the structure of the data are also given.

3.1. Definition of Possible World

Let O R N × n , O = { O 1 , O 2 , , O n } be an uncertain dataset, where O is not deterministic data but a probability distribution. If O is a discrete probability distribution, pw is one of the possibilities of the uncertain data O, which can be written as p w = { O 1 p w , O 2 p w , , O n p w } , which is deterministic data with its probability P(pw). If O is a continuous probability distribution, O can be described as a probability density function f(pw), where pw is the value of the random variable O. Then,
D f ( p w ) d p w = 1

3.2. Definition of Kullback–Leibler Divergence

Let p(x) and q(x) be the distribution of random variable X, so the Kullback–Leibler divergence of p(x) and q(x) is:
d K L ( p ( x ) , q ( x ) ) = + p ( x ) log ( p ( x ) q ( x ) ) d x

3.3. Some Assumptions

Assumption 1. Almost all possible worlds exhibit the same class labels and cluster structures, and they exhibit the different class labels and cluster structures with small probabilities.
Assumption 2. In Section 5, it is assumed that ∀xi, xjX, xi + xj is also the Gaussian distribution.
Assumption 3. In Section 5, it is assumed that the wearable nodes keep a stable state to collect the data all the time. Therefore, the covariance matrix will not change.

4. Possible World-Based Fusion Estimation Model (PWFWM)

In this section, the details of the PWFWM are introduced in three parts. The first part is the introduction of data fusion estimation. The second part is the introduction of the calculation process of distribution distance. The third part introduces the clustering method based on the possible world.

4.1. Data Fusion Estimation

The collected data can be divided into two types: filterable data and high accuracy data. Without loss of generality, it is assumed the measurement data at time t is:
M t = [ z 1 f , z 2 f , , z q f , z 1 a , z 2 a , , z s a ] t
where M t f = [ z 1 f , z 2 f , , z q f ] t are the filterable data, and M t a = [ z 1 a , z 2 a , , z s a ] t are the high-accuracy data.
Corresponding to the possible world, filterable data are the probabilistic data, while the high accuracy data are the numeric data. It is assumed the format of the clustering data in a possible world at time t is:
X t = [ x 1 p , x 2 p , , x h p , x 1 n , x 2 n , , x s n ] t
where X t p = [ x 1 p , x 2 p , , x h p ] t are the probability data, and X t n = [ x 1 n , x 2 n , , x s n ] t are the numeric data.
In most cases, the filterable data can be obtained according to the Kalman-based filter. The high accuracy data can be converted to filterable data by the Gaussian distribution, whose expectation is zero and whose variance is small. The details are as follows.
The measurement data are first converted to the clustering data by the following formulas:
If the filterable data satisfied the following state function and measurement function:
{ X t p = f ( X t 1 p ) + ω t / t 1 M t f = g ( X t 1 p ) + υ t
The appropriate filter algorithm can be used to solve the functions above. If the result is X ^ t p , the probability data can be written as X ^ t p + ω t / t 1 .
Similarly, the numerical data can be written as X t n = M t a + ω t a , where ω t a is Gaussian distribution with zero mean and small variance.
Then, we have:
X t = [ X ^ t p + ω t / t 1 M t a + ω t a ] = [ X ^ t p M t a ] + [ ω t / t 1 ω t a ] = X ^ t + Ω t
where X ^ t = [ X ^ t p M t a ] and Ω t = [ ω t / t 1 ω t a ] . Moreover, we let Ω t = [ ω p ω a ] , which is a scleronomic Gaussian distribution. Therefore, according to Assumption 2, the multivariate Gaussian distribution with Xt can be written as follows:
X = 1 ( 2 π ) l | Σ | 1 2 e ( x μ x ) T ( Σ ) 1 ( x μ x ) 2
where l = h + s, μ x = [ E ( x i ) ] i = 1 l and
Σ = [ σ i j ] l × l , σ i j = D ( x i ) · D ( x j )
Based on the above, the structure of clustering data can be confirmed. Then, the distance-based functions need to be confirmed.

4.2. Distance Calculation Method Based on KL Divergence-Based Distance

Almost all clustering algorithms need to calculate the distance. In the PWFWM, there are two types of data: filterable and high accuracy. For accuracy data, the Euclidean distance can be used, and the KL divergence can be used to process the filterable data. In this Section, the distance calculation method based on KL divergence is introduced in detail.
KL divergence analyzes the degree of difference between two distributions from the perspective of information entropy. Assume that p(x) and q(x) are two distributions of random variable X, then the KL divergence is:
K L ( p q ) = + p ( x ) log p ( x ) q ( x ) d x
The calculation formula in the discrete case is:
K L ( p q ) = i = 1 n p ( x i ) log p ( x i ) q ( x i )
Assuming that the probability distribution is usually Gaussian, P N ( μ 1 , Σ 1 ) and Q N ( μ 2 , Σ 2 ) , and the dimension of the data is n. Then, the KL divergence calculation formula is as follows:
K L ( P Q ) = + p ( x ) log p ( x ) q ( x ) d x = E p [ log p ( x ) log q ( x ) ]
Plugs the P N ( μ 1 , Σ 1 ) and Q N ( μ 2 , Σ 2 ) in (10):
K L ( P Q ) = 1 2 E P [ log | Σ 2 | | Σ 1 | ( x μ 1 ) T Σ 1 1 ( x μ 1 ) + ( x μ 2 ) T Σ 2 1 ( x μ 2 ) ] = 1 2 log | Σ 2 | | Σ 1 | 1 2 E P [ ( x μ 1 ) T Σ 1 1 ( x μ 1 ) ] + 1 2 E P [ ( x μ 2 ) T Σ 2 1 ( x μ 2 ) ]
where
E P [ ( x μ 1 ) T Σ 1 1 ( x μ 1 ) ] = E P [ t r ( Σ 1 1 ( x μ 1 ) ( x μ 1 ) T ) ] = t r [ E P ( Σ 1 1 ( x μ 1 ) ( x μ 1 ) T ) ] = t r [ Σ 1 1 E P ( ( x μ 1 ) ( x μ 1 ) T ) ] = n
and
E P [ ( x μ 2 ) T Σ 2 1 ( x μ 2 ) ] = E P [ t r ( Σ 2 1 ( x μ 2 ) ( x μ 2 ) T ) ] = t r [ Σ 2 1 E P ( ( x μ 2 ) ( x μ 2 ) T ) ] = t r [ Σ 2 1 E P ( x x T x μ 2 T μ 2 x T + μ 2 μ 2 T ) ] = t r [ Σ 2 1 ( Σ 1 + μ 1 μ 1 T μ 1 μ 2 T μ 2 μ 1 T + μ 2 μ 2 T ) ] = t r [ Σ 2 1 Σ 1 + Σ 2 1 ( μ 1 μ 2 ) ( μ 1 μ 2 ) T ] = t r ( Σ 2 1 Σ 1 ) + ( μ 1 μ 2 ) T Σ 2 1 ( μ 1 μ 2 )
Finally, we have
K L ( P Q ) = 1 2 [ log | Σ 2 | | Σ 1 | n + t r ( Σ 2 1 · Σ 1 ) + ( μ 1 μ 2 ) T Σ 2 1 ( μ 1 μ 2 ) ]
Moreover, if Σ1 = Σ2 = Σ. Then, we get:
d K L ( i , j ) = K L ( P Q ) = 1 2 ( u j u i ) T Σ 1 ( u j u i )
In this way, the distance between two probability distributions is obtained. Then, the clustering method based on the possible world can be used.

4.3. The Clustering Method Based on the Possible World

In [22], the authors used an adaptive, local-structure learning method to calculate the consensus affinity matrix. In their model, the collected numerical data are used to match the probability density function (PDF) of the uncertain data. However, the authors give no algorithm for the case where the PDF is given directly. Moreover, the proposed method needs a sizable quantity of data. In this paper, Assumption 1 is proposed instead of the consistency principle.
According to Assumption 1 above, the probability of each possible world should be considered when calculating the consensus affinity matrix. Then, the objective function is shown as follows:
min j = 1 n d i j p w s i j p w + α j = 1 n s i j p w s . t . S i p w = [ s 1 i p w , s 2 i p w , , s n i p w ] T ( S i p w ) T · 1 n × 1 = 1 0 s i j p w 1
where, d i j p w is a kind of distance function between O i p w and O j p w , and S i p w = [ s 1 i p w , s 2 i p w , , s n i p w ] T is the normalized distance matrix for one of the possible worlds (pw).
Moreover, let the effective results of Si
t = j = 1 n sgn ( s i j p w )
According to the conclusion of [22], t can be adjusted by α, and the optimization result is
s i j p w = 1 t + 1 2 α ( s = 1 t d i s p w t d i j p w )
where D i p w = [ d 1 i p w , d 2 i p w , , d n i p w ] T is another order of D i p w , and it ranges from small to large.
According to the formulas above, the extra information about classes is required to confirm t. It is set as t = N if there is no extra information about classes. That is:
s i j p w = 1 n + 1 2 α ( s = 1 n d i s p w n d i j p w )
Finally, an optimization normalized distance matrix S* is needed for clustering the training set, which is satisfied by the following optimal model:
min E ( S S p w F 2 ) s . t . ( S i ) T · 1 n × 1 = 1 0 s i j 1
where S i = [ s 1 i , s 2 i , , s n i ] T and S = [ S 1 , S 2 , , S n ] T .
According to the object function (20),
E ( S S p w F 2 ) = E ( i = 1 n j = 1 n ( s i j s i j p w ) 2 ) = i = 1 n j = 1 n E ( s i j s i j p w ) 2
On the other hand, according to (19), we have
s i j s i j p w = s i j 1 n 1 2 α ( s = 1 n d i s p w n d i j p w ) .
Therefore,
E ( s i j s i j p w ) 2 = E ( s i j 1 n 1 2 α ( s = 1 n d i s p w n d i j p w ) ) 2
According to the properties of expectation and variance:
E ( X 2 ) = E 2 ( X ) + D ( X ) ,
E ( a X + b ) = a E ( X ) + b
and
D ( a X + b ) = a 2 · D ( X ) ,
Equation (23) can be reduced to:
E ( s i j s i j p w ) 2 = ( s i j 1 n 1 2 α ( s = 1 n E ( d i s p w ) n E ( d i j p w ) ) ) 2 + 1 4 α 2 D ( s = 1 n d i s p w n d i j p w )
Obviously, (7) is equivalent to the following optimal model:
min i = 1 n j = 1 n ( s i j 1 n 1 2 α ( s = 1 n E ( d i s p w ) n E ( d i j p w ) ) ) 2 .
The optimal solution for the above optimal model can be obtained easily, which is
s i j = 1 n + 1 2 α ( s = 1 n E ( d i s p w ) n E ( d i j p w ) ) , i = 1 , 2 , , n .
Now, another understanding for a possible world is presented. Let us review the definition of possible world. The construction of an uncertain dataset and its PDF f(pw) are known. Then, if the dimensions of the dataset are finite, which is assumed to be { o i j } j = 1 n , the edge probability density function (EPDF) for ith dimension is:
f i ( O i ) = D p w / D i f ( p w ) d ( p w / O i )
Moreover, if the dimensions of Oi(i = 1, 2, …, n) are finite, which is assumed to be { o i j } j = 1 n , the edge probability density function (EPDF) for jth dimension of Oi(i = 1, 2, …, n) is:
f i j ( o i j ) = D i / D i j f i ( O i ) d ( O i / o i j )
Here, it is assumed that distance(Oi,Oj) is the distance between the random variables Oi and Oj. Then, the consensus affinity matrix S can be obtained according to the following formula:
min j = 1 n d i j s i j + α j = 1 n s i j s . t . S i = [ s 1 i , s 2 i , , s n i ] T ( S i ) T · 1 n × 1 = 1 0 s i j 1
where dij = g(distance(Oi,Oj)), and S = [ S 1 , S 2 , , S n ] T .
Then, according to the analysis above, if there is no extra information about classes, the optimal solution for the object function (15) is:
s i j = 1 n + 1 2 α ( s = 1 n d i s n d i j ) .
Compared with (12), the distribution is used instead of the expectation of point distance. Therefore, (12) is appropriate for the possible world that includes fewer and simpler random variables, while (16) is appropriate for the possible world with complexity random variables in theory.
So far, when the distance-based function is confirmed, the optimization consensus affinity matrix S for the all possible worlds can be worked out.
According to the calculations above, the closer two data objects are, the larger sij is. Therefore, the value of sij may have no use when sij < p (distance threshold). Then, the matrix S may need to be pruned to remove the meaningless sij. This pruning is divided into two steps: removing and normalization. In the removing step, the meaningless values are replaced by 0. In the normalization step, the meaningful value is recalculated to keep the equation:
i = 1 n s i j = 1 , j = 1 , 2 , , n .
The following Algorithm 1 shows the processing of pruning:
Algorithm 1 for Matrix Pruning:
Input: the matrix SRn×n and pruning threshold p
The processing:
Removing step:
For i = 1 to n
For j = 1 to n
If sij < p
sij = 0
End if
End for
End for
Normalization step:
For i = 1 to n
s u m i = j = 1 n s j i
For j = 1 to n
s j i = s j i s u m i
End for
End for
Moreover, in spectral analysis, if a nonnegative affinity matrix S is given, the corresponding Laplacian matrix Ls can be calculated as L s = D s S T + S 2 , where Ds is a diagonal matrix and its ith diagonal element is j = 1 n s i j + s j i 2 . The Laplacian matrix Ls has an important property as follows [24].
Theorem 1.
Let S be a nonnegative affinity matrix; then, the multiplicity k of the eigenvalue 0 of the Laplacian matrix Ls is equal to the number of connected components in the graph associated with the affinity matrix S.
It is assumed that the eigenvalues of the Laplacian matrix Ls, which is { σ i } i = 1 n , are ordered from small to large. According to the properties of the Laplacian matrix Ls, we have the following conclusion:
0 = σ 1 σ 2 σ n .
If the number of clusters k is unknown, the threshold Th is set to decide k, which satisfies:
σ k T h σ k + 1 .
Finally, the eigenvectors of eigenvalues σ1 to σk comprise the matrix URn×k. The k-means clustering algorithm is used to cluster the row of matrix U. The clustering result is that of the training set. The Algorithm 2 for processing S is shown as follows.
Algorithm 2 for processing S:
Input: the matrix SRn×n and clustering threshold Th
The processing:
L s = D s S T + S 2
{ σ i } i = 1 n is the set of eigenvalues of Ls
0 = σ 1 σ 2 σ n .
{ υ i } i = 1 n is the set of eigenvectors of Ls
If
σ r T h σ r + 1
k = r
End if
U = [ υ 1 , υ 2 , , υ k ]
Cluster the row of matrix U according to the k-means method. These are also the clustering results for training set. Therefore, the cluster { C i } i = 1 k , and the number of cluster members { n i } i = 1 k are obtained.

4.4. Updating

After clustering the training set, the data in the test set should be put into the clusters determined above. Firstly, the test set is given as follows:
The Test Set: O = { O i } i = n n + p , and O i = [ o 1 i , o 2 i , , o n i ] T is the data object.
The clustering updating algorithm for the test set is divided into two steps: clustering and updating. The details are shown in the following Algorithm 3:
Algorithm 3 for Clustering Updating:
Input: the center of each cluster { C i } i = 1 k , and the number of cluster members { n i } i = 1 k of training set and the test set O = { O i } i = n n + p .
The processing:
Clustering step:
{ C i } i = 1 k = { C i } i = 1 k .
For i = n + 1 to n + p
[ d i j ] j = 1 k , d i j = d i s t a n c e ( O i , C j )
[ d i j ] j = 1 k , d i j = d i s t a n c e ( O i , C j )
c l u s t e r i = arg min j   d i j c l u s t e r i = arg min j   d i j
If clusteri = clusteri
Oi belongs to clusteri.
Else if
d i , c l u s t e r i d i , c l u s t e r i d i , c l u s t e r i d i , c l u s t e r i
Oi belongs to clusteri
Else
Oi belongs to clusteri
End if
End if
End for
Centers updating step:
For i = n + 1 to n + p
If Oi belongs to clusteri
C c l u s t e r i = n c l u s t e r i C c l u s t e r i + O i n c l u s t e r i + 1
n c l u s t e r i = n c l u s t e r i + 1
End if
End for

5. Simulations

In this section, comparisons with three state-of-the-art uncertain data clustering algorithms are conducted on real benchmark datasets. Moreover, an uncertain dataset that obeys the multivariate Gaussian distribution is generated, and the parameters in the PWFEM model are discussed.
In the comparisons, six common real benchmark datasets, which came from ‘http://archive.ics.uci.edu/ml/’, are employed for the simulation; their details are shown in Table 1:
These datasets were originally established as collections of data with determinate values. Then, we followed the method in [27] to generate uncertainty in these datasets, and the generation method is shown as follows Algorithm 4:
Algorithm 4 The Generation Method from Numerical Data to Uncertain Data (Gaussian Type).
Input: the numerical data a = [ a 1 , a 2 , , a n ] T and the standard deviation of each attribute [ σ 1 , σ 2 , , σ n ]
Output: the corresponding uncertain data u a = [ u a 1 , u a 2 , , u a n ] T
For i = 1 to n
x = random, 0 < x ≤ 1
u a i = 1 2 π σ e ( x a i ) 2 2 σ 2
End for

5.1. The Clustering Accuracy

In this part, 2 widely used evaluation metric, which are accuracy (ACC) and Normalized mutual information (NMI), are adopted to compare the different clustering algorithms. In this part, the proposed clustering algorithms, PWFEM-nd and PWFEM-pd, are compared with three state-of-the-art uncertain data clustering algorithms: UK-means [26], REP [27] and PWCLU. Each clustering algorithm was run 100 times. The maximum, minimum, mean value, and variance of the ACC were calculated with respect to each algorithm. The comparisons were simulated for two cases. Case 1 is the real mean value with variance known, while case 2 is the finite measurement results, which obey the given PDF instead.
In order for the proposed model to be executed properly, the exact values of expectation and covariance need to be known. However, the datasets used in this simulation do not give those values. Therefore, the approximate values were calculated instead according to the following formula:
E = X   and   C o v = C o v ( X ) .
where X = { x i } i = 1 n is the dataset.
The comparisons of ACC for each algorithm in case 1 are shown in Table 2.
As shown in Table 2, in the datasets of wine and glass, the PWFEM-nd shows the best performance with maximum, minimum, and mean values. Unfortunately, it shows the worst performances with those values in the datasets of iris, Ecoli and PhishingData. As for the proposed PWFEM-pd, it shows the best performances with maximums in all datasets except wine and glass.
According to their respective algorithms, there may be plenty of reasons for the results above. Some analyses that have high probabilities are presented next.
Firstly, it is important to note that the UK-means, REP, PWCLU, and PWFEM-nd use the mean value only. Therefore, their variance values are zeros, which means the clustering results never change throughout the 100 iterations. Only PWFEM-pd uses the variance of uncertain data.
Secondly, UK-means clusters the dataset directly, while REP, PWCLU, PWFEM-nd, and PWFEM-pd cluster the dataset indirectly. Here, REP, PWCLU, PWFEM-nd, and PWFEM-pd use the model based on the possible world. Moreover, PWCLU uses the Euclidean distance (‖∙‖2). PWFEM-nd uses the cosine similarity. PWFEM-pd uses the Kullback–Leibler divergence. Compared with the PWCLU, PWFEM-nd combines the distributions of each component in a datum. Moreover, PWFEM-pd calculates the distance in distributions directly, while PWCLU and PWFEM-nd transform the distributions into some special numbers (mean value and variance). Therefore, the clustering accuracy of PWFEM-pd may be higher than that of PWCLU in most cases. Moreover, PWFEM-pd can be regarded as having different covariances obtained randomly to that of clustering. If a covariance close to the true covariance is acquired, a high accuracy of clustering is gained.
For a clearer view of the changing of clustering accuracy with different covariances, see Figure 1.
As shown in Figure 1, the ACC of PWFEM-pd is sensitive to the covariance of the uncertain data. On the other hand, the impacts caused by covariances from different datasets lead to different results. Obviously, in Figure 1a,c,d,f, the ACC is highly dependent on the covariance. In Figure 1e, the ACC is divided into two parts: one is around 0.51 and the other is around 0.34, when different covariances are given. Moreover, in Figure 1b, the ACC is stable around 0.5 most times with the changing of covariance.
According to the analysis above, only for the proposed models, which are PWFEM-nd and PWFEM-pd, the ACC is sensitive to covariance. Then, the changing of mean values is added to the simulations. Therefore, the simulation results are given for case 2, which uses the generation method proposed in the beginning of this section; the results of case 2 are shown in Table 3 and Figure 2.
As shown in Table 3, when combining the maximum value and minimum value, the clustering results of all clustering methods change. This means all the clustering methods are sensitive to the mean value. Moreover, the sensitivity to each clustering method varies. Obviously, the fluctuation ranges of all clustering methods in iris and glass are the most drastic. On the other hand, the clustering accuracy of the PWFEM-pd algorithm is always higher than that of the PWFEM-nd, but its stability is lower than that of the PWFEM-nd. Besides, compared with Table 2, the NMI are lower than the ACC for the same dataset, which means that in the clustering results of the model, the accuracy of each class is inconsistent, with some categories having high precision and some having low precision.
For a clearer view of the changing of clustering accuracy with different covariances and mean values, see Figure 2.
As shown in Figure 2, the PWFEM-pd has a similar fluctuation as that shown in Figure 1. Unfortunately, this clustering method is sensitive to both mean value and covariance. Therefore, it is hard to distinguish the main reason. Next, the remaining four clustering methods are discussed.
Firstly, similar to the conclusion in Table 2, Figure 2c,d in all clustering methods show a drastic fluctuation. For UK-means and CK-means, they show a drastic fluctuation in Figure 2a and are stable in Figure 2b,e,f. For PWCLU, it is stable in Figure 2a,b,e,f. For PWFEM-nd, it is stable in Figure 2a,b,f, while it is stable at two ranges in Figure 2e.
According to the analysis above, the situations of the proposed methods are clearer. However, the variation tendency with the mean value and covariance are not clear. Therefore, a specific dataset was generated to investigate the above issues.

5.2. The Simulation with a Specific Dataset

In this part, a specific dataset is generated to analyze the impacts of mean value and covariance. The generated dataset consisted of two dimensions, and the number of data points was set at 1000. It was divided into three clusters, whose centers were [0, 0], [100, 0], and [0, 100]. The distance between the datum and its center was randomly distributed in [0, r]. The variance for each dimension was σi(i = 1, 2). Moreover, it was set as σ1 = σ2 = σ. The correlation coefficient of these two dimensions was ρ. Therefore, the covariance of this dataset was:
[ σ ρ σ ρ σ σ ] .
Next, the parameters r, σ, and ρ are discussed.
In this simulation, σ = 2, ρ = 0, and r was from 1 to 100. As shown in Figure 3, the ACCs of all methods were 1 before about 50, and then reduced with increasing r. This simulation results are in accordance with common sense.
On the other hand, if ρ = 0 and r is fixed, σ can vary the distance between the data evenly. Therefore, it cannot affect the clustering results, and the simulation proves it. Because the ACC curves of all methods are lines parallel to the X-axis, the figure was omitted.
Finally, the simulation for ρ is discussed with σ = 2, r = 20, 40, 60, and 80, and −1 < ρ < 1. The simulation results are shown in Figure 4.
As shown in Figure 4, when r < 50, the ACCs are stable for all methods with −1 < ρ < 1. This is because the cluster structure is prominent in this condition, whereas the effect of ρ on the clustering result is weak. Moreover, when r > 50, the ACCs of UK-means, CK-means, PWCLU, and PWFEM-nd show significant changes between in (−1, −0.7) and in (0.7, 1). In these two intervals, ρ makes the data points even messier. Therefore, the ACCs of clustering results decrease if the data points are not processed. On the other hand, the ACCs become stable when −0.7 < ρ < 0.7. Obviously, the effect of ρ on the clustering results is weak.

6. Conclusions

In this paper, a possible world-based fusion estimation model for uncertain data is proposed. It includes two methods, which are the PWFEM-nd and PWFEM-pd. The PWFEM-nd is based on a data perspective, which uses a bottom-up method to cluster the data. The PWFEM-pd uses clustering according to the uncertain data directly. Both these methods depend more on the probability density distribution of uncertain data. We performed some simulations and confirmed that the proposed methods showed better performance in terms of probabilistic accuracy. The accuracy is highly dependent on the accuracy of covariance.
The discussion in the last section is incomplete. Obviously, it gets more complex when dimension increases. Only some simple conclusions are given in the simulation. In addition, the exact covariance is not usually obtained in actual scenarios. In any case, the proposed methods provide a new way to treat uncertain data clustering. The issues mentioned above are also to be addressed in future works.

Author Contributions

C.L.: Manuscript writing and data analysis. Z.Z.: Algorithm research and design, manuscript revising. W.W.: Algorithm for discussion. H.-C.C.: Algorithm for discussion. X.L.: The data collection and manuscript revising. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (grant number 2018YFC0831304) and the National Natural Science Foundation of China (grant number 61772064).

Institutional Review Board Statement

The study did not require ethical approval.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are openly available in “http://archive.ics.uci.edu/ml/index.php”, reference numbers are [25,26].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Delias, P.; Doumpos, M.; Grigoroudis, E.; Manolitzas, P.; Matsatsinis, N. Supporting healthcare management decisions via robust clustering of event logs. Knowl. Based Syst. 2015, 84, 203–213. [Google Scholar] [CrossRef]
  2. Gaglio, S.; Re, G.L.; Morana, M. Human Activity Recognition Process Using 3-D Posture Data. IEEE Trans. Hum. Mach. Syst. 2017, 45, 586–597. [Google Scholar] [CrossRef]
  3. Waller, L.A.; Turnbull, B.W.; Clark, L.C.; Nasca, P. Chronic disease surveillance and testing of clustering of disease and exposure: Application to leukemia incidence and TCE-contaminated dumpsites in upstate New York. Environmetrics 1992, 3, 281–300. [Google Scholar] [CrossRef]
  4. Matthews, G.; Warm, J.S.; Shaw, T.H.; Finomore, V.S. Predicting battlefield vigilance: A multivariate approach to assessment of attentional resources. Ergonomics 2014, 57, 856–875. [Google Scholar] [CrossRef] [PubMed]
  5. Sun, W.; Yuan, D.; Ström, E.G.; Brännström, F. Cluster-Based Radio Resource Management for D2D-Supported Safety-Critical V2X Communications. IEEE Trans. Wirel. Commun. 2015, 15, 1. [Google Scholar] [CrossRef] [Green Version]
  6. Zagouras, A.; Pedro, H.T.C.; Coimbra, C.F.M. Clustering the solar resource for grid management in island mode. Sol. Energy 2014, 110, 507–518. [Google Scholar] [CrossRef]
  7. Li, M.; Xu, D.; Zhang, D.; Zou, J. The seeding algorithms for spherical k-means clustering. J. Glob. Optim. 2019, 76, 695–708. [Google Scholar] [CrossRef]
  8. Lu, H.; Zhang, R.; Li, S.; Li, X. Spectral Segmentation via Midlevel Cues Integrating Geodesic and Intensity. IEEE Trans. Cybern. 2013, 43, 2170–2178. [Google Scholar] [CrossRef]
  9. Sokoloski, S. Implementing a Bayes Filter in a Neural Circuit: The Case of Unknown Stimulus Dynamics. Neural Comput. 2017, 29, 2450–2490. [Google Scholar] [CrossRef] [Green Version]
  10. Sinopoli, B.; Schenato, L.; Franceschetti, M.; Poolla, K.; Jordan, M.I.; Sastry, S.S. Kalman filtering with intermittent observations. IEEE Trans. Autom. Control 2004, 49, 1453–1464. [Google Scholar] [CrossRef]
  11. Zhang, W.; Zhang, Z.; Zeadally, S.; Chao, H.C.; Leung, V. CMASM: A Multiple-algorithm Service Model for Energy-delay Optimization in Edge Artificial Intelligence. IEEE Trans. Ind. Inform. 2019, 15, 4216–4224. [Google Scholar] [CrossRef]
  12. Zhang, W.; Zhang, Z.; Wang, L.; Chao, H.C.; Zhou, Z. Extreme learning machines with expectation kernels. Pattern Recognit. 2019, 96, 1–13. [Google Scholar] [CrossRef]
  13. Chau, M.; Cheng, R.; Kao, B.; Ng, J. Uncertain data mining: An example in clustering location data. In Proceedings of the Advances in Knowledge Discovery and Data Mining 10th Pacific-Asia Conference, Singapore, 9–12 April 2006. [Google Scholar]
  14. Kriegel, H.P.; Pfeifle, M. Hierarchical density-based clustering of uncertain data. In Proceedings of the Data Mining Fifth IEEE International Conference, Houston, TX, USA, 27–30 November 2005. [Google Scholar]
  15. Volk, P.B.; Rosenthal, F.; Hahmann, M.; Habich, D.; Lehner, W. Clustering Uncertain Data with Possible Worlds. In Proceedings of the Proceedings of the 25th International Conference on Data Engineering, Shanghai, China, 29 March–2 April 2009. [Google Scholar]
  16. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  17. Garcia, E.; Hausotte, T.; Amthor, A. Bayes filter for dynamic coordinate measurements Accuracy improvment, data fusion and measurement uncertainty evaluation. Meas. J. Int. Meas. Confed. 2013, 46, 3737–3744. [Google Scholar] [CrossRef]
  18. Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef] [Green Version]
  19. Costa, P.J. Adaptive model architecture and extended Kalman-Bucy filters. IEEE Trans. Aerosp. Electron. Syst. 1994, 30, 525–533. [Google Scholar] [CrossRef]
  20. Julier, S.J.; Uhlmann, J.K.; Durrant-Whyte, H.F. A New Approach for Filtering Nonlinear Systems. In Proceedings of the American Control Conference, Seattle, DC, USA, 21–23 June 1995. [Google Scholar]
  21. Haykin, S. Kalman Filtering and Neural Networks; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2001. [Google Scholar]
  22. Liu, H.; Zhang, X.; Zhang, X. Possible World-based consistency learning model for clustering and classifying uncertain data. Neural Netw. 2018, 102, 48–66. [Google Scholar] [CrossRef]
  23. Sinkkonen, J.; Kaski, S. Clustering Based on Conditional Distributions in an Auxiliary Space. Neural Comput. 2014, 14, 217–239. [Google Scholar] [CrossRef]
  24. Luxburg, U.V. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  25. Dua, D.; Graff, C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019. [Google Scholar]
  26. Abdelhamid, N.; Ayesh, A.; Thabtah, F. Phishing detection based Associative Classification data mining. Expert Syst. Appl. 2014, 41, 5948–5959. [Google Scholar] [CrossRef]
  27. Gullo, F.; Tagarelli, A. Uncertain centroid based partitional clustering of uncertain data. Proc. VLDB Endow. 2012, 5, 610–621. [Google Scholar] [CrossRef] [Green Version]
Figure 1. ACC with different clustering algorithms for 100 iterations in case 1.
Figure 1. ACC with different clustering algorithms for 100 iterations in case 1.
Sensors 21 00875 g001
Figure 2. NMI with different clustering algorithms for 100 iterations in case 2.
Figure 2. NMI with different clustering algorithms for 100 iterations in case 2.
Sensors 21 00875 g002aSensors 21 00875 g002b
Figure 3. Change of ACCs with r values from 1 to 100.
Figure 3. Change of ACCs with r values from 1 to 100.
Sensors 21 00875 g003
Figure 4. Change of ACCs with ρ from −1 to 1.
Figure 4. Change of ACCs with ρ from −1 to 1.
Sensors 21 00875 g004aSensors 21 00875 g004b
Table 1. Details of the adoptive datasets [25].
Table 1. Details of the adoptive datasets [25].
DatasetObjectsAttributesClasses
Iris15043
Wine178133
Glass21496
Ecoli32775
Waveform5000213
PhishingData [26]135393
Table 2. Accuracy (ACC) for each algorithm in case 1.
Table 2. Accuracy (ACC) for each algorithm in case 1.
UK-MeansREPPWCLUPWFEM-ndPWFEM-pd
IrisMax0.88000.81330.81330.81330.8533
Min0.55330.55330.54000.48670.5200
Mean0.72440.69940.68690.71810.7602
Variance0.00220.00210.00160.00170.0028
WineMax0.70220.70220.57300.70790.9607
Min0.70220.70220.57300.69660.3202
Mean0.70220.70220.57300.69890.8999
Variance00000.0173
GlassMax0.83330.76190.86180.79050.9286
Min0.24760.60000.25710.22860.3095
Mean0.72390.70780.75880.64890.7818
Variance0.01910.00100.02040.01730.0537
EcoliMax0.53270.49530.53740.52340.5421
Min0.34580.20560.40650.33180.4299
Mean0.44220.40250.49050.45270.4634
Variance0.00120.00350.00110.00140.0009
WaveformMax0.52910.40060.70030.51990.7156
Min0.31800.23240.39450.40060.4801
Mean0.43500.31770.54030.44450.5706
Variance0.00140.00130.00250.00060.0038
PhishingDataMax0.56390.45600.56470.51880.6061
Min0.46640.35850.45680.45080.4797
Mean0.51830.42180.50270.49100.5719
Variance0.00050.00040.00040.00020.0010
Table 3. NMI for each algorithm.
Table 3. NMI for each algorithm.
UK-MeansREPPWCLUPWFEM-ndPWFEM-pd
IrisMax0.78540.68090.67160.67000.7396
Min0.26940.38980.28710.22130.3162
Mean0.53740.52450.48340.52950.5927
Variance0.00500.00270.00310.00330.0054
WineMax0.49460.49460.31840.53890.9551
Min0.49460.49460.31840.51360.3146
Mean0.49460.49460.31840.52090.8803
Variance00000.0198
GlassMax0.70010.61710.75220.62880.8671
Min0.09970.40280.16430.03200.2233
Mean0.55110.52500.60940.42580.6911
Variance0.01960.00190.02230.02040.0672
EcoliMax0.65440.65440.70640.71250.7309
Min0.37310.37310.51990.46790.3639
Mean0.49880.49880.63540.56290.5569
Variance0.00500.00500.00340.00540.0079
WaveformMax0.32470.25480.46450.31040.4895
Min0.11950.12860.12820.19190.2545
Mean0.21120.19090.32440.23920.3558
Variance0.00170.00080.00220.00050.0025
PhishingDataMax0.25170.16360.24160.22000.3190
Min0.15590.05940.14520.13130.1804
Mean0.20880.10500.18800.17600.2803
Variance0.00040.00040.00050.00030.0008
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, C.; Zhang, Z.; Wei, W.; Chao, H.-C.; Liu, X. A Possible World-Based Fusion Estimation Model for Uncertain Data Clustering in WBNs. Sensors 2021, 21, 875. https://doi.org/10.3390/s21030875

AMA Style

Li C, Zhang Z, Wei W, Chao H-C, Liu X. A Possible World-Based Fusion Estimation Model for Uncertain Data Clustering in WBNs. Sensors. 2021; 21(3):875. https://doi.org/10.3390/s21030875

Chicago/Turabian Style

Li, Chao, Zhenjiang Zhang, Wei Wei, Han-Chieh Chao, and Xuejun Liu. 2021. "A Possible World-Based Fusion Estimation Model for Uncertain Data Clustering in WBNs" Sensors 21, no. 3: 875. https://doi.org/10.3390/s21030875

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop