In a nonparametric way, we firstly partition all points into two groups, boundary and interior points, which are used to access the boundary matching degree and connectivity degree. By integrating these two quantities, a novel clustering evaluation index can be formed.
3.1. Boundary Matching and Connectivity
The density of any point in the existing clustering analysis is computed by counting the number of points in the point’s neighborhood with a specified radius. However, the computed density only is a group of discrete integers such as 1, 2, …, and thus many points have the same density which is indistinguishable. Moreover, the defined density may greatly be affected by the specified radius.
In this study, we first define a nonparametric density to find all boundary and interior points in any dataset. Assume
X = {
x1,
x2, …,
xn } is a dataset in a
D-dimensional space
RD. For any data point
xk X, its
m nearest neighbors are denoted as,
xk,1,
xk,2, …,
xk,m, with distances
d(
xk,
xk,1),
d(
xk,
xk,2),…,
d(
xk,
xk,m), where
m is the integer part of 2
Dπ,
k = 1, 2, …,
n. Here, 2
D shows that one interval in any dimension in
RD can be measured by the two-interval endpoints, and
π is a conversing coefficient when the
m points are enclosed by a spherical neighborhood that is used in the existing density computation. Therefore, the density of any data point
xk in
X is defined as
Definition 1. Boundary point and interior point. A point is called as a boundary or interior point if half of its m nearest neighbors have a higher or lower (equal) density than its density, respectively.
The proposed notion of boundary and interior points have the following two characteristics.
(1) Certainty. Unlike the used density in other existing algorithms, the proposed density is fixed and unique for any point, which reduces the uncertainty in the clustering process. In fact, the clustering results in other algorithms may greatly be changed as the number of neighbors used for computing the density increases or decreases. Note that the effective estimation of the number of neighbors has been a difficult task, and so far, it remains unsolved [
35].
(2) Locality. The classification of border or interior points is defined only by its
m nearest neighbors, so the separating boundary or interior points presents local characteristics. Inversely, DBSCAN uses a global density to distinguish border or interior points in a density-skewed dataset, which even causes an entire cluster to be perfectly regarded as border points (see
Figure 4). Therefore, the proposed density is a more reasonable local notion.
Figure 4a shows a density-skewed dataset with three clusters of large, medium, and small density, respectively. The red points in
Figure 4a,b represent the boundary points computed by using DBSCAN and (10), respectively. DBSCAN can find no interior points in the cluster with the lowest density; in comparison, owing to the locality of (10), the red border points and blue interior points determined are distributed more reasonably. The interior points are located at the center of any cluster and surrounded by border points, while the border points construct the shape and structure of any clusters. Specifically, after the removal of border points in any dataset, the separation of clusters is greatly enhanced (see
Figure 4c). Therefore, the real number of clusters can be determined more easily by any clustering algorithm.
In graph theory, a cluster is defined as a group of points that connect to each other [
36,
37]. In order to assess the connectivity of points in a dataset
X, we calculate the density for all points in
X and sort them in the order of increasing density. Assume
xmax is the point with the highest density in
X, and thereby, a connecting rule among points is defined as follows. For any point
xkX, the next point
xk+1 is the point which is the nearest neighbor of
xk but has a higher density than
xk. Subsequantly, repeat the above steps until visiting the point
xmax.
Definition 2. Chain. A chain is a subset of X that starts with any data point xi in X and stops at xmax based on the above connecting rule.
There is a unique chain from any point in
X since the nearest neighbor of each point is unique. The above steps are repeated until each point has been visited in
X. Consequently, all
n points in
X respond to
n chains, denoting them as
S1,
S2, …,
Sn. The largest distance between adjacent points in
t-th chain
St is denoted as
dis(
St),
t = 1, 2, …,
n.
Figure 5 shows all chains in two datasets with various characteristics, where the arrow is the direction from low to high-density points. The green dotted circles denote the points with maximum densities.
Figure 5 shows that the value of
dis(
St) is small when a chain perfectly is contained in a cluster, but abnormally becomes large when a chain bridges a cluster and the other cluster. In views of the notation of the chain, we further define a notation of connectivity of
X as follows.
Definition 3. Connectivity. Let S1, S2, …, Sn be n chains in X, then the connectivity of all points in X is defined as Along with the graph theory, the value of con(X) indicates the degree of compactness of a dataset. It can reflect whether a chain is contained in a cluster, as explained and illustrated in the next section. In this paper, we use the notion of boundary matching degree and connectivity degree to access the clustering results obtained by any clustering algorithm.
3.2. Clustering Evaluation Based on Boundary and Interior Points
Once a clustering algorithm has partitioned a dataset
X into
c disjoint clusters, i.e.,
X =
C1∪
C2∪…∪
Cc, we substitute
X by
C1,
C2, …,
Cc, respectively, and find their boundary points using (10). The set of boundary points in
Ck is denoted as
BCk, while the set of boundary points in
X is
BX. A boundary-point-matching index is defined as
Equation (12) measures the matching degree of boundary points between the entire dataset X and the disjoint C clusters. In the mathematical meaning, it is clear that the values of bou (c) must fall in the interval [0, 1]. The following example can explain the cases that are smaller and equivalent to 1.
Figure 6a–c show the boundary and interior points determined by (10) when performing C-means algorithm at
c is smaller, equal to or larger than 6, respectively, where interior points refer to the remaining points after removing boundary points.
The red boundary points at
c = 3 and 6 are similar to these in
X, but these boundary points at
c equals 10 are different from those in
X.
Figure 6d shows that the values of
bou (
c) are nearly unchangeable when
c < 6 but decrease fast when
c > 6. When the number of clusters
c is smaller than the actual one, and usually, any cluster does not be assigned two cluster centers. Therefore, the set of boundary points of all partitioned clusters are consistent with the entire set
X, and
bou (
c) = 1. When
c is larger than the actual number of clusters, there is at least one cluster the number of whose boundary points increase. Thus,
bou (
c) < 1. It can be seen that the values of
bou (
c) are helpful to find the real number of clusters for any clustering algorithm. Alternatively, we can regard any cluster in
C1,
C2, …,
Cc as an independent dataset like
X, and accordingly,
xmax in
X become these points with maximal density in
C1,
C2,…, and
Cc, respectively. Thereby, we assess the connectivity among points according to (11) when
c = 1, 2, …,
cmax. The connectivity of
X is reduced to the connectivity of each cluster. As
c increases, the connectivity is enhanced since the number of maximal inter-cluster distances in these chains decreases.
Figure 7a–c show the connectivity calculated using (11) and C-means when
c equals 3, 6, and 10, respectively. This value becomes smaller when
c < 6, while it tends to be flat when
c > 6, as shown in
Figure 7d. Consequently, there is an inflection point on the curve in
Figure 7d.
As c increases, both the curves calculated according to (11) and (12) have inversely varying tendencies. It is expected that the real number of clusters is encountered at c*, where the curve of (11) turns to be flat from fast-changing, and that of (12) becomes fast varying from slow-changing.
Considering the variances of
bou (
c) and
con(
c) can be calculated by curvature radius mathematically, we define a novel validity index according to
bou (
c) for boundary points and
con(
c) for interior points, respectively. By combining (11) and (12), we define a function as follows
where the symbol Δ denotes a two-order difference operator of
bou (
c) and
con(
c), aiming to locate the maximal inflection points on curves of
bou (
c) and
con(
c), respectively. The optimal number of clusters
c* for any dataset is computed as
The proposed validity index has the following characteristics.
- (1)
Complementarity. The mathematical curvature and difference can reflect the varied tendency of a curve in (13), and thereby the real number of clusters c* can be found. When c < c*, R1(c) is nearly equivalent to R1(c*) since the set of boundary points is approximately unchangeable, but R2(c) < R2(c*), since the number of center points successively increase. In sum, F(c) < F(c*). Inversely, when c > c*, R1(c) successively decreases, and R2(c) tends to be flat; therefore, F(c) < F(c*). Consequently, (13) can attain a maximum when c* appears.
- (2)
Monotonicity. Assume that c* is the real number of clusters, and when c takes its two values c1 and c2 satisfying c1 < c2 < c*, F(c1) < F(c2) < F(c*); otherwise, when c* < c1 < c2, F(c*) > F(c1) > F(c2); Hence, F(c) consists of two monotone functions at the two sides of c*, respectively. Therefore, for arbitrary two values c1 and c2 satisfying c1 < c2, c2 may refer to a more optimal clustering result. Usually, a larger value of F(c*) indicates a better clustering result.
- (3)
Generalization. Equation (13) can provide a wide entry for any clustering algorithm only if the clustering results are available and the corresponding numbers of clusters are taken as the variable of F(c). Especially, a group of clustering results may result from different clustering algorithms and parameters, since any two clustering results are comparable according to the above monotonicity. For example, one clustering result with c1 results from C-means, and others from DBSCAN, and so on. Equation (13) can evaluate the results of any clustering algorithm and parameter in a trial-and-error way. In comparison, the existing validity indices can mainly work for a specific algorithm and parameter of the number of clusters since the center in them has to be defined, especially for the C-means algorithm.
Hereafter, the cluster validity index of (13) based on boundary and interior points is called CVIBI. The evaluating process for any clustering results based on CVIBI is listed in Algorithm 2.
Algorithm 2. Evaluating Process Based on CVIBI |
Input: a dataset X RD containing n points and clustering results from any clustering algorithm at c = 1, 2, …, cmax. |
Output: the suggested number of clusters. |
Steps: |
1. Calculate the density for each point in X according to (10); |
2. Partition X into boundary and interior points; |
3. Input clustering results at c = 1, 2, …, cmax; |
4. Partition each cluster into boundary and interior points; |
5. Compute values of bou (c) or con(c) at c equals 1, 2, …., cmax; |
6. Solve the optimal value of (13); |
7. Suggest an optimal number of clusters. |
8. Stop. |