A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph

Jeong-Hun Kim; Jong-Hyeok Choi; Kwan-Hee Yoo; Woong-Kee Loh; Aziz Nasridinov

doi:10.3390/electronics8101094

,

and

¹

Department of Computer Science, Chungbuk National University, Cheongju 28644, Korea

²

Department of Software, Gachon University, Seongnam 13120, Korea

^*

Authors to whom correspondence should be addressed.

Electronics2019, 8(10), 1094;https://doi.org/10.3390/electronics8101094

This article belongs to the Section Computer Science & Engineering

Version Notes

Order Reprints

Abstract

OPTICS is a state-of-the-art algorithm for visualizing density-based clustering structures of multi-dimensional datasets. However, OPTICS requires iterative distance computations for all objects and is thus computed in

O (n^{2})

time, making it unsuitable for massive datasets. In this paper, we propose constrained OPTICS (C-OPTICS) to quickly create density-based clustering structures that are identical to those by OPTICS. C-OPTICS uses a bi-directional graph structure, which we refer to as the constraint graph, to reduce unnecessary distance computations of OPTICS. Thus, C-OPTICS achieves a good running time to create density-based clustering structures. Through experimental evaluations with synthetic and real datasets, C-OPTICS significantly improves the running time in comparison to existing algorithms, such as OPTICS, DeLi-Clu, and Speedy OPTICS (SOPTICS), and guarantees the quality of the density-based clustering structures.

Keywords:

OPTICS; density-based clustering structure; visualization; constraint graph

1. Introduction

Clustering is one of the data mining techniques that group data objects based on a similarity [1]. The groups can provide important insights that are used for a broad range of applications [2,3,4,5,6,7,8], such as superpixel segmentation for image clustering [2,3], brain cancer detection [4], wireless sensor networks [5,6], pattern recognition [7,8], and others. We can classify clustering algorithms into centroid, hierarchy, model, graph, density, and grid-based clustering algorithms [9]. Many algorithms address various clustering issues including scalability, noise handling, dealing with multi-dimensional datasets, the ability to discover clusters with arbitrary shapes, and the minimum dependency on domain knowledge for determining certain input parameters [10].

Among clustering algorithms, density-based clustering algorithms can discover arbitrary shaped clusters and noise from datasets. Furthermore, density-based clustering algorithms do not require the number of clusters as an input parameter. Instead, clusters are defined as dense regions separated by sparse regions and are formed by growing due to the inter-connectivity between objects. Density-based spatial clustering of applications with noise (DBSCAN) [11] is a well-known density-based clustering algorithm. To define dense regions which serve as clusters, DBSCAN requires two parameters:

ε

, which represents the radius of the neighborhood of an observed object, and

M i n P t s

, which is the minimum number of objects in the

ε

-neighborhood of an observed object. Let

P

be a set of multi-dimensional objects and let the

ε

-neighbors of an object

p_{i} \in P

be

N_{ε} (p_{i})

. Here, DBSCAN implements two rules:

An object $p_{i}$ is an $ε$ -core object if $| N_{ε} (p_{i}) | \geq M i n P t s$ ;
If $p_{i}$ is an $ε$ -core object, all objects in $N_{ε} (p_{i})$ should appear in the same cluster as $p_{i}$ .

The process of DBSCAN is simple. Firstly, an arbitrary

ε

-core object

p_{i}

is added to an empty cluster. Secondly, a cluster grows as follows: for every

ε

-core object

p_{i}

in the cluster, all objects of

N_{ε} (p_{i})

are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure 1 demonstrates this limitation of DBSCAN in a two-dimensional dataset

P

when

M i n P t s = 3

. If

ε = 0.955

,

p_{2}

is an

ε

-core object and forms a cluster, which contains

p_{1}

,

p_{2}

,

p_{3}

, and

p_{4}

because

| N_{ε} (p_{2}) | \geq M i n P t s

is satisfied. However,

p_{9}

and

p_{13}

are noise objects. Here, a noise object is an object that is not included in any cluster. In other words, a set of objects that cannot reach

ε

-core objects in the clusters are defined as noise objects. On the other hand, if

ε = 1.031

,

p_{13}

becomes an

ε

-core object and forms a cluster, which contains

p_{11}, p_{13}

, and

p_{12}

. However,

p_{9}

is still a noise object. As shown in the example in Figure 1, input parameter selection in DBSCAN is problematic.

Figure 1. Two-dimensional clustering example demonstrating problems of density-based spatial clustering of applications with noise (DBSCAN) for the selection of

ε

(

M i n P t s = 3

).

To address this disadvantage of DBSCAN, a method for ordering points to identify the clustering structure, called OPTICS [12], was proposed. Like DBSCAN, OPTICS requires two input parameters,

ε

, and

M i n P t s

, and finds clusters of differing densities by creating a reachability plot. Here, the reachability plot represents an ordering of a dataset with respect to the density-based clustering structure. To create the reachability plot, OPTICS forms a linear order of objects where objects that are spatially closest become neighbors [13]. Figure 2 shows the reachability plot for a dataset

P

, when

ε = \sqrt{2}

and

M i n P t s = 3

. The horizontal axis of the reachability plot enumerates the objects in a linear order, while vertical bars display reachability distance (see Definition 2), which is the minimum distance for an object to be included in a cluster. The reachability distances for some objects (e.g.,

p_{1}

,

p_{6}

,

p_{11}

, and

p_{14}

) are infinite. In this case, an infinite reachability distance results when the distance to each object is undefined because the distance value is greater than given

ε

. OPTICS does not provide clustering results explicitly, but the reachability plot shows the clusters for

ε

. For example, when

ε = 0.5

, a first cluster

C_{1}

containing

p_{1}

,

p_{2}

, and

p_{3}

is found. When

ε = 0.943

, second cluster

C_{2}

, which contains

p_{6}

,

p_{7}

, and

p_{8}

is found. As

ε

grows larger, clusters

C_{1}

and

C_{2}

continue to expand, and a third cluster

C_{3}

containing

p_{11}

,

p_{12}

, and

p_{13}

is found.

Figure 2. A reachability plot of an example dataset (

ε = \sqrt{2}

and

M i n P t s = 3

).

As demonstrated in Figure 2, OPTICS addresses the limitations of input parameter selection for DBSCAN. However, OPTICS requires distance computations for all pairs of objects to create a reachability plot. In other words, OPTICS first computes an

ε

-neighborhood for each object to identify

ε

-core objects, and then, computes reachability distances at which

ε

-core objects are reached from all other objects. Thus, OPTICS is computed in

O (n^{2})

time, where

n

is the number of objects in a dataset [14]. Therefore, OPTICS is unsuitable for massive datasets. Prior studies have proposed many algorithms to address the running time of OPTICS, such as DeLi-Clu [15] and SOPTICS [16]. These algorithms improve the running time of OPTICS, but have their own limitations such as dependence on the number of dimensions, and deformation of the reachability plot.

This paper focuses on improving OPTICS by addressing its quadratic time complexity problem. To do this, we propose a fast algorithm, called constrained OPTICS (simply, C-OPTICS). C-OPTICS uses a novel bi-directional graph structure, called the constraint graph, which consists of the vertices corresponding to each cell that partitions a given dataset. In the constraint graph, the vertices are linked by means of edges when the distance between vertices is less than an

ε

. The constraints are assigned as the weight of each edge. The main feature of C-OPTICS is that it only computes the reachability distances of the objects that satisfy the constraints, which results in a reduction of unnecessary distance computations when creating a reachability plot. We evaluated the performance of C-OPTICS through experiments with the OPTICS, DeLi-Clu, and SOPTICS algorithms. The experimental results show that C-OPTICS significantly reduces the running time compared with OPTICS, DeLi-Clu, and SOPTICS algorithms and guarantees the reachability plot identical to those by of OPTICS.

The rest of the paper is organized as follows: Section 2 provides an overview of OPTICS, including its limitations, and describes related studies that have been performed to improve OPTICS. Section 3 defines the concepts of C-OPTICS and describes the algorithm. Section 4 presents an evaluation of C-OPTICS based on the results of experiments with synthetic and real datasets. Section 5 summarizes and concludes the paper.

2. Related Work

This section focuses on OPTICS and related algorithms proposed to address the quadratic time complexity problem of OPTICS. Section 2.1 presents the concepts of OPTICS and discusses unnecessary distance computations in OPTICS. Section 2.2 describes the existing algorithms proposed to improve the running time of OPTICS. Details of all the symbols used in this paper are defined in Table 1.

Table 1. The notations.

2.1. OPTICS

The well-known hierarchical density-based algorithm OPTICS visualizes the density-based clustering structure. Section 2.1.1 reviews the definitions of the naïve algorithm used to compute a reachability plot. Section 2.1.2 demonstrates the quadratic time complexity of the naïve algorithm and its unnecessary distance computations.

2.1.1. Definitions

Let

P

be a set of

n

objects in the

d

-dimensional space

ℝ^{d}

. Here, the Euclidean distance between two objects

p_{i}

and

p_{j}

is denoted by

d i s t (p_{i}, p_{j})

. OPTICS creates a reachability plot based on the concepts defined below.

Definition 1

(Core distance of an object

p

) [12]. Let

N_{ε} (p)

be the

ε

-neighborhood and let

M i n P t s

-

d i s t (p)

be the distance between

p

and

M i n P t s

-th nearest neighbor. The core distance of

p

,

c d i s t_{ε, M i n P t s} (p)

, is then defined using Equation (1):

c d i s t_{ε, M i n P t s} (p) = {\begin{array}{r} U N D E F I N E D, | N_{ε} (p) | < M i n P t s \\ M i n P t s - d i s t (p), o t h e r w i s e \end{array} .

(1)

Note that

c d i s t_{ε, M i n P t s} (p)

is the minimum

ε

at which

p

qualifies as an

ε

-core object for DBSCAN. For example, when

ε = \sqrt{2}

and

M i n P t s = 3

for the sample dataset

P

in Figure 1,

c d i s t_{\sqrt{2}, 3} (p_{9}) = 1.315

, which is the distance between

p_{9}

and the

M i n P t s

-th nearest neighbor

p_{6}

.

Definition 2

(Reachability distance object

p

with respect to object

o

) [12]. Let

p

and

o

be objects from dataset

P

; the reachability distance of

p

with respect to

o

,

r d i s t_{ε, M i n P t s} (p, o)

, is defined as Equation (2):

r d i s t_{ε, M i n P t s} (p, o) = {\begin{matrix} U N D E F I N E D, | N_{ε} (o) | < M i n P t s \\ m a x (c d i s t_{ε, M i n P t s} (o), d i s t (o, p)), o t h e r w i s e \end{matrix} .

(2)

Intuitively, when

o

is an

ε

-core object, the reachability distance of

p

with respect to

o

is the minimum distance such that

p

is directly density-reachable from

o

. Thus, the minimum reachability distance of each object

p \in P

means the minimum distance that can be contained in a cluster. To create a reachability plot, a linear order of objects is formed by selecting a next object

p

that has the closest reachability distance to an observed object

o

. Here, a linear order of objects represents the order of interconnection between objects by densities in the dataset. Accordingly, the reachability plot shows the reachability distance for each object in the order the object was processed.

2.1.2. Computation

OPTICS first finds all

ε

-core objects in the dataset at

O (n^{2})

time and then computes the minimum reachability distance for all objects at

O (n^{2})

time to create a reachability plot. This is still the best time complexity known. Alternatively,

ε

-core objects can be found quickly using spatial indexing structures, such as an R*-tree [17], that optimize range queries to obtain

ε

-neighborhoods. However, the reachability plot is still created at

O (n^{2})

time because computing the reachability distances of all objects to form a linear order of objects is not optimized by the spatial indexing structure. In other words, OPTICS has quadratic time complexity because each object computes reachability distances for all

ε

-core objects in the dataset. However, only the minimum reachability distance of each object is displayed in the reachability plot (see Figure 2). That is, all reachability distance computations, except for identifying the reachability distance displayed in the reachability plot, are unnecessary.

Figure 3 shows an example of unnecessary reachability distances for sample dataset

D

when

ε = \sqrt{2}

and

M i n P t s = 3

. First, the distance between

p_{1}

and all sample objects

p_{i} \in N_{ε} (p_{1})

is computed to determine if

p_{1}

is a core object. Next, the distance between

p_{1}

and the

M i n P t s

-th nearest neighbor is computed to obtain the core distance of

p_{1}

according to Definition 1. Subsequently, the reachability distances for all objects contained in

N_{ε} (p_{1})

are computed according to Definition 2 as shown in Figure 3a. That is, the reachability distances between

p_{1}

and

p_{2}

,

p_{3}

,

p_{4}

,

p_{5}

are computed. This process is repeated for all sample objects as shown in Figure 3b. However, as shown in Figure 3c, not all reachability distances are required to create a reachability plot. For example,

p_{5}

is reachable from both

p_{1}

and

p_{2}

; however, a reachability distance from

p_{1}

is 1.29 and from

p_{2}

is 1.37. Considering that only the minimum reachability distance of each object is displayed in the reachability plot, the reachability distance computation from

p_{5}

to

p_{2}

is unnecessary because it does not contribute to creating a reachability plot.

Figure 3. Visualization of the reachability distances for sample dataset

D

: (a) all reachability distances for

p_{1}

; (b) all reachability distances between sample objects; (c) unnecessary distance computations between sample objects.

2.2. Existing Work

This subsection describes the algorithms proposed to address the quadratic time complexity of OPTICS. Researchers have proposed new indexing structures and approximate reachability plot to reduce the number of core distance and reachability distance computations. In addition, some researchers have proposed algorithms to visualize a new hierarchical density-based clustering structure. We can classify these algorithms roughly into the following three categories: equivalent algorithms with results identical to those of OPTICS, approximate algorithms, and other hierarchical density-based algorithms.

Among the equivalent algorithms, DeLi-Clu [15] optimized range queries to compute the core distance and reachability distance of each object in a dataset using a spatial indexing structure, in this case, the variance of R*-tree. In particular, the proposed self-join R*-tree query significantly reduces the total number of distance computations. Furthermore, DeLi-Clu discards a parameter

ε

, as the join process will automatically stop when all objects are connected, without computing all pair-wise distances. However, as the dimension of the dataset increases, the limitations of R*-tree arise in the form of overlapping distance computations, which significantly degrades the performance of DeLi-Clu. The extended DBSCAN and OPTICS algorithms are proposed by Brecheisen et al. [18]. These two algorithms improved the performances of DBSCAN and OPTICS by applying a multi-step query processing paradigm. Specifically, the core distance and reachability distance were quickly computed by replacing range queries with

M i n P t s

-th nearest neighbor queries. In addition, complicated distance computations were performed only at essential steps to create a reachability plot to reduce waste computations. However, the running time of the extended DBSCAN and OPTICS in [18] is not suitable for massive datasets. Other approaches which use exact algorithms are those based on a graphics processing unit (GPU), or multi-core based distributed algorithms. One study [13] introduced an extensible parallel OPTICS algorithm on a 40-core shared memory machine. The similarities between OPTICS and Prim’s minimum spanning tree algorithm were used to extract clusters in parallel from the shared memory. Other work [19] introduced G-OPTICS, which improves the scalability of OPTICS using a GPU. G-OPTICS significantly reduces the running time of OPTICS by processing the iterative computations required to form a reachability plot in parallel through the GPU and shared memory.

Approximate algorithms simplify and reduce the complex distance computations of OPTICS. An algorithm was proposed to compress the dataset with data bubbles containing only key information [20]. The number of distance computations is determined via the data compression ratio, allowing the scalability to be improved. However, if the data compression ratio exceeds a certain level, the reachability plot identical to those by of OPTICS cannot be guaranteed. The higher the data compression ratio, the larger the number of objects abstracted into only a representative object. GridOPTICS, which transforms the dataset into a grid structure to create a reachability plot, was proposed [21]. GridOPTICS initially creates a grid structure for a given dataset, after which it forms a reachability plot for the grids. It then assigns objects in the dataset to the formed reachability plot. However, clustering quality is not guaranteed given that the form of the reachability plot depends on the size of the grids. Another method is SOPTICS [16], which applies random projection to improve the running time of OPTICS. After partitioning a dataset into subsets while performing random projection via a pre-defined criterion, it quickly creates a reachability plot using a new density estimation metric based on average distance computations through data sampling. However, SOPTICS includes many deformations in the reachability plot due to the random projection maps objects into the lower dimensions.

Other hierarchical density-based clustering algorithms have been continuously studied. A runt-pruning algorithm that analyzes the minimum spanning tree of the sampled dataset and then uses nearest neighbor-based density estimation to define the density levels of each object to create a cluster tree was proposed in [22]. Based on a runt test for multi-modality [23], the runt-pruning algorithm creates a cluster tree by repeating the task of partitioning a dense cluster into two connected components. HDBSCAN, which constructs a graph structure with a mutual reachability distance between two objects, was also proposed in [24]. It discovers a minimum spanning tree and then forms a hierarchical clustering structure. In addition to hierarchical clustering, an algorithm for producing flat partitioning from the hierarchical solution was introduced in [25].

3. Constrained OPTICS (C-OPTICS)

3.1. Overview

We now explain the proposed algorithm, called C-OPTICS. The goal of the C-OPTICS is to reduce the total number of distance computations required to create a reachability plot because they affect the running time of OPTICS, as explained in Section 2.1.2. To achieve this goal, C-OPTICS proceeds in three steps: (a) partitioning step, (b) graph construction step, and (c) plotting step. For the convenience of the explanation, Figure 4 shows the process of the C-OPTICS for the two-dimensional dataset

P

in Figure 1. In the partitioning step, we partition

P

into identical cells. Figure 4a shows the result of partitioning, where empty cells are discarded. In the graph construction step, as shown in Figure 4b, we construct a constraint graph that consists of vertices corresponding to the cells obtained in the partitioning step and edges linking vertices by constraints. In the plotting step, we compute the reachability distance of each object while traversing the constraint graph. Figure 4c shows the reachability distance and linear order of each object.

Figure 4. The overall process of the C-OPTICS for the two-dimensional dataset

P

: (a) partitioning step, (b) graph construction step, and (c) plotting step.

C-OPTICS can obtain

ε

-neighbors by finding adjacent cells for an arbitrary object. Thus, C-OPTICS can reduce the number of distance computations to find

ε

-core objects. Besides, C-OPTICS only computes the reachability distances of the objects that satisfy the constraints in the constraint graph. Consequently, C-OPTICS reduce the total number of distance computations required to create a reachability plot. We explain in detail the partitioning step in Section 3.2, the graph construction step in Section 3.3, and the plotting step in Section 3.4.

3.2. Partitioning Step

This subsection describes the data partitioning step for C-OPTICS. We partition a given dataset into cells of identical size with a diagonal length

ε

. Definition 3 explains the unit cell of the partitioning step.

Definition 3

(Unit cell,

U C

). We define

U C

as a

d

-dimensional hypercube with a diagonal length

ε

and straight sides, all of equal length.

U C

is a square with a diagonal length

ε

in two dimensions, and in three dimensions

U C

is a cube with a diagonal length

ε

.

Here, all

U C

s have two main features. First, each

U C

has a unique identifier that is used as the location information in the dataset. We use the coordinates values of a

U C

to obtain a unique identifier. Here, the coordinates value of a

U C

represents a sequence of

U C

for each dimension in the dataset. Thus, combining the coordinates values of

U C

for all dimensions enables us to obtain the location information from a unique identifier. The location information can quickly find adjacent

U C

s within a radius

ε

for an arbitrary

U C

. Through partitioning step, we can reduce the number of distance computations in finding

ε

-neighbors of each object because the number of objects in each cell within a radius

ε

is smaller than in the entire dataset. We can further simplify the process of constructing a constraint graph in a later step. Second, each

U C

has a

d

-dimensional minimum bounding rectangle (

M B R

) that encloses the contained objects. We use

M B R

s to compute the distance between

U C

s to find adjacent

U C

s. We obtain all coordinates of

M B R

based on the maximum and minimum values of each dimension for the contained objects. If the

U C

contains only one object, the object becomes an

M B R

.

Example 1.

Figure 5 shows an example of the partitioning step for the sample dataset

P

when

ε = \sqrt{2}

. Figure 5a shows each object

p_{i} \in P

in a two-dimensional universe.Figure 5b shows the partitioning of

P

into

U C

s, where each UC has the same diagonal length.Figure 5c shows empty and non-empty

U C

s. Note that we do not consider empty

U C

s. Further, we obtain the unique identifier of each

U C

as follows. To compute the unique identifier of

U C_{21}

, we first obtain the coordinate values of

U C_{21}

in each dimension. As shown inFigure 5c, the coordinate values of

U C_{21}

are (2,1) in the first and second dimensions, respectively. Considering that a unique identifier of

U C_{21}

is obtained by the combination of the coordinate values, the unique identifier of

U C_{21}

is 21. The result of partitioning is shown inFigure 5d, where

P

is partitioned into seven

U C

s, with each

U C

having a

M B R

.

Figure 5. Partitioning into

U C

s of sample dataset

P

(

ε = \sqrt{2}

): (a) the sample dataset

P

, (b)

U C

s that partition

P

, (c) non-empty

U C

s (white area) and their unique identifiers, and (d) the result of the partitioning step.

3.3. Graph Construction Step

In the graph construction step, we construct a constraint graph with the

U C

s obtained in the data partitioning step. Through constructing a constraint graph, we can reduce the number of distance computations to obtain the minimum reachability distance for each object. For this, we first use the

U C

s obtained in the data partitioning step as the vertices of the constraint graph. Here, a unique identifier of each

U C

is used as a unique identifier of each vertex. Thus, we can quickly find adjacent vertices because we preserve the location information of

U C

for each vertex. Additionally, we also preserve the

M B R

of each

U C

. Then, we connect the vertices to define the edges. Here, we connect the vertices according to two properties: (i) the two vertices must be adjacent to each other within a radius

ε

, and (ii) an edge between two vertices must have constraints. Considering that we can quickly find the

ε

-neighborhood of each object in the dataset by the first property, we can obtain the objects that are within a radius

ε

. Thus, there is no need to compute distances with all objects. We first explain the maximum reachability distance, which is defined as Definition 4.

Definition 4

(Maximum reachability distance,

M R D

). We define

M R D

as the range of a radius

ε

at which a vertex can reach. Here, the unique identifier of a vertex is the combination of coordinate values. Thus, in

d

-dimension,

M R D

represents

\pm \sqrt{d}

range for a coordinate value of each dimension because a radius

ε

is

\sqrt{d}

times the length of a straight side of

U C

that corresponds to the vertex.

Figure 6 shows an example of the

M R D

for a vertex,

v_{33}

, in two-dimensional space. We can decompose the unique identifier of

v_{33}

into two coordinates values, which are (3, 3). Considering that the MRD of a vertex in the two-dimensional dataset represents a range of +2 and −2 in coordinate values of each dimension, according to Definition 4, the reachable range of each dimension is 1 to 5 in the first dimension and 1 to 5 in the second dimension. As a result, the

M R D

of

v_{33}

is equal to the gray area shown in Figure 6. Through the calculation of MRD, we can find adjacent vertices within the same range and thus, avoid unnecessary calculations of distances between all vertices. In Figure 6, adjacent vertices

v_{12}

and

v_{53}

are in the

M R D

of

v_{33}

.

Figure 6. An example of maximum reachability distance (

M R D

) in two-dimensional space.

Additionally, we use the

M B R

of each vertex to compute the distance between two vertices. We can obtain the adjacent vertices more accurately by computing the distance between the two vertices. Here, adjacent vertices obtained from

M R D

of an arbitrary vertex may not actually be vertices within a radius

ε

because the range of

M R D

and the range of

ε

do not precisely match. Thus, we compute the minimum distance between the

M B R

s of two vertices to find adjacent vertices within the radius

ε

. Equation (3) represents the distance between two vertices.

v d i s t (v_{i}, v_{j}) = \min (d i s t (v_{i} . M B R, v_{j} . M B R)),

(3)

where

v d i s t (v_{i}, v_{j})

is the distance between

v_{i}

and

v_{j}

.

v_{i} . M B R

is the

M B R

of a unit cell

U C_{i}

corresponding to

v_{i}

. Here, if

v d i s t (v_{i}, v_{j}) \leq ε

,

v_{j}

is the adjacent vertex of

v_{i}

and vice versa.

Next, we define the constraints to link vertices that satisfy the second property. Recall from Section 1 that the constraint graph only links vertices when the distance between vertices is less than an ε. These pairs of vertices become edges, and the constraints are assigned as weights for each edge. Here, the constraint is defined as a subset of

ε

-core objects that can compute the minimum reachability distances of objects in different vertices. In the plotting step, we only compute the reachability distances of the objects that satisfy the constraints and thus, reduce unnecessary distance computations. Here, a constraint is satisfied if the observed object belongs to a subset of

ε

-core objects as designated by the constraint. We first explain the linkage constraints defined as Definition 5 to obtain the

ε

-core objects corresponding to the constraints.

Definition 5.

(Linkage constraint,

L C (v_{i}, v_{j})

) For two vertices

v_{i}

and

v_{j}

, let an

ε

-core object in

v_{i}

be

p_{c o r e}

and let the

M B R

of a vertex

v_{j}

be

v_{j} . M B R

. We define the linkage constraint from

v_{i}

to

v_{j}

as a subset of

ε

-core objects in

v_{i}

. Here, when

p_{1}

is an

ε

-core objectclosest to the

v_{j} . M B R

and

q

is the coordinates of

v_{j} . M B R

farthest from

p_{1}

, the

ε

-core objectsshould be closer to

q

than they are to

p_{1}

.

We link two vertices when the linkage constraint between the vertices is defined according to Definition 5. Thus, we can reduce unnecessary distance computations by pruning the adjacent vertices obtained in the first property again. Furthermore, the linkage constraints guarantee the quality of the reachability plot while reducing the number of distance computations. We prove this in Lemma 1.

Lemma 1.

Let the linkage constraint from

v_{i}

to

v_{j}

be

L C (v_{i}, v_{j})

for

v_{i}, v_{j} \in V

in the constraint graph

G (V, E)

. If

L C (v_{i}, v_{j}) \neq \emptyset

, the minimum reachability distances for all objects contained in

v_{j}

are determined by the

L C (v_{i}, v_{j}) \times v_{j}

pair and the core distance. In other words, the reachability distances of all objects in

v_{j}

are determined by

ε

-core objects in

L C (v_{i}, v_{j})

.

Proof.

For a given dataset

P

, let

v_{i} = {p_{1}, p_{2}}

,

p_{t} \in v_{j}

and let the linkage constraint

L C (v_{i}, v_{j}) = {p_{1}}

from

v_{i}

to

v_{j}

. In this case, the reachability distance of an object

p_{t}

is determined by

p_{1}

when

c d i s t_{ε, M i n P t s} (p_{1}) \leq d i s t (p_{1}, p_{t})

. Conversely,

p_{2}

cannot determine the reachability distance of

p_{t}

because

c d i s t_{ε, M i n P t s} (p_{1}) < d i s t (p_{2}, p_{t})

holds in all cases even when

d i s t (p_{1}, p_{t}) < c d i s t_{ε, M i n P t s} (p_{1})

. Therefore, Lemma 1 is obviously true. □

Additionally, we define the state of a vertex. We can find

ε

-core objects in the dataset without computing the distance between the objects through the states of vertices. We define the state of a vertex based on the rules of DBSCAN, and from the

U C

s obtained in the data partitioning step. As described in Section 1, if the number of objects contained in the

ε

-neighborhood for an arbitrary object is greater than or equal to

M i n P t s

, then the object is an

ε

-core object. Here, because the diagonal length of

U C

is

ε

, the objects must be

ε

-core objects when the number of objects contained in

U C

is greater than or equal to

M i n P t s

. Accordingly, we define three states for each vertex:

s t a b l e

,

u n s t a b l e

, and

n o i s e

. More precisely, the state of a vertex is defined as follows:

Definition 6.

(The state of a vertex,

v s t a t e

) Let a set of adjacent vertices contained in

M R D

of

v_{i}

be

R V_{i}

and an adjacent vertex be

r v_{k} \in R V_{i}

. The state of

v_{i}

, which can determine whether to compute the distance of contained objects, satisfies the following conditions:

1.: Let $| v_{i} |$ be the number of data points contained in $v_{i}$ . If $M i n P t s \leq | v_{i} |$ , $v_{i} . v s t a t e$ is $s t a b l e$ ;
2.: If condition 1 is not satisfied and $M i n P t s \leq | v_{i} | + | r v_{k} |$ , $v_{i} . v s t a t e$ is $u n s t a b l e$ ;
3.: If both conditions 1 and 2 are not satisfied, $v_{i} . v s t a t e$ is $n o i s e$ .

Again, we denote that objects contained in the same vertex must be

ε

-neighbor for each other. Thus, when the

v s t a t e

of a vertex

v_{i}

is

s t a b l e

, all objects contained in

v_{i}

are

ε

-core objects, because the

| N_{ε} (p_{t}) |

of all objects

p_{t} \in v_{i}

is always greater than or equal to

M i n P t s

. Conversely, if the

v s t a t e

of

v_{i}

is

n o i s e

, all objects contained in

v_{i}

are noise objects and are not considered anymore.

Example 2.

Figure 7 shows the step-by-step constructing process of the constraint graph

G (V, E)

when

ε = \sqrt{2}

and

M i n P t s = 3

for the sample dataset

P

, using an identical dataset to that shown in Figure 5. The example starts with a partitioned dataset,

\hat{P}

, as shown in Figure 7a. First, as shown in Figure 7b, each

U C

is mapped into the vertices of

G

, and the unique identifier of each

U C

is used as the unique identifier of the vertex. Second, a vertex is selected by the input sequence of objects. In our case, a vertex

v_{21}

containing

p_{1}

is selected as shown in Figure 7c. To find the adjacent vertices for

v_{21}

, the

M R D

of

v_{21}

(i.e., gray area) is computed using Definition 4. Here, two vertices

v_{0}

and

v_{33}

are found as adjacent vertices of

v_{21}

. Next, the state of

v_{21}

,

v_{21} . v s t a t e

, is obtained according to Definition 6. Here, all objects in

v_{21}

are

ε

-core objects hence

v_{21} . v s t a t e

is

s t a b l e

and depicted in a doubled circle. Third, as shown in Figure 7d, linkage constraints for

L C (v_{21}, v_{0})

and

L C (v_{21}, v_{33})

are obtained using Definition 5. For example, let the

M B R

of a vertex

v_{i}

be

v_{i} . M B R

. We first obtain the

ε

-core objects of

v_{21}

, but we can skip this step because

v_{21} . v s t a t e

is

s t a b l e

. Next, an

ε

-core object of

v_{21}

closest to

v_{0} . M B R

,

p_{1}

, is obtained. Because

v_{0}

contains only one object, the coordinate of

v_{0} . M B R

farthest from

p_{1}

is

p_{5}

(blue star). Thus,

L C (v_{21}, v_{0})

contains only

p_{1}

because no

ε

-core object is closer to

p_{5}

than

p_{1}

. Then, because the constraint between

v_{21}

and

v_{0}

is defined, the two vertices are linked by

e (v_{0}, v_{21})

. Figure 7e shows the linkage constraint

L C (v_{0}, v_{21})

and state of

v_{0}

. The

v_{0}

contains only one object. Thus,

v_{0} . v s t a t e

is

u n s t a b l e

(circle) due to the adjacent vertex

v_{21}

that contains four objects.

L C (v_{0}, v_{21})

contains

p_{5}

and thus

e (v_{0}, v_{21})

becomes bidirectional. On the other hand,

v_{74}

contains only one object and has no adjacent vertex. Thus, the state of

v_{74}

is

n o i s e

and is depicted as a dotted circle. When all vertices are processed, constraint graph

G

is constructed, as shown in Figure 7f.

Figure 7. Construction of constraint graph

G (V, E)

for sample dataset

P

(

ε = \sqrt{2}

and

M i n P t s = 3

): (a) the sample dataset

P

partitioned into non-empty

U C

s, (b) vertices that are corresponding to

U C

s, (c)

M R D

and

v s t a t e

of

v_{21}

, (d) edges in

v_{21}

. with linkage constraints weighted, (e) an edge in

v_{0}

with a linkage constraint weighted, and (f) the constraint graph.

3.4. Plotting Step

In the plotting step, we traverse the constraint graph to generate a reachability plot as in OPTICS. Recall from Section 2.1.2 that OPTICS computes reachability distances for all pairs of objects to create a reachability plot. However, to create a reachability plot, only one reachability distance is required for each object. Through the constraint graph, we can reduce the reachability distance computations that do not contribute to generating the reachability plot. Here, the constraint graph identifies the reachability distance of each object required for the reachability plot by the linkage constraint. Thus, we reduce the unnecessary distance computations by only computing the distance between objects contained in the vertices which are linked to each other. We can further reduce the unnecessary distance computations by only computing the reachability distance for objects that satisfy the linkage constraints between the linked vertices. To plot a reachability plot, we traverse the constraint graph with the following rules: (i) if two objects are contained in the same vertex, the distance is computed; (ii) if two objects are contained in different vertices, only objects that satisfy the linkage constraint between two vertices are computed; (iii) the object with the closest reachability distance to the target object becomes the next target object; (iv) if no object is reachable, the next target object is selected by the input order of objects. For clarity, we provide the pseudocode, which will be referred to as the C-OPTICS procedure.

Algorithm 1: C-OPTICS

Input:
(1)

P = {p_{1}, p_{2}, \dots, p_{n}}

: the input dataset
(2)

G (V, E)

: the constraint graph for

P

Output:
(1)

R P

: a reachability plot

Algorithm:

Initialize the list $R P ⟨ o b j e c t, r d i s t ⟩$ and the priority queue $O r d e r S e e d ⟨ o b j e c t, r d i s t ⟩$ .
foreach $p_{t a r g e t} \in P$ do
if $p_{t a r g e t} . U n p r o c e s s e d$ then
$p_{t a r g e t} . U n p r o c e s s e d = f a l s e$ .
Put $p_{t a r g e t}$ into $R P$ .
$v_{t a r g e t} = p_{t a r g e t}$ .get Vertex().
if $v_{t a r g e t} . v s t a t e \neq n o i s e & p_{t a r g e t}$ .iscore() then
Update $(p_{t a r g e t}, v_{t a r g e t}, O r d e r S e e d)$ .
while $O r d e r S e e d \neq \emptyset$ do
$p_{t a r g e t} = O r d e r S e e d$ .next().
$p_{t a r g e t} . U n p r o c e s s e d = f a l s e$ .
Put $p_{t a r g e t}$ into $R P$ .
$v_{t a r g e t} = p_{t a r g e t}$ .get Vertex().
If $v_{t a r g e t} . v s t a t e \neq n o i s e & p_{t a r g e t}$ .iscore() then
Update $(p_{t a r g e t}, v_{t a r g e t}, O r d e r S e e d)$ .
end

Algorithm 1 shows the C-OPTICS procedure that creates a reachability plot by traversing the constraint graph and plotting the reachability distance of each object. The inputs for C-OPTICS are the dataset

P

and the constraint graph

G (V, E)

. The output of C-OPTICS is the reachability plot

R P

. In line 1,

R P

, a list structure representing the reachability plot, and

O r d e r S e e d

, a priority queue structure, are initialized. Here,

O r d e r S e e d

determines the order in which to traverse the constraint graph. In line 2, a target object

p_{t a r g e t} \in P

is selected by the input order of objects. In line 3, the algorithm checks whether the target object

p_{t a r g e t}

has been processed. If

p_{t a r g e t}

is not processed,

p_{t a r g e t}

is set to the processed state in line 4. Then,

p_{t a r g e t}

is inputted to

R P

with

U N D E F I N E D

reachability distance (infinite). In line 6, the vertex

v_{t a r g e t} \in V

, which contains

p_{t a r g e t}

, is obtained. If

v_{t a r g e t} . v s t a t e

is a

n o i s e

, in line 7,

p_{t a r g e t}

is determined to be a noise object and the algorithm selects the next target object. If

v_{t a r g e t} . v s t a t e

is not a

n o i s e

,

p_{t a r g e t}

is checked as to whether

p_{t a r g e t}

is an

ε

-core object. If

p_{t a r g e t}

is not an

ε

-core object,

p_{t a r g e t}

is determined to be a noise object. In the opposite case, in line 8, the Update procedure is called to discover

N_{ε} (p_{t a r g e t})

, after which

O r d e r S e e d

is updated. In lines 9–14, repeated processes are performed on

O r d e r S e e d

. The object with the highest priority in

O r d e r S e e d

is selected as the next target object

p_{t a r g e t}

in line 10. Then,

p_{t a r g e t}

is set to the processed state in line 11. Next, in line 12,

p_{t a r g e t}

is inputted to

R P

with the computed reachability distance. Then, a vertex

v_{t a r g e t}

, which contains

p_{t a r g e t}

, is obtained. In lines 13 and 14,

O r d e r S e e d

is updated by the Update procedure when

v_{t a r g e t} . v s t a t e

is not

n o i s e

and

p_{t a r g e t}

is an

ε

-core object. This process is repeated until

O r d e r S e e d

is empty. When

O r d e r S e e d

is empty, the unprocessed object is selected as the next target object

p_{t a r g e t} \in P

. Thereafter, all of the above processes are repeated. When all objects are processed,

R P

is output. Here, listing the objects in

R P

results in a reachability plot.

Algorithm 2 shows the Update procedure that determines whether

O r d e r S e e d

is updated according to the state of each vertex and linkage constraints. The inputs for the Update are the object

p_{t a r g e t}

, vertex

v_{t a r g e t}

containing

p_{t a r g e t}

, and priority queue

O r d e r S e e d

. The output of Update is

O r d e r S e e d

. In line 1,

N_{ε} (p_{t a r g e t})

of

p_{t a r g e t}

is obtained. In line 2, an object

p_{n e i g h b o r} \in N_{ε} (p_{t a r g e t})

is selected. In line 3, the processing state of

p_{n e i g h b o r}

is checked. When

p_{n e i g h b o r}

is processed, a new object

p_{n e i g h b o r} \in N_{ε} (p_{t a r g e t})

is selected; otherwise, in line 4, a vertex

v_{n e i g h b o r} \in V

containing

p_{n e i g h b o r}

is obtained. In line 5,

p_{t a r g e t}

is checked for whether it satisfies the linkage constraint

L C (p_{t a r g e t}, p_{n e i g h b o r})

. If

p_{t a r g e t}

satisfies

L C (p_{t a r g e t}, p_{n e i g h b o r})

, the reachability distance of

p_{n e i g h b o r}

to

p_{t a r g e t}

is computed and

O r d e r S e e d

is updated. In the opposite case, the reachability distance of

p_{n e i g h b o r}

is not computed and thus

O r d e r S e e d

is not updated.

Algorithm 2: Update

Input:
(1)

p_{t a r g e t}

: a target object of dataset

D

(2)

v_{t a r g e t}

: a target vertex of constraint graph

G

(3)

O r d e r S e e d

: a priority queue
Output:
(4)

O r d e r S e e d

: a priority queue

Algorithm:

$n e i g h b o r s = p_{t a r g e t}$ .getNeighbors( $ε_{m a x}$ , $M i n P t s$ )
foreach $p_{n e i g h b o r} \in n e i g h b o r s$ do
if $p_{n e i g h b o r} . U n p r o c e s s e d$ then
$v_{n e i g h b o r} = p_{n e i g h b o r}$ .get Vertex().
if $L C (v_{t a r g e t}, v_{n e i g h b o r})$ .con $tains$ ( $p_{t a r g e t}$ ) then
$r d i s t = \max (p_{t a r g e t} . c d i s t, d i s t (p_{t a r g e t}, p_{n e i g h b o r}))$ .
$O r d e r S e e d$ .update $(p_{n e i g h b o r}, r d i s t)$ .
end

Example 3.

Figure 8 shows the step-by-step clustering process of the plotting step for the sample dataset

P

. Here, this example assumes that

ε = \sqrt{2}

and

M i n P t s = 3

, and a constraint graph

G (V, E)

is created, as shown inFigure 8a. First, a target object

p_{i} \in P

is selected by the input sequence of objects. Thus,

p_{1}

is selected (red dot), as shown inFigure 8b. Second, a vertex

v_{21}

containing

p_{1}

is obtained (blue circle) and the

ε

-neighborhood of

p_{1}

,

N_{ε} (p_{1})

, is obtained (blue dots). Because

L C (v_{21}, v_{0})

is satisfied,

p_{5}

contained in a vertex

v_{0}

is also a neighbor of

p_{1}

. Conversely, because

L C (v_{21}, v_{33})

is not satisfied, two objects

p_{9}

and

p_{10}

contained in a vertex

v_{33}

are not neighbors of

p_{1}

. Third, as shown in Figure 8c, the reachability distance of each object, which is contained in

N_{ε} (p_{1})

, is computed for

p_{1}

according to Definition 2. Fourth, an object

p_{2}

having the closest reachability distance to

p_{1}

is selected as the next target object as shown in Figure 8d (red dot). Then, the above process is repeated for

p_{2}

. Note that only objects contained in

v_{21}

are considered because no linkage constraint is satisfied. Figure 8e shows the process for

p_{4}

. Where

p_{4}

satisfies

L C (v_{21}, v_{33})

, however, two objects

p_{9}

and

p_{10}

are not neighbors of

p_{4}

because

d i s t (p_{4}, p_{9})

and

d i s t (p_{4}, p_{10})

are greater than

ε

. Thus, the next target object is not selected. In this case, the next target object is selected by the input order of objects which are not processed. The above process is repeated for all unprocessed objects, creating a unidirectional graph structure that represents the reachability plot, as shown in Figure 8f.

Figure 8. An example of the plotting step for sample dataset

P

(

ε = \sqrt{2}

and

M i n P t s = 3

): (a) the constraint graph for the sample dataset

P

, (b)

N_{ε} (p_{1})

that is obtained by using the linkage constraints of

p_{1}

, (c) the reachability distances of all objects contained in

N_{ε} (p_{1})

for

p_{1}

, (d) the reachability distances of all objects contained in

N_{ε} (p_{2})

for

p_{2}

, (e)

N_{ε} (p_{4})

, and (f) a unidirectional graph structure that represents the reachability plot.

4. Performance Evaluation

This section presents the experiments designed to evaluate C-OPTICS. The main purposes of these experiments are to provide experimental evidence that C-OPTICS alleviates the quadratic time complexity of OPTICS and that it outperforms the current state-of-the-art algorithms. Section 4.1 presents the meta-information pertaining to the experiments. Afterward, Section 4.2 shows the quality of a reachability plot by C- OPTICS and the running time achieved by C-OPTICS compared to the state-of-the-art algorithms.

4.1. Experimental Setup

This subsection describes the meta-information set up to perform the experiment. The experiments were run on a machine with a single core (Intel Core i7-8700 3.20 GHz CPU) and 48 GB of memory. The operating system installed is Windows 10 x64. All algorithms used in our experiments were implemented in the Java programming language. Moreover, the maximum Java heap memory in the JVM environment was set to 48 GB. Section 4.1.1 describes the datasets used in the experiments. Section 4.1.2 describes the existing algorithms which are compared with C-OPTICS. Section 4.1.3 describes the approach used to evaluate the clustering quality (accuracy of the reachability plot) of the algorithms.

4.1.1. Datasets

We conducted experiments with three real datasets and two synthetic datasets. The real datasets are termed HT, Household, and PAMAP2, and are obtained from the UCI Machine Learning Repository [26]. First, HT is a ten-dimensional dataset that collects measured values of home sensors that monitor the temperature, humidity, and concentration levels of various gases produced in another project [27]. Second, Household is a seven-dimensional dataset that collects the measured values of active energy consumed by each electronic product in the home. Third, PAMAP2 is a four-dimensional dataset that collects the measured values of three inertial forces and the heart rates for 18 physical activities produced during a project [28]. The Synthetic datasets are referred to here as BIRCH2 and Gaussian. First, BIRCH2 is a synthetic dataset for a clustering benchmark produced in earlier work [29]. This paper extended BIRCH2 to one million instances in seven dimensions to evaluate the dimensionality and scalability of the algorithms. Second, Gaussian is a synthetic dataset for the benchmarking of the clustering quality and the running time of the algorithms according to the size of the dimensions. It has a minimum of ten dimensions and a maximum of 50 dimensions. Table 2 presents the properties of all datasets, including their sizes and dimensions. In addition, each dataset was sampled at various sizes to assess the scalability of the algorithms.

Table 2. Meta-information of the datasets.

4.1.2. Competing Algorithms

We compared C-OPTICS with the three state-of-the-art algorithms, each of which is representative in a unique sense, as explained below:

OPTICS: The naïve algorithm [12] with the spatial indexing structure R*-tree to improve range queries;
DeLi-Clu: A state-of-the-art algorithm that quickly creates an exact reachability plot by improving the single-linkage approach [15];
SOPTICS: A fast OPTICS algorithm that achieves sub-quadratic time complexity by resorting to approximation using random projection [16].

Our comparisons with the above algorithms had different purposes. The comparisons with OPTICS and DeLi-Clu, which create an exact reachability plot, focus on evaluating the running time of C-OPTICS. As can be observed from the experimental results in Section 4.2, C-OPTICS is superior in terms of running time to both algorithms in all cases.

The comparison with SOPTICS represents the assessments of the clustering quality (accuracy of the reachability plot) and the running time. Here, the main purpose is to demonstrate two phenomena through experiments. First, the reachability plot created by C-OPTICS is robust and accurate in all cases, unlike that by SOPTICS. Second, C-OPTICS outperforms SOPTICS in terms of running time. These outcomes are verified in experimental results on the three real datasets and two synthetic datasets introduced in Section 4.1.1.

All four algorithms, including C-OPTICS, run in a single-threaded environment, with the selected parameter settings for each algorithm differing depending on the dataset. Table 3 lists the parameters of each algorithm and their search ranges.

Table 3. Parameters and search ranges for the three compared algorithms.

4.1.3. Clustering Quality Metrics

We assess the clustering quality of the algorithms with two approaches. The first seeks to present the reachability plots in full to enable a direct visual comparison. This approach, however, fails to quantify the degree of similarity. In addition, the larger the dataset size, the more difficult it is to observe the difference between the reachability plots. To remedy this defect, we use the adjusted Rand index (ARI) with 30-fold cross-validation.

ARI is a metric that evaluates the clustering quality based on the degree of similarity through all pair-wise comparisons between extracted clusters [30,31,32]. ARI returns a real value between 0 and 1, with 1 representing completely identical clustering. However, the reachability plot visualizes only the hierarchy of clusters without extracting the clusters. To measure the ARI of a reachability plot, we extract clusters for a reachability plot by defining the thresholds for

ε_{t}

. Thus, ARI is computed by comparing the clusters of OPTICS with the clusters of each algorithm for

ε_{t}

. In other words, the reachability plot of OPTICS is used as ground truth to evaluate the algorithms SOPTICS and C-OPTICS. In order to demonstrate the robustness of the clustering quality, the experiment is repeated and the computed minimum and maximum ARIs are compared.

4.2. Experimental Results

In this subsection, we present experimental evidence of the robustness of the clustering quality and the superior computational efficiency of C-OPTICS in a comparison with the three state-of-the-art algorithms OPTICS, DeLi-Clu, and SOPTICS. Section 4.2.1 describes the experimental results on the clustering quality of C-OPTICS and SOPTICS based on the two approaches used to evaluate the clustering quality introduced in Section 4.1.3. Section 4.2.2 presents the results of the experiments on scalability and dimensionality to evaluate the computational efficiency of C-OPTICS.

4.2.1. Clustering Quality

In this subsection, we present the robustness of the clustering quality of C-OPTICS through a comparison with OPTICS and SOPTICS. Here, DeLi-Clu is excluded from the evaluation of clustering quality because DeLi-Clu creates a reachability plot identical to that of OPTICS. Note that the clustering quality for each algorithm is evaluated on the basis of OPTICS as mentioned in Section 4.1.3. To assess the clustering quality according to the two approaches mentioned in Section 4.1.3, the reachability plots are directly compared first.

Figure 9 presents a visual comparison of the reachability plots for PAMAP2. Figure 9a–c show the reachability plots for OPTICS, C-OPTICS, and SOPTICS, respectively. The reachability plots for each algorithm are similar; however, many differences are observed based on the threshold

ε_{2}

. Figure 9a,b show that four identical clusters (

C_{1}, C_{2}, C_{3},

and

C_{4}

) are extracted from OPTICS and C-OPTICS. On the other hand, Figure 9c shows that, unlike OPTICS, two clusters are extracted from SOPTICS. This result is due to the accumulation of deformations in the dataset by repeated random projection, which creates a reachability plot different from that by OPTICS. Conversely, C-OPTICS creates a reachability plot identical to that by OPTICS, as shown in Figure 9b, as it identifies the essential distance computations to guarantee an exact reachability plot.

Figure 9. Visual comparison of reachability plots for PAMAP2: (a) OPTICS; (b) C-OPTICS; (c) SOPTICS.

Figure 10 shows ARI with respect to the value of

M i n P t s

for each algorithm. Here, each dot on a curve gives the minimum ARI and is associated with a vertical bar that indicates the corresponding maximum ARI. This experiment compared the influence of

M i n P t s

on ARI of each algorithm for three real datasets HT, Household, and PAMAP2, and one synthetic dataset BIRCH2. Figure 10 shows that C-OPTICS creates a reachability plot for all datasets. C-OPTICS guarantees that the reachability distance of each object is identical to OPTICS by linkage constraint and thus ARI of C-OPTICS is 1. This means that a reachability plot formed by C-OPTICS is identical to OPTICS. Furthermore, these results show that C-OPTICS is robust for

M i n P t s

and is not dependent. SOPTICS creates different reachability plots for all datasets. Furthermore, the maximum and minimum ARI outcomes are influenced by the value of

M i n P t s

. In most cases, when

M i n P t s

is small, the difference between the minimum and maximum ARI is large, as the smaller the value of

M i n P t s

is, the more the random projection is performed, and the deformations can thus accumulate more. Figure 11 shows the ARI outcomes with respect to the number of dimensions for the Gaussian dataset. This experiment focuses on the dependence of the ARI of each algorithm on the number of dimensions. According to Figure 11, C-OPTICS guarantees ARI even if the number of dimensions increases. In other words, C-OPTICS creates a reachability plot identical to that by OPTICS regardless of the number of dimensions. The strategy for improving the computational efficiency of C-OPTICS is not dependent on the number of dimensions. However, as the number of dimensions increases, the minimum ARI decreases for SOPTICS. This occurs because the random projection of SOPTICS increases as the number of dimensions increases, as with

M i n P t s

. Thus, the difference between the minimum and maximum ARI becomes large, as shown in Figure 11.

Figure 10. Clustering quality quantified by adjusted Rand index (ARI) vs.

M i n P t s

: (a) HT; (b) Household; (c) PAMAP2; (d) BIRCH2.

Figure 11. Clustering quality quantified by ARI vs. dimensionality for the Gaussian dataset.

4.2.2. Computational Efficiency

This subsection evaluates the computational efficiency of C-OPTICS and the three state-of-the-art algorithms using the three real datasets HT, Household, PAMAP2, and two synthetic datasets BIRCH2 and Gaussian (50 dimensions). Although there are distribution-based algorithms which improve the running time of OPTICS, this paper excludes those algorithms, as we do not consider a distributed environment. Figure 12 shows the result of the comparison of running time for each algorithm according to the sampled ratios for the two synthetic datasets. Note that the y-axis is presented on a logarithmic scale. In addition, if the running time of an algorithm exceeds

10^{5}

s, it does not appear on the graph. This indicates that the algorithms did not terminate within 24 h and therefore were not considered for further experiments. As shown in Figure 12a, at a sampling rate of 5% for BIRCH2, C-OPTICS improved the running time by fivefold over OPTICS. As the sampling rate increases (i.e., as the size of the dataset increases), C-OPTICS improves the running times by up to 25 times over OPTICS. C-OPTICS even improved the running times by up to eight times over the fastest SOPTICS among the other algorithms. Figure 12b shows the results of the comparison for algorithm running times for the 50-dimensional Gaussian dataset. Here, C-OPTICS shows an improvement up to 100 times over OPTICS, as the efficiency of the spatial indexing structure at high dimensions is significantly decreased. These results also show a similar trend in the running time comparison with DeLi-Clu. For SOPTICS, which does not use a spatial indexing structure, a running time similar to that of C-OPTICS arises because it is not influenced by the number of dimensions. Similar trends were observed in the experiments on the three real datasets, as shown in Figure 13. As the sampling rate increased, the running time of C-OPTICS improved significantly compared to that by OPTICS. For the HT dataset in Figure 13a, C-OPTICS shows an improvement by as much as 50 times over OPTICS and up to nine times over SOPTICS. For the Household dataset in Figure 13b, the results for C-OPTICS are improved by up to 150 times over OPTICS. Conversely, C-OPTICS shows a running time nearly identical to that of SOPTICS. This corresponds to nearly identical worst case for C-OPTICS, where most of the objects are contained in a few vertices. Nevertheless, it is important to note that C-OPTICS shows an improvement over SOPTICS. Likewise, for the PAMAP2 dataset in Figure 13c, C-OPTICS overwhelms the other algorithms. These experimental results show that C-OPTICS outperforms the other algorithms in terms of computational efficiency.

Figure 12. Running time vs. sampling rate for synthetic datasets: (a) BIRCH2; (b) Gaussian (50

d

).

Figure 13. Running time vs. sampling rate for real datasets: (a) HT; (b) Household; (c) PAMAP2.

To provide an additional direct comparison of the running time for the algorithms, we compared the rate of reduction of the total number of distance computations for each algorithm with OPTICS. Figure 14 shows the experimental results for all datasets, where it can be observed that the total number of distance computations for C-OPTICS is significantly reduced. Again, it should be noted that the y-axis is presented on a logarithmic scale. C-OPTICS reduces the distance computations by more than ten times in all cases compared to OPTICS. This occurs because C-OPTICS identifies and excludes unnecessary distance computations based on the constraint graph.

Figure 14. Comparison of the total number of distance computations.

We conducted experiments on Gaussian datasets of various dimensions to evaluate the dimensionality of C-OPTICS experimentally. Commonly, the running time also increases due to the number of dimensions, and the numbers of distance computations are proportional. Figure 15 shows linear time complexity regarding the number of dimensions for C-OPTICS as compared to SOPTICS. In contrast, OPTICS and DeLi-Clu show an exponential increase in the running time as the number of dimensions increases. Moreover, the computational efficiency of the algorithms decreases and the running time increases exponentially. As a result, C-OPTICS is shown to improve the scalability significantly and can address the quadratic time complexity of OPTICS while guaranteeing the quality of a reachability plot.

Figure 15. Running time vs. dimensionality for the Gaussian (50d) dataset.

5. Conclusions

In this paper, we proposed C-OPTICS, which improves the running time of OPTICS by reducing the unnecessary distance computations to address the quadratic time complexity issue of OPTICS. C-OPTICS partitions a

d

-dimensional dataset into unit cells which have identical diagonal length and constructs a constraint graph. Subsequently, C-OPTICS only computes the reachability distance for each object that appears in the reachability plot through linkage constraints in the constraint graph.

We conducted experiments on synthetic and real datasets to confirm the scalability and efficiency of C-OPTICS. Specifically, C-OPTICS outperformed state-of-the-art algorithms. Experimental results show that C-OPTICS addressed the quadratic time complexity of OPTICS. Specifically, the running time with regard to the data size is improved by as much as 102 times over DeLi-Clu. In addition, the running time is improved up to nine times over SOPTICS, which creates an approximate reachability plot. We also conducted experiments on dimensionality. These experimental results show that C-OPTICS has robust clustering quality and linear time complexity regardless of the size of the dimensions.

Future research can consider methods by which the proposed algorithm can be improved. For example, C-OPTICS can be improved by having it construct a constraint graph without depending on a radius

ε

. This can provide a solution to the worst case of C-OPTICS. In addition, C-OPTICS can be improved to enable GPU-based parallel processing to accelerate the construction of the constraint graph.

Author Contributions

J.-H.K. and J.-H.C. designed the algorithm. J.-H.K. performed the bibliographic review and writing of the draft and developed the proposed algorithm. A.N. shared his expertise with regard to the overall review of this paper. A.N., K.-H.Y., and W.-K.L. supervised the entire process.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A3B03035729). This work was also supported by National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2018R1A2B6009188).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.; Yu, Z.; Chen, C.P.; You, J.; Gu, T.; Wong, H.S.; Zhang, J. Clustering by local gravitation. IEEE T. Cybern. 2017, 48, 1383–1396. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Chen, J. Superpixel segmentation using linear spectral clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1356–1363. [Google Scholar]
Fang, Z.; Yu, X.; Wu, C.; Chen, D.; Jia, T. Superpixel Segmentation Using Weighted Coplanar Feature Clustering on RGBD Images. Appl. Sci. 2018, 8, 902. [Google Scholar] [CrossRef]
Torti, E.; Florimbi, G.; Castelli, F.; Ortega, S.; Fabelo, H.; Callicó, G.; Marrero-Martin, M.; Leporati, F. Parallel K-Means clustering for brain cancer detection using hyperspectral images. Electronics 2018, 7, 283. [Google Scholar] [CrossRef]
Han, C.; Lin, Q.; Guo, J.; Sun, L.; Tao, Z. A Clustering Algorithm for Heterogeneous Wireless Sensor Networks Based on Solar Energy Supply. Electronics 2018, 7, 103. [Google Scholar] [CrossRef]
Al-Shalabi, M.; Anbar, M.; Wan, T.C.; Khasawneh, A. Variants of the low-energy adaptive clustering hierarchy protocol: Survey, issues and challenges. Electronics 2018, 7, 136. [Google Scholar] [CrossRef]
Panapakidis, I.P.; Michailides, C.; Angelides, D.C. Implementation of Pattern Recognition Algorithms in Processing Incomplete Wind Speed Data for Energy Assessment of Offshore Wind Turbines. Electronics 2019, 8, 418. [Google Scholar] [CrossRef]
Zhang, T.; Haider, M.; Massoud, Y.; Alexander, J. An Oscillatory Neural Network Based Local Processing Unit for Pattern Recognition Applications. Electronics 2019, 8, 64. [Google Scholar] [CrossRef]
Yaohui, L.; Zhengming, M.; Fang, Y. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowl. Based Syst. 2017, 133, 208–220. [Google Scholar] [CrossRef]
Zaiane, O.R.; Foss, A.; Lee, C.H.; Wang, W. On data clustering analysis: Scalability, constraints, and validation. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, 6–8 May 2002; pp. 28–39. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Ankerst, M.; Breunig, M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
Patwary, M.A.; Palsetia, D.; Agrawal, A.; Liao, W.K.; Manne, F.; Choudhary, A. Scalable parallel OPTICS data clustering using graph algorithmic techniques. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–22 November 2013; pp. 49–60. [Google Scholar]
Gunawan, A.; de Berg, M. A Faster Algorithm for DBSCAN. Master’s Thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, March 2013. [Google Scholar]
Achtert, E.; Böhm, C.; Kröger, P. DeLi-Clu: Boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 9–12 April 2006; pp. 119–128. [Google Scholar]
Schneider, A.; Vlachos, M. Scalable density-based clustering with quality guarantees using random projections. Data Min. Knowl. Discov. 2017, 31, 972–1005. [Google Scholar] [CrossRef]
Beckmann, N.; Kriegel, H.P.; Schneider, R.; Seeger, B. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, 23–25 May 1990; pp. 322–331. [Google Scholar]
Brecheisen, S.; Kriegel, H.P.; Pfeifle, M. Multi-step density-based clustering. Knowl. Inf. Syst. 2006, 9, 284–308. [Google Scholar] [CrossRef][Green Version]
Lee, W.; Loh, W.K. G-OPTICS: Fast ordering density-based cluster objects using graphics processing units. Int. J. Web Grid Serv. 2018, 14, 273–287. [Google Scholar] [CrossRef]
Breunig, M.M.; Kriegel, H.P.; Sander, J. Fast hierarchical clustering based on compressed data and optics. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 13–16 September 2000; pp. 232–242. [Google Scholar]
Vágner, A. The GridOPTICS clustering algorithm. Intell. Data Anal. 2016, 20, 1061–1084. [Google Scholar] [CrossRef]
Stuetzle, W. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 2003, 20, 25–47. [Google Scholar] [CrossRef]
Hartigan, J.A.; Mohanty, S. The runt test for multimodality. J. Classif. 1992, 9, 63–70. [Google Scholar] [CrossRef]
Campello, R.J.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 2015, 10, 5–56. [Google Scholar] [CrossRef]
Bryant, A.; Cios, K. RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimation. IEEE Trans. Knowl. Data Eng. 2018, 30, 1109–1121. [Google Scholar] [CrossRef]
Blake, C.; Merz, C. UCI Repository of Machine Learning Database; UCI: Irvine, CA, USA, 1998. [Google Scholar]
Huerta, R.; Mosqueiro, T.; Fonollosa, J.; Rulkov, N.F.; Rodriguez-Lujan, I. Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring. Chemom. Intell. Lab. Syst. 2016, 157, 169–176. [Google Scholar] [CrossRef]
Reiss, A.; Stricker, D. Introducing a new benchmarked dataset for activity monitoring. Proceedings of International Symposium on Wearable Computers, Boston, MA, USA, 11–15 November 2012; pp. 108–109. [Google Scholar]
Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, QC, Canada, 4–6 June 1996; pp. 103–114. [Google Scholar]
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clustering comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]

Figure 1. Two-dimensional clustering example demonstrating problems of density-based spatial clustering of applications with noise (DBSCAN) for the selection of

ε

(

M i n P t s = 3

).

Figure 2. A reachability plot of an example dataset (

ε = \sqrt{2}

and

M i n P t s = 3

).

Figure 3. Visualization of the reachability distances for sample dataset

D

: (a) all reachability distances for

p_{1}

; (b) all reachability distances between sample objects; (c) unnecessary distance computations between sample objects.

Figure 4. The overall process of the C-OPTICS for the two-dimensional dataset

P

: (a) partitioning step, (b) graph construction step, and (c) plotting step.

Figure 5. Partitioning into

U C

s of sample dataset

P

(

ε = \sqrt{2}

): (a) the sample dataset

P

, (b)

U C

s that partition

P

, (c) non-empty

U C

s (white area) and their unique identifiers, and (d) the result of the partitioning step.

Figure 6. An example of maximum reachability distance (

M R D

) in two-dimensional space.

Figure 7. Construction of constraint graph

G (V, E)

for sample dataset

P

(

ε = \sqrt{2}

and

M i n P t s = 3

): (a) the sample dataset

P

partitioned into non-empty

U C

s, (b) vertices that are corresponding to

U C

s, (c)

M R D

and

v s t a t e

of

v_{21}

, (d) edges in

v_{21}

. with linkage constraints weighted, (e) an edge in

v_{0}

with a linkage constraint weighted, and (f) the constraint graph.

Figure 8. An example of the plotting step for sample dataset

P

(

ε = \sqrt{2}

and

M i n P t s = 3

): (a) the constraint graph for the sample dataset

P

, (b)

N_{ε} (p_{1})

that is obtained by using the linkage constraints of

p_{1}

, (c) the reachability distances of all objects contained in

N_{ε} (p_{1})

for

p_{1}

, (d) the reachability distances of all objects contained in

N_{ε} (p_{2})

for

p_{2}

, (e)

N_{ε} (p_{4})

, and (f) a unidirectional graph structure that represents the reachability plot.

Figure 9. Visual comparison of reachability plots for PAMAP2: (a) OPTICS; (b) C-OPTICS; (c) SOPTICS.

Figure 10. Clustering quality quantified by adjusted Rand index (ARI) vs.

M i n P t s

: (a) HT; (b) Household; (c) PAMAP2; (d) BIRCH2.

Figure 11. Clustering quality quantified by ARI vs. dimensionality for the Gaussian dataset.

Figure 12. Running time vs. sampling rate for synthetic datasets: (a) BIRCH2; (b) Gaussian (50

d

).

Figure 13. Running time vs. sampling rate for real datasets: (a) HT; (b) Household; (c) PAMAP2.

Figure 14. Comparison of the total number of distance computations.

Figure 15. Running time vs. dimensionality for the Gaussian (50d) dataset.

Table 1. The notations.

Symbols	Definitions
$P$	A set of $N$ objects for clustering
$N$	The cardinality of $P$
$p_{i}$	An i-th object (or data point) in $P$ ( $1 \leq i \leq N$ )
$U C$	Unit cell that partitions the dataset (Definition 3)
$d$	The number of dimensions of $P$
$M B R$	The minimum bounding rectangle in $U C$
$v$	A vertex in the constraint graph
$C$	The cluster of $P$
$ε$	Epsilon represents the radius of neighborhood of an object $p_{i}$
$M R D$	The maximum reachable ranges of a vertex by $ε$ (Definition 4)
$N_{ε} (p_{i})$	The $ε$ -neighborhood of an object $p_{i}$
$M i n P t s$	Minimum number of objects in the $ε$ -neighborhood of an object $p_{i}$
$c d i s t_{ε, M i n P t s} (p_{i})$	The core distance of $p_{i}$ (Definition 1)
$r d i s t_{ε, M i n P t s} (p_{i}, p_{j})$	The reachability distance of an object $p_{i}$ w.r.t. an $ε$ -core object $p_{j}$ (Definition 2)
$d i s t (p_{i}, p_{j})$	The distance between $p_{i}$ and $p_{j}$
$v d i s t (v_{i}, v_{j})$	The distance between two vertices $v_{i}$ and $v_{j}$
$R V$	The set of adjacent vertices for a vertex $v$
$v s t a t e$	The state of a vertex $v$ (Definition 6)
$L C (v_{i}, v_{j})$	The linkage constraint from $v_{i}$ to $v_{j}$ (Definition 5)

Table 2. Meta-information of the datasets.

Dataset	Number of Instances	Number of Dims	Domain of Each Dims	Data Type
HT	928,991	10	[0,1]	Real
Household	2,049,280	7	[0,1]	Real
PAMAP2	3,850,505	4	[0,1]	Real
BIRCH2	1,000,000	7	[0,1]	Synthetic
Gaussian	500,000	10–50	[0,1]	Synthetic

Table 3. Parameters and search ranges for the three compared algorithms.

Algorithm	Parameters and Search Ranges
OPTICS	$ε \in [0.2, 2.5]; M i n P t s \in {20, 30, 40, 50}$
DeLi-Clu	$M i n P t s \in {20, 30, 40, 50}$
SOPTICS	$M i n P t s \in {20, 30, 40, 50}$
C-OPTICS	$ε \in [0.2, 2.5]; M i n P t s \in {20, 30, 40, 50}$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph

Abstract

1. Introduction

2. Related Work

2.1. OPTICS

2.1.1. Definitions

2.1.2. Computation

2.2. Existing Work

3. Constrained OPTICS (C-OPTICS)

3.1. Overview

3.2. Partitioning Step

3.3. Graph Construction Step

3.4. Plotting Step

4. Performance Evaluation

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Competing Algorithms

4.1.3. Clustering Quality Metrics

4.2. Experimental Results

4.2.1. Clustering Quality

4.2.2. Computational Efficiency

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics