A Novel 2D Clustering Algorithm Based on Recursive Topological Data Structure

: In the ﬁeld of data science and data mining, the problem associated with clustering features and determining its optimum number is still under research consideration. This paper presents a new 2D clustering algorithm based on a mathematical topological theory that uses a pseudometric space and takes into account the local and global topological properties of the data to be clustered. Taking into account cluster symmetry property, from a metric and mathematical-topological point of view, the analysis was carried out only in the positive region, reducing the number of calculations in the clustering process. The new clustering theory is inspired by the thermodynamics principle of energy. Thus, both topologies are recursively taken into account. The proposed model is based on the interaction of particles deﬁned through measuring homogeneous-energy criterion. Based on the energy concept, both general and local topologies are taken into account for clustering. The effect of the integration of a new element into the cluster on homogeneous-energy criterion is analyzed. If the new element does not alter the homogeneous-energy of a group, then it is added; otherwise, a new cluster is created. The mathematical-topological theory and the results of its application on public benchmark datasets are presented.


Introduction
Clustering and its classification has increased significantly due to a large amount of digital information available on the internet, especially on social media. The task of grouping information that contains common characteristics or meaning and its subsequent classification is a cornerstone in areas such as data sciences, data mining, and pattern recognition. Despite the important advancement that has been done in the algorithms of clustering and its classifications, it is still under the attention of researchers.
Clustering algorithms are based on their clustering paradigm, and the amount and dimension of the data to be handled. There are several review papers reported in the literature. Saxena et al. [1] have classified them into two main groups, i.e., hierarchical and partitional algorithms. The hierarchical algorithms are further subdivided into agglomerative and divisive algorithms, while partitional algorithms are subdivided into density-based, distance-based, and model-based algorithms.
Another review paper is presented by Dong et al. [2], where the main attention was paid to analyzing the clustering algorithms from supervised and unsupervised perspective. Nevertheless, the most widely accepted classification is that of [3] in which the authors have divided the algorithm into Partitional, Hierarchical, Density-Based, Grid-Based, Spectral, Gravitational, Neural Network-Based, and Evolutionary-Based. We consider the similar classification given in detail in Section 2.
In this research paper, a new 2D clustering algorithm is presented. The proposal is based on the energy and homogeneity of topological theory. Taking into account cluster symmetry property, from metric and mathematical-topological point of view, the analysis is done only in positive region, reducing the number of calculations in the clustering process. The proposal works according to the homogeneous-energy effect that is calculated when a cluster receives a new element. If the new element does not alter the local homogeneous energy of the cluster, then it is added to the cluster. Otherwise, a new cluster will be generated. The algorithm will terminate when all elements are assigned to a single cluster.
Our novel approach was developed to work in two-dimensional space, and it was tested and validated using the bi-dimensional Shape databases widely used in the data clustering literature from the Machine Learning group of School of Computing, University of Eastern Finland [4]. The Shape dataset was chosen because of its levels of complexity, spherically separable point distribution (R15, D31), embedded classes (Jain, flame, and compound), and complex distributions (spiral and pathbased). The previous data distributions are difficult to be correctly clustered by a single clustering algorithm, which is achieved in our proposed methodology.
The rest of the paper is organized as follows. In Section 2, state of the art is presented. The proposal is presented in Section 3. Topology-based theory is given in Section 4. Section 5 presents the methodology. Experimental results and discussion are given in Section 6. Finally, conclusions and perspectives are given in Section 7.

State of the Art
Clustering is an unsupervised learning method that classifies unlabeled data objects into several groups based on the similarities among them. The main characteristic of clustering is that prior knowledge of the data is not required. Extensive use of clustering algorithms is made in areas of data science and data mining, where the objective is to group the information that has common characteristics, as well as to define the optimal number of groups. Figure 1 shows the common clustering algorithms: Partitional, Hierarchical, Density-Based, Grid-Based, Spectral, Gravitational, Neural Network-Based, and Evolutionary Clustering [3]. There is another clustering technique based on semantic definition [5], but it has a disadvantage that it works for the supervised clustering only.  The main works related to clustering algorithms are presented in [6][7][8][9][10][11][12][13][14]. Clustering algorithms are classified into Partitional Clustering, Hierarchical Clustering, Density-based Clustering, Spectral Clustering, Gravitational Clustering, Evolutionary Clustering. Their advantages and disadvantages are summarized in Table 1. Although there is a wide range of clustering algorithms available, none of the aforementioned clustering algorithm is self-sufficient for all types of clustering problems. Published clustering methodologies have been conceived for each type of database to be clustered. The reported algorithms do not take into account the topology of the data, neither local nor global, which confines them to work with local densities or general distance criteria. This proposal defines a homogeneous-energy measure that takes into account the local and global topological properties of the data to be clustered (see Sections 3 and 4). Table 1. Clustering algorithms with its merits and demerits.
Easy and simple implementation.
Not appropriate for non-hyperspherical clusters.
Optimization problem (non-guaranteed minimum value).
Sensitive to noise and cluster initialization.
Trouble finding the optimal number of groups automatically.
Less robust to noise.
Automatically finds the number of clusters.
Computationally heavy. Main disadvantage is defining a priori density function.
Automatic data quantity management.
Preserving time-frequency correspondence.
Experience to analyze in the frequency domain.
Representation in the complex plane. Computationally heavy.
Optimization problem (non-guaranteed minimum value).
Automatically finds the number of clusters. Suitable for uninformed cluster shape.
Optimization problem (non-guaranteed minimum value).
Our proposal is an attempt to overcome the main limitations of typical clustering algorithms. With the topological model based on mathematical pseudometric, unlike partitional and gravitational algorithms, we no longer have the need to establish prior knowledge of the data. Thus, it is not necessary to define any density-based function, and the use of working only with the hyper-spherical separation functions is avoided. Therefore, based on the Algebra of sets, due to not occupying excessive memory as the density-based, spectral, and gravitational algorithms do, the calculations and results are obtained faster. This research paper aims to compare the clustering and synthetic benchmark datasets.

Proposal
There are many clustering algorithms focused on grouping elements according to their feature distribution. As presented in the above section, there are clustering algorithms. i.e., Partitional, Hierarchical, Density-Based, and so on, that solve a particular clustering problem. Our proposal takes into account the topology of both local and global dataset, which makes it more general and be able to cluster any type of data. The measure of similarity between the elements to cluster is defined via a pseudometric, and the clustering criteria is based on an affinity function and a homogeneous-energy state. An affinity function measures how close one element is to another. While a homogeneous-energy function determines, for a closed system, the equilibrium locally, as well as globally. Local and global homogeneous-energies are reduced to a minimum value in equilibrium. Therefore, if we associate the elements with a homogeneous-energy function, which is obtained in principle with the affinity function, a group of elements will be formed by common elements if they are kept below their homogeneous-energy level.
At the beginning, r-representatives are selected (either randomly or sequentially), considering all the elements of the database to form a subset of elements representing each group. Subsequently, the affinity of the r-representatives to the rest of the elements in the database is calculated. An element will be assigned to a group if its group homogeneousenergy level is low or without changes. Otherwise, a new group and a new representative will be generated.
The representatives will be a subset of a given group. A new element can be added to a given group if it is an affinity between that new element and the group of representatives, keeping its group homogeneous-energy level low or without changes.
If the homogeneous-energy in a given group is drastically altered or higher than the level, then the new element is not related to this grouping, and as a consequence, the new element is rejected. A new group is created in which the elements are labeled as "others" because they do not have similar properties to each other.
With the proposal that considers homogeneous-energy, which takes into account the local and global topologies, all the disadvantages indicated in Table 1 presented in Section 2 are overcome.

Theoretical Fundamentals of a Topological-Pseudometric-Based Clustering
In the context of mathematical topology, six definitions, a proposition, a theorem, and an example are provided. Definition 1 refers to pseodometry. Definition 2 deals with set theory, i.e., empty sets and subsets. Definition 3 deals with the definition of pseudometrics as applied to both representative and database elements. Definition 4 refers to the energy function. Definition 5 deals with pseudometrics around a topological neighborhood. Finally, definition 6 measures the pseudometrics to the representatives and to the subsets of groups. The example shows the energy variability having the same representative element, but with different topologically distributed neighbors. Definition 1. Let X be a non-empty set. The Cartesian product X × Y of sets X and Y is the set of all ordered pairs (x, y) with x ∈ X and y ∈ Y.
A pseudometric space (X, d) is a set X together with a non-negative real-valued function d : X × X → [0, ∞) (called a pseudometric function) such that, for every x, y, z ∈ X.
Unlike a metric space, points in a pseudometric space need not to be distinguishable; that is, one may have d(x, y) = 0 for distinct values x = y.
The pseudometric topology is induced by the open balls, with x ∈ X and r ≥ 0, which forms basis for the topology. A topological space is said to be a pseudo metrizable topological space if the space can be given a pseudometric such that the pseudometric topology coincides with the given topology on the space [61].

Definition 2.
An exact cover C(X) of a set X is a family C(X) = {C i |i ∈ I} of nonempty subsets of X such that the following conditions are satisfied:

•
For each i ∈ I, C i ⊂ X.

•
For i, j ∈ I and i = j implies C i ∩ C j = ∅.
Definition 3. Let (X, d) be a pseudometric space and A be a subset of X. If an element x * ∈ A satisfies the condition: for every z ∈ A then x * is called representative of A. The set of representatives of A is denoted by R(A). The distance between point x ∈ X and a set A can be defined as: Lemma 1. Let (X, d) be a pseudometric space. For each finite set A ∈ 2 X , the set of representatives R(A) is a non-empty set.
Proof. For the set A = {a i : i = 1, . . . , n}, considers, For each x ∈ A. This proves that a i * is a representative of A.
where x * ∈ E(A). It should be noted that the energy E(A) of a set A is independent of the choice of the representatives of A. By definition, if x 1 and x 2 are representative of A, then, the following condition must be satisfied: 1

A point p preserves the energy of A if E(A ∪ {p}) ≤ E(A).
Example 1. Energy variations can be represented by subsets elements, having the same representative. Let the sets be X, Y, and Z in R 2 with the Euclidean metric (see Figure 2). Elements are marked with dots and its representative with a star. Elements are distributed in two concentric circles with radii 1, and 2. Consider the subsets Y, and Z as removing a circle from X (see Figure 2 (Y) and (Z)). In addition, it can be noted that the energy of set X, Y, and Z are 1.5, 2, and 1, respectively. It can be remarked that energy changes under subsets. Thus, for this example Y ⊆ X does not imply E(Y) ≤ E(X).
It can construct the succession {A j : j = 1, . . . , n} which satisfied E(A m ) ≤ E(A n ) for n ≤ m and that A j → st (A, λ). Therefore E(st(A, λ)) ≤ E(A). Theorem 1. Let (X, d) be a pseudometric space, and δ ≥ 0. There exists an exact cover C(X) = D ∪ {C i : i ∈ I} of X where {C i : i ∈ I} is δ exact cover of X \ D, and the element D does not preserve the energy of C j for each j ∈ I.

Proof. It will construct a family of sets
which satisfies the following properties for each i, j = 1, . . . , n.
Stand x 1 , x 2 ∈ X. On the condition d(x 1 , On the other hand, taking n and m to be two integers, and without loss of generality, next condition d(x m , x n ) < δ is fulfilled for m ≤ n. Let C 1 = {x m , x n } and D 1 = {x 1 , x 2 , . . . , x m−1 , x m+1,...,x n−1 }, Subsequently, take C 1 = st(C 1 , δ) and considering F 1 = D 1 ∩ {C 1 }. Taking into account the proposition 1 then inequalities stand: consequently, the condition C 1 = st(C 1 , δ) is satisfied. It has been shown that properties (i), and (ii) are completely satisfied.

Methodology
Proposed clustering methodology is shown in Figure 3. The main steps of the methodology are described below.

1.
Dataset are read: First, datasets are read as ascii files, where the first column represents the information on the x-axis, while the second column represents the information on the y-axis.

2.
The manhattan distance is calculated on the totality of the data: A measure amount of all elements is defined by its manhattan distance, in this case, there is a two-dimension manhattan distance.

3.
Local and general pseudometry is evaluated: Based on pseudometry defined in Definitions 1-3, local and general topology are taken into account in order to measure the cluster energy, defined in Definition 4.

4.
The appropriate cluster is chosen: If the energy of the cluster is not affected by the new element, then it is integrated to the cluster, otherwise, a new one is generated. 5.
The homogeneous-energy of the clusters is evaluated: At this stage, each cluster energy is calculated in order to update the cluster energy.
If all the elements are assigned to a cluster, then the algorithm ends.

Datasets
In order to test the proposed algorithm and its robustness in the automatic generation of clusters, synthetic datasets of two-dimensional points were used. Datasets were taken from Machine Learning group of School of Computing, University of Eastern Finland [4]. The dataset is shown in Figure 4 and its characteristics are given in Table 2. There were 8 datasets tested throughout the topological algorithm, that are: Aggregation, Compound, Pathbase, Spiral, D31, R15, Jain, and Flame. The Shape dataset was chosen because of its levels of complexity, spherically separable point distribution (R15, D31), embedded classes (Jain, flame, and compound), and complex distributions (spiral and pathbased). The previous data distributions are difficult to be correctly clustered by a single clustering algorithm. In our topological-pseudometric-based clustering proposal, it is possible.

Metric and Distance
Consider the set Γ of points (P). Points correspond to a synthetic dataset (see Figure 4). According to Section 4, a pseudometric ρ is defined for two given points as follows: A pseudometric has been built in the space of the points. Thus, Theorem 1 will always allow to create clusters. The Manhattan distance is used in all the tests.

Add the Object into the Suitable Cluster
Starting from the representatives in each cluster, the new homogeneous-energy measure is calculated for each new element to be added to the cluster. The homogeneities are calculated using theorem 1 and proposition 1 (see Section 4) for each representatives. Thus, the new element is assigned to the cluster which is not affected in its homogeneous-energy. Otherwise, a new cluster is created and the new element is taken as its representatives.

Experimental Results
Experiments on the D-Dimension were conducted on shape dataset in order to test the efficiency of the clusterings produced by our topological algorithm on a varied collection of synthetic datasets (see Table 2). The goal in this set of experiments is to show how topological clustering can be used to improve the quality and robustness of widely-used clustering datasets benchmark. Eight synthetic datasets were used, from separate distributions and compacted classes to spiral or circular distributions, which are extremely complex to cluster: Aggregation, Compound, Pathbase, Spiral, D31, R15, Jain, and Flame. The eight datasets were divided into three groups according to their distribution-shape and difficulty of grouping. Clustering error is defined as [62]: • Easy: The proposed algorithm works very well for the Jain, D31, R15, and Aggregation datasets (see Figure 5a-d). The clustering error is less than 2%. The clustering problem is on Aggregation dataset (in the union of the classes of the right side; see Figure 5d). • Medium: Results are good for Flame and Compound datasets (see Figure 5e,f). The clustering error is less than 5%. • Complex: For the third group, the results of the Pathbased and spiral datasets are good (see Figure 5g,h). The clustering error is less than 10% just for Pathbase dataset. The result for Spiral dataset is 0%.
The proposed algorithm was implemented in LabVIEW software, which is oriented to work easily with hardware, obtaining accurate and fast measurements and results. LabView allows to run algorithms from reading datasets from files, to the graphical display of information.
The algorithm begins taking representatives randomly from the whole dataset. The distances among representatives allow to define the parameter δ as intermediate values. The δ value is set at 0.2% of the distance between two representatives. After setting the initial parameters, the pseudometric grouping theorem is applied. Thus, there are k − groups as a result. We impose the criteria that groups containing less than 10 percent of the entire dataset are annulled, and the elements are re-integrated into the Q set. For the next iterations, the delta value is increased by 0.2%. The algorithm ends when the Q set is empty.
Considering local and global topologies has allowed us to define a more robust algorithm than those reported in the literature. Better results have been obtained in this research work for different characteristics of the databases to be clustered.
The proposal overcomes the problems reported in existing algorithms and allows not having the limitations of (1) a prior knowledge of the data and (2) using only spherical separator functions. Another important advantage of this proposed algorithm is the use of Algebra of sets which helps in obtaining the results faster and without excessive memory consumption (as density-based, spectral, and gravitational algorithms do).

Conclusions
The new 2D clustering algorithm based on a mathematical topological theory was presented in this research paper. The proposed theory of a pseudometric-based clustering model and its application in synthetic datasets worked as expected. Thus, this new method based on topology theory has successfully worked for the clustering of easy and complex datasets. The proposal also takes into account the local and global topological properties of the data to be clustered in a definition of homogeneous-energy measurement.
The proposal overcomes the problems reported in existing algorithms and without the need for (1) a prior knowledge of the data and (2) using only spherical separator functions. Another advantage of the proposal is that since the proposed algorithm is based on Algebra of sets, the computational results are faster and without excessive memory consumption.
Because of the theoretical development, there is now a theorem (Theorem 1) that can be applied in any space that defines a pseudometric.
Based on the results obtained, the clustering of n-dimensional databases will be explored, as well as the application of the proposal to large databases.