Entropy-Randomized Clustering

Yuri S. Popkov; Yuri A. Dubnov; Alexey Yu. Popkov

doi:10.3390/math10193710

,

and

Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, 119333 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mathematics2022, 10(19), 3710;https://doi.org/10.3390/math10193710

This article belongs to the Special Issue Mathematical Modeling, Optimization and Machine Learning

Version Notes

Order Reprints

Abstract

This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. The indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds to the maximum of the optimal probability distribution. This method is developed for binary clustering as a basic procedure. Its extension to t-ary clustering is considered. Some illustrative examples of entropy-randomized clustering are given.

Keywords:

randomized clustering; Boltzmann and Fermi entropies; indicator matrix; binary clustering; t-ary clustering; finite-dimensional and functional problems

MSC:

62H30; 68T10

1. Introduction

Cluster analysis of different objects is a branch of machine learning where the teacher’s labels are replaced by some internal characteristics of objects or external characteristics of clusters. The internal ones include the distances between objects within the cluster [1,2] and the similarity of objects [3]. Among the external characteristics, we mention the distances between the clusters [4]. As a mathematical problem, clustering has no universal statement. Therefore, clustering algorithms are often heuristic [5,6].

A highly developed area of research is cluster analysis of large text arrays. As a rule, latent features are first detected based on latent semantic analysis [7]. Subsequently, they are used for clustering [8,9]. Recently, there have appeared works based on the concept of ensemble clustering [10,11].

Most clustering algorithms involve the distance between objects, measured in an accepted metric, and enumerative search algorithms with heuristic control [12]. Clustering results significantly depend on the metric. Therefore, it is very important to quantify the quality of clustering [13,14,15].

This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution [16]. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. Since clusters are treated as random objects, the indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds in size and composition to the maximum of the optimal probability distribution.

The optimal distribution of clusters is based on the method of the randomized maximum entropy estimation (MEE method) described in [16]. The method turns out to be effective in many machine learning and data mining problems. Among other features, it introduces the problem of entropy-randomized clustering. This article is devoted to a more detailed presentation of this problem in terms of proving the convergence of the multiplicative algorithm and the logical scheme of the clustering procedures.

2. An Indicator of Data Matrices

Consider a set of n objects characterized by row vectors

x^{(1)}, \dots, x^{(n)}

from the feature space

R^{m}

. Using these vectors, we construct the following n-row matrix:

X^{(1, \dots, n)} = (\begin{matrix} x^{(1)} \\ \dots \\ x^{(n)} \end{matrix}) .

(1)

Let the distance between the ith and jth rows be defined as

ϱ (x^{(i)}, x^{(j)}) = {∥ x^{(i)} - x^{(j)} ∥}_{R^{m}},

(2)

where

{∥ • ∥}_{R^{m}}

denotes an appropriate metric in the feature space

R^{m}

. Next, we construct the distance matrix

D_{(n \times n)} = (\begin{matrix} 0 & ϱ (x^{(1)}, x^{(2)}) & \dots & ϱ (x^{(1)}, x^{(n)}) \\ ϱ (x^{(2)}, x^{(1)}) & 0 & \dots & ϱ (x^{(2)}, x^{(n)}) \\ \dots & \dots & \dots \\ ϱ (x^{(n)}, x^{(1)}) & ϱ (x^{(n)}, x^{(2)}) & \dots & 0 \end{matrix}) .

(3)

We introduce an indicator of the matrix

X^{(1, \dots, n)}

as the average value of the elements of the distance matrix

D

:

d i s (X) = \frac{2}{n (n - 1)} \sum_{(i, j) = 1, j \neq i}^{n} ϱ (x^{(i)}, x^{(j)}) .

(4)

Below, the objects will be included in clusters depending on the distances in (2). Therefore, the important characteristics of the matrix

X^{(1, \dots, n)}

are the minimum and maximum elements of the distance matrix

D

:

i n f (D) = min_{i, j} ϱ (x^{(i)}, x^{(j)}), s u p (D) = max_{i, j} ϱ (x^{(i)}, x^{(j)}) .

(5)

Note that the elements of the distance matrices of the clusters belong to the interval

I = [i n f (D), s u p (D)] .

3. Randomized Binary Clustering

The binary clustering problem is to arrange n objects between two clusters

K_{(s^{*})}

and

K_{(n - s^{*})}

of sizes

s^{*}

and

(n - s^{*}),

respectively:

\begin{matrix} K_{(s^{*})} = {i_{1}, \dots, i_{s^{*}}}, K_{(n - s^{*})} = {j_{1}, \dots, j_{(n - s^{*})}}; \\ (i_{1}, \dots, i_{s}) \neq (j_{1}, \dots, j_{(n - s^{*})}), (i_{α}, j_{β}) = \bar{1, n}, α = \bar{1, s^{*}}, β = \bar{1, n - s^{*}} . \end{matrix}

(6)

It is required to find the size

s^{*}

and composition

{i_{1}, \dots, i_{s^{*}}}

of the cluster

K_{(s^{*})} .

For each fixed cluster size

s,

the clustering procedure consists of selecting a submatrix

X_{(s)}

of some

s < n

rows from the matrix

X^{(1, \dots, n)} .

If the matrix

X_{(s)}

is selected, then the remaining rows form the matrix

X_{(n - s)}

and the set of their numbers form the cluster

K_{(n - s)} .

Clearly, the matrix

X_{(s)}

can be formed from the rows of the original matrix

X^{(1, \dots, n)}

in

C_{s}^{n}

different ways (the number of s-combinations from the set of n elements). For each of them, the matrices

X_{(n - s)}

can be formed in a corresponding number of ways.

According to the principle of randomized binary clustering, the matrix

X_{(s)}

is a random object and its particular images are the realizations of this object. The sets of its elements and the number of rows s are therefore random.

A realization of the random object is a set of row vectors from the original matrix:

X_{(s)} = X_{(s)}^{(i_{1}, \dots, i_{s})} = (\begin{matrix} x^{(i_{1})} \\ \dots \\ x^{(i_{s})} \end{matrix}) .

(7)

We renumber this set as follows:

{i_{1}, \dots, i_{s}} \to k = \bar{1, K (s)}; K (s) = C_{s}^{n} .

(8)

Thus, the randomization procedure yields a finite ensemble of the form

X_{(s)} = \{X_{(s)}^{1}, \dots, X_{(s)}^{K (s)}\} .

(9)

Recall that the matrices in this ensemble are random. Hence, we assume the existence of probabilities

p (s, k)

for realizing the ensemble elements, where s and k denote the cluster size and cluster realization number, respectively:

X_{(s)}^{(k)} w i t h p r o b a b i l i t y p (s, k), s = \bar{1, (n - 1)}, k = \bar{1, K} .

(10)

Then, the randomized binary clustering problem reduces to determining a discrete probability distribution

p (s, k), (s = \bar{1, (n - 1)}, k = \bar{1, K})

, which is appropriate in some sense.

Let such a function

p^{*} (s, k)

be obtained; according to the general variational principle of statistical mechanics, the realized matrix will be

X_{(s^{*})}^{(k^{*} (s^{*}))}, w h e r e k^{*} (s^{*}) = max_{s, k} p^{*} (s, k) .

(11)

This matrix corresponds to the most probable cluster of

s^{*}

objects with the numbers

K_{1}^{*} = {i_{1}^{*}, \dots, i_{s^{*}}^{*}} \to k^{*} (s^{*}) .

(12)

The other cluster consists of the remaining

(n - s^{*})

objects with the numbers

K_{2}^{*} = {j_{1}^{*}, \dots, j_{(n - s^{*})}^{*}}, (j_{1}^{*}, \dots, j_{(n - s^{*})}^{*}) \neq (i_{1}^{*}, \dots, i_{s^{*}}^{*}) .

(13)

Generally speaking, there are many such clusters but they all contain the same

(n - s^{*})

objects.

4. Entropy-Optimal Distribution $p^{*} (s, k)$

Consider the cluster

K_{1}

of size

s,

the associated matrix

X_{(s)}^{(i_{1}, \dots, i_{s})} = (\begin{matrix} x^{(i_{1})} \\ \dots \\ x^{(i_{s})} \end{matrix}) = X_{(s)}^{(k)}, {i_{1}, \dots, i_{s}} \to k,

(14)

and the distance matrix

D_{(s)}^{(i_{1}, \dots, i_{s})} = (\begin{matrix} 0 & ϱ^{(k)} (x^{i_{1}}, x^{i_{2}}) & \dots & ϱ^{(k)} (x^{i_{1}}, x^{i_{s}}) \\ \dots & \dots & \dots \\ ϱ^{(k)} (x^{i_{s}}, x^{i_{1}}) & ϱ^{(k)} (x^{i_{s}}, x^{i_{2}}) & \dots & 0 \end{matrix}) = D_{(s)}^{(k)} .

(15)

We define the matrix indicator in (4) for the cluster

K_{s}

as

d i s (X_{(s)}^{(k)}) = \frac{2}{s (s - 1)} \sum_{(t, h) = 1, t \neq h}^{s} ϱ^{(k)} (x^{i_{t}}, x^{i_{h}}) .

(16)

Since the matrices

X_{(s)}^{(k)}

are supposed random objects, their ensemble has a probability distribution

p (s, k)

. We introduce the average indicator in the form

M {d i s (X_{(s)}^{(k)})} = \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)}) .

(17)

For determining the discrete probability distribution

p (s, k),

we apply randomized machine learning with the Boltzmann–Shannon entropy functional [17]:

H_{B} [p (s, k)] = - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) ln p (s, k) \Rightarrow max

(18)

subject to the constraints

0 \leq p (s, k) \leq 1, s = \bar{1, (n - 1)}, k = \bar{1, K (s)},

(19)

i n f (D) \leq \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)}) \leq s u p (D) .

(20)

Here, the lower

inf (D)

and upper

sup (D)

bounds for the elements of the distance matrix are given by (5); the indicator

d i s (X_{(s)}^{(k)})

is given by (16).

5. Parametrical Problems (18)–(20)

We treat Equations (18)–(20) as finite-dimensional: the objective function (entropy) and the constraints both depend on the finite-dimensional vector

p

composed of the values of the two-dimensional probability distribution

p (s, k)

:

p = {p (1, k \in \bar{1, K (1)}), \dots, p ((n - 1, k \in \bar{K (n - 2) + 1, K (n - 1)}))} .

(21)

The dimension of this vector is

M = \sum_{s = 1}^{(n - 1)} K (s) .

(22)

The constraints in (19) can be omitted by considering the Fermi entropy [18] as the objective function. Performing standard transformations, we arrive at a finite-dimensional entropy-linear programming problem [19] with the form

\begin{matrix} H (p) = - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) ln p (s, k) + (1 - p (s, k)) ln (1 - p (s, k)) \Rightarrow max_{0 \leq p (s, k) \leq 1}, \\ \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \bar{d i s} (X_{(s)}^{(k)}) \leq - 1, \bar{d i s} (X_{(s)}^{(k)}) = - \frac{d i s (X_{(s)}^{(k)})}{inf (D)}, \\ \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \underset{̲}{d i s} (X_{(s)}^{(k)}) \leq 1, \underset{̲}{d i s} (X_{(s)}^{(k)}) = \frac{d i s (X_{(s)}^{(k)})}{sup (D)} . \end{matrix}

(23)

To solve this problem, we employ the Karush–Kuhn–Tucker theorem [20], expressing the optimality conditions in terms of Lagrange multipliers and a Lagrange function. For Equation (23), the Lagrange function has the form

\begin{matrix} L [p, λ_{1}, λ_{2}] & = & H (p) + λ_{1} (- 1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \bar{d i s} (X_{(s)}^{(k)})) + \\ + & λ_{2} (1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \underset{̲}{d i s} (X_{(s)}^{(k)})) . \end{matrix}

(24)

The optimality conditions for the saddle point of the Lagrange function in (24) are written as

\nabla_{p} L (p^{*}, λ_{1}^{*}, λ_{2}^{*}) = 0, \frac{\partial L (p^{*}, λ_{1}^{*}, λ_{2}^{*})}{\partial λ_{i}} \geq 0,

(25)

λ_{i} \frac{\partial L (p^{*}, λ_{1}^{*}, λ_{2}^{*})}{\partial λ_{i}} = 0, λ_{i} \geq 0, i = 1, 2 .

(26)

The first condition in (25) is analytically solvable with respect to the components of the vector

p

:

\begin{matrix} p^{*} (s, k | λ_{1}, λ_{2}) & = & \frac{exp (- λ_{1} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2} \underset{̲}{d i s} (X_{(s)}^{(k)}))}, \\ s & = & \bar{1, (n - 1)}, k = \bar{1, K (s)} . \end{matrix}

(27)

The second condition in (25) yields the inequalities

\begin{matrix} L_{λ_{1}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = - 1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) \bar{d i s} (X_{(s)}^{(k)}) \geq 0, \\ L_{λ_{2}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = 1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) \underset{̲}{d i s} (X_{(s)}^{(k)}) \geq 0, \end{matrix}

(28)

and the condition in (26) yields the following equations:

\begin{matrix} λ_{1}^{*} L_{λ_{1}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = 0, \\ λ_{2}^{*} L_{λ_{2}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = 0, \\ λ_{1}^{*} \geq 0, λ_{2}^{*} \geq 0 . \end{matrix}

(29)

The non-negative solution of these inequalities and equations can be found using a multiplicative algorithm [19] with the form

\begin{matrix} λ_{1}^{q + 1} = λ_{1}^{q} (1 + γ L_{λ_{1}} (p^{*} (s, k | λ_{1}^{q}, λ_{2}^{q})), \\ λ_{2}^{q + 1} = λ_{2}^{q} (1 + γ L_{λ_{2}} (p^{*} (s, k | λ_{1}^{q}, λ_{2}^{q})), (λ_{1}^{0}, λ_{2}^{0}) > 0 . \end{matrix}

(30)

Here,

γ > 0

is a parameter assigned based on the

G

-convergence conditions of the iterative process in (30).

The algorithm in (30) is said to be

G

-convergent if there exists a set

G

in the space

R_{+}^{2}

and scalars

a (G)

and

γ

such that, for all

(λ_{1}^{0}, λ_{2}^{0}) \in G

and

0 < γ \leq a (G),

this algorithm converges to the solution

(λ_{1}^{*}, λ_{2}^{*})

of Equation (30), and the rate of convergence in the neighborhood of

(λ_{1}^{*}, λ_{2}^{*})

is linear.

Theorem 1.

The algorithm in (30) is

G

-convergent to the solution of Equation (29).

Proof.

Consider an auxiliary system of differential equations obtained from (30) as

γ \to 0

:

\frac{d λ_{i}}{d t} = λ_{i} L_{λ_{i}} (p^{*} (s, k | λ_{1}, λ_{2})), i = 1, 2 .

(31)

First, we have to establish its stability in the large, i.e., under any initial deviations in the space

R_{+}^{2} .

Second, we have to demonstrate that the algorithm in (30) is a Euler difference scheme for Equation (31) with an appropriate value

γ .

Let us describe some details of the proof. We define the following function in

R_{+}^{2}

:

V (λ_{1}, λ_{2}) = - \sum_{i = 1}^{2} λ_{i}^{*} (ln λ_{i} - ln λ_{i}^{*}) .

The function is strictly convex in

R_{+}^{2}

. Its Hessian is

Γ = d i a g [\frac{λ_{i}^{*}}{λ_{i}^{2}} | i = 1, 2] \geq 0 .

Hence,

{min}_{R_{+}^{2}} V (λ_{1}, λ_{2}) = 0

and it is achieved at the point

(λ_{1}^{*}, λ_{2}^{*})

. Thus,

V (λ_{1}, λ_{2}) > 0, (λ_{1}, λ_{2}) \in R_{+}^{2}

, and

V (λ_{1}^{*}, λ_{2}^{*}) = 0

.

We define the time derivative along the trajectories of (31):

\frac{d V}{d t} = - λ_{1}^{*} L_{λ_{1}} (p^{*} (s, k) | λ_{1}, λ_{2}) - λ_{2}^{*} L_{λ_{2}} (p^{*} (s, k) | λ_{1}, λ_{2}) .

According to (28) on

R_{+}^{2}

\frac{d V}{d t} = \{\begin{matrix} < 0 if λ_{1}^{*} > 0, λ_{2}^{*} > 0 \\ = 0 if λ_{1}^{*} = λ_{2}^{*} = 0 . \end{matrix}

Hence, the function V is a Lyapunov function for Equation (31) in the space

R_{+}^{2}

. All solutions of Equation (31) are asymptotically stable under any initial conditions

λ_{1}^{0} > 0

and

λ_{2}^{0} > 0 .

The algorithm in (30) is a Euler difference scheme. Due to the asymptotic stability of the solutions of (31), there always exists a step

γ > 0

and an initial condition domain under which the Euler scheme will converge. □

By the general principle of statistical mechanics, the realized cluster corresponds to the maximum of the probability distribution:

K_{s^{*}, k^{*}} \Rightarrow (s^{*}, k^{*}) = max_{s, k} p^{*} (s, k) .

(32)

5.1. Randomized Binary Clustering Algorithms

In this section, the algorithms of the clustering procedures in terms of logical schemes are proposed.

5.1.1. Algorithm $R 2 K (s)$ with a Given Cluster Size s

1.

Calculating the numerical characteristics of the data matrix

X^{(1, \dots, n)}

(a): Constructing the row vectors

$x^{(1)} = {x_{11}, \dots, x_{1 n}}, \dots, x^{(n)} = {x_{n 1}, \dots, x_{n n}} .$
(b): Calculating the elements of the (Euclidean) distance matrix $D_{(n \times n)}$ in (3):

$ϱ (x^{(i)}, x^{(j)}) = ϱ_{i j} = \sqrt{\sum_{(k, l)}^{m} {(x_{k}^{(i)} - x_{l}^{(j)})}^{2}}, (i, j) = \bar{1, n} .$
(c): Calculating the data matrix indicator in (4):

$d i s (X) = \frac{2}{n (n - 1)} \sum_{(i j) = 1}^{n} ϱ_{i, j} .$
(d): Calculating the upper and lower bounds for the elements of the matrix $D_{(n \times n)}$ in (3):

$inf (D) = min_{i j} ϱ_{i j}, sup (D) = max_{i j} ϱ_{i j} .$

2.

Forming the matrix ensemble

X_{(s)}^{(i_{1}, \dots, i_{s})}

(a): Forming the correspondence table

$i_{1}, \dots, i_{s} \to k, i_{j} = \bar{1, n}, j = \bar{1, s}; k = \bar{1, K (s)}, K (s) = C_{s}^{n} .$
(b): Constructing the matrices $X^{(i_{1}, \dots, i_{s})} .$
(c): Calculating the elements of the distance matrices $D_{(s)}^{(k)}$ in (15):

$ϱ^{(k)} (x^{i_{h}}, x^{i_{q}}) = \sqrt{\sum_{(i_{k}, i_{q}) = 1}^{m} {(x^{i_{h}} - x^{i_{q}})}^{2}}, i_{k} \neq i_{q} .$
(d): Calculating the indicator of the matrix $X_{(s)}^{(k)}$ :

$d i s (X_{(s)}^{(k)}) = \frac{2}{s (s - 1)} \sum_{i_{h}, i_{q}}^{s} ϱ_{i_{h}, i_{q}} .$

3.

Determining the Lagrange multipliers

λ_{1}

and

λ_{2}

for the finite-dimensional problem

(a): Specifying the initial values for the Lagrange multipliers:

$λ_{1}^{(0)} > 0, λ_{2}^{(0)} > 0 .$
(b): Applying the iterative algorithm in (30):

$λ_{1}^{q + 1} = λ_{1}^{q} (1 - γ [1 + \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} \frac{exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))} \bar{d i s} (X_{(s)}^{(k)})]),$

$λ_{2}^{q + 1} = λ_{2}^{q} (1 + γ [1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} \frac{exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))} \underset{̲}{d i s} (X_{(s)}^{(k)})]),$

where

$\bar{d i s} (X_{(s)}^{(k)}) = - \frac{d i s (X_{(s)}^{(k)})}{inf (X_{(s)}^{(k)})}, \underset{̲}{d i s} (X_{(s)}^{(k)}) = \frac{d i s (X_{(s)}^{(k)})}{sup (X_{(s)}^{(k)})} .$
(c): Determining the optimal probability distribution:

$p^{*} (k | s, λ_{1}^{*}, λ_{2}^{*}) = \frac{exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))} .$
(d): Determining the most probable cluster $K_{1}^{*}$ :

$k^{*} = arg max p^{*} (k | s, λ_{1}^{*}, λ_{2}^{*}), K_{1}^{*} = {i_{1}^{*}, \dots, i_{s}^{*}} \to k^{*} (s) .$
(e): Determining the cluster $K_{2}^{*}$ :

$K_{2}^{*} = {j_{1}^{*}, \dots, j_{(n - s)}^{*}}, (j_{1}^{*}, \dots, j_{(n - s)}^{*}) \neq (i_{1}^{*}, \dots, i_{s}^{*}) .$

5.1.2. Algorithm $R 2 K$ with an Unknown Cluster Size $s \in [1, (n - 1)]$

1.

Applying step 1 of

R 2 K (s)

ϱ (x^{(i)}, x^{(j)}), (i, j) = \bar{1, n}; d i s (X), inf (D), sup (D) .

2.

Organizing a loop with respect to the cluster size

s = \bar{1, (n - 1)}

(a): Applying step 2 of $R 2 K (s)$

$i_{1}, \dots, i_{s} \to k, i_{j} = \bar{1, n}, j = \bar{1, s}; k = \bar{1, K (s)}, K (s) = C_{s}^{n} .$

$ϱ^{(k)} (x^{i_{h}}, x^{i_{q}}) = \sqrt{\sum_{(i_{k}, i_{q}) = 1}^{m} {(x^{i_{h}} - x^{i_{q}})}^{2}}, i_{k} \neq i_{q} .$

$d i s (X_{(s)}^{(k)}) = \frac{2}{s (s - 1)} \sum_{i_{h}, i_{q}}^{s} ϱ_{i_{h}, i_{q}} .$
(b): Applying step 3 of $R 2 K (s)$

$p^{*} (s, k) = \frac{exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))} .$
(c): Putting $p^{*} (s, k)$ into the memory.
(d): Calculating the conditionally maximum value of the entropy:

$H_{B}^{*} [p^{*} (s, k)] = - \sum_{k = 1}^{K (s)} p^{*} (s, k) ln p^{*} (s, k) = H^{*} (s) .$
(e): Putting $H (s)$ in the memory.
(f): If $s < n - 1$ , then returning to Step 2a.
(g): Determining the maximum element of the array $H (s), s = \bar{1, (n - 1)}$ :

$s^{*} = arg max_{1 \leq s \leq (n - 1)} H (s) .$
(h): Extracting the probability distribution

$p (s^{*}, k), k = \bar{1, K (s^{*})} .$
(i): Executing Steps 3d and 3e of $R 2 K (s)$ :

$k^{*} (s^{*}) = arg max p (s^{*}, k), K_{1}^{*} = {i_{1}^{*}, \dots, i_{s^{*}}^{*}} \to k^{*} (s^{*}) .$

$K_{2}^{*} = {j_{1}, \dots, j_{n - s^{*}}}, (i_{1}^{*}, \dots, i_{s^{*}}^{*}) \neq (j_{1}, \dots, j_{n - s^{*}}) .$

6. Functional Problems (18)–(20)

Consider a parametric family of all constrained entropy maximization problems with the form

\begin{matrix} H [p (s, k), ε] = - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} [p (s, k) ln p (s, k) + (1 - p (s, k)) ln (1 - p (s, k))] \Rightarrow max, \\ \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)}) = inf (D) + ε (sup (D) - inf (D)) = Δ (ε), \\ 0 \leq ε \leq 1 . \end{matrix}

(33)

The solutions of (22) and (23) will coincide under an unknown value of the parameter

ε .

It can be determined by solving Equation (23) and fixing the values of the entropy functional. Its maximum value will correspond to the desired value

ε^{*} .

Let us turn to Equation (23) with a fixed value of the parameter

ε .

It belongs to the class of Lyapunov-type problems [21]. We define a Lagrange functional as

L [p (s, k), ε, λ] = H [p (s, k), ε] + λ (Δ (ε) - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)})) .

(34)

Using the technique of Gâteaux derivatives, we obtain the stationarity conditions for the functional in (24) in the primal (functional)

p (s, k)

and dual (scalar)

λ

variables; for details, see [22,23]. The resulting optimal distribution parameterized by the Lagrange multiplier

λ

is given by

p^{*} (s, k | λ (ε)) = \frac{exp (- λ d i s (X_{(s)}^{(k)}))}{1 + exp (- λ d i s (X_{(s)}^{(k)}))} .

(35)

The Lagrange multiplier

λ

satisfies the equation

\sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} \frac{exp (- λ d i s (X_{(s)}^{(k)}))}{1 + exp (- λ d i s (X_{(s)}^{(k)}))} d i s (X_{(s)}^{(k)}) = Δ (ε) .

(36)

The solution

λ^{*} (ε)

of this equation belongs to

(- \infty, + \infty)

and depends on

ε .

Hence, the value of the entropy functional

H [p^{*} (s, k | ε), ε]

depends on

ε .

We choose

ε^{*} = arg max_{ε} H [p^{*} (s, k | ε), ε] .

(37)

The randomized binary clustering procedure can be repeated

t / 2

times to form t clusters. At each stage, two new clusters are generated from the remaining objects of the previous stage.

7. Illustrative Examples

Consider the binary clustering of iris flowers using Fisher’s Iris dataset (in this dataset, iris flowers are described by the petal width

x_{1}

and the petal length

x_{2}

). The database contains this feature information for three types of flowers: “setosa” (1), “versicolor” (2), and “virginica” (3), in the amount of 50 two-dimensional points for each species. Below, we study types 1 and 2 and 10 data points for each type.

Example 1.

The data matrix contains the numerical values of the two features for types 1 and 2; see Table 1.

Table 1. Data matrix.

Figure 1 shows the arrangement of the data points on the plane.

Figure 1. Data points on the two-dimensional plane.

First, we apply the algorithm

R 2 K (10);

see Section 5.1.

The minimum and maximum elements are

inf (D_{(20 \times 20)}) = 0, sup (D_{(20 \times 20)}) = 3.73 .

(38)

The data matrix indicator is

d i s X = 1.7382 .

Let

ε = 0.15

.

The ensemble of possible clusters has the size

K (10) = 184786

. The cluster with number

k = 256

has the form

i_{1} = 1, i_{2} = 2, i_{3} = 3, i_{4} = 4, i_{5} = 5, i_{6} = 6, i_{7} = 7, i_{8} = 14,

i_{9} = 15, i_{10} = 20

. The distance matrix

D_{(10)}^{(256)}

is presented in Table 2.

Table 2. Distance matrix for cluster

K (256)

.

The indicator of the matrix

X_{(10)}^{(256)}

corresponding to the cluster

K_{(10)}^{(256)}

is

d i s (X_{(10)}^{(256)}) = 1.5021 .

(39)

The indicators for the clusters

k = 1, \dots, 184786

are shown in Figure 2.

Figure 2. Indicators for

k \in [1, 184786]

.

The entropy-optimal probability distribution for

s = 10

has the form

p^{*} (k | 10) = \frac{exp (- λ^{*} d i s (X_{(10)}^{(k)}))}{1 + exp (- λ^{*} d i s (X_{(10)}^{(k)}))}, λ^{*} = 12.1153 .

(40)

The cluster

K_{1}

with the maximum probability is numbered by

k^{*} = 166922, K_{1} = {4, 5, 6, 7, 11, 14, 15, 16, 17, 20}, d i s K_{1} = 0.1354 .

(41)

The cluster

K_{2}

consists of the following data points:

{1, 2, 3, 8, 9, 10, 12, 13, 18, 19}

.

The arrangement of the clusters

K_{1}

and

K_{2}

is shown in Figure 3.

Figure 3. Randomized clustering results.

A direct comparison with Figure 1 indicates a perfect match of 10/10: no clustering errors.

Example 2.

Consider another data matrix from the same dataset (Table 3).

Table 3. Data matrix.

Figure 4 shows the arrangement of the data points.

Figure 4. Data points on the two-dimensional plane.

Similar to Example 1, we apply the algorithm

R 2 K (10) .

We construct the distance matrix

D_{(20 \times 20)}

and find the minimum and maximum elements:

inf (D_{(20 \times 20)}) = 0, sup (D_{(20 \times 20)}) = 3.73 .

(42)

Let

ε = 0.15

.

The ensemble of possible clusters has the size

K (10) = 184786

. The indicators for the clusters

k = 1, \dots, 184786

are shown in Figure 5.

Figure 5. Indicators for

k \in [1, 184786]

.

The entropy-optimal probability distribution for

s = 10

has the form

p^{*} (k | 10) = \frac{exp (- λ^{*} d i s (X_{(10)}^{(k)}))}{1 + exp (- λ^{*} d i s (X_{(10)}^{(k)}))}, λ^{*} = 100 .

(43)

The cluster

K_{1}

with the maximum probability is numbered by

k^{*} = 177570, K_{1} = {5, 6, 7, 8, 11, 14, 15, 16, 17, 20}, d i s K_{1} = 0.4420 .

(44)

It consists of the following data points:

{5, 6, 7, 8, 11, 14, 15, 16, 17, 20} .

The cluster

K_{2}

consists of the following data points:

{1, 2, 3, 4, 9, 10, 12, 13, 18, 19} .

The arrangement of the clusters

K_{1}

and

K_{2}

is shown in Figure 6. A direct comparison with Figure 4 indicates a match of 8/10.

Figure 6. Randomized clustering results.

8. Discussion and Conclusions

This paper has developed a novel concept of clustering. Its fundamental difference from the conventional approaches is the generation of an ensemble of random clusters, accompanied by the matrices of inter-object distances averaged over the entire ensemble (the so-called indicators). Random clusters are parameterized by the number of objects s and their set

k \to {i_{1}, \dots, i_{s}}

. Therefore, the ensemble’s characteristic is the probability distribution of the clusters in the ensemble, which depends on s and k. A generalized variational principle of statistical mechanics has been proposed to find this distribution. It consists of the conditional maximization of the Boltzmann–Shannon entropy. Algorithms for solving the finite-dimensional and functional optimization problems have been developed.

An advantage of the novel randomized clustering method is complete algorithmization, independent of the properties of the clustered objects data. All existing clustering methods involve, more or less, various empirical techniques related to data properties.

However, this method requires high computational resources to form an ensemble of random clusters and their indicators.

Author Contributions

Conceptualization, Y.S.P.; Data curation, A.Y.P.; Methodology, Y.S.P., A.Y.P., and Y.A.D.; Software, A.Y.P. and Y.A.D.; Supervision, Y.S.P.; Writing—original draft, Y.S.P., A.Y.P. and Y.A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Russian Federation, project no. 075-15-2020-799.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mandel, I.D. Klasternyi Analiz (Cluster Analysis); Finansy i Statistika: Moscow, Russia, 1988. [Google Scholar]
Zagoruiko, N.G. Kognitivnyi Analiz Dannykh (Cognitive Data Analysis); GEO: Novosibirsk, Russia, 2012. [Google Scholar]
Zagoruiko, N.G.; Barakhnin, V.B.; Borisova, I.A.; Tkachev, D.A. Clusterization of Text Documents from the Database of Publications Using FRiS-Tax Algorithm. Comput. Technol. 2013, 18, 62–74. [Google Scholar]
Jain, A.; Murty, M.; Flynn, P. Data Clustering: A Review. ACM Comput. Surv. 1990, 31, 264–323. [Google Scholar] [CrossRef]
Vorontsov, K.V. Lektsii po Algoritmam Klasterizatsii i Mnogomernomu Shkalirovaniyu (Lectures on Clustering Algorithms and Multidimensional Scaling); Moscow State University: Moscow, Russia, 2007. [Google Scholar]
Lescovec, J.; Rajaraman, A.; Ullman, J. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Deerwester, S.; Dumias, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 1999, 41, 391–407. [Google Scholar] [CrossRef]
Zamir, O.E. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. Ph.D. Thesis, The Univeristy of Washington, Seattle, WA, USA, 1999. [Google Scholar]
Cao, G.; Song, D.; Bruza, P. Suffix-Tree Clustering on Post-retrieval Documents Information; The Univeristy of Queensland: Brisbane, QLD, Australia, 2003. [Google Scholar]
Huang, D.; Wang, C.D.; Lai, J.H.; Kwoh, C.K. Toward multidiversified ensemble clustering of high-dimensional data: From subspaces to metrics and beyond. IEEE Trans. Cybern. 2021. [Google Scholar] [CrossRef] [PubMed]
Khan, I.; Luo, Z.; Shaikh, A.K.; Hedjam, R. Ensemble clustering using extended fuzzy k-means for cancer data analysis. Expert Syst. Appl. 2021, 172, 114622. [Google Scholar] [CrossRef]
Jain, A.; Dubs, R. Clustering Methods and Algorithms; Prentice-Hall: Hoboken, NJ, USA, 1988. [Google Scholar]
Pal, N.R.; Biswas, J. Cluster Validation Using Graph Theoretic Concept. Pattern Recognit. 1997, 30, 847–857. [Google Scholar] [CrossRef]
Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On Clustering Validation Techniques. J. Intell. Inf. Syst. 2001, 17, 107–145. [Google Scholar] [CrossRef]
Han, J.; Kamber, M.; Pei, J. Data Mining Concept and Techniques; Morgan Kaufmann Publishers: Burlington, MA, USA, 2012. [Google Scholar]
Popkov, Y.S. Randomization and Entropy in Machine Learning and Data Processing. Dokl. Math. 2022, 105, 135–157. [Google Scholar] [CrossRef]
Popkov, Y.S.; Dubnov, Y.A.; Popkov, A.Y. Introduction to the Theory of Randomized Machine Learning. In Learning Systems: From Theory to Practice; Sgurev, V., Piuri, V., Jotsov, V., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 199–220. [Google Scholar] [CrossRef]
Popkov, Y.S. Macrosystems Theory and Its Applications (Lecture Notes in Control and Information Sciences Vol 203); Springer: Berlin, Germany, 1995. [Google Scholar]
Popkov, Y.S. Multiplicative Methods for Entropy Programming Problems and their Applications. In Proceedings of the 2010 IEEE International Conference on Industrial Engineering and Engineering Management, Xiamen, China, 29–31 October 2010; pp. 1358–1362. [Google Scholar] [CrossRef]
Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
Joffe, A.D.; Tihomirov, A.M. Teoriya Ekstremalnykh Zadach (Theory of Extreme Problems); Nauka: Moscow, Russia, 1974. [Google Scholar]
Tihomirov, V.M.; Alekseev, V.N.; Fomin, S.V. Optimal Control; Nauka: Moscow, Russia, 1979. [Google Scholar]
Popkov, Y.; Popkov, A. New methods of entropy-robust estimation for randomized models under limited data. Entropy 2014, 16, 675–698. [Google Scholar] [CrossRef]

Figure 1. Data points on the two-dimensional plane.

Figure 2. Indicators for

k \in [1, 184786]

.

Figure 2. Indicators for

k \in [1, 184786]

.

Figure 3. Randomized clustering results.

Figure 4. Data points on the two-dimensional plane.

Figure 5. Indicators for

k \in [1, 184786]

.

Figure 5. Indicators for

k \in [1, 184786]

.

Figure 6. Randomized clustering results.

Table 1. Data matrix.

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

Table 2. Distance matrix for cluster

K (256)

.

Table 2. Distance matrix for cluster

K (256)

.

No.	1	2	3	4	5	6	7	8	9	10
1	0	0.1	0.22	3.01	3.45	3.32	3.27	3.36	3.36	3.36
2	0.1	0	0.14	3.1	3.55	3.42	3.36	3.45	3.45	3.45
3	0.22	0.14	0	3.16	3.61	3.48	3.42	3.51	3.51	3.51
4	3.01	3.1	3.16	0	0.45	0.32	0.28	0.36	0.36	0.36
5	3.45	3.55	3.61	0.45	0	0.14	0.2	0.1	0.1	0.1
6	3.32	3.42	3.48	0.32	0.14	0	0.14	0.1	0.1	0.1
7	3.27	3.36	3.42	0.28	0.2	0.14	0	0.1	0.1	0.1
8	3.36	3.45	3.51	0.36	0.1	0.1	0.1	0	0	0
9	3.36	3.45	3.51	0.36	0.1	0.1	0.1	0	0	0
10	3.36	3.45	3.51	0.36	0.1	0.1	0.1	0	0	0

Table 3. Data matrix.

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Entropy-Randomized Clustering

Abstract

1. Introduction

2. An Indicator of Data Matrices

3. Randomized Binary Clustering

4. Entropy-Optimal Distribution $p^{*} (s, k)$

5. Parametrical Problems (18)–(20)

5.1. Randomized Binary Clustering Algorithms

5.1.1. Algorithm $R 2 K (s)$ with a Given Cluster Size s

5.1.2. Algorithm $R 2 K$ with an Unknown Cluster Size $s \in [1, (n - 1)]$

6. Functional Problems (18)–(20)

7. Illustrative Examples

8. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1

Entropy-Randomized Clustering

Abstract

1. Introduction

2. An Indicator of Data Matrices

3. Randomized Binary Clustering

4. Entropy-Optimal Distribution p * ( s , k )

5. Parametrical Problems (18)–(20)

5.1. Randomized Binary Clustering Algorithms

5.1.1. Algorithm R 2 K ( s ) with a Given Cluster Size s

5.1.2. Algorithm R 2 K with an Unknown Cluster Size s ∈ [ 1 , ( n − 1 ) ]

6. Functional Problems (18)–(20)

7. Illustrative Examples

8. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

4. Entropy-Optimal Distribution $p^{*} (s, k)$

5.1.1. Algorithm $R 2 K (s)$ with a Given Cluster Size s

5.1.2. Algorithm $R 2 K$ with an Unknown Cluster Size $s \in [1, (n - 1)]$

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1