Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem

Jianguang Lu; Juan Tang; Bin Xing; Xianghong Tang

doi:10.3390/math10010144

,

and

¹

State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China

²

Chongqing Innovation Center of Industrial Big-Data Co., Ltd., Chongqing 400707, China

³

School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

⁴

Institute of Computing Science and Technology, Guangzhou University, Guangzhou 510006, China

Mathematics2022, 10(1), 144;https://doi.org/10.3390/math10010144

This article belongs to the Special Issue Advanced Mathematical Methods in Intelligent Multimedia: Security and Applications

Version Notes

Order Reprints

Abstract

The k-means problem has been paid much attention for many applications. In this paper, we define the uncertain constrained k-means problem and propose a

(1 + ϵ)

-approximate algorithm for the problem. First, a general mathematical model of the uncertain constrained k-means problem is proposed. Second, the random sampling properties of the uncertain constrained k-means problem are studied. This paper mainly studies the gap between the center of random sampling and the real center, which should be controlled within a given range with a large probability, so as to obtain the important sampling properties to solve this kind of problem. Finally, using mathematical induction, we assume that the first

j - 1

cluster centers are obtained, so we only need to solve the j-th center. The algorithm has the elapsed time

O ({(\frac{1891 e k}{ϵ^{2}})}^{8 k / ϵ} n d)

, and outputs a collection of size

O ({(\frac{1891 e k}{ϵ^{2}})}^{8 k / ϵ} n)

of candidate sets including approximation centers.

Keywords:

stochastic approximate algorithms; uncertain constrained k-means; approximation centers

1. Introduction

The k-means problem has received much attention in the past several decades. The k-means problems consists of partitioning a set P of points in d-dimensional space

R^{d}

into k subsets

P_{1}, \dots, P_{k}

such that

\sum_{i = 1}^{k} \sum_{p \in P_{i}} | | p - c_{i} {| |}^{2}

is minimized, where

c_{i}

is the center of

P_{i}

, and

| | p - q | |

is the distance between two points of p and q. The k-means problem is one of the classical NP-hard problems, and has been paid much attention in the literature [1,2,3].

For many applications, each cluster of the point set may satisfy some additional constraints, such as chromatic clustering [4], r-capacity clustering [5], r-gather clustering [6], fault tolerant clustering [7], uncertain data clustering [8], semi-supervised clustering [9], and l-diversity clustering [10]. The constrained clustering problems was studied by Ding and Xu, who presented the first unified framework in [11]. Given a point set

P \subseteq R^{d}

, and a positive integer k, a list of constraints

L

, the constrained k-means problem is to partition P into k clusters

P = {P_{1}, \dots, P_{k}}

, such that all constraints in

L

are satisfied and

\sum_{P_{i} \in P} \sum_{x \in P_{i}} | | x - c (P_{i}) {| |}^{2}

is minimized, where

c (P_{i}) = \frac{1}{| P_{i} |} \sum_{x \in P_{i}} x

denotes the centroid of

P_{i}

.

In recent years, particular research has been focused on the constrained k-means problem. Ding and Xu [11] showed the first polynomial time approximation scheme with running time

O (2^{p o l y (k / ϵ)} {(log n)}^{k} n d)

for the constrained k-means problem, and obtained a collection of size

O (2^{p o l y (k / ϵ)} {(log n)}^{k + 1})

of candidate approximate centers. The existing fastest approximation schemes for the constrained k-means problem takes

O (2^{O (k / ϵ)} n d)

time [12,13], which was first shown by Bhattacharya, Jaiswai, and Kumar [12]. Their algorithm gives a collection of size

O (2^{O (k / ϵ)})

of candidate approximate centers. In this paper, we propose the uncertain constrained k-means problem, which supposes that all points are random variables with probabilistic distributions. We present a stochastic approximate algorithm for the uncertain constrained k-means problem. The uncertain constrained k-means problem can be regarded as a generalization of the constrained k-means problem. We prove the random sampling properties of the uncertain constrained k-means problem, which are fundamental for our proposed algorithm. By applying random sampling and mathematical induction, we propose a stochastic approximate algorithm with lower complexity for the uncertain constrained k-means problem.

This paper is organized as follows. Some basic notations are given in Section 2. Section 3 provides an overview of the new algorithm for the uncertain constrained k-means problem. In Section 4, we discuss the detailed algorithm for the uncertain constrained k-means problem. In Section 5, we investigate the correctness, success probability, and running time analysis of the algorithm. Section 6 concludes this paper and gives possible directions for future research.

2. Preliminaries

Definition 1

(Uncertain constrained k-means problem). Given a random variable set

X \subseteq R^{d}

, the probability density function

f_{X} (s)

for every random variable

X \in X

, a list of constraints

L

, and a positive integer k, the uncertain constrained k-means problem is to partition

X

into k clusters

X = {X_{1}, \dots, X_{k}}

, such that all constraints in

L

are satisfied and

\sum_{X_{i} \in X} \sum_{X \in X_{i}} \int^{R^{d}} | | s - c (X_{i}) {| |}^{2} f_{X} (s) d s

is minimized, where

c (X_{i}) = \frac{1}{| X_{i} |} \sum_{X \in X_{i}} \int^{R^{d}} s f_{X} (s) d s

denotes the centroid of

X_{i}

.

Definition 2

([13]). Let

X

be a set of random variables in

R^{d}

,

f_{X} (s)

be probability density function for every random variable

X \in X

, and

q \in R^{d}

and P be a set of points in

R^{d}

,

p \in P

.

Define $f_{2} (q, X) = \sum_{X \in X} \int^{R^{d}} | | s - {q | |}^{2} f_{X} (s) d s$ .
Define $c (X) = \frac{1}{| X |} \sum_{X \in X} \int^{R^{d}} s f_{X} (s) d s$ .
Define $d i s t (X, P) = m i n_{p \in P} \int^{R^{d}} | | s - p | | f_{X} (s) d s$ .

Definition 3

([13]). Let

X

be a set of random variables in

R^{d}

,

f_{X} (s)

be the probability density function for every random variable

X \in X

, and

X_{1}, \dots, X_{k}

be a partition of

X

.

Define $m_{j} = c (X_{j})$ .
$β_{j} = \frac{| X_{j} |}{| X |}$ .
Define $σ_{j} = \sqrt{\frac{f_{2} (m_{j}, X_{j})}{| X_{j} |}}$ .
Define
$O P T_{k} (X) = \sum_{j = 1}^{k} \sum_{X \in X_{j}} \int^{R^{d}} | | s - c (X_{j}) {| |}^{2} f_{X} (s) d s = \sum_{j = 1}^{k} f_{2} (m_{j}, X_{j})$ .
Define $σ_{o p t} = \sqrt{\frac{O P T_{k} (X)}{| X |}} = \sqrt{\sum_{i = 1}^{k} β_{i} σ_{i}^{2}}$ .

Lemma 1.

For any point

x \in R^{d}

and a random variable set

X \subseteq R^{d}

,

f_{2} (x, X) = f_{2} (c (X), X) + | X | | | c (X) - {x | |}^{2}

.

Proof.

Let

f_{X} (s)

be the probability density function for every random variable

X \in X

.

\begin{matrix} (1) & f_{2} (x, X) & = \sum_{X \in X} \int^{R^{d}} | | s - {x | |}^{2} f_{X} (s) d s \\ (2) & = \sum_{X \in X} \int^{R^{d}} | | s - c (X) + c (X) - {x | |}^{2} f_{X} (s) d s \\ (3) & = \sum_{X \in X} \int^{R^{d}} | | s - {c (X) | |}^{2} f_{X} (s) d s + \sum_{X \in X} \int^{R^{d}} | | c (X) - {x | |}^{2} f_{X} (s) d s \\ (4) & = f_{2} (c (X), X) + | | c (X) - {x | |}^{2} \sum_{X \in X} \int^{R^{d}} f_{X} (s) d s \\ (5) & = f_{2} (c (X), X) + | X | | | c (X) - {x | |}^{2} . \end{matrix}

The (3) equality follows from the fact that

\sum_{X \in X} \int^{R^{d}} (s - c (X)) f_{X} (s) d s = 0

. □

Lemma 2.

Let

X

be a set of random variables in

R^{d}

and

f_{X} (s)

be the probability density function for every random variable

X \in X

. Assume that

T

is a set of random variables obtained by sampling random variables from

X

uniformly and independently. For ∀

δ > 0

, we have:

P r (| | c (T) - c (X) | |^{2} > \frac{1}{δ | T |} σ^{2}) < δ,

(6)

where

σ^{2} = \frac{1}{| X |} \sum_{X \in X} \int^{R^{d}} | | s - {c (X) | |}^{2} f_{X} (s) d s

.

Proof.

First, observe that

E (c (T)) = c (X), E (| | c (T) - {c (X) | |}^{2}) = \frac{1}{| T |} σ^{2}

(7)

where

σ^{2} = \frac{1}{| X |} \sum_{X \in X} \int^{R^{d}} | | s - {c (X) | |}^{2} f_{X} (s) d s

. Then apply the Markov inequality to obtain the following.

P r (| | c (T) - c (X) | |^{2} > \frac{1}{δ | T |} σ^{2}) < δ .

(8)

□

Lemma 3.

Let

Q

be a set of random variables in

R^{d}

,

f_{X} (s)

be the probability density function for every random variable

X \in Q

, and

Q_{1}

be an arbitrary subset of

Q

with

α | Q |

random variables for some

0 < α \leq 1

. Then

| | c (Q) - c (Q_{1}) | | \leq \sqrt{\frac{1 - α}{α}} σ

, where

σ^{2} = \frac{1}{| Q |} \sum_{X \in Q} \int^{R^{d}} | | s - {c (Q) | |}^{2} f_{X} (s) d s

.

Proof.

Let

Q_{2} = Q \ Q_{1}

. By Lemma 1, we have the following two equalities.

f_{2} (c (Q), Q_{1}) = f_{2} (c (Q_{1}), Q_{1}) + | Q_{1} | | | c (Q_{1}) - {c (Q | |}^{2},

(9)

f_{2} (c (Q), Q_{2}) = f_{2} (c (Q_{2}), Q_{2}) + | Q_{2} | | | c (Q_{2}) - {c (Q | |}^{2} .

(10)

Let

L = | | c (Q_{1}) - c (Q_{2}) | |

. By the definition of the mean point, we have:

c (Q) = \frac{1}{| Q |} \sum_{X \in Q} \int^{R^{d}} s f_{X} (s) d s = \frac{1}{| Q |} (| Q_{1} | c (Q_{1}) + | Q_{2} | c (Q_{2})) .

(11)

Thus, the three points

{c (Q), c (Q_{1}), c (Q_{2})}

are collinear, while

| | c (Q_{1}) - c (Q) | | = (1 - α) L

and

| | c (Q_{2}) - c (Q) | | = α L

. Meanwhile, by the definition of

σ

, we have

σ^{2} = \frac{1}{| Q |} (\sum_{X \in Q_{1}} \int^{R^{d}} | | s - {c (Q) | |}^{2} f_{X} (s) d s + \sum_{X \in Q_{2}} \int^{R^{d}} | | s - {c (Q) | |}^{2} f_{X} (s) d s)

. Combining Equality (9) and Equality (10), we have:

\begin{matrix} (12) & σ^{2} & \geq \frac{1}{| Q |} (| Q_{1} | | | c (Q_{1}) - {c (Q | |}^{2} + | Q_{2} | | | c (Q_{2}) - {c (Q | |}^{2}) \\ (13) & = α {((1 - α) L)}^{2} + (1 - α) {(α L)}^{2} \\ (14) & = α (1 - α) L^{2} . \end{matrix}

Thus, we have

L \leq \frac{σ}{\sqrt{α (1 - α)}}

, which means that

| | c (Q) - c (Q_{1}) | | = (1 - α) L \leq \sqrt{\frac{1 - α}{α}} σ

. □

Lemma 4

([12]). For any

x, y, z \in R^{d}

, then

| | x - {z | |}^{2} \leq 2 | | x - {y | |}^{2} + 2 | | y - {z | |}^{2}

.

Theorem 1

([14]). Let

X_{1}, \dots, X_{s}

be s, an independent random

0 - 1

variable, where

X_{i}

takes 1 with a probability of at least p for

i = 1, \dots, s

. Let

X = \sum_{i = 1}^{s} X_{i}

. Then, for any

δ > 0

,

P r (X < (1 - δ) p s) < e^{- \frac{1}{2} δ^{2} p s}

.

3. Overview of Our Method

In this section, we first introduce the main idea of our methodology to solve the uncertain constrained k-means problem.

Considering the optimal partition

X = {X_{1}, \dots, X_{k}} (| X_{1} | \geq \dots \geq | X_{k} |)

of

X

, since

| X_{1} | / | X | \geq 1 / k

, if we could sample a set

S

of size

O (k / ϵ)

from

X

uniformly and independently, then at least

O (1 / ϵ)

random variables in

S

are from

X_{1}

with a certain probability. All subsets of

S

of size

O (1 / ϵ)

could be enumerated to discover the approximate center of

X_{1}

.

We assume that

C_{j - 1} = {c_{1}, \dots, c_{j - 1}}

is the set including approximate centers of the

X_{1}, \dots, X_{j}

. Let

B_{j} = {X \in X | d i s t (X, C_{j - 1}) = m i n_{c \in C_{j - 1}} \int^{R^{d}} | | s - c | | f_{X} (s) d s \leq r_{j}}

, where

r_{j} = \sqrt{\frac{ϵ}{40 β_{j} k}} σ_{o p t}

. The set

X_{j}

is divided into two parts:

X_{j}^{o u t}

and

X_{j}^{i n}

, where

X_{j}^{o u t} = X_{j} \ B_{j}

and

X_{j}^{i n} = X_{j} \cap B_{j}

. For each random variable X, let

\tilde{X}

be the nearest point (particular random variable) in

C_{j - 1}

to X. Let

{\tilde{X}}_{j}^{i n} = {\tilde{X} | X \in X_{j}^{i n}}

, and

{\tilde{X}}_{j} = {\tilde{X}}_{j}^{i n} \cup X_{j}^{o u t}

.

If most of the random variables of

X_{j}

are in

X_{j}^{i n}

, our idea is to use the center of

{\tilde{X}}_{j}^{i n}

to approximate the center of

X_{j}

. The center of

{\tilde{X}}_{j}^{i n}

is found based on

C_{j - 1}

. If most of the random variables of

X_{j}

are in

X_{j}^{o u t}

, our ideal is to replace the center of

X_{j}

with the center of

{\tilde{X}}_{j}

. For seeking out the approximate center of

{\tilde{X}}_{j}

, we should find out a subset

S^{'}

by uniformly sampling from

{\tilde{X}}_{j}

. However, the set

X_{j}^{o u t}

is unknown. We need to find the set

S^{'} \cap X_{j}^{o u t}

. We apply a branching strategy to find a set

Q

such that

X \ B_{j} \subseteq Q

, and

| Q | < 2 | X \ B_{j} |

. Then, a random variables set

S

is obtained by sampling random variables from

Q

independently and uniformly. And the set

X \ B_{j} \subseteq Q

can be replaced by a subset

S^{*}

of

S

from

X_{j}^{o u t}

. Based on

S^{*}

and

{\tilde{X}}_{j}^{i n}

, the approximation center of

{\tilde{X}}_{j}

could be obtained. Therefore, the algorithm presented in this paper outputs a collection of size

O ({(\frac{1891 e k}{ϵ^{2}})}^{8 k / ϵ} n)

of candidate sets containing approximation centers, and has the running time

O ({(\frac{1891 e k}{ϵ^{2}})}^{8 k / ϵ} n d)

.

4. Our Algorithm cMeans

Given an instance

(X, k, L)

of the uncertain constrained k-means problem,

X = {X_{1}, \dots, X_{k}}

denotes an optimal partition of

(X, k, L)

. There exist six parameters (

ϵ

,

Q

, g, k, C, U) in our

cMeans

, where

ϵ \in (0, 1]

is the approximate factor,

Q

is the input random variable set, g is the number of centers, k is the number of the clusters, C is the set of approximate cluster centers, and U is a collection of candidate sets including the approximate center. Let

M = \frac{6}{ϵ}, N = \frac{79, 380 k}{ϵ^{3}}

, where M is the size of subsets of the sampling set and N is the size of the sampling set. Without loss of generality, assume that values of M and N are integers.

We use the branching strategy to seek out the approximate centers of clusters in

X

. There exist two branches in our algorithm

cMeans

, which can be seen in Figure 1. On one branch, a size N set

S_{1}

is obtained by sampling from

Q

uniformly and independently;

S_{2}

is constructed by

S_{1}

and M copies of each point in C. Moreover, we consider each subset

S^{'}

of size M of

S_{2}

, and the centroid c of

S^{'}

is solved to represent the approximate center of

X_{k - g + 1}

, and our algorithm

cMeans

(

ϵ, Q, g - 1, k, C \cup {c}, U

) is used to obtain the remaining

g - 1

cluster centers.

Figure 1. Flow chart of our algorithm

cMeans

.

On the other branch, for each random variable

X \in Q

, we calculate the distance between X and C first. H denotes the set of all distances of random variables in

X

to C, where H is a multi-set. We should obtain the median value m for all values in H, which is the

⌊ | H | / 2 ⌋

-th element if all of the values in H are sorted. In the second branch,

Q

is divided into two parts,

Q^{'}

and

Q^{″}

, based on m such that for

\forall X^{'} \in Q^{'}

,

X^{″} \in Q^{″}

,

d i s t (X^{'}, C) \leq d i s t (X^{″}, C)

, where

| Q^{'} | = ⌈ \frac{| Q |}{2} ⌉

,

| Q^{″} | = ⌊ \frac{| Q |}{2} ⌋

. Subroutine

cMeans

(

ϵ, Q^{″}, g, k, C, U

) is used to obtain the remaining g cluster centers. Therefore, we present the specific algorithm for seeking out a collection of candidate sets in the Algorithm 1.

Algorithm 1:

cMeans

(

ϵ, Q, g, k, C, U

)

5. Analysis of Our Algorithm $cMeans$

We investigate the success probability, correctness, and time complexity analysis of the algorithm

cMeans

in this section.

Lemma 5.

There exists a candidate set, with a probability of at least

1 / 12^{k}

, including the approximate center

C_{k} = {c_{1}, \dots, c_{k}}

in U satisfying

| | m_{j} - c_{j} {| |}^{2} \leq \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2} (1 \leq j \leq k)

.

The following Lemmas from Lemma 6 to 16 are used to prove Lemma 5. We prove Lemma 5 via induction on j. For

j = 1

, we can obtain

β_{1} \geq 1 / k

easily, and prove the success probability first.

Lemma 6.

In the process of finding

c_{1}

in our algorithm

cMeans

, by sampling a set of

79,380 k / ϵ^{3}

random variables from

X

independently and uniformly, denoted by

S_{1}

, the probability that at least

6 / ϵ

random variables in

S_{2}

are from

X_{1}

is at least

1 / 2

.

Proof.

In our algorithm

cMeans

, we assume that

S_{1} = S_{1}, \dots, S_{N}

, where

N = 79,380 k / ϵ^{3}

. Let

x_{1}^{^{'}}, \dots, x_{N}^{^{'}}

be the corresponding random variables of elements in

S_{1}

. If

S_{i} \in X_{1}

, then

x_{i}^{^{'}} = 1

. Otherwise

x_{i}^{^{'}} = 0

. It is known easily that

P r [S_{i} \in X_{1}] \geq \frac{1}{k}

. Let

x = \sum_{i = 1}^{N} x_{i}^{^{'}}, u = \sum_{i = 1}^{N} E (x_{i}^{^{'}})

. We obtain that

u \geq 79,380 k / ϵ^{3}

. Then,

\begin{matrix} (15) & P r [x > \frac{6}{ϵ}] & = 1 - P r [x \leq \frac{6}{ϵ}] \\ (16) & = 1 - P r [x \leq \frac{6 ϵ^{2}}{79,380} \frac{79,380}{ϵ^{3}}] \\ (17) & \geq 1 - P r [x \leq \frac{ϵ^{2}}{13,230} u] \\ (18) & \geq 1 - e^{- \frac{{(1 - \frac{ϵ^{2}}{13,230})}^{2} u}{2}} \\ (19) & \geq 1 - e^{- \frac{{(1 - \frac{ϵ^{2}}{13,230})}^{2} \frac{79,380}{ϵ^{3}}}{2}} \\ (20) & \geq 1 - e^{- \frac{{(1 - \frac{1}{13,230})}^{2} \cdot 79,380}{2}} \\ (21) & \geq \frac{1}{2} . \end{matrix}

□

From Lemma 6, an

S^{*}

with size

6 / ϵ

of

S_{2}

can be obtained, and the probability that all points in

S^{*}

are from

X_{1}

is at least

1 / 2

. Let

c_{1}

denote the centroid of

S^{*}

, and

δ = 5 / 6

. For

| S^{*} | = 6 / ϵ

, by Lemma 2, we conclude that

| | m_{1} - c_{1} {| |}^{2} \leq \frac{1}{5} ϵ σ_{1}^{2}

holds with a probability of at least

1 / 6

. Then, the probability that a subset

S^{*}

of size

6 / ϵ

of

S_{2}

can be found such that

| | m_{1} - c_{1} {| |}^{2} \leq \frac{1}{5} ϵ σ_{1}^{2} \leq \frac{9}{10} ϵ σ_{1}^{2} + \frac{1}{10 β_{1} k} ϵ σ_{o p t}^{2}

holds is at least

1 / 12

. Therefore, we conclude that Lemma 5 holds for

j = 1

.

Moreover, we assume that for

j \leq j_{0} (1 \leq j_{0})

, Lemma 5 holds with a probability of at least

1 / 12^{j}

. Considering the case

j = j_{0} + 1

, we prove Lemma 5 by the following two cases: (1)

| X_{j}^{o u t} | \leq \frac{ϵ}{49} β_{j} n

; (2)

| X_{j}^{o u t} | > \frac{ϵ}{49} β_{j} n

.

5.1. Analysis for Case 1: $| X_{j}^{o u t} | \leq \frac{ϵ}{49} β_{j} n$

Since

| X_{j}^{o u t} | \leq \frac{ϵ}{49} β_{j} n

, most of the random variables of

X_{j}

are in

B_{j}

. Our idea is to replace the center of

X_{j}

with the center of

{\tilde{X}}_{j}^{i n}

. Thus, we need to find the approximate center

c_{j}

of

{\tilde{X}}_{j}^{i n}

and the bound distance

| | m_{j} - c_{j} | |

. We divide the distance

| | m_{j} - c_{j} | |

into the following three parts:

| | m_{j} - m_{j}^{i n} | |

,

| | m_{j}^{i n} - {\tilde{m}}_{j}^{i n} | |

, and

| | {\tilde{m}}_{j}^{i n} - c_{j} | |

. We first study the distance between

m_{j}

and

m_{j}^{i n}

.

Lemma 7.

| | m_{j} - m_{j}^{i n} | | \leq \sqrt{\frac{ϵ}{48}} σ_{j}

.

Proof.

Since

| X_{j} | = β_{j} n

and

| X_{j}^{o u t} | \leq \frac{ϵ}{49} β_{j} n

, the proportion of

X_{j}^{i n}

in

X_{j}

is at least

1 - \frac{ϵ}{49}

. By Lemma 3,

| | m_{j} - m_{j}^{i n} | | \leq \sqrt{\frac{ϵ / 49}{1 - ϵ / 49}} σ_{j} \leq \sqrt{\frac{ϵ}{48}} σ_{j}

. □

Lemma 8.

| | m_{j}^{i n} - {\tilde{m}}_{j}^{i n} | | \leq r_{j}

.

Proof.

Since

m_{j}^{i n} = \frac{1}{| X_{j}^{i n} |} \sum_{X \in X_{j}^{i n}} \int^{R^{d}} s f_{X} (s) d s

, and

{\tilde{m}}_{j}^{i n} = \frac{1}{| X_{j}^{i n} |} \sum_{X \in X_{j}^{i n}} \tilde{X}

, we can obtain the following:

\begin{matrix} (22) & | | m_{j}^{i n} - {\tilde{m}}_{j}^{i n} | | & = | | \frac{1}{| X_{j}^{i n} |} \sum_{X \in X_{j}^{i n}} \int^{R^{d}} s f_{X} (s) d s - \frac{1}{| X_{j}^{i n} |} \sum_{X \in X_{j}^{i n}} \tilde{X} | | \\ (23) & = \frac{1}{| X_{j}^{i n} |} | | \sum_{X \in X_{j}^{i n}} \int^{R^{d}} (s - \tilde{X}) f_{X} (s) d s | | \\ (24) & \leq \frac{1}{| X_{j}^{i n} |} \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | s - \tilde{X} | | f_{X} (s) d s \\ (25) & \leq \frac{1}{| X_{j}^{i n} |} \sum_{X \in X_{j}^{i n}} r_{j} \\ (26) & = r_{j} . \end{matrix}

□

Lemma 9.

f_{2} ({\tilde{m}}_{j}^{i n}, {\tilde{X}}_{j}^{i n}) \leq 2 | X_{j}^{i n} | r_{j}^{2} + 2 f_{2} (m_{j}, X_{j}^{i n}) - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2}

.

Proof.

Since

| {\tilde{X}}_{j}^{i n} | = | X_{j}^{i n} |

, by 1, we have

f_{2} (m_{j}, {\tilde{X}}_{j}^{i n}) = f_{2} ({\tilde{m}}_{j}^{i n}, {\tilde{X}}_{j}^{i n}) + | X_{j}^{i n} | | | {\tilde{m}}_{j}^{i n} - m_{j} | |

. Then,

\begin{matrix} (27) & f_{2} ({\tilde{m}}_{j}^{i n}, {\tilde{X}}_{j}^{i n}) & = f_{2} (m_{j}, {\tilde{X}}_{j}^{i n}) - | X_{j}^{i n} | | | {\tilde{m}}_{j}^{i n} - m_{j} {| |}^{2} \\ (28) & = \sum_{X \in X_{j}^{i n}} | | \tilde{X} - m_{j} {| |}^{2} - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (29) & = \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | \tilde{X} - m_{j} {| |}^{2} f_{X} (s) d s - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (30) & = \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | \tilde{X} - s + s - m_{j} {| |}^{2} f_{X} (s) d s - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (31) & \leq \sum_{X \in X_{j}^{i n}} \int^{R^{d}} (2 | | \tilde{X} - {s | |}^{2} + 2 | | s - m_{j} {| |}^{2}) f_{X} (s) d s - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (32) & \leq 2 | X_{j}^{i n} | r_{j}^{2} + 2 \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | s - m_{j} {| |}^{2} f_{X} (s) d s - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (33) & = 2 | X_{j}^{i n} | r_{j}^{2} + 2 f_{2} (m_{j}, X_{j}^{i n}) - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \end{matrix}

□

Lemma 10.

In the process of finding

c_{j}

in our algorithm

cMeans

, for the set

S_{2}

in step 5, a subset

S^{*}

of size

6 / ϵ

of

S_{2}

can be obtained such that all random variables in

S^{*}

are from

{\tilde{X}}_{j}^{i n}

. Let

c_{j}

be the centroid of

S^{*}

. Then, the inequality

| | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \leq \frac{2}{5} ϵ r_{j}^{2} + \frac{49}{120} ϵ σ_{j}^{2} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2}

holds with a probability of at least

1 / 6

.

Proof.

For each point

p \in C_{j - 1}

,

6 / ϵ

copies of p are added to

S_{2}

in step 9 in our algorithm

cMeans

. Thus, a subset

S^{*}

of size

6 / ϵ

of

S_{2}

can be obtained such that all random variables in

S^{*}

are from

{\tilde{X}}_{j}^{i n}

. Let

δ = 5 / 6

. Since

| S^{*} | = 6 / ϵ

, by Lemma 2,

| | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \leq \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}^{i n}, {\tilde{X}}_{j}^{i n})}{| X_{j}^{i n} |}

holds with a probability of at least

1 / 6

. Assume that

| | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \leq \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}^{i n}, {\tilde{X}}_{j}^{i n})}{| X_{j}^{i n} |}

. Then,

\begin{matrix} (34) & | | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} & \leq \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}^{i n}, {\tilde{X}}_{j}^{i n})}{| X_{j}^{i n} |} \\ (35) & \leq \frac{1}{5} ϵ \frac{2 | X_{j}^{i n} | r_{j}^{2} + 2 f_{2} (m_{j}, X_{j}^{i n}) - | X_{j}^{i n} | | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2}}{| X_{j}^{i n} |} \\ (36) & = \frac{2}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ \frac{f_{2} (m_{j}, X_{j}^{i n})}{| X_{j}^{i n} |} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (37) & \leq \frac{2}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ \frac{f_{2} (m_{j}, X_{j})}{| X_{j} | - | X_{j}^{o u t} |} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (38) & \leq \frac{2}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ \frac{β_{j} n σ_{j}^{2}}{(1 - ϵ / 49) β_{j} n} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} \\ (39) & \leq \frac{2}{5} ϵ r_{j}^{2} + \frac{49}{120} ϵ σ_{j}^{2} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} . \end{matrix}

□

Lemma 11.

If

c_{j}

satisfies

| | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \leq \frac{2}{5} ϵ r_{j}^{2} + \frac{49}{120} ϵ σ_{j}^{2} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2}

, then

| | m_{j} - c_{j} {| |}^{2} \leq \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2}

.

Proof.

Assume that

c_{j}

satisfies

| | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \leq \frac{2}{5} ϵ r_{j}^{2} + \frac{49}{120} ϵ σ_{j}^{2} - \frac{1}{5} ϵ | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2}

. Then,

\begin{matrix} (40) & | | m_{j} - c_{j} {| |}^{2} & = | | m_{j} - {\tilde{m}}_{j}^{i n} + {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \\ (41) & \leq 2 | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} + 2 | | {\tilde{m}}_{j}^{i n} - c_{j} {| |}^{2} \\ (42) & \leq (2 - \frac{2}{5} ϵ) | | m_{j} - {\tilde{m}}_{j}^{i n} {| |}^{2} + \frac{4}{5} ϵ r_{j}^{2} + \frac{49}{60} ϵ σ_{j}^{2} \\ (43) & \leq (2 - \frac{2}{5} ϵ) | | m_{j} - m_{j}^{i n} + m_{j}^{i n} - {\tilde{m}}_{j}^{i n} {| |}^{2} + \frac{4}{5} ϵ r_{j}^{2} + \frac{49}{60} ϵ σ_{j}^{2} \\ (44) & \leq (2 - \frac{2}{5} ϵ) (2 | | m_{j} - m_{j}^{i n} {| |}^{2} + 2 | | m_{j}^{i n} - {\tilde{m}}_{j}^{i n} {| |}^{2}) + \frac{4}{5} ϵ r_{j}^{2} + \frac{49}{60} ϵ σ_{j}^{2} \\ (45) & \leq (2 - \frac{2}{5} ϵ) (\frac{1}{24} ϵ σ_{j}^{2} + 2 r_{j}^{2}) + \frac{4}{5} ϵ r_{j}^{2} + \frac{49}{60} ϵ σ_{j}^{2} \\ (46) & \leq \frac{9}{10} ϵ σ_{j}^{2} + 4 r_{j}^{2} \\ (47) & = \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2} . \end{matrix}

□

5.2. Analysis for Case 2: $| X_{j}^{o u t} | > \frac{ϵ}{49} β_{j} n$

Let

{\tilde{X}}_{j} = {\tilde{X}}_{j}^{i n} \cup X_{j}^{o u t}

, and

{\tilde{m}}_{j}

denote the centroid of

{\tilde{X}}_{j}

. Our idea is to replace the center of

X_{j}

with the center of

{\tilde{X}}_{j}

. But it is difficult to seek out the center of

{\tilde{X}}_{j}

. Thus, we try to find an approximate center

c_{j}

of

{\tilde{X}}_{j}

.

Lemma 12.

\frac{| X_{j}^{o u t} |}{| X \ B_{j} |} \geq \frac{ϵ^{2}}{3969 k}

.

Proof.

\begin{matrix} (48) & \frac{| X_{j}^{o u t} |}{| X \ B_{j} |} & = \frac{| X_{j}^{o u t} |}{\sum_{i = 1}^{j - 1} | X_{i} \ B_{j} | + | X_{j}^{o u t} | + \sum_{i = j + 1}^{k} | X_{i} \ B_{j} |} \\ (49) & \geq \frac{| X_{j}^{o u t} |}{\sum_{i = 1}^{j - 1} \frac{f_{2} (c_{i}, X_{i})}{r_{j}^{2}} + | X_{j}^{o u t} | + \sum_{i = j + 1}^{k} | X_{i} |} \\ (50) & \geq \frac{| X_{j}^{o u t} |}{\sum_{i = 1}^{j - 1} \frac{f_{2} (m_{i}, X_{i}) + | X_{i} | | | m_{i} - c_{i} {| |}^{2}}{r_{j}^{2}} + | X_{j}^{o u t} | + \sum_{i = j + 1}^{k} | X_{i} |} \\ (51) & \geq \frac{| X_{j}^{o u t} |}{\frac{(1 + ϵ) n σ_{o p t}^{2}}{r_{j}^{2}} + | X_{j}^{o u t} | + \sum_{i = j + 1}^{k} | X_{i} |} \\ (52) & \geq \frac{| X_{j}^{o u t} |}{\frac{40 (1 + ϵ) k β_{j} n}{ϵ} + | X_{j}^{o u t} | + (k - j) β_{j} n} \\ (53) & \geq \frac{\frac{ϵ}{49} β_{j} n}{\frac{40 (1 + ϵ) k β_{j} n}{ϵ} + \frac{ϵ}{49} β_{j} n + (k - j) β_{j} n} \\ (54) & \geq \frac{ϵ^{2}}{(80 k + k) 49 + (ϵ - 49 j) ϵ} \\ (55) & \geq \frac{ϵ^{2}}{3969 k} \end{matrix}

□

Lemma 13.

| | m_{j} - {\tilde{m}}_{j} | | \leq r_{j}

.

Proof.

\begin{matrix} (56) & | | m_{j} - {\tilde{m}}_{j} | | & = | | \frac{1}{| X_{j} |} \sum_{X \in X_{j}} \int^{R^{d}} s f_{X} (s) d s - \frac{1}{| X_{j} |} (\sum_{X \in X_{j}^{i n}} \tilde{X} + \sum_{X \in X_{j}^{o u t}} \int^{R^{d}} s f_{X} (s) d s) | | \\ (57) & = \frac{1}{| X_{j} |} | | \sum_{X \in X_{j}^{i n}} \int^{R^{d}} (s - \tilde{X}) f_{X} (s) d s | | \\ (58) & = \frac{1}{| X_{j} |} \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | s - \tilde{X} | | f_{X} (s) d s \\ (59) & \leq \frac{1}{| X_{j} |} \sum_{X \in X_{j}^{i n}} r_{j} \\ (60) & = \frac{| X_{j}^{i n} |}{| X_{j} |} r_{j} \\ (61) & \leq r_{j} \end{matrix}

□

Lemma 14.

f_{2} ({\tilde{m}}_{j}, {\tilde{X}}_{j}) \leq 2 f_{2} (m_{j}, X_{j}) + 4 β_{j} n r_{j}^{2}

.

Proof.

\begin{matrix} (62) & f_{2} ({\tilde{m}}_{j}, {\tilde{X}}_{j}) & = \sum_{X \in X_{j}^{i n}} | | \tilde{X} - {\tilde{m}}_{j} {| |}^{2} + \sum_{X \in X_{j}^{o u t}} \int^{R^{d}} | | s - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s \\ (63) & = \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | \tilde{X} - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s + \sum_{X \in X_{j}^{o u t}} \int^{R^{d}} | | s - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s \\ (64) & = \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | \tilde{X} - s + s - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s + \sum_{X \in X_{j}^{o u t}} \int^{R^{d}} | | s - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s \\ (65) & \leq \sum_{X \in X_{j}^{i n}} \int^{R^{d}} (2 | | \tilde{X} - {s | |}^{2} + 2 | | s - {\tilde{m}}_{j} {| |}^{2}) f_{X} (s) d s + \sum_{X \in X_{j}^{o u t}} \int^{R^{d}} | | s - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s \\ (66) & \leq 2 \sum_{X \in X_{j}^{i n}} \int^{R^{d}} | | \tilde{X} - {s | |}^{2} f_{X} (s) d s + 2 \sum_{X \in X_{j}^{o u t}} \int^{R^{d}} | | s - {\tilde{m}}_{j} {| |}^{2} f_{X} (s) d s \\ (67) & \leq 2 | X_{j}^{i n} | r_{j}^{2} + 2 f_{2} ({\tilde{m}}_{j}, X_{j}) \\ (68) & = 2 | X_{j}^{i n} | r_{j}^{2} + 2 f_{2} (m_{j}, X_{j}) + 2 | X_{j} | | | m_{j} - {\tilde{m}}_{j} {| |}^{2} \\ (69) & \leq 2 f_{2} (m_{j}, X_{j}) + 4 β_{j} n r_{j}^{2} \end{matrix}

□

Lemma 15.

In the process of finding

c_{j}

in our algorithm

cMeans

, we assume that

Q

satisfies

X \ B_{j} \subseteq Q

and

| Q | < 2 | X \ B_{j} |

. For the set

S_{2}

in step 5, a subset

S^{*}

of size

6 / ϵ

of

S_{2}

can be obtained such that all random variables in

S^{*}

are from

{\tilde{X}}_{j}^{i n}

with a probability of

1 / 2

. Let

c_{j}

denotes the centroid of

S^{*}

. Then, the inequality

| | {\tilde{m}}_{j} - c_{j} {| |}^{2} \leq \frac{4}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ σ_{j}^{2}

holds with a probability of at least

1 / 6

.

Proof.

In our algorithm

cMeans

, we assume that

S_{1} = S_{1}, \dots, S_{N}

, where

N = 79,380 k / ϵ^{3}

. Let

x_{1}^{^{'}}, \dots, x_{N}^{^{'}}

be the corresponding random variables of elements in

S_{1}

. If

S_{i} \in X_{j}^{o u t}

, obtain

x_{i}^{^{'}} = 1

, or else

x_{i}^{^{'}} = 0

. It is known easily that

P r [S_{i} \in X_{j}^{o u t}] \geq \frac{ϵ^{2}}{7938 k}

by Lemma 12. Let

x = \sum_{i = 1}^{N} x_{i}^{^{'}}, u = \sum_{i = 1}^{N} E (x_{i}^{^{'}})

. We obtain that

u \geq 10 / ϵ

, and

\begin{matrix} (70) & P r [x > \frac{6}{ϵ}] & = 1 - P r [x \leq \frac{6}{ϵ}] \\ (71) & \geq 1 - P r [x \leq \frac{3}{5} u] \\ (72) & \geq 1 - e^{- \frac{{(1 - \frac{3}{5})}^{2} u}{2}} \\ (73) & \geq 1 - e^{- \frac{{(1 - \frac{3}{5})}^{2} \frac{10}{ϵ}}{2}} \\ (74) & \geq 1 - e^{- \frac{4}{5}} \\ (75) & \geq \frac{1}{2} . \end{matrix}

Then, the probability that at least

6 / ϵ

random variables in

S_{1}

are from

X_{j}^{o u t}

is at least

1 / 2

. Since

S_{2} = S_{1} \cup {6 / ϵ

copies of each point in

C}

, a subset

S^{*}

of size

6 / ϵ

of

S_{2}

can be obtained, and the probability that all random variables in

S^{*}

are from

{\tilde{X}}_{j}^{i n}

is at least

1 / 2

. Let

c_{j}

denote the centroid of

S^{*}

and

δ = 5 / 6

. For

| S^{*} | = 6 / ϵ

and

| w i d e t i l d e X_{j} | = | X_{j} |

, by Lemma 2,

| | {\tilde{m}}_{j} - c_{j} {| |}^{2} \leq \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}, {\tilde{X}}_{j})}{| {\tilde{X}}_{j} |} = \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}, {\tilde{X}}_{j})}{| X_{j} |}

holds with a probability of at least

1 / 6

. Assume that

| | {\tilde{m}}_{j} - c_{j} {| |}^{2} \leq \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}, {\tilde{X}}_{j})}{| X_{j} |}

. Then,

| | {\tilde{m}}_{j} - c_{j} {| |}^{2} \leq \frac{ϵ}{5} \frac{f_{2} ({\tilde{m}}_{j}, {\tilde{X}}_{j})}{| X_{j} |} \leq \frac{ϵ}{5} \frac{2 f_{2} (m_{j}, X_{j}) + 4 β_{j} n r_{j}^{2}}{| X_{j} |} \leq \frac{4}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ σ_{j}^{2} .

(76)

□

Lemma 16.

If

c_{j}

satisfies

| | {\tilde{m}}_{j} - c_{j} {| |}^{2} \leq \frac{4}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ σ_{j}^{2}

, then

| | m_{j} - c_{j} {| |}^{2} \leq \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2}

.

Proof.

Assume that

c_{j}

satisfies

| | {\tilde{m}}_{j} - c_{j} {| |}^{2} \leq \frac{4}{5} ϵ r_{j}^{2} + \frac{2}{5} ϵ σ_{j}^{2}

. Then,

\begin{matrix} (77) & | | m_{j} - c_{j} {| |}^{2} & = | | m_{j} - {\tilde{m}}_{j} + {\tilde{m}}_{j} - c_{j} {| |}^{2} \\ (78) & \leq 2 | | m_{j} - {\tilde{m}}_{j} {| |}^{2} + 2 | | {\tilde{m}}_{j} - c_{j} {| |}^{2} \\ (79) & \leq 2 r_{j}^{2} + \frac{8}{5} ϵ r_{j}^{2} + \frac{4}{5} ϵ σ_{j}^{2} \\ (80) & = \frac{4}{5} ϵ σ_{j}^{2} + (2 + \frac{8}{5} ϵ) r_{j}^{2} \\ (81) & \leq \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2} . \end{matrix}

□

Lemma 17.

Given an instance

(X, k, L)

of the uncertain constrained k-means problem, where the size of

X

is n, for

\forall ϵ \in (0, 1], k \geq 2

, we assume that by using our algorithm

cMeans

(ϵ,

X

, k, C, U) (C and U are initialized as empty sets), a collection U of candidate sets including approximate centers is obtained. If there exists a set

C_{k} = {c_{1}, \dots, c_{k}}

in U satisfying that

| | m_{j} - c_{j} {| |}^{2} \leq \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2} (1 \leq j \leq k)

, then

C_{k}

is a

(1 + ϵ)

-approximation for the uncertain constrained k-means problem.

Proof.

Assume that

C_{k} = c_{1}, \dots, c_{k}

is a set in U satisfying that

| | m_{j} - c_{j} {| |}^{2} \leq \frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2} (1 \leq j \leq k)

. Then,

\begin{matrix} (82) & \sum_{j = 1}^{k} f_{2} (c_{j}, X_{j}) & = \sum_{j = 1}^{k} (f_{2} (m_{j}, X_{j}) + | X_{j} | | | m_{j} - c_{j} {| |}^{2}) \\ (83) & \leq \sum_{j = 1}^{k} (f_{2} (m_{j}, X_{j}) + β_{j} n (\frac{9}{10} ϵ σ_{j}^{2} + \frac{1}{10 β_{j} k} ϵ σ_{o p t}^{2})) \\ (84) & \leq \sum_{j = 1}^{k} (f_{2} (m_{j}, X_{j}) + \frac{9}{10} ϵ n \sum_{j = 1}^{k} β_{j} σ_{j}^{2} + \frac{1}{10} ϵ n σ_{o p t}^{2} \\ (85) & \leq \sum_{j = 1}^{k} (f_{2} (m_{j}, X_{j}) + \frac{9}{10} ϵ n σ_{o p t}^{2} + \frac{1}{10} ϵ n σ_{o p t}^{2} \\ (86) & = (1 + ϵ) \cdot O P T_{k} (P) . \end{matrix}

□

5.3. Time Complexity Analysis

We analyze the time complexity for our algorithm

cMeans

in this section.

Lemma 18.

The time complexity of our algorithm

cMeans

is

O (4^{k} {(\frac{13,231 e k}{ϵ^{2}})}^{6 k / ϵ} \frac{1}{ϵ} n d)

.

Proof.

Let

a = C_{N + k M}^{M}

, which

N = \frac{79,380 k}{ϵ^{3}}

,

M = \frac{6}{ϵ}

. By the Stirling formula,

C_{N + k M}^{M} \leq \frac{{(N + k M)}^{M}}{M!} \approx O ({(e \frac{N + k M}{M})}^{M}) = O ({(\frac{13,231 e k}{ϵ^{2}})}^{\frac{6}{ϵ}}) .

In our algorithm

cMeans

, steps 5–9 have a run time of

O (k / ϵ^{3})

, step 11 have a run time of

O (d / ϵ)

, and steps 13–16 have a run time of

O (k n d)

. Let

T (n, g)

denote the time complexity of algorithm

cMeans

, where g is the number of cluster centers, and n is the size of

Q

.

If

g = 0

,

T (n, 0) = O (1)

. When

n = 1

,

T (1, g) = a (T (1, g - 1) + O (d / ϵ)) + O (k / ϵ^{3})

. Because

a > k / ϵ^{3}

,

T (1, g) = a (T (1, g - 1) + O (d / ϵ)) \leq a^{g} \cdot T (1, 0) + g \cdot a^{g} \cdot O (d / ϵ) = O (g \cdot a^{g} \cdot d / ϵ)

. Therefore,

T (1, g) \leq O (4^{g} {(\frac{13,231 e k}{ϵ^{2}})}^{6 g / ϵ}) \frac{1}{ϵ} d

, where

e = 2.7183

.

For

\forall n \geq 2

and

g \geq 1

, the recurrence of

T (n, g)

could be obtained as follows:

T (n, g) = a \cdot T (n, g - 1) + T (⌊ \frac{n}{2} ⌋, g) + a \cdot O (\frac{d}{ϵ}) + O (\frac{k}{ϵ^{3}}) + O (k n d) .

Because

a > k / ϵ^{3}

, two constants

b_{1}

and

b_{2}

with

b_{1} \geq 1

and

b_{2} \geq 1

could be obtained to arrive at the following recurrence.

T (n, g) \leq a \cdot T (n, g - 1) + T (⌊ \frac{n}{2} ⌋, g) + a \cdot b_{1} \cdot \frac{d}{ϵ} + b_{2} \cdot k n d .

Now we claim that

T (n, g) \leq b_{1} \cdot b_{2} \cdot \frac{1}{ϵ} \cdot a^{g} \cdot 2^{2 g} \cdot n d - b_{1} \cdot \frac{d}{ϵ}

. If

g = 0

, then

T (n, 0) = O (1)

. If

g \geq 1, n = 1

, then

T (1, g) \leq O (4^{g} {(\frac{13,231 e k}{ϵ^{2}})}^{6 g / ϵ}) \frac{1}{ϵ} d

, and the claim holds. Suppose that if

\forall n_{1} \geq 0, \forall g > g_{1}

, the claim holds for

T (n_{1}, g_{1})

, and if

\forall 0 < n_{2} < n, \forall g_{2}

, the claim holds for

T (n_{2}, g_{2})

. We need to prove that:

\begin{matrix} b_{1} \cdot b_{2} \cdot \frac{1}{ϵ} \cdot a^{g} \cdot 2^{2 g} \cdot n d - b_{1} \cdot \frac{d}{ϵ} \geq a (b_{1} \cdot b_{2} \cdot \frac{1}{ϵ} \cdot a^{(} g - 1) \cdot 2^{2 (g - 1)} \cdot n d - b_{1} \cdot \frac{d}{ϵ}) \\ + b_{1} \cdot b_{2} \cdot \frac{1}{ϵ} \cdot a^{g} \cdot 2^{2 g} \cdot ⌊ \frac{n}{2} ⌋ d - b_{1} \cdot \frac{d}{ϵ} + a \cdot b_{1} \cdot \frac{d}{ϵ} + b_{2} \cdot k n d . \end{matrix}

The above formula can be simplified as

\frac{1}{4 ϵ} \cdot b_{1} \cdot a^{g} 2^{2 g} \geq k

, which holds for

\forall g \geq 1

. For

a = {(\frac{13,231 e k}{ϵ^{2}})}^{6 / ϵ}

,

T (n, k) = O (4^{k} {(\frac{13,231 e k}{ϵ^{2}})}^{6 k / ϵ} \frac{1}{ϵ} n d)

. □

Thus, we can obtain the following Theorem 2.

Theorem 2.

Given an instance

(X, k, L)

of the uncertain constrained k-means problem, where the size of

X

is n, for

\forall ϵ \in (0, 1], k \geq 2

, by using our algorithm

cMeans (ϵ, X, k, C, U)

, a collection U of candidate sets including approximate centers can be obtained with a probability of at least

1 / 12^{2}

such that U includes at least one candidate set including approximate centers that is a

(1 + ϵ)

-approximation for the uncertain constrained k-means problem, and the time complexity of our algorithm

cMeans

is

O (4^{k} {(\frac{13,231 e k}{ϵ^{2}})}^{6 k / ϵ} \frac{1}{ϵ} n d)

.

6. Conclusions

In this paper, we defined the uncertain constrained k-means problem first, and then presented a stochastic approximate algorithm for the problem in detail. We proposed a general mathematical model of the uncertain constrained k-means problem, and studied the random sampling properties, which are very important to deal with the uncertain constrained k-means problem. By applying a random sampling technique, we obtained a

(1 + ϵ)

-approximate algorithm for the problem. Then, we investigated the success probability, correctness and time complexity analysis of our algorithm

cMeans

, whose running time is

O (4^{k} {(\frac{13,231 e k}{ϵ^{2}})}^{6 k / ϵ} \frac{1}{ϵ} n d)

. However, there also exists a big gap between the current algorithms for the uncertain constrained k-means problem and the practical algorithms for the problem, which has been mentioned in [13] similarly.

We will try to explore a much more practical algorithm for the uncertain constrained k-means problem in future. It is known that the 2-means problem is the smallest version of the k-means problem, and remains NP-hard. The approximation schemes for the 2-means problem can be generalized to solve the k-means problem. Due to the particularity of the uncertain constrained 2-means problem, we will study approximation schemes for the uncertain constrained 2-means problem and reduce the algorithm complexity of approximation schemes for the uncertain constrained k-means problem through approximation schemes of the uncertain constrained 2-means problem. Additionally, we will apply the proposed algorithm to some practical problems in the future.

Author Contributions

J.L. and J.T. contributed to supervision, methodology, validation and project administration. B.X. and X.T. contributed to review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Science and Technology Foundation of Guizhou Province ([2021]015), in part by the Open Fund of Guizhou Provincial Public Big Data Key Laboratory (2017BDKFJJ019), in part by the Guizhou University Foundation for the introduction of talent ((2016) No. 13), in part by the GuangDong Basic and Applied Basic Research Foundation (No. 2020A1515110554), and in part by the Science and Technology Program of Guangzhou (No. 202002030138), China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Feldman, D.; Monemizadeh, M.; Sohler, C. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, SoCG, Gyeongju, Korea, 6–8 June 2007; pp. 11–18. [Google Scholar]
Ostrovsky, R.; Rabani, Y.; Schulman, L.J.; Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. J. ACM 2012, 59, 28:1–28:22. [Google Scholar] [CrossRef]
Jaiswal, R.; Kumar, A.; Sen, S. A simple D²-sampling based PTAS for k-means and other clustering problems. Algorithmica 2014, 71, 22–46. [Google Scholar] [CrossRef]
Arkin, E.M.; Diaz-Banez, J.M.; Hurtado, F.; Kumar, P.; Mitchell, J.S.; Palop, B.; Perez-Lantero, P.; Saumell, M.; Silveira, R.I. Bichromatic 2-center of pairs of points. Comput. Geom. 2015, 48, 94–107. [Google Scholar] [CrossRef]
Yhuller, S.; Sussmann, Y.J. The capacitated k-center problem. SIAM J. Discrete Math. 2000, 13, 403–418. [Google Scholar]
Har-Peled, S.; Raichel, B. Net and prune: A linear time algorithm for Euclidean distance problems. J. ACM 2015, 62, 4401–4435. [Google Scholar] [CrossRef]
Swamy, C.; Shmoys, D.B. Fault-tolerant facility location. ACM Trans. Algorithms 2008, 4, 1–27. [Google Scholar] [CrossRef]
Xu, G.; Xu, J. Efficient approximation algorithms for clustering point-sets. Comput. Geom. 2010, 43, 59–66. [Google Scholar] [CrossRef][Green Version]
Valls, A.; Batet, M.; Lopez, E.M. Using expert’s rules as background knowledge in the clusdm methodology. Eur. J. Oper. Res. 2009, 195, 864–875. [Google Scholar] [CrossRef]
Li, J.; Yi, K.; Zhang, Q. Clustering with deversity. In Proceedings of the 37th International Colloquium on Automata, Languages and Programming, ICALP, Bordeaux, France, 6–10 July 2010; pp. 188–200. [Google Scholar]
Ding, H.; Xu, J. A unified framework for clustering constrained data without locality property. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, San Diego, CA, USA, 4–6 January 2015; pp. 1471–1490. [Google Scholar]
Bhattacharya, A.; Jaiswal, R.; Kumar, A. Faster algorithms for the constrained k-means problem. Theory Comput. Syst. 2018, 62, 93–115. [Google Scholar] [CrossRef]
Feng, Q.; Hu, J.; Huang, N.; Wang, J. Improved PTAS for the constrained k-means problem. J. Comb. Optim. 2019, 37, 1091–1110. [Google Scholar] [CrossRef]
Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]

Figure 1. Flow chart of our algorithm

cMeans

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem

Abstract

1. Introduction

2. Preliminaries

3. Overview of Our Method

4. Our Algorithm cMeans

5. Analysis of Our Algorithm $cMeans$

5.1. Analysis for Case 1: $| X_{j}^{o u t} | \leq \frac{ϵ}{49} β_{j} n$

5.2. Analysis for Case 2: $| X_{j}^{o u t} | > \frac{ϵ}{49} β_{j} n$

5.3. Time Complexity Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem

Abstract

1. Introduction

2. Preliminaries

3. Overview of Our Method

4. Our Algorithm cMeans

5. Analysis of Our Algorithm cMeans

5.1. Analysis for Case 1: | X j o u t | ≤ ϵ 49 β j n

5.2. Analysis for Case 2: | X j o u t | > ϵ 49 β j n

5.3. Time Complexity Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

5. Analysis of Our Algorithm $cMeans$

5.1. Analysis for Case 1: $| X_{j}^{o u t} | \leq \frac{ϵ}{49} β_{j} n$

5.2. Analysis for Case 2: $| X_{j}^{o u t} | > \frac{ϵ}{49} β_{j} n$