Stochastic Approximate Algorithms for Uncertain Constrained K -Means Problem

: The k -means problem has been paid much attention for many applications. In this paper, we deﬁne the uncertain constrained k -means problem and propose a ( 1 + (cid:101) ) -approximate algorithm for the problem. First, a general mathematical model of the uncertain constrained k -means problem is proposed. Second, the random sampling properties of the uncertain constrained k -means problem are studied. This paper mainly studies the gap between the center of random sampling and the real center, which should be controlled within a given range with a large probability, so as to obtain the important sampling properties to solve this kind of problem. Finally, using mathematical induction, we assume that the ﬁrst j − 1 cluster centers are obtained, so we only need to solve the j -th center. The algorithm has the elapsed time O (( 1891 ek (cid:101) 2 ) 8 k / (cid:101) nd ) , and outputs a collection of size O (( 1891 ek (cid:101) 2 ) 8 k / (cid:101) n ) of candidate sets including approximation centers.


Introduction
The k-means problem has received much attention in the past several decades. The k-means problems consists of partitioning a set P of points in d-dimensional space R d into k subsets P 1 , . . . , P k such that ∑ k i=1 ∑ p∈P i ||p − c i || 2 is minimized, where c i is the center of P i , and ||p − q|| is the distance between two points of p and q. The k-means problem is one of the classical NP-hard problems, and has been paid much attention in the literature [1][2][3].
For many applications, each cluster of the point set may satisfy some additional constraints, such as chromatic clustering [4], r-capacity clustering [5], r-gather clustering [6], fault tolerant clustering [7], uncertain data clustering [8], semi-supervised clustering [9], and l-diversity clustering [10]. The constrained clustering problems was studied by Ding and Xu, who presented the first unified framework in [11]. Given a point set P ⊆ R d , and a positive integer k, a list of constraints L, the constrained k-means problem is to partition P into k clusters P = {P 1 , . . . , P k }, such that all constraints in L are satisfied and ∑ P i ∈P ∑ x∈P i ||x − c(P i )|| 2 is minimized, where c(P i ) = 1 |P i | ∑ x∈P i x denotes the centroid of P i .
In recent years, particular research has been focused on the constrained k-means problem. Ding and Xu [11] showed the first polynomial time approximation scheme with running time O(2 poly(k/ ) (log n) k nd) for the constrained k-means problem, and obtained a collection of size O(2 poly(k/ ) (log n) k+1 ) of candidate approximate centers. The existing fastest approximation schemes for the constrained k-means problem takes O(2 O(k/ ) nd) time [12,13], which was first shown by Bhattacharya, Jaiswai, and Kumar [12]. Their algorithm gives a collection of size O(2 O(k/ ) ) of candidate approximate centers. In this paper, we propose the uncertain constrained k-means problem, which supposes that all points are random variables with probabilistic distributions. We present a stochastic approximate algorithm for the uncertain constrained k-means problem. The uncertain constrained k-means problem can be regarded as a generalization of the constrained kmeans problem. We prove the random sampling properties of the uncertain constrained k-means problem, which are fundamental for our proposed algorithm. By applying random sampling and mathematical induction, we propose a stochastic approximate algorithm with lower complexity for the uncertain constrained k-means problem.
This paper is organized as follows. Some basic notations are given in Section 2. Section 3 provides an overview of the new algorithm for the uncertain constrained k-means problem. In Section 4, we discuss the detailed algorithm for the uncertain constrained k-means problem. In Section 5, we investigate the correctness, success probability, and running time analysis of the algorithm. Section 6 concludes this paper and gives possible directions for future research.

Preliminaries
Definition 1 (Uncertain constrained k-means problem). Given a random variable set X ⊆ R d , the probability density function f X (s) for every random variable X ∈ X , a list of constraints L, and a positive integer k, the uncertain constrained k-means problem is to partition X into k clusters X = {X 1 , . . . , X k }, such that all constraints in L are satisfied and

Definition 2 ([13]
). Let X be a set of random variables in R d , f X (s) be probability density function for every random variable X ∈ X , and q ∈ R d and P be a set of points in R d , p ∈ P.

Definition 3 ([13]
). Let X be a set of random variables in R d , f X (s) be the probability density function for every random variable X ∈ X , and X 1 , . . . , X k be a partition of X .
• Define m j = c(X j ).
Proof. Let f X (s) be the probability density function for every random variable X ∈ X .
The (3) equality follows from the fact that ∑ X∈X Lemma 2. Let X be a set of random variables in R d and f X (s) be the probability density function for every random variable X ∈ X . Assume that T is a set of random variables obtained by sampling random variables from X uniformly and independently. For ∀ δ > 0, we have: where where Then apply the Markov inequality to obtain the following.
Lemma 3. Let Q be a set of random variables in R d , f X (s) be the probability density function for every random variable X ∈ Q, and Q 1 be an arbitrary subset of Q with α|Q| random variables Proof. Let Q 2 = Q \ Q 1 . By Lemma 1, we have the following two equalities.

Overview of Our Method
In this section, we first introduce the main idea of our methodology to solve the uncertain constrained k-means problem.
Considering the optimal partition X = {X 1 , . . . , from X uniformly and independently, then at least O(1/ ) random variables in S are from X 1 with a certain probability. All subsets of S of size O(1/ ) could be enumerated to discover the approximate center of X 1 .
We assume that C j−1 = {c 1 , . . . , c j−1 } is the set including approximate centers of the The set X j is divided into two parts: X out j and X in j , where X out j = X j \ B j and X in j = X j ∩ B j . For each random variable X, let X be the nearest point (particular random variable) in C j−1 to X. Let X in j = { X|X ∈ X in j }, and X j = X in j ∪ X out j . If most of the random variables of X j are in X in j , our idea is to use the center of X in j to approximate the center of X j . The center of X in j is found based on C j−1 . If most of the random variables of X j are in X out j , our ideal is to replace the center of X j with the center of X j . For seeking out the approximate center of X j , we should find out a subset S by uniformly sampling from X j . However, the set X out j is unknown. We need to find the set S ∩ X out j . We apply a branching strategy to find a set Q such that X \ B j ⊆ Q, and |Q| < 2|X \ B j |. Then, a random variables set S is obtained by sampling random variables from Q independently and uniformly. And the set X \ B j ⊆ Q can be replaced by a subset S * of S from X out j . Based on S * and X in j , the approximation center of X j could be obtained. Therefore, the algorithm presented in this paper outputs a collection of size O(( 1891ek 2 ) 8k/ n) of candidate sets containing approximation centers, and has the running time O(( 1891ek 2 ) 8k/ nd).

Our Algorithm cMeans
Given an instance (X , k, L) of the uncertain constrained k-means problem, X = {X 1 , . . . , X k } denotes an optimal partition of (X , k, L). There exist six parameters ( , Q, g, k, C, U) in our cMeans, where ∈ (0, 1] is the approximate factor, Q is the input random variable set, g is the number of centers, k is the number of the clusters, C is the set of approximate cluster centers, and U is a collection of candidate sets including the approximate center. Let M = 6 , N = 79,380k 3 , where M is the size of subsets of the sampling set and N is the size of the sampling set. Without loss of generality, assume that values of M and N are integers. We use the branching strategy to seek out the approximate centers of clusters in X. There exist two branches in our algorithm cMeans, which can be seen in Figure 1. On one branch, a size N set S 1 is obtained by sampling from Q uniformly and independently; S 2 is constructed by S 1 and M copies of each point in C. Moreover, we consider each subset S of size M of S 2 , and the centroid c of S is solved to represent the approximate center of X k−g+1 , and our algorithm cMeans( , Q, g − 1, k, C ∪ {c}, U) is used to obtain the remaining g − 1 cluster centers. On the other branch, for each random variable X ∈ Q, we calculate the distance between X and C first. H denotes the set of all distances of random variables in X to C, where H is a multi-set. We should obtain the median value m for all values in H, which is the |H|/2 -th element if all of the values in H are sorted. In the second branch, Q is divided into two parts, Q and Q , based on m such that for ∀X ∈ Q , Subroutine cMeans( , Q , g, k, C, U) is used to obtain the remaining g cluster centers. Therefore, we present the specific algorithm for seeking out a collection of candidate sets in the Algorithm 1.

Analysis of Our Algorithm cMeans
We investigate the success probability, correctness, and time complexity analysis of the algorithm cMeans in this section.
The following Lemmas from Lemma 6 to 16 are used to prove Lemma 5. We prove Lemma 5 via induction on j. For j = 1, we can obtain β 1 ≥ 1/k easily, and prove the success probability first.

Lemma 6.
In the process of finding c 1 in our algorithm cMeans, by sampling a set of 79,380k/ 3 random variables from X independently and uniformly, denoted by S 1 , the probability that at least 6/ random variables in S 2 are from X 1 is at least 1/2.
5.1. Analysis for Case 1: |X out j | ≤ 49 β j n Since |X out j | ≤ 49 β j n, most of the random variables of X j are in B j . Our idea is to replace the center of X j with the center of X in j . Thus, we need to find the approximate center c j of X in j and the bound distance ||m j − c j ||. We divide the distance ||m j − c j || into the following three parts: ||m j − m in j ||, ||m in j − m in j ||, and || m in j − c j ||. We first study the distance between m j and m in j .
Proof. Since |X j | = β j n and |X out j | ≤ 49 β j n, the proportion of X in j in X j is at least 1 − 49 . By Lemma 3, ||m j − m in j || ≤ Lemma 10. In the process of finding c j in our algorithm cMeans, for the set S 2 in step 5, a subset S * of size 6/ of S 2 can be obtained such that all random variables in S * are from X in j . Let c j be the centroid of S * . Then, the inequality || m in j − c j || 2 ≤ 2 5 r 2 j + 49 120 σ 2 j − 1 5 ||m j − m in j || 2 holds with a probability of at least 1/6.

Proof.
For each point p ∈ C j−1 , 6/ copies of p are added to S 2 in step 9 in our algorithm cMeans. Thus, a subset S * of size 6/ of S 2 can be obtained such that all random variables in S * are from X in j . Let δ = 5/6. Since |S * | = 6/ , by Lemma 2, || m in j − c j || 2 ≤ 5 holds with a probability of at least 1/6. Assume that || m in j − c j || 2 ≤ 5 Proof. Assume that c j satisfies || m in j − c j || 2 ≤ 2 5 r 2 j + 49 120 σ 2 j − 1 5 ||m j − m in j || 2 . Then, 5.2. Analysis for Case 2: |X out j | > 49 β j n Let X j = X in j ∪ X out j , and m j denote the centroid of X j . Our idea is to replace the center of X j with the center of X j . But it is difficult to seek out the center of X j . Thus, we try to find an approximate center c j of X j .
Lemma 15. In the process of finding c j in our algorithm cMeans, we assume that Q satisfies X \ B j ⊆ Q and |Q| < 2|X \ B j |. For the set S 2 in step 5, a subset S * of size 6/ of S 2 can be obtained such that all random variables in S * are from X in j with a probability of 1/2. Let c j denotes the centroid of S * . Then, the inequality || m j − c j || 2 ≤ 4 5 r 2 j + 2 5 σ 2 j holds with a probability of at least 1/6.

Proof.
In our algorithm cMeans, we assume that S 1 = S 1 , . . . , S N , where N = 79380k/ 3 . Let x 1 , . . . , x N be the corresponding random variables of elements in S 1 . If S i ∈ X out j , obtain x i = 1, or else x i = 0. It is known easily that Pr[S i ∈ X out j ] ≥ 2 7938k by Lemma 12. Let . We obtain that u ≥ 10/ , and Then, the probability that at least 6/ random variables in S 1 are from X out j is at least 1/2. Since S 2 = S 1 ∪ {6/ copies of each point in C}, a subset S * of size 6/ of S 2 can be obtained, and the probability that all random variables in S * are from X in j is at least 1/2. Let c j denote the centroid of S * and δ = 5/6. For |S * | = 6/ and |widetildeX j | = |X j |, holds with a probability of at least 1/6.
Proof. Assume that c j satisfies || m j − c j || 2 ≤ 4 5 r 2 j + 2 5 σ 2 j . Then, Lemma 17. Given an instance (X , k, L) of the uncertain constrained k-means problem, where the size of X is n, for ∀ ∈ (0, 1], k ≥ 2, we assume that by using our algorithm cMeans( , X , k, C,U) (C and U are initialized as empty sets), a collection U of candidate sets including approximate centers is obtained. If there exists a set C k = {c 1 , . . . , c k } in U satisfying that ||m j − c j || 2 ≤ 9 10 σ 2 j + 1 10β j k σ 2 opt (1 ≤ j ≤ k), then C k is a (1 + )-approximation for the uncertain constrained k-means problem.

Time Complexity Analysis
We analyze the time complexity for our algorithm cMeans in this section.
In our algorithm cMeans, steps 5-9 have a run time of O(k/ 3 ), step 11 have a run time of O(d/ ), and steps 13-16 have a run time of O(knd). Let T(n, g) denote the time complexity of algorithm cMeans, where g is the number of cluster centers, and n is the size of Q.
Thus, we can obtain the following Theorem 2.

Theorem 2.
Given an instance (X , k, L) of the uncertain constrained k-means problem, where the size of X is n, for ∀ ∈ (0, 1], k ≥ 2, by using our algorithm cMeans( , X , k, C, U), a collection U of candidate sets including approximate centers can be obtained with a probability of at least 1/12 2 such that U includes at least one candidate set including approximate centers that is a (1 + )-approximation for the uncertain constrained k-means problem, and the time complexity of our algorithm cMeans is O(4 k ( 13231ek 2 ) 6k/ 1 nd).

Conclusions
In this paper, we defined the uncertain constrained k-means problem first, and then presented a stochastic approximate algorithm for the problem in detail. We proposed a general mathematical model of the uncertain constrained k-means problem, and studied the random sampling properties, which are very important to deal with the uncertain constrained k-means problem. By applying a random sampling technique, we obtained a (1 + )-approximate algorithm for the problem. Then, we investigated the success probability, correctness and time complexity analysis of our algorithm cMeans, whose running time is O(4 k ( 13231ek 2 ) 6k/ 1 nd). However, there also exists a big gap between the current algorithms for the uncertain constrained k-means problem and the practical algorithms for the problem, which has been mentioned in [13] similarly.
We will try to explore a much more practical algorithm for the uncertain constrained k-means problem in future. It is known that the 2-means problem is the smallest version of the k-means problem, and remains NP-hard. The approximation schemes for the 2-means problem can be generalized to solve the k-means problem. Due to the particularity of the uncertain constrained 2-means problem, we will study approximation schemes for the uncertain constrained 2-means problem and reduce the algorithm complexity of approximation schemes for the uncertain constrained k-means problem through approximation schemes of the uncertain constrained 2-means problem. Additionally, we will apply the proposed algorithm to some practical problems in the future.
Author Contributions: J.L. and J.T. contributed to supervision, methodology, validation and project administration. B.X. and X.T. contributed to review and editing. All authors have read and agreed to the published version of the manuscript.