Optimizing MSE for Clustering with Balanced Size Constraints

Clustering is to group data so that the observations in the same group are more similar to each other than to those in other groups. k-means is a popular clustering algorithm in data mining. Its objective is to optimize the mean squared error (MSE). The traditional k-means algorithm is not suitable for applications where the sizes of clusters need to be balanced. Given n observations, our objective is to optimize the MSE under the constraint that the observations need to be evenly divided into k clusters. In this paper, we propose an iterative method for the task of clustering with balanced size constraints. Each iteration can be split into two steps, namely an assignment step and an update step. In the assignment step, the data are evenly assigned to each cluster. The balanced assignment task here is formulated as an integer linear program (ILP), and we prove that the constraint matrix of this ILP is totally unimodular. Thus the ILP is relaxed as a linear program (LP) which can be efficiently solved with the simplex algorithm. In the update step, the new centers are updated as the centroids of the observations in the clusters. Assuming that there are n observations and the algorithm needs m iterations to converge, we show that the average time complexity of the proposed algorithm is O(mn1.65)–O(mn1.70). Experimental results indicate that, comparing with state-of-the-art methods, the proposed algorithm is efficient in deriving more accurate clustering.


Introduction
In the fields of data mining, machine learning, and social science, the most widely studied and fundamental method is clustering.Clustering is to partition observations into clusters so that similar observations are put together while dissimilar observations are separated [1].There are sophisticated clustering algorithms.Unfortunately, they are not suitable for applications that impose the equally large cluster sizes as a constraint.For example, in marketing campaigns, the given customers need to be partitioned into clusters so that each cluster is allotted to a salesman.For the purpose of fairness and efficiency, each salesman should have the same workload [2].In [3], the authors considered organizing sensor nodes into clusters of similar size so that the workloads are distributed evenly on all the master nodes.There are also other examples like circuit designing [4], document clustering [5], and image searching [6].The balanced size constraints are introduced into clustering because of the application requirements rather than the actual distribution of the data.Moreover, according to [7], the balance of clusters is an important issue in clustering.The lack of size constraints in traditional clustering algorithms could lead to extremely unbalanced clustering.
With the demands of the balanced size-constraint clustering, many approaches have been proposed in the last decades.Most of the size-constraint clustering methods are based on soft size constraints.They could ease the problem of extremely unbalanced clustering caused by traditional clustering algorithms.However, they are not suitable for the applications where the observations are required to follow a strictly even distribution.Although there are plenty of such application requirements, only few studies have been dedicated to the hard size-constraint problem.More importantly, existing clustering methods based on hard size constraints can be hardly used due to the low accuracy or efficiency.
In this paper, we propose a novel approach to solve the problem of clustering with hard balance size constraints.Given a set of n observations, the task is to find a scheme to partition them into k clusters so that the sizes of all the clusters are approximately the same (±1) while the mean squared error (MSE) is minimized.Similar to the k-means algorithm, our proposed method is a gradient descent solution which runs in an iterative way.Each iteration consists of two steps, namely an assignment step and an update step.In the assignment step, the observations are evenly assigned to different clusters.The balanced assignment task here is modeled as an integer linear program (ILP).We prove that the corresponding constraint matrix is totally unimodular.Thus the ILP can be relaxed as a linear program (LP) which can be efficiently solved with the simplex algorithm [8].In the update step, the task is relatively simpler: update the new centers as the centroids of the observations assigned to the clusters.Assuming that there are n observations and the algorithm needs m iterations to converge, we show that the average time complexity of the proposed algorithm is O(mn 1.65 )-O(mn 1.70 ), which is much better than the O(mn 3 ) method based on the Hungarian algorithm [9].
To evaluate the proposed method, experiments have been conducted.Both real and synthetic datasets are involved.The number of observations n ranges from 150 to 5000.The number of clusters k takes its value from the set of {3, 9, 21, 45, 93}, which covers both the situations when the number of observations could and could not be divisible by the number of clusters.We do not choose very large cluster numbers, because according to the rule of thumb [10], the number of clusters k for data of size n should be less than √ n/2.The evaluation criteria include the cost and running time of the balanced assignment algorithm as well as the number of iterations, MSE, and running time of the balanced clustering algorithm.From the experimental results, it can be concluded that the proposed method is a practical solution for the task of clustering with balanced size constraints.
The rest of this paper is organized as follows.Section 2 reviews some related work.Section 3 covers the details about our proposed balanced clustering algorithm.In Section 4, the experimental settings and results are given.The discussion is provided in Section 5. Finally, Section 6 concludes the paper and presents future directions of research.

Related Work
Since the problem of clustering with balanced size constraints is a special form of the size-constraint clustering, we will briefly review some studies related to clustering with size constraints.The constraints can be divided into two categories, namely soft size constraints and hard size constraints.The former ones refer to constraints that are not guaranteed to be satisfied, while the latter ones are compulsory conditions that must be satisfied during clustering process [11].

Soft Constraints
More work has been published about soft rather than hard constraints.Ref. [7] proposed frequency sensitive competitive learning (FSCL) for balanced clustering.A multiplicative term and an additive term are introduced into the objective function in order to penalize larger clusters.Thus larger clusters are less likely to win points so that the clustering will be balanced.Similarly, Ref. [6] proposed to apply a penalty mechanism in FSCL that only includes an additive term.Ref. [12] presented a scalable clustering algorithm that satisfies the balance constraints which runs in O(kNlogN) time.The algorithm first clusters the sampled data points, then populates the clusters with the unsampled data points while simultaneously keeping the balanced size constraints.The ratio-cut algorithm that solves an objective function which embodies both min-cut and equipartition was described in [13].
Ref. [14] suggested a size regularized cut (SRcut).The sizes of clusters are utilized as prior knowledge in the clustering process.A regularization term measuring the balances of clusters is added to the cost function.The results suggest that it outperforms the normalized cut which impose indirect size constraints [15].Ref. [16] proposed to incorporate the balance condition in the cost function, and the problem is solved by finding a partition close to the given partition.In [17], exclusive lasso is applied in the balanced k-means and min-cut method to represent the size constraints.The results suggest improved performance.Clustering with soft size constraints could ease the problem of extremely unbalanced clustering, yet it is not able to deal with applications that impose hard constraints on the sizes of clusters.In this paper, our focus is on the clustering with hard size constraints.

Hard Constraints
On the other hand, there are few studies of clustering with hard size constraints, although the topic is highly relevant in practice.The method in [9] presented a new direction to deal with this problem.It translates the k-means assignment step into the average assignment problem that is to evenly assign n observations to k clusters with minimum cost.The average assignment problem is then modeled as a bipartite matching problem that is readily solved by the Hungarian algorithm.The method works well when the number of observations n is divisible by the number of clusters k.However, it is awkward when n is not divisible by k because it is not known that which clusters should be assigned to n/k observations.Ref. [18] proposed a method for clustering with hard size constraints.It first applies the k-means algorithm to the data so that the partition A without any size constraints is obtained.Then the partition B with given size constraints is calculated by maximizing its agreement with partition A. The optimization problem is finally modeled as an ILP, which can be easily solved with an existing solver.The algorithm is quite convenient and efficient to implement.The experimental results also indicate that incorporating the size constraints improves the accuracy of the clustering algorithm.However, the algorithm fails to consider the similarity between observations in the mathematical model, which leads to a less optimized MSE.Refs.[19,20] presented a balanced k-means algorithm to solve the problem of partitioning areas for large scale vehicle routing.The traditional k-means algorithm is applied to divide the whole dataset into several areas at the first step.Then a heuristic algorithm is developed to adjust the clusters so that the balanced size constraints are satisfied.However, the heuristic algorithm can hardly guarantee the quality of clustering.
In case of local solutions with clusters that contain few observations or even empty clusters, a clustering method was proposed in [21] where it imposes lower bounds on the sizes of clusters.The problem is modeled as an ILP which is then transferred into a minimum cost flow (MCF) linear network optimization problem.An algorithm specifically tailored to network optimization is then applied to solve the problem.Refs.[22,23] put lower bounds and upper bounds on the sizes of clusters, respectively, and adapt heuristic algorithms to solve the size-constraint clustering.However, the objectives of these studies differ from what we are going to accomplish in this paper.
According to the above discussion, most of the current researches on clustering with size constraints are based on soft constraints.Few contributions have been dedicated to the problem of clustering with hard size constraints, and they are not competent for the problem.To fill the gap, in this paper we propose a novel method that is based on ILP to deal with the problem of clustering with hard size constraints.

Problem Formulation
Minimizing the Mean Squared Error (MSE) is one of the most common standards in clustering algorithms.In this paper, we are trying to optimize the MSE subject to hard balanced size constraints, i.e., all the clusters should have approximately the same number of observations (±1).Given n observations, the objective is to divide them into k clusters, and each cluster contains n/k to n/k observations.So our problem can be formulated as: Here, o i is the i-th observation, µ j represents the set of data in the j-th cluster, µ j denotes the number of observations in the j-th cluster, c j represents the center of the j-th cluster, and o i − c j 2 stands for the squared distance between o i and c j .
Let p denote the partition matrix, where p i,j = 1 indicates that observation o i belongs to cluster j, and p i,j = 0 indicates that o i does not belong to cluster j.Obviously, summing each row of p equals 1 since one observation can only be assigned to one cluster.Summing each column of p would be the number of observations in the corresponding cluster, which in the balanced case would be either n/k or n/k .In this way, the problem shown in Equation ( 1) can be reformulated as:

Proposed Solution
Similar to the k-means algorithm, the gradient descent method, which runs in an iterative way, is applied to solve the above problem.Given n observations, we derive the initial k centers by k-means++ algorithm [24].Then the algorithm proceeds by 2 steps, namely an assignment step and an update step.
In the assignment step, we need to evenly assign the observations to different clusters according to the distance between the observations and the cluster centers so that the MSE is minimized.The objective is to minimize E(c, p) with respect to p while holding c fixed (known).Therefore, Equation (2) becomes an ILP as shown in Equation ( 3), where u j , v j and g i,j are slack variables used to eliminate inequalities.Z is the set of integers.In the next subsection we will prove that the integer constraints can be removed from this ILP so that the ILP becomes an LP which can be efficiently solved with the simplex algorithm [8].
A concrete example of the above problem can be illustrated as a bipartite graph shown in Figure 1.Five observations need to be evenly assigned to two clusters, i.e., each cluster must be assigned to two or three observations.The weight attached to each edge w i,j is the squared distance between observation o i and cluster center c j , namely o i − c j 2 .The green solid lines shown in Figure 1   Once the observations are assigned, the new centers need to be updated so that the MSE is minimized.The objective is to minimize E(c, p) with respect to c while holding p fixed (known).Since p is fixed, we can remove all the constraints in Equation ( 2).Then the task of the update step becomes an optimization problem without any constraints as shown in Equation (4).
The minimum value of E can be achieved when In summary, the proposed method can be detailed as Algorithm 1.A concrete example of our proposed algorithm can be found in Figure 2.  Our algorithm is guaranteed to converge, but not always to the global optimum due to the non-convexity of the objective function.Assuming that p (t) , c (t) are the partition matrix and the centers at the end of the t-th iteration, p (t) must satisfy the constraints in Equation (3).In the assignment step of (t + 1)-th iteration, we evenly assign all the observations to different clusters so that E(c (t) , p) is minimized, i.e., we optimize E(c (t) , p) with respect to p under the constraints in Equation (3) to get p (t+1) .Thus we have E(c (t) , p (t+1) ) ≤ E(c (t) , p (t) ).In the update step, the new centers are updated, which further optimize the objective function.Thus we have E(c (t+1) , p (t+1) ) ≤ E(c (t) , p (t+1) ).The value of E monotonically decreases during the course of our algorithm, so it converges.

Time Complexity
Each iteration of the proposed algorithm consists of two steps, namely an assignment step and an update step.According to Equation ( 5), the update step takes O(n) time, where n is the number of observations.Solving the ILP in the assignment step on the other hand is relatively more difficult.Typically, to handle the general ILP shown in Equation ( 6), the simplex algorithm needs to be executed multiple (at most exponential) times in a branch-and-bound way.However, according to [25], if the vector b is integral and the constraint matrix A is totally unimodular (a matrix is totally unimodular if and only if the determinant of every square sub-matrix is 0, 1, or −1), the solution to the relaxed LP (without integer constraints) is guaranteed to be integral, i.e., the integer constraints can be simply removed so that the ILP can be transformed into an LP which can be efficiently solved by the simplex algorithm.
Theorem 1.The constraint matrix of the ILP for the balanced assignment problem shown in Equation ( 3) is totally unimodular.
Proof.Putting Equation (3) into the form of Equation ( 6), we have the variables.
x = x x x and x represent the decision variables and slack variables, respectively.
Correspondingly, the constraint matrix can be derived.
Each row of matrix A contains the coefficients for each set of constraints shown in Equation (3).(A 1 A 1 ) are the coefficients for the first set of constraints ∑ n i=1 p i,j ) are the coefficients for the final set of constraints p i,j T correspond to the coefficients for the decision and slack variables, respectively.
Here, I k×k is the identity matrix of size k × k. 0 k×k is the zero matrix of size k × k.
According to Lemma 4, (A 2 A 3 ) T is a totally unimodular matrix, as every column has two non-zero entries being ±1, and the sum of each column equals 0.
According to Lemma 2 and Lemma 3, (A 1 A 2 A 3 ) T is totally unimodular, because multiplying A 1 with-1 results in a duplicate of A 2 , and duplication does not affect total unimodularity.
According to Lemma 1, (A 1 A 2 A 3 A 4 ) T is totally unimodular, because every row of A 4 contains exactly one non-zero entry which is 1.Inserting A 4 to (A 1 A 2 A 3 ) T would still result in a totally unimodular matrix.
According to Lemma 1, A is totally unimodular, because every column of (A 1 A 2 A 3 A 4 ) T contains exactly one non-zero entry which is 1.
T leads to a totally unimodular matrix.
All the lemmas can be found in [26].According to the above description, the constraint matrix A for the ILP shown in Equation ( 3) is a totally unimodular matrix.

Lemma 1. The constraint matrix A remains totally unimodular if inserting a column (row) with only one non-zero entry being ±1.
Lemma 2. The constraint matrix A remains totally unimodular if multiplying a column (row) with −1.

Lemma 3. The constraint matrix A remains totally unimodular if duplicating a column (row).
Lemma 4. The constraint matrix A is totally unimodular if it has at most two non-zero entries being ±1 in each column (row), and, for every column (row) with two non-zero entries, the sum of the column (row) is 0.
The constraint matrix A in our problem is totally unimodular.In addition, the vector b is integral so that the linear program shown in Equation ( 3) can be simply solved by the simplex algorithm.The worst time complexity of the simplex algorithm is exponential.Fortunately, the simplex algorithm is remarkably efficient in practice, and it has polynomial time complexity in the average case [27,28].More specifically, the average time complexity of the simplex algorithm is T = βs α td 0.33 [29].Here, s and t are the numbers of constraints and variables in the linear program, respectively.d is the percentage of non-zero entries in the constraint matrix.α and β are the two unknowns we need to find out.
To evaluate α and β, we collected 240 different sets of data for regression of the form T/(td 0.33 ) = βs α .All the data are randomly generated.The number of observations n varies from 100 to 100,000.The number of clusters k varies from 3 to 100.It is found that α is in the range (0.43, 0.46), β is in the range (2.09 × 10 −6 , 3.46 × 10 −5 ).The parameters related to the goodness of fit are SSE = 8.38 × 10 −6 , R − square = 0.95, Adjusted R − square = 0.95. Figure 3 illustrates the result of the regression.Furthermore, in our case, we have s = n + 2k + nk, t = 2nk + 2k, and d = (5nk + 2k)/(st).By placing everything together into T = βs α td 0.33 , we have T = O((nk) 1.1 )-O((nk) 1.13 ).Considering the maximum value of k is √ n/2 [10], the average time complexity of each assignment step is approximately O(n 1.65 )-O(n 1.70 ).Assuming that there are m iterations, the average time complexity of the balanced clustering algorithm is O(mn 1.65 )-O(mn 1.70 ).

Experimental Settings
In order to evaluate the performance of our proposed approach, both synthetic and real datasets are explored in our experiments.The synthetic datasets are generated from the standard uniform distribution on the interval (0, 100).The real datasets are from the UCI machine learning repository [30].Table 1 shows the details of the datasets in our experiments.We compare the balanced clustering performances between our proposed method and state-of-the-art methods which include the balanced clustering method presented by [9] and the size-constraint clustering method proposed in [18].Notice that for each run, all the algorithms start with the same set of initial centers derived from the k-means++ algorithm so that the comparison is objective.In addition, since [9] and we both model each iteration as a balanced assignment problem, the performances of the balanced assignment algorithms are also compared to provide a deeper insight.Notice that the approach in [18] is not an iterative approach, so it is not included in this comparison.
For the evaluation of the balanced assignment algorithm, we consider the cost and running time (measured in seconds).For the evaluation of the balanced clustering algorithm, we record the number of iterations, MSE, and running time (measured in seconds).It should be noted that we do not record the number of iterations for the method in [18] as it is not an iterative method.Furthermore, to evaluate the stability of the clustering algorithms, the coefficient of variation for the number of iterations, MSE and running time are collected.All the metrics mentioned above are calculated and averaged over 10 runs.For the sake of compactness, we round all the metrics to two decimal places.In addition, all trailing zeros are omitted.The best results are highlighted with the bold format in all the tables.
The number of clusters for all the data in our experiments takes its value from the set of {3, 9, 21, 45, 93}.We do not choose very large cluster number, as according to the rule of thumb [10], the number of clusters k for data of size n should be less than √ n/2.For the datasets involved in our experiments, the largest number of observations is 5000, and √ 5000/2 = 50 so that the range of k is enough for us to evaluate the clustering algorithm.
All the algorithms involved in this paper are implemented in MATLAB which ran on an Intel i7-7700HQ 2.8 GHz processor with 16 GB memory.The code can be downloaded from https://github.com/IGGIUJS/BalancedClustering.

Balanced Assignment
From Tables 2 and 3, we can observe two differences.(1) Although occasionally the assignment method in [9] produces equal cost, more frequently, the proposed method achieves a better cost for the balanced assignment problem.(2) Our proposed assignment method takes much less time than the method in [9].

Balanced Clustering
From Tables 4 and 5, we can observe three differences.(1) Compared with the method in [18], the proposed method achieves much lower MSE with bearable loss of efficiency.(2) Compared with the method in [9], the proposed method leads to a better MSE, while at the same time, it is much faster.
(3) The proposed method generally converges with fewer iterations than the method in [9].
The coefficient of variation (C.V. = standard deviation/mean) is explored to evaluate the stability of the proposed algorithm.From Tables 6 and 7, we can observe two differences.(1) Our method is as stable as the method in [9].In fact, the standard deviations of these measures in our method are mostly lower than that in the method of [9], and the means of these measures in our method are even lower (better).( 2) Compared with the method in [18], the running time of the proposed method is less stable while the MSE is more stable.  The method in [9]. 2 The method in [18].  1 The method in [9]. 2 The method in [18].  1 The method in [9]. 2 The method in [18].

Discussion
As we can see in Tables 4 and 5, in most cases, our method produces a lower MSE than the method based on the Hungarian algorithm [9].However, there are situations when the method in [9] achieves equal or even better MSE, such as the cases when dividing the dataset R2 into 3 clusters, dividing the dataset Iris into 3 clusters, and dividing R5 into 45 clusters.There are two main reasons for this.(1) When the number of observations n is divisible by the number of clusters k, the Hungarian algorithm is more likely to get the same result as our assignment method and this will lead to an equal MSE.(2) From Tables 2 and 3, we can see that the proposed method always produces a balanced assignment with equal or lower cost.However, it is an iterative method and does not always converge to the global optimum.
When the number of observations n is not divisible by the number of clusters k, the sizes of clusters vary from n/k to n/k .The proposed method could adaptively determine the size of each cluster during the process of optimization, while for the method based on the Hungarian algorithm, the sizes of clusters must be pre-settled.The awkward situation is that it is not known which clusters should have size n/k , and which clusters should have size n/k in advance.Thus, theoretically, in each assignment step, the Hungarian algorithm needs to be executed C n n%k times in order to get the optimal assignment, where % is the modulus operator.
Our proposed method is suitable for applications which impose upper and (or) lower bounds on the sizes of clusters.The balanced size-constraint clustering in this paper is actually a special case when the upper bounds are n/k and the lower bounds are n/k .It is straightforward to see that the constraint matrix is totally unimodular when there are both upper bounds and lower bounds on the sizes of clusters.In situations when only upper (lower) bounds are available, the constraint matrix would still be totally unimodular.We will skip the proof here, as it is similar to the proof to Theorem 1.

Conclusions
In this paper, we propose a novel approach to address the issue of balanced clustering which has many applications in practice.Each iteration of the proposed algorithm consists of two steps, namely an assignment step and an update step.In the assignment step, observations are evenly assigned to different clusters, and the problem is modeled as an ILP.We prove that the constraint matrix of the ILP is totally unimodular so that the ILP can be efficiently solved with the simplex algorithm.In the update step, the new centers are updated as the centroids of the observations in the clusters.The regression result suggests that the average time complexity of the proposed method is O(mn 1.65 )-O(mn 1.70 ), which is much faster than the O(mn 3 ) method based on the Hungarian algorithm.Experiments on both synthetic and real data validate that the proposed method achieves both high quality and efficiency in the task of clustering with balanced size constraints.
There are several issues we can consider in our future work.For instance, the proposed method could be adapted into a general size-constraint clustering algorithm, and the clustering accuracy could be evaluated.Moreover, the proposed algorithm is a variant of the k-means algorithm.The hard size constraints are incorporated into the objective function of the k-means algorithm.Similarly, the hard constraints could be applied to other clustering algorithms, such as spectral clustering.Furthermore, in addition to the cluster-level constraints considered in this paper, instance-level constraints could be studied.For example, the observations o i and o j must (or must not) be in the same cluster.Finally, it is interesting to investigate the applications of the proposed method.
indicate a possible solution where o 1 , o 3 , and o 4 are assigned to c 1 , while o 2 and o 5 are assigned to c 2 .

Figure 1 .
Figure 1.The bipartite graph (the green solid lines indicate a possible solution).

Algorithm 1 : repeat 3 : 4 :
Clustering with balanced size constraints Input: Observations O = {o 1 , o 2 , ..., o n }, number of clusters k Output: Labels of the observations.1: Initialize k centers using the k-means++ algorithm; 2Assignment step: assign the observations to k clusters by solving Equation (3); Update step: update the new centers of the clusters by solving Equation (5); 5: until There is no more change to the partition matrix; 6: return Labels of the observations.

Figure 2 .
Figure 2.An example of our proposed method: (a) The data points.(b) The result after the first iteration.(c) The result after the second iteration.(d) The result after the last iteration.

Figure 3 .
Figure 3.The result of the regression.

Table 1 .
A brief summary of experimental datasets.

Table 2 .
Cost and running time for balanced assignment (synthetic data).

Table 3 .
Cost and running time for balanced assignment (real data).

Table 4 .
Mean squared error (MSE), running time, and iteration count for balanced clustering (synthetic data).

Table 5 .
MSE, running time, and iteration count for balanced clustering (real data).

Table 6 .
Coefficient of variation for MSE, running time, and iteration count (synthetic data).

Table 7 .
Coefficient of variation for MSE, running time, and iteration count (real data).