Fast Fusion Clustering via Double Random Projection

In unsupervised learning, clustering is a common starting point for data processing. The convex or concave fusion clustering method is a novel approach that is more stable and accurate than traditional methods such as k-means and hierarchical clustering. However, the optimization algorithm used with this method can be slowed down significantly by the complexity of the fusion penalty, which increases the computational burden. This paper introduces a random projection ADMM algorithm based on the Bernoulli distribution and develops a double random projection ADMM method for high-dimensional fusion clustering. These new approaches significantly outperform the classical ADMM algorithm due to their ability to significantly increase computational speed by reducing complexity and improving clustering accuracy by using multiple random projections under a new evaluation criterion. We also demonstrate the convergence of our new algorithm and test its performance on both simulated and real data examples.


Introduction
Clustering is a pivotal technique in unsupervised learning, applied extensively across various scientific and technological fields that handle large datasets.Clustering also plays a crucial role in data labelling, which sets the stage for the application of artificial intelligence and machine learning models [1,2] on the organized data to perform predictive analytics and classification tasks.Traditional clustering algorithms like k-means, Gaussian mixture models, and hierarchical clustering often face stability challenges due to their concave optimization formulations, which can lead to variability in results due to factors such as initial conditions or data outliers [3][4][5].Recent advancements in convex or concave fusion methods have shown promise in enhancing stability, achieving more consistent global or local optimality and reliable estimation of cluster centers and counts through sparseinducing penalties on pairwise centers [6][7][8][9].For clustering high-dimensional data, the data can be mapped into a high-dimensional feature space (kernel space) for processing [10], or clustering can be achieved by optimizing a smooth and continuous objective function that is based on robust statistics [11].This paper introduces a comprehensive empirical validation of these methods across simulation studies and real data analysis, detailing their improved stability over traditional methods and the practical implications of these advancements.
In fusion clustering, p-dimensional observations X i , i = 1, . . ., n are each parameterized by their own centroid µ i .These centroids are estimated under the assumption that all observations can be grouped into K clusters G 1 , . . ., G K , such that for i ∈ G k , µ i = ρ k , where ρ k represents the cluster center for observations in cluster G k .Fusion clustering aims to concurrently estimate the cluster centroids ρ k and the partitions G k by minimizing the following objectives 1 The penalty function p λ (∥ • ∥τ) is used to control the complexity of the model, and it is determined by the tuning parameter λ.The form of the norm used is represented by ∥ • ∥τ.This penalty function is typically used in fusion clustering to encourage sparsity in the estimated cluster centroids.The penalty function p λ (∥ • ∥τ) controls the complexity of the model and is determined by the tuning parameter λ.The norm used is ∥ • ∥τ.The penalty function is typically used in fusion clustering to promote sparsity in cluster centroids.
While robust, convex and concave fusion clustering methods are computationally demanding with a O(n 2 p) complexity, which can limit their practicality in scenarios involving large sample sizes n and high-dimensional datasets p.This article proposes a strategy for overcoming this limitation using random projection techniques [21][22][23][24].The approach involves the construction of a random diagonal matrix whose diagonal elements are sourced from a binary distribution.This matrix is then projected onto the pairwise component of the fusion method.By doing so, the number of pairwise differences between individual centroids, ∥µ i − µ j ∥, is substantially reduced.This reduction not only decreases the computational load but also maintains the integrity of the clustering process, enhancing the algorithm's scalability without excessively increasing the operational overhead.We provide empirical evidence demonstrating that this method significantly reduces the computational time while preserving the clustering quality, as shown in our simulation section.
In unsupervised learning, rapid clustering processes are crucial for handling large datasets efficiently.Our study introduces a novel approach to fusion clustering to enhance computational speed without compromising accuracy.Our contributions are summarized as follows: (1) We propose using random projection techniques to simplify the fusion aspect of clustering, effectively diminishing the pairwise centroids discrepancies and significantly boosting computational efficiency by minimizing the fusion step's complexity.(2) We have developed a novel double recursive random projection ADMM method designed for efficient high-dimensional fusion clustering, improving the accuracy of clustering.
In the remainder of this paper, the proposed new ADMM algorithm will be described in Section 2. This section will also include an analysis of the computational complexity and convergence of the algorithm.It will also include a strategy for improving cluster accuracy.The finite-sample properties of the proposed new ADMM algorithm will be evaluated through simulation studies in Section 3, and the method will be demonstrated using a real data example in Section 4. Concluding remarks will be presented in Section 5, and technical proofs will be provided in the Appendices A and B.

Methodology
To improve convex or concave fusion clustering efficiency, we propose an extension of the classical ADMM algorithm based on a random projection called RP-ADMM.A random projection can significantly reduce the time and computational resources needed to analyze high-dimensional data, making it suitable for large datasets and real-time processing.In this section, we will discuss the RP-ADMM algorithm's computational complexity and convergence.

Random Projection Based ADMM
Previous ADMM algorithms for convex or concave fusion clustering [6,8] have suffered from a high computational burden due to the need to consider all n(n − 1)/2 pairwise differences between individual centroids.This is represented by the fusion matrix , where e i is the ith unit vector with a 1 in the ith position and 0s elsewhere, and e i − e j can be interpreted as the difference between the ith and jth individual centroids.The computational complexity of this approach is O(n 2 ), which becomes infeasible for large sample sizes n.

Bernoulli distribution-based random projections ADMM
It is worth noting that pairwise differences between individual centroids can be deduced from other differences.For example, if we know that µ 1 − µ 2 = 0 and µ 2 − µ 3 = 0, we can conclude that µ 1 − µ 3 = 0.This means that it may be unnecessary to consider the row e 1 − e 3 in E .To reduce the computational burden of convex or concave fusion clustering, we propose a random projection approach.This only considers a small subset of the n(n − 1)/2 pairwise differences between individual centroids.This is achieved by generating indicators π ij from a Bernoulli distribution with probability α.We then form a random matrix Π, which is a diagonal matrix with diagonal elements the difference between µ i and µ j is taken into account; if π ij = 0, it is not considered.The probability α controls the size of the subset of pairwise differences considered.The matrix ΠE can be seen as a projection of E onto a sparse matrix.This is with about n(n − 1)(1 − α)/2 rows being zero vectors and about n(n − 1)α/2 ones being nonzero vectors.This projection is based on a Bernoulli distribution.Finally, we form a new fusion matrix Ω by deleting the rows of zero vectors in ΠE .The new fusion matrix is given by Ω = (Ω 1 , • • • , Ω κ ) T , where Ω j , j = 1, • • • , κ, denotes jth row vector of Ω.
• For the TLP [8] with a > 1, Through some algebra, the problem of ( 9) is equivalent to the minimization of the function h(µ, ϕ (m+1) , η (m+1) ), which has the from Under the given value of ϕ (m+1) , η (m+1) , the updated µ (m+1) are where I n is n × n identity matrix.µ (m+1) and ϕ (m+1) are updated according to the random projection ADMM iterative algorithm ( 5)-( 7) until the input of some convergence criteria, such as both dual and primal residuals being close to zero [28] in our practice.The convergence time of ADMM is highly related to the penalty parameter φ.A poor selection of φ can result in a slow convergence for the ADMM algorithm [29] and thus RP-ADMM.
In this paper, we fix φ = 1 throughout for simplicity.

Algorithm 1 RP-ADMM for fusion clustering
, η (0) ; tuning parameter, λ Output: an estimate of µ for m = 0, 1, 2, • • • do compute ϕ (m+1) using (5) compute η (m+1) using ( 6) compute µ (m+1) using ( 7) if convergence criterion is met, then Stop and denote the last iteration by µ(λ), else m = m + 1 end if end for Practically, we would not want to conduct the RP-ADMM updates comprehensively until convergence to save computing time in the first iterations.Another trick is to adopt the initial values of subsequent convex relaxations as optimal values from the previous relaxed convex problem, which significantly reduces the number of RP-ADMM iterations.

Selection of Optimal Tuning Parameter
For a given λ, the converging value µ(λ) of the above RP-ADMM procedure is defined as µ(λ) = argmin µ ℓ p (µ; λ), (11) where ℓ p (µ; λ) is defined in (2) and the optimal value of λ can be selected via a properly constructed data-driven criterion.In particular, we partition the support of λ into a grid of and for each λ j , we compute a solution path of µ(λ j ) and obtain K(λ j ) distinct cluster centroids { ρ 1 (λ j ), . . ., ρ K(λ j ) (λ j )}, The optimal λ is selected by minimizing a data-driven BIC, i.e., λ = argmin λ j ;j=1,...,J BIC(λ j ), where Subsequently, we obtain the estimator µ = µ( λ), and the individuals can be separated into K = K( λ) clusters accordingly, i.e., G k = {i : Other methods for tuning parameters in clustering, such as generalized degrees of freedom with generalized cross-validation [8] and stability-based cross validation [25,30] can provide good results but may require extensive computation or the specification of a hyperparameter perturbation size [8].In contrast, the proposed BIC is easy to compute and performs well in estimating cluster centroids and the true number of clusters (K). Figure 1 shows the change in BIC values against log(λ) and the cluster number of the simulation.Across all cases with different values of n and p, we observe that BIC(λ) decreases as the value of log(λ) increases.With recovering the true cluster number K = 3, BIC( λ) reaches a minimum at the optimal λ.Moreover, when log(λ) keeps increasing, the cluster centroids are continuously integrated, and BIC(λ) is enlarged.However, further research is needed to fully prove the consistency of the BIC in combination with the objective function ( 2).

Recursive RP-ADMM and Cluster Matrix
In the above cluster analysis, the effect of randomness on the clustering results was not considered.However, empirical analysis has shown that the impact of this randomness on the estimated cluster centers and numbers is minimal (i.e., ρ k 's and K's).However, the impact on the final partitioning results (i.e., which observations are grouped into a single cluster) can be significant.In response to this, we propose the Recursive RP-ADMM (RRP-ADMM) procedure, which performs multiple RP-ADMM cluster analyses by generating M random matrices (i.e., Ω m 's, m = 1, • • • , M) and repeatedly conducting the analysis.
Once the multiple RP-ADMM cluster analyses have been completed, we must summarize the results.We define a n × n symmetric cluster matrix C where C ij = 1 denotes that the ith and jth observations belong to the same cluster; otherwise, C ij = 0 .Another n × n symmetric matrix D is introduced, with element D ij representing the relative frequency of the ith and jth observations belonging to the same cluster over the M independent RP-ADMM clustering procedures.The decision of whether the ith and jth observations should be grouped into a single cluster or not can then be treated as a classification problem, with the two possible class labels being 1 (belong to the same cluster) or 0 (do not belong to the same cluster).We can use an indicator function to transform the relative frequency into class labels and generate an estimator for the cluster matrix C, i.e., where 1 (•) denotes the indicator function.We summarize the above procedure in Algorithm 2. This transformation can be understood as a voting-based aggregation strategy, similar to the one proposed by [31], which aims to reduce misclassification errors and improve the accuracy of the clustering.To evaluate the accuracy of the clustering results, we define a new measure called the similarity index (SI) between two data clusterings: Like the Rand Index (RI) measure [32], the newly introduced evaluation criterion can be seen as a measure of the percentage of correct decisions made by some algorithm.The SI values also range from 0 to 1, with lower values indicating better algorithm performance.

Algorithm 2 RRP-ADMM for fusion clustering
Input: data X 1 , • • • , X n ; M; Initialize µ (0) , η (0) ; tuning parameter, λ Output: an estimate of µ for m = 0, 1, • • • , M do compute µ (m) using RP-ADMM end for while 1 ≤ i ≤ n do compute D ij and C ij from (13) end while The classical convex or concave fusion clustering procedure in (1) requires O(n 2 p) operations and O(n 2 p + np) of storage for a single round of ADMM updates with primal and dual residual calculations, because all pairs of centroids are shrunk together in this method.
The RP-ADMM algorithm significantly improves computational efficiency compared to classical ADMM algorithm.It requires only O(κ p + np) of storage, compared to O(n 2 p + np) for the classical ADMM algorithm, because the variables η and ϕ have only κ columns rather than n(n − 1)/2.Additionally, the RP-ADMM algorithm requires only O(κ p) operations for its most computationally demanding step, in comparison to O(n 2 p) for the classical ADMM algorithm.The RP-ADMM algorithm also requires O(κn) operations to conduct Cholesky factorization in every iteration, in comparison to O(n 3 ) for the classical ADMM algorithm.This efficient Cholesky factorization is computed only once and reused across repeated RP-ADMM updates.
At the end of this subsection, we will demonstrate the convergence of the RP-ADMM algorithm by showing that the sequence generated by the algorithm contains a subsequence that converges to a stationary point.Lemma 1.Let {µ (m) , ϕ (m) , η (m) } ∞ k=1 be the sequence generated by Algorithm 1, then for some constant c > 0, In order to prove that the sequence {µ (m) , ϕ (m) , η (m) } ∞ k=1 is convergent, we need to assume that ϕ (m) is bounded and ψ∥η (m+1)−η (m) ∥ → 0 which are often observed in numerical tests.

Simulation
In this part of the study, simulation experiments were conducted to compare the performance of the extended and classical ADMM clustering algorithms in terms of computational time and clustering accuracy, using the evaluation criterion in (14).The Lasso-based fusion method often leads to the formation of dense clusters with a minor penalty for small differences in ∥ϕ j ∥, which can result in the formation of many spurious clusters with very small differences among them [6].In contrast, the concave penalty method tends to produce a clear cluster structure and a well-defined number of clusters [8].Therefore, in this study, we focus on the MCP-based fusion method [27] which compares the conventional ADMM's clustering performance and the proposed new ADMM algorithm.

Low-Dimensional Setting
In this part, we evaluated the clustering performance of the classical ADMM, RP-ADMM, and RRP-ADMM algorithms on low-dimensional synthetic data generated from three overlapping convex clusters with the same spherical shape in some number of dimensions p and sample size n.The synthetic data were generated from three populations and Σ = (σ kj ) p×p with σ jj = 1 and σ kj = 0.1 |k−j| for k ̸ = j.This setting was chosen deliberately to allow overlap in the sample sets generated from clusters proximal to each other, thereby increasing the complexity of the clustering task.As illustrated in Figure 2c, the clustering performance using a single random projection (RP-ADMM) was suboptimal, indicating challenges with cluster separability under this setup.Conversely, Figure 2b demonstrates that recursive random projection (RRP-ADMM) significantly improved clustering results.The recursive times for the RP-ADMM and RRP-ADMM algorithms were set to M = 10.To evaluate the accuracy of the RP-ADMM, relax-and-split approach [33] (RS-ADMM) and RRP-ADMM algorithms in recovering the true cluster matrix, we generated a random sample of n = 60 observations with 1-20 drawn from P 1 , 21-40 drawn from P 2 , and 41-60 drawn from P 3 , and set the number of dimensions to p = 5.The probability α of generating a 1 in the random matrix was set to α = c log(n) n , where c controls the probability size.The level plots in Figure 2 use colour to visualize the values of 1's and 0's in the cluster matrix.The results show that both RP-ADMM and RRP-ADMM can accurately recover the true cluster matrix, with RRP-ADMM showing more accurate gradation than the true cluster matrix.Single random projection (RP-ADMM) can cause high variance in clustering outcomes due to the randomness of the sampling process.To mitigate this issue, we have adopted the voting-based pooling technique [31], which reduces variance by averaging results from recursive random projection (RRP-ADMM).
To further evaluate the performance of the algorithms, we calculated the values of the index SI defined in (14) after 100 replicates under different c choices.We depicted the results as boxplots in Figure 3.These results show that RRP-ADMM consistently improves clustering accuracy compared to RP-ADMM, as evidenced by the smaller median and standard error of SI values.Next, we will compare the performance of classical ADMM and RRP-ADMM in terms of computation time per iteration and the SI after 100 trials.The sample size is varied with n = 60, 150, 240, 360 points and α = 4 log(n) n , while p = 2 is kept constant.In this study, we have limited the number of points to 360, as the classical ADMM algorithm requires a significant amount of computation time for a single realization with more points.We will also compare the performance of the Similarity Index (SI) and Rand Index (RI) in evaluating the clustering results.Therefore, we should calculate the partitioning structure of all points based on the estimated cluster matrix graph.This process involves first identifying the point a 1 with the most neighbors and aggregating the connected points with point a 1 as cluster 1, then finding the second point a 2 with the most edges to form cluster 2, and repeating this process until there are no more points remaining.
Table 1 shows the mean values of the SI, RI, and the consumed time in seconds for different sample sizes under different methods after 100 replicates.Based on the data in Table 1, we can observe the following: (i) The proposed RRP-ADMM significantly reduces the time required for convex or concave fusion clustering, especially when the sample size increases.(ii) RRP-ADMM produces smaller SI and larger RI values, possibly due to the voting-based pooling technique improving cluster accuracy.(iii) As the sample size increases, the SI and RI values decrease.The boxplots in Figures 4 and 5 demonstrate the superiority of the RRP-ADMM algorithm over the classical ADMM algorithm in terms of both the SI values and the square root of run time, as seen in the results obtained from 100 replicates with four different sample sizes.These results further reinforce our belief in the effectiveness of the RRP-ADMM algorithm.

High-Dimensional Setting
In this part, we investigate using the double random projection-based alternating direction method of multiplier (DRP-ADMM and DRRP-ADMM) algorithms for clustering high-dimensional data sets.We employ a recursive Gaussian distribution-based random projection strategy in the first step to mitigate the impact of randomness on cluster results.Since the classical ADMM algorithm is computationally intensive in high-dimensional settings, we focus on evaluating the performance of the DRP-ADMM and DRRP-ADMM algorithms with recursive times M = 9, using three Gaussian random projections in the outer layer and three binary random projections in the inner layer.The simulated data sets consist of two overlapping convex clusters with the same spherical shape.They are generated using a population P k = N (ρ k , Σ), k = 1, 2 with ρ 1 = 1 p , ρ 2 = −1 p .Furthermore, Σ = (σ kj ) p×p with σ jj = 1 and σ kj = 0.1 |k−j| for k ̸ = j.We consider four high-dimensional cases with p = 1000, 2000, 3000, 5000 and a fixed sample size of n = 100.
We evaluate the accuracy of the DRP-ADMM and DRRP-ADMM algorithms in recovering the true cluster matrix.To do this, we first generate a Gaussian random matrix R with dimensions p × q in the first projection.The elements of R correspond to N (0, 1/ √ q).
We set q = ⌈ κ ε 2 /2−ε 3 /3 log(n)⌉ with ε = 1 and κ = 5 6 .See [21,23] for the number of projections.In the second step, we generate a diagonal binary random matrix with probability α = 4 log(n) n of equaling one.Then, we calculate the values of the SI index defined in Equation ( 14) and plot the results as boxplots in Figure 6 after 100 replicates for different values of p.The results show that the DRRP-ADMM algorithm consistently outperforms the DRP-ADMM algorithm regarding the median and standard error of the SI values for all values of p, indicating that the DRRP-ADMM algorithm improves clustering accuracy.

Real Data Analysis
In this study, we use the DrivFace dataset to demonstrate the effectiveness of our proposed clustering procedure.The DrivFace database consists of n = 606 images of 640,480 pixels each, captured from four drivers (two women and two men) over different days and containing p = 17 facial features such as glasses and beards.Each driver's images containing similar facial features can be grouped into one cluster, resulting in a total of K = 4 clusters as shown in Figure 7a.Firstly, we know the true labels of the dataset; that is, there are four clusters, and we also know which observations belong to the common cluster.Secondly, because the similarity among observations in the pictures is very high across different clusters, it is challenging to separate them.Therefore, we can use this dataset to evaluate our proposed clustering method.Due to the large sample size of the DrivFace dataset, we do not use the classical ADMM algorithm, which would require 606 × (606 − 1) × 17/2 operations in a single ADMM iteration.Instead, we first scale the samples by each feature and apply the RP-ADMM procedure to estimate individual centers using a grid of λ values.We plot the f usiongrams of four selected variables in Figure 8, and the scrutiny of Figure 8a implies that some outlying points (influential points) cause the clusters to be dense.We then remove these 55 points and plot a new f usiongram in Figure 8b.The optimal λ value, as determined by the developed BIC criterion in Equation ( 12), is 1.38, indicating that the estimated number of clusters is four, the same as the number of drivers.We apply the proposed RRP-ADMM algorithm with a Bernoulli-distribution-based random projection procedure to further improve the cluster accuracy using α = 10 log(n)/n and a recursive number M = 20.Using the estimated optimal tuning parameter of 1.38, we obtain the estimated cluster matrix in Figure 7b, which closely resembles the true cluster matrix in Figure 7a.The calculated similarity index (SI) value is 0.098.Moreover, the value of Adjusted Rand Index (ARI) is 0.672.

Conclusions
We propose using the recursive random projection-based ADMM (RRP-ADMM) method to improve the speed and accuracy of convex and nonconvex fusion clustering.In simulations and real data examples, the RRP-ADMM method demonstrates superior performance in fast calculation and accurate clustering results.The RRP-ADMM algorithm is scalable and can be applied to deal with heterogeneous issues in any setting that involves fusion techniques.
However, some challenges still need to be addressed in this field.One challenge is efficiently transforming the cluster matrix graph into the target partitioning structure and determining the optimal number of clusters.Another challenge is using prior information about which points are more likely to be integrated into a single cluster to reduce the number of pairwise comparisons.Additionally, a further study is needed to determine the theoretical probability of achieving a probability of one in binary random projection.Another future research direction involves performing clustering simultaneously with feature selection, using techniques such as incorporating feature weights [34] or introducing sparsity [14].
of Canada (NSERC), and Linglong Kong was also partially supported by grants from the Canada Research Chair program from NSERC.

Figure 2 .
The level plots of cluster matrix including the true one in the left panel, estimators calculated from RRP-ADMM and RP-ADMM in the middle and right panels, respectively.

Figure 3 .
Figure 3. Boxplots of SI values through RP-ADMM and RRP-ADMM algorithms, respectively, under four choices of c after 100 replicates.

Figure 4 .
Figure 4. Boxplots of SI values through classical ADMM and RRP-ADMM algorithms, respectively, under four choices of sample sizes n after 100 replicates.

Figure 5 .
Figure 5. Boxplots of the square root of the run time through classical ADMM and RRP-ADMM algorithms, respectively, under four choices of sample sizes n after 100 replicates.

( a ) 7 .
True cluster matrix (b) Estimated cluster matrix Figure True (a) and estimated (b) cluster matrix in DrivFace data.

( a )Figure 8 .
Figure 8.The above f usiongrams are plotted from 4 selected variables in DrivFace data before (left panel) and after (right panel) deleting the influence points, respectively.

Table 1 .
The mean values of Similarity index (SI), Rand Index (RI) and run time in seconds against different sample sizes and different methods after 100 replicates.
[32]: 'SI' represents the similarity index defined in (14), 'RI' denotes Rand Index[32].TIME is the required time in seconds in a single round of ADMM.