Divergence-Based Locally Weighted Ensemble Clustering with Dictionary Learning and L2,1-Norm

Accurate clustering is a challenging task with unlabeled data. Ensemble clustering aims to combine sets of base clusterings to obtain a better and more stable clustering and has shown its ability to improve clustering accuracy. Dense representation ensemble clustering (DREC) and entropy-based locally weighted ensemble clustering (ELWEC) are two typical methods for ensemble clustering. However, DREC treats each microcluster equally and hence, ignores the differences between each microcluster, while ELWEC conducts clustering on clusters rather than microclusters and ignores the sample–cluster relationship. To address these issues, a divergence-based locally weighted ensemble clustering with dictionary learning (DLWECDL) is proposed in this paper. Specifically, the DLWECDL consists of four phases. First, the clusters from the base clustering are used to generate microclusters. Second, a Kullback–Leibler divergence-based ensemble-driven cluster index is used to measure the weight of each microcluster. With these weights, an ensemble clustering algorithm with dictionary learning and the L2,1-norm is employed in the third phase. Meanwhile, the objective function is resolved by optimizing four subproblems and a similarity matrix is learned. Finally, a normalized cut (Ncut) is used to partition the similarity matrix and the ensemble clustering results are obtained. In this study, the proposed DLWECDL was validated on 20 widely used datasets and compared to some other state-of-the-art ensemble clustering methods. The experimental results demonstrated that the proposed DLWECDL is a very promising method for ensemble clustering.


Introduction
For a long time, clustering has been widely studied as an important technology for machine learning [1][2][3][4]. However, due to the lack of prior knowledge, i.e., pre-label training, the accuracy of clustering algorithms is much lower than that of supervised learning methods. Traditional single clustering methods, such as k-means, balanced iterative reducing and clustering using hierarchies (BIRCH), density-based spatial clustering of applications with noise (DBSCAN), etc., cannot usually achieve good clustering results for complex data [5,6]. Encouraged by the accuracy improvement effects of ensemble learning methods, many researchers have begun to study clustering ensemble algorithms. Clustering ensembles learn from multiple base clustering results to obtain consensus results, which can greatly improve the clustering accuracy without the need for prior knowledge [7][8][9][10][11][12].
The focuses of ensemble clustering methods are either the selection of base clustering or ensemble methods [13]. The selection of base clustering has two influences on the consensus results: accuracy and diversity. Higher accuracy usually leads to the lower diversity of the base clustering, while higher diversity results in the lower accuracy of the base clustering [14]. Therefore, balancing these two factors is key in the selection of base clustering. Ensemble methods aim to learn more robust consensus results by mining more effective information from the base clustering sets. Essentially, ensemble methods mine more inner information from the base clusterings. Although there are many robust ensemble methods, it is difficult to identify which ensemble method outperforms the others on a given dataset due to the randomness of the base clustering selection and the diversity of datasets.
Generally speaking, the most commonly used representative methods for mining this information from base clusterings include (1) co-association (CA) matrices, which represent the mutual relationships between samples in the base clustering sets, i.e., relationships at the sample level, (2) cluster-cluster (CC) matrices, which indicate the relationships between clusters in base clustering sets, i.e., relationships at the cluster level, and (3) sample-cluster matrices, which represent the relationships between samples and clusters in base clustering sets, i.e, relationships at the sample-cluster level. Both CA and CC matrices can be calculated using sample-cluster matrices. CA matrices reveal the probability that samples are of the same class. The larger the value of X ij in a CA matrix, the greater the possibility that the samples i and j are of the same class. Some methods aim to retain or learn reliable samples in CA matrices and then seek consensus results [10,14]. For example, Jia et al. proposed an effective self-enhancement framework for CA matrices to improve the ensemble clustering results, through which high-confidence information was extracted from base clusterings [15]. CC matrices reveal the similarities between clusters, which cannot be used for ensembles alone due to the lack of effective information, so it has to be combined with other valid information to perform accurate clustering. Therefore, some researchers have used CC matrices to calculate similarities and then mapped them as weights to CA matrices or sample-cluster matrices [11,16]. Sample-cluster matrices are the original matrices in base clustering sets and retain the most complete information in base clustering sets. Some methods choose to explore hidden information in the original matrices [11]. For example, based on sample-cluster matrices, the dense representation ensemble clustering (DREC) method introduces microcluster representation, reduces the amount of data, retains the effective information from sample-cluster matrices to the greatest extent and then performs dense representation clustering, which not only improves the time performance but also explores the hidden effective information to the greatest extent [13]. Huang et al. pointed out that the differences between microclusters also play important roles in ensemble clustering [17]. However, the DREC method ignores the differences between microclusters. Moreover, it does not reveal the underlying structures in sample-cluster matrices well. Entropy-based locally weighted ensemble clustering (ELWEC) has been demonstrated as being effective in improving clustering accuracy [18]. The key reason for this is the adoption of the idea of mapping entropy-based local weights to clustering. However, the ELWEC method measures the weights of clusters rather than microclusters and ignores sample-cluster relationships, thereby limiting the clustering performance to some extent. Very recently, the Markov process [19], a growing tree model [20], a low-rank tensor approximation [21] and an equivalence granularity [22] have been applied to ensemble clustering to achieve better clustering results.
Motivated by the above analysis, a divergence-based locally weighted ensemble clustering with dictionary learning (DLWECDL) is proposed in this paper. The idea of local weights was introduced to the DLWECDL. Different from the entropy-based local weights of clusters in ELWEC, this study used the divergence-based local weights of microclusters for ensemble clustering. Specifically, low-rank representation, the L 2,1 -norm and dictionary learning were applied to design the objective function and the corresponding constraints. We used the augmented Lagrange multiplier (ALM) with alternating direction minimization (ADM) strategy for the optimization of the objective function. Extensive experiments on real datasets demonstrated the effectiveness of our proposed method.
The main contributions of this paper are summarized as follows: (1) The proposal of a Kullback-Leibler divergence-based weighted method to better reveal relationships between clusters; (2) The use of low-rank representation instead of dense representation to better explore hidden effective information and low-rank structures of original matrices; (3) The application of the L 2,1 -norm to noise to improve robustness; (4) The introduction of adaptive dictionary learning to better learn low-rank structures; (5) Extensive experiments to demonstrate that the proposed DLWECDL can significantly outperform other state-of-the-art approaches.
The rest of this paper is organized as follows. Section 2 reviews related works on ensemble clustering. The proposed ensemble clustering method is described in detail in Section 3. The experimental settings and results are analyzed and discussed in Section 4. Finally, Section 5 concludes the paper and provides our recommendations for future work.

Ensemble Clustering
The goal of ensemble clustering is to find consensus results based on M base clusterings. To obtain good consensus results, two questions naturally arise. The first question is the selection of the base clusterings, which should not only ensure the diversity of the base clusterings but also the quality or accuracy of the base clusterings. Existing studies have proposed some methods that take into account the diversity and quality of base clusterings [23,24]. The second question is the ensemble method, which is roughly divided into two categories: similarity matrix-based learning and graph-based learning. Similarity matrices are the core problems in various clustering methods. In ensemble clustering, similarity matrices are obtained by exploring sample-sample, cluster-cluster and sample-cluster relationship matrices and then using spectral clustering to obtain the final clustering results.
Based on similarity matrices, our method follows a dense representation ensemble clustering framework, finds microclusters and then performs dense representation at the microcluster level. However, it does not work for microclusters that contain more samples. Therefore, we designed a local weight-based microcluster ensemble method and used a new low-rank representation clustering method. Inspired by the ALRR method [25], we introduced the L 2,1 -norm and adaptive dictionary learning to the new low-rank representation method.

Microcluster Representatives
Our approach starts by finding microcluster representatives to simplify the problem. A sample-cluster matrix needs to be reconstructed before looking for these microcluster representatives. Figure 1 is an example that illustrates our definition of a microcluster, where C i is the i-th base clustering, X j represents the j-th sample and the numbers 1-7 in the heading of the full data matrix are the global renamed cluster IDs. We reconstructed the original base clustering results to obtain the full data matrix, in which we observed that the information in samples X 1 and X 2 was completely consistent. Therefore, we grouped X 1 and X 2 into the same microcluster and chose either X 1 or X 2 as the microcluster representative.

Information Entropy-Based Locally Weighted Method
The information entropy-based locally weighted method mainly explores the uncertainty of each cluster [18]. It introduces the concept of entropy to calculate the uncertainty of each cluster and then determines the weight of each cluster using a monotonically decreasing function. It forms results based on the more stable cluster, the smaller the uncertainty and the larger the weight. However, for similar clusters, it cannot guarantee that the final weights are consistent, even though the weights of completely different clusters may be consistent.
We used the locally weighted method for microclusters to calculate the weight of each cluster in each base clustering and then apply it to the microclusters. The weights were measured using the ensemble-driven cluster index (ECI).
Taking the cluster in the i-th base clustering π i as an example, the weights were calculated as follows: where − represents the subtraction operation, θ is a control parameter and |C i | represents the number of samples in C i . After obtaining the ECI weight of each cluster, we applied them to the selected representative microcluster matrix to obtain the final data matrix.

Dense Representation Ensemble Clustering
The concept of microclusters has been introduced into the DREC method. The scale of ensemble clustering problems is simplified using the "slim-down strategy" and then similarity matrices can be obtained using the dense representation method and the final result segmentation can be obtained by applying the Ncut algorithm. Because of the microclusters, the DREC method improves time efficiency and preserves more original information. However, the DREC method treats "shrunk" samples equally, which does not work for microcluster samples. At the same time, although the DREC method considers the influence of noise, it fails to consider the selection of the base clusterings, which also leads to the instability of the final results of the randomly selected base clustering integration.

Divergence-Based Locally Weighted Ensemble Clustering with Dictionary Learning (DLWECDL)
The goal of ensemble clustering is to learn consistent results based on M base clusterings. In ensemble clustering, the key is to explore the effective information in base clustering sets. The effective information in base clustering sets is hidden within three common manifestations, namely sample-sample relational representation, sample-cluster relational representation and cluster-cluster relational representation. We believe that good consensus results can be obtained when all of the valid information from the three representations can be fully utilized. Sample-cluster relationship matrices are key to linking these three representations because they can be used to calculate the remaining two representations. Therefore, we took the sample-cluster relational representation as the base representation and used it as the data matrix for our method. It was the original representation of our base clustering set.

Divergence-Based Locally Weighted Method
The information entropy-based locally weighted method mainly considers the uncertainty between clusters.We introduced the Kullback-Leibler (KL) divergence, which is widely used to measure the differences between distributions. When distributions are exactly the same, the KL divergence is 0. Considering the good performance of KL divergence in some clustering methods over recent years, we introduced KL divergence as a measure of local weights. Since p π i , π m j and p π m j , π i are not clear probabilistic interpretations, the KL divergence results here were not guaranteed to always be greater than 0. After obtaining the KL divergence, we used the ECI entropy mapping function to obtain the new KL divergence weights.
To better illustrate the advantages of KL divergence weighting, an example is presented in Figure 2, where C i represents the i-th base clustering result, π j i denotes the j-th cluster in the i-th base clustering and the numbers 1-12 in the circles are the numbers of the samples. As shown in Table 1, we compared the results of the inter-cluster entropy calculation and the KL divergence calculation, where R represents the ratio of the maximum number of samples in the stable subsets to the number of samples that were contained in the clusters. For example, Samples 1, 2 and 3 were assigned to π 1 1 , π 1 2 and π 1 3 in the base clusterings C 1 , C 2 and C 3 , respectively. The three samples were classified into the same class in different base clustering results. This meant that the most stable subsets of π 1 1 , π 1 2 and π 1 3 were Sample 1, Sample 2 and Sample 3, respectively. Therefore, the R values for clusters π 1 1 , π 1 2 and π 1 3 were 3 3 = 1, 3 5 = 0.6 and 3 4 = 0.75, respectively. It can be observed from Table  1 that the R values of π 3 1 , π 4 1 and π 3 2 were consistent but the entropy values were quite different. This led to inconsistent weights. The same situation occurred in π 2 2 and π 2 3 . The KL divergence method reduced the gaps between clusters with the same R values as much as possible so that the weights were as consistent as possible.

L 2,1 -Norm Subspace Clustering of Adaptive Dictionaries
After obtaining the final data matrix, we developed a new subspace clustering method. Unlike dense representation, we explored similarity matrices using low-rank representation, which incorporated an adaptive dictionary learning strategy and employed a new regularization term, i.e., the L 2,1 -norm.
The original low-rank subspace clustering that could explore similarity matrices was formulated as follows: min where λ is a regularization parameter, X represents the data matrix, D is the dictionary, Z is the low-rank representation coefficient matrix, E is the noise and ||.|| * and ||.|| 2,1 represent the nuclear norm and the L 2,1 -norm of the matrix, respectively. The original low-rank representation method the data X as a dictionary D. On this basis, many low-rank representation subspace clustering algorithms have been further proposed and the adaptive dictionary learning low-rank representation [25] problem can be formulated as follows: where ||.|| F is the famous Frobenius norm, which was used here for computational convenience because many closed-form solutions that are based on this norm can greatly improve time efficiency. In order to eliminate the arbitrary scaling factor in the process of dictionary learning, D and X were replaced by P T X. To take into account the advantages of dictionary learning and noise immunity, our method was formulated as follows: where P denotes a low-dimensional projection matrix and I d is the identity matrix. The proposed method not only retains dictionary learning in low-rank representation, i.e., learning better and more orthogonal dictionaries, but also adopts the L 2,1 -norm to make it more robust to noise. A widely accepted theory is that high-dimensional data are determined by low-dimensional structures. The low-rank matrix Z that was obtained according to the objective function contained the angle information between the data samples. We performed SVD decomposition on the low-rank matrix Z and obtained H = UΣ 1 2 . We then used H to obtain the final similarity matrix W.
The detailed steps of the proposed DLWECDL are described in Algorithm 1, a flowchart for which is also shown in Figure 3. It should be noted that α in Algorithm 1 is a positive integer parameter and h i and h j are the i-th and j-th rows of matrix H, respectively. Algorithm 1: Divergence-based locally weighted ensemble clustering with dictionary learning (DLWECDL).
5. Perform Ncut to partition the similarity matrix W. 6. Obtain consensus result S by microcluster representative label mapping. As shown in Figure 3, DLWECDL first introduces microclusters to reduce the amount of data, which reduces data redundancy and improves time efficiency. Then, DLWECDL performs local weighting on the simplified dataset. Two weighting methods, namely entropy-based and KL divergence-based weighting, are used to better represent the microclusters. Theoretically, the entropy-based weighting method focuses more on the uncertainty of the clusters themselves while the KL divergence-based method focuses more on the relative uncertainty, i.e., the differences between clusters. This also means that datasets with more diverse base clusterings may be more suitable for the KL divergence-based weighting method. The third step uses low-rank representation with dictionary learning and the L 2,1 -norm to explore deep structures. After using the Ncut method to partition the data, the labels of the reduced dataset need to be mapped to the full dataset because of the introduction of the microclusters.
To demonstrate the feasibility and effectiveness of the proposed algorithm more intuitively, an example on a 2D synthetic dataset is presented in Figure 4. In the example, k-means clustering algorithms with different ks were performed 20 times. Their outputs were used to generate the microclusters, from which a matrix of the KL divergence weights was obtained. Then, low-rank representation with adaptive dictionary learning and the L 2,1norm was applied to the weighted matrix to obtain an affinity matrix and the corresponding labels for the microclusters. Finally, the labels were mapped to obtain the final results of the proposed DLWECDL. In Figure 4, the microclusters, KL divergence weights, affinity matrix and labels are the intermediate data of the proposed DLWECDL.

Optimization Method
For Problem (7), we employed the augmented Lagrange multiplier (ALM) with alternating direction minimization (ADM) strategy for optimization [26]. The auxiliary variable J needed to be introduced here. The augmented Lagrangian function is as follows: where Y 1 and Y 2 are Lagrange multipliers and µ is a penalty parameter. According to the ADM strategy [26], we divided the objective into several subproblems that could be efficiently optimized.

Subproblem J
To update J, we needed to solve the following problem: Problem (9) had a popular closed-form solution, which was solved using SVD decomposition. It was consistent with the first solution of the LRR method.

Subproblem Z
To update Z, we needed to solve the following problem: Since Problem (9) was unconstrained, we could take the derivative of Z directly. We obtained the derivation result of Problem (10) as follows: Let ∂L ∂Z = 0, then we could obtain the result of Z as follows:

Subproblem E
To update E, we needed to solve the following problem: As with Problem (9), Problem (13) also had a closed-form solution. We calculated E using Lemma 1.

Lemma 1.
Let Q = [q 1 , q 2 , · · · , q i , · · · ] be a given matrix. When the optimal solution to is W * , then the i−th column of W * is

Subproblem P
To update P, we needed to solve the following problem: Considering that Problem (16) was a constrained problem, we introduced Lemma 2 to solve it.

Lemma 2.
Given the objective function min R Q − GR 2 F · s.t. R T R = RR T = I, the optimal solution is R = UV T , where U and V are the left and right singular values of the SVD decomposition of G T Q, respectively.
We transformed Problem (16) to obtain the following results:

t. P T XX T P = I d
Going one step further: Let X T P = R, then according to Lemma 2, we could obtain the equation X T P = UV T . Then, we only needed to calculate the inverse of the data matrix to obtain the solution to Problem (16): The detailed optimization algorithm for DLWECDL is shown in Algorithm 2.

Differences between Our Approach and Other Ensemble Clustering Methods
As mentioned in the Introduction, our method introduces the theory of microclusters in order to reduce the dataset size. The divergence weights are then calculated and applied to the microclusters. Finally, a low-rank representation is performed to obtain a similarity matrix. Compared to other existing advanced methods, our method has a great number of differences and advantages, mainly in the following aspects: (1) Differences in the data matrix. Some methods perform ensemble algorithms based on co-association (CA) matrices [10,27], but CA matrices focus on instance-level relationships and ignore the relationships between clusters. Our method is based on instance-cluster data matrices, although the DREC [13], PTA-CL [17] and CESHL [11] methods also use data matrices that are similar to ours. Among these methods, CESHL does not introduce microclusters and its time efficiency is low. DREC fails to consider the differences between microclusters. Our method makes up for these shortcomings.
It is worth pointing out that although the PTA-CL method considers the differences between microclusters, it does not explore their deep structures. (2) Differences in the weighting methods. The LWEC method is based on the entropybased weighting method [18]. As shown in Section 3.1, the entropy-based weighted method cannot solve the problem of consistent weights among the similar clusters. Therefore, our method uses KL divergence-based weighting to alleviate this contradiction to a certain extent. Some other weighting methods focus on cluster-level similarities and then map these similarities to the instance level [16].
(3) Differences in the low-rank representation. The existing low-rank representationbased ensemble methods all treat the original data directly as a dictionary [28,29].
Considering that good dictionaries are crucial to the learning of similarity matrices, our method uses novel low-rank representation with dictionary learning constraints.

Datasets and Evaluation Methods
In this section, we present the setup and results of our extensive experiments to validate the proposed algorithm on 20 real datasets. Information about the datasets is listed in Table 2. Although there are various metrics for evaluating clustering performance, we chose three of them, namely accuracy (ACC), normalized mutual information (NMI) and adjusted rand index (ARI), to evaluate the proposed approach because of their simplicity, popularity and robustness to changes in labeling [18,30].
ACC is the score that is obtained by matching ground truth labels. Since the labels that are assigned by clustering methods may be inconsistent with the ground truth labels, the Hungarian algorithm is generally used for label alignment when calculating ACC, which can be formulated as follows: where y j represents the ground truth labels and δ(y j , f (π(x j ))) = 1 when y j = f (π(x j ) and δ(y j , f (π(x j ))) = 0 otherwise. As a measure of mutual information entropy that indicates the clustering results and the ground truth labels [31], NMI is defined as follows: where the cluster c p in the clustering results and the cluster c q in the ground truth labels contain n p and n q instances, respectively. ARI is an improved version of the rand index (RI) that can reflect the degree of overlap between clustering results and ground truth labels [32], which can be defined as follows: where the clustering results and the ground truth labels contain k and k clusters, respectively, N i,j is the number of common instances in cluster c i in the clustering results and cluster p j in the ground truth labels and N c i and N p j are the numbers of instances in clusters c i and p j , respectively.
The definitions of these three evaluation indicators show that the greater the indicator values, the better the method.

Experimental Settings
Each of the selected datasets contained 100 base clustering results, from which we randomly selected 20 to evaluate the ensemble clustering in each run. There were two main hyperparameters in the proposed approach, namely θ in (3) and λ in Problem (7). We used the grid search method to optimize the hyperparameters with all of the data in each dataset using the set of {0.2 : 0.1 : 2} for θ and {0.01, 0.1, 1, 10, 100, 200, 500} for λ. Note that these hyperparameters could also be optimized using evolutionary algorithms, as in many practical applications [33][34][35][36]. Additionally, the true number of classes in each dataset was also used as the input of the proposed approach. For each dataset, we ran the experiments 20 times and then reported the average results.

Experimental Results
We carried out a large number of repeated experiments and obtained average results, according to the optimal parameter range. We also compared our method to the following models: • DREC [13], which introduces microclusters to reduce the amount of data and is a dense representation-based method; • LWGP, LWEA [18], which both use locally weighted methods (LWGP is based on graph partitioning and LWEA is based on hierarchical clustering); • MCLA [37], which is a clustering ensemble method that is based on hypergraph partitioning; • PTA-CL [17], which introduces microclusters, explores probabilistic trajectories based on random walks and then uses complete-linkage hierarchical agglomerative clustering; • CESHL [11], which is a clustering ensemble method for structured hypergraph learning; • SPCE [10], which introduces a self-paced learning method to learn consensus results from base clusterings; • TRCE [27], which is a multi-graph learning clustering ensemble method that considers tri-level robustness.
Note that the proposed DLWECDL used divergence-based local weights for ensemble clustering. We also replaced the divergence-based local weights in DLWECDL with entropybased local weights but kept the other components unchanged in another algorithm, called ELWECDL, for comparison.
The NMI values of the proposed DLWECDL method and the other selected methods are listed in Table 3, where the best and second best values are shown in bold. From this table, it can be seen that the proposed DLWECDL method achieved the best or second best result in 16 out of the 20 cases, followed by the SPCE and ELWECDL methods (best or second best in 8 out of the 20 cases). TRCE achieved the best or second best result three times, meaning it ranked fourth among the ten methods. DREC, LWGP, LWEA, PTA-CL and CESHL performed so poorly that they all only achieved the best or second best result once. MCLA did not achieve the best or second best value for any of the 20 datasets. On average, DLWECDL and ELWECDL improved the NMI values by 7.36% and 6.12%, respectively, compared to the other eight ensemble clustering models. Therefore, these results demonstrated that the proposed DLWECDL significantly outperformed the other selected ensemble clustering methods in terms of NMI. The ARI values of the ensemble clustering methods that are shown in Table 4 offered the following findings: (1) the DLWECDL method achieved the best or second best results 17 times, meaning that it ranked first among all of the ensemble clustering methods once again; (2) the DLWECDL method was followed by the ELWECDL method, which achieved the best or second best results for nine datasets; (3) the rest of the methods only achieved the best or second best values three times or less and specially, both MCLA and TRCE failed to achieve the best or second best values for any of the datasets; (4) on average, the ARI values of DLWECDL and ELWECDL improved by 15.11% and 12.49%, respectively, compared to the other models. These findings confirmed that the proposed DLWECDL method was superior to the other selected methods in terms of ARI. We further ran DREC, ELWECDL and DLWECDL on eight datasets (Wine, Caltech20, Caltech101, Control, FCT, ISOLET, LS and SPF). Each ensemble clustering method was run 20 times on each dataset and the accuracy values are plotted in Figure 5. We found that the ELWECDL and DLWECDL methods achieved much higher accuracy than the DREC method in almost all cases. Meanwhile, the DLWECDL method was advantageous over the ELWECDL method in most cases, which indicated that the divergence-based local weights were better than the entropy-based local weights for ensemble clustering.

Impact of Hyperparameters
For the proposed ensemble clustering algorithm, there are two main hyperparameters, i.e., λ in Problem (7) and θ in ECI. According to our extensive experiments, we found that λ had little effect on the final clustering results. The reason for this is the fact that low-rank structures are mainly explored using low-rank subspace clustering methods and Z * dominated Problem (7), as confirmed by Chen et al. [25]. For the weight parameter θ, we found that it had a large influence on the final results and that the optimal value of θ was related to the random selection of base clusterings in each experiment. According to our experience, the optimal weight parameter was 0.2-2. We selected some other datasets and repeated the experiments another 50 times. The θ values that corresponded to the maximum NMI values are shown in Figure 6. As shown in Figure 6a, in the first run of the experiment on the Zoo dataset, the corresponding optimal θ value was 1.2, which became 1.3 in the second runt. In the subsequent experiment runs, θ did not have a fixed optimal value. The other datasets showed this same trend, which indicated that the parameter θ in our method was associated with the data matrix, i.e., we could not fix the weight parameter θ, even within the same dataset. This was mainly due to the problem of base clustering set selection.

Running Time
We compared the running time of the selected algorithms on 10 datasets, as shown in Table 5. As can be seen from the table, the time efficiency of the DLWECDL algorithm was not good because many iterations were performed while looking for low-rank repre-sentation. In order to reduce the number of iterations, we could adjust the learning rate, i.e., ρ, within an appropriate range, as long as the loss function was reasonably reduced. By adjusting the ρ value, we could control the number of iterations at less than 10, thereby improving the time performance of the algorithm. As can be seen from the table, after we increased the ρ value, the running time of DLWECDL became less than that of DREC [13].

Discussion
As analyzed in Section 3.3.5, our method is different from the other selected ensemble clustering methods in several aspects. Among them, the CESHL, DREC, PTA-CL and PTGP methods are based on the same data matrix as ours, while TRCE and SPCE are based on CA matrices. DREC, PTA-CL and PTGP all introduce microclusters to reduce the amount of data, while CESHL uses all data matrices directly. The superiority of our method over these methods mainly stems from the idea of the weighting and low-rank representation methods.
The KL divergence-based weighting method measures the differences between clusters, which alleviates the problem of the significant weight differences between similar clusters in ELWEC. Currently, DREC treats all microclusters equally and fails to consider the differences between microclusters. Although PTA-CL, PTGP and CESHL consider the differences between microclusters or clusters, none of them apply low-rank representation, i.e., they offer an insufficient exploration of the underlying information within data matrices. Moreover, CESHL is limited by the scale of the data, which leads to lower time efficiency.
Clustering ensemble method based on low-rank representation, such as RSEC, NRSEC, etc. , are based on CA matrices and focus on instance-level relationships. They also all use the original data directly, i.e., the CA matrices, as dictionaries, although the L 2,1 -norm is applied to consider the influence of noise. In general, the advantages of dictionary learning are more obvious.

Conclusions
In this paper, we proposed a new weighting method and a new low-rank representation method with adaptive dictionary learning. The new weighting method was able to mine more effective cluster-cluster relationships. We mapped these inter-cluster relationships into a representative microcluster matrix, i.e., we used the microcluster-cluster matrix as a new data matrix, and added new effective information on the basis of retaining the original matrix information to the greatest possible extent. Furthermore, methods based on low-rank representation with adaptive dictionary learning have been shown to be effective and we used a more reasonable L 2,1 -norm to enhance robustness. Our experimental results demonstrated the effectiveness of our proposed method. On average, the proposed DL-WECDL improved the NMI and ARI values by 7.36% and 15.11%, respectively, compared to the other selected SOTA ensemble clustering models. However, due to the influence of the random selection of base clusterings, we could not obtain a fixed optimal weight parameter that matched all possible base clustering combinations, even within the same dataset. Through our extensive experiments, we obtained an empirical range of weight parameters. The selection of the optimal combination of base clusterings within a dataset to obtain a pre-determined optimal weight parameter is our next research direction.