Improved Selective Deep-Learning-Based Clustering Ensemble

: Clustering ensemble integrates multiple base clustering results to improve the stability and robustness of the single clustering method. It consists of two principal steps: a generation step, which is about the creation of base clusterings, and a consensus function, which is the integration of all clusterings obtained in the generation step. However, most of the existing base clustering algorithms used in the generation step are shallow clustering algorithms such as k-means. These shallow clustering algorithms do not work well or even fail when dealing with large-scale, high-dimensional unstructured data. The emergence of deep clustering algorithms provides a solution to address this challenge. Deep clustering combines the unsupervised commonality of deep representation learning to address complex high-dimensional data clustering, which has achieved excellent performance in many ﬁelds. In light of this, we introduce deep clustering into clustering ensemble and propose an improved selective deep-learning-based clustering ensemble algorithm (ISDCE). ISDCE exploits the deep clustering algorithm with different initialization parameters to generate multiple diverse base clusterings. Next, ISDCE constructs ensemble quality and diversity evaluation metrics of base clusterings to select higher-quality and rich-diversity candidate base clusterings. Finally, a weighted graph partition consensus function is utilized to aggregate the candidate base clusterings to obtain a consensus clustering result. Extensive experimental results on various types of datasets demonstrate that ISDCE performs signiﬁcantly better than existing clustering ensemble approaches.


Introduction
With the explosive growth of 5G, big data have penetrated into every aspect of daily life [1].These data typically present large scales, high dimensions, and complex structures [2].How to mine valuable information from these complex data has become an urgent challenge at present [3].Clustering analysis [4] is an essential technique in many fields of research that involves processing multidimensional data, such as pattern recognition, data retrieval, and bioinformatics.It is normally considered an unsupervised method for grouping data based on similarity [5].Traditional clustering methods, such as k-means [6], spectral clustering [7], and Gaussian mixture clustering [8], have achieved good performance in various fields.However, most traditional clustering algorithms are only able to exploit shallow features of the data and cannot excavate the interdependence of complex data features in latent space [9].On the other hand, there is no single clustering algorithm capable of suitably applying to all datasets, so various clustering algorithms are proposed and improved.To address this problem, the concept of clustering ensemble (consensus clustering) [10] is introduced, which is inspired by the success of the supervised classifiers ensemble.Clustering ensemble has been applied in many fields [11] and combines a set of clusterings into a final consensus clustering [12].
Generally, the clustering ensemble method refers to two steps: generation and consensus function.Generation is the first step in generating the set of base clusterings.In clustering ensemble, an appropriate generation process is very important because the final consensus result will be affected by the initial clusterings obtained in this step.The consensus function is the second step, and plays a major role in the clustering ensemble.It is a great challenge to define an appropriate consensus function to improve the results of single clustering algorithms.However, most of the existing base clustering algorithms used in the generation step of clustering ensemble methods are traditional shallow clustering algorithms.It is difficult for these traditional clustering algorithms to exploit deep features of the data and excavate the interdependence of complex data features in latent space [5].Therefore, the performance of these shallow clusterings after aggregation will also be limited.
In recent years, the emergence and development of deep clustering methods [13] have provided the idea to address complex data clustering, which simultaneously learns cluster assignments and feature representations using deep neural networks.Deep clustering typically learns a mapping from the original data space to a lower-dimensional feature space and iteratively optimizes the clustering objective in that feature space.Existing deep clustering algorithms can be roughly divided into three categories based on different deep representation layers [14]: autoencoder-based (AE-based) deep clustering, variational-autoencoder-based (VAE-based) deep clustering, and generative-adversarialnetwork-based (GAN-based) deep clustering.We elaborate these deep clustering methods in detail in related work.Compared to traditional shallow clustering methods, existing deep clustering algorithms have achieved excellent performance in clustering complex high-dimensional data [15,16].However, the existing deep clustering is very sensitive to network parameters and hyperparameters, so the clustering results fluctuate greatly and are not robust enough [17].
To address the aforementioned problem in clustering ensemble and deep clustering, we propose a clustering ensemble method called improved selective deep clustering ensemble (ISDCE).It combines the idea of clustering ensemble with deep clustering and incorporates a selective strategy.ISDCE can be divided into two phases: the deep clustering generation phase and the selective clustering ensemble phase.In the deep clustering generation phase, unsupervised deep autoencoder networks with different initializations are used to train the non-clustering loss, and then k-means are applied to initialize the cluster centroids.The similarity between the deep low-dimensional embeddings extracted by the autoencoder and the clustering centroids is calculated as soft cluster assignment through Student t-distribution.The KL divergence loss of the the soft cluster assignment and auxiliary distribution is used as the clustering loss.Here, the auxiliary distribution is computed by normalizing the square of the soft cluster assignment.Finally, the clustering loss is jointly and iteratively optimized to obtain multiple deep clustering results.In the selective clustering ensemble phase, firstly, different deep base clusterings are evaluated.Considering that both the quality and diversity of base clusterings affect the final ensemble performance, we construct the ensemble quality and diversity evaluation metrics of base clusterings.The base clusterings with higher quality and diversity are selected as ensemble candidates.Meanwhile, for the consensus function, we consider the local diversity of clusters within the same base clusterings and construct an entropy-based criterion to measure the reliability of clusters.A weighted graph partition consensus function is utilized to efficiently aggregate the candidate base clusterings.The final consensus clustering result is obtained by using Tcut [18] for graph partitioning.
The main contributions of our work are summarized as follows: It is able to select higher-quality and rich-diversity base clusterings to improve ensemble performance.In addition, ISDCE measure the reliability of clusters and the local diversity of clusters within the same base clusterings to further improve the integration performance.
• Extensive experimental results on various types of datasets confirm that ISDCE performs significantly more robustly and better than existing clustering ensemble approaches.
The remainder of this paper is organized as follows: we review related work on deep clustering and ensemble clustering in detail in Section 2.Then, we elaborate on our proposed ISDCE methodology in Section 3. Section 4 presents the experimental results.Finally, Section 5 discusses the conclusions.

Related Work
In this paper, we propose an improved selective deep-learning-based clustering ensemble algorithm that is closely related to three branches of research: clustering ensemble, deep clustering, and selection ensemble.The following will introduce the related work in these three fields.

Clustering Ensemble
Existing clustering ensemble methods can be divided into three main categories [12], including pair-wise-similarity-based approaches, median-partitioning-based approaches, and graph-and hypergraph-partitioning-based approaches.We elaborate on some of the popular and classical clustering ensemble algorithms in recent years.
PTA [19] considers clustering ensemble based on sparse graph representation and probabilistic trajectory analysis.The problem of link uncertainty is solved using the elite neighbor selection strategy, and a similarity measure based on probabilistic trajectories is constructed using the K-elite neighbor sparse graph.PTA constructs a K-elite neighbor sparse graph to refine the local links using the random walk information, and derives the similarity based on the probabilistic trajectories by capturing global structural information through the random walk trajectories.Finally, two consensus functions, namely probabilistic trajectory accumulation and probabilistic-trajectory-based graph partitioning, are further proposed.ECFG [20] was proposed as a new ensemble clustering approach, termed ensemble clustering using a factor graph.It introduces super-object representation to facilitate the computation of ensemble processes and solves the optimization problem using an efficient solver based on the factor graph technique.WSCE [21] was developed as a new clustering ensemble framework, utilizing some concepts from the community detection domain and graph-based clustering that address the combination of different evaluated individual clustering results in the absence of a thresholding process.It introduces two-kernel spectral clustering to generate graph-based individual clustering results and normalized modularity measures to provide diversity estimates.SECWK [22] was proposed as spectral ensemble clustering to utilize the advantages of co-association matrices in information integration, but with higher operational efficiency.The time and space complexity of spectral ensemble clustering are significantly reduced by identifying the equivalence between spectral ensemble clustering and weighted k-means.It is a promising candidate for big data clustering.
LWEA [23] is considered to be the most popular algorithm among the weighted clustering ensemble methods proposed in recent years.It considers the local diversity of clusters and utilizes uncertainty estimation and local weighted strategies.The uncertainty of each cluster is estimated by considering the cluster labels in the whole set through an entropy criterion.The ensemble-driven clustering validity measure was introduced and a locally weighted co-correlation matrix was proposed as a summary of the ensemble of different clusters.Two new consensus functions are proposed by exploiting the local diversity in the ensembles.The authors in [24] studied that the co-association matrix may be dominated by poor base clusterings in clustering ensemble, resulting in inferior performance.Then, they introduced low-rank tensor approximation (LRTA) to clustering ensemble and exploited the low-rankness of a three-dimensional tensor formed by the coherent-link matrix and the co-association matrix.LRTA formulates the algorithm as a convex constrained optimization problem and solves it efficiently.ECCMS [25] was introduced as a novel effective co-association matrix self-enhancement model for ensemble clustering that improved the traditional co-association matrix.It exploites the high-confidence information to form a sparse high-confidence matrix and denoising error connections simultaneously.Technically, ECCMS is formulated as a symmetric constrained convex optimization problem, which is efficiently solved by an alternating iterative algorithm.The authors in [26] considered that conventional clustering ensemble methods may usually be misguided by unreliable samples due to the lack of labels.Therefore, they integrated the active clustering ensemble method and proposed a self-paced learning framework (SPACE).It evaluates their difficulty in selecting unreliable data and applies easy data to ensemble.

Deep Clustering
The emergence and development of deep clustering methods provides a technique that exploits the commonalities of unsupervised deep networks and clustering to simultaneously learn feature representations and cluster assignments.Deep clustering typically involves learning a mapping from the original data space to a low-dimensional feature space and iteratively optimizing the clustering objective in that feature space [14].
DEC [27] was one of the earliest deep clustering methods.It was inspired by deep learning for computer vision and extended it to unsupervised data clustering.It has been cited in more than 2000 articles and has attracted much attention in the field.DEC uses a pre-trained stacked denoising autoencoder to learn deep feature representations, and then defines a probability distribution based on the centroid and minimizes its KL divergence to an auxiliary target distribution in order to improve the clustering assignment and the feature representations at the same time.IDEC [28] was an improvement on DEC, and argues that the clustering loss defined by DEC corrupts the feature space and leads to unrepresentative features.Therefore, it adds back the decoder to optimize the reconstruction error and clustering loss.Clustering features suitable for local structure preservation are learned by adding the denoising autoencoder of the decoder.DCEC [29] is the convolutional autoencoders version of IDEC, and takes advantage of convolutional neural networks to better extract the hierarchical embedding features.It is superior to stacked autoencoders by incorporating spatial relationships between pixels in images.DFKM [30] is a novel deep clustering method with adaptive loss function and entropy regularization that employs fuzzy k-means and uses fuzzy information to represent a clear deep clustering structure.To enhance the robustness of the model, DFKM utilizes a robust loss function with adaptive weights.GFDC [31] utilizes a deep convolutional network for the feature generator and a graph convolutional network with a softmax layer to perform a clustering assignment.It constructs a topological graph to express the spatial relationship of features.These AE-based methods have the advantage of being easy to implement and extend, but are sensitive to network initialization parameters and hyperparameters, and they also require a certain limited number of layers in the deep network structure.
GMVAE [32] assumes that observations are generated from a multi-modal prior distribution and constructs an inference model that can be directly optimized using reparameterization techniques.VaDE [33] embeds the probabilistic clustering problem into a variational autoencoder framework.It models the data generation process through a GMM model and a neural network.VaDE is optimized by maximizing an evidence lower bound on the log-likelihood of the data using a stochastic gradient variational Bayesian estimator and a reparameterization technique.The main difference between GMVAE and VaDE is that the assumed sample-generating model is different and GMVAE is somewhat more complex than VaDE, with poorer empirical results.DCVA [34] considers that the potential space of the autoencoder does not pursue the same clustering goals as the k-means or Gaussian mixture model.Therefore, it introduces a variational autoencoder probabilistic approach that represents distances in the latent space in terms of KL divergence and uses probability distributions as inputs instead of using points in the latent space.Finally, the potential space is clustered using a Bayesian Gaussian mixture model.These VAE-based methods have the advantage of being able to generate samples with reasonable theoretical guarantees, but have the disadvantage of high computational complexity.
Witnessing the great success of generative adversarial networks (GANs) in estimating complex data distributions, DAC [35] introduces generative adversarial networks [36] to deep clustering.It matches the aggregated posterior of the latent representation with a Gaussian mixture distribution and optimizes three objectives.ClusterGAN [37] argues that the cluster structure may not be better preserved in the GAN latent space, and utilizes discrete and continuous latent variables and co-trains the GAN with the inverse mapping network clustering loss, where the distance geometry of the projection space mirrors the distance geometry of the variables.These GAN-based methods may lead to pattern collapse and difficulties in the convergence of the algorithm.

Selection Ensemble
Selection ensemble (or ensemble pruning) [38] is a variant of clustering ensemble that selects an appropriate subset of base clusterings and forms a smaller ensemble that performs better than all of the base clusterings.
Hadjitodorov et al. [39] adopted the ARI measure to evaluate the quality and diversity of the ensemble.They constructed four variants of ARI to measure diversity.They showed that the median diversity selections are usually significantly better than a randomly chosen ensemble.Fern et al. [40] studied the literature for simultaneously considering the diversity and quality of the ensemble.They showed that the combination of quality and diversity for selection ensemble can produce better results using the sum of the normalized mutual information measure.Jia et al. [41] made an improvement on an innovative selective clustering, called a selective spectral clustering ensemble algorithm based on the bagging technique.The base ensembles were generated by employing spectral clustering with random initialization.Then, the bagging technique was applied to ranking and evaluating the component clustering.Based on this ranking, the candidate ensemble was selected for the final solution.

Improved Selective Deep Clustering Ensemble
The framework of the ISDCE method is illustrated in Figure 1.As can be seen in Figure 1, similar to the classical clustering ensemble step, ISDCE contains two phases: a deep clustering generation step and selective clustering ensemble step.In the deep clustering generation step, the blue and green network structures in Figure 1 represent the network framework for deep autoencoder clustering.We firstly trained S deep autoencoder clusterings on the original input data, such as blue network structures DAE 1 to DAE S in Figure 1.The selective clustering ensemble step consists of two main modules: selective strategy and consensus ensemble.Considering that partial ensemble may outperform full ensemble, we incorporated the selective strategy into the DAE and constructed the ensemble quality and diversity evaluation metrics of base clusterings to select M higher-quality and richdiversity candidate clusterings (such as green network structures DAE j 1 to DAE j M in Figure 1).In the consensus ensemble step, we considered the local diversity of clusters within the same base clusterings.An entropy-based criterion was utilized to estimate the local uncertainty of different ensemble members.Considering that graph-based consensus functions can be applied to larger-scale data, we constructed a weighted graph partition consensus function to efficiently aggregate the candidate clusterings.The final consensus clustering result was obtained by using Tcut [18] for graph partitioning.The details of each module are described below.

Deep Clustering Generation
The framework of deep autoencoder clustering is shown in Figure 2. Deep autoencoder clustering consists of a fully connected autoencoder and a clustering layer, which is connected to the autoencoder embedding layer.In Figure 2, x i represents the i-th sample of the original input data and ẋi indicates the i-th reconstructed sample.z i represents the deep low-dimensional embedding representation of the i-th sample.Given the original input data X = {x i } n i=1 , deep autoencoder clustering firstly learns a mapping from the original data x i to a lower-dimensional embedding representation z i .Then, it maps each embedding point z i of the input sample to a soft cluster label q using a clustering layer.The non-clustering loss L r of deep autoencoder clustering consists of the fully connected autoencoder reconstruction loss, computed as: Then, Student's t-distribution is used to measure the similarity between the deep representation z i and the clustering centroid µ i .The clustering centroid µ i is generated by k-means and used as training weights.The clustering layer maps each embedding point z i to a soft cluster assignment Q = {q ij } as follows: α is the degree of freedom, but since it is not possible to cross-validate the validation set in an unsupervised environment, this parameter learning is redundant and, in all later experiments, we set α = 1 [27].q ij represents the probability of assigning the i-th sample to the j-th cluster.µ j is the j-th cluster centroid obtained from the initialization of k-means.After that, an auxiliary target distribution P = {p ij } is constructed to help iteratively optimize the cluster assignment.We used KL divergence as the clustering loss between the soft assignment Q and the auxiliary distribution P. Here, unlike general autoencoder clustering, our clustering loss was used to scatter the embedded points z and the reconstruction loss.This makes sure that the embedded space preserves the local structure of the data-generating distribution.The clustering loss L c was computed as follows: where the auxiliary target distribution P is obtained by normalizing Q squared: The gradients of L c for each cluster centroid µ j and for each embedding point z i were subsequently calculated as: The gradient of L c to z i was passed to the deep network and used for standard backpropagation to compute the parameter gradients.Finally, to obtain the final cluster assignment, the iteration was stopped when it reaches the maximum number of iterations or when the update of the cluster assignment for two consecutive optimization iterations is less than a threshold.In order to clearly show the deep autoencoder clustering computation step, the specific steps are summarized in Algorithm 1.
The whole deep clustering network objective is defined as: where L r and L c are reconstruction loss and clustering loss, respectively, and γ > 0 is a coefficient that controls the degree of distorting embedded space.In subsequent experiments, the gamma value is usually set to 0. In the selective strategy, we firstly constructed the ensemble quality and diversity evaluation metrics of ensemble Π = {π 1 , π 2 , . . ., π S }. π r = {C r 1 , C r 2 , . . ., C r n r } denotes the r-th base clustering in Π and C r n r indicates the n r -th cluster of the r-th base clustering.n r is the number of clusters.
With regard to ensemble quality evaluation, for a given ensemble Π = {π 1 , π 2 , . . ., π S }, we constructed a summational adjusted Rand index (SARI) as a means of evaluating the quality of each base clustering π r : Here, the SARI measures the degree of consistency of the overall trend contained in a particular cluster π r and ensemble Π. Obviously, when the value of SARI is larger, the quality of π r is higher.The ARI is the adjusted Rand index, which comprises the common external clustering metrics.
With regard to ensemble diversity evaluation, for a given ensemble Π = {π 1 , π 2 , . . ., π S }, we constructed pair-wise normalized mutual information (PNMI) to evaluate the diversity of the enemble Π: NMI is the normalized mutual information, which is an external metric used in clustering to measure the degree of similarity between two clustering results.Here, when the value of PNMI is smaller, the diversity is richer in ensemble Π.
After defining the ensemble quality and diversity evaluation criteria, a certain number of higher quality and rich diversity candidate ensemble need to be selected from the ensemble Π = {π 1 , π 2 , . . ., π S }.Three selection strategies can be used here: (1) Quality Strategy (QS).For a given ensemble Π = {π 1 , π 2 , . . ., π S }, the QS adopts SARI(π r , Π) to compute all the base clusterings in the ensemble Π and ranks them in descending order to select the top M (M ≤ S) base clusterings with higher SARI values as the candidate ensemble.In general, base clusterings with higher SARI values show more overall trend consistency.Base clusterings with lower SARI values can be considered outliers in the ensemble and may be unfavorable for inclusion in the ensemble.(2) Diversity Strategy (DS).The DS is a strategy that seeks to maximize ensemble diversity.For a given ensemble Π = {π 1 , π 2 , . . ., π S }, we select M (M ≤ S) base clusterings to minimize the PNMI.We can view this objective as a problem of finding weightmaximizing subgraphs, where the edge weight of each vertex is 1 − N MI π i , π j .However, this problem is NP-hard.Therefore, we approximate the solution of the problem using a simple greedy strategy.First, the highest-quality base clusterings are selected to form a new ensemble E using SARI computation, and then a base clustering from the ensemble Π is gradually selected to be added to the ensemble E so as to minimize the PNMI value.This process is repeated until the number of base clusterings in the ensemble E reaches M. (3) Balance Strategy (BS).The BS is a combination of the above two strategies.The two metrics SARI and PNMI are combined to form a new metric balance strategy index (BSI) as follows: where β is an adjustment factor used to control for quality and diversity.Similarly, this joint metric BSI is solved greedily using a DS-like approach.

Consensus Ensemble
In the consensus ensemble of ISDCE, we considered cluster-wise diversity inside the same base clusterings to enhance the consensus performance [23].We adopted the local weighting idea to evaluate the reliability of clusters.Given the ensemble Π = {π 1 , . . ., π m }, where π m = {C m 1 , . . ., C m n m } denotes the m-th base clustering in Π and C i indicates the i-th cluster, the uncertainty of cluster C i for π m is computed as follows: where where n m is the number of clusters in π m and C m j is the j-th cluster in π m .|C i | indicates the number of objects in C i .
Therefore, the uncertainty (or entropy) of C i for ensemble Π can be calculated as follows: where M is the number of base clusterings in Π.
Then, we constructed a ensemble cluster index (ECI) to compute the reliability of clusters, which is defined as follows: (14) It is easy to see from the above equation that ECI(C i ) ∈(0,1] and, the smaller the uncertainty of the cluster, the larger the ECI value.θ > 0 is a coefficient that adjusts for the effects of cluster uncertainty. Then, based on ECI, we exploited the weighted graph partitioning consensus function.We defined the weighted graph as G = (V, L) and computed the link weight between two nodes v i and v j as follows: where X is the original data and The time complexity of ISDCE can be roughly divided into two parts: deep clustering and clustering ensemble.The time complexity of deep clustering section is O(nd h k + nD 2 ), where d h , k, and D are, respectively, the dimension of the embedding layer, number of clusters, and maximum number of neurons in hidden layers.The time complexity of clustering ensemble section is O(Mn m + kln + k(n m ) 2 ), where n m is the number of clusters in π m and l is the average number of links connecting to a node in the graph.

Experiments
In this section, to verify the effectiveness of our proposed ISDCE algorithm, we conducted a series of experiments on 7 datasets to compare ISDCE against 11 clustering ensemble algorithms.

Datasets and Evaluation Measures
We conducted experiments on three widely used UCI datasets (http://archive.ics.uci.edu/ (accessed on 3 November 2023)), two classical image datasets (https://github.com/zhoujielaoyu/2018-NC-DREC (accessed on 3 November 2023)), and two real-world biological datasets (https://github.com/BinPro/CONCOCT/tree/develop(accessed on 3 November 2023)), where a summary of the statistics of these datasets can be found in Table 1 [43] is the results of a chemical analysis of wines grown in the same region, which determined the quantities of 13 constituents found in each of the three types of wines.• MNIST : MNIST dataset is a set of well-known image data of 70,000 handwritten digits (10 class labels) with 784 pixels.Mnist5 is the subset of the MNIST [44].• Strain : Strain dataset is a set of synthetic mock metagenome data [45].This dataset was constructed to investigate the impact of strain-level variation on clustering.• Species : Species dataset is also a set of synthetic mock metagenome data [45].It was designed to resolve species-level variation in a complex community.To evaluate the quality of the clustering result, we adopted normalized mutual information (NMI) and adjusted Rand index (ARI) as evaluation measures.Greater values of these metrics represent better clustering performance.Detailed introduction can be referred to in the following literature: [46].

Experimental Settings and Clustering Performance Comparison
In this section, we compare the proposed ISDCE algorithm with 11 comparison algorithms, i.e., WSCE [21], PTAAL, PTACL, PTASL [19], PTGP [19], SECWK [22], LWEA [23], LWGP [23], LRTA [24], ECCMS [25], and SPACE [26].For the 11 comparison algorithms, the parameters will be set as suggested by their corresponding papers.The total number of base clusterings is 50.The number of base clusterings involved in the ensemble is 20.For each algorithm, the final number of clusters was set to the true number of categories on the dataset.The running time of the algorithm is 20.The NMI, ARI, mean, and variance of the 20 running results were used as the evaluation results.
For our proposed ISDCE algorithm, the initial ensemble size S = 50 and the top M is in the range of [10, S  2 ].The number of clusters k was set to the true number of categories on the dataset.The parameter θ, which adjusts for the effects of cluster uncertainty, was set uniformly to 0.4 (a range of [0.2, 1] is recommended).Here, we experimented with three selective strategies (quality strategy, diversity strategy, and balance strategy) in the ISDCE, where the adjustment factor β = 0.5 in the balance strategy.Regarding the parameter settings for deep base clustering, we used randomly initialized network parameters.The optimizer is uniformly Adam and the activation function is RELU, The corresponding autoencoder network structure for the experimental data is: Cars- The clustering performance comparison result is shown in Tables 2-5.For each dataset, we ran it 20 times and computed average and standard deviation result for evaluation measures.All the bolded results in Tables 2-5 are statistically (according to pair-wise t-test at 95% significance level) superior to the other methods.ISDCEQS denotes that ISDCE uses quality strategy for ensemble, and ISDCEDS indicates that ISDCE adopts diversity strategy for ensemble.ISDCEBS represents that ISDCE employs balance strategy for ensemble.
Tables 2 and 3 show the clustering performance of three UCI datasets.For each dataset, the highest three scores are highlighted in bold.As can be seen from the NMI results, three strategies of ISDCE achieved the highest score, except for the Cars dataset.SPACE achieved the best NMI on the Cars dataset, but its ARI value is obviously lower than ISDCE in Table 3.This shows that the ISDCE method, especially ISDCEDS, has a better clustering performance and better robustness.In addition to the Cars dataset, three strategies of ISDCE achieved the best ARI scores on the other two datasets.ISDCEDS achieved the best ARI on the Cars dataset, but ISDCEQS and ISDCEBS seem to not be good enough.The reason for why ISDCEDS has a much larger value than other methods may be that, for Cars data, rich ensemble diversity may have its main role in the integration process.ISDCEDS utilizes the rich diversity of base partitions to integrally generate better consensus clustering results.LWEA, LWGP, and ECCMS perform better than ISDCEQS and ISDCEBS, while they perform far worse than ISDCEDS.In comparison experiments on these three small UCI data, we observe that all methods, except for the LRTA and SPACE method, obtain clustering results in less than 1 s.This indicates that these methods can obtain the final ensemble results quickly at a small data scale.
Tables 4 and 5 show the clustering performance of two image datasets and two biological datasets."timeout" means that the clustering results were not obtained for a long time."error" means that, when the eigs function was calculated, dnaupd did not find any eigenvalues that achieved sufficient accuracy using the WSCE method.As can be seen from the NMI results in Table 4, three strategies of ISDCE achieved the best NMI scores on all image and biological datasets.From the ARI results in Table 5, it can be observed that three strategies of ISDCE also obtained the best scores on all image and biological datasets.
The overall results from Tables 2-5 demonstrate the excellent clustering performance of our ISDCE method.Due to the large number of comparison methods and experimental data involved, this paper does not show the execution time of each algorithm.However, from the experimental results of three image datasets with gradually increasing sample sizes, it is observed that the SECWK algorithm is the fastest, followed by LWGP and our ISDCE.The LRTA method has the slowest running efficiency.In addition, WSCE, LRTA, ECCMS, and SPACE grow exponentially in running efficiency with an increase in data samples, thus failing to obtain the final clustering results on MNIST data for a longer period of time.The execution time of the remaining methods is in the same order of magnitude.In the following, we will conduct a series of experiments to verify the validity of each component of our ISDCE approach.

ISDCE Component Ablation Experiment
In this section, to validate the ensemble effectiveness of the ISDCE algorithm, we compared the ISDCE with a single deep autoencoder clustering algorithm.To demonstrate the validity of the selection strategy, we ran the ISDCE without a selection strategy (ISDCE_noSS).The experimental results are shown in Tables 6 and 7 below.From Tables 6 and 7, it is easy to conclude that combining the ensemble idea and selection strategy in the ISDCE algorithm significantly improves the clustering performance.The clustering performance of ISDCE_noSS is better than the single deep autoencoder clustering, while the performance of ISDCE is almost better than ISDCE_noCC and the clustering results are more stable.

ISDCE t-SNE Visualization
In this section, to verify that ISDCE could achieve good clustering results, we adopted t-SNE to visualize the initial embedding and final embedding clustering results on Iris and Mnist5 datasets.The visualization results are shown in Figures 3-6.The horizontal and vertical coordinates of tsne_1 , tsne_2 in the figure are the scalar numerical coordinates of the initial embedding and the final embedding reduced to 2D features after t-SNE.The number of clusters visualized is the true number of categories on the dataset.Points of different colors represent samples in different clusters.It can be seen that the final embedding after joint optimization with clustering loss is more cluster-friendly and produces higher-quality clusters.We can easily observe that the final embedding trained by the joint clustering loss forms denser clusters with fewer mixed clusters.For the visualization results of the Iris data, the data are characterized in that only one class in the Iris data is linearly separable, and the other two classes are linearly indivisible.Therefore, visualizing the Iris results with t-SNE results in a situation where there are two clusters that are indistinguishable.But, in fact, from the two clustering evaluation measures, ISDCE indeed achieved good performance on these data.

Table 1 .
Statistics of the datasets.

Table 2 .
Average NMI results of the comparison method with ISDCE on three UCI datasets (The best three scores in each column are highlighted in bold).

Table 3 .
Average ARI results of the comparison method with ISDCE on three UCI datasets (The best three scores in each column are highlighted in bold).

Table 4 .
Average NMI results of the comparison method with ISDCE on two image and two biological datasets (The best three scores in each column are highlighted in bold).

Table 5 .
Average ARI results of the comparison method with ISDCE on two image and two biological datasets (The best three scores in each column are highlighted in bold).

Table 6 .
Average NMI results of the ISDCE component ablation experiment.

Table 7 .
Average ARI results of the ISDCE component ablation experiment.