Multi-Modal 3D Shape Clustering with Dual Contrastive Learning

: 3D shape clustering is developing into an important research subject with the wide applications of 3D shapes in computer vision and multimedia ﬁelds. Since 3D shapes generally take on various modalities, how to comprehensively exploit the multi-modal properties to boost clustering performance has become a key issue for the 3D shape clustering task. Taking into account the advantages of multiple views and point clouds, this paper proposes the ﬁrst multi-modal 3D shape clustering method, named the dual contrastive learning network (DCL-Net), to discover the clustering partitions of unlabeled 3D shapes. First, by simultaneously performing cross-view contrastive learning within multi-view modality and cross-modal contrastive learning between the point cloud and multi-view modalities in the representation space, a representation-level dual contrastive learning module is developed, which aims to capture discriminative 3D shape features for clustering. Meanwhile, an assignment-level dual contrastive learning module is designed by further ensuring the consistency of clustering assignments within the multi-view modality, as well as between the point cloud and multi-view modalities, thus obtaining more compact clustering partitions. Experiments on two commonly used 3D shape benchmarks demonstrate the effectiveness of the proposed DCL-Net.


Introduction
With the development of 3D scanning and modeling technology, 3D shapes have been widely employed in various applications of computer vision and multimedia fields, such as 3D printing, model retrieval, augmented reality, etc. [1][2][3]. How to effectively analyze large numbers of 3D shapes has become a research hot spot. In recent years, owing to the advanced development of deep learning, a series of deep 3D shape classification methods [4][5][6] have obtained satisfactory results. However, the success of deep neural networks critically relies on large-scale human-annotated data [7][8][9], which requires a laborious data annotation procedure. Under these circumstances, clustering has received increasing attention due to its powerful ability to divide massive amounts of unlabeled data [10,11]. Exploring effective 3D shape clustering methods has become a promising approach to overcome the above obstacle.
In practical application scenarios, 3D shapes are generally represented by different modalities due to the diversity of acquisition devices. As two popular 3D modalities, point clouds and multiple views are produced by 3D scanners and RGB cameras respectively, which have the advantages of flexible acquisition and low costs [12]. Specifically, point clouds describe 3D shapes with a series of disordered points, and the positional arrangement of those points preserves the spatial geometry of the 3D shapes [13]. Different from point clouds, multiple views are formed by a series of 2D images corresponding to different camera angles [14,15]. They contain rich visual information of 3D shapes, such as texture and color [16]. Since point clouds and multiple views describe 3D shapes from different perspectives, effectively exploiting the multi-modal properties is conducive to capturing more discriminative descriptions of 3D shapes and better revealing compact 3D shape clustering partitions.
Recently, contrastive learning has shown great success in unsupervised representation learning [17]. The core idea of contrastive learning is to maximize the representation similarities of positives while minimizing those of negatives, thus capturing more effective representations of data. Driven by this, some unsupervised 3D shape representation learning methods [18,19] successfully extract better cross-modal 3D shape representations by performing contrastive learning among different 3D modalities. However, since the above methods lack clustering-oriented learning objectives, the performance is usually limited when directly applying traditional clustering algorithms to the learned representations. In order to learn cross-modal representations that are suitable for clustering, several previous works [20,21] have integrated contrastive learning into multi-modal clustering for text and image data. By maximizing the similarities among the representations or the clustering assignments of different modalities in a contrastive learning manner, these methods have achieved encouraging results. Nonetheless, no existing work has focused on the multi-modal 3D shape clustering task. For the point clouds and multiple views of 3D shapes, in addition to the inter-modal correlations, different views within the multi-view modality also describe different local appearances of 3D shapes from particular angles. Therefore, how to jointly explore the inter-view correlations within the multi-view modality and the inter-modal correlations between the point cloud and multi-view modalities during the learning procedure of contrastive clustering remains a challenging issue for the multi-modal 3D shape clustering task.
To address the above issue, this paper proposes a dual contrastive learning network (DCL-Net) for multi-modal 3D shape clustering. The key motivation behind our design involved two aspects. Firstly, as for a 3D shape, different views within multi-view modality contain diverse appearances from different perspectives. Meanwhile, point cloud and multi-view modalities mainly focus on the geometric and visual information about 3D shapes, respectively. Simultaneously exploring the cross-view consistent representations of different views, as well as the cross-modal consistent representations of point cloud and multi-view modalities, contributes to obtaining more discriminative 3D shape descriptions. Secondly, different views within the multi-view modality and the corresponding point cloud of the same 3D shape all share consistent semantics. In addition to learning consistent representations, exploring the cross-view and cross-modal consistent clustering assignments is beneficial to boosting the robustness of the 3D shape features for clustering, thus further enhancing the compactness of clustering partitions. Therefore, by simultaneously performing cross-view contrastive learning within multi-view modality and cross-modal contrastive learning between point cloud and multi-view modalities at both the representation and clustering assignment levels, a representation-level dual contrastive learning module and an assignment-level dual contrastive learning module were developed in the proposed method. The key contributions of this paper are as follows: (1) A dual contrastive learning network for multi-modal 3D shape clustering is proposed to discover the underlying clustering partitions of unlabeled 3D shapes. To the best of our knowledge, this is the first deep multi-modal 3D shape clustering method; (2) By simultaneously ensuring the representation consistency within multi-view modality and between point cloud and multi-view modalities, a representation-level dual contrastive learning module is proposed to capture discriminative 3D shape features for clustering; (3) To further boost the compactness of clustering partitions, an assignment-level dual contrastive learning module is proposed to simultaneously capture consistent clustering assignments within multi-view modality and between point cloud and multiview modalities; (4) Experimental results on two widely used 3D shape benchmark datasets are presented to demonstrate the superior clustering performance of the proposed DCL-Net.
The remaining of this paper is organized as follows. Section 2 briefly describes the three aspects of current research that are the most relevant to the proposed method. Section 3 introduces the proposed DCL-Net in detail. Section 4 presents a series of experimental results and analyses. In Section 5, the conclusion is summarized.

Unsupervised 3D Shape Feature Learning
Due to the rapid growth of 3D shapes, significant progress has been made in unsupervised 3D shape feature learning. Many research works [22][23][24][25][26][27][28] have been proposed to learn 3D shape features from various 3D modalities, such as multiple views, point clouds, meshes, and voxels. For instance, Zhao et al. [23] proposed an autoencoder-based 3D point capsule network to extract 3D shape features via point cloud reconstruction. The method in [26] extracted structure-preserving 3D shape features by effectively encoding the local geometry structures of 3D meshes. Han et al. [28] proposed a recurrent neural network (RNN) architecture to learn global 3D features via the multiple view inter-prediction task. Furthermore, considering the multi-modal characteristics of 3D shapes, several cross-modal learning methods [18,19,29,30] have been proposed to boost the quality of 3D shape features by adequately exploiting information from different modalities. For example, Wu et al. [29] proposed a 3D generative adversarial network to capture 3D shape features by reconstructing 3D voxels from 2D images. Girdhar et al. [30] introduced a view encoder into the voxel autoencoder network to learn robust 3D shape features based on image-to-voxel generation. Nonetheless, the above methods are not oriented toward clustering tasks, thus it is difficult to ensure that the learned features are suitable for clustering.

Deep Multi-Modal Clustering
Multi-modal clustering aims to capture consistent underlying category partitions from multi-modal inputs, such as text [31], images [32], videos [33][34][35], etc. Due to the powerful feature extraction capability of deep neural networks [36,37], a number of deep multi-modal clustering methods [38][39][40][41][42] have been proposed over recent years. Ngiam et al. [38] introduced a deep autoencoder network to extract consistent representations across different modalities and obtained promising results in speech and vision tasks. Andrew et al. [39] adopted deep canonical correlation analysis (DCCA) to learn cross-modal consistent representations by maximizing the correlations between multi-modal features. Abavisani et al. [40] utilized multiple parallel autoencoders and a shared self-expression layer [43] to capture a joint cross-modal affinity matrix for clustering. The method in [42] adopted deep autoencoders to explore multi-modal shared representation while introducing adversarial training to disentangle the latent space. Zhou et al. [43] designed an adversarial network with an attention mechanism to learn cross-modal consistent representations for clustering. In summary, the current deep multi-modal clustering methods have made remarkable progress. However, the existing works have not focused on the multi-modal 3D shape clustering task. How to sufficiently exploit the advantages of deep learning to design an effective multi-modal 3D shape clustering method still needs to be further investigated.

Contrastive Learning
As a powerful approach to unsupervised representation learning, contrastive learning has attracted increasing amounts of research attention and several contrastive learningbased works [44][45][46][47][48] have recently emerged. He et al. [45] proposed a momentum contrastive method to facilitate unsupervised representation learning by regarding contrastive learning as a dictionary lookup and building a dynamic dictionary. Chen et al. [46] effectively simplified the framework in [47] by adopting a Siamese network with a prediction head while introducing powerful data augmentations to boost the quality of the learned features. Tian et al. [48] employed two asymmetric networks with an interactive prediction mechanism to learn image representations and avoided model collapse without negative samples. Motivated by the success of contrastive learning in unsupervised representation learning, several methods [20,21,49] have applied contrastive learning to multi-modal learn-ing tasks. For example, Xu et al. [20] explored the common semantics of different modalities using feature contrastive learning and label contrastive learning. Trosten et al. [21] introduced contrastive learning to align multi-modal representations and achieved effective improvements in clustering performance. Although the above methods have achieved promising results, the existing explorations of multi-modal contrastive clustering have mainly focused on text-and image-related tasks. In contrast, this paper effectively utilizes the characteristics of multiple views and point clouds, and delivers a novel contrastive learning-based multi-modal 3D shape clustering method.

Architecture of DCL-Net
The overview architecture of the proposed DCL-Net is illustrated in Figure 1.
..N denote a 3D shape dataset with N shapes, where I v i denotes the v-th view in the multiple views of the i-th 3D shape and P i denotes the corresponding point cloud of the i-th 3D shape. As shown in the figure, the proposed DCL-Net includes a multi-modal feature extractor, a representation-level dual contrastive learning (RDCL) module, and an assignment-level dual contrastive learning (ADCL) module. Taking two views from different angles and the corresponding point cloud as inputs, the multi-modal feature extractor was adopted to extract the view and point cloud features. Afterward, the RDCL module was designed to capture more discriminative 3D shape features by simultaneously applying cross-view contrastive learning within the multi-view modality and cross-modal contrastive learning between the point cloud and multi-view modalities in the representation space. Moreover, by effectively applying cross-view and cross-modal contrastive learning to the clustering assignments, the ADCL module was designed to ensure clustering assignment consistency among different views and the corresponding point cloud, thus further boosting the 3D shape clustering performance. Finally, the clustering results are obtained from the soft labels predicted by the ADCL module.

Representation-Level Dual Contrastive Learning
For each 3D shape, different views within the multi-view modality and the corresponding point cloud share consistent semantics while containing complementary 3D shape information across both views and modalities. Simultaneously performing crossview contrastive learning within multi-view modality and cross-modal contrastive learning between point cloud and multi-view modalities for consistent representation is conducive to capturing more discriminative 3D shape information from a comprehensive understanding of 3D shapes. To this end, an RDCL module that adopts both cross-view and cross-modal contrastive representation learning was developed to capture discriminative 3D shape features for clustering.
First, for the i-th 3D shape in a given mini-batch i are arbitrarily chosen from the V views, thereby providing more cross-view and cross-modal combinations for the network learning. After that, to capture latent 3D shape features from the inputs i and F v 2 i respectively, while the point cloud encoder is responsible for mapping P i into the point cloud feature F p i . The mapping processes are calculated as follows: where θ I and θ P denote the parameters of the view encoder and the point cloud encoder, respectively. Afterward, to effectively ensure the discrimination of the learned 3D shape features, a representation projection head g φ (·) is adopted in the RDCL module to further project the learned features into the representation space: Considering the semantic consistency among different views and the corresponding point cloud, multiple view representations and the point cloud representation of the same 3D shape should maintain higher similarities than those of different 3D shapes. Therefore, the view representations and point cloud representation of the same 3D shape need to be taken as positives to be pulled together, while those of different 3D shapes need to be regarded as negatives to be pushed apart. In view of this, a representation-level cross-view contrastive loss L RCV and a representation-level cross-modal contrastive loss L RCM are simultaneously applied to the representation space to ensure both the crossview representation consistency within the multi-view modality and the cross-modal representation consistency between the point cloud and multi-view modalities. Specifically, the representation-level cross-view contrastive loss for the input view I v 1 i is calculated as follows: where τ R is the representation-level temperature parameter and is generally set to 0.5.
. By computing the representation-level cross-view contrastive loss for each arbitrarily chosen view in the mini-batch, L RCV can be expressed in the form of: Similarly, the representation-level cross-modal contrastive loss for the view I v 1 i is calculated as follows: Note that, to avoid the bias of the point cloud representation toward a particular view, the cross-modal contrastive loss for the i-th shape is calculated between Z p i and Z v 1 i and between Z p i and Z v 2 i , which are denoted as L RCM1 i and L RCM2 i , respectively. Therefore, the representation-level cross-modal contrastive loss L RCM is calculated as follows: By combining L RCV and L RCM , the overall loss of the RDCL module is expressed as: Under the constraint of the representation-level dual contrastive loss L RDCL , the network is encouraged to distinguish different 3D shapes according to the cross-modal and cross-view consistent representations, thus effectively promoting the discrimination of the extracted 3D shape features for clustering.

Assignment-Level Dual Contrastive Learning
The ADCL module was designed to simultaneously ensure cross-view clustering assignment consistency within the multi-view modality and cross-modal clustering assignment consistency between the point cloud and multi-view modalities, thus further boosting the compactness of the learned 3D shape features for clustering. Specifically, for the extracted view and point cloud features F v 1 i , F v 2 i , and F p i of the i-th 3D shape, the ADCL module further maps them into soft labels using an assignment projection head h ψ (·) with the following process: Let Y u = [Y u 1 , Y u 2 , . . . , Y u n ] denote the outputs of the assignment projection head for the n 3D shapes in the mini-batch and G u denote the transposed matrix of Y u , where u ∈ {v 1 , v 2 , p}. In G u , the i-th column vector denotes the soft label of the i-th 3D shape and the k-th row vector denotes the clustering assignment distribution of the cluster k.
Considering that both different views within the multi-view modality and the corresponding point cloud of the same 3D shape contain consistent semantics, the obtained clustering assignment distributions should be similar within the multi-view modality and between the point cloud and multi-view modalities. Namely, the index of different views and the corresponding point cloud that are assigned to a particular cluster should be consistent. To this end, an assignment-level cross-view contrastive loss L ACV and an assignment-level cross-modal contrastive loss L ACM are simultaneously applied to the clustering assignments. The assignment-level cross-view contrastive loss for the G v 1 k is calculated as follows: where τ A denotes the assignment-level temperature parameter and is generally set to 1.0. G v 1 k and G v 1 l are the k-th and l-th row vectors of G v 1 , respectively. Then, the L ACV is calculated by computing the assignment-level cross-view contrastive loss for each cluster: where c denotes the number of clusters. Similar to the representation-level cross-modal contrastive loss, the L ACM k is also obtained by computing the L ACM1 k between G p k and G v 1 k and the L ACM2 k between G p k and G v 2 k . In this way, the bias of the consistent clustering assignments toward a particular view is effectively removed, thus further boosting the robustness of the clustering assignments. The assignment-level cross-modal contrastive loss for the G v 1 k is calculated as follows: Then, the L ACM is naturally expressed as: Afterward, the overall assignment-level dual contrastive loss L ADCL of the ADCL module is obtained by summing L ACV and L ACM : As can be seen from Equation (19), the L ADCL simultaneously maximizes the crossview and cross-modal assignment consistency of the same cluster, while minimizing that of different clusters. This effectively enhances the intra-cluster compactness and inter-cluster separation, thus further boosting the multi-modal 3D shape clustering performance.
Finally, the clustering results are easily obtained from the predicted soft labels using the following formula: where q i is the final predicted label for the i-th 3D shape.

Implementation Details
In the proposed DCL-Net, except for the designed representation-level dual contrastive loss L RDCL and the assignment-level dual contrastive loss L ADCL , an additional regularization loss L RL [20] is imposed on the predicted soft labels from the ADCL, so as to avoid trivial solutions in deep clustering. Therefore, the total loss of the DCL-Net is calculated by summing the representation-level dual contrastive loss, the assignment-level dual contrastive loss, and the regularization loss: where λ 1 ≥ 0 and λ 2 ≥ 0 are trade-off parameters to balance the roles of the different loss terms. For the multi-modal feature extractor in the DCL-Net, ResNet18 [50] and PointNet [51] are adopted as the encoder networks for the selected views and point clouds, respectively. For the RDCL module, a multi-layer perceptron (MLP) with the dimensions of 512-128-128 is utilized as the representation projection head. For the ADCL module, another MLP followed by a Softmax operation are utilized as the assignment projection head, in which the dimensions are set to 512-128-c.

Experimental Setup
The proposed DCL-Net was implemented on the PyTorch platform using a GeForce GTX 1080 Ti GPU and an Intel i7-8700K processor at 3.70 GHz. During the network training phase, Adam [52] was adopted as the optimizer and the learning rate was set to 1.0 × 10 −4 . The batch size was set to 128 for all of the experiments and the trade-off parameters λ 1 and λ 2 were fixed to 1 and 5, respectively. After the training phase, Equation (20) was utilized to calculate the final clustering results.
To evaluate the clustering performance of the proposed DCL-Net, experiments were conducted on two widely used 3D shape benchmark datasets: ModelNet10 [53] and ModelNet40 [53]. The ModelNet10 dataset consists of 4,899 3D CAD models from 10 classes, while the ModelNet40 dataset includes 12,311 3D CAD models from 40 classes. Following the experimental settings of [12], 1,024 points from the surface of each CAD model were sampled to form the point clouds and twelve 2D views of each CAD model were rendered to obtain multiple views. Note that this paper focused on unsupervised multi-modal 3D shape clustering, thus the class labels were not provided in all of the experiments.
Following the previous clustering work [54], four commonly used evaluation metrics, i.e., accuracy (ACC), normalized mutual information (NMI), adjusted rand index (ARI), and F-score, were employed to evaluate the clustering performance of the proposed DCL-Net and comparison methods. Different metrics were used to measure the consistency between the predicted labels and the ground truth labels from different perspectives. Specifically, the ACC represented the proportion of correctly predicted samples in the total samples. The NMI was used as a normalized measure of the correlations between the distributions of the predicted labels and the ground truth labels. The ARI was a modified version of RI [55] and indicated the distribution correlations between the predicted labels and the ground truth labels. Finally, the F-score was the harmonic mean of the precision and recall, where precision and recall represented the fraction of correctly predicted samples in the total positive predictions and the actual positives, respectively. For all of these metrics, higher values indicated a better clustering performance.

Comparison Results
To demonstrate the performance of the proposed DCL-Net, several existing multimodal clustering methods were adopted for comparison, including DMSC [40], EAMC [43], and CoMVC [21]. Note that the selected methods were not designed for the multi-modal 3D shape clustering task, thus it was incapable of directly comparing them with the proposed method. To this end, they were extended as DMSC*, EAMC*, and CoMVC* to adapt to the multi-modal 3D shape clustering task. Specifically, the feature extractor corresponding to the input point cloud was replaced with PointNet [51], which was consistent with the proposed DCL-Net. Then, the point cloud and an arbitrarily selected view of each 3D shape were fed into the corresponding feature extractors of different modalities to obtain the final clustering results. Additionally, to ensure the reliability of the experimental results, all of the experiments were repeated ten times with random initializations to reduce the effects of randomness and the mean results of the repeated experiments are reported in this paper.
The quantitative comparison results on the ModelNet10 and ModelNet40 datasets are shown in Table 1. As shown in the table, the proposed DCL-Net achieved a better clustering performance than the comparison methods on both the ModelNet10 and ModelNet40 datasets, which effectively proved the superiority of the proposed method. In particular, the DCL-Net significantly outperformed CoMVC* and EAMC* by large margins. Even compared to the advanced method DMSC*, the proposed method also achieved the performance improvements of 2.94%, 9.27%, 2.20%, and 1.68% for the ACC, NMI, ARI, and F-score metrics on the ModelNet10 and the performance improvements of 6.04%, 3.99%, 5.25%, and 4.66% for the ACC, NMI, ARI, and F-score metrics on the ModelNet40. This was mainly because the comparison methods were proposed for general multi-modal clustering and directly transferring them into the 3D shape clustering task failed to leverage the characteristics of 3D shapes, thus resulting in unsatisfactory clustering performances. In contrast, the proposed method took full advantage of the inter-view correlations within the multi-view modality, as well as the inter-modal correlations between the point cloud and the multi-view modalities, and developed a dual contrastive learning network. By jointly exploring the cross-view and cross-modal consistent representations and clustering assignments, the proposed method was more suitable for the multi-modal 3D shape clustering task. Additionally, it is worth mentioning that the clustering accuracy of the comparison methods dropped by more than 20% when the benchmark dataset was changed from ModelNet10 to ModelNet40. The main reason for this was that ModelNet40 held more classes and more imbalanced data distributions than ModelNet10, making it challenging for the multi-modal clustering task. Nevertheless, the proposed method obtained the values of 61.20%, 72.46%, 57.61%, and 60.22% for the ACC, NMI, ARI, and F-score metrics on the ModelNet40 dataset respectively, which further proved the robustness of our method. To further evaluate the superiority of the proposed DCL-Net over the comparison methods, t-SNE [56] visualizations of the 3D shape features utilized for clustering in the different methods were provided on the ModelNet10 dataset. The visualization results are shown in Figure 2, in which the different colors indicate different classes. As shown in Figure 2a-c, the features extracted by the comparison methods were quite dispersed and the boundaries between the different classes were inconspicuous. By contrast, the proposed DCL-Net provided more clear and compact clustering partitions, which further demonstrated the effectiveness of the proposed method.  In this paper, a representation-level dual contrastive learning module and an assignmentlevel dual contrastive learning module were developed to discover the clustering partitions of unlabeled 3D shapes by jointly learning consistent 3D shape representations and clustering assignments. To validate the effectiveness of the two modules, evaluations were conducted on the ModelNet10 and ModelNet40 datasets. The results are shown in Table 2, in which "w/o ADCL" indicates the proposed method without the ADCL module and "w/o RDCL" indicates the proposed method without the designed RDCL module. When the ADCL module was removed, the clustering results could not be directly predicted by the model, hence the k-means algorithm [57] was introduced to perform clustering on the consistent 3D shape representations obtained by the RDCL. As shown in the table, the performance of the "w/o ADCL" method dropped significantly on both two datasets compared to the DCL-Net and the ACC value dropped sharply by more than 10% on ModelNet10. This was mainly because the removal of the ADCL module disconnected the procedure of the representation learning and clustering, thus the obtained 3D shape features were irrelevant to the subsequent clustering. Similarly, the "w/o RDCL" method also obtained unsatisfactory clustering performances on both two datasets. The main reason was that removing the RDCL module made it difficult to ensure the feature consistency between different views and the corresponding point cloud, thus damaging the intra-cluster compactness and inter-cluster separation of the latent 3D shape features for clustering. Comparatively, the clustering results of the DCL-Net were consistently improved on both two datasets when using the RDCL and ADCL modules simultaneously. This sufficiently reflected the significance of both the representation-level dual contrastive learning and the assignment-level dual contrastive learning for multi-modal 3D shape clustering.

Evaluation of the Cross-View and Cross-Modal Contrastive Learning
By adequately exploiting the characteristics of 3D shapes, the proposed method simultaneously performed cross-view contrastive learning within the multi-view modality and cross-modal contrastive learning between the point cloud and the multi-view modalities, so as to better explore the consistent semantic information of 3D shapes. To evaluate the effectiveness of the cross-view and cross-modal contrastive learning, validation experiments were conducted on the ModelNet10 and ModelNet40 datasets. The experimental results are reported in Table 3, in which "w/o cross-modal contrastive learning" denotes removing the point cloud branch and only combining the cross-view contrastive losses for the network constraints, and "w/o cross-view contrastive learning" denotes removing one of the view branches and constraining the network via the cross-modal contrastive losses of the remaining view branch with the point cloud branch. As shown in the table, the "w/o cross-modal contrastive learning" and "w/o cross-view contrastive learning" methods only achieved limited clustering performances on the two datasets compared to the DCL-Net. The main reason was that removing either the cross-view contrastive learning or the cross-modal contrastive learning prevented the network from exploring the consistent semantics from more comprehensive 3D shape information. Specifically, when the cross-modal contrastive learning was removed, the network was incapable of perceiving the spatial geometry of 3D shapes, which made it challenging to explore the discriminative 3D shape descriptions from harder contrastive positives. Similarly, when the cross-view contrastive learning was removed, the network could not observe the richer visual information about the 3D shape from different angles, thus failing to ensure the compactness of view features of the same 3D shape and misleading the extraction of consistent information. Therefore, both the cross-view and cross-modal contrastive learning adopted in the proposed DCL-Net were crucial for the 3D shape clustering.

Conclusions
3D shape clustering has become a promising research topic in computer vision and multimedia fields due to its powerful ability to divide unlabeled 3D shape data. However, little effort has been put into solving the 3D shape clustering task in previous works. To this end, a novel DCL-Net for 3D shape clustering was proposed in this paper. Taking full advantage of the data characteristics of multiple views and point clouds, the proposed DCL-Net is the first deep multi-modal 3D shape clustering method. Specifically, a representationlevel dual contrastive learning module was first designed to extract discriminative 3D shape features for clustering by ensuring cross-view representation consistency within multi-view modality, as well as cross-modal representation consistency between point cloud and multi-view modalities. Meanwhile, by simultaneously performing cross-view and cross-modal contrastive learning at the clustering assignment level, an assignmentlevel dual contrastive learning module was designed to further obtain consistent clustering assignments based on the robust learned 3D shape features. Under the joint effects of the two modules, the proposed DCL-Net is able to sufficiently exploit the consistency and complementarity within multi-view modality as well as between point cloud and multiview modalities, thus obtaining more compact category partitions. As the first attempt at solving the multi-modal 3D shape clustering task, the proposed DCL-Net achieved remarkable performances on two widely used 3D shape benchmark datasets, which would bring enlightening investigations in future unsupervised 3D shape analysis research.