# An End-to-End, Multi-Branch, Feature Fusion-Comparison Deep Clustering Method

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- Proposed a new, end-to-end, multi-branch, feature fusion-comparison deep clustering method. Contrast learning is utilized to accomplish a priori representation learning while fusing aggregated information under multiple branches in a feature extraction network. The contrastive representation-learning stage uses clustering centers to compare instance samples and extract semantically meaningful feature representations. Combined representation learning and clustering for joint training and iterative optimization.
- Designed a new, multi-branch feature-aggregation method. Divided multi-channel sub-features, using a three-branch structure to learn multi-dimensional spatial channel-dimension information and weighted receptive-field spatial features. Completed multi-branch and cross-dimensional information exchange, achieving the aggregation of sub-features and establishing long-term and short-term dependence.
- Designed a clustering-oriented contrastive representation learning strategy. Joint optimization of unsupervised contrastive representation learning and clustering to improve the problem of error transmission faced by multi-stage deep clustering tasks. The training of the model extracts clustering-oriented feature representations in continuous iterations, thus improving the model’s ability to cluster.

## 2. Related Work

## 3. Materials and Methods

#### 3.1. Contrast Deep Clustering

#### 3.2. Multi-Branch Feature Aggregation

#### 3.3. Objective Function

## 4. Experiments

#### 4.1. Dataset

- CIFAR-10 [33] is a dataset containing 60,000 images of color objects, of which 50,000 are training images, and 10,000 are test images. Each image is a three-channel color RGB image of size 32 × 32. Each image in CIFAR-10 represents real-world objects and can be categorized into 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
- CIFAR-100/20 [33] contains 60,000 color images, including 50,000 training images and 10,000 test images, each of which belongs to the RGB three-channel type of 32 × 32 pixels. The CIFAR100 dataset contains 100 categories, which can be subdivided into 20 major categories from a deeper perspective, and 5 subcategories in each major category. From the perspective of category division, CIFAR-100/20 is more detailed and rich in hierarchical structure than CIFAR-10, which is more conducive to network learning.
- STL10 [34] is one of the commonly used benchmark datasets in the unsupervised domain, which consists of 113,000 RGB images, all of which have a resolution of 96 × 96, and it contains 105,000 training data and 8000 test data.

#### 4.2. Evaluation Metrics

#### 4.3. Experimental Settings

#### 4.4. Comparative Experiment

#### 4.5. Empirical Analysis

#### 4.5.1. Visualization of Cluster Semantics

#### 4.5.2. Ablation Study

#### 4.6. Comparative Study

#### 4.7. Parameter Sensitivity

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Li, F.; Qiao, H.; Zhang, B.; Xi, X. Discriminatively Boosted Image Clustering with Fully Convolutional Auto-Encoders. Pattern Recognit.
**2018**, 83, 161–173. [Google Scholar] [CrossRef] - von Luxburg, U. A tutorial on spectral clustering. Stat. Comput.
**2007**, 17, 395–416. [Google Scholar] [CrossRef] - Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep Clustering for Unsupervised Learning of Visual Features; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11218, pp. 139–156. [Google Scholar]
- Ting, C.; Simon, K.; Mohammad, N.; Geoffrey, H. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning; PMLR: Birmingham, UK, 2020; Volume 119, pp. 1597–1607. [Google Scholar]
- Xu, J.; Tang, H.; Ren, Y.; Peng, L.; Zhu, X.; He, L. Multi-level Feature Learning for Contrastive Multi-view Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; Volume 1, pp. 16030–16039. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv
**2014**, arXiv:1406.2661. [Google Scholar] [CrossRef] - Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv
**2014**, arXiv:1411.1784. [Google Scholar] - Gansbeke, W.V.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Gool, L.V. SCAN: Learning to Classify Images without Labels. In European Conference on Computer Vision 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 268–285. [Google Scholar]
- Chen, C.; Lu, H.; Wei, H.; Geng, X. Deep Subspace Image Clustering Network with Self-Expression and Self-Supervision; Springer: Berlin/Heidelberg, Germany, 2022; Volume 53, pp. 4859–4873. [Google Scholar]
- Yang, X.; Deng, C.; Zheng, F.; Yan, J.; Liu, W. Deep Spectral Clustering Using Dual Autoencoder Network. arXiv
**2019**, arXiv:1904.13113. [Google Scholar] - Niu, C.; Shan, H.; Wang, G. SPICE: Semantic Pseudo-Labeling for Image Clustering. IEEE Trans. Image Process.
**2022**, 31, 7264–7278. [Google Scholar] [CrossRef] - Kaiming, H.; Haoqi, F.; Yuxin, W.; Saining, X.; Ross, G. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; Volume 2020, pp. 9726–9735. [Google Scholar]
- Xinlei, C.; Haoqi, F.; Ross, G.; Kaiming, H. Improved Baselines with Momentum Contrastive Learning. arXiv
**2020**, arXiv:2003.04297. [Google Scholar] - Xie, J.; Girshick, R.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. arXiv
**2016**, arXiv:1511.06335. [Google Scholar] - Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 8–22 June 2018. [Google Scholar]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv
**2018**, arXiv:1807.03748. [Google Scholar] - Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. Adv. Neural Inf. Process. Syst.
**2020**, 33, 9912–9924. [Google Scholar] - Jean-Bastien, G.; Florian, S.; Florent, A.; Corentin, T.; H., R.P.; Elena, B.; Carl, D.; Avila, P.B.; Daniel, G.Z.; Gheshlaghi, A.M.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 21271–21284. [Google Scholar] - Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar]
- Chen, X.; Xie, S.; He, K. An Empirical Study of Training Self-Supervised Vision Transformers. arXiv
**2021**, arXiv:2104.02057. [Google Scholar] - Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv
**2021**, arXiv:2104.14294. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst.
**2017**, 30, 5998–6008. [Google Scholar] - Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the IJCAI 2017, Melbourne, Australia, 19–25 August 2017; pp. 1753–1759. [Google Scholar]
- Mukherjee, S.; Asnani, H.; Lin, E.; Kannan, S. Clustergan: Latent Space Clustering in Generative Adversarial Networks. arXiv
**2019**, arXiv:1809.03627. [Google Scholar] [CrossRef] - Hu, J.; Zhang, Y.; Zhao, D.; Yang, G.; Chen, F.; Zhou, C.; Chen, W. A Robust Deep Learning Approach for the Quantitative Characterization and Clustering of Peach Tree Crowns Based on UAV Images. IEEE Trans. Geosci. Remote. Sens.
**2022**, 60, 1–13. [Google Scholar] [CrossRef] - Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J.T.; Peng, X. Contrastive Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence 2021, Virtually, 2–9 February 2021; Volume 35, pp. 8547–8555. [Google Scholar]
- Zhong, H.; Wu, J.; Chen, C.; Huang, J.; Deng, M.; Nie, L.; Lin, Z.; Hua, X.S. Graph Contrastive Clustering. arXiv
**2021**, arXiv:2104.01429. [Google Scholar] - Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. arXiv
**2013**, arXiv:1306.0895. [Google Scholar] - Peyré, G.; Cuturi, M. Computational Optimal Transport. arXiv
**2019**, arXiv:1803.00567. [Google Scholar] - Hu, Q.; Zhang, L.; Zhang, D.; Pan, W.; An, S.; Pedrycz, W. Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst. Appl.
**2011**, 38, 10737–10750. [Google Scholar] [CrossRef] - Ntelemis, F.; Jin, Y.; Thomas, S.A. Information maximization clustering via multi-view self-labelling. Knowl.-Based Syst.
**2022**, 250, 109042. [Google Scholar] [CrossRef] - Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.; Zhang, G. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv
**2023**, arXiv:2305.13563. [Google Scholar] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 2 September 2024).
- Coates, A.; Ng, A.Y.; Lee, H. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics 2011, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 215–223. [Google Scholar]
- Znalezniak, M.; Rola, P.; Kaszuba, P.; Tabor, J.; Smieja, M. Contrastive Hierarchical Clustering. arXiv
**2023**, arXiv:2303.03389. [Google Scholar] - Gao, H.; Liu, Z.; Weinberger, K.; van der Maaten, L. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]

**Figure 1.**Multi-branch, feature fusion-comparison deep clustering architecture. Firstly, a convolutional neural network that integrates multi-branch feature extraction strategies is used to extract high-dimensional representations. The Conv Block and the EAR Block iteratively interact with each other. The EAR uses a three-branch structure to identify spatial channel features and weighted receptive-field spatial features from different dimensions. Then, different transformed instances of the same data sample are mapped to the clustering center in the feature space, and the comparison of instance samples is completed through the clustering center. Finally, clustering is performed based on the embedded feature vector z.

**Figure 5.**Performance of different G-element fetches; the horizontal coordinate indicates the G fetches.

Dataset Name | Total Samples | Clusters | Type | Size |
---|---|---|---|---|

CIFAR-10 | 60,000 | 10 | Color object image | 32 × 32 |

CIFAR-100/20 | 60,000 | 20/100 | Color object image | 32 × 32 |

STL10 | 113,000 | 10 | Color object image | 96 × 96 |

**Table 2.**Clustering performance using three object image benchmarks; note that all data were trained on an NVIDIA RTX 4090 to get the run results.

Model | CIFAR-10 | CIFAR-100/20 | STL10 | ||||||
---|---|---|---|---|---|---|---|---|---|

ACC | NMI | ARI | ACC | NMI | ARI | ACC | NMI | ARI | |

K-means | 22.2 | 7.5 | 4.6 | 14.2 | 8.2 | 2.6 | 22.5 | 12.7 | 6.1 |

CC [26] | 78.9 | 70.4 | 63.7 | 42.8 | 43.0 | 26.5 | 85.0 | 76.3 | 72.5 |

SCAN [8] | 87.2 | 78.2 | 75.3 | 46.7 | 45.9 | 29.0 | 75 | 66.0 | 58.7 |

CoHiClust [35] | 83.0 | 75.3 | 70.1 | 45.0 | 41.0 | 28.0 | 69.0 | 60.7 | 52.5 |

IMC-SwAV [31] | 89.3 | 81.4 | 79.2 | 49.3 | 51.2 | 34.5 | 81.4 | 71.9 | 67.4 |

SwEAC (AVG) | 89.6 ± 0.4 | 81.8 ± 0.5 | 79.8 ± 0.6 | 51.0 ± 0.5 | 52.1 ± 0.4 | 35.7 ± 0.7 | 83.3 ± 0.3 | 73.1 ± 0.3 | 68.5 ± 0.5 |

SwEAC (best) | 90.1 | 82.3 | 80.7 | 51.5 | 52.8 | 36.5 | 83.6 | 73.4 | 69 |

Methods | ACC | NMI | ARI | |
---|---|---|---|---|

CIFAR-10 | SwEAC (ResNet) | 89.8 | 82.2 | 80.6 |

SwEAC (EAR) | 90.1 | 82.3 | 80.7 | |

CIFAR-100/20 | SwEAC (ResNet) | 49.0 | 50.4 | 33.9 |

SwEAC (EAR) | 51.5 | 52.8 | 36.5 | |

STL10 | SwEAC (ResNet) | 83.5 | 73.0 | 68.9 |

SwEAC (EAR) | 83.6 | 73.4 | 69 |

Methods | ACC | NMI | ARI |
---|---|---|---|

SwEAC-kmeans | 65.5 | 68.8 | 42.7 |

SwEAC-sc | 73.7 | 75.4 | 61.5 |

SwEAC | 89.9 | 82.3 | 80.6 |

ACC | NMI | ARI | |
---|---|---|---|

G = 4 | 50 | 51 | 34.6 |

G = 8 | 51.5 | 52.5 | 36.5 |

G = 16 | 49.6 | 50.9 | 34.3 |

G = 32 | 47.8 | 50.9 | 33.6 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Li, X.; Yang, H.
An End-to-End, Multi-Branch, Feature Fusion-Comparison Deep Clustering Method. *Mathematics* **2024**, *12*, 2749.
https://doi.org/10.3390/math12172749

**AMA Style**

Li X, Yang H.
An End-to-End, Multi-Branch, Feature Fusion-Comparison Deep Clustering Method. *Mathematics*. 2024; 12(17):2749.
https://doi.org/10.3390/math12172749

**Chicago/Turabian Style**

Li, Xuanyu, and Houqun Yang.
2024. "An End-to-End, Multi-Branch, Feature Fusion-Comparison Deep Clustering Method" *Mathematics* 12, no. 17: 2749.
https://doi.org/10.3390/math12172749