View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition
Abstract
1. Introduction
2. Related Work
2.1. Multi-View 3D Shape Recognition
2.2. Graph Construction and Relational Modeling
2.3. Hierarchical Graph Coarsening and Pooling
3. Methodology
3.1. Overview
3.2. Initial Feature Extraction
3.3. Cluster Assignment Based View Sampling
3.4. Graph Convolution and Sampling Fusion Module (CSF)
3.4.1. Feature Embedding
3.4.2. Assignment Matrix Generation
3.4.3. Multi-Scale Fusion and Receptive Field Analysis
3.5. Hierarchical Network Architecture and Loss Function
4. Experiments and Results Analysis
4.1. Experimental Setup
4.1.1. Datasets and Evaluation Metrics
- ModelNet40 [24]: This dataset contains 12,311 3D CAD models across 40 categories, with 9843 models used for training and 2468 for testing. Following the standard protocol, we render either 20 views (from the vertices of a dodecahedron) or 12 views (from a circular trajectory at an elevation of 30°) for each 3D object.
- RGB-D [25]: A real-world dataset comprising 300 household objects across 51 categories. We adopt a 10-fold cross-validation strategy for evaluation on this dataset.
4.1.2. Implementation Details
4.2. Comparison with State-of-the-Art Methods
4.3. Robustness to View Quantity
4.4. Shape Retrieval Performance
4.5. Ablation Study
- Soft-clustering vs. Hard Sampling: The instance accuracy of View-GFN-FPS drops by 1.3% (from 97.8% to 96.5%), proving that soft-clustering based on the assignment matrix retains discriminative information far better than Farthest Point Sampling.
- CSF Synchronous Fusion: View-GFN-SEP performs almost on par with the full model (instance accuracy is only 0.1% lower), but incurs a significant increase in parameters. This demonstrates that the CSF module substantially reduces model complexity while maintaining high accuracy.
- AM Initialization Strategy: The instance accuracies of View-GFN-A1 (local adjacency) and View-GFN-A2 (coordinate encoding) are 0.4% and 0.3% lower than the full model, respectively. This firmly validates the superiority of our global connectivity prior and predefined initial values.
- Impact of GCN Layers (K) and Cluster Scales: To further validate our architectural choices, we expanded our ablation analysis to include the number of stacked GCN layers (K) within the CSF module and the hierarchical cluster node scales. Empirically, we observed that setting yields the optimal trade-off. Using only captures insufficient structural information, leading to suboptimal feature aggregation, while causes the model to suffer from oversmoothing, which degrades the overall recognition accuracy. Furthermore, regarding the graph coarsening scale, our tests confirm that the adopted hierarchical reduction of optimally balances topological preservation and information compression. More aggressive coarsening strategies (e.g., directly down-sampling from 20 to 5) result in a severe loss of discriminative geometric details.
- Impact of GCN Layers (K) and Cluster Scales: Beyond the core components analyzed in Table 4, we also conducted exhaustive empirical searches for architectural hyperparameters, specifically the number of stacked GCN layers (K) and the hierarchical graph coarsening scales. Our experiments confirm that setting and adopting a node reduction scale yield the optimal trade-off between structural topological preservation and memory efficiency.
4.6. Sensitivity Analysis of Hyperparameter
4.7. Qualitative Analysis and Failure Cases
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Tu, T.; Chen, P.; Zhang, L. ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 6996–7007. [Google Scholar]
- Xu, C.; Wu, B.; Hou, J.; Tsai, S.; Li, R.; Wang, J.; Zhan, W.; He, Z.; Vajda, P.; Keutzer, K.; et al. NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3621–3631. [Google Scholar]
- Ding, D.; Wang, Z.; Xiong, H. Robust point cloud classification via semantic and structural modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 15078–15087. [Google Scholar]
- Ben-Shabat, Y.; Gould, S. 3DInAction: Understanding Human Actions in 3D Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 19978–19987. [Google Scholar]
- Chen, Y.; Liu, S.; Shen, X. Learnable Skeleton-Aware 3D Point Cloud Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 18101–18111. [Google Scholar]
- Li, Z.; Xu, C.; Leng, B. Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 25368–25377. [Google Scholar]
- Li, J.; Wang, J.; Chen, J.; Xu, T. Towards Robust Point Cloud Recognition with Sample-Adaptive Auto-Augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3003–3017. [Google Scholar] [CrossRef] [PubMed]
- Zhang, L.; Wang, Y.; Liu, H. Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 22–26 June 2025; IEEE: New York, NY, USA, 2025; pp. 20135–20144. [Google Scholar]
- Liu, H.; Zhang, L.; Wang, Y. Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 22–26 June 2025; IEEE: New York, NY, USA, 2025; pp. 28745–28754. [Google Scholar]
- Xiong, S.; Kasaei, H. LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition. In Proceedings of the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Eindhoven, The Netherlands, 25–29 August 2025; IEEE: New York, NY, USA, 2025; pp. 141–148. [Google Scholar]
- Ma, S.; Dong, Z.; Cong, R.; Kwong, S.; Shao, X. Proto-FG3D: Prototype-based Interpretable Fine-Grained 3D Shape Classification. arXiv 2025, arXiv:2505.17666. [Google Scholar]
- Ma, X.; Bai, J.; Su, Z.; Wang, Y. PVSTrans: Patch-View-Shape Progressive Interaction Transformer for 3D Shape Recognition. Inf. Process. Manag. 2026, 63, 104279. [Google Scholar] [CrossRef]
- Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; IEEE: New York, NY, USA, 2015; pp. 945–953. [Google Scholar]
- Feng, Y.; Zhang, Z.; Zhao, X.; Ji, R.; Gao, Y. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 264–272. [Google Scholar]
- Yu, T.; Meng, J.; Yuan, J. Multi-view harmonized bilinear network for 3D object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 186–194. [Google Scholar]
- Kanezaki, A.; Matsushita, Y.; Nishida, Y. RotationNet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 5010–5019. [Google Scholar]
- Han, Z.; Shang, M.; Liu, Z.; Vong, C.M.; Liu, Y.S.; Zwicker, M.; Han, J.; Chen, C.P. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 2018, 28, 658–672. [Google Scholar] [CrossRef] [PubMed]
- Wei, X.; Yu, R.; Sun, J. View-GCN: View-based graph convolutional network for 3D shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; IEEE: New York, NY, USA, 2020; pp. 1850–1859. [Google Scholar]
- Xu, M.; Chen, H.; Wang, Z. PAGNet: Path aggregation graph network for multi-view 3D shape recognition. Knowl.-Based Syst. 2021, 229, 107338. [Google Scholar]
- Gao, H.; Ji, S. Graph U-Nets. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 2083–2092. [Google Scholar]
- Lee, J.; Lee, I.; Kang, J. Self-attention graph pooling. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 4800–4810. [Google Scholar]
- Bianchi, F.M.; Grattarola, D.; Alippi, C. Spectral clustering with graph neural networks for graph pooling. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 874–883. [Google Scholar]
- Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 1912–1920. [Google Scholar]
- Lai, K.; Bo, L.; Ren, X.; Fox, D. A large-scale hierarchical multi-view RGB-D object dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; IEEE: New York, NY, USA, 2011; pp. 1817–1824. [Google Scholar]
- Su, J.-C.; Gadelha, M.; Wang, R.; Maji, S. A deeper look at 3D shape classifiers. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Jiang, J.; Bao, D.; Chen, Z.; Zhao, X.; Gao, Y. MLVCNN: Multi-loop-view convolutional neural network for 3D shape retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Association for Computing Machinery: New York, NY, USA, 2019; Volume 33, pp. 8513–8520. [Google Scholar]
- Xu, L.; Cui, Q.; Xu, W.; Chen, E.; Tong, H.; Tang, Y. Walk in views: Multi-view path aggregation graph network for 3D shape analysis. Inf. Fusion 2024, 103, 102131. [Google Scholar] [CrossRef]
- Cheng, Y.; Cai, R.; Zhao, X.; Huang, K. Convolutional Fisher kernels for RGB-D object recognition. In Proceedings of the 2015 International Conference on 3D Vision (3DV), Lyon, France, 19–22 October 2015; IEEE: New York, NY, USA, 2015; pp. 135–143. [Google Scholar]
- Rahman, M.M.; Tan, Y.; Xue, J.; Lu, K. RGB-D object recognition with multimodal deep convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 991–996. [Google Scholar]
- Asif, U.; Bennamoun, M.; Sohel, F.A. A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2051–2065. [Google Scholar] [CrossRef]
- Li, J.; Liu, Z.; Li, L.; Lin, J.; Yao, J.; Tu, J. Multi-view convolutional vision transformer for 3D object recognition. J. Vis. Commun. Image Represent. 2023, 95, 103906. [Google Scholar] [CrossRef]



| Method | Backbone | Type | Views | Inst. Acc. (%) | Class Acc. (%) | Params (M) | Time (s/epoch) |
|---|---|---|---|---|---|---|---|
| MVCNN-new [26] | VGG-M | Aggreg. | 12 | 95.0 | 92.4 | – | – |
| GVCNN [14] | GoogLeNet | Grouping | 12 | 93.1 | 90.7 | – | – |
| MHBN [15] | VGG-M | Bilinear | 6 | 94.7 | 93.1 | – | – |
| RotationNet [16] | AlexNet | View Opt. | 20 | 97.4 | 96.8 | – | – |
| MLVCNN [27] | ResNet-18 | Multi-loop | 36 | 94.2 | – | – | – |
| MVPNet [28] | ResNet-18 | Path Agg. | 20 | 97.9 | 96.8 | – | – |
| View-GCN [18] | ResNet-18 | Graph Net. | 20 | 97.6 | 96.5 | 33.9 | 62.6 |
| PVSTrans [12] | ViT-B | Transf. | 20 | 97.9 | 97.2 | 86.0 | – |
| View-GFN (Ours) | ResNet-18 | Graph Fus. | 20 | 97.8 | 96.5 | 17.0 | 33.4 |
| Method | Backbone | Views | Inst Acc (%) | Params (M) | Time (s/epoch) |
|---|---|---|---|---|---|
| MVCNN [13] | VGG-M | 12 | 86.1 | – | – |
| CFK [29] | – | 120 | 86.8 | – | – |
| MMDCNN [30] | VGG-M | 120 | 89.5 | – | – |
| MDSICNN [31] | VGG-M | 120 | 89.9 | – | – |
| View-GCN [18] | ResNet-18 | 12 | 94.3 | 22.7 | 1.1 |
| View-GFN (Ours) | ResNet-18 | 12 | 94.1 | 17.0 | 0.5 |
| Method | mAP (%) |
|---|---|
| GVCNN [14] | 85.7 |
| MVCVT [32] | 95.4 |
| MLVCNN [27] | 92.8 |
| MVPNet [28] | 97.4 |
| View-GFN (Ours) | 97.8 |
| Configuration | Inst Acc (%) | Class Acc (%) | Description |
|---|---|---|---|
| View-GFN-FPS | 96.5 | 95.2 | Replace soft-clustering with Farthest Point Sampling (FPS) |
| View-GFN-SEP | 97.7 | 96.5 | Decouple feature embedding and assignment matrix generation |
| View-GFN-A1 | 97.4 | 96.2 | AM considers only 3 nearest neighbor nodes |
| View-GFN-A2 | 97.5 | 96.1 | AM initialized with view coordinate encoding |
| View-GFN (Full) | 97.8 | 96.5 | Full model (Global AM + Soft-clustering + CSF) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Pang, M.; Jiao, J.; Zhang, Y. View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Appl. Sci. 2026, 16, 5629. https://doi.org/10.3390/app16115629
Pang M, Jiao J, Zhang Y. View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Applied Sciences. 2026; 16(11):5629. https://doi.org/10.3390/app16115629
Chicago/Turabian StylePang, Min, Jichao Jiao, and Yingjian Zhang. 2026. "View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition" Applied Sciences 16, no. 11: 5629. https://doi.org/10.3390/app16115629
APA StylePang, M., Jiao, J., & Zhang, Y. (2026). View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Applied Sciences, 16(11), 5629. https://doi.org/10.3390/app16115629

