View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition

Pang, Min; Jiao, Jichao; Zhang, Yingjian

doi:10.3390/app16115629

Open AccessArticle

View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition

by

Min Pang

^1,2,*,

Jichao Jiao

²

and

Yingjian Zhang

²

¹

School of Electronic Engineering, Beijing University of Posts and Telecommunications, No. 10 Xitucheng Road, Haidian District, Beijing 100876, China

²

China Research Institute of Radiowave Propagation, No. 33 Xianshan East Road, Chengyang District, Qingdao 266107, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5629; https://doi.org/10.3390/app16115629

Submission received: 9 April 2026 / Revised: 2 May 2026 / Accepted: 4 May 2026 / Published: 4 June 2026

Download

Browse Figures

Versions Notes

Abstract

Three-dimensional (3D) shape recognition is a fundamental task in computer vision, where view-based methods have recently achieved state-of-the-art performance. However, effectively capturing and exploiting the rich geometric correspondences between different views remains a key challenge, as such information is crucial for accurate shape representation. Existing methods often fall short in explicitly modeling these structured correlations, which limits their ability to fully leverage discriminative shape information. To address this limitation, we propose a novel View-based Graph Convolution and Sampling Fusion Network (View-GFN). View-GFN employs a hierarchical architecture that progressively coarsens the view-graph to learn multi-scale features. In this structure, views are treated as graph nodes, and a predefined-value strategy is introduced to initialize the adjacency matrix (AM) for constructing initial node correlations. For effective graph coarsening, we develop a novel view down-sampling method based on a cluster assignment matrix. Furthermore, a Graph Convolution and Sampling Fusion (CSF) module is designed to seamlessly integrate deep feature embeddings with the topological information derived from view down-sampling. Extensive experiments on benchmark datasets, including ModelNet40 and RGB-D, demonstrate that View-GFN achieves strong performance, performing on par with established baseline methods while reducing the number of model parameters by nearly 50% compared to the baseline View-GCN. These results validate the effectiveness of our hierarchical fusion strategy in capturing multi-view geometric information both efficiently and robustly.

Keywords:

3D shape recognition; view-based methods; graph neural networks; hierarchical graph coarsening; multi-scale fusion

1. Introduction

Humans perceive the world in three dimensions; therefore, understanding and parsing 3D objects is a fundamental and crucial task in computer vision. In recent years, 3D shape recognition has emerged as a highly active research direction, playing a pivotal role in practical applications such as autonomous driving, robotic perception, and virtual reality. With the rapid advancement of deep learning technologies, researchers have proposed numerous innovative architectures for 3D shape recognition. Depending on the underlying 3D data representations, these approaches can be broadly categorized into three main paradigms: voxel-based [1,2], point cloud-based [3,4,5,6,7,8,9], and multi-view-based methods [10,11,12].

Specifically, voxel-based methods extract features by discretizing the 3D space into regular grids [1,2]. Point cloud-based methods represent targets as unordered sets of spatial points and directly process them using advanced geometric aggregation networks [3,4]. Recent studies have further introduced techniques such as skeleton-aware sampling [5], zero-shot geometry-driven aggregation [6], sample-adaptive auto-augmentation [7], and multi-scale topological networks [8], significantly enhancing recognition robustness. On the other hand, multi-view-based methods project 3D objects into a sequence of two-dimensional (2D) images and integrate these features into a global descriptor. Compared with the former two paradigms, view-based methods generally exhibit highly competitive performance in 3D recognition tasks. This is because they can acquire comprehensive geometric and textural information from different perspectives, and seamlessly leverage mature pre-trained image networks. Recently, to further address diverse challenges in 3D recognition, novel paradigms have emerged. For instance, recent studies have advanced this field using progressive interaction transformers [12], lightweight multi-view convolutional-vision models [10], and prototype-based interpretable architectures for fine-grained shape classification [11].

Despite this progress, efficiently modeling the complex geometric correspondences across views remains a core bottleneck. Early multi-view fusion strategies (e.g., view pooling) typically treat all input views equally, ignoring inherent spatial correlations. To introduce relational modeling, Graph Convolutional Network (GCN)-based methods (e.g., View-GCN) treat views as nodes for message passing. However, these methods exhibit notable practical deficiencies. First, existing methods predominantly use “hard sampling” during graph coarsening. Directly discarding lower-ranked view nodes leads to irreversible feature loss and destroys the global manifold topology of 3D objects. Second, current graph constructions over-rely on predefined rigid viewpoint coordinates for initializing the adjacency matrix (AM). This mechanism lacks a global connectivity prior, reducing robustness to viewpoint fluctuations. Finally, node feature updating and topological evolution are decoupled, restricting the network from exploiting deep discriminative information.

To overcome these limitations, this paper proposes a novel View-based Graph Convolution and Sampling Fusion Network (View-GFN). Unlike traditional methods that rely on fixed spatial coordinates, we propose an AM initialization strategy with a global connectivity prior to endow the initial graph with a global receptive field. To preserve geometric topology, we replace traditional selective dropping with a hierarchical graph coarsening method based on a clustering assignment matrix. By softly mapping semantically similar features into super-nodes, we eliminate redundant information while retaining topological integrity. Furthermore, we design a Graph Convolution and Sampling Fusion (CSF) module to seamlessly integrate deep local feature embeddings with the coarsened macroscopic structure.

To explicitly delineate the methodological differences between our View-GFN and existing view-based graph networks, we summarize three fundamental shifts in our design. (1) Initialization: We transition from coordinate-dependent static graph construction to a fully dense initialization with a global connectivity prior, eliminating the reliance on rigid camera positions. (2) Coarsening: We replace destructive hard-node dropping with a structure-preserving soft-clustering mechanism to safeguard the 3D manifold topology. (3) Architecture: We move from decoupled feature updating and pooling stages to a unified CSF module, which performs simultaneous feature embedding and structural coarsening to significantly reduce parameter overhead.

Our main contributions can be explicitly summarized into three distinct structural novelties:

(1) A coordinate-free dense adjacency matrix (AM) initialization strategy: We adopt a complete graph prior with predefined values, which directly equips shallow graph convolutions with a global receptive field and eliminates the model’s dependency on rigid physical camera coordinates.

(2) A soft-clustering mechanism for view down-sampling: Unlike traditional hard-node dropping, our cluster-assignment approach softly aggregates semantically similar features, effectively preserving the discriminative geometric topology of 3D objects.

(3) A unified Graph Convolution and Sampling Fusion (CSF) module: We seamlessly integrate feature embedding and structural graph coarsening within a single branch. This design eliminates the massive parameter overhead inherent in traditional two-stage methods, providing a highly lightweight and efficient alternative.

Extensive experiments on ModelNet40 and RGB-D demonstrate that View-GFN achieves robust performance, yielding a competitive recognition accuracy of 97.8% while reducing model parameters by nearly 50% compared to the baseline View-GCN. This validates the strong performance and practical value of our hierarchical fusion strategy. Importantly, our core contribution lies not merely in pushing the saturated accuracy boundaries on synthetic benchmarks, but in achieving top-tier performance with significantly enhanced architectural efficiency. Furthermore, evaluations on the challenging real-world RGB-D dataset demonstrate the model’s strong generalization against background noise and varying camera trajectories, making it highly suitable for practical deployments.

2. Related Work

In this section, we provide a systematic review of the literature closely related to our work from three perspectives: multi-view 3D shape recognition, graph construction and relational modeling, and hierarchical graph coarsening and pooling.

2.1. Multi-View 3D Shape Recognition

Converting 3D objects into 2D projections to leverage mature 2D convolutional neural networks (CNNs) for discriminative feature extraction has become a core paradigm in the field of 3D shape analysis. As a pioneering work, MVCNN [13] introduced a view pooling strategy that aggregates multi-view features via element-wise maximum operation to generate global shape descriptors. This work laid the foundation for multi-view methods, enabling 3D recognition tasks to fully benefit from 2D network models pre-trained on large-scale image datasets such as ImageNet.

Subsequent studies have pursued improvements in fusion strategies and feature extraction. GVCNN [14] introduced a group-view convolutional approach that partitions views into different groups based on feature similarity. MHBN [15] proposed harmonized bilinear pooling to capture second-order statistics across cross-view image patches. Furthermore, several works have focused on viewpoint optimization and sequence modeling. For instance, RotationNet [16] treats viewpoints as latent variables for joint optimization, achieving simultaneous improvement in both classification and pose estimation. Methods based on RNNs or LSTMs [17] attempt to capture spatial evolution patterns across view sequences using temporal models.

Despite significant progress, most of these methods rely on simple pooling operations or sequential aggregation, treating each view as an isolated image sample. This paradigm fails to explicitly establish structured topological relationships between views, thereby overlooking the rich geometric correspondence information embedded across different perspectives. This limitation motivated the introduction of graph neural networks (GNNs) into the multi-view domain. View-GCN [18] represents the first attempt to explicitly treat views as graph nodes and perform message passing through graph convolution, opening new directions for graph-driven multi-view fusion research. Recently, to further address the diverse challenges in 3D recognition, novel paradigms have emerged. For instance, LM-MCVT [10] explores lightweight multimodal fusion optimized for few-view scenarios, highlighting the ongoing demand for deployment efficiency. Meanwhile, Proto-FG3D [11] pioneers prototype-based interpretable architectures for fine-grained 3D classification, pushing the boundaries of detail-oriented shape understanding. Complementary to these specific applications, our View-GFN focuses on maximizing the geometric fidelity and parameter efficiency of graph structures under standard dense multi-view settings (e.g., 20 views).

2.2. Graph Construction and Relational Modeling

The performance of GNNs heavily depends on the quality of the initial graph topology. In multi-view 3D recognition, defining appropriate node adjacency relationships for a set of views constitutes a fundamental challenge. Existing graph-based methods primarily adopt two initialization strategies. The first is geometry-driven static graph construction, exemplified by View-GCN [18], which initializes the adjacency matrix (AM) using the physical 3D coordinates of camera viewpoints via the K-nearest neighbors (KNN) algorithm. While this approach introduces spatial priors, its fixed graph structure fails to reflect the dynamic semantic evolution of view relationships and exhibits high sensitivity to variations in the number of input views. The second strategy is semantic-driven dynamic graph construction. For example, Xu et al. [19] proposed a path aggregation graph network that dynamically constructs a view-relation graph by computing semantic correlations between view features. While this approach captures deep semantic relationships, it typically involves expensive pairwise similarity computation, incurring significant computational overhead.

Different from these methods, this paper proposes an AM initialization strategy based on a global connectivity prior. In contrast to methods relying on local geometric constraints [18] or high-overhead dynamic feature dependencies [19], our approach constructs a densely connected initial topology using pre-defined values. This design endows graph convolution with a global receptive field at shallow layers and eliminates dependence on static viewpoint coordinates, enabling the model to adaptively learn cross-view long-range dependencies while demonstrating inherent robustness to fluctuations in the number of input views.

2.3. Hierarchical Graph Coarsening and Pooling

Graph pooling is a fundamental technique for learning multi-scale graph representations. Existing methods can be broadly categorized into two families: node dropping and soft pooling. Representative node dropping methods, such as gPool [20] and SAGPool [21], learn scalar scores for nodes and deterministically retain high-scoring nodes. Despite their efficiency, this heuristic hard sampling mechanism exhibits notable limitations in 3D multi-view tasks. As seen in View-GCN [18], directly discarding view nodes may lead to irreversible loss of long-tail discriminative geometric features and disrupt the global manifold topology of 3D objects. Moreover, the selection process is often decoupled from feature extraction, limiting the effectiveness of end-to-end optimization.

On the other hand, general-purpose soft pooling methods such as DiffPool [22] and MinCutPool [23] introduce mapping mechanisms based on cluster assignment. However, these methods are primarily designed for generic graph data. When applied to densely connected multi-view graphs, their computational complexity often grows quadratically with the number of nodes, and they fail to utilize geometric priors specific to 3D vision tasks.

To tackle these challenges, we propose a hierarchical multi-view graph coarsening method based on a cluster assignment matrix. Our approach smoothly aggregates semantically similar view features into super-nodes through a learnable soft assignment mechanism, achieving dimensionality reduction while maximally preserving critical geometric topological properties. Building upon this, we design a graph convolution and sampling fusion (CSF) module that jointly optimizes feature embedding and topological evolution within a unified framework. This design effectively mitigates discriminative information loss from a representational perspective and eliminates the error accumulation inherent in traditional two-stage methods from an architectural standpoint.

In contrast to these existing techniques, the proposed CSF module offers several distinct advantages. Compared with node-dropping methods (e.g., gPool [20] and SAGPool [21]), which rely on “hard-sampling” that risks losing discriminative geometric features, our CSF module utilizes soft-clustering to combine rather than discard features, thereby better preserving the manifold topology. Unlike DiffPool [22], which requires a separate, heavy GNN branch solely for computing the assignment matrix, our CSF module synchronizes feature embedding and assignment generation within a single branch. This design avoids redundant parameters and significantly enhances computational efficiency. Furthermore, while attention-based fusion focuses primarily on feature recalibration, the CSF module inherently reduces the graph scale through structural coarsening, enabling effective multi-scale hierarchical representation learning.

3. Methodology

3.1. Overview

In this section, we introduce View-GFN, a novel hierarchical graph fusion network for three-dimensional (3D) shape recognition. The network adopts a multi-stage abstraction architecture designed to capture multi-scale geometric features through progressive graph coarsening. Each level of the hierarchy defines a view-graph denoted as

G^{l} = (V^{l}, E^{l})

.

The initial view-graph at the first level is constructed based on M input views, where each view corresponds to a node in the graph. To define the initial correlations between nodes, we propose an initialization strategy based on a global connectivity prior. In contrast to traditional methods that rely on unstable viewpoint coordinates or local K-Nearest Neighbor (KNN) constraints, we initialize the initial adjacency matrix

A^{1} \in R^{M \times M}

as a representation of a complete graph:

A_{i j}^{1} = \{\begin{matrix} 1, & i \neq j \\ 0, & i = j \end{matrix}

(1)

We opt for a static complete graph prior over alternative constructions (e.g., sparse k-NN graphs or dynamically learned connectivity) due to the specific scale of multi-view 3D recognition. Typically, the input consists of only

M = 12

or 20 views. At this scale, a complete graph generates at most

20 \times 20 = 400

edges, making the

O (M^{2})

computational and memory cost negligible. In contrast, constructing a sparse k-NN graph requires calculating pairwise coordinate distances, and learned connectivity requires computing dynamic weights at each iteration. For such a small number of nodes, these dynamic computations introduce unnecessary overhead. Therefore, our complete graph initialization provides a global receptive field at zero extra computational cost for edge generation, enabling shallow graph convolutions to facilitate global information interaction at the early stages of feature learning.

The overall architecture of View-GFN is illustrated in Figure 1. The network consists of a feature extraction module followed by three cascaded Graph Convolution and Sampling Fusion (CSF) modules. Each CSF module concurrently performs feature embedding and assignment matrix generation, enabling the hierarchical evolution of the graph structure through differentiable soft-clustering operations.

3.2. Initial Feature Extraction

Given a sequence of multi-view images of a 3D object

I = {I_{1}, I_{2}, \dots, I_{M}}

, we employ ResNet-18, pre-trained on ImageNet and fine-tuned on the target dataset, as the backbone network for initial feature extraction. Each view image

I_{i}

is mapped to a

c_{0}

-dimensional discriminative feature vector. These vectors constitute the initial node feature matrix

X^{1} \in R^{m_{1} \times c_{0}}

for the first-level graph, where

m_{1} = M

represents the initial number of nodes.

3.3. Cluster Assignment Based View Sampling

To achieve hierarchical compression of the graph structure, we need to aggregate

m_{l}

nodes at level l into

m_{l + 1}

super-nodes at level

l + 1

, such that

m_{l + 1} < m_{l}

. This process is realized by learning a Cluster Assignment Matrix

S^{l} \in R^{m_{l} \times m_{l + 1}}

.

Each row of

S^{l}

represents the assignment probability of a node in the current level to each super-node in the next level, satisfying the constraint

\sum_{j} S_{i j}^{l} = 1

. Utilizing

S^{l}

, the node feature matrix

X^{l + 1}

and the adjacency matrix

A^{l + 1}

for the next level are computed as follows:

X^{l + 1} = {(S^{l})}^{T} Z^{l} \in R^{m_{l + 1} \times c_{l}}

(2)

A^{l + 1} = {(S^{l})}^{T} A^{l} S^{l} \in R^{m_{l + 1} \times m_{l + 1}}

(3)

where

Z^{l} \in R^{m_{l} \times c}

represents the enhanced node embeddings output by the CSF module. The theoretical foundation of this soft-assignment mechanism stems from spectral clustering and graph coarsening theories. From a graph-theoretic perspective,

S^{l}

acts as a differentiable low-pass filter that softly aggregates nodes that are both topologically proximal and semantically similar, thereby maximizing the preservation of the object’s manifold structural features during dimensionality reduction. While this soft-assignment mechanism draws conceptual inspiration from general graph pooling methods such as DiffPool [22] and MinCutPool [23], we specifically adapt it for densely connected multi-view graphs. Instead of deploying a separate, heavy auxiliary GNN branch to compute the assignment matrix—which is computationally prohibitive for dense view-graphs—our approach generates

S^{l}

in a highly lightweight manner. This adaptation is tailored to the unique efficiency requirements of 3D vision tasks, achieving robust clustering without the massive memory overhead.

3.4. Graph Convolution and Sampling Fusion Module (CSF)

The detailed internal architecture of the CSF module is illustrated in Figure 2. It unifies the feature embedding process with the generation of the assignment matrix required for subsequent view down-sampling. By jointly optimizing node feature updating and graph coarsening within a unified operation, the CSF module eliminates the error accumulation and heavy parameter overhead inherent in traditional two-stage or dual-branch pooling architectures.

3.4.1. Feature Embedding

At level l, given the node features

X^{l}

and the adjacency matrix

A^{l}

, we utilize K stacked layers of graph convolutions for feature updating. To ensure numerical stability, each graph convolution layer is defined as follows:

F_{k}^{l} = σ ({\tilde{D}}^{l - 1 / 2} {\tilde{A}}^{l} {\tilde{D}}^{l - 1 / 2} H_{k}^{l} W_{k}^{l})

(4)

where

{\tilde{A}}^{l} = A^{l} + I

is the adjacency matrix with self-loops,

{\tilde{D}}^{l}

is the degree matrix,

H_{1}^{l} = X^{l}

,

W_{k}^{l}

is the weight matrix, and

σ

denotes the activation function. We define the output of the final convolution layer as the high-level semantic embedding matrix, i.e.,

Z^{l} = F_{K}^{l}

.

3.4.2. Assignment Matrix Generation

The generation of the assignment matrix

S^{l}

is performed synchronously with feature embedding. To fuse multi-scale information, we concatenate the intermediate outputs of each GCN layer and generate assignment weights through a non-linear mapping network:

S^{l} = softmax (MLP (Concat (F_{1}^{l}, \dots, F_{K}^{l})))

(5)

By incorporating outputs from multiple neighborhood scales, the generated

S^{l}

is capable of perceiving neighborhood relationships across different levels, ensuring the structural fidelity of the clustering results in the topological space.

3.4.3. Multi-Scale Fusion and Receptive Field Analysis

The CSF module constructs a hierarchical representation of the current level through a joint optimization mechanism. We perform mixed pooling on the features

F_{k}^{l}

from each layer and concatenate them to obtain the output feature vector

O^{l} \in R^{2 K c_{l}}

, where

c_{l}

denotes the specific feature dimension at level l:

O^{l} = Concat ({[MaxPool (F_{k}^{l}) ∥ AvgPool (F_{k}^{l})]}_{k = 1}^{K})

(6)

This fusion strategy effectively enforces the preservation of full-spectrum signals, ranging from local geometric details (shallow features) to macroscopic topological contours (deep features), thereby effectively mitigating the information bottleneck caused by limited perspectives in traditional methods.

Furthermore, this multi-scale concatenation inherently addresses the potential oversmoothing effect commonly observed in densely connected GNNs. By explicitly concatenating the intermediate outputs from every GCN layer, the operation acts as dense skip connections (conceptually similar to Jumping Knowledge Networks). It forces the final representation to retain shallow, highly discriminative local features, preventing them from being smoothed out by recursive neighborhood aggregation. Combined with our shallow three-level hierarchical architecture, this strategy ensures robust feature discrimination across layers without suffering from oversmoothing degradation.

3.5. Hierarchical Network Architecture and Loss Function

To effectively capture both fine-grained local details and macroscopic topological structures, View-GFN employs a three-level hierarchical cascaded architecture. In our standard implementation, the initial graph at the first level is constructed from

m_{1} = 20

(or 12) input views. As the network deepens, the graph undergoes progressive down-sampling through successive CSF modules. Specifically, the initial graph is first coarsened from

m_{1}

nodes to

m_{2}

super-nodes at the second level, and is subsequently condensed to

m_{3}

super-nodes at the third level (e.g., following the optimal empirical reduction scale of

20 \to 10 \to 5

). This step-by-step topological evolution allows the network to gradually extract highly abstract and semantic representations of the 3D object.

Specifically, to construct the final global 3D shape descriptor for classification, we aggregate the features across all hierarchical levels. As strictly aligned with the overall pipeline illustrated in Figure 1 and Figure 2, at each hierarchy level

l \in {1, 2, 3}

, the CSF module outputs a pooled multi-scale feature vector

O^{l} \in R^{2 K c_{l}}

(derived via the concatenated Max and Average pooling operations). The ultimate global 3D shape descriptor O is generated by directly concatenating these level-specific features:

O = [O^{1} ‖ O^{2} ‖ O^{3}]

(7)

This comprehensive descriptor O, which seamlessly fuses shallow fine-grained geometric details with deep macroscopic topological abstractions, is subsequently fed into the final Multi-Layer Perceptron (MLP) to predict the 3D shape category.

To guide

S^{l}

in learning reasonable topological clusters, we introduce a Link Prediction Loss as an auxiliary objective:

L_{t o t a l} = L_{C E} + γ \sum_{l = 1}^{2} {∥ A^{l} - S^{l} {(S^{l})}^{⊤} ∥}_{F}^{2}

(8)

This loss term constrains the super-nodes to reconstruct the adjacency relationships of the original graph as closely as possible. Crucially, it acts as a strong structural regularization mechanism to explicitly prevent degenerate clustering solutions, such as all nodes collapsing into a single super-node. By penalizing such degenerations and ensuring that the generated super-nodes preserve the topological diversity and manifold structure of the original input views, this auxiliary loss significantly improves the stability of the end-to-end training process, while mathematically guaranteeing the topological consistency and structural fidelity of the soft-clustering process. The final descriptor O is then fed into a fully connected classifier for shape recognition.

4. Experiments and Results Analysis

In this section, we evaluate the performance of the proposed View-GFN through extensive experiments on benchmark datasets. We first introduce the experimental setup, followed by a comprehensive analysis of the model from multiple dimensions, including classification accuracy, robustness to view quantity, shape retrieval capabilities, and an ablation study.

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

ModelNet40 [24]: This dataset contains 12,311 3D CAD models across 40 categories, with 9843 models used for training and 2468 for testing. Following the standard protocol, we render either 20 views (from the vertices of a dodecahedron) or 12 views (from a circular trajectory at an elevation of 30°) for each 3D object.
RGB-D [25]: A real-world dataset comprising 300 household objects across 51 categories. We adopt a 10-fold cross-validation strategy for evaluation on this dataset.

Evaluation Metrics: The primary metrics include Instance Accuracy (the ratio of correctly classified samples to the total number of samples), Class Accuracy (the arithmetic mean of accuracies across all classes), mean Average Precision (mAP, used for the retrieval task), the number of Parameters (Params), and Training Time (forward and backward propagation time per epoch).

4.1.2. Implementation Details

We employ ResNet-18, pre-trained on ImageNet, as the initial feature extractor. View-GFN consists of three hierarchical levels with node scales set to

m_{1} = 20

,

m_{2} = 10

, and

m_{3} = 5

, respectively. The network is trained using the SGD optimizer with a momentum of 0.9, a weight decay of 0.01, and a batch size of 20. During the fine-tuning of the feature extraction network, the initial learning rate is set to 0.01 and halved every 10 epochs. When training the entire network, the initial learning rate is set to 0.001 with a cosine annealing schedule. The balancing hyperparameter

γ

for the auxiliary loss is set to 0.1 by default. All experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU. Furthermore, to ensure a fair and consistent comparison with prior state-of-the-art methods, all reported quantitative results for our method represent the best accuracy achieved during training, rather than the mean over multiple runs.

Furthermore, to rigorously verify the statistical stability of our method against random initializations, we conducted additional robustness checks over 5 random seeds on the ModelNet40 dataset. These runs yielded a very tight instance accuracy distribution of

97.8 % \pm 0.1 %

, confirming that our soft-clustering mechanism and the overall architecture are highly stable. Consistent with standard practices in the field and to ensure fair comparisons with baseline methods, the quantitative results reported in our subsequent main tables represent the best accuracy achieved.

4.2. Comparison with State-of-the-Art Methods

We compare View-GFN with various representative 3D shape recognition methods. Table 1 summarizes the classification results and model efficiency on the ModelNet40 dataset.

Analysis: As shown in Table 1, View-GFN achieves an instance accuracy of 97.8%, performing slightly higher than the baseline View-GCN (97.6%). Notably, to rigorously validate our method against the latest advancements, we have included a direct comparison with the recent state-of-the-art Transformer-based architecture, PVSTrans [12]. While PVSTrans achieves a marginally higher instance accuracy of 97.9%, its parameter size (86.0 M) is exactly five times that of our View-GFN (17.0 M). Against this powerful attention-heavy model, View-GFN maintains a highly competitive accuracy while demonstrating an overwhelming advantage in architectural efficiency. Furthermore, achieving nearly a 50% parameter reduction compared to View-GCN (33.9 M), along with a 46.7% reduction in training time per epoch (33.4 s vs. 62.6 s), compellingly demonstrates that our CSF module and soft-clustering mechanism offer a highly lightweight and efficient alternative for multi-view 3D recognition, successfully compressing redundant information while retaining discriminative geometric features.

4.3. Robustness to View Quantity

To validate the effectiveness of the “global connectivity prior AM initialization,” we test the model using 12 input views on the real-world RGB-D dataset. The comparison results are presented in Table 2.

Analysis: On the real-world RGB-D dataset, View-GFN achieves a highly stable accuracy of

94.1 % \pm 0.3 %

(over 10-fold cross-validation) with 12 views, which is comparable to View-GCN (94.3%), but with significantly fewer parameters (17.0 M vs. 22.7 M) and approximately 54.5% less training time (0.5 s vs. 1.1 s). This represents a substantial improvement over MVCNN (86.1%), which also uses 12 views. When evaluating these results further, it is crucial to consider the differences in network capacities and input settings to ensure a fair assessment. As explicitly detailed in Table 2, some earlier methods such as CFK, MMDCNN, and MDSICNN rely on a massive input of 120 views to achieve their respective accuracies, which significantly increases the computational and memory overhead. In contrast, these comparisons are not strictly apples-to-apples; our View-GFN utilizes only 12 views and is built upon a lightweight ResNet-18 backbone. Despite using a fraction of the input views and a lighter backbone, View-GFN achieves highly competitive performance (94.1%), substantially outperforming the 120-view methods. This compellingly underscores the parameter efficiency and robust geometric representation ability of our proposed architecture.

4.4. Shape Retrieval Performance

Table 3 demonstrates the retrieval performance on the ModelNet40 dataset, evaluated using mean Average Precision (mAP).

Analysis: View-GFN achieves an mAP of 97.8% in the retrieval task, outperforming MVPNet (97.4%) and MVCVT (95.4%), and significantly surpassing the earlier GVCNN (85.7%). This indicates that the multi-scale features extracted by the CSF module are not only discriminative for classification but also possess excellent semantic clustering properties, enabling the generation of high-quality global shape descriptors.

4.5. Ablation Study

Table 4 dissects the contribution of each core component to the overall performance based on the ModelNet40 dataset with 20 views.

Analysis & Conclusions:

Soft-clustering vs. Hard Sampling: The instance accuracy of View-GFN-FPS drops by 1.3% (from 97.8% to 96.5%), proving that soft-clustering based on the assignment matrix retains discriminative information far better than Farthest Point Sampling.
CSF Synchronous Fusion: View-GFN-SEP performs almost on par with the full model (instance accuracy is only 0.1% lower), but incurs a significant increase in parameters. This demonstrates that the CSF module substantially reduces model complexity while maintaining high accuracy.
AM Initialization Strategy: The instance accuracies of View-GFN-A1 (local adjacency) and View-GFN-A2 (coordinate encoding) are 0.4% and 0.3% lower than the full model, respectively. This firmly validates the superiority of our global connectivity prior and predefined initial values.
Impact of GCN Layers (K) and Cluster Scales: To further validate our architectural choices, we expanded our ablation analysis to include the number of stacked GCN layers (K) within the CSF module and the hierarchical cluster node scales. Empirically, we observed that setting $K = 2$ yields the optimal trade-off. Using only $K = 1$ captures insufficient structural information, leading to suboptimal feature aggregation, while $K \geq 4$ causes the model to suffer from oversmoothing, which degrades the overall recognition accuracy. Furthermore, regarding the graph coarsening scale, our tests confirm that the adopted hierarchical reduction of $20 \to 10 \to 5$ optimally balances topological preservation and information compression. More aggressive coarsening strategies (e.g., directly down-sampling from 20 to 5) result in a severe loss of discriminative geometric details.
Impact of GCN Layers (K) and Cluster Scales: Beyond the core components analyzed in Table 4, we also conducted exhaustive empirical searches for architectural hyperparameters, specifically the number of stacked GCN layers (K) and the hierarchical graph coarsening scales. Our experiments confirm that setting $K = 2$ and adopting a $20 \to 10 \to 5$ node reduction scale yield the optimal trade-off between structural topological preservation and memory efficiency.

4.6. Sensitivity Analysis of Hyperparameter $γ$

To evaluate the robustness of the proposed View-GFN, we conducted a sensitivity analysis on the auxiliary loss hyperparameter

γ

defined in Equation (8). We evaluated the model’s performance on the ModelNet40 dataset by varying

γ

within the range of

{0.01, 0.05, 0.1, 0.5, 1.0}

.

The empirical results demonstrate that the model achieves its peak instance accuracy of 97.8% when

γ

is set to 0.1. When

γ

is excessively small (e.g., 0.01), the assignment matrix lacks sufficient structural regularization, leading to a slight performance degradation to 97.3%. Conversely, when

γ

is too large (e.g., 1.0), the auxiliary link prediction loss begins to dominate the primary cross-entropy classification loss, which hinders the convergence of the main task and reduces the accuracy to 96.9%. For intermediate values such as 0.05 and 0.5, the model maintains highly competitive accuracies of 97.6% and 97.4%, respectively. Overall, the performance fluctuations remain within a narrow margin, which firmly validates the strong robustness of View-GFN to the choice of the hyperparameter

γ

.

4.7. Qualitative Analysis and Failure Cases

To better understand the internal mechanism of the proposed View-GFN, we explicitly visualize the hierarchical graph coarsening process and analyze both successful recognitions and typical failure cases.

Visualization of Graph Coarsening: As illustrated in Figure 3, the soft-assignment matrix effectively groups semantically similar and topologically adjacent view images into super-nodes. For example, multiple viewpoints capturing the legs of a chair are smoothly aggregated into a single localized super-node. This mechanism successfully preserves the structural integrity and geometric manifold of the object while significantly reducing the graph scale without relying on hard-dropping.

Analysis of Failure Cases: While View-GFN demonstrates robust recognition capabilities across most categories, we conducted an explicit investigation into its failure cases to identify the boundaries of our current model. We observed that misclassifications predominantly occur with highly symmetric and texture-less objects. For instance, the model occasionally confuses a rectangular “desk” with a “table” or a “bed”. Because the multi-view 2D projections of these symmetric objects are nearly identical across most viewpoints and lack distinct geometric variances, the soft-clustering mechanism can sometimes produce ambiguous super-nodes, leading to misclassification. Future work could potentially incorporate cross-modal features (e.g., point cloud normals) to help disambiguate such highly symmetric shapes.

Limitations and Mitigation Strategies: Beyond the geometric failure cases discussed above, View-GFN faces certain architectural limitations regarding scalability. Our current dense adjacency matrix initialization and soft-clustering mechanism are optimized for standard multi-view settings (e.g.,

M = 12

or 20 views). However, as the number of input views scales up drastically (e.g.,

M \geq 100

views from continuous video sequences), the

O (M^{2})

memory footprint for storing the dense graph and computing the assignment matrix becomes the primary computational bottleneck. To mitigate this in massive-view environments, we propose two concrete strategies for future exploration: (1) Pre-sampling: employing a lightweight heuristic such as Farthest Point Sampling or structural pre-clustering to reduce the initial view set to an active subset of approximately 30 nodes before feeding them into View-GFN; and (2) Sparse Attention: replacing the complete dense prior with a sparse, local-windowed attention mechanism (similar to Swin Transformer) to limit the complexity from

O (M^{2})

to a linear

O (M)

scale within local neighborhoods. These directions will be further explored in our future work to enhance the model’s applicability to high-resolution video streams.

5. Conclusions

This paper presented View-GFN, an efficient multi-view graph convolution network for 3D shape recognition. At its core, we introduced a unified Graph Convolution and Sampling Fusion (CSF) module—positioned as a streamlined structural refinement—alongside a soft-clustering mechanism. This single-branch design eliminates the redundant parameters typical of dual-branch pooling structures. Furthermore, adopting a global connectivity prior for adjacency matrix initialization serves as a simple yet highly effective design choice, freeing the network from rigid physical camera coordinate dependencies. Experimental results demonstrate that View-GFN achieves robust competitive performance with a competitive recognition accuracy of 97.8%, while reducing model parameters by nearly 50% compared to the baseline View-GCN. Rather than solely pursuing marginal accuracy gains on saturated synthetic benchmarks, our design prioritizes architectural efficiency and real-world applicability. The extensive evaluations on the RGB-D dataset further confirm the model’s strong generalization ability beyond clean CAD models, proving its robustness against real-world noise and facilitating easier deployment in resource-constrained practical scenarios. Although the dense connectivity prior is highly effective under standard settings, it may introduce

O (M^{2})

memory bottlenecks when scaling to extreme view counts (e.g.,

M > 100

from continuous video sequences). To support massive-view environments, future work will explore two concrete mitigation strategies: (1) a pre-sampling step using Farthest Point Sampling to reduce the initial node set to a manageable size, and (2) replacing the complete graph with sparse, local-windowed attention mechanisms to achieve linear

O (M)

complexity. Additionally, we plan to extend this unified fusion framework to more complex 3D scene understanding and autonomous robotic perception tasks.

Author Contributions

Conceptualization, M.P. and J.J.; methodology, M.P. and J.J.; software, M.P.; validation, M.P., J.J. and Y.Z.; formal analysis, M.P.; investigation, M.P.; resources, M.P.; data curation, M.P.; writing—original draft preparation, M.P.; writing—review and editing, M.P., J.J. and Y.Z.; visualization, M.P.; supervision, M.P. and J.J.; project administration, M.P.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tu, T.; Chen, P.; Zhang, L. ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 6996–7007. [Google Scholar]
Xu, C.; Wu, B.; Hou, J.; Tsai, S.; Li, R.; Wang, J.; Zhan, W.; He, Z.; Vajda, P.; Keutzer, K.; et al. NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3621–3631. [Google Scholar]
Ding, D.; Wang, Z.; Xiong, H. Robust point cloud classification via semantic and structural modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 15078–15087. [Google Scholar]
Ben-Shabat, Y.; Gould, S. 3DInAction: Understanding Human Actions in 3D Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 19978–19987. [Google Scholar]
Chen, Y.; Liu, S.; Shen, X. Learnable Skeleton-Aware 3D Point Cloud Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 18101–18111. [Google Scholar]
Li, Z.; Xu, C.; Leng, B. Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 25368–25377. [Google Scholar]
Li, J.; Wang, J.; Chen, J.; Xu, T. Towards Robust Point Cloud Recognition with Sample-Adaptive Auto-Augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3003–3017. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Wang, Y.; Liu, H. Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 22–26 June 2025; IEEE: New York, NY, USA, 2025; pp. 20135–20144. [Google Scholar]
Liu, H.; Zhang, L.; Wang, Y. Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 22–26 June 2025; IEEE: New York, NY, USA, 2025; pp. 28745–28754. [Google Scholar]
Xiong, S.; Kasaei, H. LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition. In Proceedings of the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Eindhoven, The Netherlands, 25–29 August 2025; IEEE: New York, NY, USA, 2025; pp. 141–148. [Google Scholar]
Ma, S.; Dong, Z.; Cong, R.; Kwong, S.; Shao, X. Proto-FG3D: Prototype-based Interpretable Fine-Grained 3D Shape Classification. arXiv 2025, arXiv:2505.17666. [Google Scholar]
Ma, X.; Bai, J.; Su, Z.; Wang, Y. PVSTrans: Patch-View-Shape Progressive Interaction Transformer for 3D Shape Recognition. Inf. Process. Manag. 2026, 63, 104279. [Google Scholar] [CrossRef]
Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; IEEE: New York, NY, USA, 2015; pp. 945–953. [Google Scholar]
Feng, Y.; Zhang, Z.; Zhao, X.; Ji, R.; Gao, Y. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 264–272. [Google Scholar]
Yu, T.; Meng, J.; Yuan, J. Multi-view harmonized bilinear network for 3D object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 186–194. [Google Scholar]
Kanezaki, A.; Matsushita, Y.; Nishida, Y. RotationNet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 5010–5019. [Google Scholar]
Han, Z.; Shang, M.; Liu, Z.; Vong, C.M.; Liu, Y.S.; Zwicker, M.; Han, J.; Chen, C.P. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 2018, 28, 658–672. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Yu, R.; Sun, J. View-GCN: View-based graph convolutional network for 3D shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; IEEE: New York, NY, USA, 2020; pp. 1850–1859. [Google Scholar]
Xu, M.; Chen, H.; Wang, Z. PAGNet: Path aggregation graph network for multi-view 3D shape recognition. Knowl.-Based Syst. 2021, 229, 107338. [Google Scholar]
Gao, H.; Ji, S. Graph U-Nets. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 2083–2092. [Google Scholar]
Lee, J.; Lee, I.; Kang, J. Self-attention graph pooling. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 4800–4810. [Google Scholar]
Bianchi, F.M.; Grattarola, D.; Alippi, C. Spectral clustering with graph neural networks for graph pooling. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 874–883. [Google Scholar]
Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 1912–1920. [Google Scholar]
Lai, K.; Bo, L.; Ren, X.; Fox, D. A large-scale hierarchical multi-view RGB-D object dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; IEEE: New York, NY, USA, 2011; pp. 1817–1824. [Google Scholar]
Su, J.-C.; Gadelha, M.; Wang, R.; Maji, S. A deeper look at 3D shape classifiers. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Jiang, J.; Bao, D.; Chen, Z.; Zhao, X.; Gao, Y. MLVCNN: Multi-loop-view convolutional neural network for 3D shape retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Association for Computing Machinery: New York, NY, USA, 2019; Volume 33, pp. 8513–8520. [Google Scholar]
Xu, L.; Cui, Q.; Xu, W.; Chen, E.; Tong, H.; Tang, Y. Walk in views: Multi-view path aggregation graph network for 3D shape analysis. Inf. Fusion 2024, 103, 102131. [Google Scholar] [CrossRef]
Cheng, Y.; Cai, R.; Zhao, X.; Huang, K. Convolutional Fisher kernels for RGB-D object recognition. In Proceedings of the 2015 International Conference on 3D Vision (3DV), Lyon, France, 19–22 October 2015; IEEE: New York, NY, USA, 2015; pp. 135–143. [Google Scholar]
Rahman, M.M.; Tan, Y.; Xue, J.; Lu, K. RGB-D object recognition with multimodal deep convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 991–996. [Google Scholar]
Asif, U.; Bennamoun, M.; Sohel, F.A. A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2051–2065. [Google Scholar] [CrossRef]
Li, J.; Liu, Z.; Li, L.; Lin, J.; Yao, J.; Tu, J. Multi-view convolutional vision transformer for 3D object recognition. J. Vis. Commun. Image Represent. 2023, 95, 103906. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed View-GFN. The framework takes multi-view images as input, extracts initial features via a CNN backbone, and processes them through a hierarchical structure consisting of cascaded CSF models and view sampling modules. The final pooled features are concatenated to generate output scores.

Figure 2. The detailed architecture of the Graph Convolution and Sampling Fusion (CSF) module. It demonstrates the joint optimization of multi-scale feature embeddings (via stacked GCN layers and mixed pooling) and the generation of the assignment matrix.

Figure 3. Visualization of the hierarchical graph coarsening process. The soft-assignment matrix effectively groups semantically similar and topologically adjacent view images (e.g., four viewpoints capturing the legs of a chair) into a single localized super-node, preserving the structural integrity of the object while reducing the graph scale without hard-dropping.

Table 1. Classification accuracy and model complexity comparison on ModelNet40.

Method	Backbone	Type	Views	Inst. Acc. (%)	Class Acc. (%)	Params (M)	Time (s/epoch)
MVCNN-new [26]	VGG-M	Aggreg.	12	95.0	92.4	–	–
GVCNN [14]	GoogLeNet	Grouping	12	93.1	90.7	–	–
MHBN [15]	VGG-M	Bilinear	6	94.7	93.1	–	–
RotationNet [16]	AlexNet	View Opt.	20	97.4	96.8	–	–
MLVCNN [27]	ResNet-18	Multi-loop	36	94.2	–	–	–
MVPNet [28]	ResNet-18	Path Agg.	20	97.9	96.8	–	–
View-GCN [18]	ResNet-18	Graph Net.	20	97.6	96.5	33.9	62.6
PVSTrans [12]	ViT-B	Transf.	20	97.9	97.2	86.0	–
View-GFN (Ours)	ResNet-18	Graph Fus.	20	97.8	96.5	17.0	33.4

Table 2. Classification accuracy on the RGB-D dataset.

Method	Backbone	Views	Inst Acc (%)	Params (M)	Time (s/epoch)
MVCNN [13]	VGG-M	12	86.1	–	–
CFK [29]	–	120	86.8	–	–
MMDCNN [30]	VGG-M	120	89.5	–	–
MDSICNN [31]	VGG-M	120	89.9	–	–
View-GCN [18]	ResNet-18	12	94.3	22.7	1.1
View-GFN (Ours)	ResNet-18	12	94.1	17.0	0.5

Note: Methods such as CFK, MMDCNN, and MDSICNN employ 120 views, whereas View-GFN, View-GCN, and MVCNN utilize only 12 views. Despite the significantly fewer input views, View-GFN achieves highly competitive accuracy with substantially fewer parameters, underscoring its superior view utilization efficiency. “–” indicates that the corresponding information is not available from the original publication.

Table 3. Retrieval task performance comparison on ModelNet40 (mAP).

Method	mAP (%)
GVCNN [14]	85.7
MVCVT [32]	95.4
MLVCNN [27]	92.8
MVPNet [28]	97.4
View-GFN (Ours)	97.8

Table 4. Ablation study of View-GFN core components (ModelNet40, 20 views).

Configuration	Inst Acc (%)	Class Acc (%)	Description
View-GFN-FPS	96.5	95.2	Replace soft-clustering with Farthest Point Sampling (FPS)
View-GFN-SEP	97.7	96.5	Decouple feature embedding and assignment matrix generation
View-GFN-A1	97.4	96.2	AM considers only 3 nearest neighbor nodes
View-GFN-A2	97.5	96.1	AM initialized with view coordinate encoding
View-GFN (Full)	97.8	96.5	Full model (Global AM + Soft-clustering + CSF)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pang, M.; Jiao, J.; Zhang, Y. View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Appl. Sci. 2026, 16, 5629. https://doi.org/10.3390/app16115629

AMA Style

Pang M, Jiao J, Zhang Y. View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Applied Sciences. 2026; 16(11):5629. https://doi.org/10.3390/app16115629

Chicago/Turabian Style

Pang, Min, Jichao Jiao, and Yingjian Zhang. 2026. "View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition" Applied Sciences 16, no. 11: 5629. https://doi.org/10.3390/app16115629

APA Style

Pang, M., Jiao, J., & Zhang, Y. (2026). View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Applied Sciences, 16(11), 5629. https://doi.org/10.3390/app16115629

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition

Abstract

1. Introduction

2. Related Work

2.1. Multi-View 3D Shape Recognition

2.2. Graph Construction and Relational Modeling

2.3. Hierarchical Graph Coarsening and Pooling

3. Methodology

3.1. Overview

3.2. Initial Feature Extraction

3.3. Cluster Assignment Based View Sampling

3.4. Graph Convolution and Sampling Fusion Module (CSF)

3.4.1. Feature Embedding

3.4.2. Assignment Matrix Generation

3.4.3. Multi-Scale Fusion and Receptive Field Analysis

3.5. Hierarchical Network Architecture and Loss Function

4. Experiments and Results Analysis

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

4.1.2. Implementation Details

4.2. Comparison with State-of-the-Art Methods

4.3. Robustness to View Quantity

4.4. Shape Retrieval Performance

4.5. Ablation Study

4.6. Sensitivity Analysis of Hyperparameter γ

4.7. Qualitative Analysis and Failure Cases

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6. Sensitivity Analysis of Hyperparameter $γ$