NGD-Transformer: Navigation Geodesic Distance Positional Encoding with Self-Attention Pooling for Graph Transformer on 3D Triangle Mesh

: Following the significant success of the transformer in NLP and computer vision, this paper attempts to extend it to 3D triangle mesh. The aim is to determine the shape’s global representation using the transformer and capture the inherent manifold information. To this end, this paper proposes a novel learning framework named Navigation Geodesic Distance Transformer (NGD-Transformer) for 3D mesh. Specifically, this approach combined farthest point sampling with the Voronoi segmentation algorithm to spawn uniform and non-overlapping manifold patches. How-ever, the vertex number of these patches was inconsistent. Therefore, self-attention graph pooling is employed for sorting the vertices on each patch and screening out the most representative nodes, which were then reorganized according to their scores to generate tokens and their raw feature embeddings. To better exploit the manifold properties of the mesh, this paper further proposed a novel positional encoding called navigation geodesic distance positional encoding (NGD-PE), which encodes the geodesic distance between vertices relatively and spatial symmetrically. Subsequently, the raw feature embeddings and positional encodings were summed as input embeddings fed to the graph transformer encoder to determine the global representation of the shape. Experiments on several datasets were conducted, and the experimental results show the excellent performance of our proposed method.


Introduction
Deep learning has made significant progress in many fields over the past few years. There are multiple types of deep neural networks, including convolutional neural networks (CNNs), recursive neural networks (RNNs), attention-based networks, etc. Among them, attention-based networks have undergone rapid development in recent years, especially transformers, a prominent attention-based network that has achieved significant success in the last two to three years in NLP [1][2][3], computer vision [4][5][6][7][8], and other fields.
Inspired by these milestone works, this paper seeks to extend the transformer to the 3D triangle mesh. Point-based transformers [9][10][11][12] have recently shown excellent performance in point cloud classification, segmentation, and other 3D shape tasks. Although it is a straightforward approach that is directly applied to 3D triangle meshes, point-based transformers are based on 3D Euclidean space and lack information on the topological connections between vertices.
An alternative is to represent a 3D triangle mesh as a graph and use powerful graphbased transformers [13][14][15][16][17] to learn the shape. Nevertheless, in addition to the vertex and edge information possessed in graph form, the 3D triangle mesh also contains rich intrin-sic manifold properties. Therefore, a vital issue is how to preserve and capture the manifold information on the surface as much as possible while learning the global association map between the input tokens and the global representation of the shape. This paper presents a novel 3D triangle mesh learning framework named NGD-Transformer to solve the above problem. The proposed method is based on the graphbased transformer, and it aims to keep and capture as much intrinsic manifold information of the 3D mesh as possible. Specifically, this paper combines farthest point sampling (FPS) with the Voronoi segmentation algorithm to generate manifold-preserved, uniform, and non-overlapping mesh patches. However, the vertex number of each patch is not the same. Therefore, for the follow-up processing, self-attention graph pooling (SAG pooling) [18] is employed for sorting the vertices on each patch and screening out the most representative nodes. These nodes are then reorganized according to their scores to create tokens and raw feature embeddings.
To better exploit the manifold properties of the 3D triangle mesh, we further proposed a novel positional encoding called navigation geodesic distance positional encoding (NGD-PE), which encodes the geodesic distances between vertices in a relative manner. The idea is derived from star-map navigation [19], in which the global position of an object is determined by the relative distance between the object and the navigation-star sequences. Thus, multiple navigation vertices with uniform and semi-symmetrical distribution on the mesh are generated and used as position references for the remaining vertices. Subsequently, raw feature embeddings and positional encodings are merged as input embeddings and fed to the graph transformer encoder to determine the global representation. The global features are then aggregated by a novel up-pooling operation, which is a structure symmetrical to the graph-pooling operation. The aggregated global features are then fused with the local feature determined from the graph neural networks (GNNs). Based on the subsequent tasks, this paper also designed the classifier and semantic segmentation networks to complete the shape classification and semantic segmentation tasks, respectively.
The main contributions of this paper are as follows.
1. A novel 3D triangle mesh learning framework, named NGD-Transformer, is proposed. This framework can effectively learn and fuse the local and global information of the mesh. 2. We propose a token-generation algorithm that combines FPS and Voronoi segmentation to generate manifold-preserved, uniform, and non-overlapping patches and spawn tokens and their representations via SAG pooling. 3. We further propose a novel geodesic relative positional encoding method called navigation geodesic distance positional encoding (NGD-PE), which encodes the geodesic distance between the vertices relatively for the transformer.

Transformer
Transformers first made a breakthrough in NLP and computer vision. We refer the reader to [1][2][3][4][5][6][7][8] for detailed information. For transformers applied to 3D shapes, one of the major research topics is the point-based transformers [9][10][11][12][20][21][22]. The main disadvantage of these methods is that they are based on the 3D Euclidean space, lacking information on the topological connections between vertices. An alternative is to design a transformer based on the 3D triangle mesh manifold space. For example, Sarasua et al. [23] proposed TransforMesh, which first combined the transformer and mesh for the longitudinal modeling of anatomical meshes. However, handling the edge connection information is problematic for this method. To solve this issue, some researchers process 3D triangle mesh with graph-based transformers. For example, Lin et al. [24] and He et al. [25] proposed Mesh Graphormer and GET based on graphs for 3D pose transfer tasks. However, currently, such studies are still rare. One of the reasons for this may be that graph-based transformers are primarily made for general graph structures, while 3D triangle mesh contains rich intrinsic manifold information.
Among the works mentioned above, MeshMAE [26] is the most relevant study to the proposed method. The main difference between MeshMAE and our proposed model is that we employ the manifold preserved segmentation algorithm to generate uniform and non-overlapping patches without a remeshing operation. In addition, this paper proposes a novel geodesic relative positional encoding method, NGD-PE, for learning geodesic information.

Mesh Sampling
Currently, mesh sampling is mainly divided into clustering-based [27,28] and shape sampling-based [29][30][31][32] methods. The main drawback of clustering-based approaches is that they destroy the manifold topology of the shape and lead to mesh degeneration. Many scholars use shape-sampling methods to preserve the manifold nature of 3D shapes. For instance, Ranjan et al. [29] sampled the mesh with the Quadic Error Metrics (QEM) criterion, but it is prone to producing degenerated meshes. Schult et al. [32] proposed the vertex-clustering method, but it is inclined to non-uniform triangles. Thus, this paper combines the manifold FPS algorithm and the Voronoi segmentation algorithm to generate uniform, non-overlapping, and manifold patches.

Positional Encoding
A critical step in the transformer is positional encoding, representing the positional interactions between tokens. Currently, it can be divided into two categories: absolute encoding [3,7] and relative encoding [1,17,33,34]. The current 3D mesh transformer mainly uses relative positional encoding based on the graph. For example, Kreuzer et al. [14] extended spectral encoding to the graph transformer using a pre-computed Laplacian feature vector to add to the raw feature embedding before the input embeddings. Another approach is to use edge weights as relative encodings [14,34,35]. However, the works mentioned above rarely feature the design of positional encodings starting from the 3D manifold surface. In this paper, we present a novel geodesic relative positional encoding method, named NGD-PE, which encodes the geodesic distance between the vertices in a relative manner for the transformer.

Preliminary
Generally, a 3D mesh can be expressed as a graph structure where V is the set of vertices, E is the set of edges, and F is the set of faces. The graph can also be represented by various weight matrices, such as the Euclidean distance matrix, degree matrix, Laplacian matrix, and so on. Vertex feature function is defined by assigning a scalar or vector to the vertex. Given a shape with N vertices, each of which has feature dimension D, the input embedding can then be represented as a matrix . A graph neural network can be viewed as nonlinear functional mapping between the graph G and the output domain, namely Y=f(G)=f(V,E,F) . It can obtain some low-level features (e.g., normal vector, curvature, geodesic distance, etc.) from graph G , which can be represented as the vertex feature function X , and usually edge list E can be represented by an adjacency matrix . L Thus, in many neural networks, The output of Y has multiple situations. For example, if the output of Y is a category-probability vector, then the neural network is a classification network; if the output of Y is a class-probability matrix, then the neural network is a segmentation network.

Algorithm of the Proposed Method
In this section, we introduce the components of our framework. The overall flowchart is shown in Figure 1. The framework consists of three modules: local feature learning, global representation learning, and application. The global-representation learning module has four parts: patch split, token generation, NGD-PE calculator, and graph transformer encoder. The aim was to change the basic architecture of the graph transformer as little as possible. Therefore, we mainly adapted the transformer and 3D shape in terms of input embedding and positional embedding-specifically, the input embedding format required for the transformer via the patch split and token generation module. The required positional-embedding format was generated by the NGD-PE calculator. Finally, the global association of the above features was determined through the graph transformer encoder. This method also adopted the idea of feature fusion, combining local feature learning with global representation learning to generate richer features.  1) We use the manifold reserved segmentation algorithm to divide the shape into a sequence of uniform patches. Each patch, however, has varying numbers of vertices. (2) SAG pooling is then employed to sort the vertices on each patch and screen out the more representative nodes. (3) The nodes are reorganized according to their scores to generate tokens and raw feature embeddings. (4) To efficiently capture the geodesic distance information, NGD-PE for each vertex is computed. (5) NGD-PEs are fused with the raw feature embeddings to form input embeddings, which are further fed to the transformer encoder for learning. (6) The learned embeddings can be used for tasks such as shape classification and shape semantic segmentation.

Token Generation
In this study, we combined FPS and the Voronoi segmentation algorithm to spawn patches. The proposed method randomly selected a vertex and then used the FPS algorithm to sample a set of uniform seeds on manifold surface, represented as , and the number of elements is N . After obtaining the initial seeds, each vertex on the surface was labeled using the Voronoi segmentation algorithm. For the i-th vertex, its patch label is arg min (1) Here, ij ρ is the geodesic distance between the i-th vertex and the j-th seed.
In this way, the original model  is divided into N individual non-overlapping patches meeting the following constraint criteria: If the vertices in are triangulated with the Delaunay algorithm, it can be seen that the topology of the whole shape is well preserved. If the number of sampling points is changed, it is able to generate shape representations of different scales (see Figure 2). Thus, the proposed method split the entire mesh into multiple uniform, manifold-preserved, non-overlapping patches. Patches can be treated as sub-meshes, but they differ in the number of the vertex. The proposed method selects the most representative nodes in each patch to satisfy the requirements of a unified input embedding shape format. In this method, only the K nodes from each patch are retained using the SAG pooling algorithm. For each patch, the feature of the selected vertices can be denoted as , and d is the dimension of raw feature embeddings. We further extracted the first top-j vertex in each patch and combined them to form a token sequence j  expressed as token embedding. Furthermore, we further created the K-individual matrices . Subsequently, K graph transformers are designed to learn these K embeddings. Following these few steps, it can directly return the features generated by transformers to the original mesh through pooling indices. Figure 3 shows the procedure for this part. In this process, the SAG pooling operation and the up-pooling operation are symmetrical in the feature space transformation. In other words, SAG pooling converts graph embeddings to token embeddings, while up-pooling, converts token embeddings to graph embeddings.

Navigation Geodesic Distance
In measurement theory, a point can be located by at least three non-collinear reference vertices. This principle is widely used in GPS and star map navigation [19]. It is generally recommended to have uneven and extensive navigation vertices to increase positioning accuracy. It is observed that vertices generated by the FPS satisfy this criterion. Therefore, the FPS algorithm generates a sequence of navigation vertices represented as , and the number of vertices is N . It can also reuse the vertices in the patch split module or resample the navigation vertices again. The navigational geodesic distance of vertex i is defined as: where is the geodesic distance between the i-th vertex and the k-th navigation vertex. The NGD feature distance between the i-th vertex and the j-th vertex is further defined via the following formula: where α > 0, , , ∈ 0,1 .
Clearly, d i, j satisfies the symmetry property, namely d i, j = d j, i , which guarantees the symmetry of the metric between two nodes. The NGD feature distance also indicates the geodesic proximity of two vertices on a surface. In general, the closer the geodesic distance between vertices, the larger the NGD-feature-distance value. The NGDfeature-distance distribution between a given vertex and the rest of the vertices appears outward-decaying and spatially isotropic and symmetrical (see Figure 4). The α can control the decay rate of d i, j . Usually, the smaller the value of α, the smaller the decay speed.
In Figure 4, the first row represents the navigation vertices gathered through different sampling algorithms. The upper left is the FPS sampling algorithm, the upper right is the random sampling algorithm, and the red vertices are the navigation vertices. The second row describes the navigational geodesic distance of a given vertex i from the remaining vertices in shape; the red color indicates a larger NGD feature distance, the blue color indicates a smaller NGD feature distance, and the white vertices represent the given vertex i.
The NGD feature distance of each vertex is computed from the set of to form matrix: where ∈ R . For the j-th token in the l-th token sequence, its global vertex identifier is ,j g l , and its input embedding is further generated as follows:

Graph Transformer Encoder
We adopted the graph transformer frameworks [15,17] for learning. Figure 5 shows the block diagram of our graph transformer encoder. Since φ i, j contains the relative position information of the vertex i and j, it can also be taken as an edge feature. The input embeddings of the l-th token sequence are The multi-head attention coefficients between token j and token i are computed as follows: where , exp is the exponential scale dot-product function and each head's hidden size is determined by d. Following the graph's multi-head attention, the messages from token j to source i are aggregated as follows: where the || is the concatenation operation for C head attention.
In the same way as with GAT, it can apply the graph transformer to the last output layer to remove the non-linear transformation by averaging as follows: We finally stack multiple graph transformer layers to form an encoder module. The output of the graph transformer encoder can be represented as K-individual matrices . For the j-th token in the i-th token sequence, the proposed method adopts the pooling indices technology to restore its learned feature to the original mesh as follows: where u K , 1, 2,..., 1, 2, ..., N j i = = . Figure 5. Block diagram of graph transformer encoder.

Networks
NGD-Transformer is applied to shape classification and semantic segmentation tasks and design the corresponding network structure for these two tasks.

Classification Network
After passing through the transformer module, the embedding representation of the shape is . It was then reshaped as . Subsequently, the proposed method performed feature mapping through a set of MLPs and extracted global features through global pooling. Subsequently, the features were further transformed into the target category number via the fully connected network. The network block diagram is shown in Figure 6.

Semantic Segmentation Network
The output embeddings are firstly projected onto the global manifold structure through the up-pooling operation. Next, we combine the mapped features and the local features obtained from the GNN network to generate a new representation of each vertex following feature transformation through a set of MLPs. For the final results, we further post-process and optimize them using the conditional random field (CRF) [36]. The network block diagram is shown in Figure 7.

Experimental Setting
The PyTorch geometric library was used to implement our method. The networks were trained and tested on a desktop computer with the NVIDIA GeForce RTX 3060 GPU. To increase the training speed, the uniform remeshing algorithm described by Huang et al. [37] was used to adjust the number of vertices to 2048~8192.
The input features of our network have 38 dimensions with 16 wave kernel signature (WKS) features [38], 16 spin image (SI) features [39], 3 coordinate features, and 3 normal vector features, respectively. The above features contained 32 intrinsic features and 6 extrinsic features. These input features were first feature-transformed through a three-layer MLP network.
For the patch split module, the patch number was set between 48 and 96, roughly the square root of the vertex number of 3D mesh. For the NGD-PE calculation module, our navigation number value needed to be consistent with the dimension of the raw feature embeddings. The alpha was set as the default value 1. For the token generation module, the number of vertices, K, was kept with the value ranging from 4 to 16, and the pooling method adopted a graph attention convolution as its GNN network with a multi-head attention head of 2 to 4, which retained enough information without increasing the burden of computation.
For the graph transformer module, we built the association matrix and edge features for any two tokens and used φij as the edge feature. The number of layers of the transformer encoder was set as 3.
We simultaneously used a GNN network, the ChebNet [40], to learn the local features of the shape. The above is the experimental setting for the public part, and the setting for the application network is explained in detail below.

Shape Classification Network
In this section, our classifier performance was evaluated. For the classification task, the evaluation was conducted using the ModelNet40 [41] and ModelNet10. ModelNet40 is a large database of shape classification. It consists of 12,311 CAD models, which are divided into 9840 training models and 2468 testing models. Among them, ModelNet10 contains 4899 CAD models and is a subset of ModelNet40. We manually cleaned this subset of models and aligned their orientations. The shapes were also converted into watertight manifold meshes using the method mentioned in [37].
Our network structure is shown in Figure 6. After setting up the public network part, the global pooling was used to obtain the feature representation of the overall shape and then input this feature into a three-layer MLP network. The output of the classifier was the category probability vector. The log_softmax was used as a cost function and implemented the Adam optimizer optimization, with a learning rate of 8 × 10 −4 and a momentum of 0.9. The network was trained in 100 epochs. Finally, the classification accuracy was used as an evaluation indicator.
The results, which are shown in Table 1, indicated that our method achieves the best results in the dataset MN10. The best results and unreported results for each datasets were labeled in bold font and "-", respectively.

Shape Semantic Segmentation Network
In this section, we evaluate our semantic segmentation performance. For this task, the evaluation was conducted using the PSB [45] and COSEG [46].
According to the Princeton Segmentation Benchmark (PSB), there are 19 categories, each including 20 shapes. Each shape category has 2 to 11 categories.
COSEG has eight scan categories and three synthetic categories, which are mainly used for shape semantic segmentation. The shape numbers varied considerably between the eight scan categories, and there were relatively more models in the three synthetic categories than in the eight scan categories, including Alien (200 meshes) and Vases (300 meshes), and Chairs (400 meshes).
Our semantic segmentation framework is shown in Figure 7. After the public network had output the global features, the features were then fused with local features into a three-layer MLP network whose output result was the category-probability vector of each vertex. Finally, the results were optimized with CRF. The log_softmax was used as a cost function and implemented the optimization using the Adam optimizer with a learning rate of 8 × 10 −4 and a momentum of 0.92. The network was trained in 50 epochs.
There are many definitions of segmentation accuracy. In this paper, the definition presented by Kalogerakis et al. [36] was applied, according to which the segmentation accuracy is defined as the ratio of the area of the correctly labeled faces to the total area of all the faces of the surface. Tables 2 and 3 provide the segmentation accuracy for several categories within each PSB category and large datasets of COSEG. The bold numbers indicate the best results for each dataset, and '-' indicates results that were not reported. On the PSB dataset, our proposed method achieves the best results in 8 out of 19 categories, and Figure 8 shows some of the segmentation results of our algorithm on the PSB and COSEG datasets.   Table 3. Segmentation accuracy of the large COSEG dataset. MeshCNN [30], and PD-MeshNet [49] and LaplacianNet [27] are compared in the first three rows.

Ablation Study
The large COSEG dataset was used to evaluate the components in our network. The results of the evaluation are shown in Table 3.

NGD-PE and Lap-PE
We compared the segmentation results of Lap-PE [14] and NGD-PE. Table 3 shows that our NGD-PE achieved better results than the Lap-PE. Furthermore, it is found that the performance improvement was limited when the Nu value of the NGD-PE was larger than the specified value. We also conducted experiments on the selection of α and achieved excellent results when the range was 0.5~2.

SAG Pooling vs. Top-K Pooling
We compared the two pooling algorithms, SAG pooling and Top-K pooling. SAG pooling has one more graph convolution operation than Top-K pooling. Table 3 shows that SAG pooling performed better in these two pooling operations. At the same time, different K values were selected to observe their effects on performance. It is found that the performance improvement was relatively limited when the K value was greater than a specific value.

Conclusions and Discussion
This paper presented a novel transformer framework named NGD-Transformer for 3D triangle mesh. The framework bridges some critical gaps in the previous studies. First, the manifold preserved algorithm was used to split the shape into a sequence of uniform, non-overlapping patches. Second, the SAG pooling algorithm was used to generate tokens and raw feature embeddings. Third, to effectively capture the geodesic distance information, we designed NGD-PE to represent geodesic interactions between tokens. Finally, the subsequent neural networks were then constructed according to the classification and semantic segmentation tasks, respectively.
We also conducted complete experiments to test our proposed method. The experiments, which were conducted on multiple datasets, showed that our proposed network performs excellent classification and semantic segmentation. An ablation study on the NGD-PE module and the SAG pooling module was further conducted. It is found that NGD-PE can achieve better effects than Lap-PE and setting the appropriate number of sampling vertices and attenuation coefficient can further improve the segmentation effect. In addition, the input feature learning through GNN before Top-K pooling, that is, using SAG Pooling, can further improve the segmentation performance, and the improvement effect stabilizes when K is greater than a certain value. It is believed that this performance improvement mainly benefits from our learning of the manifold properties of shapes using the graph transformer network structure.
This improved segmentation performance allows our method to be further applied to shape deformation and shape animation, which often need to accurately identify the specific parts in the shape. The improvement of the classification performance enables our method to be applied to shape retrieval and to retrieve 3D shapes more accurately. Multiple extensions can be further explored. For example, our network can be used for shape completion. Our network divides shapes into multiple patches. If some patches are missing, our network can predict and complete the remaining patches combined with the mask mechanism. Our method can also be applied to shape correspondence, through which it can establish a patch-level correspondence through the analysis of the correlation between patches. Furthermore, since the transformer has been widely used in fields such as NLP and computer vision, our framework can be combined with NLP or computer vision to perform cross-modal learning.

Conflicts of Interest:
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work; there is no professional or other personal interest of any nature or kind in any product, service and company that could be construed as influencing the position presented in, or the review of, the manuscript here entitled.