EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion

Li, Qi; Song, Yu; Jin, Xiaoqian; Wu, Yan; Zhang, Hang; Zhao, Di

doi:10.3390/s24248196

Open AccessArticle

EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion

by

Qi Li

^1,2,3,*

,

Yu Song

¹,

Xiaoqian Jin

¹,

Yan Wu

^1,2,

Hang Zhang

¹ and

Di Zhao

¹

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China

²

Jilin Provincial International Joint Research Center of Brain Informatics and Intelligence Science, Changchun 130022, China

³

Zhongshan Institute of Changchun University of Science and Technology, Zhongshan 528437, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(24), 8196; https://doi.org/10.3390/s24248196

Submission received: 28 October 2024 / Revised: 18 December 2024 / Accepted: 20 December 2024 / Published: 22 December 2024

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

:

With the advancement of service robot technology, the demand for higher boundary precision in indoor semantic segmentation has increased. Traditional methods of extracting Euclidean features using point cloud and voxel data often neglect geodesic information, reducing boundary accuracy for adjacent objects and consuming significant computational resources. This study proposes a novel network, the Euclidean–geodesic network (EGNet), which uses point cloud–voxel–mesh data to characterize detail, contour, and geodesic features, respectively. The EGNet performs feature fusion through Euclidean and geodesic branches. In the Euclidean branch, the features extracted from point cloud data compensate for the detail features lost by voxel data. In the geodesic branch, geodesic features from mesh data are extracted using inter-domain fusion and aggregation modules. These geodesic features are then combined with contextual features from the Euclidean branch, and the simplified trajectory map of the grid is used for up-sampling to produce the final semantic segmentation results. The Scannet and Matterport datasets were used to demonstrate the effectiveness of the EGNet through visual comparisons with other models. The results demonstrate the effectiveness of integrating Euclidean and geodesic features for improved semantic segmentation. This approach can inspire further research combining these feature types for enhanced segmentation accuracy.

Keywords:

geodesic information; neural network; point cloud; semantic segmentation; voxel data

1. Introduction

As home service robot technology advances, the demand for three-dimensional (3D) indoor scene understanding has increased. While an outstanding level of performance in robotic object grasp detection has been achieved, it is essential to identify the spatial layout of the interior environment, the type of object, and the positional relationship between objects [1,2]. Therefore, 3D scene understanding for indoor environments has become increasingly significant, with this study focusing on the 3D semantic segmentation of indoor scenes.

Three-dimensional point cloud semantic segmentation assigns semantic labels to individual points, linking each point to a label that represents a real category. However, point cloud data are characterized by their disordered distribution, uneven density, and sparsity, which makes it challenging to effectively apply traditional convolutional methods. Consequently, scholars have proposed voxelization, a method that transforms point clouds into dense, normalized data, enabling the use of a standard 3D convolution for segmentation [3,4]. Nevertheless, the methods often result in the loss of intricate shape details while retaining the sparsity of point clouds. Liu et al. [5] proposed supplementing point cloud features with voxel data and incorporating multimodal data to improve local refinement. By using Euclidean convolution, both point cloud and voxel data can bridge small gaps, facilitating the propagation and accumulation of neighborhood information between spatially adjacent objects. As shown in Figure 1a, objects such as windows and walls are typically detected beside a curtain by learning the Euclidean local point neighborhood. Although these methods have shown some results in Euclidean domains, constructing topological structures of point sets solely on the Euclidean distance between points does not explore surface information. Because various objects are spatially close to each other, semantic segmentation models may mistakenly recognize unconnected objects as connected within local neighborhoods. The mesh data structure approximates an object’s surface by a set of two-dimensional polygons in 3D space. The boundary of the complex object can be separated in a mesh, distinguish between easily confused front–back topologies, and provide the intrinsic connection and information representation of the point set. By utilizing mesh data, curtains can be better distinguished from other objects using mesh data, as shown in Figure 1b. By capturing information regarding the mesh path length to the object’s surface topology, even physically similar objects can be accurately distinguished by the difference in the length of their surface paths, enabling efficient boundary segmentation.

This study proposes a Euclidean–geodesic network (EGNet) that contains Euclidean and geodesic branches. The Euclidean branch extracts the detailed features and voxelated contour features of point cloud data. The geodesic branch extracts geodesic features from mesh data using graph message propagation methods and fuses these with contextual features extracted from the Euclidean branch for point cloud semantic segmentation. The primary challenge addressed in this study is the process of effectively combining the features of point, voxel, and mesh data. Consequently, we proposed a self-attention module and a cross-domain attention module based on the feature fusion method from reference [6]. The self-attention module uses the unified message passing model to focus its attention on the vertex features of mesh data. The cross-domain attention module uses the same model to achieve attention fusion. The feature of the Euclidean branch is fused with the vertex feature of the mesh data to achieve the integration of point cloud, voxel, and mesh data. The EGNet is evaluated using two large-scale real-world datasets, Scannet and Matterport, and achieved satisfactory performance. Notably, the accuracy of some results even surpassed that of the manually annotated labels.

2. Related Work

2.1. Method Based on Euclidean Feature

Many researchers [7,8,9,10] have proposed point-based approaches. However, a significant portion of the processing time, approximately 90%, is spent on converting irregular data into structured neighborhood representations rather than on actual feature extraction. PtV2 [11] uses a fractal geometry approach to record the mapping generated through serialization to transform unstructured and irregular point clouds into regular and serialized data. While this method yields better results, it leads to the loss of boundary information. Another method divides the 3D space into regular voxels before the regular convolution of the voxels [12]. Although these approaches improve the efficiency of data access, voxelization incurs a loss of information and high computational and memory overheads when handling large-scale scenarios. To further improve the efficiency of 3D deep learning, Liu et al. proposed a method based on point–voxel convolution [5]. This method consists of two parts: the point-based branch, which processes each point directly, and the voxel-based branch, which fuses the voxelization results of the point-based branch before performing the convolution. Although it performs well for understanding small regions, it is not applicable to large-scale scenes. Therefore, a sparse convolution (SC) [13] is applied to the point–voxel fusion network to address the complexities of voxel calculations.

2.2. Method Based on Geodesic Feature

The geodesic distance measures the shortest path between any two vertices, calculated in two ways: the first method uses the inherent shape information of the object’s surface [14,15] and the second provides an approximate solution by constraining the shortest length of the path on graph data [16,17]. Both methods utilize the surface information of mesh data to construct graph data structures and extract features from these structures through a graph convolution. This approach also influenced our design philosophy and motivated the development of the EGNet. We perform element-wise multiplication between the weight matrix in the attention mechanism and the feature matrix constructed from the graph data, incorporating regularization to extract the relevance between two connected points. If the points belong to different labels, the corresponding elements in the weight matrix should approach zero.

2.3. Method Based on Geodesic Features and Euclidean Features

To simultaneously extract the features from point cloud and mesh data, many scholars have proposed solutions that fuse geodesic and Euclidean features. The DualConvMesh-Net model [18] uses a radius graph to define the concept of neighborhoods for vertices in the Euclidean space. Euclidean and geodesic features are extracted by using graph convolutions and dynamic graph convolutions, respectively, and subsequently, the resulting feature map is concatenated. The DualConvMesh-Net combines the benefits of 3D surface meshes and a Euclidean graph convolution on 3D vertices in the spatial domain. The VMNet [19] uses voxelization and SC to extract the Euclidean spatial features in the Euclidean module, and the voxel features are projected onto the mesh features by a linear interpolation for further geodesic convolution. In our method, a multi-layer perceptron (MLP) is applied to each point data to extract high-frequency features, compensating for the information loss caused by voxelization.

3. Methodology

We proposed an indoor point cloud semantic segmentation network, named EGNet. EGNet processes three types of 3D representations: points, voxels, and meshes. As depicted in Figure 2, the network consists of two branches: the upper one is denoted as the Euclidean branch and the lower as the geodesic branch.

In the Euclidean branch, we use a feature extractor similar to the U-Net structure to extract Euclidean features from voxels to capture fine features. Additionally, inspired by the PointNet++ structure, we incorporate a point-based multilayer perceptron (MLP) to extract features from individual points and capture high-frequency features (i.e., fine details). However, the dense sampling of point cloud data in certain areas can lead to an over-reliance on the weight of the calculation of point cloud data in dense areas. Consequently, we fuse the SC and the submanifold convolution [20] to extract low-frequency features, while the MLP compensates for high-frequency features. Specifically, MLP focuses on extracting high-frequency features, and SC and submanifold convolution are used to extract low-frequency features. Sparse deconvolution is applied to aggregate high-frequency and low-frequency features, enhancing the model’s segmentation efficiency. In the geodesic branch, vertex clustering (VC) and quadratic error metrics are used to simplify the mesh, which represents non-Euclidean data. This simplification records trajectory maps for the upsampling process. Subsequently, the mesh is transformed into a graph structure, in which a cross-domain attention module is used to aggregate the features extracted from the Euclidean branch with those from the self-domain attention module. Finally, the aggregated features undergo processing through the self-domain attention module and are upsampled further. Inspired by reference [19], this process is repeated six times to achieve satisfactory performance. A visual representation of the data processing flow of EGNet is depicted in Figure 2. While in the geodesic branch, a self-domain attention module is used to efficiently aggregate the vertices of the original mesh, the features of the mesh vertices are fused with the sparse vertex features from the Euclidean branch via a cross-domain attention module.

3.1. Euclidean Branch

(1): MLP based on point: Sparse voxel branches cannot model fine single-point features effectively. Inspired by the work of H. Tang [9], we improve the method for combining point features and voxel features of point cloud data. The MLP is used to extract the features of a single point and achieve high-resolution point information to compensate for the missing information based on coarse-grained voxels. We employ voxelization and devoxelization to achieve data consolidation during fusion.
(2): Voxelization: The point cloud tensor is expressed as C = ({P_c, F_c}), where P_c = (x_i, y_i, z_i) defines the 3D coordinates of the point cloud and F_c represents the feature vector of the points [21]. The sparse tensor is described as S = ({P_s, F_s}, v), where P_s is the ratio of the 3D coordinates of the point cloud to the voxel size, F_s is the feature tensor of the grid, and v is the voxel resolution.

In the voxel branch, we rasterize the point cloud and convert the point cloud tensor C into sparse tensor I as the network input. Equations (1) and (2) describe the function of voxelization.

P_{s} = (\tilde{x_{i}}, \tilde{y_{i}}, \tilde{z_{i}}) = (f l o o r (\frac{x_{i}}{v}), f l o o r (\frac{y_{i}}{v}), f l o o r (\frac{z_{i}}{v}))

(1)

F_{s} = \frac{1}{N_{m}} \sum_{k = 1}^{n} 1 [P_{s}] \cdot F_{c}

(2)

where 1[·] is a binary indicator indicating whether P_c is included in the grid. N_m is the normalization coefficient representing the number of points located in the m-th non-zero voxel, and the floor function rounds down.

(3): Devoxelization: In the devoxelization process, inspired by [17], we use trilinear interpolation at each layer to transform the eight neighboring voxel rasters into a point cloud tensor. The features of each point are interpolated using the features of the eight neighboring voxels. Note that, for the de-voxelization process, if we use the voxelized data converted to point cloud data, these point cloud data will generate a rough 3D mesh, causing information loss. To avoid this, one option is to increase the voxel mesh resolution; however, this method will cause memory overflow. Therefore, we use the original 3D point cloud as nodes based on the approach for model inference used in [17]. For each point $O_{i} = (P_{i}^{x}, P_{i}^{y}, P_{i}^{z})$ , we define a random variable $x_{i}$ to denote the relationship between data categories (semantic information). We also define $L_{i} (x_{i})$ to be the scores (logits) associated with the probability distribution of $x_{i}$ , where $L_{i}$ is operated as shown in Equation (3):

L_{i} (x_{i}) = \sum_{n = 1}^{8} w_{i, n} L_{i, n} (x_{i, n})

(3)

That is,

L_{i}

is the weighted sum of the scores of the eight spatially nearest neighbor voxels

V_{i, n} (n \in {1, \dots, 8})

.

Our approach differs from that of [17] with respect to the operation procedure of

w_{i, n}

, considering that the goal of this work in the literature [17] is to implement the gradient feedback of conditional random fields (CRFs) in backpropagation by using

w_{i, n}

. In the original literature, the word “splat” is used to define the process of “feedback” 3D-FCNN by obtaining the score to be used as a monadic term in the CRFs.

Therefore, the procedure for

w_{i, n}

in [17] is

w_{i, n} = Π_{s \in {x, y, z}} (1 - |P_{i}^{s} - P_{i, n}| / V)

, where V denotes the size of the voxel. We compute

w_{i, n}

as shown in Equation (4):

w_{i, n} = \frac{1}{S_{x y z}} \cdot \frac{1}{{∥ P_{i} - P_{i, n} ∥}_{V}}

(4)

where

S_{x y z}

is the size of the voxel and

{∥ \cdot ∥}_{V}

denotes the distance in voxel space. We use the inverse of the distance as a weight, implying that a voxel center

P_{i, n}

that is closer to the target point

P_{i}

will have a greater impact on the category score of the target point. This is because voxel centers closer to the target point are more representative of the features or categories of the target point. This operation will increase the difference in the weight vectors corresponding to different categories of points positioned at the edge, which is more conducive to the segmentation of the boundary.

(4): Combination of sparse convolution and submanifold sparse convolution: Generally, the sparse tensors are convolved in a certain access order. However, the conventional convolution operation is not adjusted to suit the sparsity of the input data, resulting in the “submanifold dilation problem”. Submanifold sparse convolution (SSC) [20] addresses the limitation of submanifold dilation by restricting the convolutional output to a group of active input regions to ensure the sparsity stays constant across multiple layers. However, SSC causes each pixel to be processed separately, which limits the network’s ability to extract the correlation with neighborhood information. Therefore, our method combines the SC (m, n, f, s = 2) and SSC (m, n, f, s = 1) to create convolution networks based on U-Net, where m is the input features, n is the output features, f is the filter size, and s is the step size. The size of the output feature map is (l − f + s)/s.

Given that the method [22] can only perform matrix multiplication on a single kernel, we construct the hash table and a rulebook for activating input sites in the sparse tensor I inspired by the TorchSparse [23] to achieve parallel computation on the GPU. The hash table contains the index and position tuples of all activated input sites, while the rulebook records the input positions of the activation sites and the output positions generated by convolution with the kernel. SC and SSC are used to perform neighborhood feature aggregation for sparse tensors. We parallelize kernel mapping operations on the GPU based on the created hash table. The voxelized coordinate of each point in the point tensor Z is used to search for the index in the sparse tensor, which improves the search efficiency over sequential search.

3.2. Geodesic Branch

(1): MLP based on point cloud data: Mesh simplification primarily minimizes the number of triangular meshes while preserving the model’s geometric details and texture components. Developed methods for mesh simplification include VC [24], edge collapse quadratic error metric (QEM) [25], wavelet discretization [26], and a combination of VC and QEM [19]. Our experiments compare these methods, revealing that edge collapse yields optimal performance. Although the literature [19] shows that direct application of the QEM method produces high-frequency noise signals and that combining clustering and edge folding methods is optimal, we have already extracted the high-frequency features from the point cloud data using the MLP network in the Euclidean branching. Therefore, we do not need the high-resolution features embedded in the gridded data prior to simplification. Instead, we employ an attentional mechanism during feature aggregation to select the high-frequency features that best match the geodetic features of the gridded data. This mechanism filters the high-frequency noise in the gridded data, achieving an effect similar to that of the VC method. Because we only extract the geodesic features from the gridded data, we also do not need to consider whether the collapsed edges depend on a specific task [27].

Furthermore, the trajectory maps are saved to track pooling between various grid levels. In practice, we set up seven layers to enrich the multi-resolution information, as shown in Figure 3. An effective method of tracking connections between pooled and non-pooled vertices is to use trajectory map, which tracks vertex connectivity to facilitate quick searches between neighboring mesh levels. By improving the handling of multi-resolution hierarchies, we can track these simplified contractions and generate pooled trace maps through pooled trace mapping.

(2): Self-domain Attention: The adjacency matrix $A = [a_{i j}] \in R^{n \times n}$ describes graph G, while the diagonal matrix D is defined as $D = d i a g (d_{1}, d_{2}, \dots, d_{n}), d_{i} = \sum_{j} a_{i j}$ , where $d_{i}$ denotes the degree of vertex $i$ . The formula $D^{- 1} A$ represents the graph features of the geodesic branch or the graph features of the Euclidean branch.

To effectively aggregate the vertex features of the mesh, we construct the self-domain attention module in the geodesic branch by referring to reference [28]. The attention weight matrix

a_{i j}

for each edge from point j to point i in the mesh data is computed using the following formulas:

q_{i}^{(l)} = W_{q}^{(l)} D^{- 1} A_{g e o}^{(l)} + b_{q}^{(l)}

(5)

k_{i}^{(l)} = W_{k}^{(l)} D^{- 1} A_{g e o}^{(l)} + b_{k}^{(l)}

(6)

α_{i j}^{(l)} = \frac{⟨q_{i}^{(l)}, k_{j}^{(l)}⟩}{\sum_{u \in N (i)} ⟨q_{i}^{(l)}, k_{u}^{(l)}⟩}

(7)

Including

⟨q, k⟩ = \exp (\frac{q^{⊤} k}{\sqrt{d}})

, where

q_{i}, k_{i}, v_{i}

correspond to

Q, K, V

of traditional attention, respectively, and

d

is the dimension of output. Message aggregation is conducted with the obtained attention weight matrix

α_{i j}

, and the formulas are as follows:

v_{j}^{(l)} = W_{v}^{(l)} D^{- 1} A_{geo}^{(l)} + b_{v}^{(l)}

(8)

{\hat{k}}^{(l + 1)} = \sum_{j \in N_{(i)}} α_{i j}^{(l)} v_{j}^{(l)}

(9)

To prevent excessive smoothing of network weights, we add a gated residual connection [29].

The formulas are as follows:

r_{i}^{(l)} = W_{r}^{(l)} D^{- 1} A_{geo}^{(l)} + b_{r}^{(l)}

(10)

β_{i}^{(l)} = sigmoid (W_{g}^{(l)} [{\overset{\land}{h}}^{(l + 1)}; r_{i}^{(l)}; {\overset{\land}{h}}^{(l + 1)} - r_{i}^{(l)}])

(11)

D^{- 1} A^{(l + 1)} = Relu (LayerNorm ((1 - β_{i}^{(l)}) {\overset{\land}{h}}^{(l + 1)} + β_{i}^{(l)} r_{i}^{(l)}))

(12)

If the self-domain attention module is used in the last output layer, the linear transformation needs to be removed using the following formula:

D^{- 1} A^{(l + 1)} = (1 - β_{i}^{(l)}) {\overset{\land}{h}}^{(l + 1)} + β_{i}^{(l)} N_{i}^{(l)}

(13)

(3): Cross-domain Attention Module: To fuse mesh data features and features extracted by Euclidean branch, we designed the cross-domain attention module, which mirrors the structure of the self-domain attention module. The difference is the input features. First, when calculating the attention weight matrix $a_{i j}$ , we use the Euclidean feature $D^{- 1} A_{euc}^{(l)}$ to calculate $k_{i}^{(l)}$ , identifying Euclidean features that correspond to the current geodesic features. The formula is as follows:

k_{i}^{(1)} = W_{k}^{(1)} D^{- 1} A_{e u c}^{(l)} + b_{r}^{(l)}

(14)

Second, when incorporating a gated residual connection, the Euclidean features

D^{- 1} A_{euc}^{(l)}

are used, represented by the following equation:

r_{i}^{(l)} = W_{r}^{(l)} D^{- 1} A_{euc}^{(l)} + b_{r}^{(l)}

(15)

4. Experiment Results and Analyses

4.1. Datasets

To demonstrate the validity and robustness of our proposed model, the EGNet, we conducted experiments on two benchmarks: ScanNet v2 and Matterport3D [6,29].

ScanNet v2 [29] updates the annotation of ScanNet, achieving a surface coverage of 90%. This large-scale dataset comprises 2.5 million RGB-D images collected from 1513 scans across 707 different indoor scenes. The dataset’s annotations include camera postures, textured meshes, dense semantic segmentation at the object level, and aligned computer-aided design (CAD) models, which are valuable for scene understanding. ScanNet v2 surpasses previous RGB-D datasets by more than one order of magnitude in size. To evaluate the effectiveness of the EGNet, we performed an ablation study using the ScanNet validation set.

Matterport3D [6] is a new RGB-D dataset that includes 10,800 panoramic views and 194,400 RGB-D images from 90 architectural-scale scenes. This dataset comprehensively provides labels for the walls, floors, ceilings, doors, and windows of each house in a 3D mesh. The dataset is divided into training, verification, and test splits of 61, 11, and 18 scans, respectively. Additionally, we report the average category accuracy scores of 21 categories in the test set.

4.2. Implementation Detail

The EGNet was trained and tested on a single 32G GV100GL GPU using Python 3.6 and PyTorch 1.4 in a CUDA10.1 and UBUNTU 18.04 environment. During training, we employed cross-entropy as the loss function for both modules, summing the individual losses to obtain the total loss. We used the stochastic gradient descent (SGD) optimizer with a poly scheduler to minimize the loss, in which the initial learning rate was initially set to 0.1 and the power to 0.9. The maximum number of iterations was set to 80,000; once the iteration reached this limit, the learning rate decayed to its final value. Considering the hardware limitations, the batch size was set to six.

To improve the robustness of the model, we trained the network without cropping the dataset using a seven-level mesh simplification process. This included random edge dropping, color dithering, and random scaling at each level of the mesh. Specifically, we applied VC on the input grid at levels 1 and 2 based on a unit length of two cm, while the QEM was used for the remaining five levels to reduce the number of vertices to reach the 30% simplification target. The final data input to the network was organized as a dictionary containing eight levels of vertices, seven levels of trajectories, raw colors, labels, and additional features.

To evaluate the performance of the EGNet in the task of semantic segmentation, we validated the effectiveness of our proposed method using both the Scannetv2 and Matterport3D datasets. On the Scannetv2 dataset, we adopted two commonly used evaluation metrics: the mean average precision (mAP) and mean class intersection over union (mIoU). On the Matterport3D dataset, we employed the mAcc metric, which reports the mAcc for 20 categories. The mAP measures the detection accuracy and coverage of the model across different semantic categories, combining both precision and recall. The mIoU evaluates the average overlap between the predicted regions and the ground-truth regions for each semantic category. The mAcc assesses the classification accuracy of the prediction results compared to the true results at each pixel point.

4.3. Experimental Results and Analysis

The performance of the proposed network for indoor scene segmentation was evaluated through experiments conducted on the ScanNet V2 dataset; Table 1 presents the results. Our method achieved a mean intersection over union (mIoU) of 73.3% on the validation set and 74.1% on the test set. Compared with DCM-Net, the mIoU performance improved by 8.3%; compared with the leading sparse convolution method (i.e., SparseConvNet), the mIoU increased by 1.6%. The proposed method outperformed RFCR+KPConv on the test set by 3.8%, as it employs a cross-modal feature complementation strategy, directly fuses features at the semantic level, avoids over-reliance on local features, and effectively models complex geometric and topological structures. While the AF-GCN method optimizes the traditional UNet architecture by combining graph convolutional networks (GCNs) and geometric attention modules (GAFs), it has certain disadvantages in computational complexity, model complexity, and handling large-scale data. Although our method’s performance on the validation set is 0.1% lower, it outperforms AF-GCN by 2.3% on the test set, fully demonstrating the superiority of our method in terms of computational efficiency and global geometric structure modeling. Although the PointTransformer series achieved better results on the ScanNet dataset, its neglect of the geodesic information on object surfaces may limit its performance in tasks requiring a high boundary segmentation accuracy. For example, in robotic arm-grasping path-planning tasks, the boundary is often used to determine the homogeneous transformation matrix of the gripper for a grasp pose. As shown in Table 2, the proposed method demonstrated superior mAP performance, indicating its capability to meet the requirements for a high boundary segmentation accuracy in tasks such as the spatial positioning of operational targets for indoor wheeled robots. Figure 4 shows the results of our model on the ScanNet V2 dataset. Additionally, we compared it with other methods, demonstrating its effectiveness in predicting challenging categories, such as storage cabinets and curtains, and in edge segmentation accuracy.

Additionally, Table 3 presents the overall evaluation results of the EGNet on the Matterport3D benchmark, showing the mean accuracy (mAcc) across 20 categories. Our model’s overall mean accuracy (omAcc) is 0.3% higher than that of the previous state-of-the-art methods. Although the improvement is modest, our method effectively produced accurate predictions even in cases in which the original annotations might have been incorrect or missing. Figure 5 shows a comparison with other methods. Our method demonstrates a superior accuracy and edge segmentation precision compared to the MinkowskiNet method on the Matterport3D dataset, and it can correctly identify targets even with missing or erroneous annotations.

In robotic arm-grasping path planning, accurately identifying target boundaries is crucial for determining the homogeneous transformation matrix of the gripper, thereby ensuring precise grasping poses. Figure 6 shows a detailed visualization of local features in the ScanNet validation set, with key differences highlighted in yellow bounding boxes. The first example focuses on a kitchen corner, for which the EGNet demonstrates smoother and more refined segmentation results compared to the ground-truth annotations. The second and third examples shift to bedroom scenes. Even in cases of blurred boundaries, our method accurately distinguishes cabinets from sofas. Notably, the third example reveals a significant improvement in recognizing the door, in which discrepancies are evident. Furthermore, our method correctly identifies a shelf misannotated as part of the floor, treating it as a distinct object. These findings hold significant practical value for grasping tasks, as clear segmentation boundaries are critical for improving the accuracy of target boundary localization in robotic arm operations.

4.4. Ablation Study

In this section, we conduct ablation experiments on the Scannet V2 for network components to further highlight the significance of building modules in the EGNet. In all the experiments, we ensured consistent parameters throughout the study.

To validate our approach, we conducted an ablation study based on the preliminary assessment of combining the data from the two modules. We compared the EGNet to two baseline networks: “Euc Only”, a U-Net structure based on a sparse convolution operating on voxels, and “Geo Only”, a network with an identical structure based on a graph convolution of multi-level mesh simplification information.

Table 4 demonstrates that using the two branches in parallel significantly enhanced the network’s segmentation performance. In our experiments, both branches were implemented using the U-Net structure. The difference between two branches was the number of parameters. To mitigate the difference, we increased the number of channels and layers in the geodesic branch to obtain an approximately 1.5-times-narrower gap. Ultimately, the two modules showed comparable overall performances. The “Geo only” module outperformed the “Euc only” module by 1.3% in mIoU. By combining the two branches, we achieved an mIoU of 73.3%, an mAcc of 80.4%, and an OA of 90.6%.

Additionally, we compared the proposed model with other modules. The “Euc Point” module extended the “Euc Only” module with a point-based branch by transforming and combining the information contained in voxels and points. It compensated for the voxelization-related information error and allowed the model to concentrate on the intricate details of the interior space. As shown in Figure 7, by integrating sparse vertex features from the Euclidean branch through our proposed cross-domain attention module, we have addressed the issue in which the gradients generated during the backpropagation process are not sufficient to effectively reduce the wrong weights’ (i.e., the deviation of the probability distribution) output by the Euclidean branch, which was caused by solely using the method of weight matrices to fuse the mesh data features extracted by the self-domain attention module with those of the Euclidean branch (as illustrated in the fourth column of Figure 7). This has resulted in smoother and more accurate edge segmentation. Furthermore, the introduction of original mesh vertices has reduced the classification error rate caused by extracting Euclidean features solely from voxels, as the probability distribution of points in the devoxelization process has been optimized during the backpropagation. The results from Table 3 highlight the effectiveness of this improvement, resulting in an improvement of 1.8% in the mIoU, 1.4% in the mAcc, and 0.4% in the OA compared with the “Euc Only” module. The “Geo Self” module is an enhancement of the “Geo Only” module, utilizing a self-attention mechanism to effectively aggregate geodesic features from the mesh and improve local information. The “Geo Self” module achieved an mIoU of 70.9%, mAcc of 79.3%, and OA of 89.4%, as shown in Table 3. While its performance was only slightly superior to the “Geo Only” module, a 2.4% improvement was achieved through the fusion block. This block integrated information from the Euclidean module into the geodesic module, enabling the adaptive fusion of features from the two domains.

4.5. Model Efficiency

Our proposed model has a parameter count of 44.3 million when the batch size is set to one during the training phase, with a latency of 326 ms and a memory usage of 4.7 GB. During the inference phase, the latency is reduced to 49 ms with 1.5 GB of memory. When using the NVIDIA ORIN edge computing device, the latency is 300 ms. The device is equipped with 512 tensor computation units based on the Volta architecture, an eight-core ARM-based CPU, and 32 GB of memory. This configuration efficiently supports tasks such as scene comprehension, target edge detection, and robotic arm-grasping path planning.

5. Conclusions

This study proposes a semantic segmentation network named the EGNet for indoor scenes with mesh data as input. It contains Euclidean and geodesic branches. The Euclidean branch extracts detailed features and voxelized contour features from point cloud data, while the geodesic branch extracts geodesic features from mesh data by graph message propagation. The features of point, voxel, and mesh data are fused by self-attention and cross-domain attention modules. The Matterport3D and ScanNet v2 datasets are used to demonstrate the effectiveness of the EGNet. The results indicate that the fusion is effective for semantic segmentation. This study informs the research on the potential of integrating both Euclidean and geodesic features in semantic segmentation. The EGNet has broad application prospects, especially in robotic arm-grasping path planning tasks. These tasks typically require the precise boundary identification of the target to determine the homogeneous transformation matrix of the gripper, ensuring an accurate grasping pose. Given the EGNet’s outstanding performance in boundary segmentation accuracy, it can meet the high-precision requirements for the spatial localization of operational targets in indoor environments.

Author Contributions

Conceptualization, Q.L. and Y.S.; methodology, Q.L., Y.S. and H.Z.; software, Y.S.; validation, Y.S., X.J. and Y.W.; formal analysis, Y.S.; investigation, Y.W. and H.Z.; resources, Q.L.; data curation, Y.S., Y.W. and H.Z.; writing—original draft preparation, Q.L. and Y.S.; writing—review and editing, Y.S. and H.Z.; visualization, Y.W.; supervision, D.Z.; project administration, Q.L.; funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research was financially supported by the Science and Technology Development Plan Project of Jilin Province, China (No. 20240101344JC, 20230203098SF) and the Zhongshan Public Welfare Science and Technology Research Project (No. 2023B2015).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our code is available at: https://github.com/ritajin6/pvm-net/tree/main, accessed on 19 December 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ding, R.; Yang, J.; Xue, C.; Zhang, W.; Bai, S.; Qi, X. Lowis3d: Language-driven open-world instance-level 3D scene understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8517–8533. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Gan, V.J.L. Multi-view stereo for weakly textured indoor 3D reconstruction. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 1469–1489. [Google Scholar] [CrossRef]
Li, S.; Li, H. Regional-to-local point-voxel transformer for large-scale indoor 3D point cloud semantic segmentation. Remote Sens. 2023, 15, 4832. [Google Scholar] [CrossRef]
Jhaldiyal, A.; Chaudhary, N. Semantic segmentation of 3D lidar data using deep learning: A review of projection-based methods. Appl. Intell. 2023, 53, 6844–6855. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3D deep learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 965–975. [Google Scholar]
Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D Data in Indoor Environments. arXiv 2017, arXiv:1709.06158. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5105–5114. [Google Scholar]
Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; Lu, C. Pointsift: A sift-like network module for 3D point cloud semantic segmentation. arXiv 2018, arXiv:1807.00652. [Google Scholar]
Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching efficient 3D architectures with sparse point-voxel convolution. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 685–702. [Google Scholar]
Xu, M.; Ding, R.; Zhao, H.; Qi, X. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3172–3181. [Google Scholar] [CrossRef]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer v2: Grouped vector attention and partition-based pooling. Adv. Neural Inf. Process. Syst. 2022, 35, 33330–33342. [Google Scholar]
Wu, P.; Gu, L.; Yan, X.; Xie, H.; Wang, F.L.; Cheng, G.; Wei, M. PV-RCNN++: Semantical point-voxel feature interaction for 3D object detection. Vis. Comput. 2023, 39, 2425–2440. [Google Scholar] [CrossRef]
Park, J.; Kim, C.; Kim, S.; Jo, K. PCSCNet: Fast 3D semantic segmentation of LiDAR point cloud for autonomous car using point convolution and sparse convolution network. Expert Syst. Appl. 2023, 212, 118815. [Google Scholar] [CrossRef]
Agathos, A.; Azariadis, P. Optimal Point-to-Point geodesic path generation on point clouds. Comput.-Aided Des. 2023, 162, 103552. [Google Scholar] [CrossRef]
Shao, Y.; Chen, J.; Gu, X.; Lu, J.; Du, S. A novel curved surface profile monitoring approach based on geometrical-spatial joint feature. J. Intell. Manuf. 2024, 1–23. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, J.; Ma, X.; Wang, G.; Bhatti, U.A.; Huang, M. Interactive medical image annotation using improved Attention U-net with compound geodesic distance. Expert. Syst. Appl. 2024, 237, 121282. [Google Scholar] [CrossRef]
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. Segcloud: Semantic segmentation of 3D point clouds. In Proceedings of the International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 537–547. [Google Scholar] [CrossRef]
Schult, J.; Engelmann, F.; Kontogianni, T.; Leibe, B. Dualconvmesh-net: Joint geodesic and euclidean convolutions on 3D meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8609–8619. [Google Scholar] [CrossRef]
Hu, Z.; Bai, X.; Shang, J.; Zhang, R.; Dong, J.; Wang, X.; Sun, G.; Fu, H.; Tai, C.L. Vmnet: Voxel-mesh network for geodesic-aware 3D semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15468–15478. [Google Scholar] [CrossRef]
Graham, B.; Engelcke, M.; Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar] [CrossRef]
Maturana, D.; Scherer, S. Voxnet: A 3D convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Hamburg, Germany, 28 September–3 October 2015; Volume 2015, pp. 922–928. [Google Scholar] [CrossRef]
Choy, C.; Gwak, J.; Savarese, S. 4D spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3070–3079. [Google Scholar] [CrossRef]
Tang, H.; Liu, Z.; Li, X.; Lin, Y.; Han, S. TorchSparse: Efficient point cloud inference engine. Proc. Mach. Learn. Syst. 2022, 4, 302–315. [Google Scholar]
Yang, H.; Huang, S.; Wang, R. Efficient roof vertex clustering for wireframe simplification based on the extended multiclass twin support vector machine. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6501405. [Google Scholar] [CrossRef]
Li, J.; Chen, D.; Hu, F.; Wang, Y.; Li, P.; Peethambaran, J. Shape-preserving mesh decimation for 3D building modeling. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103623. [Google Scholar] [CrossRef]
Eldar, Y.C.; Bolcskei, H. Block-sparsity: Coherence and efficient recovery. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; Volume 2009, pp. 2885–2888. [Google Scholar] [CrossRef]
Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; Cohen-Or, D. Meshcnn: A network with an edge. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Shi, Y.; Huang, Z.; Feng, S.; Zhong, H.; Wang, W.; Sun, Y. Masked label prediction: Unified message passing model for semi-supervised classification. arXiv 2020, arXiv:2009.03509. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Niessner, M. Scannet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar] [CrossRef]
Dai, A.; Nießner, M. 3dmv: Joint 3D-multi-view prediction for 3D semantic scene segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 458–474. [Google Scholar] [CrossRef]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Montréal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 828–838. [Google Scholar]
Wang, J.; Sun, B.; Lu, Y. Mvpnet: Multi-view point regression networks for 3D object reconstruction from a single image. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8949–8956. [Google Scholar] [CrossRef]
Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9613–9622. [Google Scholar] [CrossRef]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6410–6419. [Google Scholar] [CrossRef]
Lei, H.; Akhtar, N.; Mian, A. Spherical kernel for efficient graph convolution on 3D point clouds. arXiv 2019, arXiv:1909.09287. [Google Scholar]
Gong, J.; Xu, J.; Tan, X.; Song, H.; Qu, Y.; Xie, Y.; Ma, L. Omni-supervised point cloud segmentation via gradual receptive field component reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11673–11682. [Google Scholar]
Yue, G.; Xiao, R.; Zhao, Z.; Li, C. AF-GCN: Attribute-fusing graph convolution network for recommendation. IEEE Trans. Big Data 2022, 9, 597–607. [Google Scholar] [CrossRef]
Wang, C.; Jiang, L.; Wu, X.; Tian, Z.; Peng, B.; Zhao, H.; Jia, J. Groupcontrast: Semantic-aware self-supervised representation learning for 3D understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4917–4928. [Google Scholar] [CrossRef]
Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point transformer V3: Simpler faster stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar] [CrossRef]
Su, H.; Jampani, V.; Sun, D.; Maji, S.; Kalogerakis, E.; Yang, M.H.; Kautz, J. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2530–2539. [Google Scholar] [CrossRef]
Dai, A.; Ritchie, D.; Bokeloh, M.; Reed, S.; Sturm, J.; Niessner, M. Scancomplete: Large-scale scene completion and semantic segmentation for 3D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4578–4587. [Google Scholar] [CrossRef]
Tatarchenko, M.; Park, J.; Koltun, V.; Zhou, Q.Y. Tangent convolutions for dense prediction in 3D. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3887–3896. [Google Scholar] [CrossRef]

Figure 1. The yellow point on the curtain serves as the focal point for all color shades that represent the distance within the neighborhood. The shades of color represent the Euclidean distance between each point and the focal point in point cloud data (blue), shown in (a). The shades of color represent the path length between each point and the focal point (red) in 3D mesh data, shown in (b). The data structures of each are displayed in the yellow bounding boxes.

Figure 2. EGNet architecture. In the Euclidean branch, we use a feature extractor similar to U-Net structure for extracting Euclidean features from voxels to capture fine features. Inspired by PointNet++ structure, we incorporate a point-based MLP into the Euclidean branch. In the geodesic branch, the self-domain attention module is used to effectively aggregate the vertices of the original mesh. The features of the mesh vertices are fused with the features of sparse vertices from Euclidean branch using the cross-domain attention module.

Figure 3. Mesh simplification. Mesh_l0 to Mesh_l3 is part of the mesh simplification process, with the yellow label indicating the trajectory map of a point from Mesh_l0 to Mesh_l1.

Figure 4. Results of the ScanNet v2 validation. We have highlighted the main differences with yellow bounding boxes. Observing the segmentation results of the door in the first instance, our method demonstrates more accurate boundary segmentation. In the second instance, despite the poor quality of the environmental scan, although the MinkowskiNet method also identified the unannotated bed, a closer inspection reveals that our method provides a clearer segmentation boundary between the bed and the desk.

Figure 5. Visualization of Matterport3D. We have highlighted the main differences with yellow bounding boxes. In the first example, our method shows almost no errors compared to the ground-truth labels and achieves more accurate segmentation regions than other methods. Additionally, our method successfully identifies the flowerpot (other furniture) on the table. In the second example, despite errors in the ground-truth labels, our method achieves more accurate target classification compared to other methods.

Figure 6. Detailed local regions from the ScanNetV2 validation set, with key differences marked by yellow bounding boxes. The first example focuses on a kitchen corner, where EGNet produces smoother segmentation results compared to the ground-truth annotations. The second and third examples depict bedroom scenes, for which our method successfully distinguishes cabinets from sofas even in areas with blurred boundaries. Notably, in the third example, there was a significant improvement in door recognition, and our method correctly classifies a shelf that had been mislabeled as part of the floor.

Figure 7. Visualization results of the ablation study. The data in the third column show that the accuracy of edge segmentation is significantly improved with the introduction of geodesic branching. The effectiveness of our proposed cross-domain attention module is demonstrated by comparing the results of the data in the fourth and fifth columns.

Table 1. Results of semantic segmentation in Scannet V2 dataset.

Method	Year	Val	Test
PointNet++ [7]	2017	53.5	55.7
3DMV [30]	2018	-	48.4
PointCNN [31]	2018	-	45.8
SparseConvNet [20]	2018	69.3	72.5
MVPNet [32]	2019	-	64.1
PointConv [33]	2019	61.0	66.6
KPConv [34]	2019	69.2	68.6
MinkowskiNet [22]	2019	72.2	73.6
SPH3D-GCN [35]	2019	-	61.8
DCM-Net [18]	2020	-	65.8
RFCR+KPConv [36]	2021	-	70.3
PointTransformer V2 [11]	2022	75.4	74.2
AF-GCN [37]	2023	73.4	71.8
GroupContrast [38]	2024	75.7
PointTransformer V3 [39]	2024	77.5	77.9
Ours	2024	73.3	74.1

Table 2. Results of instance segmentation in Scannet V2 dataset.

Method	${m A P}_{25}$	${m A P}_{50}$	mAP
MinkowskiNet [22]	72.8	56.9	36.0
PointTransformer V2 [11]	76.3	60.0	38.3
Ours	76.1	60.7	40.2

Table 3. Matterport3D test category results.

Method	PointNet++ [7]	SplatNet [40]	ScanComplete [41]	TangentConv [42]	3DMV [30]	DCM-Net [18]	VMNet [19]	Ours
omAcc	43.8	26.7	44.9	46.8	56.1	66.2	67.2	67.5
Wall	80.1	90.8	79.0	56.0	79.6	78.4	85.9	84.6
Floor	81.3	95.7	95.9	87.7	95.5	93.6	94.4	93.8
Cab	34.1	30.3	31.9	41.5	59.7	64.5	56.2	56.1
Bed	71.8	19.9	70.4	73.6	82.3	89.5	89.5	89.5
Chair	59.7	77.6	68.7	60.7	70.5	70.0	83.7	83.5
Sofa	63.5	36.9	41.4	69.3	73.3	85.3	70.0	69.7
Table	58.1	19.8	35.1	38.1	48.5	46.1	54.0	53.2
Door	49.6	33.6	32.0	55.0	64.3	81.3	76.7	75.4
Wind	28.7	15.8	37.5	30.7	55.7	63.4	63.2	63.1
Bookshelf	1.1	15.7	17.5	33.9	8.3	43.7	44.6	45.0
Image	34.3	0	27.0	50.6	55.4	73.2	72.1	72
Counter	10.1	0	37.2	38.4	34.8	39.9	29.1	30.2
Desk	0	0	11.8	19.7	2.4	47.9	38.4	48.3
Window	68.8	12.3	50.4	48.0	80.1	60.3	79.7	67.7
Ceiling	79.3	75.7	97.6	45.1	94.8	89.3	94.5	82
Refrigerator	0	0	0.1	22.6	4.7	65.8	47.6	64.7
Bathtub	29.0	0	15.7	35.9	54.0	43.7	80.1	83.5
Toilet	70.4	10.4	74.9	50.7	71.1	86.0	85.0	77.5
Sink	29.4	4.1	44.4	49.3	47.5	49.6	49.2	51.0
Shower	62.1	20.3	53.5	56.4	76.7	87.5	88.0	83.4
Other Furniture	8.5	1.7	21.8	16.6	19.9	31.1	29.0	43.4

The bold ones are the best results, and the underlined ones are the second-best results.

Table 4. Ablation study for models with different inputs.

	Model Part			Result
	Point	Self-Domain	Cross-Domain	mIou	mAcc	OA
Euc Only				68.3	77.2	88.6
Euc Point	✓			70.1	78.6	89
Geo Only				69.6	77.9	88.9
Geo Self		✓		70.9	79.3	89.4
Ours	✓	✓	✓	73.3	80.4	90.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Song, Y.; Jin, X.; Wu, Y.; Zhang, H.; Zhao, D. EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion. Sensors 2024, 24, 8196. https://doi.org/10.3390/s24248196

AMA Style

Li Q, Song Y, Jin X, Wu Y, Zhang H, Zhao D. EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion. Sensors. 2024; 24(24):8196. https://doi.org/10.3390/s24248196

Chicago/Turabian Style

Li, Qi, Yu Song, Xiaoqian Jin, Yan Wu, Hang Zhang, and Di Zhao. 2024. "EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion" Sensors 24, no. 24: 8196. https://doi.org/10.3390/s24248196

APA Style

Li, Q., Song, Y., Jin, X., Wu, Y., Zhang, H., & Zhao, D. (2024). EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion. Sensors, 24(24), 8196. https://doi.org/10.3390/s24248196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EGNet: 3D Semantic Segmentation Through Point–Voxel–Mesh Data for Euclidean–Geodesic Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Method Based on Euclidean Feature

2.2. Method Based on Geodesic Feature

2.3. Method Based on Geodesic Features and Euclidean Features

3. Methodology

3.1. Euclidean Branch

3.2. Geodesic Branch

4. Experiment Results and Analyses

4.1. Datasets

4.2. Implementation Detail

4.3. Experimental Results and Analysis

4.4. Ablation Study

4.5. Model Efficiency

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI