GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds

Kong, Xiangbin; Wu, Weijun; Wu, Minghu; Gui, Zhihang; Luo, Zhe; Miao, Chuyu

doi:10.3390/electronics14244917

Open AccessArticle

GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds

by

Xiangbin Kong

^1,2,

Weijun Wu

^1,*

,

Minghu Wu

^1,2,

Zhihang Gui

¹

,

Zhe Luo

¹ and

Chuyu Miao

¹

School of Electronic and Electronic Engineering, Hubei University of Technology, Wuhan 430068, China

²

Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4917; https://doi.org/10.3390/electronics14244917

Submission received: 26 October 2025 / Revised: 27 November 2025 / Accepted: 6 December 2025 / Published: 15 December 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Three-dimensional semantic segmentation plays a crucial role in advancing scene understanding in fields such as autonomous driving, drones, and robotic applications. Existing studies usually improve prediction accuracy by fusing data from vehicle-mounted cameras and vehicle-mounted LiDAR. However, current semantic segmentation methods face two main challenges: first, they often directly fuse 2D and 3D features, leading to the problem of information redundancy in the fusion process; second, there are often issues of image feature loss and missing point cloud geometric information in the feature extraction stage. From the perspective of multimodal fusion, this paper proposes a point cloud semantic segmentation method based on a multimodal gated attention mechanism. The method comprises a feature extraction network and a gated attention fusion and segmentation network. The feature extraction network utilizes a 2D image feature extraction structure and a 3D point cloud feature extraction structure to extract RGB image features and point cloud features, respectively. Through feature extraction and global feature supplementation, it effectively mitigates the issues of fine-grained image feature loss and point cloud geometric structure deficiency. The gated attention fusion and segmentation network increases the network’s attention to important categories such as vehicles and pedestrians through an attention mechanism and then uses a dynamic gated attention mechanism to control the respective weights of 2D and 3D features in the fusion process, enabling it to solve the problem of information redundancy in feature fusion. Finally, a 3D decoder is used for point cloud semantic segmentation. In this paper, tests will be conducted on the SemanticKITTI and nuScenes large-scene point cloud datasets.

Keywords:

semantic segmentation; LiDAR point clouds; multi-modal; deep learning

1. Introduction

Multimodal feature fusion has emerged as a cornerstone technique for point -cloud semantic segmentation, attracting intensive research interest within the point -cloud processing community. Its goal is to fuse feature information from point clouds and other modalities to predict the semantic category of each point in the point cloud data. Due to its excellent performance, it is widely applied in numerous fields. In the field of autonomous driving [1,2,3,4,5,6], vehicles acquire multimodal data through sensors such as LiDAR and RGB cameras. They process and fuse this multimodal data to accurately recognize and understand the surrounding environment, and perform actions like speed adjustment or evasion of pedestrians and vehicles in advance during driving to ensure driving safety. In the field of robotics [7,8,9], robots integrate information obtained from multiple different modal sensors to construct 3D maps of the environment. These maps contain spatial location and object semantic information, enabling robots to accurately identify and avoid obstacles. In the field of satellite remote sensing [10,11,12], remote sensing sensors mounted on satellites capture multimodal remote sensing images. By combining information from different modalities, ground covers such as forests, buildings, and croplands can be distinguished, which plays a crucial role in environmental pollutant monitoring, disaster assessment, and emergency response. Therefore, multimodal feature fusion-based point cloud semantic segmentation has become a current research focus.

Despite the significant progress in multimodal feature fusion-based point cloud semantic segmentation, existing methods still face several challenges. First, current semantic segmentation methods often directly fuse 2D and 3D features, leading to information redundancy in the fusion process. Second, there are issues of image feature loss and point cloud geometric missing in the feature extraction stage. For example, 2D feature extraction networks often fail to retain fine-grained texture and edge information, while 3D point cloud feature extraction networks tend to lose geometric structure information. In complex scenes, due to the aforementioned limitations, small-scale targets (e.g., pedestrians and vehicles) are difficult to segment accurately.

To address these challenges, this paper proposes a multimodal gated mechanism point cloud semantic segmentation network, GMAFNet (as shown in Figure 1). The network consists of a multimodal feature extraction network and a gated attention fusion and segmentation network. The multimodal feature extraction network leverages existing architectures: the 2D feature extraction network uses the RepGhost modul, and the 3D feature extraction network uses the PV-RCNN framework. These architectures are chosen for their efficiency and effectiveness in extracting rich and accurate features from RGB images and point clouds, respectively.

Specifically, the RepGhost module is utilized to extract appearance and texture features from RGB images. It employs feature reuse and regularization techniques to reduce the number of model parameters and alleviate overfitting, thereby enhancing the network’s ability to extract 2D features and reducing feature information loss. The PV-RCNN framework is employed for 3D feature extraction. It combines the advantages of 3D voxel convolution and PointNet to achieve efficient point cloud processing. It uses voxel-to-key point scene encoding and key point-to-grid RoI feature abstraction to capture rich contextual information and improve localization accuracy.

The gated attention fusion and segmentation network is the core contribution of this paper. It introduces a dynamic gated attention mechanism to control the weights of 2D and 3D features during the fusion process, thereby reducing information redundancy and enhancing the network’s focus on important information such as vehicles and pedestrians. This mechanism effectively improves the accuracy of point cloud semantic segmentation.

In summary, the main contributions of this paper are as follows:

We propose a dynamic gated mechanism semantic segmentation network (GMAFNet) that fuses 2D image features and LiDAR point cloud features. By utilizing a dynamic gated mechanism, the network enhances attention to important information such as vehicles and pedestrians, reduces issues like information redundancy during the fusion process, and improves the overall performance of the system.
We introduce the 2D feature extraction network (RepGhost) and the 3D feature extraction network (PV-RCNN) to extract multi-scale features. This approach provides richer and more accurate feature representations for subsequent semantic segmentation tasks.
Through experiments on multiple datasets (such as SemanticKITTI and nuScenes), we demonstrate that GMAFNet significantly improves matching accuracy, robustness, and generalization ability compared with existing advanced methods.

2. Related Work

Multimodal feature fusion-based point cloud semantic segmentation methods enhance the accuracy of point cloud semantic segmentation by integrating information from multiple modalities such as point clouds and RGB images, and leveraging the complementarity of information across each modality. RGB semantic segmentation sits at the core of this pipeline: its objective is to classify every pixel into a semantic class, delivering dense 2D labels. The seminal Fully Convolutional Network (FCN) [13] first cast this problem as an end-to-end regression task, converting image-classification backbones into pixel-wise predictors. Subsequent work has systematically enhanced FCN by fusing multi-scale cues—via pyramid pooling or cross-scale attention—an avenue that has repeatedly pushed accuracy upward [14]. In the realm of depth cameras, Fang, K. et al. have made remarkable contributions [15]. LiDAR, as a powerful sensor, plays a critical role in machine vision, particularly in addressing complex scenarios. Among relevant advancements, PointNet [16] pioneered the application of Multi-Layer Perceptrons (MLPs) for processing raw 3D point clouds in semantic segmentation tasks. PointNet++ [17] further introduced a multi-scale sampling mechanism for the aggregation of global and local features.

Multi-sensor approaches aim to synergistically integrate information from two complementary sensors while leveraging the inherent capabilities of both cameras and LiDAR [18]. The PointPainting method [19] projects RGB images into the point cloud space via spherical projection, thereby enhancing the performance of semantic segmentation networks. Gregory P et al. [20] employed CNN backbones to extract features from both modalities, which resulted in improved segmentation accuracy for distant objects and small-sized objects. The RangeLVDet network [21] utilizes pre-trained models to extract semantic features from RGB images and designs a 2D CNN to capture geometric features from range views; these two types of features are then fused from the perspective of point clouds. Panoptic-FusionNet [22] enhances feature maps through a dedicated feature fusion module, which ensures precise geometric alignment of features extracted from the two aforementioned sensors. They constructed tables that enable accurate alignment of point-voxel-pixel correspondences across multiple scales, and feature map fusion is achieved by querying these tables. MFSA-Net [23] proposed a semantic segmentation network based on camera-LiDAR cross-attention fusion with fast neighbor feature aggregation; this network is particularly well-suited for large-scale semantic segmentation tasks in complex environments. Recently, several new neural network models have been proposed for the semantic segmentation of fused point clouds. These methods mostly utilize attention mechanisms and different mechanisms to improve feature extraction. For example, the adaptive point-pixel fusion network (APPFNet) proposed by Wu et al. in 2024 claims to achieve higher mIoU on the SemanticKITTI and nuScenes datasets through neighborhood feature aggregation [24]. In 2025, Bi et al. proposed a multi-scale sparse convolution and point convolution adaptive fusion method for point cloud semantic segmentation [25].

Despite the significant progress made in these studies, fused point cloud semantic segmentation still faces some challenges. In the 2D feature extraction stage, existing methods often employ downsampling or pooling operations when processing high-resolution images, leading to the neglect or loss of some detailed features in the images, resulting in the issue of fine-grained image feature loss. For example, when using traditional convolutional neural networks (CNNs) for feature extraction, the spatial resolution of the image gradually decreases with the increase in the number of network layers, and some local textures and edge information may not be effectively retained. In the 3D feature extraction stage, existing methods may fail to fully capture the geometric structural information of point clouds, such as local shapes and curvatures. This is because point cloud data are sparse and irregular, making it difficult to directly apply traditional grid-based methods to point cloud feature extraction.

Compared to existing solutions, the proposed “multimodal feature extraction network” in this paper has the following fundamental differences: First, in 2D feature extraction, the RepGhost module is adopted, which utilizes feature reuse and regularization techniques to reduce the number of model parameters, mitigate overfitting, and enhance the network’s ability to extract 2D features, thereby reducing feature information loss. Second, in 3D feature extraction, a combination of voxelization and farthest point sampling strategies is used, where the global geometric features obtained from the farthest point sampling branch are used to supplement the features obtained from the voxel branch, addressing the issue of lost edge features in point clouds. Moreover, the gated attention mechanism used in this paper differs from those in related works. While attention mechanisms in related works mainly focus on enhancing the feature representation of specific modalities, the dynamic gated attention fusion module in this paper not only enhances the network’s attention to important categories (such as vehicles and pedestrians) but also effectively addresses the problem of redundant information in feature fusion by dynamically adjusting the weights of 2D and 3D features during the fusion process, thereby improving the accuracy of point cloud semantic segmentation.

3. Method

This section will provide an overview of the technical workflow of GMAFNet and elaborate on its core modules in detail. Section 3.1 will discuss the input preprocessing, followed by Section 3.2, which will provide an overview of the network’s overall architecture. The latter is further broken down into three core components: 2D feature extraction (Section 3.3), 3D feature extraction (Section 3.4), and the dynamic gated attention fusion module (Section 3.5).

3.1. Input Processing

Following prevailing multi-sensor fusion practice, we map the 3D points to the image plane through perspective projection [26].As shown in Figure 2, perspective projection maps points in 3D space to a 2D plane by simulating the imaging principle of human eyes or cameras. Its core idea is as follows: objects closer to the observer appear larger, while those farther away appear smaller. This projection method can generate a strong sense of depth, making the 2D image look more realistic and natural. Specifically, given a LiDAR point cloud

P = {p^{i}}_{i = 1}^{N} \in R^{N \times 3}

, each 3D point

p_{i} = (x_{i}, y_{i}, z_{i}) \in R^{3}

, is projected to a point

{\hat{p}}_{i} = (u_{i}, v_{i}) \in R^{2}

on the plane. The mapping relationship between the LiDAR point

p_{i}

and the planar point

{\hat{p}}_{i}

is as follows:

[u_{i}, v_{i}, 1]^{T} = \frac{1}{z_{i}} \times K \times T \times [x_{i}, y_{i}, z_{i}, 1]^{T}

(1)

Here,

K \in R^{3 \times 4}

and

T \in R^{4 \times 4}

represent the intrinsic and extrinsic matrices of the camera, respectively, and

z_{i}

denotes the depth of the point in the camera coordinate system. Both

K

and

T

are directly provided in KITTI.

3.2. Network Overview

GMAFNet adopts 2DPASS [1] as its basic architecture. 2DPASS effectively integrates information from 2D images into the semantic segmentation of 3D point clouds through knowledge distillation. In the inference phase, the 3D branch of 2DPASS can operate independently without relying on 2D images, which endows it with high inference efficiency in practical applications. This method can reduce the dependence on 3D annotated data to a certain extent; by leveraging the semantic information of 2D images to assist the segmentation of 3D point clouds, it is particularly suitable for tasks involving the fusion of 2D images and 3D point clouds. Our goal is to improve prediction accuracy by integrating information from surrounding point clouds.

The multimodal design of 2DPASS [1] confirms that 2D imagery can indeed reinforce sparse 3D features, an observation that aligns with our goal of boosting accuracy in distant, LiDAR-scarce regions. Nevertheless, the pipeline still falls short in two respects: Firstly, in the direct concatenation fusion of 2DPASS, the phenomenon of information redundancy is ignored. In the unfiltered redundant features during fusion, there is a large amount of random noise from the dataset (such as sensor errors and annotation deviations). These noises are repeatedly propagated with the redundant information, causing the model to learn the noise in the dataset rather than the true patterns, and are mistakenly judged as effective features by the model. This, in turn, affects the accuracy of the model.

Second, although sparse convolution can efficiently process point cloud data, the detection of small targets remains a challenge in long-range target detection. Small targets contain only a small number of sampling points in the point cloud, so sparse convolution methods are prone to missed detections or false judgments during the detection process. To address these challenges, we propose an optimized multimodal gated attention mechanism for point cloud semantic segmentation.

Figure 3 illustrates that GMAFNet proceeds through four stages: data preprocessing and augmentation, 3D encoding and 2D encoding, decoding, and classifying. In the data preprocessing stage, this study employs the perspective projection method to map 3D point cloud data onto the 2D image plane. By virtue of perspective projection, the positional differences between 2D image coordinates and their corresponding 3D points can be determined. Subsequently, this difference is utilized to establish the correspondence between image pixels and point cloud points, thereby endowing the 2D image with spatial positional information derived from the 3D point cloud. The coordinates

(u_{i}, v_{i})

of the image are obtained through Equation (1), and then these coordinates are used to extract the corresponding depth

z_{i}

from the point cloud, ultimately establishing the mapping relationship between the image and the point cloud for subsequent fusion. Additionally, based on the experimental conclusions of 2DPASS [1], images are randomly cropped into

I_{c} \in R^{3 \times 480 \times 320}

. Trimming the images to these dimensions markedly lowers computational overhead and speeds up training while incurring only a negligible drop in accuracy. In the 3D encoder and 2D encoder stage, the 2D encoder adopts the RepGhost lightweight network [27] to extract appearance and texture features from RGB images. Specifically, the cropped patches are fed into the RepGhost encoder, where the lightweight backbone extracts multi-scale 2D features denoted

F_{2 D}

. The 3D encoder is responsible for extracting point cloud features. Specifically, the 3D encoder is based on voxelization strategy and farthest point sampling strategy, and uses the global geometric features obtained by the farthest point sampling branch to supplement the features obtained by the voxel branch, solving the problem of missing edge features of point clouds. The voxelized point cloud undergoes 4 times of 3D sparse convolution to obtain point cloud features

F_{3 D}^{S}

. After each 3D sparse convolution operation, the VSA module is used to extract key feature points from it, which are then integrated into the farthest sampled points to obtain key feature points

F_{3 D}^{V}

. The point cloud features

F_{3 D}

are obtained through fusion via the RoI-grid pooling module [28]. Finally, the dynamic gated attention fusion module is used to merge

F_{2 D}

and

F_{3 D}

to generate fused features

F_{f u s i o n}

. In the decoder stage and classifier, the fused features

F_{f u s i o n}

at various scales are upsampled. The decoded features are subsequently processed by a multi-scale fusion layer followed by an MLP classifier to produce the final semantic predictions.

3.3. RepGhost Module

RGB images contain rich visual information, such as the color, edge details, and texture of objects. The 2DPASS model adopts the ResNet network for 2D feature extraction; however, this network suffers from issues including a large number of parameters and proneness to overfitting. To address these issues, this paper employs the RepGhost module [27], as shown in Figure 4a, which utilizes feature reuse and regularization techniques to reduce the number of model parameters, alleviate overfitting, enhance the network’s ability to extract 2D features, and reduce the loss of feature information. Compared with the Ghost module, the RepGhost module achieves efficient feature reuse and lightweight design through structural reparameterization technology, significantly reducing the amount of computation and parameters while maintaining efficient feature extraction capability. Specifically, the RepGhost module removes traditional concatenation operations and replaces them with addition operations. It shifts the ReLU activation function from after the depthwise convolution layer to after the addition operation, and introduces a batch normalization (BN) layer in the identity mapping branch to provide nonlinear features during training. As shown in Figure 4a, the input RGB image

I_{c} \in 3 \times 480 \times 320

is processed through a 1 × 1 convolution layer to obtain features of size

7 \times 480 \times 320

and then undergoes four RepGhost modules for feature extraction to generate 2D features

F_{2 D}

.

3.4. PV-RCNN 3D Feature Extraction Framework

Point cloud data exhibit an uneven distribution of points, with higher point density in some regions and lower density in others. This uneven distribution gives point cloud data an overall sparse nature. Such a characteristic leads to indistinct local features and poor performance in feature extraction.

The PV-RCNN framework [28] achieves efficient point cloud processing by combining the advantages of 3D voxel Convolution and PointNet [16]. This framework consists of two main steps: voxel-to-key point scene encoding and key point-to-grid RoI feature abstraction. Compared with voxel-based methods, it has higher localization accuracy. Compared with point-based methods, it has a better ability to receive contextual information.

Specifically, the original point cloud is voxelized by establishing a 3D Cartesian coordinate system to subdivide the 3D space into equally spaced voxel blocks, as shown in Figure 5. Assuming the voxelization resolution is

r_{x}

,

r_{y}

,

r_{z}

, for a point cloud

p_{i} = (x_{i}, y_{i}, z_{i})

, the corresponding voxel index can be obtained through Equation (2):

V_{i} = (i n t (\frac{x_{i}}{r_{x}}), i n t (\frac{y_{i}}{r_{x}}), i n t (\frac{z_{i}}{r_{x}}))

(2)

The sparse convolutional network is employed for multiple rounds of feature extraction and downsampling to obtain the feature

F_{3 D}^{S}

. The framework we adopt first aggregates the voxels from multiple neural layers representing the entire scene into a small number of key points, which serve as a bridge between the 3D voxel CNN feature encoder and the proposal refinement network. Specifically, we use the Farthest Point Sampling (FPS) algorithm to sample a small number of key points

p_{i} = {p_{1}, \dots, p_{n}}

, from the point cloud P, as given by Equations (3) and (4):

d (q, S) = m i n \sqrt{{{(p}_{x} - q_{x})}^{2} + {{(p}_{y} - q_{y})}^{2} + {{(p}_{z} - q_{z})}^{2}}

(3)

p_{i} = \max_{p \in (S - S_{i - 1})} [d (p, S_{i - 1})], p \in (S - S_{n})

(4)

Here,

d (q, S)

denotes the distance from point

q

to the point set

S

,

S_{i - 1}

represents the set of the first

i - 1

key points that have already been selected, and

S - S_{i - 1}

indicates the remaining point set that has not yet been selected. The shortest distance is taken as the distance from point

q

to the point set

S

. A point

p_{i}

that is farthest from the point set

S - S_{i - 1}

is selected and designated as a key point, which helps to better preserve the global structure of the point cloud. For the KITTI dataset,

n = 2048

. This strategy encourages the key points to be evenly distributed around non-empty voxels and to represent the entire scene. The Voxel Set Abstraction (VSA) module is used at each layer of feature extraction in the sparse convolutional network to extract multi-scale features, generating multi-scale semantic features. The feature

F_{3 D}^{V}

of each key point is composed of voxel features from different layers and the original point cloud features.

For features

F_{3 D}^{S}

and

F_{3 D}^{V}

, we adopt grid-based uniform sampling, where each key point aggregates features of voxels within a certain distance around it. Each grid point contains multiple receptive fields to capture rich contextual information. Key point features are aggregated through multiple receptive fields to generate features for each grid point, and then the features of all grid points are combined to obtain the 3D feature

F_{3 D}

, which is used for proposal confidence prediction and position refinement.

3.4.1. Voxel Set Abstraction Module

This module, referred to as VSA for short, is used to extract multi-scale semantic features from 3D convolution features. The aggregation method proposed in PointNet++ [17] is adopted here. However, in PointNet++, each key point aggregates the features of points within a certain distance in the surrounding original point cloud, whereas in VSA, each key point aggregates the features of voxels within a certain distance around it. After aggregation, the features are fed into a structure similar to PointNet [16] for feature extraction. It can be expressed by the following formula:

F_{3 D}^{V} = \max \{G (M (S_{i}^{(l_{k})}))\}

(5)

Here,

M

denotes the sampling and aggregation process (sampling is imposed by threshold constraints, i.e., the number of aggregated voxels is limited), and

G

represents the multi-layer perceptron for feature extraction. Similar to PointNet++ [17], we set multiple distances for the aforementioned VSA module in aggregating surrounding voxels. The module is applied across convolution layers at different levels, and finally, their features are concatenated. The features obtained here not only contain the information from the sparse convolution sampling process but also the information from local aggregation via PointNet. Due to the expanded receptive field, the position information of each key point itself is also preserved.

3.4.2. RoI-Grid Pooling via Set Abstraction

We employ the RoI-grid pooling module to integrate the key point features with the points within the RoI region. These features will be utilized for subsequent object detection, confidence prediction, and semantic segmentation. This part is quite similar to the previous VSA module. 6 × 6 × 6 = 216 points are selected from each proposal, referred to as grid points. Each grid point is treated as a center point, and the method of PointNet++ is utilized to aggregate the key point features within a certain surrounding distance (similarly, multiple distance thresholds are adopted to obtain multi-scale information). Meanwhile, coordinate information is concatenated to each key point feature, where the value of this coordinate information is the coordinate difference between the key point and the corresponding center point. This step can be expressed by the following formula:

F_{3 D} = {[F_{3 D}^{V}; p_{j} - g_{j}]^{T}| ∥ p_{j} - g_{j} ∥^{2} < r, \forall p_{j} \in K, \forall F_{3 D}^{V} \in \tilde{F}\}

(6)

Here,

F_{3 D}

denotes the final set of generated features, where each element is a concatenated feature vector.

{[F}_{3 D}^{V}; p_{j} - g_{j}]^{T}

represents the concatenation operation, which concatenates the key point feature

F_{3 D}^{V}

with the relative coordinate

p_{j} - g_{j}

into a single vector. Here,

r

is a distance threshold used to limit the RoI-grid pooling process, ensuring that only key points

p_{j}

within a distance

r

from the grid center

g_{j}

are included in the feature aggregation range. The key point set

K

contains 2048 key points. In Equation (6),

\tilde{F}

denotes the set of key point features, with each key point having a corresponding feature vector. Then, the features obtained from the 216 center points are transformed through a two-layer MLP, and the resulting features represent the proposal. Benefiting from the rich information of key points, the feature information obtained from each proposal contains abundant contextual information and has a flexible receptive field.

3.5. Dynamic Gated Attention Fusion Module

The gating attention mechanism is used to control the inflow and outflow of information in the module. It scores each feature vector in the feature map, judges the validity of information based on the scores, aggregates effective information, and filters out redundant information.

There is abundant complementary information between RGB images and 3D point clouds. Direct feature fusion through feature concatenation cannot effectively utilize the complementary information between the two modalities; instead, it will generate redundant information and introduce noise, leading to poor performance in point cloud semantic segmentation. To address this issue, this paper proposes a dynamic gated attention fusion module, which employs a gated attention mechanism to filter out the complementary information from the two modalities and reduce information redundancy. Finally, cross-entropy loss is utilized, and the fused results are employed to guide the 3D network, thereby enhancing the performance of the 3D network and improving the accuracy of point cloud semantic segmentation. The specific structure is shown in Figure 6.

Specifically, the input 3D feature

F_{3 D}

and 2D featured

F_{2 D}

are concatenated in terms of feature dimensions to obtain

F_{l}^{c a t}

, as shown in Formula (7).

F_{l}^{c a t} = c o n c a t (F_{2 D}, F_{3 D})

(7)

The concatenated feature

F_{l}^{c a t}

undergoes feature integration and dimension transformation via an MLP, reducing its dimension to the same as that of

F_{2 D}

and

F_{3 D}

. Then, the Sigmoid activation function is used to map all parameters in the feature matrix to the range [0, 1] (The closer to 1, the more trust is placed in the 2D features; the closer to 0, the more trust is placed in the 3D features.) The weight matrix Gate is thereby obtained. Obtained from Formula (8).

G a t e = S i g m o i d (M L P (F_{l}^{c a t}))

(8)

The gating mechanism is used to fuse the features of

F_{2 D}

and

F_{3 D}

. First, the obtained weight matrix Gate is matrix-multiplied with the 2D feature

F_{2 D}

. Then, a matrix with all values of 1 is matrix-subtracted from Gate, and the resulting matrix is matrix-multiplied with

F_{3 D}

. Finally, the two matrix-multiplication results are fused via matrix addition to obtain the fused result

F_{l}^{f u s e}

.

F_{l}^{f u s e} = F_{2 D} ⊙ G a t e + F_{3 D} ⊙ (1 - G a t e)

(9)

Two classifiers composed of linear layers are used to predict the category scores of the 3D feature

F_{3 D}

and the fused result

F_{l}^{f u s e}

, obtaining

S_{l}^{3 D}

and

S_{l}^{2 D 3 D}

. These are derived from Formulas (10) and (11).

S_{l}^{2 D 3 D} = L i n e a r (F_{l}^{f u s e})

(10)

S_{l}^{3 D} = L i n e a r (F_{3 D})

(11)

The obtained scores

S_{l}^{2 D 3 D}

and

S_{l}^{3 D}

utilize KL divergence as the distillation loss for knowledge distillation, sharing a total loss

L_{t o t a l} = L_{s e g} + λ \cdot L_{x M}

with the entire network. Here,

L_{s e g}

represents the cross-entropy loss between the segmentation head’s output

S_{l}^{2 D 3 D}

from

F_{l}^{f u s e}

and the ground truth,

L_{x M}

denotes the distillation loss,

L_{t o t a l}

indicates the total loss, and

λ

is derived from the aforementioned Gate. This process is expressed by Formula (12).

L_{x M} = D_{K L} (S_{l}^{2 D 3 D} ∥ S_{l}^{3 D})

(12)

Similar to 2DPASS, GMAFNet also employs a teacher-student model framework. During training, it leverages 2D image information to obtain more accurate fused features. However, during inference, the 2D network is entirely removed, and only the 3D network is utilized, while retaining the precision gains achieved during the fusion phase. The fused score

S_{l}^{2 D 3 D}

derived from multimodal fusion is taken as the result of the teacher model, while the 3D score

S_{l}^{3 D}

obtained solely from 3D features is treated as the result of the student model. Knowledge distillation is performed in this way to improve the performance of the 3D backbone.

4. Experimental Analysis

This section provides a comprehensive overview of the experimental results of GMAFNet, including the introduction to the datasets (Section 4.1), the experimental environment (Section 4.2), comparisons of experimental results among different networks (Section 4.3), and ablation studies (Section 4.4).

4.1. Datasets

This paper systematically evaluates GMAFNet based on two publicly available datasets: SemanticKITTI [29] and nuScenes [30]. Jointly released by the University of Bonn in Germany and ETH Zurich in Switzerland, SemanticKITTI is specifically designed for autonomous driving and robotics scenarios, providing 43,000 LiDAR point cloud frames with point-wise annotations. Among these, sequences 00–10 (approximately 21,000 frames) are used for training and validation, while sequence 10 serves as the test set. The dataset covers 19 typical on-road categories (such as vehicles, pedestrians, and traffic signs), ensuring the standardization and reproducibility of the experiments. In contrast, the nuScenes dataset supplements multi-modal data, including a 360° LiDAR and six surround-view cameras. It comprises 1000 driving recordings, each lasting 20 s, with a total duration exceeding 1000 min, offering more diverse scenarios (see Table 1). Through cross-validation using these two datasets, this paper fully verifies the stability of GMAFNet under various environmental conditions, providing empirical support for its deployment in real-world vehicles.

4.2. Training and Evaluation

Training and testing were conducted on an Nvidia RTX 4090. The experimental environment adopted is as follows: Python 3.8, PyTorch 1.8.1, CUDA 10.2, Ubuntu 20.04, Nvidia TITAN XP, and spconv 2.1.16.

GMAFNet employs the Stochastic Gradient Descent (SGD) optimizer for end-to-end single-stage training. During training, the batch size is set to 2, the initial learning rate is set to 0.24, and training is conducted for 40 epochs. To enhance the generalization ability of the model, scaling factors are randomly selected from a uniform distribution between 0.95 and 1.1 during training. To evaluate the performance of GMAFNet, the mean Intersection over Union (mIoU) is used as the primary evaluation metric. The mIoU is the average of the Intersection over Union (IoU) across all categories and provides a comprehensive reflection of the model’s overall performance in multi-class segmentation tasks.

4.3. Comparative Results

For the SemanticKITTI dataset, we compare the proposed GMAFNet with RandLA-Net [31], RangeNet ++ [32], PointTransformer [33], PolarNet [34], PTv2 [35], PointPainting [19], RGBAL [18], PMF [36], SqueezeSegV3 [37], SalsaNext [38], SPVNAS [39], APPFNet [24] and 2DPASS [1]. Quantitative results on SemanticKITTI are reported in Table 2.

The semantic segmentation results of the SemanticKITTI dataset using various algorithms are presented in Table 1. Under the same experimental environment, the proposed method achieves the highest mIoU compared with recent point cloud semantic segmentation methods, demonstrating the advantages of the semantic segmentation approach proposed in this paper. For categories such as trucks, motorcycles, and pedestrians, the proposed method exhibits higher segmentation accuracy, which is mainly attributed to improvements in both feature extraction and feature fusion components. The 2D and 3D feature extraction networks reduce the loss of fine-grained image features and point cloud geometric representations during feature extraction, enhancing the capability to extract features of small independent objects such as vehicles and pedestrians. The cross-self-attention feature enhancement module and dynamic gated attention fusion module effectively control the weights of the two types of features in the fusion process, addressing the issue of feature redundancy. Consequently, the semantic segmentation accuracy of vehicles and pedestrians is improved. However, due to the cross-self-attention feature enhancement module reducing attention to categories such as roads and vegetation, the accuracy of these categories is slightly lower compared with the 2DPASS model. Overall, the proposed method achieves a 1.4% improvement in mIoU compared with the 2DPASS model, yielding favorable point cloud semantic segmentation results.

To verify the generalization ability of the proposed method, experimental validation was conducted on the nuScenes dataset, with comparisons made against RangeNet++ [17], PMF [36], and 2DPASS [1]. The experimental results are shown in Table 3.

The table shows that in the nuScenes dataset, GMAFNet achieves a 1.3% improvement compared to the 2DPASS method. The experimental results on both datasets further demonstrate that GMAFNet ranks among the top current point cloud semantic segmentation methods. Moreover, GMAFNet’s performance on these two distinct datasets also validates its adaptability and scalability.

In addition, to analyze the training and inference speeds of the model, we tested the inference speed of the model under the same data size and also recorded the time taken to train both models for 40 epochs. The specific experimental results are shown in Table 4 and Table 5.

The experimental results show that, compared with the backbone network, the overall complexity of the model has increased due to the introduction of the farthest point sampling branch in the feature extraction network and the addition of the dynamic gated attention mechanism in the feature fusion network. Compared with 2DPASS, the number of parameters remains unchanged (33.2 M vs. 33.2 M), but the training time has increased by 32% (172 h vs. 130 h). In addition, the inference delay has increased by 2.5% (162 ms vs. 158 ms). Despite the lack of a significant decrease in rate, the model still maintains a high accuracy among mainstream networks, which demonstrates the superior performance of the GMAFNet model. As shown in Figure 7, compared with mainstream networks, GMAFNet has smaller errors.

4.4. Ablation Experiment

To verify the effectiveness of each module and structure proposed in this paper, ablation experiments on the aforementioned methods were conducted using the SemanticKITTI point cloud semantic segmentation dataset. The experimental results are presented in Table 6.

In Table 6, we use the 2DPASS model as the basic framework. When the RepGhost module is adopted as the 2D feature extraction structure, the segmentation accuracy is improved by 0.3% compared with the basic framework. After adding the VSA module to extract multi-scale semantic features, the accuracy of semantic segmentation is increased by 0.5%. Subsequently, with the application of RoI-grid Pooling, the segmentation accuracy is enhanced by 0.2% compared with the basic framework. Furthermore, by using the proposed gated fusion module in this paper to adaptively adjust the weights of 2D and 3D features during the fusion process, the segmentation accuracy is improved by 0.7% compared with the basic framework. It can be concluded that each module and structure proposed in this method can further enhance the semantic segmentation performance of the network, and the effectiveness of the proposed method is also verified.

5. Conclusions

This paper proposes a multimodal gated mechanism point cloud semantic segmentation network, GMAFNet. A multimodal feature extraction network is adopted to effectively extract 3D and 2D features, reducing the loss of fine-grained image features and point cloud geometric representations during the feature extraction process. A dynamic gated attention fusion module is introduced to make the network pay more attention to important information such as vehicles and pedestrians, control the weights of features from different modalities in the fusion process, and solve the problem of information redundancy, thus achieving higher point cloud semantic segmentation accuracy. The experimental results show that a segmentation accuracy of 66.1% is achieved on the Semantic KITTI dataset, which is 1.7% higher than that of the baseline network. Significant improvements are observed in important targets such as pedestrians and vehicles, fully demonstrating that this semantic segmentation method has certain advantages and provides reference value for the research of other 3D point cloud semantic segmentation methods. GMAFNet also has certain limitations. The voxelization of point clouds can lead to the loss of some global features. Although there is supplementation from key point features, the recognition of large objects (such as buildings) by GMAFNet is somewhat reduced.

In the future, our core vision is to optimize and extend GMAFNet in a multi-dimensional and in-depth manner. This not only involves the technical iteration of the model itself but also encompasses exploring its diversified application prospects in key intelligent fields such as unmanned aerial vehicles (UAVs), autonomous driving, and robotics. We are committed to continuously improving the prediction accuracy and inference speed of GMAFNet, as well as enhancing its robustness in complex environments, to address more challenging tasks.

Author Contributions

Conceptualization, X.K.; Methodology, X.K.; Validation, W.W.; Formal analysis, M.W.; Investigation, Z.G.; Resources, Z.L.; Data curation, C.M.; Writing—original draft, W.W.; Writing—review & editing, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is derived from publicly available datasets. The SemanticKITTI dataset can be accessed at https://www.semantic-kitti.org/. The nuScenes dataset can be accessed at https://www.nuscenes.org/.

Acknowledgments

The authors wish to thank the editor and reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds 2022. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 677–695. [Google Scholar]
Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2022, 23, 722–739. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, X.; Chen, Z.; Lu, Z. A Review of Deep Learning-Based Semantic Segmentation for Point Cloud. IEEE Access 2019, 7, 179118–179133. [Google Scholar] [CrossRef]
Rong, M.; Cui, H.; Shen, S. Efficient 3D Scene Semantic Segmentation via Active Learning on Rendered 2D Images. IEEE Trans. Image Process. 2023, 32, 3521–3535. [Google Scholar] [CrossRef] [PubMed]
Alokasi, H.; Ahmad, M.B. Deep Learning-Based Frameworks for Semantic Segmentation of Road Scenes. Electronics 2022, 11, 1884. [Google Scholar] [CrossRef]
Xu, X.; Liu, J.; Liu, H. Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation. Electronics 2023, 12, 3943. [Google Scholar] [CrossRef]
Kolhatkar, C.; Wagle, K. Review of SLAM Algorithms for Indoor Mobile Robot with LIDAR and RGB-D Camera Technology. In Innovations in Electrical and Electronic Engineering; Favorskaya, M.N., Mekhilef, S., Pandey, R.K., Singh, N., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; Volume 661, pp. 397–409. ISBN 978-981-15-4691-4. [Google Scholar]
Teso-Fz-Betoño, D.; Zulueta, E.; Sánchez-Chica, A.; Fernandez-Gamiz, U.; Saenz-Aguirre, A. Semantic Segmentation to Develop an Indoor Navigation System for an Autonomous Mobile Robot. Mathematics 2020, 8, 855. [Google Scholar] [CrossRef]
Xue, J.; Dai, Y.; Wang, Y.; Qu, A. Multiscale Feature Extraction Network for Real-Time Semantic Segmentation of Road Scenes On the Autonomous Robot. Int. J. Control Autom. Syst. 2023, 21, 1993–2003. [Google Scholar] [CrossRef]
Diab, A.; Kashef, R.; Shaker, A. Deep Learning for LiDAR Point Cloud Classification in Remote Sensing. Sensors 2022, 22, 7868. [Google Scholar] [CrossRef]
Liu, W.; Wang, H.; Qiao, Y.; Zhang, H.; Yang, J. DLAFNet: Direct LiDAR-Aerial Fusion Network for Semantic Segmentation of 2-D Aerial Image and 3-D LiDAR Point Cloud. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 1864–1875. [Google Scholar] [CrossRef]
Hu, X.; Li, D. Research on a Single-Tree Point Cloud Segmentation Method Based on UAV Tilt Photography and Deep Learning Algorithm. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4111–4120. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Silver Spring, MD, USA, 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Fang, K.; Xu, K.; Wu, Z.; Huang, T.; Yang, Y. Three-Dimensional Point Cloud Segmentation Algorithm Based on Depth Camera for Large Size Model Point Cloud Unsupervised Class Segmentation. Sensors 2023, 24, 112. [Google Scholar] [CrossRef] [PubMed]
Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 77–85. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
El Madawi, K.; Rashed, H.; El Sallab, A.; Nasr, O.; Kamel, H.; Yogamani, S. RGB and LiDAR Fusion Based 3D Semantic Segmentation for Autonomous Driving. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7–12. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection 2020. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 4604–4612. [Google Scholar]
Meyer, G.P.; Charland, J.; Hegde, D.; Laddha, A.; Vallespi-Gonzalez, C. Sensor Fusion for Joint 3D Object Detection and Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1230–1237. [Google Scholar]
Zhang, Z.; Liang, Z.; Zhang, M.; Zhao, X.; Li, H.; Yang, M.; Tan, W.; Pu, S. RangeLVDet: Boosting 3D Object Detection in LIDAR With Range Image and RGB Image. IEEE Sens. J. 2022, 22, 1391–1403. [Google Scholar] [CrossRef]
Song, H.; Cho, J.; Ha, J.; Park, J.; Jo, K. Panoptic-FusionNet: Camera-LiDAR Fusion-Based Point Cloud Panoptic Segmentation for Autonomous Driving. Expert Syst. Appl. 2024, 251, 123950. [Google Scholar] [CrossRef]
Duan, Y.; Meng, L.; Meng, Y.; Zhu, J.; Zhang, J.; Zhang, J.; Liu, X. MFSA-Net: Semantic Segmentation With Camera-LiDAR Cross-Attention Fusion Based on Fast Neighbor Feature Aggregation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19627–19639. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, Y.; Lan, R.; Qiu, S.; Ran, S.; Liu, Y. APPFNet: Adaptive Point-Pixel Fusion Network for 3D Semantic Segmentation with Neighbor Feature Aggregation. Expert Syst. Appl. 2024, 251, 123990. [Google Scholar] [CrossRef]
Bi, Y.; Liu, P.; Zhang, T.; Shi, J.; Wang, C. Multi-Scale Sparse Convolution and Point Convolution Adaptive Fusion Point Cloud Semantic Segmentation Method. Sci. Rep. 2025, 15, 4372. [Google Scholar] [CrossRef]
Zhao, H.; Zhou, A. A Dual Projection Method for Semantic Segmentation of Large-Scale Point Clouds. Vis. Comput. 2025, 41, 9107–9126. [Google Scholar] [CrossRef]
Chen, C.; Guo, Z.; Zeng, H.; Xiong, P.; Dong, J. RepGhost: A Hardware-Efficient Ghost Module via Re-Parameterization. arXiv 2024, arXiv:2211.06088. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. Int. J. Comput. Vis. 2021, 131, 531–551. [Google Scholar] [CrossRef]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9296–9306. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11618–11628. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11105–11114. [Google Scholar]
Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. RangeNet ++: Fast and Accurate LiDAR Semantic Segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4213–4220. [Google Scholar]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.S.; Koltun, V. Point Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 9598–9607. [Google Scholar]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point Transformer V2: Grouped Vector Attention and Partition-Based Pooling. arXiv 2022, arXiv:2210.05666. [Google Scholar] [CrossRef]
Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16260–16270. [Google Scholar]
Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation. arXiv 2020, arXiv:2003.03653. [Google Scholar]
Cortinhal, T.; Tzelepis, G.; Aksoy, E.E. SalsaNext: Fast, Uncertainty-Aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving. In Proceedings of the 15th International Symposium on Visual Computing, San Diego, CA, USA, 5–7 October 2020. [Google Scholar]
Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution 2020. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2020; pp. 685–702. [Google Scholar]

Figure 1. Prediction-result comparison: our method versus 2DPASS. The red boxes highlight incorrect predictions. The ground truth originates from the official manual annotations of SemanticKITTI.

Figure 2. LiDAR point cloud after perspective projection processing. The upper part shows the camera view, and the lower part shows the LiDAR view after perspective projection.

Figure 3. Architecture sketch. GMAFNet employs two parallel encoders—one dedicated to 3D geometry, the other to RGB appearance—to extract modality-specific features. 3DSC: 3D Voxel Convolution. FPS: Farthest Point Sampling Algorithm. VSA: Voxel Set Abstraction Module. RoI-grid Pooling: RoI-grid Pooling via Set Abstraction. DGAF: Dynamic Gated Attention Fusion Module.

Figure 4. (a) Composition of the RepGhost Module. (b) 2D Feature Extraction Network.

Figure 5. Three-dimensional Feature Extraction Network. The upper half shows point cloud voxelization and subsequent feature extraction, while the lower half displays key point extraction.

Figure 6. Dynamic gated attention fusion module. The fusion of 2D and 3D features is achieved through a gating mechanism.

Figure 7. Visual comparison of model predictions on SemanticKITTI; red boxes highlight the most prominent errors.

Table 1. Dataset statistics used in experiments.

Datasets	Split Details	Total Scans	Classes	Purpose
SemanticKITTI	22 sequences	43,551	19	Training & testing
nuScenes	1000 scenes	40,062	16	Generalization Validation

Table 2. For the SemanticKITTI test set comparison.

	Car	Bicycle	Motorcycle	Truck	Other-Vehicle	Person	Bicyclist	Road	Parking	Sidewalk	other-Ground	Building	Fence	Vegetation	Trunk	Terrian	Pole	Traffc-Sign	mIoU (%)
RandLANet	92	8	12.8	74.8	46.7	52.3	46	93.4	32.7	78.4	0.1	84	43.5	83.7	57.3	73.1	48	27.3	50
RangeNet++	89.4	26.5	48.4	33.9	26.7	54.8	69.4	92.9	37	69.9	0	83.4	51	83.3	54	69.1	49.1	34	51.2
PointTransformer	94	0	31.1	73.8	43.5	52.7	43.2	94.9	31.6	75.2	0	84	41.5	82.7	54.3	69.1	46	29.3	49.8
PolarNet	90.9	41.1	48.1	54.8	51.7	67.5	54.3	94.3	43.5	78.5	0	80.3	52.9	83.5	55.4	71.1	47.8	32.8	55.2
PTv2	94	37.1	32	74.1	43.9	53.1	44.2	93.9	36.6	77.2	0	85.1	45.5	86.7	52.3	71.6	46.1	31.1	52.9
SqueezeSegV3	87.1	34.3	48.6	47.5	47.1	58.1	53.8	95.3	43.1	78.2	0.3	78.9	53.2	82.3	55.5	70.4	46.3	33.2	53.3
PointPainting	94.7	17.7	35	28.8	55	59.4	63.6	95.3	39.9	77.6	0.4	87.5	55.1	87.7	67	72.9	61.8	36.5	54.5
RGBAL	87.3	36.1	26.4	64.6	54.6	58.1	72.7	95.1	45.6	77.5	0.8	78.9	53.4	84.3	61.7	72.9	56.1	41.5	56.2
SalsaNext	90.5	44.6	49.6	86.3	54.6	74.0	81.4	93.4	40.6	69.1	0	84.6	53.0	83.6	64.3	64.2	54.4	39.8	59.4
SPVNAS	96.5	44.8	63.1	59.9	64.3	72.0	86.0	93.9	42.4	75.9	0	88.8	59.1	88.0	67.5	73.0	63.5	44.3	62.3
PMF	95.4	47.8	62.9	68.4	75.2	78.9	71.6	96.4	43.5	80.5	0.1	88.7	60.1	88.6	72.7	75.3	65.5	43	63.9
2DPASS	97	49.1	65.2	67.2	74.3	79.5	73.4	97.2	45.3	79.5	0.9	88.6	61.2	89.6	72.9	74.3	65.4	43.1	64.4
APPFNet	97.2	51.9	75.2	69.2	73.1	79.3	84.7	95.2	43.3	75.6	1.6	87.9	60.1	88.9	70.5	73.5	69.1	45.2	65.8
GMAFNet (Ours)	95.4	52.6	78.2	67	62.8	82.5	90.1	96.4	42.7	77.4	1.1	86.2	60.3	85.7	71.3	72.6	65.5	54.8	66.1

Table 3. In the nuScenes test set, the bold numbers indicate the best results.

	Bus	Car	Engineer-Vehicle	Motorcycle	Pedestrian	Trafficcone	Trailer	Truck	Driveable	Otherflat	Sidewalk	Terrain	Construction	Vegetation	mIoU (%)
RangeNet++	77.2	80.9	30.2	66.8	69.6	52.1	54.2	72.3	94.1	66.6	63.5	70.1	83.1	79.8	65.5
PMF	89.8	92.1	54	77.7	80.5	70.9	64.6	82.9	95.5	73.3	73.6	74.8	89.4	87.7	76.7
2DPASS	91.3	93.8	51.3	78	78.9	64.9	62.1	84.4	96.8	71.6	76.4	75.4	90.5	87.4	76.5
GMAFNet (Ours)	96	93.7	58.1	83.9	81.1	60.4	73.6	88.9	96.5	71.9	75.4	75.1	88.6	87	77.8

Table 4. Comparison of training time, inference time, and inference accuracy between GMAFNet and baseline models under the GeForce RTX 3090 environment.

Method	Params (M)	Inference Time (ms)	mIoU (%)	Training Time (h)
2DPASS	33.2	158	64.4	130
GMAFNet (Ours)	33.2	162	66.1	172

Table 5. Impact of Different Components on Training Time.

	RepGhost	VSA	RoI-Grid Pooling	DGAF	Training Time (h)
1	×	×	×	×	130
2	√	×	×	×	136
3	√	√	×	×	148
4	√	√	√	×	158
5	√	√	√	√	172

Table 6. Ablation studies on the SemanticKITTI test set.

2DPASS	RepGhost	VSA	RoI-Grid Pooling	DGAF	mIo U(%)
√	×	×	×	×	64.4
√	√	×	×	×	64.7
√	√	√	×	×	65.2
√	√	√	√	×	65.4
√	√	√	√	√	66.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kong, X.; Wu, W.; Wu, M.; Gui, Z.; Luo, Z.; Miao, C. GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds. Electronics 2025, 14, 4917. https://doi.org/10.3390/electronics14244917

AMA Style

Kong X, Wu W, Wu M, Gui Z, Luo Z, Miao C. GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds. Electronics. 2025; 14(24):4917. https://doi.org/10.3390/electronics14244917

Chicago/Turabian Style

Kong, Xiangbin, Weijun Wu, Minghu Wu, Zhihang Gui, Zhe Luo, and Chuyu Miao. 2025. "GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds" Electronics 14, no. 24: 4917. https://doi.org/10.3390/electronics14244917

APA Style

Kong, X., Wu, W., Wu, M., Gui, Z., Luo, Z., & Miao, C. (2025). GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds. Electronics, 14(24), 4917. https://doi.org/10.3390/electronics14244917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GMAFNet: Gated Mechanism Adaptive Fusion Network for 3D Semantic Segmentation of LiDAR Point Clouds

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Input Processing

3.2. Network Overview

3.3. RepGhost Module

3.4. PV-RCNN 3D Feature Extraction Framework

3.4.1. Voxel Set Abstraction Module

3.4.2. RoI-Grid Pooling via Set Abstraction

3.5. Dynamic Gated Attention Fusion Module

4. Experimental Analysis

4.1. Datasets

4.2. Training and Evaluation

4.3. Comparative Results

4.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

	RepGhost	VSA	RoI-Grid Pooling	DGAF	Training Time (h)
1	×	×	×	×	130
2	√	×	×	×	136
3	√	√	×	×	148
4	√	√	√	×	158
5	√	√	√	√	172

	RepGhost	VSA	RoI-Grid Pooling	DGAF	Training Time (h)
1	×	×	×	×	130
2	√	×	×	×	136
3	√	√	×	×	148
4	√	√	√	×	158
5	√	√	√	√	172

	RepGhost	VSA	RoI-Grid Pooling	DGAF	Training Time (h)
1	×	×	×	×	130
2	√	×	×	×	136
3	√	√	×	×	148
4	√	√	√	×	158
5	√	√	√	√	172