S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network

Zhang, Baowen; Su, Chengzhi; Cao, Guohua

doi:10.3390/electronics14102008

Open AccessArticle

S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network

by

Baowen Zhang

,

Chengzhi Su

and

Guohua Cao

^*

School of Mechanical and Electrical Engineering, Changchun University of Science and Technology, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(10), 2008; https://doi.org/10.3390/electronics14102008

Submission received: 8 April 2025 / Revised: 13 May 2025 / Accepted: 13 May 2025 / Published: 15 May 2025

Download

Browse Figures

Versions Notes

Abstract

In the projection-driven multi-modal 3D object detection task, the data projection process has extremely high computational complexity, which restricts the efficiency of the detection network. In addition, traditional projection interpolation methods have certain limitations. To improve the voxel projection efficiency and explore a projection interpolation method that can enhance the detection accuracy, we propose a voxel projection optimized positioning strategy and an independent projection interpolation method—neighborhood-enhanced feature interpolation. Meanwhile, we propose a new 3D object detection network, S-FusionNet, based on multi-modal semi-fusion. Through the optimized positioning strategy, the inference speed increases from 6.7 FPS to 10.78 FPS. Using the optimized positioning strategy, with an additional 6.1 ms consumed by the network, the neighborhood-enhanced feature interpolation method improves the detection accuracy of “pedestrians” at the “moderate” and “hard” levels by 2.18% and 2.25%, respectively. It also improves the detection accuracy of “Car” and “Cyclist” at the “moderate” level by 1.36% and 1.3%, respectively. We also verify the stability and generalization ability of the proposed semi-fusion network S-FusionNet through robustness experiments.

Keywords:

3D object detection; PointNet cloud; neighborhood-enhanced feature interpolation method; voxel projection optimization localization strategy

1. Introduction

In the field of three-dimensional (3D) object detection, processing point cloud data through neural networks to achieve accurate identification, positioning, and classification of objects is a crucial technology. It plays an irreplaceable and important role in promoting the development and progress of many important industries such as robotics and autonomous driving. However, point cloud data are sparse, contain limited feature information, and lack intuitive texture attributes. Moreover, the point density of distant objects in the scene is low, resulting in a certain number of missed detections and false detections in the detection results of small objects [1,2,3,4,5], making it difficult for the system to fully and accurately perceive the 3D environment.

To overcome these challenges, researchers combine image data with rich visual information [6] and LiDAR data, which can provide accurate distance information [7,8,9]. This multi-modal data fusion not only realizes information complementarity between different data sources but also systematically provides more comprehensive and rich feature representations. Through this fusion, the accuracy of object identification and classification has been significantly improved, which is crucial for accurately perceiving and understanding 3D scenes [10,11,12,13].

In the field of multi-modal 3D object detection, there are inherent differences in continuity and discreteness among the different types of data used for fusion. Effectively combining the data collected by different sensors, properly addressing the data mismatch problem, overcoming the slow inference speed caused by projection operations during the matching process of point cloud data and image data, and correcting spatial offset due to projection errors have become key technical challenges that cannot be ignored in multi-modal 3D object detection.

In research on multi-modal 3D object detection, projection-based multi-modal 3D object detection methods leverage projection matrices to achieve an efficient and in-depth combination of point cloud features and image features. Specifically, this technique first accurately maps point cloud data onto the image plane. Since the projection of point cloud data on the image plane is often non-discrete, specialized interpolation techniques are required to obtain the image features corresponding to 3D points. Common interpolation techniques include bilinear interpolation [14] and voxel center point projection [15]. Bilinear interpolation [14] estimates the value of the target point using a bilinear function by considering the values of the four adjacent points around the target point. Voxel center point projection [15] performs projection and feature estimation based on the center point of the voxel. After obtaining the corresponding image features, a corresponding fusion strategy is employed to fuse them with the point cloud features. Interpolation methods play a crucial role in the entire projection-based multi-modal 3D object detection process. They serve as the key bridge connecting point cloud data and image data and are an important means for effectively fusing the features of two different modalities. An appropriate interpolation method can maximally preserve and integrate the useful information of the two types of data, reducing errors and information loss caused by projection and data conversion, thereby improving the accuracy and reliability of multi-modal 3D object detection.

The methods proposed in references [10,11,16,17,18,19,20,21] use bilinear interpolation to fuse the features of different modalities. However, as shown in Figure 1, there is a significant difference in density between point cloud data and image data. Point cloud data cannot fully cover all pixels in an image, because of which the bilinear projection interpolation-based method has certain drawbacks. Specifically, first, this method only averages the feature information of four adjacent pixels, resulting in limited feature information and a higher likelihood of losing image feature information corresponding to the points to be interpolated. Second, since point cloud data and image data are collected by different sensors from different perspectives, errors are very likely to be introduced during the alignment and matching process. For example, references [19,22] project the point cloud onto multiple views and then directly concatenate and fuse the projected image features and RGB image features. However, this fusion method fails to achieve an in-depth fusion of the point cloud and image features and ignores the geometric correspondence between the point cloud and the image. Methods such as F-PointNet [23,24] first perform 2D object detection on RGB images to generate 2D bounding boxes and then use the point cloud network to estimate 3D bounding boxes based on the 2D bounding boxes. The entire detection process fails to fully fuse point cloud data and image features.

The Focal Sparse [15] model, a typical representative of the voxel center point projection method, projects the center point of each voxel onto the image features. Compared with the bilinear interpolation method, which only processes the feature information of four pixels, voxel center point projection significantly reduces computational complexity. However, this method also has obvious limitations. It may not fully capture all the features inside the voxel, which in turn affects the detection accuracy to some extent.

In response to the problems in the above literature, based on the Focals Conv-F network [15], we propose a multi-modal-based semi-fusion 3D object detection network, namely S-FusionNet. This network only uses image features to assist the voxel feature extraction task and does not participate in object classification and regression tasks. This network architecture, which lies between single-modality and multi-modality object detection, fully exploits the advantages of both. It effectively addresses the issue of insufficient feature information in pure LiDAR data and alleviates the additional computational resource consumption in multi-modal 3D object detection due to projection alignment. In S-FusionNet, we propose a voxel projection optimization positioning strategy and an autonomous projection interpolation method—the neighborhood-enhanced feature interpolation method. The core of the voxel projection optimization positioning strategy is to advance the fusion position of LiDAR features and image features before the feature extraction network, that is, to perform fusion at the level of the original voxel features, rather than after the features are expanded through the convolutional network. In this way, the number of projections between different modalities of data is significantly reduced, effectively improving computational efficiency. However, using shallow voxel features for fusion inevitably has a certain impact on the detection accuracy of the network.

To improve the correctness of projection and the accuracy of feature representation, we researched and designed an adaptive projection method—the neighborhood-enhanced feature interpolation method. This method first samples the

3 \times 3

image features around the point to be interpolated. By calculating the correlation weights between the point to be interpolated and its surrounding neighborhood, and then performing a weighted sum of the 9 neighborhoods including the interpolation point and taking the average, the image features corresponding to the point to be interpolated are obtained.

To comprehensively evaluate the performance and reliability of the S-FusionNet network, we systematically carried out ablation studies and robustness tests. These experiments were conducted on three datasets (namely KITTI, nuScenes, and VoD) and compared the experimental results with those of three benchmark models (PV-RCNN, Focals Conv-F, and CenterPoint).

2. Related Words

Based on projection interpolation technology, a variety of multi-modal 3D object detection methods have emerged, with the most commonly adopted approach being the combination of LiDAR and camera data sources. In this field, the key elements that affect detection performance encompass several aspects, including the position of fusion, the type of input data, and the fusion strategy, among others. The following sections will sequentially analyze these three key elements and their representative research achievements.

2.1. Fusion Position

At the forefront of research on multi-modal 3D object detection, fusion strategies are primarily categorized into three types based on the timing of fusion: early fusion, intermediate fusion, and late fusion [25,26,27,28,29,30,31,32].

Early fusion strategies [20,33] first independently extract features from image and point cloud data, followed by the integration of these features at the initial stage of feature extraction or directly at the data input level. This approach emphasizes the deep interaction and complementarity between features of different modalities before the prediction process, which are then input into the prediction module for tasks such as object classification and regression. Early fusion strategies have gained widespread recognition and application in multi-modal 3D object detection tasks due to their significant improvement in detection accuracy and robustness.

Intermediate fusion strategies focus on combining the prediction results of images with point cloud data, as shown in the literature [8,34,35,36]. In this strategy, the prediction results of the image serve as prior information, adding richness to the feature information of the point cloud data, which are then used as input for a 3D network to further extract 3D features and eventually enter the prediction stage. For instance, F-PointNet [34] is among the first studies to propose an intermediate fusion strategy. However, these methods are somewhat limited by the accuracy of 2D detectors and have not achieved full information exchange and complementarity between the two data types.

Late fusion strategies involve using specific networks for detection of each input modality separately and then integrating the unimodal prediction results at the prediction stage through a fusion network, as described in the literature [12,37,38]. The advantage of this strategy is that it avoids the alignment problem between different modal data, but it also loses the advantage of using deeply fused features for object detection.

In summary, early fusion strategies occur at the feature extraction stage, where the feature extraction of 2D images and point clouds is performed in parallel. Intermediate fusion strategies involve a series connection of 2D image and point cloud feature extraction, where 2D image features are first extracted and predicted, and then the prediction results are fused with 3D point cloud data and input into a 3D neural network for feature extraction and prediction. Late fusion occurs after completing 2D and 3D predictions. These three fusion strategies each have their own characteristics, with both advantages and limitations. Researchers can choose the most suitable fusion method based on the specific requirements of the task and the data characteristics.

The research in this paper adopts an early fusion strategy that enables information complementarity. In response to the alignment issues and time-consuming computation problems between different modalities of data in this fusion strategy, the network is carefully designed and adjusted.

2.2. Input Data Types

In the field of multi-modal 3D object detection, the selection of input data types is crucial, with the combination of LiDAR and camera being the most common. LiDAR data can take various forms, including raw point clouds, voxel grids, and projections of point clouds on bird’s eye view (BEV) or range view (RV). Based on the differences in input data types, fusion algorithms are primarily categorized into three types.

(1): 3D Object Detection Based on BEV/RV + Camera

In the early research on multi-modal 3D object detection, the strategy of projecting point cloud data onto BEV or RV and combining it with image data captured by a camera has received extensive attention. This approach not only enriches the diversity of input information but also addresses the challenge of 2D networks processing 3D point cloud data. For instance, the MV3D model proposed by Chen et al. [39] uses the front view (FV) and BEV of point clouds as inputs and achieves feature fusion through a region-based fusion network. Ku et al. [40] proposed the AVOD method to address the challenge of small object detection, which combines the BEV and multi-view features of images through an advanced RPN network to generate more accurate detection boxes. The FuseSeg model proposed by Sun et al. [41] effectively reduces information loss during projection by establishing a point-by-point correspondence between the distance view (RV) of point clouds and RGB image features.

The MMF model [17] uses a fusion module to extract the BEV feature maps of point clouds, image feature maps, and pseudo-LiDAR point clouds, combining these three feature maps as inputs for 3D object detection. Wang et al. [42] proposed the multiview adaptive fusion network (MVAF-Net), which also adopts these three input methods, extracting semantic features of images, voxelized features of point clouds, and distance map features, and adaptively evaluating the importance weights of these features through an attention module to achieve adaptive fusion, thereby improving detection accuracy.

However, when point cloud data are converted to BEV and RV representations, a significant amount of useful information about the targets in the original scene may be lost. During this conversion, incorrect projection prior knowledge may be introduced into the detection network, negatively affecting performance. Therefore, retaining more original information while reducing the negative impact of projection conversion is a key challenge in research based on point cloud data projection onto BEV/RV+Camera.

(2): 3D Object Detection Based on Raw Point Cloud and Camera

Point projection-based 3D object detection methods use transformation matrices [43] to project image features onto the original point cloud, establishing the correlation between LiDAR and image pixels and enhancing the representativeness of point cloud data features through fusion strategies. For example, the point fusion method proposed by Xu et al. [33] serially fuses global image features with point cloud features extracted by PointNet [2], but this method does not fully consider the continuity of point cloud data coordinates and the discreteness of image data coordinates. The EPNet network proposed by Huang et al. [19] establishes a mapping between point features and camera image features at the fusion layer, addressing the issues of the point fusion network.

3D object detection methods based on original point clouds and cameras improve detection performance while also facing challenges such as computational resource consumption and data fusion issues.

(3): 3D Object Detection Based on Voxel and Image Features

3D object detection methods that use voxel and image features as inputs first convert the point cloud into a voxel grid, then accurately calibrate and project the voxel features onto the image plane, and achieve feature fusion through a fusion network. The MVX-Net proposed by Sindagi et al. [16] is a representative achievement in this field. The 3D-CVF method proposed by Yoo et al. [18] constructs smooth LiDAR-camera features through automatic projection and performs intelligent fusion based on the contribution of each modality to the detection task. VPF-Net [44] introduces an innovative architecture that effectively bridges the resolution gap between point cloud and image data through the alignment and aggregation of “virtual” points. However, the Focals Conv [15] method projects only the center point of each voxel grid onto image features, leading to a loss of detailed information within the voxels.

Compared to methods based on BEV/RV or raw point cloud fusion with camera, methods based on voxel grid and camera fusion are more popular, as they can effectively utilize the complementary information from different sensors and overcome some limitations of other methods. As a regularization representation of irregular point clouds, the voxel grid can retain the original information in three-dimensional space within the voxels, reduce projection errors, and simultaneously lower the demand for computational resources.

The research in this paper adopts an early fusion strategy that enables information complementarity. In response to the alignment issues and high computational time consumption between different modalities of data in this fusion strategy, the network is carefully designed and adjusted.

This paper’s research selects the voxel grid and camera fusion strategy, which is easier to align spatially, to explore its application potential in 3D object detection.

2.3. Fusion Strategy

The fusion strategy is the specific method used to achieve the integration of different modal input data after determining the type of input data and the stage of fusion. According to the level of fusion, it can be divided into pixel-level fusion, decision-level fusion, and feature-level fusion [32,45,46,47].

(1): Pixel-Level Fusion Strategy

The pixel-level fusion strategy involves concatenating the original spatial coordinates of point cloud data with the RGB values of the corresponding pixels in the image, representing the lowest level of fusion. Due to the discreteness of image pixel positions and the continuity of 3D point coordinates in point cloud data, this strategy overlooks the difference in data density between point clouds and image data, leading to a lack of corresponding 3D points when projecting the 3D points in the point cloud onto the image. Moreover, this strategy requires a network capable of handling the feature extraction of different types of data simultaneously, which is not ideal.

(2): Decision Fusion Strategy

The decision-level fusion strategy [12,34,36,39,40,48,49] first independently uses image data and point cloud data for 3D object detection and then aligns the features of the two preliminary regions of interest (RoIs) using a projection matrix. Although this early multi-modal detection strategy is simple, its performance is usually limited due to its reliance on 2D images for 3D object detection, which can lead to misalignment and fusion of incorrect features, thus affecting the detection accuracy and robustness of the network architecture.

(3): Feature-Level Fusion Strategy

Feature-level fusion 3D object detection methods are mainly into point-projection feature fusion [20,35,50,51,52,53,54,55] and voxel-projection feature fusion [15,16,17,18,33,56,57,58,59,60]. These methods perform fusion at the point cloud feature extraction stage, with a focus on the fusion strategy of point cloud features and image features.

Point-projection-based 3D object detection methods typically perform fusion in the early stages. PointPainting [20] enhances LiDAR points by appending segmentation scores. MVP [50] builds on PointPainting, using image instance segmentation and a projection matrix to establish alignment between segmentation masks and point clouds. MVP randomly selects pixels within each range and links these pixels to the nearest neighbors in the point cloud for sampling. However, due to occlusion issues in the image domain, mapping 3D points to occluded image regions may result in invalid image information being obtained.

Voxel-projection-based 3D object detection methods first voxelize the original point cloud data into regular voxel grids. Each 3D point in the voxel is then projected onto image features, a method we call point-level projection; the image features obtained by projecting the voxel feature center points are taken as the image features of all data points within the voxel. This projection method is called voxel center point projection. After obtaining the corresponding image features, the image features are combined with the voxel features through fusion strategies (such as summation, concatenation, or attention mechanisms). This method addresses the issue of empty voxels caused by the sparsity of LiDAR data by integrating image information.

A typical representative of point-level projection is the MVX-Net [16] model, where each voxel feature includes the voxel’s own features, the 3D features of each point in the voxel, and the matched RGB image features of each point. Image features are concatenated with the feature vector of each 3D point to achieve point-level feature fusion.

A typical representative of voxel-centric projection is the Focal Sparse [15] model, which focuses on learning important information through Focal Sparse Conv and proposes a multi-modal Focal Sparse-fusion framework based on this. In this framework, image features corresponding to the voxel center point are obtained through projection and fused with voxel features through a summation strategy for predicting the importance map of the cube.

Compared to point-level projection, voxel-centric projection reduces computational complexity. However, it may lose some detailed information and fail to capture all feature changes within the voxel. To address this issue, we propose a new projection interpolation method called neighborhood correlation-enhanced interpolation. This method uses the interrelationship between adjacent image pixels, calculates weights through correlation, and automatically optimizes projections to enrich the image features corresponding to voxel features and mitigate alignment errors. This method not only enriches the expressiveness of voxel features and improves the accuracy of fusion but also enhances the robustness of multi-modal 3D object detection.

3. Method

3.1. Multi-Modal Semi-Fusion Network (S-FusionNet)

The proposed S-FusionNet network framework is illustrated in Figure 2. This network architecture draws on the design principles of VoxelNet [9], PV-RCNN [61], CenterPoint [62], Focal Sparse [15], and Voxel R-CNN [63], constructing a composite system comprising a single-view feature extraction module, projection-enhanced region-based convolutional neural networks (PE-CNNs), an object classification and parameter regression network, and a detection head. The single-view feature extraction module is briefly described in Section 3.2. The specific workflow of the PE-CNN is detailed in Section 3.3. The object classification and parameter regression network and the detection head are briefly introduced in Section 3.7.

3.2. Single-View Feature Extraction Module

(1): Voxelization Feature Encoding

In the crucial step of voxelized feature encoding, we draw on the point cloud data voxelization strategy of Focal Sparse [15]. Specifically, we perform spatial cropping on the point cloud data, cropping it along the x, y, and z axes to the range of

[0, 70.4 m] * [- 40 m, 40 m] * [- 3 m, 1 m]

. Then, we use a voxel size of

0.05 m * 0.05 m * 0.1 m

to voxelize the cropped point cloud data.

During the voxel feature extraction process, we sum up the features of all points within each voxel. Then, we divide the obtained total point cloud features by the number of points in the voxel. The resulting average feature vector is the feature vector of each voxel. To prevent the error of dividing by zero during the mean calculation, we set a minimum number of points limit; that is, the number of points in a voxel must be no less than 1.

(2): Image Feature Encoding

For RGB image feature extraction, we follow the approach used in Focal Sparse [15], employing lightweight ResNet50 [64] as the backbone network. Through two downsampling steps, each containing 5 layers, we progressively extract features from 5 levels, with the lower levels preserving details and the higher levels capturing semantic information. The first downsampling step has a stride of 1, maintaining the resolution, and outputs 64-dimensional features. After upsampling with a stride of 1, we obtain 128-dimensional features. The second downsampling step has a stride of 2, reducing the size by half to capture higher-level semantic information, and outputs 128-dimensional features. Following upsampling with a stride of 2, we again obtain 128-dimensional features. By concatenating the features from the two upsamplings, we obtain 256-dimensional image features. These are then reduced to 16-dimensional 2D features for fusion with the 3D features.

3.3. Projection-Enhanced Convolutional Neural Networks (PE-CNNs)

Figure 3 shows the detailed structure of the PE-CNN framework, which consists of four key components: a Stem module, an Accelerated Enhanced Focal Sparse Convolution Fusion Module (AE-FocalsConv-F), a feature extraction network, and a Conv_out layer. Among them, the image features are only used to combine with the primary voxel features to screen out important positions in the scene, where feature extraction is performed to enhance feature extraction ability.

In the Stem module, through the Subm sparse convolution with a convolution kernel of 3 and padding of 1, the voxel features are elevated from 4 dimensions to 16 dimensions. The dimension of the convolution kernel is the same as that of the image features, which lays the foundation for subsequent feature summation and fusion. The AE-Focals Conv-F module sums and fuses the 16-dimensional voxel features obtained in the Stem stage with the 16-dimensional image features obtained through neighborhood-enhanced feature projection interpolation, and uses the result as the input of the feature extraction network. In the AE-Focals Conv-F module, we propose two innovations corresponding to the voxel projection optimization positioning strategy in Section 3.4 and the neighborhood enhanced feature interpolation method in Section 3.5, respectively. The feature extraction backbone network includes four stages, namely Stage 1–Stage 4, and a Conv_out layer for feature dimension elevation, which are introduced in detail in Section 3.6.

3.4. Voxel Projection Optimization Localization Strategy

During the in-depth exploration of the fusion mechanism between voxel features and image features, the order of fusion is a non-negligible factor affecting the number of voxels. The following is a detailed analysis of the changes in the number of voxels under two different operational sequences:

(1): Fusion of image features followed by sparse convolution: When this strategy is employed, image features are fused with voxel features on the existing voxel structure, a process that does not result in an increase or decrease in the number of voxels. Additionally, the total number of projection calculations remains unchanged, thus ensuring the consistency and stability of the computational process.
(2): Convolution of voxel features followed by fusion of image features: When voxel features are first subjected to sparse convolution, it may lead to a change in the number of voxels, especially when the convolution operation introduces a dilation effect. This dilation effect can cause an increase in the number of voxels, and in the subsequent fusion stage, the demand for projection calculations will also increase due to the involvement of more voxels, which could potentially double the consumption of computational resources.

In the literature [15], the original FocalsConv-F module is placed at the end of, Stage 1 as indicated by the blue dashed box in Figure 4. In this module, LiDAR features first undergo convolutional dilation operations through submanifold sparse convolution blocks and focal sparse convolution blocks, and then are projected onto the image feature map to obtain corresponding image features and sum them for fusion. In our experiments, we found that this process involves as many as 552,244 projection calculations, which is a substantial computational load that poses a challenge for real-time applications with high real-time requirements. We moved the FocalsConv-F module to before Stage 1, performing projection on the original voxel features without dilation, reducing the number of projection calculations to 15,463, significantly reducing computation time by 36 times. The adjusted position of the FocalsConv-F module is located after the Stem module and before Stage 1, as represented by the red solid box. This optimization strategy is named the Voxel Projection Optimization Localization (VPOL) strategy. For experimental comparison purposes, the FocalsConv-F module using the VPOL strategy is named the Accelerated Focal Sparse Convolution Fusion Module (A-FocalsConv-F).

However, when the fusion module is located in the shallow layer of the network and only the shallow features of voxels are fused with image features, the semantic information contained is relatively scarce. The fused features struggle to precisely capture complex semantic information. When the model faces complex scenarios and diverse targets, the accuracy of classification and recognition drops significantly. Moreover, the image features obtained using the voxel center point projection method are also insufficient, which directly affects the accuracy of the detection network.

To obtain richer and more accurate image features as much as possible to compensate for the impact of shallow voxel features on detection performance, we propose a neighborhood-enhanced feature interpolation method to replace the voxel center point projection method in the FocalsConv-F module. We also rename the A-FocalsConv-F fusion module as the Accelerated Enhanced Focal Sparse Convolution Fusion Module (AE-FocalConv-F).

3.5. Neighborhood-Enhanced Feature Interpolation Method

The voxel center point projection interpolation method used in the literature [14] maps the center points of voxels to the image plane and fuses the single image feature at that position with the voxel feature through summation, thereby generating new voxel features. Although this projection interpolation method is efficient, it does not consider the inherent differences in density between point cloud data and image data, as shown in Figure 1. Image features typically consist of discrete pixel values, whereas point cloud features are continuous spatial coordinate information. Each point in the point cloud may not have a direct corresponding pixel in the image, or a single pixel may correspond to multiple points in the point cloud, which could lead to spatial misalignment during fusion. Therefore, adopting this simple feature fusion strategy is insufficient to fully leverage the potential of multi-modal data, resulting in a decline in 3D object detection performance.

To overcome these deficiencies and explore a more effective projection interpolation method, this paper proposes a novel projection interpolation technique named the neighborhood-enhanced feature interpolation (NE interpolation) method. It is an automatic projection method, as shown in Figure 5.

By expanding the neighborhood calculation range of the sampling points, the new projection interpolation technique avoids information loss due to insufficient sampling. Additionally, cosine similarity is used to weight each neighborhood point, adjusting the importance of the features. The 9 weighted feature vectors are stacked and merged, followed by an averaging operation to obtain the image feature of the point to be interpolated. The entire process includes the following three steps:

(1): Neighborhood Sampling and Feature Extraction: Traverse the $3 \times 3$ neighborhood coordinates centered on the point P to be interpolated. Remove invalid coordinates that exceed the image size and fill them with zeros. Extract the features corresponding to the valid coordinates from the image feature map.
(2): Feature Weighting: Calculate the cosine similarity between the features of each neighborhood point and the feature of the point to be interpolated (a cosine value closer to 1 indicates that the two feature vectors are more related). Weight each neighborhood feature. Enhance features similar to the center point and suppress irrelevant neighborhood features. The formula for cosine similarity is shown in Equation (1).
(3): Feature Stacking: Combine the nine weighted neighborhood features along the new dimension to form a tensor of [9, N, C], where N represents the number of points in the point cloud and C represents the number of channels of the features. Take the mean along the neighborhood dimension to obtain the final aggregated feature, which is then summed and fused with the voxel feature to serve as the training input.

The formula for calculating the similarity between neighborhoods is as shown in Equation (1):

\cos (θ) \approx \frac{f (Q_{9}) \cdot f (Q_{i, i = 1 \dots 9})}{‖ f (Q_{9}) ‖ * ‖ f (Q_{i, i = 1 \dots 9}) ‖}

(1)

where

f (Q_{i})

represents the pixel feature value of the i-th neighborhood.

The advantages of this method are as follows: first, it addresses the mismatch between the discreteness of image features and the continuity of point cloud data; second, by expanding the neighborhood calculation range of the sampling points, it avoids information loss due to insufficient sampling; third, through similarity weighting, this method can automatically highlight important features while ignoring unimportant or redundant ones, thus enhancing the detection capabilities of the model.

3.6. Feature Extraction Network

Next, we introduce the backbone network for feature extraction, which consists of a feature extraction module and a Conv_out layer for feature dimension elevation, as shown in Figure 3. The feature extraction module contains four stages: Stage 1, Stage 2, Stage 3, and Stage 4. In each stage, the Reg block, Subm block, and Focals Conv block represent three types of sparse convolution blocks: conventional sparse convolution [14], submanifold sparse convolution [65], and focal sparse convolutions [15], respectively. In this architecture, {c0, c1, c2, c3, c4} represent the output channel numbers of each stage, sequentially {16, 16, 32, 64, 64}, while {n1, n2, n3, n4} indicate the number of submanifold blocks, respectively {1, 2, 2, 2}.

As shown in Figure 6, Stage 1consists of one Subm block and one Focals convolution. The input and output of the Subm convolution are (16, 16), with a kernel size of 3 and padding of 1. Due to the sparsity of point cloud data, the number of adjacent points varies for each point, so a fixed stride is not set. The input and output dimensions of the Focals convolution are (16, 16), with a stride of 1. As it is a dynamic adaptive convolution, the kernel size adjusts automatically based on the characteristics of the input data, so no fixed kernel size or padding is set to maintain the same input and output of the convolution. Stage 2 is composed of one Reg convolution, two Subm convolutions, and one Focals convolution. The input and output feature dimensions of the Reg convolution are (16, 32), with a kernel size of 3, a stride of 2, and padding of 1. The input and output feature dimensions of the two Subm convolutions are (32, 32), with a kernel size of 3 and padding of 1. The feature dimensions of the Focals convolution are (32, 32), with a stride of 2. Stage 3 consists of one Reg convolution, two Subm convolutions, and one Focals convolution. The input and output feature dimensions of the Reg convolution are (32, 64), with a kernel size of 3, a stride of 2, and padding of 1. The input and output feature dimensions of the two Subm convolutions are (64, 64), with a kernel size of 3 and padding of 1. The feature dimensions of the Focals convolution are (64, 64), with a stride of 4. Stage 4 is composed of one Reg convolution and two Subm convolutions. The input and output feature dimensions of the Reg convolution are (64, 64), with a kernel size of 3, a stride of 2, and padding of (0, 1, 1), which means adding 1 unit of zero padding to the left and right sides of the input feature map, without adding padding to the top and bottom, usually used to maintain the width of the feature map after convolution. The input and output feature dimensions of the two Subm convolutions are (64, 64), with a kernel size of 3 and padding of 1.

Finally, through a regular sparse convolution layer, Conv_out, with a convolutional kernel size of (3, 1, 1) and a stride of (2, 1, 1) in the depth, height, and width dimensions, the features are elevated to 128 dimensions for the object classification and parameter regression tasks.

3.7. Object Classification and Parameter Regression Network

The design of the object classification and parameter regression network is inspired by the Voxel R-CNN [63] network. It begins by compressing the 3D voxel features along the z-axis to create a bird’s-eye view (BEV) feature map. A 2D feature extraction network with convolution kernels of size 3 is used to downsample the BEV feature map and extract feature maps, which are then upsampled and concatenated along the channel dimension to obtain a 256-dimensional feature map. A convolution with a kernel size of 1 is used to obtain coarse-grained regions of interest (RoIs). The 3D candidate regions obtained from the region proposal network (RPN) are subjected to voxel ROI pooling on the last two layers of the 3D feature extraction network to extract voxel features, thus obtaining the ROI features of the voxels. These two layers are downsampled by a factor of 4 and 8, respectively. Each RoI corresponds to a partition of the voxel space into

6 \times 6 \times 6

sub-voxels, and the 16 voxels surrounding each sub-voxel are identified. The PointNet method is used to calculate the features of each sub-voxel. The features of each sub-voxel are then summed and aggregated using PointNet. The final fused 128-dimensional ROI features are obtained through max pooling and input into the detection head for further refinement of bounding box predictions.

4. Experimental Section

4.1. Datasets

KITTI: We conducted the main experiments on our proposed S-FusionNet network using the KITTI [66] 3D object detection dataset. This dataset consists of 3717 training sets, 3769 validation sets, and 7518 test sets. The primary detection targets are classified into three categories: Car, Pedestrian, and Cyclist. Based on factors such as object size, occlusion, and truncation, these targets are further divided into three difficulty levels: Easy, Moderate, and Hard.

The KITTI dataset employs different intersection over union (IoU) thresholds for objects of various categories and difficulty levels. Specifically, for the “Car” category, the IoU thresholds corresponding to the three difficulty levels are set at (0.7, 0.7, 0.7). For the smaller object categories of “Pedestrian” and “Cyclist”, the IoU thresholds across the three difficulty levels are uniformly set at (0.5, 0.5, 0.5).

The evaluation metrics include detection precision (average precision, AP) and mean average precision (mAP). The average precision is calculated using 40 recall positions, adhering to the unified standards set by the official KITTI performance evaluation.

Data Preprocessing: For the training dataset, we performed redundant data filtering, random shuffling, and data padding, along with a series of data augmentation operations to enhance data diversity, including random horizontal flipping, global rotation, scaling, and translation [67].

First, we filtered the data to retain regions consistent with the voxel size, with the point cloud range set to [0, −40, −3, 70.4, 40, 1]. Additionally, we randomly shuffled the training data to improve its robustness. For data augmentation, we applied a series of transformations to the point cloud data, including random horizontal flipping (with a probability of 0.5), random rotation within [−π/4, π/4], random scaling (with a scaling factor between [0.95, 1.05]), and random translation with a standard deviation of 0.2 along the X, Y, and Z axes.

The RGB images in the training set were normalized and randomly scaled within the range of (640 × 192) to (2560 × 768) while maintaining the aspect ratio.

For the test dataset, we applied only the same region filtering and padding as in the training phase to avoid significant data alterations, ensuring an objective evaluation of the model’s performance on new data. The RGB images in the test set were resized to a fixed input scale of (1280 × 384) while preserving the same aspect ratio as the training data.

4.2. Training Setup

The entire experiment was completed on a single RTX 3090 GPU using the deep learning framework PyTorch 2.2.0. The training process employed the Adaptive Moment Estimator (Adam) optimizer [68], with the weight decay parameter set to 0.01 and the learning rate set to 0.01, and a total of 80 epochs were trained with a batch size of 8.

The hardware configuration of this study includes a 16-core VCPU Intel(R) Xeon(R) Platinum 8474C processor, accompanied by one NVIDIA GeForce RTX 3090 GPU with a video memory capacity of 24 GB. The software environment consists of the Ubuntu 20.04 operating system, the PyTorch 1.10.0 deep learning framework, the Python 3.8 programming language, and CUDA 11.3. For the model training, the Adam optimizer [68] was employed. The batch size was set to 4, the weight decay rate was 0.01, the momentum value was 0.9, and the maximum number of iterations was 80.

4.3. Experimental Results

4.3.1. Performance on KITTI Validation Set

On the Kitti 3D object validation set, we comprehensively evaluated the detection performance of S-FusionNet on three key categories, namely “Car”, “Pedestrian”, and “Cyclist”, and compared its performance with other current advanced multi-modal 3D detection methods. The results are shown in Table 1. When performing calculations using only a single graphics card—that is, in the face of limited computing resources—the 3D mAP performance of the S-FusionNet network still exceeded that of the EPNet method by 4%, which fully demonstrates its excellent detection ability.

Although our detection results for the “Pedestrian” category are 1.03% and 1.46% lower than those of the F-PointNet method at the “Easy” and “Mod” difficulty levels, respectively, at the more challenging “Hard” level, our detection results are 2.34% higher than those of the F-PointNet method. In terms of the 3D mAP index, S-FusionNet far outperforms the F-PointNet method by 9.37%.

4.3.2. Performance on KITTI Test Set

To fully verify the generalization ability of the S-FusionNet method, we deployed it on the KITTI online server and evaluated the detection performance of three key categories, namely “Car”, “Pedestrian”, and “Cyclist”, on the 3D detection benchmark of the KITTI test set. We also compared it with other excellent multi-modal detection methods, and the results are shown in Table 2.

In the “Car” category, the detection accuracy of the S-FusionNet network at the three difficulty levels is 0.42%, 0.93%, and 0.69% lower than that of Focals Conv-F, respectively. In the “Pedestrian” category, the S-FusionNet network outperforms other methods. In the “Cyclist” category, the detection accuracy at the easy and medium difficulty levels is 2.62% and 0.90% lower than that of the F-ConvNet respectively. However, the detection accuracy at the hard difficulty level is 2.72% higher than that of the F-ConvNet network, and the detection accuracy on 3D mAP is 2.38% higher than that of the F-ConvNet network.

These experimental results clearly indicate that the S-FusionNet method has a certain generalization ability when dealing with new scenarios.

In this study, we conducted a comparative analysis of the detection performance in the “Car” category for the top five published multi-modal methods on the KITTI test set, as shown in Table 3. The ranking of these methods is based on the detection results at the “Mod” difficulty level. Although there is a certain gap in detection accuracy between our method, S-FusionNet, and the aforementioned methods, we adopted a new fusion strategy, the multi-modal semi-fusion method, and carried out theoretical explorations and relevant technical attempts in the field of multi-modal fusion.

In addition, we also evaluated the overall performance of the S-FusionNet method on the KITTI test set, which includes the precision–recall (PR) curves for 3D object detection, bird’s-eye view (BEV), and direction estimation tasks. These curves intuitively reveal the performance of the model on “Car” targets at different difficulty levels, as shown in Figure 7.

4.4. Visual Analysis

In this section, we conduct a visual analysis of the prediction results of the S-FusionNet network on the test set, aiming to comprehensively demonstrate the detection performance of this model on three major categories of targets: “Car”, “Pedestrian”, and “Cyclist”. To make the detection results more intuitive and understandable, we specifically select three scenarios for each category. Figure 8, Figure 9, and Figure 10, respectively, present the detection situations of “Car”, “Pedestrian”, and “Cyclist” targets clearly.

Through a detailed analysis of the visual images of these different scenarios, we draw the following conclusions:

(1): The S-FusionNet network can successfully detect the vast majority of targets. By comparing with the image detection results, we find that there are no missed detections in the three key categories of “Car”, “Pedestrian”, and “Cyclist”, which shows its efficient detection ability.
(2): In terms of car detection, the algorithm demonstrates high accuracy. Even so, in complex environments with extremely dim light or occlusion, the cars in Figure 8a and Figure 9a are also successfully detected.
(3): In terms of pedestrian detection, although most pedestrians are accurately identified due to the sparsity of LiDAR data, the point cloud information contained in small target objects is relatively scarce. As a result, some small debris is misjudged as pedestrians, which affects the accuracy of pedestrian detection. This is also a relatively difficult problem to overcome in a 3D object detection task.
(4): In terms of cyclist detection, since there are relatively few samples of this target object, as can be seen from the visualization of the scenario, there are no missed or false detections.

4.5. Ablation Experiment

4.5.1. VPOL Strategy

To explore the potential impact of the voxel projection optimization positioning strategy on the algorithm’s inference speed and performance, we conducted a comparative experimental study on the KITTI validation set.

In Table 4, ‘×’ represents the Focals Conv-F [15] without the introduction of the VPOL strategy, in which case the fusion module is located after the first stage; ‘✓’ indicates the introduction of the VPOL strategy, where the fusion module is located after the Stem module and before the first stage. Both groups of experiments adopted the voxel center point projection interpolation method.

The experimental results show that when the network does not adopt the VPOL strategy, the inference speed is 6.7 FPS. After adopting the VPOL strategy, the inference time is significantly shortened to 10.78 FPS, achieving an acceleration improvement of 4.08 FPS. The network architecture using the VPOL strategy can effectively reduce the model’s inference time by reducing the calculation of the number of voxel projections.

However, when comparing the detection accuracy of the two groups, it is found that the performance of the network model optimized for speed has declined to a certain extent in the three detection categories of “Car”, “Pedestrian”, and “Cyclist” compared with the model without the VPOL strategy. In-depth analysis reveals that this is mainly because the fusion module is located in the shallow layer of the network, and the network only utilizes the shallow-layer features of voxels. There are limitations in describing the complexity and details of targets, and the impact on the detection performance of small targets, especially “Pedestrian”, is most obvious.

To make up for the negative impact of the insufficient information of shallow-layer features on network performance, we further introduced the neighborhood-enhanced feature interpolation method to replace the center point projection interpolation method used in the network model, as shown in Table 5.

4.5.2. Neighborhood-Enhanced Feature Interpolation Module

The core objective of the neighborhood-enhanced feature interpolation method proposed in this paper is to alleviate the issue of decreased detection performance that occurs when the VPOL strategy is adopted and the fusion module is integrated into the shallow layers of the network. Additionally, it aims to overcome the limitations of existing projection interpolation methods. As shown in Table 5, the A-FocalsConv-F module combines the VPOL strategy with the traditional voxel center point projection interpolation method, while the AE-FocalsConv-F module employs the VPOL strategy along with the neighborhood-enhanced feature interpolation method proposed in this study. The experimental data indicate that although the neighborhood-enhanced feature interpolation method leads to a decrease of 0.67 FPS in the model’s inference speed compared to the voxel center point projection interpolation method, the detection accuracy improves for the three categories of “Car”, “Pedestrian”, and “Cyclist”. In particular, for the “Pedestrian” category at the “Moderate” and “Hard” levels, significant performance improvements of 2.18% and 2.25% are achieved, respectively.

The experimental results verify that the neighborhood-enhanced feature interpolation method can effectively explore and utilize image features, thus improving detection precision while maintaining the model’s inference speed.

4.5.3. Semi-Fusion Network

Based on the semi-fusion network, the full-fusion network applies the image features not only to the task of assisting in extracting voxel features but also to the tasks of object classification and regression. As shown in Table 6 of the experimental results, although the full-fusion network integrates richer image features, it does not significantly improve the performance of the detection task. Specifically, in the “Car” and “Pedestrian” categories, the detection accuracy of the semi-fusion network is better than that of the full-fusion network. In the “Cyclist” category, compared with the full-fusion network, the semi-fusion network is 3.04%, 0.42%, and 1.05% lower, respectively. However, from the analysis of the model’s training time, the time required by the full-fusion network is 9.5 times that of the semi-fusion network. This indicates that the semi-fusion network can iterate and optimize more quickly, effectively reducing unnecessary computational complexity.

4.6. The Scalability of the S-FusionNet Network

In the S-FusionNet network, by introducing or excluding the AE-FocalsConv-F module, the network gains the capability to handle both multi-modal input (Lidar+RGB) and single LiDAR data input (Lidar). As shown in Table 7, the “✓” row indicates the inclusion of the AE-FocalsConv-F module in the network, while the “×” row denotes its exclusion. Experimental results demonstrate the verified flexibility and scalability of the S-FusionNet network.

As shown in Table 7, the introduction of the AE-FocalsConv-F module significantly improves detection accuracy for the “Car” and “Pedestrian” categories. Particularly for the “Pedestrian” category under “Hard” difficulty levels, detection accuracy increases by 5.02% when small targets are severely occluded. However, detection performance declines for the “Cyclist” category. This is primarily attributed to (1) the irregular shapes and sizes of cyclists, combined with limited training samples (Car: 14,357; Pedestrian: 2207; Cyclist: 743), leading to insufficient feature learning; (2) the diverse postures and orientations of cyclists, which make image-derived features less accurate compared to LiDAR point cloud data; (3) image features being susceptible to lighting variations and shadows, with cyclists often occluded by other objects—factors that introduce noise in image-based detection but are less prominent in LiDAR point clouds. Therefore, fused image features may degrade cyclist detection performance compared to using LiDAR alone.

Regarding inference time, the AE-FocalsConv-F module introduces an additional 27.7 ms latency to S-FusionNet. Although this increases inference delay, the accompanying improvement in detection accuracy makes this trade-off worthwhile from the perspective of autonomous driving safety requirements.

4.7. Robustness Testing

Evaluate the robustness of the S-FusionNet network from two aspects: stability and generalization. For the stability test, the change in angular error is introduced to assess the network’s ability to handle targets from different perspectives, ensuring that the network can maintain stable and accurate detection performance when the target angle changes. The generalization test is verified on different benchmark frameworks and datasets to examine the adaptability of the algorithm under different environments and data distributions, thereby confirming its learning ability and generalization ability.

(1): Stability

Angular error is an important indicator for evaluating the stability and adaptability of the model. It is mainly used to simulate the different pitch angles, roll angles, and yaw angles of the targets in actual scenarios due to the slope of the road, turning, or the movement of the targets. By adjusting these angles and observing the detection results of the model, the stability of the algorithm can be evaluated.

In the experiment, multiple angular error thresholds were set, specifically 0.05 degrees, 0.1 degrees, and 0.2 degrees. These thresholds represent the rotation operations of the target objects in three-dimensional space, resulting in a certain degree of rotation deviation on the x-axis, y-axis, and z-axis, that is, the variables of the pitch angle, roll angle, and yaw angle. As shown in Table 8, on the KITTI dataset, taking Focals Conv-F as the benchmark, the performance tests of the A-FocalsConv-F model applying the VPOL strategy and the AE-FocalsConv-F model integrating the neighborhood-enhanced feature interpolation method were carried out and compared under different angular error conditions.

Table 8 provides a detailed comparison of the stability performance of the A-FocalsConv-F and AE-FocalsConv-F models under angular errors of 0 degrees, 0.05 degrees, 0.1 degrees, and 0.2 degrees. When the angular error is 0 degrees, it means that there is no angular error. The performance increment (Delta) reflects the comparison of the model’s stable performance under different angular errors with that under no errors.

First, we analyze the performance changes of the two models in the three categories when the angular error is 0.05 degrees. We find that the A-FocalsConv-F model shows a performance improvement in the “Pedestrian” category when there is an angular error, which indicates that the model has a certain adaptability to angular changes rather than a demonstration of its stability. In contrast, when the angular error is 0.05 degrees, the AE-FocalsConv-F model also has a certain adaptability to angular changes in the Pedestrian category, but the magnitude of its change is relatively small. By comparing the performance variables Delta(b-a) and Delta(B-A) of the two models, the change in the detection performance of the AE-FocalsConv-F model across the three categories is smaller, indicating that its stability is superior to that of the A-FocalsConv-F model.

When the angular error increases to 0.1 degrees compared with the situation at an angular error of 0.05 degrees, the fluctuations in detection performance increase. By comparing the increments Delta(c-a) and Delta(C-A) of the two models, the AE-FocalsConv-F model has relatively smaller fluctuations and still maintains a high level of stability.

When the angular error threshold is further increased to 0.2 degrees, a comparison of the variables Delta(d-a) and Delta(D-A) between the two models shows that the AE-FocalsConv-F model exhibits relatively low sensitivity to angular errors.

This means that in practical applications, the AE-FocalsConv-F model, which incorporates both the Voxel Projection Optimization Location (VPOL) strategy and the neighborhood-enhanced feature interpolation method in the network, can maintain a certain level of reliability and adaptability within a wider range of angular errors, demonstrating a certain degree of environmental adaptability.

(2): Generalization

By testing on different benchmark frameworks and datasets, the adaptability of the S-FusionNet network when facing different environments and data distributions is evaluated. Different combinations such as “PV-RCNN+KITTI”, “Focals Conv-F+VoD”, and “CenterPoint+nuScenes” are adopted for inspection.

- PV-RCNN + KITTI

On the KITTI validation set, we conducted generalization performance experiments on the reproduced benchmark model PV-RCNN* and the S-FusionNet network. The experiment used an RTX 3090 GPU, was executed using the deep learning framework PyTorch, and was trained using the ADAM-OneCycle optimizer. The weight decay parameter was set to 0.01, the initial learning rate was set to 0.01, the batch size was set to 4, and 30 epochs of training were carried out. The test evaluated the detection accuracy (AP) of the three categories of “Car”, “Pedestrian”, and “Cyclist” in 3D and BEV. The results are shown in Table 9. The overall detection performance of the S-FusionNet network is superior to that of the PV-RCNN* network.

b.
Focals Conv-F + VOD

On the millimeter-wave radar dataset View-of-Delft (VoD) [73], generalization performance experiments were conducted on the reproduced benchmark model FocalsConv-F* and the proposed S-FusionNet network. The experiment was run on an RTX 3090 GPU and trained using the ADAM optimizer. The training parameter settings included a weight decay parameter of 0.01, an initial learning rate of 0.01, a batch size of 8, and a total of 30 epochs were trained with a batch size of 4. Since the VoD official website does not provide the ground truth data of the test set, this study evaluated the three main categories of Car, Cyclist, and Pedestrian on the training set containing 5139 samples and the validation set containing 1296 samples. During the evaluation, the IoU threshold for the Car category was set to 0.5, while the IoU thresholds for the Cyclist and Pedestrian categories were set to 0.25. In order to comprehensively evaluate the model performance, this study adopted the evaluation indicators of two areas: the Entire annotated area and the Driving corridor area. The Entire annotated area covers all the annotated areas in the dataset and is applicable to multiple tasks in autonomous driving research, such as object detection, tracking, and behavior prediction. The Driving corridor area focuses on the areas around the vehicle’s driving path, that is, the areas in front of and on both sides of the vehicle, which are particularly important for real-time object detection and path planning.

As shown in Table 10, the S-FusionNet network also performs effectively on the VoD dataset and demonstrates excellent detection performance. Although the proposed multi-modal semi-fusion network S-FusionNet is slightly lower than the multi-modal full-fusion network FocalsConv-F in some categories, this is because the FocalsConv-F network introduces image features in the object prediction stage and belongs to a multi-modal full-fusion network. In contrast, our S-FusionNet network only uses voxel unimodal features in the object prediction stage. However, the S-FusionNet network reduces latency by 27 ms. The experiment verified the generalization and wide applicability of the S-FusionNet network in the 3D object detection task.

c.
CenterPoint + nuScenes

On the nuScenes dataset [74], the generalization ability of the S-FusionNet network was examined, and experimental comparisons were made with the benchmark model CenterPoint that uses only LiDAR and the reproduced CenterPoint*. The results are shown in Table 11. The nuScenes dataset contains 1000 complex driving scenarios, with 700 scenarios serving as the training set, 150 scenarios as the validation set, and 150 scenarios as the test set. The detection categories include 10 targets. The experiment was run on an RTX 3090 GPU and trained using the Adam optimizer. The training parameter settings include a weight decay parameter of 0.01, an initialized learning rate of 0.003, a batch size of 2, and a total of 30 epochs of training. From the experimental results, it can be seen that the overall detection performance of the proposed S-FusionNet network improved, with the mean average precision (mAP) and the nuScenes detection score (NDS) increasing by 5.07% and 3.14%, respectively.

5. Conclusions

In the projection-based multi-modal 3D object detection task, the matching of point cloud and image data leads to a slow inference speed due to projection operation, and existing projection interpolation methods have significant limitations. In response to these issues, this paper innovatively proposes the voxel projection optimization localization strategy and the neighborhood-enhanced feature interpolation method. At the same time, a multi-modal semi-fusion network, S-FusionNet, is designed, which combines the speed advantages of unimodal 3D object detection networks with the strong feature extraction capabilities of multi-modal 3D object detection.

To verify the effectiveness of the proposed methods, we conducted ablation experiments to verify the role of each module in overall performance improvement. Additionally, the S-FusionNet network model was submitted to the KITTI server for generalization capability testing. We focused on evaluating the performance of the S-FusionNet network in the three key categories of “Car”, “Pedestrian”, and “Cyclist” and compared it with the current state-of-the-art multi-modal 3D detection methods.

Finally, the robustness of the S-FusionNet network was examined through a series of experiments, including an examination of the algorithm’s stability and generalization. The stability of the algorithm was tested by adjusting the angular error thresholds to 0.05 degrees, 0.1 degrees, and 0.2 degrees, respectively. For the examination of generalization, different environments and data distributions such as “PV-RCNN+KITTI”, “Focals Conv-F+VoD”, and “CenterPoint+nuScenes” were adopted to test the learning ability and generalization of the network. Combinations were used to test the generalization of the algorithm. The experimental results prove that when facing a changeable and unstable detection environment, the S-FusionNet network, which simultaneously employs the voxel projection optimization location strategy and the neighborhood-enhanced feature interpolation method, can also exhibit good detection performance.

The code and model will be made available on https://github.com/baowenzhang/S-FusionNet (accessed on 12 May 2025).

Author Contributions

Conceptualization, B.Z.; methodology, B.Z.; software, B.Z.; validation, G.C., C.S. and B.Z.; formal analysis, B.Z.; investigation, B.Z.; resources, G.C.; data curation, B.Z.; writing—original draft preparation, B.Z.; writing—review and editing, B.Z.; visualization, B.Z.; supervision, G.C.; project administration, G.C.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

VPOL	Voxel Projection Optimization Localization
PE-RCNN	Projection-Enhanced RCNN
A-FocalsConv-F	Accelerated Focal Sparse Convolution Fusion
AE-FocalConv-F	Accelerated Enhanced Focal Sparse Convolution Fusion Module
NE Interpolation	Neighborhood-Enhanced Feature Interpolation
S-FusionNet	Multi-Modal Semi-Fusion Network

References

Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From Points to Parts: 3D Object Detection from Point Cloud with Part-Aware and Part-Aggregation Network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2647–2664. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Liu, Z.; Hu, H.; Cao, Y.; Zhang, Z.; Tong, X. A closer look at local aggregation operators in point cloud analysis. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIII 16; Springer: Cham, Switzerland, 2020; pp. 326–342. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Ma, H.; Fidler, S.; Urtasun, R. 3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1259–1272. [Google Scholar] [CrossRef] [PubMed]
Luo, W.; Yang, B.; Urtasun, R. Fast and furious: Real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3569–3577. [Google Scholar]
Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3D object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F.; Zhou, B.; Zhao, H. Autoalign: Pixel-instance feature aggregation for multi-modal 3D object detection. arXiv 2022, arXiv:2201.06493. [Google Scholar]
Wan, R.; Zhao, T.; Zhao, W. PTA-Det: Point Transformer Associating Point Cloud and Image for 3D Object Detection. Sensors 2023, 23, 3229. [Google Scholar] [CrossRef]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10386–10393. [Google Scholar]
Ali, W.; Abdelkarim, S.; Zidan, M.; Zahran, M.; El Sallab, A. Yolo3d: End-to-end real-time 3D oriented object bounding box detection from lidar point cloud. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8 September 2018. [Google Scholar]
Gribbon, K.T.; Bailey, D.G. A novel approach to real-time bilinear interpolation. In Proceedings of the DELTA 2004, Second IEEE International Workshop on Electronic Design, Test and Applications, Perth, Australia, 28–30 January 2004; pp. 126–131. [Google Scholar]
Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3D object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7345–7353. [Google Scholar]
Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection. In Computer Vision–ECCV 2020, Proceedings of 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 720–736. [Google Scholar]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3D object detection. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16; Springer: Cham, Switzerland, 2020; pp. 35–52. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
Deng, J.; Zhou, W.; Zhang, Y.; Li, H. From multi-view to hollow-3D: Hallucinated hollow-3D R-CNN for 3D object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4722–4734. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. Density-based spatial clustering of applications with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996. [Google Scholar]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed]
Suhr, J.K.; Jang, J.; Min, D.; Jung, H.G. Sensor fusion-based low-cost vehicle localization system for complex urban environments. IEEE Trans. Intell. Transp. Syst. 2016, 18, 1078–1086. [Google Scholar] [CrossRef]
Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
Wang, Y.; Mao, Q.; Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Li, H.; Zhang, Y. Multi-modal 3D object detection in autonomous driving: A survey. Int. J. Comput. Vis. 2023, 131, 2122–2152. [Google Scholar] [CrossRef]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Singh, A. Surround-view vision-based 3D detection for autonomous driving: A survey. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 3235–3244. [Google Scholar]
Singh, A. Transformer-based sensor fusion for autonomous driving: A survey. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October; pp. 3312–3317.
Wang, X.; Li, K.; Chehri, A. Multi-sensor fusion technology for 3D object detection in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. 2023, 25, 1148–1165. [Google Scholar] [CrossRef]
Peng, Y.; Qin, Y.; Tang, X.; Zhang, Z.; Deng, L. Survey on image and point-cloud fusion-based object detection in autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22772–22789. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Song, Z.; Bi, J.; Zhang, G.; Wei, H.; Tang, L.; Yang, L.; Li, J.; Jia, C. Multi-modal 3D object detection in autonomous driving: A survey and taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [Google Scholar] [CrossRef]
Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Shin, K.; Kwon, Y.P.; Tomizuka, M. Roarnet: A robust 3D object detection based on region approximation refinement. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2510–2515. [Google Scholar]
Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3D object detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1742–1749. [Google Scholar]
Fayyad, J.; Jaradat, M.A.; Gruyer, D.; Najjaran, H. Deep Learning Sensor Fusion for Autonomous Vehicle Perception and Localization: A Review. Sensors 2020, 20, 4220. [Google Scholar] [CrossRef] [PubMed]
Asvadi, A.; Garrote, L.; Premebida, C.; Peixoto, P.; Nunes, U.J. Multimodal vehicle detection: Fusing 3D-LIDAR and color camera data. Pattern Recognit. Lett. 2018, 115, 20–29. [Google Scholar] [CrossRef]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1000–1011. [Google Scholar] [CrossRef]
Wang, G.; Tian, B.; Zhang, Y.; Chen, L.; Cao, D.; Wu, J. Multi-view adaptive fusion network for 3D object detection. arXiv 2020, arXiv:2011.00652. [Google Scholar] [CrossRef]
Strecha, C.; Von Hansen, W.; Van Gool, L.; Fua, P.; Thoennessen, U. On benchmarking camera calibration and multi-view stereo for high resolution imagery. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Mao, Q.; Li, H.; Zhang, Y. Vpfnet: Improving 3D object detection with virtual point based lidar and stereo data fusion. IEEE Trans. Multimed. 2022, 25, 5291–5304. [Google Scholar] [CrossRef]
Waltz, E. Multisensor Data Fusion; Artech House: Norwood, MA, USA, 1990. [Google Scholar]
Hall, D.L.; Llinas, J. An introduction to multisensor data fusion. Proc. IEEE 1997, 85, 6–23. [Google Scholar] [CrossRef]
Hall, D.; Llinas, J. Multisensor Data Fusion; CRC Press: Boca Raton, FL, USA, 2001. [Google Scholar]
Chen, C.; Fragonara, L.Z.; Tsourdos, A. RoIFusion: 3D object detection from LiDAR and vision. IEEE Access 2021, 9, 51710–51721. [Google Scholar] [CrossRef]
Pang, S.; Morris, D.; Radha, H. Fast-CLOCs: Fast camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 187–196. [Google Scholar]
Meyer, G.P.; Charland, J.; Hegde, D.; Laddha, A.; Vallespi-Gonzalez, C. Sensor fusion for joint 3D object detection and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Wang, C.; Ma, C.; Zhu, M.; Yang, X. Pointaugmenting: Cross-modal augmentation for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11794–11803. [Google Scholar]
Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. Fusionpainting: Multimodal fusion with adaptive attention for 3D object detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3047–3054. [Google Scholar]
Nabati, R.; Qi, H. CenterFusion: Center-based radar and camera fusion for 3D object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1527–1536. [Google Scholar]
Simon, M.; Amende, K.; Kraus, A.; Honer, J.; Samann, T.; Kaulbersch, H.; Milz, S.; Michael Gross, H. Complexer-yolo: Real-time 3D object detection and tracking on semantic point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Meyer, G.P.; Laddha, A.; Kee, E.; Vallespi-Gonzalez, C.; Wellington, C.K. Lasernet: An efficient probabilistic 3D object detector for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12677–12686. [Google Scholar]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
Song, Z.; Zhang, G.; Xie, J.; Liu, L.; Jia, C.; Xu, S.; Wang, Z. Voxelnextfusion: A simple, unified and effective voxel fusion framework for multi-modal 3D object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Li, Y.; Qi, X.; Chen, Y.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Voxel field fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1120–1129. [Google Scholar]
Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October; pp. 22014–22024.
Wang, Z.; Zhan, W.; Tomizuka, M. Fusing bird view lidar point cloud and front view camera image for deep object detection. arXiv 2017, arXiv:1711.06703. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; pp. 1201–1209. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Wu, H.; Wen, C.; Shi, S.; Wang, C. Virtual Sparse Convolution for Multimodal 3D Object Detection. arXiv 2023, arXiv:2303.02314. [Google Scholar]
Dong, Z.; Ji, H.; Huang, X.; Zhang, W.; Zhan, X.; Chen, J. PeP: A Point enhanced Painting method for unified point cloud tasks. arXiv 2023, arXiv:2310.07591. [Google Scholar]
Dong Hoang, H.A.; Yoo, M. 3ONet: 3-D Detector for Occluded Object Under Obstructed Conditions. IEEE Sens. J. 2023, 23, 18879–18892. [Google Scholar] [CrossRef]
Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-Equivariant 3D Object Detection for Autonomous Driving. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2795–2802. [Google Scholar]
Palffy, A.; Pool, E.; Baratam, S.; Kooij, J.F.; Gavrila, D.M. Multi-class road user detection with 3+ 1d radar in the view-of-delft dataset. IEEE Robot. Autom. Lett. 2022, 7, 4961–4968. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. Nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11618–11628. [Google Scholar]

Figure 1. Schematic illustration of point cloud projection onto an image.

Figure 2. Illustration of the architecture of S-FusionNet.

Figure 3. Illustration of the architecture of PE-CNNs.

Figure 4. Process of the VPOL strategy.

Figure 5. Neighborhood-enhanced feature interpolation method.

Figure 6. Feature extraction network detailed structure.

Figure 7. Precision–recall (PR) curves of the S-FusionNet method for 3D object detection, bird’s-eye view (BEV), and orientation estimation tasks.

Figure 8. The qualitative results of “Car” detection in three different scenarios on the KITTI test set. (a), (b), and (c) respectively represent three different scenarios.

Figure 9. The qualitative results of “Pedestrian” detection in three different scenarios on the KITTI test set. (a), (b), and (c) respectively represent three different scenarios.

Figure 10. The qualitative results of “Cyclist” detection in three different scenarios on the KITTI test set. (a), (b), and (c) respectively represent three different scenarios.

Table 1. Comparative results of the KITTI 3D object validation set. L denotes LiDAR data, and R denotes RGB image data. The best detection accuracy is indicated in bold black font.

Method (L+R)	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D (%)			3D mAP (%)
Method (L+R)	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard	3D mAP (%)
AVOD-FPN [40]	84.41	74.44	68.65	-	58.80	-	-	49.70	-
PTA-Det [11]	84.72	74.45	69.86	60.84	52.48	45.11	72.43	49.17	46.75	61.76
PointFusion [33]	77.92	63.00	53.27	33.36	28.04	23.38	49.34	29.42	26.98	42.75
F-PointNet [34]	83.76	70.92	63.65	70.00	61.32	53.59	77.15	56.49	53.37	65.58
CLOCs [12]	89.49	79.31	77.36	62.88	56.20	50.10	87.57	67.92	63.67	70.50
EPNet [19]	88.76	78.65	78.32	66.74	59.29	54.82	83.88	65.50	62.70	70.96
3D-CVF [18]	89.20	80.05	73.11	-	-	-	-	-	-	-
Voxel R-CNN [63]	92.38	85.29	82.86	-	-	-	-	-	-	-
FocalsConv-F [15]	92.26	85.32	82.95	-	-	-	-	-	-	-
Ours	92.87	85.46	83.13	67.77	60.75	55.93	87.75	72.44	68.57	74.97

Table 2. Comparative results of the KITTI 3D object test set. L denotes LiDAR data, and R denotes RGB image data. The best detection accuracy is indicated in bold black font.

Method (L+R)	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D (%)			3D mAP (%)
Method (L+R)	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard	3D mAP (%)
MV3D [39]	74.97	63.63	54.00	-	-	-	-	-	-	-
ContFuse [56]	83.68	68.78	61.67	-	-	-	-	-	-	-
MMF [16]	88.40	77.43	70.22	-	-	-	-	-	-	-
EPNet [19]	89.81	79.28	74.59	-	-	-	-	-	-	-
3D-CVF [18]	89.20	80.05	73.11	-	-	-	-	-	-	-
CLOCs [12]	88.94	80.67	77.15	-	-	-	-	-	-	-
AVOD-FPN [40]	83.07	71.76	65.73	50.46	42.27	39.04	63.76	50.55	44.93	56.84
F-PointNet [34]	82.19	69.79	60.59	50.53	42.15	38.08	72.27	56.12	49.01	57.86
F-ConvNet [36]	87.36	76.39	66.69	52.16	43.38	38.80	81.98	65.07	56.54	63.15
FocalsConv-F [15]	90.55	82.28	77.59	-	-	-	-	-	-	-
Ours	90.13	81.35	76.90	52.78	44.22	41.62	79.36	64.17	59.26	65.53

Table 3. On the KITTI 3D test set, we comprehensively compared the S-FusionNet model with cutting-edge methods in the dataset’s ranking list.

Method	Multi-Modal	Runtime (ms)	Car-AP_3D (%)
Method	Multi-Modal	Runtime (ms)	Easy	Mod	Hard	mAP
VirConv-S [69]	Full-Fusion	90	92.48	87.20	82.45	87.38
PEP [70]	Full-Fusion	100	91.77	86.72	82.57	87.02
VirConv-T [69]	Full-Fusion	90	92.54	86.25	81.24	86.67
3ONet [71]	Full-Fusion	100	92.03	85.47	78.64	85.38
TED [72]	Full-Fusion	100	91.61	85.28	80.68	85.86
S-FusionNet	Semi-Fusion	98.9	90.13	81.35	76.90	82.79

Table 4. Ablation study of the VPOL strategy module in AP3D (R40) on the KITTI validation set. Here, ‘×’ indicates that the VPOL strategy is not introduced, and ‘✓’ indicates that the VPOL strategy is introduced.

Method	Stage	FPS	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D (%)
VPOL Strategy	Stage	FPS	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
×	(1,)	6.7	92.96	85.52	84.90	71.84	64.73	59.87	88.63	71.72	68.63
✓	(Stem,)	10.78	92.51	84.10	82.62	67.09	58.57	53.68	87.68	71.14	68.26
Delta		+4.08	−0.45	−1.42	−2.28	−4.75	−6.16	−6.19	−0.95	−0.58	−0.37

Table 5. Ablation study of the neighborhood-enhanced feature interpolation method in AP3D (R40) on the KITTI validation set.

Method	FPS	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D (%)			3D mAP (%)
Method	FPS	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard	3D mAP (%)
A-FocalsConv-F	10.78	92.51	84.10	82.62	67.09	58.57	53.68	87.68	71.14	68.26	73.96
AE-FocalsConv-F	10.11	92.87	85.46	83.13	67.77	60.75	55.93	87.75	72.44	68.57	74.96
Delta	−0.67	+0.36	+1.36	+0.51	+0.68	+2.18	+2.25	+0.07	+1.3	+0.31	+1

Table 6. Ablation experiments on the full-fusion/semi-fusion network design in AP3D (R40) on the KITTI validation set.

Method	Training Time (h)	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D(%)
Method	Training Time (h)	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
Full-Fusion	43.75	92.67	85.07	82.88	65.16	58.21	53.16	90.79	72.86	69.62
Semi-Fusion	39.40	92.87	85.46	83.13	67.77	60.75	55.93	87.75	72.44	68.57
Delta	+4.35	+0.2	+0.39	+0.25	+2.61	+2.54	+2.77	−3.04	−0.42	−1.05

Table 7. Experiments on the flexibility and scalability of the S-FusionNet network.

Method	Time (ms)	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D (%)			3D mAP (%)
AE- FocalsConv-F	Time (ms)	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard	3D mAP (%)
Lidar (×)	71.2	92.41	82.85	80.28	64.74	56.76	50.91	89.31	74.59	70.01	73.54
Lidar + RGB (✓)	98.9	92.87	85.46	83.13	67.77	60.75	55.93	87.75	72.44	68.57	74.96
Delta	−27.7	+0.46	+2.61	+2.85	+3.03	+3.99	+5.02	−1.56	−2.15	−1.44	+1.42

Table 8. The influence on the model’s stability when the angular error is 0.05 degrees, 0.1 degrees, and 0.2 degrees.

Method	Error Type	Car-AP_3D (%)			Ped.-AP_3D (%)			Cyc.-AP_3D (%)
Method	Angular Error (°)	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
A-FocalsConv-F	a. (0, 0, 0)	92.51	84.10	82.62	67.09	58.57	53.68	87.68	71.14	68.26
	b. (0.05, 0.05, 0.05)	92.40	82.30	81.98	67.68	59.46	54.94	87.36	70.92	67.77
	c. (0.1, 0.1, 0.1)	89.81	78.93	77.03	67.82	59.39	54.19	88.08	71.04	67.14
	d. (0.2, 0.2, 0.2)	82.11	62.70	61.79	65.68	57.19	52.02	83.55	63.12	59.32
	Delta (b − a)	−0.11	−1.8	−0.64	+0.59	+0.89	+1.26	−0.32	−0.22	−0.49
	Delta (c − a)	−2.7	−5.17	−5.59	+0.73	+0.82	+0.51	+0.4	−0.1	−1.12
	Delta (d − a)	−10.40	−21.40	−20.83	−1.41	−1.38	−1.66	−4.13	−8.02	−8.94
AE-FocalsConv-F	A. (0, 0, 0)	92.87	85.46	83.13	67.77	60.75	55.93	87.75	72.44	68.57
	B. (0.05, 0.05, 0.05)	92.79	85.36	82.97	68.53	61.37	56.49	87.89	71.50	68.03
	C. (0.1, 0.1, 0.1)	92.04	82.09	79.91	68.20	60.94	56.17	87.32	70.71	67.41
	D. (0.2, 0.2, 0.2)	84.68	67.92	66.95	66.35	59.04	54.05	85.18	65.10	61.88
	Delta (B − A)	−0.08	−0.1	−0.16	+0.76	+0.62	+0.56	+0.14	−0.94	−0.54
	Delta (C − A)	−0.83	−3.37	−3.22	+0.43	+0.19	+0.24	−0.43	−1.73	−1.16
	Delta (D − A)	−8.19	−17.54	−16.18	−1.42	−1.71	−1.88	−2.57	−7.34	−6.69

Table 9. The generalization ability of the S-FusionNet network is tested on the benchmark framework PV R-CNN using the KITTI validation set.

Method		Car-AP3D (%)			Ped.-AP3D (%)			Cyc.-AP3D (%)
Method		Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
PV RCNN*	3D	90.89	82.67	80.11	54.46	48.37	43.13	86.57	70.94	65.33
S-FusionNet	3D	92.05	82.82	82.35	64.29	56.97	53.13	87.71	72.84	68.76
PV RCNN*	BEV	93.16	89.38	88.28	59.39	52.99	48.69	88.76	73.46	68.82
S-FusionNet	BEV	95.21	90.24	88.34	69.30	62.03	58.47	89.58	75.84	71.96

Table 10. The generalization ability of the S-FusionNet network is tested on the benchmark framework FocalsConv-F using the VoD validation set.

Method	Time (ms)	Entire Annotated Area				In Driving Corridor
Method	Time (ms)	Car	Ped.	Cyc.	mAP	Car	Ped.	Cyc.	mAP
FocalsConv-F*	178	61.83	49.53	65.86	59.07	90.64	64.00	82.93	79.19
S-FusionNet	151	59.65	50.75	67.07	59.16	90.06	66.48	86.55	81.03
Delta	−27	−2.18	+1.22	−1.21	+0.11	−0.58	+2.48	+3.62	+1.84

Table 11. The generalization ability of the S-FusionNet network is tested on the benchmark framework CenterPoint using the nuScenes validation set.

Method	mAP	NDS	Car	Track	C.V.	Bus	Trailer	B.R.	M.T.	Bicycle	Ped.	T.C.
CenterPoint*	59.82	67.21	86.55	58.17	16.32	73.43	38.31	67.15	60.38	42.89	84.35	70.65
S-FusionNet	64.89	70.35	88.07	60.71	24.98	74.29	42.96	68.30	72.05	55.42	84.96	77.13
Delta	+5.07	+3.14	+1.52	2.54	8.66	0.86	4.65	1.15	11.67	12.53	0.61	6.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Su, C.; Cao, G. S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network. Electronics 2025, 14, 2008. https://doi.org/10.3390/electronics14102008

AMA Style

Zhang B, Su C, Cao G. S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network. Electronics. 2025; 14(10):2008. https://doi.org/10.3390/electronics14102008

Chicago/Turabian Style

Zhang, Baowen, Chengzhi Su, and Guohua Cao. 2025. "S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network" Electronics 14, no. 10: 2008. https://doi.org/10.3390/electronics14102008

APA Style

Zhang, B., Su, C., & Cao, G. (2025). S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network. Electronics, 14(10), 2008. https://doi.org/10.3390/electronics14102008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

S-FusionNet: A Multi-Modal Semi-Fusion 3D Object Detection Network

Abstract

1. Introduction

2. Related Words

2.1. Fusion Position

2.2. Input Data Types

2.3. Fusion Strategy

3. Method

3.1. Multi-Modal Semi-Fusion Network (S-FusionNet)

3.2. Single-View Feature Extraction Module

3.3. Projection-Enhanced Convolutional Neural Networks (PE-CNNs)

3.4. Voxel Projection Optimization Localization Strategy

3.5. Neighborhood-Enhanced Feature Interpolation Method

3.6. Feature Extraction Network

3.7. Object Classification and Parameter Regression Network

4. Experimental Section

4.1. Datasets

4.2. Training Setup

4.3. Experimental Results

4.3.1. Performance on KITTI Validation Set

4.3.2. Performance on KITTI Test Set

4.4. Visual Analysis

4.5. Ablation Experiment

4.5.1. VPOL Strategy

4.5.2. Neighborhood-Enhanced Feature Interpolation Module

4.5.3. Semi-Fusion Network

4.6. The Scalability of the S-FusionNet Network

4.7. Robustness Testing

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI