PRNet: 3D Object Detection Network-Based on Point-Region Fusion

Fu, Yufei; Guo, Yuhao; Hu, Hui

doi:10.3390/app15073759

Open AccessArticle

PRNet: 3D Object Detection Network-Based on Point-Region Fusion

by

Yufei Fu

¹,

Yuhao Guo

^2,* and

Hui Hu

²

¹

Department of Electrical and Electronic Engineering, University of Liverpool, Liverpool L3 5UB, UK

²

College of Information Engineering, East China Jiaotong University, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3759; https://doi.org/10.3390/app15073759

Submission received: 23 February 2025 / Revised: 23 March 2025 / Accepted: 26 March 2025 / Published: 29 March 2025

Download

Browse Figures

Versions Notes

Abstract

Object detection is a pivotal task in the realm of autonomous driving, where reliance on single-modality information often proves inadequate for high-precision detection tasks. In current research, object detection networks based on point clouds effectively identify objects with dense point clouds. However, these networks face challenges when dealing with detection tasks involving objects with higher point cloud sparsity. To address this issue, this paper proposes a 3D object detection network based on point cloud and image fusion. This network employs a fusion module named PRF (Point-Region Fusion), utilizing the K-Nearest Neighbors (KNN) algorithm to find the nearest K points corresponding to point cloud features. It then collects regional features from image feature maps and fuses them with the point cloud features after aggregation. The designed Image Feature Fusion module (IF-Fusion) fuses image feature maps of varying sizes in a pairwise manner. This fusion process is key in preserving features of small objects and subsequently enriching the point cloud’s features. In evaluations conducted on the KITTI benchmark, the method presented here surpasses prior fusion networks in detection accuracy, achieving rates of 91.95%, 81.10%, and 78.08%.

Keywords:

3D object detection; Point-Region Fusion; K-Nearest Neighbors; KITTI

1. Introduction

As the cornerstone of future transportation systems, autonomous driving technology increasingly demands sophisticated vehicle perception and environmental understanding. Within this context, 3D object detection plays a decisive role as the critical component [1,2,3]. The rise of autonomous driving technology sparks an urgent need for efficient and accurate object detection methods. In complex road environments, object detection necessitates not only real-time capabilities but also a high degree of recognition for objects of various sizes and densities.

Cameras, serving as the quintessential data acquisition sensors in the realm of two-dimensional and three-dimensional object detection endeavors, furnish imagery enriched with intricate texture information encapsulated in RGB images [4,5,6]. These images are characterized by their high-resolution attribute, yet they are concurrently marred by the deficiency of depth perception, a limitation that substantially hampers the attainment of precise three-dimensional localization [7,8]. As a common representation of 3D scenes, point cloud data plays an important role in object detection. Point cloud data is obtained by LiDAR. Three-dimensional object detection based on LiDAR has received much attention because of its ability to provide accurate depth information, which is essential for accurate spatial positioning. However, LiDAR data often has sparsity issues [9], especially in complex environments, which can limit its effectiveness in detecting small or distant objects [10]. Small object detection and feature extraction are also a challenge in the field of 3D object detection. Small objects often have a low signal-to-noise ratio and fuzzy appearance characteristics, which makes them more likely to be misclassified or missed. The existing methods face some difficulties in identifying and locating small objects because they fail to extract and utilize the key feature information of small objects effectively.

Conventional methods of fusion often prove inadequate in addressing these challenges, as they neither sufficiently enhance the representational capabilities of point clouds nor effectively maintain and amplify the feature information of small objects during the fusion process. In response to these challenges, this paper introduces a 3D object detection network based on point cloud and RGB images. The network utilizes point cloud data from a 64-beam LiDAR and RGB images from a monocular camera to achieve vehicle detection. The principal contributions of this study are as follows:

(1): This study introduces a novel point-region fusion module (PRF) designed to integrate region-specific features from images with corresponding point clouds, thereby enhancing the performance of the fusion process.
(2): This study develops an image feature fusion module (IF-Fusion), which ingeniously combines image feature maps of various sizes. This approach is specifically aimed at preserving the features of small objects while augmenting the overall expressive capability of the point cloud features.
(3): Experimental results based on the KITTI benchmark demonstrate that the proposed method achieves significant advancements over previous fusion networks. This underscores the efficacy and innovative nature of the proposed approach in the realm of point cloud object detection.

In this paper, Section 2 discusses the related work in the field of 3D object detection. Section 3 presents the comprehensive framework of the network. Section 4 details the experimental protocols, conducts ablation studies and analyses the results. Finally, Section 5 concludes this study and outlines projections for future research directions.

2. Related Work

2.1. Methods Based on Single-Modality

The point cloud offers a fully three-dimensional reconstruction of the scene, providing rich geometric, shape and scale information [11]. In the realm of image-based 3D object detection, networks are categorized into two types: 3D object detection networks based on result mapping [12,13,14,15,16,17] and 3D object detection networks based on feature mapping [18,19,20]. The 3D object detection networks based on result mapping typically first predict semantic information from images, such as two-dimensional bounding boxes, three-dimensional sizes, and distances of the objects. Then, by utilizing the constraint relationships between 2D and 3D spaces, the aforementioned predicted results are mapped into three-dimensional space, thereby obtaining the three-dimensional bounding boxes of the objects. The Deep3DBox proposed by Mousavian et al. [12] uses the constraint relationship that “the perspective projection of the two-dimensional boundary box in the three-dimensional space should be close to at least one edge of the two-dimensional boundary box” to estimate the pose and size of the three-dimensional boundary box and realize three-dimensional object detection. This kind of network has the advantage of fast processing speed, but the detection of 3D coordinates is very limited. The 3D object detection networks are based on feature mapping. Generally, they first map 2D image features to 3D space and then perform 3D object detection based on 3D spatial features. Several researchers [18,19] start by estimating the depth of images and then use this depth information to project corresponding pixels into 3D space, creating a Pseudo-Point cloud. Subsequently, this information is used for 3D object detection. Compared with the previous kind of network, this kind of network provides more abundant spatial structure information and improves detection accuracy and robustness. However, the accuracy of depth estimation is not enough, and it also brings a large amount of calculation.

Several researchers have exclusively utilized LiDAR point clouds for 3D object detection, considerably enhancing task precision. They employ point-based representation learning strategies [21,22,23], which directly derive features from the raw point clouds. Conversely, other researchers transform these point clouds into standard 2D grid representations [24,25,26] or 3D voxel representations [27,28,29,30,31,32,33]. Shi et al. [34] employed PointNet++ as their feature extraction network. Using this, they generated Regions of Interest (ROIs) and conducted parameter regression for accomplishing the 3D object detection task. Li et al. [35] projected LiDAR point clouds onto a front-view perspective to acquire a 2D grid. However, this front-view 2D grid encounters significant occlusion and overlap challenges. Yang et al. [36,37] introduced HDNet and PIXOR, which project LiDAR point clouds onto a bird’s-eye view (BEV) to create a 2D grid. Zhou et al. [27] pioneered VoxelNet, an end-to-end trainable deep neural network. This network transforms LiDAR point clouds into uniformly distributed voxels in 3D space, utilizing conventional 3D CNNs for feature extraction.

Most of the three-dimensional object detection networks represented by points rely on Pointnet++ [38], but Pointnet++ cannot encode complex geometric information of the point neighborhood, and the importance of each point in the neighborhood is ignored in the coding process, resulting in feature redundancy or loss. The grid representation 3D object detection network will lose the projection dimension information in the point cloud projection process, so the accuracy of this kind of network is usually not high. The detection accuracy of voxel representation 3D object detection network is at a high level among all 3D object detection networks. However, this kind of network will lose the point cloud information in the process of voxelization of the point cloud and will inevitably lose some information when the input voxels are downsampled in the process of feature extraction.

2.2. Methods Based Multi-Modality Fusion

Methods via visual imagery predominantly provide textural information, lacking depth insights, whereas approaches based on LiDAR point clouds offer spatial geometric data but miss textural details. Textural data plays a crucial role in object detection and classification, while depth information is instrumental in estimating the spatial positioning of objects. The integration of image and LiDAR point cloud data to enhance overall performance is a prominent research trend in 3D object detection. Charles et al. introduced F-PointNet [39], utilizing image data to create high-quality 2D candidate frames. These frames are then projected into 3D space for extracting point cloud features, leading to the generation of 3D candidate frames. Chen et al. developed the MV3D approach [40], employing point clouds to construct corresponding front views and bird’s-eye views. These, combined with RGB images, serve as inputs to three distinct feature extraction networks. This methodology achieves 3D object detection, classification, and bounding box regression by integrating feature maps from all three sources. Diverging from the MV3D algorithm, Ku et al. introduced the AVOD algorithm [41], which utilizes only image data and bird’s-eye views generated from encoded point cloud data as inputs for the network. MVX-Net [42] employs two straightforward yet effective early fusion methods, Point-Fusion and Voxel-Fusion, blending the textural information from visual imagery with the spatial geometric information from point clouds to facilitate high-precision object detection. The study ContFuse [10] introduces an innovative approach by identifying points in camera images that correspond to the point cloud (and its nearest neighbors). It then merges these interpolated camera features with each point to achieve a point-wise correspondence. Crossfusion [43] employs a bidirectional fusion strategy, integrating point clouds and images at the anchor level. While ContFuse accomplishes point-level fusion, Crossfusion achieves anchor-level fusion, leading to potential issues with feature alignment. In this paper, the objective is to design a 3D object detection network for point clouds grounded in point feature representation, enhancing the feature representation capability of point clouds through the fusion of corresponding image area features with the point cloud.

3. PRNet Framework

In this section, the architecture and implementation of a 3D object detection network based on point-region fusion are detailed. As shown in Figure 1, this network employs PointNet++ as its backbone for processing point cloud data, incorporating 4 set abstraction (SA) and 4 feature propagation (FP) modules. For image data, a straightforward convolutional neural network serves as the backbone, generating 4 feature maps of varying sizes. The initial 4 Point-Region Fusion (PRF) modules merge features from point clouds and images. Subsequently, IF-Fusion integrates the image feature maps, which are then combined with the final FP module through a PRF fusion module. The resultant point cloud features are utilized to segment the point cloud. Foreground points identified in this segmentation are used to generate 3D proposal boxes, which are refined to yield the final detection outcomes.

3.1. Feature Extractors for Point Cloud and Image

In this paper, the study selects a point cloud range of [−40, 40] × [0, 70] meters, aligning with the camera’s field of view. Subsequently, the study normalizes the number of points to 4096. While some networks transform point cloud data into formats like voxels or bird’s eye view (BEV), the study opts to utilize raw point clouds as input to retain the most authentic point cloud information. For point cloud processing, PointNet++ is chosen as the backbone of this paper. PointNet++ is a neural network architecture predicated on point set processing, adept at capturing both global and local information within point clouds through set operations. The network’s backbone is structured with 4 Set Abstraction (SA) modules and 4 Feature Propagation (FP) modules. The SA modules are tasked with down-sampling to extract global features, whereas the FP modules facilitate level-by-level feature propagation via interpolation and concatenation. This design strategy allows the network to more effectively manage issues of uneven point cloud density and sparsity, thereby enhancing the robustness of object detection. Concerning image data, this study employs a straightforward convolutional neural network as the backbone. This network generates four feature maps of varying sizes, with each map encompassing contextual information across different scales.

3.2. PRF Module

To enhance the integration of point cloud and image data, the study introduces the point-region fusion (PRF) module, a critical component within our network. This module encompasses four primary functions: projection, point searching, sampling, and fusion, as shown in Figure 2. The functionalities are delineated as follows:

The first step is projection. In this process, we project a point

p

from the point cloud onto the plane of the camera image. Each point

p

is expressed in homogeneous coordinates denoted by

p = (x, y, z, 1)^{T}

. By applying the mapping matrix

M

, we can determine the corresponding point

p^{'}

on the image plane:

p^{'} = M \times p

(1)

Among them,

M

represents a 3 × 4 mapping matrix. The point

p

in the point cloud is represented as a four-dimensional vector within a homogeneous coordinate system. Meanwhile, the point

p^{'}

on the image plane is expressed as a three-dimensional vector in homogeneous coordinates.

The second step is point searching. Following the feature extraction through the set abstraction (SA) module in the point cloud, the study acquires a set of points with feature representations, denoted within

p^{'} = {p_{1}, p_{2}, \dots, p_{n}}

. For each point

p_{i}

, the study employs the K-Nearest Neighbors (KNN) algorithm to identify its

K

closest neighboring points, which are represented as

N_{k} (p_{i})

:

N_{k} (P_{i}) = K N N (p_{i, k})

(2)

Among them,

N_{k} (p_{i})

represents

K

closest neighboring points of the point

p_{i}

. KNN diagram is shown in Figure 3.

The third step is sampling. In this phase, the image features that have been mapped through the sampling process are employed. For each point

p_{i}

within the point cloud and its associated set of neighboring points

N_{k} (p_{i})

, the study defines a sampling function

S

to extract the corresponding features from the image:

R_{i} = S (I m a g e, M, N_{k} (p_{i}))

(3)

Among them,

I m a g e

denotes the input image data.

R_{i}

refers to the region of interest or feature area associated with point

p_{i}

and its neighboring point

N_{k} (p_{i})

, which is derived through the process of sampling on the image.

The fourth step is fusion. In the final step, the image features are fused with the point cloud features. Employing the convolution operation

C

and the pooling operation

P

to extract features

F_{i}

from the source

R_{i}

. Ultimately, these extracted features are integrated with the point cloud features

F_{i}

, which are acquired from the SA (Self-Attention) module. This integration process incorporates an attention mechanism

A

, enabling a weighted fusion of these features:

F_{i^{'}} = P (A (C (R_{i})))

(4)

F_{f i n a l} = C o n c a t (F_{i^{'}}, L_{i})

(5)

Among them, the feature

F_{i^{'}}

is derived following the convolution operation

C

and the pooling operation

P

, subsequently undergoing a weighting process by the attention mechanism

A

.

L_{i}

represents the intrinsic point cloud features of the point

p_{i}

. The ultimate fusion feature

F_{f i n a l}

is prepared for application in the ensuing tasks of object detection and segmentation. The structure of this step is shown in Figure 4.

The PRF module’s strength is its capacity to effectively integrate data from point clouds and images, capitalizing on their distinct characteristics across various modalities. This integration is enhanced by the incorporation of the k-nearest neighbors (KNN) algorithm and an attention mechanism, enabling the module to more astutely discern the contextual nuances surrounding the object. This enhances the accuracy and robustness of object detection. Within the entire network architecture, the application of the PRF module significantly improves the precision and completeness of the fusion between point cloud and image data, thereby bolstering the effectiveness of 3D object detection endeavors.

3.3. IF-Fusion

To optimally combine image features and augment the capability to represent small objects, the study introduces the Image Feature Fusion (IF-Fusion) module. This module is specifically designed to address the challenge of information loss encountered in the fusion of features across multiple scales and enhance the network’s ability to detect and accurately represent small objects. The overall structure of IF-Fusion is shown in Figure 5.

Initially, the process involves extracting features. For this purpose, a convolutional neural network

f_{C N N}

is employed to analyze the input image I. This results in the generation of four distinct feature maps at varying scales, specifically maps

F_{1}

,

F_{2}

,

F_{3}

and

F_{4}

. The scale factor between each layer in the IF-Fusion module is set to 2, meaning that each subsequent feature map is downsampled by a factor of 2 compared to the previous layer. Here,

F_{i}

is used to represent the feature map corresponding to the ith layer.

F_{i} = f_{C N N} (I), i \in {1,2, 3,4}

(6)

Among them,

F_{i}

denotes the feature map of the ith layer. This represents the outputs for the neural network

f_{C N N}

to input image

I

at varying depths, facilitating the capture of a spectrum of visual features from basic to advanced levels.

Subsequently, the process entails performing pairwise fusion. This is achieved by executing a straightforward concatenation operation on each pair of feature maps

(i, j)

. This method results in the creation of six intermediate fusion feature maps

F_{i j}^{'}

, thereby efficiently integrating the feature information across different scales:

F_{i j}^{'} = C o n c a t ({F_{i}}^{'}, {F_{j}}^{'}), i < j, i, j \in {1,2, 3,4}

(7)

In the final step, all six intermediate fusion feature maps

F_{i j}^{'}

undergo a cascade fusion process via another fusion function

G

. This results in the production of the ultimate fused feature map

F^{''}

:

F^{''} = G (F_{12}^{'}, F_{13}^{'}, F_{14}^{'}, F_{23}^{'}, F_{24}^{'}, F_{34}^{'})

(8)

The optimization of feature representation is further enhanced through additional convolution and pooling operations, aiming to minimize information loss.

A significant strength of IF-Fusion is its ability to fully exploit feature information across various scales via a multi-stage fusion approach, notably improving the network’s sensitivity to small objects. The cascaded fusion structure employed in IF-Fusion notably mitigates information loss while simultaneously bolstering the network’s capacity to resist interference. This enhancement contributes significantly to the precision and robustness of 3D object detection. The incorporation of the IF-Fusion module plays a crucial role in elevating the overall performance of the network.

3.4. Computational Complexity Analysis

The main computation of the network structure is focused on the PRF module and IF-Fusion. The computational complexity of the projection operation is

O (N)

, where

N

is the number of points in the point cloud. KD-Tree is used in the KNN algorithm, so the computational complexity of point searching is

O (N l o g N) . e

computational complexity of the sampling operation is

O (N \times K)

, where

K

is the number of nearest neighbors of each point. The computational complexity of the fusion operation is

O (N \times K \times C)

, where

C

is the number of channels of the feature. The computational complexity of each layer of the PRF module is

O (N) + O (N l o g N) + O (N \times K) + O (N \times K \times C)

. Computational complexity of IF-Fusion is

O (L \times H \times W \times C \times C') + O (M \times H \times W \times C) + O (6 \times H \times W \times C)

. Where

O (L \times H \times W \times C \times C')

is the complexity of feature extraction,

O (M \times H \times W \times C)

is the complexity of paired fusion, and

O (6 \times H \times W \times C)

is the complexity of cascade fusion. Where

H \times W

is the spatial size of the feature map,

C

is the number of input channels,

C'

is the number of output channels, and is the number of layers of the convolutional neural network.

The total compute capacity of the PRF module is approximately 4.5 million FLOPs. The total compute capacity of the IF-Fusion module is approximately 1541 million FLOPs. Because of the CPU’s powerful linear task computing efficiency, the CPU is used to compute the PRF module, and the GPU is used to compute IF-Fusion. The inference time for each frame is about 90 ms on two NVIDIA GTX 1080Ti GPUs (NVIDIA, Santa Clara, CA, USA).

4. Experiment and Result

The network proposed undergoes training and testing on a personal computer with NVIDIA GTX 1080 Ti dual GPUs. This network includes a point cloud image feature fusion network based on PRF and an IF-Fusion image multi-scale feature fusion network. Figure 6 shows the final output effect. This study divides the 7481 training frames from the KITTI dataset [4] into a training set and a validation set at approximately a 1:1 ratio. For assessment, this study follows the simple, moderate, and hard difficulty classifications proposed by KITTI. The network is optimized using the Adaptive Moment Estimation (Adam) algorithm. The initial learning rate is set to 0.002, with a weight decay of 0.001 and a momentum factor of 0.9. The model is trained end-to-end for approximately 50 epochs, with a batch size of 4. The balancing weights λ in the loss function are configured to 5.

The experiment consists of three parts. Initially, the study conducts trials on the challenging KITTI dataset. It validates quantitative results by projecting 3D bounding boxes onto 2D images and comparing them with the most advanced networks. Subsequently, it selects different parameters, specifically k values, within PRF to find the most effective outcome. Following this, the study confirms the performance of IF-Fusion independently, demonstrating its effectiveness in detecting small objects. Finally, the study presents ablation research to evaluate the contribution of each method.

4.1. 3D Detection

For the final 3D detection results, this paper uses two metrics to measure the precision of 3D localization and 3D bounding box detection. For 3D localization, this paper projects the three-dimensional boxes onto the ground plane to obtain a bird’s eye view of bounding boxes. This paper calculates the Average Precision of Bird’s Eye View (AP_BEV) bounding boxes. For 3D bounding box detection, this paper uses the Average Precision (AP_3D) metric to evaluate the complete 3D bounding boxes. In the context of 3D object detection, AP_3D evaluates the precision of fully 3D bounding boxes by considering their location, size, and orientation in three-dimensional space, providing a comprehensive measure of detection accuracy. Complementing this, AP_BEV assesses detection performance by projecting 3D boxes onto the ground plane, focusing on the accuracy of object localization from a bird’s-eye perspective, which is critical for tasks like autonomous driving.

When evaluating with AP_3D and AP_BEV, we set an IoU threshold of 0.7 for the car class. On the test set, as shown in Table 1, the network in this paper basically achieves the optimal effect compared with the state-of-the-art networks.

4.2. The Effect of PRF Module

In the Point-Region Fusion (PRF) module, the parameter k represents the number of nearest neighbors considered around each point cloud data point during the K-Nearest Neighbors (KNN) algorithm. This parameter is crucial for the point cloud and image fusion process as it directly affects the fusion module’s ability to capture local structures and features. When the value of k is small, the PRF module may only consider very local areas, potentially leading to fusion features that are not robust enough. The lack of sufficient neighboring points to construct a stable local environment might make the fused features highly sensitive to noise or deviations in individual points, thereby affecting the final detection performance. Conversely, if the value of k is set large, each point will consider a more extensive local area. This leads to spatially smoother fusion features that can resist noise and sparsity better. However, a k value that is too large can also introduce issues, such as blending irrelevant features, especially near the edges of objects where points from different objects might be erroneously considered together, thus reducing detection precision. Moreover, a large k value also increases computational complexity, impacting the efficiency of the algorithm.

We compared the results of k_neighbors = 0 (“no” PRF), 3, 4, and 5 when the performance of the hardware device was supported. Through comparison, it can be found that PRF has a great effect on the improvement of detection performance. If k_neighbors = 4, the average value of AP3D increases by 1.53%. For k_neighbors = 5, the average APBEV increases by 0.84%. The specific effects are shown in Table 2. The effect of the PRF module is shown in Figure 7.

4.3. The Effect of IF-Fusion

This paper analyzes the detection results of the baseline model and the model integrated with the IF-Fusion module, paying special attention to the performance in detecting small objects and complex scenes. Both quantitative results and qualitative assessments demonstrate that the IF-Fusion module effectively enhances the performance of 3D object detection, especially in handling small objects and complex scenes. Although there is a slight increase in the demand for computational resources, considering the significant improvement in accuracy, integrating the IF-Fusion module is justified and worthwhile. The effect of the IF-Fusion module is shown in Figure 8.

4.4. Ablation Experiment

To gain a deeper understanding of the contribution of PRF and IF-Fusion modules to the performance of the 3D object detection model in this paper, we design a series of ablation experiments, as shown in Table 3. The purpose is to evaluate the effect of each module individually and their combined influence when integrated. After incorporating the PRF module, we observe significant improvements in AP_3D and AP_BEV, indicating that PRF plays a crucial role in enhancing the model’s capability to capture local features. When applying only the IF-Fusion module, the model performs better in handling small and distant objects. This validates the effectiveness of IF-Fusion in multi-scale fusion and information preservation. In the configuration that uses both PRF and IF-Fusion modules, we achieve the highest AP_3D and AP_BEV among all setups. This demonstrates that the combination of the two modules can complement each other, providing a more comprehensive feature representation than when used individually.

5. Conclusions

In this paper, we proposed PRNet, a novel 3D object detection network that leverages the fusion of point cloud and image data to address the challenges of detecting objects with sparse point clouds and small objects. The key contributions of our work include the introduction of the Point-Region Fusion (PRF) module, which effectively integrates local point cloud features with regional image features, and the Image Feature Fusion (IF-Fusion) module, which enhances the network’s ability to detect small objects through multi-scale feature fusion. Experimental results on the KITTI benchmark demonstrate that our approach achieves state-of-the-art performance, particularly in scenarios involving sparse point clouds and complex environments. The ablation studies further validate the effectiveness of both the PRF and IF-Fusion modules, highlighting their complementary roles in improving detection accuracy. Our method not only addresses the limitations of traditional single-modality approaches but also provides a robust framework for multi-modal data fusion in 3D object detection.

However, there are some limitations to this article’s network. For instance, the fusion process may still lose some fine-grained information, especially for distant objects. Additionally, the computational complexity of the KNN algorithm in the PRF module could be a bottleneck for real-time applications. Future work will focus on addressing these limitations by exploring more efficient fusion strategies and extending the evaluation to larger datasets.

There are several promising directions for future research. First, more advanced fusion strategies could be explored to further reduce information loss and improve the integration of point cloud and image features. Second, extending the evaluation to larger and more diverse datasets would help validate the generalizability of our approach. Finally, the practical deployment of our method in real-world applications, such as autonomous driving and robotic navigation, presents unique challenges that warrant further investigation.

Author Contributions

Conceptualization, Y.F. and H.H.; Methodology, Y.G.; software, Y.F.; Validation, Y.F. and Y.G.; Formal analysis, H.H.; Investigation, H.H.; Resources, Y.G.; Data curation, Y.G.; Writing—original draft preparation, Y.G. and H.H.; Writing—review and editing, Y.F.; Visualization, Y.G.; Supervision, H.H.; Project administration, Y.F.; Funding acquisition, Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China under Grant Nr.: 61961020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code is available at https://github.com/JackKu0/PRNet (accessed on 10 March 2025). We use the opensource data kitti, which is available over https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d (accessed on 10 March 2025).

Conflicts of Interest

We declare that we do not have any commercial or associative interests that represent conflicts of interest in connection with the submitted work.

References

Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 3354–3361. [Google Scholar]
Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep learning for lidar point clouds in autonomous driving: A review. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3412–3432. [Google Scholar] [CrossRef] [PubMed]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
Ataer-Cansizoglu, E.; Taguchi, Y.; Ramalingam, S.; Garaas, T. Tracking an RGB-D camera using points and planes. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2 December 2013; pp. 51–58. [Google Scholar]
Kim, H.; Kim, J.; Nam, H.; Park, J.; Lee, S. Spatiotemporal Texture Reconstruction for Dynamic Objects Using a Single RGB-D Camera. Comput. Graph. Forum 2021, 40, 523–535. [Google Scholar]
Wang, Z.; Zhan, W.; Tomizuka, M. Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection. 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Fidler, S.; Urtasun, R. 3d object proposals using stereo imagery for accurate object class detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1259–1272. [Google Scholar] [PubMed]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In Computer Vision–ECCV 2020, 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XV 16; Springer International Publishing: New York, NY, USA, 2020; pp. 35–52. [Google Scholar]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
Drobnitzky, M.; Friederich, J.; Egger, B.; Zschech, P. Survey and systematization of 3D object detection models and methods. Vis. Comput. 2024, 40, 1867–1913. [Google Scholar] [CrossRef]
Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1019–1028. [Google Scholar]
Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T.K. Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15172–15181. [Google Scholar]
Cai, Y.; Li, B.; Jiao, Z.; Li, H.; Zeng, X.; Wang, X. Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10478–10485. [Google Scholar]
Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4721–4730. [Google Scholar]
Liu, X.; Xue, N.; Wu, T. Learning auxiliary monocular contexts helps monocular 3D object detection. Proc. AAAI Conf. Artif. Intelligence 2022, 36, 1810–1818. [Google Scholar] [CrossRef]
You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv 2019, arXiv:1906.06310. [Google Scholar]
Qian, R.; Garg, D.; Wang, Y.; You, Y.; Belongie, S.; Hariharan, B.; Campbell, M.; Weinberger, K.Q.; Chao, W.L. End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5881–5890. [Google Scholar]
Guo, X.; Shi, S.; Wang, X.; Li, H. Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3153–3163. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
Li, J.; Sun, Y.; Luo, S.; Zhu, Z.; Dai, H.; Krylov, A.S.; Ding, Y.; Ling, S. P2v-rcnn: Point to voxel feature learning for 3d object detection from point clouds. IEEE Access 2021, 9, 98249–98260. [Google Scholar] [CrossRef]
Li, J.; Luo, S.; Zhu, Z.; Dai, H.; Krylov, A.S.; Ding, Y.; Shao, L. 3D IoU-Net: IoU guided 3D object detector for point clouds. arXiv 2020, arXiv:2004.04962. [Google Scholar]
Sun, P.; Wang, W.; Chai, Y.; Elsayed, G.; Bewley, A.; Zhang, X.; Sminchisescu, C.; Anguelov, D. Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5725–5734. [Google Scholar]
Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2918–2927. [Google Scholar]
Liang, Z.; Zhang, Z.; Zhang, M.; Zhao, X.; Pu, S. Rangeioudet: Range image based real-time 3d object detector optimized by intersection over union. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7140–7149. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Wang, Y.; Fathi, A.; Kundu, A.; Ross, D.A.; Pantofaru, C.; Funkhouser, T.; Solomon, J. Pillar-based object detection for autonomous driving. In Computer Vision–ECCV 2020, 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16; Springer International Publishing: New York, NY, USA, 2020; pp. 18–34. [Google Scholar]
Kuang, H.; Wang, B.; An, J.; Zhang, M.; Zhang, Z. Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 2020, 20, 704. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar]
Zheng, W.; Tang, W.; Jiang, L.; Fu, C.W. SE-SSD: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14494–14503. [Google Scholar]
Xu, Q.; Zhong, Y.; Neumann, U. Behind the curtain: Learning occluded shapes for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2893–2901. [Google Scholar] [CrossRef]
Qian, R.; Lai, X.; Li, X. BADet: Boundary-aware 3D object detection from point clouds. Pattern Recognit. 2022, 125, 108524. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3d lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
Yang, B.; Liang, M.; Urtasun, R. Hdnet: Exploiting hd maps for 3d object detection. In Proceedings of the Conference on Robot Learning, PMLR. Zürich, Switzerland, 29–31 October 2018; pp. 146–155. [Google Scholar]
Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: New York, NY, USA, 2018; pp. 1–8. [Google Scholar]
Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3d object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 7276–7282. [Google Scholar]
Hong, D.S.; Chen, H.H.; Hsiao, P.Y.; Fu, L.C.; Siao, S.M. CrossFusion net: Deep 3D object detection based on RGB images and point clouds in autonomous driving. Image Vis. Comput. 2020, 100, 103955. [Google Scholar]

Figure 1. The overall frame of PRNet structure.

Figure 2. PRF module structure.

Figure 3. Diagram of KNN.

Figure 4. Structure of the fourth step: fusion.

Figure 5. IF-Fusion module structure.

Figure 6. Detection result diagram. The system represents predicted cars with green frames and true values GroundTruth with red frames. The first score reflects the confidence score of the 3D prediction box, while the second score represents the Intersection over Union (IoU) between the 3D predicted box and the true value.

Figure 7. Effect of PRF module. In each set of images, the left image is the result of the network with k_neighbors = 0 (“no” PRF) in the PRF module, and the right image is the result of the network with k_neighbors = 4 in the PRF module. It is concluded that the PRF module can show its special effect when the object is blocked, and the object and the background are similar.

Figure 8. Comparison of small object detection results. In each set of images, the left image is the result of the network with IF-Fusion added, and the right image is the result of the network without IF-Fusion added.

Table 1. Comparison with state-of-the-art methods on the KITTI test set for car 3D detection. The best scores are highlighted in bold.

Method	AP_3D (%)			AP_BEV (%)
Method	Easy	Moderate	Hard	Easy	Moderate	Hard
F-PointNet	81.20	70.39	62.19	88.70	84.00	75.33
F-ConvNet	85.88	76.51	68.08	89.69	83.08	74.56
MV3D	71.09	62.35	55.12	86.02	76.90	68.49
AVOD-FPN	81.94	71.88	66.38	88.53	83.79	77.90
ContFuse	82.54	66.22	64.04	88.81	85.83	77.33
CrossFusion	83.20	74.50	67.01	88.39	86.17	78.23
CLOCs	87.50	76.68	71.20	92.60	88.99	81.74
PI-RCNN	84.59	75.82	68.39	-	-	-
VPFNet	88.51	80.97	76.74	-	-	-
SFD	91.73	84.76	77.92	95.64	91.85	86.83
Ours	91.95	80.56	78.08	95.75	88.83	88.68

Table 2. Detection results in kitti test set when different k_neighbors are set in PRF. The best scores are highlighted in bold.

PRF	k_neighbors	AP_3D (%)				AP_BEV (%)
PRF	k_neighbors	Easy	Moderate	Hard	Average	Easy	Moderate	Hard	Average
No	-	88.82	78.62	76.67	81.37	94.76	87.25	85.40	89.14
Yes	3	90.70	79.68	77.24	82.54 (+1.17)	95.15	88.40	86.28	89.94 (+0.80)
	4	91.06	79.94	77.69	82.90 (+1.53)	94.69	88.33	86.34	89.79 (+0.65)
	5	90.88	79.84	77.65	82.79 (+1.42)	95.11	88.40	86.43	89.98 (+0.84)

Table 3. Ablation Study. The best scores are highlighted in bold. "√" indicates that the module is added, and "× "indicates that the module is not added.

Sensor	PRF	IF-Fusion	AP_3D (%)			AP_BEV (%)
Sensor	PRF	IF-Fusion	Easy	Moderate	Hard	Easy	Moderate	Hard
Lidar	×	×	88.82	78.62	76.67	94.76	87.25	85.40
Lidar + Image	√	×	90.70	79.68	77.24	95.15	88.40	86.28
Lidar + Image	×	√	90.70	81.10	79.00	94.79	88.08	86.10
Lidar + Image	√	√	91.95	80.56	78.08	95.75	88.83	86.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Y.; Guo, Y.; Hu, H. PRNet: 3D Object Detection Network-Based on Point-Region Fusion. Appl. Sci. 2025, 15, 3759. https://doi.org/10.3390/app15073759

AMA Style

Fu Y, Guo Y, Hu H. PRNet: 3D Object Detection Network-Based on Point-Region Fusion. Applied Sciences. 2025; 15(7):3759. https://doi.org/10.3390/app15073759

Chicago/Turabian Style

Fu, Yufei, Yuhao Guo, and Hui Hu. 2025. "PRNet: 3D Object Detection Network-Based on Point-Region Fusion" Applied Sciences 15, no. 7: 3759. https://doi.org/10.3390/app15073759

APA Style

Fu, Y., Guo, Y., & Hu, H. (2025). PRNet: 3D Object Detection Network-Based on Point-Region Fusion. Applied Sciences, 15(7), 3759. https://doi.org/10.3390/app15073759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PRNet: 3D Object Detection Network-Based on Point-Region Fusion

Abstract

1. Introduction

2. Related Work

2.1. Methods Based on Single-Modality

2.2. Methods Based Multi-Modality Fusion

3. PRNet Framework

3.1. Feature Extractors for Point Cloud and Image

3.2. PRF Module

3.3. IF-Fusion

3.4. Computational Complexity Analysis

4. Experiment and Result

4.1. 3D Detection

4.2. The Effect of PRF Module

4.3. The Effect of IF-Fusion

4.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI