Focusing 3D Small Objects with Object Matching Set Abstraction

Guo, Lei; Song, Ningdong; Hu, Jindong; Han, Huiyan; Han, Xie; Xiong, Fengguang

doi:10.3390/app15084121

Open AccessArticle

Focusing 3D Small Objects with Object Matching Set Abstraction

by

Lei Guo

^1,2

,

Ningdong Song

^1,2,

Jindong Hu

^1,2,

Huiyan Han

^1,2,

Xie Han

^1,2

and

Fengguang Xiong

^1,2,*

¹

Shanxi Key Laboratory of Machine Vision and Virtual Reality, North University of China, Taiyuan 030051, China

²

School of Computer Science and Technology, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4121; https://doi.org/10.3390/app15084121

Submission received: 9 February 2025 / Revised: 22 March 2025 / Accepted: 3 April 2025 / Published: 9 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Currently, 3D object detection methods fail to detect small objects due to the fewer effective points of small objects. It is a significant challenge to reduce the loss of information of points in representation learning. To this end, we propose an effective 3D detection method with object matching set abstraction (OMSA). We observe that key points are lost during feature learning with multiple set abstraction layers, especially for downsampling and queries. Therefore, we present a novel sampling module named focus-based sampling, which raises the sampling probability of small objects. In addition, we design a multi-scale cube query to match the small objects with a close geometric alignment. Our comprehensive experimental evaluations on the KITTI 3D benchmark demonstrate significant performance improvements in 3D object detection. Notably, the proposed framework exhibits competitive detection accuracy for small objects (pedestrians and cyclists). Through an ablation study, we verify that each module contributes to the performance enhancement and demonstrate the robustness of the method against the balance factor.

Keywords:

3D object detection; set abstraction; small object detection; point clouds

1. Introduction

The domain of 3D object detection has witnessed significant research advancements recently, driven by its pivotal role in emerging technologies such as autonomous navigation systems, intelligent robotic operations, and immersive augmented reality environments. In the last decade, 3D object detection has achieved remarkable advancements driven by the excellent representation capacity of deep learning and multifarious large-scale datasets. The detection of low-resolution 3D objects remains a persistent challenge for cutting-edge algorithms, primarily attributed to the insufficient density of distinctive feature representations in LiDAR point clouds—a critical bottleneck affecting detection reliability in real-world deployments.

Contemporary 3D object detection architectures are conventionally categorized into two principal paradigms. The first involves grid-based methodological frameworks that convert unstructured 3D point clouds into standardized grid formats through geometric regularization, typically manifested as voxelized volumetric partitions, pillar-structured vertical columns, or compressed bird’s-eye-view (BEV) projection planes with optimized spatial quantization parameters [1]. In addition, 2D convolution or 3D sparse convolution is employed for feature extraction. Despite the notable performance of these approaches, a significant limitation lies in the inherent quantization inaccuracies introduced during 3D-2D conversion or voxelization processes. To circumvent this issue, a second category of methods, point-based methods, has been developed that focuses on immediate feature extraction from unprocessed point cloud data, bypassing intermediate representation conversions. Leveraging learning techniques, such as PointNet [2] and its variants, point-based methods avoid quantization errors. The set abstraction (SA) layers are cascaded to achieve representation learning. The native SA layers are biased toward feature learning of normal objects.

The general module of the SA layer contains three components: query, point downsampling, and MLP. Typically, context points are captured by a ball query. Yet, there is a significant difference in the geometric shape between the ball and the object, which may ignore key points and introduce noise points. The key information can be extracted by the operation of downsampling. The number of points associated with small objects is considerably lower compared to that of large objects. When subjected to repeated downsampling processes, critical information on the small objects is often substantially diminished. Consequently, the corresponding detection performance significantly decreases.

Aiming at the issues of key point feature loss and geometric shape mismatch during downsampling, we present an efficient semantically guided set abstraction module for 3D object detection. This block explicitly boosts the representation learning for small objects while preserving the detection performance for normal objects. First, focus-based sampling with comprehensive scores is designed with the help of semantic information to raise the sampling probability of key points for small objects. Second, a flexible cube query is carried out to better capture the nearest neighboring points and reduce the introduction of irrelevant points. Our approach exhibits competitive performance when assessed on the KITTI 3D object detection dataset. The ablation study shows that the proposed block can enhance the performance. The contributions can be summarized in threefold:

1. We design a cube query set abstraction to achieve fine-grained representation learning for small objects. This module employs multi-scale cubes to extract multi-level features and reduces the introduction of irrelevant noise simultaneously.

2. We propose focus-based sampling to reduce feature loss during downsampling. This block utilizes semantic information to enhance the sampling probability of key points in small objects.

3. Extensive experimental evaluations conducted on the KITTI 3D object detection benchmark demonstrate the effectiveness of our method in improving the detection performance for small objects.

This paper is structured as follows: Section 2 reviews related work, Section 3 details the proposed OMSA, Section 4 presents experimental results, and Section 5 concludes the study.

2. Related Work

2.1. Three-Dimensional Object Detection

Contemporary approaches to 3D object detection primarily adopt two distinct paradigms: grid-based and point-based methodologies [1,3]. The first paradigm involves the conversion of unstructured point cloud data into structured features, commonly through BEV projections or volumetric voxel grids. These regularized representations subsequently facilitate the application of a 2D or 3D conventional CNN. VoxelNet [4] is the first grid-based work that transformed point clouds into voxels and utilized a 3D CNN to detect objects. In [5], SECOND employs a sparse-convolution-based backbone, significantly increasing the speed of training and inference. VP-Net treats voxels as discrete points and leverages the intrinsic structural relationships within objects to enhance feature extraction capabilities [6]. BEV images, a kind of pseudo-image feature map, are obtained in PointPillars, accelerating the quantization process and reducing the complexity of the detection model [7]. Yin et al. find object centers for the flattened grid-based features first and boost the local representations [8]. BEV-SAN introduces a Slice Attention module designed to leverage semantic information for enhancing discriminative feature representation across distinct height levels [9]. CluB strengthens the representation at both feature and query levels by adding an auxiliary cluster-based branch [10]. Li et al. use conditional random field to improve instance-level features [11]. Nonetheless, this type of method has an inherent drawback. It inevitably introduced errors during the quantization process.

The second paradigm learns the representations from the unprocessed point cloud data [12]. These methods compute the point-wise representations for the small regions via PointNet families and predict the 3D objects with the representations. PointNet++ utilizes a multi-level feature learning architecture, which incorporates farthest point sampling (FPS) for key point selection, ball query for building local regions, and PointNet for extracting features within these regions [13]. Regarding most point-based methods, downsampling is embedded in the backbone to lower the computational complexity. However, these semantic-agnostic methods discard many points in foreground objects. As a pioneering framework in 3D object detection, 3DSSD introduces a novel integration mechanism that combines Distance FPS (D-FPS) and Feature FPS (F-FPS), thereby improving the feature representation through semantic information incorporation, which significantly boosts detection accuracy [14]. FSD groups the points into instances with the guidance of semantic information to boost feature learning [15].

2.2. Small Object Detection

Detecting small objects is a crucial yet challenging problem in both 2D and 3D object detection. Currently, general-purpose object detection makes it difficult to achieve satisfactory results in detecting small objects. This challenge arises primarily from the limited representations and inherent algorithmic bias toward the detection of normal-sized objects. For these purposes, a series of strategies for small object detection is employed to alleviate information loss and prediction bias. The strategies can be divided into sample-oriented strategy, feature enhancement strategy, and weighting strategy [16]. The first strategy aims to alleviate the sample scarcity issue by raising the sampling probability of small objects [17,18,19]. The second strategy employs the low-level learned representation or super-resolution to strengthen the corresponding features [20,21,22]. The third strategy utilizes an attention mechanism to rectify the prediction bias [23,24,25].

In recent years, small object detection in the 3D field has also garnered significant attention. Semantic information is injected into the sampling process by adding a segmentation head [26,27]. Hence, the sampling probability for the foreground is enhanced, boosting the detection performance. Motivated by these works, we specifically add another segmentation head for small objects to increase the corresponding probability. PSA-Det3D presents a pillar query to sample the key points [28]. Compared with the ball query, the pillar is more consistent with the geometric characteristics of the objects and can reduce the introduction of noise. This method only considers the x and y axes when the distance is computed, leading to more unrelated neighboring points in the z axis. Therefore, when calculating distances, we consider the additional dimension of the z axis and propose the cube query strategy. Globally, our approach integrates a sample-oriented strategy and weighting strategy to enhance the detection performance.

2.3. Set Abstraction Algorithms

The set abstraction algorithm contains sampling, query, and feature learning. FPS is usually employed to select the key points from the raw point clouds. This method mines key points according to the distance. As a result, these mined points contain a considerable number of background points. To mitigate this effect, Li et al. propose a weighted FPS with Euclidean distance to raise the sampling probability of foreground points [29]. To further reduce the introduction of irrelevant points that do not belong to the corresponding object, semantic information supervision is employed in the subsequent work; 3DSSD develops a Feature-FPS strategy to preserve the points in the objects with the feature distances [14]. In contrast to 3DSSD, SASA presents a more direct heuristic strategy to identify the foreground points [26]. Specifically, a binary segmentation module is added to estimate point-wise foreground scores. This method only classifies foregrounds and backgrounds, neglecting the small objects. Motivated by this research, we have added another segmentation module designed specifically to distinguish the small objects. Accordingly, the probability of sampling small objects has been significantly enhanced. A ball query is used to mine the context for the key points, in which the ball is difficult to match the objects [30,31]. Thus, a certain amount of key point losses and a certain number of irrelevant points are introduced, ultimately lowering the detection performance. Aiming at the issue, we propose a geometric-prior method, i.e., cube query. Unlike the ball, the cube aligns more closely with objects such as pedestrians, bicycles, and cars, enabling an accurate feature extraction for points.

3. Methodology

The presence of small objects significantly affects the performance of 3D object detection. This work proposes a point-wise methodology for object detection. The method contains a representation learning block and a prediction head. In the first block, the SA module is usually used to carry out the structured learning of features, significantly reducing the computational cost. Typically, this module employs task-agnostic schemes, such as random sampling or FPS, for data downsampling and utilizes ball query to extract the nearest neighboring points. First, small objects have significantly fewer points than large objects. Therefore, in this process, the feature learning of small objects is greatly affected by downsampling. Second, the geometric difference between the ball and the objects (pedestrians and cyclists) is large, inevitably introducing more noise points. Therefore, our target is to learn excellent features in the representation learning process and reduce information loss. We introduce an efficient point-based detector utilizing object matching set abstraction, as depicted in Figure 1. The representation learning block extracts structured features from the point clouds. Subsequently, the features are fed to the detection and segmentation modules. The detection head outputs predictions, and the segmentation head is used to provide semantic guidance for feature learning. This work focuses on the representation learning block. The block mainly comprises three components: (a) focus-based sampling, (b) cube query, and (c) MLP. Focus-based sampling raises the probability of informative points in small objects. Cube query leverages geometric priors to extract spatial points that exhibit higher conformity with the objects’ morphological characteristics. MLP employs stacked

1 \times 1

convolutional layers with BatchNorm to encode geometric patterns in local neighborhoods, thereby enhancing robustness against point cloud permutation invariance and density variations via hierarchical multi-scale feature aggregation. It is noted that the number of SA layers is greater than or equal to two.

3.1. Focus-Based Sampling

The point clouds of small objects often exhibit obvious sparse characteristics. The key points cannot be sampled by the FPS effectively, and the feature cannot be learned capably, owing to the neglect of semantic information. Semantics-guided FPS assigns higher weights to the foreground points [26]. The corresponding probability is significantly enhanced. Inspired by this work, a comprehensive focus-based sampling (FocS) approach is presented, which incorporates an additional MLP to calculate scores for small objects, further increasing the corresponding sampling probability. Suppose we have selected M points:

\{x_{i}, x_{2}, \dots, x_{M}\}

. For each 3D point

x_{i}

, we determine the shortest distance,

d_{i}^{R}

, between it and the remaining points within this subset, shown in Figure 2. Subsequently, we design a comprehensive score,

s c o r e_{i}

, to evaluate each point:

s c o r e_{i} = s_{F} + λ s_{s o}

(1)

where

s_{F}

is the score of the foreground,

s_{s o}

is the score of the small object, and

λ

is the balance factor. The rectified focus-based distance

d_{i}^{FocS}

is the following:

d_{i}^{FocS} = s c o r e_{i} * d_{i}

(2)

where

s c o r e_{i}

is the correction parameter. Subsequently, we perform FPS with the focus-based distance. Such a design effectively samples small object points by increasing the sampling probability of these points. This approach effectively mitigates the information degradation of small objects throughout successive downsampling operations, consequently enhancing the model’s capability.

3.2. Cube Query

In ball query, the points within a spherical range are typically regarded as the neighboring points. The shape of the ball differs significantly from that of the objects, leading to an under-consideration of informative points and an over-consideration of noise points. Motivated by this, we design a cube query to acquire neighboring points with fine-grained 3D information. The comparison is outlined in Figure 3.

As shown in Figure 3, the geometric shapes of the ball and pedestrians are quite different, which makes it easy to ignore key points or introduce more noise. In addition, Pillar Query [28] considered the entire z axis range, leading to more noise injection. In contrast, the cube query can adjust the length, alleviating the impact of irrelevant points on learning. Concretely, we employ FocS to obtain the key points, and a cube-based grouping operation, i.e., cube query, is utilized to partition the points:

\begin{matrix} \begin{matrix} {G A (k) = {p \in P | D_{X} (p, k) < l, D_{Y} (p, k) < l, D_{Z} (p, k) < l}, k \in K} \end{matrix} \end{matrix}

(3)

\begin{matrix} D_{X} (p, k) = |p (x) - k (x)| \end{matrix}

(4)

D_{Y} (p, k) = |p (y) - k (y)|

(5)

D_{Z} (p, k) = |p (z) - k (z)|

(6)

where

G A (k)

denotes the grouping results array;

P = {P_{j}}_{j = 1}^{N_{j} - 1}

represents the input points;

K = {K_{j}}_{j = 1}^{N_{m} - 1}

is the set of key points;

N_{m}

is the number of sampled points; and

D_{X} (p, k)

,

D_{Y} (p, k)

, and

D_{Z} (p, k)

represent the distances along the x, y, and z axes, respectively.

Similar to the ball query, we can obtain multi-scale neighboring points by setting different values of l, thereby promoting global feature learning. Compared with the ball query, the presented geometry-aware query operation can effectively capture the object-related points. We find that the results of the cube query are slightly better than the ball query, as shown in Table 1. The cube query exhibits a closer alignment with the geometric characteristics of pedestrians and cyclists, facilitating accurate feature extraction of small objects.

4. Experiments

4.1. Implementation Details

Experiments were conducted on the KITTI benchmark for 3D object detection [32]. The dataset includes car, pedestrian, and cyclist classes, categorized into easy, moderate, and hard levels based on occlusion and bounding box visibility. It contains 7481 training and 7518 testing samples, with the training set further divided into 3712 for training and 3769 for validation. The primary evaluations were conducted on the test and validation sets, while ablation studies were performed on the validation set. Our method is built on SASA using the OpenPCDet framework. Training was performed on an Nvidia RTX 3090 GPU (Nvidia Corporation, Santa Clara, CA, USA) for 80 epochs, with Average Precision (AP) on moderate-level instances as the primary metric. For our method, three SA layers were employed in our method.

4.2. Ablation Studies

We examine each component’s contribution and the impact of balance factor through ablation studies and provide a depth analysis in this section.

Effect of each module. We depict how each core component of our method—containing focus-based sampling, cube query, and the number of SA layers—affects detection performance. The corresponding results are given in Table 1. The ball query, i.e., SASA in this work, is employed as a baseline. The baseline, ball query, obtains APs of 82.94%, 57.53%, and 72.02% at the moderate level. Compared with normal objects, regarding the small objects, the baseline has a lower AP. With FocS, model A provides +0.57% and +0.48% improvements in AP for the pedestrian and cyclist classes at the moderate level. Replacing the ball query with the cube query, in model B, the APs of pedestrians and cyclists reach 59.09%, 58.92%, and 52.26% and 92.27%, 73.13%, and 67.20%, respectively. The integration of FocS with pillar query in model C yields notable performance improvements in pedestrian detection, while exhibiting marginal enhancements in other categories. To further validate our design, we constructed model D with only two SA layers. This configuration achieves an AP of 59.90% and 71.74% for pedestrians and cyclists. This degradation demonstrates the necessity of incorporating three SA layers, as the additional layer enhances the representational capacity. The combined implementation of these dual modules demonstrates competitive detection accuracy, with the evaluation metrics reaching 82.81% AP for cars, 62.47% AP for pedestrians, and 73.24% AP for cyclists in the moderate-difficulty category. These results show that our presented block, i.e., OMSA, can improve performance significantly, indicating that these two components help reduce the information loss and strengthen the corresponding feature learning.

Effect of balance factor

λ

. We validate the impact of

λ

with a range from 0 to 4 on the performance, as shown in Figure 4. When

λ

is set to 0, our method eliminates the module of focus-based sampling. Our experimental results demonstrate that the AP for small objects surpasses the baseline (

λ

= 0) within the lambda range of 0.5 to 2.5. Consequently, we adopt lambda values within this interval. Furthermore, this experiment empirically validates the effectiveness of the module.

4.3. Main Experiments

To demonstrate the proposed approach’s efficacy, comprehensive comparative experiments were conducted using the recent representative techniques, with evaluations performed across both the validation and test datasets.

Performance on the validation dataset. The quantitative evaluation results at 40 recall positions, as measured on the validation subset, are systematically presented in Table 2 and Table 3. Regarding AP at the moderate level, our method achieves 82.81%, 62.47%, and 73.24%, respectively. Our method achieves performance comparable to these two-stage methods regarding cars. Nonetheless, the one-stage methods have a relatively short inference time. Regarding cars, our results exceed these one-stage methods. Our method surpasses others by over 3.56% and 2.00% for the pedestrian and cyclist categories, respectively, at the moderate level. At the easy and hard levels, our method beats other one-stage methods. We can draw the conclusion that our method owes excellent robustness under different occluded and truncated scenes. Experimental results show our one-stage approach significantly improves small object detection while maintaining good performance for car detection.

Performance on the test dataset. Table 4 and Table 5 summarize the performance of our method and several representative approaches on the KITTI test dataset. For small objects like pedestrians, our method achieves APs of 52.26%, 50.41%, and 38.20% at the three levels, respectively. As shown in Table 4 and Table 5, our method beats all voxel-based approaches. The superior performance is largely due to the approach’s ability to directly process raw point cloud data, which have less information loss. In contrast, the quantization error causes the inferior performance of the voxel-based methods especially for small objects. In terms of the voxel-point-based method, our proposed approach slightly underperforms with PV-RCNN. Nonetheless, PV-RCNN needs to integrate these two types of representations and possesses a relatively complex model. The proposed framework employs direct feature extraction from raw point cloud data, which facilitate real-time implementation in practical scenarios. Regarding the point-based methods, our method outperforms these one-stage methods for small objects generally. F-ConvNets and ASCNet achieve better results with cyclists, but these two methods have slow inference speeds because of the two-stage prediction. Also, we achieve close results for cars. The experimental outcomes indicate the superior performance of the proposed methodology by paying closer attention to small objects.

4.4. Quantitative Analysis

To enhance visual clarity, Figure 5 presents snapshots of the results obtained by our method on the validation set, showcasing three distinct scenes. In the figure, red bounding boxes denote the ground truth, while green bounding boxes indicate the predictions generated by our approach. The RGB images corresponding to the point clouds in Figure 5a, Figure 5c, and Figure 5e are displayed in Figure 5b, Figure 5d, and Figure 5f, respectively. Our method detects most of the objects. Our proposed approach demonstrates superior object detection capabilities, successfully identifying most objects in a variety of scenarios. As shown in Figure 5d,f, the method shows remarkable sensitivity in detecting small objects that human annotators inadvertently ignore during annotation. This observation highlights the robustness of the method for handling challenging detection tasks, which validates its effectiveness as a data-driven approach and demonstrates its generalization ability.

5. Conclusions

Our work introduces a novel block, OMSA, for 3D object detection. To mitigate the sparsity issue in small objects, we propose a focus-based sampling block that, guided by semantic information, selects additional key points to enhance detection performance. The multi-scale cube query aims to thoroughly extract the geometrical information, thereby strengthening the feature learning. The thorough experimental results reveal the efficacy of OMSA and verify how the proposed block enhances representation learning for small objects during the downsampling stage. One of the limitations of our method is that there is a slight decline in performance on normal objects due to imbalanced sampling in our block. In the future, we are looking forward to investigating a method that can simultaneously boost the performance of both normal and small objects.

Author Contributions

Methodology, L.G., F.X. and N.S.; software, N.S. and J.H.; validation, H.H. and X.H.; writing—original draft preparation, L.G. and F.X.; writing—review and editing, F.X. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shanxi Province (Nos. 202203021212138 and 202303021211153) and the Foundation of Shanxi Key Laboratory of Machine Vision and Virtual Reality (No. 447-110103). The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is within the article. The code is available at https://gitee.com/ddguoll/CubeQury (accessed on 8 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
Drobnitzky, M.; Friederich, J.; Egger, B.; Zschech, P. Survey and systematization of 3D object detection models and methods. Vis. Comput. 2024, 40, 1867–1913. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Song, Z.; Wei, H.; Jia, C.; Xia, Y.; Li, X.; Zhang, C. VP-net: Voxels as points for 3D object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5701912. [Google Scholar]
Chen, Y.; Yu, Z.; Chen, Y.; Lan, S.; Anandkumar, A.; Jia, J.; Alvarez, J.M. Focalformer3D: Focusing on hard instance for 3D object detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8360–8371. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11779–11788. [Google Scholar]
Chi, X.; Liu, J.; Lu, M.; Zhang, R.; Wang, Z.; Guo, Y.; Zhang, S. BEV-SAN: Accurate bev 3D object detection via slice attention networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17461–17470. [Google Scholar]
Wang, Y.; Deng, J.; Hou, Y.; Li, Y.; Zhang, Y.; Ji, J.; Ouyang, W.; Zhang, Y. CluB: Cluster Meets BEV for LiDAR-Based 3D Object Detection. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2024; Volume 36. [Google Scholar]
Li, Z.; Lan, S.; Alvarez, J.M.; Wu, Z. BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 20113–20123. [Google Scholar]
Li, J.; Luo, C.; Yang, X. PillarNeXt: Rethinking network designs for 3D object detection in LiDAR point clouds. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17567–17576. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3D single stage object detector. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11037–11045. [Google Scholar]
Fan, L.; Yang, Y.; Wang, F.; Wang, N.; Zhang, Z. Super sparse 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12490–12505. [Google Scholar] [CrossRef] [PubMed]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Bosquet, B.; Cores, D.; Seidenari, L.; Brea, V.M.; Mucientes, M.; Del Bimbo, A. A full data augmentation pipeline for small object detection based on generative adversarial networks. Pattern Recognit. 2023, 133, 108998. [Google Scholar] [CrossRef]
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, Y.; Gao, Y.; Zhao, Z.; Feng, H.; Zhao, T. Context Feature Integration and Balanced Sampling Strategy for Small Weak Object Detection in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6009105. [Google Scholar] [CrossRef]
Cui, L.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Zhang, L.; Shao, L.; Xu, M. Context-aware block net for small object detection. IEEE Trans. Cybern. 2020, 52, 2300–2313. [Google Scholar] [CrossRef] [PubMed]
Gao, T.; Niu, Q.; Zhang, J.; Chen, T.; Mei, S.; Jubair, A. Global to Local: A Scale-Aware Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615614. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, H.; Huang, Q.; Han, Y.; Zhao, M. DsP-YOLO: An anchor-free network with DsPAN for small object detection of multiscale defects. Expert Syst. Appl. 2024, 241, 122669. [Google Scholar] [CrossRef]
Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 For Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6614–6619. [Google Scholar] [CrossRef]
Chen, C.; Chen, Z.; Zhang, J.; Tao, D. SASA: Semantics-augmented set abstraction for point-based 3D object detection. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI-22), Online, 22 February–1 March 2022; Volume 36, pp. 221–229. [Google Scholar]
Jhong, S.Y.; Chen, Y.Y.; Hsia, C.H.; Wang, Y.Q.; Lai, C.F. Density-Aware and Semantic-Guided Fusion for 3D Object Detection using LiDAR-Camera Sensors. IEEE Sens. J. 2023, 23, 22051–22063. [Google Scholar] [CrossRef]
Huang, Z.; Zheng, Z.; Zhao, J.; Hu, H.; Wang, Z.; Chen, D. PSA-Det3D: Pillar set abstraction for 3D object detection. Pattern Recognit. Lett. 2023, 168, 138–145. [Google Scholar] [CrossRef]
Li, X.; Wang, C.; Zeng, Z. WS-SSD: Achieving faster 3D object detection for autonomous driving via weighted point cloud sampling. Expert Syst. Appl. 2024, 249, 123805. [Google Scholar] [CrossRef]
Huang, Z.; Wang, Y.; Tang, X.; Sun, H. Boundary-aware set abstraction for 3D object detection. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–7. [Google Scholar]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
Zhang, Y.; Huang, D.; Wang, Y.; Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, , 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10526–10535. [Google Scholar]
Jiang, T.; Song, N.; Liu, H.; Yin, R.; Gong, Y.; Yao, J. VIC-Net: Voxelization information compensation network for point cloud 3D object detection. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13408–13414. [Google Scholar]
Tong, G.; Peng, H.; Shao, Y.; Yin, Q.; Li, Z. ASCNet: 3D object detection from point cloud based on adaptive spatial context features. Neurocomputing 2022, 475, 89–101. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3D lidar point clouds. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18931–18940. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. IPOD: Intensive point-based object detector for point cloud. arXiv 2018, arXiv:1812.05276. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from RGB-D data. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12689–12697. [Google Scholar]
Wang, Z.; Jia, K. Frustum ConvNet: Sliding frustums to aggregate local point-wise features for amodal 3D object detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1742–1749. [Google Scholar]
Zhou, D.; Fang, J.; Song, X.; Liu, L.; Yin, J.; Dai, Y.; Li, H.; Yang, R. Joint 3D instance segmentation and object detection for autonomous driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1836–1846. [Google Scholar]
Zhang, Y.; Huang, D.; Wang, Y. PC-RGNN: Point cloud completion and graph neural network for 3D object detection. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI-21), Online, 2–9 February 2021; Volume 35, pp. 3430–3437. [Google Scholar]

Figure 1. The framework of our approach OMSA.

Figure 2. The diagram of the focus-based sampling.

Figure 3. Comparison of the different methods. (a) Ball query; (b) pillar query; (c) cube query.

Figure 4. The effect of balance factor

λ

.

Figure 4. The effect of balance factor

λ

.

Figure 5. Snapshots of detection results from our method on the validation set are shown. Figure 5a,c,e are the visualization of detection results in point clouds, and Figure 5b,d,f are the corresponding results in the RGB images. The results are projected onto both the point clouds and the corresponding images. Note that the red bounding boxes represent the ground truth, while the green ones are predictions made by our method.

Table 1. Effect of each presented module.

Method	Car			Pedestrian			Cyclist
Method	Easy	Moderate	Hard	Easy	Moderate	Hard	Easy	Moderate	Hard
Baseline (Ball Query)	91.18	82.94	80.34	60.65	57.53	51.30	91.10	72.02	67.64
Model A (FocS + Ball Query)	90.37	80.49	77.75	62.02	58.10	50.96	92.16	72.50	66.96
Model B (Cube Query)	91.09	82.72	79.75	59.09	58.92	52.26	92.27	73.13	67.20
Model C (FocS + Pillar Query)	88.66	78.97	78.24	66.63	60.58	55.01	88.67	72.19	67.77
Model D (Two SA Layers)	91.25	82.68	80.15	66.93	59.90	54.03	91.46	71.74	67.03
OMSA (FocS + Cube Query)	91.53	82.81	80.25	69.59	62.47	56.77	92.41	73.24	68.85

Table 2. Experimental results on the KITTI validation dataset on the car category.

Method	Representation Format	Type	Car
Method	Representation Format	Type	Easy	Moderate	Hard
VoxelNet [4]	Voxel	One-stage	81.97	65.46
SECOND [5]	Voxel	One-stage	87.43	76.48	69.10
PointRCNN [33]	Point	Two-stage	88.88	78.63	77.38
PV-RCNN [34]	Voxel + Point	Two-stage 92.57	84.83	82.69
3DSSD [14]	Point	One-stage	89.71	79.45	78.67
VIC-Net [35]	Voxel + Point	Two-stage	89.58	84.40	78.86
ASCNet [36]	Point	Two-stage	-	83.33	-
IA-SSD [37]	Point	One-stage	-	79.57	-
OMSA	Point	One-stage	91.53	82.81	80.25

Table 3. Experimental results on the KITTI validation dataset on the pedestrian and cyclist categories.

Method	Representation Format	Type	Pedestrian			Cyclist
Method	Representation Format	Type	Easy	Moderate	Hard	Easy	Moderate	Hard
VoxelNet [4]	Voxel	One-stage	57.86	53.42	48.87	67.17	47.65	45.11
3DSSD [14]	Point	One-stage	54.64	44.27	40.23	82.48	64.10	56.90
IA-SSD [37]	Point	One-stage	-	58.91	-	-	71.24	-
OMSA	Point	One-stage	69.59	62.47	56.77	92.41	73.24	68.85

Table 4. Experimental results on the KITTI test dataset on the car category.

Method	Representation Format	Type	Car
Method	Representation Format	Type	Easy	Moderate	Hard
VoxelNet [4]	Voxel	One-stage	77.47	65.11	57.73
SECOND [5]	Voxel	One-stage	84.65	75.96	68.71
IPOD [38]	Point	Two-stage	79.75	72.57	66.33
AVOD + Feature Pyramid [39]	Point + RGB	Two-stage	81.94	71.88	66.38
Frustum PointNets [40]	Point + RGB	Two-stage	81.20	70.39	62.19
PointPillars [41]	Voxel	One-stage	82.58	74.31	68.99
PointRCNN [33]	Point	Two-stage	86.96	75.64	70.70
F-ConvNets [42]	Point	Two-stage	85.88	76.51	68.08
PV-RCNN [34]	Voxel + Point	Two-stage	90.25	81.43	76.82
3DSSD [14]	Point	One-stage	88.36	79.57	74.55
Joint [43]	Point	Two-stage	87.74	78.96	74.30
PC-RGNN [44]	Point	Two-stage	89.13	79.90	75.54
ASCNet [36]	Point	Two-stage	88.48	81.67	76.93
IA-SSD [37]	Point	One-stage	88.34	80.13	75.04
OMSA	Point	One-stage	87.28	76.18	71.81

Table 5. Experimental results on the KITTI test dataset on the pedestrian and cyclist categories.

Method	Representation Format	Type	Pedestrian			Cyclist
Method	Representation Format	Type	Easy	Moderate	Hard	Easy	Moderate	Hard
VoxelNet [4]	Voxel	One-stage	39.48	33.69	31.51	61.22	48.39	44.37
SECOND [5]	Voxel	One-stage	45.31	35.52	33.14	75.83	60.82	53.67
IPOD [38]	Point	Two-stage	56.92	44.68	42.39	71.40	53.46	48.34
AVOD + Feature Pyramid [39]	Point + RGB	Two-stage	50.80	42.81	40.88	64.00	52.18	46.61
Frustum PointNets [40]	Point + RGB	Two-stage	51.21	44.89	40.23	71.96	56.77	50.39
PointPillars [41]	Voxel	One-stage	51.45	41.92	38.89	77.10	58.65	51.92
PointRCNN [33]	Point	Two-stage	47.98	39.37	36.01	74.96	58.82	52.53
F-ConvNets [42]	Point	Two-stage	52.37	45.61	41.49	79.58	64.68	57.03
PV-RCNN [34]	Voxel + Point	Two-stage	52.17	43.29	40.29	78.60	63.71	57.65
ASCNet [36]	Point	Two-stage	42.00	35.76	33.69	78.41	65.10	57.87
IA-SSD [37]	Point	One-stage	46.51	39.03	35.60	78.35	61.94	55.70
OMSA	Point	One-stage	52.26	50.41	38.20	79.53	62.30	55.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, L.; Song, N.; Hu, J.; Han, H.; Han, X.; Xiong, F. Focusing 3D Small Objects with Object Matching Set Abstraction. Appl. Sci. 2025, 15, 4121. https://doi.org/10.3390/app15084121

AMA Style

Guo L, Song N, Hu J, Han H, Han X, Xiong F. Focusing 3D Small Objects with Object Matching Set Abstraction. Applied Sciences. 2025; 15(8):4121. https://doi.org/10.3390/app15084121

Chicago/Turabian Style

Guo, Lei, Ningdong Song, Jindong Hu, Huiyan Han, Xie Han, and Fengguang Xiong. 2025. "Focusing 3D Small Objects with Object Matching Set Abstraction" Applied Sciences 15, no. 8: 4121. https://doi.org/10.3390/app15084121

APA Style

Guo, L., Song, N., Hu, J., Han, H., Han, X., & Xiong, F. (2025). Focusing 3D Small Objects with Object Matching Set Abstraction. Applied Sciences, 15(8), 4121. https://doi.org/10.3390/app15084121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Focusing 3D Small Objects with Object Matching Set Abstraction

Abstract

1. Introduction

2. Related Work

2.1. Three-Dimensional Object Detection

2.2. Small Object Detection

2.3. Set Abstraction Algorithms

3. Methodology

3.1. Focus-Based Sampling

3.2. Cube Query

4. Experiments

4.1. Implementation Details

4.2. Ablation Studies

4.3. Main Experiments

4.4. Quantitative Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI