4.3.1. Comparison Experiments
- (1)
Comparison Experiments on the Reconstructed Datasets FloW-BEV and WaterScenes-BEV
To evaluate the efficacy of the proposed 3D object detection algorithm RCF-Free in surface environments, a comparative analysis of detection performance was conducted on the self-constructed dataset FloW-BEV. The results are summarized in
Table 3.
CBR is adopted as the baseline model in this study. As illustrated in
Table 3, RCF-Free achieves the highest performance across both evaluation metrics on the validation set of the FloW-BEV dataset. Specifically, compared to FE-YOLOv5n, YOLOv9s-Hungarian, MFNet, and CBR, RCF-Free exhibits improvements in
of 33.1%, 36.0%, 29.5%, and 3.6%, respectively. Corresponding gains in mIOU are 17.1%, 17.2%, 2.3%, and 2.2%. These results underscore the superior capability of RCF-Free in detecting floating objects on water surfaces.
It is also observed that the expanded pixel strategy used in FE-YOLOv5n contributes to certain performance improvements in such scenarios, yielding noticeably higher accuracy than the YOLOv9s-Hungarian approach. However, conventional decision-level fusion methods still exhibit limitations. The extracted point cloud bounding boxes tend to be coarse, potentially incorporating adjacent points erroneously and thus reducing . In contrast, both the CBR algorithm and RCF-Free show substantially better performance in terms of .
Even in cases of misdetection, the point cloud boxes obtained through decision-level fusion still partially overlap with ground truth boxes. Therefore, for small objects, RCF-Free yields a notable improvement in , with a more moderate yet clear increase in mIOU. Overall, the proposed method exhibits superior comprehensive performance.
In addition, to test the sensitivity of the model to random initialization, we removed the restriction on the random seed in PyTorch and conducted 8 rounds of complete training, selecting the optimal model as the result each time. In the test results, the average is 60.5%, with a maximum of 61.3% and a minimum of 59.8%. The average mIOU is 47.2%, with a maximum of 48.1% and a minimum of 46.3%. The standard deviation of is 0.5%, and the standard deviation of mIOU is 0.7%. We also calculated the standard deviation on waterscene Bev and dair-v2x-i respectively.
Table 4 presents a comparative assessment of RCF-Free against other detection methods on the reconstructed WaterScenes-BEV dataset. The results affirm the high accuracy of RCF-Free in detecting water vessels. It significantly outperforms decision-level fusion methods (FE-YOLOv5n, YOLOv9s-Hungarian, MFNet) and the point-cloud-based detector PointPillars, with
improvements of 64.1%, 63.0%, 62.5%, and 22.7%, respectively. This substantial gap underscores the challenge that sparse, noisy maritime radar point clouds pose for geometry-only or late-fusion approaches. By leveraging high-quality image features enhanced with point cloud information through its fusion architecture, RCF-Free also surpasses the vision-only baseline CBR by 1.9% in
and 2.4% in mIOU, demonstrating the clear benefit of multimodal integration.
The inclusion of state-of-the-art BEV detection methods reveals interesting insights. PointPillars, designed for denser LiDAR point clouds, performs poorly in this sparse radar scenario. Both BEVFormer and BEVDepth, which rely on precise calibration, achieve strong results. Notably, BEVFormer exceeds RCF-Free’s score, while BEVDepth performs slightly worse—a trend opposite to their ranking on the DAIR-V2X autonomous driving dataset. This discrepancy can be attributed to the domain-specific optimizations of these methods. BEVDepth’s design, particularly its depth estimation module (LSS), is heavily optimized for the structural and depth distribution priors of ground-based autonomous driving scenes (e.g., cars, roads). The aquatic environment, with its flat surface, different target size distributions, and unique clutter patterns, represents a significant domain shift where these priors may not hold, leading to suboptimal performance. In contrast, BEVFormer’s transformer-based view transformation mechanism demonstrates better generalization. Nevertheless, RCF-Free, without requiring precise calibration parameters, achieves competitive performance that is on par with these calibration-dependent state-of-the-art methods, highlighting its remarkable practical value and robustness in real-world aquatic deployment where calibration is unreliable.
- (2)
Comparison Experiments on the Public Dataset DAIR-V2X-I
To further evaluate the effectiveness and generalization capability of the proposed algorithm, we compared RCF-Free with several state-of-the-art 3D object detection methods on the public autonomous driving dataset DAIR-V2X-I. The results are summarized in
Table 5.
As shown in
Table 5, RCF-Free achieves
improvements of 1.3%, 1.1%, and 1.1% over the baseline CBR across easy, moderate, and hard task difficulty levels, respectively. Under weak calibration setting, RCF-Free ranks third overall among all methods in all three difficulty categories. It outperforms calibration-dependent radar-based methods by margins of at least 1.8%, 7.2%, and 7.2% across the three levels. Compared to multimodal methods that require calibration, the improvements are 2.3%, 7.5%, and 7.4%, respectively. Relative to vision-only methods relying on calibration, RCF-Free surpasses ImVoxelNet by 29.1%, 23.6%, and 23.6%, and BEVFormer by 11.9%, 10.5%, and 10.5%.
These results affirm that although RCF-Free is specifically designed for the unique challenges of aquatic environments, it generalizes effectively to terrestrial autonomous driving scenarios, exhibiting robust detection capability. It is important to contextualize this performance: while the absolute accuracy of RCF-Free remains below that of the latest calibration- and depth-supervised methods (e.g., BEVDepth, BEVHeight, and CUDA-V2XFusion), this comparison highlights a fundamental design trade-off. Methods like BEVDepth and CUDA-V2XFusion achieve superior performance by leveraging precise calibration and computationally intensive depth estimation networks, which incur high inference costs and pose challenges for real-time edge deployment. In contrast, RCF-Free forgoes this dependency, offering the significant advantage of operating without any accurate calibration parameters input, thereby eliminating associated deployment complexity, cost, and the risk of performance degradation from calibration drift.
Furthermore, this practicality is reflected in the model’s efficiency. The parameter count of RCF-Free is only about 2% larger than that of the CBR baseline (∼410 M vs. ∼400 M), and it is significantly more compact than networks incorporating heavy depth estimation modules (e.g., BEVHeight, ∼880 M). This compactness, combined with its weak-calibration design, makes RCF-Free particularly suitable for scalable and reliable deployment on resource-constrained platforms like USVs, where maintaining precise calibration is often impractical.
In summary, the comparison in
Table 5 and the accompanying analysis serve to clearly position our contribution. RCF-Free is not designed to outperform all methods in idealized, calibrated settings but to deliver a robust, efficient, and readily deployable perception solution for calibration-constrained environments where the state-of-the-art, calibration-dependent methods cannot be reliably applied.
4.3.2. Ablation Experiment
To systematically validate the effectiveness of each proposed component, a comprehensive ablation study was conducted. As the baseline model CBR lacks a point cloud branch, the conventional approach of “removing a module” is not directly applicable for evaluating the multimodal fusion. Therefore, we designed the following model variants: RCF-Free-RI refers to the variant that incorporates the BEV-Point encoder for processing millimeter-wave radar point clouds and the Triple-Path Cross-View Fusion module for multi-modal feature integration. RCF-Free-RI-C, which incorporates the BEV-Point cloud encoder but does not use the Triple-Path Cross-View Fusion module, instead adopting a simple feature concatenation method similar to CBR to fuse point cloud features before the detection head. This variant tested the necessity of our proposed fusion architecture. RCF-Free-RI-MHA replaces the Mobile Self-Attention Module (MAM) in RCF-Free-RI with a standard Multi-Head Attention (MHA) mechanism to benchmark the efficiency advantage of MAM. RCF-Free denotes the complete proposed model, which further includes the MAM.
Results are summarized in
Table 6, and we analyze them from three perspectives: accuracy, robustness, and efficiency. The comparison between RCF-Free-RI-Cand the baseline CBRis particularly revealing. On the FloW-BEV dataset, the fusion method of CBR can only bring a slight performance improvement (+0.3%
). More critically, on the DAIR-V2X dataset, this variant causes a significant performance drop (
Easy decreases by 3.5%). This clearly indicates that in scenarios with larger-scale or richer point clouds (e.g., DAIR-V2X), a crude fusion strategy can introduce noise or lead to feature competition between modalities, thereby degrading the performance of the original visual model. In stark contrast, RCF-Free-RI, equipped with our proposed Triple-Path Cross-View Fusion module, achieves consistent improvements across all datasets. This strongly validates that our designed cross-view attention mechanism is crucial for achieving effective and beneficial multimodal fusion, rather than merely introducing additional features.
A decomposition of each module’s contribution is evident on the FloW-BEV dataset. RCF-Free-RI achieves a +2.6% gain in over CBR, attributable primarily to the effective representation of sparse point clouds by the BEV-Point encoder and the high-quality feature integration enabled by the Triple-Path Cross-View Fusion. Building upon this, incorporating the MAM to form the complete RCF-Free model delivers a further +1.0% improvement. This verifies that the MAM, by enhancing spatial contextual understanding, further refines the front-view visual features for more precise alignment with BEV features. A consistent trend of incremental gains from each module is also observed on the more complex WaterScenes-BEV dataset.
We further analyzed the model’s efficiency on the FloW-BEV dataset. Introducing the multimodal fusion core (RCF-Free-RI) inevitably increases computational cost while delivering significant accuracy gains, with latency rising from 5.1 ms to 7.9 ms. A key finding concerns the efficiency of our proposed lightweight MAM versus standard Multi-Head Attention (MHA). Comparing RCF-Free-RI-MHA and RCF-Free, both achieve comparable accuracy, but RCF-Free exhibits lower latency and higher FPS. This confirms that the MAM maintains powerful contextual modeling capabilities while offering superior computational efficiency. Ultimately, the complete RCF-Free model delivers the best detection accuracy with a latency of approximately 8.2 ms, demonstrating its strong potential for real-time deployment.
It is important to note the practical significance beyond absolute metrics. Although the absolute improvement from CBR to RCF-Free on WaterScenes-BEV (+1.9% ) may appear modest, the critical geometric and motion information provided by radar point clouds is irreplaceable in real-world aquatic scenarios. This improvement often translates to the system’s ability to avoid severe missed or false detections when the visual sensor is compromised by glare, fog, or nighttime conditions, which is paramount for USV safety. Therefore, the ablation study validates not just a numerical increase, but more importantly, an enhancement in perception reliability and redundancy in complex environments.