4.4. Ablation Experiments
We first conducted an ablation study on the hyperparameter α in the SA-IoU loss function by evaluating five different parameter settings. The experimental results demonstrate that the model achieves optimal performance when α = 0.5 and (1 − α) = 0.5, with a mAP
50 of 94.9% and a mAP
50–95 of 59.5%. As shown in
Table 3, although the precision and recall exhibit slight fluctuations under different settings, the configuration with α = 0.5 yields the most balanced performance across all metrics. This indicates that this specific value facilitates a better trade-off between the model’s classification capability and localization accuracy. In this section, the bolded values in all tables indicate the best results.
To evaluate the effectiveness of each proposed module, we conduct comprehensive ablation experiments on the ODinMJ dataset. The experimental results, as summarized in
Table 4 and
Table 5, include both detection performance and model complexity.
Experiment A serves as the baseline and adopts the improved dual-backbone YOLOv8n structure, achieving a precision of 95.4%, recall of 92.5%, mAP50 of 96.7%, and mAP50–95 of 69.9%.
In Experiment B, the LPConv module is introduced into the infrared branch to enhance the modeling of fine-grained texture information in thermal images. This leads to a notable increase in both detection precision and robust-ness, raising the mAP50 to 97.8% and mAP50–95 to 72.7%.
Experiment C replaces the standard C2f module in the visible-light branch with the proposed C2f_FEM, which strengthens multi-scale and shape-aware representations. As a result, the model achieves an mAP50 of 97.7% and mAP50–95 of 72.8%.
Experiment D incorporates the proposed Weighted Denoising Fusion Module during the feature fusion stage. By adaptively suppressing background noise and recalibrating modality contributions, the model further improves to an mAP50 of 97.9% and mAP50–95 of 73.1%.
Experiment E explores the use of SA-IoU as the training loss function to enhance shape-level discrimination. Although the loss modification does not alter the inference architecture, the network benefits from stronger shape alignment during training, resulting in an mAP50 of 97.1% and mAP50–95 of 71.1%.
Experiments F, G, and H progressively integrate the aforementioned modules to assess their combined effects. Experiment F, which fuses LPConv and C2f_FEM, achieves an mAP50 of 98.0% and mAP50–95 of 73.9%. Further inclusion of the Weighted Denoising Fusion Module in Experiment G improves these values to 98.3% and 75.0%, respectively. When all proposed modules are utilized in Experiment H, the model reaches the best performance with a precision of 97.4%, recall of 96.3%, mAP50 of 98.7%, and mAP50–95 of 75.3%, demonstrating the complementary nature and cumulative benefits of the proposed improvements.
In terms of model complexity, the baseline model in Experiment A comprises 4.3 million parameters, 12.3 GFLOPs, and an inference time of 0.5 ms. The introduction of LPConv in Experiment B does not increase the parameter count, while reducing GFLOPs slightly to 12.1 and adding minimal inference delay. The C2f_FEM module in Experiment C maintains comparable complexity. Experiment D shows a moderate increase in parameters and GFLOPs due to the addition of Weighted Denoising Fusion Module but remains within an efficient range. Since SA-IoU operates only during training, Experiment E maintains the same complexity as the baseline. The integration of multiple modules in Experiments F to H demonstrates that the proposed architecture maintains efficiency even in its most complete form. Specifically, Experiment H yields the highest accuracy while keeping the parameter count at 4.2 M, GFLOPs at 11.6, and inference time within 0.9 ms, validating the practicality of the overall design.
We also conducted ablation experiments on the LLVIP dataset to further validate the effectiveness of each proposed module. The evaluation results, summarized in terms of detection accuracy, are presented in
Table 6 and
Table 7.
Experiment A serves as the baseline, utilizing the dual-branch YOLOv8n without any of the proposed improvements. It achieves a Precision of 93.9%, Recall of 87.6%, mAP50 of 94.4%, and mAP50–95 of 57.8%, providing a reference point for subsequent enhancements.
Experiment B introduces the LPConv module into the infrared branch. This module enhances the modeling of fine-grained thermal textures, improving the network’s sensitivity to subtle heat signatures. Consequently, Recall increases to 89.2% and mAP50–95 reaches 60.1%, marking a 2.3% gain over the baseline.
Experiment C replaces the standard C2f module in the visible-light branch with the proposed C2f_FEM module, designed to better capture multi-scale and shape-aware features. This yields a Precision of 95.0%, Recall of 89.1%, and mAP50–95 of 60.3%.
Experiment D incorporates the Weighted Denoising Fusion Module during the multi-modal feature fusion phase. By adaptively emphasizing informative cross-modal features and suppressing background noise, the module improves Recall to 89.6% and mAP50–95 to 60.5%.
Experiment E replaces the conventional CIoU loss with the SA-IoU loss function to strengthen shape-level alignment and perception. While not changing the network architecture, this modification enhances training effectiveness, resulting in an mAP50–95 of 59.5%.
Experiment F combines LPConv and C2f_FEM, enhancing both thermal and visible branches simultaneously. This synergistic configuration improves Precision and Recall to 95.1% and 90.2%, respectively, with mAP50–95 reaching 61.2%.
Experiment G adds Weighted Denoising Fusion Module on top of the dual-branch enhancements, further boosting multi-modal interaction and yielding a Precision of 95.5%, Recall of 90.8%, and mAP50–95 of 61.4%.
Finally, Experiment H integrates all proposed modules—LPConv, C2f_FEM, Weighted Denoising Fusion Module, and SA-IoU—into the full architecture. This complete configuration achieves the best performance with a Precision of 95.3%, Recall of 91.3%, mAP50 of 95.7%, and mAP50–95 of 61.3%, demonstrating the effectiveness and complementarity of the proposed components.
In terms of model complexity and inference efficiency,
Table 7 summarizes the number of parameters, GFLOPs, and inference time for each experiment. The baseline model in Experiment A contains 4.3 M parameters and 11.7 GFLOPs, with an inference time of 0.4 ms. Adding the LPConv module in Experiment B slightly increases GFLOPs to 12.1 and doubles inference time to 0.8 ms. Experiment C reduces parameters to 4.0 M and lowers GFLOPs to 11.1, maintaining moderate inference time. The addition of the IFM module in Experiment D raises parameters slightly to 4.4 M with a small increase in GFLOPs and inference time. Using SA-IoU loss in Experiment E does not affect complexity or speed. The combined modules in Experiments F to H keep parameters around 4.2 M, GFLOPs near 11.6, and inference times between 0.7 and 0.8 ms, demonstrating a good balance between accuracy and efficiency.
In order to thoroughly assess the effectiveness and superiority of the proposed method, several representative state-of-the-art object detection models, such as YOLOv5s, YOLOv10n, and SuperYOLO, are selected for comparison. The corresponding results are summarized in
Table 8. The improved model RTMF-Net significantly outperforms the comparison methods in detection accuracy, achieving a mAP of 98.7%, which is the highest among all models. Meanwhile, RTMF-Net also demonstrates excellent performance in computational complexity, model size, and inference speed. It requires only 4.2 million parameters, 11.6 GFLOPs of computation, and achieves the lowest inference time of 0.8 milliseconds, which is substantially lower than other models. These results highlight its highly lightweight characteristic. In summary, RTMF-Net effectively reduces computational overhead while maintaining outstanding detection performance. Additionally, the mAP comparison curves are provided in
Figure 7.
The generalization capability and practical applicability of the proposed model were assessed through comparative experiments conducted on the representative multimodal remote sensing dataset LLVIP, with the results summarized in
Table 9. Although RTMF-Net demonstrates slightly lower detection accuracy than some more complex models, it shows significant advantages in model size and computational complexity. RTMF-Net possesses the lowest parameter count and computational cost among all compared models, greatly reducing dependence on hardware resources and exhibiting excellent lightweight characteristics and deployment of friend-lines. Compared with the best-performing model U2Fusion, RTMF-Net’s mAP is only about 1.0% lower, while its parameter count and computational load are substantially less than those of U2Fusion. This highlights RTMF-Net’s ability to maintain high detection performance while effectively controlling model complexity. These results further confirm RTMF-Net’s superiority in balancing accuracy and efficiency, demonstrating strong generalization ability and promising practical deployment potential.
Moreover, to further evaluate detection quality across different object sizes and detection thresholds, we adopt the COCO-style metrics on both the LLVIP and ODinMJ datasets.
On the LLVIP dataset, the model achieves an overall Average Precision (AP) of 56.7% under the IoU = 0.50:0.95 metric, with APs of 91.4% and 63.6% at IoU thresholds of 0.5 and 0.75, respectively. In terms of object sizes, the AP for large objects reaches 58.2%, while medium objects yield an AP of 22.0%. It is worth noting that the AP for small objects is reported as −1.0, which indicates that there are no small-sized objects in the validation set, making it impossible to compute a valid AP for this scale. The detailed results are presented in
Table 10.
In contrast, the performance on the ODinMJ dataset demonstrates a substantial improvement across all metrics. The model achieves an overall AP of 69.0% and excels at standard IoU thresholds with 94.6% AP at IoU = 0.5 and 78.3% at IoU = 0.75. Notably, detection for small, medium, and large objects all exhibit significant gains, with APs of 56.6%, 72.7%, and 75.8%, respectively, demonstrating improved robustness in multi-scale object detection. The results on the ODinMJ dataset are presented in
Table 11.
For recall, the LLVIP dataset yields an overall AR of 29.6% at maxDets = 1 and 63.1% at maxDets = 10 or 100. Meanwhile, the ODinMJ dataset presents a higher AR of 38.3% at maxDets = 1, and a consistent 72.9% at maxDets = 10 and 100. The ODinMJ recall performance is notably stronger across all object scales—AR values reach 61.6%, 76.3%, and 79.2% for small, medium, and large objects, respectively.
These results collectively demonstrate that the proposed method maintains superior scalability and robustness across datasets, particularly excelling in challenging multi-scale detection scenarios on ODinMJ.
To further validate the effectiveness of the proposed lightweight design, we conduct a comparative study with other representative lightweight backbones, namely MobileNetV3 and ShuffleNetV2. As shown in
Table 12, our model has 4.2 million parameters, which is larger than ShuffleNetV2 with 0.64 million and MobileNetV3 with 0.23 million. However, its computational complexity remains moderate at 11.6 GFLOPs. In addition, the inference time of our model is 0.8 milliseconds, only slightly higher than ShuffleNetV2 at 0.6 milliseconds and MobileNetV3 at 0.4 milliseconds. These results demonstrate that the proposed design achieves a balanced trade-off between model size, computational cost, and detection performance, highlighting its suitability for real-time applications.