1. Introduction
With the vigorous development of global maritime trade and the growing demand for maritime security, ship detection technology has become a research frontier in the field of maritime transportation. Its accuracy and efficiency directly affect the practical effectiveness of fields such as intelligent shipping and maritime law enforcement. However, infrared images in complex marine environments face numerous challenges; for example, complex textures formed by sea waves lead to severe background interference, and long-distance or small ships exhibit weak features, unclear contours, and minimal grayscale differences from the background. These overlapping issues severely restrict the detection robustness of existing algorithms, highlighting an urgent need for solutions that combine high accuracy with lightweight properties [
1]. To address the aforementioned challenges, developing a high-precision and lightweight infrared ship detection algorithm adaptable to complex sea conditions holds significant importance.
Existing lightweight architectures such as YOLOv11n, despite their advantages in inference speed, still have limitations in infrared ship detection tasks: insufficient edge feature extraction capability easily leads to missed detection of weak targets; information redundancy during the downsampling process increases computational burden; and the task coupling issue in the detection head reduces the model’s adaptability to complex scenarios. To this end, this study takes YOLOv11n as the basic architecture and, aiming to address the above limitations, constructs an efficient and lightweight infrared ship detection algorithm called DSEE-YOLO (Dynamic Ship Edge-Enhanced YOLO) through three module innovations combined with model pruning and distillation technologies. The specific innovations of this work are as follows:
- 1.
To address blurred contours in infrared ship detection, we replace the C3k2 module’s bottleneck with the MultiScaleEdgeFusion module. This enhances edge information and improves weak target discriminability through multi-scale feature fusion, without added computational cost.
- 2.
To resolve information redundancy in multi-scale feature extraction, depthwise separable convolution (DSConv) is used to reconstruct the downsampling path, decoupling spatial convolution from channel fusion. This compresses parameters and computational load while retaining key features.
- 3.
The DyTaskHead dynamically aligns task-specific features to enhance downstream task interactions. It employs depthwise separable convolution to significantly reduce parameter redundancy, thereby improving adaptability to complex scenarios while making the model more lightweight.
- 4.
Redundant pruning is performed on the model integrated with the above modules. With the pruned model as the student network and the unpruned model as the teacher network, self-distillation is implemented to further improve model accuracy, ultimately forming the DSEE-YOLO algorithm.
The remaining structure of this paper is organized as follows:
Section 2 reviews the progress of related research;
Section 3 elaborates on the model architecture of DSEE-YOLO and its core improvement methods;
Section 4 verifies the algorithm’s performance through comparative experiments and ablation experiments;
Section 5 discusses the model’s advantages and applicable scenarios in depth with visualization results;
Section 6 summarizes the full text and proposes future work.
2. Related Work
In infrared ship target detection, traditional methods rely on manually designed features such as HOG [
2] and SIFT [
3] combined with classical machine learning classifiers like SVMs [
4] and AdaBoost [
5]. However, these approaches exhibit obvious limitations. In infrared images, the temperature difference between ships and the background is weak, edges are blurred, and ships are easily disturbed by pseudo-heat sources such as spray and foam, making it difficult for manual features to accurately depict the thermal distribution characteristics of ships and leading to a significant reduction in distinguishability.
In contrast, deep learning-based detection techniques significantly improve detection performance by enabling convolutional neural networks to automatically extract multi-scale semantic features. Among them, two-stage detection algorithms (e.g., Faster R-CNN [
6]) have achieved breakthroughs in detection accuracy by leveraging an RPN (Region Proposal Network) to generate candidate boxes with adaptive IoU thresholds. Single-stage detectors adopt a feature pyramid fusion strategy, which effectively reduces the missed detection rate for small targets while ensuring real-time performance, thus providing core technical support for the construction of all-weather anti-interference infrared maritime monitoring systems.
In the field of infrared ship detection, deep learning-based single-stage object detection algorithms continue to optimize detection performance in complex sea surface scenarios. In 2017, YOLOv2 [
7] reduced the missed detection of small targets by virtue of DarkNet19 and anchor box optimization, while RetinaNet [
8] suppressed the interference of ocean wave noise using focal loss. In 2018, YOLOv3 [
9] improved the AP value for long-distance ship recognition with a three-level feature pyramid. In 2020, YOLOv4 [
10] integrated PANet and mosaic augmentation, maintaining a high mAP even under low-visibility conditions, while YOLOv5 achieved real-time high-precision detection through adaptive anchor boxes and SPPF. In 2022, YOLOv6 [
11] introduced RepVGG and anchor-free mechanisms to reduce the missed detection rate, while YOLOv7 [
12] enhanced the recognition capability of occluded targets by means of E-ELAN. In 2023, YOLOv8 introduced C2f to enhance multi-scale performance, while, in 2024, YOLOv9 [
13] optimized contour detection using deformable convolution, and the C2f-Faster module of YOLOv10 [
14] compressed the number of parameters. The latest SOTA (state-of-the-art) algorithm, YOLOv11, reduces the number of parameters through the C3k2 module, focuses on weak signals by integrating the C2PSA attention mechanism, and greatly reduces the missed detection rate with a decoupled detection head and an improved CIoU.
The iterative upgrades of the aforementioned algorithms have not only laid a solid foundation for improving ship detection performance but also driven researchers to explore optimization paths in multiple dimensions to address the complexity of the sea surface environment.
In terms of feature optimization and fusion, researchers have been continuously improving the discriminative ability of models through feature enhancement and structured fusion. Chen et al. [
15] introduced the FT algorithm into YOLOv3 to strengthen semantic features and adopted Soft-NMS to optimize box-screening. Chen et al. [
16] proposed CSD-YOLO, which focuses on key information and fuses the multi-scale features of SAR through the SAS-FPN module. Li et al. [
17] designed a feature-focused diffusion pyramid, and combined the ADown module with an improved C2f structure to enhance the features of the central region. Zhang et al. [
18] embedded a multi-scale residual module into YOLOv7-tiny to improve adaptability to complex water surface environments. Wang et al. [
19] constructed NST-YOLO11, which utilizes MSDA (multi-scale dynamic attention) to fuse the global semantic capture ability of ViT. Such studies echo earlier efforts: Liu et al. [
20] (PJ-YOLO) integrated prior knowledge; Zhou et al. [
21] (YOLO-SWD) designed a feature compensation mechanism; and Ha et al. [
22] (YOLO-SR) implemented feature recalibration. These works collectively improved feature quality but generally encountered common issues, such as increased computational costs [
16,
20,
21] or false detections in complex scenarios [
17,
18].
Targeting lightweight design improvements, model efficiency optimization has become a key aspect for practical deployment. Although Zhao et al. [
23] did not directly make the model lightweight, they indirectly improved efficiency through K-means++ clustering of anchor boxes. Yue et al. [
24] combined MobileNetv2 with YOLOv4 and achieved compression based on channel pruning. Guo et al. [
25] improved the lightweight backbone network of the YOLO model to realize cross-scale layer feature fusion and reduce the amount of convolution computation, but this method has the problem of missed target detection. Songjie Du et al. [
26] improved YOLOv8 and added an attention mechanism to achieve lightweight traffic sign detection. These studies complement the efforts of Shen et al. [
27] (who streamlined the feature pyramid in YOLO-LPSS) and Sanikommu et al. [
28] (focused on edge computing deployment), collectively addressing efficiency bottlenecks. However, they still struggle to avoid the accuracy losses caused by lightweight design, such as the degradation in large-target detection [
27] and insufficient sensitivity to small targets [
18,
26].
In terms of multi-scale and scale adaptation, in response to the challenge of large differences in ship scales, multi-scale optimization strategies have continued to evolve. Ma et al. [
29] improved SP-YOLOv8s and retained fine-grained features to enhance tiny-object accuracy. Such studies form a technical closed loop with the works of Yuan et al. [
30] (AM YOLO for adaptive multi-scale performance) and Huang et al. [
31] (ADV-YOLO to optimize SAR multi-scale expression). Zhang et al. [
32] focused on the detection of hidden suspicious objects in terahertz images through multi-scale detection. However, the missed detection of small targets remains a common shortcoming [
23,
29,
30,
31], especially in low-resolution infrared or SAR images [
20,
22,
31].
In terms of achieving robustness in complex scenarios, the research has advanced significantly. These efforts synergize with early SAR-specific studies (Ha et al. [
22], Huang et al. [
31]) and remote sensing scene research (Shi et al. [
33]). However, cross-scene generalization—such as from visible light to infrared—remains a challenge [
18,
19,
31].
In summary, iterative advancements and multi-directional optimizations in infrared ship detection have laid a solid foundation for practical use, but existing architectures like YOLOv11 still have limitations. To address these limitations, we propose DSEE-YOLO, an efficient lightweight algorithm based on YOLOv11n to provide a novel effective solution for infrared ship detection.
3. Methodology
This section elaborates on the technical details of the proposed DSEE-YOLO algorithm. The algorithm is based on the YOLOv11n framework and is optimized and developed by integrating three innovative modules combined with pruning and self-distillation techniques. The overall architecture of DSEE-YOLO is shown in
Figure 1.
3.1. C3k2_MultiScaleEdgeFusion Module
In infrared ship detection, thermal reflection from ocean waves and diffuse reflection from clouds or fog reduce the thermal contrast between ships and the background. This attenuation induces grayscale blurring at target edges, impeding traditional algorithms’ ability to capture subtle variations and leading to localization errors. A critical challenge arises from high-frequency noise that mimics genuine edges; such noise is frequently misclassified, increasing the false-positive rate. Since edge information is pivotal for precise localization and recognition and its integrity and reliability are compromised by these factors, targeted edge enhancement becomes essential for improving detection efficacy.
Infrared ship detection is fundamentally challenged by low signal-to-noise ratios and blurred edges. To address this from a theoretical standpoint, we design the MultiScaleEdgeFusion module based on scale-space theory. Real structural edges in infrared images exhibit scale invariance, which means they exist across multiple spatial scales. By contrast, noise and background clutter such as wave crests and thermal reflections are usually transient and scale-specific; they only appear prominently at specific scales. Traditional convolutional layers with fixed receptive fields have difficulty distinguishing between these phenomena. They either use small kernels to capture fine-grained details but are easily affected by noise or they use large kernels. While large kernels improve noise resistance, they blur fine edges and lead to the loss of small-target details. Our MultiScaleEdgeFusion module solves this problem by adopting a multi-scale parallel processing strategy.
We integrate the aforementioned MultiScaleEdgeFusion module as a sub-module into the original C3k2 module to further improve the overall performance of the model through edge enhancement. Specifically, the high-frequency residual, which reflects high-frequency edge information, is obtained by extracting the difference signal (
) between the original features and the blurred features after 3 × 3 average pooling. Then, a learnable convolution layer is used to generate an edge weight matrix
(Equation (
1)), so as to selectively enhance real edges and suppress thermal noise. Meanwhile, feature compression, edge enhancement, and upsampling restoration are performed at four scales (3 × 3, 6 × 6, 9 × 9, and 12 × 12), followed by fusion. This not only simultaneously improves the detection capability for ships of different sizes but also suppresses noise by fusing edge responses of different scales at the same position. Meanwhile, the true edges of ships exhibit high responses at both the 3 × 3 and 12 × 12 scales and thus are retained. In contrast, noise, which only shows high responses at a specific scale, will be weakened by the low responses at other scales. Combined with the edge weight matrix, the proportion of weights assigned to pseudo-edges is further reduced.
Herein, represents the Sigmoid function, denotes the edge weight matrix, stands for the high-frequency residual, is the basic activation threshold for infrared edges, refers to the convolution kernel weight, and ∗ is the convolution operator.
To verify the effectiveness of the multi-scale strategy, this study designs multi-scale selection comparative experiments (see
Table 1 for results). The experimental results show that, under different single-scale settings, although the model performance exhibits fluctuations, the overall differences are not significant. Under the dual-scale fusion condition, most combinations fail to systematically outperform the optimal single-scale configuration, indicating that simply increasing the number of scales cannot effectively improve the model’s generalization performance. However, when more scale information is introduced, the model performance is significantly improved. In particular, the four-scale fusion combination achieves the highest values in terms of both recall rate and mAP@0.50, and its mAP@0.50:0.95 also outperforms all comparative settings. The results demonstrate that multi-scale feature fusion can enhance the model’s ability to represent targets of different sizes and reduce detection biases caused by scale variance, thereby improving detection accuracy and robustness.
As shown in
Figure 2, the MultiScaleEdgeFusion sub-module operates as follows:
Multi-Scale Sampling: Input features undergo dimensionality reduction via parallel adaptive pooling layers of varying sizes.
Scale-Specific Feature Extraction: After channel compression (1 × 1 convolution), 3 × 3 group convolution extracts scale-specific features, preserving scale differences with reduced computation.
Edge Enhancement: Each output is processed by an independent EdgeEnhancer module to strengthen contours, then upsampled to the original resolution.
Detail Retention: Local details from the original input are preserved via 3 × 3 convolution.
Feature Fusion: Enhanced edge features from all branches and local details are channel-concatenated, integrated by 3 × 3 fusion convolution, outputting an optimized feature map with enhanced edge response and preserved spatial details.
As shown in the ablation experiment results in
Table 2, the C3k2_MultiScaleEdgeFusion-YOLO model achieves a 1.6% increase in Precision, a 2.0% increase in Recall, a 1.4% increase in mAP@0.50, and a 1.3% increase in mAP@0.50:0.95, with almost no change in the number of parameters and the computational load. These results indicate that the introduction of this module effectively reduces the missed detection rate and false detection rate.
3.2. DS_ADown Module
Despite structural refinements reducing the parameters in YOLOv11n, its computational load on mobile devices remains substantial. Crucially, the downsampling module dominates computational costs. While the ADown module balances efficiency and accuracy in general scenarios, it struggles with infrared ship detection’s low-contrast, weak-texture characteristics, and thus is our primary optimization target.
To this end, this study proposes the DS_ADown module, which achieves optimization through a dual-path feature fusion architecture and lightweight design. Concurrently, DSConv is introduced, decomposing standard convolution into depthwise and pointwise convolution layers. This maintains the feature representation capability while reducing computational complexity, thereby eliminating feature redundancy. Traditional downsampling methods cause information loss that is particularly detrimental in low-contrast infrared images. The DS_ADown module uses DSConv to decouple spatial and channel processing, reducing redundancy while preserving the high-frequency features essential for distinguishing ships from wave clutter. The module thus delivers a lightweight solution for ship monitoring in complex maritime conditions, and the detailed network architecture is illustrated below.
As shown in
Figure 3, the input feature map
is first downsampled via the average pooling layer and then split into
and
along the channel dimension. Among them, the
branch extracts local details through DSConv and enhances nonlinearity via SiLU activation. The
branch amplifies high-frequency features through max pooling followed by 1 × 1 convolution. Finally, the outputs of the dual paths are channel-concatenated to fuse global semantics with local textures, effectively alleviating the information loss inherent in traditional downsampling.
Therefore, through joint optimization of the lightweight architecture and the feature decoupling design, DS_ADown effectively addresses ADown’s generalization constraints and computational redundancy in infrared ship detection. Selective deployment of DS_ADown in backbone or neck components enables a balanced accuracy–weight trade-off. Experiments (
Table 3) demonstrate that DS_ADown reduces parameters by 24.05% and computations by 20.63%, incurring negligible
degradation.
3.3. DyTaskHead Detection Head
In traditional object detection, classification and regression share the feature space, leading to conflicts in feature requirements. Although the decoupled detection head of YOLOv11n alleviates this issue using parallel branches, static feature allocation struggles to adapt to dynamic demands, making classification susceptible to interference in complex backgrounds. The limitations of such static decoupling are particularly prominent in infrared ship detection.
In infrared detection, classification and regression tasks compete for shared features, especially under a low signal-to-noise ratio. DyTaskHead introduces dynamic feature alignment and spatial attention to decouple these tasks, allowing the model to adaptively focus on semantic vs. geometric features, which is critical for handling blurred and occluded targets.
To address the aforementioned challenges, this paper proposes DyTaskHead, a feature-decoupled detection head optimized via dynamic task decomposition and spatial adaptive alignment. As shown in
Figure 4, its architecture comprises the following:
Feature Preprocessing: Multi-scale inputs (P3, P4, P5) are downsampled to a uniform resolution via feature pyramid network, then processed by a shared convolutional layer incorporating DSConv to achieve feature fusion and lightweight properties.
Task Decomposition: Features are decoupled into classification and regression branches. The classification branch enhances semantic representation through channel attention and spatial dynamic weighting; the regression branch focuses on geometric structures using deformable convolution, specifically implementing offset and mask learning via DyDCNv2.
Dynamic Alignment: This stage comprises localization and classification components. For localization, spatial convolution generates offsets and masks to refine feature sampling. For classification, spatial attention calibrates feature responses.
Output: The regression branch decodes box coordinates with distribution focal loss while the classification branch outputs category probabilities, completing the feature processing pipeline.
This design achieves task decoupling and collaborative optimization while reducing parameters and maintaining lightweight adaptability. DyTaskHead enhances detection performance through three key innovations. As shown in
Table 4, the experiments demonstrate that, with only 7.6M parameters, the model achieves 91.5%
on the IRShip dataset, surpassing the baseline by 1.7%. This delivers an efficient framework for a dynamic task-aligned detection head design.
3.4. DSEE-YOLO: Model Optimization via Pruning and Distillation
To address the computational power constraints faced by infrared ship detection models in practical deployment, pruning technology can reduce model complexity while ensuring detection accuracy by removing redundant parameters and channels. Although the original Fused-YOLO has achieved a certain degree of reduction in terms of parameters and computational load, this experiment still adopts the LAMP (Layer-Adaptive Sparsity for the Magnitude-Based Pruning) [
34] pruning method for further optimization. It eliminates redundant computations and parameters and performs re-training to obtain optimal weights. The LAMP method sorts channels based on the gradient-based L1 norm and gradually removes the channels that have the least impact on model output. This verifies that pruning technology can enhance the model’s ability to focus on key features, which not only does not cause performance loss but also achieves model optimization.
Given that Fused-YOLO has already achieved favorable feature extraction and detection performance through preliminary optimization, the pruned lightweight model, despite its streamlined parameters, still has room for improvement in its feature representation capability in complex scenarios. To address this, this study conducts self-distillation with Fused-YOLO as the teacher model and the pruned model as the student model. By transferring the key feature knowledge and decision logic from the teacher model, the student model can absorb the detection experience of the teacher while maintaining low complexity, thereby achieving a balance between lightweight properties and high performance.
As shown in
Figure 5, this experiment adopts a combination of BCKD (Bridging Cross-Task Protocol Inconsistency for Knowledge Distillation) [
35] and logical distillation. Logical distillation focuses on transferring the decision-making logic of the teacher model, which can capture the correlation between target classification and localization in infrared ship detection and help the student model learn the teacher’s judgment logic for ambiguous targets. BCKD self-distillation transfers the teacher’s ability to recognize weak thermal signatures and ambiguous edges to the student model, effectively enhancing the model’s sensitivity to low-contrast targets without increasing the computational cost. BCKD can adapt to the target-blurring characteristics of infrared ships caused by thermal radiation. Its weighting strategy can strengthen the learning of hard samples. Meanwhile, this method can transfer the features of scale and thermal radiation differences through the teacher model to cope with scenarios where infrared ships have large-scale variations and strong background interference.
As shown in
Table 5, DSEE-YOLO incorporating BCKD self-distillation achieves an improvement of 1.1–2.6% in
,
,
, and
compared to Fused-YOLO. Meanwhile, its numbers of parameters and FLOPs remain unchanged, which confirms the effectiveness of BCKD self-distillation in improving detection accuracy without any loss.
5. Visualization and Discussion
5.1. Visualizing the Decoupling Advantage
To go beyond evaluation methods that solely rely on numerical indicators and delve into the intrinsic mechanism of the decoupled design of DyTaskHead, we conducted a detailed comparative analysis of feature visualization. This section aims to intuitively reveal the root cause of its performance advantages by comparing the feature maps of DyTaskHead and the coupled head of YOLOv11n. This subsection focuses on
Figure 12 for visualization analysis.
As shown in
Figure 13, the differences are even more pronounced in the regression task. The regression feature maps of DyTaskHead exhibit clear structural and edge-related features, indicating that its regression branch focuses on extracting geometric information that is crucial for localization. In contrast, the regression features of YOLOv11n appear blurry and smooth with weak feature responses, struggling to support accurate bounding box localization.
As shown in
Figure 14, in the classification task, the feature maps generated by DyTaskHead exhibit higher contrast and stronger feature activation, with different channels focusing on different visual patterns. This indicates that its classification branch is adept at extracting discriminative semantic information. By contrast, the feature maps of YOLOv11n are relatively dull and homogeneous, indicating that the shared feature layer struggles to optimize the optimal features dedicated to classification.
In summary, the visualization analysis clearly reveals the advantageous mechanism of DyTaskHead; its decoupled design allows the classification and regression branches to evolve independently, focusing on high-level semantic features and low-level geometric features, respectively, thereby effectively avoiding task conflicts. In contrast, the coupled head of YOLOv11n causes inter-task interference due to feature sharing, which limits its performance ceiling. This intuitive comparison not only confirms the effectiveness of the DyTaskHead design but also provides a reasonable explanation for its performance advantages from a feature perspective.
5.2. Ship Detection Comparison in Representative Maritime Scenarios
5.2.1. Comparison of Detection Performance of DSEE-YOLO and YOLOv11n
To further analyze the effectiveness of ship detection in complex scenarios and explore the trends of false detections and missed detections, samples from representative maritime scenarios were selected to analyze the comprehensive performance of DSEE-YOLO, as shown in
Figure 15. The visualization results in
Figure 15 intuitively demonstrate the differences in ship detection performance between DSEE-YOLO and the benchmark model YOLOv11n on the IRShip test set through eight groups of comparative samples. The experimental results indicate that DSEE-YOLO exhibits significant performance advantages over YOLOv11n in ship detection tasks. Specifically, in
Figure 15a,f, DSEE-YOLO effectively corrects the false detections of YOLOv11n and accurately identifies ship targets.
Figure 15c,d show that DSEE-YOLO has higher detection precision, with more accurate localization and recognition of ships. In the scenario with target occlusion in
Figure 15h, DSEE-YOLO still achieves high-precision detection, while, in
Figure 15e,b,g, DSEE-YOLO avoids the missed detections of YOLOv11n and successfully identifies all ship targets. In summary, DSEE-YOLO outperforms YOLOv11n in terms of detection accuracy, false detection correction capability, and handling complex scenarios such as occlusion, significantly enhancing the reliability and robustness of ship detection.
5.2.2. Comparison of Detection Performance of DSEE-YOLO and Other Algorithms
As shown in
Figure 16 and
Figure 17, we conducted a visual comparison of DSEE-YOLO and other mainstream or advanced detection algorithms in identical scenarios to intuitively evaluate detection performance. Specifically, in the cluttered port scenario depicted in
Figure 16, both DSEE-YOLO and YOLOv5-ODConvNeXt yield high-confidence detections. Although some missed or false detections remain in challenging areas, the two models demonstrate superior overall performance compared to others, with YOLOv5n also showing competence in such tasks. For near-shore ship recognition, shown in
Figure 17, almost all algorithms achieve high-confidence results; nevertheless, DSEE-YOLO and YOLOv5-ODConvNeXt again deliver the most reliable detections with the highest confidence scores. In summary, the visual evidence confirms that DSEE-YOLO achieves robust, state-of-the-art performance across diverse maritime environments.
5.3. Discussion on the Limitations of DSEE-YOLO
It should be noted that, although the DSEE-YOLO model and IRShip v1.0 dataset proposed in this study achieve certain results in infrared maritime target detection, there are still several limitations: Firstly, for ship targets with extremely small pixel sizes, the model’s miss detection rate remains relatively high compared to visible-light image detection in low-contrast environments (under the same environmental conditions). Secondly, the IRShip v1.0 dataset lacks samples of special ships such as those in high sea states, those in open-sea scenarios with tropical storms, and various types of transport ships, and its representativeness is still limited due to the difficulty in sample acquisition. Thirdly, the DSEE-YOLO model has insufficient generalization ability under extreme meteorological conditions, and the detection frame rate decreases significantly in high sea states in particular. The root cause lies in the fact that such scenarios are not covered in training, and existing data augmentation methods fail to simulate real-world nonlinear interference.
To address the aforementioned issues, targeted breakthroughs can be pursued in future work: For the miss detection of extremely small targets under low contrast, a dedicated feature enhancement module for extremely small targets can be added, combined with cross-layer feature fusion, and matched with high-resolution infrared sensors and contrast preprocessing. For the lack of samples in the dataset, diffusion models can be used to generate simulated data for special scenarios while collaborating with multiple parties to collect the real data of special ships to expand the dataset. For the insufficient generalization ability and low frame rate under extreme meteorological conditions, domain-adaptive learning can be adopted to adapt to scenarios, the model can be made lightweight to improve the frame rate, and radar and SAR data can be further fused to build a multi-modal framework.
In addition, in practical maritime environments, dense fog, heavy rain, and sea clutter tend to reduce the accuracy of infrared ship detection. Currently, the performance of DSEE-YOLO under extreme conditions can still be optimized. In the future, we can enhance the model’s robustness in harsh environments by adjusting the model with real-time meteorological data and designing anti-interference preprocessing modules, so as to meet the needs of all-weather maritime surveillance.