DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments

Wang, Siyu; Feng, Yunsong; Jin, Wei; Liu, Liping; Zhou, Changqi; Tao, Huifeng; Cai, Lei

doi:10.3390/rs17193325

Open AccessArticle

DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments

by

Siyu Wang

^1,2,3

,

Yunsong Feng

^1,2,3,*,

Wei Jin

^1,2,3,

Liping Liu

^1,2,3

,

Changqi Zhou

^1,2,3,

Huifeng Tao

^1,2,3 and

Lei Cai

^1,2,3

¹

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

²

Anhui Province Key Laboratory of Electronic Environment Intelligent Perception and Control, Hefei 230037, China

³

Advanced Laser Technology Laboratory of Anhui Province, Hefei 230037, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3325; https://doi.org/10.3390/rs17193325

Submission received: 11 August 2025 / Revised: 23 September 2025 / Accepted: 26 September 2025 / Published: 28 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

We propose DSEE-YOLO, a lightweight infrared ship detection model integrating C3k2_MultiScaleEdgeFusion, DS_ADown, and DyTaskHead, plus pruning and BCKD self-distillation, and it outperforms YOLOv11n by 2.8% in mAP@0.50, reduces parameters by 42.3%, and shrinks the model size to 3.5 MB.
On the self-constructed IRShip dataset, DSEE-YOLO achieves 92.2% precision, 85.9% recall, and 65.8% mAP@0.50:0.95, effectively addressing blurred features and background interference in complex maritime infrared scenes.

What are the implications of the main findings?

DSEE-YOLO’s lightweight and highly efficient features demonstrate great potential for real-time deployment on resource-constrained devices, providing strong technical support for meeting the practical application needs of infrared ship detection in maritime monitoring scenarios.
The model provides a novel technical paradigm that is both accurate and lightweight in infrared target detection, offering a reliable solution for all-weather maritime security, intelligent shipping, and coastal law enforcement.

Abstract

Complex marine infrared images, which suffer from background interference, blurred features, and indistinct contours, hamper detection accuracy. Meanwhile, the limited computing power, storage, and energy of maritime devices require target detection models suitable for real-time detection. To address these issues, we propose DSEE-YOLO (Dynamic Ship Edge-Enhanced YOLO), an efficient lightweight infrared ship detection algorithm. It integrates three innovative modules with pruning and self-distillation: the C3k2_MultiScaleEdgeFusion module replaces the original bottleneck with a MultiEdgeFusion structure to boost edge feature expression; the lightweight DS_ADown module uses DSConv (depthwise separable convolution) to reduce parameters while preserving feature capability; and the DyTaskHead dynamically aligns classification and localization features through task decomposition. Redundant structures are pruned via LAMP (Layer-Adaptive Sparsity for the Magnitude-Based Pruning), and performance is optimized via BCKD (Bridging Cross-Task Protocol Inconsistency for Knowledge Distillation) self-distillation, yielding a lightweight, efficient model. Experimental results show the DSEE-YOLO outperforms YOLOv11n when applied to our self-constructed IRShip dataset by reducing parameters by 42.3% and model size from 10.1 MB to 3.5 MB while increasing mAP@0.50 by 2.8%, mAP@0.50:0.95 by 3.8%, precision by 2.3%, and recall by 3.0%. These results validate its high-precision detection capability and lightweight advantages in complex infrared scenarios, offering an efficient solution for real-time maritime infrared ship monitoring.

Keywords:

ship detection; YOLOv11; infrared image; maritime transportation; BCKD

1. Introduction

With the vigorous development of global maritime trade and the growing demand for maritime security, ship detection technology has become a research frontier in the field of maritime transportation. Its accuracy and efficiency directly affect the practical effectiveness of fields such as intelligent shipping and maritime law enforcement. However, infrared images in complex marine environments face numerous challenges; for example, complex textures formed by sea waves lead to severe background interference, and long-distance or small ships exhibit weak features, unclear contours, and minimal grayscale differences from the background. These overlapping issues severely restrict the detection robustness of existing algorithms, highlighting an urgent need for solutions that combine high accuracy with lightweight properties [1]. To address the aforementioned challenges, developing a high-precision and lightweight infrared ship detection algorithm adaptable to complex sea conditions holds significant importance.

Existing lightweight architectures such as YOLOv11n, despite their advantages in inference speed, still have limitations in infrared ship detection tasks: insufficient edge feature extraction capability easily leads to missed detection of weak targets; information redundancy during the downsampling process increases computational burden; and the task coupling issue in the detection head reduces the model’s adaptability to complex scenarios. To this end, this study takes YOLOv11n as the basic architecture and, aiming to address the above limitations, constructs an efficient and lightweight infrared ship detection algorithm called DSEE-YOLO (Dynamic Ship Edge-Enhanced YOLO) through three module innovations combined with model pruning and distillation technologies. The specific innovations of this work are as follows:

1.: To address blurred contours in infrared ship detection, we replace the C3k2 module’s bottleneck with the MultiScaleEdgeFusion module. This enhances edge information and improves weak target discriminability through multi-scale feature fusion, without added computational cost.
2.: To resolve information redundancy in multi-scale feature extraction, depthwise separable convolution (DSConv) is used to reconstruct the downsampling path, decoupling spatial convolution from channel fusion. This compresses parameters and computational load while retaining key features.
3.: The DyTaskHead dynamically aligns task-specific features to enhance downstream task interactions. It employs depthwise separable convolution to significantly reduce parameter redundancy, thereby improving adaptability to complex scenarios while making the model more lightweight.
4.: Redundant pruning is performed on the model integrated with the above modules. With the pruned model as the student network and the unpruned model as the teacher network, self-distillation is implemented to further improve model accuracy, ultimately forming the DSEE-YOLO algorithm.

The remaining structure of this paper is organized as follows: Section 2 reviews the progress of related research; Section 3 elaborates on the model architecture of DSEE-YOLO and its core improvement methods; Section 4 verifies the algorithm’s performance through comparative experiments and ablation experiments; Section 5 discusses the model’s advantages and applicable scenarios in depth with visualization results; Section 6 summarizes the full text and proposes future work.

2. Related Work

In infrared ship target detection, traditional methods rely on manually designed features such as HOG [2] and SIFT [3] combined with classical machine learning classifiers like SVMs [4] and AdaBoost [5]. However, these approaches exhibit obvious limitations. In infrared images, the temperature difference between ships and the background is weak, edges are blurred, and ships are easily disturbed by pseudo-heat sources such as spray and foam, making it difficult for manual features to accurately depict the thermal distribution characteristics of ships and leading to a significant reduction in distinguishability.

In contrast, deep learning-based detection techniques significantly improve detection performance by enabling convolutional neural networks to automatically extract multi-scale semantic features. Among them, two-stage detection algorithms (e.g., Faster R-CNN [6]) have achieved breakthroughs in detection accuracy by leveraging an RPN (Region Proposal Network) to generate candidate boxes with adaptive IoU thresholds. Single-stage detectors adopt a feature pyramid fusion strategy, which effectively reduces the missed detection rate for small targets while ensuring real-time performance, thus providing core technical support for the construction of all-weather anti-interference infrared maritime monitoring systems.

In the field of infrared ship detection, deep learning-based single-stage object detection algorithms continue to optimize detection performance in complex sea surface scenarios. In 2017, YOLOv2 [7] reduced the missed detection of small targets by virtue of DarkNet19 and anchor box optimization, while RetinaNet [8] suppressed the interference of ocean wave noise using focal loss. In 2018, YOLOv3 [9] improved the AP value for long-distance ship recognition with a three-level feature pyramid. In 2020, YOLOv4 [10] integrated PANet and mosaic augmentation, maintaining a high mAP even under low-visibility conditions, while YOLOv5 achieved real-time high-precision detection through adaptive anchor boxes and SPPF. In 2022, YOLOv6 [11] introduced RepVGG and anchor-free mechanisms to reduce the missed detection rate, while YOLOv7 [12] enhanced the recognition capability of occluded targets by means of E-ELAN. In 2023, YOLOv8 introduced C2f to enhance multi-scale performance, while, in 2024, YOLOv9 [13] optimized contour detection using deformable convolution, and the C2f-Faster module of YOLOv10 [14] compressed the number of parameters. The latest SOTA (state-of-the-art) algorithm, YOLOv11, reduces the number of parameters through the C3k2 module, focuses on weak signals by integrating the C2PSA attention mechanism, and greatly reduces the missed detection rate with a decoupled detection head and an improved CIoU.

The iterative upgrades of the aforementioned algorithms have not only laid a solid foundation for improving ship detection performance but also driven researchers to explore optimization paths in multiple dimensions to address the complexity of the sea surface environment.

In terms of feature optimization and fusion, researchers have been continuously improving the discriminative ability of models through feature enhancement and structured fusion. Chen et al. [15] introduced the FT algorithm into YOLOv3 to strengthen semantic features and adopted Soft-NMS to optimize box-screening. Chen et al. [16] proposed CSD-YOLO, which focuses on key information and fuses the multi-scale features of SAR through the SAS-FPN module. Li et al. [17] designed a feature-focused diffusion pyramid, and combined the ADown module with an improved C2f structure to enhance the features of the central region. Zhang et al. [18] embedded a multi-scale residual module into YOLOv7-tiny to improve adaptability to complex water surface environments. Wang et al. [19] constructed NST-YOLO11, which utilizes MSDA (multi-scale dynamic attention) to fuse the global semantic capture ability of ViT. Such studies echo earlier efforts: Liu et al. [20] (PJ-YOLO) integrated prior knowledge; Zhou et al. [21] (YOLO-SWD) designed a feature compensation mechanism; and Ha et al. [22] (YOLO-SR) implemented feature recalibration. These works collectively improved feature quality but generally encountered common issues, such as increased computational costs [16,20,21] or false detections in complex scenarios [17,18].

Targeting lightweight design improvements, model efficiency optimization has become a key aspect for practical deployment. Although Zhao et al. [23] did not directly make the model lightweight, they indirectly improved efficiency through K-means++ clustering of anchor boxes. Yue et al. [24] combined MobileNetv2 with YOLOv4 and achieved compression based on channel pruning. Guo et al. [25] improved the lightweight backbone network of the YOLO model to realize cross-scale layer feature fusion and reduce the amount of convolution computation, but this method has the problem of missed target detection. Songjie Du et al. [26] improved YOLOv8 and added an attention mechanism to achieve lightweight traffic sign detection. These studies complement the efforts of Shen et al. [27] (who streamlined the feature pyramid in YOLO-LPSS) and Sanikommu et al. [28] (focused on edge computing deployment), collectively addressing efficiency bottlenecks. However, they still struggle to avoid the accuracy losses caused by lightweight design, such as the degradation in large-target detection [27] and insufficient sensitivity to small targets [18,26].

In terms of multi-scale and scale adaptation, in response to the challenge of large differences in ship scales, multi-scale optimization strategies have continued to evolve. Ma et al. [29] improved SP-YOLOv8s and retained fine-grained features to enhance tiny-object accuracy. Such studies form a technical closed loop with the works of Yuan et al. [30] (AM YOLO for adaptive multi-scale performance) and Huang et al. [31] (ADV-YOLO to optimize SAR multi-scale expression). Zhang et al. [32] focused on the detection of hidden suspicious objects in terahertz images through multi-scale detection. However, the missed detection of small targets remains a common shortcoming [23,29,30,31], especially in low-resolution infrared or SAR images [20,22,31].

In terms of achieving robustness in complex scenarios, the research has advanced significantly. These efforts synergize with early SAR-specific studies (Ha et al. [22], Huang et al. [31]) and remote sensing scene research (Shi et al. [33]). However, cross-scene generalization—such as from visible light to infrared—remains a challenge [18,19,31].

In summary, iterative advancements and multi-directional optimizations in infrared ship detection have laid a solid foundation for practical use, but existing architectures like YOLOv11 still have limitations. To address these limitations, we propose DSEE-YOLO, an efficient lightweight algorithm based on YOLOv11n to provide a novel effective solution for infrared ship detection.

3. Methodology

This section elaborates on the technical details of the proposed DSEE-YOLO algorithm. The algorithm is based on the YOLOv11n framework and is optimized and developed by integrating three innovative modules combined with pruning and self-distillation techniques. The overall architecture of DSEE-YOLO is shown in Figure 1.

3.1. C3k2_MultiScaleEdgeFusion Module

In infrared ship detection, thermal reflection from ocean waves and diffuse reflection from clouds or fog reduce the thermal contrast between ships and the background. This attenuation induces grayscale blurring at target edges, impeding traditional algorithms’ ability to capture subtle variations and leading to localization errors. A critical challenge arises from high-frequency noise that mimics genuine edges; such noise is frequently misclassified, increasing the false-positive rate. Since edge information is pivotal for precise localization and recognition and its integrity and reliability are compromised by these factors, targeted edge enhancement becomes essential for improving detection efficacy.

Infrared ship detection is fundamentally challenged by low signal-to-noise ratios and blurred edges. To address this from a theoretical standpoint, we design the MultiScaleEdgeFusion module based on scale-space theory. Real structural edges in infrared images exhibit scale invariance, which means they exist across multiple spatial scales. By contrast, noise and background clutter such as wave crests and thermal reflections are usually transient and scale-specific; they only appear prominently at specific scales. Traditional convolutional layers with fixed receptive fields have difficulty distinguishing between these phenomena. They either use small kernels to capture fine-grained details but are easily affected by noise or they use large kernels. While large kernels improve noise resistance, they blur fine edges and lead to the loss of small-target details. Our MultiScaleEdgeFusion module solves this problem by adopting a multi-scale parallel processing strategy.

We integrate the aforementioned MultiScaleEdgeFusion module as a sub-module into the original C3k2 module to further improve the overall performance of the model through edge enhancement. Specifically, the high-frequency residual, which reflects high-frequency edge information, is obtained by extracting the difference signal (

δ = x - pool (x)

) between the original features and the blurred features after 3 × 3 average pooling. Then, a learnable convolution layer is used to generate an edge weight matrix

α

(Equation (1)), so as to selectively enhance real edges and suppress thermal noise. Meanwhile, feature compression, edge enhancement, and upsampling restoration are performed at four scales (3 × 3, 6 × 6, 9 × 9, and 12 × 12), followed by fusion. This not only simultaneously improves the detection capability for ships of different sizes but also suppresses noise by fusing edge responses of different scales at the same position. Meanwhile, the true edges of ships exhibit high responses at both the 3 × 3 and 12 × 12 scales and thus are retained. In contrast, noise, which only shows high responses at a specific scale, will be weakened by the low responses at other scales. Combined with the edge weight matrix, the proportion of weights assigned to pseudo-edges is further reduced.

α = σ (W_{conv} * δ + b)

(1)

σ = \frac{1}{1 + e^{- x}}

(2)

Herein,

σ

represents the Sigmoid function,

α

denotes the edge weight matrix,

δ

stands for the high-frequency residual,

b

is the basic activation threshold for infrared edges,

W_{conv}

refers to the convolution kernel weight, and ∗ is the convolution operator.

To verify the effectiveness of the multi-scale strategy, this study designs multi-scale selection comparative experiments (see Table 1 for results). The experimental results show that, under different single-scale settings, although the model performance exhibits fluctuations, the overall differences are not significant. Under the dual-scale fusion condition, most combinations fail to systematically outperform the optimal single-scale configuration, indicating that simply increasing the number of scales cannot effectively improve the model’s generalization performance. However, when more scale information is introduced, the model performance is significantly improved. In particular, the four-scale fusion combination achieves the highest values in terms of both recall rate and mAP@0.50, and its mAP@0.50:0.95 also outperforms all comparative settings. The results demonstrate that multi-scale feature fusion can enhance the model’s ability to represent targets of different sizes and reduce detection biases caused by scale variance, thereby improving detection accuracy and robustness.

As shown in Figure 2, the MultiScaleEdgeFusion sub-module operates as follows:

Multi-Scale Sampling: Input features undergo dimensionality reduction via parallel adaptive pooling layers of varying sizes.

Scale-Specific Feature Extraction: After channel compression (1 × 1 convolution), 3 × 3 group convolution extracts scale-specific features, preserving scale differences with reduced computation.

Edge Enhancement: Each output is processed by an independent EdgeEnhancer module to strengthen contours, then upsampled to the original resolution.

Detail Retention: Local details from the original input are preserved via 3 × 3 convolution.

Feature Fusion: Enhanced edge features from all branches and local details are channel-concatenated, integrated by 3 × 3 fusion convolution, outputting an optimized feature map with enhanced edge response and preserved spatial details.

As shown in the ablation experiment results in Table 2, the C3k2_MultiScaleEdgeFusion-YOLO model achieves a 1.6% increase in Precision, a 2.0% increase in Recall, a 1.4% increase in mAP@0.50, and a 1.3% increase in mAP@0.50:0.95, with almost no change in the number of parameters and the computational load. These results indicate that the introduction of this module effectively reduces the missed detection rate and false detection rate.

3.2. DS_ADown Module

Despite structural refinements reducing the parameters in YOLOv11n, its computational load on mobile devices remains substantial. Crucially, the downsampling module dominates computational costs. While the ADown module balances efficiency and accuracy in general scenarios, it struggles with infrared ship detection’s low-contrast, weak-texture characteristics, and thus is our primary optimization target.

To this end, this study proposes the DS_ADown module, which achieves optimization through a dual-path feature fusion architecture and lightweight design. Concurrently, DSConv is introduced, decomposing standard convolution into depthwise and pointwise convolution layers. This maintains the feature representation capability while reducing computational complexity, thereby eliminating feature redundancy. Traditional downsampling methods cause information loss that is particularly detrimental in low-contrast infrared images. The DS_ADown module uses DSConv to decouple spatial and channel processing, reducing redundancy while preserving the high-frequency features essential for distinguishing ships from wave clutter. The module thus delivers a lightweight solution for ship monitoring in complex maritime conditions, and the detailed network architecture is illustrated below.

As shown in Figure 3, the input feature map

X

is first downsampled via the average pooling layer and then split into

X_{1}

and

X_{2}

along the channel dimension. Among them, the

X_{1}

branch extracts local details through DSConv and enhances nonlinearity via SiLU activation. The

X_{2}

branch amplifies high-frequency features through max pooling followed by 1 × 1 convolution. Finally, the outputs of the dual paths are channel-concatenated to fuse global semantics with local textures, effectively alleviating the information loss inherent in traditional downsampling.

Therefore, through joint optimization of the lightweight architecture and the feature decoupling design, DS_ADown effectively addresses ADown’s generalization constraints and computational redundancy in infrared ship detection. Selective deployment of DS_ADown in backbone or neck components enables a balanced accuracy–weight trade-off. Experiments (Table 3) demonstrate that DS_ADown reduces parameters by 24.05% and computations by 20.63%, incurring negligible

m A P @ 0.50

degradation.

3.3. DyTaskHead Detection Head

In traditional object detection, classification and regression share the feature space, leading to conflicts in feature requirements. Although the decoupled detection head of YOLOv11n alleviates this issue using parallel branches, static feature allocation struggles to adapt to dynamic demands, making classification susceptible to interference in complex backgrounds. The limitations of such static decoupling are particularly prominent in infrared ship detection.

In infrared detection, classification and regression tasks compete for shared features, especially under a low signal-to-noise ratio. DyTaskHead introduces dynamic feature alignment and spatial attention to decouple these tasks, allowing the model to adaptively focus on semantic vs. geometric features, which is critical for handling blurred and occluded targets.

To address the aforementioned challenges, this paper proposes DyTaskHead, a feature-decoupled detection head optimized via dynamic task decomposition and spatial adaptive alignment. As shown in Figure 4, its architecture comprises the following:

Feature Preprocessing: Multi-scale inputs (P3, P4, P5) are downsampled to a uniform resolution via feature pyramid network, then processed by a shared convolutional layer incorporating DSConv to achieve feature fusion and lightweight properties.

Task Decomposition: Features are decoupled into classification and regression branches. The classification branch enhances semantic representation through channel attention and spatial dynamic weighting; the regression branch focuses on geometric structures using deformable convolution, specifically implementing offset and mask learning via DyDCNv2.

Dynamic Alignment: This stage comprises localization and classification components. For localization, spatial convolution generates offsets and masks to refine feature sampling. For classification, spatial attention calibrates feature responses.

Output: The regression branch decodes box coordinates with distribution focal loss while the classification branch outputs category probabilities, completing the feature processing pipeline.

This design achieves task decoupling and collaborative optimization while reducing parameters and maintaining lightweight adaptability. DyTaskHead enhances detection performance through three key innovations. As shown in Table 4, the experiments demonstrate that, with only 7.6M parameters, the model achieves 91.5%

m A P @ 0.50

on the IRShip dataset, surpassing the baseline by 1.7%. This delivers an efficient framework for a dynamic task-aligned detection head design.

3.4. DSEE-YOLO: Model Optimization via Pruning and Distillation

To address the computational power constraints faced by infrared ship detection models in practical deployment, pruning technology can reduce model complexity while ensuring detection accuracy by removing redundant parameters and channels. Although the original Fused-YOLO has achieved a certain degree of reduction in terms of parameters and computational load, this experiment still adopts the LAMP (Layer-Adaptive Sparsity for the Magnitude-Based Pruning) [34] pruning method for further optimization. It eliminates redundant computations and parameters and performs re-training to obtain optimal weights. The LAMP method sorts channels based on the gradient-based L1 norm and gradually removes the channels that have the least impact on model output. This verifies that pruning technology can enhance the model’s ability to focus on key features, which not only does not cause performance loss but also achieves model optimization.

Given that Fused-YOLO has already achieved favorable feature extraction and detection performance through preliminary optimization, the pruned lightweight model, despite its streamlined parameters, still has room for improvement in its feature representation capability in complex scenarios. To address this, this study conducts self-distillation with Fused-YOLO as the teacher model and the pruned model as the student model. By transferring the key feature knowledge and decision logic from the teacher model, the student model can absorb the detection experience of the teacher while maintaining low complexity, thereby achieving a balance between lightweight properties and high performance.

As shown in Figure 5, this experiment adopts a combination of BCKD (Bridging Cross-Task Protocol Inconsistency for Knowledge Distillation) [35] and logical distillation. Logical distillation focuses on transferring the decision-making logic of the teacher model, which can capture the correlation between target classification and localization in infrared ship detection and help the student model learn the teacher’s judgment logic for ambiguous targets. BCKD self-distillation transfers the teacher’s ability to recognize weak thermal signatures and ambiguous edges to the student model, effectively enhancing the model’s sensitivity to low-contrast targets without increasing the computational cost. BCKD can adapt to the target-blurring characteristics of infrared ships caused by thermal radiation. Its weighting strategy can strengthen the learning of hard samples. Meanwhile, this method can transfer the features of scale and thermal radiation differences through the teacher model to cope with scenarios where infrared ships have large-scale variations and strong background interference.

As shown in Table 5, DSEE-YOLO incorporating BCKD self-distillation achieves an improvement of 1.1–2.6% in

P r e c i s i o n

,

R e c a l l

,

m A P @ 0.50

, and

m A P @ 0.50 : 0.95

compared to Fused-YOLO. Meanwhile, its numbers of parameters and FLOPs remain unchanged, which confirms the effectiveness of BCKD self-distillation in improving detection accuracy without any loss.

4. Experimental Results

All experiments used consistent parameters without pre-trained weights. The dataset, training details, and evaluation metrics are described below. Experiments were run on an Ubuntu 20.04.6 LTS server with two NVIDIA RTX3090 GPUs (64GB) using Python 3.10 and CUDA 12.1. We adopted open-source YOLOv11n as the baseline model, training for 200 epochs with batch size 32 and the SGD optimizer (lr = 0.01). Input images were resized to 640 × 640 pixels via letterbox preprocessing.

4.1. The Dataset IRShip v1.0

The IRShip v1.0 dataset targets infrared ship detection in marine scenarios. It integrates two types of public data with field-collected images, forming a total of 10,013 rigorously annotated samples that span six ship types. The first type of public data is InfiRay^® data, which is publicly available. The second type includes partial samples from a self-constructed infrared marine dataset from Shandong University, and the use of this dataset obtained official authorization. For the public data, InfiRay infrared open-platform data mainly cover multi-scale ship targets in near-shore and offshore scenarios, while the samples authorized by Shandong University supplement scenarios in medium-distance ship imaging, further enriching the diversity of target scales. The field data, by contrast, focus on distant small targets while the overall public data fully cover multi-scale objects. The collection of field data used an FLIR T630 thermal imager (

8 - 14 μ m

) in coastal environments near Anhui Province, and field-collected images account for approximately 20% of the entire IRShip v1.0 dataset. The images in the dataset encompass nine distinct resolutions, and the details are comprehensively presented in Figure 6. To mitigate scale sensitivity, the absolute coordinates of targets were normalized to relative proportions. This normalization resulted in 82% of targets having width and height values of less than 0.2, confirming the prevalence of small-target density characteristics in the dataset.

The dataset was randomly sampled into an augmented training set of 14,370 images, a validation set of 1717 images, and a test set of 1325 images, ensuring a balanced distribution of categories and other attributes. All images had undergone strict labeling review (with incorrect labels corrected), covering various types of ships and near-shore scenes. This dataset provides a foundation for near-shore ship detection, coastal security monitoring, and related applications.

4.2. Experimental Evaluation Metrics

In this paper, metrics such as the number of Parameters, Precision, Recall, mAP@0.50, mAP@0.50:0.95, and FLOPs are adopted to evaluate the model performance. Among them, the number of Parameters, FLOPs, and Model-size are used to assess the lightweight properties of DSEE-YOLO, reflecting the model’s complexity, computational cost, and resource consumption. The specific calculation formulas are as follows:

P = \frac{T P}{T P + F P} \times 100 %

(3)

R = \frac{T P}{T P + F N} \times 100 %

(4)

A P = \int_{0}^{1} P d R

(5)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(6)

P a r a m s = (k_{h} \times k_{w} \times C_{i n}) \times C_{o u t}

(7)

F L O P s = (k_{h} \times k_{w} \times C_{i n} \times C_{o u t}) \times (H \times W)

(8)

where

m

represents the number of target categories;

TP

(true positives) refers to the correct detection of ships;

FP

(false positives) refers to the misidentification of non-ship objects as ships;

FN

(false negatives) refers to the failure to detect actually existing ships;

k_{h}

and

k_{w}

are the kernel height and width, respectively;

C_{in}

and

C_{out}

are the number of input and output channels, respectively; and

H

and

W

are the dimensions of the feature map.

4.3. Ablation Experiment

Ablation experiments performed on the IRShip dataset evaluated the effects of the DS_ADown, C3k2_MultiScaleEdgeFusion, and DyTaskHead modules. Eight improvement combinations were tested (Table 6). The introduction of DS_ADown reduced Parameters by 24.05% and GFLOPs by 24.05% compared to YOLOv11n, with negligible accuracy loss, confirming its lightweight efficacy. Replacing the original C3k2 module with C3k2_MultiScaleEdgeFusion, while retaining comparable complexity, improved Precision by 1.6%, Recall by 2.0%, mAP@0.50 by 1.4%, and mAP@0.50:0.95 by 1.3 percent, effectively mitigating missed and false detections. Integration of DyTaskHead yielded the most significant accuracy gains: mAP@0.50 increased by 1.7%, Parameters decreased by 14.01%, and Precision improved by 1.9%. The combined implementation achieved 40.83% Parameter reduction with a Recall improvement of 1.9%, mAP@0.50 increase of 1.4%, and mAP@0.50:0.95 enhancement of 1.3%, demonstrating synergistic performance enhancement.

As shown in Figure 7, the training process curves of the ablation experiments reveal significant differences in the four core metrics among different network modules. YOLOv11n exhibits a relatively fast initial convergence speed in terms of Precision and mAP@0.50. However, its Recall rate and the comprehensive metric mAP@0.50:0.95 show relatively slow growth in the later stages of training. The improved DSEE-YOLO model, relying on its relevant modules, demonstrates favorable performance advantages. After 200 training epochs, it leads significantly over YOLOv11n in multiple metrics, with relatively small fluctuations in the metrics, thus achieving excellent overall performance.

This study verifies the innovative breakthrough of the fused-YOLO model integrated with three modules in terms of balancing precision and efficiency through the comparison of PR curves, as shown in Figure 8. The results of the ablation experiments indicate that, on the basis of maintaining an average precision of 0.912, the model successfully achieves lightweight improvement through the integrated application of the three modules. In particular, within the high recall rate range of 0.8–0.9, DSEE-YOLO only exhibits a precision fluctuation of 1.2%, and its stability is significantly superior to that of other models. The locally enlarged sub-graph shows that, under the premise of ensuring detection sensitivity, this scheme successfully overcomes the defect of traditional models in which precision drops sharply under high-recall-rate conditions.

4.4. Pruning and Distillation Experiments

The distillation employs a logarithmic loss with the BCKD method. Through multiple comparative experiments, as shown in Table 7, we identified that a logarithmic loss weight ratio of 0.65 yields the optimal distillation effect, using a constant decay schedule. For feature distillation layers, both teacher and student models utilize layers 16, 19, and 22 as transfer layers to enable effective knowledge migration. The comprehensive results demonstrate that the 0.65 logarithmic loss weight achieves optimal performance.

Compared to the benchmark model YOLOv11n, the proposed model reduces the Parameters count by 42.3% while improving core accuracy metrics by 2.3–3.8%. Specifically, Precision reaches 92.2%, Recall 85.9%, mAP@0.50 92.6%, and mAP@0.50:0.95 65.9%, marking significant enhancements. This achieves an optimal balance between a lightweight design and high accuracy.

4.5. Comparative Experiment

4.5.1. Comparison of Different Backbone Networks

The results of the ablation experiments on the improvement of the C3k2 module are shown in Table 8. The finally adopted YOLO-C3k2_MultiScaleEdgeFusion model performs optimally in terms of the two core indicators Recall and mAP@0.50, reaching 84.9% and 91.2%, respectively.

Further evidence is provided by the training results: As shown in the training curves in Figure 9a, the Precision curve indicates that YOLO-C3k2_MultiScaleEdgeFusion remains stable above 0.85 in the middle and late stages of training, demonstrating better control over the false detection rate. The curve in Figure 9b shows that the model reached its peak after 150 epochs, verifying the significant improvement in reducing missed detections. In terms of comprehensive performance metrics, the curves in Figure 9c,d collectively reveal that performance improvement can be achieved without losing parameters or reducing computational load. Moreover, it is noteworthy that the model exhibits rapid convergence characteristics in the early stage of training, and the fluctuation range of each indicator curve is smaller than that of other variants, indicating that its network structure improvement effectively enhances training stability. Compared with other variants that sacrifice computational load and parameters, YOLO-C3k2_MultiScaleEdgeFusion has obvious comprehensive advantages.

4.5.2. A Comparative Analysis of Various Detection Heads

As shown in Table 9, the DyTaskHead detection head exhibits significant comprehensive performance advantages in object detection tasks. Compared to the baseline model YOLOv11n, it achieves high efficiency with only 2.22M Parameters, elevating mAP@0.50 to 91.5% and Recall to 85.7%. Remarkably, despite a 14% reduction in parameter count, its mAP@0.50:0.95 reaches 63.8%, outperforming other advanced detection heads (e.g., YOLO-LADH and YOLO-EfficientHead). Although FLOPs increase to 7.9G, the performance gain significantly surpasses that of similar high-computation alternatives. Our experiments validate that the dynamic feature fusion mechanism exhibits robust adaptability in complex scenarios, notably enhancing feature representation for multi-scale targets, thus achieving an optimal accuracy–efficiency trade-off for practical deployment.

To comprehensively verify the effectiveness of the improved detection head structure, this study conducts a comparative analysis of seven mainstream detection heads, including YOLOv11-Dyhead, YOLOv11-EfficientHead, and YOLOv11-LSCD. In the training process graphs in Figure 10, the precision curve in (a) indicates that, after training convergence, YOLOv11-DyTaskHead stably maintains a precision above 0.92, which is significantly better than the baseline YOLOv11n. As shown by the recall curve of (b), this model exhibits optimal capability in suppressing missed detections. In terms of comprehensive evaluation, both the mAP@0.50 curve in (c) and the mAP@0.50:0.95 curve in (d) demonstrate that YOLOv11-DyTaskHead is far ahead, especially showing obvious advantages in terms of the more stringent mAP@0.50:0.95 metric. It is noteworthy that, with a limited increase in computational load, this detection head outperforms traditional dynamic detection heads and exhibits excellent stability, ultimately achieving a significant improvement in comprehensive performance.

4.5.3. Comparison of Different Detection Algorithms

Comprehensive evaluation of DSEE-YOLO on the IRShip dataset verified its ship detection effectiveness, and detailed results are presented in Table 10. This study systematically compared twelve mainstream algorithms spanning two-stage detectors, one-stage detectors, Transformer-based models, and lightweight models across four metrics: accuracy, speed, computational cost, and scene adaptability. All models underwent consistent training at 640 × 640 resolution using SGD optimization for 200 epochs to ensure convergence fairness. The two-stage detector benchmark employs Faster R-CNN, while Cascade R-CNN enhances localization through progressive IoU threshold refinement. One-stage detectors include multiple YOLO variants and the task-aligned TOOD. Transformer-based models feature DETR and its improved variant Dab-DETR, which incorporate deformable attention for superior local feature modeling and adopted EfficientNet-B3 as a lightweight backbone for cross-architecture comparison.

To verify the effectiveness of DSEE-YOLO in the ship detection task, we conducted an experimental comparison of it and current state-of-the-art models in the same field, including the YOLOv5-ODConvNeXt [42] and Ship_YOLO [43]. The experiments were carried out on the IRShip dataset, and the results are presented in Table 10.

As can be seen from the table, DSEE-YOLO achieves the highest comprehensive detection accuracy (mAP@0.50:0.95) of 65.8%. Meanwhile, it attains the lowest number of parameters and the lowest computational cost with a significant advantage. These results fully demonstrate that DSEE-YOLO achieves a better balance between accuracy and efficiency, and its design can more effectively serve practical application scenarios with limited resources.

The experimental results in Table 10 indicate that, on the IRShip dataset, compared with other models, DSEE-YOLO shows obvious advantages in key indicators such as Precision, Recall, mAP@0.50, and mAP@0.50:0.95, while also demonstrating favorable lightweight characteristics in terms of Parameters. This model leads in accuracy, number of parameters, and computational efficiency, suggesting that it has strong potential in target detection tasks.

Figure 11 presents a comparative analysis of core performance metrics across target detection models. DSEE-YOLO demonstrates a superior comprehensive performance, achieving the highest mean average precision of 92.6% at an Intersection over Union threshold of 0.50 and 65.8% at thresholds between 0.50 and 0.95. It also maintains competitive recall and precision rates. These results confirm exceptional target localization accuracy and robustness across varying Intersection over Union criteria, with significant reductions in both missed detections and false detections. Overall, DSEE-YOLO exhibits optimal performance in precision and average precision metrics.

5. Visualization and Discussion

5.1. Visualizing the Decoupling Advantage

To go beyond evaluation methods that solely rely on numerical indicators and delve into the intrinsic mechanism of the decoupled design of DyTaskHead, we conducted a detailed comparative analysis of feature visualization. This section aims to intuitively reveal the root cause of its performance advantages by comparing the feature maps of DyTaskHead and the coupled head of YOLOv11n. This subsection focuses on Figure 12 for visualization analysis.

As shown in Figure 13, the differences are even more pronounced in the regression task. The regression feature maps of DyTaskHead exhibit clear structural and edge-related features, indicating that its regression branch focuses on extracting geometric information that is crucial for localization. In contrast, the regression features of YOLOv11n appear blurry and smooth with weak feature responses, struggling to support accurate bounding box localization.

As shown in Figure 14, in the classification task, the feature maps generated by DyTaskHead exhibit higher contrast and stronger feature activation, with different channels focusing on different visual patterns. This indicates that its classification branch is adept at extracting discriminative semantic information. By contrast, the feature maps of YOLOv11n are relatively dull and homogeneous, indicating that the shared feature layer struggles to optimize the optimal features dedicated to classification.

In summary, the visualization analysis clearly reveals the advantageous mechanism of DyTaskHead; its decoupled design allows the classification and regression branches to evolve independently, focusing on high-level semantic features and low-level geometric features, respectively, thereby effectively avoiding task conflicts. In contrast, the coupled head of YOLOv11n causes inter-task interference due to feature sharing, which limits its performance ceiling. This intuitive comparison not only confirms the effectiveness of the DyTaskHead design but also provides a reasonable explanation for its performance advantages from a feature perspective.

5.2. Ship Detection Comparison in Representative Maritime Scenarios

5.2.1. Comparison of Detection Performance of DSEE-YOLO and YOLOv11n

To further analyze the effectiveness of ship detection in complex scenarios and explore the trends of false detections and missed detections, samples from representative maritime scenarios were selected to analyze the comprehensive performance of DSEE-YOLO, as shown in Figure 15. The visualization results in Figure 15 intuitively demonstrate the differences in ship detection performance between DSEE-YOLO and the benchmark model YOLOv11n on the IRShip test set through eight groups of comparative samples. The experimental results indicate that DSEE-YOLO exhibits significant performance advantages over YOLOv11n in ship detection tasks. Specifically, in Figure 15a,f, DSEE-YOLO effectively corrects the false detections of YOLOv11n and accurately identifies ship targets. Figure 15c,d show that DSEE-YOLO has higher detection precision, with more accurate localization and recognition of ships. In the scenario with target occlusion in Figure 15h, DSEE-YOLO still achieves high-precision detection, while, in Figure 15e,b,g, DSEE-YOLO avoids the missed detections of YOLOv11n and successfully identifies all ship targets. In summary, DSEE-YOLO outperforms YOLOv11n in terms of detection accuracy, false detection correction capability, and handling complex scenarios such as occlusion, significantly enhancing the reliability and robustness of ship detection.

5.2.2. Comparison of Detection Performance of DSEE-YOLO and Other Algorithms

As shown in Figure 16 and Figure 17, we conducted a visual comparison of DSEE-YOLO and other mainstream or advanced detection algorithms in identical scenarios to intuitively evaluate detection performance. Specifically, in the cluttered port scenario depicted in Figure 16, both DSEE-YOLO and YOLOv5-ODConvNeXt yield high-confidence detections. Although some missed or false detections remain in challenging areas, the two models demonstrate superior overall performance compared to others, with YOLOv5n also showing competence in such tasks. For near-shore ship recognition, shown in Figure 17, almost all algorithms achieve high-confidence results; nevertheless, DSEE-YOLO and YOLOv5-ODConvNeXt again deliver the most reliable detections with the highest confidence scores. In summary, the visual evidence confirms that DSEE-YOLO achieves robust, state-of-the-art performance across diverse maritime environments.

5.3. Discussion on the Limitations of DSEE-YOLO

It should be noted that, although the DSEE-YOLO model and IRShip v1.0 dataset proposed in this study achieve certain results in infrared maritime target detection, there are still several limitations: Firstly, for ship targets with extremely small pixel sizes, the model’s miss detection rate remains relatively high compared to visible-light image detection in low-contrast environments (under the same environmental conditions). Secondly, the IRShip v1.0 dataset lacks samples of special ships such as those in high sea states, those in open-sea scenarios with tropical storms, and various types of transport ships, and its representativeness is still limited due to the difficulty in sample acquisition. Thirdly, the DSEE-YOLO model has insufficient generalization ability under extreme meteorological conditions, and the detection frame rate decreases significantly in high sea states in particular. The root cause lies in the fact that such scenarios are not covered in training, and existing data augmentation methods fail to simulate real-world nonlinear interference.

To address the aforementioned issues, targeted breakthroughs can be pursued in future work: For the miss detection of extremely small targets under low contrast, a dedicated feature enhancement module for extremely small targets can be added, combined with cross-layer feature fusion, and matched with high-resolution infrared sensors and contrast preprocessing. For the lack of samples in the dataset, diffusion models can be used to generate simulated data for special scenarios while collaborating with multiple parties to collect the real data of special ships to expand the dataset. For the insufficient generalization ability and low frame rate under extreme meteorological conditions, domain-adaptive learning can be adopted to adapt to scenarios, the model can be made lightweight to improve the frame rate, and radar and SAR data can be further fused to build a multi-modal framework.

In addition, in practical maritime environments, dense fog, heavy rain, and sea clutter tend to reduce the accuracy of infrared ship detection. Currently, the performance of DSEE-YOLO under extreme conditions can still be optimized. In the future, we can enhance the model’s robustness in harsh environments by adjusting the model with real-time meteorological data and designing anti-interference preprocessing modules, so as to meet the needs of all-weather maritime surveillance.

6. Conclusions

This paper presents DSEE-YOLO, a lightweight and high-precision deep learning method for infrared ship detection developed from the YOLOv11n baseline with four core optimizations: a C3k2_MultiScaleEdgeFusion backbone that reduces missed and false detections, thereby boosting efficacy without increasing parameters or complexity; a DS_ADown downsampling module that cuts parameters and computation while preserving accuracy; a DyTaskHead detection head that resolves regression–localization conflicts via task branching; and model pruning with self-distillation to further compress size and enhance accuracy. Evaluations on the custom IRShip dataset confirmed DSEE-YOLO’s superiority in accuracy, compactness, and efficiency; compared to YOLOv11n, it reduced Parameters by 42.3% while improving mAP@0.50 by 2.8%, Precision by 2.3%, Recall by 3.0%, and mAP@0.50:0.95 by 3.8%. These results validate its effectiveness in enhancing recognition accuracy, integrity, and multi-threshold adaptability and demonstrate its advantages in model compactness and inference speed, which make it ideal for resource-constrained scenarios like shipborne embedded systems. This work provides valuable technical insights for ship detection, aiding in improving practical inference performance, and future research will focus on strengthening the lightweight architecture’s feature extraction and optimizing small-ship detection to achieve superior results.

Author Contributions

Conceptualization, S.W., Y.F., H.T. and L.C.; data curation, S.W., W.J., C.Z. and H.T.; formal analysis, Y.F., W.J., C.Z. and H.T.; funding acquisition, Y.F., W.J. and L.L.; investigation, Y.F., W.J., C.Z. and H.T.; methodology, Y.F.; project administration, Y.F., W.J. and L.L.; resources, L.L. and L.C.; software, L.L. and L.C.; supervision, S.W., W.J., L.L. and L.C.; validation, S.W. and L.C.; writing—original draft, S.W.; writing—review and editing, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Key Projects of the Foundation Strengthening Program, grant number 2023-JJ-0604.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ji, W.; Qi, L.; Xing, P.; Yang, G. Analysis of the Influence of Cloud Occlusion on Marine Infrared Transmittance. J. Atmos. Environ. Opt. 2021, 16, 88–97. (In Chinese) [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Hearst, M.; Dumais, S.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. Int. Conf. Mach. Learn. 1996, 96, 148–156. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Honolulu, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. Comput. Vis. Pattern Recognit. 2020, 28, 168–176. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. Comput. Vis. Pattern Recognit. 2022, 1204–1215. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New-State-of-the-Art for Real-Time Object Detectors. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 689–706. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. YOLOv9: Learning What YouWant to Learn Using Programmable Gradient Information. In Winter Conference on Applications of Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 831–840. [Google Scholar]
Wang, J.; Xu, Y.; Li, Z. YOLOv10: Real-Time End-to-End Object Detection. Conf. Comput. Vis. Pattern Recognit. 2024, 4567–4576. [Google Scholar]
Chen, L.; Li, B.; Qi, L. Research on Ship Target Detection Algorithm Based on YOLOv3 with Image Saliency Fusion. Softw. Guide 2020, 19, 146–151. (In Chinese) [Google Scholar]
Chen, Z.; Liu, C.; Filaretov, V.F.; Yukhimets, D.A. Multi-Scale Ship Detection Algorithm Based on YOLOv7 for Complex Scene SAR Images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
Li, C.; Cai, Y.; Hu, J.; Zhan, W. Improved Ship Target Detection Algorithm in Severe Weather Based on YOLOv8. Mod. Electron. Tech. 2024; 1–9. (In Chinese) [Google Scholar]
Zhang, L.; Du, X.; Zhang, R.; Zhang, J. A Lightweight Detection Algorithm for Unmanned Surface Vehicles Based on Multi-Scale Feature Fusion. J. Mar. Sci. Eng. 2023, 11, 1392. [Google Scholar] [CrossRef]
Wang, T.; Li, X.; Zhang, Y. NST-YOLO11: ViT Merged Model with Neuron Attention for Arbitrary-Oriented Ship Detection in SAR Images. Int. Conf. Comput. Vis. Pattern Recognit. 2025, 1230–1245. [Google Scholar]
Liu, Y.; Li, C.; Fu, G. PJ-YOLO: Prior-Knowledge and Joint-Feature-Extraction Based YOLO for Infrared Ship Detection. J. Mar. Sci. Eng. 2025, 13, 226. [Google Scholar] [CrossRef]
Zhou, R.; Gu, M.; Pan, H. YOLO-SWD-An Improved Ship Recognition Algorithm for Feature Occlusion Scenarios. Appl. Sci. 2025, 15, 2749. [Google Scholar] [CrossRef]
Ha, C.K.; Nguyen, H.; Van, V.D. YOLO-SR: An optimized convolutional architecture for robust ship detection in SAR Imagery. Intell. Syst. Appl. 2025, 26, 200538. [Google Scholar] [CrossRef]
Zhao, Y.; Guo, H.; Jiao, H.; Zhang, J. Application of YOLOv4 with Hybrid Domain Attention in Ship Detection. Comput. Mod. 2021, 9, 75–82. (In Chinese) [Google Scholar]
Yue, T.; Yang, Y.; Niu, J.M. A light-weight ship detection and recognition method based on YOLOv4. In Proceedings of the 4th International Conference on Advanced Electronic Materials, Computers, Changsha, China, 26–28 March 2021. [Google Scholar]
Guo, Y.; Lu, Y.; Liu, R.W. Lightweight deep network-enabled real-time low-visibility enhancement for promoting vessel detection in maritime video surveillance. J. Navig. 2022, 75, 230–250. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, W.; Li, S.; Liu, H.; Hu, Q. YOLO-Ships: Lightweight ship object detection based on feature enhancement. J. Vis. Commun. Image Represent. 2024, 101, 104170. [Google Scholar] [CrossRef]
Shen, L.; Gao, T.; Yin, Q. YOLO-LPSS: A Lightweight and Precise Detection Model for Small Sea Ships. J. Mar. Sci. Eng. 2025, 13, 925. [Google Scholar] [CrossRef]
Sanikommu, V.; Marripudi, S.P.; Yekkanti, H.R.; Divi, R.; Chandrakanth, R.; Mahindra, P. Edge computing for detection of ship and ship port from remote sensing images using YOLO. Front. Artif. Intell. 2025, 8, 1508664. [Google Scholar] [CrossRef]
Ma, M.; Pang, H. SP-YOLOv8s: An Improved YOLOv8s Model for Remote Sensing Image Tiny Object Detection. Appl. Sci. 2023, 13, 8161. [Google Scholar] [CrossRef]
Yuan, M.; Meng, H.; Wu, J. AM YOLO: Adaptive multi-scale YOLO for ship instance segmentation. J. Real-Time Image Process. 2024, 21, 100. [Google Scholar] [CrossRef]
Huang, Y.; Han, D.; Han, B.; Wu, Z. ADV-YOLO: Improved SAR ship detection model based on YOLOv8. J. Supercomput. 2025, 81, 1–32. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, H.; Ge, Z.; Jiang, Y.; Ge, H.; Zhao, Y.; Xiong, H. SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images. Photonics 2024, 11, 778. [Google Scholar] [CrossRef]
Shi, W.; Zheng, W.; Xu, Z. Ship-Yolo: A Deep Learning Approach for Ship Detection in Remote Sensing Images. J. Mar. Sci. Eng. 2025, 13, 737. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive sparsity for the Magnitude-based Pruning. arXiv 2021, arXiv:2010.07611. [Google Scholar]
Yang, L.; Zhou, X.; Li, X.; Qiao, L.; Li, Z.; Yang, Z.; Wang, G.; Li, X. Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 17129–17138. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Tan, M.; Le, Q.V. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. arXiv 2019, arXiv:1912.02424. [Google Scholar]
Cheng, S.; Zhu, Y.; Wu, S. Deep Learning Based Efficient Ship Detection from Drone-Captured Images for Maritime Surveillance. Ocean Eng. 2023, 285, 115440. [Google Scholar] [CrossRef]
Wu, C.-M.; Lei, J.; Li, Z.-Q.; Ren, M.L. Ship_YOLO: General Ship Detection Based on Mixed Distillation and Dynamic Task-Aligned Detection Head. Ocean. Eng. 2025, 323, 120616. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the DSEE-YOLO network model.

Figure 2. Structural diagram of the MultiScaleEdgeFusion module.

Figure 3. Structural diagram of the DS_ADown module.

Figure 4. Structural diagram of the DyTaskHead module, the * in the figure represents the convolution operation.

Figure 5. Optimization of the pruned Fused-YOLO using BCKD for self-distillation.

Figure 6. The size and category distribution of inshore ships in the IRShip dataset. (a) Distribution of ship bounding box width and height. Darker colors indicate higher counts. (b) Pie chart of resolution distribution.

Figure 7. The training processes of different networks in the ablation experiment: (a) precision curve, (b) recall curve, (c) mAP@0.50 curve, (d) mAP@0.50:0.95 curve.

Figure 8. Ablation experiment PR curve comparison.

Figure 9. The training processes of different C3k2 backbone networks in the comparison experiment: (a) precision curve, (b) recall curve, (c) mAP@0.50 curve, (d) mAP@0.50:0.95 curve.

Figure 10. The training processes of different detection heads in the comparison experiment: (a) precision curve, (b) recall curve, (c) mAP@0.50 curve, (d) mAP@0.50:0.95 curve.

Figure 11. Comparison experiment results.

Figure 12. The selected image for visual comparative analysis.

Figure 13. Regression feature visualization comparison of YOLOv11n and DyTaskHead at P3 level. (a) YOLOv11n detection head regression feature map; (b) DyTaskHead regression branch visualized feature map.

Figure 14. Classification feature visualization comparison of YOLOv11n and DyTaskHead at P3 level. (a) YOLOv11n detection head classification feature map; (b) DyTaskHead classification branch visualized feature map.

Figure 15. Detection results of ships on the IRShip test set. (a) False detection correction; (b) Missed detection correction; (c) Detection accuracy improvement; (d) Detection accuracy improvement; (e) Missed detection correction; (f) False detection correction; (g) Multi-object missed detection correction; (h) Detection accuracy improvement.

Figure 16. Comparison of detection performance of DSEE-YOLO and other algorithms. (a) Image selected for detection comparison; (b) DSEE-YOLO detection results; (c) Ship_YOLO detection results; (d) YOLOv12n detection results; (e) YOLOv5n detection results; (f) YOLOv5-ODConvNeXt detection results; (g) YOLOv8n detection results; (h) YOLOv13n detection results.

Figure 17. Comparison of detection performance of DSEE-YOLO and other algorithms. (a) Image selected for detection comparison; (b) DSEE-YOLO detection results; (c) Ship_YOLO detection results; (d) YOLOv12n detection results; (e) YOLOv5n detection results; (f) YOLOv5-ODConvNeXt detection results; (g) YOLOv8n detection results; (h) YOLOv13n detection results.

Table 1. Ablation study of scale combinations in the MultiScaleEdgeFusion module.

Different Scale Combinations	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)
3 × 3	90.3	82.5	90.2	61.8
6 × 6	92.7	81.2	90.3	61.9
9 × 9	91.4	83.4	90.7	62.1
12 × 12	90.9	82.5	90.1	62.0
3 × 3 + 6 × 6	91.1	82.6	90.2	62.1
3 × 3 + 9 × 9	90.7	82.8	90.2	61.7
6 × 6 + 9 × 9	92.1	82.0	90.5	62.0
3 × 3 + 12 × 12	91.3	82.0	90.2	61.7
9 × 9 + 12 × 12	91.2	83.7	90.8	62.4
3 × 3 + 6 × 6 + 9 × 9 + 12 × 12	91.2	84.9	91.2	63.3

Table 2. Performance comparison of ablation experiments on the C3k2_MultiScaleEdgeFusion module.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLO-C3k2_MultiScaleEdgeFusion	91.2 (+1.6)	84.9 (+2.0)	91.2 (+1.4)	63.3 (+1.3)	2.538963	6.4

Table 3. Performance comparison of ablation experiments on the DS_ADown module.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLOv11-DS_ADown	90.8 (+1.2)	80.9 (−2.0)	89.2 (−0.6)	61.4 (−0.6)	1.9607790 (−24.05%)	5.0 (−20.63%)

Table 4. Performance comparison of ablation experiments on the DyTaskHead module.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLOv11-DyTaskHead	91.5(+1.9)	85.7 (+2.8)	91.5 (+1.7)	63.8 (+1.8)	2.2200204 (−14.02%)	7.9

Table 5. Performance comparison of BCKD self-distillation ablation experiments.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
Fused-YOLO	89.8	84.8	91.2	63.3	1.4967400	6.5
DSEE-YOLO	92.2 (+2.4)	85.9 (+1.1)	92.6 (+1.4)	65.9 (+2.6)	1.4967400 (−42.3%)	6.5

Table 6. Ablation experiments performed on IRShip.

Dataset	Method			Metrics				Complexity
Dataset	C3k2_MultiScaleEdgeFusion	DS_ADown	DyTaskHead	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Model Size (MB)	Parameters (M)	FLOPs (G)
IRShip				89.6	82.9	89.8	62.0	10.1	2.582347	6.3
	√			91.2	84.9	91.2	63.3	5.3	2.538963	6.4
		√		90.8	80.9	89.2	61.4	7.7	1.9607790	5.0
			√	91.5	85.7	91.5	63.8	9.1	2.2200204	7.9
	√	√		91.7	81.1	89.6	61.3	6.8	1.6714110	4.5
	√		√	90.9	84.0	91.2	62.7	8.3	1.9769160	7.2
		√	√	91.8	84.0	90.8	63.4	7.3	1.7507960	6.7
	√	√	√	89.8	84.8	91.2	63.3	3.5	1.4967400	6.5

Table 7. Performance comparison of BCKD distillation models under different logical loss weights.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)
logical-0.5	92.5	85.7	92.5	65.9
logical-0.6	91.9	85.9	92.6	65.8
DSEE-YOLO (logical-0.65)	92.2	85.9	92.6	65.9
logical-0.7	91.2	86.4	92.6	65.9
logical-0.75	92.2	85.5	92.3	66.0
logical-0.8	91.7	86.3	92.5	65.7
logical-1.0	91.7	85.6	92.3	65.7

Table 8. Comparative experiments for the improvement of the C3k2 module.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLOv11-C3k2-WTConv	91.6	82.6	90.5	62.9	2.521291	6.2
YOLOv11-C3k2-Faster	91.3	81.9	89.8	61.4	2.288195	5.8
YOLOv11-C3k2-OREPA	91.1	81.4	89.9	61.3	2.582347	6.3
YOLOv11-C3k2-AdditiveBlock	91.3	83.1	90.6	62.3	2.625507	6.6
YOLOv11-C3k2-GhostDynamicConv	90.6	80.6	89.5	61.0	2.226827	5.4
YOLOv11-C3k2-DAttention	90.6	82.9	90.5	62.1	2.617355	6.3
YOLOv11-C3k2-Star-CAA	92.1	81.1	90.2	62.2	3.033891	8.1
ours	91.2	84.9	91.2	63.3	2.538963	6.4

Table 9. Comparative experiments performed on different detection head improvements.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLOv11-Dyhead	90.9	83.1	90.7	62.4	3.099207	7.4
YOLOv11-EfficientHead	91.0	83.2	90.3	62.2	2.312139	5.1
YOLOv11-LSCD	91.0	84.1	90.4	62.2	2.420492	5.6
YOLOv11-LADH	91.1	82.2	90.1	61.7	2.281547	5.2
YOLOv11-LSDECD	91.2	83.2	90.6	62.5	2.260300	6.0
YOLOv11-LSCSBD	88.4	81.7	89.3	62.2	2.457228	6.2
YOLOv11-DyTaskHead	91.5	85.7	91.5	63.8	2.2200204	7.9

Table 10. Comparative experiments performed on IRShip.

Models	Framework	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Parameters (M)	FLOPs (G)
Cascade-RCNN [36]	ResNet50+FPN	90.4	75.4	85.9	64.5	69.15200	121.856
Dab-DETR [37]	Transformer	88.4	82.7	82.0	45.7	43.70200	43.099
DETR [38]	Transformer	85.5	76.4	81.7	52.5	36.81900	38.107
Faster-RCNN [6]	ResNet50+FPN	90.2	73.2	84.7	62.4	41.34800	90.898
EfficientNet [39]	Efficientnet-b3	83.2	85.9	87.8	56.0	18.33900	54.230
TOOD [40]	ResNet50	87.1	79.8	87.6	59.1	32.01800	78.837
ATSS [41]	ResNet50	90.6	79.2	88.0	63.2	32.11300	80.475
YOLOv13n	YOLOv13n	90.9	83.3	90.0	63.0	2.44809	6.200
YOLOv12n	YOLOv12n	88.4	81.0	88.1	59.7	2.518971	5.900
YOLOv11n	YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.300
YOLOv8n	YOLOv8n	91.0	83.4	90.0	62.1	3.005843	8.100
YOLOv5n	YOLOv5n	89.0	77.0	86.8	58.7	1.760518	4.100
YOLOv3-tiny	DarkNet-53	79.3	70.7	77.0	45.9	61.52400	77.449
YOLOv5-ODConvNeXt [42]	YOLOv5n	92.6	84.6	92.2	61.6	6.991905	14.8
Ship_YOLO [43]	YOLOv11n	92.3	84.7	91.8	64.0	2.756567	8.6
DSEE-YOLO	YOLOv11n	92.2	85.9	92.6	65.9	1.489764	6.400

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Feng, Y.; Jin, W.; Liu, L.; Zhou, C.; Tao, H.; Cai, L. DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments. Remote Sens. 2025, 17, 3325. https://doi.org/10.3390/rs17193325

AMA Style

Wang S, Feng Y, Jin W, Liu L, Zhou C, Tao H, Cai L. DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments. Remote Sensing. 2025; 17(19):3325. https://doi.org/10.3390/rs17193325

Chicago/Turabian Style

Wang, Siyu, Yunsong Feng, Wei Jin, Liping Liu, Changqi Zhou, Huifeng Tao, and Lei Cai. 2025. "DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments" Remote Sensing 17, no. 19: 3325. https://doi.org/10.3390/rs17193325

APA Style

Wang, S., Feng, Y., Jin, W., Liu, L., Zhou, C., Tao, H., & Cai, L. (2025). DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments. Remote Sensing, 17(19), 3325. https://doi.org/10.3390/rs17193325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSEE-YOLO: A Dynamic Edge-Enhanced Lightweight Model for Infrared Ship Detection in Complex Maritime Environments

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. C3k2_MultiScaleEdgeFusion Module

3.2. DS_ADown Module

3.3. DyTaskHead Detection Head

3.4. DSEE-YOLO: Model Optimization via Pruning and Distillation

4. Experimental Results

4.1. The Dataset IRShip v1.0

4.2. Experimental Evaluation Metrics

4.3. Ablation Experiment

4.4. Pruning and Distillation Experiments

4.5. Comparative Experiment

4.5.1. Comparison of Different Backbone Networks

4.5.2. A Comparative Analysis of Various Detection Heads

4.5.3. Comparison of Different Detection Algorithms

5. Visualization and Discussion

5.1. Visualizing the Decoupling Advantage

5.2. Ship Detection Comparison in Representative Maritime Scenarios

5.2.1. Comparison of Detection Performance of DSEE-YOLO and YOLOv11n

5.2.2. Comparison of Detection Performance of DSEE-YOLO and Other Algorithms

5.3. Discussion on the Limitations of DSEE-YOLO

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI