EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment

Wang, Siyu; Feng, Yunsong; Tao, Huifeng; Chen, Juan; Jin, Wei; Liu, Liping; Zhou, Changqi

doi:10.3390/photonics12111044

Open AccessArticle

EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment

by

Siyu Wang

^1,2,3

,

Yunsong Feng

^1,2,3,*,

Huifeng Tao

^1,2,3,*,

Juan Chen

⁴,

Wei Jin

^1,2,3,

Liping Liu

^1,2,3

and

Changqi Zhou

^1,2,3

¹

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

²

Anhui Province Key Laboratory of Electronic Environment Intelligent Perception and Control, Hefei 230037, China

³

Advanced Laser Technology Laboratory of Anhui Province, Hefei 230037, China

⁴

School of Electronic and Information Engineering, Anhui Agriculture University, Hefei 230009, China

^*

Authors to whom correspondence should be addressed.

Photonics 2025, 12(11), 1044; https://doi.org/10.3390/photonics12111044

Submission received: 14 September 2025 / Revised: 13 October 2025 / Accepted: 16 October 2025 / Published: 22 October 2025

(This article belongs to the Special Issue Technologies and Applications of Optical Imaging)

Download

Browse Figures

Versions Notes

Abstract

Ship detection is critical for maritime traffic management, as infrared imaging in complex marine environments encounters significant challenges, such as strong background interference and weak target features. Thus, we propose EISD-YOLO (Efficient Infrared Ship Detection-YOLO), a high-performance lightweight algorithm specifically designed for infrared ship detection. The algorithm aims to improve detection accuracy while simultaneously reducing model parameters and enhancing computational efficiency. It integrates three core architectural innovations: first, we optimized the backbone C3k2 module by replacing the traditional bottleneck with a PEMA block to significantly reduce the parameter count; second, we integrated a lightweight DS_ADNet module, using depth-wise separable convolution to reduce parameters and alleviate computational load while maintaining robust feature representation; and third, we adopted the DyTAHead detection head, which integrates classification and localization features through dynamic task alignment, thereby achieving robust performance in complex infrared ship detection scenarios. The experimental results on the IRShip dataset demonstrate that, compared with YOLOv11n, EISD-YOLO reduced the parameters by 48.83%, while mAP@0.50, precision, and recall all increased by 1.2%. This breaks the traditional rule that lightweight models inevitably lead to reduced accuracy. Additionally, the model size reduced from 10.1 MB to 5.7 MB, which highlights its enhanced computational efficiency and practical applicability in maritime deployment scenarios.

Keywords:

ship detection; YOLOv11; infrared image; attention mechanism; maritime transportation

1. Introduction

With the rapid development of electromagnetic stealth and camouflage technologies, the echo feature recognition capability of traditional active radar and the spectral resolution of visible-light imaging technology are encountering severe challenges, which are leading to a significant decline in detection performance for maritime ships. Against this backdrop, infrared detection technology, which leverages the universality of black-body radiation characteristics and the all-weather adaptability of thermal imaging, has become a core sensing component of modern maritime early warning and defense systems, and its strategic status in marine monitoring has become increasingly prominent [1]. In recent years, as a specific application of this technology, infrared surveillance camera systems have been widely used in maritime traffic monitoring because of their high cost-effectiveness, flexible deployment, and real-time imaging capability. Compared with satellite remote sensing imaging systems, these systems can achieve accurate positioning and category labeling of maritime ships with meter-level resolution, while also ensuring the continuous robustness of the system even under complex environmental conditions.

However, the performance implementation of infrared surveillance camera systems is highly dependent on the accuracy of backend detection models. In maritime scenarios, the scale variation of ship targets and dynamic environmental interference impose high requirements on the adaptability of detection models. Existing models have not fully met this application demand. Specifically, these models have significant scale sensitivity when they process infrared maritime ship targets, accompanied by a high missed detection rate for small targets and an increased false alarm rate under low-signal-to-noise ratio (SNR) conditions. Furthermore, hardware constraints, real-time requirements, and the uniqueness of deployment environments in maritime detection scenarios pose additional challenges to detection algorithms [2]. To address the challenges of multi-scale missed detection, false alarms under low SNR, and hardware adaptation, in recent studies, researchers have mostly enhanced robustness by improving the YOLO architecture. For example, Chen et al. proposed RER-YOLO to optimize small-target recognition [3], whereas Zhang et al. designed SMR-YOLO to improve the noise suppression capability [4]. In these studies, the researchers verified the effectiveness of YOLO improvements tailored for infrared scenarios. However, in most previous studies, researchers focused on a single performance metric. They have not yet achieved the optimal balance between lightweight architecture and high accuracy.

Therefore, in response to the hardware constraints and real-time requirements of maritime infrared detection, the development of an infrared ship detection algorithm that integrates a lightweight architecture with high accuracy is crucial. This study will help to advance the development of intelligent port systems and improve scheduling efficiency, thereby providing technical support for civil applications such as maritime security scenario awareness, intelligent ship traffic management, and maritime search and rescue operations.

To address the aforementioned technical challenges, we propose algorithmic innovations based on the YOLOv11 framework. Specifically, to enhance the detection performance of infrared ship targets in complex marine environments, we propose a C3k2-based feature extraction module, lightweight downsampling module DS_ADNet incorporating ADown, and the dynamic task alignment head (DyTAHead), which is a detection head based on a task-decomposition architecture. We integrate these improvements into the YOLOv11 framework for oriented bounding-box target detection tasks. The main contributions of this study are summarized as follows:

We propose a lightweight feature extraction module, PEMA-C3k2, whose core is the replacement of the bottleneck structure of the original C3k2 with partial efficient multi-scale attention (PEMA) blocks. This module is composed of partial convolution and the efficient multi-scale attention (EMA) mechanism, and it can achieve model lightweighting through parameter reduction while maintaining detection accuracy.
To address information redundancy in multi-scale feature extraction, we propose an improved DS_ADNet module. The integration of DSConv into the feature extraction pipeline enables this module to decouple spatial convolution and channel fusion, thereby improving feature representation efficiency.
We introduce a dynamic task alignment detection head, DyTAHead, to enhance feature interaction in downstream tasks. Combined with DSConv, it significantly reduces parameter redundancy. DyTAHead enables high-accuracy detection without increasing the model parameters.

The remainder of this paper is organized as follows: In Section 2, we review related work in infrared target detection and neural network architectures. In Section 3, we detail the proposed Efficient Infrared Ship Detection-YOLO (EISD-YOLO) framework, including the design of the PEMA-C3k2 feature extraction module, DS_ADNet downsampling unit, and DyTAHead detection architecture. In Section 4, we present comparative experiments and ablation studies to validate the effectiveness of the proposed improvements. Additionally, we discuss the experimental results, and we visually illustrate the model’s advantages in complex maritime scenarios. Finally, in Section 5, we summarize the study.

2. Related Work

In the field of infrared ship target detection, traditional methods rely on handcrafted features (e.g., HOG [5] and SIFT [6]) combined with classical machine learning classifiers such as SVM [7] and AdaBoost [8]. However, these approaches have notable limitations in complex maritime environments. First, their sliding-window mechanism requires the exhaustive traversal of high-resolution images to generate redundant candidate regions, which results in suboptimal computational efficiency. Moreover, infrared imagery often has minimal target–background temperature contrast, blurred boundaries, and interference from pseudo-thermal sources (e.g., wave spray). Handcrafted features are ineffective at capturing these characteristics, which weakens feature discriminability. Additionally, infrared radiation’s optical properties exacerbate this issue: extremely low target–background temperature differences (as low as 1–2 K) render gradient-dependent features such as HOG ineffective, whereas low resolution in long-wave infrared cameras blurs boundaries and undermines SIFT features’ stability [9]. Second, these methods suffer high false alarm rates. Sea clutter-induced fluctuations in background radiation intensity are difficult to model using manual feature engineering, and infrared signal jitter from atmospheric turbulence [10] makes fixed-rule handcrafted features unable to adapt to non-stationary optical interference, which further elevates the false alarm rate. In contrast, deep learning-based detection technologies use convolutional neural networks (CNNs) to automatically extract hierarchical multi-scale semantic features, thereby outperforming traditional methods significantly.

The improvement in performance is primarily attributed to innovations in the architectural design of deep learning detectors, which can be broadly classified into two mainstream approaches: one-stage and two-stage methods. Two-stage detectors, such as Faster R-CNN [11], use region proposal networks with adaptive intersection-over-union (IoU) thresholds to generate high-quality candidate boxes, thereby achieving breakthroughs in detection accuracy. In contrast, one-stage detectors use feature pyramid fusion strategies to reduce small-target miss rates while preserving real-time inference capabilities, which provides critical technical support for the development of all-weather anti-jamming infrared maritime monitoring systems.

In the infrared ship detection domain, the development of deep learning-based one-stage object detectors has been primarily motivated by the challenges posed by complex sea-surface scenarios. Therefore, one-stage detectors have emerged as the times require. Among their early representatives, YOLOv1 pioneered real-time object detection; however, it exhibited insufficient accuracy under thermal signals with low contrast [12]. Subsequently, SSD introduced multi-scale anchor boxes to improve recall rates; however, its performance suffered from increased false alarms caused by specular reflection interference [13]. YOLOv2 significantly reduced small-target miss rates through anchor box clustering and improved feature extraction [14], whereas YOLOv3 used a feature pyramid network to enhance long-range ship detection accuracy [15]. Subsequent versions focused on environmental robustness: YOLOv4 maintained high mAP under low-visibility conditions via architecture enhancements [16], and YOLOv6 reduced false alarms in sea clutter using an anchor-free design [17]. Recent innovations have advanced fine-grained recognition: YOLOv7 improved the detection of occluded targets through hierarchical feature learning [18], YOLOv9 enhanced ship contour detection using deformable convolutions [19], and YOLOv10 achieved model compression down to 3.8 M parameters via its C2f-Faster module [20]. The state-of-the-art YOLOv11 (compared with YOLOv8) further reduced miss rates by incorporating dedicated attention mechanisms for weak thermal signals and decoupled detection heads.

The continuous evolution of the aforementioned YOLO series of algorithms has not only achieved robust adaptability in the field of infrared ship detection but also established a solid technical foundation for addressing a wider range of marine target detection scenarios. In the marine environment, beyond the challenge of infrared ship detection, challenges exist in the accurate identification of diverse types of targets under complex sea conditions. Because of the YOLO-series algorithms’ strengths in processing speed, detection accuracy, and environmental adaptability, they have become a preferred solution for many researchers seeking to enhance and optimize marine target detection systems.

Numerous researchers have also made improvements based on the YOLO series of algorithms. For example, Yao et al. improved YOLOv5 via fuzzy C-means clustering, thereby enhancing close-object localization and aiding in dense scenario detection [21], but focusing on visible light or short-range targets, which are not adaptable to long-range infrared ships’ weak thermal signals. Chen et al. enhanced YOLOv3 by replacing NMS with Soft-NMS and adding feature fusion, thereby boosting detection robustness without introducing additional computational costs, despite still relying on the base model’s parameters [22]. Zhao et al. proposed YOLO-Marine, an adaptation of YOLOv4, which integrates a hybrid attention module, mosaic augmentation, and K-means++ anchor re-clustering, thereby improving small-target detection performance [23]. Yue et al. developed a lightweight detector by fusing MobileNetv2 with YOLOv4, along with sparse training and channel pruning; however, the model had insufficient lightweight properties, and it yielded approximately 3% accuracy degradation and limited robustness in complex environments [24]. Chen et al. used SAS-FPN to enhance multi-scale SAR ship detection in CSD-YOLO; however, the model was relatively large and heavily reliant on GPU acceleration [25]. Guo et al. optimized YOLO for cross-scale fusion, but they encountered missed detections because of inadequate contextual information [26]. Ma et al.’s SP-YOLOv8s improved the detection of tiny objects via fine-grained feature enhancement; however, it captured limited global contextual information and was still not sufficiently lightweight at 15 MB [27]. Zhang et al.’s lightweight YOLOv7-tiny-based model with multi-scale residual modules performed well in complex maritime environments, but red exhibited performance degradation when detecting small ships [28]. Li et al. integrated FFDPN, ADown, PKINet, and modified C2f in YOLOv8-ship to enhance central features, thereby significantly improving robustness and inland vessel detection under harsh weather conditions, but significantly increasing the inference time [29]. Du et al. enhanced YOLOv8 for the detection of small traffic signs by introducing a novel attention mechanism; however, the training process required complex knowledge distillation [30]. Wang et al. incorporated MSDA and integrated ViT with YOLOv11 in NST-YOLO11, thereby boosting arbitrary-direction ship detection’s mAP under complex sea clutter; however, high computational demand limited its deployment on edge devices [31]. Zhang et al. proposed CA-YOLO, a YOLO-based improvement that adopts cross-attention to empower biomimetic localization tasks, further verifying that attention mechanisms can effectively enhance models’ adaptability to specific positioning requirements—though its focus on biomimetic scenarios differs from the infrared ship detection task in this study [32].

In recent years, research on infrared ship detection based on YOLO has made progress in either accuracy or lightweight design individually. However, existing models have three major limitations: first, most lightweight strategies are merely local adjustments, such as YOLO-Marine [23], which only optimizes anchors; second, the application of attention mechanisms remains at the level of performance enhancement (e.g., the MSDA in NST-YOLO11 [31]) and is not coordinated with the goal of lightweight design; third, the detection head fails to address the dynamic adaptation issue between classification and regression tasks, such as YOLOv8-ship [29], which still has localization errors.

The innovation of EISD-YOLO is not a mere stacking of technologies but a systematic solution tailored to the three core requirements (lightweight design, real-time performance, and robustness) of shipborne infrared detection; it solves the problem of feature extraction efficiency through PEMA-C3k2, addresses the issue of information loss during downsampling via DS_ADNet, and resolves the problem of task conflict adaptation with DyTAHead. These three modules form a closed-loop system, rather than the optimization for a single problem seen in existing models.

To summarize, although YOLO-series models exhibit excellent performance in infrared ship detection, their development has always been limited by the inherent trade-off between accuracy and lightweight design: models that pursue high accuracy are difficult to deploy because of their complex structures, whereas lightweight designs often sacrifice detection performance for low-contrast small and weak targets. Furthermore, existing methods lack targeted optimization for the unique low contrast and thermal clutter interference of infrared images, and complex models are also prone to generalization issues. To break through this bottleneck, we propose a novel solution. The results demonstrate that it successfully reduced the parameter count by 48.83% while increasing key metrics such as mAP@0.50, precision (P), and recall (R) by approximately 1.2% each. Thus, the proposed solution provides a better balance of real-time and accurate detection in complex marine environments.

3. Methodology

In this study, we introduce structural modifications to the YOLOv11n framework and propose EISD-YOLO, a more lightweight and efficient object detection algorithm tailored to infrared ship imagery. Figure 1 illustrates the overall network architecture. In this section, we elaborate on the architectural optimizations of these modules, detailing their design principles and technical contributions to detection performance.

3.1. PEMA-C3k2 Module

To address the limitation of traditional CNNs in extracting discriminative features under low-SNR conditions, the C3k2 module adopted by YOLOv11, as an important feature extraction component improved from the traditional C3 module, enhances the feature representation ability in complex backgrounds through structural innovations. However, the C3k2 module still has limitations in practical applications. Its ability to capture the features of small targets requires enhancement, and the use of larger convolution kernels lacks flexibility when focusing on subtle features. Existing lightweight methods (MobileNetV3 [33] and GhostNet [34]) can reduce the computational load through specialized architectures, but they often sacrifice multi-scale feature expression ability, which leads to a decline in detection accuracy in complex scenarios.

Against this backdrop, to enhance the lightweight nature without compromising (and even strengthening) the feature expression capability, we propose a new solution: a novel architecture centered on the PEMA block. The PEMA block uses Partial_conv3 to partition input features into channel-wise subspaces by processing only 1/4 of the channels to substantially reduce computational redundancy. Meanwhile, it integrates an EMA mechanism to dynamically recalibrate channel weights, which compensates for the potential loss of global semantics caused by partial convolution. As illustrated in Figure 2, the PEMA block comprises two core components: Partial_conv3 and the EMA mechanism. The module is initiated with a channel-splitting strategy aimed at reducing computational overhead: after channel dimension adjustment via

1 \times 1

convolution, input features are divided into two sub-branches. In the Partial_conv3 branch, 25% of the channels undergo

3 \times 3

convolution, and the remaining 75% are directly forwarded as identity connections. The branch outputs are concatenated and processed through a lightweight multi-layer perceptron for dimensionality compression, followed by dynamic weight recalibration via the EMA mechanism. Finally, the refined features are fused with the original input through residual connection.

The PEMA-C3k2 module serves as a multi-scale feature fusion unit that is improved based on the cross-stage partial architecture. The input feature

X

first undergoes channel compression via a CBS unit, followed by parallel decomposition through a split operation. When

C 3 k = True

, the path uses stacked C3k units, which each integrate cascaded deformable convolutions to model local geometric variations. Conversely, when

C 3 k = False

, the path switches to a series of PEMA blocks.

To further verify the core role of the EMA mechanism in the PEMA module, we independently investigated the necessity of adding the EMA mechanism through ablation experiments. As shown in Table 1, although the precision of the model slightly increased after removing the EMA mechanism from PEMA-C3k2, this change led to a significant 1.4% decrease in recall. This indicates that PEMA-C3k2 adopted a more conservative detection strategy after the EMA mechanism was removed: it only predicted targets with extremely high confidence. While this approach reduced false positives, it came at the cost of increased false negatives. More importantly, compared with the version without the EMA mechanism, PEMA-C3k2 with the EMA mechanism retained outperformed the former by 0.9% in mAP@0.50 and 0.6% in mAP@0.5:0.95. This fully demonstrates that the EMA mechanism effectively enhances the overall performance and robustness of the model. By strengthening the model’s ability to focus on key features, the mechanism enables the model to more effectively identify various types of targets, including hard samples in particular, while maintaining high precision. In turn, this achieves a simultaneous improvement in both recall and mAP values.

The PEMA-C3k2 module can significantly enhance the feature extraction capability of the backbone network, which enables it to capture richer, clearer, and more discriminative features from complex infrared backgrounds. As shown in Figure 3, the first row of the corresponding feature maps presents the output feature maps of PEMA-C3k2, while the second row shows the output of the original C3k2 network. A comparison between the two rows intuitively highlights the advantages of the proposed design.

In terms of feature diversity, PEMA-C3k2’s feature maps have distinct colors like cyan, blue, green, and yellow; this shows that the network activates various differentiated features. For example, the yellow–green regions in its first map align accurately with dense ship clusters, and different channels in the second map focus on sea-surface ripples, boundary lines, and potential spots separately. In contrast, the original C3k2’s feature maps are almost all dark blue and purple; this monotonous color comes from severe feature homogenization in ordinary convolutional layers, where many filters learn similar or redundant features, leading to low information diversity. In terms of detail preservation, PEMA-C3k2 has clear advantages: its feature maps respond sharply to details. For example, the second figure shows that it fully preserves and enhances horizontal sea-surface ripples, even allowing faint identification of tiny sea spray spots. This strong detail-capturing ability is critical for distinguishing small or long-distance ships in complex sea clutter. In contrast, the original C3k2’s feature maps are generally blurry with low contrast; though sea surface ripples exist in the second figure, they are very faint, and much detail is lost. These weak signals may be filtered as noise in later network layers, raising the risk of missed target detection. Thus, PEMA-C3k2 effectively prevents key target information from blurring in the network’s shallow layers and significantly boosts the network’s perception of small targets and complex scenarios. Furthermore, PEMA-C3k2 performs better in target–background discrimination: its feature maps have strong local response points or regions, which mostly represent the network’s high responses to potential ship targets or key structures, making it easier to distinguish targets from backgrounds in the feature space. In contrast, the original C3k2 has weak, scattered feature responses that fail to form obvious focused regions. This mixes target and background features, resulting in very low discriminability, and posing great challenges to subsequent classification and regression networks. In summary, PEMA-C3k2 effectively suppresses background interferences like sea clutter and cloud layers while enhancing target region responses, greatly improving feature SNR and laying a foundation for simpler, more accurate target detection.

The experimental results also serve as evidence that the PEMA-C3k2 module can significantly enhance the feature extraction capability of the backbone network. As shown in Table 1, the PEMA-C3k2 module reduced FLOPs while achieving a 0.9% improvement in mAP@0.50, thereby outperforming the baseline C3k2 architecture in terms of both efficiency and accuracy.

3.2. DS_ADNet Module

YOLOv11 systematically optimizes detection accuracy, inference speed, and model lightweight requirements in complex scenarios. Although YOLOv11 achieves parameter compression through structural optimization, its computational load remains considerable in specific scenarios and during practical measurements on mobile processors. Traditional CNNs have significant feature redundancy during feature encoding, where a portion of channels in the high-dimensional feature tensors output by standard convolutional layers fails to generate effective gradient updates during subsequent network propagation. Such ineffective feature transmission constitutes redundant forward computational overhead.

To address the aforementioned challenges, particularly to enhance downsampling efficiency, we propose a targeted improvement scheme. The ADown module is improved in this study. This is a lightweight convolutional component designed for downsampling in YOLOv9. The ADown module can significantly reduce model complexity and accelerate inference speed, as well as retaining critical details when reducing the feature map resolution to ensure accurate detection of targets. However, the aforementioned advantages are difficult to achieve fully in scenarios with the unique characteristics of infrared images, such as low contrast and weak texture.

To address the limitations of the ADown module, we propose a novel module called DS_ADNet. The core innovation of DS_ADNet is the integration of a dual-path feature fusion architecture with a lightweight design concept, which effectively addresses the aforementioned challenges through the collaborative optimization of parallel pooling strategies and DSConv. Specifically, based on dual-path feature fusion, DS_ADNet introduces DSConv to further eliminate coupling redundancy between spatial and channel features. This design decomposes the traditional convolution kernel into a combination of two serial operations: a DWConv kernel in the spatial dimension,

W_{depth} \in R^{k \times k \times C}

, which is responsible for extracting local spatial patterns (e.g., geometric features of ship edges); and a pointwise convolution kernel in the channel dimension,

W_{point} \in R^{1 \times 1 \times C \times C^{'}}

, which performs cross-channel semantic fusion. Through this decomposed structure, DS_ADNet achieves the decoupling of spatial and channel information.

{Params}_{DSConv} = (k_{h} \times k_{w} \times C_{i n}) + (1 \times 1 \times C_{i n} \times C_{o u t})

(1)

{Params}_{Conv} = k_{h} \times k_{w} \times C_{i n} \times C_{o u t}

(2)

As shown in Figure 4, the input feature

X

is first dimensionally reduced to

H / 2 \times W / 2

through an Avgpool layer with a stride of 2, and then split into

X_{1}

(

C_{2}

channels) and

X_{2}

(

C_{1}

channels) along the channel dimension. Specifically, the

X_{1}

branch extracts local details using DSConv and enhances nonlinearity through the SiLU activation function. The

X_{2}

branch strengthens high-frequency features through Maxpool and DSConv. Finally, the dual-path outputs are fused via channel concatenation, which integrates global semantics with local textures and effectively mitigates the information loss commonly associated with traditional downsampling methods.

While significantly reducing the parameters, DS_ADNet maintains or even optimizes feature extraction quality. It does not sacrifice performance for a lightweight design, as shown in Figure 5a,b. The right first row presents the output of DS_ADNet, while the right second row shows the output of the original large-scale convolutional network. The original network’s feature maps are dense and complex, filled with numerous bright yellow–green dots, lines, and textures. This is because ordinary convolutions have massive parameters and fixed receptive fields, which fail to screen information effectively and, thus, lead to their capture of all potential information, including a large amount of redundant information. By contrast, DS_ADNet’s feature maps are more concise and sparse, with fewer and less dense bright feature points. Its lightweight design enables precise feature screening, so it only generates strong responses to prominent critical features. Even so, DS_ADNet still accurately retains the feature activation of key targets. For example, the core region of the ship target in the original image remains strongly and clearly responsive in DS_ADNet’s feature maps. This proves that the model’s lightweight design achieves precise slimming rather than performance degradation. In summary, the simplicity of DS_ADNet’s feature maps intuitively verifies that parameter reduction is remarkably effective, as redundant features are eliminated. More importantly, the model shows higher attention to key target regions like ships. Its ability to extract core features remains intact, and it captures critical information accurately. This indicates that our improvement reduces redundant parameters and computational complexity while preserving the network’s core efficiency.

The intuitive comparison of the aforementioned feature maps clearly confirms the lightweight advantages of DS_ADNet. Moreover, the specific effects of this advantage can be further verified through more rigorous quantitative experimental data. The results of the subsequent ablation experiments show that the DS_ADNet module reduced the model parameters by 24.05% and the computational load by 20.63% while maintaining an almost minimal decrease in mAP@0.50. Such significant efficiency improvements indicate that the DS_ADNet module successfully balances model complexity and detection accuracy.

3.3. DyTAHead Detection Head

In traditional object detection frameworks, classification and regression tasks share the same feature space, which fundamentally arises from uncontrollable coupling between feature representation spaces and may lead to potential conflicts in feature representation. From the perspective of task requirements, classification tasks rely on high-level semantic features, which are specifically global features such as object texture and color, for category discrimination. In contrast, regression tasks require geometrically sensitive localization features such as local geometric details, including edge positions and angles, to enable accurate coordinate prediction. This difference in feature requirements is particularly evident in complex scenarios. Although the decoupled head adopted in YOLOv11 alleviates this issue using parallel branches, its static feature allocation strategy has limitations.

Therefore, we propose DyTAHead, which achieves feature decoupling optimization for classification and regression tasks through a dynamic task decomposition and spatial adaptive alignment mechanism. The classification pathway uses global feature aggregation and channel attention mechanisms; meanwhile, the regression pathway introduces deformable convolutions to dynamically capture local geometric features, by generating sampling point coordinate offsets through an independent offset prediction network to adapt convolution kernels to target deformations, as illustrated in Figure 6. This design allows both pathways to share basic visual features (e.g., edges and corners) while performing task-specific processing at higher levels.

Building on the achievement of feature decoupling between classification and regression tasks, we further enhance task synergy and the model’s discriminative capability by optimizing the internal structure of the detection head. Specifically, a customized task consistency framework is integrated into the detection head. This framework uses multi-level convolutional layers (including grouped convolution Conv_GN and DSConv) to effectively extract features. This design not only significantly enhances the model’s ability to learn discriminative features from multi-target scenarios but also substantially reduces the number of parameters. Furthermore, the sigmoid activation function (Equation (3)) is used to convert linear problems into nonlinear problems, which effectively reduces the network complexity and the computational load of the module, as shown in Equation (4), where

conv_i

denotes the i-th convolutional layer and

δ

represents the sigmoid activation function. The shared encoding layer is constructed using DSConv. Compared with standard convolution, the number of parameters is reduced to

\frac{1}{c_{o u t}} + \frac{1}{k^{2}}

(where

k

is the convolution kernel size).

δ = s i g m o i d (x) = \frac{1}{1 + e^{- x}}

(3)

X_{i}^{out} = \{\begin{matrix} δ ({conv}_{i} (X_{0}^{in})) & i = 1 \\ δ ({conv}_{i} (X_{i - 1}^{out})) & i > 1 \end{matrix}, \forall i \in {1, 2, . . ., N}

(4)

In the design of DyTAHead, we address the challenge of feature coupling between classification and regression tasks in object detection, thereby achieving the unification of dynamic task alignment and spatial adaptivity. First, we propose a dual-task decomposition mechanism to decouple shared features via an independent module. The classification branch uses global Avgpool to extract contextual semantic information, acquires task-interactive features from each convolutional layer through a feature extractor, and conducts dynamic feature selection to adapt to target deformation, occlusion, and background interference in complex scenarios while enhancing category-discriminative features via channel attention weights. The regression branch focuses on local details: it improves on the fixed offset prediction of traditional deformable convolution (DCNv2) by introducing a joint learning mechanism for offsets and dynamic weights, using a spatial convolutional layer to generate both 3 × 3 convolution kernel coordinate offsets and pointwise importance masks, enabling adaptive adjustment of sampling areas to boost boundary localization accuracy, especially for targets with extreme aspect ratios. For the generator’s spatial offset mechanism, an additional 3 × 3 convolutional layer generates offsets (determining sampling point positions) and modulation scalars (modulating feature weights); these are integrated into DCNv2 to let the convolution kernel dynamically adjust its location based on input feature map content, with sampling point positions guided by the task alignment mechanism to better accommodate ship shape and pose variations. At the semantic alignment level, we design a spatial attention mask to calibrate the classification features. A spatial attention map is generated through cascaded 1 × 1 and 3 × 3 convolution layers, which effectively suppresses background noise and highlights foreground semantics. To balance model efficiency and representation capability, general features are extracted at the lower layers using Conv_GN and DSConv, and group normalization is applied to enhance the stability of small-batch training. At the higher layers, multi-level residual features are fused through feature concatenation to construct a task-specific representation space while preserving rich detail information.

As shown in Figure 7, the multi-scale input feature maps P3, P4, and P5 are downsampled to the same size through the feature pyramid network and then aligned in spatial dimensions. In DyTAHead, feature fusion and lightweight processing are first performed through a shared convolution layer to generate a unified representation. The shared convolution layer includes a 3 × 3 convolution enhanced by GroupNorm and a DSConv. The fused features are then passed through a task decomposition module, which dynamically splits the joint features into a classification branch and a regression branch. The classification branch uses a channel attention mechanism to filter key features and calculates spatial attention weights via a dynamic weight generator composed of 1 × 1 convolution, ReLU activation, and 3 × 3 convolution, which enables point-wise dynamic feature screening in the feature maps. For the regression branch, a DCNv2 is used to learn feature offsets and modulation masks, enabling the convolution kernel to adaptively adjust to changes in target morphology. The features from the two branches are processed separately by the classification convolution head and regression convolution head, which each incorporate a learnable scaling factor. Ultimately, the model outputs classification confidence scores and refined bounding-box coordinates. The entire structure reduces parameter redundancy through shared convolutions, achieves feature decoupling between classification and localization via task decomposition, and strengthens task collaboration through a dynamic alignment mechanism. This design not only preserves the lightweight property of the model but also substantially enhances the adaptability of the detection head to different tasks.

To further verify the independent roles of the core components in DyTAHead, we conducted ablation experiments on these core components individually, with the specific results and analysis presented below. The results are shown in Table 2. We used the original YOLOv11n as the baseline; after introducing the complete DyTAHead, the model maintained high efficiency in terms of parameter count and computational complexity. Meanwhile, its mAP@0.50 and mAP@0.50:0.95 increased to 91.5% and 63.8%, respectively, achieving optimal comprehensive performance.

First, when the DCNv2 component is removed, the model’s computational load drops to 7.3 G, but all performance metrics show a significant decline, with mAP@0.50:0.95 decreasing notably by 1.1%. This indicates that by enhancing the model’s adaptability to geometrically deformed targets, DCNv2 effectively improves regression and localization accuracy, and the performance gains that it brings far outweigh the increased computational overhead. Second, when the spatial probability modulation module is removed, precision increases to 91.9%, but recall drops sharply to 82.6%, leading to a 1.2% decrease in mAP@0.50:0.95. This confirms that this module can dynamically suppress the classification confidence of low-quality prediction boxes and filter out high-confidence but unreliable detection results. While it slightly sacrifices precision, it significantly improves recall, thereby optimizing the model’s overall performance. In contrast, the increased precision observed after the module’s removal actually comes at the cost of a severe loss in recall, ultimately resulting in a decline in comprehensive performance. Finally, when the task decomposition module is removed, the model’s parameter count slightly increases compared to the complete DyTAHead, and mAP@0.50:0.95 decreases by 1.2%. This confirms the value of this module: by explicitly modeling task-specific features, it can reduce network parameters while enhancing feature expression capability, making it a key component that balances model efficiency and performance.

In summary, ablation experiments fully demonstrate that each component in DyTAHead has a unique and irreplaceable function. The synergistic effect of these components enables the dynamic alignment of classification and regression tasks, ultimately allowing the model to achieve the optimal balance between accuracy and efficiency.

DyTAHead improves detection performance through three core innovative designs, and the effectiveness of these designs has been validated by previous ablation experiments. First, we used a task decoupling module to separate classification and regression features, which alleviates task conflicts, and this is also the core reason for performance degradation after removing this module. Second, we proposed a dynamic alignment mechanism that integrates deformable convolution with spatial probability maps to enhance adaptability to multi-scale targets, and the synergistic value of the two has been verified by previous individual ablation experiments. Third, lightweight shared encoding ensures a balance between computational efficiency and feature representation capability. The experimental results demonstrate that the model achieved 91.5% mAP@0.50 on IRShip with 7.6 M parameters, which is an improvement of 1.7% over the baseline. This provides an efficient and effective solution for the design of dynamic task-aligned detection heads.

4. Experimental Results

In this section, we perform a comprehensive analysis of the dataset used in the experiments, present the detailed training configurations of the network model (including fundamental parameters and evaluation metrics used to assess the detection performance), and describe extensive comparative experiments and ablation studies. It is worth noting that we did not use pre-trained weights in any of the experiments, and we strictly maintained the consistency of the experimental parameters.

4.1. Experimental Setup

The experimental platform leveraged hardware consisting of an Intel(R) Xeon(R) Gold 5317 CPU @ 3.00 GHz, 64 GB of RAM, and two NVIDIA GeForce RTX 3090 GPUs. The PyTorch framework (version 2.3.1) was used, with parameter settings inherited from the default configuration of YOLOv11, including adaptive anchor boxes and mosaic data augmentation. Table 3 provides an overview of the training parameter settings used in the experiments. Notably, all images in the experiments were uniformly processed using the letterbox method and resized to 640 × 640 for training.

4.2. Dataset IRShip v1.0

In this study, we constructed IRShip v1.0, which is an annotated dataset specifically designed for infrared ship detection in maritime environments. It integrates publicly available data captured by professional InfiRay® equipment and data collected by our research group through approved on-site acquisitions.

It integrates two types of public data with field-collected images, forming a total of 10,013 rigorously annotated samples that span six ship types. The first type of public data is InfiRay® data, which is publicly accessible. The second type includes partial samples from the self-built infrared marine dataset of Shandong University, and official authorization has been obtained for the use of this dataset. Regarding the public data, the InfiRay Infrared open platform data mainly covers multi-scale ship targets in nearshore and offshore scenarios, while the samples authorized by Shandong University supplement the scenarios of medium-range ship imaging, further enriching the diversity of target scales. In contrast, the on-site data focuses on long-distance small targets, while the public data as a whole fully covers multi-scale targets. On-site data collection was conducted using a FLIR T630 3.0 infrared thermal imager (spectral range:

8 - - 14 μ m

) in representative coastal environments near Anhui Province, including ports, offshore areas, and waterways, under conditions such as clear days, nights, and light fog; the images collected on-site account for approximately 20% of the entire IRShip v1.0 dataset.

As shown in Figure 8, the dominant resolution of 640 × 480 accounts for the highest proportion, at 42.02%. Notably, two resolutions close to the 1:1 ratio, 384 × 288 and 300 × 300, account for 18.68% and 18.94%, respectively, with a combined share of 37.62%. This squaring trend may be related to the popularization of mobile terminal image acquisition devices. The higher-definition specifications of 1280 × 1024 and the square resolution of 1024 × 1024 form the second echelon, accounting for 13.41% and 4.67%, respectively, which reflects the presence of professional image processing requirements. The remaining resolutions collectively account for less than 1%.

All ship targets were annotated with oriented bounding boxes (OBBs) using a customized LabelImg tool, ensuring a tight fit to hull contours. Annotation was performed by three trained annotators, with a two-stage quality control process: cross-checking 10% of samples between annotators, and random inspection of 20% by a senior researcher. The inter-annotator consistency met the high reliability standards.

After data augmentation, the IRShip v1.0 dataset contained 16,540 images, which each underwent rigorous annotation and review. The dataset was divided into a training set (including 14,370 images after data augmentation), validation set (1717 images), and test set (1325 images) via random sampling, which ensured a balanced distribution across different categories, resolutions, and scenarios.

4.3. Experimental Evaluation Metrics

To verify the effectiveness of model improvement, we used classic indicators such as Parameters, Precision, Recall, mAP@0.50, mAP@0.50:0.95, and FLOPs to comprehensively evaluate the model’s performance. The calculation methods for the relevant parameters are shown in Equations (5)–(8).

Precision (P) represents the proportion of correctly identified positive samples among all samples predicted as positive.

Recall (R) refers to the proportion of correctly identified positive samples among all actual positive samples.

Mean average precision (mAP) serves as a comprehensive measure of the average detection accuracy across different target categories. In multi-category detection, the average precision (AP) is first computed for each individual category, and then the mean of these APs is computed to obtain the mAP. Specifically, mAP@0.50 evaluates detection accuracy at an IoU threshold of 0.50, while mAP@0.50:0.95: computes the average mAP across a range of IoU thresholds from 0.50 to 0.95 (in 0.05 increments), which provides a more comprehensive assessment of detection performance under varying localization precision requirements.

P = \frac{T P}{T P + F P} \times 100 %

(5)

R = \frac{T P}{T P + F N} \times 100 %

(6)

A P = \int_{0}^{1} P d R

(7)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(8)

where N represents the number of target categories in the dataset; true positive (TP) indicates that the prediction is positive and the actual scenario is also positive, which means that the network model correctly detects the actual presence of a ship; false positive (FP) means that the prediction is positive, whereas the actual scenario is negative, i.e., the model mistakenly identifies an object that is not a ship as a ship; and false negative (FN) implies that the prediction is negative but the actual scenario is positive, i.e., the model fails to detect the actual presence of a ship target.

The number of Parameters, floating-point operations per second (FLOPs), and model size are metrics used to thoroughly evaluate the lightweight characteristics of our proposed EISD-YOLOv11. FLOPs measures the computational complexity of a model, which indicates the number of floating-point operations executed per second. The parameter count refers to the total number of trainable parameters in the model, which serves as a key indicator of model complexity that directly affects storage requirements, computational burden, and training/inference speeds. The parameters and FLOPs in the convolutional layer are represented as follows:

P a r a m s = (k_{h} \times k_{w} \times C_{i n}) \times C_{o u t}

(9)

F L O P s = (k_{h} \times k_{w} \times C_{i n} \times C_{o u t}) \times (H \times W)

(10)

where

k_{h}

and

k_{w}

denote the height and width of the kernel, respectively;

C_{in}

and

C_{out}

represent the number of input and output channels, respectively; and H and W represent the dimensions of the feature map. Parameters refers to the learnable variables of the model, which are continuously optimized during training to minimize the loss. FLOPs quantifies the floating-point operations performed during inference, which serve as an indicator of computational complexity or the cost of specific operations. Model-size reflects the model’s complexity and the required storage space. Increases in Parameters, FLOPs, and Model-size generally lead to higher hardware resource consumption.

4.4. Ablation Experiment

Compared with the baseline YOLOv11 network, EISD-YOLO incorporates several architectural improvements. In our experiments, we evaluated how each enhancement affected the model performance and the mutual interactions between enhancements. Through ablation studies conducted on the IRShip dataset, we assessed the impact of replacing the DS_ADNet and PEMA-C3k2 modules in the backbone and neck, in addition to introducing the DyTAHead detection head. These studies consisted of eight sub-experiments, which each involved different combinations of the proposed improvements on the IRShip dataset.

The overall ablation experiment results are presented in Table 4. On the IRShip dataset, the introduction of DS_ADNet reduced the parameter count by 24.05% and GFLOPs by 20.63% compared with YOLOv11n, while maintaining nearly the same level of model accuracy. This demonstrates the effectiveness of the DS_ADNet module in terms of the lightweight design, which enabled input feature maps to be mapped to a high-dimensional nonlinear feature space without increasing computational overhead or the number of parameters. Second, based on the original YOLOv11n, the replacement of the C3k2 module in the backbone with PEMA-C3k2 and in the neck with C3k2-Faster reduced the parameters by 11.21%, increased P by 1.4%, and improved mAP@0.50 by 0.9%. This confirms that PEMA-C3k2 achieved superior performance over the original C3k2 module in terms of both feature representation and computational efficiency. Additionally, integrating the proposed DyTAHead detection head into YOLOv11n resulted in a 1.7% increase in mAP@0.50, a 14.01% reduction in parameters, and a 1.9% improvement in precision, which was the highest accuracy gain among all of the modifications. These three enhancements collectively contributed to the overall performance improvement of EISD-YOLO, with each component demonstrating distinct and complementary advantages in terms of lightweight design and detection accuracy.

In summary, when we integrated all three optimizations, EISD-YOLO achieved a 1.2% increase in mAP@0.50, a 43.56% reduction in the total model size, and a 48.83% decrease in the number of parameters compared with YOLOv11n. This demonstrates that the EISD-YOLO model successfully maintained high detection accuracy while achieving significant lightweight performance.

As shown in the training process curve of the ablation experiment in Figure 9, the successively introduced network modules exhibited notable differences in four core indicators. YOLOv11n converged rapidly at the initial stage in terms of precision and mAP@0.50, but its recall rate and comprehensive indicator mAP@0.50:0.95 exhibited limited growth in the later stage. In contrast, the improved model EISD-YOLO demonstrated comprehensive advantages with the dynamic scale fusion module. After 200 training epochs, EISD-YOLO significantly outperformed the baseline model and maintained the lowest fluctuation range across all indicators, which enabled it to achieve balance and optimality in comprehensive metrics.

We validated the innovative advancement of the EISD-YOLO model regarding balancing accuracy and efficiency through the PR curve comparison shown in Figure 10. Notably, within the high-recall interval of 0.8–0.9, EISD-YOLO exhibited only a 1.2% precision fluctuation, thereby significantly outperforming YOLO-PEMA-C3k2 and YOLO-DS_ADNet in terms of stability. Although YOLO-DyTAHead achieved the best results for many metrics, its parameter count and computational load remained relatively high. In contrast, after the incorporation of PEMA-C3k2 and DS_ADNet, EISD-YOLO struck a balance between parameter count and accuracy. The model’s parameter count was reduced to 59.51% of that of YOLO-DyHead, accompanied by improved inference speed, and it maintained high precision thresholds (≥0.85) in extreme detection scenarios, with recall rates exceeding 0.9. The local magnification subplots demonstrate that this scheme successfully overcomes the defect of traditional models’ sharp precision degradation under high-recall conditions while ensuring detection sensitivity. Its balanced characteristics demonstrate unique advantages in industrial-grade embedded device deployment, thereby offering a novel technical solution for real-time object detection under resource-constrained conditions.

To further analyze ship detection performance in complex scenarios and investigate the trends of false positives and false negatives, we selected representative samples from diverse scenarios to evaluate the comprehensive performance of EISD-YOLO, as illustrated in Figure 11. The visual results in Figure 11 intuitively demonstrate the differences in ship detection performance between EISD-YOLOv11 and the baseline model YOLOv11n on the IRShip test set using 14 groups comparative samples. The experimental results demonstrate that the proposed EISD-YOLO model significantly improved detection accuracy and robustness under complex sea-surface backgrounds, specifically regarding three aspects. As shown in groups (a), (b), and (j) in Figure 11, the baseline model generated false positives in scenarios with cloud or fog interference, whereas our model accurately identified targets through enhanced feature extraction. Groups (c), (e), (i), (k), (l), (m), and (h) confirm that our model preserved accurate positioning capabilities in low-contrast environments. Group (g) demonstrates that EISD-YOLO achieved higher detection success rates than the baseline model in multi-target overlapping scenarios. Meanwhile, groups (d), (n), and (f) indicate that the improved feature extraction module substantially reduced the false negative rate.

In summary, EISD-YOLO not only reduced the false positive rate for small targets and the false negative rate for distant small targets but also achieved overall improvements in small-target detection accuracy. By integrating the PEMA-C3k2 and DS_ADNet modules, EISD-YOLO enhanced the network’s feature extraction and fusion capabilities, which enabled the more effective separation of target features from background clutter. This enabled the model to accurately predict the positions of occluded ships and those with similar hull structures, thereby avoiding detection performance degradation caused by mutual feature interference. Collectively, these advancements demonstrate that EISD-YOLO significantly outperformed YOLOv11n in infrared ship target detection.

4.5. Comparative Experiment

4.5.1. Comparison of Different Backbone Networks

In the optimization of the feature extraction network, as shown in Table 5, the proposed PEMA-C3k2 module demonstrated significant advantages while maintaining an input resolution of 640 × 640. Compared with the baseline model YOLOv11n, it reduced the parameter count by 11.3%, decreased the computational load by 7.9%, and achieved an optimal detection accuracy of 90.7% mAP@0.50, which is 0.9 percentage points higher than that of the original C3k2 module. These results validate the collaborative optimization mechanism of multi-scale feature enhancement and parameter simplification.

The comparative experimental results for the improvement of the C3k2 module are presented in Table 5. The final YOLOv11-PEMA-C3k2 model selected achieved superior performance for the two core indicators of the recall rate and mAP@0.50, reaching 83.3% and 90.7%, respectively. Meanwhile, it enabled efficient deployment, with a computational complexity of 2.29M parameters and 5.8G FLOPs. Compared with the baseline model YOLOv11n, this solution increased recall by 0.4 percentage points and mAP@0.50 by 0.9 percentage points, whereas it reduced the number of parameters by 11.3% and FLOPs by 7.9%. In a horizontal comparison with other improved models, although introducing WTConv into C3k2 achieved the highest precision of 91.6%, its recall rate of 82.6% and computational load of 6.2G FLOPs were slightly inferior to those of the PEMA solution. The YOLOv11-C3k2-Star-CAA variant, which fuses the StarBlock and CAA modules, attained the highest precision among all models (92.1%). However, its parameter count and FLOPs significantly increased to 3.03M and 8.1G FLOPs, respectively. Additionally, its recall rate of 81.1% and mAP@0.50 of 90.2% remained lower than those of the PEMA solution. The C3k2-AdditiveBlock achieved a higher recall rate of 83.1% but incurred significant increases in parameters (2.63M) and FLOPs (6.6G). The C3k2-GhostDynamicConv, with the lowest parameter count of only 2.23M, exhibited performance limitations in recall (80.6%) and mAP@0.50 (89.5%).

It is worth noting that C3k2-Faster and PEMA-C3k2 both exhibited a low computational load of 5.8G FLOPs, but the former’s recall rate of 81.9% and mAP@0.50 of 89.8% were significantly lower. Considering the requirements of precision, efficiency, and lightweight design comprehensively, YOLOv11-PEMA-C3k2 achieved a multi-dimensional performance balance and emerged as the preferred solution that integrated detection accuracy with engineering implementation.

Further validation of the training results is provided by the training curves in Figure 12. Specifically, the precision curve in Figure 12a shows that YOLO-PEMA-C3k2 maintained stability above 0.85 throughout the mid-to-late training stages, which indicates superior control of false positive rates. The recall curve in Figure 12b shows that the model reached its peak after 200 epochs, which verifies significant improvements in reducing false negatives. In terms of comprehensive performance metrics, the curves in Figure 12c,d collectively demonstrate performance improvements, despite reductions in both parameter count and computational load. Notably, the model exhibited rapid convergence during the early training stages, with all metric curves showing smaller fluctuations compared with other variants. This indicates that the improved network architecture effectively enhanced the training stability. Compared with variants that sacrificed computational resources or parameter count, YOLO-PEMA-C3k2 demonstrates distinct comprehensive advantages. The experimental results confirm that the YOLO-PEMA-C3k2 module significantly enhanced detection robustness in complex scenarios while preserving the lightweight characteristics of C3k2, thereby validating its superior comprehensive performance.

The PEMA-C3k2 module achieved the best trade-off between a precision of 90.9% and a recall of 83.3% while maintaining a computational efficiency of 5.8 G FLOPs. Its mAP@0.50:0.95 of 62.0% matched that of the baseline model. In conclusion, the experimental results demonstrate that the PEMA-C3k2 module was more suitable for ship detection tasks with higher recall requirements.

4.5.2. Comparative Analysis of Various Detection Heads

In terms of detection head architecture innovation, the experimental data in Table 6 indicate that DyTAHead achieved a breakthrough in balancing accuracy and efficiency. With a parameter count of 2.22 M and computational cost of 7.9 G FLOPs, DyTAHead increased mAP@0.50:0.95 to 63.8%, which is a 1.8-percentage-point increase over the original detection head.

As shown in Table 6, the DyTAHead detection head demonstrates significant comprehensive performance advantages in object detection tasks. Compared with the baseline model YOLOv11n, DyTAHead achieved a detection accuracy of mAP@0.50 = 91.5% and recall rate of 85.7% through a dynamic task alignment mechanism, while maintaining a parameter count of 2.22 M to ensure parameter efficiency. This represents improvements of 1.7 and 2.8 percentage points for mAP@0.50 and recall rate, respectively, compared with the baseline. Notably, this improved scheme reduced the parameter count by 14% while increasing mAP@0.50:0.95 to 63.8%, significantly outperforming other improved detection heads such as LADH (61.7%) and EfficientHead (62.2%). Although its FLOPs increased to 7.9 G, which represents a 25% increase from the baseline’s 6.3 G, its accuracy gain was significantly higher than that of other high-computation schemes; for example, Dyhead, with 7.4 G FLOPs, only achieved an mAP@0.50:0.95 of 62.4%. The experimental data validate the adaptive advantages of the dynamic feature fusion mechanism in complex scenarios, particularly regarding the enhancement of feature representation for multi-scale targets. This makes the model an optimal solution for balancing accuracy and efficiency, which ultimately supports decision-making in practical deployment scenarios.

To comprehensively validate the effectiveness of the improved detection head structures, we conducted comparative analyses among seven mainstream detection heads, including YOLOv11-Dyhead, YOLOv11-EfficientHead, and YOLOv11-LSCD. As shown in Figure 13, the precision curve in Figure 13a shows that YOLO-DyTAHead stably maintained a value above 0.92 after training convergence, which significantly outperformed the baseline YOLOv11n. The recall curve in Figure 13b shows that this model exhibited the best false negative suppression capability, with a recall rate peaking at 85.7%. In terms of comprehensive evaluation metrics, the mAP@0.50 curve in Figure 13c and the mAP@0.50:0.95 curve in Figure 13d show that YOLO-DyTAHead led all other models, particularly for the stricter mAP@0.50:0.95 metric. Notably, while incurring only limited increases in computational load, this detection head achieved superior performance and stability across all metric curves compared with traditional dynamic detection heads, thereby ultimately achieving significant improvements in comprehensive performance.

4.5.3. Comparison of Different Detection Algorithms

For the target detection task, we systematically compared 12 mainstream detection algorithms. These algorithms spanned four major categories: two-stage detectors, single-stage detectors, transformer-based models, and lightweight models. We focused the evaluation on multi-dimensional metrics, including accuracy, inference speed, computational cost, and scene adaptability. In the classification based on model architecture, Faster R-CNN served as the benchmark among the two-stage detectors. Its derivative, Cascade R-CNN, gradually increased the IoU threshold through a three-stage cascade detection head, thereby achieving the iterative optimization of the localization accuracy of the candidate bounding boxes. The single-stage detectors included lightweight versions of the YOLO series, such as YOLOv3-Tiny, YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv12n, in addition to the TOOD model based on task alignment learning. Among the transformer-based models, we selected the classic DETR and its improved version Dab-DETR. The latter introduces a deformable attention mechanism to enhance the local feature modeling capability. EfficientNet-B3, as a lightweight backbone network, replaces ResNet-50 for cross-architecture comparison. We conducted in-depth experiments on the IRShip dataset to comprehensively evaluate the performance of EISD-YOLO. We designed these experiments to demonstrate the outstanding effectiveness of our EISD-YOLO model in the ship detection task. The results are presented in Table 7.

As shown in Figure 14, the EISD-YOLO model demonstrated notable advantages in terms of comprehensive performance and accuracy metrics in the comparative experiments using different object detection models. The histogram clearly compares the performance of various models, including Dab-DETR, DETR, Faster-RCNN, EfficientNet, TOOD, YOLOv12n, YOLOv8n, YOLOv5n, ATSS, YOLOv3-tiny, and EISD-YOLO, on four core performance indicators. Most notably, EISD-YOLO achieved an excellent score of 91.0% for the mAP@0.50 metric (purple bar in the chart), a key metric in which it significantly outperformed all other models, which indicates that it had a prominent lead in target positioning accuracy. Simultaneously, EISD-YOLO also ranked first, with 62.7%, for the mAP@0.50:0.95 metric (yellow bar in the figure), which represents the comprehensive detection capability of the model. This fully illustrates its robustness and overall performance advantage under varying IoU threshold requirements. In summary, the chart data clearly shows that EISD-YOLO was the best-performing model in this experiment, particularly for the core indicators of accuracy and average accuracy.

We have clearly demonstrated the feasibility of the EISD-YOLO and its excellent detection performance, which makes it well suited for real-time monitoring on devices with limited resources. The EISD-YOLO model exhibited outstanding performance on the IRShip dataset, leading the way in terms of accuracy, number of parameters, and computational efficiency. This indicates that EISD-YOLO has great potential in object detection tasks, particularly in scenarios where computational resources are limited.

4.6. Verification of EISD-YOLO’s Generalization Capability on Different Subsets of the IRShip Dataset

To address concerns about the generalization capability of the EISD-YOLO model under existing conditions, this section conducts a targeted analysis. Due to the extreme scarcity of publicly available infrared ship datasets, the feasibility of verifying generalization capability across multiple independent external datasets is limited. Therefore, we chose to use the three heterogeneous subsets divided from the self-constructed IRShip dataset for in-domain generalization verification, as the inherent diversity of this dataset in terms of sources can effectively make up for the deficiency of external public data and reflect the model’s adaptability to different data distributions.

The three heterogeneous subsets of the IRShip dataset include public data from InfiRay Technology (covering standardized scenarios), a maritime dataset constructed by Shandong University (in relatively simple nearshore scenarios), and field-collected data obtained by our team (containing real-world interferences such as waves and occlusions). In the experiment, we directly used the pre-trained EISD-YOLO model to perform inference on each subset, and we evaluated its performance using core metrics including mAP, precision, and recall, with the specific results presented in Table 8.

On the InfiRay Public Infrared Dataset, EISD-YOLO outperforms YOLOv11n in all metrics. Specifically, EISD-YOLO achieves a precision of 90.6%, a recall of 83.7%, an mAP@0.50 of 91.1%, and an mAP@0.50:0.95 of 60.9%, while YOLOv11n has a precision of 90.0%, a recall of 81.5%, an mAP@0.50 of 89.3%, and an mAP@0.50:0.95 of 58.3%. This indicates that EISD-YOLO has stronger detection accuracy on standardized data, laying a solid foundation for its generalization capability. On the Shandong University Maritime Dataset, both models demonstrate extremely high performance. EISD-YOLO achieves a precision of 99.9%, a recall of 99.9%, an mAP@0.50 of 99.5%, and an mAP@0.50:0.95 of 94.2%, which are comparable to YOLOv11n’s precision of 99.6%, recall of 99.9%, mAP@0.50 of 99.5%, and mAP@0.50:0.95 of 94.1%. This is because this data subset is too simple. On the field-collected dataset, EISD-YOLO exhibits better robustness to differences in real-world data distributions. Its mAP@0.50 reaches 91.6% and its mAP@0.50:0.95 reaches 66.3%, confirming its superior generalization capability in practical scenarios.

4.7. Discussion

4.7.1. Comparison with Other Related Works

Infrared ship detection, as an important branch of optical imaging target detection in photonics, has attracted extensive attention. Compared with the results of existing studies, those of EISD-YOLO demonstrate advantages regarding three aspects:

First, EISD-YOLO achieved breakthrough progress in balancing accuracy and efficiency. Wang et al. proposed PPGS-YOLO, which focuses on lightweight detection in offshore dense-obstacle scenarios; however, this method failed to achieve the inverse optimization of accuracy and parameters [41]. Liu et al. developed PJ-YOLO, which improves detection accuracy through prior knowledge fusion; however, its parameter count was as high as 28 M [42]. Guo et al. designed YOLO-IRS, which adopts a self-attention mechanism to handle complex backgrounds, but at the cost of a 30% increase in parameters [43]. In contrast, EISD-YOLO, through the EMA mechanism of the PEMA-C3k2 module and the depth-wise separable convolution of DS_ADNet, reduced the parameters by 48.83% and compressed the model size by 43.56% while increasing mAP@0.50 by 1.2%, thus achieving a collaborative breakthrough in lightweight performance and accuracy.

Second, EISD-YOLO innovatively addresses the conflict between classification and localization features. Jiang et al. proposed an infrared ship task alignment method that adopts a static strategy with fixed loss weights (1:1) [44], whereas the DyTAHead detection head adaptively adjusts weights according to target sizes, thereby significantly exceeding the adaptability limit of traditional task alignment designs in infrared scenarios. Targeting the time-varying characteristics of infrared photon energy distribution, this mechanism achieved higher task collaboration efficiency within a lightweight framework through task decomposition and spatial adaptive alignment.

Third, EISD-YOLO exhibited excellent robustness in complex scenarios. Zhan et al. developed EGISD-YOLO, which improves small-target detection through edge guidance, but with a 30% increase in parameters [45]. Mo et al. designed a video trajectory feature extraction method that relies on temporal analysis to cope with dynamic background interference [46]. In contrast, EISD-YOLO, through the dual-path pooling design of DS_ADNet, achieved sea-wave noise suppression in single-frame images, thereby effectively overcoming the non-uniform interference of infrared radiation on the sea surface. While reducing the parameters by 48.83%, it maintained simultaneous improvements in precision and recall, and it reduced both false detection and missed detection rates.

In summary, EISD-YOLO achieved three major breakthroughs through the synergy of its three modules: First, it broke the consensus that “lightweighting inevitably sacrifices accuracy.” Compared with algorithms in references [41,42,43], it reduced the parameters by 48.83% while increasing mAP@0.50 by 1.2%. Second, it pioneers a dynamic task alignment mechanism, which significantly improved small-target detection performance compared with the static method in reference [44], thereby providing a new paradigm for low-contrast scenarios. Third, it enhanced robustness in complex scenarios, as the 5.7 MB model achieved performance superior to that of large models, thereby solving the deployment challenges of references [46,47] in resource-constrained scenarios. These breakthroughs not only enrich the library of efficient optical detection methods in the field of photonics but also address the real-time processing bottleneck of shipborne infrared systems, providing a standardized solution for maritime engineering applications.

The lightweight performance of EISD-YOLO is mainly driven by the DS_ADNet downsampling module and PEMA-C3k2 feature extraction module. The introduction of DS_ADNet alone reduces the parameter count by 24.05% and FLOPs by 20.63% compared with YOLOv11n, while replacing the original C3k2 with PEMA-C3k2 cuts the parameters by an additional 11.21% and maintains efficiency by leveraging partial convolution and multi-scale attention. In terms of accuracy enhancement, the core contributors are the DyTAHead detection head and the EMA mechanism in PEMA-C3k2. DyTAHead improves mAP@0.50 by 1.7% and recall by 2.8% through dynamic task alignment and deformable convolution, and removing the EMA mechanism from PEMA-C3k2 leads to a 1.4% drop in recall and a 0.9% reduction in mAP@0.50 (Table 1), confirming its role in preserving global semantic features. It is the synergistic effect of these three modules that enables EISD-YOLO to reduce the parameters by 48.83% and increase mAP@0.50 by 1.2% simultaneously, breaking the traditional trade-off between lightweight design and detection accuracy.

However, EISD-YOLO still has limitations: First, the dataset insufficiently covers samples from extreme environments (e.g., ice-covered sea surfaces) and special targets (e.g., military ships), which may lead to a decline in detection capability during generalization. Second, its single-modal design does not support cross-modal fusion, which results in reduced accuracy when migrated to multi-modal scenarios. Third, DyTAHead still has positioning errors for ships with extreme attitudes, and the suppression of false targets under high sea conditions relies on static thresholds (which are not fully adapted to the related scenario challenges in references [47,48]). In the future, these limitations could be overcome through cross-modal alignment and dynamic threshold optimization.

4.7.2. Limitations and Future Work

Although EISD-YOLO has achieved positive results, its design still has inherent limitations, which also clarify directions for future research:

First, the trade-off between performance and complexity. While EISD-YOLO significantly reduces the parameter count, its core module DyTAHead introduces a certain degree of computational overhead (Table 2 and Table 6). Compared with YOLOv11n, the FLOPs of both the complete EISD-YOLO model and the DyTAHead module increased. This trade-off is reflected in the following: higher detection accuracy and recall are obtained through operations such as dynamic task alignment and deformable convolution; however, on edge devices with extremely limited resources, although the model size is small, its inference speed may not be optimal. In future work, we will further lightweight DyTAHead via neural architecture search or more efficient dynamic mechanisms, aiming to reduce FLOPs while maintaining performance advantages.

Second, the detection of small and long-distance targets requires further optimization. Although the DS_ADNet and PEMA-C3k2 modules enhance the ability to extract detailed features, the detection performance of EISD-YOLO still degrades for extremely small or long-distance ship targets that occupy only a few pixels. Such targets have weak features in infrared images and are easily confused with background noise, making it difficult for the model to generate high-confidence detection boxes and, thus, leading to missed detections. To address this issue, future research will explore dedicated detection strategies for extremely small targets, such as integrating higher-resolution shallow feature maps into the FPN, designing loss functions for ultra-small targets, or introducing GANs to enhance small-target features.

Third, there is an inherent performance gap compared with large-scale/task-specialized models. The core goal of this study is to achieve optimal performance under lightweight constraints; therefore, the upper performance limit of EISD-YOLO cannot be compared with those of large-scale models with large parameter counts and high computational intensity (e.g., NST-YOLO11 [31]) or models highly specialized for specific tasks. As shown in Table 7, although EISD-YOLO performs excellently among lightweight models, large two-stage detectors such as Cascade R-CNN achieve better results in strict localization metrics like mAP@0.50:0.95. This reflects the inherent trade-off between model capacity and expressive ability. For scenarios with sufficient resources where ultimate accuracy is prioritized, a feasible direction is to explore the effectiveness of the proposed modules on larger models, or to adopt knowledge distillation to enable the lightweight EISD-YOLO to learn from a large-scale teacher model, thereby approaching the teacher model’s performance.

Fourth, verification of real-world deployment needs to be supplemented. The lightweight characteristics of EISD-YOLO theoretically make it suitable for resource-constrained environments. However, the current study only supports the claim of deployment suitability through design, and it lacks empirical benchmarking on embedded hardware (e.g., NVIDIA Jetson devices). The existing evaluation is missing data on practical metrics such as FPS, latency, and power consumption on edge platforms. In the near future, we plan to conduct extensive deployment testing on such devices to verify the operational efficiency of the model in real-world edge scenarios.

5. Conclusions

We proposed a lightweight and high-precision deep learning model, EISD-YOLO, for infrared ship detection. It improved on the baseline YOLOv11n model through three key optimizations: First, we designed a new backbone network PEMA-C3k2, which significantly reduced the number of parameters while maintaining detection accuracy. Second, we introduced an improved downsampling module DS_ADNet to reduce the model parameters and computational complexity, eliminate redundant structures, and effectively preserve accuracy. Third, we proposed a novel DyTAHead detection head that adopts a task branching strategy to address the task conflict between regression and localization in the YOLOv11n detection head, thereby achieving efficient detection. The experimental results on the self-constructed IRShip dataset demonstrate that EISD-YOLO outperformed YOLOv11n in terms of accuracy, model size, parameter count, and computational complexity. Specifically, EISD-YOLO achieved an mAP@0.50 of 91.0%, which was 1.2% higher than the baseline. Moreover, it reduced the number of parameters by 48.83%, decreased the model size by 43.56%, and broke the inherent trade-off between accuracy and lightweight design. Its lightweight characteristics facilitated its deployment in resource-constrained environments. These results verify the significant advantages of EISD-YOLO in infrared image ship detection tasks, offer valuable insights for researchers in the field of ship detection, and improve inference performance in practical applications. Despite this, this study still has limitations: infrared ship detection is significantly affected by environmental factors and target diversity, and the current model still has potential for improvement in the existing model to further balance computational efficiency and accuracy. Future research will focus on the optimization of the backbone network, integration of knowledge distillation technology, and improvement of the performance of small-target detection to better address the practical requirements of resource-limited maritime environments.

Author Contributions

Conceptualization, S.W., Y.F., J.C., H.T. and C.Z.; data curation, S.W., W.J., H.T. and C.Z.; formal analysis, Y.F., W.J., H.T. and C.Z.; funding acquisition, Y.F., W.J. and L.L.; investigation, Y.F., W.J., H.T. and C.Z.; methodology, Y.F. and W.J.; project administration, Y.F., W.J. and L.L.; resources, J.C. and L.L.; software, S.W., J.C. and L.L.; supervision, S.W., J.C., W.J. and L.L.; validation, S.W. and J.C.; writing—original draft, S.W.; writing—review and editing, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Projects of the Foundation Strengthening Program, grant number 2023-JJ-0604.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kou, R.; Wang, C.; Zhang, Y.; Tang, P.; Huang, F.; Fu, Q. Review of Performance Evaluation Methods for Infrared Detection Systems. Infrared Technol. 2024, 12, 1411–1417. (In Chinese) [Google Scholar]
Ji, W.; Qi, L.; Xing, P.; Yang, G. Analysis of the Influence of Cloud Occlusion on the Infrared Transmittance over the Sea. J. Atmos. Environ. Opt. 2021, 2, 88–97. (In Chinese) [Google Scholar]
Chen, T.; Cai, C.; Zhang, J.; Dong, Y.; Yang, M.; Wang, D.; Yang, J.; Liang, C. RER-YOLO: Improved Method for Surface Defect Detection of Aluminum Ingot Alloy Based on YOLOv5. Opt. Express 2024, 32, 8763–8777. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Chen, H.; Ge, Z.; Jiang, Y.; Ge, H.; Zhao, Y.; Xiong, H. SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images. Photonics 2024, 11, 778. [Google Scholar] [CrossRef]
Dalai, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 2, 91–110. [Google Scholar] [CrossRef]
Hearst, M.A. Support Vector Machines. IEEE Intell. Syst. 1998, 4, 18–28. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. Experiments with a New Boosting Algorithm. In Proceedings of the Machine Learning: Proceedings of the Thirteenth International Conference, Bari, Italy, 3–6 July 1996; pp. 148–156. [Google Scholar]
Li, X.; Wu, Q.; Wang, Y. Binocular Vision Calibration Method for a Long-Wavelength Infrared Camera and a Visible Spectrum Camera with Different Resolutions. Opt. Express 2021, 29, 3855–3872. [Google Scholar] [CrossRef]
Jiang, G.-Y.; Li, Y.-P.; Li, X.-H.; Zhang, W.-D.; Wan, Z.-A.; Zhu, Q.-M.; Gong, P.-F.; Zhang, S. Performance of Ship-Based QKD Under the Influence of Sea-Surface Atmospheric Turbulence. Photonics 2025, 12, 340. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Li, Y.; Chen, X.; Rao, P.; Zhang, S.; Liu, G. A Resolution and Localization Algorithm for Closely-Spaced Objects based on Improved YOLOv5 Joint Fuzzy C-Means Clustering. IEEE Photonics J. 2024, 16, 1–14. [Google Scholar] [CrossRef]
Chen, L.; Li, B.; Qi, L. Research on YOLOv3 Ship Target Detection Algorithm Fusing Image Saliency. Softw. Guide 2020, 10, 146–151. (In Chinese) [Google Scholar]
Zhao, Y.; Guo, H.; Jiao, H.; Zhang, J. Application of YOLOv4 Fusing Hybrid-Domain Attention in Ship Detection. Comput. Mod. 2021, 9, 75–82. (In Chinese) [Google Scholar]
Yue, T.; Yang, Y.; Niu, J.-M. A Light-Weight Ship Detection and Recognition Method Based on YOLOv4. In Proceedings of the 2021 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Changsha, China, 26–28 March 2021; pp. 661–670. [Google Scholar]
Chen, Z.; Liu, C.; Filaretov, V.F.; Yukhimets, D.A. Multi-Scale Ship Detection Algorithm Based on YOLOv7 for Complex Scene SAR Images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
Guo, Y.; Lu, Y.; Liu, R.W. Lightweight deep network-enabled real-time low-visibility enhancement for promoting vessel detection in maritime video surveillance. J. Navig. 2022, 75, 230–250. [Google Scholar] [CrossRef]
Ma, M.; Pang, H. SP-YOLOv8s: An Improved YOLOv8s Model for Remote Sensing Image Tiny Object Detection. Appl. Sci. 2023, 13, 8161. [Google Scholar] [CrossRef]
Zhang, L.; Du, X.; Zhang, R.; Zhang, J. A Lightweight Detection Algorithm for Unmanned Surface Vehicles Based on Multi-Scale Feature Fusion. J. Mar. Sci. Eng. 2023, 11, 1392. [Google Scholar] [CrossRef]
Li, C.; Cai, Y.; Hu, J.; Zhan, W. Research on Ship Target Detection Algorithm under Severe Weather Conditions Based on Improved YOLOv8. Mod. Electron. Tech. 2025, 48, 77–82. (In Chinese) [Google Scholar]
Zhang, Y.; Chen, W.; Li, S.; Liu, H.; Hu, Q. YOLO-Ships: Lightweight ship object detection based on feature enhancement. J. Vis. Commun. Image Represent. 2024, 101, 104170. [Google Scholar] [CrossRef]
Huang, Y.; Wang, D.; Wu, B.; An, D. NST-YOLO11: ViT Merged Model with Neuron Attention for Arbitrary-Oriented Ship Detection in SAR Images. Remote Sens. 2024, 16, 4760. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, Q.; Li, X.; Wang, C.; Zhu, G.; Zhang, Y.; Huo, Y.; Yu, H.; Zhang, Y. CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization. IEEE Trans. Circuits Syst. Video Technol. 2025. Early Access. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Tan, M.; Le, Q.V. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. arXiv 2019, arXiv:1912.02424. [Google Scholar]
Wang, Y.; Wang, B.; Fan, Y. PPGS-YOLO: A lightweight algorithms for offshore dense obstruction infrared ship detection. Infrared Phys. Technol. 2025, 145, 105736. [Google Scholar] [CrossRef]
Liu, Y.; Li, C.; Fu, G. PJ-YOLO: Prior-Knowledge and Joint-Feature-Extraction Based YOLO for Infrared Ship Detection. J. Mar. Sci. Eng. 2025, 13, 226. [Google Scholar] [CrossRef]
Guo, L.; Wang, Y.; Guo, M.; Zhou, X. YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sens. 2025, 17, 20. [Google Scholar] [CrossRef]
Jiang, J.; Zhang, L.; Liu, K.; Yan, W.; Wang, M. Infrared ship target detection method based on task alignment learning. Syst. Eng. Electron. 2025, 47, 34–40. [Google Scholar]
Zhan, W.; Zhang, C.; Guo, S.; Guo, J.; Shi, M. EGISD-YOLO: Edge Guidance Network for Infrared Ship Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10097–10107. [Google Scholar] [CrossRef]
Mo, W.; Pei, J. Moving ships detection via the trajectory feature extraction from spatiotemporal slices of infrared maritime videos. Infrared Phys. Technol. 2024, 143, 105591. [Google Scholar] [CrossRef]
Deng, H.; Zhang, Y. FMR-YOLO: Infrared Ship Rotating Target Detection Based on Synthetic Fog and Multiscale Weighted Feature Fusion. IEEE Trans. Instrum. Meas. 2024, 73, 1–17. [Google Scholar] [CrossRef]
Yuan, J.; Cai, Z.; Wang, S.; Kong, X. A Multitype Feature Perception and Refined Network for Spaceborne Infrared Ship Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the EISD-YOLO network model.

Figure 2. Structure of the PEMA block module, * denotes the convolution operation.

Figure 3. Comparison diagram of feature visualization between PEMA-C3k2 and C3k2.

Figure 4. Structure diagram of the DS_ADNet module.

Figure 5. Comparison diagram of feature visualization between DS_ADNet and Conv, (a) DS_ADNet Feature Visualization; (b) Ordinary Convolution Feature Visualization.

Figure 6. Structural diagram of the DCNv2 module.

Figure 7. Structure diagram of the DyTAHead module.

Figure 8. The category distribution of inshore ships in the IRShip dataset.

Figure 9. The training processes of different networks in the ablation experiment: (a) precision curve; (b) recall curve; (c) mAP@0.50 curve; (d) mAP@0.50:0.95 curve.

Figure 10. Ablation experiment PR curve comparison.

Figure 11. Detection results of ships on the IRShip test set. (a) False Detection Correction; (b) False Detection Correction; (c) Detection Accuracy Improvement; (d) Missed Detection Correction; (e) Detection Accuracy Improvement; (f) Missed Detection Correction; (g) Multi-target Missed Detection Correction; (h) Detection Accuracy Improvement; (i) Detection Accuracy Improvement; (j) False Detection Correction; (k) Detection Accuracy Improvement; (l) Detection Accuracy Improvement; (m) Detection Accuracy Improvement; (n) Missed Detection Correction.

Figure 12. The training processes of different C3k2 backbone networks in the comparison experiment: (a) precision curve; (b) recall curve; (c) mAP@0.50 curve; (d) mAP@0.50:0.95 curve.

Figure 13. The training processes of different detection heads in the comparison experiment: (a) precision curve; (b) recall curve; (c) mAP@0.50 curve; (d) mAP@0.50:0.95 curve.

Figure 14. Comparison experiment results.

Table 1. Comparison of ablation experiments on the EMA attention mechanism.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)
ours	90.9	83.3	90.7	62.0
Without EMA	91.3	81.9	89.8	61.4

Table 2. Analysis of ablation experiments on core components of DyTAHead.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Params (M)	FLOPs (G)
YOLO11-DyTAHead	91.5	85.7	91.5	63.8	2.200204	7.9
Without DCNv2	90.8	84.0	91.1	62.7	2.200204	7.3
Without ClsProb	91.9	82.6	90.9	62.6	2.200204	7.8
Without TaskDecomposition	92.0	83.7	90.9	62.6	2.216716	8.1

Table 3. Training parameter settings of the experiment.

Name	Configuration
Baseline Model	YOLOv11n
Batch Size	32
Data Enhancement	Mosaic
Learning Rate	0.01
Input Resolution	640 × 640

Table 4. Ablation experiments on IRShip.

Dataset	Method			Metrics				Complexity
Dataset	PEMA-C3k2	DS_ADNet	DyTAHead	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Model Size (MB)	Params (M)	FLOPs (G)
IRShip				89.6	82.9	89.8	62.0	10.1	2.2582347	6.3
	✓			91.0	83.2	90.5	62.4	9.0	2.2929790	5.9
		✓		90.8	80.9	89.2	61.4	7.7	1.9607790	5.0
			✓	91.5	85.7	91.5	63.8	9.1	2.2200204	7.9
	✓	✓		91.7	81.1	89.6	61.3	6.8	1.6714110	4.5
	✓		✓	90.9	84.0	91.2	62.7	8.3	1.9769160	7.2
		✓	✓	91.8	84.0	90.8	63.4	7.3	1.7507960	6.7
	✓	✓	✓	90.8	84.1	91.0	62.7	5.7	1.3211720	5.8

Table 5. Comparative experiments for the improvement of the C3k2 module.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Params (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLOv11-C3k2-WTConv	91.6	82.6	90.5	62.9	2.521291	6.2
YOLOv11-C3k2-Faster	91.3	81.9	89.8	61.4	2.288195	5.8
YOLOv11-C3k2-OREPA	91.1	81.4	89.9	61.3	2.582347	6.3
YOLOv11-C3k2-AdditiveBlock	91.3	83.1	90.6	62.3	2.625507	6.6
YOLOv11-C3k2-GhostDynamicConv	90.6	80.6	89.5	61.0	2.226827	5.4
YOLOv11-C3k2-DAttention	90.6	82.9	90.5	62.1	2.617355	6.3
YOLOv11-C3k2-Star-CAA	92.1	81.1	90.2	62.2	3.033891	8.1
YOLOv11-PEMA-C3k2	90.9	83.3	90.7	62.0	2.290115	5.8

Table 6. Comparative experiments on different detection head improvements.

Methods	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Params (M)	FLOPs (G)
YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.3
YOLOv11-Dyhead	90.9	83.1	90.7	62.4	3.099207	7.4
YOLOv11-EfficientHead	91.0	83.2	90.3	62.2	2.312139	5.1
YOLOv11-LSCD	91.0	84.1	90.4	62.2	2.420492	5.6
YOLOv11-LADH	91.1	82.2	90.1	61.7	2.281547	5.2
YOLOv11-LSDECD	91.2	83.2	90.6	62.5	2.260300	6.0
YOLOv11-LSCSBD	88.4	81.7	89.3	62.2	2.457228	6.2
YOLOv11-DyTAHead	91.5	85.7	91.5	63.8	2.2200204	7.9

Table 7. Comparative experiments on IRShip.

Models	Framework	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)	Params (M)	FLOPs (G)
Cascade-RCNN [35]	ResNet50+FPN	90.4	75.4	85.9	64.5	69.15200	121.856
Dab-DETR [36]	Transformer	88.4	82.7	82.0	45.7	43.70200	43.099
DETR [37]	Transformer	85.5	76.4	81.7	52.5	36.81900	38.107
Faster-RCNN [11]	ResNet50+FPN	90.2	73.2	84.7	62.4	41.34800	90.898
EfficientNet [38]	Efficientnet-b3	83.2	85.9	87.8	56.0	18.33900	54.230
TOOD [39]	ResNet50	87.1	79.8	87.6	59.1	32.01800	78.837
ATSS [40]	ResNet50	90.6	79.2	88.0	63.2	32.11300	80.475
YOLOv12n	YOLOv12n	88.4	81.0	88.1	59.7	2.518971	5.900
YOLOv11n	YOLOv11n	89.6	82.9	89.8	62.0	2.582347	6.300
YOLOv8n	YOLOv8n	91.0	83.4	90.0	62.1	3.005843	8.100
YOLOv5n	YOLOv5n	89.0	77.0	86.8	58.7	1.760518	4.100
YOLOv3-tiny	DarkNet-53	79.3	70.7	77.0	45.9	61.52400	77.449
EISD-YOLO	YOLOv11n	90.8	84.1	91.0	62.7	1.321172	5.800

Table 8. Comparative experiment results on the IRShip dataset.

Dataset	Models	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.50:0.95 (%)
InfiRay Public Dataset	YOLOv11n	90.0	81.5	89.3	58.3
InfiRay Public Dataset	EISD-YOLO	90.6	83.7	91.1	60.9
Shandong University Maritime Dataset	YOLOv11n	99.9	99.9	99.5	94.2
Shandong University Maritime Dataset	EISD-YOLO	99.9	99.9	99.5	94.1
Field-Collected Dataset	YOLOv11n	75.5	89.9	89.9	66.1
Field-Collected Dataset	EISD-YOLO	74.3	89.7	91.6	66.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Feng, Y.; Tao, H.; Chen, J.; Jin, W.; Liu, L.; Zhou, C. EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment. Photonics 2025, 12, 1044. https://doi.org/10.3390/photonics12111044

AMA Style

Wang S, Feng Y, Tao H, Chen J, Jin W, Liu L, Zhou C. EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment. Photonics. 2025; 12(11):1044. https://doi.org/10.3390/photonics12111044

Chicago/Turabian Style

Wang, Siyu, Yunsong Feng, Huifeng Tao, Juan Chen, Wei Jin, Liping Liu, and Changqi Zhou. 2025. "EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment" Photonics 12, no. 11: 1044. https://doi.org/10.3390/photonics12111044

APA Style

Wang, S., Feng, Y., Tao, H., Chen, J., Jin, W., Liu, L., & Zhou, C. (2025). EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment. Photonics, 12(11), 1044. https://doi.org/10.3390/photonics12111044

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EISD-YOLO: Efficient Infrared Ship Detection with Param-Reduced PEMA Block and Dynamic Task Alignment

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. PEMA-C3k2 Module

3.2. DS_ADNet Module

3.3. DyTAHead Detection Head

4. Experimental Results

4.1. Experimental Setup

4.2. Dataset IRShip v1.0

4.3. Experimental Evaluation Metrics

4.4. Ablation Experiment

4.5. Comparative Experiment

4.5.1. Comparison of Different Backbone Networks

4.5.2. Comparative Analysis of Various Detection Heads

4.5.3. Comparison of Different Detection Algorithms

4.6. Verification of EISD-YOLO’s Generalization Capability on Different Subsets of the IRShip Dataset

4.7. Discussion

4.7.1. Comparison with Other Related Works

4.7.2. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI