Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention

Zuo, Danfeng; Qi, Liang; Ni, Hao; Song, Song; Li, Haifeng; Wang, Xinwen

doi:10.3390/sym18050729

Open AccessArticle

Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention

by

Danfeng Zuo

¹,

Liang Qi

^1,2,*

,

Hao Ni

¹,

Song Song

¹,

Haifeng Li

¹ and

Xinwen Wang

¹

School of Automation, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

Jiangsu Shipbuilding and Ocean Engineering Design and Research Institute, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(5), 729; https://doi.org/10.3390/sym18050729

Submission received: 18 March 2026 / Revised: 20 April 2026 / Accepted: 22 April 2026 / Published: 24 April 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Ship target detection is a prerequisite for achieving automated monitoring in ship detection systems. To address the challenge of accurately detecting ship targets in complex water environments, this study proposes a ship target detection method based on an improved YOLOv11 framework. To enhance the model’s ability to perceive and fuse features across multiple scales and in complex backgrounds, an Iterative Attention Feature Fusion (iAFF) module and a Biformer module are integrated at the end of the backbone network. The iAFF module iteratively optimizes multi-scale features through a two-stage attention mechanism, effectively focusing on key target regions, thereby improving the model’s detection capability for small, medium-sized, and occluded ships. The Biformer module leverages its innovative Bi-level Routing Attention (BRA) mechanism to enhance the modeling of global semantic information while reducing computational complexity, mitigating false detections caused by occlusions among ship targets, and consequently improving detection precision. This study employs the Minimum Point Distance Intersection over Union (MPDIoU) loss function, which more comprehensively measures the similarity between predicted and ground-truth bounding boxes by optimizing the distances of their key geometric points, effectively enhancing the accuracy of bounding box regression. Experimental results show that the proposed model achieved 93.96% mAP, 92.93% recall, and 94.97% precision on a self-built ship dataset, surpassing mainstream detection algorithms including YOLOv11 in multiple metrics. The model has only 2.90 M parameters, achieving a good balance between accuracy and efficiency. This provides an accurate and efficient solution for intelligent ship supervision.

Keywords:

ship target detection; YOLOv11; feature fusion

1. Introduction

Ships, as the core carriers of water transportation, are crucial for ensuring maritime transport and safeguarding national maritime rights and interests. With the deepening integration of global trade, waterway transportation networks are becoming increasingly dense, posing unprecedented complex challenges to water safety supervision and navigation security. Building an intelligent waterborne ship detection system has become an urgent need to enhance water governance capabilities.

Given the continuous development and innovation of science and technology, artificial intelligence and automatic control technologies in water monitoring systems are also being increasingly applied. As governments worldwide pay more attention to ship safety, regulatory agencies’ requirements for ship monitoring are becoming higher, which is also driving ship supervision towards intelligence and automation. The premise of efficient and intelligent supervision is the accurate detection of ship targets. Ship detection technology plays an irreplaceable core role in preventing traffic accidents and ensuring waterway safety. Facing complex and variable waterway environments, frequent accidents such as ship collisions and groundings not only cause significant economic losses and ecological damage but may also threaten human safety [1]. With the expansion of global shipping scale and marine resource development, ship detection faces multiple challenges. Complex and variable environmental interference requires detection algorithms to possess strong environmental adaptability and multi-scale target capture capability [2].

Current ship detection technologies primarily focus on synthetic aperture radar (SAR), infrared imaging, and visible-light images [3]. SAR images offer extensive coverage and are less susceptible to interference from lighting and weather conditions, making them suitable for long-distance ship detection. However, this type of imagery cannot capture the color and texture of ships, which limits fine-grained identification [4]. Infrared imaging holds advantages in nighttime, low-light, and harsh weather conditions, enabling ship detection through temperature differences, but its detail representation is limited [5]. Visible-light images are prioritized in many applications due to their high resolution, rich detail representation, cost advantages, and mature technology. Although affected by lighting and weather, visible-light images can realistically reproduce a ship’s color, contours, and texture, providing precise monitoring and identification features. This makes them particularly suitable for scenarios such as coastal surveillance and inland waterway traffic management [6].

Object detection is an important branch of computer vision and image processing. According to the development trajectory of object detection technology [7], it can be classified into two main categories: traditional detection algorithms and deep learning-based object detection algorithms. For a long time, traditional object detection algorithms, which are implemented based on background modeling, have been relatively simple to execute and require relatively few resources. As a result, they once became a hot research topic among scholars both domestically and internationally. Some algorithms with strong robustness and relatively high portability have also supported ship recognition and tracking tasks, such as background subtraction [8], frame difference method [9], and optical flow method [10]. However, this type of moving target detection for fixed backgrounds cannot adapt to ship recognition and tracking tasks with changing backgrounds. In recent years, deep learning technology, due to its excellent feature extraction capabilities in image data, has been widely applied to various computer vision tasks [11].

Visible-light images hold significant value in object detection applications due to their advantages of intuitive visualization and low cost. Such images typically possess high spatial resolution, providing rich local detail information, which is beneficial for subsequent analysis and understanding. In ship detection tasks, traditional methods are primarily based on visual saliency or perceptual principles. Models such as the Itti model [12], FT model [13], SR model [14], AC model [13], and GBVS model [15] are widely used for extracting ship regions from marine backgrounds. Simultaneously, the Histogram of Oriented Gradients (HOG) descriptor is also a commonly used feature representation method [16]. However, the performance of such methods is generally limited in complex background environments, and their detection performance is often unsatisfactory. In contrast, deep learning methods based on Convolutional Neural Networks (CNNs) have demonstrated significant advantages in visible-light image ship detection. They achieve fine-grained characterization of targets through their powerful model representation capabilities, thereby significantly improving detection accuracy.

Object detection algorithms can be divided into two categories: traditional machine learning algorithms and deep learning-based algorithms. Traditional methods utilize manually designed features, which rely on prior knowledge or assumptions, leading to relatively poor robustness to uncertainty. On the other hand, deep learning-based algorithms use CNNs to learn image features, thereby providing higher detection speed and accuracy. Deep learning object detection algorithms can be further divided into multi-stage algorithms and single-stage algorithms. Algorithms such as R-CNN [17], Fast R-CNN [18], and Faster R-CNN [19] represent the multi-stage approach [20]. This method uses pre-generated image candidate regions to effectively extract region features and filter out background noise, thus providing higher detection accuracy. In contrast, single-stage methods directly extract features from the entire image to predict the location and category of objects, offering faster inference speeds. Common single-stage methods include the YOLO series [21], SSD [22], DSSD [23], and FSSD [24]. The real-time object detection algorithm YOLO performs classification while simultaneously predicting the bounding boxes of detected objects, integrating multiple steps through a single neural network.

Integrating deep learning algorithms into ship detection tasks has significantly enhanced ship localization capability and supervision efficiency. To address detection challenges in complex environments, multiple studies have proposed targeted technical improvements: Lim [25] proposed a target detection algorithm based on context and attention mechanisms. This algorithm pays more attention to capturing small targets in images and incorporates contextual information at the target layer, improving the detection performance of small targets under certain conditions. Wang [26] improved the original neck structure in YOLOv5 by adopting a top–down and bottom–up weighted Bidirectional Feature Pyramid Network (BiFPN), which enhanced feature extraction capability and solved the problem of significant target scale variation in the dataset. Zhang [27] introduced an Enhanced Feature Extraction (EFE) module into the backbone network of YOLOv5 and integrated a Receptive Field Block (RFB) module to better capture global information and rich contextual information, achieving the fusion of different features. Li [28] improved the backbone network of YOLOv8 by using a multi-head attention mechanism, which enhanced the network’s ability to extract diverse features. Tian [29] proposed a Multi-scale, Multi-level Enhanced Feature (MMEF) representation method to improve the accuracy of ship target recognition in complex environments. Zhao [30] designed a fused attention mechanism, which significantly enhanced the model’s performance on multi-scale datasets. Xie [31] proposed an LBA-MCNet network that integrates localization, balance, and affinity modeling. It enhances edge localization through the EFABA module and models global context with the GDAL module. The method outperforms 28 state-of-the-art approaches on three public remote sensing datasets, significantly improving salient object detection performance. To address the challenge of landslide information extraction being susceptible to interference from bare land and vegetation in complex backgrounds, Xie [32] proposed a dual-branch multi-scale context feature extraction network based on context association characteristics. By effectively integrating multi-scale contextual information through the TMCFM module and a deeply supervised classifier, the method significantly improves the IoU accuracy for landslide extraction. To address the challenges in cross-view person search, such as susceptibility to occlusions, dense crowds, and environmental variations, Zhu [33] proposed a cross-view intelligent person search method based on multi-feature constraints. By constructing the Global-Local Context-Aware module and the Semantic Complementarity and Feature Aggregation module, the method imposes feature constraints across multiple dimensions, including spatial, identity, and detection confidence, significantly improving search accuracy. To address the challenges in concrete crack detection under complex environments, such as susceptibility to background interference and blurred edges, Song [34] proposed a visual attention-guided detection method. The approach enhances global structural modeling, edge alignment accuracy, and background suppression capability through three core modules, i.e., EGCConv, DyAdSam, and VAF, achieving state-of-the-art detection performance on multiple public datasets.

In response to the challenges of visible-light image quality degradation and target feature blurring under complex weather conditions, researchers in recent years have proposed a variety of targeted object detection algorithms to enhance the robustness of models in adverse environments. Zhang [35] introduces a specialized Ship Detector for maritime environments, which leverages realistic simulated adverse weather point cloud data and a dual-branch sparse convolution network to address weather robustness and geometric feature preservation in marine ship detection. Liu [36] proposed a novel image-adaptive YOLO (IA-YOLO) framework. Through the joint training of a differentiable image processing module and a convolutional neural network parameter predictor, this framework achieves adaptive enhancement of images under varying weather conditions, thereby effectively improving object detection performance in adverse weather environments. Hu [37] proposed a dual-teacher feature alignment framework, which guides object detection under adverse weather conditions by leveraging clear image features rather than pixel-level alignment, significantly improving the model’s detection performance across multiple complex weather scenarios while maintaining inference efficiency. Zhang [38] proposed a novel object detector that tackles performance degradation in adverse weather by introducing an Independence Learning Module and a Domain-Aware Module, which, respectively, decouple spurious correlations at the instance-level and image-level, demonstrating robust effectiveness across diverse benchmark datasets. Ogino [39] proposed the ERUP-YOLO framework, which utilizes differentiable image preprocessing and domain-agnostic data augmentation to achieve superior object detection performance over conventional methods across various adverse weather conditions. Liu [40] proposed YOLO-FOD, a lightweight real-time detection model for adverse weather conditions, which integrates the OSDBBELAN module, SPD-DSC structure, GISM-DSC module, and F-EIoU loss function to significantly improve detection accuracy in complex weather environments while maintaining high computational efficiency. The aforementioned methods still exhibit certain limitations. On the one hand, they heavily rely on high-quality clear-weather data during model training and feature alignment. On the other hand, the integration of multiple modules significantly increases computational costs and model complexity, which may hinder their practical deployment on resource-constrained edge devices.

To address the key challenges in ship detection, such as interference from complex backgrounds, significant scale variations, and occlusions, this study proposes and validates a lightweight module integration and optimization scheme tailored for ship detection in complex water environments. In this study, we synergistically integrate the iAFF module, the BiFormer vision transformer with bifocal attention, and the MPDIoU loss function into the YOLOv11 framework. This integration is not a simple combination of modules but is based on a systematic breakdown of the challenges: the iAFF module enhances feature representation for multi-scale and occluded targets, the BiFormer module strengthens global contextual awareness to suppress interference from complex backgrounds, and the MPDIoU loss function improves bounding box regression accuracy for ship targets, which often exhibit specific aspect ratios. Together, these three components form a cohesive system that collectively addresses the core difficulties in ship detection. Experiments on a self-built dataset demonstrate that this scheme achieves a high detection accuracy of 93.96% mAP with a lightweight architecture of only 2.90 M parameters, significantly outperforming the original YOLOv11 and several mainstream detectors. With its high accuracy, efficiency, and lightweight architecture, this work offers a potential technical pathway towards real-time intelligent ship monitoring on edge computing platforms. The principal contributions of this work are as follows:

(1): Constructed a Representative Self-built Ship Dataset: The study focuses on the East China Sea and the lower reaches of the Yangtze River, establishing a custom ship dataset comprising approximately ten thousand images. This dataset covers nine target categories, including cargo ships, cruise ships, military vessels, and sailboats. It encompasses typical ship types and complex navigation scenarios, featuring high-density traffic and variable backgrounds.
(2): Proposed a Lightweight and Efficient Network Model Based on an Improved YOLOv11 Framework: Using YOLOv11 as the baseline model, the study introduces three core enhancements. The proposed framework assigns each module distinct and complementary functions, forming a progressive optimization pipeline. The iAFF module acts as a feature optimizer, deployed at the end of the backbone network. It dynamically selects and fuses multi-scale features through its iterative attention mechanism. Its core function is to enhance the discriminative features of targets, especially for small-scale and occluded objects, while effectively suppressing background noise interference. This process provides a more discriminative primary feature representation for subsequent processing stages. The BiFormer module serves as a context modeler, receiving features preprocessed by iAFF. Its adopted BRA mechanism can efficiently capture long-range dependencies and global semantic relationships between features with relatively low computational complexity. Building upon the optimized features provided by iAFF, this module can deeply analyze scene context information. For example, it accurately distinguishes real ship targets from interferences such as waves and reflections, or understands the spatial relationships among densely arranged ships, thereby significantly reducing the false detection rate caused by complex backgrounds and inter-target occlusion. The preceding feature optimization by iAFF effectively enhances the allocation efficiency of BiFormer’s attention resources. The MPDIoU loss function functions as a geometry optimizer, operating at the network output layer during the training phase. Based on the high-quality semantic and contextual features jointly learned by iAFF and BiFormer, this function directly optimizes the bounding box regression process by minimizing the Euclidean distance between the key corners of the predicted bounding boxes and the ground truth boxes. This design, sensitive to geometric properties, is particularly suitable for the elongated shapes common to ships, enabling higher-precision pixel-level localization. In summary, iAFF and BiFormer collaboratively improve the model’s reliability in target recognition and preliminary localization, while MPDIoU further ensures the geometric accuracy of bounding box regression.
(3): Achieves comprehensive performance improvements on the custom dataset: The experimental results demonstrate that the improved model proposed in this paper surpasses mainstream detection algorithms across several key metrics. Specifically, on the custom dataset, the model achieves 93.96% mAP, 92.93% Recall, and 94.97% Precision, with a parameter count of merely 2.90 M, striking a good balance between accuracy and efficiency. Ablation studies further validate the effectiveness of each proposed module.

2. Materials and Methods

2.1. Network Framework

The overall network architecture of this study is shown in Figure 1. This network uses YOLOv11 as the baseline model, replaces the original simple concatenation module in the network with the iAFF module, and integrates the C2PSA and Biformer structures at the end of the backbone network. The iAFF module effectively enhances the feature representation and localization capability for multi-scale ship targets through its iterative cross-layer feature screening and fusion mechanism. This is particularly beneficial for detecting small, medium-sized, and occluded ships under wave interference and cluttered background conditions. The Biformer module is introduced to model long-range dependencies and extract contextual information, aiming to enhance the perception of global semantic relationships among ship targets. Its design objective is to achieve improved detection accuracy and robustness while maintaining inference efficiency.

2.2. Iterative Attentional Feature Fusion

In ship target detection, the precise detection and feature fusion of multi-scale ship targets in complex marine environments pose a significant challenge. Marine scenes often suffer from background interference, large variations in target size, and complex conditions such as lighting and occlusion. Traditional methods struggle to adaptively distinguish key information during feature fusion. The iterative attentional feature fusion (iAFF) module addresses this by employing a dual-stage attention mechanism that iteratively optimizes the multi-scale feature fusion process. This enables the model to dynamically weight and focus on salient features at different scales, enhancing the perception capability for small and occluded targets, thereby improving detection accuracy and robustness. The structural diagram of the iAFF module is shown in Figure 2 below.

As illustrated in Figure 2a, the iAFF module processes two input feature maps, X and Y, in parallel. Initially, the input features are preliminarily fused, after which the MS-CAM is employed to generate an attention weight map for adaptively weighting the fused features. The module further utilizes an iterative mechanism to progressively refine the feature fusion process, effectively addressing inconsistencies that may arise when integrating features of different scales or semantic levels. From the perspective of structural symmetry, the processing paths for X and Y in the iAFF module exhibit mirror symmetry, with the generation, weighting, and fusion of the attention mask demonstrating clear symmetrical design in both vertical and horizontal orientations.

As shown in Figure 2b, the adopted MS-CAM module also displays distinct symmetrical characteristics. It processes the input feature Z through dual parallel branches: one extracts the global channel context g(Z), while the other captures local detailed features L(Z). The processing flows of the two branches are highly symmetrical in both structure and operation type, differing only slightly in the initial processing steps. Finally, g(Z) and L(Z) are summed and activated by a Sigmoid function to produce an attention weight map M(Z) with values ranging from 0 to 1, thereby dynamically enhancing important features and suppressing redundant information.

In the specific context of ship target detection, the contribution of MS-CAM is particularly crucial. Confronted with challenges such as significant scale variations among ship targets, mutual occlusion, and strong background interference in complex maritime environments, this module enables the model to focus more on the critical discriminative features of small targets and occluded regions when fusing features from different levels. As the core component of the iAFF module, MS-CAM is repeatedly invoked during iterative optimization, continuously guiding the fusion process to prioritize the most valuable feature information. This significantly enhances the model’s ability to represent features of multi-scale and occluded targets, ultimately improving the robustness and accuracy of detection.

2.3. Biformer

The core innovation of BiFormer lies in its introduction of the BRA mechanism. This mechanism significantly improves computational efficiency and enhances the model’s content awareness through a hierarchical processing strategy. Unlike traditional attention mechanisms that perform exhaustive token-to-token pairwise interactions across all spatial positions, BiFormer adopts a more intelligent routing approach. It first filters key-value pairs at a coarse region level, rapidly eliminating a large amount of irrelevant background information, thereby retaining only a few valuable candidate regions. Subsequently, the model performs fine-grained token-to-token attention computation on these candidate regions, ensuring that attention resources are concentrated in the most informative areas. This hierarchical routing mechanism not only greatly reduces computational complexity and memory consumption but also makes the attention operation more flexible and dynamically content-adaptive. These characteristics are designed to facilitate accurate and efficient detection in complex scenarios.

The structural diagram of the BRA is shown in Figure 3 below. First, the input image

I \in R^{H \times W \times C}

undergoes an overlapping patch embedding layer, followed by initial feature transformation. The resulting feature map is then divided into N*N non-overlapping regions and reorganized into a regional token sequence, represented as the tensor

I^{r} \in R^{N^{2} \times \frac{H \times W}{N^{2}} \times C}

. Here,

I^{r}

effectively serves as the foundational token representation for the subsequent attention mechanism. Subsequently, the Query (Q), Key (K), and Value (V) matrices are derived by applying linear projections to this regional token sequence:

{Q = I}^{r} W^{q}, {K = I}^{r} W^{k}, {V = I}^{r} W^{v}

(1)

where

W^{q} {, W}^{k} {, W}^{v} \in R^{C \times C}

are the projection weights for query, key, and value, respectively.

The BiFormer attention mechanism employs BRA as its fundamental building block and introduces a four-level pyramid structure, as illustrated in Figure 4. In this architecture, the initial stage first utilizes an overlapping patch embedding method. Subsequently, patch merging modules are applied in the second to fourth stages to reduce spatial resolution while concurrently increasing the number of channels. Next, a series of n consecutive BiFormer blocks are introduced to transform the features. This design aims to comprehensively leverage overlapping patch embedding, patch merging modules, and BiFormer blocks across multiple stages, effectively capturing multi-scale, multi-channel feature information and enhancing the model’s abstraction and representation capabilities for input data.

Within each BiFormer block, a 3 × 3 depthwise convolution is employed to implicitly encode relative positional information. Subsequently, a BRA module with an expansion ratio of e and a two-layer Multi-Layer Perceptron (MLP) module are sequentially applied. The BRA module is used to model cross-position relationships, while the two-layer MLP handles per-position embedding. To achieve a lightweight design, BiFormer adopts a parameter-sharing strategy, meaning that multiple positions or channels share the same set of weights. This parameter sharing effectively reduces the number of model parameters, making it more compact and facilitating deployment on resource-constrained devices.

As shown in Figure 4, the structure of the BiFormer module exhibits clear hierarchical symmetry and intra-module symmetrical design. Its basic constituent units are consistent, reflecting a symmetrical pattern of repeated structures within each stage. The operations at each layer are distributed symmetrically in an up–down manner, and both the attention and feedforward network modules maintain symmetrical arrangements in terms of their positions and connectivity patterns.

Given the emphasis on lightweight design in the proposed model, the BiFormer-T variant is deliberately adopted. Its architecture features a channel width of 64 and a block configuration of [2, 2, 8, 2], resulting in a total of 14 BiFormer blocks. To adapt to the requirements of object detection, the region partition factor N is set to 16. Additionally, the number of routing regions kis configured as 1, 4, 16, and N² across the four successive stages, respectively.

2.4. MPDIoU Loss Function

In the context of bounding box regression, widely employed loss functions encompass DIoU, CIoU, and SIoU. DIoU optimizes only the center-point distance while neglecting the shape of the bounding box. Although the aspect ratios of ships are concentrated overall, the diversity in their specific values implies that the model needs to learn to regress a variety of width and height values. DIoU’s insensitivity to changes in bounding box shape hinders its ability to handle this regression task effectively—where aspect ratios, though generally consistent, still exhibit variation—potentially leading to imprecise shape matching in predicted boxes. CIoU builds upon DIoU by incorporating a penalty term for aspect ratio consistency. However, its measure of aspect ratio is ambiguous and may fail when bounding boxes share the same aspect ratio but differ in actual size. SIoU introduces an angular loss term, which operates under the core assumption that the angle between the line connecting the centers of the target and predicted boxes and the horizontal axis should be small. This mechanism works well for targets that are close to square, as the direction of the centerline remains stable. For ships, which typically exhibit a stable deviation from a 1:1 aspect ratio and are elongated in shape, even slight shifts in the bounding box can cause significant variation in the orientation of the centerline. This makes the angular penalty overly sensitive and unstable in such cases.

Therefore, traditional IoU loss functions cannot effectively distinguish between two bounding boxes that have the same overlapping area but different relative positions, or that share the same aspect ratio but have completely different actual widths and heights. This limitation reduces the convergence speed and accuracy of bounding box regression. The MPDIoU loss function addresses these shortcomings. Figure 5 below shows the schematic diagram of MPDIoU. It measures the similarity between the predicted bounding box and the ground truth bounding box by minimizing the distance between their top-left and bottom-right corner points. This approach comprehensively considers overlap area, center point distance, and deviations in width and height, enabling it to effectively handle cases where the predicted box and the ground truth box share the same aspect ratio but have different width and height values.

The formula for MPDIoU is defined as shown in Equations (2)–(5).

d_{1}^{2} = {(x_{1}^{prd} - x_{1}^{gt})}^{2} + {(y_{1}^{prd} - y_{1}^{gt})}^{2}

(2)

d_{2}^{2} = {(x_{2}^{prd} - x_{2}^{gt})}^{2} + {(y_{2}^{prd} - y_{2}^{gt})}^{2}

(3)

MPDIoU = IoU - \frac{d_{1}^{2}}{w^{2} {+ h}^{2}} - \frac{d_{2}^{2}}{w^{2} {+ h}^{2}}

(4)

{MPDIoU}_{Loss} = 1 - MPDIoU

(5)

The equations indicates that the MPDIoU loss function and the MPDIoU overlap measure are inversely proportional. Thus, as MPDIoU increases, the loss function decreases, meaning the model’s predicted bounding box becomes closer to the ground truth box.

3. Results

3.1. Experiment Settings

The main experimental configuration is as follows: processor Intel(R) Core^(TM) i5-9300H CPU@2.40 GHz, 8 GB RAM, and NVIDIA GeForce GTX1660Ti. The software includes a 64-bit Windows 10 operating system, the PyTorch 1.5.0 deep learning framework, CUDA 12.4, cuDNN 7.6.5 acceleration package, PyCharm 2020.3.2 software, and GDAL library 3.8.4. For training, this study employs the Adaptive Moment Estimation algorithm (Adam) [41] to update the network parameters. The Adam algorithm calculates estimates of the first moment, which is the mean, and the second moment, which is the uncentered variance, of the gradients. This process adapts an independent learning rate for each parameter. Its core concept combines the advantages of both the Momentum and the RMSProp optimization methods. This enables excellent performance when dealing with sparse gradients or non-stationary objective functions and facilitates efficient convergence. In the training configuration of this study, the first moment decay coefficient is set to 0.9, the second moment decay coefficient to 0.999, and the numerical stability term ϵ to 10⁻⁸. The initial learning rate is set to 1 × 10⁻³ and decayed following a cosine annealing schedule. The model was trained for 200 epochs with a batch size of 16. During training, we applied a standard set of data augmentation techniques including Mosaic [42], random horizontal flipping, random rotation (±10 degrees), and adjustments to hue, saturation, and value to improve generalization. The characteristic of Adam to dynamically adjust the step size for each parameter allows for the rapid and stable optimization of our proposed lightweight ship detection model. It effectively addresses the optimization challenges posed by complex backgrounds and targets of varying scales, serving as a crucial technical guarantee for the model to achieve a balance between high accuracy and efficiency. All remaining parameters were set to the default values specified in YOLO11. The final model for testing was selected based on the highest mAP achieved on the separate validation set. To guarantee a fair and reproducible comparison, every baseline model was both trained and evaluated using identical hardware and software environments, in strict accordance with a consistent training scheme. This unified scheme, encompassing the identical number of epochs, optimizer configuration, learning rate schedule, batch size, and data augmentation pipeline detailed above, has been shown to effectively facilitate superior convergence in lightweight object detection networks.

3.2. Datasets

This study selected the East China Sea and the lower reaches of the Yangtze River basin in China, areas containing typical ship types and navigation scenarios, as the research subjects. A custom ship dataset comprising approximately 10,000 images was created. Among them, the East China Sea is one of China’s busiest water traffic areas, covering international shipping lanes, fishing grounds, and port clusters, with a sea area of about 770,000 square kilometers. The region includes various ship types such as merchant vessels, fishing boats, cruise ships, and engineering vessels, characterized by high navigation density and complex environments. The lower reaches of the Yangtze River basin, serving as the golden waterway for inland river shipping, has a total channel length exceeding 1000 km, with world-class ports like Shanghai Port and Ningbo-Zhoushan Port along its banks. The primary ship types in this area include inland river cargo ships, container ships, tugboats, and passenger ferries, as shown in Figure 6. These two regions feature a rich variety of ship types, diverse navigation scenarios, and dense spatial distribution.

Data collection employed a multi-platform, multi-temporal approach over an 11-month period. Fixed coastal monitoring stations at Shanghai Port, Ningbo-Zhoushan Port, and key locations along the Yangtze River conducted continuous imaging. Mobile collection platforms included three research vessels equipped with stabilized camera systems that performed regular patrols in the East China Sea. Temporal coverage spanned all four seasons and different times of the day to capture variations in lighting and atmospheric conditions. Additionally, real ship images captured from multiple perspectives under different weather conditions were incorporated. Image resolutions ranged from 650 × 650 to 1920 × 1080 pixels. The dataset was annotated using the Customized Annotation and Verification Tool (CAVT) [43], specifically developed for maritime object recognition tasks. CAVT is a dedicated annotation tool designed to assist in vision tasks, enhancing the efficiency and accuracy of image and video data annotation. It supports various types of annotation tasks, including object detection, image segmentation, and keypoint labeling, enabling researchers and developers to prepare high-quality annotated data for training deep learning models. This significantly improves the quality and efficiency of data annotation for large-scale image and video datasets.

Due to the generally weakened visual contrast between ship targets and the background under complex weather conditions, this study adopts the Contrast Limited Adaptive Histogram Equalization (CLAHE) [44] technique for image enhancement processing. This approach aims to improve the identifiability of ship targets in various meteorological environments, enhance feature representation, and thereby provide more discriminative visual inputs for subsequent detection models. CLAHE is a preprocessing technique used for image enhancement. It works by dividing the image into several local regions and independently performing histogram equalization on each, while limiting the contrast increase in each region. This effectively enhances the local details and overall contrast of the image, while avoiding the noise amplification and over-enhancement issues that may arise with traditional histogram equalization. In computer vision tasks, CLAHE is commonly used to process images with uneven illumination or low contrast (such as surveillance videos), as it can improve image quality, enhance texture features, and provide clearer input data for subsequent target detection, segmentation, or classification models. The results of CLAHE image enhancement are shown in Figure 7.

The dataset covers 9 categories of ship targets (including cruise ships, freight boat, inflatable boat, sailboat, speedboat, motorboat, fishing boat, tugboat and warship) and has been divided into training, validation, and test sets in an 8:1:1 proportion. Consequently, this dataset combines scene complexity, data diversity, and annotation standardization, demonstrating strong representativeness and research value, making it suitable as a training and evaluation benchmark for ship target detection algorithms. The characteristics of the dataset are illustrated in Figure 8 below.

The dataset exhibits distinct and challenging characteristics that reflect real-world maritime scenarios. As illustrated in Figure 7, the bounding box size distribution (a) reveals a wide range of scales, indicating significant variations in the distances and sizes of ships, thereby posing a multi-scale detection challenge. The target location heatmap (b) shows that ships are predominantly concentrated near the center of the image with a tendency to cluster, suggesting high-density navigation areas and potential occlusion between targets. Furthermore, the aspect ratio distribution (c) demonstrates that most ships have elongated shapes, with an average aspect ratio of approximately 2.12 and a long-tailed distribution extending to higher values, which implies that conventional anchor designs may not adequately cover the diversity of ship geometries. Collectively, these features highlight the dataset’s complexity and its suitability for evaluating robust ship detection algorithms under realistic, cluttered, and scale-variant conditions.

In addition, this work utilizes the SeaShips [45] dataset. The SeaShips dataset used in this study was collected from a coastal surveillance system deployed along the Hengqin Island in Zhuhai, China, and consists of 7000 high-resolution (1920 × 1080) images. The dataset covers six ship categories, i.e., ore carriers, bulk carriers, container ships, general cargo ships, fishing boats, and passenger ships, and has been divided into training, validation, and test sets in an 8:1:1 proportion. The images were captured across different seasons and time periods, encompassing varied lighting, weather, and sea conditions, thereby exhibiting strong scene diversity. However, the category definitions in the dataset involve certain inter-class similarities, and the sample sizes for some common categories are relatively limited, which poses challenges for fine-grained classification by the model.

3.3. Evaluation Indexes

In binary classification problems, the category where the model predicts a positive case and it is actually positive is typically called True Positive (TP); the category where the model predicts a positive case but it is actually negative is called False Positive (FP); the category where the model predicts a negative case but it is actually positive is called False Negative (FN); and the category where the model predicts a negative case and it is actually negative is called True Negative (TN). Evaluation metrics include Precision, Recall and mean Average Precision (mAP). Precision represents the proportion of correctly detected ships among all detected results classified as ships. Recall represents the proportion of ships successfully detected by the model among all actual positive cases. Their definitions are expressed as Equations (6) and (7), respectively.

precision = \frac{TP}{TP + FP}

(6)

recall = \frac{TP}{TP + FN}

(7)

The Precision–Recall (PR) curve is a curve plotted with Recall on the horizontal axis and Precision on the vertical axis. Its core function is to comprehensively reflect the model’s performance across all confidence thresholds. Average Precision (AP) is the area under the PR curve for a single class, with a value range of 0 to 1. The closer the AP is to 1, the better the detection performance for that class. mAP is the final score for the overall performance of the model, calculated as the arithmetic mean of the AP values for all classes, as shown in Equation (8).

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(8)

where N is the total number of categories in the detection task.

3.4. Comparative Experiment

3.4.1. Detection Performance

This study trained classic object detection algorithms on the custom training set and compared them with the algorithm proposed in this study. The results are shown in Table 1, and a visual comparison of some algorithms is presented in Figure 7.

As indicated in Table 1, the model proposed in this study performs favorably across several key metrics. It achieves an AP of 93.96%, a recall of 92.93%, and a precision of 94.97%, outperforming the listed comparison algorithms, including Faster R-CNN, CenterNet, and recent YOLO variants, on these measures. Furthermore, the model maintains a relatively low parameter count of 2.90 M, which is comparable to or lower than the lightweight-focused YOLOv8n, and substantially fewer than Faster R-CNN and CenterNet. These results suggest a promising balance between detection accuracy and model efficiency in the proposed design.

Meanwhile, experimental results indicate that the training time per epoch for the proposed algorithm is approximately 8–9 min, with a peak memory usage of about 4 GB. The inference time for a single image is around 15 ms, corresponding to an inference FPS of approximately 64.17. Compared to YOLOv10 (FPS: 54.95) and YOLOv11 (FPS: 59.22), the proposed algorithm demonstrates a noticeable improvement in inference speed, which verifies the effectiveness of the model’s lightweight design.

To statistically validate whether the performance improvement of the proposed model over the baseline model is significant, a paired t-test was conducted in this study. Under the same training and testing dataset splits, both the baseline model and the proposed model were independently trained and evaluated 10 times using different random seeds, and the mAP on the test set was recorded for each run. The null hypothesis of the paired t-test was that there is no significant difference in the mean mAP between the two models, with the significance level set at α = 0.05. In 10 independent experimental runs, the mean mAP of the proposed model was (93.96 ± 0.15)%, which is significantly higher than the (91.56 ± 0.10)% achieved by the baseline model. The test result showed a p-value of less than 0.001, which is well below the significance threshold. Therefore, we reject the null hypothesis at the α = 0.05 significance level, and statistical evidence suggests that the proposed model significantly outperforms the baseline model in terms of mean average precision.

As shown in Figure 9, the comparative models in Figure 9a–e exhibit varying degrees of missed detections across different test scenarios, and the confidence scores for correctly detected targets are generally low. Simultaneously, evident false detection issues occur in scenes with complex backgrounds. Notably, although the YOLOv11 model in Figure 9f can largely achieve complete target detection, its detection confidence remains low, and the co-occurrence of missed and false detections persists, indicating that its detection stability under complex background conditions needs improvement. In contrast, the detection results of the improved model from this study, shown in Figure 9g, fully validate the effectiveness of the proposed method: the model successfully achieves accurate multi-target detection under complex background interference without generating any false detections, significantly enhancing detection confidence and localization accuracy. This thoroughly demonstrates the robustness and engineering application value of the model proposed in this study in complex waterway environments.

To systematically evaluate the robustness and generalization capability of the proposed algorithm, this study conducts comprehensive training and testing on the public SeaShips dataset, followed by a comparison with several representative object detection algorithms. The detailed performance comparison is summarized in Table 2.

Table 2 presents the experimental results of different methods on the SeaShips dataset. The results demonstrate that the proposed method outperforms most of the compared algorithms in terms of precision, recall, and mAP. Specifically, it achieves the highest scores in both recall and precision, while securing the second-highest mAP. Compared with the YOLOv11 baseline model, the proposed method exhibits an improvement of 1.25% in mAP, 1.48% in recall, and 1.04% in precision.

Although the last column of Figure 9 partially demonstrates the detection capability of the proposed algorithm in certain challenging scenarios (e.g., small and occluded targets), the visualization remains insufficient. Therefore, additional experiments under specified conditions have been conducted, and the corresponding visualization results are presented in Figure 10.

Figure 10 visualizes the detection results of the proposed model across multiple challenging maritime scenarios, further validating its robustness and generalization capability. The first row demonstrates the model’s precise detection ability for small targets (such as distant sailboats) in scenes with significant scale variations, effectively capturing their features. The second row illustrates the model’s stable detection performance for mid-to-long-range targets (e.g., the cargo ship in the image). Despite the targets occupying a small proportion of the image with limited details, the model can still accurately localize them. The third row tests the model’s performance in high-density, small-target cluster scenarios (such as densely moored sailboats), showing that the model can reasonably distinguish between adjacent targets. The fourth row presents the model’s capability in handling partially occluded targets. When ships are partially obscured by piers, other vessels, or waves, the model can still infer the complete presence of the target based on the visible parts and perform localization. Collectively, these visualization results indicate that the improved model proposed in this study can effectively address common challenges in practical maritime surveillance, including small targets, distant targets, dense targets, and occluded targets.

While the proposed method achieves notable performance improvements, the visualization results reveal remaining limitations in extremely complex scenarios, indicating room for further refinement. Specifically, in high-density small object scenes (row 3), the model exhibits instances of false-positive merging for adjacent, highly similar targets. This suggests that the current multi-scale feature fusion and context modeling mechanisms still have room for improvement in distinguishing subtle inter-instance differences when target density is exceptionally high. Furthermore, in severely occluded scenarios (row 4), some predicted bounding boxes show localization drift or size inaccuracies, reflecting that the robustness of the bounding box regression module requires strengthening when dealing with incomplete target features. These observations clearly indicate that, despite the progress made on primary performance metrics, the current approach needs further optimization for adaptation to extreme real-world conditions to achieve more comprehensive and reliable detection performance suitable for practical deployment.

3.4.2. Ablation Study

To verify the impact of each module in this study on the model’s ship target detection performance, five sets of ablation experiments were designed on the custom dataset. The comparison results are shown in Table 3.

Based on the ablation experiment results shown in Table 3, the improved method proposed in this study demonstrates significant superiority. By progressively introducing the iAFF module, the Biformer module, and the MPDIoU loss function on top of the Baseline model, the model’s performance shows continuous improvement.

First, the introduction of the iAFF module brings a significant initial performance gain to the model. As shown in Experiment 1 of Table 3, integrating the iAFF module alone increases the model’s mAP to 92.14%, an improvement of 0.58 percentage points over the baseline. The core mechanism of this module lies in its iterative attentional feature fusion process. Through its multi-scale channel attention (MS-CAM), it dynamically recalibrates and fuses features from different levels of the backbone network. In the context of ship detection, this effectively enhances the perception of small ships, occluded targets, and distant objects that typically have weaker feature responses, while initially suppressing background noise such as water ripples and light reflections. This provides a more discriminative primary feature representation for subsequent processing. The improvement in recall from 90.84% to 91.28% also corroborates its enhanced ability to capture targets, especially challenging instances.

Subsequently, introducing the BiFormer module on the features optimized by iAFF drives a further leap in model performance. The results of Experiment 2 show that the mAP reaches 92.89%, representing a 0.75 percentage point increase compared to the model using only iAFF. The contribution of the BiFormer module stems primarily from its BRA mechanism. It can efficiently model global semantic relationships across the image on the feature maps purified by iAFF, with relatively low computational cost. This is crucial for ship detection, as it allows the model to leverage contextual scene information to resolve ambiguities and better understand the spatial relationships among densely arranged vessels. This helps reduce false detections caused by complex backgrounds and close-range occlusion, which aligns with the significant improvement in precision from 82.76% in Experiment 1 to 93.64% in Experiment 2.

Furthermore, integrating the MPDIoU loss function into the training framework also brings considerable performance improvement. Using MPDIoU on the basis of iAFF (Experiment 3) increases the mAP by 0.47 percentage points to 92.61%. The innovation of MPDIoU lies in measuring similarity by minimizing the distance between the top-left and bottom-right corners of the predicted and ground-truth bounding boxes. This design enables it to more accurately guide the model in learning the specific aspect ratios and orientations common to ship targets, thereby directly optimizing the accuracy of bounding box regression and achieving tighter pixel-level localization.

Finally, integrating the iAFF, BiFormer, and MPDIoU modules into a unified framework, the results suggest that the best overall performance is achieved (Experiment 4). The mAP improvement (2.4 percentage points) brought by this combination exceeds the simple sum of the gains obtained by introducing any single module individually. This result indicates that, due to their complementary design objectives, the modules produce a mutually reinforcing positive effect when integrated: the optimized multi-scale features provided by iAFF may offer more favorable inputs for BiFormer to perform efficient global context modeling. Meanwhile, the enhanced feature discriminability and classification confidence brought by these two modules together enable the MPDIoU loss function to obtain gradient signals more conducive to optimization, thereby further refining the accuracy of bounding box regression, thereby further refining bounding box regression accuracy. This systematic and targeted modular combination design forms the intrinsic rationale for our method in achieving a balance between accuracy and efficiency.

According to the original training scheme, the model presented in this paper was trained on a self-made dataset using various loss functions, including GIoU, DIoU, CIoU, EIoU, and MPDIoU. The results are shown in Table 4.

The MPDIoU loss function achieved the best performance in the model, with an Average Precision (AP) of 93.96%, surpassing GIoU, CIoU, DIoU, and EIoU by 2.27, 2.09, 3.43, and 1.64 percentage points, respectively. In terms of recall, MPDIoU reached 92.93%, also significantly outperforming the other loss functions, with corresponding margins of 2.12%, 3.64%, 1.58%, and 2.52%. Additionally, its precision was 94.97%, exceeding that of GIoU, CIoU, and DIoU by 3.60%, 1.33%, and 5.21%, respectively. Taken together, these results indicate that MPDIoU is better suited for this model in the given task, demonstrating a more notable adaptive advantage over the other loss functions compared.

4. Conclusions

This study proposes and validates a lightweight module integration and optimization scheme specifically designed for complex aquatic environments, aiming to address the multiple challenges faced by ship target detection against complex backgrounds. This scheme innovatively integrates the Iterative Attentional Feature Fusion module, the Dual-layer Routing Attention module, and the Minimum Point Distance Intersection over Union loss function into the YOLOv11 framework. By systematically integrating the three functionally complementary modules—iAFF, BiFormer, and MPDIoU—into the YOLOv11 framework, our method provides a phased yet integrated solution to key challenges such as multi-scale target representation under complex backgrounds, global context perception, and bounding box geometric precision optimization. Experiments demonstrate that this integrated design effectively leveraged the strengths of each module.

This study proposes a solution for intelligent ship supervision in complex water surface scenarios. Experimental results demonstrate that the proposed model achieves 93.96% mAP, 92.93% recall, and 94.97% precision on a self-built dataset, with a parameter count of only 2.90 M. The proposed model demonstrates favorable performance in terms of accuracy, efficiency, and model size.

However, this study also has certain limitations:

(1): Although this study has achieved modular design and lightweight improvements for ship target detection at the algorithmic level, it has not yet evolved into a complete, robust, and real-time integrated system for ship target detection under complex weather conditions on water. Future research should focus on designing and implementing an end-to-end system that integrates weather-adaptive perception, real-time image enhancement, multi-target detection, and visual interaction. This system should feature a user-friendly graphical interface and be deployable directly in port monitoring centers, shipborne sensing units, or shore-based intelligent systems, providing real-time technical support for practical applications such as maritime traffic supervision, intelligent ship collision avoidance, autonomous navigation, and unmanned vessel docking. Although the current model has undergone certain lightweight optimizations in terms of parameter count and computational load, its inference efficiency, memory footprint, power consumption, and thermal performance on edge devices with strictly limited computational resources still require further fine-tuning. Targeted efforts in model pruning, quantization, compilation optimization, and hardware adaptation are needed, along with the establishment of a performance-versus-power trade-off assessment framework, to truly meet the demands of future engineering and large-scale deployment.
(2): Current research primarily focuses on the perception level of ship targets and has not yet deeply explored high-level semantic understanding. With the rapid development of vision-language multimodal large models, new possibilities have emerged for endowing ship intelligent perception systems with deeper cognitive and analytical capabilities. Future work should actively explore a deep integration paradigm between ship detection systems and multimodal large models. This can be achieved by fine-tuning general vision-language large models with domain-specific maritime data to construct domain-specific vision-language models equipped with maritime expertise, supplemented with a professional knowledge base covering ship types, navigation rules, maritime regulations, and risk case studies. Building on this foundation, the system will not only be capable of detecting ship targets but will also further identify nuanced ship states, behavioral patterns, and interactive relationships, while automatically generating structured semantic descriptive reports or multimodal risk warning information. This will fundamentally transform the operational mode of maritime monitoring systems, shifting from a traditionally passive, image perception and information listing-based monitoring approach to a comprehensive intelligent system that integrates active perception, deep understanding, intelligent decision-making, and forward-looking early warning. Ultimately, such a system will not only provide clear observation of maritime situations but also enable a profound understanding of their complex implications, allowing for the anticipation of future risks and thereby substantially enhancing the proactiveness and intelligence level of maritime safety supervision.

Author Contributions

Conceptualization, D.Z. and L.Q.; methodology, D.Z.; software, D.Z.; validation, D.Z., H.N., S.S., H.L. and X.W.; formal analysis, H.N. and S.S.; investigation, L.Q.; resources, H.L.; data curation, H.L.; writing—original draft preparation, D.Z.; writing—review and editing, L.Q.; visualization, X.W.; supervision, L.Q.; project administration, L.Q.; funding acquisition, L.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jiangsu Provincial Department of Education, grant number SJCX24_2491.

Data Availability Statement

The Seaships dataset used in this study is publicly available at: https://github.com/jiaming-wang/SeaShips, (accessed on 14 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huan, Y.; Chen, L.; Liu, B.; Wang, W. Research on ship detection technology based on improved YOLOv5. In Proceedings of the 2023 7th International Conference on Machine Vision and Information Technology (CMVIT), Xiamen, China, 24–26 March 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar] [CrossRef]
Song, W.; Yan, D.; Yan, J.; Hu, C.; Wu, W.; Wang, X. Ship detection and identification in SDGSAT-1 glimmer images based on the glimmer YOLO model. Int. J. Digit. Earth 2023, 16, 4687–4706. [Google Scholar] [CrossRef]
Ezzeddini, L.; Affes, N.; Ktari, J.; Frikha, T.; Ben Halima, R.; Hamam, H. Smart Maritime Surveillance: Leveraging YOLO Detection and Blockchain traceability for Vessel Monitoring. J. Inf. Assur. Secur. 2025, 19, 233–248. [Google Scholar] [CrossRef]
Lan, K.; Jiang, X.; Ding, X.; Lin, H.; Chan, S. High-Efficiency and High-Precision Ship Detection Algorithm Based on Improved YOLOv8n. Mathematics 2024, 12, 1072. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, S.; Xu, J.; Cheng, Z.; Du, G. YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments. Symmetry 2025, 17, 1116. [Google Scholar] [CrossRef]
Hui, Z.F.; Li, P.L.; Shen, L.; Shen, H.; Sui, J.; Zhang, S. Research on Target Detection and Statistics Method for Fishing Port Vessels Entering and Leaving the Port Based on Improved YOLOv8. J. Dalian Ocean. Univ. 2024, 39, 498–505. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zaurin, R.; Khuc, T.; Catbas, F.N. Hybrid Sensor-Camera Monitoring for Damage Detection: Case Study of a Real Bridge. J. Bridge Eng. 2016, 21, 934–942. [Google Scholar] [CrossRef]
Srinivas, Y.; Ganivada, A. A modified inter-frame difference method for detection of moving objects in videos. Int. J. Inf. Technol. 2024, 17, 749–754. [Google Scholar] [CrossRef]
Xin, J.; Cao, X.; Xiao, H.; Liu, T.; Liu, R.; Xin, Y. Infrared Small Target Detection Based on Multiscale Kurtosis Map Fusion and Optical Flow Method. Sensors 2023, 23, 1660. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Hou, X.; Zhang, L. Saliency Detection: A Spectral Residual Approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007. [Google Scholar] [CrossRef]
Schölkopf, B.; Platt, J.; Hofmann, T. Graph-Based Visual Saliency. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference; MIT Press: Cambridge, MA, USA, 2007; pp. 545–552. [Google Scholar] [CrossRef]
Qi, S.; Ma, J.; Lin, J.; Li, Y.; Tian, J. Unsupervised Ship Detection Based on Saliency and S-HOG Descriptor from Optical Satellite Images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1451–1455. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 91–99. [Google Scholar] [CrossRef]
Xie, F.; Zhu, D.J. Survey on Deep Learning Object Detection. Comput. Syst. Appl. 2022, 31, 1–12. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar] [CrossRef]
Li, Z.; Yang, L.; Zhou, F. FSSD: Feature Fusion Single Shot Multibox Detector. arXiv 2017, arXiv:1712.00960. [Google Scholar] [CrossRef]
Lim, J.-S.; Astrid, M.; Yoon, H.-J.; Lee, S.-I. Small Object Detection using Context and Attention. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAICIC), Jeju Island, Republic of Korea, 13–16 April 2021. [Google Scholar] [CrossRef]
Wang, J.; Pan, Q.; Lu, D.; Zhang, Y. An Efficient Ship-Detection Algorithm Based on the Improved YOLOv5. Electronics 2023, 12, 3600. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Wan, G.; Jiang, M.; Huang, Z.; Tao, X.; Chen, J.; Chu, D. Small Target Detection Algorithm for UAV Based on Improved YOLOv5. In Proceedings of the 8th International Conference on Signal and Image Processing, Wuxi, China, 8–10 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 78–82. [Google Scholar] [CrossRef]
Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato Maturity Detection and Counting Model Based on MHSA-YOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Meng, H.; Yuan, F. Multiscale and Multilevel Enhanced Features for Ship Target Recognition in Complex Environments. IEEE Trans. Ind. Inform. 2024, 20, 4640–4650. [Google Scholar] [CrossRef]
Zhao, L.; Ning, F.; Xi, Y.; Liang, G.; He, Z.; Zhang, Y. MSFA-YOLO: A Multi-Scale SAR Ship Detection Algorithm Based on Fused Attention. IEEE Access 2024, 12, 24554–24568. [Google Scholar] [CrossRef]
Xie, Y.; Liu, S.; Chen, H.; Cao, S.; Zhang, H.; Feng, D.; Wan, Q.; Zhu, J.; Zhu, Q. Localization, balance and affinity: A stronger multifaceted collaborative salient object detector in remote sensing images. arXiv 2024, arXiv:2410.23991. [Google Scholar] [CrossRef]
Xie, Y.; Zhan, N.; Zhu, J.; Xu, B.; Chen, H.; Mao, W.; Luo, X.; Hu, Y. Landslide Extraction from Aerial Imagery Considering Context Association Characteristics. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103950. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, J.; Chen, H.; Xie, Y.; Gu, H.; Lian, H. A Cross-View Intelligent Person Search Method Based on Multi-Feature Constraints. Int. J. Digit. Earth 2024, 17, 2346259. [Google Scholar] [CrossRef]
Song, W.; Zhao, Y.; Tu, J.; Chen, M.; Xie, Y.; Cui, X. A Visual Attention-Guided Approach for Concrete Crack Detection in Complex Environments. Eng. Appl. Artif. Intell. 2026, 173, 114439. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, L.; Meng, H.; Zhang, Z.; Yang, C. Ship Detection in Maritime Scenes under Adverse Weather Conditions. Remote Sens. 2024, 16, 1567. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions. arXiv 2021, arXiv:2112.08088. [Google Scholar]
Hu, R.; Zheng, H.; Ye, S.; Qing, L.; Chen, H. A Lightweight Framework for Robust Object Detection in Adverse Weather Based on Dual-Teacher Feature Alignment. Neurocomputing 2026, 671, 132726. [Google Scholar] [CrossRef]
Zhang, Y.; Xuan, S.; Li, Z. Robust Object Detection in Adverse Weather with Feature Decorrelation via Independence Learning. Pattern Recognit. 2026, 169, 111790. [Google Scholar] [CrossRef]
Ogino, Y.; Shoji, Y.; Toizumi, T.; Ito, A. ERUP-YOLO: Enhancing Object Detection Robustness for Adverse Weather Condition by Unified Image-Adaptive Processing. arXiv 2024, arXiv:2411.02799. [Google Scholar]
Liu, Y.; Yuan, T.; Ren, A.; Kuo, Y.; Xiong, X. YOLO-FOD: Lightweight Object Detection Based on Multibranch and Multiscale Feature Fusion for Adverse Weather. Neurocomputing 2026, 659, 131778. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Sekachev, B.; Zhavoronkov, A.; Manovich, N. Computer Vision Annotation Tool. 2019. Available online: https://github.com/opencv/cvat (accessed on 20 April 2024).
Mishra, A.; Gupta, M.; Sharma, P. Enhancement of Underwater Images Using Improved CLAHE. In Proceedings of the 2018 International Conference on Advanced Computation and Telecommunication (ICACAT), Bhopal, India, 28–29 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]

Figure 1. The structure of the improved YOLOv11 network.

Figure 2. The structure of the iAFF module and MS-CAM module.

Figure 3. The structure of the BRA module.

Figure 4. The structure of the BiFormer module.

Figure 5. Schematic diagram of MPDIoU.

Figure 6. Sample images of the ship dataset. Note: Chinese characters visible in images are original timestamp watermarks from the dataset.

Figure 7. Results of CLAHE image enhancement. Note: Chinese characters visible in images are original timestamp watermarks from the dataset.

Figure 8. Feature chart of the dataset. (a) Boundary box size distribution. (b) Target location distribution heatmap. (c) Distribution of aspect ratio of bounding boxes.

Figure 9. Comparison chart of model detection results.

Figure 10. Detection performance in challenging scenarios.

Table 1. Algorithm comparison experiment table.

Method	mAP (%)	Recall (%)	Precision (%)	Parameters (M)	FLOPs (G)
Faster R-CNN	82.18	77.35	83.67	40.63	207
CenterNet	88.19	79.61	89.31	45.21	50
YOLOv5	88.32	83.66	84.97	7.04	24.0
YOLOv8	88.54	88.55	85.16	3.01	28.6
YOLOv10	90.13	89.35	91.36	2.72	21.6
YOLOv11	91.56	90.84	91.39	2.59	21.5
Ours	93.96	92.93	94.97	2.90	7.9

Table 2. Performance comparison of different algorithms on the SeaShips test set.

Method	mAP (%)	Recall (%)	Precision (%)
Faster R-CNN	89.68	88.34	89.96
CenterNet	90.17	90.04	92.17
YOLOv5	95.52	97.63	96.83
YOLOv8	97.82	96.35	97.61
YOLOv10	94.17	95.44	96.37
YOLOv11	95.68	96.34	96.61
Ours	96.93	97.82	97.89

Table 3. Ablation experiment table for each module.

Method	iAFF	Biformer	MPDIoU	mAP (%)	Recall (%)	Precision (%)	Parameters (M)
Baseline	-	-	-	91.56	90.84	91.39	2.59
1	√	-	-	92.14	91.28	82.76	2.95
2	√	√	-	92.89	91.75	93.64	2.60
3	√	-	√	92.61	92.46	94.26	2.94
4	√	√	√	93.96	92.93	94.97	2.90

Table 4. Ablation study on different bounding box regression loss functions for the proposed model.

Method	mAP (%)	Recall (%)	Precision (%)
$L_{G I o U}$	91.69	90.81	91.37
$L_{C I o U}$	91.87	89.29	93.64
$L_{D I o U}$	90.53	91.35	89.76
$L_{E I o U}$	92.32	90.41	95.22
$L_{M P D I o U}$	93.96	92.93	94.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zuo, D.; Qi, L.; Ni, H.; Song, S.; Li, H.; Wang, X. Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention. Symmetry 2026, 18, 729. https://doi.org/10.3390/sym18050729

AMA Style

Zuo D, Qi L, Ni H, Song S, Li H, Wang X. Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention. Symmetry. 2026; 18(5):729. https://doi.org/10.3390/sym18050729

Chicago/Turabian Style

Zuo, Danfeng, Liang Qi, Hao Ni, Song Song, Haifeng Li, and Xinwen Wang. 2026. "Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention" Symmetry 18, no. 5: 729. https://doi.org/10.3390/sym18050729

APA Style

Zuo, D., Qi, L., Ni, H., Song, S., Li, H., & Wang, X. (2026). Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention. Symmetry, 18(5), 729. https://doi.org/10.3390/sym18050729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship Target Detection Method Based on Feature Fusion and Bi-Level Routing Attention

Abstract

1. Introduction

2. Materials and Methods

2.1. Network Framework

2.2. Iterative Attentional Feature Fusion

2.3. Biformer

2.4. MPDIoU Loss Function

3. Results

3.1. Experiment Settings

3.2. Datasets

3.3. Evaluation Indexes

3.4. Comparative Experiment

3.4.1. Detection Performance

3.4.2. Ablation Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI