YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection

Ye, Zhenfeng; Peng, Xing; Li, Dingkang; Shi, Feng

doi:10.3390/jmse13101843

Open AccessArticle

YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection

by

Zhenfeng Ye

^1,2,†,

Xing Peng

^1,2,3,*,†

,

Dingkang Li

^1,2 and

Feng Shi

^1,2,3

¹

National Key Laboratory of Equipment State Sensing and Smart Support, Changsha 410073, China

²

College of Intelligent Science and Technology, National University of Defense Technology, Changsha 410073, China

³

Hunan Provincial Key Laboratory of Ultra-Precision Machining Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2025, 13(10), 1843; https://doi.org/10.3390/jmse13101843

Submission received: 8 August 2025 / Revised: 2 September 2025 / Accepted: 19 September 2025 / Published: 23 September 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Underwater target detection is a critical technology for marine resource management and ecological protection, but its performance is often limited by complex underwater environments, including optical attenuation, scattering, and dense distributions of small targets. Existing methods have significant limitations in feature extraction efficiency, robustness in class-imbalanced scenarios, and computational complexity. To address these challenges, this study proposes a lightweight adaptive detection model, YOLOv11-MSE, which optimizes underwater detection performance through three core innovations. First, a multi-scale dilated attention (MSDA) mechanism is embedded into the backbone network to dynamically capture multi-scale contextual features while suppressing background noise. Second, a Slim-Neck architecture based on GSConv and VoV-GSCSPC modules is designed to achieve efficient feature fusion via hybrid convolution strategies, significantly reducing model complexity. Finally, an efficient multi-scale attention (EMA) module is introduced in the detection head to reinforce key feature representations and suppress environmental noise through cross-dimensional interactions. Experiments on the underwater detection dataset (UDD) demonstrate that YOLOv11-MSE outperforms the baseline model YOLOv11, achieving a 9.67% improvement in detection precision and a 3.45% increase in mean average precision (mAP50) while reducing computational complexity by 6.57%. Ablation studies further validate the synergistic optimization effects of each module, particularly in class-imbalanced scenarios where detection precision for rare categories (e.g., scallops) is significantly enhanced, with precision and mAP50 improving by 60.62% and 10.16%, respectively. This model provides an efficient solution for edge computing scenarios, such as underwater robots and ecological monitoring, through its lightweight design and high underwater target detection capability.

Keywords:

underwater target detection; deep learning; lightweight network; multi-scale

1. Introduction

Covering 71% of the Earth’s surface, the ocean plays a vital role in global sustainable development, economic growth, and national security [1]. According to the United Nations, marine ecosystems contribute $2.5 trillion annually to the global economy and support the livelihoods of 3 billion people. With the implementation of the United Nations Decade of Ocean Science for Sustainable Development (2021–2030), deep-sea observation network construction has become a focal area of international marine technological competition. The number of underwater robotic devices, serving as critical infrastructure, has experienced exponential growth over the past five years, with global deployments exceeding 15,000 units, supporting core applications such as seabed resource exploration, ecological monitoring, and disaster early warning (NOAA, 2024). Underwater target detection has emerged as a key technology for marine resource development and sustainable resource management. However, the 2024 Gulf of Mexico seabed facility leak accident highlighted significant deficiencies in existing detection technologies—the traditional system’s false-negative rate for targets smaller than 20 cm in diameter reached 43%, directly threatening marine environmental security. This underscores the strategic value of underwater object detection (UOD) technology in achieving Sustainable Development Goals (SDGs) and the urgency of technological innovation.

At present, underwater target detection methods are usually divided into two categories: one is based on traditional target detection methods, and the other is based on deep learning target detection methods [2]. In traditional target detection, sonar equipment (such as side-scan sonar and multi-beam sonar) is often used to obtain underwater target images, combined with optical image processing technology, and the target shape is analyzed by echo signal. In recent years, the rapid development of deep learning technology has opened up the paradigm change of UOD. A variety of machine learning and deep learning techniques are applied in the field of underwater real-time target detection. However, the complexity of the underwater environment poses a serious challenge to the detection technology [3]. In the actual underwater environment, there is relatively serious optical interference, light attenuation, scattering, and color distortion leading to low image contrast, and it is difficult to distinguish the target from the background [4]. In addition, most marine organisms occupy less than 1.65% of the image area, and the dense distribution is easy to cause target overlap and missing detection. Secondly, the dense distribution characteristics of small targets are significant. UDD statistics show that 35% of targets are smaller than 32 × 32 pixels, and the spatial distribution density reaches 4.7 pixels/square meter [5]. In addition, edge devices such as underwater robots are limited by computing power and need to strike a balance between accuracy and efficiency. Existing algorithms, such as YOLOv8 and Faster R-CNN, are 35% to 45% less accurate in underwater scenes than in land applications, mainly because they are unable to handle low-contrast images and small targets [6]. For example, in low-light conditions, scallops, a key indicator of ocean health, are often misclassified as rocks or trash, leading to abnormal detection.

To cope with the challenges of a highly complex and changeable underwater environment, the underwater target detection algorithm needs to have strong adaptability. Researchers have proposed many image-processing algorithms to restore and enhance underwater images [2,7]. For underwater images, Fu et al. proposed a single image enhancement method based on Retinex theory. This method is superior to traditional de-fogging algorithms and fusion methods in terms of color naturalness, contrast, and edge sharpness. However, due to its insufficient robustness and high computational complexity for complex scenes, it is difficult to meet real-time application requirements [8]. Yu et al. proposed a new TR-YOLOv5s network and downsampling principle and introduced an attention mechanism into the method. Automatic target recognition (ATR) and target positioning were realized through the integration of the transformer module and YOLOv5s (TR-YOLOv5s), and the detection rate was significantly better than the original YOLOv5s, with a 12.5% mAP improvement over YOLOv5s [9]. Cai et al. used MobileNetv1 to replace the original Darknet-53 backbone network of YOLOv3. Through theoretical field analysis and dynamic selection of feature maps to match fish targets of different sizes, the detection rate of small and medium targets was significantly improved, and 78.63% AP (3.06% increase) was achieved in real farm data, with an inference speed of 76ms/frame [10]. Liu et al. improved the YOLOv8 algorithm and introduced the MarineYOLO network, which significantly improved the accuracy of underwater small target detection on RUOD and URPC datasets, increasing the average accuracy (AP) by 12.2% and 16.8%, respectively [11]. Wang et al. proposed a collaborative online target detection method composed of multiple autonomous underwater vehicles (multi-AUVs) equipped with side-scan sonar (SSS) sensors, which can achieve real-time and efficient underwater target detection and location [12]. Chen et al. optimized the YOLOv8n architecture using GAN to generate extended datasets, combined with CoTAttention and SEAM modules to optimize feature interaction and boundary fitting, and improved the detection efficiency in complex underwater scenes [13]. Yang et al. proposed FishDet-YOLO to enhance the gradient flow information and long-distance dependent modeling ability through an image enhancement module (UEM) and Mamba-C2f structure to solve the problem of underwater complex environments and fish diversity. However, the model has high computational complexity, and the adaptability to extreme environments still needs optimization [14]. However, existing algorithms over-rely on the original quality of underwater images, and when the enhancement technology is overused, key details may be weakened or eliminated, which will harm the accuracy of target recognition [15,16].

More recently, research has shifted to the development of specialized models and algorithms tailored specifically for the detection of underwater targets. The YOLOv8-CPG model proposed by Zhang et al. significantly reduces the complexity of the model through the CIB module, PSA attention mechanism, and Gold-YOLO neck structure. At the same time, the WIoU v3 loss function is introduced to balance the gradient gain between extreme samples and ordinary samples and improve the training stability. The model improved accuracy by 2.76%, recall rate by 2.06%, and mAP50-95 by 3.55% on the CoopKnowledge dataset. However, this model has insufficient detection performance for some categories (such as sea cucumbers), and the experiment relies on high-performance GPUs [17]. Feng and Jin proposed CEH-YOLO, a model that combines a higher-order deformable attention (HDA) module to enhance spatial feature extraction and interaction by prioritizing key regions in the model. In addition, the enhanced spatial pyramid fast pooling (ESPPF) module is integrated to enhance the extraction of target attributes (such as color and texture), with an AP50 of 66.6% on the TrashCan dataset, significantly better than Faster R-CNN (55.3%) and YOLOv7 (43.4%) [6]. CEH-YOLO has excellent detection performance in scenes with small or overlapping targets, but it is susceptible to complex background interference and false detection. The GCP-YOLO model, proposed by Gao et al., improves upon YOLOv7 by adopting GhostNetV2 instead of the CBS module, embedding Coordinate Attention (CA), and utilizing network pruning. This approach achieves a strong detection performance while cutting computational costs to 25.3 GFLOPs [18]. However, existing approaches either fail to fully address the unique challenges of the underwater environment or introduce significant computational overhead that hinders their practical deployment [19].

YOLOv11 achieved a 54.5% mAP (YOLOv11x) on the COCO dataset by introducing the C3k2 module and the C2PSA attention mechanism, with a 22% reduction in parameter quantity and a faster inference speed. However, its direct application to underwater object detection faces key limitations, for instance, performance degradation under complex optical interference, feature confusion of densely distributed underwater small targets, and excessively high computational complexity in edge deployment. To overcome these challenges, based on YOLOv11s, modifications were made to the network structure of YOLOv11 to further improve the detection accuracy of underwater target detection. As a result, a novel model, YOLOv11-MSE, is proposed in this study. The model is a novel variant optimized for underwater scenarios through three key innovations: (1) multi-scale dilated attention (MSDA) is introduced into the existing C2PSA module of the backbone, and MSDA is integrated into the backbone network to enhance context feature extraction while minimizing computational redundancy; (2) Slim-Neck is constructed, which reduces the complexity of the detection model and speeds up the detection speed of the model without losing the accuracy; and (3) the enhanced detection head is improved, the EMA module is integrated in the detection head, and interdimensional interaction is used to capture pixel-level relationships and enhance the ability to detect small underwater targets. The main contributions of this paper are as follows:

The MSDA mechanism is combined with the backbone and integrated into the C2PSA module of YOLOv11. This arrangement can help the model improve the detection ability of multi-scale targets, enhance the focus on key areas, suppress the interference of complex background information, and enhance multi-scale feature extraction.
GSConv and VoV-GSCSPC modules are introduced to build Slim-Neck. By applying depth-separable convolution (DWConv) and the fixed channel ratio, the convolution operation efficiency is improved, and the feature map information of different stages is fused more effectively at the same time. The detection model further reduces complexity without losing accuracy, speeds up the detection speed, and makes it more suitable for actual detection tasks requiring fast and lightweight networks.
EMA is used to enhance the detection head. First, EMA is integrated; then, channel dimension and batch dimension are reorganized, interdimensional interaction is used to capture pixel-level relationships, and feature loss in lightweight design is compensated for through channel attention recalibration to improve rare small-target detection.

A large number of experiments on the underwater detection dataset (UDD) show that YOLOv11-MSE achieves an average accuracy mean (mAP50) of 68.9% under the 0.5 threshold, which is 3.45% higher than YOLOv11, while reducing GFLOPs by 6.57%. These advances set a new benchmark for underwater target detection while achieving a balance between accuracy, efficiency, and adaptability. The Section 1 focuses on summarizing the core challenges of underwater target detection and the limitations of mainstream methods (e.g., the YOLO series and traditional image enhancement) that motivated our work. A more systematic and in-depth analysis of specialized studies—including recent advances in underwater target detection and small target detection—is presented in the subsequent Section 2.

2. Related Work

2.1. Underwater Target Detection

The core challenge of underwater target detection lies in the complex optical environment and target diversity. Early studies used image enhancement techniques, such as Retinex theory, to improve image quality, but they relied on artificial features and lacked real-time performance [8]. In recent years, deep learning models have gradually become mainstream. Ni et al. developed a new deep learning-based network model, YOLO-D, to detect underwater targets in real time. It uses an adaptive nuclear convolution module (AKConv) to dynamically adjust the shape and size of samples and optimize the detection of target features at various scales and regions. The integration of context aggregation network (CONTAINER) and bidirectional feature pyramid network (BiFPN) promotes effective cross-scale feature fusion and improves the detection accuracy of multi-scale and fuzzy targets [20]. Lu et al. proposed a new model, AquaYOLO, by introducing a dynamic selection aggregation module (DSAM) and context-aware feature selection (CAFS) to optimize multi-scale feature fusion while retaining edge information through residual blocks in noise and low-resolution sonar images. The model achieved mAP50 of 94.9% on the UATD dataset [21]. Xu et al. combined the Mean Teacher’s semi-supervision strategy with YOLOv8 and proposed the SUD-YOLO framework. Only 10% of data was labeled on the DUO dataset, and mAP was 50.8%, which was 11.0% higher than that of fully supervised YOLOv8 [22]. Zhou et al. proposed spatial residual blocks and BSR5 backbone networks, redefined the network solution process, and introduced the SkipCut mechanism to optimize parameters and gradient distribution. Experiments showed that, compared with RT-DETR (ResNet-101), BSR5-DETR had an AP increase of 1.3–2.7% and a parameter reduction of 41.6–6.6%; the BSR5-YOLO series also performed well. However, this method still has shortcomings in improving the performance of large models and detecting small targets [23]. To address strong background interference and weak object perception in underwater detection, Xin Shen et al. proposed UCOEA, an unsupervised clustering-based efficient attention mechanism using K-means and EM algorithms, which improved mAP0.5 by 1.6% to 82.2% on YOLOv8-S and released the DLMU2024 dataset with 2500 images. However, the model remains sensitive to initialization and has relatively high computational complexity [24].

The collaborative optimization of image enhancement and detection algorithms has become a research hotspot in recent years. Zhuang et al. proposed an enhancement method based on the Bayesian Retinex framework to decompose the problem into an alternatingly optimized denoising subproblem through multi-step prior modeling. Experiments showed that the method achieved a UIQM of 4.47 (optimal) on 50 test images, and the number of recovered visible edges increased by 300% (such as in the “Turtle” image). Moreover, the processing speed (0.14 s) was significantly better than mainstream deep learning methods [25]. Sarkar et al. proposed the UICE-MIRNet model. The authors extracted low-color regions, corrected colors through the UI-CEB module, and used the FSFB and SCAB modules to optimize feature fusion and selection. On the Brackish and Trash-ICRA19 datasets, UIQM increased to 0.98 and 0.85, respectively, while mAP reached 84.41% and 83.09% [26]. Manimurugan et al. proposed the HLASwin-T-ACoat-Net model, combined with CLAHE image enhancement, the Swin Transformer local perception module (LAB), and improved Coat-Net. By optimizing multi-scale feature fusion and dynamic background adaptability, it effectively solved the degradation problem of underwater images [27]. Focused on poor underwater imaging conditions, Jun Wang et al. proposed B-YOLOX-S, a lightweight detector leveraging Poisson fusion and wavelet transform-based style transfer for data augmentation and image restoration, and combined it with a BIFFN-S feature fusion module and EIoU loss to address issues of small, overlapping targets and domain shift. The model improved mAP by 5.05% compared to YOLOX-S on URPC2020. However, conducting detection in highly complex waters remains challenging [28]. Wenling Wang et al. addressed the issues of image blurring and color shifting caused by water scattering in underwater object detection by proposing UDINO based on DINO, which incorporates a multi-scale high-frequency information enhancement module (MHFIEM) and a multi-scale gated channel information refinement module (MGCIRM). The model achieved AP scores of 62.0, 51.1, and 26.3 on RUOD, UODD, and UDD, respectively, significantly outperforming existing methods, though it faces limitations in computational efficiency and adaptability to extreme environments [29]. Although these methods are effective in specific scenarios, there are still some common problems, such as over-reliance on image enhancement technology, resulting in partial detail loss, high computational complexity, and difficulty in deploying edge devices.

2.2. Small Target Detection

Small target detection is another core problem in underwater scenes. The existing methods mainly improve the performance through multi-scale feature fusion and attention mechanisms. For example, Zheng and Yu improved YOLOv7-tiny, designed an efficient reparameterized multi-scale fusion (RMF) module and a clustering and distribution feature pyramid network (GDFPN), and introduced ShaP-IOU to further improve the detection performance of small underwater targets [30]. By integrating the CSPSL module, VKConv, and SPPFMS, Li et al. applied the improved model to resource-constrained devices with only 3.0 M parameters, 8.2 G computing capacity, and 10.9 ms reasoning speed [31]. Liu et al. proposed for the first time the RepGhost module, which realizes parameter reutilization. GFPN was introduced to enhance multi-scale feature fusion, and CLLAHead was combined with a cross-layer local attention mechanism to optimize the detection head. The real-time performance of edge deployment was up to 58 FPS. However, the problem of missing detection still exists in complex backgrounds and low-resolution scenes. Therefore, the robustness of feature extraction needs to be further optimized [32].

In addition, many researchers have optimized the small target detection capability by improving the backbone network. The Dynamic YOLO proposed by Chen and Er, a lightweight backbone network extended by a decoupling detection head based on DCN v3, optimizes the detection performance of small targets through dynamic ReLU and deformable convolution alignment classification and positioning tasks [33]. Zhang et al. proposed a lightweight underwater detection model based on YOLOv8, which reduced parameters by 22.52% through the FasterNet-T0 backbone network, reaching 53.18% AP on the UTDAC2020 dataset [34].

3. Proposed Method

3.1. Network Structure

In 2016, the YOLO (You Only Look Once) algorithm proposed by Redmon et al. subverted the traditional target detection paradigm and realized the balance between real-time detection and high precision for the first time by transforming the detection task into a single-stage regression problem [35]. In 2024, the YOLOv11 was officially unveiled at the YOLO Vision 2024 (YV24) conference, marking another major overhaul of the series of models. As the newest member of the YOLO family, YOLOv11 uses Dynamic Sparse Convolution and Hierarchical Decoupled Head architecture. It achieved an average accuracy (AP) of 68.9% on the MS COCO dataset and increased the inference speed to 165 frames per second (FPS), which is significantly better than previous generation models, such as YOLOv10 [36]. However, despite YOLOv11’s excellent performance in general-purpose scenarios, its direct application to underwater target detection still faces significant challenges, such as the performance degradation of the model under complex optical interference, the feature confusion of dense distribution of small targets, high computational cost, and insufficient real-time performance.

These limitations highlight the need for customized improvements for the underwater environment. To solve these problems, this study improves the YOLOv11 network structure in three aspects: Combining the MSDA mechanism with a backbone, the C2PSA_MSDA module is proposed to replace the original C2PSA module, which helps the model improve the detection ability of multi-scale targets and improves the overall detection and classification ability of the model. Considering that the introduction of an attention mechanism may make the model more complex, this study introduces a lightweight convolution GSConv based on the Slim-Neck concept to reduce computational complexity and constructs a VoV-GSCSPC module to effectively fuse the information of feature maps at different stages and achieve effective information fusion without loss of network accuracy. At the same time, the model is designed for underwater target detection tasks. The fusion of EMA and detection heads aims to simultaneously reduce computational overhead and retain key information for each channel, improving the model’s ability to process features. The improved model structure is shown in Figure 1.

3.2. C2PSA_MSDA Module

In the optimization of underwater target detection models for complex environments, the adaptability of models to underwater targets of different scales becomes a core challenge, especially when the models need to simultaneously identify targets with significant scale differences, such as sea urchins, scallops, and various types of fish. The multi-scale extended attention mechanism takes advantage of the sparsity of the self-attention mechanism at different scales and expands it in different heads to realize the expansion of the receptor field and the enhancement of the global modeling ability, which provides a more powerful context understanding ability for underwater target detection, which is especially suitable for multi-scale target location and classification tasks in complex scenes. However, the introduction of a multi-scale extended attention mechanism may cause a secondary increase in computational complexity and memory usage, which limits the practical application potential of the model to some extent. The Multi-Scale Dilated Attention (MSDA) mechanism allows standard visual converters (ViTs) to pursue global attention while reducing computational complexity [37]. Unlike general attention mechanisms, MSDA is a multi-scale hollow local attention mechanism. The core of MSDA is to separate the channel of the feature graph into multiple heads. Each head processes different feature subsets in parallel with different expansion rates, making full use of the sparsity of the self-attention mechanism at different scales, thus reducing computational redundancy while maintaining performance. Its working principle is shown in Figure 2 below.

The main steps of MSDA include generating the query (Q), key (K), and value (V); calculating the attention score; and finally summing the values weighted according to the attention score. The core formula for the operation of MSDA is as follows:

Input feature map

X \in R^{B \times C \times H \times W}

, where B is the batch size, C is the number of channels, H is the feature map height, and W is the feature map width. Generate query Q, key K, and value V by linear projection, as shown in Equation (1):

Q, K, V = L i n e a r (X)

(1)

Divide the channel into n independent heads; the number of channels per head is

C / n

, as shown in Equation (2):

Q_{i}, K_{i}, V_{i} = Split (Q, K, V, axis = 1, {num}_{splits} = n)

(2)

Perform SWDA operations on each header using the void rate

r_{i}

to extract the keys and values for the local region, as shown in Equations (3) and (4):

K_{r_{i}} = Unfold (K_{i}, {kernel}_{size = 3}, dilation = r_{i})

(3)

V_{r_{i}} = Unfold (V_{i}, {kernel}_{size = 3}, dilation = r_{i})

(4)

Calculate the attention score and weight the sum, as shown in Equations (5) and (6):

A t t n_{i} = S o f t \max (\frac{Q_{i} \cdot K_{r_{i}}^{T}}{\sqrt{d}})

(5)

h_{i} = A t t n_{i} \cdot V_{r_{i}}

(6)

where

d = \frac{C}{n}

is the dimension of each head.

Finally, after the output of all the heads

{\{h_{i}\}}_{i = 1}^{n}

is spliced, aggregate the features through the linear layer, as shown in Equation (7):

X_{M S D A} = L i n e a r (C o n c a t (h_{1}, h_{2}, \dots, h_{n}))

(7)

In this study, the MSDA mechanism is introduced into the backbone, and the C2PSA_MSDA module is proposed, which enables it to effectively aggregate different scales of semantic information in the receiving domain of attention and effectively reduce the redundancy of the self-attention mechanism without complicated operation and extra computing cost. The C2PSA_MSDA module structure is shown in Figure 3.

Following YOLOv11’s feature pyramid module, the newly introduced C2PSA module aims to enhance spatial attention in feature maps, thus allowing models to focus more effectively on key areas in the image. The C2PSA module enables YOLOv11 to focus on specific areas of interest through spatial feature aggregation, thereby improving the detection accuracy of targets of different sizes and positions. However, traditional C2PSA modules have limitations in adapting to different scale targets. To solve this problem, the MSDA mechanism is introduced, which is a core component of the C2PSA_MSDA module. The MSDA mechanism differs from the traditional attention mechanism by splitting the channel of the feature graph into different heads and applying different expansion rates and receptive fields inside each head. The expansion rate of MSDA is set as [1,2,3] according to experience, covering small, medium, and three receptive fields to capture multi-scale context information. This strategy enables the model to capture image features at multiple scales, which are then integrated and fed into a linear layer for feature aggregation. The design enables the model to understand the image at different scales, thereby improving the overall understanding of the image content. With this approach, MSDA can capture not only local details but also contextual information about a wider area, enhancing the expressiveness of the model.

3.3. Slim-Neck Structure

Although the introduction of an attention mechanism can significantly improve the model’s ability to detect targets at different scales and focus on key areas, this operation is often accompanied by an increase in model complexity. To solve this challenge and reduce the complexity of the model as much as possible without sacrificing the detection accuracy, this study integrates GSConv (Gather-and-Scatter Convolution) into the neck structure of the model and adopts a one-time aggregation strategy based on GSbottleneck. A cross-level fractional network (GSCSP) called the VoV-GSCSP module is constructed. Finally, Slim-Neck, which is composed of GSconv and VoV-GSCSP, simplifies the calculation process and network architecture while maintaining high detection accuracy. The VoV-GSCSP module builds a path aggregation feature pyramid network to achieve deep integration with multi-scale information [38]. In the backbone of convolutional neural networks (CNNs), input images usually undergo a similar transformation process: spatial information is gradually transferred to channel information. In this process, the compression of the spatial dimension (width and height) of the feature graph and the extension of the channel dimension may lead to the loss of semantic information. Channel-intensive convolution computation can preserve the hidden connections between channels to the maximum extent, while channel-sparse convolution can completely split these connections. GSConv is designed to maintain these connections as much as possible and has a low time complexity [39]. The calculation process can be decomposed into, as shown in Equations (8)–(10):

X_{D S C} = D W C o n v (X) \otimes W_{p w}

(8)

X_{S C} = C o n v (X)

(9)

G S C o n v = σ (λ_{1} \cdot X_{D S C} + λ_{2} \cdot X_{S C})

(10)

Here, DSC represents deep separable convolution, SC represents standard convolution, DWConv represents deep convolution,

W_{p w}

represents point-by-point convolution kernel,

λ_{1}

and

λ_{2}

are learned channel attention coefficients, and

σ

is a Swish/Mish nonlinear activation function. This two-branch structure reduces the theoretical computation (FLOPs) by about 60% compared to standard convolution while compensating for the feature loss of DSC by SC branches.

Figure 4 shows the data processing flow of GSConv in the convolutional neural network. The input feature map is first passed through a standard convolution layer, followed by a depth-separable convolution layer. Finally, the output of these two convolutional layers is spliced together, and then the spliced feature maps are shuffled to optimize the feature representation between channels. The GSConv, GSbottleneck, and VOV-GSCSP are further examined in this study. The structure of these modules is shown in Figure 4. The GSbottleneck module aims to enhance the ability of network bottleneck processing features, and it enhances the model’s learning ability by stacking GSConv modules. The VoV-GSCSP module uses different structural design schemes to improve the feature utilization efficiency and network performance. However, when the model is carried out on a platform with limited computing resources, such as an underwater robot and underwater monitoring equipment, there are some problems, such as large computing memory and slow computing speed. Therefore, the VoV-GSCSPC module is further discussed and proposed in this study. Based on VoV-GSCSP, depth-separable convolution (DWConv) and fixed channel ratio are introduced, which significantly reduces the number of floating-point operations (FLOPs) and the number of parameters, and its lightweight characteristics provide an efficient solution for edge computing. The technology enables real-time target detection in resource-constrained underwater environments, expanding the deployment range of the YOLOv11 model. The calculation process is shown in Equations (11)–(13):

X_{s p l i t} = S p l i t (X_{i n})

(11)

X_{b o t t l e n e c k} = {S t a c k [G S B o t t l e n e c k C]}^{n} (X_{s p l i t}^{(1)})

(12)

X_{o u t} = C o n c a t (X_{b o t t l e n e c k}, X_{s p l i t}^{(2)}) \otimes W_{f u s i o n}

(13)

The split operation divides the input channel into two parts to obtain the main branch

X_{s p l i t}^{(1)}

and another branch

X_{s p l i t}^{(2)}

. The main branch

X_{s p l i t}^{(1)}

is processed n times for GSbottleneckC, splicing the processed main branch features

X_{b o t t l e n e c k}

with the other branch

X_{s p l i t}^{(2)}

, and finally the feature fusion is completed through

1 \times 1

convolutional (weight

W_{f u s i o n}

); the final result

X_{o u t}

is the output.

Figure 5 illustrates the core of the Slim-Neck concept, which aims to reduce computational complexity and inference delay while ensuring that the accuracy of the model is not compromised. Through this modular design strategy, the network architecture can be flexibly constructed to meet the requirements of specific tasks.

3.4. EMA Module

Slim-Neck, to pursue lightweight, simplifies the model complexity through GSConv and a lightweight bottleneck structure. However, the underwater environment has features such as low illumination, water scattering, and fuzzy targets, and the features of small targets are already weak. After Slim-Neck is added, the number of channels is reduced and the convolutional operation is simplified, which may lead to insufficient extraction of detailed features, such as the edge of the underwater target, and the loss of key information. This is an important challenge for underwater target detection. To make up for the lack of information and strengthen feature extraction, an efficient multi-scale attention (EMA) module is integrated with the Detect detection head in this study. The EMA multi-scale attention mechanism can be used to focus on key features of underwater targets through attention weight allocation. At the same time, the EMA attention mechanism can learn to suppress underwater noise patterns. By identifying noise feature patterns in multi-scale features, allocating low attention weights, and retaining target real features, Slim-Neck can alleviate the problem of insufficient anti-noise ability caused by decreasing complexity and improve the robustness of the model in complex underwater environments. The EMA module is designed to retain information about each channel while reducing computational overhead. It makes spatial semantic features evenly distributed within each feature group by reshaping part of the channel to the batch dimension and grouping the channel dimension into multiple sub-features [40]. In addition, the EMA module recalibrates the channel weights in each parallel branch by encoding global information and captures pixel-level pairs through cross-dimensional interactions.

Figure 6 shows the core operation principle of the EMA module. The main step is to divide the input feature map

X \in R^{C \times H \times W}

into G groups, with each group being

X_{g} \in R^{C / G \times H \times W}

.

The

1 \times 1

branch generates channel attention weights

W_{1 \times 1}

by 1D global averaging pooling and

1 \times 1

convolution, and the

3 \times 3

branch extracts multi-scale spatial features

F_{3 \times 3}

by

3 \times 3

convolution.

Then, the output of the two branches is pooled by the 2D global average, and the global eigenvectors

Z_{1 \times 1}

and

Z_{3 \times 3}

are obtained, as shown in Equations (14) and (15):

Z_{1 \times 1} = \frac{1}{H \times W} \sum_{i, j} F_{1 \times 1} (i, j)

(14)

Z_{3 \times 3} = \frac{1}{H \times W} \sum_{i, j} F_{3 \times 3} (i, j)

(15)

The attention force M is calculated by the matrix dot product to capture pixel-level relationships, as shown in Equation (16):

M = S o f t m a x (Z_{1 \times 1} \cdot Z_{3 \times 3}^{T})

(16)

S o f t m a x

is used to normalize the attention weight.

Finally, multiply the attention force M with the original feature X to obtain the enhanced feature, as shown in Equation (17):

X_{o u t} = X \times M

(17)

4. Experiments and Discussion

4.1. Dataset

4.1.1. Overview of the Dataset

This study uses the underwater detection dataset (UDD), a benchmark dataset for open-sea underwater target detection developed by Dalian University of Technology in 2020. It comprises 2227 high-resolution 4K (3840 × 2160 pixels) images captured under real ocean conditions, annotated with three benthic species: sea cucumbers, sea urchins, and scallops. As the first public dataset achieving centimeter-level annotation accuracy in complex open-ocean environments, UDD provides high-quality data for validating underwater detection model [5].

Notable features of UDD include excellent spatial resolution of up to 3840 × 2160 pixels, which preserves the fine morphological characteristics of marine life. Ecological validity is achieved through natural underwater light conditions and real biological distribution. It is worth noting that there is a significant inter-class imbalance in this dataset, with sea urchins accounting for 90.5% and sea cucumbers and scallops accounting for 7.6% and 1.9%, respectively. In addition, more than 90% of the targets occupy an image area < 1.654%, which poses a major challenge for small target detection.

4.1.2. Image and Target Characteristics

This study uses the UDD to train the model. This dataset contains 2227 images, and each image has the three types of underwater targets precisely marked with bounding boxes. These bounding boxes provide detailed information about the position, size, and category of the targets. In addition, this study randomly divides the above dataset into a training set and a validation set in an 8:2 ratio.

Three types of underwater organisms are labeled in the UDD, namely sea cucumber, sea urchin, and scallop, as shown in Figure 7.

The characteristics and detection challenges of the three types of target images are shown in Table 1.

Through the analysis of the three types of targets, the common challenges of underwater target detection can be summarized as follows:

(1): Optical environment interference: The attenuation and scattering of underwater light cause image blurring, color distortion, and reduced target feature recognition, which affects the performance of the YOLOv11 detection algorithm.
(2): Small targets and dense distribution: The three types of targets generally have the characteristics of small size and high aggregation. The existing detection networks have insufficient perception ability for small target features, and the distinction and positioning of targets in dense scenes are technical difficulties.
(3): Computational resource constraints: The edge devices carried by underwater robots have limited computing power. The real-time deployment of high-precision detection models requires a balance between model complexity and inference efficiency, which results in higher requirements for the design of lightweight networks.

4.2. Implementation Details

This research experiment is based on the Pytorch deep learning framework and runs in an Anaconda virtual environment. Table 2 shows the environmental configuration of the experiment. Table 3 shows the main hyperparameter settings of the experiment.

4.3. Evaluation Metrics

In this study, precision, recall, mAP, and GFLOPs were used to evaluate the model and verify the practical application effect of the YOLOv11-MSE model proposed in this paper.

4.3.1. Precision

Precision represents the proportion of true positives in the samples predicted by the model to be positive, reflecting the reliability of the test results. Its expression is shown in Equation (18):

P r e c i s o n = \frac{T P}{T P + F P}

(18)

TP (true positive) indicates the number of positive samples correctly detected, and FP (false positive) indicates the number of negative samples incorrectly detected as positive samples. In the underwater target detection scenario, high accuracy can effectively reduce the risk of misoperation of the robot grasping system.

4.3.2. Recall

Recall rate measures the ability of the model to cover actual positive samples, and its expression is shown in Equation (19):

R e c a l l = \frac{T P}{T P + F N}

(19)

FN (false negative) indicates the number of positive samples missed. For underwater intensive target detection tasks, increasing the recall rate can help reduce the loss of missed targets.

4.3.3. AP

AP (average precision) comprehensively evaluates the detection performance through the area under the integral precision–recall curve, and its expression is shown in Equation (20):

A P = \int_{0}^{1} P (r) d r

(20)

where the

P (r)

represents the precision value when the recall is r. This index effectively overcomes the limitation of single-threshold evaluation and is especially suitable for performance evaluation of class-imbalanced datasets.

4.3.4. mAP

Mean average precision (mAP) is a comprehensive evaluation indicator that reflects the precision, recall, and AP of a test task. Its expression is shown in Equation (21), where N = number of target categories:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(21)

The value range of mAP is [0,1]. The higher the value, the better the comprehensive detection performance of the model. As the gold standard in the target detection field, this index can effectively reflect the robustness of the model in the detection process. The mAP is called mAP50 when the IoU threshold is 0.50, while mAP50-95 is the AP over a range of IoU thresholds ranging from 0.50 to 0.95, usually in increments of 0.05.

4.3.5. GFLOPs

GFLOPs (giga floating point operations) is a key measure of the computational complexity of a neural network model, representing the billion floating point operations performed per second. A higher value of GFLOPs indicates that the model requires more computation during execution, which in turn requires more hardware resources, which may adversely affect the training speed and inference efficiency.

4.4. Quantitative Analysis

In this study, the YOLOv11-MSE model proposed by us was first compared with other networks in the YOLO series. The selected YOLO series includes YOLOv5, YOLOv9, YOLOv10, and YOLOv11. The experimental results of different models are shown in Table 4 and Figure 8.

This study systematically compares the performance of the YOLO series models on the UDD to verify the effectiveness of the improvements made to the YOLOv11-MSE model. As shown in Table 4, YOLOv11-MSE achieves the best results in several key indicators, including accuracy, recall rate, mAP50, and GFLOPs. Comparing the GFLOPs of different YOLO versions, YOLOv6 has a GFLOPs of 44, indicating a relatively high complexity. In contrast, YOLOv5, YOLOv8, YOLOv9, YOLOv10, YOLOv11, and YOLOv12 perform significantly better than YOLOv6 in terms of GFLOPs. Specifically, YOLOv5 has a GFLOPs of 23.8, YOLOv8 has 28.4, YOLOv9 has 26.7, and YOLOv10 has 24.5.

YOLOv11 achieves a GFLOPs of 21.3, while YOLOv12 has a slightly lower complexity at 21.2 GFLOPs. As shown in Figure 8, YOLOv11 outperforms other YOLO variants (YOLOv5, YOLOv6, YOLOv8, YOLOv9, and YOLOv10) across precision, recall, and mAP50, only trailing YOLOv12 marginally in accuracy and GFLOPs. Thus, YOLOv11 is selected as the baseline for research. After applying the improvement methods proposed in Section 3 of this study to YOLOv11, it achieves the best results in accuracy, recall rate, and mAP50 among all networks while significantly reducing the model complexity by 6.57% compared to YOLOv11, demonstrating that the model has broken through the accuracy–efficiency balance bottleneck of the traditional YOLO architecture through innovative module design.

To objectively evaluate the detection performance advantages of the proposed model in class-imbalanced scenarios, this study conducts a horizontal comparison experiment on three types of underwater targets, with a focus on the breakthrough progress in small target detection. The evaluation indicators include precision, recall, mAP50, and mAP50-95. The comparison results of the models’ detection performance on different targets are shown in Figure 9.

As shown in Figure 9, the model proposed in this paper demonstrates detection advantages for specific targets in multiple indicators. Our model achieves significant breakthroughs in detecting small targets in class-imbalanced scenarios, demonstrating superior adaptability to complex underwater environments. This is particularly evident in scallop detection, where it substantially outperforms the baseline YOLOv11. Specifically, for the rare scallop category (constituting only 1.9% of the dataset), our model dramatically reduces both false positives and false negatives, achieving a 60.62% increase in precision and a 10.16% improvement in mAP50 over YOLOv11. In sea cucumber detection (7.6% of data), the Slim-Neck and attention mechanisms contribute to a 14.2% improvement in recall compared to the best competitor, significantly reducing missed detections in complex backgrounds. This highlights the model’s enhanced capability to locate elusive targets, though distinguishing them from similar-looking backgrounds remains a challenge, as reflected by the precision metric. For the dominant sea urchin category (90.5%), the model maintains competitive detection performance (mAP50-95) while achieving a 6.57% reduction in computational complexity (GFLOPs). This effectively demonstrates the successful precision–efficiency trade-off of our lightweight design. Notably, this model demonstrates significant generalization ability under extreme class imbalance conditions. Taking scallops as an example, the mAP50 improvement exceeds that of other categories by an order of magnitude, proving that the multi-scale attention mechanism and cross-dimensional interaction strategy can effectively capture weak features and suppress background interference. Additionally, as shown in Table 4, compared with other versions of YOLO models, the Slim-Neck structure reduces the computational complexity by 19.1% while only causing a slight decrease of 0.012 in mAP50-95, highlighting the superiority of the lightweight design. In summary, this study achieves a leap in the performance of small target detection through modular innovation, especially in low-proportion and high-difficulty targets (such as scallops), providing a new solution for real-time detection in complex underwater scenes. Future work will focus on the design of dynamic loss functions to further optimize the gradient balance between majority and minority classes and enhance the overall robustness of the model.

4.5. Ablation Experiments

To validate the contribution of each proposed module, we conducted extensive ablation studies using YOLOv11s as the baseline. We incrementally integrated the C2PSA_MSDA, Slim-Neck, and EMA modules, both individually and in combination, to form the final YOLOv11-MSE model. The study first optimized the C2PSA module in the YOLOv11 backbone network while keeping other parts unchanged to ensure the purity of the experimental results. This process was repeated for all improvement methods, and these methods were integrated to form the final YOLOv11-MSE model to further verify its effectiveness. The specific experimental results are shown in Table 5 and Figure 10.

According to Figure 10, all models exhibit some oscillation in precision curves, a common challenge in underwater scenarios due to factors like low contrast, background clutter, and extreme class imbalance, which complicate feature learning for small targets. As shown in Table 5, adding the C2PSA_MSDA module (Exp2) alone improves precision, recall, mAP50, and mAP50-95 by 3.46%, 0.33%, 1.65%, and 0.3% (absolute increase), respectively, demonstrating its effectiveness in enhancing multi-scale feature representation.

Through the cascade design of GSConv and VoV-GSCSPC modules, Slim-Neck can achieve a 19.1% reduction in GFLOPs from 21.3 to 19.5 under the premise of maintaining an mAP50-95 reduction less than or equal to 0.012, and its lightweight efficiency exceeds that of similar programs, which verifies the effectiveness of structural lightweight design. After exp4 is added to the EMA module, recall increases from 0.61 to 0.695, an increase of 12.23%, and mAP50 reaches 0.686, indicating that exp4 enhances model training stability and generalization ability, but precision decreases from 0.693 to 0.591. It is suggested that EMA enhances the detection ability of difficult samples (such as sea cucumber and scallop) through feature recalibration but may amplify background interference. At the same time, according to Figure 11, after adding C2PSA_MSDA, Slim-Neck, and EMA, the detection accuracy and accuracy reach a breakthrough, with mAP50 increasing from 0.666 to 0.689, an increase of 3.45%, and the precision also significantly increases by 9.67%. In addition, the GFLOPs compression of the model is 6.57%. The MSDA module primarily contributes to gains in scallop detection by expanding the receptive field without sacrificing resolution, allowing the network to capture more contextual information around these small targets. This is particularly effective for scallops that are partially obscured or blend with the seabed, as the multi-scale context helps distinguish them from the background. The EMA module builds upon this by performing cross-dimensional interaction to emphasize the most discriminative features of scallops (e.g., their unique elliptical shape and texture patterns visible in high-resolution images) and attenuate irrelevant background noise. This synergistic effect explains the dramatic 60.62% improvement in precision for scallops as MSDA finds more true positives, and EMA then helps validate them, reducing false alarms. In summary, the ablation studies confirm that YOLOv11-MSE effectively addresses key challenges in underwater detection: mitigating feature loss in small targets, reducing false detections in class-imbalanced data, and maintaining accuracy under strict lightweight constraints.

4.6. Model Loss Chart Comparison

To further verify the optimization effect of the improved method in this study on model convergence, Figure 12 presents the loss change curve between the YOLOv11 model and the improved model in this study during the training process. The comparison shows that the loss value of the YOLOv11 model is higher than that of the improved model in this study in both the training stage and the verification stage. Specific data show that the final loss value of the model in this study on the training set and the verification set is 1.18515 and 1.32397, respectively, while the loss value of the YOLOv11 model on the training set and the verification set is 1.18886 and 1.33960. The results fully show that the improved model achieves significant improvement in training stability and convergence performance.

4.7. Visualization Results

To directly verify the effectiveness of the proposed method compared with other models, qualitative analysis was conducted. This study selected four representative detection images from the test set as cases. Figure 13 shows how the newer YOLO series models perform in underwater target detection tasks. It is worth noting that the seven lines in the figure correspond to the original image, YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12, and the prediction results of the model in this study.

To verify the detection efficiency of the improved model in practical applications, a series of experiments were performed using the test set retained in the dataset to evaluate the adaptability of the improved algorithm in dynamic hydrological conditions and underwater environments with different visibility. The core purpose of these experiments was to evaluate the performance of the algorithm in a real-world environment. The performance comparison between the baseline model and the improved algorithm is shown in Figure 13, where the yellow box identifies the starfish, the blue box identifies the sea cucumber, the cyan box identifies the sea urchin, and the white box identifies the scallop. In this study, the target detection performance of six models (YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12, and YOLOv11-MSE) was evaluated. Under bright and clear lighting conditions, several sea urchins and sea cucumbers were located at the edge of the image, as shown in the image on the left. YOLOv11 failed to identify these organisms, while YOLOv11-MSE successfully detected them with significantly improved confidence, showing strong edge detection capabilities. In the fuzzy and dark environment, as shown in the third column of images, due to the impact of low visibility underwater, YOLOv11 misjudged the end of seaweed in the water as a sea cucumber and the black shadow on the edge as a sea urchin. Other YOLO series models also showed serious misjudgments, failing to distinguish seaweed and sea cucumbers with similar colors and shapes, especially YOLOv8. There were six misjudgments. YOLOv11-MSE successfully detected sea cucumbers, demonstrating the effectiveness of its improved attention mechanism under fuzzy, dark light conditions. Observing the images on the far right, although not as dark as the images in the third column, the target detection was still affected to some extent. For the scallops trapped in the seabed sediment, YOLOv8, YOLOv11, and YOLOv12 all failed to detect and misjudged the seaweed as sea cucumbers, while YOLOv11-MSE could identify and locate the target more accurately.

In summary, the YOLOv11-MSE model is slightly better than other YOLO series models in terms of overall detection accuracy. In addition, the YOLOv11-MSE demonstrates excellent performance when detecting organisms in challenging environments, such as low resolution and fuzzy dark light, while maintaining high accuracy.

5. Conclusions

This study successfully addressed the challenges of low contrast, small targets, and class imbalance in underwater object detection by proposing a lightweight and efficient model, YOLOv11-MSE. The core contributions and conclusions are as follows. The proposed multi-scale dilated attention module (C2PSA_MSDA) effectively enhances multi-scale contextual feature extraction, which is critical for detecting small and blurred underwater targets. The designed Slim-Neck structure, based on GSConv and VoV-GSCSPC, significantly reduces computational complexity while maintaining competitive accuracy, offering a practical solution for real-time deployment on edge devices. The integration of the EMA mechanism in the detection head strengthens feature representation and noise suppression, particularly improving the recall of challenging, rare categories. Comprehensive evaluations on the UDD demonstrate that YOLOv11-MSE achieves state-of-the-art performance among YOLO variants. Our model attained an mAP50 of 68.9%, surpassing YOLOv11 by 3.45%, while simultaneously reducing computational cost by 6.57%. Notably, for the extremely rare scallop category (1.9%), precision and mAP50 were dramatically improved by 60.62% and 10.16%, respectively.

Despite these advances, limitations remain in the case of extreme class imbalances, such as low recall rates for underrepresented species such as scallops. Second, although the attention mechanism introduced in this study enables the model to adapt to multi-scale targets, it also leads to an increase in model complexity. Therefore, we envisage exploring semi-supervised learning in future work to utilize unlabeled underwater data and further optimize edge devices using hybrid quantization techniques. Designing more lightweight attention mechanisms or methods to enhance the ability of models to adapt to multi-scale objectives would provide a solid foundation for applications such as marine robotics, environmental monitoring, and aquaculture management, bridging the gap between theoretical innovation and practical deployment.

Author Contributions

Conceptualization, Z.Y.; Methodology, Z.Y.; Validation, D.L.; Formal analysis, Z.Y.; Investigation, X.P.; Data curation, D.L.; Writing—original draft, Z.Y.; Visualization, D.L.; Supervision, F.S.; Project administration, F.S.; Funding acquisition, X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Laboratory Foundation (WDZC20245250303), the National Natural Science Foundation of China (52305594), the Natural Science Foundation of Hunan Province (2024JJ6460), and the China Postdoctoral Science Foundation Grant (2024M754299).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

McGeady, R.; Runya, R.M.; Dooley, J.S.G.; Howe, J.A.; Fox, C.J.; Wheeler, A.J.; Summers, G.; Callaway, A.; Beck, S.; Brown, L.S.; et al. A review of new and existing non-extractive techniques for monitoring marine protected areas. Front. Mar. Sci. 2023, 10, 1126301. [Google Scholar] [CrossRef]
Zeng, L.; Sun, B.; Zhu, D. Underwater target detection based on Faster R-CNN and adversarial occlusion network. Eng. Appl. Artif. Intell. 2021, 100, 104190. [Google Scholar] [CrossRef]
Fan, B.; Chen, W.; Cong, Y.; Tian, J. Dual Refinement Underwater Object Detection Network. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 275–291. [Google Scholar]
Qiao, W.; Khishe, M.; Ravakhah, S. Underwater targets classification using local wavelet acoustic pattern and multi-layer perceptron neural network optimized by modified whale optimization algorithm. Ocean Eng. 2021, 219, 108415. [Google Scholar] [CrossRef]
Liu, C.; Wang, Z.; Wang, S.; Tang, T.; Tao, Y.; Yang, C.; Li, H.; Liu, X.; Fan, X. A New Dataset, Poisson GAN and AquaNet for Underwater Object Grabbing. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2831–2844. [Google Scholar] [CrossRef]
Feng, J.; Jin, T. CEH-YOLO: A composite enhanced YOLO-based model for underwater object detection. Ecol. Inform. 2024, 82, 102758. [Google Scholar] [CrossRef]
Huang, Z.; Wang, J.; Fu, X.; Yu, T.; Guo, Y.; Wang, R. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci. 2020, 522, 241–258. [Google Scholar] [CrossRef]
Fu, X.; Zhuang, P.; Huang, Y.; Liao, Y.; Zhang, X.-P.; Ding, X. A retinex-based enhancing approach for single underwater image. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 4572–4576. [Google Scholar]
Yu, Y.; Zhao, J.; Gong, Q.; Huang, C.; Zheng, G.; Ma, J. Real-Time Underwater Maritime Object Detection in Side-Scan Sonar Images Based on Transformer-YOLOv5. Remote Sens. 2021, 13, 3555. [Google Scholar] [CrossRef]
Cai, K.; Miao, X.; Wang, W.; Pang, H.; Liu, Y.; Song, J. A modified YOLOv3 model for fish detection based on MobileNetv1 as backbone. Aquac. Eng. 2020, 91, 102117. [Google Scholar] [CrossRef]
Liu, L.; Chu, C.; Chen, C.; Huang, S. MarineYOLO: Innovative deep learning method for small target detection in underwater environments. Alex. Eng. J. 2024, 104, 423–433. [Google Scholar] [CrossRef]
Wang, Q.; He, B.; Zhang, Y.; Yu, F.; Huang, X.; Yang, R. An autonomous cooperative system of multi-AUV for underwater targets detection and localization. Eng. Appl. Artif. Intell. 2023, 121, 105907. [Google Scholar] [CrossRef]
Chen, N.; Zhu, J.; Zheng, L. Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection. Appl. Sci. 2024, 14, 6461. [Google Scholar] [CrossRef]
Yang, C.; Xiang, J.; Li, X.; Xie, Y. FishDet-YOLO: Enhanced Underwater Fish Detection with Richer Gradient Flow and Long-Range Dependency Capture through Mamba-C2f. Electronics 2024, 13, 3780. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Liu, R.; Fan, X.; Zhu, M.; Hou, M.; Luo, Z. Real-World Underwater Enhancement: Challenges, Benchmarks, and Solutions Under Natural Light. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4861–4875. [Google Scholar] [CrossRef]
Zhang, F.; Cao, W.; Gao, J.; Liu, S.; Li, C.; Song, K.; Wang, H. Underwater Object Detection Algorithm Based on an Improved YOLOv8. J. Mar. Sci. Eng. 2024, 12, 1991. [Google Scholar] [CrossRef]
Gao, Y.; Li, Z.; Zhang, K.; Kong, L. GCP-YOLO: A lightweight underwater object detection model based on YOLOv7. J. Real-Time Image Process. 2025, 22, 3. [Google Scholar] [CrossRef]
Zhou, J.; Pang, L.; Zhang, D.; Zhang, W. Underwater Image Enhancement Method via Multi-Interval Subhistogram Perspective Equalization. IEEE J. Ocean. Eng. 2023, 48, 474–488. [Google Scholar] [CrossRef]
Ni, T.; Sima, C.; Zhang, W.; Wang, J.; Guo, J.; Zhang, L. Vision-based underwater docking guidance and positioning: Enhancing detection with YOLO-D. J. Mar. Sci. Eng. 2025, 13, 102. [Google Scholar] [CrossRef]
Lu, Y.; Zhang, J.; Chen, Q.; Xu, C.; Irfan, M.; Chen, Z. AquaYOLO: Enhancing YOLOv8 for accurate underwater object detection for sonar images. J. Mar. Sci. Eng. 2025, 13, 73. [Google Scholar] [CrossRef]
Xu, S.; Wang, J.; Sang, Q. Semi-supervised method for underwater object detection algorithm based on improved YOLOv8. Appl. Sci. 2025, 15, 1065. [Google Scholar] [CrossRef]
Zhou, J.; He, Z.; Zhang, D.; Liu, S.; Fu, X.; Li, X. Spatial residual for underwater object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4996–5013. [Google Scholar] [CrossRef] [PubMed]
Shen, X.; Yuan, G.; Wang, H.; Fu, X. Unsupervised clustering optimization-based efficient attention in YOLO for underwater object detection. Artif. Intell. Rev. 2025, 58, 219. [Google Scholar] [CrossRef]
Zhuang, P.; Li, C.; Wu, J. Bayesian retinex underwater image enhancement. Eng. Appl. Artif. Intell. 2021, 101, 104171. [Google Scholar] [CrossRef]
Sarkar, P.; De, S.; Gurung, S.; Dey, P. UICE-MIRNet guided image enhancement for underwater object detection. Sci. Rep. 2024, 14, 22448. [Google Scholar] [CrossRef] [PubMed]
Manimurugan, S.; Narmatha, C.; Aborokbah, M.M.; Chilamkurti, N.; Ganesan, S.; Thavasimuthu, R.; Karthikeyan, P.; Uddin, M.A. HLASwin-T-ACoat-net based underwater object detection. IEEE Access 2024, 12, 32200–32217. [Google Scholar] [CrossRef]
Wang, J.; Qi, S.; Wang, C.; Luo, J.; Wen, X.; Cao, R. B-YOLOX-S: A lightweight method for underwater object detection based on data augmentation and multiscale feature fusion. J. Mar. Sci. Eng. 2022, 10, 1764. [Google Scholar] [CrossRef]
Wang, W.; Yu, Z.; Huang, M. Refining features for underwater object detection at the frequency level. Front. Mar. Sci. 2025, 12, 1544839. [Google Scholar] [CrossRef]
Zheng, Z.; Yu, W. RG-YOLO: Multi-scale feature learning for underwater target detection. Multimed. Syst. 2025, 31, 26. [Google Scholar] [CrossRef]
Li, T.; Gang, Y.; Li, S.; Shang, Y. A small underwater object detection model with enhanced feature extraction and fusion. Sci. Rep. 2025, 15, 2396. [Google Scholar] [CrossRef]
Liu, M.; Wu, Y.; Li, R.; Lin, C. LFN-YOLO: Precision underwater small object detection via a lightweight reparameterized approach. Front. Mar. Sci. 2025, 11, 1513740. [Google Scholar] [CrossRef]
Chen, J.; Er, M.J. Dynamic YOLO for small underwater object detection. Artif. Intell. Rev. 2024, 57, 165. [Google Scholar] [CrossRef]
Zhang, M.; Wang, Z.; Song, W.; Zhao, D.; Zhao, H. Efficient small-object detection in underwater images using the enhanced YOLOv8 network. Appl. Sci. 2024, 14, 1095. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024. [Google Scholar] [CrossRef]
Jiao, J.; Tang, Y.-M.; Lin, K.-Y.; Gao, Y.; Ma, J.; Wang, Y.; Zheng, W.-S. DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition. arXiv 2023. [Google Scholar] [CrossRef]
Cao, L.; Wang, Q.; Luo, Y.; Hou, Y.; Cao, J.; Zheng, W. YOLO-TSL: A lightweight target detection algorithm for UAV infrared images based on Triplet attention and Slim-neck. Infrared Phys. Technol. 2024, 141, 105487. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]

Figure 1. The YOLOv11-MSE network architecture.

Figure 2. The MSDA operating principle diagram.

Figure 3. The C2PSA_MSDA module network structure.

Figure 4. The GSConv structure.

Figure 5. The Slim-Neck network architecture.

Figure 6. The EMA structure.

Figure 7. Samples of UDD.

Figure 8. Schematic diagram showing comparative results of YOLO family networks.

Figure 9. Detection performance comparison of the YOLO series: (a) precision, (b) recall, (c) mAP50, (d) mAP50-95.

Figure 10. Results of model performance in ablation experiments: (a) precision, (b) recall, (c) mAP50, (d) mAP50-95.

Figure 11. Relationship between mAP50 and GFLOPs in ablation experiment.

Figure 12. Model loss comparison chart: (a) Train_loss, (b) Val_loss.

Figure 13. Comparison of model test results.

Table 1. Image characteristics and detection challenges of the UDD.

Category	Characteristics	Detection Challenges
Sea cucumbers	Long strips; the color is similar to the rock on the sea floor	Easily drowned by the background; small size of the target makes it easy to miss detection
Sea urchins	Black spiny appearance; visual contrast with sandy background; aggregation distribution	Densely distributed; objectives overlap; the boundary is fuzzy, and the location of the detection block is difficult
Scallops	In the form of a shell; various opening and closing state and occlusion conditions; the distribution is scattered or clustered	Irregular shape and complex occlusion; small-sized scallops easily lose details; dataset classification is unbalanced

Table 2. Experimental environment configuration.

Environment Configuration	Parameter
Operating system	Windows 11
CPU	AMD Ryzen 7 5800H with Radeon Graphics
GPU	NVIDIA GeForce RTX 3060 (6 GB)
Development environment	Pycharm 2024.3.1.1
Language	Python 3.10.16
Operating platform	Pytorch2.5.0+cuda11.8

Table 3. Hyperparameter settings.

Hyperparameter	Parameter
Epochs	100
Batch size	16
Learning rate	0.01
Momentum	0.937
Weight decay	0.0005
Input image size	640

Table 4. Benchmarking results of the YOLO family networks.

Model	Precision	Recall	mAP50	mAP50-95	GFLOPs
YOLOv5	0.714	0.556	0.653	0.281	23.8
YOLOv6	0.608	0.609	0.622	0.269	44
YOLOv8	0.737	0.55	0.637	0.284	28.4
YOLOv9	0.702	0.586	0.65	0.278	26.7
YOLOv10	0.619	0.606	0.634	0.275	24.5
YOLOv11	0.693	0.61	0.666	0.299	21.3
YOLOv12	0.704	0.533	0.632	0.283	21.2
Our model	0.76	0.618	0.689	0.294	19.9

Table 5. Results of the ablation experiment.

NO	Model			Precision	Recall	mAP50	mAP50-95	GFLOPs
NO	C2PSA-MSDA	Slim-Neck	EMA	Precision	Recall	mAP50	mAP50-95	GFLOPs
exp1	Baseline			0.693	0.61	0.666	0.299	21.3
exp2	√			0.717	0.612	0.677	0.302	21.4
exp3		√		0.686	0.62	0.676	0.287	19.5
exp4			√	0.591	0.695	0.686	0.301	21.7
exp5	√	√		0.687	0.639	0.68	0.296	19.5
exp6	√		√	0.719	0.58	0.669	0.292	21.7
exp7		√	√	0.689	0.609	0.667	0.289	19.8
exp8	√	√	√	0.76	0.618	0.689	0.294	19.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, Z.; Peng, X.; Li, D.; Shi, F. YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection. J. Mar. Sci. Eng. 2025, 13, 1843. https://doi.org/10.3390/jmse13101843

AMA Style

Ye Z, Peng X, Li D, Shi F. YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection. Journal of Marine Science and Engineering. 2025; 13(10):1843. https://doi.org/10.3390/jmse13101843

Chicago/Turabian Style

Ye, Zhenfeng, Xing Peng, Dingkang Li, and Feng Shi. 2025. "YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection" Journal of Marine Science and Engineering 13, no. 10: 1843. https://doi.org/10.3390/jmse13101843

APA Style

Ye, Z., Peng, X., Li, D., & Shi, F. (2025). YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection. Journal of Marine Science and Engineering, 13(10), 1843. https://doi.org/10.3390/jmse13101843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv11-MSE: A Multi-Scale Dilated Attention-Enhanced Lightweight Network for Efficient Real-Time Underwater Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Underwater Target Detection

2.2. Small Target Detection

3. Proposed Method

3.1. Network Structure

3.2. C2PSA_MSDA Module

3.3. Slim-Neck Structure

3.4. EMA Module

4. Experiments and Discussion

4.1. Dataset

4.1.1. Overview of the Dataset

4.1.2. Image and Target Characteristics

4.2. Implementation Details

4.3. Evaluation Metrics

4.3.1. Precision

4.3.2. Recall

4.3.3. AP

4.3.4. mAP

4.3.5. GFLOPs

4.4. Quantitative Analysis

4.5. Ablation Experiments

4.6. Model Loss Chart Comparison

4.7. Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI