YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm

Luo, Shengfu; Dong, Chao; Dong, Guixin; Chen, Rongmin; Zheng, Bing; Xiang, Ming; Zhang, Peng; Li, Zhanwei

doi:10.3390/jmse13050947

Open AccessArticle

YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm

by

Shengfu Luo

¹,

Chao Dong

^2,3,*,

Guixin Dong

^4,*,

Rongmin Chen

¹,

Bing Zheng

^1,2,3,

Ming Xiang

^2,3,

Peng Zhang

⁴

and

Zhanwei Li

⁴

¹

School of Ocean Engineering and Technology, Sun Yat-sen University, Zhuhai 519082, China

²

Key Laboratory of Marine Environmental Survey Technology and Application, Guangzhou 510300, China

³

South China Sea Marine Survey Center, Ministry of Natural Resources, Guangzhou 510300, China

⁴

Chimelong Group Co., Ltd., Guangzhou 511430, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(5), 947; https://doi.org/10.3390/jmse13050947

Submission received: 9 April 2025 / Revised: 24 April 2025 / Accepted: 27 April 2025 / Published: 13 May 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

In computer vision applications, the primary task of object detection is to answer the following question: “What object is present and where is it located?”. However, underwater environments introduce challenges, such as poor lighting, high complexity, and diverse marine organism shapes, leading to missed detections or false positives in deep learning-based algorithms. To improve detection accuracy and robustness, this paper proposes an enhanced YOLOv11-based algorithm for underwater object detection that strengthens the ability to capture both local and global details and global contextual information in complex underwater environments. To better capture local and global features while integrating contextual information, the proposed method introduces several enhancements. The backbone incorporates a DualBottleneck module to enhance feature extraction, replacing the standard bottleneck structure in C3k, thus enhancing the feature extraction and the channel aggregation. The detection head adopts DyHead-GDC, integrating ghost depthwise separable convolution with DyHead for greater efficiency. Furthermore, the ADown module replaces conventional feature extraction and downsampling convolutions, reducing parameters and FLOPs by 14%. The C2PSF module, combining focal modulation and C2, strengthens local feature extraction and global context processing. Additionally, a SCSA module is inserted before the detection head to fully utilize multi-semantic information, improving the detection performance in complex underwater scenes. Experimental results confirm the effectiveness of these improvements. The model achieves 84.2% mAP50 on UTDAC2020, 84.4% on DUO and 86.7% on RUOD, surpassing the baseline by 2.5%, 1.6% and 1.2%, respectively. It remains lightweight, with 6.5 M parameters and a computational cost of 7.1 GFLOPs.

Keywords:

underwater object detection; YOLOv11; deep learning

1. Introduction

Marine environments dominate the Earth’s surface area and host significant energy resources, including petroleum, natural gas, minerals, and marine biodiversity. Exploring the ocean not only advances scientific inquiry but also facilitates the sustainable exploitation of these resources. However, the physiological constraints of human operators render underwater tasks significantly more challenging than surface-level operations. Emerging trends in underwater object detection have received increasing scholarly attention, as they seek to leverage advanced computational techniques for the automated identification and precise localization of submerged entities, including marine organisms, shipwrecks, archaeological artifacts, divers, and underwater infrastructure. Accurate detection is instrumental in applications such as resource exploration, aquaculture management, and marine ecosystem conservation, in addition to its strategic significance in sectors like deep-sea resource extraction, commercial fisheries, and military surveillance. Despite substantial progress in deep learning-based object detection, underwater environments present a unique set of unresolved challenges. The propagation of light underwater is heavily influenced by attenuation and scattering, leading to diminished contrast and spatial feature distortions that vary with depth and the viewing angle. Additionally, the inherent scale variability of submerged targets—ranging from minute aquatic organisms to large shipwrecks—complicates the maintenance of spatial consistency across feature hierarchies. Furthermore, the dynamic postural variations exhibited by marine organisms introduce additional complexities, as even intra-class objects display significant morphological changes, further exacerbating detection difficulties. Consequently, robust spatial feature processing remains a critical factor in enhancing detection performance under these conditions.

To mitigate these limitations, this paper introduces a composite YOLOv11n-based framework designed to improve the fusion of local and global feature representations, thereby enhancing detection robustness and precision. Local feature extraction emphasizes intricate target characteristics, such as texture and edge details, whereas global feature modeling captures macroscopic attributes, including scale and color distributions. Additionally, global contextual awareness is incorporated to account for the environmental backdrop surrounding the detected object. By synergistically integrating these components, the proposed model aspires to fortify the accuracy of underwater object detection against the adversities posed by complex aquatic environments.

The enhanced framework, YOLO-DAFS, incorporates novel components: DualBottleneck, focal modulation, ADown, and ghost depth-separable convolution (GDConv). This introduction evaluates the advantages of these modules over existing alternatives, highlighting their suitability for underwater environments.

Firstly, the DualBottleneck module enhances feature extraction in the C3k2 block through a lightweight DualConv design, integrating 3 × 3 grouped convolutions and 1 × 1 pointwise convolutions to capture local spatial features (e.g., edges and textures) and global channel information, respectively. This approach ensures a balance between detail sensitivity and scale adaptability, effectively addressing light-attenuation-induced feature loss and accommodating diverse underwater target scales, from small sea urchins to large wrecks. Compared to attention-based bottlenecks, such as SE-Net and CBAM, DualBottleneck avoids computationally intensive global pooling and complex convolution. Compared to transformer-based bottlenecks, such as multi-head self-attention structures, it eliminates the need for large-scale matrix computations, offering lower complexity and enhanced robustness for small targets and low-contrast scenes.

Secondly, the C2PSF module leverages focal modulation to replace the original C2PSA block, bolstering multiscale feature representation and global context modeling. By employing multiscale depthwise convolutions, gated aggregation, and element-wise affine transformations, focal modulation dynamically integrates visual contexts across short-to-long ranges, preserving both local details and global structures. This design excels in distinguishing targets from complex underwater backgrounds and adapts to dynamic lighting variations. Compared with attention-free transformers and MLP Mixers, focal modulation can avoid two problems: the serialized global modeling of attention-free transformers may reduce sensitivity to fuzzy edges, and the fixed-patch processing of MLP Mixers is difficult to adapt to drastic changes in the target scale. This ensures heightened sensitivity to small underwater targets.

Thirdly, the ADown module optimizes feature downsampling by enhancing detail preservation while significantly reducing the computational overhead. Ablation studies (Table 1, UTDAC2020 dataset) demonstrate that ADown reduces the parameters and FLOPs by approximately 14%, lowering the parameters from 3.677 M to 3.167 M and FLOPs from 8.5 GFLOPs to 7.1 GFLOPs. For instance, on a Jetson Nano with ~0.5 TFLOPS, the 14% FLOPs reduction theoretically shortens inference time from 17 ms (58.8 FPS) to 14.2 ms (70.4 FPS), yielding a ~19.7% FPS increase, though real-world gains may be tempered by memory access and I/O overheads. Similarly, the parameter reduction mitigates memory bandwidth constraints, reducing cache misses and data transfer latencies, further boosting FPS.

Lastly, the DyHead-GDC module optimizes the detection head by integrating ghost convolution and depthwise separable convolution, forming a lightweight GDConv structure. Compared to inverted residual blocks of MobileNet, GDConv mitigates the limited receptive field issue through ghost feature generation, offering richer feature diversity. Unlike the channel shuffle of ShuffleNet, which risks feature inconsistency due to randomization, GDConv ensures targeted feature enhancement. Relative to MBConv with the Squeeze-and-Excitation of efficientNet, GDConv preserves spatial information, improving the detection of small or blurred targets.

In summary, the DualBottleneck, focal modulation, ADown, and GDConv modules collectively enable YOLO-DAFS to achieve better performance in underwater object detection. By addressing low contrast, scale variability, and background complexity with lightweight, efficient, and robust designs, these components outperform attention-based, transformer-based, and other lightweight convolution alternatives. YOLO-DAFS not only delivers high accuracy and real-time processing but also supports practical applications in marine biology monitoring, underwater rescue, and archaeological exploration, offering a novel perspective for advancing underwater vision systems.

2. Related Work

Computer vision systems designed for visual recognition address two primary tasks: identifying existing objects within an image and precisely determining their spatial coordinates. This challenging task focuses on extracting target features from images or videos to enable precise localization. Over the past few decades, object detection has progressed through two primary phases: traditional algorithms and deep learning-based approaches.

The traditional algorithms for object detection primarily rely on handcrafted features, and the overall process comprises three fundamental steps. First, regions of interest (ROIs) are selected, identifying areas that are likely to contain objects. Second, feature extraction is performed on these candidate regions. Finally, the extracted features are used for object classification and localization. Viola et al. [1] proposed the Viola–Jones (VJ) detector, which employs a sliding window technique to determine the presence of a target within a given region. However, it is computationally expensive and suffers from high time complexity.

Deep learning-based algorithms of object detection fall into two main types: single-stage and two-stage methods. Single-stage algorithms directly generate ultimate predictions through a single forward pass. Initially, the input image needs to be preprocessed, followed by feature extraction through convolutional operations. These extracted features are then fed into the detection head, where object localization and classification are performed to obtain the final detection results. Representative single-stage algorithms include the SSD [2] and YOLO [3] series. Conversely, two-stage algorithms operate in a sequential manner. First, region proposal networks generate candidate proposals for potential object locations, which are subsequently mapped to a fixed size. Second, a feature extraction network is used to extract target features from the candidate regions, after which object classification and localization are performed. Lastly, non-maximum suppression (NMS) is applied to eliminate redundant bounding boxes, providing the final detection results. R-CNN [4] and Fast R-CNN [5] are the representative two-stage algorithms.

With the continuous advancement of object detection methods, researchers have been striving to enhance detection accuracy in complex environments. In underwater image detection tasks, feature fusion and many attention mechanisms are widely employed to enhance the extraction of feature information. Ranjith Dinakaran et al. [6] proposed a cascaded framework, named DCGAN + SSD, which combines SSD with a deep convolutional generative adversarial network (DCGAN) for underwater object detection. While this model achieved higher detection accuracy than the original SSD, its inference speed was inadequate for real time detection in autonomous underwater vehicles (AUVs). Other researchers, such as Duo et al. [7], leveraged convolutional neural networks (CNNs) for marine data reconstruction and classification. Zhang Lin et al. [8] developed an improved algorithm based on Faster R-CNN for detecting fish school locations to assist in underwater device path planning and tracking. However, Faster R-CNN primarily relies on local convolution operations, limiting its ability to capture global contextual information. CNNs extract features mainly through convolution and pooling operations, which inherently restricts their receptive field and restricts their capability to extract fine details throughout multiple scales. Song et al. [9] introduced a two-stage detection method, the Boosting R-CNN. This method incorporates a novel regional proposal network, RetinaRPN, along with a probabilistic inference pipeline and boosting reweighting. While this method demonstrated outstanding performance in detecting blurred and low-contrast underwater images, it still faced challenges in effectively modeling global features, such as spatial relationships and contextual information between objects. Jia et al. [10] enhanced the structure of the EfficientDet detector and proposed a novel marine organism detection model, EfficientDet-Revised (EDR). By integrating an MBConvBlock with a channel shuffle module, this approach facilitated information exchange across feature channels by eliminating fully connected layers in the attention module and replacing them with convolutional operations, thereby reducing network parameters. This optimization facilitated feature fusion from multiple scales and enhanced the model’s performance to extract features from diverse objects. However, EDR suffered from low-resolution high-level features, leading to suboptimal detection performance for tiny and densely packed objects. Dai et al. [11] proposed the ERL-Net method, which incorporated an edge-guided attention module to capture salient boundaries and proposed a multiscale feature aggregation module along with a wide, yet asymmetric, receptive field. While ERL-Net demonstrated strong performance in underwater object detection tasks, the asymmetric receptive field increased computational complexity, and the multiscale feature aggregation process potentially led to the loss of local details in small objects and interference between targets. Given the high accuracy and fast inference speed, YOLO-based algorithms have also been widely applied to marine object detection. Traditional YOLO models adopt FPN/PAN structures and anchor-based mechanisms, which often give rise to the loss of small object information. To address this limitation, we selected YOLOv11, which employs an anchor-free mechanism, eliminating the conventional anchor box design. Instead, it directly predicts the center coordinates, width, height, and class probability of target objects for detection. Furthermore, YOLOv11 optimizes feature fusion with its BiFPN/PAN structure, which selectively enhances key information through learnable feature-weighting mechanisms.

As object detection technology continues to evolve, researchers have concentrated on improving model accuracy in complex environments. Attention mechanisms are frequently integrated into networks to enhance the target feature extraction, enabling the model to focus on important semantic information, thereby improving both computational efficiency and accuracy. Hu et al. [12] proposed SE-Net, which enhances inter-channel dependencies through the SE module, adaptively recalibrating channel feature responses. This improvement strengthens the network’s feature representation capabilities while maintaining effective generalization. However, SE-Net only considers channel attention while ignoring spatial information, resulting in a limited detection capability for object position and shape. Liu et al. [13] introduced the global attention mechanism (GAM) module, which mitigates information loss and enhances global interaction representation. By introducing a 3D arrangement strategy, GAM preserves cross-dimensional information and utilizes a multi-layer perceptron (MLP) to reinforce dependencies between channel and spatial dimensions. Although GAM emphasizes global contextual information, this focus may come at the expense of local feature representation. Woo et al. [14] proposed the convolutional block attention module (CBAM), which effectually addresses some limitations of previous attention mechanisms by incorporating both spatial attention and channel. The module of channel attention assigns weights to channel features to highlight “important” features, while the module of spatial attention assigns weights to spatial features to emphasize their “position”. This design enables the network to better concentrate on key information in images, leading to more accurate feature extraction and improved detection performance. However, CBAM employs global average pooling and convolution operations to compute attention weights, which leads to the increasing number of parameters and may result in the loss of small object details. Wang et al. [15] introduced the efficient channel attention (ECA) module, which utilizes a local cross-channel interaction strategy using 1D convolution to efficiently compute channel attention. ECA adaptively selects the convolution kernel size, significantly reducing computational complexity while maintaining performance. However, the limited size of local convolutions restricts ECA’s ability to capture complex global features, making it less effective in modeling global information. Additionally, ECA primarily focuses on channel attention while neglecting spatial information. Ouyang et al. [16] proposed the efficient multiscale attention (EMA) module, which reshapes part of channels into the batch dimension and partitions them into multiple sub-features to make sure the uniform distribution of spatial semantic features within each feature group. This approach preserves channel features while reducing computational resource wastage. Furthermore, EMA recalibrates channel weights for each parallel branch, the method encodes the information of global features and captures pixel-level relationships through cross-dimensional interaction. However, EMA mainly relies on global information, resulting in the weaker modeling of local spatial features. Si et al. [17] introduced the spatial and channel synergistic attention (SCSA) module, which consists of two components: Shared multi-semantic spatial attention (SMSA) and progressive channel self-attention (PCSA). The former integrates multi-semantic information and employs a progressive compression strategy to inject discriminative spatial priors into the channel self-attention of the latter. This process effectively guides channel recalibration and alleviates semantic discrepancies between different feature levels.

The proposed YOLO-DAFS effectively improves the network’s ability of local spatial features processing and global features representation through the integration of lightweight, dual-kernel convolution, a dynamic detection head, a downsampler, an attention-free focal modulation network module, and a spatial channel synergistic attention module. This design ensures a balance between local and global feature extraction while considering computational efficiency. The model demonstrates strong performance in both image classification and object detection on the UTDAC2020 and DUO datasets. YOLO-DAFS achieves a real-time detection speed of 217 FPS, meeting critical real-time processing requirements. The model has a size of 6.5 M. Additionally, its computational complexity remains moderate at 7.1 GFLOPs, with only a 0.5 GFLOP increase over YOLOv11n. These characteristics make it well suited for integration into underwater detection systems.

3. Methods

3.1. YOLOv11 Network

This study employs YOLOv11n as the benchmark model and introduces a modified network based on it, referred to as YOLO-DAFS.

3.1.1. Backbone

The backbone of YOLOv11 consists of CBS, C3k2, SPPF, and C2PSA modules. The CBS module covers a standard convolutional layer (Conv), batch normalization (BN), and the SiLU activation. The C3k2 module, inspired by the ELAN structure in YOLOv7, improves gradient flow through additional inter-layer connections, enhancing feature representation. The SPPF module applies dimensionality reduction to the input feature maps via a convolutional layer followed by three max-pooling layers, before fusing and upsampling the feature maps to retain multiscale target information. SPPF is modified from the SPP structure but implemented differently. SPP uses parallel multiscale pooling, while SPPF innovatively achieves the equivalent effect by serially repeating the same pooling core, which makes SPPF improve its computational efficiency by about 30%, while maintaining its accuracy. The C2PSA module extends the C2 module by incorporating a pointwise spatial attention (PSA) block, which uses a multiple head attention mechanism and a feedforward neural network to improve feature extraction and contextual awareness.

3.1.2. Neck

The neck of YOLOv11 network utilizes the path aggregation network (PANet), a bidirectional feature fusion framework that integrates both top-down and bottom-up pathways. Higher-level features capture rich semantic information, while lower-level features preserve detailed spatial information. PANet’s fusion mechanism effectively combines these features, ensuring the final output retains both the spatial and semantic content.

3.1.3. Head

The detection head of YOLOv11 adopts a decoupled structure, incorporating two additional depthwise separable convolution modules into the classification–detection head of YOLOv8. The detection task in YOLO consists of object localization, which focuses on boundary information, and classification, which emphasizes texture features. The decoupled head extracts coordinate and classification information through independent network branches, enabling separate learning and output processing. The use of depthwise separable convolution reduces both the parameter count and complexity of computation while improving the model’s generalization ability.

3.2. Enhanced YOLOv11 Network

The YOLOv11 algorithm includes five versions: n, s, m, l, and x. This study selects YOLOv11n as the benchmark model due to its low parameter count and fast detection speed. Based on this, an improved YOLOv11n-based underwater object detection algorithm is proposed. The enhancements include integrating the DualConv [18] module into the C3k module’s bottleneck to refine the backbone; incorporating ghost convolution and depthwise separable convolution into the dynamic head [19] to replace the original YOLOv11 detection head; introducing the lightweight feature extraction module ADown to replace the convolutional modules used for feature extraction and downsampling in the backbone; and combining focal modulation [20] with the C2 module to form the C2PSF. Additionally, the SCSA attention module is added before the detection head. The improved YOLOv11 network structure is illustrated in Figure 1.

3.2.1. C3k2_Dual

The underwater environment is typically characterized by complex lighting conditions, scattering, and light attenuation, which result in low contrast between objects and backgrounds, making it difficult to distinguish color or texture features and resulting in false positives and missed detections. Additionally, underwater objects exhibit various scales, which presents a challenge for detection.

To boost the efficiency of information processing and the extraction feature of the backbone network, the lightweight DualConv with dual convolution kernels is integrated into the C3k2 module. The convolution filter structure of the DualConv is depicted in Figure 2, where M represents the input channels’ number, N represents the convolution filters’ total number, and G indicates the number of convolution groups. A subset of the convolution kernels in the DualConv simultaneously perform 3 × 3 and 1 × 1 convolutions, while the remaining kernels perform only the 1 × 1 convolution. The N convolution filters are partitioned into G groups, each tasked with processing the entire input feature map. In each group, M/G input channels undergo processing by both 3 × 3 and 1 × 1 convolution kernels, while remanent (M − M/G) channels are subjected exclusively to 1 × 1 convolution kernel operations. This architecture effectively merges the principles of 3 × 3 group convolutions with 1 × 1 point convolutions. The former focus on extracting spatial features from localized regions of the feature map, emphasizing the shape and details of underwater organisms and enhancing local features, while efficiently organizing the convolution kernels to reduce model complexity and maintain the network’s representational ability. The latter serve as a global feature aggregator at the channel level, performing global channel information aggregation to effectively detect objects of varying scales. Figure 3 illustrates the structures of the C3k, DualBottleneck, and bottleneck in C3k2-Dual.

3.2.2. DyHead-GDC

In object detection tasks, images often contain objects of different scales, and objects may appear in arbitrary shapes and positions within the image. Moreover, these objects may be represented through multiple modalities. Consequently, an effective object detection head must satisfy three key requirements to cope with these challenges. This study proposes a neoteric design, termed DyHead-GDC, which integrates a ghost depth-separable convolution (GDC) with the DyHead framework introduced by Xiyang Dai et al. This design not only reduces the parameter count and floating-point computations of DyHead but also maintains the detection accuracy. The dynamic detection head synergistically incorporates multiple attention mechanisms—operating across task-aware output channels, scale-aware feature hierarchies, and spatially aware locations—thereby enhancing its capacity to capture both localized details and global contextual features of objects. The elaborate structure of DyHead-GDC is depicted in Figure 4, where the modulated DeformConv, spatial attention, and DyReLU activations correspond to the scale-, spatial-, and task-aware attention modules, respectively. The scale-aware attention module functions along the feature layer dimension, dynamically adjusting the weights assigned to multiscale features. This adaptability enables the detection head to more effectively select appropriate feature layers for objects of varying scales, thereby improving the representation of multiscale objects with an emphasis on global feature extraction. The spatial-aware attention module operates within the spatial dimension, leveraging deformable convolutions to dynamically adjust sampling points. This mechanism allows the network to prioritize critical regions of the feature map during training, placing a greater emphasis on local feature enhancement. Meanwhile, the task-aware attention module operates along the channel dimension, enabling distinct channels to adaptively specialize in tasks such as object classification and bounding box regression.

The GDConv module proposed in this work integrates ghost convolution [21] with depth-separable convolution [22] to effectively decrease both the parameters and the complexity of computation of the detection head. The ghost depth-separable convolution approach partitions the convolution process into two stages: initially, a standard convolution is employed to produce a small amount of main features; subsequently, depth-separable convolution is applied to these primary features to generate additional ghost features. The primary and ghost features are then concatenated, followed by batch normalization and an activation function to yield the final output features. By leveraging cost-efficient convolutions to generate an increased number of feature maps, this method significantly alleviates the computational burden and reduces the number of model parameters. Experimental evaluations demonstrate that the incorporation of ghost depth-separable convolution in DyHead-GDC results in a reduction of approximately 206,000 parameters and 1.2 GFLOPs compared to the original DyHead, while preserving the equivalent accuracy of detection. The architectural details of the GDConv module are illustrated in Figure 4.

3.2.3. ADown

In the process of extracting object features in YOLO networks, multiple downsampling operations are typically performed. This can lead to a weakening of the object’s feature information. Therefore, an efficient downsampling module is crucial for the network’s feature extraction. This paper adopts the ADown [23,24] module as a substitute for the conventional backbone feature extraction and downsampling processes. The core objective of this module is to execute downsampling while simultaneously amplifying the receptive field of the feature maps, thereby enabling the capture of information across a broader spatial extent. This approach not only reduces the parameters of the model but also enhances its accuracy. It utilizes two different paths to process the input data.

One pathway applies spatial downsampling to half of the input channels followed by convolution, which aids in preserving richer contextual information;
The other pathway performs max pooling on the remaining half of the channels before applying convolution, enabling the extraction of localized maximum feature values.

The ADown module’s architecture is illustrated in Figure 5. This configuration enables the network to capture finer details during the downsampling process. By employing convolution kernels of varying sizes and diverse pooling techniques, the module enhances the richness of the feature representation. Ultimately, the outputs from the two embranchments are concatenated, yielding a comprehensive feature representation that effectively combines information derived from distinct receptive fields.

3.2.4. Attention Module

In object detection tasks, it is imperative for the model to effectively capture both local, fine-grained features and global contextual information. In scenarios with intricate background environments, the inadequate comprehension of the global context or the insufficient modeling of global information can impede the model’s capability to differentiate between target object and background or neighboring objects. This deficiency may lead to the erroneous classification of background regions as objects or the loss of critical object features within a complex surrounding background, consequently diminishing the detection accuracy.

Figure 6 depicts the architectural details of the C2PSF and focal modulation modules. This paper replaces the existing attention mechanism with the focal modulation module, combining it to form the C2PSF module. This modification aims to bolster the model’s multiscale representation capacity, feature extraction proficiency, and ability to process global contextual information, particularly for detecting small objects and handling complex background scenarios. The focal modulation mechanism consists of three key components: focal context contextualization, gated aggregation, and element-wise affine transformation. Focal context contextualization is accomplished through a sequence of deep convolutional layers that encode visual context across a spectrum of spatial ranges, from short to long distances. These layers effectively extract visual features at varying scales, enabling the network to comprehend multiple levels of image content. Gated aggregation selectively gathers contextual information for each query token and incorporates it into the modulator via a gating mechanism, allowing the model to efficiently consolidate pertinent contextual features. This step enhances the model’s capability to differentiate objects from the background while preserving sensitivity to local details and improving its grasp of the global structure. The element-wise affine transformation embeds modulation information into the query, enabling the model to dynamically adjust query tokens based on contextual cues. This refinement strengthens the capacity of model to capture and represent critical visual features, thereby elevating detection performance in intricate or dynamic visual settings.

Furthermore, the SCSA module is incorporated between the feature pyramid and the detection head. The SCSA module comprises two sub-modules: SMSA and PCSA, which are integrated in a sequential, modular fashion. The SMSA sub-module extracts comprehensive semantic features from the multiscale spatial information provided by the feature pyramid, employing a progressive compression strategy to infuse this spatial information into the PCSA sub-module. Subsequently, PCSA efficiently calculates similarities among inter-channels, mitigating semantic disparities among the sub-features derived from SMSA. This collaborative mechanism enhances the model’s feature representation capacity, thus improving the detection accuracy in complex scenarios. The architectural design of the SCSA module is illustrated in Figure 7.

Focal modulation replaces traditional self-attention mechanisms to achieve multiscale contextual fusion with linear computational complexity, thereby addressing scale variations of underwater targets.

SCSA focuses on cross-channel relationship modeling and spatial detail enhancement, specifically designed to mitigate background interference and small object omission.

The two modules employ a stage-wise processing strategy to prevent a functional overlap:

C2PSF performs coarse-grained feature modulation in the backbone network.
SCSA conducts fine-grained calibration prior to the detection head.

4. Experiment

4.1. Datasets

To demonstrate the efficacy of the proposed approach, we performed comprehensive experiments on the UTDAC2020, DUO, and RUOD datasets. The datasets were divided into training, testing, and validation sets in a ratio of 8:1:1.

The UTDAC2020 dataset comprises 7383 annotated images spanning four marine species categories: sea urchins, starfish, sea cucumbers, and scallops. The dataset exhibits significant resolution diversity, with image dimensions ranging from high-definition (3840 × 2160, 2560 × 1440) to standard resolutions (2048 × 1536, 1920 × 1080) and smaller formats (720 × 405, 704 × 576, 586 × 480). The detailed information of the dataset is shown in the Figure 8.

The DUO dataset contains 7782 annotated underwater images representing four distinct marine organism categories: needle urchins, sea cucumbers, scallops, and starfish. The image collection spans multiple resolution standards, including ultra-high definition (3840 × 2160), full HD (1920 × 1080), and various lower resolutions (ranging from 720 × 405 to 586 × 480 pixel dimensions). The detailed information of the dataset is shown in the Figure 9.

The RUOD dataset contains 14,000 annotated underwater images representing 10 different categories of marine life: sea cucumbers, sea urchins, scallops, starfish, fish, coral, diver, cuttlefish, turtle, jellyfish. The images are captured across a variety of resolution standards, including ultra HD (3840 × 2160), full HD (1920 × 1080), and a total of more than 3048 different resolutions. In Figure 10. only several common resolutions are listed. The detailed information of the dataset is shown in the Figure 10.

4.2. Experimental Setting

In this study, the experimental configuration is specified as follows: the operating system is Windows, the processor is an Intel^® Core™ i9-14900KF (Santa Clara, CA, USA), the graphics card is an NVIDIA GeForce RTX 4070Ti SUPER (16 GB), Santa Clara, CA, USA, and the memory is 64 GB. The proposed methods were implemented in Python 3.9.20 and experimentally validated using PyTorch 2.3.1 with CUDA 12.1. Given GPU memory constraints, we configured a batch size of 16. This study employes the Stochastic gradient Descent (SGD) optimizer with momentum because it uses only one or a small number of samples at a time to update the model, which results in fast computation and low memory consumption. At the same time, the randomness of an SGD gradient update can avoid falling into the local optimum and improve the generalization ability of the model.

To mitigate potential model overfitting, we conducted preliminary experiments to determine the optimal number of training epochs for each dataset. Based on the observed convergence behavior of the loss function, we fixed the training duration at 100 epochs for both the UTDAC2020 and DUO datasets, as further training beyond this point did not yield significant improvements. Furthermore, Mosaic data augmentation was employed, providing an additional safeguard against overfitting.

4.3. Experimental Metrics

In this study, the evaluation metrics employed include the mean average precision (mAP), parameters (Params), and Giga floating-point operations per second (GFLOPS). These metrics collectively gauge the overall efficacy of the detection approach across all categories. For the assessment of mAP, two distinct Intersection over Union (IoU) thresholds were utilized: mAP50 and mAP50:95.

To compute mAP, precision (P) and recall (R) need to be calculated, and the mathematical formulation can be expressed as follows:

P = \frac{T P}{T P + F P} \times 100 %

(1)

R = \frac{T P}{T P + F N} \times 100 %

(2)

where TP (True positive) indicates how many positive samples are accurately identified, FP (false positive) corresponds to the number of negative samples falsely classified as positive, TN (True negative) represents the count of negative samples accurately identified, and FN (false negative) denotes the number of positive samples mistakenly classified as negative.

The formula for the mAP is given as follows:

A P = \int_{0}^{1} P d R

(3)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(4)

In Equation (4), APᵢ denotes the average precision (AP) for the i-th category, with N representing the total number of categories within the dataset. The AP quantifies the model’s performance for one specific category, whereas the mAP is computed as the average of the AP values across all categories, offering a unified metric to evaluate the model’s overall effectiveness.

The Params refers to the total trainable parameters in the network, representing the computational space complexity. The GFLOPS measures the computational time complexity, indicating the number of floating-point operations executed per second in billions. Numerically, GFLOPS is equivalent to the total Giga floating-point operations (GFLOPs) performed in one second.

The formulas for computing the Params and floating-point operations (FLOPs) for a given convolutional layer are as follows:

P a r a m s = (K_{h} \times K_{w} \times C_{i n} + 1) \times C_{o u t}

(5)

F L O P s = 2 H W (C_{i n} \times K_{h} \times K_{w} + 1) C_{o u t}

(6)

In Equation (5), K represents the kernel size of the convolution, where h and w denote the height and width of the kernel, respectively. C represents the number of channels, with in and out indicating the number of input and output channels, respectively. In Equation (6), H and W denote the height and width of the input tensor.

4.4. Results

To examine the efficiency of the proposed algorithm, we performed ablation studies on the YOLO-DAFS model utilizing the UTDAC2020 and DUO datasets. Additionally, we benchmarked our approach against cutting-edge methods using the same UTDAC2020 and DUO datasets. Through comprehensive experimentation, we substantiated the effectiveness of the proposed methodology.

4.4.1. Ablation Experiment

We conducted ablation experiments on both the UTDAC2020 and DUO datasets to evaluate the efficacy of the proposed enhancements. As presented in Table 1 and Table 2, method A denotes the integration of the C3k2-Dual module, B signifies the adoption of the DyHead-GDC detection head, C indicates the incorporation of the ADown module, D corresponds to the implementation of the C2PSF module alongside the SCSA attention mechanism. Figure 11 shows the comparison of the precision-recall curves of YOLO-DAFS on the UTDAC2020 and DUO datasets.

To rigorously evaluate the individual and joint effects of C2PSF and SCSA, we conduct controlled the experiments on the UTDAC2020 and DUO datasets by gradually integrating each module into YOLOv11n + A + B + C. The relevant ablation experiment results are shown in Table 3. When deployed alone, C2PSF is mainly used to achieve multiscale context fusion, but mAP50 showed a slight performance degradation of 0.5% and 0.5% on two datasets, mAP50:95 showed a slight performance degradation of 0.1% and 0.8% on two dataset. When deployed alone, SCSA is mainly used to mitigate background interference and small target misses and reduce semantic differences between sub-features, but the performance of mAP50 was decreased by 0.4% and 0.9% on two datasets, the performance of mAP50:95 was decreased by 0% and 0.8%. However, their combined ensemble showed a synergistic performance: the full model (A + B + C + D) achieved a 0.7% and 0.2% mAP50:95 boost, confirming that C2PSF’s multiscale context normalization and SCSA’s spatial channel refinement are complementary and not overlapping. This was physically manifested, as the backbone processing features of C2PSF provided scale-invariant priors for SCSA’s subsequent noise suppression and detail recovery in the detection head. This pipeline specialization is critical for underwater scenes, further validating the effectiveness of the staged attention design.

4.4.2. Visualization

The qualitative comparison of visualization outcomes is depicted in Figure 12. In this figure, (a) illustrates the ground truth images, featuring bounding boxes and category annotations that are either manually or algorithmically generated to assess the precision of the detection algorithm. Subfigure (b) displays the detection outcomes of the YOLOv11n algorithm, whereas (c) showcases the detection results of the YOLO-DAFS algorithm (our proposed method).

From the visual analysis, it is evident that YOLO-DAFS surpasses YOLOv11n in detection accuracy for both single-object and multi-object scenarios, exhibiting greater robustness and consistency. Conversely, YOLOv11n demonstrates instances of false positives and missed detections in specific cases, while YOLO-DAFS achieves more precise target identification and a reduced false positive rate. Furthermore, under challenging environmental conditions—such as substantial lighting variations, significant occlusions, or pronounced background interference—YOLO-DAFS sustains superior detection accuracy. In contrast, YOLOv11n proves more vulnerable to such perturbations, resulting in less reliable detection performance. These advancements underscore the efficacy of YOLO-DAFS in addressing complex detection challenges.

To further elucidate the performance improvements of the proposed method in object detection tasks, we employed the LayerCAM algorithm for a heatmap-based visualization analysis, comparing the attention distributions of YOLO-DAFS and YOLOv11n.

LayerCAM leverages gradient information from feature layers to produce highly interpretable heatmaps, emphasizing the regions of focus within the model and facilitating an in-depth evaluation of enhancements in detection accuracy. The visualization results are presented in Figure 13, where (a) depicts the ground truth images, annotated with manually or algorithmically defined bounding boxes and category labels for accuracy assessment. Subfigure (b) illustrates the heatmap visualization derived from the three-layer feature pyramid output of YOLOv11n using LayerCAM, while (c) presents the corresponding heatmap visualization for the three-layer feature pyramid output of YOLO-DAFS. The analysis reveals that YOLO-DAFS demonstrates a more focused attention distribution, precisely encompassing the target regions, thereby outperforming YOLOv11n. In contrast, YOLOv11n occasionally exhibits inadequate attention to target areas or erroneous focus on background elements in certain instances. These findings underscore the enhanced feature extraction and target discrimination capabilities of YOLO-DAFS, which enable more effective object localization and contribute to improved detection accuracy and reliability.

4.4.3. Comparison with Other Methods

To thoroughly assess the efficacy of the proposed method, we conducted comparisons with several established approaches using the UTDAC2020 and DUO datasets. As detailed in Table 4, the proposed method exhibits competitive performance across multiple metrics, including mAP50, mAP50:95, model parameters, and floating-point operations (FLOPs), evaluated on both datasets.

In order to comprehensively verify the generalization ability of the model, this study uses a new independent dataset, RUOD. This dataset not only adds new species to the target category not covered by the previous two but also expands the diversity of the test environment by constructing a new collection scenario. This rigorously designed validation framework provides a more reliable benchmark for evaluating the adaptation of models in real, complex scenarios.

The results of YOLO-DAFS on these datasets highlight its proficiency in underwater object detection. As evidenced in Table 4, the proposed model not only delivers high detection accuracy but also retains an advantage in terms of model efficiency. This benefit can be attributed to the lightweight network architecture inherited from the original YOLO framework. In addition, the performance on the RUOD dataset also verifies the generalization ability of the proposed method beyond the UTDAC2020 and DUO datasets.

5. Conclusions

Research on underwater object detection holds substantial value for marine environment and resource development, conservation efforts, and biodiversity preservation. It empowers scientists and engineers to monitor marine organisms, underwater terrain, and environmental changes with greater efficiency and precision. Moreover, underwater object detection is pivotal in marine archaeology and military applications. In archaeological contexts, it facilitates the accurate identification of historical relics and artifacts, while in military scenarios, it enhances the effectiveness and security of underwater combat and defense systems.

In this study, we introduced a composite-enhanced model, YOLO-DAFS, built upon the YOLOv11 framework. This model incorporates a lightweight dual convolution mechanism into the C3k2 module, where one convolutional kernel captures spatial features and another integrates feature information. A group convolution strategy was employed to optimize the arrangement of convolutional kernels, reducing the model complexity while maintaining robust network representation capabilities. Additionally, DyHead-GDC was implemented to improve the model’s perception across the scale, spatial, and task dimensions, thereby elevating the detection accuracy. DyHead combined with GDConv for lightweight processing reduces the number of parameters and the computational requirements while maintaining the performance of the dynamic detection head. A novel lightweight downsampling module, ADown, was introduced, combining convolutional kernels of varying sizes with pooling operations to enhance feature representation. Within the attention mechanism, the C2PSA module was replaced by a focal modulation-based C2PSF module, and the SCSA module was integrated prior to the detection head, further refining the model’s performance in visual tasks. The collaborative design of C2PSF and SCSA helps solve the key challenges of underwater object detection: C2PSF handles object size changes through multiscale feature fusion, and SCSA enhances small object detection and suppresses background interference through spatial channel joint attention. Ablation experiments show that the performance improvement when the two modules work together is significantly better than that when they are used alone, which verifies the complementarity between the modules. Experimental evaluations on the UTDAC2020 and DUO underwater datasets prove that YOLO-DAFS surpasses the baseline YOLO model across multiple datasets. In terms of lightweight optimization, the proposed approach refines the network architecture, achieving performance gains without a substantial increase in the parameter count. By optimizing weight-sharing mechanisms, streamlining feature extraction processes, and minimizing redundant parameters, the model incurs only an approximate increase of 0.6 M parameters and 0.5 GFLOPs compared to YOLOv11n, while delivering markedly improved detection performance.

In comparison to traditional detection methods, the lightweight YOLO-DAFS model maintains high accuracy with a reduced computational overhead, making it particularly advantageous for marine biology researchers studying marine organisms and their habitats. It also supports underwater rescue operations by enabling the rapid detection and localization of individuals requiring assistance. The model’s lightweight design renders it highly suitable for deployment on embedded edge computing platforms, such as Unmanned underwater vehicles (UUVs) and AUVs, thereby contributing significantly to marine exploration and monitoring tasks.

Despite its strong performance in underwater object detection, certain limitations remain. The detection accuracy may decline when underwater environments closely resemble target objects or when dataset class distributions are imbalanced. Future work will explore the integration of pruning and knowledge distillation techniques to further reduce the model parameters and the computational demands, aiming to deploy the processed model to the embedded device and striving for a more efficient lightweight detection network. Additionally, advanced deep learning methodologies will be investigated to bolster the model’s robustness in complex underwater settings.

The proposed YOLO-DAFS model is poised to advance marine biology research by assisting scientists in detecting and analyzing marine organisms and their habitats. It also offers practical utility in underwater rescue missions by enabling the swift identification and localization of individuals in distress, thereby broadening its real-world applicability.

6. Prospect

While the proposed method yields encouraging outcomes in underwater object detection, opportunities for refinement persist. The model size has expanded by approximately 1.2 M parameters relative to the baseline model. In the future, we aim to further enhance the model by implementing techniques such as pruning and knowledge distillation to achieve greater lightweight efficiency. In addition, we plan to put the pruned and knowledge-distillated model in environments with different turbidity, illumination, and color aberrations for embedded deployment testing. Moreover, we plan to investigate more effective and resource-efficient algorithms and methodologies to optimize the detection performance in increasingly challenging underwater environments, and we are optimistic about the potential of these efforts. In addition, we intend to test the generalization ability of the model to ensure the effectiveness of the model.

Dependence on RGB visual data alone imposes constraints in complex underwater settings. A prominent future direction involves multimodal data fusion to bolster detection capabilities in low-visibility conditions. For example, combining RGB data with sonar (acoustic imaging) could enhance perception in murky waters, while integrating RGB with polarized light could improve the resolution of underwater target textures and edges.

In underwater applications, such as marine organism monitoring, seabed exploration, and shipwreck investigations, real-time object detection and tracking are essential. Potential enhancements include the integration of multi-object tracking (MOT) algorithms, such as ByteTrack, to improve the tracking reliability in underwater contexts. The adoption of transformer-based tracking could leverage historical frame data to refine detection outcomes. Additionally, optimizing real-time inference through low-power solutions, such as TensorRT and ONNX acceleration, could enhance the computational efficiency and enable low-latency object detection on underwater platforms.

The future of underwater object detection is expected to progress toward lightweight, intelligent, and multimodal fusion frameworks, incorporating adaptive feature extraction, data augmentation, and efficient tracking technologies. These developments will facilitate the broader deployment of underwater drones and intelligent monitoring systems. As computational hardware evolves and open-source ecosystems advance, underwater object detection will assume a pivotal role in marine ecosystem surveillance, underwater archaeology, and AUVs, offering robust support for ocean exploration and conservation efforts.

Author Contributions

Conceptualization, S.L. and C.D.; methodology, S.L. and R.C.; software, S.L.; validation, S.L., C.D. and R.C.; formal analysis, S.L. and C.D.; investigation, S.L. and R.C.; resources, G.D., P.Z. and Z.L.; data curation, S.L., M.X., P.Z. and Z.L.; writing—original draft preparation, S.L. and R.C.; writing—review and editing, S.L., C.D., R.C. and B.Z.; visualization, S.L., C.D. and R.C.; supervision, C.D., G.D. and S.L.; project administration, C.D. and G.D.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science and Technology Development Foundation of the South China Sea Bureau, the Ministry of Natural Resources (No. 220202); the National Natural Science Foundation Young Scientist Fund Project, under grant 62201164; the Guangdong Provincial Science and Technology Plan Project, under grant 2023A0505050097; and the National Key Research and Development Program of China (grant No. 2022YFF130160X); the National Natural Science Foun-dation of China under Grant 42227901.

Data Availability Statement

The data that support the findings of this study are openly available in the UTDAC2020, DUO, and RUOD datasets at https://aistudio.baidu.com/datasetdetail/215376 (accessed on 21 December 2024), https://github.com/chongweiliu/DUO (accessed on 21 December 2024), and https://aistudio.baidu.com/datasetdetail/216919 (accessed on 19 April 2025), respectively.

Conflicts of Interest

Authors Guixin Dong, Peng Zhang and Zhanwei Li were employed by the company Chimelong Group Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Viola, P.; Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Dinakaran, R.; Zhang, L.; Li, C.-T.; Bouridane, A.; Jiang, R. Robust and Fair Undersea Target Detection with Automated Underwater Vehicles for Biodiversity Data Collection. Remote Sens. 2022, 14, 3680. [Google Scholar] [CrossRef]
Duo, Z.; Wang, W.; Wang, H. Oceanic Mesoscale Eddy Detection Method Based on Deep Learning. Remote Sens. 2019, 11, 1921. [Google Scholar] [CrossRef]
Zhang, L.; Ge, Y.; Du, J.W.; Liu, Y.P. Improved FCOS Network for Marine Fish Target Detection. Computer Systems and Applications. Comput. Syst. Appl. 2023, 32, 309–315. Available online: http://www.c-s-a.org.cn/1003-3254/8965.html (accessed on 31 March 2025). (In Chinese).
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN Samples by RPN’s Error for Underwater Object Detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
Jia, J.; Fu, M.; Liu, X.; Zheng, B. Underwater Object Detection Based on Improved EfficientDet. Remote Sens. 2022, 14, 4487. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Song, P.; Tang, H.; Ding, R.; Li, S. Edge-Guided Representation Learning for Underwater Object Detection. CAAI Trans. Intell. Technol. 2024, 9, 1078–1091. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11211. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4 June 2023; pp. 1–5. [Google Scholar]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Zhong, J.; Chen, J.; Mian, A. DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks. In IEEE Transactions on Neural Networks and Learning Systems; IEEE: Piscataway, NJ, USA, 2023; Volume 34, pp. 9528–9535. [Google Scholar] [CrossRef]
Yang, J.; Li, C.; Dai, X.; Yuan, L.; Gao, J. Focal Modulation Networks. Adv. Neural Inf. Process. Syst. 2022, 35, 4203–4217. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7369–7378. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, 8.0.0; AGPL-3.0: January 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 7 October 2024).

Figure 1. The network structure of YOLO-DAFS consists of the backbone, neck, and head.

Figure 2. The structure of the dual convolution.

Figure 3. The structures of the C3k, DualBottleneck, and bottleneck.

Figure 4. The structure of the DyHead-GDC.

Figure 5. The structure of the ADown module.

Figure 6. (a) The structure of the C2PSF; (b) the structure of the focal modulation.

Figure 7. An illustration of the SCSA.

Figure 8. Distribution of categories and images in UTDAC2020 dataset: (a) proportion of each category; (b) proportion of each image resolution.

Figure 9. Distribution of categories and images in DUO dataset: (a) proportion of each category; (b) proportion of each image resolution.

Figure 10. Distribution of categories and images in RUOD dataset: (a) proportion of each category; (b) proportion of each image resolution.

Figure 11. Comparison of precision-recall curves of YOLO-DAFS: (a) UTDAC2020; (b) DUO.

Figure 12. Partial detection results: (a) ground truth; (b) YOLOv11n; (c) YOLO-DAFS.

Figure 13. Heatmap visualization: (a) ground truth; (b) YOLOv11n; (c) YOLO-DAFS.

Table 1. Ablation experiment result for UTDAC2020 dataset.

Method	Param (M)	GFLOPs	mAP50 (%)	mAP50:95 (%)
YOLOv11n	2.591	6.4	81.7	47.6
+A	2.531	6.4	(+0.4) 82.1	(+0.1) 47.7
+A + B	3.677	8.5	(+1.2) 83.2	(+0.6) 48.3
+A + B + C	3.167	7.3	(+0.6) 83.8	(+0.5) 48.8
+A + B + C + D	3.193	7.1	(+0.4) 84.2	(+0.7) 49.5

Table 2. Ablation experiment result on DUO dataset.

Method	Param (M)	GFLOPs	mAP50 (%)	mAP50:95 (%)
YOLOv11n	2.591	6.4	82.8	62.2
+A	2.531	6.4	(+0.4) 83.2	(+0.6) 62.8
+A + B	3.677	8.5	(+1.0) 84.2	(+2.2) 65.0
+A + B + C	3.167	7.3	(+0.4) 84.6	(+0) 65.0
+A + B + C + D	3.193	7.1	(-0.2) 84.4	(+0.2) 65.2

Table 3. The ablation of attention module experiment result, performed on the UTDAC2020 and DUO datasets.

Method	Param (M)	FLOPs (G)	UTDAC2020		DUO
Method	Param (M)	FLOPs (G)	mAP50 (%)	mAP50:95 (%)	mAP50 (%)	mAP50:95 (%)
YOLOv11n + A+B + C	3.167	7.3	83.8	48.8	84.6	65.0
+C2PSF	3.187	7.3	(−0.5) 83.3	(−0.1) 48.7	(−0.5) 84.1	(−0.8) 64.2
+SCSA	3.173	7.1	(−0.4) 83.4	(−0) 48.8	(−0.9) 83.7	(−0.8) 64.2

Table 4. A comparison between the methods applied to the UTDAC2020, DUO, and RUOD datasets.

Method	Param(M)	FLOPs(G)	UTDAC2020		DUO		RUOD
Method	Param(M)	FLOPs(G)	mAP50(%)	mAP50:95(%)	mAP50(%)	mAP50:95(%)	mAP50(%)	mAP50:95(%)
YOLOv5s	9.123	24.0	85.4	50.7	85.3	66.8	87.3	64.4
YOLOv8s	11.137	28.7	85.5	50.8	85.5	67.3	86.8	63.9
YOLOv11n	2.591	6.4	81.8	47.6	82.8	62.2	85.5	62.0
YOLOv11s	9.429	21.6	85.5	51.3	86.3	67.9	86.8	63.8
Faster R-CNN	41.36	210.32	80.9	44.5	75.9	54.8	81.4	51.9
Boosting R-CNN	45.95	54.71	82.4	49.8	78.4	57.3	80.6	53.9
Deformable DETR	40.00	200.71	84.1	46.6	79.0	55.7	66.8	38.7
ERL-Net	218.83	416.63	82.8	48.5	81.4	61.2	83.1	54.8
YOLO-DAFS(Ours)	3.193	7.1	84.2	49.5	84.4	65.2	86.7	63.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, S.; Dong, C.; Dong, G.; Chen, R.; Zheng, B.; Xiang, M.; Zhang, P.; Li, Z. YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm. J. Mar. Sci. Eng. 2025, 13, 947. https://doi.org/10.3390/jmse13050947

AMA Style

Luo S, Dong C, Dong G, Chen R, Zheng B, Xiang M, Zhang P, Li Z. YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm. Journal of Marine Science and Engineering. 2025; 13(5):947. https://doi.org/10.3390/jmse13050947

Chicago/Turabian Style

Luo, Shengfu, Chao Dong, Guixin Dong, Rongmin Chen, Bing Zheng, Ming Xiang, Peng Zhang, and Zhanwei Li. 2025. "YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm" Journal of Marine Science and Engineering 13, no. 5: 947. https://doi.org/10.3390/jmse13050947

APA Style

Luo, S., Dong, C., Dong, G., Chen, R., Zheng, B., Xiang, M., Zhang, P., & Li, Z. (2025). YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm. Journal of Marine Science and Engineering, 13(5), 947. https://doi.org/10.3390/jmse13050947

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. YOLOv11 Network

3.1.1. Backbone

3.1.2. Neck

3.1.3. Head

3.2. Enhanced YOLOv11 Network

3.2.1. C3k2_Dual

3.2.2. DyHead-GDC

3.2.3. ADown

3.2.4. Attention Module

4. Experiment

4.1. Datasets

4.2. Experimental Setting

4.3. Experimental Metrics

4.4. Results

4.4.1. Ablation Experiment

4.4.2. Visualization

4.4.3. Comparison with Other Methods

5. Conclusions

6. Prospect

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI