WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention

Qian, Jiahui; Chen, Ming

doi:10.3390/app15073537

Open AccessArticle

WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention

by

Jiahui Qian

and

Ming Chen

^*

Key Laboratory of Fisheries Information, Ministry of Agriculture and Rural Affairs, College of Information Technology, Shanghai Ocean University, Hucheng Ring Road 999, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3537; https://doi.org/10.3390/app15073537

Submission received: 26 February 2025 / Revised: 18 March 2025 / Accepted: 21 March 2025 / Published: 24 March 2025

(This article belongs to the Section Marine Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate marine benthos detection is a technical prerequisite for underwater robots to achieve automated fishing. Considering the challenges of poor underwater imaging conditions during the actual fishing process, where small objects are easily occluded or missed, we propose WDS-YOLO, an advanced model designed for marine benthos detection, built upon the YOLOv8n architecture. Firstly, the convolutional module incorporated with wavelet transform was used to enhance the backbone network, thereby expanding the receptive field of the model and enhancing its feature extraction ability for marine benthos objects under low visibility conditions. Secondly, we designed the DASPPF module by integrating deformable attention, which dynamically adjusts the attention domain to enhance feature relevance to targets, reducing irrelevant information interference and better adapting to marine benthos shape variations. Finally, the SF-PAFPN feature fusion structure was designed to enhance the model’s ability to detect smaller object features while mitigating false positives and missed detections. The experimental results demonstrated that the proposed method achieved 85.6% mAP@50 on the URPC dataset, representing a 2.1 percentage point improvement over the YOLOv8n model. Furthermore, it outperformed several mainstream underwater object detection algorithms, achieving a detection speed of 104.5 fps. These results offer significant technical guidance for advancing intelligent fishing systems powered by underwater robotic technologies.

Keywords:

marine benthos detection; underwater object detection; YOLOv8; deformable attention; wavelet convolution

1. Introduction

In recent years, the demand for nutrient-rich marine benthos, including holothurian, echinus, scallop and starfish, has surged, driving the rapid development of the aquaculture industry. However, current marine benthos fishing primarily relies on manual operations, which suffer from high costs, low efficiency, and significant safety risks [1]. In this large-scale industry, the use of underwater robots for automated fishing has emerged as a future trend [2,3], offering cost reduction and enhanced safety and productivity, as well as marine ecosystem protection. A prerequisite for intelligent fishing technology is the accurate detection of marine benthos in complex underwater environments, where holothurians visually blend with sandy substrates, echinus densely aggregate with overlapping spines, scallops are partially buried under marine sediments, and starfish display diverse polymorphic features. These unique underwater complexities challenge conventional computer vision systems; thus, research on object detection methods for marine benthos holds significant research implications and practical application value.

The field of object detection has progressed through three main paradigms: two-stage, one-stage, and transformer-based methods, each tackling unique challenges in accuracy, efficiency, and scalability. Two-stage detectors [4,5], such as Faster R-CNN [6] and Cascade R-CNN [7], attain high precision by initially generating region proposals and subsequently refining them through classification and regression. However, their dependence on proposal generation and multi-stage processing leads to substantial computational overhead, restricting their applicability in real-time applications. Transformer-based methods, exemplified by RT-DETR [8], utilize self-attention mechanisms to model the global context and eliminate handcrafted components such as anchors. Despite their superior performance in complex scenarios, these models often demand extensive computational resources and large-scale training data, presenting challenges for deployment on resource-constrained devices.

In contrast, one-stage detectors prioritize efficiency and simplicity, making them ideal for real-time applications. SSD [9] pioneered this paradigm by directly predicting bounding boxes and class scores from multi-scale feature maps, optimizing the trade-off between speed and precision. The YOLO series [10,11,12,13,14,15,16,17,18,19,20] has been at the forefront of one-stage detectors: YOLOv3 introduced multi-scale predictions with Darknet-53, while YOLOv7 enhanced feature extraction through E-ELAN networks and dynamic re-parameterized convolutions. YOLOv8 further advanced the field by adopting an anchor-free detection head and the C2f module, enabling multi-task integration with state-of-the-art performance in detection, segmentation, and classification. These innovations highlight the advantages of single-stage methods, including reduced computational complexity, faster inference speeds, and easier deployment on edge devices.

Recent advancements have seen the development of numerous novel algorithms aimed at tackling the complexities of underwater object detection, including image blur, color distortion, dense small objects, and dynamic environmental interference. Chen et al. [21] developed the SWIPENet+IMA framework, which enhances small object detection through high-resolution super-feature maps and reduces the impact of heterogeneous noise in underwater images using the Inverse Multi-class Adaboost algorithm. Lin et al. [22] designed the SA-FPN, which uses a special backbone subnetwork to extract fine-grained features and incorporates Soft Non-Maximum Suppression (Soft-NMS) to optimize dense bounding box filtering. Xu et al. [23] proposed a multi-scale feature pyramid and context enhancement mechanism, which fuses semantic information through a top-down up-sampling pathway, significantly improving the robustness of multi-scale underwater object detection. Qi et al. [24] developed a two-stage detection network based on Deformable Convolution Pyramid (DCP), which uses deformable convolution to adaptively match target deformations and occlusion features, combined with a phased learning strategy to enhance multi-domain generalization capability. Liu et al. [25] embedded an adaptive convolutional kernel selection unit and a deconvolutional feature pyramid into the Faster R-CNN framework, effectively addressing the missed detection of small underwater targets. Fu et al. [26] integrated the Transformer mechanism into YOLOv4, enhancing feature discrimination in complex underwater backgrounds through self-attention mechanisms. Zhang et al. [27] designed a lightweight feature fusion module based on MobileNet v2, optimizing model deployment efficiency on low-computation devices. Liu et al. [28] enhanced YOLOv5 by incorporating the CBAM and CRFPN networks, which improved the feature representation of blurred underwater targets. Wen et al. [29] enhanced shallow feature extraction in YOLOv5s through the synergistic use of Coordinate Attention (CA) and Squeeze-and-Excitation (SE) modules. Zhang et al. [30] combined EfficientNetV2-S with Bottleneck Transformer to optimize a lightweight YOLOv5, balancing detection accuracy and computational cost. Yi et al. [31] improved the FPN structure and integrated the SENet mechanism with YOLOv7, strengthening the multi-scale feature association of small underwater objects. Zhang et al. [32] used the RTMDet backbone network and BoT3 module to enhance the context modeling capability of YOLOv5. Liu et al. [33] proposed the TC-YOLO model, which uses Transformer and Coordinate Attention to collaboratively model the spatial relationships of underwater objects. Wang et al. [34] designed lightweight ODConv and GSConv modules based on YOLOv6, reducing the computational complexity of complex underwater scenarios. Liu et al. [35] improved the YOLOv7 network architecture by optimizing the E-ELAN structure with the ACmixBlock module and skip connections, and introduced the Global Attention Mechanism (GAM) to enhance target discrimination in complex underwater environments. Zhou et al. [36] constructed Cross-Stage Multi-Branch (CSMB) and Large Kernel Spatial Pyramid (LKSP) modules in YOLOv8, improving multi-scale underwater object detection performance. Guo et al. [37] used FasterNet to optimize the feature pyramid of YOLOv8, achieving high-frame-rate real-time underwater detection. Qu et al. [38] designed the LEPC module and AP-FasterNet architecture, which preserves small object spatial details through Content-Aware ReAssembly (CARAFE) up-sampling. Pan et al. [39] improved YOLOv9s feature fusion based on the Dual Dynamic Token Mixer (D-Mixer), enhancing feature consistency for dynamic blurry objects. Sun et al. [40] developed an enhanced YOLOX framework by integrating MobileViT and Dual Coordinate Attention (DCA), which enhances global feature extraction efficiency through a lightweight ViT backbone network and strengthens shallow feature representation with the DCA mechanism, compressing model parameters by 49.6% while maintaining detection accuracy, significantly meeting the low-computation requirements of underwater unmanned platforms.

To address these issues, this paper proposes a marine benthos object detection algorithm, WDS-YOLO, based on YOLOv8, with three main contributions:

WTConv was used to expand the receptive field, enhancing the model’s ability to extract and represent features in complex underwater environments. This capability is critical for robotic systems to accurately locate partially buried scallops and starfish in sandy substrates, contributing to enhanced fishing efficiency.
A Deformable Attention-based Spatial Pyramid Pooling Fast (DASPPF) module was designed, which dynamically adjusts the network’s attention to objects, reducing interference from complex backgrounds in detection tasks. This module minimizes false positives caused by overlapping objects, supporting more accurate harvesting operations.
To address the issue of small object feature information being easily lost in deep network layers, the SF-PAFPN feature fusion module was designed. It enhances the model’s ability to fuse small object features without significantly increasing the computational load, thereby improving the detection capability for smaller marine benthos objects. This improvement has the potential to optimize resource utilization and support more efficient harvesting practices.

2. Related Works

2.1. YOLOv8 Object Detection Network

The YOLO family of models is widely recognized as one of the most frequently employed one-stage object detection frameworks. YOLOv8, launched by Ultralytics in 2023, exhibits exceptional detection accuracy and rapidity while ensuring strong performance. YOLOv8 is available in five configurations: n, s, m, l, and x, accommodating various model dimensions. Given the constrained hardware capabilities of underwater robotic devices, we selected YOLOv8n, the most compact and rapid variation, as the baseline model.

The YOLOv8 architecture consists of four fundamental components: Input, Backbone, Neck, and Head. The Input layer processes target photos by executing data augmentation and additional preprocessing techniques to improve image variety. The Backbone, tasked with feature extraction, comprises Conv, C2f, and SPPF modules. The Neck, consisting of Feature Pyramid Networks (FPN) [41] and Path Aggregation Network (PAN) [42], improves the integration and application of multi-scale feature layers, facilitating a more comprehensive feature representation. The detection head incorporates three layers at different scales to accurately predict objects of varying sizes, including parallel branches for classification and regression functions. The network architecture of YOLOv8 is illustrated in Figure 1.

2.2. Attention Mechanism

In the research of attention mechanisms for visual tasks, various methods have been developed to optimize feature representation across different dimensions. Conventional channel attention approaches, such as Squeeze-and-Excitation (SE) [43], capture inter-channel relationships by utilizing global pooling and fully connected layers, but their neglect of spatial information limits localization capabilities in complex scenes. Subsequent improvements, such as Coordinate Attention (CA) [44], attempt to embed spatial coordinate information into channel weights, while Efficient Channel Attention (ECA) [45] reduces parameters through 1D convolution. However, these methods are still constrained by static weight allocation patterns, making it difficult to adapt to geometric deformations or occlusions of targets. Spatial-channel joint attention mechanisms, such as Convolutional Block Attention Module (CBAM) [46], integrates both channel-wise and spatial attention, but their reliance on fixed pooling operations to generate spatial weights limits their ability to model irregular targets. The parameter-free SimAM [47] module implicitly models feature importance through an energy function, but its static attention mechanism lacks flexibility in dynamic scenes.

In recent years, dynamic sparse attention mechanisms have gradually become a research focus. For example, Efficient Multi-Scale Attention (EMA) [48] enhances multi-scale feature fusion through channel grouping and cross-dimensional interaction, but its fixed grouping strategy may not adapt to dramatic changes in target scale; Bilinear Routing Attention (BRA) [49] dynamically selects relevant regions through two-layer routing, but the routing process still relies on predefined sparse patterns. In contrast, deformable attention mechanisms [50] dynamically adjust feature sampling positions through learnable spatial offsets, achieving adaptive focus on input content. The core idea is to use the network to predict offsets for feature sampling positions, enabling attention computation to bypass irrelevant background regions (such as underwater suspended particles or lighting interference) and directly capture key local features of targets. This dynamic deformation capability gives it unique advantages in complex underwater scenarios: on the one hand, through offset learning, deformable attention can compensate for geometric deformations caused by water refraction or target motion, enhancing localization robustness for blurred or distorted targets; on the other hand, its sparse attention mechanism performs intensive computation only in predicted key regions, effectively suppressing noise interference from complex underwater backgrounds. Compared to traditional attention mechanisms’ reliance on fixed patterns (such as CA’s coordinate encoding or EMA’s multi-scale grouping), the data-driven deformation characteristics of deformable attention are more suitable for non-rigid deformations, low contrast, and occlusion challenges of underwater targets, providing a more universal feature enhancement solution for underwater detection tasks.

3. Methods

While YOLOv8 demonstrates robust performance in general object detection, its direct application to underwater marine benthos detection yields suboptimal results due to significant blurring and occlusion phenomena in aquatic environments. Persistent issues of false positives and missed detections indicate considerable room for improvement in the model. To address these limitations and enhance detection efficacy in complex underwater environments, we propose the WDS-YOLO algorithm based on YOLOv8n, which incorporates three key enhancements:

First, the cascaded convolution module WTconv based on wavelet transform was introduced to enhance the receptive field and strengthen the feature extraction capabilities of the backbone network. Second, a Deformable Attention-based SPPF (DASPPF) module was developed to dynamically capture marine benthos objects’ geometric variations while suppressing irrelevant features through deformable attention mechanisms. Finally, the original neck structure was re-engineered as SF-PAFPN by integrating the Focus module with the novel CSPOKM module, enabling effective multi-level feature fusion to boost small-object detection. The complete architecture of WDS-YOLO is illustrated in Figure 2.

3.1. Feature Extraction Module: C2f-WTConv

Owing to underwater light scattering and attenuation, captured images often suffer from color distortion and reduced contrast, which complicates the process of feature identification and introduces additional challenges for marine benthos feature extraction. Additionally, due to the habitat characteristics of marine benthos, their distribution is often concealed with protective coloration, making boundary information more critical than surface color for feature extraction. This necessitates a feature extraction network with an expanded receptive field. The C2f structure in YOLOv8’s backbone network employs fixed-size convolutional kernels for feature extraction, limiting its ability to effectively capture the boundary information of marine benthos objects.

To address this issue, we introduced the Wavelet Convolution (WTConv) module [51], a convolutional structure based on the Wavelet Transform (WT). This module aims to expand the receptive field through hierarchical frequency decomposition while avoiding the issue of excessive parameters.

The core idea of WTConv is to apply the cascade WT to recursively decompose the input signal into different frequency components, perform small-kernel convolution operations on each frequency component, and finally merge the results through Inverse Wavelet Transform (IWT). Specifically, the WTConv employs 2D Haar WT, which performs depth-wise convolution on the input using four distinct filters: a low-pass filter (f_LL) to capture the global low-frequency information of the image, while horizontal high-pass (f_LH), vertical high-pass (f_HL), and diagonal high-pass (f_HH) filters are utilized to preserve the local high-frequency details of the image.

By cascading the application of WT, the WTConv module can generate frequency components at multiple scales. Specifically, at each level of the wavelet decomposition, the low-frequency component from the previous level (

X_{L L}^{(i - 1)}

) is further decomposed, yielding a new low-frequency component (

X_{L L}^{(i)}

) and corresponding high-frequency components (

X_{H}^{(i)}

). This hierarchical decomposition effectively emphasizes low-frequency information, thereby enhancing the model’s responsiveness to such features. Then, at each level of frequency components, the WTConv module performs a small-kernel depth-wise convolution. The process is expressed as follows:

X_{L L}^{(i)}, X_{H}^{(i)} = W T (X_{L L}^{(i - 1)})

(1)

Y_{L L}^{(i)}, Y_{H}^{(i)} = C o n v ({W^{(i)}, (X}_{L L}^{(i)}, X_{H}^{(i)}))

(2)

where

W^{(i)}

represents the weights of the depth-wise convolution kernel at level

i

, while

Y_{L L}^{(i)}

and

Y_{H}^{(i)}

represent the low-frequency and high-frequency outputs, respectively, after the convolution operation.

Since WT reduces the spatial resolution of each sub-band, small convolutional kernels can cover larger regions of the image, thereby significantly expanding the receptive field without a substantial increase in the number of parameters.

After the convolution operation, the WTConv module reconstructs the convolution results of each frequency component into the original spatial domain using the IWT. Leveraging the linear properties of the IWT, it efficiently integrates information from multi-level convolutions, preserving multi-frequency features while achieving convolution operations with large receptive fields. The equation is as follows:

Z^{(i)} = I W T (Y_{L L}^{(i)} + Z^{(i + 1)}, Y_{H}^{(i)})

(3)

where

Z^{(i)}

represents the aggregated output from level

i

and above.

Figure 3 illustrates an example of the WTConv module employing a 2-level WT and a 3 × 3 convolutional kernel. The 2-level wavelet decomposition can provide a sufficiently large receptive field without excessively increasing computational complexity, while the 3 × 3 convolutional kernel effectively captures local features across each frequency component.

To enhance the model’s ability to concentrate on the boundary details of marine benthos, we incorporated WTConv into the backbone network’s C2f, creating the C2f-WTConv block, shown in Figure 4. The precise enhancement procedure is outlined as follows: Initially, we substituted the static convolutional layers in the original Bottleneck with WTConv, effectively broadening the model’s receptive field without a substantial increase in parameters, thereby augmenting the module’s feature extraction efficacy, referred to as WT-Bottleneck. Secondly, we replaced all Bottleneck structures in the original C2f module with WT-Bottleneck, preserving the multi-branch gradient flow architecture of the original C2f module to optimize feature extraction efficiency and computational expense, designating the enhanced C2f as C2f-WTConv. To maintain a rather consistent feature representation for the initial features, we preserved the first two layers of the backbone network unaltered, continuing to utilize the C2f module. As the network evolved, characteristics necessitate increased abstraction and discriminative capability to manage intricate underwater detecting jobs. Consequently, we substituted the final two C2f modules in the backbone network with C2f-WTConv, facilitating the network’s ability to acquire more extensive contextual information during high-level feature processing, thus enhancing the localization and identification of marine benthos objects across diverse scales.

3.2. Deformable Attention-Integrated Spatial Pyramid Pooling Fast (DASPPF) Module

Due to factors such as water flow disturbance and changes in shooting angles, marine benthos objects often exhibit diverse forms, and some objects have protective colors that are difficult to distinguish from the surrounding background environment. Moreover, non-uniform illumination and background disturbances in intricate underwater settings impair the interdependencies among pixels, resulting in the obfuscation of critical information on marine benthos objects. Furthermore, Convolutional Neural Networks (CNNs) extract picture data using restricted receptive fields limited to local areas, leading to inadequate contextual information.

We designed the DASPPF module utilizing deformable attention methods [50] to let the detection model disregard unnecessary background and concentrate on pertinent marine benthos feature information. It executes attention computation on the feature maps integrated by the Fast Spatial Pyramid Pooling architecture to capture long-range relationships, hence maximizing the correlations among feature elements and extracting more representative features. In conventional self-attention mechanisms [52], each query must associate with all potential keys in the image space to calculate attention weights, resulting in heightened model parameters and substantial computing expenses. Moreover, given that the majority of marine benthos objects constitute a tiny fraction of images, this means that this method will inevitably introduce a significant volume of extraneous information, interfering with the accurate detection of marine benthos.

Deformable attention integrates the sparse spatial sampling proficiency of deformable convolution and the effective modeling capacity of self-attention for data correlations. Its primary advantage resides in its robust capacity to articulate the comprehensive characteristics of data. It may adaptively and dynamically modify the attention domain of each query and prioritize the significance of features according to input variations, enhancing feature representation’s relevance to target objects, thereby efficiently diminishing irrelevant information and alleviating background noise. Figure 5 illustrates the principle of deformable attention.

For a specified input feature

x \in R^{H \times W \times C}

, a set of reference points

p \in R^{H / r \times W / r \times 2}

is generated (where r is a predetermined parameter), while the input

x

undergoes linear projection to obtain the query tokens

q

. The

q

is then processed by the offset learning network

θ_{o f f s e t}

to compute the relevant offsets

Δ p

for the reference points. The deformed points are computed using the offsets and reference points, and the sampled features

\tilde{x}

are derived by bilinear interpolation of the input features and the coordinates of the deformed points. The formula for calculation is as follows:

q = x W_{q}

(4)

Δ p = s \cdot \tanh (θ_{offset} (q))

(5)

\tilde{x} = ϕ (x; p + Δ p)

(6)

where

W_{q}

represents the projection matrix,

s

is the range parameter to prevent excessive offsets, and

ϕ (\cdot; \cdot)

is the bilinear interpolation sampling function.

Subsequently, the sampled features are projected to obtain the key

\tilde{k}

and value

\tilde{v}

embeddings:

\tilde{k} = \tilde{x} W_{k}

(7)

\tilde{v} = \tilde{x} W_{v}

(8)

Finally, the multi-head attention output is computed by incorporating the relative position offset R, yielding the output features

z \in R^{H \times W \times C}

.

z^{(m)} = softmax (\frac{q^{(m)} {\tilde{k}}^{(m)}^{T}}{\sqrt{d}} + ϕ (\hat{B}; R)) {\tilde{v}}^{(m)}, m = 1,2, \dots, M

(9)

z = Concat (z^{(1)}, z^{(2)}, \dots, z^{(M)}) W_{o}

(10)

where

z^{(m)}

corresponds to the output generated by the m-th head;

M

and

d

denote the number of heads and the dimension of each head, respectively;

\hat{B}

is the relative position bias; and

W_{o}

is the multi-head attention output projection matrix.

We augmented the SPPF module with deformable attention (DA). Since SPPF can capture multi-scale features, the application of the deformable attention to the fused feature maps enhances the model’s reasoning for essential marine benthos attributes such as texture, color, and shape. This facilitates the effective capture of the non-rigid morphology of marine benthos objects, allowing for greater adaptation to their form variability, thereby enhancing flexible feature perception and significantly augmenting the model’s ability to represent features of marine benthos objects.

To assess the efficacy of DASPPF, we performed comparison studies by augmenting the SPPF module with diverse attention mechanisms. The comprehensive experimental methodologies are defined in Section 4.2.5, Sub-experiment 1.

3.3. Enhanced Neck Structure: SF-PAFPN

In marine benthos detection tasks, the diminutive size of marine benthos and the significant distance between the camera and targets during underwater data collecting lead to most objects in acquired images being categorized as small objects. Furthermore, the habitat characteristics of marine benthos frequently leads to dense clustering, causing reciprocal overlap and occlusion, which obscures the detailed information of small objects in images, hence complicating detection and resulting in missed and false detections.

Recent enhancement solutions for small object detection tasks typically incorporate a P2 small object detection layer to preserve more feature information [53,54]. This method results in substantial computational expenses and prolonged post-processing time. Consequently, to focus on the detailed characteristics of smaller marine benthos objects, ensuring an optimal trade-off between accuracy and efficiency, we have developed an SF-PAFPN neck structure that proficiently amalgamates multi-level feature information. The fused features have better descriptive ability for smaller, occlusion-prone marine benthos objects, which helps the model learn richer feature representations and improve detection performance.

The shallow feature maps from the P2 layer are first down-sampled via the Focus module and subsequently fused with the output of P3, with the objective of decreasing resolution while more efficiently retaining small object feature information. Following this, utilizing the CSP idea and Omni-Kernel [55], we developed the CSPOKM module to process the previously described fused output, therefore augmenting the representation of small object feature information. The input first undergoes processing through a 1 × 1 convolutional layer, after which a Split operation directs 25% of the channels to the Omni-Kernel module, while the remaining channels undergo 1 × 1 convolution to mitigate computational complexity. Ultimately, the two are concatenated and subsequently processed through a 1 × 1 convolutional layer for feature integration.

The Omni-Kernel module, depicted in Figure 6, consists of three branches: local, large, and global. The local branch employs 1 × 1 depth-wise convolution to enhance compromised local information from small scales. The large branch utilizes three substantial kernel depth-wise convolutions (1 × 31, 31 × 31, and 31 × 1) to enhance the receptive field. The global branch integrates a Dual-domain Channel Attention Module (DCAM) and a Frequency-based Spatial Attention Module (FSAM) to facilitate global perception. In this context, DCAM executes crude dual-domain feature enhancement inside the channel dimension, whereas FSAM implements fine-grained augmentation of the spectrum in the spatial dimension, hence augmenting the network’s capacity to leverage global information. The outcomes from the three branches are combined by element-wise addition and subsequently refined using a 1 × 1 convolution to proficiently acquire feature representations from global to local levels.

To evaluate the effectiveness of SF-PAFPN, we conducted comparative experiments by integrating the SF-PAFPN module with the P2-based approach. The detailed experimental procedures are outlined in Section 4.2.5, Sub-experiment 2.

4. Experiments

4.1. Experiment Setup

4.1.1. Experimental Dataset

The dataset utilized in this study was obtained from the 2020 Underwater Robot Professional Contest (URPC) [56], provided by Peng Cheng Laboratory (Shenzhen, China). It consists of 7543 underwater optical images captured in natural marine environments around Dalian’s Zhangzidao Island. In this dataset, 80% of the images are still shots and 20% of the images were derived from keyframes captured from low-speed cruise video. The dataset contains diverse marine benthos objects with varying morphological characteristics. The dataset comprises five categories: holothurian, echinus, starfish, scallop, and seaweed, and has been widely adopted. As seaweed is not considered a marine benthos species, it was excluded from the prediction tasks in our experiments. After preprocessing, the four target categories contained the following labeled instances: holothurian (6371), echinus (28,624), starfish (9264), and scallop (13,153). The distribution of instances across the training, validation, and test sets is detailed in Table 1, which provides a breakdown of the number of labeled objects for each category. Representative samples from the dataset are shown in Figure 7. The dataset was randomly split into training, validation, and test sets using a 7:1:2 ratio, with all input images adjusted to a consistent resolution of 640 × 640 pixels.

4.1.2. Experiment Environment

The experimental platform configuration comprises an Intel Xeon 8255C CPU @ 2.50 GHz and NVIDIA GeForce RTX 3090 (24 GB) GPU, running on a Linux operating system with a PyTorch 1.13.1 deep learning framework, CUDA 11.7, and Python 3.8. The training parameters were configured as follows: SGD optimizer with an initial learning rate of 0.01, momentum of 0.937, weight decay of 0.0005, 200 epochs, and a batch size of 32.

4.2. Experiment Results

4.2.1. Ablation Experiment

To validate the effectiveness and rationale of the proposed improvements, we conducted a series of ablation studies using identical datasets, experimental platforms, and model parameters for evaluation. The detailed results are summarized in Table 2, which reports Precision (P), Recall (R), and mAP@50 (mean Average Precision at 50% IoU threshold), standard metrics for evaluating detection performance. A checkmark (√) denotes the application of each respective enhancement strategy.

As shown in Table 2, the three proposed improvements—WTConv based on wavelet transform, DASPPF, and SF-PAFPN—significantly enhanced the model’s detection performance. First, the WTConv-enhanced backbone achieved a 0.2% improvement in mAP@50, with a 13.3% reduction in parameters and a 4.2 FPS increase in detection speed. This demonstrates enhanced feature extraction capability for marine benthos objects in complex backgrounds, while achieving an optimal balance among parameter efficiency, detection speed, and accuracy. Second, the DASPPF module strengthened feature correlations in important target regions while suppressing irrelevant background areas, resulting in an mAP@50 improvement to 84.3%. Third, the SF-PAFPN module significantly improved small object detection, with a 1.2% increase in mAP@50, along with enhanced precision and recall rates.

While combining SF-PAFPN and DASPPF improved the detection accuracy, it inevitably increased the model parameters. Integrating WTConv with SF-PAFPN achieved 84.9% mAP@50, representing a 1.4% improvement over the baseline and 1.2% over SF-PAFPN alone, while reducing the parameters by 12.1%. This indicates WTConv’s capability to mitigate parameter inflation from large-kernel convolutions in SF-PAFPN while further enhancing accuracy.

Finally, the WDS-YOLO model incorporating all three improvements achieved a 2.1% higher mAP@50 than YOLOv8n, with only 0.2 M additional parameters, maintaining lightweight characteristics despite a slight speed reduction. These results demonstrate the effectiveness and compatibility of the developed improvements for marine benthos tasks.

4.2.2. Comparative Experiment

We perform a comprehensive assessment of the effectiveness of various mainstream object detection models in underwater settings, including the newest versions: YOLOv9-t, YOLOv10n, and YOLO11n. The evaluation metrics comprised the mean average precision (mAP@50), model size (M), and computational complexity (GFLOPs) to assess the performance of each model in terms of accuracy and efficiency. The experimental results are presented in Table 3.

This study conducted a comprehensive comparative analysis of mainstream object detection models. The evaluation metrics comprised mAP@50, parameters, and computational complexity, providing a holistic assessment of model performance across accuracy and efficiency dimensions.

Regarding detection accuracy, WDS-YOLO demonstrated superior performance, achieving an mAP@50 of 85.6%. This represents a substantial improvement of 1.0 to 7.0 percentage points compared to most models, particularly exhibiting enhanced robustness in complex underwater environments. Concerning model complexity, WDS-YOLO maintained a modest parameter count of 3.2 M, substantially lower than traditional two-stage detectors, while remaining competitive with lightweight models (e.g., YOLOv8n and YOLOv9-t).

While the computational cost of 11.4G FLOPs represents a trade-off for improved accuracy, being marginally higher than some models, WDS-YOLO maintains real-time performance for underwater detection tasks, striking an optimal balance between accuracy and efficiency. Collectively, these results demonstrate that WDS-YOLO excels in both detection accuracy and computational efficiency, making it particularly well-suited for resource-constrained underwater detection scenarios. The model’s exceptional accuracy coupled with low computational overhead underscores its remarkable generalization capability in complex underwater environments.

4.2.3. Generalization Experiment

To further assess the generalization capability of the WDS-YOLO model, a comparative study was conducted with mainstream models on the RUOD dataset [57]. The RUOD dataset comprises 14,000 images annotated with 10 categories: diver, starfish, coral, turtle, fish, echinus, holothurian, scallop, squid, and jellyfish. This comprehensive dataset covers diverse underwater scenarios featuring various marine organisms and challenging conditions such as color distortion and light interference, providing a robust platform for model performance evaluation. The experimental setup remained consistent with the previous section, and the corresponding results are summarized in Table 4.

As demonstrated in Table 4, WDS-YOLO achieved a superior mAP@50 of 84.9% on the RUOD dataset, outperforming Faster R-CNN (75.2%), YOLOv7-tiny (81.5%), and YOLOv8n (83.8%) by considerable margins. These results demonstrate that WDS-YOLO exhibits enhanced generalization capabilities in handling complex underwater scenarios. The model maintains high detection accuracy even when handling a larger number of categories, attributed to its optimized network architecture and advanced feature extraction mechanism. These enhancements enable better adaptation to the diverse characteristics of underwater environments, thereby validating the robustness and practical applicability of the proposed method.

4.2.4. Visual Comparison of Detection Results

To qualitatively assess the detection performance of the WDS-YOLO algorithm in real-world underwater conditions, we selected three distinct scenarios representing dense distributions, occlusions, and low-visibility conditions. A comparative analysis was performed with the baseline YOLOv8n model, with the results illustrated in Figure 8.

The detection results revealed that YOLOv8n produces false positives for holothurian targets in high-density scenes, whereas WDS-YOLO effectively eliminates such errors. This demonstrates that the enhanced model exhibits superior object detection capabilities and background interference resistance in complex environments.

For partially occluded marine benthos objects with incomplete contours, the original model struggles to accurately identify their features, leading to missed detections. In contrast, the proposed WDS-YOLO algorithm addresses the challenge of feature extraction in such cases, successfully identifying these objects and significantly reducing missed detection rates.

Furthermore, in low-visibility conditions with blurred imagery, YOLOv8n demonstrates limited detection capability, exhibiting numerous missed detections for holothurians and scallops. Conversely, WDS-YOLO maintains superior detection performance even in challenging underwater environments with significant background blur. These comparative analyses across three distinct scenarios substantiate that the proposed algorithm acquires more comprehensive feature representations of marine benthos objects relative to the original YOLOv8n, effectively mitigating both missed detections and false positives for occluded and blurred marine organisms.

To analyze the limitations of our improved WDS-YOLO, we selected representative error examples from the detection results, as illustrated in Figure 9. Firstly, the misclassification of white rocks as scallops and a brown rock as a holothurian indicates insufficient discriminative power in distinguishing between morphologically similar underwater objects. Secondly, the model erroneously identified shadows as echinus, suggesting vulnerability to illumination-induced artifacts in complex underwater environments. Thirdly, the partial occlusion of a holothurian by an echinus led to a missed detection, while the edge contour of the echinus was erroneously segmented as an independent target, exposing the model’s inadequacy in handling occluded objects and boundary ambiguity. To address these challenges, future work will focus on integrating a multi-scale context-aware module to enhance occlusion reasoning and on exploring adaptive spectral decomposition techniques to improve illumination robustness in dynamic underwater scenarios.

4.2.5. Sub-Experiments of Improvement Points

Sub-experiment 1: Comparison of various attention modules

To assess the effectiveness and feasibility of enhancing the SPPF with Deformable Attention (DA) and evaluate its performance in multi-scale feature fusion, we integrated various attention mechanisms, including SimAM, SE, CBAM, CA, EMA, ECA, and BRA, into the SPPF module. These modules were incorporated at the identical position in the model. The relevant findings are summarized in Table 5.

As shown in Table 5, the enhancement of the SPPF module using deformable attention (DA) resulted in the most significant improvement in the model’s overall performance. The precision (P), mAP@50, and mAP@50:95 metrics all outperformed those of other attention modules. This can be attributed to DA’s ability to adaptively adjust the attention domain, enabling more effective representation of the global features of marine benthos objects in complex underwater environments. Therefore, this attention mechanism is particularly suitable for marine benthos detection tasks.

Sub-experiment 2: Comparison of small object detection methods

To assess the effectiveness of the proposed SF-PAFPN neck structure for small-object marine benthos detection and its superiority over the current mainstream approach of adding a small-object detection layer (P2), we conducted comparative experiments based on YOLOv8n. The experimental results are presented in Table 6.

As evident from Table 6, the proposed SF-PAFPN structure demonstrates improvements in precision (P), recall (R), mAP@50, and mAP@50:95 metrics compared to the addition of the small-object detection layer P2. This suggests that the SF-PAFPN structure exhibits enhanced feature fusion capabilities for small-object marine benthos, resulting in superior detection performance. Furthermore, the SF-PAFPN structure requires less computational resources than the added detection layer P2, further validating its superiority.

5. Discussion

To address key challenges in marine environments including turbid backgrounds, optical refraction distortions, and size variations of marine benthos objects, we introduced the WDS-YOLO algorithm as an effective solution for marine benthos detection.

Regarding detection accuracy, the proposed algorithm demonstrated superior performance in small object detection compared to conventional methods, as Table 3 demonstrates. Specifically, it achieved 85.6% mAP@50, outperforming YOLOv5s (+2.7%) and YOLOv7-tiny (+3.3%). Notably, the introduced approach also achieved superior performance on the RUOD dataset, confirming enhanced robustness in marine benthos detection.

Regarding practical deployment, the proposed model maintains operational feasibility while achieving competitive performance. WDS-YOLO maintains a compact parameter size of 3.2 M (74.2% reduction from 12.4 M) with moderate computational costs. This demonstrates an effective balance between detection performance and model efficiency. With a processing speed of 104.5 FPS (marginally below YOLOv8n’s 114.1 FPS), the model maintains a real-time capability that fulfills practical requirements for marine benthos detection under computational constraints.

Despite these advancements, two limitations require consideration. First, we observed a modest performance decrease on the RUOD dataset with increased categories and background complexity. This suggests image quality and background complexity may affect detection accuracy, motivating future research to improve robustness through image data enhancements. Second, although the current 104.5 FPS satisfies real-time requirements, computational efficiency could be improved via channel pruning or quantization while preserving accuracy. These aspects require further exploration in future studies to enhance the algorithm’s utility.

6. Conclusions

This paper proposed an enhanced YOLOv8-based marine benthos detection model, termed WDS-YOLO. The proposed model incorporates an advanced backbone network utilizing WTConv to significantly improve feature extraction capability when detecting marine benthos objects in complex underwater environments. A novel DASPPF module was developed by integrating a deformable attention mechanism, which enables adaptive adjustment of the attention domain in response to input variations, thus minimizing the influence of irrelevant data and improving detection accuracy. Furthermore, the SF-PAFPN module was implemented to facilitate the fusion of shallow feature information, thereby preventing the loss of small object details and improving small object detection accuracy, which consequently reduces both missed detection and false detection rates. The experimental results demonstrate that the proposed architecture, with 3.2 million parameters and 11.4 G FLOPs, achieves 85.6% mAP@50 on marine benthos detection, representing a 2.1 percentage absolute improvement over the baseline YOLOv8n (83.5%) while maintaining competitive efficiency and outperforming recent models, including YOLOv10n (82.8%) and YOLOv9-t (83.9%), thereby confirming its effectiveness in accurately detecting marine benthos objects in challenging environments and offering an innovative solution for automated underwater detection and robotic fishing.

Although the SF-PAFPN structure effectively utilizes shallow feature map information to boost small object detection capability and requires less computation than implementing an additional small object detection layer, its computational overhead relative to the baseline model has increased. Therefore, significant opportunities exist for further algorithm optimization. Accordingly, future research will focus on maintaining detection accuracy while minimizing computational requirements, thus improving the model’s detection speed. Additionally, considering the rapid advancements in object detection research, future work will explore state-of-the-art algorithms to further enhance the model’s performance.

Author Contributions

Conceptualization, M.C.; methodology, J.Q.; software, J.Q.; validation, M.C. and J.Q.; formal analysis, M.C. and J.Q.; resources, M.C.; data curation, M.C. and J.Q.; writing—original draft preparation, J.Q.; writing—review and editing, M.C. and J.Q.; supervision, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research and Development Planning in Key Areas of Guangdong Province, NO. 2021B0202070001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

R-CNN	Region with CNN Features
RT-DETR	Real-Time Detection Transformer
YOLO	You Only Look Once
NMS	Non-Maximum Suppression
FPN	Feature Pyramid Network
DCP	Deformable Convolution Pyramid
DASPPF	Deformable Attention-based Spatial Pyramid Pooling Fast
SE	Squeeze-and-Excitation
CA	Coordinate Attention
ECA	Efficient Channel Attention
CBAM	Convolutional Block Attention Module
EMA	Efficient Multi-Scale Attention
BRA	Bilinear Routing Attention
WTConv	Wavelet Convolution
URPC	Underwater Robot Professional Contest

References

Yu, G.; Cai, R.; Su, J.P.; Hou, M.; Deng, R. U-YOLOv7: A network for underwater organism detection. Ecol. Inform. 2023, 75, 102108. [Google Scholar]
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar]
Huang, H.; Tang, Q.; Li, J.; Zhang, W.; Bao, X.; Zhu, H.; Wang, G. A review on underwater autonomous environmental perception and target grasp, the challenge of robotic organism capture. Ocean. Eng. 2020, 195, 106644. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar]
Cai, Z.W.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2024; pp. 16965–16974. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision 2016, Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhu, X.K.; Lyu, S.C.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Ultralytics: Yolov5. [EB/OL]. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 November 2024).
Chen, Z.; Zhang, F.; Liu, H.; Wang, L.X.; Zhang, Q.; Guo, L.L. Real-time detection algorithm of helmet and reflective vest based on improved YOLOv5. J. Real-Time Image Process. 2023, 20, 4. [Google Scholar]
Wu, D.L.; Jiang, S.; Zhao, E.L.; Liu, Y.L.; Zhu, H.C.; Wang, W.W.; Wang, R.Y. Detection of Camellia oleifera fruit in complex scenes by using YOLOv7 and data augmentation. Appl. Sci. 2022, 12, 11318. [Google Scholar] [CrossRef]
Jiang, K.; Xie, T.; Yan, R.; Yan, R.; Wen, X.; Li, D.; Jiang, H.B.; Jiang, N.; Feng, L.; Duan, X.L.; et al. An attention mechanism-improved YOLOv7 object detection algorithm for hemp duck count estimation. Agriculture 2022, 12, 1659. [Google Scholar] [CrossRef]
Li, B.; Chen, Y.; Xu, H.; Fei, Z. Fast vehicle detection algorithm on lightweight YOLOv7-tiny. arXiv 2023, arXiv:2304.06002. [Google Scholar]
Kulyukin, V.A.; Kulyukin, A.V. Accuracy vs. energy: An assessment of bee object inference in videos from on-hive video loggers with YOLOv3, YOLOv4-Tiny, and YOLOv7-Tiny. Sensors 2023, 23, 6791. [Google Scholar] [CrossRef]
Chen, L.; Liu, Z.; Tong, L.; Jiang, Z.; Wang, S.; Dong, J.; Zhou, H.Y. Underwater object detection using Invert Multi-Class Adaboost with deep learning. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Lin, W.; Zhong, J.; Liu, S.; Li, T.; Li, G. Roimix: Proposal-fusion among multiple images for underwater object detection. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2588–2592. [Google Scholar]
Xu, F.; Wang, H.; Peng, J.; Fu, X. Scale-aware feature pyramid architecture for marine object detection. Neural Comput. Appl. 2021, 33, 3637–3653. [Google Scholar]
Qi, S.; Du, J.; Wu, M.; Yi, H.; Tang, L.; Qian, T.; Wang, X. Underwater small target detection based on deformable convolutional pyramid. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 2784–2788. [Google Scholar]
Liu, Y.; Wang, S. A quantitative detection algorithm based on improved Faster R-CNN for marine benthos. Ecol. Inform. 2021, 61, 101228. [Google Scholar]
Fu, X.; Liu, Y.; Liu, Y. A case study of utilizing YOLOT based quantitative detection algorithm for marine benthos. Ecol. Inform. 2022, 70, 101603. [Google Scholar]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on YOLO v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Liu, P.; Qian, W.; Wang, Y. YWnet: A convolutional block attention-based fusion deep learning method for complex underwater small target detection. Ecol. Inform. 2024, 79, 102401. [Google Scholar]
Wen, G.; Li, S.; Liu, F.C.; Luo, X.; Er, M.; Mahmud, M.; Wu, T. YOLOv5s-CA: A modified YOLOv5s network with coordinate attention for underwater target detection. Sensors 2023, 23, 3367. [Google Scholar] [CrossRef]
Zhang, L.; Fan, J.; Qiu, Y.; Jiang, Z.; Hu, Q.; Xing, B.W.; Xu, J.X. Marine zoobenthos recognition algorithm based on improved lightweight YOLOv5. Ecol. Inform. 2024, 80, 102467. [Google Scholar]
Yi, W.; Wang, B. Research on underwater small target detection algorithm based on improved YOLOv7. IEEE Access 2023, 11, 66818–66827. [Google Scholar]
Zhang, J.; Zhang, J.; Zhou, K.; Zhang, Y.; Chen, H.; Yan, X. An improved YOLOv5-based underwater object-detection framework. Sensors 2023, 23, 3693. [Google Scholar]
Liu, K.; Peng, L.; Tang, S. Underwater object detection using TC-YOLO with attention mechanisms. Sensors 2023, 23, 2567. [Google Scholar] [CrossRef]
Wang, J.; Li, Q.; Fang, Z.; Zhou, X.; Tang, Z.; Han, Y.; Ma, Z. YOLOv6-ESG: A lightweight seafood detection method. J. Mar. Sci. Eng. 2023, 11, 1623. [Google Scholar] [CrossRef]
Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Underwater target detection based on improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
Zhou, H.; Kong, M.; Yuan, H.; Pan, Y.; Wang, X.; Chen, R.; Lu, W.; Wang, R.Z.; Yang, Q.H. Real-time underwater object detection technology for complex underwater environments based on deep learning. Ecol. Inform. 2024, 82, 102680. [Google Scholar] [CrossRef]
Guo, A.; Sun, K.; Zhang, Z. A lightweight YOLOv8 integrating FasterNet for real-time underwater object detection. J. Real-Time Image Process. 2024, 21, 49. [Google Scholar] [CrossRef]
Qu, S.; Cui, C.; Duan, J.; Lu, Y.; Pang, Z. Underwater small target detection under YOLOv8-LA model. Sci. Rep. 2024, 14, 16108. [Google Scholar] [CrossRef] [PubMed]
Pan, W.; Chen, J.; Lv, B.; Peng, L. Optimization and Application of Improved YOLOv9s-UI for Underwater Object Detection. Appl. Sci. 2024, 14, 7162. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, W.; Du, X.; Yan, Z. Underwater small target detection based on YOLOX combined with MobileViT and double coordinate attention. J. Mar. Sci. Eng. 2023, 11, 1178. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, B. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Knoll, A. Omni-Kernel Network for Image Restoration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 1426–1434. [Google Scholar]
Han, Y.; Chen, L.; Luo, Y.; Ai, H.; Hong, Z.; Ma, Z.; Zhang, Y. Underwater Holothurian Target-Detection Algorithm Based on Improved CenterNet and Scene Feature Fusion. Sensors 2022, 22, 7204. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Luo, Z. Rethinking General Underwater Object Detection: Datasets, Challenges, and Solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 network structure.

Figure 2. WDS-YOLO network structure.

Figure 3. An example of the WTConv.

Figure 4. C2f-WTConv module structure.

Figure 5. Deformable attention module structure.

Figure 6. Omni-Kernel module structure.

Figure 7. Sample images from URPC2020 dataset.

Figure 8. Comparison of detection results. Green bounding boxes indicate correctly detected objects, red boxes denote false positives, and blue boxes represent false negatives.

Figure 9. Examples of WDS-YOLO errors: (a) Blurriness caused the rock to be recognized as an object; (b) The shadow was mistaken for a echinus; (c) The holothurian was missed due to objects blocking each other. Green bounding boxes indicate correctly detected objects, red boxes denote false positives, and blue boxes represent false negatives.

Table 1. The distribution of the URPC2020 dataset across training, validation, and test sets.

Phase	Total Instances	Holothurian	Echinus	Starfish	Scallop
Training	40,107	4502	20,356	6473	8776
Validation	5660	619	2780	947	1314
Test	11,645	1250	5488	1844	3063

Table 2. Results of ablation experiment.

WTConv	DASPPF	SF-PAFPN	P/%	R/%	mAP@50/%	Params/M	FPS/(f·s⁻¹)
			81.5	77.8	83.5	3.0	114.1
√			82.4	76.0	83.7	2.6	118.3
	√		83.6	77.2	84.3	3.2	113.1
		√	83.2	78.1	84.7	3.3	109.5
	√	√	82.6	77.2	84.6	3.5	107.1
√		√	82.7	77.9	84.9	2.9	105.5
√	√	√	83.8	78.6	85.6	3.2	104.5

Table 3. Comparative experiment results for different networks.

Method	mAP@50/%	Params/M	FLOPs/G
Faster R-CNN	78.6	41.4	239.3
Cascade R-CNN	80.0	69.2	119.0
SSD	75.1	26.3	63.4
ATSS	80.7	32.1	80.5
YOLOv3	78.3	61.6	66.5
YOLOv5s	83.9	9.1	23.8
YOLOv7-tiny	84.6	6.0	13.2
YOLOv8n	83.5	3.0	8.1
YOLOv9-t	83.9	2.7	11.1
YOLOv10n	82.8	2.7	8.2
YOLO11n	80.0	2.6	6.3
RT-DETR	83.8	19.9	57.0
WDS-YOLO	85.6	3.2	11.4

Table 4. Comparison of model performance on RUOD dataset.

Method	mAP@50/%	Params/M	FLOPs/G
Faster R-CNN	75.2	41.4	239.3
YOLOv7-tiny	81.5	6.0	13.2
YOLOv8n	83.8	3.0	8.1
WDS-YOLO	84.9	3.2	11.4

Table 5. Experimental results of SPPF enhanced with various attention modules.

Module	P/%	R/%	mAP@50/%	mAP@50:95/%	Params/M	FLOPs/G
SPPF	81.5	77.8	83.5	48.9	3.0	8.1
SPPF + SimAM	82.4	76.4	83.6	48.8	3.0	8.1
SPPF + SE	82.1	75.8	83.3	48.5	3.0	8.1
SPPF + CBAM	83.4	75.7	83.5	48.8	3.0	8.1
SPPF + CA	82.5	76.4	83.4	48.8	3.0	8.1
SPPF + EMA	82.4	76.4	83.6	48.8	3.0	8.1
SPPF + ECA	80.4	77.9	83.9	48.9	3.0	8.1
SPPF + BRA	80.7	77.8	83.6	48.8	3.2	8.3
SPPF + DA	83.6	77.2	84.3	49.4	3.2	8.3

Table 6. Experimental results of small object detection methods.

Module	P/%	R/%	mAP@50/%	mAP@50:95/%	FLOPs/G
P2	81.1	77.6	83.6	49.2	12.2
SF-PAFPN	83.2	78.1	84.7	49.5	11.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, J.; Chen, M. WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention. Appl. Sci. 2025, 15, 3537. https://doi.org/10.3390/app15073537

AMA Style

Qian J, Chen M. WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention. Applied Sciences. 2025; 15(7):3537. https://doi.org/10.3390/app15073537

Chicago/Turabian Style

Qian, Jiahui, and Ming Chen. 2025. "WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention" Applied Sciences 15, no. 7: 3537. https://doi.org/10.3390/app15073537

APA Style

Qian, J., & Chen, M. (2025). WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention. Applied Sciences, 15(7), 3537. https://doi.org/10.3390/app15073537

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WDS-YOLO: A Marine Benthos Detection Model Fusing Wavelet Convolution and Deformable Attention

Abstract

1. Introduction

2. Related Works

2.1. YOLOv8 Object Detection Network

2.2. Attention Mechanism

3. Methods

3.1. Feature Extraction Module: C2f-WTConv

3.2. Deformable Attention-Integrated Spatial Pyramid Pooling Fast (DASPPF) Module

3.3. Enhanced Neck Structure: SF-PAFPN

4. Experiments

4.1. Experiment Setup

4.1.1. Experimental Dataset

4.1.2. Experiment Environment

4.2. Experiment Results

4.2.1. Ablation Experiment

4.2.2. Comparative Experiment

4.2.3. Generalization Experiment

4.2.4. Visual Comparison of Detection Results

4.2.5. Sub-Experiments of Improvement Points

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI