Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism

Li, Zhenming; Zheng, Bing; Chao, Dong; Zhu, Wenbo; Li, Haibing; Duan, Jin; Zhang, Xinming; Zhang, Zhongbo; Fu, Weijie; Zhang, Yunzhi

doi:10.3390/jmse12122291

Open AccessArticle

Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism

by

Zhenming Li

^1,2,†,

Bing Zheng

^1,3,4,†,

Dong Chao

^1,3,4,*,

Wenbo Zhu

^1,2,3

,

Haibing Li

²

,

Jin Duan

⁵,

Xinming Zhang

²

,

Zhongbo Zhang

²,

Weijie Fu

² and

Yunzhi Zhang

²

¹

Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai 519000, China

²

College of Mechanical Engineering and Automation, Foshan University, Foshan 528200, China

³

South China Sea Marine Survey Center, Ministry of Natural Resources of the People’s Republic of China, Guangzhou 510300, China

⁴

Key Laboratory of Marine Environmental Survey Technology and Application, Ministry of Natural Resources of the People’s Republic of China, Guangzhou 510300, China

⁵

School of Electronic Information Engineering, Changchun University of Science and Technology, Weixing Road No. 7089, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2024, 12(12), 2291; https://doi.org/10.3390/jmse12122291

Submission received: 15 November 2024 / Revised: 10 December 2024 / Accepted: 10 December 2024 / Published: 12 December 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Underwater object detection is critical for marine ecological monitoring and biodiversity research, yet existing algorithms struggle in detecting densely packed objects of varying sizes, particularly in occluded and complex underwater environments. This study introduces Underwater-Yolo, a novel detection network that enhances performance in these challenging scenarios by integrating a dual-branch occlusion-handling attention mechanism (GLOAM) and a Cross-Stage Partial Dilated Deformable Convolution (CSP-DDC) backbone. The dilated deformable convolutions (DDCs) in the backbone and neck expand the receptive field, thereby improving the detection of small objects, while the deformable convolutions enhance the model’s adaptive feature extraction capabilities for unstructured objects. Additionally, the CARAFE up-sampling operator in the neck aggregates contextual information across a broader spatial domain. The GLOAM, consisting of a global branch (using a Vision Transformer to capture global features and object–background relationships) and a local branch (enhancing the detection of occluded objects through depthwise–pointwise convolutions), further optimizes performance. By incorporating these innovations, the model effectively addresses the challenges of detecting small and occluded objects in dense underwater environments. The evaluation on the CLfish-V1 dataset shows significant improvements over state-of-the-art algorithms, with an AP50 of 93.8%, an AP75 of 88.9%, and an AP-small of 76.4%, marking gains of 4.7%, 16.7%, and 6%, respectively. These results demonstrate the model’s effectiveness in complex underwater scenarios.

Keywords:

underwater object detection; dilated deformable convolution (DDC); dual-branch occlusion-handling attention mechanism (GLOAM); small-object detection; occluded-object detection; marine ecological monitoring

MSC:

68T45; 68U101

1. Introduction

In the 21st century, object detection technology has been extensively applied in marine exploration, research, and environmental monitoring, providing crucial support for resource development, ecological conservation, scientific inquiry, and security and defense. Among these applications, underwater object detection remains one of the most challenging tasks in computer vision due to the inherent complexities of the underwater environment. Key challenges include the imbalance in underwater image datasets, significant variations in target sizes, high-density object clusters, and the dynamic nature of underwater scenes. These factors make accurate and reliable object detection particularly difficult [1].

Traditional object detection methods often exhibit poor performance in terms of accuracy and generalization under such conditions. For instance, marine organisms’ natural camouflage and inter-class similarity further complicate detection tasks [2]. The advent of deep learning has enabled significant progress in underwater object detection, such as the development of multi-scale representations and feature pyramid networks to enhance detection precision, particularly for small and densely packed targets [3], particularly in addressing issues related to marine ecosystem conservation and object identification in underwater environments.

However, even advanced models such as YOLO, SSD, and Faster R-CNN face difficulties when applied to underwater tasks. These models typically predict object boundaries and categories directly from images, with YOLO excelling in detection speed via a single-shot regression mechanism, SSD enhancing multi-scale object detection through feature maps at different resolutions, and Faster R-CNN providing more precise localization via a region proposal network [4]. Despite these advancements, challenges persist in underwater settings, especially when dealing with small, densely packed objects and occlusions.

The underwater environment exacerbates the issues of small target detection and occlusion, often leading to reduced detection performance. Enhancing detection accuracy in such environments is vital for applications in marine biology, underwater navigation, and environmental monitoring. Recent innovations, such as attention-based feature fusion modules, have shown promise in addressing these challenges [5]. Thus, there is a pressing need to develop advanced algorithms capable of overcoming these unique challenges to ensure effective and reliable underwater object detection [6].

To address these challenges, this study proposes Underwater-Yolo, an advanced underwater object detection network that integrates a dual-branch occlusion-handling attention mechanism (GLOAM) and a scalable deformable feature extraction backbone (CSP-DDC). The network introduces a combination of global and local components inspired by the Vision Transformer (ViT) and a depthwise–pointwise convolution architecture as part of the GLOAM module. Additionally, the CSP-DDC layer is incorporated into the backbone to improve the detection of small and occluded targets in complex underwater environments.

The main contributions of this study are as follows:

Novel backbone network design: This study introduces a new backbone network that employs scalable deformable convolutions as the primary feature extraction layer. Compared with existing state-of-the-art single-stage networks, this design increases the receptive field without adding computational overhead, enabling the network to detect objects of varying scales and flexibly adapt to changes in object shape and position. This improves detection accuracy, particularly for unstructured and irregular targets.
Dual-branch occlusion-handling attention mechanism (GLOAM): To improve the detection of densely packed targets, the GLOAM module combines global feature extraction (using the Vision Transformer) and local feature extraction (via a depthwise–pointwise convolutional architecture) for powerful local feature extraction. The dual-branch attention mechanism assigns appropriate weights to both global and local components, effectively handling the relationship between objects and their background while enhancing the detection of occluded objects.
Enhanced feature utilization with CARAFE up-sampling and Small-Object Detection Layer: To maximize the use of feature outputs from the backbone network and reduce information loss, the CARAFE up-sampling module is integrated into the detection network, enriching semantic information. Additionally, a specialized Small-Object Detection head (SOD-layer) is constructed, leveraging the GLOAM module to enhance the detection of densely clustered small targets.

Although existing advanced algorithms such as YOLO, SSD, and Faster R-CNN have made progress in underwater object detection, they still face significant performance bottlenecks when handling small objects, densely packed targets, and occlusions. The Underwater-Yolo model proposed in this study achieves breakthroughs in addressing these challenges by integrating the dual-branch occlusion-handling attention mechanism (GLOAM) and the Cross-Stage Partial Dilated Deformable Convolution (CSP-DDC) backbone network. Compared with commonly used methods in the literature, our model effectively enhances the detection of small and occluded objects, which has often been considered a difficult problem in past research. The experimental results show that Underwater-Yolo outperforms current state-of-the-art algorithms on the CLfish-V1 dataset, achieving an AP50 of 93.8%, an AP75 of 88.9%, and an AP-small of 76.4%, with improvements of 4.7%, 16.7%, and 6%, respectively.

This study demonstrates the potential of the proposed Underwater-Yolo network to address the limitations of current models, providing a more reliable and accurate solution for underwater object detection under challenging conditions.

2. Related Work

In recent years, deep learning-based underwater target detection technologies have achieved remarkable advancements. This section reviews and analyzes key research efforts in the field, focusing on general object detection, underwater small-target detection, and the challenges posed by densely occluded underwater environments.

2.1. Common Underwater Object Detection

There is no shortage of ordinary underwater target detection, which is generally performed in one and two stages. Zhang, M. et al. [7] introduced a lightweight underwater object detection method that combines MobileNetv2, YOLOv4, and attentional feature fusion to enhance accuracy and speed, showing good results in terms of precision and model efficiency. Similarly, Shi, P. et al. [8] developed an improved underwater biological detection algorithm based on Faster-RCNN that incorporates ResNet, BiFPN, EIoU, and K-means++ clustering.

Whether based on one-stage or two-stage target detection, these models show great potential to address major challenges such as common scale changes in underwater scenarios, but they also highlight the limitations of universal algorithms in detecting small, dense objects in complex underwater environments.

2.2. Underwater Small-Target Detection

Several effective methods have been proposed by researchers to improve small-target detection in underwater environments. Gao, J. et al. [9] introduced a path-augmented Transformer framework for underwater object detection, which addresses the challenges posed by small-scale targets in complex environments. Their method outperformed existing techniques in terms of precision, recall, F1-score, and FPS. Similarly, Chen, D. and Gou, G. [10] proposed SFDet, a small-object detection method utilizing a spatial-to-frequency attention mechanism. This approach demonstrated superior performance across four publicly available datasets.

In another contribution, Li, X. et al. [11] developed a multi-scale aggregation feature pyramid network with cornerness for underwater object detection. Their method enhanced feature extraction and small-object recall, achieving 78.90% mAP on an underwater dataset and 84.3% mAP on the PASCAL VOC dataset. Qu, S. et al. [12] introduced the YOLOv8-LA model, incorporating a Lightweight Efficient Partial Convolution (LEPC) module, the AP-FasterNet architecture, and CARAFE up-sampling. This model improved small-target detection in underwater scenarios, achieving 84.7% mAP and 189.3 FPS on the URPC2021 dataset. Chen, G. et al. [13] proposed a hybrid Transformer-based approach for underwater small-object detection. This method combined a lightweight network, feature pyramid, and test-time augmentation, achieving significant improvements in detection accuracy and efficiency with reduced model parameters. Sun, Y. et al. [14] developed a lightweight underwater small-target detection model that combines YOLOX, MobileViT, and a double-coordinate attention (DCA) mechanism. Their approach increased detection accuracy while reducing model complexity, achieving an average accuracy of 72.00% on the URPC2020 dataset. Chen, L. et al. [15] proposed SWIPENET, a deep learning-based method that uses Sample-Weighted Hyper Feature Maps and a Curriculum Multi-Class Adaboost (CMA) training paradigm. This method addresses noise issues and small-object detection, achieving competitive accuracy across multiple underwater datasets. Qi, S. et al. [16] introduced a two-stage underwater small-target detection network utilizing Deformable Convolutional Pyramid (DCP) to tackle deformation, occlusion, and varying object sizes. Their approach achieved superior detection performance and state-of-the-art results in underwater target detection tasks.

While these advancements significantly contribute to the detection of small underwater targets, challenges remain. These methods often focus on enhancing object detection under specific conditions, but further breakthroughs are required to expand receptive fields and achieve precise detection in the complex and dynamic underwater environment. Specifically, addressing the limitations of general-purpose algorithms in handling occlusion and scale variation, especially in non-structured underwater scenes, is essential to improving the robustness and efficiency of small-object detection in such environments.

2.3. Detection of Densely Occluded Underwater Targets

Ji, X. et al. [17] proposed a hybrid CNN–Transformer feature boosting and differential pyramid network (FBDPN) for underwater object detection. This approach enhances multi-scale feature learning while reducing information redundancy, leading to superior performance compared with existing methods. Wang, J. et al. [18] introduced the YOLOv5-FCDSDSE model for underwater dense–small-object detection, integrating CFnet, Dyhead, and SE attention mechanisms to improve accuracy and reduce model parameters, resulting in enhanced performance over baseline models. K. Liu et al. [19] presented a new method using TC-YOLO with attention mechanisms and optimal transport label assignment, which improved both accuracy and computational efficiency for underwater detection tasks. Similarly, Liu, K. et al. [20] proposed a similar TC-YOLO-based method with attention mechanisms and optimal transport label assignment, achieving better accuracy and reduced computational cost for underwater object detection. Li, J. et al. [21] introduced an improved CME-YOLOv5 network to detect densely spaced fish and small targets, incorporating the coordinate attention mechanism, expanded detection layers, and the EIOU loss function. This model demonstrated a 4.4% increase in mAP and a 24.6% improvement in detection performance over YOLOv5. A. Mathias et al. [22] proposed the hybrid adaptive deep SORT-YOLOv3 (HADSYv3) method for occluded underwater object tracking, combining YOLOv3 for detection and adaptive deep SORT with LSTM for tracking, which achieved improved efficiency in tracking occluded objects across various angles.

These methods demonstrate progress in tackling occlusion challenges, but they still face difficulties in effectively integrating both global context and local features, especially when handling densely occluded targets in dynamic environments. While these approaches have advanced the accuracy and efficiency of underwater object detection, they overlook the need to better manage the occlusion relationships between targets and the background, as well as the local relationships among targets themselves. Further research is needed to strike a balance between global context extraction and fine-grained local feature representation, which is crucial to optimizing detection performance under occlusion conditions.

2.4. Literature Analysis

Researchers have extensively applied deep learning techniques to underwater target detection, given the ability of these methods to extract complex features and address the challenges of underwater environments, such as low visibility, color distortion, and image degradation. This section summarizes the contributions and limitations of various methodologies, including convolutional neural networks (CNNs), YOLO, Faster R-CNN, and Transformer-based models, and their application in real-time detection and monitoring systems. Although these approaches have improved detection accuracy and robustness to noise, several challenges remain. These include high computational costs, limited availability of annotated underwater datasets, difficulties in detecting small and densely packed objects, and the need for real-time processing. Table 1 presents a detailed summary of the key contributions and limitations of the primary research on underwater target detection, highlighting the areas that require further improvement.

Despite the emergence of numerous underwater target detection algorithms in recent years, significant challenges remain, particularly when addressing densely packed targets, complex shapes, and unstructured environments. These algorithms exhibit several key limitations:

Limited receptive field and adaptability: The relatively small receptive fields of convolutional kernels in existing algorithms constrain their ability to adapt to variations in target shape and position. This limitation significantly reduces their effectiveness in detecting objects in complex underwater environments, where targets can vary widely in scale and form.
Insufficient feature extraction capability: Current algorithms struggle to effectively capture both global and local features, resulting in suboptimal performance, especially in multi-scale target detection. This issue is particularly pronounced when dealing with small or densely packed objects, as these models often fail to fully exploit the information available in feature layers, leading to inaccuracies in detection.
Ineffective attention mechanisms: Existing attention mechanisms face substantial difficulties in distinguishing densely packed targets, complex shapes, and unstructured environments. In clusters of densely packed objects, visual overlap often occurs, making it hard for the model to focus on individual targets. For objects with intricate shapes, current mechanisms often fail to capture the full range of shape diversity, reducing recognition accuracy. Additionally, in unstructured environments, such as natural underwater scenes, the models struggle to differentiate targets from complex and dynamic backgrounds, further complicating detection tasks.

3. Proposed Method

In the previous sections, this study reviewed the strengths and limitations of existing underwater object detection algorithms. While current one-stage and two-stage detection methods exhibit promising performance in real-time processing and small-object detection, they still encounter significant challenges when dealing with dense-target populations, complex shapes, and unstructured objects. To address these issues, this research study proposes an innovative underwater object detection network—Underwater-Yolo. This network enhances detection capabilities in complex environments by incorporating an expanded deformable convolution (CSP-DDC) backbone and a dual-branch occlusion-handling attention mechanism (GLOAM). This chapter will provide a detailed overview of the employed methodology, including the newly designed network architecture, the functionalities and implementations of various modules, and the optimization strategies targeting small objects and dense populations. These methods not only improve the detection capabilities for objects of varying scales but also enhance the accuracy of background–target relationships in dense populations.

3.1. Overall Structure of Underwater-Yolo

The Underwater-Yolo network architecture is divided into three main components: the backbone network, the neck, and the head network. These components are interconnected to achieve efficient and accurate object detection in challenging underwater environments, as illustrated in Figure 1.

The Underwater-Yolo architecture integrates a Cross-Stage Partial Dilated Deformable Convolution (CSP-DDC) backbone with a dual-branch occlusion-handling attention mechanism (GLOAM) to address the challenges of underwater object detection. CSP-DDC innovatively merges traditional, dilated, and deformable convolutions, enhancing the receptive field and dynamically adapting to unstructured or irregular target contours. This integration facilitates multi-scale target detection without additional computational burden and improves adaptability to complex geometric shapes. The GLOAM mitigates occlusion effects by combining global and local processing flows by using a Vision Transformer (ViT) and convolutional neural networks (CNNs). The ViT captures panoramic dependencies, while the CNNs extract detailed local features. This dual-branch structure enhances the detection of complex relationships by merging global insights with local details, improving responses to occluded objects and enhancing detection outcomes in dense scenarios. For small-target detection, the architecture incorporates CARAFE up-sampling within the PANet structure to enrich semantic details in high-level feature maps, preserving essential details crucial for accuracy. A specialized Small-Object Detection layer (SOD-layer) is also integrated, ensuring the model’s adaptability to complex targets in underwater detection and enhancing overall performance. This approach optimizes the model’s capabilities for handling intricate underwater environments.

3.2. Dilated Deformable Convolution

The CSP-DDC (Cross-Stage Partial Dilated Deformable Convolution) module, as shown in Figure 2, is an advanced feature extraction module integrating deformable convolutions and dilated convolutions. By combining cross-stage partial dilations and deformable convolutions, the module aims to enhance the network’s flexibility and accuracy in capturing complex target features, particularly in the detection of objects with varying scales and deformations. The design of the CSP-DDC module focuses on improving the fusion of local and global information when processing targets of different scales. By introducing cross-stage partial dilated convolutions, the network can capture information across multiple scales without significantly increasing computational complexity. Additionally, the integration of deformable convolutions allows for adaptive adjustments to the shape of targets, significantly improving feature extraction accuracy, especially in images with deformations or complex structures.

The CSP-DDC (Expandable Deformable Convolution) module can be expressed as in (1):

\begin{matrix} Z^{'} = D C N (\sum_{i = 1}^{n} C o n v_{1 \times 1} (C o n v_{3 \times 3, d_{i}} (C o n v_{1 \times 1} (X_{1}))) + X_{1}) \end{matrix}

(1)

As illustrated in Figure 2, the input feature map first undergoes a 1 × 1 convolution layer for channel compression, producing the output feature map

X \in R^{H \times W \times C_{m i d}}

, where

C_{m i d}

is typically less than

C_{i n}

. Subsequently,

X_{1}

is fed into multiple 3 × 3 dilated convolution branches, each utilizing a different dilation rate

d

, resulting in

Y_{i} = C o n v_{3 \times 3, d} (X_{1})

, where

i

denotes the respective branch. To prevent excessive dilation from causing a loss of local information, the dilation rates are chosen to be 1, 2, and 4. The output feature maps

Y_{i}

from each branch are fused through a 1 × 1 convolution layer and then added to the input features via a residual connection, ultimately forming the output feature map

Z \in R^{H \times W \times C o u t}

, where

C_{o u t}

typically restores to

C_{i n}

.

On the fused feature map, a deformable convolution (DCN) is further applied, which dynamically adjusts the position of the convolution kernels by using learnable offsets, producing the final output

Z^{'}

. This design ensures that while expanding the receptive field, the model can flexibly adapt to variations in target shape and position, thereby enhancing detection performance in complex scenarios.

3.3. Dual-Branch Occlusion Attention Mechanism

The GLOAM (dual-branch occlusion-handling attention mechanism) module, as shown in Figure 3, is specifically designed to address occlusion challenges in object detection. By incorporating a dual-branch attention mechanism, it separately focuses on global and local features to tackle occlusions in images. The core idea of the GLOAM is to enhance the attention on local details while preserving the global context, particularly in occluded scenarios. The global branch, utilizing the Vision Transformer (ViT), captures long-range dependencies and overall background information, while the local branch employs deep convolution to recover occluded-target features. In complex scenarios, occlusions often hinder traditional methods from effectively extracting the occluded parts of the target, but the GLOAM’s structured approach effectively handles these challenges.

In the local branch, depthwise–pointwise (DP) convolutions with kernel sizes such as 3, 5, and 7 are integrated with average pooling to facilitate multi-scale feature extraction. These features are then unified through up-sampling and elementwise addition, further enhanced by a multi-layer perceptron (MLP). The global branch employs embedded patches, normalization, multi-head attention, and MLPs to capture comprehensive global features. Outputs from both branches are combined via pointwise multiplication and addition, merging local and global information and improving the network’s detection capabilities across various scales. Dimensional alignment between layers ensures effective feature fusion, maintaining or adjusting the output feature map size as necessary, as represented mathematically in Equations (2)–(4).

W_{L} = \sum_{i \in \{3,5, 7\}} M L P (U (A v g P o o l (D P (X; k e r n e l_{s i z e} = i))))

(2)

W_{G} = \sum_{i \in \{3,5, 7\}} M L P (M H A (N o r m (E m b e d (X))) + E m b e d (X))

(3)

F_{o u t p u t} = {(W}_{G} + W_{L}) \times X

(4)

In this context,

X

represents the input feature map obtained from the preceding layer’s processing.

F_{o u t p u t}

denotes the final output feature map, which is the result of the fusion between the local and global branches. This fusion integrates multi-scale convolutional features with global contextual information, thereby enhancing the model’s feature representation capability.

3.4. SOD-Layer

The SOD-layer incorporates a series of GLOAM modules arranged in sequence, enhancing feature extraction for underwater small-target detection by integrating convolution layers such as Conv(128,1,1) and Conv(128,3,2). These layers adjust feature map dimensionality and facilitate information integration, improving the flow and coherence of data through the model.

Initially, the SOD-layer employs Conv(128,1,1) for channel compression to reduce computational demands, followed by sequential GLOAM modules, which leverage local and global attention mechanisms to discern complex interactions between targets and backgrounds. Intermediate features are further refined by using the C3_DPN_3(128,false) module, which utilizes depthwise separable convolutions for enhanced performance and efficiency. Subsequently, Conv(128,3,2) serves as a transitional layer to adjust feature map dimensions, preparing inputs for additional GLOAM modules. Post-processing by GLOAM modules, the head network finalizes the detection outputs, optimizing each phase from feature extraction to detection for the precise identification of small targets under challenging underwater conditions. The network architecture employs CARAFE for up-sampling, a method distinguished by its ability to aggregate contextual information and generate adaptive convolutional kernels, effectively preserving semantic detail for complex underwater target detection tasks, as depicted in Figure 4.

3.5. Anchor Point Selection Strategy

Anchor boxes are pivotal in anchor-based object detection models, influencing the model’s capacity to accurately determine object position, size, and shape. YoloV5 employs anchor boxes with fixed sizes and ratios, which may not adapt effectively to varied datasets and application scenarios, such as the CLfish-V1 dataset characterized by significant scale variation and the high density of target objects. This study advocates the use of K-means clustering to automatically derive anchor box distributions from annotations, optimizing object detection across diverse conditions, as demonstrated in Figure 5.

The mathematical foundation of using Intersection over Union (IoU) as a distance metric is based on the definition of IoU to calculate the similarity between two bounding boxes. IoU measures the ratio of the intersection to the union of two bounding boxes. Specifically, if we have two bounding boxes A and B, where A represents the predicted bounding box and B represents the ground-truth bounding box, the IoU calculation Formula (5) is

I o U (A, B) = \frac{|A ⋂ B|}{|A ⋃ B|}

(5)

In this context, |A∩B| represents the intersection area of bounding boxes A and B, while |A∪B| denotes the union area of A and B. Intersection over Union (IoU) as a metric emphasizes the degree of overlap between bounding boxes, providing a more direct and effective evaluation method for object detection. Since the centroid of each cluster represents the average width and height of the bounding boxes belonging to that cluster, the corresponding anchor points can be derived from the centroids of each cluster. Given that Underwater-YOLO incorporates a detection layer specifically designed for small objects, comprising a total of four detection layers, the resulting set of anchor points should consist of four groups, as illustrated in Table 2.

Based on the clustering results of anchor point sizes, this study defines target categories of different sizes. The proposed model specifically addresses small target sizes, which are categorized into two groups as follows:

Tiny: the smallest size group, with pixel dimensions approximately:

[39.67, 42.67], [40.67, 84.74], [72.00, 64.00].

4. Experiment

4.1. Experimental Setup

4.1.1. Implementation Details

This study deployed the Underwater-YOLO model on an NVIDIA GeForce RTX 4090 with 24 GB memory and used Python 3.8 and Torch 1.13.1 in PyTorch. The optimizer, selectable between Adam for adaptive learning rates and SGD for scenarios necessitating momentum and Nesterov updates, was dynamically adjusted based on the batch size with learning rates optimized through step decay or cosine decay mechanisms. This approach facilitates precise parameter settings across various training phases—initial, frozen, and unfrozen—to enhance training efficiency and model performance in complex underwater environments, ensuring robustness and practical applicability.

4.1.2. Datasets

To evaluate the model’s proficiency in detecting and localizing small, densely featured underwater targets, this study developed a custom dataset, CLfish-V1, focusing on diverse fish species. As depicted in its pie chart (Figure 6), CLfish-V1 provides a statistical breakdown emphasizing small targets with dense data characteristics, with the top five categories comprising 4.8%, 4.8%, 4.2%, 3.9%, and 3.4% of the dataset, highlighting a substantial presence of densely populated small fish species. Figure 7 illustrates that many dataset samples feature small fish occupying limited pixel areas, alongside species displaying group behaviors in natural settings. This dataset is tailored for refining image analysis techniques for small- and dense-target detection, enhancing research in underwater biodiversity monitoring and ecological conservation.

The dataset used in this study was captured by using three different devices: iPhone 12, which is manufactured by Apple Inc. in Cupertino, California, USA, HUAWEI Mate 50, which is produced by Huawei Technologies Co., Ltd. in Shenzhen, Guangdong, China, and Canon DSLR camera, which is made by Canon Inc. in Tokyo, Japan. These devices provided underwater images of varying resolutions. The conditions under which the images were taken are outlined in Table 3, which includes details on light intensity, water transparency, and underwater depth of field.

(1): Definition of density feature of target

This study defines the feature of target density in the dataset by calculating the ratio of the pixel area of the annotated bounding boxes to the sum of the areas of the bounding boxes and the original image. Let A_1, A_2, …, A_n represent the pixel area of each annotated bounding box and A_image denote the total pixel area of the original image. The ratio R of the total area of all annotated bounding boxes to the area of the original image can be expressed by Formula (6):

R = \frac{\sum_{i = 1}^{n} A_{i}}{A_{i m a g e}}

(6)

This formula directly expresses the ratio of the total pixel area covered by all annotated bounding boxes to the area of the entire image, allowing for the assessment of the relative size of the bounding boxes within the image. A larger ratio indicates a higher density of targets, while a smaller ratio suggests a sparser distribution of targets.

Based on the ratio of the sum of pixel areas of the annotated bounding boxes to the area of the original image, this study defines the characteristics of dense-target populations as having a ratio of 70% or higher.

(2): Definition of small feature of target

In traditional object detection tasks, there is often a significant variation in target size. To make the detection of small targets more flexible, this study does not rely on a fixed area threshold but instead defines small targets based on a relative comparison of bounding-box areas. By comparing the area of each target with this ratio, small targets are identified.

Let the area of the target’s bounding box be

A_{b b o x}

, the area of the smallest target bounding box be

A_{m i n}

, the area of the largest target bounding box be

A_{m a x}

, and the total pixel area of the image be

A_{t o t a l}

.

The ratio of the smallest bounding box area to the largest bounding-box area is calculated to dynamically set a threshold. The ratio of the smallest bounding-box area to the largest bounding-box area can be expressed as in Equation (7):

A r e a_{r a t i o} = \frac{A_{m i n}}{A_{m a x}}

(7)

This ratio is used to set a relative standard that represents the size ratio of the smallest target compared with the largest target. Thus, based on a comparison of the bounding-box area and the threshold set by the ratio of the smallest and largest bounding-box areas, small targets are defined. This comparison can be expressed as in Equation (8):

A_{b b o x} < A r e a_{r a t i o} {\times A}_{t o t a l}

(8)

Through the above method, this study proposes a dynamic definition of small targets based on relative area ratios. First, the ratio of the smallest bounding-box area to the largest bounding-box area is calculated to set a dynamic threshold. Then, by comparing the bounding-box area to this threshold, it is determined whether the target is a small target. This method, based on relative size, provides greater flexibility in detecting small targets across various scales and densities and can effectively adapt to complex scenarios in small-target detection.

4.2. Comparison of Heatmaps Between CSP-DDC and Other Backbone Networks

Through the input of underwater fish images into various backbone networks, the feature output layers of the backbone networks were visualized, generating heatmaps, as illustrated in Figure 8.

The heatmaps reveal that CSPdarkNet [22] displays dispersed high-response zones for target regions, while ConvNext-tiny [23] localizes targets more densely but sometimes misses unstructured objects. In contrast, Swim-Transformer-Tiny’s [24] heatmap shows focused response areas, indicating refined regional attention, albeit with occasional imprecise boundary delineation. The CSP-DDC model’s heatmap exhibits broader, intense response zones, particularly effective in unstructured target detection, underscoring the benefits of integrating dilated deformable convolutions (DDCs), which conform convolutional kernels to object morphology, enhancing the receptive field’s dynamic adaptability.

As shown in Table 4, this study evaluated four backbone architectures on the CLfish-V1 validation set to assess their underwater fish target detection capabilities. CSP-DDC + PANet outperformed all models, achieving the highest average precision (AP50 of 0.863) and notable accuracy at higher thresholds (AP75 of 0.653), attributed to its deformable convolution (DDC) technology, which adjusts the convolutional kernel to better capture small, irregularly shaped targets. This enhancement significantly improved unstructured small-object recognition, with CSP-DDC + PANet outperforming the next best model, Swim-Transformer-Tiny + PANet, by approximately 1.65% in AP50 and surpassing ConvNext-small + PANet and CSPdarknet + PANet by 6.81% and 17.26%, respectively. These results demonstrate the effectiveness of the CSP-DDC technology in enhancing detection precision for unstructured small targets, particularly through its improved receptive field and adaptive convolutional kernels.

4.3. Visual Comparison of Underwater-YOLO

In this study, Underwater-Yolo was employed to address the complex problem of underwater image recognition, and its performance was demonstrated through visual experiments across multiple scenarios. Figure 9 illustrates that the Underwater-Yolo model (CSP-DDC + SOD + GLOAM) outperforms other models.

As illustrated in Figure 9, Underwater-Yolo demonstrated superior performance in various detection challenges compared with other algorithms. Specifically, it was the only algorithm to successfully detect occluded fish (middle panels in Figure 9) and accurately identified all targets in a multi-target scenario (top panels in Figure 9), where algorithms like DETR and YoloV8-s were subject to missed detection and false positives.

Additionally, in the challenge of detecting small targets (bottom panels in Figure 9), Underwater-Yolo uniquely succeeded in identifying all small targets visible in the image, whereas other algorithms failed. This study evaluated four distinct network architectures —YoloX-s [25], YoloV8-s [26], YoloV10-s [27], Resnet101+DETR [28], Resnet101+RT-DETR [29], and Underwater-Yolo (CSP-DDC + SOD + GLOAM)—on their target detection efficacy using the CLfish-V1 dataset, as summarized in Table 5. Underwater-Yolo demonstrates significant advantages in Table 5, particularly in the AP50 and AP75 metrics, achieving scores of 0.938 and 0.889, respectively, which surpass those of other models. This superior performance is primarily attributed to the integration of several key techniques: depthwise separable convolutions, a specialized small-object detection layer, and a dual-branch occlusion-handling attention mechanism. The use of depthwise separable convolutions reduces the model’s computational complexity, making it more efficient in processing large or complex images. The small-object detection layer is specifically optimized for identifying small-sized targets, as reflected in its AP-small score of 0.764, which outperforms other models. Meanwhile, the dual-branch occlusion-handling attention mechanism significantly enhances the model’s ability to recognize occluded targets, a crucial feature in underwater environments, where visual occlusion is common.

In contrast, other models, such as YoloV10 and Resnet101+DETR, while performing well in the AP50 and AP75 metrics, still lag behind Underwater-Yolo. For example, YoloV10 achieves an AP50 score of 0.869, while Resnet101+DETR achieves an AP75 score of 0.649, indicating that Underwater-Yolo outperforms them in detecting targets in complex underwater scenarios. The integrated technologies, particularly those optimized for challenges unique to underwater environments, enable Underwater-Yolo to excel in small-object detection and occlusion handling, thereby significantly improving overall detection accuracy.

Underwater-Yolo has FLOPs of 26.858G and a parameter count of 9.815M, which make it significantly more computationally efficient compared with Resnet101+DETR (with FLOPs of 208.95G and a parameter count of 55.704M). To verify the detection capability of Underwater-Yolo in real time, this study conducted validation experiments, the results of which are shown in Table 6.

Table 6 shows that Underwater-Yolo achieves twice the frames per second (FPS) of Resnet101+DETR, with significantly lower GPU usage. Although the FPS of Underwater-Yolo is lower compared with that of Resnet101+RT-DETR, which is specifically designed for real-time detection, Underwater-Yolo’s lower GPU usage makes it more suitable for deployment in practical applications.

4.4. Limitation Experiments

To further evaluate the generalization ability of the Underwater-Yolo model, this study designed a test experiment about the model in different underwater environments. In this study, several underwater fields were selected for testing, including underwater color bias, underwater atomization, and underwater contrast scenarios, to investigate the feasibility and stability of the model in practical applications.

As shown in Figure 10, the different underwater scenes in the images present various potential challenges, such as insufficient lighting, image blur, and color distortion. However, Underwater-Yolo still demonstrates good detection capability across different underwater environments. Underwater-Yolo effectively addresses these challenges through its unique optimization strategies, such as dilated deformable convolutions, Small-Object Detection Layers, and the dual-branch occlusion-handling attention mechanism. This indicates that the model has strong generalization ability when facing different underwater scenarios, such as varying depth, lighting condition, and water quality.

Although Underwater-Yolo demonstrates strong generalization performance across various underwater scenes, it still faces limitations, particularly when dealing with highly dynamic and high-noise underwater images, as shown in Figure 11. Under such challenging conditions, the model may still experience false negatives (missed detection) and false positives (incorrect detection). These issues arise due to the inherent difficulties in distinguishing targets from noisy backgrounds or rapidly changing environments, where the visibility and clarity of objects are significantly reduced. The model’s reliance on fixed feature extraction techniques may struggle to handle extreme variations in lighting, motion blur, and the presence of environmental disturbances like bubbles or particles, which can degrade detection accuracy. Further improvements, such as enhanced noise-robustness mechanisms or adaptive learning strategies, could help mitigate these challenges and improve performance under more demanding underwater conditions.

4.5. Ablation Experiments

This study elucidated the algorithm optimization process by conducting ablation experiments to assess the enhancements Underwater-Yolo brings to underwater target detection accuracy. Table 7 displays results from comparative experiments, highlighting the impact of optimization and ablation on Underwater-Yolo’s performance. The experiments aimed to evaluate the contributions of three critical technical modules—CSP-DDC, GLOAM, and Small-layer—to the network’s efficacy. Four network variants were tested on the CLfish-V1 validation set to delineate the individual and combined influences of these technologies on target detection accuracy.

The experimental design included four model configurations: a fully functional model (comprising CSP-DDC, GLOAM, and SOD-layer), a model excluding CSP-DDC, a model omitting both CSP-DDC and GLOAM, and a baseline model that lacks any enhancement components. Each configuration was evaluated based on various performance metrics (AP50, AP75, and AP-small) to observe the specific impacts of the following different technical components:

CSP-DDC is specifically designed for detecting unstructured small targets, enhancing the model’s ability to detect these small targets by improving its receptive field.
GLOAM primarily addresses occlusion issues between targets in dense scenes, aiding the model in more accurately identifying and localizing targets in complex environments.
The SOD-layer is a feature extraction layer tailored for small targets, specifically aimed at enhancing the model’s recognition capabilities for small-sized objects.

As shown in Figure 12, The experimental findings confirm that Model 3, integrating CSP-DDC, GLOAM, and SOD-layer, outperforms other models in all the evaluated metrics, with AP50, AP75, and AP-small scores of 0.938, 0.889, and 0.764, respectively. This illustrates the crucial role of CSP-DDC in improving the detection of small targets, as its removal in Model 2 resulted in reduced performance (AP50: 0.922; AP75: 0.868; AP-small: 0.744). The further exclusion of the GLOAM in Model 1 led to more significant declines (AP50: 0.888; AP75: 0.826; AP-small: 0.709), underscoring the GLOAM’s importance in managing occlusions. The baseline model, without any specialized components, exhibited substantially lower performance, highlighting the collective enhancement these technologies bring to underwater object detection, particularly for small and occluded targets. Visualization of ablation experiments clearly demonstrate the individual and combined impacts of CSP-DDC, GLOAM, and SOD-layer on the Underwater-Yolo network’s performance. The heatmaps from these experiments delineate the progressive improvements in object detection capabilities across different configurations, particularly showing how the integration of these components optimizes detection accuracy in challenging underwater scenarios, effectively tackling occlusions and enhancing target recognition.

5. Discussion

Although Underwater-Yolo shows good performance, particularly in AP50, AP75, and AP-small, several limitations remain. Despite improvements in small-object detection (AP-small = 0.764), the model still struggles in high-density target and occlusion scenarios. The introduction of the GLOAM and SOD-layer modules helps with occlusion and small-object detection, but performance can degrade in cases of dense-object overlap and occlusion. While the CSP-DDC backbone offers a good balance of performance and efficiency, challenges remain in optimizing AP-small, particularly in serious high-density environments.

Future research should focus on improving performance in complex scenarios, such as incorporating frequency information to enhance the algorithm’s detection capabilities. Expanding the dataset to include more diverse underwater conditions, particularly in low light and turbid water, will improve the model’s generalization. Additionally, optimizing real-time performance and computational efficiency, possibly through lightweight networks or edge computing solutions, will be crucial for deployment in resource-constrained environments.

6. Conclusions

This study introduces Underwater-Yolo, an advanced underwater object detection network designed to effectively handle challenges such as densely packed targets, small objects, and complex occlusions. The network integrates a dual-branch occlusion-handling attention mechanism (GLOAM), a dilated deformable convolutional backbone (CSP-DDC), and a Small-Object Detection layer (SOD-layer) to significantly enhance detection accuracy under challenging conditions. The use of dilated deformable convolutions (DDCs) broadens the receptive field, improving the detection of small and unstructured objects. Furthermore, the CARAFE up-sampling technique aggregates contextual information to improve performance, while the GLOAM module optimizes occlusion handling by merging global and local feature attention mechanisms. Experimental evidence from the CLfish-V1 dataset shows that Underwater-Yolo surpasses existing one-stage detectors, marking substantial gains in the AP50, AP75, and AP-small metrics. The ablation studies underscore the effectiveness of CSP-DDC, the GLOAM, and the SOD-layer, affirming the architecture’s suitability for practical underwater object detection applications and suggesting avenues for future enhancements in dynamic environment adaptability and real-time processing efficiency.

Author Contributions

Data curation, Z.L. and B.Z.; formal analysis, Z.L. and B.Z.; investigation, Z.L. and B.Z.; methodology, Z.L. and B.Z.; resources, Z.L. and B.Z.; writing—original draft, D.C., Z.L., W.Z., H.L., B.Z., Z.Z. and W.F.; writing—review and editing, D.C., Z.L., W.Z., H.L., B.Z., Z.Z., W.F., J.D., X.Z. and Y.Z.; Z.L. and B.Z. are the co-first authors. All authors have read and agreed to the published version of the manuscript.

Funding

This research received by the Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai) under grant SML2022SP101.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the exemption granted by the center director as it only uses a self-made underwater fish dataset that does not involve human or animal ethics issues in a traditional sense.

Informed Consent Statement

Not applicable.

Data Availability Statement

Due to the confidential nature of the institution’s documents, the dataset for this study cannot be disclosed to the public.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Er, M.J.; Chen, J.; Zhang, Y.; Gao, W. Research challenges, recent advances, and popular datasets in deep learning-based underwater marine object detection: A review. Sensors 2023, 23, 1990. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Fang, X.; Pan, M.; Yuan, L.; Zhang, Y.; Yuan, M.; Lv, S.; Yu, H. A marine organism detection framework based on the joint optimization of image enhancement and object detection. Sensors 2021, 21, 7205. [Google Scholar] [CrossRef] [PubMed]
Peng, F.; Miao, Z.; Li, F.; Li, Z. S-FPN: A shortcut feature pyramid network for sea cucumber detection in underwater images. Expert Syst. Appl. 2021, 182, 115306. [Google Scholar] [CrossRef]
Shi, P.; Xu, X.; Ni, J.; Xin, Y.; Huang, W.; Han, S. Underwater biological detection algorithm Based on improved faster-RCNN. Water 2021, 13, 2420. [Google Scholar] [CrossRef]
Chen, J.; Er, M.J.; Zhang, Y.; Gao, W.; Wu, J. Novel dynamic feature fusion stragegy for detection of small underwater marine object. In Proceedings of the 2022 5th International Conference on Intelligent Autonomous Systems (ICoIAS), Dalian, China, 23–25 September 2022; pp. 24–30. [Google Scholar]
Yu, Y.; Zhao, J.; Gong, Q.; Huang, C.; Zheng, G.; Ma, J. Real-time underwater maritime object detection in side-scan sonar images based on transformer-YOLOv5. Remote Sens. 2021, 13, 3555. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U.A. PE-Transformer: Path enhanced transformer for improving underwater object detection. Expert Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
Chen, D.; Gou, G. SFDet: Spatial to frequency attention for small-object detection in underwater images. J. Electron. Imaging 2024, 33, 023057. [Google Scholar] [CrossRef]
Li, X.; Yu, H.; Chen, H. Multi-scale aggregation feature pyramid with cornerness for underwater object detection. Vis. Comput. 2024, 40, 1299–1310. [Google Scholar] [CrossRef]
Qu, S.; Cui, C.; Duan, J.; Lu, Y.; Pang, Z. Underwater small target detection under YOLOv8-LA model. Sci. Rep. 2024, 14, 16108. [Google Scholar] [CrossRef]
Chen, G.; Mao, Z.; Wang, K.; Shen, J. HTDet: A hybrid transformer-based approach for underwater small object detection. Remote Sens. 2023, 15, 1076. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, W.; Du, X.; Yan, Z. Underwater small target detection based on yolox combined with mobilevit and double coordinate attention. J. Mar. Sci. Eng. 2023, 11, 1178. [Google Scholar] [CrossRef]
Chen, L.; Zhou, F.; Wang, S.; Dong, J.; Li, N.; Ma, H.; Wang, X.; Zhou, H. SWIPENET: Object detection in noisy underwater scenes. Pattern Recognit. 2022, 132, 108926. [Google Scholar] [CrossRef]
Qi, S.; Du, J.; Wu, M.; Yi, H.; Tang, L.; Qian, T.; Wang, X. Underwater small target detection based on deformable convolutional pyramid. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 7–13 May 2022; pp. 2784–2788. [Google Scholar]
Ji, X.; Chen, S.; Hao, L.-Y.; Zhou, J.; Chen, L. FBDPN: CNN-Transformer hybrid feature boosting and differential pyramid network for underwater object detection. Expert Syst. Appl. 2024, 256, 124978. [Google Scholar] [CrossRef]
Wang, J.; Li, Y.; Wang, J.; Li, Y. An Underwater Dense Small Object Detection Model Based on YOLOv5-CFDSDSE. Electronics 2023, 12, 3231. [Google Scholar] [CrossRef]
Xu, X.; Liu, Y.; Lyu, L.; Yan, P.; Zhang, J. MAD-YOLO: A quantitative detection algorithm for dense small-scale marine benthos. Ecol. Inform. 2023, 75, 102022. [Google Scholar] [CrossRef]
Liu, K.; Peng, L.; Tang, S. Underwater object detection using TC-YOLO with attention mechanisms. Sensors 2023, 23, 2567. [Google Scholar] [CrossRef]
Li, J.; Liu, C.; Lu, X.; Wu, B. CME-YOLOv5: An efficient object detection network for densely spaced fish and small targets. Water 2022, 14, 2412. [Google Scholar] [CrossRef]
Mathias, A.; Dhanalakshmi, S.; Kumar, R. Occlusion aware underwater object tracking using hybrid adaptive deep SORT-YOLOv3 approach. Multimed. Tools Appl. 2022, 81, 44109–44121. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13029–13038. [Google Scholar]
Li, J.; Wang, C.; Huang, B.; Zhou, Z. ConvNeXt-backbone HoVerNet for nuclei segmentation and classification. arXiv 2022, arXiv:2202.13560. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zheng, G.; Songtao, L.; Feng, W.; Zeming, L.; Jian, S. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]

Figure 1. Overall network architecture of Underwater-Yolo. * Representatives multiplied.

Figure 2. CSP-DDC module; Pathways with different dilatation rates are indicated in different colors.

Figure 3. Dual-branch occlusion-handling attention mechanism (GLOAM); Local and global branches are indicated in different colors.

Figure 4. Small-Object Detection layer (SOD-layer). * Representatives multiplied.

Figure 5. Distribution of annotation boxes; Each color represents the range of the size of a class of annotation box, and x represents the central anchor point selected by this box.

Figure 6. Data statistics of the CLfish-V1 dataset.

Figure 7. Target feature representation of the CLfish-V1 dataset; (a) characteristics of dense-target groups; (b) features of small targets; Red boxes represent the position and size of the target mark boxes in Figure.

Figure 8. Heatmap visualization of CSP-DDC compared with other backbone networks. The darker the color represents, the better the feature extraction of the corresponding area of the image.

Figure 9. Visualization of detection results from Underwater-Yolo compared with other single-stage object detection networks.

Figure 10. Visualization of detection results for Underwater-Yolo in different underwater environments (a–c); (a) is the scene with a low contrast, (b) is the color-biased scene, and (c) is the atomized scene.

Figure 11. Limitations of Underwater-Yolo. (a) the scene with high dynamics and (b) the scene with high noise.

Figure 12. Visualization of heatmaps for ablation study of various models. The darker the color represents, the better the feature extraction of the corresponding area of the image; (1)–(3) represents each generation of the ablation model of the ablation experiment.

Table 1. Contributions and limitations of research on underwater target detection.

Method	Author	Algorithm	Limitations
Common underwater object detection	Shi, P. et al. (2021) [4]	Faster-RCNN algorithm for underwater detection, utilizing ResNet, BiFPN, EIoU, and K-means++ for better feature extraction and scale integration.	Challenges in detecting small and densely packed organisms, as well as handling occlusion, due to the inherent limitations of the IoU-based bounding-box and anchor generation techniques.
Common underwater object detection	Zhang, M et al. (2021) [7]	Lightweight underwater object detection method based on MobileNet v2, YOLOv4, and attentional feature fusion (AFFM).	Struggles with small-target detection and occlusion due to limited feature extraction for densely packed targets.
Underwater small-target detection	Gao, J. et al. (2024) [8]	Path-augmented Transformer detection framework enhances semantic details of small-scale underwater targets.	Struggles with densely occluded underwater objects, lacking robust feature selection in cluttered scenes.
	Chen, D. et al. (2024) [9]	SFDet employs a spatial-to-frequency domain attention mechanism optimized for small-object detection in underwater images.	Struggles with dense occlusion handling due to the focus on spatial and frequency domains.
	Li, X. et al. (2024) [10]	Multi-scale aggregation feature pyramid with cornerness for enhanced detection and recall of small underwater objects.	Struggles with dense occlusion, possibly due to limited contextual integration between closely packed targets.
	Qu, S. et al. (2024) [11]	YOLOv8-LA with LEPC module, AP-FasterNet, and CARAFE up-sampling.	Struggles with densely occluded targets despite improvements, due to the inherent challenges of detecting small, closely spaced underwater objects.
	Chen, G. et al. (2023) [12]	HTDet introduces a hybrid Transformer-based network with a fine-grained feature pyramid and test-time augmentation for efficient small-object detection underwater.	The model struggles with dense occlusions in dynamic marine environments due to the complex nature of underwater imagery.
	Sun, Y. et al. (2023) [13]	Underwater target detection model using MobileViT, YOLOX, and a new double-coordinate attention (DCA) mechanism to enhance feature extraction and improve detection accuracy.	The model still faces challenges in dense occlusion scenarios where small targets are easily lost due to complex underwater conditions.
	Chen, L. et al. (2022) [14]	SWIPENET with Curriculum Multi-Class Adaboost (CMA) targets underwater object detection, enhancing small-object detection and noise robustness.	The approach may struggle in environments with densely occluded targets, where distinguishing between overlapping objects and noisy backgrounds remains challenging.
	Qi, S. et al. (2022) [15]	The proposed Underwater Small-Target Detection (USTD) network employs a Deformable Convolutional Pyramid (DCP) and Phased Learning for domain generalization.	Struggles with highly occlusive environments and managing diverse, dense underwater scenarios without further adaptation.
Detection of densely occluded underwater targets	Ji, X. et al. (2024) [16]	FBDPN employs a CNN–Transformer hybrid, enhancing feature interaction across scales for better underwater object detection.	Struggles with handling unstructured small targets due to limited refinement between closely spaced multi-scale features.
	Wang, J. et al. (2023) [17]	YOLOv5-FCDSDSE, enhanced with CFnet structure, Dyhead technology, a small-object detection layer, and SE attention mechanism for optimized underwater object detection.	Struggles with highly unstructured environments despite improvements in scale, space, and task perception.
	Xu, X. et al. (2023) [18]	MAD-YOLO, an enhanced YOLOv5 with VOVDarkNet for feature extraction, AFC-PAN for feature fusion, and SimOTA for improved occlusion handling.	While effective for blurred, dense, and small-scale objects, it may struggle with extremely noisy underwater conditions or highly complex occlusions.
	Liu, K. et al. (2023) [19]	TC-YOLO network combines YOLOv5s with Transformer self-attention, adaptive histogram equalization, and optimal transport label assignment for underwater object detection.	Potential limitations in handling unstructured small targets due to complex underwater imaging conditions and inherent algorithm constraints.
	Li, J. et al. (2022) [20]	An improved CME-YOLOv5 network integrates the coordinate attention mechanism and the C3CA module with expanded detection layers and the EIOU loss function to enhance detection of densely spaced fish and small targets.	The model struggles with dynamic identification and accurate detection in highly unstructured underwater environments where occlusion and density vary significantly.
	Mathias, A. et al. (2022) [21]	Hybrid Adaptive DeepSORT-YOLOv3 (HADSYv3) combines YOLOv3 for detection and adaptive deep SORT with LSTM for tracking occluded underwater objects.	Struggles with highly unstructured small targets due to reliance on fixed-size anchor boxes in YOLOv3 and potential scale variations.

Table 2. Selected anchor point groups.

Tiny	Small	Medium	Large
[39.67, 42.67]	[64.00, 118.52]	[154.00, 121.48]	[226.00, 286.22]
[40.67, 84.74]	[105.67, 90.07]	[141.67, 209.19]	[360.00, 226.37]
[72.00, 64.00]	[95.33, 157.04]	[231.00, 162.96]	[432.94, 368.59]

Table 3. The shooting parameters for the CLfish-V1 dataset. Light intensity (range of 0–255 for 8-bit images); transparency: represented as a ratio between 0 and 1; depth of field: no physical unit.

	Light Intensity	Transparency	Depth of Field
Min	21	1.0	0.38
Max	210	1.0	555
Mean	101	1.0	24

Table 4. Comparison of evaluation metrics for CSP-DDC and other backbone networks.

	CSP-Darknet	Conv-Next-Small	Swim-Transformer-Tiny	CSP-DDC
AP50	0.736	0.808	0.849	0.863
AP75	0.506	0.578	0.625	0.653
FLOPs	79.963 G	148.994 G	79.914 G	24.086 G
Parameter	31.425 M	53.039 M	31.122 M	8.706 M

Table 5. Comparison of evaluation metrics for four object detection methods on the CLfish-V1 dataset.

	YoloX-s	YoloV8-s	YoloV10-s	Resnet101+ DETR	Resnet101+ RT-DETR	Underwater-Yolo
AP50	0.647	0.841	0.869	0.864	0.891	0.938
AP75	0.566	0.605	0.629	0.649	0.722	0.889
AP-small	0.571	0.542	0.650	0.654	0.704	0.764
FLOPs	26.92 G	28.853 G	21.6 G	208.95 G	259.60 G	26.858 G
Parameter	8.968 M	11.173 M	7.2 M	55.704 M	76.12 M	9.815 M

Table 6. Real-time detection performance comparison of Underwater-Yolo.

	Resnet101+ DETR	Resnet101+ RT-DETR	Underwater-Yolo
FLOPs	208.95 G	259.60 G	26.858 G
Parameter	55.704 M	76.12 M	9.815 M
FPS	30	74	64
GPU rate	560.45 MB	660.15 MB	214.53 MB

Table 7. Comparison of evaluation metrics for each model in the ablation study on CLfish-V1 dataset.”√” represents that this module exists, “×” represents that this module does not exist, and “↑” represents ascending order.

Underwater-Yolo				CLfish-V1
Name	CSP-DDC	GLOAM	SOD-Layer	AP50↑	AP75↑	AP-small↑
3	√	√	√	0.938	0.889	0.764
2	×	√	√	0.922	0.868	0.744
1	×	×	√	0.888	0.826	0.709
Baseline	×	×	×	0.647	0.566	0.571

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Zheng, B.; Chao, D.; Zhu, W.; Li, H.; Duan, J.; Zhang, X.; Zhang, Z.; Fu, W.; Zhang, Y. Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism. J. Mar. Sci. Eng. 2024, 12, 2291. https://doi.org/10.3390/jmse12122291

AMA Style

Li Z, Zheng B, Chao D, Zhu W, Li H, Duan J, Zhang X, Zhang Z, Fu W, Zhang Y. Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism. Journal of Marine Science and Engineering. 2024; 12(12):2291. https://doi.org/10.3390/jmse12122291

Chicago/Turabian Style

Li, Zhenming, Bing Zheng, Dong Chao, Wenbo Zhu, Haibing Li, Jin Duan, Xinming Zhang, Zhongbo Zhang, Weijie Fu, and Yunzhi Zhang. 2024. "Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism" Journal of Marine Science and Engineering 12, no. 12: 2291. https://doi.org/10.3390/jmse12122291

APA Style

Li, Z., Zheng, B., Chao, D., Zhu, W., Li, H., Duan, J., Zhang, X., Zhang, Z., Fu, W., & Zhang, Y. (2024). Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism. Journal of Marine Science and Engineering, 12(12), 2291. https://doi.org/10.3390/jmse12122291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Underwater-Yolo: Underwater Object Detection Network with Dilated Deformable Convolutions and Dual-Branch Occlusion Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Common Underwater Object Detection

2.2. Underwater Small-Target Detection

2.3. Detection of Densely Occluded Underwater Targets

2.4. Literature Analysis

3. Proposed Method

3.1. Overall Structure of Underwater-Yolo

3.2. Dilated Deformable Convolution

3.3. Dual-Branch Occlusion Attention Mechanism

3.4. SOD-Layer

3.5. Anchor Point Selection Strategy

4. Experiment

4.1. Experimental Setup

4.1.1. Implementation Details

4.1.2. Datasets

4.2. Comparison of Heatmaps Between CSP-DDC and Other Backbone Networks

4.3. Visual Comparison of Underwater-YOLO

4.4. Limitation Experiments

4.5. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI