AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments

Sun, Weikai; Liu, Xiaoqun; Hao, Juan; Yao, Qiyou; Xi, Hailin; Wu, Yuwen; Xing, Zhaoye

doi:10.3390/jmse13081465

Open AccessArticle

AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments

by

Weikai Sun

,

Xiaoqun Liu

,

Juan Hao

^*

,

Qiyou Yao

,

Hailin Xi

,

Yuwen Wu

and

Zhaoye Xing

College of Information Engineering, Hebei University of Architecture, Zhangjiakou 075031, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(8), 1465; https://doi.org/10.3390/jmse13081465

Submission received: 9 June 2025 / Revised: 21 July 2025 / Accepted: 28 July 2025 / Published: 30 July 2025

(This article belongs to the Topic Applications and Development of Underwater Robotics and Underwater Vision Technology, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Detecting underwater targets is crucial for ecological evaluation and the sustainable use of marine resources. To enhance environmental protection and optimize underwater resource utilization, this study proposes AGS-YOLO, an innovative underwater small-target detection model based on YOLO11. Firstly, this study proposes AMSA, a multi-scale attention module, and optimizes the C3k2 structure to improve the detection and precise localization of small targets. Secondly, a streamlined GSConv convolutional module is incorporated to minimize the parameter count and computational load while effectively retaining inter-channel dependencies. Finally, a novel and efficient cross-scale connected neck network is designed to achieve information complementarity and feature fusion among different scales, efficiently capturing multi-scale semantics while decreasing the complexity of the model. In contrast with the baseline model, the method proposed in this paper demonstrates notable benefits for use in underwater devices constrained by limited computational capabilities. The results demonstrate that AGS-YOLO significantly outperforms previous methods in terms of accuracy on the DUO underwater dataset, with mAP@0.5 improving by 1.3% and mAP@0.5:0.95 improving by 2.6% relative to those of the baseline YOLO11n model. In addition, the proposed model also shows excellent performance on the RUOD dataset, demonstrating its competent detection accuracy and reliable generalization. This study proposes innovative approaches and methodologies for underwater small-target detection, which have significant practical relevance.

Keywords:

underwater target detection; AGS-YOLO; YOLO11; GSConv; multi-scale features

1. Introduction

The ocean, which is abundant in both biological and mineral resources, plays a pivotal and irreplaceable role in regulating the Earth’s climate and preserving global biodiversity. With the swift progress in underwater robotics, computer vision, and associated fields, underwater target detection has emerged as a critical enabler in marine resource exploration, ecological monitoring, and aquaculture [1]. However, due to the complexity of the marine environment—considering factors such as light attenuation in the water column and scattering by suspended particles, as well as underwater objects of different sizes and shapes—underwater target detection faces many challenges [2].

Current detection techniques mainly rely on sonar and optical images. While sonar-based methods are suitable for turbid or deep waters, their low resolution limits fine-grained recognition [3]. In contrast, high-resolution optical images captured with underwater cameras provide detailed textures and colors, making them ideal for precise target detection. Therefore, this work focuses on underwater target detection using optical imagery.

With technological advancements, underwater target detection methods based on optical images can be categorized into the following two main approaches: traditional computer vision algorithms and deep learning models. Traditional methods mainly rely on manually designed features to achieve target localization and identification. Conversely, target detection techniques using deep learning models can adaptively learn and extract deep target features, and they have gradually become the dominant approach for underwater target detection. Deep learning-based target detection algorithms are generally categorized into the following three types: two-stage, single-stage, and Transformer-based [4] approaches. These models have also achieved notable success in other remote sensing applications, such as crop disease detection using UAV imagery [5], further demonstrating their effectiveness in complex environments and inspiring their application in underwater scenarios. A representative two-stage algorithm is Faster R-CNN [6], which first generates candidate frames and then finely predicts the features using a classifier. Lin et al. [7] proposed RoiMix, a multi-image feature fusion strategy that enhances Faster R-CNN’s discriminative power to improve its detection accuracy under occlusion, overlap, and blurring in underwater scenes. Zeng et al. [8] integrated GANs into Faster R-CNN to enhance the detection of marine organisms in blurred, low-contrast underwater images. Typical representative single-stage algorithms include those in the YOLO series [9] and the SSD algorithm [10], which achieve real-time optimization while maintaining high detection speed by transforming the detection task into a single regression problem. Muksit et al. [11] proposed YOLO-Fish based on YOLOv3, achieving efficient fish detection on the DeepFish and OzFish datasets. Xu et al. [12] enhanced YOLOv5 by introducing a dynamic cross-scale attention module, improving the detection of low-contrast and occluded underwater targets. Yu et al. [13] proposed U-YOLOv7, which integrates channel attention, content-aware upsampling, and 3D attention mechanisms to enhance precision and robustness in complex underwater environments. Zhao et al. [14] proposed BGLE-YOLO, which has enhanced feature extraction using the EMC module, improved small-object detection performance via the BIG module, and reduced complexity due to the use of LSH and re-parameterization. Novel detection architectures based on Transformer open up new paths for underwater target detection. DETR [15], as a typical representative, adopts the encoder–decoder architecture to achieve end-to-end detection. Gao et al. [16] proposed PE-Transformer, which uses local path embedding and multi-level feature interaction to enhance the semantic representation of tiny underwater targets. These deep learning-based methods have gained significant accuracy and speed enhancements in the task of underwater target detection, thus enabling advancements in underwater target identification; however, they still face problems such as lower detection performance for small and low-visibility objects [17,18].

To overcome the aforementioned challenges, this study presents a novel detection architecture based on YOLO11n featuring multi-scale attention enhancement, high computational efficiency, and robust cross-scale feature fusion. The main contributions of this work are as follows:

(1): Inspired by the idea of the PPA (Parallelized Patch-Aware Attention) module, the multi-scale aggregated attention module AMSA is proposed and integrated into the end of the C3k2 module. The AMSA module effectively enhances the model’s perception of tiny targets in complex underwater environments through multi-branch feature extraction, aggregating non-overlapping patches through a dual attention adaptive mechanism.
(2): In order to solve the problem of information loss during feature space compression and channel expansion of underwater optical images, the lightweight convolutional module GSconv is introduced, which adopts structural mixing and feature compensation strategies to decrease the model’s parameter count and computational complexity while ensuring the maximum retention of the hidden dependencies among channels.
(3): Drawing on the idea of RT-DETR, a novel and efficient lightweight cross-scale connected neck network is designed, which introduces two types of convolutional unification channels, namely, lateral convolution and input projection, to decrease the number of model parameters and the computational complexity. The information interaction between features of different sizes is enhanced via the bidirectional feature fusion structure and cross-scale linking, allowing the model to efficiently capture multi-scale semantics and enhancing the detection accuracy for targets of varying sizes.

2. Materials and Methods

2.1. YOLO11

In order to ensure that the resulting model has a sufficiently accurate and efficient real-time object detection capability, YOLO11 [19] was chosen as the benchmark model in this paper. As the latest iteration in the YOLO series, YOLO11 has made significant progress in target detection through the incorporation of multiple structural enhancements. The model not only inherits the advantages of the previous version in real-time detection but also focuses on optimizing the processing efficiency and feature representation capacity of the network architecture.

The network architecture of YOLO11, as illustrated in Figure 1, primarily consists of the following four key components: input, trunk, neck, and head. The input component is responsible for pre-processing the original image, including steps such as image scaling, normalization, and data enhancement to fit the network. The backbone network is the core of YOLO11 and is designed to extract rich feature representations from the input images. YOLO11 introduces the improved C3K2 module, which enhances the efficiency of multi-scale feature fusion by adjusting the convolutional kernel configuration. In addition, the new C2PSA feature enhancement layer, added after the SPPF module, significantly improves feature differentiation in complex scenes by establishing an inter-channel information interaction mechanism. The neck network primarily serves to further fuse and propagate the feature representations extracted by the backbone network. YOLO11 utilizes a feature pyramid network (FPN) alongside a path aggregation network (PANet) [20] to strengthen its multi-scale feature fusion ability, thereby enhancing the detection accuracy of small targets. The detection head is responsible for transforming the integrated feature information into the final detection result. YOLO11 adopts an anchorless decoupling design, where the classification branch uses a depth-separable convolution to reduce the computational load and the regression branch maintains the standard convolutional structure to enhance the localization accuracy. In addition, the introduced Distributed Focus Loss (DFL) module further enhances the precision of bounding box regression.

Although YOLO12 [21]—the most recent iteration in the YOLO series—introduces more advanced structural improvements in terms of feature extraction and detection accuracy, YOLO11 was still chosen as the benchmark model in this paper due to its lightness, stability, and maturity in practical deployment. YOLO11 maintains efficient inference speed while enhancing the efficiency of feature extraction and precise target detection through the improved C3K2 module and C2PSAattention mechanism, effectively enhancing its multi-scale feature aggregation and enhanced small-object detection capabilities. Furthermore, it has a simple structure, low computational cost, is suitable for deployment on edge devices, and has good scalability and community support, enabling its further optimization and application.

2.2. Proposed Model

Despite its strong performance under conventional conditions, YOLO11 faces certain challenges in more complex applications, such as detecting densely distributed or small-scale objects. In order to overcome the inherent difficulties in underwater target detection under complex conditions, this study proposes the AGS-YOLO model, whose architectural design is shown in Figure 2. Compared with the benchmark model YOLO11, the main improvements in AGS-YOLO are implemented in the backbone and neck network.

Specifically, in the backbone part, AGS-YOLO integrates the multi-scale aggregated attention module AMSA into the end of the C3k2 module and introduces a three-branch structure. This structure is capable of aggregating features at different scales and dynamically focusing on targets in combination with non-overlapping patches and dual-attention adaptive mechanisms, which improves the model’s capability in terms of both dense and small-target detection. In addition, AGS-YOLO adopts a cross-scale fusion neck network CSFE, which unifies the channels using lateral links and input projection convolution modules to achieve efficient splicing, effectively minimizing the parameter count and computational complexity. Additionally, a symmetric cross-scale feature fusion structure enables a closed-loop feature enhancement pathway, allowing the feature map to capture both shallow local details and deep global semantics through cross-scale interactions. To tackle the problem associated with information loss that may occur during feature space compression and channel expansion in underwater optical images, AGS-YOLO introduces GSConv for lightweight convolution in the neck. This module further decreases the number of parameters and the computational burden of the model under the premise of guaranteeing the information integrity, thus improving the efficiency and practicality of the model.

2.2.1. AMSA

In the underwater target detection task, due to the tiny size of targets and the complex background noise in underwater optical images, traditional convolutional networks are prone to losing key information of the targets in the process of multiple downsampling. In YOLO11, the C3k2 module is used for feature extraction, and, although it can effectively retain the global semantic information of the target, it still lacks the ability to capture the edge textures of tiny targets and is susceptible to interference from background noise in scenes characterized by low illumination and high turbidity. To address the aforementioned issues, inspired by the PPA module [22], this study proposes the Aggregated Multi-Scale Attention Module (AMSA), which enhances the C3k2 module. The C3k2–AMSA structure is shown in Figure 3.

The C3k2 module is designed based on the CSP structure and consists of two branches with different convolutional kernel sizes to capture local details and global semantic information. Among them, the main branch extracts key features through a 3 × 3 convolutional layer, while the auxiliary branch employs 1 × 1 convolution to enhance the compactness of the feature representation. Although this structure has some advantages in fusing multi-scale information, its fixed-scale setting limits its ability to model more diverse scale targets; in particular, when considering small targets with unclear textures and blurred boundaries, the feature response strength is insufficient to form a clear detection representation. In addition, the lack of an explicit attention mechanism in C3k2 makes it susceptible to invalid information under a high noise background, reducing the discriminative ability of the feature map. By integrating the AMSA module at the output of C3k2, the weak response to the edges of small targets in complex underwater scenes is addressed through multi-scale feature extraction, thereby significantly enhancing the perception of small targets. Figure 4 shows the structure of AMSA.

The AMSA module employs a three-branch feature extraction parallel strategy, including local, global, and single-stage convolution, where each branch is designed to extract features across multiple spatial scales and semantic levels. This strategy helps to capture the multi-scale features of objects, thus improving the accuracy of small-target detection. Given an input feature map

F \in R^{H' \times W' \times C}

, the AMSA module is obtained after channel dimensionality adjustment via point-by-point convolution of

F \in R^{H' \times W' \times C'}

. The input feature map enters the three feature processing branches in parallel, and the local and global branches are divided using a non-overlapping patch partitioning strategy. By applying the unfold operation, the input features are partitioned into a collection of spatially continuous patches

(p \times p, \frac{H'}{p}, \frac{W'}{p}, C)

, where the difference between the two branches is controlled by the patch size parameter, p. This is accomplished by aggregating and shifting non-overlapping spatial patches, computing an attention matrix among them to capture both local and global feature dependencies. Then, channel averaging is performed to obtain

(p \times p, \frac{H'}{p}, \frac{W'}{p}, C)

. Using FFN [23] for linear computation, the activation function is applied to obtain the probability distribution of the linearly computed features in the spatial dimension, and the weights are adjusted accordingly. Letting

d = \frac{H' \times W'}{p \times p}

, the weighted outcome is denoted as

{(q_{i})}_{i = 1}^{C'}

, where

q^{i} \in R^{d}

represents the i-th output token. Finally, the detailed feature

F_{L G}

is extracted as follows:

F_{L G} = \sum_{i = 1}^{d} s i m (q_{i}, ξ) \cdot (P \cdot q_{i})

(1)

where

F_{L G}

is

F_{l o c a l}

or

F_{g l o b a l}

,

ξ \in R^{C'}

is the task embedding vector,

p \in R^{C' \times C'}

is the task-specific parameter matrix, and

s i m (\cdot, \cdot)

is a cosine similarity function taking values in the range of [0, 1] to achieve dynamic screening of target-related features. To minimize the number of parameters and eliminate redundant computations, the conventional convolutional layers are replaced with convolutional branches with a single stage of

3 \times 3

convolutional layers. The three branches are then subjected to multi-scale feature fusion to obtain

\tilde{F}

, as follows:

\tilde{F} = F_{l o c a l} \oplus F_{g l o b a l} \oplus F_{c o n v}

(2)

where

\oplus

indicates element-wise addition. The three feature maps are aligned in both spatial and channel dimensions before fusion to ensure their compatibility. A two-layer attentional enhancement mechanism is introduced for adaptive feature enhancement, where the target-sensitive channel is first calibrated using the channel attention [24], followed by spatial attention [25] to reinforce the target region accordingly, as follows:

F_{f i n a l} = B N (R e L U (M_{s} \oplus (M_{c} 𐤈 \tilde{F})))

(3)

where

M_{c}

and

M_{s}

are one- and two-dimensional channel attention maps, respectively;

\oplus

denotes channel-wise broadcast addition between

M_{c}

;

𐤈

denotes element-wise multiplication; and

B N (\cdot)

and

R e L U (\cdot)

denote batch normalization and ReLU activation, respectively. When the AMSA module replaces the conventional convolutional layers within the YOLO architecture, the module enhances the features through multi-scale extraction and attention mechanisms, effectively improving the detection of targets at different scales underwater.

2.2.2. GSConv

The complex optical scattering environment and the tiny scale characteristics of underwater targets impose higher requirements on the feature extraction capability of detection networks. However, the conventional convolution module used in the YOLO11 neck network suffers from feature information loss, especially in the deep feature fusion stage, which causes it to tend to ignore the detailed texture features of tiny targets. Moreover, the large number of parameters associated with the dense convolution operation makes it difficult to deploy the model on computationally constrained platforms such as underwater robots. In order to achieve a trade-off between performance and computational cost, this study introduces the lightweight convolution module GSConv [26], whose structure is shown in Figure 5.

In order to reduce the computational cost, many lightweight models use depth separable convolution (DSConv) [27] to decrease the number of parameters and floating-point computations; however, DSConv segregates the input image’s channel information throughout the computation, which leads to a lower feature representation capability than the standard convolution. The GSConv module is based on the theory of feature compensation and employs a hybrid strategy by means of a two-way parallel processing mechanism for optimized feature representation. GSConv first performs standard convolution to generate dense features of C2/2 channels for the feature map with input channel number C1, then uses DSConv in parallel for spatial feature re-extraction to obtain another set of sparse features of C2/2 channels. Subsequently, the two sets of features are fused into the enhanced features of the C2 channel using a channel splicing and channel blending mechanism, such that the global feature information extracted from the standard convolutional path can be fully captured in the local feature space of the depth-separable convolutional path, thus preserving the hidden connections among the features as much as possible. The channel blending mechanism enhances the nonlinear representation of the module, and the overall design allows GSConv to capture more spatial and channel features with a streamlined parameter count and minimized computational overhead. In addition, due to the dimensionality of the feature graph, GSConv is only used in the neck network for feature concatenation while maintaining the standard convolution in the backbone network. When the features flow through the neck network, their channel dimensions reach the peak while the spatial dimensions are compressed to the minimum; thus, GSConv’s hybrid architecture is able to fully integrate cross-scale semantic information while avoiding the inference delay caused by deep stacking.

2.2.3. CSFE

Although the YOLO11 neck network has the basic FPN feature pyramid structure, it still suffers from insufficient feature expression and computational redundancy in complex scenarios such as those involving small-target detection and dense target stacking. In this paper, a cross-scale feature fusion lightweight neck network, Cross-Scale Feature Enhancement (CSFE), is designed by drawing on the idea of multi-scale fusion in RT-DETR [28]. This further improves the model’s capability to perceive and localize small targets under complex backgrounds by reconfiguring the feature fusion paths, introducing a lightweight design, and enhancing the modular expression capability.

As illustrated in Figure 6, the backbone network generates the following three feature maps at different scales: P3, P4, and P5. Among them, P3 has the highest spatial resolution and the shallowest depth, while P5 has the lowest resolution but the richest semantic information. (a) Displays the neck design of YOLO11, with A4 as the intermediate feature map. (b) Presents the CSFE neck structure, which generates multiple intermediate outputs; specifically, Y4, Y5, Y3, and X4.

The CSFE network introduces two types of 1 × 1 Conv for lateral convolution [29] and input projection [30], and P5, P4, and P3 are all projected to a unified 256 channel first. This design not only decouples the channel adaptation and feature fusion processes but also significantly reduces the computational complexity through the parameter-sharing mechanism used to prepare for the subsequent cross-scale interaction. The feature fusion path adopts a bi-directional closed-loop architecture combining FPN and PAN. In the top-down path, the deep high-semantic feature P5 is upsampled and fused with the mid-layer feature P4, while the shallow spatial details are used to calibrate the position of the deep semantics; in contrast, in the bottom-up path, the shallow high-resolution feature P3 is downsampled through convolution, and texture details are injected into the mid-layer feature to improve the model’s fine-grained sensing capability. This bi-directional interaction mechanism creates a dynamic feedback loop of cross-scale features, allowing each layer of features to capture both local details and global semantics during the iteration process. The network maintains resolution continuity through layer-wise upsampling, and it adopts a splicing (rather than summing) fusion strategy to maximize the preservation of the original feature information. After the secondary enhancement of the PAN path, the final output multi-scale features correspond to the small-, medium-, and large-target detection tasks. These improvements make the network particularly suitable for underwater environments, where the scale changes drastically due to differences in imaging distance, and its closed-loop feature enhancement path greatly enhances its precision and consistency in detecting small objects.

3. Experiments and Discussions

3.1. Experimental Environment and Configuration

Experiments were performed on a system equipped with the Windows 11 OS, and the hardware configuration comprised an NVIDIA GeForce RTX 4060Ti graphics card (16 GB of video memory) and a 12th-generation Intel Core i5-12400F processor (2.50 GHz), with 32 GB of DDR4 memory. The development environment used Python 3.12.0 with the PyTorch 2.5.0 deep learning framework and GPU acceleration based on the CUDA 12.7 parallel computing architecture. The detailed hyperparameter settings for the training models are presented in Table 1, with all experiments conducted under identical configurations.

3.2. Dataset

In this study, two typical underwater target detection datasets—DUO (Detection Underwater Objects) and RUOD—were selected for the experiments. The model was independently trained and tested on both the DUO and RUOD datasets to assess its performance under different data distributions.

The DUO dataset [31] integrates underwater images accumulated over the years from the URPC Challenge, which have undergone de-duplication and label correction operations. The dataset contains a total of 7782 real underwater optical images covering four typical marine benthic organisms, namely, holothurian, echinus, scallop, and starfish, where the number of labeled instances for each target is 7887, 50,156, 1924, and 14,548, respectively, indicating a pronounced distribution imbalance among categories, as shown in Figure 7. The training set in this dataset contains 6671 images, and the test set comprises 1111 images.

As the first comprehensive benchmark for underwater object detection, the RUOD dataset [32] includes 14,000 high-resolution images with a total of 74,903 labeled targets, which systematically integrates typical marine organisms with complex environmental elements. In addition, the dataset includes three typical underwater disturbance scenarios—namely, fogging effect, angular offset, and dynamic lighting disturbance—and covers 10 target categories, including holothurian, echinus, scallops, starfish, fish, corals, divers, cuttlefish, turtles, and jellyfish, as shown in Figure 8. It consists of 9800 images for training and 4200 for testing.

3.3. Performance Evaluation

In this study, the number of parameters, FLOPS, mAP@0.5, and mAP@0.5:0.95 are used as evaluation metrics. True positive (TP) refers to the number of samples correctly classified as positive, whereas false positive (FP) denotes the number of samples incorrectly classified as positive. True negative (TN) indicates the number of samples classified as negative, and false negative (FN) denotes the count of samples incorrectly predicted as negative.

The accuracy metric precision indicates the ratio of positive classes inferred by the model to all classes predicted, while recall indicates the proportion of true-positive predictions relative to the total number of positive samples in the dataset. These metrics are calculated using the following formulas:

P e r c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

where

T P

and

F P

denote the number of true-positive and false-positive predictions, respectively.

F N

indicates the number of samples with false predictions in the negative category.

A P

denotes the average precision of a single class, and corresponds to the area under the precision–recall curve, calculated as follows:

A P = \int_{0}^{1} P (r) d r

(6)

The mean average precision (

m A P

) represents the overall detection accuracy across all classes and is computed as the weighted average of the individual

A P

values. It is calculated as follows:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(7)

m A P

@0.5 denotes the mean average precision across all classes when the intersection over union (IoU) threshold is set to 0.5, while

m A P

@0.5:0.95 represents the average precision computed over IoU thresholds ranging from 0.5 to 0.95 (inclusive).

3.4. Ablation Experiments

In this section, we detail the ablation experiments conducted to evaluate the contribution of each proposed module in AGS-YOLO. Table 2 presents the performance results on the DUO dataset, including the detection accuracy, model parameters, and computational complexity.

Starting from the baseline YOLO11n model, which achieved mAP@0.5 of 84.2% and mAP@0.5:0.95 of 65.1%, with 2.58 million parameters and 6.3 GFLOPS, we first introduce the AMSA module. This modification led to a significant improvement in accuracy, raising the mAP@0.5 to 85.3% and the mAP@0.5:0.95 to 67.4%. These gains are attributed to the AMSA module’s ability to enhance multi-scale feature extraction and focus attention on critical object regions, especially for small targets. However, this improvement comes at the cost of increased model complexity, as the parameters rose to 3.8 million and the GFLOPS increased to 10.8. Subsequently, we integrated the CSFE into the architecture. This further improved the detection performance, increasing the mAP@0.5 to 85.4% and the mAP@0.5:0.95 to 67.6%. More importantly, the parameter count decreased to 3.1 million and the GFLOPS dropped to 9.7, demonstrating that CSFE not only contributes to more effective cross-scale feature fusion but also improves model efficiency. Finally, when incorporating the lightweight GSConv module, the model maintained strong detection performance, with a slight improvement in the mAP@0.5:0.95 reaching 67.7%, while the parameter count was further reduced to 3.0 million and the GFLOPS decreased to 9.6. These results demonstrate that GSConv enhances computational efficiency without compromising accuracy.

Overall, each module makes a distinct and complementary contribution to the model. The AMSA module delivers the most significant accuracy gains, the CSFE module achieves a strong balance between performance and efficiency, and the GSConv module effectively compresses the model while maintaining precise detection. The final architecture achieves an optimal trade-off between detection accuracy and model complexity.

To gain deeper insights into the performance of the proposed model in image feature extraction, Gradient-Weighted Class Activation Mapping (Grad-CAM) [33] was utilized to generate and analyze heatmaps, thereby facilitating an intuitive understanding of the model’s attention regions. Heatmaps are widely used in deep learning image classification tasks and function similarly to infrared imaging: regions with higher “temperatures” are displayed in red, while lower-temperature areas appear blue. In neural networks, heatmaps intuitively highlight the areas of the image that the model focuses on, with red indicating regions of high attention. Heatmap visualizations for both the baseline and the proposed models across four types of underwater environments are shown in Figure 9 for comparison. The results indicate that the original YOLO11n algorithm has certain limitations in terms of image recognition. In contrast, after adopting the AGS-YOLO model, the red regions in the heatmaps become significantly larger and more concentrated, suggesting that the model can extract image features more thoroughly, making it more suitable for underwater target detection in turbid environments. It is worth noting that although the areas of some red regions decreased, the number of such regions increased. These results imply that the perceptual scope of the model has been significantly enhanced, allowing it to focus on more detailed regions of the image, thereby improving its capacity for multi-scale object detection. Among the components, the AMSA module, GSConv convolution, and the CSFE neck network play key roles in improving the model’s underwater detection accuracy. In summary, compared with YOLO11n, AGS-YOLO exhibits enhanced performance in terms of both the coverage and accuracy of detection regions, showing stronger capabilities for target perception and detection.

3.5. Comparison Experiments

3.5.1. Comparative Analyses on the DUO Dataset

To evaluate the impact of the proposed enhanced model, comparative experiments were conducted on the DUO dataset, benchmarking it against several representative object detection algorithms including YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLO11n, and YOLO12n. The experimental results are presented in Table 3. Considering that differences in model size among detection algorithms may significantly affect detection performance, lightweight models with a parameter count similar to that of the proposed model were deliberately selected as baselines. This ensured a fair and valid comparison, allowing for a more objective evaluation of the performance advantages of AGS-YOLO under comparable computational resource conditions.

On the DUO dataset, the proposed AGS-YOLO model demonstrated outstanding performance across several critical evaluation metrics, achieving an mAP@0.5 of 85.5% and an mAP@0.5:0.95 of 67.7%. Compared with the original YOLO11n, the mAP@0.5 was improved by 1.3%, while the mAP@0.5:0.95 increased by 2.6%. These performance gains can be ascribed to multiple contributing factors. The AMSA module enhances its feature extraction capabilities, thereby improving detection accuracy. The GSConv module reduces model complexity while effectively preserving important features. The CSFE neck network, through efficient cross-scale connections and unified channels, captures multi-scale semantic information while reducing computational overhead. Compared to the YOLOv5n model, the mAP@0.5 and mAP@0.5:0.95 were improved by 1.9% and 3.6%, respectively. When compared with the YOLOv7-tiny model, the mAP@0.5 and mAP@0.5:0.95 increased by 0.5% and 3%, respectively, while the parameter count was reduced by 3.0M, and the computational complexity was reduced by 3.4 GFLOPS. In comparison with the YOLOv8n and YOLOv10n models, the AGS-YOLO model achieved improvements of 0.9% and 2.2% in mAP@0.5 and 2.4% and 3.1% in mAP@0.5:0.95, respectively, while maintaining a low parameter count and computational complexity. The experimental evidence demonstrates that the proposed method significantly boosts the detection precision of small objects in underwater environments while keeping additional parameters and computational demands to a minimum.

Although YOLO12n is a more recent version, our experiments showed that it does not outperform YOLO11n in the underwater small-object detection task. This may be attributed to the fact that YOLO12n emphasizes further lightweighting and structural pruning, which reduces its capacity to capture fine-grained features. In contrast, YOLO11n retains a better balance between feature expressiveness and model size, making it more suitable for detecting small and low-contrast underwater objects.

The confusion matrix shown in Figure 10 summarizes the classification outcomes of the improved model on the DUO dataset. The x-axis corresponds to actual class labels, while the y-axis corresponds to predicted labels. This visualization provides a concise and interpretable measure of model accuracy by aligning predicted and true classifications. The figure reveals that the predicted results for most targets were highly consistent with the actual categories, indicating that the model has strong recognition ability in underwater object detection tasks. Under the same experimental conditions, the AGS-YOLO model demonstrated superior detection capabilities when compared to the original YOLO11n in several categories, reflecting the significant effects of the structural improvements in terms of enhancing the discriminative ability of the model.

Figure 11 presents a comparison of the average precision performance between mainstream detection models and the proposed model on the DUO dataset. Specifically, Figure 11a shows the performance of each model under the mAP@0.5 metric, while Figure 11b illustrates the detection performance regarding the mAP@0.5:0.95. The figures clearly demonstrate that AGS-YOLO significantly outperformed other models in the YOLO series in both metrics. These results confirm that the proposed model not only maintains a lightweight structure but also exhibits superior feature extraction and object detection capabilities, thereby demonstrating the effectiveness of the improvements.

3.5.2. Comparative Analyses on the RUOD Dataset

In order to further validate the capabilities of the proposed model, comparative tests were also performed on the RUOD dataset, as shown in Table 4. AGS-YOLO demonstrated superior performance across both mAP@0.5 and mAP@0.5:0.95, attaining 86.4% and 63.5%, respectively, and surpassing the other state-of-the-art lightweight object detectors by a notable margin. Under the condition of maintaining the same parameter count of 3.0M and keeping the computational cost below 10 GFLOPS, AGS-YOLO achieved improvements of 1.7 and 2.0 percentage points in mAP50 and mAP50-95, respectively, when compared to YOLO11n, while the number of parameters was kept at the same level of 3.0M and the computation volume was maintained below 10 GFLOPS. The results fully demonstrate that the improved strategy proposed in this paper significantly boosts the network’s capability in feature extraction and target recognition without appreciably increasing its complexity or computational overhead, thus providing a good performance-to-efficiency ratio. This makes the model especially suitable for deployment in real underwater scenarios with limited computational resources. Figure 12 depicts the confusion matrix for the proposed enhanced model evaluated on the RUOD dataset. Figure 13 presents a comparative analysis of mAP (mean average precision) between mainstream models and the model proposed in this paper on the RUOD dataset. Specifically, Figure 13a illustrates the performance comparison in terms of mAP@0.5, while Figure 13b demonstrates the results for mAP@0.5:0.95. The results clearly indicate the superior detection performance of our model across both evaluation metrics.

3.5.3. Comparison with Other Advanced Mainstream Models

To further validate the effectiveness of the proposed model, we conducted comparative experiments against several state-of-the-art models, as presented in Table 5. The results on both the DUO and RUOD datasets demonstrate that AGS-YOLO achieved a favorable balance between accuracy and efficiency. Specifically, AGS-YOLO attained the highest mAP@0.5:0.95 scores of 67.7% and 63.5% on the DUO and RUOD datasets, respectively, outperforming YOLO11n and RTMD-Tiny, while maintaining a lightweight design with only 3.0M parameters and 9.6 GFLOPS. Remarkably, despite its compact size, AGS-YOLO even surpassed the more complex DETR-R50 in terms of detection accuracy, underscoring its robustness and adaptability in complex underwater environments.

3.5.4. Statistical Significance Testing and Visualization

To verify that the effectiveness of the proposed method was not due to random errors, we conducted five independent experiments comparing YOLO11n and AGS-YOLO on the DUO dataset. Paired t-tests were performed for both the mAP@0.5 and mAP@0.5:0.95 metrics, and the corresponding box plots are presented in Figure 14.

The results indicate that AGS-YOLO significantly outperformed YOLO11n on both mAP@0.5 (p < 0.001) and mAP@0.5:0.95 (p < 0.001), demonstrating the effectiveness and robustness of the proposed improvements.

3.5.5. Visualization and Analysis

Figure 15 illustrates how the proposed model performed across four distinct underwater scenarios, highlighting its adaptability to diverse aquatic conditions. Under challenging underwater environments, the naked eye struggles to detect objects submerged underwater in a timely manner as the target is often highly integrated with the background environment. However, the proposed AGS-YOLO model was able to achieve accurate detection in challenging environments, including those characterized by underwater haze, low contrast, and low-light conditions, as shown in Figure 15a–c. In addition, the model was equally accurate in identifying occluded underwater targets in dense scenes, as shown in Figure 15d. These results fully demonstrate the excellent target detection performance of AGS-YOLO in complex and harsh underwater environments.

To assess the effectiveness of the proposed model in addressing underwater object detection challenges, three images from different scenarios were selected from the DUO dataset for qualitative analysis across multiple detection models. As shown in Figure 16, with the exception of the model presented in this study, all other models suffered from target leakage during detection, reflecting their shortcomings in terms of feature extraction capability. This further verifies that the AGS-YOLO model has stronger feature perception and target recognition capabilities in complex underwater environments.

To further validate the AGS-YOLO model’s performance in underwater target detection, three representative images from distinct scenes within the RUOD dataset were selected to qualitatively analyze each model, as illustrated in Figure 17. According to the obtained results, apart from the proposed model, the other models exhibited varying degrees of target leakage during detection, which further verifies that AGS-YOLO has higher detection accuracy and stronger robustness in challenging underwater scenarios.

4. Discussion

Despite the promising performance of the proposed AGS-YOLO model in underwater small-object detection tasks, several limitations remain. First, the detection accuracy may degrade in scenarios with extremely dense or overlapping targets, where the spatial separation between objects is minimal. In such cases, the receptive field of the model might not be sufficient to distinguish between individual objects, leading to misdetections or false positives. Second, the model’s performance is still sensitive to the quality of the optical input data. Severe underwater conditions—such as intense turbidity, low light, or color distortion—can negatively affect the model’s ability to extract discriminative features, despite the use of attention mechanisms and multi-scale fusion. Third, although our method was shown to achieve a good balance between accuracy and efficiency, further optimization is needed for deployment in ultra-low-resource environments, such as underwater drones with limited computing and power budgets.

In future research, we plan to integrate adaptive, noise-robust enhancement modules to better handle degraded visual conditions; explore semi-supervised learning and domain adaptation techniques to alleviate the dependence on high-quality labeled datasets; and further compress the network using pruning or quantization techniques to enable its real-time deployment in embedded systems.

5. Conclusions

In this paper, a novel, high-precision target detection network architecture, called AGS-YOLO, was proposed. The model’s sensing performance was improved by embedding a multi-scale aggregated attention module, AMSA, into the backbone network. The neck structure was designed to integrate the cross-scale feature enhancement module CSFE, providing a bidirectional information transmission path to effectively capture multi-scale semantic information. Meanwhile, GSConv was incorporated to preserve critical features while keeping the number of parameters and computational overhead low. Through these improvements, the model was shown to achieve a well-balanced trade-off between precision and computational efficiency. Extensive experiments on the DUO and RUOD datasets demonstrated that AGS-YOLO outperforms several state-of-the-art models in terms of both detection accuracy and efficiency. Specifically, AGS-YOLO achieved the highest mAP@0.5:0.95 scores of 67.7% and 63.5% on DUO and RUOD, respectively, while maintaining a low parameter count and FLOPS, verifying the effectiveness of the proposed design in balancing precision and resource efficiency. The findings confirmed that the proposed structural enhancements can significantly boost the performance of YOLO-based architectures in challenging underwater environments, particularly in the detection of small and low-contrast targets. These improvements have practical implications for real-time marine applications, such as underwater exploration, ecological monitoring, and autonomous robotics.

While AGS-YOLO balances accuracy and efficiency well, its performance under extreme conditions can be further improved. Future work will focus on enhancing its robustness, the application of model compression techniques, and the incorporation of diverse training data to improve the model’s generalization ability.

Author Contributions

Conceptualization, W.S.; Methodology, W.S.; Software, W.S.; Resources, W.S., X.L., J.H., Q.Y., H.X., Y.W. and Z.X.; Writing—original draft, W.S.; Writing—review & editing, X.L., J.H., Q.Y., H.X., Y.W. and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, P.; Qian, W.; Wang, Y. YWnet: A Convolutional Block Attention-Based Fusion Deep Learning Method for Complex Underwater Small Target Detection. Ecol. Inform. 2024, 79, 102401. [Google Scholar] [CrossRef]
Chen, X.; Fan, C.; Shi, J.; Wang, H.; Yao, H. Underwater Target Detection and Embedded Deployment Based on Lightweight YOLO_GN. J. Supercomput. 2024, 80, 14057–14084. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A Systematic Review and Analysis of Deep Learning-Based Underwater Object Detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Shahi, T.B.; Xu, C.-Y.; Neupane, A.; Guo, W. Recent Advances in Crop Disease Detection Using UAV and Deep Learning Techniques. Remote Sens. 2023, 15, 2450. [Google Scholar] [CrossRef]
Faster, R. Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 9199, 2969239–2969250. [Google Scholar]
Li, Y.; Liu, S.; Zhu, P.; Yu, J.; Li, S. Extraction of Visual Texture Features of Seabed Sediments Using an SVDD Approach. Ocean. Eng. 2017, 142, 501–506. [Google Scholar] [CrossRef]
Zeng, L.; Sun, B.; Zhu, D. Underwater Target Detection Based on Faster R-CNN and Adversarial Occlusion Network. Eng. Appl. Artif. Intell. 2021, 100, 104190. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Al Muksit, A.; Hasan, F.; Emon, M.F.H.B.; Haque, M.R.; Anwary, A.R.; Shatabda, S. YOLO-Fish: A Robust Fish Detection Model to Detect Fish in Realistic Underwater Environment. Ecol. Inform. 2022, 72, 101847. [Google Scholar] [CrossRef]
Xu, X.; Liu, Y.; Lyu, L.; Yan, P.; Zhang, J. MAD-YOLO: A Quantitative Detection Algorithm for Dense Small-Scale Marine Benthos. Ecol. Inform. 2023, 75, 102022. [Google Scholar] [CrossRef]
Yu, G.; Cai, R.; Su, J.; Hou, M.; Deng, R. U-YOLOv7: A Network for Underwater Organism Detection. Ecol. Inform. 2023, 75, 102108. [Google Scholar] [CrossRef]
Zhao, H.; Xu, C.; Chen, J.; Zhang, Z.; Wang, X. BGLE-YOLO: A Lightweight Model for Underwater Bio-Detection. Sensors 2025, 25, 1595. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U.A. PE-Transformer: Path Enhanced Transformer for Improving Underwater Object Detection. Expert. Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
Zhang, F.; Cao, W.; Gao, J.; Liu, S.; Li, C.; Song, K.; Wang, H. Underwater Object Detection Algorithm Based on an Improved YOLOv8. JMSE 2024, 12, 1991. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Hou, J.; Cong, R.; Guo, C.; Ren, W. Underwater Image Enhancement via Medium Transmission-Guided Multi-Color Space Embedding. IEEE Trans. Image Process. 2021, 30, 4985–5000. [Google Scholar] [CrossRef] [PubMed]
Khanam, R.; Hussain, M. Yolov11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. arXiv 2024, arXiv:2403.10778. [Google Scholar]
Zhang, T.; Sun, X.; Zhuang, L.; Dong, X.; Gao, L.; Zhang, B.; Zheng, K. FFN: Fountain Fusion Net for Arbitrary-Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3276995. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Lightweight-Design for Real-Time Detector Architectures. J. Real-Time Image Proc. 2024, 21, 62. [Google Scholar] [CrossRef]
Nascimento, M.G.D.; Fawcett, R.; Prisacariu, V.A. Dsconv: Efficient Convolution Operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5148–5157. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16 June 2024; pp. 16965–16974. [Google Scholar]
Wang, Y.; Dong, M.; Shen, J.; Lin, Y.; Pantic, M. Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation. arXiv 2020, arXiv:2006.03708. [Google Scholar]
Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena Hierarchy: Towards Larger Convolutional Language Models. In Proceedings of the International Conference on Machine Learning (PMLR), Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A Dataset and Benchmark of Underwater Object Detection for Robot Picking. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking General Underwater Object Detection: Datasets, Challenges, and Solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-Cam: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R. Ultralytics/Yolov5: V3. 0. Zenodo; European Organization for Nuclear Research: Geneva, Switzerland, 2020. [Google Scholar]
Ma, L.; Zhao, L.; Wang, Z.; Zhang, J.; Chen, G. Detection and Counting of Small Target Apples under Complicated Environments by Using Improved YOLOv7-Tiny. Agronomy 2023, 13, 1419. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A Review on Yolov8 and Its Advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Cham, Switzerland, 2024; pp. 529–545. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhang, L.; Huang, L. Ship Plate Detection Algorithm Based on Improved RT-DETR. J. Mar. Sci. Eng. 2025, 13, 1277. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]

Figure 1. YOLO11 structure.

Figure 2. AGS-YOLO structure.

Figure 3. C3k2-AMSA structure.

Figure 4. AMSA structure. The parameter ‘p’ is assigned values of 2 and 4, corresponding to the local and global branches, respectively.

Figure 5. GSConv structure.

Figure 6. (a) The YOLO11 neck architecture; (b) illustrates the CSFE neck architecture.

Figure 7. The DUO dataset includes the following four typical types of organisms: holothurian, echinus, scallops, and starfish.

Figure 8. The RUOD dataset includes the following ten categories of targets: holothurian, echinus, scallops, starfish, fish, corals, divers, cuttlefish, turtles, and jellyfish.

Figure 9. Comparison of heatmap visualizations of detections on the DUO dataset: (a) raw image captured by the underwater optical camera showing a complex underwater scene; (b) heatmap visualization of the YOLO11n detection output; (c) heatmap visualization of the AGS-YOLO detection output proposed in this paper.

Figure 10. Confusion matrix of this paper’s model on the DUO dataset.

Figure 11. mAP average accuracy curves for various models on the DUO dataset. (a) Comparison of mAP@0.5 accuracy curves. (b) Comparison of mAP@0.5:0.95 accuracy curves.

Figure 12. Confusion matrix of this paper’s model on the RUOD dataset.

Figure 13. mAP average accuracy curves for various models on the RUOD dataset. (a) Comparison of mAP@0.5 accuracy curves; (b) Comparison of mAP@0.5:0.95 accuracy curves.

Figure 14. Boxplot comparison of YOLO11n and AGS-YOLO on the DUO dataset in terms of mAP@0.5 and mAP@0.5:0.95. (a) Comparison of mAP@0.5 accuracy curves; (b) Comparison of mAP@0.5:0.95 accuracy curves.

Figure 15. Results of this paper’s model detection in four different underwater environments.

Figure 16. YOLO family of models visualizing detection results on the DUO dataset. (a–c) Visualization results of different models on three different scene images.

Figure 17. YOLO family of models visualizing detection results on the RUOD dataset. (a–c) Visualization results of different models on three different scene images.

Table 1. Experimental environment parameters.

Training Parameter	Value
Learning Rate	0.01
Image size	$640 \times 640$
Epochs	300
Batch	32
Workers	4
Optimizer	SGD
Cache	False
Weight Decay	0.0005

Table 2. Ablation study results on the DUO dataset.

Model	AMSA	CSFE	GSConv	mAP@0.5	mAP@0.5:0.95	Parameter/M	GFLOPS
YOLO11n	×	×	×	84.2	65.1	2.58	6.3
	√	×	×	85.3	67.4	3.8	10.8
	√	√	×	85.4	67.6	3.1	9.7
	√	√	√	85.5	67.7	3.0	9.6

Note: ’√’ indicates that the module is added, while '×’ indicates that the module is not added.

Table 3. Comparative evaluation of various models on the DUO dataset.

Model	mAP@0.5	mAP@0.5:0.95	Parameter/M	GFLOPS
YOLOv5n [34]	83.6	64.1	2.2	5.9
YOLOv7-tiny [35]	85.0	64.7	6.0	13.0
YOLOv8n [36]	84.6	65.3	2.7	6.9
YOLOv10n [37]	83.3	64.6	2.7	8.3
YOLO11n	84.2	65.1	2.6	6.3
YOLO12n	83.2	63.8	2.6	6.4
AGS-YOLO	85.5	67.7	3.0	9.6

Table 4. Comparative evaluation of various models on the RUOD dataset.

Model	mAP@0.5	mAP@0.5:0.95	Parameter/M	GFLOPS
YOLOv5n	84.1	59.7	2.2	5.9
YOLOv7-tiny	85.0	57.9	6.0	13.0
YOLOv8n	84.3	60.5	2.7	6.9
YOLOv10n	84.3	61.1	2.7	8.3
YOLO11n	84.9	61.5	2.6	6.3
YOLO12n	83.9	60.2	2.6	6.4
AGS-YOLO	86.4	63.5	3.0	9.6

Table 5. Comparative evaluation of various models on the DUO and RUOD datasets.

Model	DUO		RUOD		Parameter/M	GFLOPS
Model	mAP@ 0.5	mAP@ 0.5:0.95	mAP@ 0.5	mAP@ 0.5:0.95	Parameter/M	GFLOPS
DETR-R50 [38]	84.5	63.0	85.4	59.1	41.6	91.7
RTMD-Tiny [39]	85.8	66.7	85.7	62.0	4.8	8.1
YOLO11n	84.2	65.1	84.9	61.5	2.6	6.3
AGS-YOLO	85.5	67.7	86.4	63.5	3.0	9.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, W.; Liu, X.; Hao, J.; Yao, Q.; Xi, H.; Wu, Y.; Xing, Z. AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments. J. Mar. Sci. Eng. 2025, 13, 1465. https://doi.org/10.3390/jmse13081465

AMA Style

Sun W, Liu X, Hao J, Yao Q, Xi H, Wu Y, Xing Z. AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments. Journal of Marine Science and Engineering. 2025; 13(8):1465. https://doi.org/10.3390/jmse13081465

Chicago/Turabian Style

Sun, Weikai, Xiaoqun Liu, Juan Hao, Qiyou Yao, Hailin Xi, Yuwen Wu, and Zhaoye Xing. 2025. "AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments" Journal of Marine Science and Engineering 13, no. 8: 1465. https://doi.org/10.3390/jmse13081465

APA Style

Sun, W., Liu, X., Hao, J., Yao, Q., Xi, H., Wu, Y., & Xing, Z. (2025). AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments. Journal of Marine Science and Engineering, 13(8), 1465. https://doi.org/10.3390/jmse13081465

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLO11

2.2. Proposed Model

2.2.1. AMSA

2.2.2. GSConv

2.2.3. CSFE

3. Experiments and Discussions

3.1. Experimental Environment and Configuration

3.2. Dataset

3.3. Performance Evaluation

3.4. Ablation Experiments

3.5. Comparison Experiments

3.5.1. Comparative Analyses on the DUO Dataset

3.5.2. Comparative Analyses on the RUOD Dataset

3.5.3. Comparison with Other Advanced Mainstream Models

3.5.4. Statistical Significance Testing and Visualization

3.5.5. Visualization and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI