Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8

Sun, Yisong; Chen, Wei; Wang, Qixin; Fang, Tianzhong; Liu, Xinyi

doi:10.3390/sym17071102

Open AccessArticle

Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8

by

Yisong Sun

¹

,

Wei Chen

^1,*,

Qixin Wang

¹,

Tianzhong Fang

¹

and

Xinyi Liu

²

¹

School of Automation, Jiangsu University of Science and Technology, Zhenjiang 212003, China

²

School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou 215123, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1102; https://doi.org/10.3390/sym17071102

Submission received: 14 May 2025 / Revised: 2 July 2025 / Accepted: 6 July 2025 / Published: 9 July 2025

(This article belongs to the Special Issue Computer Vision, Pattern Recognition, Machine Learning, and Symmetry, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The ocean encompasses the majority of the Earth’s surface and harbors substantial energy resources. Nevertheless, the intricate and asymmetrically distributed underwater environment renders existing target detection performance inadequate. This paper presents an enhanced YOLOv8s approach for underwater robot object detection to address issues of subpar image quality and low recognition accuracy. The precise measures are enumerated as follows: initially, to address the issue of model parameters, we optimized the ninth convolutional layer by substituting certain conventional convolutions with adaptive deformable convolution DCN v4. This modification aims to more effectively capture the deformation and intricate features of underwater targets, while simultaneously decreasing the parameter count and enhancing the model’s ability to manage the deformation challenges presented by underwater images. Furthermore, the Triplet Attention module is implemented to augment the model’s capacity for detecting multi-scale targets. The integration of low-level superficial features with high-level semantic features enhances the feature expression capability. The original CIoU loss function was ultimately substituted with Shape IoU, enhancing the model’s performance. In the underwater robot grasping experiment, the system shows particular robustness in handling radial symmetry in marine organisms and reflection symmetry in artificial structures. The enhanced algorithm attained a mean Average Precision (mAP) of 87.6%, surpassing the original YOLOv8s model by 3.4%, resulting in a marked enhancement of the object detection model’s performance and fulfilling the real-time detection criteria for underwater robots.

Keywords:

object detection; YOLOv8; convolutional neural network; ROV underwater robot

1. Introduction

The ocean is abundant in mineral and marine resources [1]. The establishment of ocean pastures is a crucial element of sustainable ocean development, contributing to the restoration and enhancement of the marine natural environment. However, due to the limitations of the marine environment, people can only rely on devices such as optical sensors, sonar [2], or underwater robots to collect underwater image information. In contrast to terrestrial settings, the intricate underwater milieu presents issues including elevated noise levels, diminished visibility, indistinct edges, reduced contrast, and color discrepancies in underwater photos, hence creating substantial obstacles for underwater target recognition endeavors [3].

In contrast to terrestrial photos, underwater images frequently display variable levels of deterioration, resulting in problems such as blurred features, color discrepancies, and diminished contrast, significantly hindering the efficacy and further endeavors of underwater robots. To mitigate the decline of image quality in specialized situations, it is essential to design customized image processing approaches that improve quality and enhance practicality. Currently, image enhancing approaches can be divided into classical algorithms and deep learning-based methods.

Classical underwater image enhancement methods mainly use manually constructed rules and mathematical calculation models to optimize the quality of underwater imaging. The core optimization direction of this type of technology focuses on correcting the brightness distribution, detail contrast, and color deviation of images, aiming to achieve accurate visual restoration of underwater scenes, thereby enhancing their application value and ensuring the stability of subsequent computer vision tasks. The classic methods for enhancing traditional underwater images include Histogram Equalization (HE) [4], Retinex algorithm [5], Dark Channel Prior (DCP) [6], and so on.

Traditional image enhancement methods have low application complexity and can effectively improve image quality for underwater images with mild quality degradation, because such methods are usually based on clear physical interpretability. But it has the following drawbacks: It cannot completely eliminate noise or restore details when there is dim light and blue-green water mist, and its physical parameters need to be reconfigured whenever the environment changes. Therefore, traditional image enhancement performs well in specific scenarios, but its applicability is limited in dynamic and complex underwater conditions, and other methods must be optimized or fused to enhance overall performance.

In contrast to conventional image improvement algorithms, methods for enhancing pictures based on deep learning may autonomously learn the associations of picture feature mapping via neural networks [7]. They can directly learn pixel-level mapping relationships from a large amount of paired data without relying on manually designed algorithm rules. Consequently, numerous researchers both domestically and internationally have dedicated substantial efforts to advancing this field. The image enhancement methods of deep learning are mainly divided into two types: convolutional neural networks and generative adversarial networks.

Li et al. [8] presented a streamlined end-to-end architecture for picture augmentation, emulating the creation of degraded clear image datasets in underwater settings. Tian et al. [9] designed an underwater image enhancement strategy suitable for non-uniform lighting conditions to tackle the problems of inconsistent lighting distribution and noise interference in deep-sea environments. This scheme comprehensively utilizes deformable convolutional networks to dynamically identify turbid and clear areas, separates absorption bands and scattering bands through lightweight transformers, reconstructs the physical balance between color channels, and enables strong scattering compensation for high-turbidity areas to avoid edge blurring caused by global enhancement. Wang et al. [10] constructed an underwater generative adversarial network and integrated convolutional networks with the U-Net structure to achieve feature separation and recombination. Its model is based on an unsupervised training strategy and utilizes an adversarial training process to improve the physical representation of underwater light propagation, showing significant performance improvement in image visual realism and color space restoration performance.

Therefore, this article will employ a deep learning-based underwater picture enhancement technique to develop a network architecture appropriate for intricate and damaged underwater environments, with a focus on its ability to enhance images for common problems, thus combining deep learning with underwater optical imaging theory to construct an interpretable theoretical framework.

Deep learning-based object detection is primarily categorized into one-stage and two-stage algorithms [11]. The two-stage method delineates the vicinity of the detection target, thereafter identifying and categorizing the target. The Fast R-CNN [12] technique exemplifies a two-stage approach that achieves excellent detection accuracy; nevertheless, it suffers from a slow detection speed and frequently underperforms in real-time detection applications. Single-stage object identification techniques employ concurrent feature extraction and location classification. Taking the SSD algorithm [13], YOLO [14], and FCOS [15] as examples, single-stage algorithms have relatively good detection accuracy and speed. Due to their direct prediction of classification and localization, single-stage algorithms are suitable for instantaneous detection tasks. Compared to two-stage object detection algorithms, single-stage algorithms concurrently execute object localization and classification within a unified network, eliminating the necessity for candidate region generation and hence enhancing detection speed.

In the process of operating in real water environments, underwater target detection is easily affected by light, and the images captured by underwater robots generally suffer from target blurring and contrast degradation, which puts higher demands on the environmental adaptability of detection algorithms. Liu et al. [16] first applied the single-stage algorithm SSD to real-time object detection, with the main idea of selecting different feature layers when detecting objects of different sizes. Single-stage detection relies heavily on the model structure design and sample resources during training to directly predict results. Lin et al. [17] designed the RetinaNet single-stage algorithm and incorporated the Focal Loss function to address the issue of imbalanced background categories in image samples. Chen et al. [18] employed the mean accuracy deficit to rectify the background imbalance issue. Zhang et al. [19] proposed the Adaptive Training Sample Selection strategy, which distinguishes samples during training through adaptive thresholding, thereby improving detection accuracy and algorithm robustness.

YOLO is a model used for real-time object detection, focusing on quickly identifying objects in images; U-Net is a model used for image segmentation, focusing on precise pixel-level region division; the GCN-based method is used to process graph-structured data, focusing on analyzing the relationships between nodes. By comparison, YOLO has a greater advantage in this detection direction.

With the continuous development of single-stage algorithms, YOLO series algorithms have gradually emerged in the domain of subaqueous object identification with higher accuracy and a faster response speed. Sung M et al. [20] proposed the YOLO series of algorithms suitable for fish detection. Subsequently, Zhu Shiwei et al. [21] proposed a class-weighted YOLO network based on the YOLO framework, which improved the accuracy of object detection by constructing a class-weighted loss function and introducing an adaptive dimensionality clustering method for target boxes. Wu et al. [22] developed an optimized YOLOv7 network architecture, which utilizes ACmixBlock components to replace the original structure, integrates skip connections and 1 × 1 convolution design, constructs ResNet ACmix units, and embeds a global attention mechanism. The K-means++ algorithm is applied to adjust parameters, significantly improving feature recognition accuracy and model inference efficiency. In the same year, Chen et al. [23] presented an underwater YC optimization algorithm utilizing YOLOv7. The detection framework first uses the spatial attention mechanism to analyze key feature details, then designs a multi-scale feature fusion structure to optimize fuzzy region recognition, and finally, it introduces a dynamic bounding box-weighted strategy to improve localization accuracy. YOLOv8 has efficient feature extraction capabilities, which can cope with the problems of poor underwater image quality, small targets, and complex backgrounds. Subsequent versions of the YOLO series are more suitable for high-throughput real-time inference (such as video stream analysis) or GPU-optimized environments. However, under the specific requirements of underwater target detection, YOLOv8 is still a more robust choice. Therefore, many researchers have chosen YOLOV8 as the object of improvement.

To tackle the challenges of significant parameters and computational intricacy in the C2f module of YOLOv8, Hongchun Yuan [24,25] developed the C2fGS module, which diminished the model’s computational complexity while preserving its accuracy, thereby enhancing its application in underwater biological detection. Zhou Xin et al. [26] enhanced the YOLOv8 method by integrating deformable convolutional networks (DCN) into the backbone architecture and developing an empty hole convolutional spatial pyramid module (ASPF). Zhang et al. [27] substituted the darknet-53 backbone network of YOLOv8 with FastNet-T0, resulting in a significant reduction in model parameters, computational requirements, and model size, while only marginally diminishing detection accuracy. DCNv2 was introduced into the neck part and the bottleneck layer of the C2F structure to improve the detection ability of irregularly shaped targets; Song et al. [28] replaced the convolutional layer of the 9th layer in the YOLOv8 framework with an adaptive deformable convolution dcnv3 and replaced the SPPF module with the ciou loss function with the wiouv3 loss function; and Rejin Varghese et al. [29] used efficientnet-b4 as the backbone network and nas-fpn as the detection head in YOLOv8. The novel loss function, focal loss, enhances recall rate and accuracy in datasets characterized by imbalance and noise.

This article chooses to improve the YOLOv8s model and proposes an improved YOLOv8s network model that achieves the lightweighting of the model while ensuring detection speed, thereby improving detection accuracy. The innovation points of this article are as follows:

(1): The use of adaptive deformable convolution DCN v4 replaces specific original convolutions, which more cleverly limits the deformation and complex features of underwater targets.
(2): Enhance the detection capability of multi-scale targets using Triplet Attention.
(3): Replace the CIoU function with Shape IoU to improve the model’s overall efficacy.

2. Materials and Methods

2.1. Robot Experiment Platform

The ROV design of this experimental platform adopts an open frame structure, with a high center of gravity to ensure buoyancy and a low center of gravity to ensure stability. It can meet the operational requirements of a maximum water depth of 100 m. Figure 1 shows the actual body of the aircraft.

The body is also equipped with temperature sensors, pressure sensors, depth sensors, and other sensing devices, which can not only ensure its own operational performance but also sense the working environment, thereby enhancing the adaptability and stability of underwater operations. Figure 2 shows the overall hardware schematic of the ROV. It is divided into two parts, with the water control platform linking the ground console and remote control handle, taking into account the connection of the underwater movement part and the central control function. Under normal conditions, commands are sent from the console to control the underwater operation process of the ROV. In special circumstances, the ROV can be manually controlled using a remote control handle. The underwater motion part is linked and controlled by an industrial computer and a control module, and the control instructions are integrated on an embedded development board and transmitted to various devices. The underwater robot was jointly developed by Jiangsu University of Science and Technology and Zhoushan Haizhixing Information Technology Co., Ltd. (Zhoushan, China).

The various modules of the hardware system are shown in Figure 3. The control device is responsible for the stable transmission of information between land and underwater, as well as the control signal transmission and feedback signal reception, and it makes decisions based on the underwater environment in which the ROV is located. The execution device is responsible for adjusting the ROV’s motion posture based on the current water flow situation, ensuring the ROV’s balance, and is responsible for the ROV’s power propulsion. The sensing device is responsible for sensing the working environment, collecting underwater information, and providing feedback on work status, while ensuring the normal operation of the other equipment including cameras, lighting, and other devices.

The control module is divided into two parts: the water control module and the underwater control module. The computer is loaded with an NVIDIA3090ti graphics card to ensure the smoothness and stability of underwater work. It can realize the computer command to control the underwater control module and the ROV’s underwater movement and operation process, and process the information collected by the ROV. It can be responsible for underwater image restoration and enhancement, underwater target detection and tracking, and other operation tasks. In addition to computer equipment, the ground console also has a rocker system, which can send forward and backward, floating and diving commands to the ROV by using the rocker or remote control handle to deal with emergencies.

As shown in Figure 4, the electrical system in the cabin is mainly responsible for the motion control of the ROV and the information transmission between cameras, mechanical arms, and sensors.

The control module in the cabin uses the STM32F4327 processor, which integrates the Cortex-M7 core performance (with floating-point unit) with a working frequency of 180 MHz. It uses the Italian-French semiconductor 90 nm process and art accelerator and has the function of the dynamic power consumption adjustment. The processor has a fast processing speed, low price, and more than 20 interfaces, which can realize the requirements of underwater control units by connecting various hardware devices, can operate in low-temperature environments, and can meet the task of underwater information transmission.

2.2. ROV System Test

Figure 5 illustrates the schematic representation of the pool experiment. In order to prevent ROV equipment failure, cabin water inflow, abnormal communication, and other problems in the field water flow environment, the tightness test and body movement test were carried out in the artificial pool of the laboratory in advance.

Firstly, inspect the condition of the ROV’s body to ensure that its battery is fully charged and the system is functioning properly, with special attention paid to the integrity of the sealing components. Secondly, conduct underwater sealing experiments on the ROV by installing pressure sensors, temperature sensors, and video cameras on the ROV to monitor its internal status in real time. Next, place the ROV into a sealed container or water tank and ensure that all interfaces are well sealed. Subsequently, simulate the diving process of the ROV by slowly adding water to the container or tank, while observing and recording the pressure and temperature changes inside the ROV. It is crucial to conduct underwater performance testing immediately after completing the sealing test of the ROV and ensuring its good sealing performance. In the laboratory pool, we activated the ROV to test its maneuverability, stability, and athletic performance. By remote control or autonomous navigation, we observe the response speed, navigation posture, speed, and endurance of the ROV to ensure its excellent performance in practical applications. At the same time, we record and analyze test data to provide strong support for the subsequent improvement and application of the ROV. Figure 6 shows the underwater performance test results in an artificial water tank.

The ROV has demonstrated exceptional performance following sealing and motion testing in the pool. In the airtightness assessment, the ROV exhibited superior sealing efficacy, preventing any water ingress; in the sports performance evaluation, the ROV displayed steady maneuverability, rapid responsiveness, and fulfilled the criteria for underwater operations regarding velocity and endurance. Consequently, we are assured that the ROV is entirely equipped to fulfill the anticipated underwater operational specifications and anticipate its exceptional performance in practical applications.

2.3. Research on Object Detection Algorithm Based on YOLOv8

2.3.1. YOLOv8 Object Detection Algorithm

YOLOv8 comprises five variants ranging from tiny to large: v8n, v8s, v8m, v8l, and v8x. As the model’s size escalates, its accuracy consistently enhances. Considering the particularity of underwater environments, underwater images suffer from asymmetric degradation issues. The shallow feature preservation ability of YOLOv8s is more advantageous for small object detection than deep networks. Meanwhile, the YOLOv8s model has a smaller size and is more suitable for underwater communication. In turbulent environments, the YOLOv8s model has lower computational latency and a lower target tracking loss rate. This article selects the yolov8s model due to its compact size and good precision. Figure 7 illustrates the diagram of the network structure.

The backbone component primarily executes feature extraction via the Darknet-53 [30] framework and incorporates a novel C2f module for residual learning. Utilizing CSP and ELAN, an increased number of skip-layer connections and supplementary Split operations are employed to incorporate gradient variations into the feature map throughout the entire process. The Conv convolution module and C2f module are successively stacked four times, with each stacking referred to as a stage. The SPPF module utilized in YOLOv5 and other designs was implemented to standardize the vector dimensions of feature maps across various scales.

The neck component primarily performs feature fusion. YOLOv8s eliminates the 1 × 1 convolution prior to upsampling found in YOLOv5 and YOLOv6, directly upsampling the feature outputs from various stages of the backbone.

The head portion comprises three detection heads, each utilizing feature maps of varying dimensions to identify and output target objects of disparate sizes.

2.3.2. Improvement of YOLOv8s Object Detection Network

A.: Improvement of Attention Mechanism.

Triplet Attention [31] is an innovative technology that utilizes a three-part scaffold structure to capture interactions between different dimensions and calculate attention weights. The difference between this method and CBAM [32] and SENet [33] is that it does not depend on a substantial quantity of learnable parameters to establish the dependency structure among input tensor channels, as shown in Figure 8. Triplet Attention constructs interdependencies between dimensions through rotation processing and residual adjustment techniques, attaining the encoding of inter-channel and geographical information with minimal computer complexity. This way is known for its simplicity and efficiency and can be used as an auxiliary module to seamlessly integrate into existing mainstream backbone network architectures.

The function of the Z-pool

Z - pool

layer is to reduce the dimensionality of the tensor in the C dimension, while connecting the average and maximum pooling features on that dimension. This can be expressed as follows:

Z - pool (x) = [M a x P o o l_{o d} (x), A v g P o o l_{o d} (x)]

(1)

As shown in Figure 9, Triplet Attention is a three-branch module, with each branch responsible for capturing the interactions between different dimensions of the input tensor. Specifically, the three branches are responsible for capturing three sets of dimensional interactions: (C, H), (C, W), and (H, W). Subsequently, a rotation operation is performed on the input tensor in the first branch. The tensor is rotated counterclockwise by 90 degrees along the H axis to obtain a tensor

\hat{χ_{1}}

with the shape of (W × H × C). Similarly, the second branch rotates along the W axis to obtain a tensor

\hat{χ_{2}}

with the shape of (H × C × W). The last branch is not rotated and is directly operated using Z-pool. The rotated tensor is subjected to the Z-pool operation to reduce the zeroth dimension of the tensor to 4, resulting in a tensor with the shape of (2 × H × C)

\hat{χ_{1}^{*}}

. The Z-pool tensor is passed through a convolutional layer (kernel size k × 1) and batch normalization layer to obtain an intermediate output shape of (1 × H × C). The intermediate output generates attention weights through the sigmoid activation function. The generated attention weights are applied to the rotated tensor, which is then rotated back to its original direction to restore the original input shape (C × H). The ultimate output tensor is as follows:

y = \frac{1}{3} (\bar{\hat{x_{1}} σ (ψ_{1} (\hat{x_{1}^{*}}))} + \bar{\hat{x_{2}} σ (ψ_{2} (\hat{x_{2}^{*}}))} + x σ ψ_{2} \hat{x_{3}})

(2)

Through the above steps, Triplet Attention captures the interactions between different dimensions, providing richer feature representations with low computational complexity and fewer parameters, making it easy to integrate into standard deep CNN architectures.

B.: Enhancement of Original Convolution

Traditional convolution, such as u-net, mainly relies on fixed convolution kernels and jump connections to extract and restore features, which exhibits weak dynamics and proves challenging to adapt to the intricate aspects of underwater imagery. While GCN-based approaches perform well in graph-structured data, its neighborhood aggregation mechanism lacks flexibility, and it is difficult to process irregular image data. The Adaptive Deformable ConvNet v4 [34] proficiently addresses the issue of picture distortion in underwater target detection. In contrast to conventional convolution, it incorporates supplementary offset variables into the convolution kernel, enabling flexible alterations in shape and position according to image content, thus enhancing the precision of feature extraction from deformed images.

The ninth layer of YOLOv8s is located in the middle and deep layers of the backbone, with a feature map resolution of about 1/8 of the input image, which can capture medium-scale targets while preserving sufficient spatial details. In addition, deep networks begin to emphasize semantic information in this layer, but underwater target deformation still needs to retain geometric sensitivity. The dynamic sampling of DCNv4 precisely compensates for the rigid structural defects of traditional convolution. Therefore, this article chooses to replace the original ninth convolutional layer with DCNv4.

The calculation formula for standard convolution is as follows:

y (p_{0}) = \sum_{k = 1}^{x} w_{k} x (p_{0} + p_{k})

(3)

Among them, k denotes the number of sampling points,

P_{k}

signifies the kth location of the sampling grid, and

w_{k}

indicates the projection weight of the associated sampling point. The calculation formula for deformable convolution is as follows:

y (p_{0}) = \sum_{k = 1}^{x} w_{k} x (p_{0} + p_{k} + Δ p_{k})

(4)

Among them,

Δ p_{k}

represents the offset.

Figure 10 illustrates a comparison between a regular convolution kernel and a convolution kernel with an additional offset. Figure 10a illustrates basic convolution, Figure 10b depicts deformable convolution, while Figure 10c,d present other variations in deformable convolution.

This article substitutes the original convolution with DCNv4, an efficient and effective operator. DCNv4 enhances DCNv3 [35] by eliminating softmax normalization and optimizing memory access, hence greatly improving convergence and processing performance. It excels in tasks including image classification, segmentation, and creation, particularly when incorporated into generative models. Substituting DCNv3 with DCNv4 in the model can enhance speed and performance by 80% without requiring further modifications. The benefits of DCNv4 in terms of speed, efficiency, and multi-visual task performance suggest that it is a crucial component for future visual models.

As shown in Figure 11, we display the relative running time based on DCNv3. DCNv4 has a significant acceleration compared to DCNv3 and surpasses other common visual operators.

C.: Optimization of loss function

The bounding box regression loss is a critical element of the detector localization branch, significantly influencing object detection tasks. Figure 12 shows that b denotes the centroid of the actual box with specified coordinates

(x_{c}, y_{c})

, and

b^{g t}

signifies the centroid of the forecast box with designated coordinates

(x_{c}^{g t}, y_{c}^{g t})

. This paper utilizes the form IoU [36] method to rectify shortcomings in existing research, facilitating loss computation that prioritizes the form and scale of the bounding box, hence improving the precision of bounding box regression.

The Shape IoU formula is derived from the graph:

I o U = \frac{|B \cap B^{g t}|}{|B \cup B^{g t}|}

(5)

Among them,

I o U

[36] is the most prevalent assessment metric for target detection, with

B

and

B^{g t}

denoting the predicted box and ground truth box, respectively.

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(6)

h h = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(7)

Among these, scale denotes the scaling factor, which pertains to the dimensions of the targets inside the dataset;

w w

and

h h

are the weight coefficients in the horizontal and vertical directions, respectively, with values contingent upon the configuration of the GTbox. The associated bounding box regression loss is defined as follows, where

w^{g t}

and

h^{g t}

denote the width and height of the ground truth box, and w and h represent the width and height of the anchor box.

d i s t a n c e^{s h a p e} = h h \times \frac{{(x_{c} - x_{c}^{g t})}^{2}}{c^{2}} + w w \times \frac{{(y_{c} - y_{c}^{g t})}^{2}}{c^{2}}

(8)

Among them,

d i s \tan c e^{s h a p e}

denotes the shape distance, which is utilized to quantify the disparity in shape between the projected box and the actual box.

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{- ω t})}^{θ}, θ = 4

(9)

Among them,

Ω^{s h a p e}

refers to shape loss, which further penalizes prediction boxes with significant shape differences.

\{\begin{cases} ω_{w} = h h \times \frac{|w - w^{g t}|}{\max (w, w^{g t})} \\ ω_{h} = w w \times \frac{|h - h^{g t}|}{\max (h, h^{g t})} \end{cases}

(10)

Among these variables, scale indicates the scale factor related to the dimensions of the targets inside the dataset;

w w

and

h h

signify the weight coefficients in the horizontal and vertical orientations, respectively, which correspond to the design of the ground truth box. The bounding box regression loss function is defined as follows:

L_{s h a p e - I O U} = 1 - I O U + d i s \tan c e^{s h a p e} + 0.5 \times Ω^{s h a p e}

(11)

3. Results

3.1. Data Collection and Experimental Setting

This paper employs the URPC dataset to summarize images of underwater detection targets to improve online resources. Figure 13 illustrates the collection of 5543 underwater target photographs.

This dataset encompasses many categories of objects, as depicted in Figure 14a,b, which depicts the distribution of bounding box dimensions, indicating that most items are quite small, with sizes primarily concentrated between 0.0 and 0.2. To facilitate effective learning of the designated model on the dataset, it is divided into a training set and a validation set in a 9:1 ratio.

The experimental setting for this article is as follows: GPU: NVIDIA GeForce GTX 4060, CPU: Intel(R) Core(TM) i9-12900, Memory: 16.00 GB, Video Memory: 6 GB. The development environment is PyCharm, the programming language is Python, and the operating system and software environment comprise Windows 11, CUDA 11.7, Python 3.10, and PyTorch 1.12.1.The basic parameter configuration of this study is shown in Table 1.

3.2. Evaluation Indicators

This paper comprehensively evaluates the performance of object detection algorithms using average precision (AP), class average precision (mAP), and frame rate (FPS).

Precision signifies the precision of detection, defined as the ratio of anticipated positive samples to the actual count of positive samples. The equation is as follows:

\Pr e c i s i o n = \frac{T P}{T P + F P}

(12)

Recall denotes the ratio of accurately identified samples within the positive set. The equation is as follows:

Re c a l l = \frac{T P}{T P + F N}

(13)

The

F_{1}

score takes into account both the precision and recall of the model. A high

F_{1}

score indicates that the model exhibits a robust performance in both precision and recall. The formula for calculating the

F_{1}

score is as follows:

F_{1} = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l} = \frac{2 T P}{2 T P + F P + F N}

(14)

A P

denotes the mean likelihood of accurately forecasting each undersea category. The equation is as follows:

A P = \int_{0}^{1} P d R

(15)

Among them,

R

represents the recall rate.

m A P

is a holistic metric that incorporates both precision and recall, representing the cumulative average of

A P

indicators across all categories. Commonly employed is the mAP@0.5. The model’s accuracy is assessed using mAP@0.5. The equation is stated as follows:

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(16)

FPS denotes frames per second delivered and is utilized to assess the algorithm’s detecting speed.

3.3. Results and Analysis of Experiments

3.3.1. Artificial Water Pool Experiment

Figure 15 illustrates the Precision–Recall curve of the improved model on the dataset, with a mAP@50 of 87.6%. Notably, sea urchins exhibit a singular and distinct color structure, achieving an AP value of 92.5%. Sea cucumbers possess a complex morphology and a coloration that closely resembles their surroundings, resulting in an AP value of 76.6%. Figure 16 illustrates the F1 value curve, demonstrating an enhancement relative to the original model’s appearance.

The improved model validation set prediction diagram is shown in Figure 17. Compared with the annotated dataset, the improved model can identify unlabeled targets, reflecting its network feature extraction ability and adaptability.

The improved model validation set prediction diagram is shown in Figure 18. Compared with the annotated dataset, the improved model can identify unlabeled targets, reflecting its network feature extraction ability and adaptability, greatly improving the accuracy of underwater target detection.

3.3.2. Comparison of Detection Performance of Different Object Detection Models

In exploring the challenging field of underwater target recognition, researchers have attempted and validated various advanced object detection models. Classic and popular detection methods such as Faster R-CNN, SSD, RetinaNet [17], YOLOv5, etc., have demonstrated a remarkable performance in previous studies, effectively identifying various types of targets in complex underwater environments. The results of these models have laid a solid foundation for underwater visual analysis, demonstrating the potential of machine learning technology in overcoming challenges such as underwater light attenuation, background interference, and target deformation. To assess the efficacy and benefits of the proposed approach, we conducted a comparison of the YOLOv8s model with many prevalent object detection techniques utilizing the identical dataset. The experimental data is shown in Table 2.

The data shows that the enhanced YOLOv8s methodology discussed in this article demonstrates substantial benefits in detection speed and precision for underwater target recognition tasks. Experimental validation confirms that the improved approach displays an exceptional real-time performance, with a frame rate of up to 87 FPS.

3.3.3. Ablation Experiment

This study performed ablation tests to assess the efficacy of several submodules. Ablation experiments were performed by progressively integrating different modules to evaluate their influence on overall model performance, to examine the distinct role of each module, and to confirm the effectiveness of the proposed method in this research. The experimental results are displayed in Table 3. B represents the baseline established by the YOLOv8s model using the dataset, T represents Triplet Attention, D represents deformable convolution DCNv4, and S represents Shape IoU.

The results indicate that, relative to Model 1, Model 2 shows a little increase in floating-point operations and parameter count, although frames per second have decreased by six. Unlike Model 4, Model 3 modifies only the loss function, resulting in minimal changes in floating-point operations and the number of parameters. Model 4 demonstrates a 2.7% improvement in mAP compared to the initial Model 1, while the FPS meets the required norms. The suggested approach exhibits effectiveness.

3.3.4. Analysis of Gripping Experiments and Results for an Underwater Robot Prototype

A.: Artificial aquatic experimentation

The experiment was performed in an artificial water tank, and the detection findings are illustrated in Figure 19. The experimental results show that although there are frequent interferences and overlaps in multi-target recognition, this algorithm can meet the task requirements of underwater target detection.

B.: Natural water experiment

This article chose to conduct real water detection experiments on Dengbu Island in Zhoushan City, Zhejiang Province. The waters of Zhoushan Archipelago include shallow and deep-sea areas, providing diverse detection environments suitable for underwater target detection experiments in different depths and complex environments. There are various types of sediments in the rich sediment sea area, which helps to test and verify the performance and reliability of various underwater detection equipment under different sediment conditions. Figure 20 shows the experimental map of Dengbu island in Zhoushan.

The performance enhancement of the suggested network design was assessed by comparing the modified algorithm with the original YOLOv8 method. Table 4 presents the comparative data.

Table 4 illustrates that the enhanced algorithm presented in this paper has increased the recognition efficiency of each object in actual water compared to the original YOLOV8. The mAP value has risen by 3.8% relative to the original YOLOV8, and the F1 score has improved by 12% compared to the original. The findings indicate that the algorithm presented in this research demonstrates effective performance in actual aquatic environments and satisfies the visual criteria for ROV underwater operations.

To assess the algorithm’s detection efficacy, representative situations across several contexts, including complex underwater environments, obstructed conditions, and highly populated targets, were selected. The network was evaluated against YOLOv8s, as illustrated in Figure 21, Figure 22 and Figure 23 (the left side depicts the performance prior to enhancement, while the right side shows the performance subsequent to enhancement).

The background of Figure 21 predominantly has rocks that emulate the hue of sea cucumbers, while the white specks on the rocks resemble scallops. The background presents considerable obstacles for sea cucumber identification, as the initial YOLOv8s network experienced detection inaccuracies in intricate environments, erroneously identifying sea cucumbers that closely resemble the background hue. The optimized YOLOv8s model can proficiently address this issue. The enhanced SPP structure facilitates the efficient interaction and integration of features at several stages, hence augmenting the model’s detection accuracy. Figure 22 illustrates instances of blurriness and occlusion in the underwater image, which may result in missed objects by the original model. Nevertheless, the enhanced model excels at accurately detecting them. The deformable convolution in the updated model may more accurately conform to the object’s shape and size during sampling, hence improving its resilience and generalizability. In the presence of several congested images, as illustrated in Figure 23, the YOLOv8s network has failed to detect certain objects; however, the enhanced model exhibits a markedly greater capacity to reliably identify overlapping and small targets.

4. Discussion

This work primarily addresses the issues of fuzziness, low contrast, and shape distortion in underwater environmental images. In comparison to the latest advancements in the YOLOv8 literature, it is innovative. Document [35] employs EfficientNet-B4 as the backbone network to augment the model’s feature extraction capabilities and enhance its adaptability to targets of varying scales, utilizing NAS-FPN as the detection head. This approach facilitates the automatic generation of a feature pyramid network, thereby improving the efficiency of feature fusion and bolstering the model’s adaptability to complex environments. These enhancements are more universal and relevant to many scenarios. The literature [28] uses DCNv3 to enhance the convolution layer, but this paper utilizes the more advanced DCNv4 to supplant the original convolution, hence augmenting the model’s efficiency and accuracy while diminishing the parameter count. The Shape IoU loss utilized in this study is more effective in addressing items with significant shape alterations compared to WIoU loss v3, such as underwater oscillating plants or animals. The assessment index indicates that the mAP of this paper is 87.6%, surpassing the 86.5% reported in the literature [28]. Reference [27] opted to substitute the original backbone network of YOLOv8s with FasterNet-T0 and incorporated DCNv2 and Coordinate Attention. These enhancements prioritized lightweight and small target detection in the model but compromised computational performance. The modified mAP in reference [27] is 84.4%, which is inferior to that presented in this paper.

5. Conclusions

This study presents an improved YOLOv8s-based methodology for target detection in underwater robotics to tackle challenges related to inadequate image quality and diminished recognition accuracy. Given the specificity of underwater images, YOLOv8s’s capacity for shallow feature preservation is more advantageous for small object recognition compared to deep networks. Consequently, YOLOv8s has been chosen as the subject of enhancement. In adjusting the model parameters, only the convolution in the ninth layer is altered, utilizing an adaptive deformable convolution DCN v4 to substitute certain original convolutions. This update alleviates the deformation problems of underwater images with a reduced parameter count. The detection capability for multi-scale targets is enhanced using Triplet Attention, and feature representation is strengthened by integrating superficial low-level attributes with advanced semantic features. Ultimately, substituting the CIoU loss function with Shape IoU enhances the model’s overall performance. The data proves that the method provided in this study exhibits higher detection efficiency in underwater target recognition in complex scenarios.

The model possesses extensive application potential in marine resource development, undersea environmental monitoring, and seabed archeology, among other areas. For instance, it can facilitate target recognition and capture by underwater robots to enable the automated extraction of marine resources; it can monitor the underwater environment and promptly detect pollution and ecological issues; and it can assist in submarine archeology by aiding scientists in the exploration and excavation of underwater artifacts.

The improved algorithm in this article has a wide range of applications in the industry. In terms of ocean resource development, diving robots can use this algorithm to identify and capture marine organisms, minerals, and other resources for automated collection, improving efficiency and safety. In terms of marine environmental monitoring, this algorithm can be used to identify the types and quantities of marine organisms, monitor the health status of marine ecology, and provide a basis for marine protection. In terms of underwater archeology, this algorithm can identify underwater artifacts such as sunken ships, ancient relics, etc., providing clues for underwater archeology.

Author Contributions

Conceptualization, Y.S. and W.C.; methodology, Y.S., Q.W. and T.F.; software, Y.S. and T.F.; validation, Q.W. and X.L.; formal analysis, Y.S. and W.C.; investigation, T.F.; resources, Y.S.; data curation, T.F.; writing—original draft preparation, Y.S. and X.L.; writing—review and editing, Y.S., Q.W. and W.C.; visualization, Q.W. and T.F.; supervision, Y.S., Q.W. and T.F.; project administration, Y.S. and X.L.; funding acquisition, Y.S. and W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially funded by the Jiangsu Provincial Industry Prospective and Key Core Technology Project (BE2021135) and partially funded by the Zhenjiang International Science and Technology Cooperation Project (GJ2020009).

Data Availability Statement

The dataset used in the paper can be downloaded here: https://openi.pcl.ac.cn/OpenOrcinus_orca/URPC2020_dataset/datasets (accessed on 10 March 2022).

Acknowledgments

The authors thank the editors and anonymous reviewers for their critical comments and suggestions for improving the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ROV	Remotely Operated Vehicle
Fast R-CNN	Fast Region-Based Convolutional Neural Network
Faster R-CNN	Faster Region-Based Convolutional Neural Network
Mask R-CNN	Mask Region-based Convolutional Neural Network
SSD	Single Shot MultiBox Detector
YOLO	You Only Look Once
C2f	CSP Layer_2Conv
CSP	Common Spatial Pattern
ELAN	Efficient Layer Attention Network
SPPF	Spatial Pyramid Pooling Fusion
PAN	Path Aggregation Networks
FPN	Feature Pyramid Networks
SENet	Squeeze-and-Excitation Network
DCNv3	Deformable ConvNet v3
DCNv4	Deformable ConvNet v4
CIoU loss	Complete Intersection over Union Loss
Shape IOU loss	Shape-aware Intersection over Union Loss
URPC	Underwater Robot Professional Contest

References

Chen, L.; Zheng, M.; Duan, S.; Luo, W.; Yao, L. Underwater target recognition based on improved YOLOv4 neural network. Electronics 2021, 10, 1634. [Google Scholar] [CrossRef]
Yu, Y.; Guo, B.; Chu, S.; Li, H.; Yang, P. A review of underwater biological target detection methods based on deep learning. Shandong Sci. 2023, 36, 1–7. [Google Scholar]
Xu, S.B.; Zhang, M.H.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Anilkumar, S.; Dhanya, P.R.; Balakrishnan, A.A.; Supriya, M.H. Algorithm for Underwater Cable Tracking Using Clahe Based Enhancement. In Proceedings of the 2019 International Symposium on Ocean Technology(SYMPOL), Ernakulam, India, 11–13 December 2019; IEEE: Ernakulam, India, 2019; pp. 129–137. [Google Scholar]
Land, E.H. The retinex theory of color vision. Sci. Am. 1977, 237, 108–127. [Google Scholar] [CrossRef] [PubMed]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [PubMed]
Bhatti, U.A.; Yu, Z.; Chanussot, J.; Zeeshan, Z.; Yuan, L.; Luo, W.; Nawaz, S.A.; Bhatti, M.A.; Ain, Q.U.; Mehmood, A. Local similarity-based spatial–spectral fusion hyperspectral image classification with deep CNN and Gabor filtering. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038–107057. [Google Scholar] [CrossRef]
Tian, Y.; Xu, Y.; Zhou, J. Underwater image enhancement method based on feature fusion neural network. IEEE Access 2022, 10, 107536–107548. [Google Scholar] [CrossRef]
Wang, H.; Yang, M.; Yin, G.; Dong, J. Self-adversarial generative adversarial network for underwater image enhancement. IEEE J. Ocean. Eng. 2023, 49, 237–248. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Kumar, A.; Srivastava, S. Object detection system based on convolution neural networks using single shot multi-box detector. Procedia Comput. Sci. 2020, 171, 2610–2617. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutionalone—Stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Chen, K.; Li, J.; Lin, W.; See, J.; Wang, J.; Duan, L.; Chen, Z.; He, C.; Zou, J. Towards accurate one-stage object detection with ap-loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15-20 June 2019; pp. 5119–5127. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor—Based and anchor—Free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Sung, M.; Yu, S.-C.; Girdhar, Y. Vision based real-time fish detection using convolutional neural network. In Proceedings of the OCEANS 2017—Aberdeen, Aberdeen, UK, 19–22 June 2017; pp. 1–6. [Google Scholar]
Zhu, S.; Hang, R.; Liu, Q. Underwater object detection based on class weighted YOLO network. J. Nanjing Norm. Univ. (Nat. Sci. Ed.) 2020, 43, 129–135. [Google Scholar]
Wu, Q.; Cen, L.; Kan, S.; Zhai, Y.; Chen, X.; Zhang, H. Real-time underwater target detection based on improved YOLOv7. J. Real-Time Image Process. 2025, 22, 43–47. [Google Scholar] [CrossRef]
Chen, X.; Yuan, M.; Yang, Q.; Yao, H.; Wang, H. Underwater-ycc: Underwater target detection optimization algorithm based on YOLOv7. J. Mar. Sci. Eng. 2023, 11, 995. [Google Scholar] [CrossRef]
Gallagher, J. How to Train an Ultralytics YOLOv8 Oriented Bounding Box (OBB) Model. [2024-02-06]. Available online: https://blog.roboflow.com/train-yolov8-obb-model/ (accessed on 13 May 2025).
Yuan, H.; Lei, T. Detection and identification of fish in electronic monitoring data of commercial fishing vessels based on improved YOLOv8. J. Dalian Ocean. Univ. 2023, 38, 533–542. [Google Scholar]
Zhou, X.; Li, Y.; Wu, M.; Fan, X.; Wang, J. Improved YOLOv8 for underwater target detection. Comput. Syst. Appl. 2024, 33, 177–185. [Google Scholar]
Zhang, M.; Wang, Z.; Song, W.; Zhao, D.; Zhao, H. Efficient small-object detection in underwater images using the enhanced yolov8 network. Appl. Sci. 2024, 14, 1095. [Google Scholar] [CrossRef]
Song, G.; Chen, W.; Zhou, Q.; Guo, C. Underwater Robot Target Detection Algorithm Based on YOLOv8. Electronics 2024, 13, 3374. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Su, J.; Feng, K.K.; Liang, B.; Hou, W. CoT-YOLO underwater target detection algorithm. Computer Eng. Des. 2024, 45, 2119–2126. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attention: Convolutional Triplet Attention Module. arXiv 2020, arXiv:2010.03045. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation net works. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Wang, W.; Lu, T.; Li, H.; Qiao, Y.; et al. In Proceedings of the Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications IEEE, Seattle, WA, USA, 16–22 June 2024. [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]

Figure 1. Physical image of ROV underwater robot.

Figure 2. ROV system framework diagram.

Figure 3. Module structure diagram.

Figure 4. Cabin electrical diagram.

Figure 5. ROV pool experiment.

Figure 6. Image transmission and robot arm operation test.

Figure 7. YOLOv8s network structure diagram.

Figure 8. Triplet Attention principle diagram.

Figure 9. Triplet Attention network structure diagram.

Figure 10. Comparison of offset convolution kernels. (a) basic convolution; (b) deformable convolution; (c) variations in deformable convolution; (d) variations in deformable convolution.

Figure 11. DCNv4 data comparison chart.

Figure 12. Schematic representation of Shape Intersection over Union.

Figure 13. Dataset image. (a) Sea urchin; (b) sea cucumber; (c) shell; and (d) starfish.

Figure 14. Data distribution. (a) Identifying objects; (b) Boundary box size.

Figure 15. Precision–Recall Curve.

Figure 16. F1 Curve.

Figure 17. Improved model validation set prediction graph.

Figure 18. Training and validation loss and metric curves for the improved model.

Figure 19. ROV pool target detection results.

Figure 20. Natural water experiment.

Figure 21. Comparison of complex background detection.

Figure 22. Comparison of occlusion environment detection.

Figure 23. Comparison of underwater multi-class dense target detection.

Table 1. Basic parameter settings.

Parameter	Value
Batch size	4
Learning rate	0.01
Optimizer	SGD
Weight attenuation factor	0.0005
Confidence threshold	0.5

Table 2. Comparison results with AP, mAP, and FPS values of other mainstream models.

Model Name	AP(%)				mAP (%)	FPS (Hz)
Model Name	Sea Urchin	Sea Cucumber	Sea Star	Scallop	mAP (%)	FPS (Hz)
SSD	74.7	69.9	75.2	60.2	70.0	21
YOLOv5s	91.3	75.1	85.0	84.4	83.9	97
RetinaNet	77.2	68.1	78.3	61.2	71.2	26
Faster R-CNN	87.4	69.4	80.5	61.3	74.4	12
YOLOv8s	90.1	74.7	87.3	85.5	84.4	96
Improved YOLOv8s	92.5	76.6	88.3	87.4	87.6	87

Table 3. Experimental results of underwater dataset ablation. I have checked and there are no issues.

Model	B	T	D	S	mAP (%)	FPS	FLOPs (G)	Parameter Quantity (M)
1	√				84.9	96.3	28.4	11.1
2		√			85.2	90.2	31.6	14.5
3		√	√		86.3	87.9	33.2	17.6
4		√	√	√	87.6	87.1	33.2	17.6

Table 4. Comparison of AP, mAP, and F1 values in real waters.

Real Sea Area	AP (%)				mAP (%)	F1
Real Sea Area	Sea Urchin	Sea Cucumber	Sea Star	Scallop	mAP (%)	F1
YOLOv8s	88.1	72.4	85.2	81.6	81.9	0.67
Improved YOLOv8s	90.6	82.4	85.1	84.7	85.7	0.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Chen, W.; Wang, Q.; Fang, T.; Liu, X. Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8. Symmetry 2025, 17, 1102. https://doi.org/10.3390/sym17071102

AMA Style

Sun Y, Chen W, Wang Q, Fang T, Liu X. Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8. Symmetry. 2025; 17(7):1102. https://doi.org/10.3390/sym17071102

Chicago/Turabian Style

Sun, Yisong, Wei Chen, Qixin Wang, Tianzhong Fang, and Xinyi Liu. 2025. "Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8" Symmetry 17, no. 7: 1102. https://doi.org/10.3390/sym17071102

APA Style

Sun, Y., Chen, W., Wang, Q., Fang, T., & Liu, X. (2025). Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8. Symmetry, 17(7), 1102. https://doi.org/10.3390/sym17071102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8

Abstract

1. Introduction

2. Materials and Methods

2.1. Robot Experiment Platform

2.2. ROV System Test

2.3. Research on Object Detection Algorithm Based on YOLOv8

2.3.1. YOLOv8 Object Detection Algorithm

2.3.2. Improvement of YOLOv8s Object Detection Network

3. Results

3.1. Data Collection and Experimental Setting

3.2. Evaluation Indicators

3.3. Results and Analysis of Experiments

3.3.1. Artificial Water Pool Experiment

3.3.2. Comparison of Detection Performance of Different Object Detection Models

3.3.3. Ablation Experiment

3.3.4. Analysis of Gripping Experiments and Results for an Underwater Robot Prototype

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI