Underwater-YCC: Underwater Target Detection Optimization Algorithm Based on YOLOv7

: Underwater target detection using optical images is a challenging yet promising area that has witnessed signiﬁcant progress. However, fuzzy distortions and irregular light absorption in the underwater environment often lead to image blur and color bias, particularly for small targets. Consequently, existing methods have yet to yield satisfactory results. To address this issue, we propose the Underwater-YCC optimization algorithm based on You Only Look Once (YOLO) v7 to enhance the accuracy of detecting small targets underwater. Our algorithm utilizes the Convolutional Block Attention Module (CBAM) to obtain ﬁne-grained semantic information by selecting an optimal position through multiple experiments. Furthermore, we employ the Conv2Former as the Neck component of the network for underwater blurred images. Finally, we apply the Wise-IoU, which is effective in improving detection accuracy by assigning multiple weights between high-and low-quality images. Our experiments on the URPC2020 dataset demonstrate that the Underwater-YCC algorithm achieves a mean Average Precision (mAP) of up to 87.16% in complex underwater environments.


Introduction
The ocean is the largest repository of resources on Earth, and its related industries, such as marine ranching, are constantly improving due to the rapid development of underwater equipment.A crucial step in resource extraction and utilization is detection.New technologies, such as artificial intelligence, have provided significant impetus to improve detection.While many studies on underwater target detection are based on acoustic detection methods [1], these methods are inadequate for detecting small-sized underwater organisms due to their low sound source level, which can easily be drowned out by background noise.Additionally, the feature diversity in acoustic detection methods may not meet the demand for distinguishing small differences between underwater organisms.For this reason, optical images are more suitable for detecting small targets at close range, as they contain rich features of the target.
However, the complex underwater environment can seriously affect optical images.In general, the quality of underwater images is poor.The primary reason for this poor quality is the complexity and variability of underwater lighting conditions [2].Specifically, (i) the energy attenuation of red to blue light in the chromatographic process changes from fast to slow, resulting in blue-green tone and underwater image color distortion.(ii) Different colors scatter in water to varying degrees and manners, causing loss of fine image details. (iii) Real-life water bodies are often turbid, containing sediment and plankton, which degrade the imaging quality of underwater cameras and blur the images.(iv) Due to the specific habitat of underwater organisms, they are usually attached to mud, sand, and reefs, which are difficult to distinguish from the background.Target occlusion is also a problem due to the specificity of organism distribution.All of these factors pose significant

•
Underwater data collection poses challenges due to the poor image quality and limited number of learnable samples.To overcome these challenges, this paper adopts dataenhancement methods, including random flipping, stretching, mosaic enhancement, and mixup, to enrich the learnable samples of the model.This approach improves the generalization ability of the model and helps to prevent overfitting.

•
In order to extract more comprehensive semantic information and enhance the feature extraction capability of the model, we incorporate the CBAM attention mechanism into each component of the YOLOv7 architecture.Specifically, we introduce the CBAM attention mechanism into the Backbone, Neck, and Head structures, respectively, to identify the most effective location for the attention mechanism.Our experimental results reveal that embedding the CBAM attention mechanism into the Neck structure yields the best performance, as it allows the model to capture fine-grained semantic information and more effectively detect targets.

•
To enhance the ability of the model to detect objects in underwater images with poor quality, this paper introduces Conv2Former as the Neck component of the network.
The Conv2Former model can effectively handle images with different resolutions and extract useful features for fusion, thereby improving the overall detection performance of the network on blurred underwater images.

•
As low-quality underwater images can negatively affect the model's generalization ability, this paper introduces Wise-IoU as a bounding box regression loss function.This function improves the detection accuracy of the model by weighing the learning of samples of different qualities, resulting in more accurate localization and regression of targets in low-quality underwater images.
The paper is organized as follows.Section 2 focuses on the work related to this algorithm, with emphasis on the data enhancement approach and the YOLOv7 architecture.Section 3 introduces the content of the proposed Underwater-YCC algorithm.In Section 4 the relevant experimental results are analyzed and discussed.Section 5 presents conclusions.

Related Work 2.1. Underwater Dataset Acquisition and Analysis
Deep-learning models with good generalization ability require a substantial amount of training data, and a lack of appropriate data can lead to poor network training.The underwater environment is considerably more complex than the terrestrial environment, requiring the use of artificial light sources to capture underwater videos.Light transmission in water is subject to absorption, reflection, scattering, and other effects, resulting in significant attenuation.As a consequence, captured underwater images have limited visibility, blurriness, low contrast, non-uniform illumination, and noise.
The URPC2020 dataset is composed of 5543 images belonging to four categories: echinus, holothurian, scallop, and starfish.To train and test the proposed algorithm, the dataset was split into training and testing sets with an 8:2 ratio, resulting in 4434 images for training and 1109 images for testing.This dataset presents a variety of complex situations, such as underwater creatures gathering obscuration, uneven illumination, and motionshot blurring, which makes it a realistic representation of the underwater environment and therefore will improve the generalization ability of the model.However, the uneven distribution of samples among categories and their different resolutions pose significant challenges to the model's training.Figure 1 shows the sample information of URPC2020.Figure 1a shows the amount of data for each category, the size and number of bounding boxes, the location of the sample centroids, and the aspect ratio of the target occupying the entire image, respectively.

Data Augmentation
Deep convolutional neural networks have demonstrated remarkable results in target detection tasks.However, these networks heavily rely on a large amount of image data for effective training, which is difficult to obtain in some domains, including underwater target detection.A detection model with high generalization ability can accurately detect

Data Augmentation
Deep convolutional neural networks have demonstrated remarkable results in target detection tasks.However, these networks heavily rely on a large amount of image data for effective training, which is difficult to obtain in some domains, including underwater target detection.A detection model with high generalization ability can accurately detect and classify targets from various angles and in different states.Generalization ability can be defined as the difference in the performance of a model when evaluated on training and test data [20].Models with weak generalization ability are prone to overfitting, and data augmentation is one of the key strategies to mitigate this issue and improve the generalization ability of the model.

Geometric Transformation
Geometric transformation is the alteration of an image and its inverse, such as flipping, rotating, shifting, scaling, cropping, etc.For orientation-insensitive tasks, flipping is one of the safest operations and the most commonly used, and it does not change the size of the target.In the case of underwater target detection, the movements, morphology, and orientation of underwater creatures are uncertain, and using the flip operation for data augmentation can effectively improve the training results of the model.Horizontal and vertical flips are the two most commonly used types of flip operations, and the horizontal flip is preferred in most cases.

Mixup Data Augmentation
The method of mixup data augmentation simply selects two random photos from each batch and mixes them in a certain ratio to generate a new image that is used in the training process, without the original image participating in the model training.It is a simple and data-independent data enhancement method that generates new sample-label data by adding two sample-label data images proportionally to construct virtual training examples [21].The equation for processing data labels is as follows: where x i and x j are one-hot label encodings, y i and y j are one-hot label encodings; and (x i , y i ) and x j , y j are two randomly selected samples in the training set, λ ∈ [0, 1].Accord- ing to the above equation, mixup uses prior knowledge to extend the training distribution.
Figure 2 shows the resulting graph after performing mixup data enhancement.According to the above equation, mixup uses prior knowledge to extend the training distribution.Figure 2 shows the resulting graph after performing mixup data enhancement.

Mosaic Data Augmentation
Mosaic data augmentation is a method of mixing and cutting four randomly selected images in a dataset to obtain a new image.The result contains richer target information, which expands the training data to a certain extent and allows the network to be trained more fully on a small number of datasets.Figure 3 shows the image after mosaic enhancement.

Mosaic Data Augmentation
Mosaic data augmentation is a method of mixing and cutting four randomly selected images in a dataset to obtain a new image.The result contains richer target information, which expands the training data to a certain extent and allows the network to be trained more fully on a small number of datasets.Figure 3 shows the image after mosaic enhancement.

Mosaic Data Augmentation
Mosaic data augmentation is a method of mixing and cutting four randomly selected images in a dataset to obtain a new image.The result contains richer target information, which expands the training data to a certain extent and allows the network to be trained more fully on a small number of datasets.Figure 3 shows the image after mosaic enhancement.

Attention Mechanism
The attention mechanism can be regarded as a process of dynamic weight adjustment based on the features of the input image around the target position [22], so that the machine focuses on the target to be detected and recognized as much as possible, and optimizes the allocation of computing resources under limited computing power.Attentional mechanisms play an important role in the field of computer vision, and more and more people are optimizing models by introducing attentional mechanisms.
Attention mechanisms commonly used in the visual domain include the spatial domain, the channel domain, and the hybrid domain.The spatial domain is used to generate

Attention Mechanism
The attention mechanism can be regarded as a process of dynamic weight adjustment based on the features of the input image around the target position [22], so that the machine focuses on the target to be detected and recognized as much as possible, and optimizes the allocation of computing resources under limited computing power.Attentional mechanisms play an important role in the field of computer vision, and more and more people are optimizing models by introducing attentional mechanisms.
Attention mechanisms commonly used in the visual domain include the spatial domain, the channel domain, and the hybrid domain.The spatial domain is used to generate a spatial mask of the same size as the feature map.It then modifies the weights according to the importance of each location.The channel domain adds weight to the information on each channel, representing the relevance of that channel to the key information.The higher the weight, the higher the relevance.Finally, the hybrid domain effectively combines channel attention and spatial attention, allowing the machine to focus on both simultaneously.The attentional mechanisms can significantly improve the performance of target detection models.

Convolutional Block Attention Module
CBAM is a simple and effective feed-forward convolutional neural network attention module [23].The CBAM combines a channel attention module with a spatial attention module, which has superior performance compared to attention mechanisms that focus on only one direction.Its structure diagram is shown in Figure 4.The features are first passed through a channel attention module, the output is weighted with the input features to obtain a weighted result, and then a spatial attention module is used for final weighting to obtain the output.
CBAM is a simple and effective feed-forward convolutional neural network attention module [23].The CBAM combines a channel attention module with a spatial attention module, which has superior performance compared to attention mechanisms that focus on only one direction.Its structure diagram is shown in Figure 4.The features are first passed through a channel attention module, the output is weighted with the input features to obtain a weighted result, and then a spatial attention module is used for final weighting to obtain the output.The structure of the channel attention is shown in Figure 5.The input feature maps are subjected to w-based global max pooling and h-based global average pooling, respectively.The output is obtained after a shared fully connected layer is subjected to summation and Sigmoid activation operations to obtain the channel attention feature maps. () represents the output feature maps of the channel attention mechanism.

𝑀 (𝐹) = 𝜎(𝑀𝐿𝑃(𝐴𝑣𝑔𝑃𝑜𝑜𝑙(𝐹) + 𝑀𝐿𝑃(𝑀𝑎𝑥𝑃𝑜𝑜𝑙(𝐹)))
(3) The spatial attention mechanism takes the output of the channel attention module as its input, performs channel-based global max pooling and global average pooling, concats the two results, reduces the dimensionality to a channel by a convolution operation, and then generates a spatial attention feature by sigmoid. () represents the output feature map of the spatial attention mechanism.The structure of the spatial attention is shown in Figure 6.The structure of the channel attention is shown in Figure 5.The input feature maps are subjected to w-based global max pooling and h-based global average pooling, respectively.The output is obtained after a shared fully connected layer is subjected to summation and Sigmoid activation operations to obtain the channel attention feature maps.M c (F) represents the output feature maps of the channel attention mechanism.
module [23].The CBAM combines a channel attention module with a spatial attention module, which has superior performance compared to attention mechanisms that focus on only one direction.Its structure diagram is shown in Figure 4.The features are first passed through a channel attention module, the output is weighted with the input features to obtain a weighted result, and then a spatial attention module is used for final weighting to obtain the output.The structure of the channel attention is shown in Figure 5.The input feature maps are subjected to w-based global max pooling and h-based global average pooling, respectively.The output is obtained after a shared fully connected layer is subjected to summation and Sigmoid activation operations to obtain the channel attention feature maps. () represents the output feature maps of the channel attention mechanism.

𝑀 (𝐹) = 𝜎(𝑀𝐿𝑃(𝐴𝑣𝑔𝑃𝑜𝑜𝑙(𝐹) + 𝑀𝐿𝑃(𝑀𝑎𝑥𝑃𝑜𝑜𝑙(𝐹)))
(3) The spatial attention mechanism takes the output of the channel attention module as its input, performs channel-based global max pooling and global average pooling, concats the two results, reduces the dimensionality to a channel by a convolution operation, and then generates a spatial attention feature by sigmoid. () represents the output feature map of the spatial attention mechanism.The structure of the spatial attention is shown in Figure 6.The spatial attention mechanism takes the output of the channel attention module as its input, performs channel-based global max pooling and global average pooling, concats the two results, reduces the dimensionality to a channel by a convolution operation, and then generates a spatial attention feature by sigmoid.M s (F) represents the output feature map of the spatial attention mechanism.The structure of the spatial attention is shown in Figure 6.

YOLOv7 Network Architecture
The YOLOv7 model [24] is a state-of-the-art, real-time, target-detection model that was proposed in 2022.It is faster and more accurate than the previous YOLO series and other methods.For the characteristics of underwater targets, we propose an optimization algorithm based on YOLOv7 to improve the detection accuracy of underwater organisms.The network structure of YOLOv7 is shown in Figure 7.
The YOLOv7 network structure is a one-stage structure consisting of four parts: the Input Terminal, Backbone, Neck, and Head.The target image is fed into the Backbone after a series of operations for data enhancement.The Backbone section performs feature extraction on the image, the extracted features are fused in the Neck module and processed to obtain three sizes of features, and the final fused features are passed through the detection Head to obtain the output results.The Input Terminal involves features such as data enhancement, adaptive anchor box calculation, and adaptive image scaling; here, we will focus on the Backbone, Neck, and Head.

YOLOv7 Network Architecture
The YOLOv7 model [24] is a state-of-the-art, real-time, target-detection model that was proposed in 2022.It is faster and more accurate than the previous YOLO series and other methods.For the characteristics of underwater targets, we propose an optimization algorithm based on YOLOv7 to improve the detection accuracy of underwater organisms.The network structure of YOLOv7 is shown in Figure 7.
after a series of operations for data enhancement.The Backbone section performs feature extraction on the image, the extracted features are fused in the Neck module and processed to obtain three sizes of features, and the final fused features are passed through the detection Head to obtain the output results.The Input Terminal involves features such as data enhancement, adaptive anchor box calculation, and adaptive image scaling; here, we will focus on the Backbone, Neck, and Head.The YOLOv7 network structure is a one-stage structure consisting of four parts: the Input Terminal, Backbone, Neck, and Head.The target image is fed into the Backbone after a series of operations for data enhancement.The Backbone section performs feature extraction on the image, the extracted features are fused in the Neck module and processed to obtain three sizes of features, and the final fused features are passed through the detection Head to obtain the output results.The Input Terminal involves features such as data enhancement, adaptive anchor box calculation, and adaptive image scaling; here, we will focus on the Backbone, Neck, and Head.

Backbone
The Backbone of the model is built using Conv1, Conv2, the ELAN module, and the D-MP module.Conv1 and Conv2 are two modules with different sizes of convolutional kernels, and the structure is shown in Figure 8, which is a convolutional layer superimposed with a batch normalization layer and an activation layer.Conv1 is mainly used for feature extraction, while Conv2 is equivalent to a down-sampling operation to select the features to be extracted.The Backbone of the model is built using Conv1, Conv2, the ELAN module, and the D-MP module.Conv1 and Conv2 are two modules with different sizes of convolutional kernels, and the structure is shown in Figure 8, which is a convolutional layer superimposed with a batch normalization layer and an activation layer.Conv1 is mainly used for feature extraction, while Conv2 is equivalent to a down-sampling operation to select the features to be extracted.ELAN is an efficient network structure that allows the network to learn more features by controlling the longest and shortest gradient paths and thus has better generalization capabilities.It has two branches.The first branch goes through a 1 × 1 convolution module to change the number of channels, the other branch changes the number of channels and then goes through four 3 × 3 convolution modules for feature extraction, and finally introduces the idea of residual structure to superimpose the features and attain more detailed feature information.The structure is shown in Figure 9. ELAN is an efficient network structure that allows the network to learn more features by controlling the longest and shortest gradient paths and thus has better generalization capabilities.It has two branches.The first branch goes through a 1 × 1 convolution module to change the number of channels, the other branch changes the number of channels and then goes through four 3 × 3 convolution modules for feature extraction, and finally introduces the idea of residual structure to superimpose the features and attain more detailed feature information.The structure is shown in Figure 9. ELAN is an efficient network structure that allows the network to learn more features by controlling the longest and shortest gradient paths and thus has better generalization capabilities.It has two branches.The first branch goes through a 1 × 1 convolution module to change the number of channels, the other branch changes the number of channels and then goes through four 3 × 3 convolution modules for feature extraction, and finally introduces the idea of residual structure to superimpose the features and attain more detailed feature information.The structure is shown in Figure 9.The D-MP module divides the input into two parts.The first branch is spatially down-sampled by MaxPool and then the channels are compressed by a 1 × 1 convolution module.The other branch compresses the channels first and then performs a sampling operation using Conv2.Finally, the results of both samples are superimposed.The module has the same number of input and output channels with twice the spatial resolution reduction.The structure is shown in Figure 10.

Neck
The images go through the Backbone for feature extraction and then enter the Neck for feature fusion.The fusion part of YOLOv7 is similar to YOLOv5, using the traditional PAFPN structure.Three effective feature layers are obtained in the Backbone part for fusion.The features are first fused through an up-sampling operation and then through a down-sampling operation, thus obtaining feature information at different scales and allowing the network to have better robustness.
The SPPCSPC module first divides the features into two parts: one for conventional processing and the other for SPP operation.The features in SPP are passed through four The other branch compresses the channels first and then performs a sampling operation using Conv2.Finally, the results of both samples are superimposed.The module has the same number of input and output channels with twice the spatial resolution reduction.The structure is shown in Figure 10.ELAN is an efficient network structure that allows the network to learn more features by controlling the longest and shortest gradient paths and thus has better generalization capabilities.It has two branches.The first branch goes through a 1 × 1 convolution module to change the number of channels, the other branch changes the number of channels and then goes through four 3 × 3 convolution modules for feature extraction, and finally introduces the idea of residual structure to superimpose the features and attain more detailed feature information.The structure is shown in Figure 9.The D-MP module divides the input into two parts.The first branch is spatially down-sampled by MaxPool and then the channels are compressed by a 1 × 1 convolution module.The other branch compresses the channels first and then performs a sampling operation using Conv2.Finally, the results of both samples are superimposed.The module has the same number of input and output channels with twice the spatial resolution reduction.The structure is shown in Figure 10.

Neck
The images go through the Backbone for feature extraction and then enter the Neck for feature fusion.The fusion part of YOLOv7 is similar to YOLOv5, using the traditional PAFPN structure.Three effective feature layers are obtained in the Backbone part for fusion.The features are first fused through an up-sampling operation and then through a down-sampling operation, thus obtaining feature information at different scales and allowing the network to have better robustness.
The SPPCSPC module first divides the features into two parts: one for conventional processing and the other for SPP operation.The features in SPP are passed through four

Neck
The images go through the Backbone for feature extraction and then enter the Neck for feature fusion.The fusion part of YOLOv7 is similar to YOLOv5, using the traditional PAFPN structure.Three effective feature layers are obtained in the Backbone part for fusion.The features are first fused through an up-sampling operation and then through a down-sampling operation, thus obtaining feature information at different scales and allowing the network to have better robustness.
The SPPCSPC module first divides the features into two parts: one for conventional processing and the other for SPP operation.The features in SPP are passed through four different MaxPool modules with pooling kernels of 1, 5, 9, and 13, respectively; the maximum pooling is used to obtain different perceptual fields that are used to distinguish between large and small targets.Finally, the results of the two parts are combined, reducing the amount of computation and simultaneously improving the accuracy of the detection.The module structure is shown in Figure 11.different MaxPool modules with pooling kernels of 1, 5, 9, and 13, respectively; the maximum pooling is used to obtain different perceptual fields that are used to distinguish between large and small targets.Finally, the results of the two parts are combined, reducing the amount of computation and simultaneously improving the accuracy of the detection.The module structure is shown in Figure 11.ELAN-F is similar to the ELAN structure in the Backbone but differs in that the number of outputs in the first branch is increased by summing each output section, allowing for more efficient learning and convergence in a deeper network structure.The ELAN-F structure is shown in Figure 12.ELAN-F is similar to the ELAN structure in the Backbone but differs in that the number of outputs in the first branch is increased by summing each output section, allowing for more efficient learning and convergence in a deeper network structure.The ELAN-F structure is shown in Figure 12.ELAN-F is similar to the ELAN structure in the Backbone but differs in that the number of outputs in the first branch is increased by summing each output section, allowing for more efficient learning and convergence in a deeper network structure.The ELAN-F structure is shown in Figure 12.

Head
In this part, YOLOv7 selects the 'IDetect' detection head with three target scales: large, medium, and small.The Head is used as the classifier and regressor of the network, and three enhanced effective feature layers are obtained through the above three parts.The information inside is used for feature-point judgment to determine whether there is a target to correspond to a priori box in the feature point.The use of the RepConv module allows the structure of the model to change during training and inference, introducing the idea of a re-parameterized convolution structure, as in Figure 13.RepConv is divided into two parts.The first uses three branches at training time; the top branch is a 3 × 3 convolution for feature extraction, the second branch is a 1 × 1 convolution for feature smoothing, adding a residual structure of Identity if the input and output are of equal size, and finally fusing and summing these three parts.At the time of inference, there is only one 3 × 3 convolution, which is re-parameterized from the training module.

Head
In this part, YOLOv7 selects the 'IDetect' detection head with three target scales: large, medium, and small.The Head is used as the classifier and regressor of the network, and three enhanced effective feature layers are obtained through the above three parts.The information inside is used for feature-point judgment to determine whether there is a target to correspond to a priori box in the feature point.The use of the RepConv module allows the structure of the model to change during training and inference, introducing the idea of a re-parameterized convolution structure, as in Figure 13.RepConv is divided into two parts.The first uses three branches at training time; the top branch is a 3 × 3 convolution for feature extraction, the second branch is a 1 × 1 convolution for feature smoothing, adding a residual structure of Identity if the input and output are of equal size, and finally fusing and summing these three parts.At the time of inference, there is only one 3 × 3 convolution, which is re-parameterized from the training module.

Head
In this part, YOLOv7 selects the 'IDetect' detection head with three t large, medium, and small.The Head is used as the classifier and regressor of t and three enhanced effective feature layers are obtained through the above The information inside is used for feature-point judgment to determine whe a target to correspond to a priori box in the feature point.The use of the RepC allows the structure of the model to change during training and inference, intr idea of a re-parameterized convolution structure, as in Figure 13.RepConv is two parts.The first uses three branches at training time; the top branch is a 3 tion for feature extraction, the second branch is a 1 × 1 convolution for feature adding a residual structure of Identity if the input and output are of equal size fusing and summing these three parts.At the time of inference, there is on convolution, which is re-parameterized from the training module.

Underwater-YCC Algorithm
In this section, the Underwater-YCC target detection algorithm is introduced.The main structure diagram of this algorithm is shown in Figure 14.

Underwater-YCC Algorithm
In this section, the Underwater-YCC target detection algorithm is introduced.The main structure diagram of this algorithm is shown in Figure 14.

YOLOv7 with CBAM
In the field of target detection, there is no single rule for where the best results can be achieved by adding attention mechanisms, and the results vary from location to location.For YOLOv7, three different fusion methods have been chosen for the three modules Backbone, Neck, and Head.The first is to add the attention mechanism to the Backbone section, which is part of the network where the features are extracted.The fusion of attention at this location can help the network to extract more effective information and locate finegrained features more easily, thus improving the overall performance of the network.The second method is to add the attention mechanism to the Neck part of the network, where the features are integrated and extracted.When fusing information at different scales, adding the attention mechanism can help the network to fuse more valuable information into the features to refine the features.The last approach is to add the attention mechanism to the Head section, which is for feature classification as well as regression prediction, and to add the attention mechanism before the three different scales of features in and out, to perform attention reconstruction on the feature map and ultimately improve the network performance.The three attention mechanisms are added as shown in Figure 15.

YOLOv7 with CBAM
In the field of target detection, there is no single rule for where the best results can be achieved by adding attention mechanisms, and the results vary from location to location.For YOLOv7, three different fusion methods have been chosen for the three modules Backbone, Neck, and Head.The first is to add the attention mechanism to the Backbone section, which is part of the network where the features are extracted.The fusion of attention at this location can help the network to extract more effective information and locate fine-grained features more easily, thus improving the overall performance of the network.The second method is to add the attention mechanism to the Neck part of the network, where the features are integrated and extracted.When fusing information at different scales, adding the attention mechanism can help the network to fuse more valuable information into the features to refine the features.The last approach is to add the attention mechanism to the Head section, which is for feature classification as well as regression prediction, and to add the attention mechanism before the three different scales of features in and out, to perform attention reconstruction on the feature map and ultimately improve the network performance.The three attention mechanisms are added as shown in Figure 15.

Neck Improvement Based on Conv2Former
The introduction of the transformer has given a huge boost to the field of computer vision, demonstrating powerful performance in areas such as image segmentation and target detection.More and more researchers are proposing the encoding of spatial features by convolution, and Conv2Former is one of the most efficient methods for encoding spatial features using convolution.The structure of Conv2Former [25] is shown in Figure 16, which is a transformer-style convolutional network with a pyramidal structure and a different number of convolutional blocks in each of the four stages.Each stage has a different feature map resolution, and a patch-embedding block is used in between two consecutive stages to reduce the resolution.The core of the method lies in the convolutional modulation operation, as shown in Figure 17, using only deep convolutional features as weights to modulate the representation, combined with Hadamard product to simplify the selfattentive mechanism and make more efficient use of large kernel convolution.Inspired by TPH-YOLOv5 [26], Conv2Former replaces the ELAN-F convolution block in the Neck of the original YOLOv7.Compared with the original structure, Conv2Former can better capture the global information and contextual semantic information of the network, and thus obtain rich features for fusion operation, which enables the network performance to be improved.

Neck Improvement Based on Conv2Former
The introduction of the transformer has given a huge boost to the field of computer vision, demonstrating powerful performance in areas such as image segmentation and target detection.More and more researchers are proposing the encoding of spatial features by convolution, and Conv2Former is one of the most efficient methods for encoding spatial features using convolution.The structure of Conv2Former [25] is shown in Figure 16, which is a transformer-style convolutional network with a pyramidal structure and a different number of convolutional blocks in each of the four stages.Each stage has a different feature map resolution, and a patch-embedding block is used in between two consecutive stages to reduce the resolution.The core of the method lies in the convolutional modulation operation, as shown in Figure 17, using only deep convolutional features as weights to modulate the representation, combined with Hadamard product to simplify the self-attentive mechanism and make more efficient use of large kernel convolution.Inspired by TPH-YOLOv5 [26], Conv2Former replaces the ELAN-F convolution block in the Neck of the original YOLOv7.Compared with the original structure, Conv2Former can better capture the global information and contextual semantic information of the network, and thus obtain rich features for fusion operation, which enables the network performance to be improved.

Neck Improvement Based on Conv2Former
The introduction of the transformer has given a huge boost to the field of computer vision, demonstrating powerful performance in areas such as image segmentation and target detection.More and more researchers are proposing the encoding of spatial features by convolution, and Conv2Former is one of the most efficient methods for encoding spatial features using convolution.The structure of Conv2Former [25] is shown in Figure 16, which is a transformer-style convolutional network with a pyramidal structure and a different number of convolutional blocks in each of the four stages.Each stage has a different feature map resolution, and a patch-embedding block is used in between two consecutive stages to reduce the resolution.The core of the method lies in the convolutional modulation operation, as shown in Figure 17, using only deep convolutional features as weights to modulate the representation, combined with Hadamard product to simplify the selfattentive mechanism and make more efficient use of large kernel convolution.Inspired by TPH-YOLOv5 [26], Conv2Former replaces the ELAN-F convolution block in the Neck of the original YOLOv7.Compared with the original structure, Conv2Former can better capture the global information and contextual semantic information of the network, and thus obtain rich features for fusion operation, which enables the network performance to be improved.

Introduction of Wise-IoU Bounding Box Loss Function
In the field of target detection, the setting of the bounding box loss function directly affects the accuracy of the target detection result.The bounding box loss function is used to optimize the error between the position of the detected object and the real object so that the output prediction box is infinitely close to the real box.As the scenes and datasets faced in underwater practical work are of poor quality, we propose the use of Wise-IoU as the bounding box loss function, thus balancing the results of the model-trained images of varying quality to obtain a more accurate detection result.Wise-IoU [27] is a category weight introduced on top of the traditional IoU to minimize the difference between categories, thus reducing the impact on detection results.That is, a weight is assigned to each category and then the overlap between different categories is weighted using different weights in the calculation of IoU to obtain a more accurate evaluation result.Wise-IoUv1 with a two-level attention mechanism is first constructed based on the distance metric with the following equation: An anchor box is represented by  ⃗ = [   ℎ], where the value represents the center coordinates and size of the corresponding bounding box, and  ⃗ = [   ℎ ] refers to the corresponding value of the target box. and  are the minimum dimensions of the bounding box,  can significantly amplify the IoU Loss of an ordinary quality anchor box, and  can reduce  of a high-quality anchor box.The method used in this paper applies Wise-IoU with  on top of Wise-IoUv1.The outlier  is used to describe the quality of anchor frames, with a smaller outlier representing a higher-quality anchor frame.A smaller gradient gain is assigned to anchor frames with larger outliers, preventing low-quality images from affecting the training results.The outlier is defined as follows: The Wise-IoU used is defined as follows:  makes  = 1 when  = .The anchor box will have the highest gradient when the outlier is equal to a fixed value.According to Equation (7), the criteria for dividing the anchor box are dynamic, so Wise-IoU can use the best gradient gain allocation strategy and improve the positioning accuracy of the model.

Introduction of Wise-IoU Bounding Box Loss Function
In the field of target detection, the setting of the bounding box loss function directly affects the accuracy of the target detection result.The bounding box loss function is used to optimize the error between the position of the detected object and the real object so that the output prediction box is infinitely close to the real box.As the scenes and datasets faced in underwater practical work are of poor quality, we propose the use of Wise-IoU as the bounding box loss function, thus balancing the results of the model-trained images of varying quality to obtain a more accurate detection result.Wise-IoU [27] is a category weight introduced on top of the traditional IoU to minimize the difference between categories, thus reducing the impact on detection results.That is, a weight is assigned to each category and then the overlap between different categories is weighted using different weights in the calculation of IoU to obtain a more accurate evaluation result.Wise-IoUv1 with a two-level attention mechanism is first constructed based on the distance metric with the following equation: An anchor box is represented by where the value represents the center coordinates and size of the corresponding bounding box, and → B gt = [x gt y gt w gt h gt ] refers to the corresponding value of the target box.W g and H g are the minimum dimensions of the bounding box, R Wise−IoU can significantly amplify the IoU Loss of an ordinary quality anchor box, and L IoU can reduce R Wise−IoU of a high-quality anchor box.The method used in this paper applies Wise-IoU with β on top of Wise-IoUv1.The outlier β is used to describe the quality of anchor frames, with a smaller outlier representing a higher-quality anchor frame.A smaller gradient gain is assigned to anchor frames with larger outliers, preventing low-quality images from affecting the training results.The outlier is defined as follows: The Wise-IoU used is defined as follows: δ makes r = 1 when β = δ.The anchor box will have the highest gradient when the outlier is equal to a fixed value.According to Equation (7), the criteria for dividing the anchor box are dynamic, so Wise-IoU can use the best gradient gain allocation strategy and improve the positioning accuracy of the model.

Experiments 4.1. Experimental Platform
The experimental environment of this paper is shown in Table 1.

Evaluation Metrics
In this paper, the metric's precision, recall, F1 score, and mAP are selected to evaluate the performance of the model.If the predicted value is the same as the true value, the predicted value is a positive sample, denoted TP.If the predicted value is a negative sample, it is denoted TN.If they are not the same, and the predicted value is a positive sample, it is denoted FP, and if the predicted value is a negative sample, it is denoted FN.The recall, precision and F1 score are calculated as follows: Recall = TP TP + FN (10) AP is the average of the precision values on the PR curve, obtained using different combinations of precision and pecall points to calculate the area under the curve.mAP is the mean average precision; these metrics can be expressed as:

Experimental Results and Analysis
The results in this section are obtained experimentally on the URPC2020 dataset.The mislabeled images in this dataset are re-labeled, the overly blurred images are filtered out, and the final experimental results are obtained on the optimized dataset.

Data Augmentation
Experiments were conducted using different data enhancement methods on the original structure of YOLOv7.From Table 2, the mAP of the model training results was only 64.59% when no data enhancement method was used, which increased by 4.91% and 17.38% after training with mixup and mosaic, respectively, and by 21.08% when the two enhancement methods were used together.The experimental results show that both data augmentation methods can help train the model well, and the use of both can greatly improve the detection accuracy of the model.The model and attention mechanism were optimally combined by adding the attention mechanism at different locations in YOLOv7, and CBAM was added to the Backbone, Neck, and Head parts of the network, respectively.Table 3 shows the experimental results.The addition of CBAM to the network improved the recognition accuracy of the network, with the best result being 86.68% at the Neck; both accuracy and recall were higher than the original model.The results show that CBAM does not work in all parts of the network.In the Head part, due to the deeper model, the underlying semantic information has been lost, and it is difficult to obtain results with fewer features for further attention weighting, so many metrics have decreased.The best embedding results are obtained in the Neck part, where the attentional weighting of the feature maps of different dimensions is more effective at obtaining fine-grained semantic information.This helps the network to grasp the detection target, and thus obtain the most significant effect.

Ablation Experiments
In order to verify the effectiveness of each improved method for underwater target detection, the effect of different modules on detection results is analyzed by ablation experiments.Among them, YOLOv7_A adds CBAM to the Neck, YOLOv7_B uses Conv2Former to improve the Neck, YOLOv7_C uses Wise-IoU, YOLOv7_D uses both CBAM and Wise-IoU, and YOLOv7_E uses both Conv2Former and Wise-IoU.Underwater-YCC is the underwater target detection method proposed in this paper.
From Table 4, we can see that the experimental results obtained for each of the modular methods used are improved compared to the original YOLOv7, indicating that all reinforcement methods used in this paper are effective and can all be used to improve underwater detection activities.(1) Analyzing the results of the three single methods in experiments (a-c) shows that the addition of each optimization method is improved compared to YOLOv7, where the addition of Conv2Former has improved the mAP of the network by 0.85%.This means that the Conv2Former module can capture the global information of the network well and retain the semantic information.The introduction of CBAM gives the network the ability to acquire more valuable features for fusion.The 0.88% improvement using Wise-IoU means that using this method allows the network to focus more on effective features and have better weight selection for images of different quality.(2) The results of experiments (d,e) show that combining Wise-IoU with CBAM and Conv2Former, respectively, improves 1.17% and 1.26%, compared to YOLOv7, indicating that this bounding box loss function is effective after adding the optimization method.(3) After summarizing the above optimization methods, this paper proposes an optimization algorithm for Underwater-YCC, which adds CBAM while using Conv2Former for Neck feature fusion, and lastly uses Wise-IoU for bounding box loss regression.This model improved the mAP This model improved the mAP by 1.49% compared to the original YOLOv7.The results show that the Underwater-YCC method can perform high-quality detection in complex underwater environments.Figure 18 depicts the test results of Underwater-YCC compared with YOLOv7.Among them, Figure 18a is the detection result of YOLOv7 and Figure 18b is the detection result of Underwater-YCC.From the figures, we can get that our proposed model can detect more targets compared with the original model and has better results for the detection of complex underwater environments.

Target Detection Network Comparison Experiment Results
Table 5 compares the results of Underwater-YCC with classical target detection algorithms, such Faster-RCNN [28], YOLOv3, YOLOv5s, YOLOv6 [29], and YOLOv7-Tiny.It can be seen from the results that although the detection time increases slightly due to the complex structure of the model, Underwater-YCC has higher detection accuracy and is more adaptable to the complex underwater environment.

Conclusions
In this study, we addressed the challenges of false and missed detection caused by blurred underwater images and the small size of underwater creatures.To tackle these issues, we proposed an underwater target detection algorithm called Underwater-YCC based on YOLOv7.We tested our algorithm on the URPC2020 dataset, which includes underwater images of echinus, holothurian, scallop, and starfish categories.
Our proposed algorithm leverages various techniques to improve detection accuracy.Firstly, we reorganized and labeled the dataset to better suit our needs.Secondly, we embedded the attention mechanism in the Neck part of YOLOv7 to improve the detection ability of the model.Thirdly, we used Conv2Former to enable the network to obtain features that are more valuable and fuse them efficiently.Lastly, we used Wise-IoU for bounding box regression calculation to effectively avoid the drawbacks caused by the large sample gap.
Experimental results demonstrate that the Underwater-YCC algorithm can achieve improved detection accuracy under the same dataset.Our approach also exhibits robustness in the case of blurring and color bias.However, there is still ample room for improving the whole network structure, and the real-time and lightweight aspects of the underwater target detection technology need to be studied further.The proposed algorithm is promising and may serve as a starting point for future research in the field of underwater target detection.

Figure 3 .
Figure 3. Use of mosaic enhancement images during training.

Figure 3 .
Figure 3. Use of mosaic enhancement images during training.

Figure 5 .
Figure 5. Structure of the channel attention.

Figure 5 .
Figure 5. Structure of the channel attention.

Figure 5 .
Figure 5. Structure of the channel attention.

17 Figure 6 .
Figure 6.Structure of the spatial attention.

Figure 6 .
Figure 6.Structure of the spatial attention.

Figure 7 .
Figure 7.The network architecture diagram of YOLOv7.The official code divides the structure of YOLOv7 into two parts: Backbone and Head.We divided the middle feature fusion layer into Neck to facilitate the detection of the influence of attention mechanism on detection results at different locations.

Figure 7 .
Figure 7.The network architecture diagram of YOLOv7.The official code divides the structure of YOLOv7 into two parts: Backbone and Head.We divided the middle feature fusion layer into Neck to facilitate the detection of the influence of attention mechanism on detection results at different locations.

Figure 8 .
Figure 8.The architecture of Conv1 and Conv2; The Conv2 convolution kernel size is 3 and stride is 1; the Conv2 convolution kernel size is 3 and stride is 2.

Figure 8 .
Figure 8.The architecture of Conv1 and Conv2; The Conv2 convolution kernel size is 3 and stride is 1; the Conv2 convolution kernel size is 3 and stride is 2.

Figure 8 .
Figure 8.The architecture of Conv1 and Conv2; The Conv2 convolution kernel size is 3 and stride is 1; the Conv2 convolution kernel size is 3 and stride is 2.

Figure 9 .
Figure 9.The architecture of ELAN.

Figure 10 .
Figure 10.The architecture of D-MP.

Figure 9 .
Figure 9.The architecture of ELAN.The D-MP module divides the input into two parts.The first branch is spatially downsampled by MaxPool and then the channels are compressed by a 1 × 1 convolution module.The other branch compresses the channels first and then performs a sampling operation using Conv2.Finally, the results of both samples are superimposed.The module has the same number of input and output channels with twice the spatial resolution reduction.The structure is shown in Figure10.

Figure 8 .
Figure 8.The architecture of Conv1 and Conv2; The Conv2 convolution kernel size is 3 and stride is 1; the Conv2 convolution kernel size is 3 and stride is 2.

Figure 9 .
Figure 9.The architecture of ELAN.

Figure 10 .
Figure 10.The architecture of D-MP.

Figure 10 .
Figure 10.The architecture of D-MP.

Figure 12 .
Figure 12.The architecture of ELAN-F.

Figure 12 .
Figure 12.The architecture of ELAN-F.

Figure 12 .
Figure 12.The architecture of ELAN-F.

Figure 15 .
Figure 15.Left: Incorporate an attention mechanism in the Backbone.Middle: Incorporate an attention mechanism in the Neck.Right: Incorporate an attention mechanism in the Head.

Figure 15 .
Figure 15.Left: Incorporate an attention mechanism in the Backbone.Middle: Incorporate an attention mechanism in the Neck.Right: Incorporate an attention mechanism in the Head.

Figure 15 .
Figure 15.Left: Incorporate an attention mechanism in the Backbone.Middle: Incorporate an attention mechanism in the Neck.Right: Incorporate an attention mechanism in the Head.

Table 1 .
Experimental environment and parameters.

Table 5 .
Compare with classical target detection algorithms.