Underwater Target Detection Algorithm Based on Feature Fusion Enhancement

: Underwater robots that use optical images for dynamic target detection often encounter image blurring, poor contrast, and indistinct target features. As a result, the underwater robots have poor detection performance with a high rate of missed detections. To overcome these issues, a feature-enhanced algorithm for underwater target detection has been proposed in this paper. Based on YOLOv7, a feature enhancement module utilizing a triple-attention mechanism is developed to improve the network’s feature extraction ability without increasing the computational or algorithmic parameter quantity. Moreover, comprehensively considering the impact of a redundant feature in the images on detection accuracy, the ASPPCSPC structure was built. A parallel spatial convolutional pooling structure based on the original feature pyramid fusion structure, SPPCSPC, is introduced. The GhostNet network was utilized to optimize its convolution module, which reduces the model’s parameter quantity and optimizes the feature map. Furthermore, a Cat-BiFPN structure was designed to address the problem of ﬁne-grained information loss in YOLOv7 feature fusion by adopting a weighted nonlinear fusion strategy to enhance the algorithm’s adaptability. Using the UPRC offshore dataset for validation, the algorithm’s detection accuracy was increased by 2.9%, and the recall rate was improved by 2.3% compared to the original YOLOv7 algorithm. In addition, the model quantity is reduced by 11.2%, and the model size is compressed by 10.9%. The experimental results signiﬁcantly establish the validity of the proposed algorithm.


Introduction
Environment information perception in underwater environments is essential for ensuring the efficacy of marine fisheries [1,2], marine resource exploration [3,4], geological surveys [5], and underwater robot operations [6][7][8].The visual perception of underwater robots, which is commonly incorporated in different kinds of detection and operation robots due to its low cost, large perception information, and high real-time performance, is critical for achieving this goal.However, the dynamic motion of the robot and the interference of the water medium can cause underwater images to become blurry, and the targets with backgrounds being detected may have low contrast and color deviation [9][10][11][12][13].These factors limit the detection accuracy and efficiency of underwater robots, thereby compromising the application system reliability.
Usually, traditional target detection methods involve feature extractors designed manually to extract feature vectors, which are subsequently used to recognize targets via machine learning classifiers [14,15].Due to the fact that the external features of targets presented underwater are greatly affected by the environment, traditional methods for detection lead to poor operational accuracy and adaptability of robot systems [16,17].Deep learning algorithms exhibit strong adaptability in selecting and extracting features, resulting in more accurate detection results.Advances in deep learning feature extraction networks and the continuously increasing amount of training sample data have resulted in the continuous improvement of the attention paid to inherent abstract features of target bodies by deep learning algorithms in image object detection.This has made deep learning the research hotspot among scholars in recent years.
Deep learning target detection algorithms consist of one-stage and two-stage structural models, in the context of research and development.Specifically, the two-stage detection model processes image feature extraction and network detection recognition separately, leading to high detection accuracy.Nonetheless, the network's complex procedures and high redundancy increase the algorithm's computational cost and hinder its real-time applicability, which poses a challenge for practical applications [18,19].In contrast, the single-stage network [20,21] employs an end-to-end structure for network design, thereby unifying the detection and classification recognition in a single logistic regression process.During the feature extraction process, the unified design facilitates detection and recognition, which simplifies the system's process, reduces the algorithm's complexity, and enhances the system's efficiency.Consequently, the one-stage network has rapidly evolved in recent years.
YOLO series algorithms [22][23][24][25] are a typical example of single-stage target detection models that accomplish accurate detection and recognition of various targets through back-propagation of objective functions, such as deviation of the estimated center point position, intersection-over-union ratio between prediction boxes and real boxes, as well as classification errors.Notably, the YOLO algorithm series excels in fast detection speed.To maintain the balance between detection speed and accuracy, researchers often use various techniques, such as feature fusion [26,27], path aggregation methods [28], and other approaches [29], to optimize the feature extraction network, thereby enhancing its capabilities.Furthermore, researchers lower model parameter quantity and improve algorithmic real-time functionality by adopting various approaches, such as detection head decoupling [30], convolution transformation [31], and knowledge distillation [32,33] for light-weight network structures.
Underwater images commonly exhibit characteristics, such as low contrast and blurring.The conventional deep learning target detection algorithm that relies on convolutional neural networks, primarily acquires the highest-level features for recognition and positioning through down-sampling.However, this approach not only fails to detect small targets but also disregards the details and texture features of the target object.Therefore, various optimized target detection networks employ the FPN structure [34], which combines deep and shallow network features.FPN has a straightforward framework, and it is effective in small target detection.However, the fusion is limited to unidirectional information flow, leading to a low fusion degree and inadequate information utilization.Consequently, FPN has poor performance in detecting underwater targets.To address this shortcoming, some studies [35] combine FPN with residual multiscale M-Resnet networks.
Other studies [28] utilize the bypass method to enrich feature fusion points in the path aggregation network, leading to greater amounts of extracted feature information.While the abovementioned network optimization methods enhance feature extraction, feature extraction still lacks flexibility.Reference [36] proposed a bidirectional feature pyramid structure that utilizes multiple weighted fusion approaches to improve network fusion performance.Reference [37] introduced the adaptive spatial feature fusion structure (ASFF), which augments the depth of feature fusion by transmitting each layer's information to other layers.
YOLOv7 is an advanced object detection algorithm, which was proposed in 2022 after integrating various network optimization methods [26].This model leverages feature fusion and path aggregation techniques to achieve high efficiency in both feature extraction and neck networks, resulting in the best outcomes in underwater object detection algorithms.Nevertheless, for underwater images with low contrast and blur, this model still has a lower utilization rate of redundant information, along with insufficient extraction and application of detail features.Moreover, the network's parameters are slightly larger, implying that it may not be the best fit for underwater robot deployment.
This article proposes a solution for the aforementioned problems of YOLOv7.This study incorporates a triple attention mechanism into YOLOv7's ELAN feature fusion module to enhance the network's feature extraction ability.Additionally, a pyramid fusion structure with several parallel dilated convolution modules is employed to extend the network's receptive field.To optimize network parameters and reduce structural complexity, the GhostNet network is implemented.This paper concludes with the use of a weighted fusion structure to enhance the fusion method and increase network robustness.
This paper makes the following major contributions: (1) This paper introduces a new object detection algorithm for underwater blurry images, utilizing a feature enhancement approach.Compared to YOLOv7, this algorithm provides a lighter weight network with higher detection accuracy, and is better suited for underwater robot application deployment, especially in turbid underwater environments.
(2) To enhance the network's feature extraction capabilities, the feature-enhanced fusion module integrates the triple attention mechanism.The process of its fusion is described in detail.Additionally, this module provides the flexibility of being applied as a component in other networks.(3) Multiple parallel dilated convolutional layers with varying sampling rates and deep fusion with GhostNet are utilized in the network.The resolution of the feature map is not reduced, and the application of this technique increases the network's receptive field and redundancy of information.As a result, the improved redundancy of information utilization enhances the detection accuracy.

The YOLOv7 Algorithm for Target Detection
Figure 1 depicts the comprehensive structure diagram of the YOLOv7 object detection algorithm.In the model design process, weight parameters are implemented into the network framework, and the state-of-the-art ELAN feature fusion module is utilized to improve the network's capability to extract target feature information efficiently.and neck networks, resulting in the best outcomes in underwater object detection algorithms.Nevertheless, for underwater images with low contrast and blur, this model still has a lower utilization rate of redundant information, along with insufficient extraction and application of detail features.Moreover, the network's parameters are slightly larger, implying that it may not be the best fit for underwater robot deployment.This article proposes a solution for the aforementioned problems of YOLOv7.This study incorporates a triple attention mechanism into YOLOv7's ELAN feature fusion module to enhance the network's feature extraction ability.Additionally, a pyramid fusion structure with several parallel dilated convolution modules is employed to extend the network's receptive field.To optimize network parameters and reduce structural complexity, the GhostNet network is implemented.This paper concludes with the use of a weighted fusion structure to enhance the fusion method and increase network robustness.
This paper makes the following major contributions: (1) This paper introduces a new object detection algorithm for underwater blurry images, utilizing a feature enhancement approach.Compared to YOLOv7, this algorithm provides a lighter weight network with higher detection accuracy, and is better suited for underwater robot application deployment, especially in turbid underwater environments.
(2) To enhance the network's feature extraction capabilities, the feature-enhanced fusion module integrates the triple attention mechanism.The process of its fusion is described in detail.Additionally, this module provides the flexibility of being applied as a component in other networks.(3) Multiple parallel dilated convolutional layers with varying sampling rates and deep fusion with GhostNet are utilized in the network.The resolution of the feature map is not reduced, and the application of this technique increases the network's receptive field and redundancy of information.As a result, the improved redundancy of information utilization enhances the detection accuracy.The network architecture comprises three modules: the backbone, neck, and head networks.Based on the network models of YOLOv4 and v5, the backbone network integrates the efficient layer aggregation network (ELAN) and the maximal pooling (MP) down-sampling module, which merges max pooling and convolution layers.The ELAN module enhances the network's capability to learn features and improves its robustness by regulating the shortest and longest gradient paths in the network.The MP down-sampling module synthesizes the outputs of two different down-sampling structures to achieve super down-sampling, which enhances the network's receptive field and the degree of non-linear feature transformation, making it more effective in detecting fine- The network architecture comprises three modules: the backbone, neck, and head networks.Based on the network models of YOLOv4 and v5, the backbone network integrates the efficient layer aggregation network (ELAN) and the maximal pooling (MP) down-sampling module, which merges max pooling and convolution layers.The ELAN module enhances the network's capability to learn features and improves its robustness by regulating the shortest and longest gradient paths in the network.The MP down-sampling module synthesizes the outputs of two different down-sampling structures to achieve super down-sampling, which enhances the network's receptive field and the degree of nonlinear feature transformation, making it more effective in detecting fine-grained features.In the neck network of YOLOv7, the feature maps from the backbone network and the path aggregation network (PANet) are concatenated to achieve feature map fusion output.Finally, the network utilizes repconv and coarse-to-fine guided labeling to establish the match between images and labels based on the three sets of outputs obtained.

The YOLOv7 Algorithm for Target Detection
YOLOv7 substantially increased the speed and accuracy of conventional image object detection by systematically optimizing the network for feature extraction and adopting a training strategy with an auxiliary head.However, the algorithm's effectiveness in underwater object detection, which is influenced by factors, such as image blur, color bias, and turbid water quality, is still limited.The algorithm does not account for the effects of deep feature enhancement, feature redundancy, and adaptive super down-sampling on object detection, thus hindering its ability to meet the requirements of practical application scenarios.The YOLOv7 network uses an ELAN module with an up-and-down two-branch structure to combine output feature information from deep convolution layers of varying scales, leading to enhanced feature capabilities and effective object detection at different scales.However, underwater robotic inspections often encounter blurred images due to water quality and dynamic motion, leading to missed potential or false detections.To address this challenge, this study proposes a triple attention mechanism that enhances the network's anti-interference ability against underwater image noise and improves its ability to extract fine-grained image features.Moreover, a multichannel parallel atrous convolution module replaces the multi-rate atrous convolution pooling structure, using different expansion rates of convolution to extend the network's receptive field and improve its ability to extract contextual information.Lastly, an optimized weighted pyramidmerging structure is implemented to improve the network's adaptiveness in extracting features for different scenarios.
The overall network architecture is depicted in Figure 2.
grained features.In the neck network of YOLOv7, the feature maps from the backbone network and the path aggregation network (PANet) are concatenated to achieve feature map fusion output.Finally, the network utilizes repconv and coarse-to-fine guided labeling to establish the match between images and labels based on the three sets of outputs obtained.YOLOv7 substantially increased the speed and accuracy of conventional image object detection by systematically optimizing the network for feature extraction and adopting a training strategy with an auxiliary head.However, the algorithm's effectiveness in underwater object detection, which is influenced by factors, such as image blur, color bias, and turbid water quality, is still limited.The algorithm does not account for the effects of deep feature enhancement, feature redundancy, and adaptive super down-sampling on object detection, thus hindering its ability to meet the requirements of practical application scenarios.

The Overall Architecture of the Network
The YOLOv7 network uses an ELAN module with an up-and-down two-branch structure to combine output feature information from deep convolution layers of varying scales, leading to enhanced feature capabilities and effective object detection at different scales.However, underwater robotic inspections often encounter blurred images due to water quality and dynamic motion, leading to missed potential or false detections.To address this challenge, this study proposes a triple attention mechanism that enhances the network's anti-interference ability against underwater image noise and improves its ability to extract fine-grained image features.Moreover, a multichannel parallel atrous convolution module replaces the multi-rate atrous convolution pooling structure, using different expansion rates of convolution to extend the network's receptive field and improve its ability to extract contextual information.Lastly, an optimized weighted pyramid-merging structure is implemented to improve the network's adaptiveness in extracting features for different scenarios.
The overall network architecture is depicted in Figure 2. The research presents an optimized network structure consisting of the input end, the backbone network, the neck network, and the prediction network.Within the backbone network, the input image is first subjected to two simple down-sampling operations followed by a feature extraction using the efficient local attention network with a transformation attention module (ELAN-TA).Three down-sampling modules, which merge max pooling and convolutional layers with the ELAN-TA module, further extract feature information.Within the neck network, a multi-rate atrous convolution pooling structure The research presents an optimized network structure consisting of the input end, the backbone network, the neck network, and the prediction network.Within the backbone network, the input image is first subjected to two simple down-sampling operations followed by a feature extraction using the efficient local attention network with a transformation attention module (ELAN-TA).Three down-sampling modules, which merge max pooling and convolutional layers with the ELAN-TA module, further extract feature information.Within the neck network, a multi-rate atrous convolution pooling structure processes the extracted feature information, and the output is concatenated with the output of the upper down-sampling module from the backbone network to obtain feature maps with sizes of 20 × 20, 40 × 40, and 80 × 80.The last up-sampled output is then dimensionally concatenated with the output of the previous up-sampled module after passing through a down-sampling module.Finally, the repconv technique is applied to the feature maps of the three differently sized feature maps before sending them to the prediction network for the desired output.
In the current research, the YOLOv7 backbone network is implemented as the primary feature extraction network in the overall architecture.The down-sampling module is composed of convolutional layers and max pooling layers, and the weighted bidirectional feature pyramid structure is used to fuse the features.Subsequently, three attention mechanisms are added after the ELAN structure to obtain feature maps with varying resolutions.These feature maps are matched with the output channels via the neck network structure, and each network predicts the output of targets with different sizes.To enhance the detection capability of targets by capturing global and local information more effectively, the original max pooling pyramid module is replaced by the multi-rate atrous convolution pooling structure.This new structure employs parallel atrous convolutional layers with sampling rates of 1, 6, 12, and 18 and is followed by dimension concatenation.

ELAN-TA Module
The ELAN module significantly improves feature fusion in the YOLOv7 network, and it is extensively utilized in both the backbone and neck networks.The network's internal structure is illustrated in Figure 3.
put of the upper down-sampling module from the backbone network to obtain feature maps with sizes of 20 × 20, 40 × 40, and 80 × 80.The last up-sampled output is then dimensionally concatenated with the output of the previous up-sampled module after passing through a down-sampling module.Finally, the repconv technique is applied to the feature maps of the three differently sized feature maps before sending them to the prediction network for the desired output.
In the current research, the YOLOv7 backbone network is implemented as the primary feature extraction network in the overall architecture.The down-sampling module is composed of convolutional layers and max pooling layers, and the weighted bidirectional feature pyramid structure is used to fuse the features.Subsequently, three attention mechanisms are added after the ELAN structure to obtain feature maps with varying resolutions.These feature maps are matched with the output channels via the neck network structure, and each network predicts the output of targets with different sizes.To enhance the detection capability of targets by capturing global and local information more effectively, the original max pooling pyramid module is replaced by the multi-rate atrous convolution pooling structure.This new structure employs parallel atrous convolutional layers with sampling rates of 1, 6, 12, and 18 and is followed by dimension concatenation.

ELAN-TA Module
The ELAN module significantly improves feature fusion in the YOLOv7 network, and it is extensively utilized in both the backbone and neck networks.The network's internal structure is illustrated in Figure 3.In the ELAN structure of the backbone network, the feature information is partitioned into two branches.The initial branch undergoes 1 × 1 convolutional processing and is then blended with the output of the other branch with three 3 × 3 convolutional layers.The resultant merged feature information is consistently output to the subsequent network structure.Within the neck network, an improvement is made to the ELAN module, producing the ELAN-W structure, which increases the initial three convolutional layer outputs to five.This advancement results in more information concerning feature fusion.
The implementation of the ELAN structure significantly enhances the network's ability to merge features across diverse scales.However, noise interference and image blurring, in addition to variations in target size, restrict the algorithm's application in underwater scenarios.To address this issue, this paper introduces the triple attention (TA) [38] module based on the ELAN structure, which enhances the network performance without increased computational complexity or algorithm parameters.A detailed module structure is provided in Figure 4.In the ELAN structure of the backbone network, the feature information is partitioned into two branches.The initial branch undergoes 1 × 1 convolutional processing and is then blended with the output of the other branch with three 3 × 3 convolutional layers.The resultant merged feature information is consistently output to the subsequent network structure.Within the neck network, an improvement is made to the ELAN module, producing the ELAN-W structure, which increases the initial three convolutional layer outputs to five.This advancement results in more information concerning feature fusion.
The implementation of the ELAN structure significantly enhances the network's ability to merge features across diverse scales.However, noise interference and image blurring, in addition to variations in target size, restrict the algorithm's application in underwater scenarios.To address this issue, this paper introduces the triple attention (TA) [38] module based on the ELAN structure, which enhances the network performance without increased computational complexity or algorithm parameters.A detailed module structure is provided in Figure 4.The TA module does not rely on learnable parameters to model channel and spatial attention.It builds dependency relationships between dimensions using rotation operations and residual transformations while encoding channel and spatial information at a minimal computational cost.This parameter-free attention module, which is both efficient and accurate, allows the model to estimate global and long-term dependency relationships in images, aiding it in predicting smaller and more recognizable targets.Additionally, it acquires global contextual information, captures more features, and expands the boundaries between different categories.
Figure 5 demonstrates the internal structure of TA.The TA module does not rely on learnable parameters to model channel and spatial attention.It builds dependency relationships between dimensions using rotation operations and residual transformations while encoding channel and spatial information at a minimal computational cost.This parameter-free attention module, which is both efficient and accurate, allows the model to estimate global and long-term dependency relationships in images, aiding it in predicting smaller and more recognizable targets.Additionally, it acquires global contextual information, captures more features, and expands the boundaries between different categories.
Figure 5 demonstrates the internal structure of TA.
The TA module does not rely on learnable parameters to model channel and spatial attention.It builds dependency relationships between dimensions using rotation operations and residual transformations while encoding channel and spatial information at a minimal computational cost.This parameter-free attention module, which is both efficient and accurate, allows the model to estimate global and long-term dependency relationships in images, aiding it in predicting smaller and more recognizable targets.Additionally, it acquires global contextual information, captures more features, and expands the boundaries between different categories.
Figure 5 demonstrates the internal structure of TA.The triple attention module comprises three parallel branches.The top branch computes the attention weight of both cross-channel dimension C and space W while the corresponding bottom branch establishes the spatial correlation between H and W .In the first two branches, the connection is established between any channel dimension and spatial dimension using a rotation operation, with the final weight summing as a simple average of all.
In the preceding modules, the Z-pooling layer, Z-Pool , truncates the 0th dimension to two dimensions while merging the average and maximum pooling elements in that dimension.This allows the layer to maintain a comprehensive representation of the tensor while decreasing its depth, thereby minimizing computational loads.Equation (1) outlines the process expression.
The 0th refers to the 0th dimension of the average and maximum pooling operation.For instance, a tensor size of ( , the first branch rotates the input X 90 degrees counterclockwise along the H axis, obtains 1 X , simplifies it to (2 through a Z-Pool , and applies a standard convolution layer with a kernel of size 7 × 7. The The triple attention module comprises three parallel branches.The top branch computes the attention weight of both cross-channel dimension C and space W while the corresponding bottom branch establishes the spatial correlation between H and W. In the first two branches, the connection is established between any channel dimension and spatial dimension using a rotation operation, with the final weight summing as a simple average of all. In the preceding modules, the Z-pooling layer, Z-Pool, truncates the 0th dimension to two dimensions while merging the average and maximum pooling elements in that dimension.This allows the layer to maintain a comprehensive representation of the tensor while decreasing its depth, thereby minimizing computational loads.Equation (1) outlines the process expression.
The 0th refers to the 0th dimension of the average and maximum pooling operation.For instance, a tensor size of (C × H × W) produces a final tensor size of (2 × H × W).
Given an input tensor X ∈ R(C × H × W), the first branch rotates the input X 90 degrees counterclockwise along the H axis, obtains X1, simplifies it to (2 × H × C) through a Z-Pool, and applies a standard convolution layer with a kernel of size 7 × 7. The resulting tensor is then passed through a Sigmoid activation function to obtain the attention weight.Finally, the tensor is rotated 90 degrees clockwise along the H axis to maintain the original shape of the input.In the second branch, the tensor X is rotated 90 degrees counterclockwise along the W axis to obtain X2, which is simplified to 2 × C × W through a Z-Pool layer.Subsequently, X2 is convolved with a kernel of size k × k and passed through a Sigmoid activation function to obtain the weights.Finally, the tensor is rotated 90 • clockwise along the W axis to maintain the same input's shape.The last branch divides the input tensor X using Z pooling, simplifies it to X3(2 × H × W) through a convolution layer with a k × k kernel, and applies the Sigmoid activation function.Subsequently, the fine tensors generated by the three branches, each represented by (C × H × W), are aggregated through simple averaging.
The application of the tensor y to fine attention obtained from the triple attention of the input tensor X ∈ R(C × H × W) can be expressed using Equation ( 2): Among these notations, σ represents the Sigmoid activation function that is applied to ψ 1 , ψ 2 , and ψ 3 .These variables represent standard two-dimensional convolutions with kernel size k present in the three branches of triple attention.In other words, y can be expressed as follows in Equation ( 3): ω 1 , ω 2 , and ω 3 represent the three cross-dimensional attention weights calculated during triple attention.In the above expressions, y 1 and y 2 represent a 90 • clockwise rotation that helps in preserving the original input shape of (C × H × W).
The TA module's essence lies in constructing a fusion attention mechanism that spans across channels and space, considering the feature enhancement and noise suppression of various feature channels for different points of interest in the image.This leads to enhanced discrimination between target and background regions based on spatial variations, which helps in extracting features of blurry underwater objects more accurately.

ASPPCSPC Module
The YOLOv7 module incorporates a parallel structure based on SPP to increase the network's receptive field and enhance the object detection accuracy for different scales, while also reducing computation by 50%.
YOLOv7 SPPCSPC module structure is shown in Figure 6.
fine tensors generated by the three branches, each represented by ( ) C H W × × , are gated through simple averaging.
The application of the tensor y to fine attention obtained from the triple atten the input tensor can be expressed using Equation ( Among these notations, σ represents the Sigmoid activation function that is a to 1 ψ , 2 ψ , and 3 ψ .These variables represent standard two-dimensional convol with kernel size k present in the three branches of triple attention.In other wo can be expressed as follows in Equation ( 3): ω , and 3 ω represent the three cross-dimensional attention weights calc during triple attention.In the above expressions, 1 y and 2 y represent a 90° cloc rotation that helps in preserving the original input shape of ( ) C H W × × .The TA module's essence lies in constructing a fusion attention mechanism that across channels and space, considering the feature enhancement and noise suppress various feature channels for different points of interest in the image.This leads hanced discrimination between target and background regions based on spatial tions, which helps in extracting features of blurry underwater objects more accurat

ASPPCSPC Module
The YOLOv7 module incorporates a parallel structure based on SPP to increa network's receptive field and enhance the object detection accuracy for different while also reducing computation by 50%.
YOLOv7 SPPCSPC module structure is shown in Figure 6.The first branch in the aforementioned network structure processes five, nine teen, and one different objects using four different receptive fields obtained through Max-pool layers after three successive convolutional operations.Despite down-sam The first branch in the aforementioned network structure processes five, nine, thirteen, and one different objects using four different receptive fields obtained through three Maxpool layers after three successive convolutional operations.Despite down-sampling with minimal computation, the Max-pool operation cannot consider global information or extract redundant feature information effectively.Down-sampling can cause the loss of crucial positional information for small objects that are essential for accurate localization.Loss of location and boundary details during the encoding phase makes it challenging to recover them during decoding, which adversely affects small object localization and detection.
This paper solves the aforementioned issue by using multiple parallel atrous convolutional layers with different sampling rates to replace the original Max-pool convolutional layer.This approach enhances the network's receptive field and information redundancy without decreasing the feature map resolution.
The network structure is shown in Figure 7.
The network structure includes the CBS module, which follows the SPPCSPC of YOLOv7.The module extracts feature information by using four different scales of atrous convolution with 3 kernels and dilations of 1, 6, 12, and 18 after three convolutional layers at the four different scales.The feature information is then fused using the Cat concatenation module.Although this method requires more technical skills than max-pooling, it offers stronger feature extraction capabilities for various dimensions and better reflects the target's deep-level feature information.

tection.
This paper solves the aforementioned issue by using multiple parallel atrous convolutional layers with different sampling rates to replace the original Max-pool convolutional layer.This approach enhances the network's receptive field and information redundancy without decreasing the feature map resolution.
The network structure is shown in Figure 7.The network structure includes the CBS module, which follows the SPPCSPC of YOLOv7.The module extracts feature information by using four different scales of atrous convolution with 3 kernels and dilations of 1, 6, 12, and 18 after three convolutional layers at the four different scales.The feature information is then fused using the Cat concatenation module.Although this method requires more technical skills than max-pooling, it offers stronger feature extraction capabilities for various dimensions and better reflects the target's deep-level feature information.
To improve the overall detection speed of the network and manage the model's computational complexity, this study incorporates GhostNet [39] to optimize the ASPPCSPC module.GhostNet, a lightweight neural network structure, uses a sequence of residual transformations to extract more Ghost Feature maps from the original features at a low cost.
The GhostNet module consists of three primary convolution components: (1) The ordinary convolution operation compresses the input feature dimension from c to m. (2) Layer-by-layer convolution is employed to extend m-dimensional features to , where s is the inexpensive operand.The grouping convolution operation is utilized to obtain the partial feature map at the bottom of the output feature map.
(3) The obtained number of channels from the first convolutional part and the second group convolutional part are summed using feature mapping.The underwater biological redundancy feature map is generated through layer-by-layer convolution, and the feature concentration is used to produce the Ghost feature map output.
The bottleneck structure comprising Ghost modules is classified into two cases depicted in (a) and (b) in Figure 8.To improve the overall detection speed of the network and manage the model's computational complexity, this study incorporates GhostNet [39] to optimize the ASPPCSPC module.GhostNet, a lightweight neural network structure, uses a sequence of residual transformations to extract more Ghost Feature maps from the original features at a low cost.
The GhostNet module consists of three primary convolution components: (1) The ordinary convolution operation compresses the input feature dimension from c to m. (2) Layer-by-layer convolution is employed to extend m-dimensional features to m × s = n, where s is the inexpensive operand.The grouping convolution operation is utilized to obtain the partial feature map at the bottom of the output feature map.(3) The obtained number of channels from the first convolutional part and the second group convolutional part are summed using feature mapping.The underwater biological redundancy feature map is generated through layer-by-layer convolution, and the feature concentration is used to produce the Ghost feature map output.
The bottleneck structure comprising Ghost modules is classified into two cases depicted in (a) and (b) in Figure 8.If the normal convolution stride is 1, this module contains two stacked ghost modules.However, if the convolution stride is 2, the image feature information is initially extracted by a ghost module.Next, the feature layer's width and height are compressed using deep separable convolutions with a stride of 2. Finally, the feature information is extracted again with another ghost module.
Regular convolutions often result in output feature maps with redundant information and large network parameters.Therefore, the Ghost Bottlenecks structure is introduced as a bottleneck layer module to reduce network parameters and eliminate most of the redundant information without compromising accuracy.This structure can greatly reduce computational complexity and improve real-time performance of the algorithm.To enhance the detection efficiency of blurred underwater objects, this study constructs a Ghost Module-based neck network on top of ASPPCSPC.The Figure 9 displays the structure.If the normal convolution stride is 1, this module contains two stacked ghost modules.However, if the convolution stride is 2, the image feature information is initially extracted by a ghost module.Next, the feature layer's width and height are compressed using deep separable convolutions with a stride of 2. Finally, the feature information is extracted again with another ghost module.

Conv P=1, d=1
Regular convolutions often result in output feature maps with redundant information and large network parameters.Therefore, the Ghost Bottlenecks structure is introduced as a bottleneck layer module to reduce network parameters and eliminate most of the redundant information without compromising accuracy.This structure can greatly reduce computational complexity and improve real-time performance of the algorithm.To enhance the detection efficiency of blurred underwater objects, this study constructs a Ghost Modulebased neck network on top of ASPPCSPC.The Figure 9 displays the structure.mation and large network parameters.Therefore, the Ghost Bottlenecks structure is introduced as a bottleneck layer module to reduce network parameters and eliminate most of the redundant information without compromising accuracy.This structure can greatly reduce computational complexity and improve real-time performance of the algorithm.To enhance the detection efficiency of blurred underwater objects, this study constructs a Ghost Module-based neck network on top of ASPPCSPC.The Figure 9 displays the structure.To reduce the number of model parameters, the number of input channels (c) is changed from c to c/2 by a 1 × 1 convolution.Afterward, the Ghost Bottlenecks structure and 1 × 1 convolution further reduce the model parameters.The second step involves using dilated convolution to capture context information of features with different scales.The dilated convolution kernel utilized in this study is 3 × 3, with branch dilation rates of 1, 6, 12, and 18 from top to bottom, with no additional parameters.After splicing the convolution feature maps of the different branch expansion rates, a 1 × 1 convolution and Ghost Bottlenecks module jump connects to the input layer following a 1 × 1 convolution, thereby optimizing the multibranch hole convolution module.Finally, a 1 × 1 convolution restores the number of channels to c.

CBS CBS
The convolution kernel's actual size and receptive field size following dilated convolution are demonstrated below: ( 1) To reduce the number of model parameters, the number of input channels (c) is changed from c to c/2 by a 1 × 1 convolution.Afterward, the Ghost Bottlenecks structure and 1 × 1 convolution further reduce the model parameters.The second step involves using dilated convolution to capture context information of features with different scales.The dilated convolution kernel utilized in this study is 3 × 3, with branch dilation rates of 1, 6, 12, and 18 from top to bottom, with no additional parameters.After splicing the convolution feature maps of the different branch expansion rates, a 1 × 1 convolution and Ghost Bottlenecks module jump connects to the input layer following a 1 × 1 convolution, thereby optimizing the multibranch hole convolution module.Finally, a 1 × 1 convolution restores the number of channels to c.
The convolution kernel's actual size and receptive field size following dilated convolution are demonstrated below: Here, k represents the actual size of the convolution kernel, which also determines the receptive field size following convolution.The original convolution kernel size is denoted by k and has been set to 3 × 3 in this paper.The expansion rate is represented by r, where r = 1, 2, 3, • • • , n, and R n refer to the receptive field for each pixel in the nth layer.Furthermore, s i represents the step length of the ith layer, with this paper assigning a value of In general, the refined pyramid module addresses the issue of information loss regarding small targets caused by the initial maximum feature pyramid pooling module.The introduction of the Ghost Bottleneck module decreases the number of network parameters while removing redundancies, thereby improving the efficiency of network detection.

Super Down-Sampling Module Optimization
Down-sampling is a crucial tool for deep learning networks to achieve feature abstraction, increase feature extraction capabilities, and improve generalization.It enables the network to capture global feature information with low-dimensional features, thus increasing the receptive field of each feature point.Max-pooling convolution is the traditional down-sampling method that takes the maximum value of feature points within a neighborhood and effectively retains texture feature information while reducing the number of parameters and improving the network's generalization.However, it lacks selectivity in texture selection, and the parameters cannot be learned.Thus, this method cannot perform adaptive adjustment of different features for feature extraction.
YOLOv7 tackles this problem by using the MP structure instead of Max-pool for down-sampling.The corresponding MP structure is illustrated in Figure 10.
the network to capture global feature information with low-dimensional features, thus increasing the receptive field of each feature point.Max-pooling convolution is the traditional down-sampling method that takes the maximum value of feature points within a neighborhood and effectively retains texture feature information while reducing the number of parameters and improving the network's generalization.However, it lacks selectivity in texture selection, and the parameters cannot be learned.Thus, this method cannot perform adaptive adjustment of different features for feature extraction.
YOLOv7 tackles this problem by using the MP structure instead of Max-pool for down-sampling.The corresponding MP structure is illustrated in Figure 10.The MP Conv module comprises two concatenated branches.The first branch includes a max-pooling layer with stride 2, followed by a convolutional kernel and a regular convolution with stride 1.The second branch starts with a standard convolution with stride 1, followed by a regular convolution with stride 2 to down-sample the data.Eventually, feature information is fused from both branches after down-sampling.Nonetheless, concatenation, a simple tensor operation (Concat), is insufficient for incorporating feature information between contexts.Hence, a Cat-BiFPN structure is proposed to enhance the fusion of the shallow network feature information with weighted bidirectional feature pyramid network.This structure, as illustrated in Figure 11, improves the connectivity between lower and upper layers of the network to boost the overall detection performance of the network via weighted coefficients.The MP Conv module comprises two concatenated branches.The first branch includes a max-pooling layer with stride 2, followed by a convolutional kernel and a regular convolution with stride 1.The second branch starts with a standard convolution with stride 1, followed by a regular convolution with stride 2 to down-sample the data.Eventually, feature information is fused from both branches after down-sampling.Nonetheless, concatenation, a simple tensor operation (Concat), is insufficient for incorporating feature information between contexts.Hence, a Cat-BiFPN structure is proposed to enhance the fusion of the shallow network feature information with weighted bidirectional feature pyramid network.This structure, as illustrated in Figure 11, improves the connectivity between lower and upper layers of the network to boost the overall detection performance of the network via weighted coefficients.
the network to capture global feature information with low-dimensional features, thus increasing the receptive field of each feature point.Max-pooling convolution is the traditional down-sampling method that takes the maximum value of feature points within a neighborhood and effectively retains texture feature information while reducing the number of parameters and improving the network's generalization.However, it lacks selectivity in texture selection, and the parameters cannot be learned.Thus, this method cannot perform adaptive adjustment of different features for feature extraction.
YOLOv7 tackles this problem by using the MP structure instead of Max-pool for down-sampling.The corresponding MP structure is illustrated in Figure 10.The MP Conv module comprises two concatenated branches.The first branch includes a max-pooling layer with stride 2, followed by a convolutional kernel and a regular convolution with stride 1.The second branch starts with a standard convolution with stride 1, followed by a regular convolution with stride 2 to down-sample the data.Eventually, feature information is fused from both branches after down-sampling.Nonetheless, concatenation, a simple tensor operation (Concat), is insufficient for incorporating feature information between contexts.Hence, a Cat-BiFPN structure is proposed to enhance the fusion of the shallow network feature information with weighted bidirectional feature pyramid network.This structure, as illustrated in Figure 11, improves the connectivity between lower and upper layers of the network to boost the overall detection performance of the network via weighted coefficients.Presented in Figure 12 are the diagram and formula for feature fusion:  ) ( ) The first layer input is represented by 1 in X , while the second layer input is represented by 2 in X . 1 ω and 2 ω represent the output learnable weights of the first and sec- ond layers, respectively.out Y ′ is obtained through weighted fusion of 1 ω and 2 ω .
Conv refers to the convolution operation, while Silu represents the activation function.
The resultant out Y ′ , after fusion, is convolved with a 1 × 1 convolution through the Silu activation function to reorganize existing features and form new features.In addition, a nonlinear reward is added to the previous layer of learning to produce the final output, out Y , for this node.The first layer input is represented by X in 1 , while the second layer input is represented by X in 2 .ω 1 and ω 2 represent the output learnable weights of the first and second layers, respectively.Y out is obtained through weighted fusion of ω 1 and ω 2 .Conv refers to the convolution operation, while Silu represents the activation function.The resultant Y out , after fusion, is convolved with a 1 × 1 convolution through the Silu activation function to reorganize existing features and form new features.In addition, a nonlinear reward is added to the previous layer of learning to produce the final output, Y out , for this node.

Experimental Environment
The experimental environment is Ubuntu18.04,Python 3.7.12,and PyTorch 1.8.0, which is specified in this paper.The relevant hardware configuration and model parameters are documented in the Table 1 with 150 training epochs.The experimental image in Figure 13, which was obtained during the 2020 China Underwater Robot Contest, shows the detection of echinus, holothurian, starfish, and scallops.An underwater robot captured the image while performing dynamic inspections in the waters of Dalian.The image appears blurry, with noticeable color bias and low contrast.Multiscale phenomena in the features on the targets are also evident, which is characteristic of actual underwater robot inspections.
ond layers, respectively.Y is obtained through weighted fusion of 1 ω and 2 ω .
Conv refers to the convolution operation, while Silu represents the activation function.
The resultant out Y ′ , after fusion, is convolved with a 1 × 1 convolution through the Silu activation function to reorganize existing features and form new features.In addition, a nonlinear reward is added to the previous layer of learning to produce the final output, out Y , for this node.

Experimental Environment
The experimental environment is Ubuntu18.04,Python 3.7.12,and PyTorch 1.8.0, which is specified in this paper.The relevant hardware configuration and model parameters are documented in the Table 1 with 150 training epochs.The experimental image in Figure 13, which was obtained during the 2020 China Underwater Robot Contest, shows the detection of echinus, holothurian, starfish, and scallops.An underwater robot captured the image while performing dynamic inspections in the waters of Dalian.The image appears blurry, with noticeable color bias and low contrast.Multiscale phenomena in the features on the targets are also evident, which is characteristic of actual underwater robot inspections.The dataset comprises a total of 5543 images.Manual verification was performed on all images prior to the experiment due to some inaccurate annotations in the original dataset, resulting in the removal of non-echinus and non-scallop tags.Additionally, mislabeled tags were corrected, and the XML-formatted tags were transformed to the TXT format necessary for training.Finally, the dataset was randomly divided into training, validation, and test sets, consisting of 4397, 592, and 1200 images, respectively.The final 1200 images were obtained from the official dataset for evaluating the overall performance of the algorithm.

Evaluating Indicator
The detection performance of the underwater image object detection model is assessed using two metrics: average precision (AP) and mean average precision (mAP).Mean average precision is a performance measure for single-class detection, obtained by computing the recall and precision metrics.
The formula for calculating recall rate is shown below: The formula for calculating accuracy is presented below: Among these factors, T P represents a true positive, which means a positive sample that is correctly identified as positive.F N represents a false negative, which means a positive sample that is incorrectly identified as negative.F P represents a false positive, which means a negative sample that is incorrectly identified as positive.
By taking the maximum accuracy corresponding to the recall rate as the y-axis, the precision-recall curve is plotted, and the area under the curve is measured as the AP value.The next step is to obtain multiple single-category values.By computing the average for each category, the mean average accuracy (mAP) for all categories is obtained.The formula for calculating mAP is as follows: The equation uses N to denote the number of object categories to be detected in the dateset.In the current experiment, N is equal to 4, corresponding to echinus, holothurian, starfish, and scallop.

Network Training
To assess the proposed algorithm's superiority, two confusion matrices of the model verification are plotted in Figure 14.The confusion matrix is a 5 × 5 matrix, comprising four categories as holothurian, echinus, scallop, and starfish where the matrix's rows and columns depict ground truth values and predicted results of the network, respectively.Additionally, the background region is included as a category to evaluate the model's performance.

Ablation Experiment
The effects of different network modules proposed in this study on overall detection accuracy were evaluated through ablation experiments, in which YOLOv7 was used as the base network.Specifically, the TA, CAT-BiFPN, ASPPCSPC, and G-ASPPCSPC structures were examined, as illustrated in Figure 15.The YOLOv7 algorithm achieves a recognition rate of 77% for echinus and 74% for starfish, as seen in the figure.However, it does not provide satisfactory recognition for holothurian and scallop.In comparison, this paper's algorithm improves the recognition rates of holothurian and scallop by 6% and 1%, respectively, while maintaining effective detection of echinus and starfish.Notably, missed detections and false alarms of the model are captured in the last row and column of the confusion matrix, respectively.The figure indicates that false alarms for echinus are more severe, mainly caused by numerous stacked rocks in the underwater environment that produce black blocks where rocks overlap.This occurrence is particularly problematic since echinus themselves are black.

Ablation Experiment
The effects of different network modules proposed in this study on overall detection accuracy were evaluated through ablation experiments, in which YOLOv7 was used as the base network.Specifically, the TA, CAT-BiFPN, ASPPCSPC, and G-ASPPCSPC structures were examined, as illustrated in Figure 15.

Ablation Experiment
The effects of different network modules proposed in this study on overall dete accuracy were evaluated through ablation experiments, in which YOLOv7 was us the base network.Specifically, the TA, CAT-BiFPN, ASPPCSPC, and G-ASPPCSPC s tures were examined, as illustrated in Figure 15.The experimental evaluation of network average precision (AP) is determined b area under the precision-recall (PR) curve.We initially trained the standard YOLOv7 work with an input image size of 640 × 640 pixels, as shown in Figure 15a.Then, we grated the ELAN-TA triplet attention module to the YOLOv7 network, as shown in F 15b, for training and testing.The results indicate that our attention module improve recall range while maintaining a certain level of detection accuracy for holothurian ages, leading to a slightly larger area under the PR curve compared to the origina work.Next, we introduced the MP-CB module with CAT-BiFPN architecture to YOLOv7 network as shown in Figure 15c.We observed that the MP-CB module sli improved the detection accuracy of holothurian images, which also resulted in a h area under the PR curve.Moreover, we combined the ELAN-TA and MP-CB mod and the network trained with both modules demonstrated improved detection accu for all four objects, particularly for holothurians.Additionally, we incorporated ASPPCSPC and G-ASPPCSPC modules, which include Ghost optimization, to the work as shown in Figure 15d,e.Both modules improved the performance of the net for different monitoring objects, with ASPPCSPC exhibiting slightly better detection racy.
The previously discussed modules were compared quantitatively using evalu indicators, and the experimental results are presented in the Table 2.The experimental evaluation of network average precision (AP) is determined by the area under the precision-recall (PR) curve.We initially trained the standard YOLOv7 network with an input image size of 640 × 640 pixels, as shown in Figure 15a.Then, we integrated the ELAN-TA triplet attention module to the YOLOv7 network, as shown in Figure 15b, for training and testing.The results indicate that our attention module improved the recall range while maintaining a certain level of detection accuracy for holothurian images, leading to a slightly larger area under the PR curve compared to the original network.Next, we introduced the MP-CB module with CAT-BiFPN architecture to the YOLOv7 network as shown in Figure 15c.We observed that the MP-CB module slightly improved the detection accuracy of holothurian images, which also resulted in a higher area under the PR curve.Moreover, we combined the ELAN-TA and MP-CB modules, and the network trained with both modules demonstrated improved detection accuracy for all four objects, particularly for holothurians.Additionally, we incorporated the ASPPCSPC and G-ASPPCSPC modules, which include Ghost optimization, to the network as shown in Figure 15d,e.Both modules improved the performance of the network for different monitoring objects, with ASPPCSPC exhibiting slightly better detection accuracy.
The previously discussed modules were compared quantitatively using evaluation indicators, and the experimental results are presented in the Table 2.In the above table, the standard YOLOv7 serves as the baseline network without the improved modules ELAN-TA, MP-CB, ASPPCSPC, and G-ASPPCSPC.Subsequently, we integrated these modules individually to compare their recall rate and mean average precision (mAP), as well as the number of parameters and model size.Recall rate is a crucial measure of the system's capability to detect missed targets, which is essential for complete detection of the real target algorithm.Higher recall rate leads to a lower missed detection rate.By contrast, mAP denotes the detection and recognition accuracy of the algorithm, with a higher value indicating better detection and recognition performance.Furthermore, the model's parameter and size are significant indicators of algorithm lightweighting.With smaller parameter values and model sizes, algorithm deployment is more convenient, and the running speed is faster.
When comparing the YOLOv7 and the optimized networks with different integrated modules, the ELAN-TA and MP-CB modules demonstrate similar parameters and model size as the original network, while improving their mAP values by 1.6% and 1.4%, respectively, and recall rates by 0.8% and 1.3%, respectively.Simultaneously adding both modules increases mAP by 2.1%, suggesting that both modules complement each other and have a more pronounced impact on the network's performance.However, recall rate declines due to increased feature extraction for inconspicuous objects with ELAN-TA and MP-CB, causing variations in features.Nonetheless, insufficient data cause a slight recall rate decline among all features in this category.The ASPPCSPC module improves the network by adding feature receptive field and information redundancy.When ELAN-TA and MP-CB modules are the basis, adding the ASPPCSPC module causes an additional 1.4% increase in mAP and a 3.2% increase in recall rate, but yields a disadvantageous increase of 16.5% and 16.3% in the number of parameters and model size required by the network, respectively, affecting robot deployment.Instead, adopting the G-ASPPCSPC network, optimized through the Ghost network, leads to a modest improvement of 0.8% in mAP and 3.4% in recall rate compared to the ASPPCSPC module.Although this increase is less significant than the ASPPCSPC module, the number of parameters is reduced to 32.4%, approximately 11% lower than the ELAN-TA and MP-CB network, and the model size has decreased to 10.8% compared to the original size.Comparing the G-ASPPCSPC module with the ASPPCSPC module, it results in a more substantial reduction in parameter value of approximately 26% and model size of approximately to 25.5%.
Generally, the ELAN-TA module, MP-CB module, and G-ASPPCSPC model demonstrate specific efficiency and value in enhancing the entire network's target detection.Among all the network processes, the network that includes all three modules is the most appropriate for underwater robotic applications.

Comparative Analysis Experiment
To comprehensively evaluate the application value of our method and compare its performance against other state-of-the-art methods, we conducted comparative experiments using various current target detection networks.Faster-RCNN, a dual-stage detection network known for its high accuracy, was included in the comparison.Additionally, we selected SSD, a classic single-stage network that distinguishes itself from the YOLO series, as well as YOLOv3, YOLOv4, and YOLOv5, which are other networks in the YOLO series.To ensure validity, an identical experimental platform, including an identical software and hardware environment, as well as training and test sets, was used for all the aforementioned networks.The results of the experiments are presented in the following Tables 3 and 4 and figures.Upon comparing the detection results of different algorithms, our algorithm demonstrated a significant improvement.Specifically, it yielded a 39% increase in mAP value and a 141% increase ratio compared to the dual-stage object detection network Faster-RCNN, as shown in Table 4. Furthermore, a 24.5% mAP increase with an increase ratio of 58.2% was observed.In comparison to the multiscale detection algorithm, SSD, our algorithm also improved detection performance, even though there was an increase of 35% in parameter quantity; the model size has been reduced by 28.6 M, with the reduction of 31%.YOLOv7, with an mAP value of 63.7%, stands out as the algorithm with the highest performance among the various YOLO algorithms.However, our algorithm was able to improve the overall mAP by 2.9% compared to the YOLOv7 algorithm, with an increase in detection performance by 5.1% for holothurians, 3.6% for starfish, 2.3% for scallops, and 0.6% for echinus.Examination of the data led to the discovery that the detection performance of scallops in the series algorithms was low, primarily due to sediments that partially or entirely occlude subsea scallops.As such, future research will focus on addressing these specific occlusion challenges.

Visual Analysis Experiment
The heat map is a critical visualization tool employed in deep learning object detection to represent the feature information and original image information utilized in detection and classification.The high response areas in the heat map provide essential information support for the ultimate discrimination of the target detection network.To examine the crucial foundation for the upsurge of network detection performance, four underwater blurred images were randomly selected to perform a contrast experiment using heat map visualization.Specifically, this was performed to compare the YOLOv7-based network and the network improved in this paper, as depicted in the Figure 16.The analysis of three representative underwater scenes reveals that the water medium causes significant image blurring, low contrast, and severe color deviation.Additionally, the detection targets have varying scales.Comparing the detection performance By examining the highlighted hotspots, it is evident that YOLOv7 encountered detection problems when detecting echinus, while the proposed algorithm in this paper extracted feature information from all regions containing echinus with a larger receptive field.In contrast, in detecting and recognizing holothurians and starfish, YOLOv7 had difficulty focusing on the objects of interest due to the overly large receptive field of the extracted feature information.This resulted in imprecise detection and recognition.Conversely, the proposed algorithm in this paper extracted more focused feature information that enhanced detection and recognition performance.In terms of detecting scallops, YOLOv7 and the proposed algorithm in this paper exhibited similar performance.However, in cases where multiple scallops were connected, the proposed algorithm had a more concentrated receptive field when compared to YOLOv7.
The detection performance of the proposed model for blurry images in different underwater scenes was verified by testing YOLOv3, YOLOv5, YOLOv7, and the proposed algorithm, as depicted in the Figure 17.
The analysis of three representative underwater scenes reveals that the water medium causes significant image blurring, low contrast, and severe color deviation.Additionally, the detection targets have varying scales.Comparing the detection performance of YOLOv3, YOLOv5, YOLOv7, and the proposed algorithm in these three scenes, YOLOv3's performance is inferior, whereas that of the proposed algorithm is relatively better.Specifically, YOLOv3 detected 11 out of 15 targets in scene 1, 2 out of 4 in scene 2, and 4 out of 15 in scene 3.Although the detected object positions and attribute discrimination are correct, the recall rate is low, leading to an overall recall rate of only 50% across all three scenes.YOLOv5 detects 13 targets in scene 1, 3 in scene 2, and 8 in scene 3, achieving an overall recall rate of approximately 70.6%.YOLOv7 performs the best among the general-purpose object detection algorithms, identifying 13 targets in scene 1, 3 in scene 2, and 9 in scene 3, with an overall recall rate of approximately 73.5%.The proposed algorithm is an optimized and improved version of YOLOv7, detecting 14 targets in scene 1, 4 in scene 2, and 11 in scene 3. The most significant effect is the detection of starfish with partially exposed features and low contrast, ultimately increasing the overall recall rate to 85.3%.
Overall, the algorithm presented in this article demonstrates superior adaptive capabilities in the underwater environment, particularly in conditions of low clarity, resulting in enhanced detection accuracy and recall rates.Furthermore, the algorithm exhibits a more robust and versatile system, making it an ideal solution for the applications involving underwater detection robots.The analysis of three representative underwater scenes reveals that the wate dium causes significant image blurring, low contrast, and severe color deviation.tionally, the detection targets have varying scales.Comparing the detection perform

Discussion
Underwater robot inspections require accurate target detection positioning, correct classification recognition, low miss rates, and high recall rates for target detection algorithms.This paper successfully increases the detection mAP of various types of targets while reducing the size and number of parameters in the deep learning model, resulting in significant optimization effects.The algorithm, however, has certain drawbacks; for example, scallops have a mAP of only 42.1%, significantly lower than that of other detected targets.Additionally, the algorithm's detection performance for blurred, occluded, or buried targets requires further research.

Conclusions
The detection accuracy of image-based target detection algorithms can be compromised during underwater robot dynamic inspection tasks due to the challenges inherent in underwater imaging, such as blurriness, low contrast, and unclear target features.This paper presents an optimization of the backbone and neck networks of the YOLOv7 algorithm to improve detection accuracy in underwater robots.ELAN, an essential feature extraction module in YOLOv7, has be enhanced by proposing a triple attention mechanism that focuses on features across channels and spatial dimensions in the underwater image.Additionally, SPPCSPC forms the basis for feature fusion in the neck network, and multi-scale features have been extracted using multiple dilated convolution layers with different sampling rates to enhance the network's receptive field and information redundancy.To make the network lighter and faster, the Ghost network is introduced in the improved ASPPCSPC module, to enable suitability for deployment in robot detection.The

Figure 1 Figure 1 .
Figure1depicts the comprehensive structure diagram of the YOLOv7 object detection algorithm.In the model design process, weight parameters are implemented into the network framework, and the state-of-the-art ELAN feature fusion module is utilized to improve the network's capability to extract target feature information efficiently.

3 .
Designing the Structure of the Feature Enhancement Network 3.1.The Overall Architecture of the Network

Figure 2 .
Figure 2. Network structure diagram of the improved YOLOv7.

Figure 2 .
Figure 2. Network structure diagram of the improved YOLOv7.

Figure 5 .
Figure 5. Structure diagram of the triple attention mechanism.

Figure 5 .
Figure 5. Structure diagram of the triple attention mechanism.

Figure 11 .
Figure 11.Improved MP structure.Presented in Figure12are the diagram and formula for feature fusion:

Figure 11 .
Figure 11.Improved MP structure.Presented in Figure12are the diagram and formula for feature fusion:

Figure 13 .
Figure 13.Example of the dataset used in this paper.Figure 13.Example of the dataset used in this paper.

Figure 13 .
Figure 13.Example of the dataset used in this paper.Figure 13.Example of the dataset used in this paper.
Electronics 2023, 12, x FOR PEER REVIEW 13 of 20stacked rocks in the underwater environment that produce black blocks where rocks overlap.This occurrence is particularly problematic since echinus themselves are black.

Figure 14 .
Figure 14.Confusion matrices: (a) YOLOv7 and (b) the proposed algorithm in this paper.

Figure 14 .
Figure 14.Confusion matrices: (a) YOLOv7 and (b) the proposed algorithm in this paper.

Figure 14 .
Figure 14.Confusion matrices: (a) YOLOv7 and (b) the proposed algorithm in this paper.

Figure 16 .Figure 17 .
Figure 16.Heat map performance of the two algorithms in different categories.

Figure 16 .
Figure 16.Heat map performance of the two algorithms in different categories.

Figure 16 .Figure 17 .
Figure 16.Heat map performance of the two algorithms in different categories.

Figure 17 .
Figure 17.Comparison of detection results among different algorithms and different scenes.

Table 1 .
Experimental related

Table 2 .
Ablation experiments based on improved YOLOv7.

Table 2 .
Ablation experiments based on improved YOLOv7.

Table 3 .
Comparison of different algorithm model sizes and parameter quantities.

Table 4 .
Comparison of accuracy of each category in different algorithms.