YOLOv7-CHS: An Emerging Model for Underwater Object Detection

: Underwater target detection plays a crucial role in marine environmental monitoring and early warning systems. It involves utilizing optical images acquired from underwater imaging devices to locate and identify aquatic organisms in challenging environments. However, the color deviation and low illumination in these images, caused by harsh working conditions, pose signiﬁcant challenges to an effective target detection. Moreover, the detection of numerous small or tiny aquatic targets becomes even more demanding, considering the limited storage and computing power of detection devices. To address these problems, we propose the YOLOv7-CHS model for underwater target detection, which introduces several innovative approaches. Firstly, we replace efﬁcient layer aggregation networks (ELAN) with the high-order spatial interaction (HOSI) module as the backbone of the model. This change reduces the model size while preserving accuracy. Secondly, we integrate the contextual transformer (CT) module into the head of the model, which combines static and dynamic contextual representations to effectively improve the model’s ability to detect small targets. Lastly, we incorporate the simple parameter-free attention (SPFA) module at the head of the detection network, implementing a combined channel-domain and spatial-domain attention mechanism. This integration signiﬁcantly improves the representation capabilities of the network. To validate the implications of our model, we conduct a series of experiments. The results demonstrate that our proposed model achieves higher mean average precision (mAP) values on the Starﬁsh and DUO datasets compared to the original YOLOv7, with improvements of 4.5% and 4.2%, respectively. Additionally, our model achieves a real-time detection speed of 32 frames per second (FPS). Furthermore, the ﬂoating point operations (FLOPs) of our model are 62.9 G smaller than those of YOLOv7, facilitating the deployment of the model. Its innovative design and experimental results highlight its effectiveness in addressing the challenges associated with underwater object detection.


Introduction
The Earth's extensive surface is primarily covered by the ocean, which accounts for more than 70% of its total area.This remarkable natural resource offers humanity a wealth of valuable marine resources that are indispensable to our survival and well-being [1].Achieving sustainable development through the effective monitoring and protection of the living marine resources necessitates the utilization of a diverse array of detection techniques in an expeditious manner.Traditional underwater target detection methods rely heavily on human divers' visual assessment of the marine environment, but these tasks can have adverse effects on the divers' health due to their extended and intricate nature.As robot technology continues to advance, human divers are gradually being replaced by underwater robots equipped with object detection algorithms that can locate and identify underwater targets accurately and promptly [2,3].The object detection model plays a pivotal role in this success.
Object detection is a crucial task in computer vision, which can be categorized into twostage and one-stage detection methods based on whether region proposals are generated [4].The former generates proposed regions before classifying them with refined positions, enabling higher accuracy but slower speed.The latter directly outputs the object's class probability and position coordinates without generating region proposals, resulting in a faster recognition but lower accuracy.This paper presents a solution to address the cost and real-time limitations of underwater detection devices by introducing a compact network with precise and fast detection capabilities.The underwater environment is characterized by its intricacy and constant change, which lead to environmental noise that can significantly degrade the performance of underwater detection devices [5].In most captured underwater images, a predominant color palette of blue-green hues dominates, creating a major obstacle in effectively discerning underwater targets from the background.Moreover, the presence of a multitude of small-bodied marine species adds another layer of complexity, as these organisms often conceal themselves, making it challenging to achieve an accurate detection.Figure 1 visually exemplifies the formidable obstacles encountered in the detection of underwater objects.replaced by underwater robots equipped with object detection algorithms that can locate and identify underwater targets accurately and promptly [2,3].The object detection model plays a pivotal role in this success.Object detection is a crucial task in computer vision, which can be categorized into two-stage and one-stage detection methods based on whether region proposals are generated [4].The former generates proposed regions before classifying them with refined positions, enabling higher accuracy but slower speed.The latter directly outputs the object's class probability and position coordinates without generating region proposals, resulting in a faster recognition but lower accuracy.This paper presents a solution to address the cost and real-time limitations of underwater detection devices by introducing a compact network with precise and fast detection capabilities.The underwater environment is characterized by its intricacy and constant change, which lead to environmental noise that can significantly degrade the performance of underwater detection devices [5].In most captured underwater images, a predominant color palette of blue-green hues dominates, creating a major obstacle in effectively discerning underwater targets from the background.Moreover, the presence of a multitude of smallbodied marine species adds another layer of complexity, as these organisms often conceal themselves, making it challenging to achieve an accurate detection.Figure 1 visually exemplifies the formidable obstacles encountered in the detection of underwater objects.This paper has primarily the following contributions: • Following a comprehensive examination of the correlation between underwater image enhancement and target detection, it is determined that there is no association between the two.This implies that image enhancement is not a mandatory step in the process of detecting targets in underwater environments.

•
To enhance the accuracy of underwater detection while minimizing computational complexity, we propose the high-order spatial interaction (HOSI) module as a replacement for efficient layer aggregation networks (ELAN) as the backbone network for YOLOv7.The HOSI module achieves superior flexibility and customization through the incorporation of high-order spatial interactions between gated convolution and recursive convolution, and greatly reduces model complexity.

•
Drawing inspiration from the transformer's working mechanism, we propose the contextual transformer (CT) module to augment our detection network, enabling the integration of both dynamic and static context representations to improve the model's ability to detect small targets.

•
We integrate the simplified parameter-free attention (SPFA) module into the detection network, enabling it to attend to both channel and spatial information simultaneously, thereby improving its ability to selectively extract relevant information.
The present paper is organized into five distinct sections.To appreciate the innovative nature of the research, Section 2 provides a comprehensive summary of prior studies in the field.Section 3, then, introduces the improved network architecture designed for the faster and more precise detection of underwater targets.In Section 4, the experimental setup and analysis are presented to evaluate the performance of the This paper has primarily the following contributions:

•
Following a comprehensive examination of the correlation between underwater image enhancement and target detection, it is determined that there is no association between the two.This implies that image enhancement is not a mandatory step in the process of detecting targets in underwater environments.

•
To enhance the accuracy of underwater detection while minimizing computational complexity, we propose the high-order spatial interaction (HOSI) module as a replacement for efficient layer aggregation networks (ELAN) as the backbone network for YOLOv7.The HOSI module achieves superior flexibility and customization through the incorporation of high-order spatial interactions between gated convolution and recursive convolution, and greatly reduces model complexity.

•
Drawing inspiration from the transformer's working mechanism, we propose the contextual transformer (CT) module to augment our detection network, enabling the integration of both dynamic and static context representations to improve the model's ability to detect small targets.

•
We integrate the simplified parameter-free attention (SPFA) module into the detection network, enabling it to attend to both channel and spatial information simultaneously, thereby improving its ability to selectively extract relevant information.
The present paper is organized into five distinct sections.To appreciate the innovative nature of the research, Section 2 provides a comprehensive summary of prior studies in the field.Section 3, then, introduces the improved network architecture designed for the faster and more precise detection of underwater targets.In Section 4, the experimental setup and analysis are presented to evaluate the performance of the detection network.Finally, Section 5 concludes with the key findings of the investigation and outlines future research directions.

Underwater Object Detection
The successful implementation of generic object detection models in common images has demonstrated their versatility and effectiveness, highlighting their utility in various applications.However, detecting objects in underwater images presents a greater challenge due to the complex environmental noise [6].Despite this, some progress has been made in underwater target detection, which can be broadly categorized into two distinct tasks: underwater image enhancement and generic target detection.The former focuses on improving the quality of degraded underwater images caused by light dispersion and color distortion, while the latter aims to locate and identify underwater targets more accurately and efficiently.Various image enhancement techniques have been investigated in the literature for improving contrast, correcting color shifts, and sharpening edges in underwater images [7][8][9].Furthermore, other studies aim to counteract the negative effects of image blur by enhancing the network architecture [10][11][12] and refining training strategies [13].Conversely, there is a high level of interest in improving the accuracy and expeditiousness of generic detection models.V. Malathi et al. [14] proposed an HC 2 PSO algorithm that leveraged a Resnet model with a convolutional neural network architecture for underwater object recognition.This innovative approach successfully eliminated speckles from images and significantly enhanced detection accuracy.In the literature [15], a novel algorithm based on YOLOv4-tiny has been proposed for constructing a symmetric FPN module, which is claimed to enhance the mAP score.However, this algorithm suffers from certain losses in terms of inference speed.The M-ResNet [16] is a cutting-edge approach that enhances detection efficiency through multi-scale operations, enabling the accurate identification of objects of varying sizes and making it a viable option for real-time applications.However, the literature on this topic has noted certain limitations.Specifically, the datasets utilized in this investigation are relatively small, which may limit its generalizability.In an effort to enhance the effectiveness of underwater target detection, a multi-scale aggregated feature pyramid network has been proposed in the literature [17].The literature [18] demonstrates a harmonious balance between accuracy and speed through the implementation of two deep learning detectors, which learn from one another during training.Nonetheless, this approach is limited in its ability to detect small and dense objects.In an effort to mitigate the detrimental effects of degraded underwater images on detection accuracy, a novel deep neural network for simultaneous color conversion and underwater target detection has been introduced in the literature [19].Despite the significant improvement in detection accuracy, challenges still exist in detecting small targets accurately.Zhang et al. [20] proposed a method that integrates MobileNetv2 and depth-separated convolution to effectively reduce the number of parameters while maintaining accuracy.Despite this reduction, the method still entails a significant amount of redundant parameters and channels.The FL-YOLOV3-TINY model [21] is a novel approach that has been proposed for reducing the number of parameters and model size by integrating a depthseparable convolution module.Despite these improvements, there is still room for further accuracy enhancement.Furthermore, to enhance the accuracy of their models, various modifications have been introduced to the underlying frameworks in the literature [22][23][24][25], as extensively documented in all studies.
The goal of this study is to optimize the detection model's size while improving its accuracy and speed.To achieve this objective, we propose the integration of the HOSI module within the YOLOv7 network, which facilitates a reduction in model size while improving its visual representation.Additionally, the CT module is intended to enhance the model's accuracy by integrating static and dynamic contexts.Finally, we also present the SPFA module to enhance the model's accuracy and speed by utilizing a parameter-free attention mechanism.

Small Object Detection
To the best of our knowledge, there are no satisfactory techniques for detecting small objects [26], particularly in underwater imagery, which are often characterized by incomplete and blurry image features.Given the abundance of small targets present in underwater images, numerous researchers have endeavored to refine the existing models, striving to enhance the accuracy and efficiency of small target detection tasks.The CME-YOLOv5 model [27] employs a novel approach by first replacing the C3CA module with the primary C3 module, followed by expanding the three detection layers to four.Additionally, this model leverages the EIOU loss function in place of the GIOU loss function to enhance small object detection performance.The DLSODC-GWM technique [28] refines the hyperparameters of the improved RefineDet (IRD) model using an arithmetic optimization algorithm (AOA), and then employs the functional link neural network (FLNN) model to classify small targets, leading to an increased detection accuracy.The SWIPENET model [29] was designed to achieve enhanced accuracy in small object detection by utilizing a backbone that generates a plurality of high-resolution, semantically rich hyper-feature graphs.Cao et al. [30] proposed an improved algorithm for small target detection based on Faster-RCNN, which addressed the localization bias issue by enhancing the loss function and RoI pooling operation.Moreover, the accuracy of the algorithm was significantly improved by optimizing the non-maximum suppression (NMS) process to prevent the loss of overlapping objects.Furthermore, Xu et al. [31] implemented the Faster-RCNN and kernelized correlation filter (KFC) tracking algorithm to achieve the real-time detection of small underwater objects.The literature [32] proposed a small target detection algorithm that utilized context and incorporated an attention mechanism.This approach has led to improved accuracy in detecting small targets.The aforementioned advancements in small target detection have led to an improvement in model performance accuracy.However, such progress often comes at the cost of reduced detection speed or increased model size.To address this challenge, we propose the YOLOv7-CHS network, which not only enhances the accuracy of detecting dense underwater small targets but also significantly reduces the model size.

Image Enhancement
Image enhancement refers to the implementation of digital signal processing techniques to increase the visual quality of images.Enhancing underwater images is a specialized technique that aims to augment specific features of underwater images and suppress noisy and irrelevant background features, in order to improve the overall quality of the image and the visibility of the target of interest [33].In this study, three underwater image enhancement techniques are explored, namely, Contrast-Limited Adaptive Histogram Equalization (CLAHE) [34], Dark Channel Prior (DCP) [35], and DeblurGAN-v2 [36].
The CLAHE algorithm is a technique for improving the contrast of images while maintaining their details.It achieves this by dividing the image into several regions and processing each region separately using an adaptive histogram equalization method.The algorithm then uses interpolation to stitch together the processed regions, resulting in enhanced image contrast.One of the key advantages of the CLAHE algorithm is its ability to preserve faint detail information in the image while enhancing contrast.
The DCP algorithm is a powerful technique for removing haze from images that employs a two-layer partial differential equation approach.It can effectively clarify an image by estimating the background brightness and distribution of light in the haze.The first step in its implementation is to calculate the dark channel of the image, which represents the minimum pixel value over a local window for each channel.This calculation is mathematically expressed as: where J c (y) represents any channel of image J, while Ω(x) denotes a rectangular window centered at pixel point x.J dark is the dark channel of image.When there is no fog in an outdoor scene, the intensity of the dark channel is low and approaches zero, making it unsuitable for haze thickness estimation.To address this issue, the algorithm employs two additional steps to optimize the defogging effect: global illumination estimation and haze density estimation.These two steps refine the estimation of the global illumination and haze thickness, leading to a better image quality.The DeblurGAN-v2 algorithm is a cutting-edge deep generative adversarial network (GAN)-based image denoising technique that boasts remarkable speed and efficiency in restoring blurry images to their original clarity.The core of this groundbreaking algorithm lies in the use of an adversarial learning approach for training both the generator and discriminator networks.This enables the generator to produce high-fidelity clear images despite being fed with an unclear input, while simultaneously improving the performance of the discriminator network to better distinguish between the real and fake clear images generated by the generator.To train DeblurGAN-v2, a hybrid loss function L G is employed, which takes into account multiple aspects of the generated images.
where L G helps to correct color and texture distortions by measuring the difference between the original image and the denoised image in terms of pixel values.L x computes the Euclidean loss on the VGG19 conv3_3 feature map, which measures the similarity between the generated image and the ground truth image in terms of visual features.Finally, L adv contains both global and local discriminator losses, which help to improve the performance of the generator by forcing it to produce more realistic and detailed images.

Network Architecture
The You Only Look Once (YOLO) network is a popular object detection technique that can identify and localize objects efficiently in images.In this study, the YOLOv7 network [37] was employed as the key network for underwater target detection.The YOLOv7 network is composed of four fundamental components, namely, the input, backbone, head, and prediction modules.In the input module, an image undergoes Mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling to construct a processed image for further processing.In the backbone network, the CBS module has three color levels, from light to dark, for changing the number of channels, extract features, and downsample.The MP module downsamples the resulting feature map, with the MP-1 and MP-2 modules having different output channel settings but identical structures.The ELAN module improves the model's adaptability and enhances its robustness by controlling the gradient path.The head network is made up of SPPCSPC module, UPSample module, and concatenation (Cat) operation structure.The SPPCSPC module expands the receptive field and adapts to images with various resolutions.The role of the UPSample module is to scale up the dimensions of the input feature map through an upsampling operation to obtain a higher resolution feature map.The Cat operation merges the features of two branches.Finally, the REP structure is used to tune the output channels, followed by a 1 × 1 convolution to predict the confidence, category, and anchor frame.
The YOLOv7 network, originally designed for generic object detection, often encounters challenges, such as high image quality requirements and insensitivity to small targets, during underwater target detection.Furthermore, the large size of the model makes it impossible to deploy on underwater devices, further complicating underwater detection tasks.To address these issues and enhance the suitability of the detection network for underwater environments, we made significant modifications and optimizations to YOLOv7.Specifically, we replaced the two ELAN modules in the original backbone with two HOSI modules.Additionally, we added downsampling convolutional layers after the first HOSI module and a feature extraction convolutional layer after the second HOSI module.These modifications enable the backbone network to achieve high-order spatial interactions while maintaining a light weight, making it more suitable for underwater detection tasks.Moreover, to enhance the visual modeling capabilities of YOLOv7, we integrated the CT3 module into the head to enable it to detect small targets by combining static and dynamic contexts.Additionally, we incorporated the SPFA attention mechanism into the MP-2 module to improve the model's representational capabilities and accuracy.
As a consequence of these enhancements, we developed the YOLOv7-CHS network, which is more suitable for underwater target detection.The improved network architecture is depicted in Figure 2.
makes it impossible to deploy on underwater devices, further complicating underwater detection tasks.To address these issues and enhance the suitability of the detection network for underwater environments, we made significant modifications and optimizations to YOLOv7.Specifically, we replaced the two ELAN modules in the original backbone with two HOSI modules.Additionally, we added downsampling convolutional layers after the first HOSI module and a feature extraction convolutional layer after the second HOSI module.These modifications enable the backbone network to achieve highorder spatial interactions while maintaining a light weight, making it more suitable for underwater detection tasks.Moreover, to enhance the visual modeling capabilities of YOLOv7, we integrated the CT3 module into the head to enable it to detect small targets by combining static and dynamic contexts.Additionally, we incorporated the SPFA attention mechanism into the MP-2 module to improve the model's representational capabilities and accuracy.
As a consequence of these enhancements, we developed the YOLOv7-CHS network, which is more suitable for underwater target detection.The improved network architecture is depicted in Figure 2.

High-Order Spatial Interaction (HOSI) Module
The success of self-attentive networks and other dynamic networks has demonstrated the benefits of incorporating higher-order spatial interactions into network structures to enhance visual modeling capabilities.In this paper, we aimed to design a network architecture using recursive gated convolution (RGConv) [38], which is illustrated in Figure 3. RGConv further improves model capacity by introducing higherorder interactions.Firstly, linear projection layers () in p  are leveraged to acquire a set of projection features 0 where H , W , and C correspond to the height, width, and channel dimension of the images, respectively.The gated convolution is then applied recursively as outlined below:

High-Order Spatial Interaction (HOSI) Module
The success of self-attentive networks and other dynamic networks has demonstrated the benefits of incorporating higher-order spatial interactions into network structures to enhance visual modeling capabilities.In this paper, we aimed to design a network architecture using recursive gated convolution (RGConv) [38], which is illustrated in Figure 3. RGConv further improves model capacity by introducing higher-order interactions.Firstly, linear projection layers p in (•) are leveraged to acquire a set of projection features u 0 and {v k } n−1 k=0 : where H, W, and C correspond to the height, width, and channel dimension of the images, respectively.The gated convolution is then applied recursively as outlined below: where the stability of training is achieved through the modification of C f , while ψ k denotes a set of deep convolutional layers.Additionally, φ k is employed to align dimensions in an alternative manner: Finally, the output of the last recursive step is passed through the projection layer to obtain the output of the recursive gated convolution () . In order to mitigate the excessive computational overhead caused by higher-order interactions, we regulated the channel dimensions in each order as RGConv is developed using conventional convolution, linear projection, and elementwise multiplication, incorporating input adaptive spatial mixing akin to self-attentiveness.The HOSI module was designed using RGConv to facilitate the interaction between long-term and higher-order spaces, as shown in Figure 4.During forward propagation, layer normalization is employed to scale the output values of neurons, which can effectively address gradient vanishing and exploding issues.Additionally, multilayer perceptron (MLP) is incorporated to enhance the model's accuracy.Finally, the output of the last recursive step is passed through the projection layer to obtain the output of the recursive gated convolution p out (u n ) ∈ R HW×C .In order to mitigate the excessive computational overhead caused by higher-order interactions, we regulated the channel dimensions in each order as RGConv is developed using conventional convolution, linear projection, and element-wise multiplication, incorporating input adaptive spatial mixing akin to self-attentiveness.
The HOSI module was designed using RGConv to facilitate the interaction between long-term and higher-order spaces, as shown in

Contextual Transformer (CT) Module
The resolution of objects in underwater images is often limited to less than pixels, resulting in a reduced spatial information.As a result, detecting these small becomes a challenging task.To overcome this limitation, this paper proposes a method for small object detection by incorporating the CT module, which inspiration from the Transformer architecture and exhibits strong visual represe

Contextual Transformer (CT) Module
The resolution of objects in underwater images is often limited to less than 30 × 30 pixels, resulting in a reduced spatial information.As a result, detecting these small objects becomes a challenging task.To overcome this limitation, this paper proposes a novel method for small object detection by incorporating the CT module, which draws inspiration from the Transformer architecture and exhibits strong visual representation capabilities [39].
Figure 5 illustrates the schematic diagram of the CT module.Initially, the input key undergoes encoding using contextual convolution to capture the static context of the input.Subsequently, the encoded key is connected to the input query through two successive convolutions.By multiplying the attention weights with the input values, the dynamic contextual representation of the input is obtained.Finally, the fusion of the static and dynamic contextual representations serves as the output of the CT module.

Contextual Transformer (CT) Module
The resolution of objects in underwater images is often limited to less than 30 × 3 pixels, resulting in a reduced spatial information.As a result, detecting these small object becomes a challenging task.To overcome this limitation, this paper proposes a nove method for small object detection by incorporating the CT module, which draw inspiration from the Transformer architecture and exhibits strong visual representatio capabilities [39].
Figure 5 illustrates the schematic diagram of the CT module.Initially, the input ke undergoes encoding using contextual convolution to capture the static context of th input.Subsequently, the encoded key is connected to the input query through tw successive convolutions.By multiplying the attention weights with the input values, th dynamic contextual representation of the input is obtained.Finally, the fusion of the stati and dynamic contextual representations serves as the output of the CT module.The CT3 module was constructed by utilizing multiple CT modules.The structur diagram of the CT3 module can be seen in Figure 6.This module is composed of tw The CT3 module was constructed by utilizing multiple CT modules.The structure diagram of the CT3 module can be seen in Figure 6.This module is composed of two branches, each following a distinct pathway.In the first branch, the input passes through a CBS module and then proceeds to a CTBottleneck module.The CBS module facilitates the integration of contextual information, while the CTBottleneck module takes advantage of the CT module to enhance feature representation.The second branch simply passes through a CBS module.Both branches converge at the Cat operation, where the features from both branches are concatenated.The concatenated features then pass through another CBS module, providing further refinement and integration.The utilization of multiple CT modules allows the module to leverage the complementary characteristics of each branch, thereby improving its efficiency in learning residual features.This module is particularly beneficial for tasks involving small object detection.

Simple Parameter-Free Attention (SPFA) Module
In the human brain, attention is a complex phenomenon that involves two main mechanisms: feature attention and spatial attention [40].These mechanisms have inspired researchers to develop a module called the SPFA, which efficiently generates real 3D In the human brain, attention is a complex phenomenon that involves two main mechanisms: feature attention and spatial attention [40].These mechanisms have inspired researchers to develop a module called the SPFA, which efficiently generates real 3D weights [41].The structure of the SPFA module is depicted in Figure 7.To achieve attention, it is crucial to estimate the importance of individual neurons.For this purpose, researchers introduced an energy function, denoted as e f (•), for each neuron.This energy function evaluates the relevance of each neuron based on its features and spatial location.By combining feature-based attention and spatial attention, the SPFA module can effectively capture features under different attention patterns and adapt to various contexts.
where tn represents the target neuron, while on i represents the other neurons in a single channel of the input feature denoted as X ∈ R HW×C .The subscript i denotes the spatial dimension, and N = HW represents the total number of neurons in the channel.The term C t is associated with the coefficient of regularization.The weight and bias in Equation ( 6) are represented by w tn and b tn .When the energy function reaches its minimum, the closed-form solutions for w tn and b tn are obtained as follows: where the mean of all neurons except tn is denoted by µ tn , while σ tn represents the variance of all neurons except tn.Subsequently, an energy function is derived to determine the importance of each neuron.Feature refinement is then conducted using the scaling operator.
The optimization phase of the module can be concisely described as follows: where E serves as an aggregation of the minimum energy across both channel and spatial dimensions.To mitigate the presence of potential excessive values within E, a sigmoid function was employed.Notably, it is crucial to emphasize that the sigmoid function applied does not alter the importance attributed to each neuron.
branches, each following a distinct pathway.In the first branch, the in a CBS module and then proceeds to a CTBottleneck module.The CB the integration of contextual information, while the CTBottleneck mod of the CT module to enhance feature representation.The second br through a CBS module.Both branches converge at the Cat operation from both branches are concatenated.The concatenated features another CBS module, providing further refinement and integration multiple CT modules allows the module to leverage the complementa each branch, thereby improving its efficiency in learning residual feat particularly beneficial for tasks involving small object detection.

Simple Parameter-Free Attention (SPFA) Module
In the human brain, attention is a complex phenomenon that mechanisms: feature attention and spatial attention [40].These mecha researchers to develop a module called the SPFA, which efficiently weights [41].The structure of the SPFA module is depicted in Fi attention, it is crucial to estimate the importance of individual neuron researchers introduced an energy function, denoted as () ef  , for each function evaluates the relevance of each neuron based on its features By combining feature-based attention and spatial attention, the effectively capture features under different attention patterns and contexts.The module employed in this study exhibits a straightforward architecture and operates as a parameter-free attention mechanism.Unlike the existing channel attention and spatial attention mechanisms, this module directly deduces attention weights in all three dimensions within the network layers, eliminating the need for additional parametric quantities.The module is both adaptable and proficient in enhancing the representational capacity of convolutional networks.
In this paper, we propose the integration of the SPFA module into the MP-2 module of YOLOv7, aiming to enhance feature extraction capabilities.By replacing the CBS module with the SPFA module, we introduced the MP-SPFA module, as illustrated in Figure 8.The MP-SPFA module comprises two branches.The first branch encompasses a maximum pooling layer and a CBS module with a 1 × 1 convolution kernel and a stride size of 1.The second branch involves the SPFA module and a CBS module with a 3 × 3 convolution kernel and a stride size of 2. The outputs of these two branches are combined through a Cat operation.This module facilitates the extraction of deeper features through an attentive mechanism.
The module employed in this study exhibits a straightfo operates as a parameter-free attention mechanism.Unlike the ex and spatial attention mechanisms, this module directly deduces three dimensions within the network layers, eliminating th parametric quantities.The module is both adaptable and pro representational capacity of convolutional networks.
In this paper, we propose the integration of the SPFA modu of YOLOv7, aiming to enhance feature extraction capabilities module with the SPFA module, we introduced the MP-SPFA m Figure 8.The MP-SPFA module comprises two branches.The fir maximum pooling layer and a CBS module with a 1 × 1 convolu size of 1.The second branch involves the SPFA module and a C convolution kernel and a stride size of 2. The outputs of these two through a Cat operation.This module facilitates the extraction of an attentive mechanism.

Dataset
To demonstrate the effectiveness and versatility of the proposed in this paper, we chose two underwater imaging datas dataset and the DUO dataset, for validation purposes.The util allowed us to evaluate the model's performance in different und The Starfish dataset [42] was obtained from the official Ka great significance in detecting a specific species of starfish that fee with the aim of safeguarding the marine environment.The dat underwater images captured in the Great Barrier Reef of Austral Thorns Starfish (COTS) is the sole object detected in the images.W × 720 pixels, the dataset comprises a total of 23,501 images.Am contain starfish objects, while 18,582 images do not exhibit an

Experiments 4.1. Dataset
To demonstrate the effectiveness and versatility of the object detection model proposed in this paper, we chose two underwater imaging datasets, namely, the Starfish dataset and the DUO dataset, for validation purposes.The utilization of these datasets allowed us to evaluate the model's performance in different underwater scenarios.
The Starfish dataset [42] was obtained from the official Kaggle website and holds great significance in detecting a specific species of starfish that feeds on corals in the ocean, with the aim of safeguarding the marine environment.The dataset consists of genuine underwater images captured in the Great Barrier Reef of Australia, where the Crown-of-Thorns Starfish (COTS) is the sole object detected in the images.With a resolution of 1280 × 720 pixels, the dataset comprises a total of 23,501 images.Among these, 4919 images contain starfish objects, while 18,582 images do not exhibit any starfish presence.To facilitate effective model training, the dataset is randomly partitioned into three sets: a training set, a validation set, and a testing set, following an 8:1:1 ratio.
The DUO dataset [43] comprises four distinct object categories: holothurian, echinus, scallop, and starfish.Figure 9 illustrates the distribution of objects across these categories, revealing a total of 74,515 objects.The specific counts for each category are as follows: holothurian (7887), echinus (50,156), scallop (1924), and starfish (14,548).The dataset encompasses a collection of 7782 underwater images captured at varying resolutions.These images are further divided into three subsets, namely, training, validation, and testing sets, following an 8:1:1 ratio.This partitioning scheme ensures a balanced representation of the data in each subset, facilitating effective model training and evaluation.
revealing a total of 74,515 objects.The specific counts for each category are as follows: holothurian (7887), echinus (50,156), scallop (1924), and starfish (14,548).The dataset encompasses a collection of 7782 underwater images captured at varying resolutions.These images are further divided into three subsets, namely, training, validation, and testing sets, following an 8:1:1 ratio.This partitioning scheme ensures a balanced representation of the data in each subset, facilitating effective model training and evaluation.
To ensure comparability, the hyperparameter settings for all models were as follows.The images were resized to have an input size of 640 × 640 pixels.The training duration spanned 300 epochs, and the batch size was set to 8. In the image enhancement and ablation experiment, the SGD [44] optimizer was utilized with a learning rate of 0.01.In the model comparison experiments, the Adam [45] optimizer was employed with a learning rate of 0.001.

Evaluation Metrics
Precision (P), recall (R), average precision (AP), and mean average precision (mAP) are commonly employed as evaluation metrics to assess model accuracy.Specifically, they are formulated as follows:
To ensure comparability, the hyperparameter settings for all models were as follows.The images were resized to have an input size of 640 × 640 pixels.The training duration spanned 300 epochs, and the batch size was set to 8. In the image enhancement and ablation experiment, the SGD [44] optimizer was utilized with a learning rate of 0.01.In the model comparison experiments, the Adam [45] optimizer was employed with a learning rate of 0.001.

Evaluation Metrics
Precision (P), recall (R), average precision (AP), and mean average precision (mAP) are commonly employed as evaluation metrics to assess model accuracy.Specifically, they are formulated as follows: where P denotes the proportion of correctly predicted samples to all predicted samples.Additionally, we considered the ratio of predicted positive samples to all positive samples, denoted byR.The term C denotes the number of target types.AP refers to the area enclosed after plotting the P-R curve, with precision on the y-axis and recall on the x-axis.mAP is the average AP calculated across all categories.Moreover, frame per second (FPS) provides valuable insights into the network's efficiency and speed in processing frames or data samples, enabling an assessment of its real-time capabilities.The number of network parameters serves as an indicator of the model's size, offering insights into its memory and storage requirements, which are crucial considerations for practical deployment.Additionally, floating-point operations (FLOPs) provide a quantifiable measure of the algorithm's computational complexity, facilitating comparisons between the state-of-the-art models based on their utilization of computational resources.

Image Enhancement
Given the challenges posed by underwater imaging, such as low illumination and color deviation, this paper applied three distinct image enhancement methods to tackle these issues.The effectiveness of these methods is demonstrated in Figure 10.The top row showcases the enhancement effect on the Starfish dataset, while the bottom row displays the enhancement effect on the DUO dataset.As depicted, the CLAHE method enhances the uniformity of the color distribution in the original image.The DCP algorithm effectively reduces the impact of fog, resulting in a clearer appearance.Additionally, the DeblurGAN-v2 algorithm alleviates image blurring and enhances the visibility of previously obscured details.
real-time capabilities.The number of network parameters serves as an indicator of the model's size, offering insights into its memory and storage requirements, which are crucial considerations for practical deployment.Additionally, floating-point operations (FLOPs) provide a quantifiable measure of the algorithm's computational complexity, facilitating comparisons between the state-of-the-art models based on their utilization of computational resources.

Image Enhancement
Given the challenges posed by underwater imaging, such as low illumination and color deviation, this paper applied three distinct image enhancement methods to tackle these issues.The effectiveness of these methods is demonstrated in Figure 10.The top row showcases the enhancement effect on the Starfish dataset, while the bottom row displays the enhancement effect on the DUO dataset.As depicted, the CLAHE method enhances the uniformity of the color distribution in the original image.The DCP algorithm effectively reduces the impact of fog, resulting in a clearer appearance.Additionally, the DeblurGAN-v2 algorithm alleviates image blurring and enhances the visibility of previously obscured details.To investigate the potential impact of enhanced images on detection accuracy, we applied image enhancement techniques to the training set, validation set, and testing set of both datasets.Subsequently, we conducted training on the YOLOv7-CHS network and evaluated the performance using the testing results presented in Table 1.The experiments were divided into four sets.The first set of experiments utilized the original dataset without any image enhancement.The second to fourth sets of experiments involved the application of the three image enhancement algorithms to the dataset, resulting in enhanced versions.Based on the experimental findings, we observed that the CLAHE method yields improved accuracy on the starfish dataset, but its effectiveness does not extend to the DUO dataset, indicating a lack of generalizability of the enhancement method.Moreover, applying the DCP and DeblurGAN-v2 methods for enhancement on both datasets does not contribute to an enhancement in the detection accuracy.Based on To investigate the potential impact of enhanced images on detection accuracy, we applied image enhancement techniques to the training set, validation set, and testing set of both datasets.Subsequently, we conducted training on the YOLOv7-CHS network and evaluated the performance using the testing results presented in Table 1.The experiments were divided into four sets.The first set of experiments utilized the original dataset without any image enhancement.The second to fourth sets of experiments involved the application of the three image enhancement algorithms to the dataset, resulting in enhanced versions.Based on the experimental findings, we observed that the CLAHE method yields improved accuracy on the starfish dataset, but its effectiveness does not extend to the DUO dataset, indicating a lack of generalizability of the enhancement method.Moreover, applying the DCP and DeblurGAN-v2 methods for enhancement on both datasets does not contribute to an enhancement in the detection accuracy.Based on these observations, it can be inferred that the image enhancement algorithm primarily enhances the visual quality of the images as perceived by human vision.However, this enhancement does not facilitate the deep learning network in acquiring additional features, thereby failing to improve the detection accuracy when trained with neural networks.Consequently, it can be concluded that the enhancement of underwater images and the subsequent improvement in the detection accuracy are not positively correlated.Furthermore, the computational burden on underwater object detection devices increases when employing image enhancement algorithms to preprocess underwater images, thus impacting the real-time performance of the detection task.Therefore, unless specifically stated, we refrained from utilizing such algorithms for preprocessing the captured underwater images in the subsequent experiments.

Ablation Experiments
Ablation experiments were conducted on the two datasets, namely, the Starfish dataset and the DUO dataset, to evaluate the effectiveness of the three enhancements presented in this paper.The primary objective of the initial set of experiments was to assess the detection performance of the YOLOv7 model in its original form, which serves as a baseline for evaluating the impact of subsequent enhancements.Subsequently, experiments were conducted in sets two to seven to augment the YOLOv7 model with individual or combined improvements.These enhancements, including the CT3, HOSI, and SPFA techniques, were incorporated into the YOLOv7 model individually to assess their individual contributions.Furthermore, combinations of these enhancements were also evaluated to explore potential synergistic effects.The eighth set of experiments focused on examining the detection results of the YOLOv7-CHS model, which incorporates all three improvements (CT3, HOSI, and SPFA) simultaneously into the YOLOv7 model.The primary aim of this integration was to achieve an improved detection performance compared to the original YOLOv7 model.
Table 2 presents the results of the ablation experiments conducted on the two datasets to evaluate the impact of the three improvement points.The initial YOLOv7 model without any enhancements achieved an mAP of 47.3% and 80.1% on the Starfish and DUO datasets, respectively, with a computational cost of 103.2 G FLOPs.In the fourth set of experiments, the inclusion of only the SPFA module in the model resulted in an mAP of 82.0% on the DUO dataset.However, there was no improvement in mAP on the Starfish dataset, and the model's computational cost is too large for practical application to underwater devices.Subsequently, the sixth set of experiments incorporated both the CT module and the SPFA module, leading to improved mAP values on both datasets.Nevertheless, the computational cost remained high at 90.6 G FLOPs, making it unsuitable for deployment on underwater devices.In the eighth set of experiments, the CT3, HOSI, and SPFA modules were simultaneously incorporated, resulting in mAP values of 48.4% and 84.1% on the Starfish and DUO datasets, respectively.These results represent a 1.1% and 4.0% increase over the initial set of experiments, respectively.Additionally, the computational cost was optimized to 40.3 G FLOPs, making the model more lightweight and suitable for deployment on underwater devices.In summary, the proposed model demonstrates a favorable performance on both underwater datasets, indicating its adaptability and robustness in various underwater environments.Furthermore, the model significantly reduced the model complexity, enabling a convenient deployment on underwater devices.

Selection of Optimizer
In the previous section, we conducted ablation experiments using the SGD optimizer, which is commonly employed for object detection.To assess the suitability of optimizers for underwater object detection, we performed a comparative analysis by conducting two sets of experiments on two datasets.We compared the performance of two popular optimizers, namely, SGD and Adam.The SGD optimizer utilizes a learning rate of 0.01, while Adam employs a learning rate of 0.001.The objective of these experiments was to evaluate and determine which optimizer offers a better performance in the context of underwater object detection.As demonstrated in Table 3, employing the Adam optimizer on the Starfish dataset yielded a significant improvement of 5% in recall and a 4.4% increase in mAP.However, this improvement came at the cost of an 8.2% reduction in precision.Additionally, there was a slight improvement in the detection speed.When applied to the DUO dataset, the use of the Adam optimizer resulted in a 1.3% decrease in recall.However, it led to a modest increase of 0.9% in precision and a 0.5% improvement in mAP.Notably, the detection speed remained consistent at 32 FPS.Overall, these findings suggest that the Adam optimizer is more suitable than the SGD optimizer for underwater object detection tasks.In this study, we conducted comparative experiments on popular object detection models, namely, YOLOv5s, YOLOv7, YOLOv7-tiny, Swin-Transformer, and YOLOv7-CHS, as detailed in Table 4. Leveraging the success of previous experiments that utilize the Adam optimizer, we employed it in our comparison study.Our model achieved an impressive mAP of 52.8%, surpassing Swin-Transformer, YOLOv5, YOLOv7-tiny, and YOLOv7 by 17%, 17.1%, 18.5%, and 4.5%, respectively.These results highlight the effectiveness of our model in accurately detecting and localizing objects.Although our model's FLOPs were not as low as Swin-Transformer, YOLOv5s, and YOLOv7-tiny, it still managed to save 62.9 G when compared to the YOLOv7 model.This indicates that our model strikes a balance between model accuracy and computational efficiency, resulting in notable computational savings.Additionally, our model demonstrated a detection speed of 32 FPS, enabling real-time object detection.This characteristic is particularly valuable when the timely and efficient detection of underwater objects is required.
Figures 11 and 12 display the detection outcomes of images depicting starfish instances that are densely distributed and sparsely distributed, respectively.Notably, the YOLOv7-CHS model demonstrates an exceptional proficiency in detecting diminutive targets as compared to alternative models, irrespective of the density of targets within the images.These observations underscore the YOLOv7-CHS model's pronounced efficacy in accurately identifying and localizing small objects, further transcending variations in target density.Consequently, the YOLOv7-CHS model exhibits promising capabilities for object detection purposes, particularly in scenarios involving minuscule subaquatic entities.Figures 11 and 12 display the detection outcomes of images depicting starfish instances that are densely distributed and sparsely distributed, respectively.Notably, the YOLOv7-CHS model demonstrates an exceptional proficiency in detecting diminutive targets as compared to alternative models, irrespective of the density of targets within the images.These observations underscore the YOLOv7-CHS model's pronounced efficacy in accurately identifying and localizing small objects, further transcending variations in target density.Consequently, the YOLOv7-CHS model exhibits promising capabilities for object detection purposes, particularly in scenarios involving minuscule subaquatic entities.Table 5 presents a comparative experiment evaluating various target detection models on the DUO dataset.The results indicate that these models achieve a limited detection accuracy for the scallops category.We surmise that this discrepancy arises from the scarcity of scallop instances in the dataset, leading to the inadequate learning of this specific category during the training phase.Consequently, we can infer that the detection

Results on the DUO Dataset
Table 5 presents a comparative experiment evaluating various target detection models on the DUO dataset.The results indicate that these models achieve a limited detection accuracy for the scallops category.We surmise that this discrepancy arises from the scarcity of scallop instances in the dataset, leading to the inadequate learning of this specific category during the training phase.Consequently, we can infer that the detection accuracy of a model for any given category is correlated to the number of instances available within the dataset.Regarding the overall results, our model achieved an exceptional mAP of 84.6%, surpassing all other detection models.Additionally, our model exhibited the highest AP among all individual categories.Figure 13 visually portrays the detection outcomes of each target detection model on the DUO dataset.Significantly, our model demonstrated a superior performance by detecting a more comprehensive range of targets with greater confidence.A further analysis revealed that the YOLOv7-CHS model also enhances the detection accuracy on the DUO dataset.This finding underscores the adaptability of the YOLOv7-CHS model in diverse underwater environments.Figure 14 illustrates the confusion matrices for the YOLOv7 and YOLOv7-CHS models.Darker color blocks along the diagonal of the confusion matrix indicate a higher accuracy in the model's detection results.It is evident that the confusion matrix of the YOLOv7 network exhibits a more scattered color block distribution, with lighter color blocks along the diagonal.This suggests that the model has a higher error rate in detecting various classes of objects.In contrast, the confusion matrix of the YOLOv7-CHS network displays a darker color on the diagonal and a more concentrated color block distribution.These characteristics indicate an improvement in accuracy for the enhanced model, particularly in detecting small underwater targets.Therefore, based on the comparison of these confusion matrices, it can be concluded that the YOLOv7-CHS model outperforms the original model in detecting objects across all classes.Figure 14 illustrates the confusion matrices for the YOLOv7 and YOLOv7-CHS models.Darker color blocks along the diagonal of the confusion matrix indicate a higher accuracy in the model's detection results.It is evident that the confusion matrix of the YOLOv7 network exhibits a more scattered color block distribution, with lighter color blocks along the diagonal.This suggests that the model has a higher error rate in detecting various classes of objects.In contrast, the confusion matrix of the YOLOv7-CHS network displays a darker color on the diagonal and a more concentrated color block distribution.These characteristics indicate an improvement in accuracy for the enhanced model, particularly in detecting small underwater targets.Therefore, based on the comparison of these confusion matrices, it can be concluded that the YOLOv7-CHS model outperforms the original model in detecting objects across all classes.
Figure 14 illustrates the confusion matrices for the YOLOv7 and YOLOv7-CHS models.Darker color blocks along the diagonal of the confusion matrix indicate a higher accuracy in the model's detection results.It is evident that the confusion matrix of the YOLOv7 network exhibits a more scattered color block distribution, with lighter color blocks along the diagonal.This suggests that the model has a higher error rate in detecting various classes of objects.In contrast, the confusion matrix of the YOLOv7-CHS network displays a darker color on the diagonal and a more concentrated color block distribution.These characteristics indicate an improvement in accuracy for the enhanced model, particularly in detecting small underwater targets.Therefore, based on the comparison of these confusion matrices, it can be concluded that the YOLOv7-CHS model outperforms the original model in detecting objects across all classes.

Conclusions
This paper presented a comparative analysis of various image enhancement algorithms to assess their effect on target detection accuracy.The findings indicate that there is no positive correlation between image enhancement techniques and improved detection accuracy.Subsequently, it introduced the YOLOv7-CHS model, which incorporates the HOSI module, CT3 module, and SPFA module into the YOLOv7 architecture.The comparative results demonstrate that this model had a lower computational load, with only 40.3 G FLOPs, which is 62.9 G less than the original YOLOv7 model.This reduction in FLOPs is advantageous for model deployment.The evaluation of the Starfish dataset reveals that the YOLOv7-CHS model achieved a mAP of 52.8%, which is 4.5% higher than the performance of the YOLOv7 model.The detection speed reached 32 FPS, enabling real-time detection capabilities.Moreover, when applied to the DUO dataset, the YOLOv7-CHS model achieved an mAP of 84.6%, which is 4.2% better than the YOLOv7 model.This demonstrates the model's adaptability to diverse underwater environments and highlights its robustness and generalization capabilities.Despite the significant advancements achieved by the YOLOv7-CHS model in underwater object detection, further improvements are warranted, particularly in terms of speed and model size.Future research should focus on optimizing the proposed model to enhance its performance in small target detection tasks within oceanic settings.By addressing these areas of improvement, we aim to contribute to the broader field of underwater object detection and pave the way for practical applications in various domains.

Figure 4 .
Figure 4. Structure diagram of the HOSI module.

Figure 4 .
Figure 4. Structure diagram of the HOSI module.

Figure 5 .
Figure 5. Structure diagram of the CT module.The "⁎" denotes the local matrix multiplication.

Figure 5 .
Figure 5. Structure diagram of the CT module.The " * " denotes the local matrix multiplication.

J
. Mar. Sci.Eng.2023, 11, x FOR PEER REVIEW 9 of 2 branches, each following a distinct pathway.In the first branch, the input passes through a CBS module and then proceeds to a CTBottleneck module.The CBS module facilitate the integration of contextual information, while the CTBottleneck module takes advantage of the CT module to enhance feature representation.The second branch simply passe through a CBS module.Both branches converge at the Cat operation, where the feature from both branches are concatenated.The concatenated features then pass through another CBS module, providing further refinement and integration.The utilization o multiple CT modules allows the module to leverage the complementary characteristics o each branch, thereby improving its efficiency in learning residual features.This module i particularly beneficial for tasks involving small object detection.

Figure 6 .
Figure 6.Structure diagram of the CT3 module.

Figure 6 .
Figure 6.Structure diagram of the CT3 module.

Figure 6 .
Figure 6.Structure diagram of the CT3 module.

Figure 9 .
Figure 9.The distribution of instances for the DUO dataset.

Figure 9 .
Figure 9.The distribution of instances for the DUO dataset.

Figure 11 . 20 Figure 11 .Figure 12 .
Figure 11.Detection results on dense targets.(a) Original image.(b) Results of the Swin-Transformer detector.(c) Results of the YOLOv5s detector.(d) Results of the YOLOv7-tiny detector.(e) Results of the YOLOv7 detector.(f) Results of the YOLOv7-CHS detector.

Figure 12 .
Figure 12.Detection results of sparse targets.(a) Original image.(b) Results of the Swin-Transformer detector.(c) Results of the YOLOv5s detector.(d) Results of the YOLOv7-tiny detector.(e) Results of the YOLOv7 (f) Results of the YOLOv7-CHS detector.

Figure 13 .
Figure 13.Detection results on the DUO dataset.(a) Original image.(b) Results of the Swin-Transformer detector.(c) Results of the YOLOv4s-mish detector.(d) Results of the YOLOv5s detector.(e) Results of the YOLOv7-tiny detector.(f) Results of the YOLOv7 detector.(g) Results of the YOLOv7-CHS detector.

Figure 13 .
Figure 13.Detection results on the DUO dataset.(a) Original image.(b) Results of the Swin-Transformer detector.(c) Results of the YOLOv4s-mish detector.(d) Results of the YOLOv5s detector.(e) Results of the YOLOv7-tiny detector.(f) Results of the YOLOv7 detector.(g) Results of the YOLOv7-CHS detector.

Table 1 .
Comparison experiments for image enhancement.The bolded numbers are the best results in that column of numbers.

Table 2 .
Ablation experiments.In columns 2 through 4, the "×" sign indicates that the module in line 1 was not added to the YOLOv7 network, and the " " sign indicates that the module in line 1 was added to the YOLOv7 network.

Table 3 .
Experimental results of the comparison of the SGD and Adam optimizers.

Table 4 .
Comparisons of different object detectors on the Starfish dataset.The bolded numbers are the best results in that column of numbers.

Table 5 .
Comparisons of different object detectors on the DUO dataset.The bolded numbers are the best results in that column of numbers.