DSW-YOLOv8n: A New Underwater Target Detection Algorithm Based on Improved YOLOv8n

: Underwater target detection is widely used in various applications such as underwater search and rescue, underwater environment monitoring, and marine resource surveying. However, the complex underwater environment, including factors such as light changes and background noise, poses a signiﬁcant challenge to target detection. We propose an improved underwater target detection algorithm based on YOLOv8n to overcome these problems. Our algorithm focuses on three aspects. Firstly, we replace the original C2f module with Deformable Convnets v2 to enhance the adaptive ability of the target region in the convolution check feature map and extract the target region’s features more accurately. Secondly, we introduce SimAm, a non-parametric attention mechanism, which can deduce and assign three-dimensional attention weights without adding network parameters. Lastly, we optimize the loss function by replacing the CIoU loss function with the Wise-IoU loss function. We named our new algorithm DSW-YOLOv8n, which is an acronym of Deformable Convnets v2, SimAm, and Wise-IoU of the improved YOLOv8n(DSW-YOLOv8n). To conduct our experiments, we created our own dataset of underwater target detection for experimentation. Meanwhile, we also utilized the Pascal VOC dataset to evaluate our approach. The mAP@0.5 and mAP@0.5:0.95 of the original YOLOv8n algorithm on underwater target detection were 88.6% and 51.8%, respectively, and the DSW-YOLOv8n algorithm mAP@0.5 and mAP@0.5:0.95 can reach 91.8% and 55.9%. The original YOLOv8n algorithm was 62.2% and 45.9% mAP@0.5 and mAP@0.5:0.95 on the Pascal VOC dataset, respectively. The DSW-YOLOv8n algorithm mAP@0.5 and mAP@0.5:0.95 were 65.7% and 48.3%, respectively. The number of parameters of the model is reduced by about 6%. The above experimental results prove the effectiveness of our method.


Introduction
The efficient use of computer vision technology to explore the unknown underwater domain is one of the most active research fields for many researchers.Due to the dynamic and changeable underwater visual environment, we must promote visual recognition tracking and dynamic perception algorithms to deal with the complex challenges [1,2].Effectively utilizing these resources can help prevent the overexploitation and destruction of terrestrial resources.In underwater engineering applications and research exploration, an efficient and accurate target detection and recognition algorithm is needed for underwater unmanned vehicles or mobile devices [3,4].Of course, the more robust target detection algorithm can be applied not only to underwater target detection, but also to other scenarios, including automatic driving and unmanned aerial vehicles [5,6].
However, the complex underwater environment can affect the detection results.Factors such as a lack of light due to weather conditions and changes in underwater brightness caused by water depth increase the difficulty of underwater target detection [7,8].Some researchers have considered using artificial light sources to compensate for these challenges, but this approach may result in the presence of bright spots and worsen the scattering of underwater suspended objects under certain conditions, which can have a negative impact.
Considering the complexity of the underwater environment, we need to develop a target detection algorithm suitable for underwater equipment which requires a high precision and low computation as its advantages [9][10][11].The YOLO series target detection algorithm is known for achieving a good balance between detection accuracy and speed [12][13][14][15].This paper focuses on improving and enhancing the performance of the YOLOv8n algorithm by making improvements in three aspects: (1) To improve adaptability to object deformations and enable more precise convolutional operations, we replace certain C2f modules in the YOLOv8n backbone feature extraction network with deformable convolutional v2 modules.(2) We introduce an attention mechanism (SimAm) to the network structure, which does not introduce external parameters but assigns a 3D attention weight to the feature map.(3) Resolving a problem with the loss function in which discrepancies between the direction of the prediction boxes and the ground truth bounding boxes may result in oscillations in the position of the prediction box during training, slowing convergence and lowering prediction accuracy.We suggest using the WIoU v3 loss function to better improve the network structure in order to get around this.

Related Work 2.1. Objection Detection Algorithm
YOLOv8 can flexibly support a variety of computer vision tasks; especially in the field of target detection, the YOLOv8 object detection model stands out as one of the top-performing models.YOLOv8 was built upon the YOLOv5 model, introducing a new network structure and incorporating the strengths of previous YOLO series algorithms and other state-of-the-art design concepts in target detection algorithms [16].While YOLOv8 still utilizes the DarkNet53 structure in its network architecture, certain parts of the structure have been fine-tuned.For instance, the C3 module in the feature extraction network is replaced by C2f with a residual connection, which includes two convolution cross-stage partial bottlenecks.This modification allows for the fusion of advanced features and contextual information, resulting in enhanced detection accuracy.Additionally, the model structure of YOLOv8 sets different channel numbers for each version to enhance the model's robustness in handling various types of detection tasks.In the Head section, YOLOv8 continues the Anchor-free mechanism found in YOLOv6 [17], YOLOv7 [18], YOLOX [19], and DAMO-YOLO [20].This mechanism reduces the computational resources required by the model and decreases the overall time consumption.YOLOv8 draws inspiration from the design ideas of YOLOX, using Decoupled Head for decoupling, so the accuracy of model detection is improved by about 1%.This design allows each branch to focus on the current prediction task, thereby improving the performance of the model.The loss function in YOLOv8 consists of two parts, sample matching and loss calculation.The loss function includes category loss and regression loss, among which the regression loss includes two parts: Distribution Focal Loss and CIoU loss [21].
Target detection algorithms can be categorized into one-stage and two-stage algorithms.The one-stage algorithm, represented by Faster R-CNN [22], is known for its slower processing speed, which makes it unsuitable for real-time target identification and detection.On the other hand, the two-stage algorithms, including the YOLO series and DETR series, offer significant advantages, while the DETR [23] network model is large, difficult to train, and exhibits a poor detection effect on small targets.To some extent, YOLO series algorithms excel in underwater target detection.Currently, in the YOLO series of object detection algorithms, some researchers do a lot of research work.Lou et al. [24] proposed a new method of downsampling on the basis of YOLOv8, which better retains the feature information of the context and improves the feature network to better combine shallow information and deep information.Zhang et al. [25] proposed to introduce the global attention mechanism into the YOLOv5 model to strengthen the feature extraction ability of the backbone network for key regions and introduce a multi-branch reparametrized structure to improve the multi-scale feature fusion.Lei et al. [26] used Swin transform as the backbone network of YOLOv5 and then improved the PAnet multi-scale feature fusion method and confidence loss function, which effectively improved the object detection accuracy and the robustness of the model.In this paper, we improved the network structure of YOLOv8n with Deformable Convnets v2, added a parameter-free attention mechanism, and finally optimized the loss function.The DSW-YOLOv8n can be divided into three parts: Backbone, Neck, and Detect.The Backbone consists of various convolutional modules.The Neck includes upsampling and concatenation operations in addition to the convolutional module.The network provides three prediction outputs for objects of different sizes.Finally, the predicted results are used to calculate the loss.The network structure of DSW-YOLOv8n is shown in Figure 1.
Electronics 2023, 10, x FOR PEER REVIEW 3 of 15 [24] proposed a new method of downsampling on the basis of YOLOv8, which better retains the feature information of the context and improves the feature network to better combine shallow information and deep information.Zhang et al. [25] proposed to introduce the global attention mechanism into the YOLOv5 model to strengthen the feature extraction ability of the backbone network for key regions and introduce a multi-branch reparametrized structure to improve the multi-scale feature fusion.Lei et al. [26] used Swin transform as the backbone network of YOLOv5 and then improved the PAnet multiscale feature fusion method and confidence loss function, which effectively improved the object detection accuracy and the robustness of the model.In this paper, we improved the network structure of YOLOv8n with Deformable Convnets v2, added a parameter-free attention mechanism, and finally optimized the loss function.The DSW-YOLOv8n can be divided into three parts: Backbone, Neck, and Detect.The Backbone consists of various convolutional modules.The Neck includes upsampling and concatenation operations in addition to the convolutional module.The network provides three prediction outputs for objects of different sizes.Finally, the predicted results are used to calculate the loss.The network structure of DSW-YOLOv8n is shown in Figure 1.

Fusion of Deformable Convolutional Feature Extraction Network
Deformable Convolution v2 [27] is an improved version of Deformable Convolution v1 [28], which further enhances and optimizes the previous method.In a common convolution module, fixed-size and shape convolution filters are used.However, during the feature extraction process, there may be interference where the convolution kernel does not align perfectly with the target region and includes excess background noise.In comparison, Deformable Convolution v2 introduces additional offsets, allowing the convolution operations to better align with the target region in the feature map.This enhancement in Deformable Convolution v2 provides improved modeling capabilities in two complementary forms.Firstly, it extends the use of deformable convolutional layers throughout the network.By incorporating more convolutional layers with adaptive learning, Deformable Convolution v2 can effectively control sampling across a wider range of feature levels.Secondly, an

Fusion of Deformable Convolutional Feature Extraction Network
Deformable Convolution v2 [27] is an improved version of Deformable Convolution v1 [28], which further enhances and optimizes the previous method.In a common convolution module, fixed-size and shape convolution filters are used.However, during the feature extraction process, there may be interference where the convolution kernel does not align perfectly with the target region and includes excess background noise.In comparison, Deformable Convolution v2 introduces additional offsets, allowing the convolution operations to better align with the target region in the feature map.This enhancement in Deformable Convolution v2 provides improved modeling capabilities in two complementary forms.Firstly, it extends the use of deformable convolutional layers throughout the network.By incorporating more convolutional layers with adaptive learning, Deformable Convolution v2 can effectively control sampling across a wider range of feature levels.Secondly, an adjustment mechanism is introduced which not only enables each sample to experience learning shifts but also adaptively adjusts the learning target feature amplitude.
Compared with traditional convolution modules, deformable convolution is superior to traditional convolution in feature extraction accuracy.In the network structure of YOLOv8n, we adjusted some nodes in the network structure and replaced C2f modules at positions six and eight in the backbone network structure with Deformable Convnets V2 modules.The robustness of the model is effectively enhanced.The difference between the common convolution module and deformable convolution v2 is shown in Figure 2.
Electronics 2023, 10, x FOR PEER REVIEW 4 of 15 adjustment mechanism is introduced which not only enables each sample to experience learning shifts but also adaptively adjusts the learning target feature amplitude.
Compared with traditional convolution modules, deformable convolution is superior to traditional convolution in feature extraction accuracy.In the network structure of YOLOv8n, we adjusted some nodes in the network structure and replaced C2f modules at positions six and eight in the backbone network structure with Deformable Convnets V2 modules.The robustness of the model is effectively enhanced.The difference between the common convolution module and deformable convolution v2 is shown in Figure 2. The calculation formula for the output of the feature map obtained by the common convolution is shown in Equation (1). represents the size of the convolution kernel, it also represents the area where convolution operations can be performed on the feature map. represents the position of the center point of the convolution kernel, while  represents the position of other pixel points relative to  .  ) represents the weight value at the n position, and x  +  ) represents the pixel value at the n position.
The calculation formula of Deformable Convnets v2 is shown in Equation (2).The common convolution region  is fixed, the deformable convolution region changes as the target changes so that  is a variable kernel size,  represents the position of the center point of the convolution kernel,  is the position of the position of other pixel points relative to .  and  in the formula represent the learnable offset and modulation range at the k position.As  is a real number with an unconstrained range, we used  to limit it.The range of  is [0, 1].From  +  +  we may obtain a decimal, in which case, a bilinear interpolation will be used to change the number from a decimal to an integer. and  have the same property; they, respectively, represent the position of pixels in the convolution region during their respective convolution operations.

Simple and Efficient Parameter-Free Attention Mechanism
Attention mechanisms are widely applied in both computer vision and natural language processing.In particular, high-resolution image processing tasks often face information processing bottlenecks.Drawing inspiration from human perception processes, researchers have been exploring selective visual attention models.We compare common attention mechanisms with SimAm, which includes CBAM [29,30], SE [31], and ECA [32].The better attention mechanism of SimAm improves model accuracy without adding extra redundancy to the network.CBAM and SE increased by 9.23% and 9.6%, respectively, on The calculation formula for the output of the feature map obtained by the common convolution is shown in Equation ( 1).R represents the size of the convolution kernel, it also represents the area where convolution operations can be performed on the feature map.p 0 represents the position of the center point of the convolution kernel, while p n represents the position of other pixel points relative to p 0 .w(p n ) represents the weight value at the n position, and x(p 0 + p n ) represents the pixel value at the n position.
The calculation formula of Deformable Convnets v2 is shown in Equation ( 2).The common convolution region R is fixed, the deformable convolution region changes as the target changes so that K is a variable kernel size, p represents the position of the center point of the convolution kernel, p k is the position of the position of other pixel points relative to p. ∆p k and ∆m k in the formula represent the learnable offset and modulation range at the k position.As ∆p k is a real number with an unconstrained range, we used ∆m k to limit it.The range of ∆m k is [0, 1].From p + p k + ∆p k we may obtain a decimal, in which case, a bilinear interpolation will be used to change the number from a decimal to an integer.p n and p k have the same property; they, respectively, represent the position of pixels in the convolution region during their respective convolution operations.

Simple and Efficient Parameter-Free Attention Mechanism
Attention mechanisms are widely applied in both computer vision and natural language processing.In particular, high-resolution image processing tasks often face information processing bottlenecks.Drawing inspiration from human perception processes, researchers have been exploring selective visual attention models.We compare common attention mechanisms with SimAm, which includes CBAM [29,30], SE [31], and ECA [32].The better attention mechanism of SimAm improves model accuracy without adding extra redundancy to the network.CBAM and SE increased by 9.23% and 9.6%, respectively, on ResNet101 [33].Even worse, ECA's increase in the number of parameters is nearly three times that of the model.The channel attention mechanism compresses global information and learns from each channel dimension.It assigns different weights to different channels using an incentive method.On the other hand, the spatial attention mechanism combines global information to process important parts, transforming various spatial data and automatically selecting the more important area feature.Two attention mechanisms represent the 1D or 2D attention mechanisms, respectively.Underwater target detection differs from conventional target detection due to its susceptibility to illumination changes.One contributing factor is the varying light intensity caused by different weather conditions and time.Then, light transmission in the water will be affected by water absorption, reflection and scattering, and serious attenuation, which will directly result in the underwater image visible range being limited and blurred, with low contrast, color incongruity, background noise, and other problems.In order to reduce the impact of the above situation, we added the SimAm attention mechanism [34] to backbone's layer 10.The parameter-free attention mechanism is simple and efficient.Most of the operators are selected based on the energy function; no additional adjustments to the internal network structure are required [35].The features with full 3D weights are shown in Figure 3.
ResNet101 [33].Even worse, ECA's increase in the number of parameters is nearly three times that of the model.The channel attention mechanism compresses global information and learns from each channel dimension.It assigns different weights to different channels using an incentive method.On the other hand, the spatial attention mechanism combines global information to process important parts, transforming various spatial data and au tomatically selecting the more important area feature.Two attention mechanisms repre sent the 1D or 2D attention mechanisms, respectively.Underwater target detection differs from conventional target detection due to its susceptibility to illumination changes.One contributing factor is the varying light intensity caused by different weather conditions and time.Then, light transmission in the water will be affected by water absorption, re flection and scattering, and serious attenuation, which will directly result in the underwa ter image visible range being limited and blurred, with low contrast, color incongruity background noise, and other problems.In order to reduce the impact of the above situa tion, we added the SimAm attention mechanism [34] to backbone's layer 10.The parame ter-free attention mechanism is simple and efficient.Most of the operators are selected based on the energy function; no additional adjustments to the internal network structure are required [35].The features with full 3D weights are shown in Figure 3. SimAm is inspired from neuroscience theory; the parameter-free attention mecha nism establishes the energy function in order to obtain the importance of each neuron The calculation formulate is shown in Equation ( 3).
The linear transformations of  and  are represented by  ̂=   +  ) and x = ω x + b , respectively.Here,  and  denote the weights and biases after transfor mation.To simplify the formula, binary labels are used and regular terms are added to the equation.The energy function is defined as shown in Equation (4).
Theoretically, each channel has M energy functions where M = H × W.However, it eratively solving this equation requires a lot of computational resources; there is a better optimization of the computation with  and  , which is shown in Equation (5).SimAm is inspired from neuroscience theory; the parameter-free attention mechanism establishes the energy function in order to obtain the importance of each neuron.The calculation formulate is shown in Equation ( 3).
The linear transformations of t and x i are represented by t = (ω t t + b t ) and xi = ω t x i + b t , respectively.Here, ω t and b t denote the weights and biases after transformation.To simplify the formula, binary labels are used and regular terms are added to the equation.The energy function is defined as shown in Equation ( 4).
Theoretically, each channel has M energy functions where M = H × W.However, iteratively solving this equation requires a lot of computational resources; there is a better optimization of the computation with w t and b t , which is shown in Equation (5).
The mean value µ t and variance σ 2 t of other neurons in the channel can be calculated using the formulas The λ represents the regularization parameter.The existing solution in formula ( 5) is obtained on a single channel, so it is reasonable to assume that the pixels in the channel all follow the same distribution.We can calculate the mean value and variance of all neurons and use it for all neurons on the channel.The method can reduce the computation amount well.Therefore, the calculation formula of minimum energy function is shown in Equation (6).
If the value of e min is lower, it means that the difference between neuron t and other neurons is more obvious; it also means that it's more important.The importance of each neuron can be obtained by e min .Our approach treats each neuron individually and integrates this linear separability into an end-to-end framework, as shown in Equation (7).
The value of E is the energy function on each channel.Meanwhile, E groups are all e min across channels and dimensions.Using the sigmoid activation function to prevent the value of E from getting too large.SimAm can be flexibly and easily applied to other target object algorithms, integrating it into the backbone network of YOLOv8n, effectively refining the characteristics of the channel domain and spatial domain, thereby significantly improving the accuracy of object detection without increasing the complexity and computing resources of the network [36].

Loss Function with Dynamic Focusing Mechanism
The loss function is essential for improving the performance of the model.The region between the predicted and ground truth bounding boxes is not taken into account by traditional loss functions, which only consider the overlap between them.If there is no intersection between the predicted and ground truth bounding boxes, this constraint becomes troublesome for small target identification because the loss function cannot be discriminated.Because of this, it is unable to optimize the network model, which causes variations in the evaluation results [37,38].In the YOLOv8n network model, the Distribution Focal Loss and CIoU loss functions are employed as the loss functions.The CIoU loss function incorporates the loss in detection box scale and the loss in length and width ratio, in addition to the DIoU loss function.These enhancements contribute to improved accuracy in regression prediction.However, it is worth noting that the CIoU loss function requires more computational resources during model training within the original YOLOv8n network structure.Second, the datasets may contain low quality data samples, which may contain other background noise, an uncoordinated ratio of length to width, and other geometric factors which may further aggravate the negative impact of its training that cannot eliminate the negative impact of geometric factors.So, we improved our model by using Wise-IoU [39] to replace CIoU.

WIoU v1
Low quality datasets will inevitably have a negative impact on the model, which usually comes from geometric factors such as distance and aspect ratio, etc.Therefore, we used the WIoU v1 with two layers of attention based on the distance metric, as follows Equations ( 8) and ( 9) [40].
where R W IoU ∈ [1, e), which can significantly enlarge the L IoU of the anchor box.W g and H g are the minimum width and height of the enclosing box.By separating W g and H g from the computed graph, gradients that hinder convergence can be prevented without introducing new conditions such as aspect ratio.(The superscript * indicates this operation) [39].

WIoU v2
WIoU v2 borrows the design method of Focal Loss to construct a monotonic focusing coefficient on the basis of WIoU v1.However, it also has another problem with the introduction of this monotonic focusing coefficient, which will cause a gradient change when the model is backpropagated.The gradient gain decreases with the decrease in L IoU , which causes the model to take more time to converge at a later stage.Therefore, we take the mean of L IoU as a normalization factor, which is a good way to speed up the later convergence of the model, where L IoU acts as the exponential running average with momentum [41].

WIoU v3
The quality of the anchor box is reflected by defining an outlier value.A high quality anchor box has a smaller outlier value.Utilizing a higher quality anchor box to match a smaller gradient gain can better focus the bounding box regression frame more on the ordinary quality anchor box, and the small gradient gain can match the anchor frame with large outliers, which can better reduce the large harmful gradient produced by low-quality samples.Based on WIoU v1, a non-monotonic focusing coefficient β is constructed and the gradient gain is highest when the value of the β is constant.Due to L IoU it is dynamic, so the quality evaluation criteria of the anchor box is also dynamic, which allows WIoU v3 to dynamically adjust the gradient gain distribution strategy.

Experiment 3.1. Underwater Target Detection Dataset
Using the Pascal VOC dataset and a self-built dataset for underwater target detection, we validate our methods in this experiment.The Target Recognition Group of China Underwater Robot Professional Competition (URPC) provided the majority of the 585 photos that make up our underwater target identification dataset.The remaining images were gathered from the publicly accessible collection on the whale community platform.There are seven different categories in the dataset, including jellyfish, fish, sea urchins, scallops, sea grass, sharks, and sea turtles.LabelImg software was used to annotate every image in the collection and they are all in yolo format.The dataset is arbitrarily split into a 7:2:1 training set, test set, and validation set.We created a complete presentation of the underwater target detection dataset, which is presented in Table 1.It includes the total number of trial sets and samples for each category.Figure 4 displays a sample of the 1585 image collection for underwater target detection.We thoroughly examined the training sets in the experiment's training phase.We can see the training set in the dataset graphically in Figure 5.The quantity of samples for each category is shown in Figure 5a, and the size and quantity of ground truth boxes in the target area are shown in Figure 5b.It is evident that the dataset has a disproportionately higher percentage of small targets.Figure 5c,d, respectively, assesses the distribution of the target area's center points and the aspect ratio of the image label for the entire image.collection for underwater target detection.We thoroughly examined the training sets in the experiment's training phase.We can see the training set in the dataset graphically in Figure 5.The quantity of samples for each category is shown in Figure 5a, and the size and quantity of ground truth boxes in the target area are shown in Figure 5b.It is evident that the dataset has a disproportionately higher percentage of small targets.Figure 5c,d, respectively, assesses the distribution of the target area's center points and the aspect ratio of the image label for the entire image.

Experimental Configuration and Environment
Our experiment made use of the Python programming language and the PyTorch deep learning framework, along with Ubuntu18.4as the operating system.The hardware setup is displayed in Table 2 below.The following hyperparameters are used during training: the image's input size is 640 × 640, the training epoch total is 200, and the batch size is 16.Using the Stochastic Gradient Descent (SGD) to optimize the model, the initial learning rate is set to 0.01, the momentum is set to 0.973, and the weight attenuation is set to the experiment's training phase.We can see the training set in the dataset graphically in Figure 5.The quantity of samples for each category is shown in Figure 5a, and the size and quantity of ground truth boxes in the target area are shown in Figure 5b.It is evident that the dataset has a disproportionately higher percentage of small targets.Figure 5c,d, respectively, assesses the distribution of the target area's center points and the aspect ratio of the image label for the entire image.

Experimental Configuration and Environment
Our experiment made use of the Python programming language and the PyTorch deep learning framework, along with Ubuntu18.4as the operating system.The hardware setup is displayed in Table 2 below.The following hyperparameters are used during training: the image's input size is 640 × 640, the training epoch total is 200, and the batch size is 16.Using the Stochastic Gradient Descent (SGD) to optimize the model, the initial learning rate is set to 0.01, the momentum is set to 0.973, and the weight attenuation is set to

Experimental Configuration and Environment
Our experiment made use of the Python programming language and the PyTorch deep learning framework, along with Ubuntu18.4as the operating system.The hardware setup is displayed in Table 2 below.The following hyperparameters are used during training: the image's input size is 640 × 640, the training epoch total is 200, and the batch size is 16.Using the Stochastic Gradient Descent (SGD) to optimize the model, the initial learning rate is set to 0.01, the momentum is set to 0.973, and the weight attenuation is set to 0.0005.For trained dataset processing, we used a Mosaic data augment strategy and turned it off for the last ten epochs [19].This strategy randomly cuts four images and changes the length to form an image.

Model Evaluation Metrics
We used the recall rate, average detection time, mean average precision (mAP), number of parameters, and floating-point operations per second (FLOPS) to evaluate the performance of the DSW-YOLOV8n model.Precision and Recall are as shown in Equations ( 13) and (14).
TP and FP are the proportion of positive samples in the dataset that are correctly predicted and incorrectly predicted, and FN is the quantity of samples in the negative sample that are incorrectly predicted.Average precision (AP) represents the average accuracy in the model.Mean average precision (mAP) is the average of AP values for all classes.x denotes the number of classes in the dataset.The calculation formulas are shown in Equations ( 15) and ( 16), respectively.

Comparison of Experimental Results of Different Model
To demonstrate the superiority of the DSW-YOLOv8n, we conducted a comparative experimental study using a YOLO series detection model.The performances of the DAMO-YOLO, YOLOv7, YOLOX, and the original YOLOv8n versions were specifically contrasted.Table 3's experimental findings contain metrics like Flops (the number of floating-point operations per second) and Params (the quantity of model parameters).At various IoU levels, we also assessed the mean average precision (mAP).When the IoU threshold is set to 0.5, the mAP@0.5 reflects the average across all categories and the mAP@0.5:0.95represents the average mAP for each category at various thresholds ranging from 0.5 to 0.95 with a step size of 0.05.In the comparison experiment, all models used default parameters and the input image size for all models was set to 640 × 640.Notably, the DSW-YOLOv8n model exhibited a 3.2% increase in mAP@0.5 and a 4.1% increase in mAP@0.95,compared to the original YOLOv8n model.Furthermore, the number of parameters in the improved model was reduced by 6.1%.When compared to other mainstream target detection algorithms, the mAP@0.5 of the DSW-YOLOv8n was found to be 8.3%, 10.3%, and 19% higher than that of YOLOv7, YOLOX, and DAMO-YOLO, respectively.Similarly, the mAP@0.95 of the DSW-YOLOv8n was 9.6%, 13.2%, and 18.7% higher than that of YOLOv7, YOLOX, and DAMO-YOLO, respectively [17][18][19][20].

Comparison of Ablation Experiments
We tested each module in the DSW-YOLOv8n and examined how it affected the model in the ablation experiment.The loss function chooses the ideal WIoU v3 for the ablation experiment from among them.Table 4 presents the outcomes.According to the experiment's findings, Deformable Convnets v2, SimAm, and WIoU v3 have each increased the model's mAP@0.5 accuracy by 2.4%, 1.6%, and 3%, respectively.Additionally, by 2.2%, 3.4%, and 2%, mAP@0.5:0.95increased.SimAm, on the basis of WIoU v3 and Deformable Convnets v2 in combination, increased the accuracy of mAP@0.5 by 2.9% and 0.1%, respectively.Additionally, mAP@0.5:0.95went up by 2.5% and 3%, respectively [25].Overall, Deformable Convnets v2 has increased the model detection accuracy.We used the Grad-CAM [41] image depicted in Figure 6 to visually contrast the effect before and after the SimAm module.Figure 6a depicts the initial input image, Figure 6b the standard output image, Figure 6c the thermal image following the addition of the SimAm module, and Figure 6d the thermal image output of the last layer of the backbone network.It can be seen that after adding the SimAm module, the information about the target area becomes more prominent in the output image by comparing the thermal effect plots of Figure 6b,c.Thus, it will be easier to see the heat effect [34].In the second ablation experiment, we specifically targeted the Wise-IoU function to observe its enhancement effect on the model.Table 5 displays the outcomes of the experiment.The addition of WIoU v1, WIoU v2, and WIoU v3 was based on the addition of Deformable Convnets v2 and SimAm to the model, respectively.In comparison to WIoU v1 and WIoU v2, the experimental results demonstrate that mAP@0.5 of WIoU v3 increases by 0.86% and 0.7%, and mAP@0.5:0.95 by 0.1% and 0.6%, respectively.The average detection speed of each image is simultaneously slowed down by 0.03 ms and 1.9 ms, respectively.With the help of the aforementioned comparison analysis, WIoU v3 can enhance the performance of our model.For the purpose of displaying the prediction results, we have selected four situations that represent various object categories.Figure 7a,b shows the detection of small targets and objects with low visibility and high density, respectively.Target detection and recognition were depicted in Figure 7c,d in a general situation.The outcomes in Figure 7 demonstrate that no missed detections or errors were made, proving the reliability of DSW-YOLOv8n.In the second ablation experiment, we specifically targeted the Wise-IoU function to observe its enhancement effect on the model.Table 5 displays the outcomes of the experiment.The addition of WIoU v1, WIoU v2, and WIoU v3 was based on the addition of Deformable Convnets v2 and SimAm to the model, respectively.In comparison to WIoU v1 and WIoU v2, the experimental results demonstrate that mAP@0.5 of WIoU v3 increases by 0.86% and 0.7%, and mAP@0.5:0.95 by 0.1% and 0.6%, respectively.The average detection speed of each image is simultaneously slowed down by 0.03 ms and 1.9 ms, respectively.With the help of the aforementioned comparison analysis, WIoU v3 can enhance the performance of our model.For the purpose of displaying the prediction results, we have selected four situations that represent various object categories.Figure 7a,b shows the detection of small targets and objects with low visibility and high density, respectively.Target detection and recognition were depicted in Figure 7c,d in a general situation.The outcomes in Figure 7 demonstrate that no missed detections or errors were made, proving the reliability of DSW-YOLOv8n.

YOLOv8n
Average Detection Time/ms mAP@0.5 mAP@0.In the second ablation experiment, we specifically targeted the Wise-IoU function to observe its enhancement effect on the model.Table 5 displays the outcomes of the experiment.The addition of WIoU v1, WIoU v2, and WIoU v3 was based on the addition of Deformable Convnets v2 and SimAm to the model, respectively.In comparison to WIoU v1 and WIoU v2, the experimental results demonstrate that mAP@0.5 of WIoU v3 increases by 0.86% and 0.7%, and mAP@0.5:0.95 by 0.1% and 0.6%, respectively.The average detection speed of each image is simultaneously slowed down by 0.03 ms and 1.9 ms, respectively.With the help of the aforementioned comparison analysis, WIoU v3 can enhance the performance of our model.For the purpose of displaying the prediction results, we have selected four situations that represent various object categories.Figure 7a,b shows the detection of small targets and objects with low visibility and high density, respectively.Target detection and recognition were depicted in Figure 7c,d in a general situation.The outcomes in Figure 7 demonstrate that no missed detections or errors were made, proving the reliability of DSW-YOLOv8n.

YOLOv8n
Average Detection Time/ms mAP@0.5 mAP@0.5, where the inclusion of the Deformable Convnets v2, SimAm, and WIoU v3 modules led to improvements of 2.5%, 1.9%, and 1.3%, respectively.This demonstrates that these three methods effectively enhance the detection accuracy.Additionally, when comparing the number of parameters in the model, the addition of Deformable Convnets v2 resulted in a 4.8% reduction, while the inclusion of the SimAm module improved the detection accuracy and recall without altering the number of parameters in the model.The effectiveness of WIoU v1, WIoU v2, and WIoU v3 on the model based on Deformable Convnets v2 and SimAm was analyzed.Table 6 shows that WIoU v3 achieved the highest detection accuracy, with mAP@0.5 and mAP@0.95being 3.5% and 2.4% higher than the original model, respectively.To visually observe the impact of the three versions of Wise-IoU on the model, we plotted the mAP@0.5 accuracy and DFL-loss curves in Figure 8.The red curve represents the performance after integrating Deformable Convnets v2, SimAm, and WIoU v3, indicating that the model has reached an optimal state.Compared to WIoU v1, mAP@0.5 and mAP@0.5:0.95 were improved by 1.2% and 0.6%, and there was a 1% improvement relative to WIoU v2.The experimental results on the Pascal VOC2012 dataset align with the results of our own underwater target dataset, confirming the effectiveness of DSW-YOLOv8n.

Conclusions
In this paper, the difficulties caused by subpar underwater image quality and intricate environmental changes are discussed in this research.We suggest three methods to YOLOv8n to address these problems.We conducted comparison studies, ablation experiments, and validation on two datasets using various methods and combinations.The experimental findings show that the three methods further optimize the model and have a good effect on its accuracy.
Firstly, we enhance the feature extraction capability of the backbone network by replacing the two-layer convolutional module with Deformable Convnets v2.This improve-

Conclusions
In this paper, the difficulties caused by subpar underwater image quality and intricate environmental changes are discussed in this research.We suggest three methods to YOLOv8n to address these problems.We conducted comparison studies, ablation experiments, and validation on two datasets using various methods and combinations.The experimental findings show that the three methods further optimize the model and have a good effect on its accuracy.
Firstly, we enhance the feature extraction capability of the backbone network by replacing the two-layer convolutional module with Deformable Convnets v2.This improvement has significantly increased the mAP@0.5 accuracy of both datasets by 2.4% and 2.5%, respectively.Although the number of parameters in the model has been reduced by 6%, the floating-point computation has increased by 4%, which is not the desired outcome.We provide a detailed analysis of the principle and formula of Deformable Convnets v2, in Equation (2), which involves adding an extra offset ∆p k calculated through forward inference and back propagation.The process of getting ∆p k slightly increases the computational effort of the DSW-YOLOv8n.Secondly, we introduce a flexible and efficient SimAm module in the last layer of the backbone.The core idea behind SimAm is to assign attention weight vectors to different positions of the input feature map.By refining the channel weight allocation, the model becomes more focused on the target region without adding extra parameters.The performance of the model is significantly improved, with a mAP@0.5 increase of 1.9% on both the underwater target detection dataset and the Pascal VOC dataset.Finally, we optimize the loss function by using the dynamic non-monotonically focused bounding box loss instead of the original CIoU.This modification effectively mitigates the negative impact of low-quality data on the model.Through ablation experiments, we demonstrate that WIoU v3 outperforms WIoU v1 and WIoU v2 in terms of the improvement effect, average detection speed, and detection accuracy of the model.As a result, mAP@0.5 improves, by 3.3% and 1.3%, the underwater target detection dataset and Pascal VOC dataset, respectively.
However, there are some imperfections in our work and the computational requirements of the model have slightly increased.In future work, we will continuously optimize our algorithm of DSW-YOLOv8n.Considering the potential application on mobile devices, the computational load is an important factor to consider.Therefore, our future goal is to explore methods for effectively reducing the amount of floating-point computation in the model and developing a more lightweight object detection model.

Figure 4 .
Figure 4. Some sample images of underwater target detection dataset.

Figure 5 .
Figure 5. Analysis and presentation of underwater target detection dataset: (a) bar chart of the samples of each class of train set; (b) represents size and quantity of grand truth box; (c) is the position of the center point relative to the image; (d) represents the ratio of height and width of the object relative to the image.

Figure 4 .
Figure 4. Some sample images of underwater target detection dataset.

Figure 4 .
Figure 4. Some sample images of underwater target detection dataset.

Figure 5 .
Figure 5. Analysis and presentation of underwater target detection dataset: (a) bar chart of the samples of each class of train set; (b) represents size and quantity of grand truth box; (c) is the position of the center point relative to the image; (d) represents the ratio of height and width of the object relative to the image.

Figure 5 .
Figure 5. Analysis and presentation of underwater target detection dataset: (a) bar chart of the samples of each class of train set; (b) represents size and quantity of grand truth box; (c) is the position of the center point relative to the image; (d) represents the ratio of height and width of the object relative to the image.

Figure 6 .
Figure 6.Grad-CAM image of the DSW-YOLOv8n.(a) Represents original image with the fish and sea turtle; (b) before adding the SimAm; (c) result after adding SimAm; and (d) the last layer output of the backbone.

Figure 6 .
Figure 6.Grad-CAM image of the DSW-YOLOv8n.(a) Represents original image with the fish and sea turtle; (b) before adding the SimAm; (c) result after adding SimAm; and (d) the last layer output of the backbone.

Figure 6 .
Figure 6.Grad-CAM image of the DSW-YOLOv8n.(a) Represents original image with the fish and sea turtle; (b) before adding the SimAm; (c) result after adding SimAm; and (d) the last layer output of the backbone.

Figure 7 .
Figure 7. Results of our method's detection in DSW-YOLOv8n.The detection results for tiny targets and targets in challenging underwater environments are shown in (a,b), while the detection results for a typical situation are shown in (c,d).

4. 3 .
Pascal VOC Dataset Experimental Results The PASCAL Visual Object Classes is an open world-class computer vision challenge.The dataset can be applied to classification, localization, detection, segmentation, and action recognition tasks.To validate our method further, we utilized the Pascal VOC dataset which consists of 17,125 images across 20 categories.We used Pascal VOC2012 to further verify and analyze our model.Our experiment involved dividing the dataset into a training set (12,330 images), a test set (3425 images), and a validation set (1370 images), following a 7:2:1 ratio.The hyperparameters used during model training were consistent with experiment of the underwater target detection dataset.Due to the larger quantity of the Pascal VOC2012 dataset and slower model convergence, we increased the number of epochs trained to 300.The detailed experimental results are presented in Table

Figure 8 .
Figure 8. mAP@0.5 precision changes are shown in (a); DFL-Loss curve changes are shown in (b).

Table 1 .
Quantity of images and samples in underwater target detection dataset.

Table 1 .
Quantity of images and samples in underwater target detection dataset.

Table 1 .
Quantity of images and samples in underwater target detection dataset.

Table 2 .
Experimental configuration and environment.

Table 3 .
The result of comparative experiments of different models.

Table 4 .
Ablation experiments of each method.

Table 6 .
The experimental result of Pascal VOC dataset.