YOLOv6-ESG: A Lightweight Seafood Detection Method

: The rapid development of convolutional neural networks has signiﬁcant implications for automated underwater ﬁshing operations. Among these, object detection algorithms based on underwater robots have become a hot topic in both academic and applied research. Due to the complexity of underwater imaging environments, many studies have employed large network structures to enhance the model’s detection accuracy. However, such models contain many parameters and consume substantial memory, making them less suitable for small devices with limited memory and computing capabilities. To address these issues, a YOLOv6-based lightweight underwater object detection model, YOLOv6-ESG, is proposed to detect seafood, such as echinus, holothurian, starﬁsh, and scallop. First, a more lightweight backbone network is designed by rebuilding the EfﬁcientNetv2 with a lightweight ODConv module to reduce the number of parameters and ﬂoating-point operations. Then, this study improves the neck layer using lightweight GSConv and VoVGSCSP modules to enhance the network’s ability to detect small objects. Meanwhile, to improve the detection accuracy of small underwater objects with poor image quality and low resolution, the SPD-Conv module is also integrated into the two parts of the model. Finally, the Adan optimizer is utilized to speed up model convergence and further improve detection accuracy. To address the issue of interference objects in the URPC2022 dataset, data cleaning has been conducted, followed by experiments on the cleaned dataset. The proposed model achieves 86.6% mAP while the detection speed (batch size = 1) reaches 50.66 FPS. Compared to YOLOv6, the proposed model not only maintains almost the same level of detection accuracy but also achieves faster detection speed. Moreover, the number of parameters and ﬂoating-point operations reaches the minimum levels, with reductions of 75.44% and 79.64%, respectively. These results indicate the feasibility of the proposed model in the application of underwater detection tasks.


Introduction
With the booming development of computer vision, underwater object detection technology based on optical images is widely used and plays a significant role in marine fisheries [1], aquaculture, marine pollution protection, underwater unexploded ordnance detection [2], and so on. In the field of marine fisheries, most of the traditional seafood collection methods are based on manual diving fishing, which is not only inefficient but also requires workers to have sufficient experience in diving and fishing. Due to the harmfulness of long-term underwater operations, the reduced labor force has led to a continuous increase in the cost of manual fishing operations. On the other hand, underwater scenes are inherently more complex than land scenes, so the images obtained by underwater cameras tend to have lower quality. Fishing operations face great challenges due to the time constraints of fishing and the impact of the marine environment. Therefore, have improved the accuracy to some extent, the overall precision is relatively lower, not more than 80%.
In 2023, Zhang et al. [21] proposed an optimized underwater object detection model based on the YOLOv4 algorithm. Expanding upon the original network, they introduced an additional prediction head to facilitate the detection of objects of different sizes. Additionally, they integrated a channel attention mechanism into the network. Furthermore, the K-means++ was applied to cluster anchor boxes and different activation functions were used to improve the model's performance. Through multiple integrated modules, the detection accuracy was elevated up to 91.1%. The enhancement came at the cost of a larger model size, which reached 182.7 MB. In the same year, Liu et al. [22] introduced a model based on YOLOv7 in this field. This model utilized the ACmixBlock module to replace the 3 × 3 convolutional block of the E-ELAN structure in the base model, while integrating skip connections and 1 × 1 convolutional block to enhance feature extraction capabilities. Simultaneously, they introduced the ResNet-ACmix module to prevent feature information loss and incorporated the Global Attention Mechanism (GAM) module to further amplify feature extraction. They employed the K-means++ algorithm to obtain anchor boxes as well. This series of enhancements led to the model demonstrating 89.6% mAP on the URPC2021 dataset and 97.4% mAP on the Brackish dataset. These improvements also resulted in an increase in the number of model parameters to 177.08 M. The latest two studies based on the YOLO algorithm have highly improved the detection performance on four types of seafood: holothurian, echinus, starfish, and scallop. On the other hand, these methods have concurrently led to a heightened complexity in the network architecture, resulting in an increase in both the number of model parameters and floating-point operations.
Most previous studies mainly have concentrated on incorporating extra feature extraction modules to improve detection accuracy. However, these networks tend to have a larger number of parameters and slow detection speeds, which makes it challenging to deploy them on embedded and mobile devices. Therefore, finding ways to reduce the model complexity and speed up the detection process while ensuring the detection accuracy gains extensive attention. In 2021, Zhang et al. [23] utilized the lightweight MobileNetV2 and the depthwise separable convolution method to enhance the backbone network of YOLOv4, which was also combined with an attention feature fusion module. This approach achieved a favorable balance between detection time and accuracy on the PASCAL VOC dataset and the Brackish dataset. This method achieved 79.54% mAP on the URPC2020 dataset. In 2022, Yeh et al. [24] proposed a deep model for jointly learning color conversion and object detection for underwater images. Initially, they converted images into grayscale to solve underwater color absorption and to improve subsequent detection performance with lower computational complexity. Their dataset primarily included three classes of objects: fish, debris, and divers. In this study, an improved model based on Feature Pyramid Network (FPN) was utilized, achieving 89.56% mAP. On the Brackish dataset, the detection accuracy reached 80.12%, with a computation complexity of only 5.06 GFLOPs. This allowed the model to be easily deployed on small-scale computing devices. However, the testing results on the URPC dataset are not reported. Han et al. [25] used the CenterNet model integrating the EfficientNet-B3 network as the backbone to reduce the model's parameters. They also combined it with a scene feature fusion method, which effectively improved the model's detection accuracy on the holothurian's dataset of URPC, while not verifying on other types of seafood. Wang et al. [26] proposed an improved lightweight underwater object detection method based on the YOLOX model. They combined BIFPN-S and FPN and effectively fused with the features obtained from the backbone layer. The model was evaluated on the URPC dataset and its detection accuracy increased to 82.69%. Similarly in 2023, Shi et al. [27] incorporated ShuffleNet and attention mechanisms into the backbone network of YOLOv4 to reduce the number of model parameters, by using deep convolution to decrease the model size and the RFB-s module in the neck layer. The results evaluated on the holothurian dataset showed that these improvements reduced the model size to 49.2 M, but the detection accuracy also decreased from 93.12% to 92.01%. They did not validate on other types of seafood as well. Generally, optimizations for lightweight models focus on improving the backbone and neck layers. One of the most common approaches is to replace the original large backbone network with a lightweight one, which can effectively reduce the number of model parameters and floating-point operations. However, this comes along with some loss of accuracy. Thus, other methods are further tailored to the dataset and object characteristics to retain or enhance the model's detection accuracy. These aim to make the proposed models easier to deploy on small devices, such as underwater robots.
This paper builds upon the basic model structure of the recent YOLOv6 [28] and optimizes its backbone and neck layers to better suit the characteristics of underwater datasets, which is named YOLOv6-ESG. The main contributions of this paper are as follows: • This paper employs a more efficient lightweight network, EfficientNetv2 [29], and further integrates it with lightweight convolution (ODConv [30]) to rebuild the backbone network. This approach significantly reduces the number of parameters and floating-point operations of the model.

•
In response to the problems of poor underwater image quality, low resolution, and difficulty in detecting small seafood, the SPD-Conv [31] (space-to-depth and nonstrided convolution) module is utilized to further improve the detection accuracy of underwater objects. • This paper incorporates the lightweight GSConv [32] and VoVGSCSP [32] modules as basic building blocks for the neck layer of YOLOv6-ESG. The experimental results demonstrate that this approach can effectively balance the accuracy and speed of the model, highlighting the effectiveness of these modules in the proposed model.

•
During the training phase, a more efficient Adan [33] optimizer is used, which requires only half of the computing resources to achieve the optimal performance of the current model. This approach can further improve the detection accuracy of the model under the same computing resources.

•
Through the analysis of experimental results before and after model improvement, as well as comparisons with some of the current mainstream object detection algorithms, the effectiveness of the proposed model has been verified.

The YOLOv6 Model
The YOLOv6 model is a one-stage object detection model proposed by the Meituan Visual Intelligence Department in 2022. Compared with the former YOLO series, it has certain advantages on detection accuracy and inference efficiency. It builds on the previous YOLO series networks by redesigning the backbone and neck layers, as well as modifying the head layer. Along with the optimization of the network structure, the model also adopts a more streamlined anchor-free detection method and the SimOTA label allocation strategy in the training strategy. The YOLOv6 model consists of five basic models: YOLOv6s, YOLOv6t, YOLOv6m, YOLOv6n, and YOLOv6l. In terms of the detection performance on the dataset studied in this paper, the YOLOv6l model is selected as the basis for optimization.
The basic structure of YOLOv6l includes input layer, backbone layer, neck layer, and head layer. The input layer of YOLOv6l uses an input image resolution of 640 × 640 × 3 (with R, G, and B channels) and performs mosaic and mix-up data augmentation techniques to the original input image. This helps create more balanced object samples for underwater images. In the backbone layer, YOLOv6l adopts a redesigned and more efficient CSP (Cross Stage Partial) module based on the previous YOLO series, known as the CSPStackRep Block. Also known as the BepC3 module, this module contains three 1 × 1 convolutions and N/2 double RepVGG blocks, with additional features including residual connections and concatenation operations. CSP connections are employed within this module to enhance performance without excessive computational costs. Compared with CSPRepResStage [34], it is more compact and considers the balance between accuracy and speed. The neck layer of YOLOv6l follows the PAN (Path Aggregation Network) topology used in previous models like YOLOv5 but replaces the CSPBlock with the CSPStackRep Block. The width and depth of the block are adjusted accordingly to obtain the Rep-PAN structure in YOLOv6l, which enhances the feature extraction capability of the neck layer. In terms of the head layer, YOLOv6l still follows more of the structure in the previous YOLO series. But it uses a mixed-channel strategy to build the detection head, which further reduces computing costs and makes it more efficient.

Materials and Methods
This study focuses on optimizing the YOLOv6l model for application in underwater equipment with limited memory. The resulting YOLOv6-ESG model mainly improves the backbone and neck layers, while the structure of the head layer is preserved, as shown in Figure 1. Firstly, the cleaned dataset is used as the input, and then mosaic data augmentation is applied. The processed images are then resized and fed into the OD-E2 backbone layer, which extracts the input images' features at three different scales. These features are subsequently transferred to the neck layer for further feature extraction. Multiscale feature fusion is then performed after upsampling or downsampling to obtain new fusion features of three different scales. Finally, the fused features are passed to the network's head layer for prediction, where decoupled heads are employed to perform classification and regression prediction, enabling the detection of objects of various sizes. × 1 convolutions and N/2 double RepVGG blocks, with additional features including residual connections and concatenation operations. CSP connections are employed within this module to enhance performance without excessive computational costs. Compared with CSPRepResStage [34], it is more compact and considers the balance between accuracy and speed. The neck layer of YOLOv6l follows the PAN (Path Aggregation Network) topology used in previous models like YOLOv5 but replaces the CSPBlock with the CSPStackRep Block. The width and depth of the block are adjusted accordingly to obtain the Rep-PAN structure in YOLOv6l, which enhances the feature extraction capability of the neck layer. In terms of the head layer, YOLOv6l still follows more of the structure in the previous YOLO series. But it uses a mixed-channel strategy to build the detection head, which further reduces computing costs and makes it more efficient.

Materials and Methods
This study focuses on optimizing the YOLOv6l model for application in underwater equipment with limited memory. The resulting YOLOv6-ESG model mainly improves the backbone and neck layers, while the structure of the head layer is preserved, as shown in Figure 1. Firstly, the cleaned dataset is used as the input, and then mosaic data augmentation is applied. The processed images are then resized and fed into the OD-E2 backbone layer, which extracts the input images' features at three different scales. These features are subsequently transferred to the neck layer for further feature extraction. Multiscale feature fusion is then performed after upsampling or downsampling to obtain new fusion features of three different scales. Finally, the fused features are passed to the network's head layer for prediction, where decoupled heads are employed to perform classification and regression prediction, enabling the detection of objects of various sizes.

OD-E2 Backbone Layer
The backbone network of YOLOv6 has strong feature extraction capabilities, but its complex structure and huge numbers of parameters have a certain impact on the detection speed. Therefore, this paper compared several popular lightweight backbone networks and chose the EfficientNetv2 network for further improvement. A more light-

OD-E2 Backbone Layer
The backbone network of YOLOv6 has strong feature extraction capabilities, but its complex structure and huge numbers of parameters have a certain impact on the detection speed. Therefore, this paper compared several popular lightweight backbone networks and chose the EfficientNetv2 network for further improvement. A more lightweight network is designed as the backbone network OD-E2 of YOLOv6-ESG by integrating EfficientNetv2 with the ODConv and SPD-Conv modules.

EfficientNetv2
EfficientNetv2 is a convolutional neural network proposed by the Google Brain team in 2021. It is an improvement of the previous EfficientNet [35] network, with the goal of improving accuracy and efficiency. EfficientNetv2 introduces the FusedMBConv [36] module based on the previous network, which replaces the 3 × 3 depthwise convolution and 1 × 1 expansion convolution in MBConv [37] with a 3 × 3 regular convolution (Conv 3 × 3), as shown in Figure 2a.
grating EfficientNetv2 with the ODConv and SPD-Conv modules.

EfficientNetv2
EfficientNetv2 is a convolutional neural network proposed by the Google Brain team in 2021. It is an improvement of the previous EfficientNet [35] network, with the goal of improving accuracy and efficiency. EfficientNetv2 introduces the FusedMBConv [36] module based on the previous network, which replaces the 3 × 3 depthwise convolution and 1 × 1 expansion convolution in MBConv [37] with a 3 × 3 regular convolution (Conv 3 × 3), as shown in Figure 2a.

OD-FusedMBConv
The ODConv (Omni-dimensional Dynamic Convolution), as shown in Figure 2b, was proposed by Intel Labs in 2022. As a "plug-and-play" convolution, it can be easily embedded into existing CNN networks. ODConv utilizes a novel multidimensional attention mechanism and a parallel strategy to learn the attention of convolutional kernels along four dimensions of the kernel space at any convolutional layer. Therefore, it can use fewer convolution kernels to greatly improve the feature extraction ability of convolution. In this study, the ODConv module is used to replace the normal 3 × 3 convolution in the FusedMBConv module, further reducing the model's computational complexity and decreasing memory access overhead. The improved module is called OD-FusedMBConv, as shown in Figure 2c.

SPD-Conv
In general, images captured in land-based scenes have good resolution and moderate object sizes. In such cases, object detection models employ designs such as stride convolution and pooling layers to skip redundant pixel information while still being able to learn object features effectively. However, in the challenging task of detecting small and blurry underwater objects, the assumption of redundant information becomes invalid, leading to the loss of fine-grained image details and insufficient learning of object features, resulting in decreased detection performance. To address this issue, the SPD-Conv module is introduced to solve the problem of low-resolution images and small objects.

OD-FusedMBConv
The ODConv (Omni-dimensional Dynamic Convolution), as shown in Figure 2b, was proposed by Intel Labs in 2022. As a "plug-and-play" convolution, it can be easily embedded into existing CNN networks. ODConv utilizes a novel multidimensional attention mechanism and a parallel strategy to learn the attention of convolutional kernels along four dimensions of the kernel space at any convolutional layer. Therefore, it can use fewer convolution kernels to greatly improve the feature extraction ability of convolution. In this study, the ODConv module is used to replace the normal 3 × 3 convolution in the FusedM-BConv module, further reducing the model's computational complexity and decreasing memory access overhead. The improved module is called OD-FusedMBConv, as shown in Figure 2c.

SPD-Conv
In general, images captured in land-based scenes have good resolution and moderate object sizes. In such cases, object detection models employ designs such as stride convolution and pooling layers to skip redundant pixel information while still being able to learn object features effectively. However, in the challenging task of detecting small and blurry underwater objects, the assumption of redundant information becomes invalid, leading to the loss of fine-grained image details and insufficient learning of object features, resulting in decreased detection performance. To address this issue, the SPD-Conv module is introduced to solve the problem of low-resolution images and small objects.
The SPD-Conv module structure is shown in Figure 3. It was proposed in 2022 and consists of a space-to-depth (SPD) layer and a non-strided convolution layer. The SPD layer downsamples the extracted feature map but retains all pixel information in the channel dimension, thereby avoiding information loss. After each SPD layer, a non-strided convolution is added to reduce the number of channels using learnable parameters. By incorporating this module into our model, we can enhance the feature extraction capability for small underwater objects and further improve the model detection accuracy.
consists of a space-to-depth (SPD) layer and a non-strided convolution layer. The SPD layer downsamples the extracted feature map but retains all pixel information in the channel dimension, thereby avoiding information loss. After each SPD layer, a nonstrided convolution is added to reduce the number of channels using learnable parameters. By incorporating this module into our model, we can enhance the feature extraction capability for small underwater objects and further improve the model detection accuracy. Based on EfficientNetv2, the OD-E2 backbone network is proposed to further reduce model computational complexity. Table 1 presents the architecture of the OD-E2 network. There are several main differences compared to the EfficientNetv2 network: First, the OD-FusedMBConv is proposed for further lightening the basic EfficientNetv2. Then, the SPD-Conv module is introduced to enhance the model's feature extraction ability for low-resolution underwater images and small objects. Furthermore, the last two stages for feature extraction in the original EfficientNetv2 are completely removed, which is suitable for subsequent processing and reduces memory access overhead. Meanwhile, the last classification layer is removed to make it only exist as the backbone feature extraction network.

The Integrated Neck Layer
In this study, the GSConv module is introduced instead of the SimConv module in the neck layer of the original model. Meanwhile, the VoVGSCSP module replaces the original BepC3 module. SPD-Conv module is also integrated into the neck layer for en- Based on EfficientNetv2, the OD-E2 backbone network is proposed to further reduce model computational complexity. Table 1 presents the architecture of the OD-E2 network. There are several main differences compared to the EfficientNetv2 network: First, the OD-FusedMBConv is proposed for further lightening the basic EfficientNetv2. Then, the SPD-Conv module is introduced to enhance the model's feature extraction ability for lowresolution underwater images and small objects. Furthermore, the last two stages for feature extraction in the original EfficientNetv2 are completely removed, which is suitable for subsequent processing and reduces memory access overhead. Meanwhile, the last classification layer is removed to make it only exist as the backbone feature extraction network. Table 1. OD-E2 network architecture.

The Integrated Neck Layer
In this study, the GSConv module is introduced instead of the SimConv module in the neck layer of the original model. Meanwhile, the VoVGSCSP module replaces the original BepC3 module. SPD-Conv module is also integrated into the neck layer for enhancing feature extraction. This section mainly describes the lightweight GSConv and VoVGSCSP modules.
Common lightweight design mainly reduces the number of parameters and floatingpoint operations (FLOPs, the number of multiply-adds) using depthwise separable convolution (DWConv). However, during computation, DWConv separates the channel information of the input image. This defect leads to much lower feature extraction and fusion capabilities of DWConv than the vanilla convolution. As shown in Figure 4, GSConv is a mix convolution that combines the advantages of the vanilla convolution, DWConv, and a shuffle. Specifically, through using the shuffle, the information generated by the vanilla convolution is permeated into various parts of the information generated by DWConv. This module evenly exchanges local feature information across different channels, allowing information from the vanilla convolution to be fully mixed into the output of DWConv. The design of GSConv aims to make the output of convolutional computations as close as possible to the output of vanilla convolution and reduces computational costs. Based on GSConv, the GS bottleneck is introduced. Then the one-shot aggregation method is used to design the cross-stage partial network (GSCSP) module, named VoVGSCSP. fusion capabilities of DWConv than the vanilla convolution. As shown in Figure 4, GSConv is a mix convolution that combines the advantages of the vanilla convolution, DWConv, and a shuffle. Specifically, through using the shuffle, the information generated by the vanilla convolution is permeated into various parts of the information generated by DWConv. This module evenly exchanges local feature information across different channels, allowing information from the vanilla convolution to be fully mixed into the output of DWConv. The design of GSConv aims to make the output of convolutional computations as close as possible to the output of vanilla convolution and reduces computational costs. Based on GSConv, the GS bottleneck is introduced. Then the one-shot aggregation method is used to design the cross-stage partial network (GSCSP) module, named VoVGSCSP.
These improvements not only reduce the computational complexity and inference time of the detector but also maintains detection accuracy.

Adan Optimizer
Based on the improvement of the above model, a deep model optimizer called Adan is introduced for the training of YOLOv6-ESG model. It was jointly proposed by the research teams of Singapore Sea AI LAB (SAIL) and Peking University ZERO Lab in 2022. Under the same computing resources, Adan can effectively improve the detection accuracy of the model and has a faster convergence speed than previous SGD optimizer.
By combining the adapted Nesterov [38] momentum and adaptive optimization algorithms and introducing decoupled weight decay, the Adan optimizer is obtained. By using extrapolation points, Adan can anticipate the surrounding gradient information in advance, which efficiently helps to avoid sharp local minimum regions and increase the model's generalization. The calculation method is shown in Formula (1): In each equation, represents the number of steps in the update process. represents the first moment of the gradient , represents the second moment of the gradi- These improvements not only reduce the computational complexity and inference time of the detector but also maintains detection accuracy.

Adan Optimizer
Based on the improvement of the above model, a deep model optimizer called Adan is introduced for the training of YOLOv6-ESG model. It was jointly proposed by the research teams of Singapore Sea AI LAB (SAIL) and Peking University ZERO Lab in 2022. Under the same computing resources, Adan can effectively improve the detection accuracy of the model and has a faster convergence speed than previous SGD optimizer.
By combining the adapted Nesterov [38] momentum and adaptive optimization algorithms and introducing decoupled weight decay, the Adan optimizer is obtained. By using extrapolation points, Adan can anticipate the surrounding gradient information in advance, which efficiently helps to avoid sharp local minimum regions and increase the model's generalization. The calculation method is shown in Formula (1): In each equation, k represents the number of steps in the update process. m k represents the first moment of the gradient g k , v k represents the second moment of the gradient g k , and n k represents the third moment of the gradient g k , where g k represents the gradient obtained by taking the derivative of the loss function f (θ) with respect to θ. α is the learning rate used to control the step size, and ε is a small constant added to the denominator for numerical stability. θ represents the parameter to be updated. β 1 represents the first moment decay coefficient, β 2 represents the second moment decay coefficient, and β 3 represents the third moment decay coefficient. λ k is the weight decay coefficient.

Evaluation Indicators
The performance of the model is comprehensively assessed in terms of three metrics: detection accuracy, detection speed, and model complexity.

mAP
The detection accuracy is evaluated using the mAP (mean Average Precision), which is a commonly used evaluation metric in object detection. It provides a comprehensive evaluation of the detection performance, considering both Precision (P) and Recall (R). The formulas for calculating Precision and Recall are shown in (2) and (3): where TP represents the number of actual positive ones in the samples predicted as positive class, FP represents the number of actual negative ones in the samples predicted as positive class, and FN represents the number of actual negative ones in the samples predicted as negative class. Generally, improving precision may lead to a decrease in recall. This relationship can be represented by the Precision-Recall (P-R) curve, where the area under the curve (AUC) represents the average precision (AP) for a category. For each category, the AP can be calculated using Formula (4).
The symbol p(r) represents the maximum precision value when the recall is greater than or equal to r (where r ranges from 0 to 1). In general, mAP is the average of the AP of all detected category, and its calculation is shown in Formula (5).
where N represents the number of object categories contained in the dataset, and AP i represents the average precision of the i-th category. A higher mAP value indicates more accurate object detection and better performance of the detection model.

FPS
The detection speed is represented by Frames Per Second (FPS), which is a measure of the number of frames transmitted per second in the field of image processing. In the field of object detection, FPS is often used to evaluate the real-time performance of object detection models. The faster the detection speed, the more the system can detect objects in real time and determine their instant positions. The calculation for FPS is given by Formula (6): where T pre refers to the image preprocessing time, T in f er refers to the model inference time, T nms can be understood as the postprocessing time, and the computing result is in milliseconds (ms).

Params and FLOPs
Model complexity can be evaluated from two aspects: the number of model parameters (Params) and the number of floating-point operations (FLOPs). Generally, the larger the number of model parameters and floating-point operations, the more complex the model, and the accuracy may also be improved. However, this also requires more computing resources during training and higher requirements for the device. It is difficult to deploy on small devices such as underwater robots. Therefore, under the premise of ensuring little loss in accuracy, reducing the number of parameters and floating-point operations as much as possible indicates better model performance.

Experimental Environment
The experiments in this study were all conducted in the same computing platform, as presented in Table 2. The used deep learning framework was Pytorch 1.8.0 + cu111, with an NVIDIA GeForce RTX 3090 GPU and Windows 10. The images were resized to 640 × 640, the batch size of the training process was set to 16, and the epoch was set to 300 iterations. The IDE was PyCharm with a programming environment of Python 3.7.

URPC2022 Dataset
The studied dataset was provided by the 2022 China Underwater Robot Professional Contest (URPC 2022) which consists of 9000 images captured in real marine environments. There is no inter-frame continuity between the images. The dataset includes images in various scales from different geographic environments and lighting conditions. Some sample data are shown in Figure 5.
where refers to the image preprocessing time, refers to the model inference time, can be understood as the postprocessing time, and the computing result is in milliseconds (ms).

Params and FLOPs
Model complexity can be evaluated from two aspects: the number of model parameters (Params) and the number of floating-point operations (FLOPs). Generally, the larger the number of model parameters and floating-point operations, the more complex the model, and the accuracy may also be improved. However, this also requires more computing resources during training and higher requirements for the device. It is difficult to deploy on small devices such as underwater robots. Therefore, under the premise of ensuring little loss in accuracy, reducing the number of parameters and floating-point operations as much as possible indicates better model performance.

Experimental Environment
The experiments in this study were all conducted in the same computing platform, as presented in Table 2. The used deep learning framework was Pytorch 1.8.0 + cu111, with an NVIDIA GeForce RTX 3090 GPU and Windows 10. The images were resized to 640 × 640, the batch size of the training process was set to 16, and the epoch was set to 300 iterations. The IDE was PyCharm with a programming environment of Python 3.7.

URPC2022 Dataset
The studied dataset was provided by the 2022 China Underwater Robot Professional Contest (URPC 2022) which consists of 9000 images captured in real marine environments. There is no inter-frame continuity between the images. The dataset includes images in various scales from different geographic environments and lighting conditions. Some sample data are shown in Figure 5. The dataset consists of four types of seafood: holothurian, echinus, starfish, and scallop. However, there is also a small number of seaweed samples, which may cause interference to the experimental results. Therefore, the dataset is initially cleaned to ensure data quality and accuracy. The cleaned images are randomly divided into 7102 training The dataset consists of four types of seafood: holothurian, echinus, starfish, and scallop. However, there is also a small number of seaweed samples, which may cause interference to the experimental results. Therefore, the dataset is initially cleaned to ensure data quality and accuracy. The cleaned images are randomly divided into 7102 training samples, 887 validation samples, and 887 test samples in an 8:1:1 ratio for subsequent experiments.
The training set consists of samples from four different seafood. The number of each object is depicted in Figure 6. It is evident that there is an imbalance in the quantities among the various categories, with the holothurian category having the lowest number of samples. This category imbalance poses challenges during training as it may lead to underfitting issues, affecting the network's learning ability. Figure 5. Sample images from URPC2022 in different geographic environments and lighting conditions, which contain four types of seafood: holothurian, echinus, starfish, and scallop.
The dataset consists of four types of seafood: holothurian, echinus, starfish, and scallop. However, there is also a small number of seaweed samples, which may cause interference to the experimental results. Therefore, the dataset is initially cleaned to ensure data quality and accuracy. The cleaned images are randomly divided into 7102 training samples, 887 validation samples, and 887 test samples in an 8:1:1 ratio for subsequent experiments.
The training set consists of samples from four different seafood. The number of each object is depicted in Figure 6. It is evident that there is an imbalance in the quantities among the various categories, with the holothurian category having the lowest number of samples. This category imbalance poses challenges during training as it may lead to underfitting issues, affecting the network's learning ability.

Experimental Results and Discussion
In this study, the presented model was verified by quantitative comparisons with other recent mainstream detection models. Additionally, ablation experiments were conducted to verify the effectiveness of each improvement on the model. Table 3 presents the experimental results of the currently popular two-stage models, namely, Faster RCNN (ResNet50) and Faster RCNN (VGG16), as well as the onestage ones like RetinaNet, YOLOv5l, YOLOv6s, YOLOv6l, and YOLOv6-ESG (ours) for object detection. In these experiments, all models were evaluated using input images of size 640 × 640, trained for 300 iterations, and under consistent experimental conditions. Table 3 provides comparison results of evaluation metrics, including mAP for the four

Experimental Results and Discussion
In this study, the presented model was verified by quantitative comparisons with other recent mainstream detection models. Additionally, ablation experiments were conducted to verify the effectiveness of each improvement on the model. Table 3 presents the experimental results of the currently popular two-stage models, namely, Faster RCNN (ResNet50) and Faster RCNN (VGG16), as well as the one-stage ones like RetinaNet, YOLOv5l, YOLOv6s, YOLOv6l, and YOLOv6-ESG (ours) for object detection. In these experiments, all models were evaluated using input images of size 640 × 640, trained for 300 iterations, and under consistent experimental conditions. Table 3 provides comparison results of evaluation metrics, including mAP for the four types of seafood, Params, FLOPs, and Speed, across different models on the URPC2022 dataset. Upon analyzing the experimental results given in Table 3, it is evident that the proposed model outperformed the other six models in terms of the evaluation metrics Params and FLOPs. The proposed model had 14.36 M parameters and 29.28 GFLOPs, significantly lower than the other models. This indicated the proposed model was feasible for fast detecting underwater objects and deploying on real-time underwater equipment. From FPS as the evaluation index of detection speed, when the experimental parameter batch size was set to 1, the detection speed of YOLOv6 was generally lower than that of YOLOv5l, but the detection speed of the proposed model YOLOv6-ESG reached 50.66 FPS, which satisfied the real-time requirements of underwater object detection. On the other hand, when the batch size was set to 4, the proposed YOLOv6-ESG demonstrated better performance, reaching a detection speed of 140.06 FPS. This was 32.53 FPS, 2.51 FPS, and 38.02 FPS higher than YOLOv5l, YOLOv6s, and YOLOv6l, respectively. Based on the experimental results, the proposed YOLOv6-ESG exhibited superior detection speed capabilities. Analyzing the results of the mAP@.5, the improved model YOLOv6-ESG achieved a detection accuracy of 86.6%. This represented an increase of 13.7%, 11.2%, 28.8%, 1.3%, and 1.1% compared to the first five models, respectively.

Comparative Results with Other Models
Although it slightly lagged behind the YOLOv6l model in terms of detection accuracy, the improved YOLOv6-ESG model offered the advantage of having lower parameters and computational costs. Moreover, it achieved faster detection speed, effectively striking a better balance between detection accuracy and speed. This combination of factors made it well-suited for real-time underwater object detection, meeting the demands of practical applications.
To provide a more comprehensive analysis of the model's performance, Table 4 provides the results of different models in terms of accuracy in each seafood category, including the AP ho (accuracy for holothurian), AP ec (accuracy for echinus), AP st (accuracy for starfish), and AP sc (accuracy for scallops). This enables a detailed evaluation of how well each model performs in accurately detecting and classifying different seafood. Table 4 illustrates the detection accuracy of each seafood at IoU (Intersection over Union) thresholds of 0.5. It can be concluded that YOLOv6l exhibited significant advantages in terms of detection accuracy. This was the reason why this paper selected YOLOv6l as the base model for research. Comparing the results vertically, it is evident that the proposed YOLOv6-ESG model achieved optimal performance in detecting echinus, with an accuracy of 89.6%. This outperformed the other models by margins of 10.9%, 8%, 14.9%, 2.1%, 1.1%, and 0.6%, respectively. For other seafood detection, the proposed model is slightly inferior to YOLOv6l, with only a marginal decrease in accuracy. In summary, the proposed YOLOv6-ESG model demonstrated superior performance in detecting echinus with significantly higher accuracy than the other models. While it slightly lagged behind YOLOv6l in other seafood detection, the model still maintained a commendable level of accuracy. From a comprehensive (horizontal) perspective, the detection accuracy of echinus and starfish was significantly higher than the other two seafood, among which the detection accuracy of holothurian was generally lower. There are two aspects to consider in the analysis: (1) From the perspective of the dataset, the dataset exhibited an imbalance in the number of samples for each class. Echinus had the highest number of labeled samples, followed by starfish, while holothurian was the fewest. As a result, the model could learn more detailed features of echinus and starfish, leading to higher recognition accuracy. However, it is possible that the model had limited exposure to the distinctive features of holothurian and had not effectively learned their unique characteristics, which could lead to a decrease in recognition accuracy. (2) From the perspective of class characteristics, the characteristics of echinus have distinctive spines and a predominantly black appearance, while starfish exhibit clear "five-pointed star" features and are typically blue in color. These characteristics made them relatively easy to identify. On the other hand, holothurian exhibited colors and patterns that resembled the background, such as seagrass. And their variable shape and less distinct features made it more difficult to distinguish accurately from their surroundings. Scallops are typically white in color, but their surfaces could be covered with algae and other impurities, which made their features less prominent and led to lower recognition accuracy.
Taking all aspects into consideration, the proposed YOLOv6-ESG model exhibited slightly lower accuracy compared to the YOLOv6l model by 0.2%. However, it significantly outperformed YOLOv6l in terms of the number of parameters and floating-point operations, while also achieving higher FPS. This made it more suitable for deployment on underwater robots and other devices that require real-time performance. Although the proposed model had a slower detection speed compared to the YOLOv5l model (batch size = 1), it achieved higher detection accuracy. Moreover, the reduced parameters and computational demands of the proposed model enhance its feasibility for underwater object detection and recognition tasks.

Qualitative Analysis of Prediction Results
This section mainly focuses on the prediction results of images under various models. Figure 7 shows some prediction results of different models in various scenarios under different water and lighting conditions, such as color distortion, blurriness, small objects, and aggregation states. The experimental environment remained the same, except for the different colors of a few detection boxes. Based on the analysis of Figure 7, it can be observed that the underwater dataset presented varying degrees of low-quality images. When comparing the detected objects with Ground Truth, most models performed well in detecting most objects. However, there were still some instances of missed detections and false positives: in the case of image (a), which featured a background with abundant underwater vegetation and a significant color shift. Despite this challenge, the objects remained relatively clear visually. The Faster RCNN (ResNet50) and Faster RCNN (VGG16) models could detect all existing objects. However, they also exhibited some false detections, incorrectly identifying several background elements as objects. This had an impact on the overall detection effi- Based on the analysis of Figure 7, it can be observed that the underwater dataset presented varying degrees of low-quality images. When comparing the detected objects with Ground Truth, most models performed well in detecting most objects. However, there were still some instances of missed detections and false positives: in the case of image (a), which featured a background with abundant underwater vegetation and a significant color shift. Despite this challenge, the objects remained relatively clear visually. The Faster RCNN (ResNet50) and Faster RCNN (VGG16) models could detect all existing objects. However, they also exhibited some false detections, incorrectly identifying several background elements as objects. This had an impact on the overall detection efficiency. The RetinaNet model demonstrated some instances of missed detections, where a few objects were not accurately detected. On the other hand, other models including the proposed model in this work exhibited excellent performance in detecting the object positions without occurrences of missed or false detections. In the case of image (b), a "fog" effect was presented on the surface and the image appeared blurry, making it difficult to discern the objects. A comparison with Ground Truth revealed that the Faster RCNN (ResNet50) model could detect all objects in the image. However, it exhibited a few instances of false detections. The Faster RCNN (VGG16), RetinaNet, YOLOv5l, YOLOv6s, and YOLOv6l models showed varying degrees of missed detections. Some holothurians, which were relatively concealed, were not fully detected by these models. However, the proposed model achieved results that were entirely consistent with the ground truth. It successfully detected all objects without any instances of missed or false detections. These findings indicated that the proposed model YOLOv6-ESG demonstrated superior detection performance for underwater images with slight blurriness and similar issues. In image (c), the image background was murky and cluttered, with significant blurriness. The objects in the image were small and bore a resemblance to the background, making their features extremely indistinct. As a result, all the experimental models exhibited varying degrees of missed and false detections. The outcome indicated that when dealing with datasets featuring turbid backgrounds and indistinct objects, it became challenging for all the studied models to accurately detect all objects based only on learned object features. The detection performance in such cases was subpar. Regarding image (d), where the image exhibited no noticeable color deviation and the background was relatively clear, the objects were clustered together. The experimental models demonstrated excellent detection ability. Interestingly, even in cases where Ground Truth did not label an object (in the lower right corner of the image), each model was still capable of detecting the object category and position based on the learned features. This observation suggested that in scenarios where multiple objects were clustered together, human observers may also encounter instances of oversight. In such cases, combining the model's detection with manual labeling can yield more accurate and reliable results.

Ground Truth
In conclusion, the proposed model YOLOv6-ESG demonstrated better performance in detecting object categories and positions, with minimal instances of false detections. It could even detect unlabeled objects in clustered scenarios. The model performed well in scenarios with slight blurriness and small objects. However, its detection performance was compromised in images with heavily turbid backgrounds or significant blurriness. Therefore, to ensure detection accuracy, it is advisable to avoid conducting underwater fishing operations during unfavorable weather conditions or when the marine environment is heavily disturbed.

Ablation Experiments
To assess the effectiveness of the proposed optimization based on the YOLOv6l model, ablation experiments were conducted. In order to ensure comparability, the environmental configuration for all experiments remained consistent. The training iterations were set to 300 and batch size was set to 1 or 4 for different experiments. The ablation experiments focused on making improvements in the backbone layer, neck layer, and optimizer.
In the backbone layer, the proposed model utilized the improved OD-E2 as the backbone network. In the neck layer, it incorporated the lightweight GSConv module (abbreviated as GS in Table 5) and the lightweight VoVGSCSP module (abbreviated as VoV in Table 5). To further tackle the difficulties specific to underwater images, the SPD-Conv module (abbreviated as SPD in Table 5) was introduced in both the backbone and neck layers. These modules enhance the model's adaptability and detection performance for underwater images. As displayed in Table 5, the ablation experiments were independently numbered to evaluate the impact of different improvements on the model's performance. From Table 5, it can be observed that this study attempted different combinations of modules to improve the YOLOv6l model. When the backbone network was replaced with the lightweight EfficientNetv2, there was a reduction in both model parameters and floating-point operations. This validated the effectiveness of the lightweight modification. However, this improvement came at the cost of a significant decrease in model accuracy. In Exp. 3, when it was replaced with the proposed OD-E2 network, there was a decrease in model accuracy compared to Exp. 2. However, there was a further reduction in the number of parameters and floating-point operations. In order to enhance the overall detection accuracy of the model, the study utilized the more efficient Adan optimizer. The experimental results revealed that this optimizer only introduced a slight increase in computational complexity but resulted in a significant improvement of 4.1% in detection accuracy. This finding underscored the compatibility of the Adan optimizer with the proposed model, highlighting its superior performance compared to the original optimizer. Based on Exp. 4, Exp. 5 was conducted to focus on the SPD-Conv module. Despite a slight increase in both the number of parameters and floating-point operations, there was a notable improvement in detection accuracy. This outcome provided strong evidence for the practicality and effectiveness of the SPD-Conv module. Exp. 6 introduced the improved model proposed in this study. Building on the findings of Exp. 5, the neck layer of the model incorporated the GSConv and VoVGSCSP modules to replace the original modules. This modification resulted in the lowest number of parameters and floating-point operations. As a result, the model demonstrated an equal detection accuracy of 86.6% compared to basic YOLOv6l while achieving faster detection speed.
In summary, the improvements of the backbone and neck layers significantly reduced the number of the model's parameters and floating-point operations. Although there was a slight decrease in model accuracy, it was the trade-off with the faster detection speed, which ensured that the model met the real-time requirements of object detection. Table 6 presents the detection results of different models for four categories of seafood. All experiments were conducted under consistent environmental settings. The detailed experimental results are given in Table 6.
Based on the experimental results from Table 6, it can be observed that the model designed in this study achieved favorable performance. Compared to the results of the YOLOv6l model, the proposed model demonstrated better performance on the echinus dataset. Although it might not achieve the best results for the other three categories, the performance was close to that of the YOLOv6l model. In terms of the detection results for each category within the same model, the proposed model YOLOv6-ESG in this study performed best in detecting starfish, followed by echinus. However, the detection performance for the other two categories of seafood was relatively poor. This may also result from the fact that the echinus and starfish datasets had a larger number of samples and possessed distinct color or shape features, allowing the model to learn these features effec-tively. Conversely, the other two categories had fewer training samples and inconspicuous features. They might have a similar background color or be partially covered by seaweed, making it challenging for the model to learn the subtle details. As a result, the detection performance for these categories was not as satisfactory. The results of the ablation experiments demonstrated that with the addition of each method, the model's detection accuracy improved or the number of parameters and floatingpoint operations decreased. This indicated the effectiveness of the introduced modules in achieving lightweight improvements. Furthermore, it validated the effectiveness of the model's improvement methods. In order to ensure the robustness of underwater object detection and better adaptability to various marine environments, the dataset used in this study did not undergo any image preprocessing or similar operations. The proposed model demonstrated favorable detection performance for all four categories of seafood. Additionally, the model achieved a significant reduction in the number of parameters and floating-point operations with a faster detection speed. These experimental results highlight the effectiveness of the proposed model for underwater object detection, indicating its viability for devices in underwater environments for detection and recognition tasks.

Conclusions
This study provides a feasible lightweight method YOLOv6-ESG based on YOLOv6 for real-time detection and identification in fishing operations. The former proposed methods in the field of marine organisms' recognition, which integrates various techniques to enhance accuracy, may potentially impact energy consumption. To address this concern, this work gives particular attention to energy efficiency in the method design. A series of measures are taken to mitigate the potential impact of high energy consumption. First and foremost, the techniques introduced in this study mainly focus on the implementation of lightweight structures. The proposed method integrates a more lightweight backbone network OD-E2 along with an optimized neck layer. The SPD-Conv module is also combined to enhance the model structure with only a small increase in computational cost, and the original SimConv and BepC3 modules in YOLOv6l are transformed into lightweight GSConv and VoVGSCSP modules to further reduce the model's parameters and floatingpoint operations. Additionally, the Adan optimizer is utilized in model training to enhance the training process, thus speeding up the model convergence. These enhancements have reduced computational cost and memory consumption during the training and validation processes, while maintaining the comparable prediction performance with the original YOLOv6l model.
Furthermore, the proposed model was extensively evaluated using a real-world underwater image dataset in this study. The experimental results show that compared with other significant underwater object detection models, the YOLOv6-ESG model with the proposed improvements achieves an accuracy of 86.6% while having the lowest number of parameters and floating-point operations. The model not only ensures detection accuracy but also accelerates the model training and detection process under the same computing resources, making it more suitable for underwater equipment in object detection tasks. The underwater video detection results of YOLOv6-ESG model can be found in the Supplementary Materials. In the future, this study would further optimize the method for implementation on underwater autonomous fishing equipment for further validation.