CME-YOLOv5: An Efﬁcient Object Detection Network for Densely Spaced Fish and Small Targets

: Fish are indicative species with a relatively balanced ecosystem. Underwater target ﬁsh detection is of great signiﬁcance to ﬁshery resource investigations. Traditional investigation methods cannot meet the increasing requirements of environmental protection and investigation, and the existing target detection technology has few studies on the dynamic identiﬁcation of underwater ﬁsh and small targets. To reduce environmental disturbances and solve the problems of many ﬁsh, dense, mutual occlusion and difﬁcult detection of small targets, an improved CME-YOLOv5 network is proposed to detect ﬁsh in dense groups and small targets. First, the coordinate attention (CA) mechanism and cross-stage partial networks with 3 convolutions (C3) structure are fused into the C3CA module to replace the C3 module of the backbone in you only look once (YOLOv5) to improve the extraction of target feature information and detection accuracy. Second, the three detection layers are expanded to four, which enhances the model’s ability to capture information in different dimensions and improves detection performance. Finally, the efﬁcient intersection over union ( EIOU ) loss function is used instead of the generalized intersection over union (GIOU) loss function to optimize the convergence rate and location accuracy. Based on the actual image data and a small number of datasets obtained online, the experimental results showed that the mean average precision (mAP@0.50) of the proposed algorithm reached 94.9%, which is 4.4 percentage points higher than that of the YOLOv5 algorithm, and the number of ﬁsh and small target detection performances was 24.6% higher. The results show that our proposed algorithm exhibits good detection performance when applied to densely spaced ﬁsh and small targets and can be used as an alternative or supplemental method for ﬁshery resource investigation. the CA attention mechanism, Model 2 expands the detection layer, Model 3 uses the EIOU loss function, and Models 4 – 6 are the improvement experiments of the two modules. Model 4 is the fusion CA attention mechanism and the extended detection layer, Model 5 is the fusion CA attention mechanism and the use of the EIOU loss function, and Model 6 is the extended detection layer and the use of the EIOU loss function.


Introduction
Changes in fish stocks can directly reflect the status of river ecosystems. The General Committee on fisheries in the Mediterranean counted the fishing indicators of fisheries in the Mediterranean region from 1970 to 2017 and found that the ecosystem structure had changed due to overexploitation, and the catch had been declining since 2010 [1]. Therefore, regular or irregular fishery resource studies and assessments are needed to confirm the status of ecosystems. It is important for responsible fishing and environmental protection. Early investigations of fishery resources used electric fishing, artificial spraynet fishing, ground cage, and prick-net methods [2][3][4][5] to collect fish samples. However, there are several problems with traditional fishery resource survey methods: 1. their excessive dependence on manual operations makes them time-consuming and laborious; 2. they result in greater disturbance of fish and aquatic ecosystems; and 3. small target fish are easily missed [6]. Traditional research methods have difficulty meeting increasing requirements for environmental protection and monitoring. Therefore, it is necessary to use the latest technology to implement fishery resource studies.
In recent years, in the era of artificial intelligence, computer information technology has developed rapidly, and there have been advances in computer vision [7]. As its core field, object detection technology has made a major breakthrough [8]. An increasing number of deep-learning object detection methods have been applied to underwater object detection. Pei Qianqian et al. [9] applied the deep learning object detection algorithm YOLOv3 [10] in an engineering fishway and conducted real-time detection of passing fish in the fishway, but the model is too single, the water quality of the fishway will change with the season, and the target detection effect is poor when the background fluctuation is relatively large. Youssef et al. [11] proposed an object detection algorithm to improve the clarity of water images. The multiscale Retinex (MSR) algorithm [12] was used to enhance the blurred water image or video in the system to increase its clarity, and then the YOLOv3 object detection algorithm was used to identify the enhanced image. The detection accuracy was significantly improved. However, the MSR algorithm is an image enhancement algorithm based on a physical model, which has a slow processing speed and a correspondingly slow image recognition speed. Fan Weiya [13] improved the Faster R-CNN [14] algorithm by increasing the number of anchors in the RPN network, changing the deformation convolution and adding three single fully connected channels. The detection accuracy was improved by 6.23% to 92.44% compared to that of the original model algorithm, improving the applicability of the fish object detection algorithm. However, the Faster R-CNN algorithm is a two-stage target detection algorithm, and the processing speed of the algorithm is slow. If the number of anchor points and channels is added, the calculating model parameters will be aggravated, which will lead to a significant decrease in the model processing speed. Yao et al. [15] used the object detection algorithm of YOLOv4 [16] for underwater target recognition, replaced the upsampling module in the original model with a deconvolution module, removed the SPP layer, and added depth detachable convolution to reduce network computations. The results showed that compared with the original YOLOv4, the improved mAP reached 75.34, which was nearly 12% higher than YOLOv4. Qiang et al. [17] proposed an improved SSD [18] algorithm based on ResNet instead of VGG and proposed depth-separable deformation convolution, which improved fish detection accuracy and speed in complex water environments. Wu Rui et al. [19] improved the YOLOv5 model [20] by introducing the convolutional attention mechanism module, and the results showed that the improved method greatly improved the identification accuracy and speed of benthic organisms in coral reefs. In summary, the above method studies are based on static individual identification or the detection of conventional targets that are only applicable to general scenes. There are few studies on the dynamic identification of fish and small targets, and the identification of dense underwater fish and small targets still has a high rate of missed detection and error. There are challenges of mutual occlusion and shadows cast by densely spaced underwater fish [21], and small object detection has always been one of the key difficulties in the object detection field [22,23], with the problems of less effective image features information, fuzziness and other difficulties. The needs of fishery research cannot be met due to the limited ability of current target detection technology to overcome the problems of fish occlusion and the difficulties with small target detection. This situation needs to be improved by advancing image recognition.
To address the above issues, in this paper, we propose a CME-YOLOv5 algorithm. Based on the YOLOv5 object detection algorithm, the CA attention mechanism is improved, the detection layers are expanded to 4 according to the characteristics of small objects, and the GIOU loss is replaced with EIOU loss. The problems of poor underwater detection of dense fish swarms, small target positioning, and few pixels and low accuracy are solved, and the accuracy of the algorithm for dense fish swarms and small targets while ensuring real-time performance effectively improved. This technology can become an alternative or supplementary method for fishery resource studies.

YOLOv5 Object Detection Model
The YOLOv5 algorithm framework is divided into four parts: the first part is the input layer, and the input size is 640 × 640's three-channel image; the second part is the backbone network, which uses the Darknet-53 network framework as a model to extract the image features; the third part is the neck module, which is located between backbone and the last output layer, includes spatial pyramid pooling-Fast (SPPF) using the maximum pooling method and a path aggregation network (PANet) under an instance segmentation framework, and repeatedly features the fusion and extraction of the shallow and deep information in the three feature layers to make full use of the context information; the fourth part predicts and decodes the three generated 20 × 20, 40 × 40, 80 × 80 feature maps (YOLO Head) and directly obtains the position of the prediction box in the image and class of each object.
In the YOLOv5 model, there are four models with different network depths and widths. According to the cost-performance ratio in Table 1, YOLOv5 l with small calculation parameters, high accuracy and high speed was selected as a basis for improvement and experimentation.

Improved CME-YOLOv5 Recognition Method
YOLOv5 has the characteristics of fast detection, high efficiency and flexibility in target recognition. Underwater images generally have the problems of low contrast, blurring, color deviation, and obscuration, which lead to poor quality and difficulty with the detection and identification of fish in dense schools. This affects the multiobject recognition of underwater schools of fish and is disadvantageous for small object fish detection. Therefore, it is necessary to further improve the YOLOv5 network to improve detection accuracy and network performance. In this paper, a dense fish school and small object recognition algorithm for the CME-YOLOv5 network are proposed; the model is shown in Figure 1. The innovative features of the model are as follows: (1) The C3 structure converged attention mechanism coordinate attention (CA) in the YOLOv5 network is used to form the C3CA structure instead of the C3 structure in the backbone extraction network to increase the model's attention to key information, reduce the interference of invalid object information, and enhance the feature expression ability of small objects in the detection network by focusing on essential information from extensive amounts of available information. (2) The number of detection layers in the YOLOv5 detection module (YOLO Head) is expanded from 3 to 4 to better capture global information and rich context information, improve the ability of the model to capture different dimensions information, and improve the detection performance of the YOLOv5 network to multiscale objects, to extract features better in dense schools of fish and improve the ability of the model to deal with small object detection. (3) The EIOU loss function, which considers overlapping area, center point distance, length, width and side length true difference, and adds focal loss to solve the sample imbalance problem in bounding box regression, is used instead of the GIOU loss function, thus addressing the problems of the slow convergence of the GIOU loss function in the horizontal and vertical direction and its inability to optimize the case when the predicted bounding box and ground-truth bounding box do not intersect.
solve the sample imbalance problem in bounding box regression, is used instead of the GIOU loss function, thus addressing the problems of the slow convergence of the GIOU loss function in the horizontal and vertical direction and its inability to optimize the case when the predicted bounding box and ground-truth bounding box do not intersect.

CA Attention Mechanism Module
Coordinate attention (CA) [24] (as shown in Figure 2) is a lightweight and efficient mechanism in the channel and X and Y spatial directions, through which channel attention is decomposed into two different spatial directions for aggregating features in a one-dimensional feature coding process. It captures long-term dependencies in one space, retains precise location information in the other, and forms a pair of direction-aware and position-sensitive feature maps so that these feature maps can be used complementary to enhance the representation of effective information. In CA, full average pooling of the input feature maps in the height and width directions is first carried out to obtain the feature maps ( ℎ (ℎ), ( )). Then, the feature maps are split in the height and width directions together to obtain the feature maps after convolution, batch normalization and nonlinear sigmoid activation, where is a function. Next, the feature map is convolved with the original height and width to obtain feature graphs ℎ and , respectively, with the same number of channels as the original. After the activation function, the attention weight in height and width and the attention weight in the width direction of the feature map ( ℎ , ) are obtained.

CA Attention Mechanism Module
Coordinate attention (CA) [24] (as shown in Figure 2) is a lightweight and efficient mechanism in the channel and X and Y spatial directions, through which channel attention is decomposed into two different spatial directions for aggregating features in a onedimensional feature coding process. It captures long-term dependencies in one space, retains precise location information in the other, and forms a pair of direction-aware and position-sensitive feature maps so that these feature maps can be used complementary to enhance the representation of effective information.
Water 2022, 14, x FOR PEER REVIEW 4 of 12 solve the sample imbalance problem in bounding box regression, is used instead of the GIOU loss function, thus addressing the problems of the slow convergence of the GIOU loss function in the horizontal and vertical direction and its inability to optimize the case when the predicted bounding box and ground-truth bounding box do not intersect.

CA Attention Mechanism Module
Coordinate attention (CA) [24] (as shown in Figure 2) is a lightweight and efficient mechanism in the channel and X and Y spatial directions, through which channel attention is decomposed into two different spatial directions for aggregating features in a one-dimensional feature coding process. It captures long-term dependencies in one space, retains precise location information in the other, and forms a pair of direction-aware and position-sensitive feature maps so that these feature maps can be used complementary to enhance the representation of effective information. In CA, full average pooling of the input feature maps in the height and width directions is first carried out to obtain the feature maps ( ℎ (ℎ), ( )). Then, the feature maps are split in the height and width directions together to obtain the feature maps after convolution, batch normalization and nonlinear sigmoid activation, where is a function. Next, the feature map is convolved with the original height and width to obtain feature graphs ℎ and , respectively, with the same number of channels as the original. After the activation function, the attention weight in height and width and the attention weight in the width direction of the feature map ( ℎ , ) are obtained. In CA, full average pooling of the input feature maps in the height and width directions is first carried out to obtain the feature maps (Z h C (h), Z w C (w)). Then, the feature maps are split in the height and width directions together to obtain the feature maps after convolution, batch normalization and nonlinear sigmoid activation, where σ is a sigmoid function. Next, the feature map F is convolved with the original height and width to obtain feature graphs F h and F w , respectively, with the same number of channels as the original. After the sigmoid activation function, the attention weight in height and width and the attention weight in the width direction of the feature map (g h , g w ) are obtained. Finally, the feature map with attention weight in the height and width direction is obtained by multiplicative weighting calculation on the original feature map, which can enhance the important information and help the model locate and identify the target more accurately.

Multiscale Detection Layer
Three detection layers of network feature maps, 20 × 20, 40 × 40, and 80 × 80, are obtained after the initial YOLOv5 network structure passes through the backbone network and the neck enhancement module, which are used to detect large, medium and small objects, respectively. For conventional detection, it may be possible to achieve the desired effect, but for dense groups of fish with individuals of different sizes, there are often omissions or poor detection accuracy, especially for small objects. Therefore, to detect individuals in dense underwater fish schools, a detection layer that can detect smaller objects is added based on the original three detection layers. The model is shown in Figure 1. The convolution is carried out at the 17th convolutional layer, which originally needs to be downsampled and then upsampled, and the feature concatenation is carried out at the 20th and 2nd layers so that the network actively learns, adaptively fuses features and concatenates the process information, thus increasing the sensing field. Then, the 160 × 160 network feature map is obtained through convolution. YOLO Head1 is a new detection layer introduced in our method. The second detection layer (YOLO Head2) of 80 × 80 is obtained through convolution after feature concatenation of layer 23 and layer 18. The third detection layer (YOLO Head3) of 40 × 40 is obtained through convolution after feature concatenation of layer 26 and layer 14. The 29th layer and the 10th layer are concatenated with features, and then the 4th detection layer (YOLO Head4) of 20 × 20 is obtained through convolution.

Optimized Loss Function
The loss function calculates the difference between the forward calculation result and the real value of each iteration of the neural network and evaluates the difference between the predicted value and the real value of the model. Generally, the better the loss function, the better the model's performance. At present, object detection regression loss functions include IOU [25], GIOU [26], DIOU [27], CIOU [28] and EIOU loss [29]. The original YOLOv5 loss function is GIOU loss. GIOU loss uses closure as a penalty term, which may lead to the problem of nonconvergence of the results in the model training process. Therefore, we use EIOU loss to calculate regression loss.
In the formula, intersection over union (IOU) is the intersection and union ratio between the predicted bounding box and ground-truth bounding box; Equation (2) represents the Euclidean distance between the center point of the predicted bounding box and groundtruth bounding box; b is the center point of the predicted bounding box; b gt is the center point of the ground-truth bounding box; w is the width of the predicted bounding box; w gt is the width of the ground-truth bounding box; h is the height of the predicted bounding box; h gt is the height of the ground-truth bounding box; C is the diagonal distance of the minimum closure region that can contain both the predicted bounding box and ground-truth bounding box.
The EIOU loss function consists of IOU loss L IOU , centre distance loss L dis and side length loss L asp , which can optimize the convergence speed and positioning accuracy and reduce the likelihood of inaccurate regression results.

Dataset
The experimental dataset in this analysis consisted of 1500 pictures, of which 65% were images of underwater fish and small target fish collected from 4 coexisting hydropower stations and 1 fish breeding station in Xinjiang and Tibet. To enhance the robustness of the training results and improve the detection effect of the model, 35% were datasets (labeled Water 2022, 14, 2412 6 of 12 fish in the wild) provided by NOAA, which included images of large numbers of fish and small target fish. LabelImg was used to label the datasets one by one. The labeling requirements were as follows: (1) in the dense fish group, fish visibility of more than 1/5 should be labeled; and (2) in the small target image, to prevent overfitting and reduce misidentification, the image pixel can be labeled if it does not reach the lost frame rate. After the annotation was complete, scripts were used to convert it into files required for YOLOv5 training, and the datasets were randomly divided into training sets and validation sets at a ratio of 8:2.

Experimental Platform and Protocols
This experiment was implemented with the Windows 10 operating system, Intel TM i7-11800 h CPU processor, GeForce RTX3080 GPU graphics card, 16 GB video memory, CUDA11.1 for training acceleration and the PyTorch 1.9 deep learning framework for training. The image input size was 640 × 640, the initial learning rate was 0.01, the final learning rate was 0.1, the SGD optimization model was used, and the training batch size was 8. The specific model parameter configuration is shown in Table 2.

Model Evaluation Measures
To verify the detection and recognition ability of our proposed model for images of densely spaced underwater fish and small objects, precision was adopted to estimate the correct proportion of all objects predicted by the model. Recall that the model predicts the correct proportion of objects among all real objects. The average accuracy (mAP), the area under the P-R curve, measures the performance of the model.
TP, TN, FP, and FN are abbreviations for true positive, true negative, false positive, and false negative, respectively. Positive and negative represent the predicted results of the model. If the IOU value is greater than the threshold (set to 0.5), the prediction is positive; if the IOU value is less than the threshold, the prediction is negative. True and false indicate whether the predicted result is the same as the real result; if the results are the same, the assessment is set to true; and if they are different, the assessment is set to false, as shown in Table 3 below: To assess the efficiency of the C3 structure, we conducted a test replacing the C3 structure with C3CA at different locations and used mAP as an evaluation index. According to the C3CA ablation experiment, it can be seen from Table 4 that the replacement method of Framework 1 exhibited the largest accuracy improvement, which is 1.5 and 1.1% higher than those of Framework 2 and Framework 3, respectively. As a result, the first method was adopted.

CME-YOLOv5 Ablation Experiment
To assess the effectiveness and progressiveness of the algorithm proposed in this paper, 8 groups of ablation experiments were conducted with the same verification set to evaluate the images of different improvement schemes on the detection performance of the model. The accuracy, recall, average accuracy, average detection time and model loss value of each model were used to evaluate the impact of different modules on the YOLOv5 target detection algorithm under the same experimental conditions. The lower the model loss value, the better the regression of the model. The objective evaluation index results are shown in Table 5 and the map is shown in Figure 3. According to the data in Table 5, after the CA attention mechanism was fused with the C3 structure, compared with the initial YOLOv5 model, mAP@0.50 increased by 2.7 percentage points, and the detection time increased by 6.1 ms. Although the detection time increased, the model detection accuracy effectively improved, indicating that this method can improve the extraction ability of target feature information, suppress the interference of invalid feature information, and maximize the utilization of feature information. After expanding the 3 detection layers of YOLOv5 to 4, map_0.5 increased by 1.6 percentage points, the model detection accuracy improved, the small target detection performance increased, and the method was able to detect objects on different scales. Replacing GIOU loss with EIOU loss lowered the training loss value of the model, reduced the average detection time, and slightly improved mAP@0.50, indicating that EIOU can optimize the convergence speed and positioning accuracy and reduce the phenomenon of nonconvergence of regression results. The final results showed that each enhanced method introduced in this paper exhibited a different performance improvement over YOLOv5. The proposed algorithm mAP reached 94.9%, which was 4.4 percentage points higher than that of YOLOv5 compared with mAP@0.50. The proposed algorithm was inferior to YOLOv5 in average detection speed, and the detection time of a single image increased by 8.5 ms. However, the algorithm introduced in this paper can meet the requirements of real-time detection, and the detection accuracy is greatly improved. Table 5. This is a table that evaluates the impact of different improvement schemes on model detection performance: Models 1-3 are single module improvement experiments, Model 1 integrates the CA attention mechanism, Model 2 expands the detection layer, Model 3 uses the EIOU loss function, and Models 4-6 are the improvement experiments of the two modules. Model 4 is the fusion CA attention mechanism and the extended detection layer, Model 5 is the fusion CA at-tention mechanism and the use of the EIOU loss function, and Model 6 is the extended detection layer and the use of the EIOU loss function. algorithm introduced in this paper can meet the requirements of real-time detectio the detection accuracy is greatly improved. Table 5. This is a table that evaluates the impact of different improvement schemes on mod tection performance: Models 1-3 are single module improvement experiments, Model 1 int the CA attention mechanism, Model 2 expands the detection layer, Model 3 uses the EIOU function, and Models 4-6 are the improvement experiments of the two modules. Model 4 is fusion CA attention mechanism and the extended detection layer, Model 5 is the fusion CA tention mechanism and the use of the EIOU loss function, and Model 6 is the extended dete layer and the use of the EIOU loss function.

Comparison of Experimental Results
To assess the detection performance of our algorithm in representative exper of the detection of more difficult images, the results of our method were compare those of YOLOv5. As seen in Figure 4, our algorithm greatly reduced the missing de of dense fish schools, improved the detection accuracy for small target fish with few and a lack of feature information, and reduced the extraction of useless image feat formation. The performance was better than that of the initial YOLOv5 model. Figu shows the detection results of a photograph taken near a hydropower station. The has challenges, such as low contrast, blurred vision and occlusion. The YOLOv5 alg exhibited serious missed detection on the right side of the figure. However, our alg effectively detected small objects without feature information. As seen in Figur YOLOv5 failed to detect the abnormal angle of small target fish in the lower right

Comparison of Experimental Results
To assess the detection performance of our algorithm in representative experiments of the detection of more difficult images, the results of our method were compared with those of YOLOv5. As seen in Figure 4, our algorithm greatly reduced the missing detection of dense fish schools, improved the detection accuracy for small target fish with few pixels and a lack of feature information, and reduced the extraction of useless image feature information. The performance was better than that of the initial YOLOv5 model. Figure 4A1 shows the detection results of a photograph taken near a hydropower station. The image has challenges, such as low contrast, blurred vision and occlusion. The YOLOv5 algorithm exhibited serious missed detection on the right side of the figure. However, our algorithm effectively detected small objects without feature information. As seen in Figure 4A2, YOLOv5 failed to detect the abnormal angle of small target fish in the lower right corner with fewer pixels. Our algorithm detected small objects with blurred vision and a lack of pixels on the upper right of the image ( Figure 4A3). Figure 4B1 shows a small target fish school. The detection performance of our algorithm was significantly higher than that of YOLOv5, indicating that the model improves small target detection. In Figure 4B2,B3, there are a large number of occluded objects in the image. Our algorithm effectively detected target fish occluded by other fish, which demonstrated the ability of the method to detect occluded and highly overlapping objects.
Water 2022, 14, x FOR PEER REVIEW 9 of 12 with fewer pixels. Our algorithm detected small objects with blurred vision and a lack of pixels on the upper right of the image ( Figure 4A3). Figure 4B1 shows a small target fish school. The detection performance of our algorithm was significantly higher than that of YOLOv5, indicating that the model improves small target detection. In Figure 4B2,B3, there are a large number of occluded objects in the image. Our algorithm effectively detected target fish occluded by other fish, which demonstrated the ability of the method to detect occluded and highly overlapping objects.  Table 6 shows the number of fish detected by our algorithm and YOLOv5. According to the table, the total number of objects detected by our algorithm was 49 more than that detected by YOLOv5, and the detection ratio increased by 24.6%. Our improved algorithm had better detection performance for densely spaced fish and small objects.

YOLOv5
Our method YOLOv5 Our method  Table 6 shows the number of fish detected by our algorithm and YOLOv5. According to the table, the total number of objects detected by our algorithm was 49 more than that detected by YOLOv5, and the detection ratio increased by 24.6%. Our improved algorithm had better detection performance for densely spaced fish and small objects.

Discussion
To verify the effectiveness and progressiveness of the CIM-YOLOv5 algorithm proposed in this paper for densely spaced and small target fish, the same dataset was used to compare its performance to that of the SSD, Faster R-CNN, YOLOv4 and YOLOv5 target detection algorithms. As seen in the data in Table 7, compared with the SSD, Faster R-CNN, YOLOv4 and YOLOv5 detection algorithms, the accuracy of CME-YOLOv5 achieved the optimal level. mAP@0.50 was 18.4, 15.3, 10.0 and 4.4 percentage points higher than SSD, Faster R-CNN, YOLOv4 and YOLOv5, respectively. The algorithm proposed in this paper uses C3CA instead of the C3 module based on YOLOv5, expands the detection layer from 3 to 4, and replaces the EIOU loss function, which can allow the model to achieve better detection performance, focus more attention on key information areas, and improve its ability to detect small objects. However, it is worth noting that adding the CA attention mechanism and expanding the detection layer led to an increase in the number of model parameters; compared with the original YOLOv5 algorithm, the computation of the model also increased, resulting in an increase in the average detection time. The model proposed in this paper is only certified in small target detection of underwater fish, but it does not affect the application of the model to small target scenes in other academic/industrial fields or datasets, such as UAV aerial photography and dense crowds. In the future, the application of computer vision technology to actual scenes is the trend and focus of current research, but many models currently focus more on improving accuracy and their detection speed will be limited. In fact, many model structures will have some redundant modules, which will lead to more useless calculations when the network is transmitted forwards/backwards, and this will not increase our accuracy. At present, the distillation and pruning of the network may improve these problems. In the future, we plan to develop a model with better performance and dynamic high-speed detection of targets.

Conclusions
To address problems such as many fish, density, mutual occlusion and small targets with little effective information and fuzziness, in this paper, we propose a method for densely spaced fish and small target recognition based on YOLOv5. Compared with other models, it has stronger advantages in various indicators.
First, aiming at the problems of poor positioning and less effective information in underwater target detection, this paper proposes that first, the attention mechanisms Ca and C3 structure are fused to increase the ability of the network to extract key information; second, aiming at the problem of large number and intensive detection tasks, the 3 detection layers of YOLOv5 were expanded to 4. Finally, EIOU loss was replaced by GIOU loss to optimize convergence speed and reduce inaccurate regression results.
The experimental results showed that the improved algorithm proposed in this paper had different effects on different indicators; mAP@0.50 reached 94.9%, which had better accuracy. The number of image detections reached 248, which was 49 more than that of YOLOv5, and the detection effect was 24.6 percent higher. Target detection performance improved. In summary, our proposed algorithm had higher accuracy and detection performance for densely spaced fish and small objects and is more suitable for underwater fishery resource studies.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Some of the datasets in this study are openly available from NOAA (https://swfscdata.nmfs.noaa.gov/labeled-fishes-in-the-wild/, (accessed on 6 July 2022)).