Dead Fish Detection Model Based on DD-IYOLOv8

Zheng, Jianhua; Fu, Yusha; Zhao, Ruolin; Lu, Junde; Liu, Shuangyin

doi:10.3390/fishes9090356

Open AccessArticle

Dead Fish Detection Model Based on DD-IYOLOv8

by

Jianhua Zheng

^1,2,3

,

Yusha Fu

¹,

Ruolin Zhao

¹

,

Junde Lu

¹ and

Shuangyin Liu

^1,2,3,*

¹

College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

²

Guangzhou Key Laboratory of Agricultural Products Quality & Safety Traceability Information Technology, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

³

Smart Agriculture Innovation Research Institute, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

^*

Author to whom correspondence should be addressed.

Fishes 2024, 9(9), 356; https://doi.org/10.3390/fishes9090356

Submission received: 7 August 2024 / Revised: 5 September 2024 / Accepted: 9 September 2024 / Published: 12 September 2024

Download

Browse Figures

Versions Notes

Abstract

In aquaculture, the presence of dead fish on the water surface can serve as a bioindicator of health issues or environmental stressors. To enhance the precision of detecting dead fish floating on the water’s surface, this paper proposes a detection approach that integrates data-driven insights with advanced modeling techniques. Firstly, to reduce the influence of aquatic disturbances and branches during the identification process, prior information, such as branches and ripples, is annotated in the dataset to guide the model to better learn the scale and shape characteristics of dead fish, reduce the interference of branch ripples on detection, and thus improve the accuracy of target identification. Secondly, leveraging the foundational YOLOv8 architecture, a DD-IYOLOv8 (Data-Driven Improved YOLOv8) dead fish detection model is designed. Considering the significant changes in the scale of dead fish at different distances, DySnakeConv (Dynamic Snake Convolution) is introduced into the neck network detection head to adaptively adjust the receptive field, thereby improving the network’s capability to capture features. Additionally, a layer for detecting minor objects has been added, and the detection head of YOLOv8 has been modified to 4, allowing the network to better focus on small targets and occluded dead fish, which improves detection performance. Furthermore, the model incorporates a HAM (Hybrid Attention Mechanism) in the later stages of the backbone network to refine global feature extraction, sharpening the model’s focus on dead fish targets and further enhancing detection accuracy. The experimental results showed that the accuracy of DD-IYOLOv8 in detecting dead fish reached 92.8%, the recall rate reached 89.4%, the AP reached 91.7%, and the F1 value reached 91.0%. This study can achieve precise identification of dead fish, which will help promote the research of automatic pond patrol machine ships.

Keywords:

prior knowledge; YOLOv8; object detection

Key Contribution: A method for detecting dead fish on the water surface based on “data + model” was proposed.

1. Introduction

The scarcity of land resources and the progression of the global food crisis are increasingly putting pressure on China’s terrestrial systems to ensure food security [1]. The ocean, as a rich treasure trove, is not only a “blue granary” for people to obtain high-quality protein but also an important domain for safeguarding food security [2]. Aquaculture plays a crucial role in the global agricultural industry. It not only meets the broad demand for high-quality aquatic products but also contributes positively to the income growth of farmers and the prosperity of the rural economy. However, fish mortality often occurs in the process of aquaculture due to various factors such as water quality deterioration, disease invasion, and improper management. Dead fish can cause direct economic losses and may also carry pathogens that can easily lead to the spread of diseases. Moreover, the decay of dead fish in the water also causes water quality pollution, affecting the health of aquaculture organisms. Therefore, timely and accurate detection of dead fish is particularly important, as it is of great significance for the timely handling of dead fish, preventing the spread of diseases, and protecting the health of aquaculture water environments and aquatic organisms. Traditional methods of detecting dead fish primarily rely on manual patrols, which are not only time-consuming and labor-intensive but also inefficient in achieving real-time monitoring of large-scale aquaculture areas. With the continuous advancement of technology, modern techniques such as remote sensing and robotic monitoring are increasingly being applied to the field of dead fish detection, offering new approaches and methods to address this issue. The application of these technologies can significantly enhance detection efficiency and enable real-time monitoring of aquaculture areas, allowing for the prompt discovery of dead fish and the implementation of appropriate measures. Consequently, this can effectively reduce economic losses and safeguard the health of the aquaculture environment and aquatic organisms.

Researchers use neural networks to solve the problem of detecting abnormal behavior in fish. However, the detection of such abnormal behavior mainly focuses on the analysis of motion dynamics, facing challenges such as the complexity of behavior, the need for real-time detection, and the interference of noise in underwater environments. On the contrary, dead fish detection is more concerned with quickly and accurately identifying targets on the water surface, and the challenges here mainly come from image quality issues, occlusion issues, and the different states of dead fish. Therefore, other researchers began to study the detection of dead fish targets. The detection scenario in the research article on dead fish is mainly in relatively closed aquaculture ponds, while in practical applications, open aquaculture environments may encounter more interfering factors. For example, obstructions such as tree branches may lead to missed detection opportunities, and under certain lighting conditions, water waves may resemble the shape of a fish, causing false alarms.

To fulfill the task of high-precision dead fish detection on water surfaces in an open environment, this paper introduces a model based on the enhanced YOLOv8, known as Data-Driven Improved YOLOv8 (DD-IYOLOv8). The contributions of this paper’s method are primarily reflected in the following aspects:

(1) Deep learning-based object detection demands high-quality training datasets, yet there are currently few publicly available datasets specific for dead fish. Therefore, this study creates a dataset of dead fish photographed at a certain tilt angle in open scenes, and incorporates prior knowledge for annotating targets such as dead fish, tree branches, and ripples.

(2) To address the issue of incorrect detections caused by the varying sizes of dead fish at different distances, this paper employs advanced dynamic snake-shaped convolution technology. This convolution technique autonomously generates kernels that closely resemble the shape of the detection targets, which can significantly enhance the effectiveness of feature extraction.

(3) The objective of dead fish detection is to identify all dead fish targets on the water surface as thoroughly as possible. However, distant dead fish appear very small in the images, making them easily overlooked. To tackle this challenge, this paper has specially integrated a small object detection module, aimed at significantly enhancing the model’s recognition and detection capabilities for small targets.

(4) In open scenes, there are not only dead fish targets but also other interfering objects. To address this issue, this paper introduces a Hybrid Attention Module (HAM), which enables the model to discern the critical characteristics of dead fish, thereby focusing more on the detection of dead fish targets and significantly enhancing the model’s precision.

2. Related Works

2.1. Object Detection Methods

Traditional detection of abnormal fish behavior relies on manual observation, which is time-consuming and laborious. At the same time, there is a degree of uncertainty based on farmers’ experience. Zhao et al. [3] enabled the network to detect and localize local anomalous behaviors in fish by using corrected motion images and quantified these local anomalous behaviors in fish using a recurrent neural network. Hu et al. [4] proposed a deep learning-based method for real-time detection of fish behavior. By using image enhancement techniques and the improved YOLOv3-Lite model, they successfully achieved efficient detection of abnormal fish behaviors. Wang et al. [5], by fine-tuning the path aggregation network, discovered that the YOLOv5 model greatly improves the sensitivity of detecting small fish individuals in a fish school, which significantly improves the efficiency of identifying anomalous behavioral patterns in a fish community. The network proposed by Zhang and colleagues [6] processes video streams as input and utilizes EfficientNetV2 as its backbone architecture, reducing parameters while maintaining high accuracy. Chen [7] proposed an innovative anomaly detection method integrating time displacement and attention mechanisms, which effectively identifies anomalous behaviors. Yang [8] designed a fish behavior recognition method that integrates auditory and visual features under complex conditions. The method effectively captures cross-modal complementary information from multiple perspectives, thereby improving the accuracy and reliability of recognition. Hu et al. [9] used C3D deep neural networks to detect abnormal behaviors in fish behavior datasets collected under complex environmental conditions. The potential of practical application of video surveillance to detect abnormal fish behavior in aquaculture was demonstrated. Zheng et al. [10] proposed the Combining Attention and Brightness Adjustment Network, a deep learning network for underwater image restoration. Wageeh et al. [11] cleverly integrated a Retinex-based multiscale color improvement technique into the YOLO detection framework. This novel approach enhances the adaptability of the model in underwater environments and effectively addresses the challenges associated with underwater vision. Wang et al. [12] proposed a new method for identifying abnormal behavior in underwater fish, which can help to identify abnormal behavior in aquatic organisms at an early stage and improve the ability to monitor and respond to changes in fish behavior. In 2022, Zhao et al. [13] proposed an innovative dead fish detection model based on YOLOv4, which incorporates a lightweight deformable convolution and significantly improves the detection accuracy. In 2024, Zhang et al. [14] introduced a sophisticated model specifically designed for identifying dead juvenile fish. They cleverly integrated an efficient channel attention mechanism to greatly enhance the feature extraction capability of the YOLOv4 algorithm. Yang Shuaipeng et al. [15] proposed a technique for identifying dead fish at the water surface in an open environment. The method cleverly exploits multiscale feature fusion and an attention mechanism to improve the detection accuracy and speed up the detection process.

2.2. Methods for Incorporating Prior Knowledge

The method of incorporating prior knowledge is currently a research hotspot, where domain knowledge or existing experience is embedded within the model, facilitating the model’s understanding of certain data patterns in advance. Yan Ruqiang et al. [16] summarize the research status of interpretable prior-enabled technologies from the perspective of signal processing priors and physical knowledge priors commonly used in industrial diagnostics. Ding et al. [17] significantly improved the performance of a deep learning model in an object detection task by integrating prior color knowledge and scene knowledge. Qin Shuai et al. [18] integrated the neighborhood a priori knowledge constraints for sample data fitting in the model construction process, which effectively improved the diagnostic accuracy of the model. Xie Zhouyang et al. [19] designed an a priori-based algorithm for segmenting damaged areas of metal plating, which effectively avoided model overfitting. Incorporating prior knowledge is vital to the training and refinement of deep learning models, ensuring their robust performance and efficiency. To enhance the performance and generalization ability of models, there are mainly four methods to introduce prior knowledge:

(1) Auxiliary Learning-based Prior Knowledge Incorporation: Auxiliary Learning [20] is a method that enhances model performance by introducing auxiliary tasks related to the main task. Within a multitask learning framework, the model can learn multiple tasks simultaneously, sharing representations and transferring knowledge, thereby improving the performance of all tasks. Additionally, pre-training tasks can also be considered a form of auxiliary learning. By learning auxiliary tasks related to the main task, the model can learn more general and meaningful feature representations.

(2) Model Reproduction-based Prior Knowledge Incorporation: Model reproduction is a method that utilizes the knowledge from known models or theories to guide the learning of new models. Knowledge distillation [21] is an efficient model simplification technique that extracts and transfers deep knowledge from large, complex models into more streamlined ones, enabling the smaller models to inherit and replicate the core cognitive abilities of the teacher models.

(3) Input Data-based Prior Knowledge Incorporation: This method uses known information or assumptions about the input data to guide model learning. Data augmentation [22] is a common input data-based method, where the training data are transformed to allow the model to learn more diverse and rich features. Feature engineering [23] is another method that involves manually designing or selecting features that can represent the data characteristics, enabling the model to better understand the data.

(4) Class Activation Mapping (CAM) Activation Constraint-based Prior Knowledge Incorporation [24]: This method uses CAM diagrams to guide the model’s attention to the correct image regions. By generating CAM diagrams and analyzing the image regions that the model focuses on when making classification decisions, we can incorporate prior knowledge about which image regions are important. This can be achieved by refining the loss function, integrating attention mechanisms, or adjusting the model architecture to ensure that the model focuses on the key elements of the classification task.

The approach adopted in this paper leverages prior knowledge derived from input data, with the objective of steering the model to learn pivotal features by identifying and annotating branches and ripple patterns in the dataset that resemble fish forms. Branches in the images can obscure dead fish, making their identification difficult; some branches have appearances similar to dead fish, which can mislead the detection algorithm. By annotating branches, the model can learn to distinguish the characteristics of branches from those of dead fish, thereby significantly improving the precision and credibility of dead fish detection. Additionally, the shapes of water ripples can sometimes resemble dead fish, potentially leading to incorrect judgments by the detection algorithm. Thus, by manually annotating ripples, the model can learn their features and accurately distinguish between the subtle differences between dead fish and ripples. This method enhances the model’s sensitivity in detecting dead fish, that is, its ability to reduce false positives in complex environments.

2.3. DD-IYOLOv8 Network Model Structure

In this study, we utilize YOLOv8n as the base model and enhance its feature extraction capabilities by integrating Dynamic Snake Convolution (DSConv) within the neck network. By enabling flexible adjustments to the receptive field, this innovation captures a more comprehensive set of contextual information, thus enhancing the model’s environmental understanding. To tackle the issue of spotting diminutive, distant dead fish, and to enable the network to more effectively identify and focus on these small-scale targets, we introduce an additional layer for minor object detection, expanding the detection heads of YOLOv8n to four. Furthermore, to concentrate the network’s attention on dead fish targets, we incorporate a Hybrid Attention Mechanism (HAM) in the later stages of the backbone network. This refinement of the global feature representation significantly improves the precision of the model’s detection capabilities. The improved network architecture is described in the accompanying Figure 1.

2.3.1. Feature Extraction: C2f_DySnakeConv

During the automatic patrol of the pond by the machine boat, the variation in fish scales at different distances can lead to missed detections or false positives. This can similarly lead to misplacement and inaccuracies of the bounding boxes, thereby impacting the accuracy of the detection outcomes. In this paper, we have adopted the dynamic snake-shaped convolution proposed by Yaolei Qi [25]. Offset values are integrated at every sampling point within the convolutional kernel, allowing for random sampling around the current location, which boosts the model’s capacity for recognizing and processing local features. This deviates from the regular grid point sampling of standard convolution kernels and employs an iterative strategy to constrain the scope of the receptive field. The arrangement of sampling points in the dynamic snake-shaped convolution resembles fish shapes, which better adapts to the shape and feature distribution of dead fish targets. Since the appearance and posture of dead fish may change due to submersion time or water flow, the flexible sampling strategy of dynamic snake-shaped convolution effectively captures these variations, enhancing the model’s feature extraction capabilities for dead fish targets. A comparison of dynamic snake-shaped convolution with standard convolution and deformable convolution is shown in the accompanying Figure 2.

Given the standard 2D convolution coordinates as

K

, the central coordinate is

K_{i} = (x_{i}, y_{i})

. A 3 × 3 kernel K with dilation 1 is expressed as follows:

\begin{matrix} K = \{(x - 1, y - 1), (x - 1, y), \dots, (x + 1, y + 1)\} \end{matrix}

(1)

To improve the convolutional kernel’s adaptability to the complex geometric structures of targets, this method introduces deformation offsets

Δ

. During the model’s autonomous learning process for these offsets, without proper control, the receptive field may extend beyond the target area. To prevent this, an iterative strategy is adopted, as illustrated in Figure 3 in the text. This method sequentially identifies observation points for each target, maintaining the coherence of attention, and curbing excessive diffusion of the perceptual scope caused by substantial deformation offsets.

In the design of Dynamic Snake Convolution (DSConv), the convolutional kernel is evenly distributed along the x-axis and y-axis through an optimized adjustment. Taking a 9 × 9 convolutional kernel as an example and observing it along the x-axis, the precise coordinates of each grid cell within

K

is denoted by

K_{i \pm c} = (x_{i \pm c}, y_{i \pm c})

, where

c = {0,1, 2,3, 4}

indicates the lateral displacement from the core of the kernel. This method of representation not only enhances the accuracy in describing the structure of the convolutional kernel but also aids in accurately capturing target features within the model. The determination of the location

K_{i \pm c}

for each grid in the convolution kernel K is a complex process.

The positions more distant from the center grid, contingent upon the position of the preceding grid, originate at the central position

K

; compared to

K_{i}

,

K_{i + 1}

increases by an offset of

Δ = {δ | δ \in [- 1,1]}

.

Beginning at the central position

K_{i}

, subsequent positions are determined by their proximity to the preceding ones. Compared to the central position

K_{i}

, an offset

K_{i + 1}

is added to

Δ = {δ | δ \in [- 1,1]}

, delineating an orderly trajectory that expands outward from the convolutional kernel’s core. Therefore, the offset is adjusted to

\sum

to guarantee that the convolution kernel is consistent with the linear form structure. Consequently, Figure 3. changes in the x-axis direction.

\begin{matrix} K_{i \pm c} = \{\begin{matrix} (x_{i + c}, y_{i + c}) = (x_{i} + c, y_{i} + \sum_{i}^{i + c} Δ y) \\ (x_{i - c}, y_{i - c}) = (x_{i} - c, y_{i} + \sum_{i - c}^{i} Δ y) \end{matrix}\} \end{matrix}

(2)

Equation (2) becomes, in the y-axis direction,

\begin{matrix} K_{j \pm c} = \{\begin{matrix} (x_{j + c}, y_{j + c}) = (x_{j} + \sum_{j}^{j + c} Δ x, y_{j} + c) \\ (x_{j - c}, y_{j - c}) = (x_{j} + \sum_{j - c}^{j} Δ x, y_{j} - c) \end{matrix}\} \end{matrix}

(3)

Since the offset is typically a fraction, we use bilinear interpolation to ensure precise sampling:

K = \sum_{K^{'}} B (K^{'}, K) \cdot K^{'}

(4)

In Equations (2) and (3),

K

represents the fractional positions,

K^{'}

contains all integer spatial positions, and

B

denotes the bilinear interpolation kernel, which is also decomposed into two 1D convolutional kernels.

B (K, K^{'}) = b (K_{x}, K_{x}^{'}) \cdot b (K_{y}, K_{y}^{'})

(5)

As depicted in Figure 4, DSConv leverages dynamic adjustments along the 2D plane (across the x-axis and y-axis) during the deformation process, enabling it to encompass an extensive area of 9 × 9. This design considerably extends the coverage of the convolutional kernel, which in turn boosts the model’s capability to recognize sophisticated characteristics. Figure 5 illustrates the process of dynamic snake-shaped convolution sampling.

2.3.2. Small Target Detection Head

Small target detection has always been a challenge in object detection. In the context of dead fish detection, distant dead fish appear very small in the video, as shown in Figure 6. As the downsampling rate increases in the YOLOv8 feature extraction network, the deeper feature maps may lose the detail information of distant small fish, which is not conducive to capturing the characteristics of small target samples. This greatly affects the detection effect of distant small target fish. Therefore, this model adds a branch at the first C2f layer, with a feature map size of 160 × 160, allowing the model to obtain feature information of small targets at an earlier stage. By integrating feature maps of targets with varying sizes, this model not only enhances its sensitivity to the fine details of small targets but also effectively harnesses the abundant semantic information from higher-level feature maps and the detailed features from lower-level feature maps, thereby comprehensively improving the detection capabilities for both large and small targets.

2.3.3. Hybrid Attention Mechanism

During the training process of a network for dead fish target detection, the input images may contain not only dead fish targets but also background noise such as branches and ripples. These noise factors can increase the difficulty of the detection task. To enhance the network’s recognition capability for dead fish targets, this paper integrates the HAM [26] (Hybrid Attention Module) following the SPPF module. The introduction of attention mechanisms is designed to enhance the network model structure’s concentration on dead fish targets, enabling the network to probe deeper into the key features of dead fish and effectively suppress irrelevant background noise, thereby significantly enhancing the precision of dead fish target detection. Compared to other attention mechanisms, HAM constructs its structure by sequentially linking a channel attention module with a spatial attention module.

This arrangement skillfully captures the regional correlations present in the input image, greatly enriching the expressiveness of the image features, and effectively enhancing the efficiency and accuracy of the feature learning and extraction process. The HAM utilizes efficient one-dimensional convolutional operations to alleviate the computational load on channel attention mechanisms and employs channel separation techniques to dynamically emphasize key features, thereby enhancing the model’s responsiveness to target characteristics as shown in Figure 7.

As shown in Figure 8. Suppose the input feature is F∈

R^{H \times W \times C}

. The process starts with the integration of the spatial dimensional information of the features, which is carried out using average and maximum pooling techniques. These operations transform the input features into two different feature tensors. The first tensor

F_{C}^{a v g}

, designated as the average pooled feature tensor, captures global spatial information by averaging the values in each region of the feature map. The second tensor

F_{C}^{m a x}

, the maximum pooled feature tensor, focuses on the most significant features by extracting the maximum value in each region. These pooling techniques effectively retain important feature information, providing a condensed and useful feature representation for subsequent processing in deep learning models. Subsequently, these two tensors are fed into an adaptive mechanism module for processing to extract rich feature information

F_{C}^{a d d}

∈

R^{1 \times 1 \times C}

. The adaptive mechanism module is equipped with two adjustable parameters,

α

and

β

, which together form a flexible feature processing tool that can optimize feature representation during the training process. The internal process of the adaptive mechanism block is illustrated in Equation (6). In this passage,

α

and

β

denote two parameters, each spanning from 0 to 1. By introducing parameters

α

and

β

, a dynamic adaptive mechanism is established between the average-pooled and max-pooled feature tensors, thereby enhancing feature representation in the process of image feature extraction. Finally, a fast 1D convolution technique is applied. To effectively grasp the interactions across various channels, the kernel size of the 1D convolution is specifically set to

k

. The calculation method for the value of

k

is shown in Equation (7).

{| t |}_{o d d}

denotes the odd number closest to

t

, while

C

denotes the channels’ number.

γ

and b are alterable hyperparameters, generally given the values of 2 and 1. Utilizing this mapping function

ϕ

, the kernel dimensions can be flexibly altered contingent on the number of channels

C

. As the feature map

F_{C}^{a d d}

∈

R^{1 \times 1 \times C}

undergoes 1D convolution, the sigmoid function is applied to activate the ensuing output feature tensor. The process of channel attention calculation is summarized in Equation (8).

σ

stands for the sigmoid activation function, while

C 1 D_{1 \times k}

denotes a one-dimensional convolution process with a kernel size of k. By multiplying

A_{C} (F)

with the initial input tensor F, the refined channel feature

F^{'}

is obtained.

F_{C}^{a d d} = \frac{1}{2} \otimes (F_{C}^{a v g} \oplus F_{C}^{m a x}) \oplus α \otimes F_{C}^{a v g} \oplus β \otimes F_{C}^{m a x}

(6)

\begin{matrix} k = ϕ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d} \end{matrix}

(7)

A_{C} (F) = σ (C 1 D_{1 \times K} (\frac{1}{2} \otimes A v g P o o l (F) \oplus M a x P o o l (F)) \oplus α A v g P o o l (F) \oplus β \otimes M a x P o o l (F))

(8)

As shown in Figure 9. When the refined channel feature

F^{'}

passes through the spatial attention submodule, the resolution rate

λ

is multiplied by the channel dimension of

F^{'}

to determine the channel dimension of the important features.

λ

serves as the boundary line between the groups of important and less important channels.

λ

takes the value of 0.6 in this experiment. The nearest even number of the channel dimension of the important features is taken and denoted as

C_{i m}

. The calculation of

C_{i m}

is as follows in Equation (9):

C_{i m} = {|C_{C - R} \cdot λ|}_{e v e n}

(9)

where

C_{C - R}

designates the channel dimension within the refined feature

F^{'}

, and

{| t |}_{e v e n}

represents the nearest even number of t.

C_{i m}

can identify the top n highest values in the channel emphasis diagram, which are equivalent to the top n most important channels in the refined features. The values in the channel attention map are divided into two parts. One portion consists of values that are greater than or equal to the nth largest value, which we refer to as the important part. The other portion comprises values that are less than the nth largest value, and we call this the sub-important part. Two masks are defined, both matching the shape of the channel attention map. The mask values for the important part are set to 1, and the mask values for the sub-important part are set to 0. By multiplying the two masks with the refined channel feature

F^{'}

,the refined channel feature is divided into an important channel group

{F_{1}}^{'}

and a less important channel group

{F_{2}}^{'}

. The channel isolation method maintains the sequence of channels within the refined feature, satisfying the following Equation (10):

\begin{matrix} {F_{1}}^{'} \oplus {F_{2}}^{'} = F^{'} \end{matrix}

(10)

Then, average pooling and maximum pooling operations are used to aggregate the channel dimension information on

{F_{1}}^{'}

and

{F_{2}}^{'},

generating two pairs of 2D mappings. One pair is

F_{S, 1}^{a v g} \in R^{H \times W \times 1}

and

F_{S, 1}^{m a x} {\in R}^{H \times W \times 1}

, and the other pair is

F_{S, 2}^{a v g} {\in R}^{H \times W \times 1}

and

F_{S, 2}^{m a x} \in R^{H \times W \times 1}

. When performing the average pooling operation on

F_{1}^{'}

, the channel dimension is

C_{i m}

; when performing the average pooling operation on

F_{2}^{'}

, the channel dimension is

C_{C - R}

−

C_{i m}

. Once the pooling operations have been completed, the various outputs are merged, thereby generating a comprehensive set of feature descriptors. This processing method helps to integrate abstract features from different levels, providing a richer and more complete set of information for subsequent analysis and processing. Subsequently, this pair of concatenated feature descriptors is processed through a shared 7 × 7 convolutional layer, generating a pair of two-dimensional attention maps. Ultimately, to generate spatial attention maps

A_{S, 1} {\in R}^{H \times W \times 1}

and

A_{S, 2} {\in R}^{H \times W \times 1}

, this pair of 2D attention maps must undergo normalization and activation processes, ensuring that the effectiveness and responsiveness of the feature maps are optimized. This step is a crucial part of refining the attention mechanism, aiding in the precise capture of key information areas within the image. The process of spatial attention calculation is summarized in the following Equations (11) and (12):

\begin{matrix} A_{S, 1} (F^{'}) = φ (C 2 D_{7 \times 7} ([A v g P o o l ({F_{1}}^{'}); M a x P o o l ({F_{1}}^{'})])) = φ (C 2 D_{7 \times 7} ([F_{S, 1}^{a v g}; F_{S, 1}^{m a x}]) \end{matrix}

(11)

\begin{matrix} A_{S, 2} (F^{'}) = φ (C 2 D_{7 \times 7} ([A v g P o o l ({F_{2}}^{'}); M a x P o o l ({F_{2}}^{'})])) = φ (C 2 D_{7 \times 7} ([F_{S, 2}^{a v g}; F_{S, 2}^{m a x}]) \end{matrix}

(12)

where

φ

signifies a sequence of nonlinear transformations applied to the spatial attention map. The ReLU is utilized to discard negative entries within the spatial attention map, concentrating exclusively on aspects that contribute positively to the ultimate classification result.

C 2 D_{7 \times 7}

refers to a shared convolutional layer with a kernel size of 7 × 7.

The spatial attention sub-module creates two spatially refined features by individually multiplying the spatial attention maps

A_{S, 1}

and

A_{S, 2}

with their corresponding group features

{F_{1}}^{'}

and

{F_{2}}^{'}

. The resulting refined feature is produced by summing the two spatially refined feature pairs.

3. Collection and Construction of Dataset

The dead fish images used in this study were collected from the Nansha Seagull Island Aquaculture Base in Guangzhou. This aquaculture base mainly cultivates a variety of aquatic products such as shrimp, crab, shellfish, and fish. The research focuses on dead fish that are multiscale and partially obscured. A total of 958 images were photographed and collected, each containing one or more dead fish. The dataset was partitioned into training and validation sets in a 7:3 ratio, with the training set comprising 670 images and the validation set consisting of 288 images. The dataset was annotated using LabelMe. In this paper, the Mosaic technique is primarily employed as a method for data augmentation. This approach allows the model to recognize and learn from a variety of different objects and backgrounds within a single image, thereby enhancing its ability to detect small targets. Furthermore, the RandomHSV technique is utilized to adjust the hue, saturation, and brightness of images randomly, which not only diversifies the color palette but also significantly improves the model’s sensitivity and recognition of color variations. Additionally, the flipping feature within RandomFlip is applied to augment the model’s understanding of the target object across various orientations. As shown in Table 1, we conducted two sets of comparative experiments to demonstrate the effectiveness of using data augmentation methods in this paper. Through data analysis, it can be concluded that the use of data augmentation methods has improved all four indicators of the model. Therefore, this article adopts data augmentation to conduct experiments. As shown in Figure 10, the dataset employed in this paper is split into two classifications: one, referred to as Dataset 1, only labels dead fish, while the other, called Dataset 2, labels dead fish, tree branches, and ripple patterns similar in shape to fish. As shown in Figure 11 and Figure 12 for more details.

4. Experimental Results and Analysis

4.1. Experimental Environment and Evaluation Metrics

The operating system used for the experiment is Ubuntu 18, the GPU model is NVIDIA GeForce RTX 3090, the programming language is Python 3.8, and the deep learning framework is Pytorch = 2.1.0 + cu118. In order to select a reasonable training parameter, we chose three commonly used learning rates for comparative experiments, as shown in Table 2.

Through experimental data analysis, we found that a learning rate of 0.01 yields better model training effects. To ensure a fair comparison with other models, we used the same parameters and equipment in our experiments: an input image size of 640 × 640, a batch size of 16, a learning rate of 0.01, and a momentum of 0.937.

When assessing the efficacy of object detection methodologies, choosing the appropriate measures is essential. This study employs precision–recall (PR) as the primary evaluation metric. The correctness of positive case detection (precision, P) is calculated by the fraction of genuinely positive instances within the sum of cases labeled positive by the model, showing the model’s exactness in pinpointing positive occurrences. Recall, on the other hand, measures the ratio of true positive instances that the model accurately detects, reflecting the model’s comprehensive capability to retrieve all positive instances. The calculations are meticulously detailed in Equations (13) and (14); in this context, TP (true positive) represents the number of targets correctly identified, FP (false positive) denotes the number of targets falsely recognized, and FN (false negative) indicates the number of missed actual targets.

P = \frac{T P}{T P + F P}

(13)

R = \frac{T P}{T P + F N}

(14)

In addition, the commonly used metrics for measuring the performance of object detection algorithms are average precision (AP) and the F1. The average precision (AP) is equivalent to the expanse covered beneath the precision–recall graph, while F1 represents the harmonic mean of precision and recall, and serves as a metric for assessing the model’s ability to balance these two measures. The calculations are shown in Equations (15) and (16):

A P = \int_{0}^{1} P (R) d R

(15)

F 1 = \frac{2 \times (P \times R)}{P + R}

(16)

4.2. Experimental Comparison with Prior Knowledge Integration

To verify the effect of integrating prior knowledge into the dataset on the DD-IYOLOv8 dead fish detection model, comparative experiments were conducted using Dataset 1 and Dataset 2 under the same experimental conditions. The results of the experiments are displayed in Table 3 below.

As indicated by the experimental findings in Table 3, when the method of integrating prior knowledge is adopted, the model’s recall rate increases by 7.8%, the average precision increases by 1.6%, and the F1 increases by 3.5%. This suggests that the model has acquired the characteristics of dead fish targets, tree branches, and ripples, making the model more cautious in identifying dead fish targets and avoiding incorrectly categorizing samples from the other two categories as dead fish, thus improving the recall rate. When using the method of integrating prior knowledge, the model’s precision decreases by 1.8%, but it still meets the requirements of the automatic patrolling robot boat.

4.3. Comparative Experiments with Different Models

In this paper, five representative object detection methods are chosen for contrast. These include the classical two-stage Faster R-CNN [27], the YOLOv5 with the smallest number of model parameters, the YOLOv7 [28] using extended efficient layer aggregation networks, the latest YOLOv10 [29], and DD-IYOLOv8 proposed in this paper. Considering the practicality of the models, the comparison models in this paper all use the version with the smallest parameters. As shown in Table 3, the method of integrating prior knowledge can enhance the model’s efficacy in identifying dead fish targets. Therefore, the following experiments all use Dataset 2 with prior knowledge integration.

As shown in the experimental data in Table 4, the Faster R-CNN model has an accuracy of 0.732, but its recall rate is 0.463, which is relatively low. Additionally, the model has a large number of parameters, making it difficult to embed into hardware devices. The YOLOv5 model performs the best in terms of precision, achieving 0.934, which demonstrates its high accuracy in the dead fish detection task. However, its recall rate is 0.873, suggesting that the model has some limitations in covering all dead fish target instances. The YOLOv7 has moderate performance in both precision and recall, with 0.837 and 0.767. The latest object detection model, the YOLOv10 model, performed moderately in terms of accuracy and recall, at 0.907 and 0.815, respectively, and although the performance was relatively balanced, it was still inferior to DD-IYOLOv8 in terms of accuracy and recall. In comparison, the DD-IYOLOv8 model exhibits better performance in the dead fish detection task, with a precision of 0.928, a recall rate of 0.894, an average precision AP of 0.917, and an F1 of 0.911. Compared with the Faster R-CNN, YOLOv5, YOLOv7, and YOLOv8 network models, the AP is improved by 7.5, 0.2, 11.6, and 3.7 percentage points, respectively, and the F1 is increased by 31.6, 0.9, 13.8, and 7.8 percentage points, respectively. This indicates that DD-IYOLOv8 maintains high precision while also achieving a high recall rate, realizing a more balanced performance.

To visually showcase the enhanced recognition capability of the model, the recognition outcomes were visualized. An experimental comparison is conducted among the six models, and one detection result is selected from each model for display as a validation image. As shown in Figure 13, it can be seen that Faster R-CNN can accurately detect two dead fish targets in the image; YOLOv5 can detect one of the dead fish targets with a confidence level of 0.90, but there are obvious omissions. YOLOv7 detected two dead fish targets with confidence levels of 0.85 and 0.53, but one of the partially blurred dead fish targets had a confidence level of 0.53, which is not high enough compared with DD-IYOLOv8. YOLOv8 detected two dead fish targets with confidence levels of 0.74 and 0.29, respectively, but the confidence level was relatively low. YOLOv10 was able to detect one of the dead fish targets with a confidence level of 0.86, with obvious omissions; DD-IYOLOv8 was able to detect two dead fish targets with a higher confidence level of 0.78 and 0.72, without any omission or misdetection. Compared with the other six models, the performance of DD-IYOLOv8 is more balanced.

The precision–recall curve is a tool for evaluating the performance of classification models. It is used to demonstrate the relationship between the precision and recall of a model under different threshold settings. The closer the PR curve is to the upper right corner of the graph, the better the performance. The PR curve graph is of great significance for understanding the performance of the model and making trade-off decisions. In Figure 14, we chose to display the curve under the IOU = 0.5 threshold, which intuitively shows that the PR curve of DD-IYOLOv8 is closer to the upper right corner compared to other models, indicating both high accuracy and high recall.

4.4. Comparative Experiments in Different Scenes

To more vividly illustrate the improved performance of the suggested method in detecting dead fish on the water surface, this study evaluates the detection results of five different algorithms under various conditions, with the findings presented in Figure 15. In large object detection scenarios, Faster R-CNN can accurately detect a target; YOLOv5 can detect targets, but there is positioning error. It should have detected one dead fish target, but the prediction calculation box shows two dead fish targets. YOLOv7 cannot detect the target; YOLOv8 and YOLOv10 can accurately detect a target; and DD-IYOLOv8 can accurately detect and locate a target. In fuzzy detection scenarios, Faster R-CNN failed to accurately detect dead fish targets, while YOLOv5, YOLOv7, YOLOv10, and DD-IYOLOv8 were all able to detect targets, with DD-IYOLOv8 achieving a high confidence level of 0.79. In scenes where the target is obstructed, Faster R-CNN fails to detect the target; YOLOv5 and YOLOv8 both have false detections. YOLOv7 cannot accurately detect the target; although the predicted count box shows a dead fish target as one, it is a false detection. YOLOv10 can accurately detect targets; DD-IYOLOv8 can detect targets. In scenarios with multiple small targets, Faster R-CNN mistakenly detects one target and fails to detect distant targets; YOLOv5 did not detect these two targets; YOLOv7 and YOLOv8 can detect nearby targets, but cannot detect distant targets; and DD-IYOLOv8 can accurately detect these two targets. In small object detection scenarios, Faster R-CNN can detect dead fish targets, but there is a false detection situation; YOLOv5, YOLOv7, YOLOv8, YOLOv10, and DD-IYOLOv8 are all capable of detecting small targets, with DD-IYOLOv8 achieving a high confidence level of 0.75. Compared with the other five models, DD-IYOLOv8 shows more comprehensive and accurate detection performance in different scenarios, especially when detecting small targets.

4.5. Ablation Experiments

To confirm the potential of the enhanced components in the model’s detection capabilities, ablation studies were carried out to assess the specific contributions of each module to the model’s detection performance. Using YOLOv8 as the baseline model, the impacts of various modules on detection outcomes were assessed under identical experimental conditions. The findings of the refined ablation study are meticulously presented in Table 5.

By comparing the results of YOLOv8 with YOLOv8-A, it can be observed that after using the DySnakeConv structure for feature fusion, the model’s precision has increased by 3.9%, the recall rate has significantly increased by 4.6%, the average precision has risen by 2.2%, and the F1 score has also improved by 4.3%. To visually demonstrate the enhanced feature extraction capability after integrating the DySnakeConv module, the detection results are visualized, as shown in Figure 16. Visualization analysis of the feature maps generated by the C2f_DySnakeConv layer indicates that employing DySnakeConv in place of traditional convolutions leads to a more precise capture of the shape characteristics of dead fish. This discovery confirms that the integration of the DySnakeConv module significantly enhances the model’s performance in feature extraction, thereby validating its significant value in augmenting the model’s capability to capture features. By comparing the results of YOLOv8-A with YOLOv8-B, with the integration of a small target detection module, the model’s recall rate has significantly increased by 2.4 percentage points, the average precision has improved by 1.3%, and the F1 has risen by 0.8%, highlighting the positive impact of enhancing detection capabilities for less conspicuous targets on the overall detection performance. This indicates that the integration of a small target detection module substantially boosts the network’s capacity to identify dead fish at a distance, effectively reducing the rate of missed detections. By comparing the results of YOLOv8-B with the method introduced in this paper, it is observed that after integrating the HAM attention mechanism, the model’s precision has improved by 3.6%, and the F1 has risen by 1.5%, further validating the effectiveness of the improvements. This proves that the HAM attention mechanism can constrain global features and enhance the model’s focus on dead fish targets, thereby further improving the model’s precision in identifying dead fish. The outcomes from the technique introduced herein demonstrate that the DD-IYOLOv8 model achieves a precision of 92.8%, a recall rate of 89.4%, an AP of 91.7%, and an F1 of 91.1%. This further substantiated that the optimized model introduced in this study displayed significant effectiveness and also had a level of viability in actual applications.

To vividly illustrate the enhanced recognition capabilities of the model, this paper employs the GradCAM [30] heatmap technique to visually present the recognition outcomes. An experimental comparison is conducted among four models, and one detection result is selected from each model for display as a validation image.

As depicted in Figure 17, the integration of the DySnakeConv module into YOLOv8-A has significantly enhanced the model’s capability to accurately concentrate on the features of dead fish targets, thereby substantially improving its feature extraction capabilities. After the introduction of a small target detection head, the YOLOv8-B module effectively reduces background noise, enabling the model to more accurately identify small-sized targets and partially occluded dead fish, thereby enhancing the accuracy of detection. The ultimate incorporation of the HAM attention module refines the global feature constraints, allowing the model to precisely and effectively concentrate on the targets of dead fish. This enhancement significantly improves the capabilities of the dead fish identification model, thereby confirming the feasibility and practicality of the improvements presented in this study.

5. Conclusions

This paper aims to achieve precise identification of dead fish targets in open fish pond environments, for which a dataset was constructed using images of dead fish captured at specific tilt angles in an open setting, and a prior knowledge method based on the dataset’s input data was introduced. This strategy enables the model to discern the differences between dead fish targets and the natural environment, significantly enhancing the accuracy and reliability of dead fish detection. Building on this, the paper proposes the DD-IYOLOv8 model architecture, which is a further optimization of YOLOv8. The model’s “neck” integrates dynamic snake-shaped convolution technology, endowing the model with the ability to self-learn features and improving the efficiency of feature extraction from targets of varying sizes. To boost the model’s responsiveness to small targets at a distance, a specialized small object detection layer has been specifically incorporated, further refining the model’s detection capabilities. HAM is embedded at the terminal section of the model’s backbone to intensify the focus on detecting dead fish targets. The DD-IYOLOv8 model has shown outstanding performance across all metrics: an accuracy of 92.8%, a recall rate of 89.4%, an average precision of 91.7%, and an F1 of 91.1%, meeting the performance requirements for automatic pond surveillance robotic boats. Looking ahead, future research can also explore the detection of densely packed objects and the recognition of submerged dead fish.

Author Contributions

Conceptualization, S.L.; methodology, J.Z.; investigation, Y.F. and J.Z.; software, Y.F. and J.Z.; validation, J.L.; writing—original draft preparation, Y.F. and R.Z.; writing—review and editing, J.Z. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported partly by the natural Science Foundation of Guangdong Province under grant 2022B1515120059, Innovation Team Project of Universities in Guangdong Province under grant 2021KCXTD019, Science and Technology Planning Project of Yunfuunder grants 2023020202 and 2023020203, Science and Technology Program of Guangzhou under grants 2023E04J1238 and 2023E04J1239, Guangdong Science and Technology Project under grant 2020B0202080002, Major Science and Technology Special Projects in Xinjiang Uygur Autonomous Region under grant 2022A02011, Undergraduate Teaching Quality Project in Guangdong Province: Teaching and Research Section of Artificial Intelligence Curriculum Group (Guangdong Higher Education Letter [2024] No. 9), and Guangdong Postgraduate Education Innovation Plan Project (No.: 2024JGXM_090).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available upon a suitable request from the corresponding author.

Acknowledgments

The authors acknowledge any support given which is not covered by the Author Contributions or Funding sections.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhang, M.; Pan, S.; Chen, Y.; Deng, Y. Development status, problems and countermeasures of blue granary industry in China. Hubei Agric. Sci. 2023, 62, 214–219. [Google Scholar]
Bao, Z. Marine Ranching: Paving the way for a sustainable blue granary. Anim. Res. One Health 2024, 2, 119–120. [Google Scholar] [CrossRef]
Zhao, J.; Bao, W.; Zhang, F.; Zhu, S.; Liu, Y.; Lu, H.; Shen, M.; Ye, Z. Modified motion influence map and recurrent neural network-based monitoring of the local unusual behaviors for fish school in intensive aquaculture. Aquaculture 2018, 493, 165–175. [Google Scholar] [CrossRef]
Hu, J.; Zhao, D.; Zhang, Y.; Zhou, C.; Chen, W. Real-time nondestructive fish behavior detecting in mixed polyculture system using deep-learning and low-cost devices. Expert Syst. Appl. 2021, 178, 115051. [Google Scholar] [CrossRef]
Wang, H.; Zhang, S.; Zhao, S.; Wang, Q.; Li, D.; Zhao, R. Real-time detection and tracking of fish abnormal behavior based on improved YOLOV5 and SiamRPN++. Comput. Electron. Agric. 2022, 192, 106512. [Google Scholar] [CrossRef]
Zhang, Z.; Shen, Y.; Zhang, Z. Recognition of Feeding Behavior of Fish Based on Motion Feature Extraction and 2D Convolution. Trans. Chin. Soc. Agric. Mach. 2024, 55, 246–253. [Google Scholar]
Chen, X. The Method of Fish Abnormal Behavior Detection Based on Deep Learning. Master’s Thesis, Shanghai Ocean University, Shanghai, China, 2024. [Google Scholar]
Yang, Y. Fish Behavior Recognition Method Based on Acoustic and Visual Features Fusion under Complex Conditions. Master’s Thesis, Dalian Ocean University, Liaoning, China, 2024. [Google Scholar]
Hu, Z.; Li, X.; Xie, X.; Zhao, Y. Abnormal Behavior Recognition of Underwater Fish Body Based on C3D Model. In Proceedings of the 2022 6th International Conference on Machine Learning and Soft Computing, Haikou, China, 15–17 January 2022. [Google Scholar]
Zheng, J.; Zhao, R.; Yang, G.; Liu, S.; Zhang, Z.; Fu, Y.; Lu, J. An Underwater Image Restoration Deep Learning Network Combining Attention Mechanism and Brightness Adjustment. J. Mar. Sci. Eng. 2024, 12, 7. [Google Scholar] [CrossRef]
Wageeh, Y.; Mohamed, H.E.-D.; Fadl, A.; Anas, O.; El Masry, N.; Nabil, A.; Atia, A. YOLO fish detection with Euclidean tracking in fish farms. J. Ambient Intell. Humaniz. Comput. 2021, 12, 5–12. [Google Scholar] [CrossRef]
Wang, J.-H.; Lee, S.-K.; Lai, Y.-C.; Lin, C.-C.; Wang, T.-Y.; Lin, Y.-R.; Hsu, T.-H.; Huang, C.-W.; Chiang, C.-P. Anomalous Behaviors Detection for Underwater Fish Using AI Techniques. IEEE Access 2020, 8, 224372–224382. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, S.; Lu, J.; Wang, H.; Feng, Y.; Shi, C.; Li, D.; Zhao, R. A lightweight dead fish detection method based on deformable convolution and YOLOV4. Comput. Electron. Agric. 2022, 198, 107098. [Google Scholar] [CrossRef]
Zhang, P.; Zheng, J.; Gao, L.; Li, P.; Long, H.; Liu, H.; Li, D. A novel detection model and platform for dead juvenile fish from the perspective of multi-task. Multimed. Tools Appl. 2024, 83, 24961–24981. [Google Scholar] [CrossRef]
Yang, S.; Li, H.; Liu, J.; Fu, Z.; Zhang, R.; Jia, H. A Method for Detecting Dead Fish on Water Surfaces Based on Multi-scale Feature Fusion and Attention Mechanism. Zhengzhou Univ. (Nat. Sci. Ed.) 2024, 56, 32–38. [Google Scholar]
Yan, R.; Shang, Z.; Wang, Z.; Xu, W.; Zhao, Z.; Wang, S.; Chen, X. Challenges and Opportunities of XAI in Industrial Intelligent Diagnosis: Priori-empowered. J. Mech. Eng. 2024, 60, 1–20. [Google Scholar]
Ding, X.; Luo, Y.; Li, Q.; Cheng, Y.; Cai, G.; Munnoch, R.; Xue, D.; Yu, Q.; Zheng, X.; Wang, B. Prior knowledge-based deep learning method for indoor object recognition and application. Syst. Sci. Control Eng. 2018, 6, 249–257. [Google Scholar] [CrossRef]
Qin, S.; Liu, H.; Chen, L.; Zhang, L. Outlier detection algorithms for penetration depth data of concrete targets combined with prior knowledge. Combust. Explos. Shock Waves 2024, 44, 70–79. [Google Scholar]
Xie, Z.; Shu, C.; Fu, Y.; Zhou, J.; Jiang, J.; Chen, D. Knowledge-Driven Metal Coating Defect Segmentation. J. Electron. Sci. Technol. China 2024, 53, 76–83. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
Fawzi, A.; Samulowitz, H.; Turaga, D.; Frossard, P. Adaptive data augmentation for image classification. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3688–3692. [Google Scholar]
Reid Turner, C.; Fuggetta, A.; Lavazza, L.; Wolf, A.L. A conceptual basis for feature engineering. J. Syst. Softw. 1999, 49, 3–15. [Google Scholar] [CrossRef]
Liu, S.; Davison, A.J.; Johns, E. Self-Supervised Generalisation with Meta Auxiliary Learning. arXiv 2019, arXiv:1901.08933. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
Li, G.; Fang, Q.; Zha, L.; Gao, X.; Zheng, N. HAM: Hybrid attention module in deep convolutional neural networks for image classification. Pattern Recognit. 2022, 129, 108785. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 618–626. [Google Scholar]

Figure 1. DD-IYOLOv8 network model structure.

Figure 2. (a) Standard convolution sampling; (b) dilated convolution sampling; (c) deformable convolution sampling; and (d) dynamic snake-shaped convolution sampling. The sampling comparison diagram of deformable convolution and standard convolution.

Figure 3. Schematic diagram of DSConv coordinate calculation.

Figure 4. Receptive field of DSConv.

Figure 5. Dynamic snake-shaped convolution sampling process.

Figure 6. Small target dead fish.

Figure 7. HAM (Hybrid Attention Module).

Figure 8. Channel attention submodule.

Figure 9. Spatial attention submodule.

Figure 10. (a) Original image; (b) image after data augmentation.

Figure 11. Dataset with only dead fish labeled.

Figure 12. Dataset labeled with dead fish, tree branches, and ripple patterns.

Figure 13. Detection effect images of various models.

Figure 14. PR curves of different models.

Figure 15. Comparative experiments in different scenes.

Figure 16. Feature visualization results. (a) Visualization of the feature maps extracted by the original C2f layer. (b) Visualization of the feature maps extracted by the C2f_DySnakeConv layer.

Figure 17. Ablation experiment heatmap.

Table 1. Data enhancement comparison experiment.

	Precision	Recall	AP	F1
No data augmentation	0.890	0.800	0.874	0.842
Data augmentation	0.928	0.894	0.917	0.911

Table 2. Comparison experiment of learning rate.

	Precision	Recall	AP	F1
lr = 0.1	0.825	0.700	0.809	0.757
lr = 0.01	0.928	0.894	0.917	0.911
lr = 0.001	0.863	0.861	0.897	0.862

Table 3. Comparative experimental results with and without prior knowledge integration.

	Dataset 1	Dataset 2	Precision	Recall	AP	F1
DD-IYOLOv8	√		0.946	0.816	0.901	0.876
DD-IYOLOv8		√	0.928	0.894	0.917	0.911

Table 4. Comparative experimental results of different models.

Model	Precision	Recall	AP	F1	Params/MB
Faster R-CNN	0.732	0.463	0.827	0.574	495
YOLOv5n	0.934	0.873	0.915	0.902	3.9
YOLOv7-tiny	0.837	0.767	0.821	0.800	12.3
YOLOv8n	0.865	0.827	0.884	0.845	6.3
YOLOv10n	0.907	0.815	0.882	0.858	5.8
DD-IYOLOv8	0.928	0.894	0.917	0.911	7.5

Table 5. Results of ablation experiments.

	DySnake Conv	Detection Head	HAM	Precision	Recall	AP	F1
YOLOv8n				0.865	0.827	0.884	0.845
YOLOv8-A	√			0.904	0.873	0.906	0.888
YOLOv8-B	√	√		0.895	0.897	0.918	0.896
DD-IYOLOv8	√	√	√	0.928	0.894	0.917	0.911

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, J.; Fu, Y.; Zhao, R.; Lu, J.; Liu, S. Dead Fish Detection Model Based on DD-IYOLOv8. Fishes 2024, 9, 356. https://doi.org/10.3390/fishes9090356

AMA Style

Zheng J, Fu Y, Zhao R, Lu J, Liu S. Dead Fish Detection Model Based on DD-IYOLOv8. Fishes. 2024; 9(9):356. https://doi.org/10.3390/fishes9090356

Chicago/Turabian Style

Zheng, Jianhua, Yusha Fu, Ruolin Zhao, Junde Lu, and Shuangyin Liu. 2024. "Dead Fish Detection Model Based on DD-IYOLOv8" Fishes 9, no. 9: 356. https://doi.org/10.3390/fishes9090356

APA Style

Zheng, J., Fu, Y., Zhao, R., Lu, J., & Liu, S. (2024). Dead Fish Detection Model Based on DD-IYOLOv8. Fishes, 9(9), 356. https://doi.org/10.3390/fishes9090356

Article Menu

Dead Fish Detection Model Based on DD-IYOLOv8

Abstract

1. Introduction

2. Related Works

2.1. Object Detection Methods

2.2. Methods for Incorporating Prior Knowledge

2.3. DD-IYOLOv8 Network Model Structure

2.3.1. Feature Extraction: C2f_DySnakeConv

2.3.2. Small Target Detection Head

2.3.3. Hybrid Attention Mechanism

3. Collection and Construction of Dataset

4. Experimental Results and Analysis

4.1. Experimental Environment and Evaluation Metrics

4.2. Experimental Comparison with Prior Knowledge Integration

4.3. Comparative Experiments with Different Models

4.4. Comparative Experiments in Different Scenes

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI