DiffusionFR: Species Recognition of Fish in Blurry Scenarios via Diffusion and Attention

Simple Summary Blurry scenarios often affect the clarity of fish images, posing significant challenges to deep learning models in terms of the accurate recognition of fish species. A method based on deep learning with a diffusion model and an attention mechanism, DiffusionFR, is proposed herein to improve the accuracy of fish species recognition in blurry scenarios caused by light reflections and water ripple noise. Using a self-constructed dataset, BlurryFish, extensive experiments were conducted and the results showed that the proposed two-stage diffusion network model can restore the clarity of blurry fish images to some extent and the proposed learnable attention module is effective in improving the accuracy of fish species recognition. Abstract Blurry scenarios, such as light reflections and water ripples, often affect the clarity and signal-to-noise ratio of fish images, posing significant challenges for traditional deep learning models in accurately recognizing fish species. Firstly, deep learning models rely on a large amount of labeled data. However, it is often difficult to label data in blurry scenarios. Secondly, existing deep learning models need to be more effective for the processing of bad, blurry, and otherwise inadequate images, which is an essential reason for their low recognition rate. A method based on the diffusion model and attention mechanism for fish image recognition in blurry scenarios, DiffusionFR, is proposed to solve these problems and improve the performance of species recognition of fish images in blurry scenarios. This paper presents the selection and application of this correcting technique. In the method, DiffusionFR, a two-stage diffusion network model, TSD, is designed to deblur bad, blurry, and otherwise inadequate fish scene pictures to restore clarity, and a learnable attention module, LAM, is intended to improve the accuracy of fish recognition. In addition, a new dataset of fish images in blurry scenarios, BlurryFish, was constructed and used to validate the effectiveness of DiffusionFR, combining bad, blurry, and otherwise inadequate images from the publicly available dataset Fish4Knowledge. The experimental results demonstrate that DiffusionFR achieves outstanding performance on various datasets. On the original dataset, DiffusionFR achieved the highest training accuracy of 97.55%, as well as a Top-1 accuracy test score of 92.02% and a Top-5 accuracy test score of 95.17%. Furthermore, on nine datasets with light reflection noise, the mean values of training accuracy reached a peak at 96.50%, while the mean values of the Top-1 accuracy test and Top-5 accuracy test were at their highest at 90.96% and 94.12%, respectively. Similarly, on three datasets with water ripple noise, the mean values of training accuracy reached a peak at 95.00%, while the mean values of the Top-1 accuracy test and Top-5 accuracy test were at their highest at 89.54% and 92.73%, respectively. These results demonstrate that the method showcases superior accuracy and enhanced robustness in handling original datasets and datasets with light reflection and water ripple noise.


Introduction
Fish are vital for humans as a protein source and for maintaining marine biodiversity [1].However, they face challenges like overfishing, habitat destruction, and climate change.
Recognition of fish species benefits animal welfare, ecological protection, and wildlife support.Fish image recognition helps researchers understand fish behavior and improve habitats.It also aids in accurate population counting and the monitoring [2] of wild fish populations.Additionally, it enables rapid recognition of fish at customs and in markets, preventing the illegal trade of endangered species.
In blurry marine scenarios, fish species recognition is challenging, requiring accurate methods [3].This contributes to surveys, population analyses, and the sustainable utilization of fish as a biological resource.
Underwater cameras are commonly used for fish surveys [4].Unlike other methods, they minimize ecosystem impact and allow continuous recording of fish activity.However, limitations include a restricted field of view and factors like water turbidity, lighting conditions, and flow magnitude that affect image quality and recognition accuracy.
Previous research mainly focused on high-resolution fish recognition [5].However, practical scenarios often feature blurry images due to water quality [6], relative movement [7] between the shooting device and the fish, water ripples [8], and light reflection [9].This poses significant challenges for fish recognition in real-life situations.
In order to overcome the challenges mentioned above, a method of fish image recognition in blurry scenarios based on the diffusion model and attention [10][11][12] mechanism, DiffusionFR, is proposed herein.DiffusionFR offers a comprehensive set of technical solutions for fish recognition in blurry scenarios.It shows the selection and application of this correcting technique.
The main contribution list of this paper is summarized as follows: (1) A two-stage diffusion model for fish recognition in blurry scenarios, TSD, was designed to maximize the removal of bad, blurry, and otherwise inadequate effects in fish images.(2) A learnable attention module, LAM, was designed to ensure that the semantic features learned at the end of the network can distinguish fish for fine-grained recognition.(3) A method for fish image recognition in blurry scenarios that synthesizes TSD and LAM, DiffusionFR, was proposed to present a complete solution for fish image recognition in blurry scenarios, and the selection and application of this correcting technique are presented herein.(4) A dataset of fish images in blurry scenarios, BlurryFish, was constructed and used to validate the effectiveness of DiffusionFR, and integrated the bad, blurry, and otherwise inadequate images from the publicly available dataset Fish4Knowledge.
The structure of this paper is as follows.In Section 2, we review the relevant works on fish species recognition.Section 3 provides a detailed explanation of the key concepts and methodology used in this study.This includes the main ideas behind the method, DiffusionFR, the two-stage diffusion model (TSD), the learnable attention module (LAM), the modified ResNet as the recognition network, the dataset, and the experimental design.Moving on to Section 4, we present the treatment and analysis of the experimental findings.In Section 5, we thoroughly discuss the implications and significance of the results.Finally, in Section 6, we summarize the essential findings and draw conclusions based on the research conducted in this paper.

Background
Previous studies on fish species recognition commonly used different deep neural network architectures or employed layered and phased strategies.
Numerous studies on fish recognition have utilized various deep neural networks, such as CNN, Tripmix-Net, DAMNet, MobileNetv3, and VGG16.Villon et al. [13] employed CNN to enhance the accuracy of coral reef fish recognition by using rule-based techniques.They achieved a model accuracy of 94.9%, surpassing manual accuracy.Simi-larly, Villon et al. [14] used a convolutional neural network to analyze images from social media, providing support in monitoring rare megafauna species.Li et al. [15] proposed Tripmix-Net, a fish image classification model that incorporates multiscale network fusion.Qu et al. [16] introduced DAMNet, a deep neural network with a dual-attention mechanism for aquatic biological image classification.However, due to the incorporation of the dual-attention mechanism, the DAMNet model may exhibit a relatively higher level of complexity.Meanwhile, Alaba et al. [17] developed a model using the MobileNetv3-large and VGG16 backbone networks for fish detection.However, their method still encounters certain challenges, such as dealing with low-light conditions, noise, and the limitations posed by low-resolution images.
A hierarchical and phased approach to fish target recognition refers to dividing the recognition process into multiple phases and levels.Liang et al. [18] divided the recognition process into multiple stages to enhance accuracy and robustness.However, their method suffers from a high number of parameters and computational complexity, which can make the training process extremely time-consuming.Similarly, Ben et al. [5] proposed a hierarchical CNN classification method for automatic fish recognition in underwater environments.
In blurry scenarios [19], intelligent fish image recognition technology aims to improve image clarity using image processing techniques.These techniques include image denoising, image enhancement, and image alignment.Image denoising [20] reduces noise in the image using filters.Image enhancement [21] improves clarity through techniques like histogram equalization.Image alignment [22] addresses image blurring through registration.Neural heuristic video systems [23] analyze video frames automatically using heuristic algorithms, extending image analysis to video analysis.The bilinear pooling with poisoning detection (BPPD) module [24] utilizes bilinear pooling of convolutional neural networks.This algorithm combines data from two networks through bilinear pooling to achieve improved classification accuracy.Intelligent fish image recognition technology utilizes the diffusion model to deblur images.This model enhances image quality, recovers lost information, and improves feature extraction.As a result, it provides better inputs for subsequent image recognition tasks, significantly improving the accuracy of fish image recognition in blurry scenarios [25].The main ideas behind DiffusionFR can be summarized as follows:

Main Ideas
(1) Two-stage diffusion (TSD): This model consists of two stages-the predictive stage and the reconstructive stage.In the predictive stage, a U-Net structure generates feature probability maps for the bad, blurry, and otherwise inadequate fish images.Each pixel in the maps represents the probability of belonging to a specific class of fish image.In the reconstructive stage, four identical modules comprising a residual block and an up-sampling block are employed to convert the feature probability maps into clear fish images.This modification aims to minimize the loss of accuracy, train a more precise recognition model, and enhance recognition accuracy for fish images in blurry scenarios.

Two-Stage Diffusion (TSD)
Recently, deep neural network-based diffusion models [26][27][28] have become popular for image denoising and super-resolution.These models utilize the capabilities of deep learning to learn image features and predict image evolution.As a result, they can quickly and efficiently denoise and enhance images.
The proposed TSD method in this study consists of two stages: a predictive stage and a reconstructive stage.The predictive stage detects fish image features in blurry images, while the reconstructive stage analyzes and processes diffusion data to address errors or deficiencies in the model.This stage significantly enhances the model's accuracy and reliability.The entire TSD process is visually depicted in Figure 2. The predictive stage of the proposed method takes the fish image as an inpu generates a probability map, which represents the likelihood of each pixel belonging specific fish species category, as an output.This probability map provides valuab sights into the model's classification probabilities for different fish species.
To achieve this, the predictive stage utilizes the U-Net architecture [29], as show Figure 3. U-Net consists of symmetrical encoders and decoders.The encoder extract age features using convolution and pooling operations, encoding the input image i low-dimensional tensor.The decoder then reconstructs the encoder's output into a age of the same dimensions as the input, with each pixel containing a probability for the target category.To address information loss, U-Net incorporates jump connec that connect the feature maps of the encoder and decoder.It consists of four 2D con tional layers and four maximum pooling layers, enabling the model to handle fish im of varying sizes and shapes within blurry images.The predictive stage of the proposed method takes the fish image as an input and generates a probability map, which represents the likelihood of each pixel belonging to a specific fish species category, as an output.This probability map provides valuable insights into the model's classification probabilities for different fish species.
To achieve this, the predictive stage utilizes the U-Net architecture [29], as shown in Figure 3. U-Net consists of symmetrical encoders and decoders.The encoder extracts image features using convolution and pooling operations, encoding the input image into a low-dimensional tensor.The decoder then reconstructs the encoder's output into an image of the same dimensions as the input, with each pixel containing a probability value for the target category.To address information loss, U-Net incorporates jump connections that connect the feature maps of the encoder and decoder.It consists of four 2D convolutional layers and four maximum pooling layers, enabling the model to handle fish images of varying sizes and shapes within blurry images.
low-dimensional tensor.The decoder then reconstructs the encoder's output into an image of the same dimensions as the input, with each pixel containing a probability value for the target category.To address information loss, U-Net incorporates jump connections that connect the feature maps of the encoder and decoder.It consists of four 2D convolutional layers and four maximum pooling layers, enabling the model to handle fish images of varying sizes and shapes within blurry images.The reconstructive stage is composed of four modules.Each module consists of a Residual Block [30] and Up-Sampling Block [31].The Residual Block addresses issues of gradient vanishing and explosion during deep neural network training, as shown in Figure 4.It includes two convolutional layers and a jump connection, where the input is added directly to the output to form residuals.This helps the network capture the mapping relationship between inputs and outputs, improving the model's performance and robustness.During model training, special attention is given to the error generated by the Up-Sampling Block in the network, as shown in Equation (1).q(x s |x t , x 0 ) = N x s 1 g t0 2 f s0 g ts 2 x 0 + f ts g s0 2 x t , g s0 2 g ts 2 g t0 2 I (1) where q(x s |x t , x 0 ) denotes the conditional probability distribution of x s ; given conditions x t and x 0 , the mean part of this is a series of linear combination terms including f s0 , g ts , f ts , and g s0 .In addition, f ts is a ratio indicating the relative scale that maps the input variable t to the input variable s, as shown in Equation (2), and g ts is computed from the scale parameters of the input variables t and s and is used to adjust the propagation process of the error, as shown in Equation (3).
g ts = g(t) 2 − f ts 2 g(s) 2  (3)  q x |x , x = N x 1 g f g x + f g x , g g g I where q x |x , x denotes the conditional probability distribution of x ; given ditions x and x , the mean part of this is a series of linear combination terms inclu f , g , f , and g .In addition, f is a ratio indicating the relative scale that map input variable t to the input variable s, as shown in Equation ( 2), and g is comp from the scale parameters of the input variables t and s and is used to adjust the p gation process of the error, as shown in Equation (3).
Equation ( 4) describes the gradual process of recovering the image from noi illustrated.
where x denotes the image recovered at moment t, obtained by a linear com tion of the initial image x and the previous recovery result Z .This linear combin uses a scaling factor α to adjust the contribution of the initial image and the pre recovery result.Meanwhile, the noise term Z is generated through a Gaussian dist tion N 0, I .Equation ( 5) represents the inverse diffusion process from the recovered ima back to the previously recovered result x .
where the conditional probability distribution q x |x , x represents the c tional probability distribution of x ; given conditions x and x , this conditional ability distribution is represented by a Gaussian distribution where the mean part tains a linear combination of x and the noise term Z.
TSD implements batch normalization techniques and dropout layers to enhanc stability, convergence, and generalization of the model.

Learnable Attention Module (LAM)
In this paper, we propose LAM, which is based on the channel attention mecha [32] (CAM) and depicted in Figure 5. Unlike CAM, LAM assigns weights to channe learning the importance of features.Equation ( 4) describes the gradual process of recovering the image from noise, as illustrated.
where x t denotes the image recovered at moment t, obtained by a linear combination of the initial image x 0 and the previous recovery result Z t−1 .This linear combination uses a scaling factor √ α t to adjust the contribution of the initial image and the previous recovery result.Meanwhile, the noise term Z is generated through a Gaussian distribution N(0, I).
Equation ( 5) represents the inverse diffusion process from the recovered image x t back to the previously recovered result x t−1 .
where the conditional probability distribution q(x t−1 |x t , x 0 ) represents the conditional probability distribution of x t−1 ; given conditions x t and x 0 , this conditional probability distribution is represented by a Gaussian distribution where the mean part contains a linear combination of x t and the noise term Z. TSD implements batch normalization techniques and dropout layers to enhance the stability, convergence, and generalization of the model.

Learnable Attention Module (LAM)
In this paper, we propose LAM, which is based on the channel attention mechanism [32] (CAM) and depicted in Figure 5. Unlike CAM, LAM assigns weights to channels by learning the importance of features.In DiffusionFR, the LAM consists of three steps.These steps include the computation of channel importance, learning of channel weight distribution, and weighted fusion of features.These steps are illustrated in Figure 1.
The first step is the computation [33] of channel importance.First, the global pooling values for each channel in the feature map F are extracted by a global average pooling or maximum pooling operation to obtain a C -dimensional vector Z .Then, Z is processed using a network architecture containing two fully connected layers and a ReLU activation function to generate a C -dimensional weight assignment vector k , which stores the weight assignments for each channel, as shown in Equation (6).
The second step involves the learning [34] of the channel weight distribution.This distribution determines the significance of each feature map channel.To compute the channel weights, we use the softmax function to map the values in the weight vector be- In DiffusionFR, the LAM consists of three steps.These steps include the computation of channel importance, learning of channel weight distribution, and weighted fusion of features.These steps are illustrated in Figure 1.
The first step is the computation [33] of channel importance.First, the global pooling values for each channel in the feature map F are extracted by a global average pooling or maximum pooling operation to obtain a C-dimensional vector Z.Then, Z is processed using a network architecture containing two fully connected layers and a ReLU activation function to generate a C-dimensional weight assignment vector k, which stores the weight assignments for each channel, as shown in Equation (6).
The second step involves the learning [34] of the channel weight distribution.This distribution determines the significance of each feature map channel.To compute the channel weights, we use the softmax function to map the values in the weight vector between 0 and 1.This ensures that the sum of all weights equals 1, representing the weight of each channel.To gather global information about the channels, we apply a global average pooling operation to the features, which is represented by Equation (7).
In the formula, x i represents the ith feature map of input size H × W, and y represents the global feature.In the softmax function, each feature vector element is mapped to a value between 0 and 1.With this mapping, the model can determine how much each channel contributes relative to the overall feature map.
The third step involves the weighted fusion [35] of features.Each channel in the feature map is weighted and fused based on their assigned weights.Firstly, weights are assigned to each channel and applied to their corresponding features.Then, the features of all channels are proportionally weighted and fused to generate a feature map adjusted by the attention mechanism.By incorporating the LAM, the network can dynamically adjust the contribution of each channel, improving its robustness and generalization ability.This attention mechanism enables the network to disregard irrelevant information (weights close to 0) and prioritize important features essential for successful task completion.
ResNet50 has a layered architecture that enables it to learn hierarchical representations of input data.Lower layers capture low-level features, while higher layers capture intricate patterns and relationships.The pooling layer reduces spatial dimensionality, improving computational efficiency and translation invariance while mitigating overfitting.By leveraging ResNet50's transfer learning, this provides a solid foundation for the probabilistic graph generation task.However, in images with complex backgrounds or noise, ResNet50 may unintentionally focus on less relevant regions, impacting model performance.To address this, an attention mechanism is introduced to dynamically adjust feature map weights based on different parts of the input data.This helps prioritize crucial features, enhancing accuracy and generalization capabilities.Therefore, the DiffusionFR approach modifies ResNet50 by incorporating LAM into the network.LAM is added between each pair of adjacent stages, as shown in Figure 6.

Dataset
This paper introduces BlurryFish, a fish image dataset created by integrating blurred images from the publicly available dataset Fish4Knowledge.The construction process involved the following steps: (1) Data Collection The datasets used in this paper are from three sources.The first source is the publicly available dataset Fish4Knowledge, consisting of realistically shot images.The second source is a field-photographed dataset that prioritizes challenging scenarios like low-light conditions and inclement weather to ensure representative fish images.The third source is fish images from Internet, which we organized and classified.The dataset comprises 25 fish species, and Figure 7 displays these species and example images.
(2) Data Cleaning The collected fish images underwent a cleaning process to ensure their quality and reliability.This involved eliminating invalid samples and duplicate samples.
(3) Dataset Partition The dataset was divided into three sets for the experiments: the training set, the validation set, and the test set.This division follows the leave-out method [39] and maintains an 8:1:1 ratio.The goal was to ensure that all sets included pictures of the same fish species, as well as similar scenarios and angles.
(4) Data Enhancement The collected dataset has an interclass balance problem [40] due to the varying number of pictures for each fish species.This can result in lower recognition accuracy for less common fish species if the dataset is directly used for training.To address this issue, standard data enhancement methods were employed, including panning, cropping, rotating, mirroring, flipping, and brightness adjustment.These operations generated additional image samples, enhancing the model's robustness, generalization ability, and recognition accuracy for smaller fish species.Table 1 shows that the initial BlurryFish dataset contained 2754 bad, blurry, and otherwise inadequate fish images.However, after applying data enhancement techniques, the dataset expanded to 35,802 bad, blurry, and otherwise inadequate fish images, as shown in Table 2. (

5) Data Annotation
To create valuable training and testing sets from the image dataset, each image in the fish image dataset was labeled with associated fish species data.We utilized the graphical interface labeling software, LabelImg (v 1.8.5), to annotate the fish images and generate XML files.Although DiffusionFR does not impose any restrictions on the resolution and other parameters of the dataset images, we uniformly converted the dataset to RGB images with a resolution of 224 × 224.These images were then stored in the PASCAL VOC data format.ResNet50 has a layered architecture that enables it to learn hierarchical representations of input data.Lower layers capture low-level features, while higher layers capture intricate patterns and relationships.The pooling layer reduces spatial dimensionality, improving computational efficiency and translation invariance while mitigating overfitting.By leveraging ResNet50's transfer learning, this provides a solid foundation for the probabilistic graph generation task.
However, in images with complex backgrounds or noise, ResNet50 may unintentionally focus on less relevant regions, impacting model performance.To address this, an attention mechanism is introduced to dynamically adjust feature map weights based on different parts of the input data.This helps prioritize crucial features, enhancing accuracy and generalization capabilities.Therefore, the DiffusionFR approach modifies ResNet50 by incorporating LAM into the network.LAM is added between each pair of adjacent stages, as shown in Figure 6.

Dataset
This paper introduces BlurryFish, a fish image dataset created by integrating blurred images from the publicly available dataset Fish4Knowledge.The construction process involved the following steps: (1) Data Collection  (2) Data Cleaning The collected fish images underwent a cleaning process to ensure their quality and reliability.This involved eliminating invalid samples and duplicate samples.
(3) Dataset Partition The dataset was divided into three sets for the experiments: the training set, the validation set, and the test set.This division follows the leave-out method [39] and maintains an 8:1:1 ratio.The goal was to ensure that all sets included pictures of the same fish species, as well as similar scenarios and angles.
(4) Data Enhancement The collected dataset has an interclass balance problem [40] due to the varying number of pictures for each fish species.This can result in lower recognition accuracy for less common fish species if the dataset is directly used for training.To address this issue, standard data enhancement methods were employed, including panning, cropping, rotat-

Evaluation Indicators
To evaluate the model's performance in classifying fish images in blurry scenarios, accuracy and Top-k accuracy were used as evaluation metrics.The experimental data was processed using Python code and analyzed using Excel software (12.1.0.16250).
(1) Accuracy Accuracy is a metric that measures the proportion of correctly predicted samples compared to the total number of instances.It is calculated using Equation (8).
where TN denotes true negative, TP denotes true positive, FP denotes false positive, and FN denotes false negative. (

2) Top-k Accuracy
The top-k accuracy rate measures the proportion of samples where at least one of the top-k predictions matches the true label, compared to the total number of samples.In this study, we use Top-1 accuracy and Top-5 accuracy as model criteria.Equation ( 9) demonstrates the calculation of the top-k accuracy.

Top − k Accuracy
= number of samples correctly predicted /the total number of samples × 100% Here, k can be any positive integer, and it is common to have top − 1 and top − 5 accuracy rates, which indicate the accuracy in the highest confidence prediction and the first five highest confidence predictions, respectively.

Parameters of Experiments
In this section, we conduct comparative experiments for each module of DiffusionFR.
During training, the model parameters of DiffusionFR were continuously adjusted to minimize prediction errors.This was achieved using optimization algorithms and loss functions.After several iterations, the hyperparameters of the DiffusionFR model were determined based on commonly used empirical values.The finalized hyperparameters can be found in Table 4. DiffusionFR's backbone network was assessed using the original dataset.The analysis included various backbone networks such as ResNet50, VGG16, MobileNetv3, Tripmix-Net, ResNeXt, DAMNet, ResNet34, ResNet101, EfficientNet [41], neuro-heuristic, bilinear pooling with poisoning detection (BPPD), and CNN(r1, r2). (

2) Comparison of Attention Mechanisms
Comparative experiments were conducted to assess the impact of attention mechanisms on the algorithm.The evaluated attention mechanisms included LAM, CBAM [42], CCA [43], and SE [44]. (

3) Comparison of Diffusion Models
To assess the impact of the diffusion model proposed in this paper on the final recognition performance, we conducted a comparative experiment.This experiment involved two deblurring methods: the diffusion module proposed in this paper and the Gaussian denoising module.

(4) Effect of Light Reflection Noise on Recognition Performance
The datasets were labeled according to the light reflection noise added.For instance, D 0 E 0 signifies the original dataset without any added noise, while D 0.6 E 100 represents the dataset with light reflection noise, where the light diameter is 0.6 cm and the light intensity is 100 Lux, added to D 0 E 0 .This naming convention is used for other datasets as well.
Nine datasets were created by categorizing the light reflection noise based on different light diameters and intensities.These datasets are named as D 0.6 E 100 , D 0.6 E 250 , D 0.6 E 400 , D 0.8 E 100 , D 0.8 E 250 , D 0.8 E 400 , D 1.0 E 100 , D 1.0 E 250 , and D 1.0 E 400 .Table 5 provides an overview of the data volume for the fish image dataset with added light reflection noise.An example of this dataset is shown in Figure 8.We conducted a comparative analysis on datasets with light reflection noise to assess the effectiveness of using corrected fish images for species-specific fish recognition. (

5) Effect of Water Ripple Noise on Recognition Performance
To add water ripple noise to the dataset and generate the water ripple effect, the following steps and Equations were used.First, an empty array X of the same size as the We conducted a comparative analysis on datasets with light reflection noise to assess the effectiveness of using corrected fish images for species-specific fish recognition. (

5) Effect of Water Ripple Noise on Recognition Performance
To add water ripple noise to the dataset and generate the water ripple effect, the following steps and Equations were used.First, an empty array X of the same size as the original image was created to store the generated water ripple effect.Next, offsets (including offset_x and offset_y) were calculated for each pixel based on the amplitude (A) and frequency (F) by iterating through each pixel in a loop.Then, the pixel values corresponding to these offsets were assigned to each pixel of the empty array X, generating the water ripple effect.Finally, the resulting water ripple effect was overlaid onto the original image, creating the final image with water ripples.Equations involved are shown in ( 10)-( 14).
where (x i , y i ) denotes the coordinates of a pixel point in the image, F is the frequency of the water ripple, and A is the amplitude of the water ripple.The pixel assignment of array X is calculated using Equations ( 12) and (13).
X[y i ] = (y i + offset_y)%height (13) where width and height are the width and height of the image, respectively.The final image generation is calculated using Equation ( 14).
where img denotes the original image, and img_with_ripples denotes the final image with water ripples.The datasets were labeled according to the water ripple noise added.For instance, F 0 A 0 signifies the original dataset without any added noise, while F 0.04 A 2 indicates the dataset with water ripple noise having a frequency of 0.04 and an amplitude of 2 added to F 0 A 0 .This naming convention is used for other datasets as well.
Water ripple noise can be classified based on the frequency and amplitude of the water ripples.Increasing the frequency and amplitude results in a higher offset and greater oscillation in the generated water waves.In this study, the water ripple noise was categorized into three groups: F 0.04 A 2 , F 0.06 A 6 , and F 0.08 A 10 , primarily based on their frequency and amplitude.Table 6 provides an overview of the data volume of the fish image dataset with the addition of water ripple noise, while Figure 9 offers an illustrative example.We conducted a comparative analysis of datasets with water ripple noise to assess the effectiveness of using corrected fish images for species-specific fish recognition.
We conducted Experiments 1 through 5 to assess the impact of different backbone networks, attention mechanisms, diffusion models, as well as light reflection noise and water ripple noise on recognition performance.These experiments were evaluated using three metrics: training accuracy, the Top-1 accuracy test, and the Top-5 accuracy test.The objective was to comprehensively evaluate their recognition performance and analyze the results.

Results
In this study, the BlurryFish dataset was used to perform comparative experiments on the key innovations of the proposed methodology.

Comparison of Backbone Networks
This study compared and analyzed the backbone network of DiffusionFR.For example, DiffusionFR_VGG16 refers to using VGG16 instead of ResNet50 as the backbone network in DiffusionFR.Similar comparisons were made with other backbone networks.The results of these comparisons can be found in Table 7.We conducted a comparative analysis of datasets with water ripple noise to assess the effectiveness of using corrected fish images for species-specific fish recognition.
We conducted Experiments 1 through 5 to assess the impact of different backbone networks, attention mechanisms, diffusion models, as well as light reflection noise and water ripple noise on recognition performance.These experiments were evaluated using three metrics: training accuracy, the Top-1 accuracy test, and the Top-5 accuracy test.The objective was to comprehensively evaluate their recognition performance and analyze the results.

Results
In this study, the BlurryFish dataset was used to perform comparative experiments on the key innovations of the proposed methodology.

Comparison of Backbone Networks
This study compared and analyzed the backbone network of DiffusionFR.For example, DiffusionFR_VGG16 refers to using VGG16 instead of ResNet50 as the backbone network in DiffusionFR.Similar comparisons were made with other backbone networks.The results of these comparisons can be found in Table 7.

Comparison of Attention Mechanisms
To evaluate the impact of the attention mechanism on the algorithm, a comparative experiment was conducted, as shown in Table 8.The experiment compared the performance of DiffusionFR with DiffusionFR without any attention mechanism, referred to as DiffusionFR_noA.Furthermore, classical attention methods were used as substitutes for LAM.For example, DiffusionFR_CBAM incorporated CBAM as the attentional method in DiffusionFR.The results of these comparisons are presented in Table 8.Table 8 shows that the training accuracy of DiffusionFR on the original dataset was 97.55%.The corresponding Top-1 accuracy test score was 92.02%, and the Top-5 accuracy test score was 95.17%.It is important to note that all these metrics outperform the performance of other methods.This establishes DiffusionFR as the method with the most effective recognition capability.

Comparison of Diffusion Models
The final recognition results for the diffusion model proposed in this paper were obtained through experiments, as presented in Table 9.This table includes the performance of DiffusionFR, DiffusionFR_noTSD, and DiffusionFR_Gaussian. DiffusionFR_noTSD refers to the method where the TSD was removed from the proposed method, and Diffu-sionFR_Gaussian involves using Gaussian denoising [45] instead of the TSD.The results of these methods are compared in Table 9.It is important to note that all these metrics outperform the performance of other methods.This establishes DiffusionFR as the method with the most effective recognition capability.

Effect of Light Reflection Noise on Recognition
We performed a comparative analysis using DiffusionFR's backbone network on datasets with light reflection noise to evaluate the usability of corrected fish images for species-specific fish recognition.The results of this analysis are presented in Table 10.TSD's effectiveness in processing fish images with light reflection noise is visually demonstrated in Figure 10.The presence of TSD reduces the noise before deblurring, thereby preserving critical features for accurate recognition.Additionally, TSD performs better in handling light reflection noise compared to water ripple noise.tasets with light reflection noise to evaluate the usability of corrected fish images for species-specific fish recognition.The results of this analysis are presented in Table 10.TSD's effectiveness in processing fish images with light reflection noise is visually demonstrated in Figure 10.The presence of TSD reduces the noise before deblurring, thereby preserving critical features for accurate recognition.Additionally, TSD performs better in handling light reflection noise compared to water ripple noise.

Effect of Water Ripple Noise on Recognition
We conducted a comparative analysis using DiffusionFR's backbone network on datasets with water ripple noise to evaluate the usability of corrected fish images for species-specific fish recognition.The results of this analysis can be found in Table 11. Figure 11 provides a visual representation of TSD's ability to process fish images containing water ripple noise.TSD effectively reduces the frequency and intensity of water ripple noise in the images before deblurring, mitigating its impact on the critical feature extraction capability of the DiffusionFR model.This ensures that the fish image before deblurring can accurately show ID characters.In Table 11, the mean value of the training accuracy of DiffusionFR on the three datasets (F 0.04 A 2 , F 0.06 A 6 , and F 0.08 A 10 ) with added water ripple noise is 95.00%.The mean value of the Top-1 accuracy test was 89.54%, and the mean value of the Top-5 accuracy test was 92.73%.These values indicate that DiffusionFR outperforms other methods in terms of accuracy.These values also demonstrate that DiffusionFR, with ResNet50 as the chosen backbone network, has a higher potential for achieving superior recognition performance.

Discussion
Based on the analysis of Tables 7-11, we have drawn several significant conclusions.Firstly, ResNet50 performs better than other backbone networks when selected as the backbone network for DiffusionFR.Compared to ResNet34 and ResNet101, the deeper network structure of ResNet50 enables a more effective capture of intricate image features and mitigates the risk of gradient vanishing or explosion [46].Additionally, ResNet50's effective integration of the attention mechanism and the residual network approach contribute to its superior performance in propagating the model gradient.
Furthermore, a comparison between DiffusionFR and DiffusionFR_noA reveals that DiffusionFR outperforms DiffusionFR_noA in terms of training accuracy and accuracy on the test set.This indicates that DiffusionFR is capable of capturing crucial features and achieving more accurate classification and prediction.DiffusionFR also demonstrates superior performance compared to other standard attention methods, further validating the effectiveness of the incorporated LAM.
Moreover, DiffusionFR exhibits remarkable results among the compared methods, achieving superior performance in terms of training accuracy and accuracy on the test set.The proposed TSD approach for fish recognition in blurry scenarios proves to be highly effective.DiffusionFR's end-to-end integrated framework [47] for denoising and recognition surpasses a two-stage scheme by leveraging the interrelationships between these tasks.It enhances accuracy and stability by efficiently handling noise [48] and blur [49] information.
Additionally, the impact of light reflection noise and water ripple noise on recognition performance is evident from the analysis.Increasing light amplitude, light diameter, frequency, and amplitude of water ripples in the datasets leads to a decreasing trend in the training accuracy, Top-1 test accuracy, and Top-5 test accuracy of the same backbone network method.This highlights the significant role of light reflection and water ripples in recognition performance and reinforces the usability of corrected fish images for speciesspecific recognition even in the presence of these noise scenarios.
In comparing the neuro-heuristic analysis of video and bilinear pooling with poisoning detection (BPPD) to the DiffusionFR method, it becomes clear that DiffusionFR outperforms these approaches.While recent advancements in the neural network field have shown progress, DiffusionFR exhibits superior performance, even when compared to CNN(r1, r2).

Conclusions
In this study, we propose a method called DiffusionFR, which combines the diffusion model and attention mechanism to address the challenge of fish image recognition in blurry scenarios.The approach involves deblurring fish scene pictures using a two-stage diffusion network model, TSD, to restore clarity.Furthermore, a learnable attention module, LAM, was incorporated to enhance the accuracy of fish recognition.
DiffusionFR achieves the highest mean values of training accuracy, Top-1 test accuracy, and Top-5 test accuracy, at 94.91% on the original dataset.It also maintains the highest mean values of accuracy at 94.65% on the datasets with added light reflection noise and 92.84% on the datasets with added water ripple noise.
The effectiveness of DiffusionFR is evident from its superior performance compared to other approaches that use different backbone networks, attention mechanisms, and Gaussian denoising.DiffusionFR proves to be more accurate and robust, making it applicable in various underwater applications such as underwater photography, underwater detection, and underwater robotics.

Figure 1 23 Figure 1 .Figure 1 .
Figure 1 presents the framework of the method based on the diffusion model and attention mechanism for fish image recognition in blurry scenarios, DiffusionFR.The framework visually illustrates the selection and application of the correction technique.Animals 2024, 14, x FOR PEER REVIEW 4 of 23

( 2 )
Learnable attention module (LAM): The attention mechanism in DiffusionFR comprises three processes-the computation of channel importance, the learning of channel weight distribution, and the weighted fusion of features.The computation of channel importance involves global average pooling and two fully connected layers with ReLU activation.The learning of channel weight distribution includes SoftMax and the aggregation of features.Finally, the weighted fusion of features incorporates the channel weight and performs a weighted fusion of the results.(3)Modifying ResNet as the recognition network: In DiffusionFR, the ResNet feature extraction network is modified by adding the LAM between each pair of adjacent stages.

Figure 2 .
Figure 2. Structure of two-stage diffusion (TSD), including a predictive stage and a reconstru stage.

Figure 2 .
Figure 2. Structure of two-stage diffusion (TSD), including a predictive stage and a reconstructive stage.

Figure 3 .
Figure 3. U-Net architecture, consisting of an encoder and a decoder.The restoration stage is responsible for generating the final restored image.It utilizes both the output image from the prediction stage and the input image.

Figure 3 .
Figure 3. U-Net architecture, consisting of an encoder and a decoder.The restoration stage is responsible for generating the final restored image.It utilizes both the output image from the prediction stage and the input image.The reconstructive stage is composed of four modules.Each module consists of a Residual Block [30] and Up-Sampling Block [31].The Residual Block addresses issues of gradient vanishing and explosion during deep neural network training, as shown in Figure 4.It includes two convolutional layers and a jump connection, where the input is added directly to the output to form residuals.This helps the network capture the mapping relationship between inputs and outputs, improving the model's performance and robustness.During model training, special attention is given to the error generated by the Up-Sampling Block in the network, as shown in Equation (1).
gradient vanishing and explosion during deep neural network training, as shown in ure 4. It includes two convolutional layers and a jump connection, where the inp added directly to the output to form residuals.This helps the network capture the ping relationship between inputs and outputs, improving the model's performance robustness.During model training, special attention is given to the error generated b Up-Sampling Block in the network, as shown in Equation (1).

Figure 5 .
Figure 5.The framework structure of the learnable attention module (LAM).

Figure 5 .
Figure 5.The framework structure of the learnable attention module (LAM).

Figure 6 .
Figure 6.Structure of the modified ResNet50 with the LAM added between every two neighboring stages.

Figure 6 .
Figure 6.Structure of the modified ResNet50 with the LAM added between every two neighboring stages.

Figure 10 .
Figure 10.Comparison of images before and after TSD deblurring of light reflection noise.(a) D 0.6 E 100 ; (b) D 0.6 E 250 ; (c) D 0.6 E 400 ; (d) D 0.8 E 100 ; (e) D 0.8 E 250 ; (f) D 0.8 E 400 ; (g) D 1.0 E 100 ; (h) D 1.0 E 250 ; and (i) D 1.0 E 400 .In Table 10, the mean value of the training accuracy of DiffusionFR on the nine datasets (D 0.6 E 100 , D 0.6 E 250 , D 0.6 E 400 , D 0.8 E 100 , D 0.8 E 250 , D 0.8 E 400 , D 1.0 E 100 , D 1.0 E 250 , and D 1.0 E 400 ) with added light reflection noise was 86.85%.The mean value of the Top-1 accuracy test was 81.87%, and the mean value of the Top-5 accuracy test was 84.71%.These values indicate that DiffusionFR outperforms other methods in terms of accuracy.These values also demonstrate that DiffusionFR, with ResNet50 as the chosen backbone network, has a higher potential for achieving superior recognition performance.

Table 1 .
Number of fish images in the BlurryFish dataset before data enhancement.

Table 2 .
Number of fish images in the BlurryFish dataset after data enhancement.

Table 3 .
Experimental software and hardware configurations.

Table 5 .
Fish Dataset with Added Light Reflection Noise.

Table 5 .
Fish Dataset with Added Light Reflection Noise.

Table 6 .
Fish Dataset with Added Water Ripple Noise.

Table 7 .
Accuracies of Different Feature Extraction Networks.

Table 7 .
Accuracies of Different Feature Extraction Networks.

Table 7
displays the performance metrics of DiffusionFR on the original dataset.The training accuracy is 97.55%.The corresponding Top-1 accuracy test score was 92.02%, and the Top-5 accuracy test score was 95.17%.These values indicate that DiffusionFR outperforms other methods in terms of accuracy.These values also demonstrate that DiffusionFR, with ResNet50 as the chosen backbone network, has a higher potential for achieving superior recognition performance.

Table 8 .
Accuracies of Different Attention Mechanisms for LAMs.

Table 9 .
Accuracies of Different Diffusion Models.

Table 9
presents the performance metrics of DiffusionFR on the original dataset.The training accuracy of DiffusionFR is recorded as 97.55%.The corresponding Top-1 accuracy test and Top-5 accuracy test scores are reported as 92.02% and 95.17%, respectively.

Table 10 .
Accuracies for Data with Different Light Reflection Noise Effects.

Table 10 .
Accuracies for Data with Different Light Reflection Noise Effects.

Table 11 .
Accuracies for Data with Different Water Ripple Noise Effects.

Table 11 .
Accuracies for Data with Different Water Ripple Noise Effects.