Detection and Identiﬁcation of Fish Skin Health Status Referring to Four Common Diseases Based on Improved YOLOv4 Model

: A primary problem affecting the sustainable development of aquaculture is ﬁsh skin diseases. In order to prevent the outbreak of ﬁsh diseases and to provide prompt treatment to avoid mass mortality of ﬁsh, it is essential to detect and identify skin diseases immediately. Based on the YOLOv4 model, coupled with lightweight depthwise separable convolution and optimized feature extraction network and activation function, the detection and identiﬁcation model of ﬁsh skin disease is constructed in this study. The developed model is tested for the diseases hemorrhagic septicemia, saprolegniasis, benedeniasis, and scuticociliatosis, and applied to monitor the health condition of ﬁsh skin in deep-sea cage culture. Results show that the MobileNet3-GELU-YOLOv4 model proposed in this study has an improved learning ability, and the number of model parameters is reduced. Compared to the original YOLOv4 model, its mAP and detection speed increased by 12.39% and 19.31 FPS, respectively. The advantages of the model are its intra-species classiﬁcation capability, lightweight deployment, detection accuracy, and speed, making the model more applicable to the real-time monitoring of ﬁsh skin health in a deep-sea aquaculture environment.


Introduction
Aquaculture is the fastest growing sector of high-protein resources in global food production, which is regarded as the most efficient and sustainable approach to meet the ever-increasing demand for food, contributing to economic development and social stability worldwide [1][2][3].The global production of aquaculture reached 114.5 million tons, of which finfish, shellfish, and crustacean culture is 82.1 million tons [4].
The supply of high-quality and green-labeled aquatic products and the welfare of farmed aquatic organisms are of great concern to politicians and policymakers, nongovernmental organizations, the aquaculture industry, and consumers [2,5].However, diseases, which are major contributors to the degradation of seafood quality and the reduction in fish welfare in intensive aquaculture, are probably responsible for massive fish mortality and, to some extent, are known to affect economic and social development worldwide [6].For many aquaculture species, such as Atlantic salmon Salmo salar and redbanded seabream Pagrus auriga, skin diseases are regarded as a primary problem that affects their sustainable growth [6][7][8][9].
Immediately detecting and identifying skin disease is therefore critical to the prevention of fish disease outbreaks as prompt treatment would avoid mass mortality of fish.Information on fish skin diseases in aquaculture can be obtained by human observation and artificial intelligence (AI) technology [5,10,11].Currently, many large-scale aquaculture cages are equipped with underwater camera equipment, but manual real-time detection and identification of fish skin health status through video viewing is extremely costly and difficult to implement in cage farming.The utilization of AI technology, specifically image recognition technology, based on target geometry and statistical features, is of great importance to achieve the rapid detection and identification of fish skin diseases even when the fish are swimming fast.The technique of image recognition has been widely applied in fisheries for species identification, biomass estimation, and behavioral analysis [12][13][14][15][16][17][18][19][20][21][22][23].However, research into the detection and identification of fish skin disease is still in its infancy [5].In the early days, machine learning techniques such as color segmentation and K-means clustering were applied to identify fish diseases [10,11,24].Waleed et al. [25] then adopted AlexNet, a deep convolutional neural network (CNN), to improve the accuracy of fish disease (epizootic ulcerative syndrome, ichthyophthirius, and columnaris) detection.However, there is still significant potential to improve detection accuracy, and real-time skin disease detection as fish swim.An algorithm that combines high detection speed with high precision is required [19].
The YOLO (you only look once) model provides a more efficient and accurate algorithm for rapid detection and recognition of targets [26][27][28], as demonstrated by Hu et al. [28] who used the YOLOv4 model to detect uneaten feed particles in aquaculture.In this study, an improved YOLOv4 model was used through a coupled algorithm approach to construct a system for the detection and identification of four common fish diseases, including hemorrhagic septicemia, saprolegniasis, benedeniasis, and scuticociliatosis.The system has been applied to real-time monitoring of fish skin health status in deep-sea cage culture, to provide useful decision-making support for early warning of aquaculture diseases.

Dataset
The image data used in this study are divided into three types: training dataset, test dataset, and application dataset.As shown in Figure 1, the datasets are mainly sourced from the website and the Penghu semi-submersible deep-sea cage (located on Chimelong Island, Zhuhai, China) and divided into the training dataset and the test dataset at a ratio of 9:1.The application dataset consists of images and videos of fish populations from the Yellow Sea Long Whale No.1 submersible deep-sea cage (located on Dashin Island, Yantai, China).

Detection and Identification Model Based on YOLOv4
Although different approaches based on computer vision techniques are used for detection tasks [10,11,24], the YOLO model is one of the main approaches and well applied in practice [28].YOLO is a one-stage network model that combines the two stages of R-CNN into one stage, and can recognize the location and category of the target with only one operation.Compared with the two-stage network model's way of detecting targets by first determining the target location and then determining the target category, the YOLO model pre-divides an image into several preset frames, splits the image into a grid, and obtains simultaneously the location frame and the category information of the target by a merging grid and judging way.It greatly reduces the computation time of the target detection network and has the good real-time performance, which is suitable for the rapid detection of fish skin health status in aquaculture.
YOLOv4 model [29] mainly consists of four components: the feature extraction network CSPDarknet, the enhanced feature extraction network SPP, the upsampling and feature fusion network PANet, and the feature prediction YOLO head.The SPP module and PANet work together to form a feature pyramid.The SPP module greatly increases the field of view and separates the most important contextual features.PANet not only performs feature extraction from bottom to top in the feature pyramid, but also implements feature extraction from top to bottom, preserving the features of different layers

Detection and Identification Model Based on YOLOv4
Although different approaches based on computer vision techniques are used for detection tasks [10,11,24], the YOLO model is one of the main approaches and well applied in practice [28].YOLO is a one-stage network model that combines the two stages of R-CNN into one stage, and can recognize the location and category of the target with only one operation.Compared with the two-stage network model's way of detecting targets by first determining the target location and then determining the target category, the YOLO model pre-divides an image into several preset frames, splits the image into a grid, and obtains simultaneously the location frame and the category information of the target by a merging grid and judging way.It greatly reduces the computation time of the target detection network and has the good real-time performance, which is suitable for the rapid detection of fish skin health status in aquaculture.In order to address the problem of difficulties in recognition due to the unclear characteristics of light and the fish skin diseases in actual production processes, and to improve network computation speed and implement the lightweight model deployment, this study developed a detection and identification model of fish skin health status based on the YOLOv4 network.Through optimizing the feature extraction network and activation function, and combining the lightweight depthwise separable convolution method, In order to address the problem of difficulties in recognition due to the unclear characteristics of light and the fish skin diseases in actual production processes, and to improve network computation speed and implement the lightweight model deployment, this study developed a detection and identification model of fish skin health status based on the YOLOv4 network.Through optimizing the feature extraction network and activation function, and combining the lightweight depthwise separable convolution method, the proposed image-analysis method comprises a preprocessing stage, an improving stage, and a training stage (Figure 2).

Data Preprocessing
Training a robust model typically requires a large amount of labeled data, while high quality data is hard to collect, and labeling of complex actions is often time consuming and expensive [30,31].The dataset used in this study contains 500 images and 5066 annotated samples.The position and posture of fish in water is highly random, and stretching images can affect fish body proportions.To increase the variability of the input images and improve the robustness of the training model, we used methods such as adjusting

Data Preprocessing
Training a robust model typically requires a large amount of labeled data, while high quality data is hard to collect, and labeling of complex actions is often time consuming and expensive [30,31].The dataset used in this study contains 500 images and 5066 annotated samples.The position and posture of fish in water is highly random, and stretching images can affect fish body proportions.To increase the variability of the input images and improve the robustness of the training model, we used methods such as adjusting brightness, contrast, hue, saturation, and noise to change the luminance distortion of the dataset.Techniques such as random scaling, cropping, horizontal flipping, and rotation (−10 • to 10 • ) were used to augment the data, the processed image is shown in the Figure 3.The original dataset of 500 images was expanded to 4550 images by this data augmentation method.To prevent image distortion caused by resizing during the pre-training image input, the initial images were processed by cropping and filling to 416 × 416 pixels.

Data Preprocessing
Training a robust model typically requires a large amount of labeled data, while high quality data is hard to collect, and labeling of complex actions is often time consuming and expensive [30,31].The dataset used in this study contains 500 images and 5066 annotated samples.The position and posture of fish in water is highly random, and stretching images can affect fish body proportions.To increase the variability of the input images and improve the robustness of the training model, we used methods such as adjusting brightness, contrast, hue, saturation, and noise to change the luminance distortion of the dataset.Techniques such as random scaling, cropping, horizontal flipping, and rotation (−10° to 10°) were used to augment the data, the processed image is shown in the Figure 3.The original dataset of 500 images was expanded to 4550 images by this data augmentation method.To prevent image distortion caused by resizing during the pre-training image input, the initial images were processed by cropping and filling to 416 × 416 pixels.

Feature Extraction Network
Feature extraction refers to the process of extracting meaningful information from input data and transforming it into features that are suitable for machine learning model learning and processing.The extracted features are descriptive and non-redundant.The

Activation Function
Activation functions, also known as neurons, are the basic units that make up a neural network and mimic the structure and properties of biological neurons.A good activation function enhances the representation and learning ability of the network, allowing gradients to propagate more efficiently and avoiding the problem of exploding and disappearing gradients.
The rectified linear unit (ReLU) activation function is a segmented linear function that changes all negative values to zero while leaving positive values unchanged, and unilateral The Mish activation function is a smooth non-monotonic function that helps keep small negative values, thus stabilizing the network gradient flow and avoiding a sharp drop in training speed to produce gradient saturation.Mish is smoother than the ReLU activation function, allowing information to penetrate deeper into the neural network for better accuracy and generalization, but is more computationally intensive.
The GELU activation function introduces the idea of stochastic regularity in activation, which makes the output result more in line with the normal distribution, and is the best activation function to deal with the NLP domain, especially in the Transformer model.

Depthwise Separable Convolution
The depthwise separable convolution block converts the standard convolution into a depth convolution and a point-by-point convolution for the operation.The feature map of each input channel is convolved channel-by-channel by 3 × 3 convolution, and the channels and convolution kernels correspond one-to-one, and then the result of the previous convolution kernel is convolved point-by-point using the convolution block of 1 × 1 input channel to superimpose the feature map into an output feature map and increase exponentially with the number of output channels (Figure 4).If the input channel is X and the output channel is Y, a normal 3 × 3 convolution will produce the number of parameters shown in Equation (4), while the 3 × 3 depthwise separable convolution will yield the number of parameters shown in Equation (5).
Fishes 2023, 8, x FOR PEER REVIEW 6 of 14  In order to determine the error between the prediction results of the model and the annotation dataset, this study uses the sum value of three loss functions, confidence-IoU loss (CIoU loss), binary cross-entropy loss (BCELoss), and mean squared error loss (MSELoss) to evaluate the loss of the trained model [32,33].
CIoU loss is calculated by computing the difference between the confidence level of the target bounding box and the IoU of the actual bounding box, where the IoU is a measure of the proportion of intersecting and concatenated parts of two regions.
where B p is the bounding box of the predicted target and B gt is the bounding box of the actual target.CIoU loss introduces the parameters v and α of the predicted and actual border aspect ratio to calculate the loss on the basis of CIoU.v = 4 π 2 (arctan where w gt and h gt are the width and height of the predicted bounding box, w and h are the width and height of the actual bounding box, and C is the diagonal distance between the predicted box and the minimum outer rectangle of the target box.MSELoss calculates the loss by computing the mean squared difference between the predicted and true values, and reflects the accuracy of the prediction. where N is the number of categories of the data, p(c) is the target predicted output of the model, and p(c) is the target label value of the model.BCELoss is a commonly used dichotomous loss function that calculates the difference between the predicted and the actual values.
where y and x are the target predicted output and the label values of model, respectively.
Fishes 2023, 8, 186 7 of 14 Finally, the loss function for model training is as follows.
where λ is the penalty coefficients and both equal five, i is the grid being divided, s 2 is the area of the whole input image, j is the predicted bounding box, and c is the category information.
The iterative training is 100 epochs.Each training round is validated on the validation dataset, and the model parameters and loss values are saved to the system file.The trained weight groups are imported separately into different network models, and the performance metrics of the network models are analyzed by testing on the test dataset.The models are trained to perform the monitoring and recognition tasks on hardware devices that supporting neural network inference function or on the Windows PC side configured with NVIDIA graphics cards (supporting compute unified device architecture, CUDA).

Performance of Improved YOLOv4 Model
The performance, such as rapidity, accuracy and robustness, of the improved YOLOv4 model is evaluated by four indices: precision, recall, mAP, and FPS [34] which are calculated as follows: Precision refers to the percentage of samples in all prediction categories that are predicted to be sample correct in that category.

P =
T P T P + F P (14) where true positive (T P ) is the number of correctly identified samples and false positive (F P ) is the number of negative samples.Recall refers to the proportion of correctly predicted samples out of all samples labeled with that category.r = T P T P + F N (15) where false negative (F N ) is the number of positive samples that are incorrectly identified as negative.
AP refers to the accuracy assessment of each category target frame in target detection, which reflects the detection performance of the target.mAP is calculated based on AP, which refers to the average accuracy of each category target frame.
where n is the current category and N is the total category.FPS refers to the number of frames processed per second, which is the main factor used to determine the detection speed of target.
Fishes 2023, 8, 186 8 of 14 where T is the time used to process one image.

Comparison of Different Loss Functions
The change in loss values of different models with the training epochs is shown in Figure 5.The models gradually converge after 50 epochs, and the YOLOv4 models with the MobileNet feature extraction module have an obviously lower loss value after the 10th epoch than the original YOLOv4 model, indicating a much improved learning efficiency.When the GELU activation function is adopted, the learning efficiency of the model shows a faster decrease in loss value during training epochs compared to other models, which means the MobileNet3-GELU-YOLOv4 model has a further enhancement in learning ability.
where n is the current category and N is the total category.
FPS refers to the number of frames processed per second, which is the main factor used to determine the detection speed of target.
where T is the time used to process one image.

Comparison of Different Loss Functions
The change in loss values of different models with the training epochs is shown in Figure 5

Comparison of Changes in the Number of Parameters
As shown in Table 1, the number of model parameters is reduced by the MobileNet feature extraction networks.One parameter storage process in the operation is removed by the GELU activation function, and the amount of parameters is further reduced by 64%-67% after applying the adoption of depthwise separable convolution.Reducing the number of parameters makes the MobileNet v3-GELU-YOLOv4 model available on hardware devices such as Raspberry Pi and FPGA devices for detection and identification of fish skin diseases, providing a good basis for the deploying lightweight networks.

Comparison of Changes in the Number of Parameters
As shown in Table 1, the number of model parameters is reduced by the MobileNet feature extraction networks.One parameter storage process in the operation is removed by the GELU activation function, and the amount of parameters is further reduced by 64%-67% after applying the adoption of depthwise separable convolution.Reducing the number of parameters makes the MobileNet v3-GELU-YOLOv4 model available on hardware devices such as Raspberry Pi and FPGA devices for detection and identification of fish skin diseases, providing a good basis for the deploying lightweight networks.

Model Detection Performance and Precision
Figure 6 shows the performance of different models for detecting the disease Hemorrhagic Septicemia in snubnose pompano (Trachinotus blochii), where the red boxes indicate that the detected fish are healthy, and the yellow boxes indicate the presence of skin disease in the detected fish.The MobileNet v3-GELU-YOLOv4 model can improve the skin disease detection accuracy to 0.99 based on the correct recognition of healthy fish in the case of overlapping occlusion of multiple detected targets.In terms of the model precision of the four fish skin diseases (Figure 7), the MobileNet v3-YOLOv4 and the MobileNet v3-GELU-YOLOv4 models have higher precision than the other three models.In addition, the MobileNet v3-GELU-YOLOv4 model improved the recognition precision of healthy fish and fish with Hemorrhagic Septicemia by 0.26% and 0.15%, respectively.A possible reason is that the MobileNet v3-GELU-YOLOv4 model includes more BatchNorm structures and GELU activation functions in the lightweight feature extraction network, allowing the output of the convolutional layer to be readjusted to the data distribution based on the mean and variance of the output [35][36][37], increasing the distance of intraspecific details in the feature extraction stage, enriching regional target features, and improving the accuracy and generalizability of species classification, especially in the case of healthy and diseased fish with small morphological differences.It is noteworthy that the addition of the channel attention mechanism module to obtain the relationship between individual pixel points of the feature map improves ability of model to capture image contextual information, thus improving the accuracy of reconstruction of regions with week structure [38].

Model Detection Performance and Precision
Figure 6 shows the performance of different models for detecting the disease Hemorrhagic Septicemia in snubnose pompano (Trachinotus blochii), where the red boxes indicate that the detected fish are healthy, and the yellow boxes indicate the presence of skin disease in the detected fish.The MobileNet v3-GELU-YOLOv4 model can improve the skin disease detection accuracy to 0.99 based on the correct recognition of healthy fish in the case of overlapping occlusion of multiple detected targets.In terms of the model precision of the four fish skin diseases (Figure 7), the MobileNet v3-YOLOv4 and the Mo-bileNet v3-GELU-YOLOv4 models have higher precision than the other three models.In addition, the MobileNet v3-GELU-YOLOv4 model improved the recognition precision of healthy fish and fish with Hemorrhagic Septicemia by 0.26% and 0.15%, respectively.A possible reason is that the MobileNet v3-GELU-YOLOv4 model includes more BatchNorm structures and GELU activation functions in the lightweight feature extraction network, allowing the output of the convolutional layer to be readjusted to the data distribution based on the mean and variance of the output [35][36][37], increasing the distance of intraspecific details in the feature extraction stage, enriching regional target features, and improving the accuracy and generalizability of species classification, especially in the case of healthy and diseased fish with small morphological differences.It is noteworthy that the addition of the channel attention mechanism module to obtain the relationship between individual pixel points of the feature map improves ability of model to capture image contextual information, thus improving the accuracy of reconstruction of regions with week structure [38].The models were validated by running them separately in a training server with an NVIDIA graphics card for the same frame comparison test when detecting real-time monitoring videos of fish in a deep-sea net cage, as shown in Figure 8.The skin health status of Korean rockfish (Sebastes schlege) is monitored, and no diseased fish are found, which is consistent with the actual situation.YOLOv4 and MobileNet v1-YOLOv4 models both had one false positive, and 28 healthy fish are accurately recognized by the MobileNet v3-GELU-YOLOv4 model, which is the highest number of healthy fish detected compared to other models.It indicates that the model is capable of being applied to the real-time monitoring of the fish skin health status in deep-sea cage culture.Underwater images frequently have low quality due to underwater illumination and detecting equipment, decreasing model detection performance.Applying sonar devices to assist in the extraction of underwater information can significantly enhance detection accuracy, boost relevant details for model input, and broaden the application scenarios of our proposed method to combine ROI analysis and image classification models to detect underwater targets in low-and no-light conditions [39,40].The models were validated by running them separately in a training server with an NVIDIA graphics card for the same frame comparison test when detecting real-time monitoring videos of fish in a deep-sea net cage, as shown in Figure 8.The skin health status of Korean rockfish (Sebastes schlege) is monitored, and no diseased fish are found, which is consistent with the actual situation.YOLOv4 and MobileNet v1-YOLOv4 models both had one false positive, and 28 healthy fish are accurately recognized by the MobileNet v3-GELU-YOLOv4 model, which is the highest number of healthy fish detected compared to other models.It indicates that the model is capable of being applied to the real-time monitoring of the fish skin health status in deep-sea cage culture.Underwater images frequently have low quality due to underwater illumination and detecting equipment, decreasing model detection performance.Applying sonar devices to assist in the extraction of underwater information can significantly enhance detection accuracy, boost relevant details for model input, and broaden the application scenarios of our proposed method to combine ROI analysis and image classification models to detect underwater targets in low-and no-light conditions [39,40].
In terms of recall and average precision of the model, the MobileNet v3-GELU-YOLOv4 model was the highest with 98.65% (Recall) and 99.64% (mAP), respectively (Table 2).A reasonable explanation is that the mAP of the model is usually effectively improved by using the MoblieNet feature extraction network [28,41,42], and the GELU activation function further improved the average precision of the network model [43].Yu et al. [44] improved the mAP by 2.72% after employing the GELU activation function in the vehicle and pedestrian target detection task.For detection speed, although the MobileNet v1-YOLOv4 model has the highest detection speed of 54.14 FPS, its average accuracy for four fish skin diseases is relatively low due to its relatively simple structure.Usually, the FPS value of the model is greater than 30 to meet the requirement of real-time underwater camera detection, and from the perspective of actual fishery production, the precise detection and the early warning of fish diseases are both the top priorities in practice [45].Therefore, the highest mAP value of 99.64 and FPS value of 39.62 make the MobileNet v3-GELU-YOLOv4 model suitable for use in different types of aquaculture facilities, such as offshore aquaculture platform, deep-water net cages, and factory farming ponds.Real-time detection of farmed unhealthy fish is possible without the use of additional hardware by simply installing a clear underwater camera and linking it to a device with deep learning processing capacity.The model can be widely used to assist farmers in identifying and treating diseased fish in a timely manner, effectively improving production and quality, and minimizing economic losses because it also has a low number of model parameters (11,428,545) that makes it less constrained by the hardware.In terms of recall and average precision of the model, the MobileNet v3-GELU-YOLOv4 model was the highest with 98.65% (Recall) and 99.64% (mAP), respectively (Table 2).A reasonable explanation is that the mAP of the model is usually effectively improved by using the MoblieNet feature extraction network [28,41,42], and the GELU activation function further improved the average precision of the network model [43].Yu et al. [44] improved the mAP by 2.72% after employing the GELU activation function in the vehicle and pedestrian target detection task.For detection speed, although the MobileNet v1-YOLOv4 model has the highest detection speed of 54.14 FPS, its average accuracy for four fish skin diseases is relatively low due to its relatively simple structure.Usually, the FPS value of the model is greater than 30 to meet the requirement of real-time underwater camera detection, and from the perspective of actual fishery production, the precise detection and the early warning of fish diseases are both the top priorities in practice [45].Therefore, the highest mAP value of 99.64 and FPS value of 39.62 make the MobileNet v3-GELU-YOLOv4 model suitable for use in different types of aquaculture facilities, such as offshore aquaculture platform, deep-water net cages, and factory farming ponds.Realtime detection of farmed unhealthy fish is possible without the use of additional hardware by simply installing a clear underwater camera and linking it to a device with deep learning processing capacity.The model can be widely used to assist farmers in identifying and treating diseased fish in a timely manner, effectively improving production and quality, and minimizing economic losses because it also has a low number of model parameters (11,428,545) that makes it less constrained by the hardware.
In deep learning, the detection and recognition models for fish disease detection are based on color calssification, including the recognition of diseases epizootic ulcerative syndrome (EUS), ichthyophthirius (Ich), and columnaris [25], and the detection of wound and lice in Atlantic salmon fish [46].Compared to the color calssification-based model, the object detection-based model, e.g., the proposed model in this study, is able to deal with the case of more than one detection target with different diseases in the same image, which considerably expands the potential application scenarios of fish disease detection; nevertheless, little research has been conducted on this [47] and there is significant space available for improvement [46].In deep learning, the detection and recognition models for fish disease detection are based on color calssification, including the recognition of diseases epizootic ulcerative syndrome (EUS), ichthyophthirius (Ich), and columnaris [25], and the detection of wound and lice in Atlantic salmon fish [46].Compared to the color calssification-based model, the object detection-based model, e.g., the proposed model in this study, is able to deal with the case of more than one detection target with different diseases in the same image, which considerably expands the potential application scenarios of fish disease detection; nevertheless, little research has been conducted on this [47] and there is significant space available for improvement [46].

Conclusions
Immediate detection and identification of fish skin diseases is important for aquaculture to prevent the outbreak of fish diseases that can cause the mass fish mortality.In this study, a fish disease detection and recognition system was developed using a coupling algorithm approach based on the YOLOv4 model.The system targets common issues such as overlapping of multiple detection targets and small, indistinct features of fish diseases.Specifically, the system was designed to detect and recognize four common fish diseases: hemorrhagic septicemia, saprolegniasis, benedeniasis, and scuticociliatosis, and has been applied to real-time monitoring of the surface health status of fish in deep-sea cage culture.
Compared with the original YOLOv4 model, the improved MobileNet v3-GELU-YOLOv4 model reduces the parameter volume by 52,934,556 and increases the mAP and the detection speed by 12.39% and 19.31 FPS, respectively, providing a significant advantage in lightweight network deployment.Using Nvidia GTX 1060 (6G) as the training test hardware, the detection speed is more than 30 FPS, which meets the criteria of real-time tracking detection.Due to the limited availability of underwater images and videos of fish with skin diseases in the dataset, the collected images may not be clear enough, which may result in a decrease in the generalization performance of the model.However, the proposed system still exhibits good performance in terms of mAP, detection precision, number of parameters, and detection time.
In the future, it is necessary to construct large-scale, high-quality datasets of fish skin diseases to improve the generalization of the network and to incorporate underwater image processing algorithms for real-time recognition and warning of fish disease.In the long term, the development of AI-based fish disease detection and recognition models may represent a paradigm shift in the way real-time monitoring of fish diseases is conducted in aquaculture.

Fishes
extraction.The YOLO Head generates results for different sizes of feature maps.
YOLOv4 model [29] mainly consists of four components: the feature extraction network CSPDarknet, the enhanced feature extraction network SPP, the upsampling and feature fusion network PANet, and the feature prediction YOLO head.The SPP module and PANet work together to form a feature pyramid.The SPP module greatly increases the field of view and separates the most important contextual features.PANet not only performs feature extraction from bottom to top in the feature pyramid, but also implements feature extraction from top to bottom, preserving the features of different layers through repeated feature extraction.The YOLO Head generates results for different sizes of feature maps.

Fishes 2023, 8 ,
x FOR PEER REVIEW 4 of 14 the proposed image-analysis method comprises a preprocessing stage, an improving stage, and a training stage (Figure2).

Figure 2 .
Figure 2. Schematic diagram of image-acquisition system, image preprocessing, model improving, model training, and fish skin disease detection.

Figure 2 .
Figure 2. Schematic diagram of image-acquisition system, image preprocessing, model improving, model training, and fish skin disease detection.

Figure 2 .
Figure 2. Schematic diagram of image-acquisition system, image preprocessing, model improving, model training, and fish skin disease detection.

Figure 3 .
Figure 3. Data Augmentation (a) the original image; (b,c) the data augmented image.

Figure 3 .
Figure 3. Data Augmentation (a) the original image; (b,c) the data augmented image.2.2.2.Model Improving Feature Extraction Network Feature extraction refers to the process of extracting meaningful information from input data and transforming it into features that are suitable for machine learning model learning and processing.The extracted features are descriptive and non-redundant.The effect of feature extraction is crucial for the subsequent network recognition and model generalization.The classic lightweight feature extraction network is mainly MobileNet.MobileNet v1 proposed a method of depthwise separable convolution by splitting a 3 × 3 convolution block into a 3 × 3 convolution block and a 1 × 1 convolution block to reduce the number of parameters and improve the feature extraction effect.MobileNet v2 uses Inverted Resblock overall, which adopts 1 × 1 convolution and 3 × 3 depthwise separable convolution for dimensionality increasement and feature extraction, and directly connects the input to the output through residual edge structure.MobileNet v3 establishes a lightweight channelbased attention model to enhance the feature extraction effect of the network, using a special bneck structure that combines the depthwise separable convolution of MobileNet v1 and the inverted residual structure of MobileNet v2.

Fishes
2023, 8, 186 5 of 14 inhibition causes the neurons in the neural network to have sparse activation.It can better extract the relevant features and fit the training data.ReLU(x) = max(0, x)

Figure 4 .
Figure 4. Schematic diagram of depthwise separable convolution.2.2.3.Model Training Model training is a core part of deep learning and migration learning training.In this study, we built YOLOv4 and its improved model based on PyTorch platform (hardware and software: Windows 10 OS, 16G running memory, Nvidia GeForce GTX1060 (6G) graphics card), pre-trained the model on the publicly available VOC 2007 large dataset, then froze most of the model parameters and used the extended annotated images.The YOLOv4 model, Mobile v1-YOLOv4 model, Mobile v2-YOLOv4 model, Mobile v3-YOLOv4 model, and Mobile v3-GELU-YOLOv4 model were trained and tested on four fish skin disease datasets, which were divided into 8:1:1 for training dataset, validation

Figure 4 .
Figure 4. Schematic diagram of depthwise separable convolution.2.2.3.Model Training Model training is a core part of deep learning and migration learning training.In this study, we built YOLOv4 and its improved model based on PyTorch platform (hardware and software: Windows 10 OS, 16G running memory, Nvidia GeForce GTX1060 (6G) graphics card), pre-trained the model on the publicly available VOC 2007 large dataset, then froze

Figure 7 .
Figure 7. Detection precision of four fish skin diseases.

Figure 7 .
Figure 7. Detection precision of four fish skin diseases.

Table 1 .
Table of different network model parameter sizes.

Table 2 .
Table of network model performance.