Detection Method of Fry Feeding Status Based on YOLO Lightweight Network by Shallow Underwater Images

: Pellet feed is widely used in fry feeding, which cannot sink to the bottom in a short time, so most fries eat in shallow underwater areas. Aiming at the characteristics of fry feeding, we present herein a nondestructive and rapid detection method based on a shallow underwater imaging system and deep learning framework to obtain fry feeding status. Towards this end, images of fry feeding in shallow underwater areas and ﬂoating uneaten pellets were captured, following which they were processed to reduce noise and enhance data information. Two characteristics were deﬁned to reﬂect fry feeding behavior, and a YOLOv4-Tiny-ECA network was used to detect them. The experimental results indicate that the network works well, with a detection speed of 108FPS and a model size of 22.7 MB. Compared with other outstanding detection networks, the YOLOv4-Tiny-ECA network is better, faster, and has stronger robustness in conditions of sunny, cloudy, and bubbles. It indicates that the proposed method can provide technical support for intelligent feeding in factory fry breeding with natural light


Introduction
Since 2020, global fishing production has exceeded 178 million tons; humans consumed 88% of this output [1]. As the world's population continues to expand, the strain on the global fisheries market will also grow [2]. From January 2020, China began to implement 'The Fishing Ban in Yangtze River' for ten years, and productive exploitation of natural fishery resources shall be prohibited during that period [3][4][5]. Increasing breeding species, reforming breeding technologies, and promoting advanced breeding patterns are becoming the priority to develop, and factory aquaculture has become the current popular model [6,7].
Fry breeding is an important part of factory aquaculture, and high-quality fry is the foundation to support the development of fisheries; healthy fry directly affects the yield and economic benefits [8]. At present, most of the fry factories in China adopt the feeding mode of manual feeding or timed and quantitative machine feeding. In manual feeding, workers determine the feeding amount according to experience and eye observation, which is characterized by high labor intensity and low accuracy, and most of the feeding machines in the market are not intelligent enough to determine the feeding quantity of fry [9,10]. The current feeding method easily causes underfeeding or overfeeding; underfeeding will lead to fish nutrition deficiency, and overfeeding will cause the waste of resources and pollute the water. The feed cost accounts for more than 60% of the variable cost in fry breeding [11]. Hence the accurate and effective detection of fry feeding status is very necessary.
In previous studies, researchers used a variety of sensors to obtain fish status. With the improvement of computer vision technology, its convenience and non-damage characteristics make it widely used in fish behavior recognition [12].
Chen et al. [13] combined the texture method and area method to study the fish feeding behavior; four texture features of the inverse moment, correlation, energy, and contrast images to improve image quality, and the image augmentation method is used to enrich the size of the data set.
The remainder of the paper is organized as follows: Section 2 details the materials, underwater imaging system, and image processing scheme. Section 3 gives the results and discusses them. Finally, Section 4 summarizes and prospects the whole paper.

Material and Sample Preparation
The experiment was performed at the Institute of Technology, Nanjing Agricultural University (Nanjing, China). Altogether, 200 grass carp fry were used in this study, with a mean weight of 5.5 g and a mean body length of 7.4 cm. The fries were selected because of their strong adaptability and their wide use in aquaculture in China. The fries were purchased from Zhenyuan Agricultural Development Co., Ltd. (Guiyang, China) and were delivered by express delivery within 72 h after being caught out of water. Nowadays, graded breeding is widely used in factory fry farming; that is, fries are gradually evacuated and separately raised based on body length growth. After about 20 days of breeding, the fry body length can reach 7 cm, and they can start to develop feeding habits-the research object of this study is grass carp fry at this stage.
After the arrival, the fry population was raised in the fishpond for two weeks to adapt to the experiment environment. The pond was made of canvas, with a diameter of 1.5 m and a water depth of 90 cm. The fishpond was sterilized before the fry population was released. In order to protect the breeding environment, the water was exchanged every two days after the fry were released, and the water was exchanged by 50% each time to keep the ammonia nitrogen and nitrite contents in the fishpond at the appropriate value (<0.2 mg/L). Meanwhile, the pH level of the water was kept at 7.5 ± 0.5, and the water temperature was kept at 22 ± 2 • C. During the period of fry adaption to the environment, the feeding rate was 2% of body weight per day with a commercial diet. When feeding, the feed required for a single time will be evenly distributed on the water surface, and the uneaten pellets will be pulled out after 2 h to prevent water pollution.

Experimental System and Data Acquisition
The experimental system is installed in a glass greenhouse on the roof of the Boyuan Building of Nanjing Agricultural University. As shown in Figure 1, the experimental system consisted mainly of a fishpond, a filtration system, a temperature control system, and a feeding system. The filtering system and temperature control system were connected to the fishpond through the water pipe; the filtering system was used to filter out the harmful material in the fish pond, the temperature control system regulates the water temperature and keeps the water temperature stable, and the feeding system sprinkles feed from above the fish pond to the water surface. The image acquisition system includes 3 SARGO A8 waterproof cameras (SARGO, Inc., Guangzhou, China) with 20 million pixels, CMOS sensor, and a maximum resolution of 3840 pixels × 2140 pixels. The image processing system consisted of a Dell 7920 tower workstation (Dell Inc., Round Rock, TX, USA) with Windows10, a 3.0-GHz Intel ® Xeon(R) Gold 6154, 256 GB RAM, and 2 NVIDIA GeForce RTX 2080Ti of 11 GB. Camera 1 and camera 2 were arranged on the inner wall of the fishpond, and the shooting angle was 90 degrees, covering the whole feeding area. Camera 3 was arranged on the bottom of the fishpond, and the water surface was shot from bottom to top. The three cameras all shoot 30FPS videos, and the image size of each frame is 3840 pixels × 2140 pixels. The 3 cameras and the feeding system are, respectively, connected to the computer, images captured are transmitted to the computer for processing, and then the computer sends instructions to control the feeding machine.
During the image acquisition, camera 1 and camera 2 were, respectively, attached to the inner wall of the fishpond at a position of 5-10 cm below the water surface, and the camera angles intersected vertically to form an acquisition area. Camera 3 is located at the bottom of the pond and looks up at the water's surface. Every 300 s, the feeding machine Electronics 2022, 11, 3856 4 of 17 above the pond dispenses feed pellets into the water surface evenly; the 3 cameras shoot the whole feeding process for 7 days.
Electronics 2022, 11, x FOR PEER REVIEW 4 of 17 pixels × 2140 pixels. The 3 cameras and the feeding system are, respectively, connected to the computer, images captured are transmitted to the computer for processing, and then the computer sends instructions to control the feeding machine. During the image acquisition, camera 1 and camera 2 were, respectively, attached to the inner wall of the fishpond at a position of 5-10 cm below the water surface, and the camera angles intersected vertically to form an acquisition area. Camera 3 is located at the bottom of the pond and looks up at the water's surface. Every 300 s, the feeding machine above the pond dispenses feed pellets into the water surface evenly; the 3 cameras shoot the whole feeding process for 7 days.

Two Typical Characteristics of Fry Feeding
It can be known from the fry feeding images acquired during the adaptation period that the fries were young, and their food intake was small, so the disturbance amplitude of the water surface caused by feeding was little. When feeding, most fries swim up and touch the water lightly; while the cameras were set at a shallow underwater layer, the images captured would have a mirror effect because of the water's surface, as shown in Figure 2. In order to facilitate the study of fry feeding behavior, two typical characteristics of fry feeding are defined in this paper: angle feeding and vertical feeding, and their specific characteristics are described as follows: (1) Angle feeding: fry swim up to the surface at an angle, and their mouths touch the feed floating on the surface, and also touch the fry in the water mirror, showing a shape of ">" or "<" in the image taken by the underwater camera, as shown in the figure below. (2) Vertical feeding: fry swim up to the surface vertically, and their mouths touch the feed floating on the surface, and also touch the fry in the water mirror, showing a shape of "|" in the image taken by the underwater camera, as shown in the figure below.

Two Typical Characteristics of Fry Feeding
It can be known from the fry feeding images acquired during the adaptation period that the fries were young, and their food intake was small, so the disturbance amplitude of the water surface caused by feeding was little. When feeding, most fries swim up and touch the water lightly; while the cameras were set at a shallow underwater layer, the images captured would have a mirror effect because of the water's surface, as shown in Figure 2. In order to facilitate the study of fry feeding behavior, two typical characteristics of fry feeding are defined in this paper: angle feeding and vertical feeding, and their specific characteristics are described as follows: (1) Angle feeding: fry swim up to the surface at an angle, and their mouths touch the feed floating on the surface, and also touch the fry in the water mirror, showing a shape of ">" or "<" in the image taken by the underwater camera, as shown in the figure below. (2) Vertical feeding: fry swim up to the surface vertically, and their mouths touch the feed floating on the surface, and also touch the fry in the water mirror, showing a shape of "|" in the image taken by the underwater camera, as shown in the figure below.
Electronics 2022, 11, x FOR PEER REVIEW 5 of 17 Figure 2. Two typical characteristics of fry feeding: (a,b) angle feeding images; (c,d) vertical feeding images.
2.3.2. Processing of Underwater Acquired Images As shown in Figure 3, fry-feeding images were taken shallow underwater under different light conditions, (a) is the image acquired in sunny conditions and (b) is the image acquired in cloudy conditions. In this section, fry-feeding behaviors in the images were observed and counted manually.

Processing of Underwater Acquired Images
As shown in Figure 3, fry-feeding images were taken shallow underwater under different light conditions, (a) is the image acquired in sunny conditions and (b) is the image acquired in cloudy conditions. In this section, fry-feeding behaviors in the images were observed and counted manually. Figure 2. Two typical characteristics of fry feeding: (a,b) angle feeding images; (c,d) vertical feeding images.

Processing of Underwater Acquired Images
As shown in Figure 3, fry-feeding images were taken shallow underwater under different light conditions, (a) is the image acquired in sunny conditions and (b) is the image acquired in cloudy conditions. In this section, fry-feeding behaviors in the images were observed and counted manually.  As shown in Figure 4, images of fry feeding were taken at the bottom of the pond: (a) is the image of floating uneaten pellets and (b) is the image of fry feeding behavior. In this section, the feed pellets in the images were observed and counted manually.   As shown in Figure 4, images of fry feeding were taken at the bottom of the pond: (a) is the image of floating uneaten pellets and (b) is the image of fry feeding behavior. In this section, the feed pellets in the images were observed and counted manually. Figure 2. Two typical characteristics of fry feeding: (a,b) angle feeding images; (c,d) vertical feeding images.

Processing of Underwater Acquired Images
As shown in Figure 3, fry-feeding images were taken shallow underwater under different light conditions, (a) is the image acquired in sunny conditions and (b) is the image acquired in cloudy conditions. In this section, fry-feeding behaviors in the images were observed and counted manually.  As shown in Figure 4, images of fry feeding were taken at the bottom of the pond: (a) is the image of floating uneaten pellets and (b) is the image of fry feeding behavior. In this section, the feed pellets in the images were observed and counted manually.    Figure 5 shows the relationship between fry feeding behaviors and uneaten pellets on the water surface based on the recognition of human observation. As can be seen from the figure, the number of uneaten pellets on the water surface decreases with the occurrence of fry feeding behaviors, and the higher the frequency of fry feeding characteristics detected, the faster the amount of uneaten pellets decreases; therefore, it proves that the occurrence of fry feeding behaviors can be detected based on fry feeding characteristics. Figure 5 shows the relationship between fry feeding behaviors and uneaten pellets on the water surface based on the recognition of human observation. As can be seen from the figure, the number of uneaten pellets on the water surface decreases with the occurrence of fry feeding behaviors, and the higher the frequency of fry feeding characteristics detected, the faster the amount of uneaten pellets decreases; therefore, it proves that the occurrence of fry feeding behaviors can be detected based on fry feeding characteristics.

Judging the Fry Feeding Status Manually
(a) (b) Figure 5. The relationship between fry feeding behaviors and uneaten pellets: (a) fry feeding in high intensity; (b) fry feeding in low intensity.

Fry Feeding Status Judged by Computer Vision Image Processing Methods
In this section, 2000 images acquired in the shallow underwater layer were selected for processing. The JPG format images were processed and made into data sets for the network training; the processing process is shown in Figure 6.

Image Preprocessing
In order to simulate the influence of different light conditions in the actual breeding and production site, the experimental system was arranged on the rooftop in this study; therefore, the external environment of the image acquisition includes soft light conditions and hard light conditions. In this study, the selected images were cut to obtain a square image containing the fry feeding characteristics, and then the images were processed with information enhancement, as shown in Figure 7.

Fry Feeding Status Judged by Computer Vision Image Processing Methods
In this section, 2000 images acquired in the shallow underwater layer were selected for processing. The JPG format images were processed and made into data sets for the network training; the processing process is shown in Figure 6.
the figure, the number of uneaten pellets on the water surface decreases with the oc rence of fry feeding behaviors, and the higher the frequency of fry feeding characteri detected, the faster the amount of uneaten pellets decreases; therefore, it proves tha occurrence of fry feeding behaviors can be detected based on fry feeding characterist (a) (b) Figure 5. The relationship between fry feeding behaviors and uneaten pellets: (a) fry feeding in intensity; (b) fry feeding in low intensity.

Fry Feeding Status Judged by Computer Vision Image Processing Methods
In this section, 2000 images acquired in the shallow underwater layer were sele for processing. The JPG format images were processed and made into data sets for network training; the processing process is shown in Figure 6.

Image Preprocessing
In order to simulate the influence of different light conditions in the actual bree and production site, the experimental system was arranged on the rooftop in this st therefore, the external environment of the image acquisition includes soft light condit and hard light conditions. In this study, the selected images were cut to obtain a sq image containing the fry feeding characteristics, and then the images were processed information enhancement, as shown in Figure 7.

Image Preprocessing
In order to simulate the influence of different light conditions in the actual breeding and production site, the experimental system was arranged on the rooftop in this study; therefore, the external environment of the image acquisition includes soft light conditions and hard light conditions. In this study, the selected images were cut to obtain a square image containing the fry feeding characteristics, and then the images were processed with information enhancement, as shown in Figure 7. Images taken underwater are often affected by fog; therefore, we first defogged the images and then carried out three steps of processing the defogged images: (1) The images were denoised by median filtering (the window size is 64 × 64), which can remove the high-frequency noise in the image and make the image smoother. (2) The color channe conversion was carried out on the defogged images, images were converted from RGB format to HSV format, and then wavelet transform was carried out on the transformed HSV images, and finally converted them back to RGB images. The processed images can display texture information more clearly. (3) Color channel conversion was carried out on the defogged image, images were converted from RGB format to HSV format, and then the HSV images were separated by channel. The separated V-channel images were pro cessed by CLAHE [29], and then the image channels were merged and converted back to RGB images, which can improve the clarity of the image. The data set obtained in the above three steps was combined to generate a network training data set containing 6000 images. The data set was labeled according to the fry feeding characteristics mentioned above (angle feeding and vertical feeding), and the labeling software was LabelImg (v1.8.1, https://pypi.org/project/labelImg/1.8.1/).
Six thousand images are far from sufficient for training a robust model; in order to ensure accuracy, we used data augmentation to enlarge the dataset size and prevent over fitting. Methods include image rotation, flip, random changes in brightness, and random noise. The expanded data set and annotation files were sorted out, and files with poo quality were excluded. A dataset containing 58,360 images was obtained, among which 52,524 were training data sets and 5836 were verification data sets.

YOLOv4-Tiny Algorithm
As a current excellent one-stage target detection algorithm, the YOLOv4 network represents continued improvements to previous YOLO series networks. It has a faster de tection speed while guaranteeing network accuracy, and it can perform real-time detec tion on devices with low computing power. The YOLOv4-Tiny algorithm is designed on the basis of the YOLOv4 algorithm [30]. Its network structure is more simplified, and the number of network layers is reduced from 162 to 38, which greatly improves the targe detection speed; however, its accuracy and speed can still meet the requirements of prac tical applications. The YOLOv4-Tiny algorithm uses the CSPDarknet53-Tiny network as the backbone network and uses the CSPBlock module in the cross-stage partial network [31] to replace the ResBlock module in the residual network. The CSPBlock module di vides the feature map into two parts and merges the two parts through the CSP residua network so that the gradient information can be transmitted in two different paths to in crease the correlation of gradient information.
Compared with the ResBlock module, the CSPBlock module can enhance the learn ing ability of the convolutional network. Although the computation amount increases by Images taken underwater are often affected by fog; therefore, we first defogged the images and then carried out three steps of processing the defogged images: (1) The images were denoised by median filtering (the window size is 64 × 64), which can remove the high-frequency noise in the image and make the image smoother. (2) The color channel conversion was carried out on the defogged images, images were converted from RGB format to HSV format, and then wavelet transform was carried out on the transformed HSV images, and finally converted them back to RGB images. The processed images can display texture information more clearly. (3) Color channel conversion was carried out on the defogged image, images were converted from RGB format to HSV format, and then the HSV images were separated by channel. The separated V-channel images were processed by CLAHE [29], and then the image channels were merged and converted back to RGB images, which can improve the clarity of the image. The data set obtained in the above three steps was combined to generate a network training data set containing 6000 images. The data set was labeled according to the fry feeding characteristics mentioned above (angle feeding and vertical feeding), and the labeling software was LabelImg (v1.8.1, https://pypi.org/project/labelImg/1.8.1/, accessed on 10 October 2022).
Six thousand images are far from sufficient for training a robust model; in order to ensure accuracy, we used data augmentation to enlarge the dataset size and prevent overfitting. Methods include image rotation, flip, random changes in brightness, and random noise. The expanded data set and annotation files were sorted out, and files with poor quality were excluded. A dataset containing 58,360 images was obtained, among which 52,524 were training data sets and 5836 were verification data sets.

YOLOv4-Tiny Algorithm
As a current excellent one-stage target detection algorithm, the YOLOv4 network represents continued improvements to previous YOLO series networks. It has a faster detection speed while guaranteeing network accuracy, and it can perform real-time detection on devices with low computing power. The YOLOv4-Tiny algorithm is designed on the basis of the YOLOv4 algorithm [30]. Its network structure is more simplified, and the number of network layers is reduced from 162 to 38, which greatly improves the target detection speed; however, its accuracy and speed can still meet the requirements of practical applications. The YOLOv4-Tiny algorithm uses the CSPDarknet53-Tiny network as the backbone network and uses the CSPBlock module in the cross-stage partial network [31] to replace the ResBlock module in the residual network. The CSPBlock module divides the feature map into two parts and merges the two parts through the CSP residual network so that the gradient information can be transmitted in two different paths to increase the correlation of gradient information.
Compared with the ResBlock module, the CSPBlock module can enhance the learning ability of the convolutional network. Although the computation amount increases by 10-20%, its accuracy is greatly improved. In order to reduce the amount of computation, the YOLOv4-Tiny algorithm eliminates the neck part with a high amount of computation in the original CSPBlock module, which not only reduces the amount of computation but also improves the accuracy of the algorithm. In order to further reduce the amount of computation and improve the real-time performance of the network, YOLOv4-Tiny uses the LeakyReLU function to replace the Mish function as the activation function [32]. The LeakyReLU function is shown in Formula (1): Here a i ∈(1,+∞) is a constant.
In feature extraction and fusion, YOLOv4-Tiny is different from YOLOv4, which adopts spatial pyramid pooling (SPP) [33] and path aggregation network (PANet) [34] for feature extraction and fusion. The purpose of SPP is to increase the receptive field of the network, while PANet shortens the distance between features with large sizes at the bottom and features with small sizes at the top, making feature fusion more effective. The YOLOv4-Tiny algorithm uses a feature pyramid network (FPN) [35] to extract feature maps with different proportions to improve target detection speed. Meanwhile, it uses two different proportion feature maps (13 × 13 and 26 × 26) to predict detection results. The network structure of YOLOv4-Tiny is shown in Figure 8. In feature extraction and fusion, YOLOv4-Tiny is different from YOLOv4, which adopts spatial pyramid pooling (SPP) [33] and path aggregation network (PANet) [34] fo feature extraction and fusion. The purpose of SPP is to increase the receptive field of th network, while PANet shortens the distance between features with large sizes at the bot tom and features with small sizes at the top, making feature fusion more effective. Th YOLOv4-Tiny algorithm uses a feature pyramid network (FPN) [35] to extract featur maps with different proportions to improve target detection speed. Meanwhile, it use two different proportion feature maps (13 × 13 and 26 × 26) to predict detection results The network structure of YOLOv4-Tiny is shown in Figure 8.

Calculation Method of Loss Function
The loss function L of the YOLOv4-Tiny network consists of three parts, namely, bounding box regression loss L Coord , confidence loss L Con f , and classification loss L Cls . The loss function L can be expressed in Formula (2): Bounding box regression loss L Coord can be expressed in Formula (3): Confidence loss L Con f can be expressed in Formula (4): Classification loss L Cls can be expressed in Formula (5): The original image is divided into N × N grids, each of which generates M candidate bounding boxes, forming a total of N × N × M bounding boxes. If there is no target in the bounding box (noObj), then only confidence loss is calculated. Here, i is the i-th grid of feature map, j is prediction bounding box responsible for the j-th box, w and h are the width and height of the ground truth box. I

Obj ij
represents whether there is a target in the i-th cell. I noObj ij represents whether there is no target in the i-th cell. Bounding box regression loss will be multiplied by (2 − w × h) to enlarge the loss of the small bounding box. The confidence and classification loss function is cross-entropy loss. The confidence loss has two cases: any object and no object. In the case of no object, λ is introduced to describe the loss percentage. L CIoU is denoted as the CIoU loss between the prediction box and ground truth box. L CIoU can be expressed in Formulas (6)-(8): where IoU represents the ratio of intersection and union in two boxes. ρ is the Euclidian distance of the two centers of the predicted box (b) and ground truth box (b gt ), and c is the diagonal length of the smallest box containing two boxes. α is the weight coefficient and the consistency of the aspect ratio is recorded as v. w gt and h gt are the width and height of the ground truth box and w and h are the width and height of the predicted box, respectively.

Attention Mechanism
In order to obtain a better training effect, the attention mechanism is introduced into the neural network in this study, and the performance of the original network model can be greatly improved by only adding a small amount of computation. The attention mechanism adopted in this paper includes CBAM module, SE module, and ECA module [36][37][38].
CBAM module is composed of a channel attention module and a spatial attention module in series. It calculates from the two dimensions of channel and space. Then, the attention map is multiplied by its input feature map for adaptive learning, so as to improve the accuracy of feature extraction by the model. The schematic diagram of its implementation is shown in Figure 9.
SE module is shown in Figure 10. After global average pooling (GAP) for each channel, the Sigmoid function is selected as its activation function through two nonlinear full connection layers. As can be seen from the figure, two nonlinear fully connected layers in SE module are used to capture nonlinear cross-channel interactions and reduce the complexity of the model through dimensionality reduction. SE module is shown in Figure 10. After global average pooling (GAP) for each c nel, the Sigmoid function is selected as its activation function through two nonlinear connection layers. As can be seen from the figure, two nonlinear fully connected laye SE module are used to capture nonlinear cross-channel interactions and reduce the c plexity of the model through dimensionality reduction. The structure of the ECA module is shown in Figure 11. Different from the SE m ule, the ECA module also uses global average pooling but does not reduce channe mension. ECA module adopts one-dimensional fast convolution with a convolution nel size of k. K is not only the size of the convolution kernel but also represents the co age of local cross-channel interaction, which means that k adjacent channels are invo in attention prediction near each channel. ECA module not only ensures the efficienc the model but also ensures the effect of calculation. K value can be determined adapti by the function of total channel number G, and the calculation is shown in Formula ( = ( ) = | log 2 + 1 2 |  SE module is shown in Figure 10. After global average pooling (GAP) for each c nel, the Sigmoid function is selected as its activation function through two nonlinear connection layers. As can be seen from the figure, two nonlinear fully connected laye SE module are used to capture nonlinear cross-channel interactions and reduce the c plexity of the model through dimensionality reduction. The structure of the ECA module is shown in Figure 11. Different from the SE m ule, the ECA module also uses global average pooling but does not reduce channe mension. ECA module adopts one-dimensional fast convolution with a convolution nel size of k. K is not only the size of the convolution kernel but also represents the co age of local cross-channel interaction, which means that k adjacent channels are invo in attention prediction near each channel. ECA module not only ensures the efficienc the model but also ensures the effect of calculation. K value can be determined adapti by the function of total channel number G, and the calculation is shown in Formula ( where k is the neighborhood of each channel; G is the total number of chan | | is the odd number closest to . The structure of the ECA module is shown in Figure 11. Different from the SE module, the ECA module also uses global average pooling but does not reduce channel dimension. ECA module adopts one-dimensional fast convolution with a convolution kernel size of k. K is not only the size of the convolution kernel but also represents the coverage of local cross-channel interaction, which means that k adjacent channels are involved in attention prediction near each channel. ECA module not only ensures the efficiency of the model but also ensures the effect of calculation. K value can be determined adaptively by the function of total channel number G, and the calculation is shown in Formula (9): where k is the neighborhood of each channel; G is the total number of channels; |x| odd is the odd number closest to x. Electronics 2022, 11, x FOR PEER REVIEW 11 of 17 Figure 11. Structure of ECA module.

Model Performance Analysis
The training parameters used in this paper are shown in Table 1. The training stops when the verification loss remains constant for 10 epochs, and the learning rate increases by 0.1 times when the verification loss remains constant for 2 epochs. In this study, precision rate, recall rate, and F1 score are used to measure the performance of the detection network. The precision rate refers to the proportion of correctly detected targets in the total number of detected targets, recall rate refers to the proportion of correctly detected targets in the total number of targets to be detected in the data set, and F1 score can be regarded as a harmonic average of model accuracy and recall. The calculation methods of precision rate, recall rate, and F1 score are shown in Formulas (10)-(12): where TP is the abbreviation of "true positive", which refers to the correctly identified feeding fry; FN is the abbreviation of "false negative", which refers to the missed feeding fry; FP is the abbreviation of "false positive", which refers to the fish fry incorrectly detected as feeding status.

Results
As shown in Figure 12, with the YOLOv4-Tiny model, fry with angle feeding characteristics are detected in red boxes, and fry with vertical feeding characteristics are detected in blue boxes. The precision rate, recall rate, and F1 score of angle feeding was 90.76%, 86.96%, and 0.89. The precision rate, recall rate, and F1 score of vertical feeding was 85.46%, Figure 11. Structure of ECA module.

Model Performance Analysis
The training parameters used in this paper are shown in Table 1. The training stops when the verification loss remains constant for 10 epochs, and the learning rate increases by 0.1 times when the verification loss remains constant for 2 epochs. In this study, precision rate, recall rate, and F 1 score are used to measure the performance of the detection network. The precision rate refers to the proportion of correctly detected targets in the total number of detected targets, recall rate refers to the proportion of correctly detected targets in the total number of targets to be detected in the data set, and F 1 score can be regarded as a harmonic average of model accuracy and recall. The calculation methods of precision rate, recall rate, and F 1 score are shown in Formulas (10)- (12): where TP is the abbreviation of "true positive", which refers to the correctly identified feeding fry; FN is the abbreviation of "false negative", which refers to the missed feeding fry; FP is the abbreviation of "false positive", which refers to the fish fry incorrectly detected as feeding status.

Results
As shown in Figure 12, with the YOLOv4-Tiny model, fry with angle feeding characteristics are detected in red boxes, and fry with vertical feeding characteristics are detected in blue boxes. The precision rate, recall rate, and F 1 score of angle feeding was 90.76%, 86.96%, and 0.89. The precision rate, recall rate, and F 1 score of vertical feeding was 85.46%, 74.09%, and 0.79. The detection speed of the model was 116FPS and the model size was 22.4 MB.

Comparison with Other Networks
This section compares YOLOv4-Tiny with six other commonly used advanced networks, such as YOLOv3, YOLOv3-EfficientNet, YOLOv4, YOLOv4-ResNet50, YOLOv4-VGG, and YOLOv5. All the networks in this paper were pre-trained on the Google ImageNet dataset, and all the comparison trials followed the same steps as YOLOv4-Tiny. To better illustrate the validity of this approach, a detailed comparison of the model results is presented in Table 2. Compared with the results of other models, the precision rate, recall rate, and F1 score of the YOLOv4-Tiny model are in the middle and lower level, higher than YOLOv4-VGG and lower than other models; however, the processing speed of the YOLOv4-Tiny model is much higher than other models, and its model size is the smallest. By comparison, the YOLOv4-Tiny model has the potential to be applied to real-time detection.

Performance Comparison after Adding Attention Mechanism
As shown in Figure 13, three attention mechanism modules are added to the YOLOv4-Tiny network in this study, namely, CBAM module, SE module, and ECA module, respectively, and the adding positions are in the red star in the figure.

Comparison with Other Networks
This section compares YOLOv4-Tiny with six other commonly used advanced networks, such as YOLOv3, YOLOv3-EfficientNet, YOLOv4, YOLOv4-ResNet50, YOLOv4-VGG, and YOLOv5. All the networks in this paper were pre-trained on the Google Im-ageNet dataset, and all the comparison trials followed the same steps as YOLOv4-Tiny. To better illustrate the validity of this approach, a detailed comparison of the model results is presented in Table 2. Compared with the results of other models, the precision rate, recall rate, and F 1 score of the YOLOv4-Tiny model are in the middle and lower level, higher than YOLOv4-VGG and lower than other models; however, the processing speed of the YOLOv4-Tiny model is much higher than other models, and its model size is the smallest. By comparison, the YOLOv4-Tiny model has the potential to be applied to real-time detection.

Performance Comparison after Adding Attention Mechanism
As shown in Figure 13, three attention mechanism modules are added to the YOLOv4-Tiny network in this study, namely, CBAM module, SE module, and ECA module, respectively, and the adding positions are in the red star in the figure.
As shown in Table 3, the precision rate, recall rate, and F 1 score of the YOLOv4-Tiny network model with the addition of attention mechanisms are all improved. The detection performance of the YOLOv4-Tiny network model with the ECA module is not the highest, while its detection speed decreases at the least. Meanwhile, its detection performance is higher than those of other network models except YOLOv5 above. The precision rate, recall rate, and F 1 score of angle feeding was 96.23%, 95.92%, and 0.96. The precision rate, recall rate, and F 1 score of vertical feeding was 95.89%, 96.44%, and 0.96, respectively. At the same time, the detection speed decreased slightly to 108FPS; model size increased slightly to 22.7 MB. As shown in Table 3, the precision rate, recall rate, and F1 score of the YOLOv4-Tiny network model with the addition of attention mechanisms are all improved. The detection performance of the YOLOv4-Tiny network model with the ECA module is not the highest, while its detection speed decreases at the least. Meanwhile, its detection performance is higher than those of other network models except YOLOv5 above. The precision rate, recall rate, and F1 score of angle feeding was 96.23%, 95.92%, and 0.96. The precision rate, recall rate, and F1 score of vertical feeding was 95.89%, 96.44%, and 0.96, respectively. At the same time, the detection speed decreased slightly to 108FPS; model size increased slightly to 22.7 MB.  Figure 14 shows the precision-recall curves and loss curves of YOLOv3, YOLOv3-EfficientNet, YOLOv4, YOLOv4-ResNet50, YOLOv4-VGG, YOLOv5, and YOLOv4-Tiny-ECA network models. As can be seen from the figure, the precision-recall curve of the YOLOv4-Tiny-ECA model is closer to the upper right corner, and the region enclosed is the largest, indicating the best network detection effect. Meanwhile, it can be seen from the loss curves that the initial loss of the red curve of the YOLOv4-Tiny-ECA model decreases more slowly than other curves, which is due to the addition of an attention mechanism in the model. However, with the increase in training iterations, the red curve decreases faster and reaches a stable state earlier; therefore, the YOLOv4-Tiny-ECA model has better performance and a faster convergence rate.   Figure 14 shows the precision-recall curves and loss curves of YOLOv3, YOLOv3-EfficientNet, YOLOv4, YOLOv4-ResNet50, YOLOv4-VGG, YOLOv5, and YOLOv4-Tiny-ECA network models. As can be seen from the figure, the precision-recall curve of the YOLOv4-Tiny-ECA model is closer to the upper right corner, and the region enclosed is the largest, indicating the best network detection effect. Meanwhile, it can be seen from the loss curves that the initial loss of the red curve of the YOLOv4-Tiny-ECA model decreases more slowly than other curves, which is due to the addition of an attention mechanism in the model. However, with the increase in training iterations, the red curve decreases faster and reaches a stable state earlier; therefore, the YOLOv4-Tiny-ECA model has better performance and a faster convergence rate.

Detection Performance Analysis under Different Underwater Shooting Conditions
In order to verify the robustness of the YOLOv4-Tiny-ECA model, this paper tested the detection performance of the model under four different underwater shooting conditions. As shown in Figure 15, fry with angle feeding characteristics are detected in red boxes, and fry with vertical feeding characteristics are detected in blue boxes.

Detection Performance Analysis under Different Underwater Shooting Conditions
In order to verify the robustness of the YOLOv4-Tiny-ECA model, this paper tested the detection performance of the model under four different underwater shooting conditions. As shown in Figure 15, fry with angle feeding characteristics are detected in red boxes, and fry with vertical feeding characteristics are detected in blue boxes.
(a) (b) (c) Figure 14. The precision-recall curves and loss curves: (a) precision-recall curves of angle feeding; (b) precision-recall curves of vertical feeding; (c) loss curves of different network models.

Detection Performance Analysis under Different Underwater Shooting Conditions
In order to verify the robustness of the YOLOv4-Tiny-ECA model, this paper tested the detection performance of the model under four different underwater shooting conditions. As shown in Figure 15, fry with angle feeding characteristics are detected in red boxes, and fry with vertical feeding characteristics are detected in blue boxes.  Table 4 shows the comparison of detection results under four different underwater shooting conditions. The results indicate that the YOLOv4-Tiny-ECA model can achieve good detection results under both sunny and cloudy conditions. The model detection performance was relatively weak under the condition of underwater bubbles, and the performance was not good in dark underwater conditions.   Table 4 shows the comparison of detection results under four different underwater shooting conditions. The results indicate that the YOLOv4-Tiny-ECA model can achieve good detection results under both sunny and cloudy conditions. The model detection performance was relatively weak under the condition of underwater bubbles, and the performance was not good in dark underwater conditions.

Findings and Limitations
In this study, we defined two typical characteristics (angle feeding and vertical feeding) of fry feeding and proved that the occurrence of fry feeding behaviors can be detected based on fry feeding characteristics. We also proposed a YOLOv4-Tiny-ECA network to detect the two typical characteristics from acquired images; then, we can realize variable feeding based on fry feeding status.
This study indicates that the YOLOv4-Tiny-ECA model has strong robustness to sunny daylight, cloudy daylight, and underwater bubble conditions and relatively weak detection results to dark underwater conditions, meaning that this method can be better applied to the detection of fry feeding status in the daytime.

Conclusions
In this study, we propose and verify a deep learning network for fry feeding status detection in factory farming, which will provide technical support for intelligent feeding.
The results indicate that for the YOLOv4-Tiny-ECA model, the precision rate, recall rate, and F 1 score of angle feeding detection were 96.23%, 95.92%, and 0.96, respectively, and the precision rate, recall rate, and F 1 score of vertical feeding detection were 95.89%, 96.44%, and 0.96, respectively; the detection speed was 108FPS and the model size was 22.7 MB.
Considering the actual breeding environment, this study tested the detection performance of the model under the conditions of sunny, cloudy, dark, and bubble, and the results showed that the YOLOv4-Tiny-ECA model can achieve good detection results under sunny and cloudy conditions. The detection performance of the model is relatively weak under the condition of underwater bubbles and is not good under dark underwater conditions. It indicates that this method can be better applied to the detection of fry feeding status under natural light conditions in the daytime.
In future studies, our goal is to develop an effective feeding behavior detection system for multiple fry species in all-weather and all-season light conditions, which is more challenging but also more practical for application scenarios. We plan to obtain the feeding characteristics information of more species of fry, collect the feeding images of fry under different lighting conditions, and enhance the universality of the detection network. At the same time, we also plan to develop and deploy a set of edge computing detection equipment to realize real-time transmission of fry feeding status information. Due to the limitation of the computing capacity of the equipment, a more efficient and lighter-weight network is needed to realize the function. The ultimate purpose of fish feeding status detection is to serve intelligent feeding equipment, so we will also combine the feeding status detection function with variable feeding equipment to achieve 24-h precision fry feeding.