A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed

Jia, Bo; Wang, Xiaochan; Shi, Yinyan; Zheng, Jinming; Wang, Jihao; Xu, Zhen; Zhang, Xiaolei; Zhou, Chengquan

doi:10.3390/fishes11050300

Open AccessArticle

A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed

by

Bo Jia

¹,

Xiaochan Wang

^1,*

,

Yinyan Shi

¹,

Jinming Zheng

²,

Jihao Wang

¹,

Zhen Xu

¹,

Xiaolei Zhang

¹ and

Chengquan Zhou

³

¹

College of Engineering, Nanjing Agricultural University, Nanjing 210031, China

²

School of Mechanical and Electrical Engineering, Chuzhou University, Chuzhou 239000, China

³

Institute of Agricultural Equipment, Zhejiang Academy of Agricultural Sciences, Hangzhou 310000, China

^*

Author to whom correspondence should be addressed.

Fishes 2026, 11(5), 300; https://doi.org/10.3390/fishes11050300

Submission received: 17 April 2026 / Revised: 15 May 2026 / Accepted: 15 May 2026 / Published: 18 May 2026

(This article belongs to the Special Issue Computer Vision Applications for Fisheries and Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate identification and quantitative assessment of fish feeding intensity are pivotal for enhancing aquaculture production efficiency. Currently, feeding intensity is mainly assessed based on fish school feeding images with a single feature, overlooking the interdependencies between individual fish and the fish school’s behavior. Therefore, this paper presents a method based on detecting individual fish heads to characterize the feeding aggregation speed and the average swimming speed of the fish school, thereby quantifying the fish school’s feeding intensity. First, the improved YOLOv11n-ALL model was employed to detect individual fish heads, resulting in improved detection performance, increasing inference speed, and reducing computational complexity. Additionally, feeding aggregation speed and average swimming speed indices for fish schools were constructed by combining the YOLOv11n-ALL model with the ByteTrack algorithm to track and extract the centers of individual fish heads’ detection boxes. Finally, the fish school feeding kinetic energy was assessed using the feeding aggregation speed and average swimming speed dual indices, and the fish school feeding intensity levels were classified according to the feeding kinetic energy. Experimental results reveal that the improved YOLOv11n-ALL model achieved an average detection precision (mAP50) of 94.13% for detecting fish heads, reduced the parameter count by 22.09%, and exhibited a computational complexity of 6.4 GFLOPs. Furthermore, the classification model of fish school feeding intensity, quantified by the dual indices of average swimming speed and feeding aggregation speed, achieved a detection accuracy of 97.41%. This method digitizes detection results, enabling rapid classification of fish school feeding intensity and demonstrating its effectiveness for feeding intensity assessment and the development of scientific feeding strategies.

Keywords:

aquaculture; feeding aggregation speed; average swimming speed; feeding intensity; deep learning

Key Contribution: A novel dual-index method combining feeding aggregation speed and average swimming speed is proposed to quantitatively assess fish school feeding intensity.

1. Introduction

In intensive recirculating aquaculture systems (RASs), feeding management plays a significant role in production efficiency and profitability. Conventional feeding methods that rely on subjective manual visual inspection are unsuitable for large-scale aquaculture. Overfeeding not only causes feed waste and increased costs but also leads to deterioration in water quality [1,2,3]. Underfeeding may exacerbate size heterogeneity among cultured individuals and negatively impact the welfare of farmed fish [4,5]. Thus, automated, high-precision approaches for monitoring fish feeding intensity are crucial for developing sustainable aquaculture and reducing feed consumption [6].

Deep learning-based machine vision methods have been widely applied in intensive aquaculture for the recognition of fish feeding behavior and the quantification of feeding intensity, owing to their high detection efficiency and recognition accuracy [7,8,9]. Based on differences in input modalities, detection targets, tracking backbone networks, and behavioral characteristics, fish feeding intensity is categorized into direct detection methods (including fish target recognition, tracking, splash size, and feeding aggregation) and indirect detection methods (based on leftover feed particles). In direct methods, the input modality and detection targets are individual fish and the fish school; feeding intensity is quantified by analyzing the behavioral characteristics exhibited during feeding. In indirect methods, the input modality and detection targets are feed pellets; feeding intensity is quantified indirectly by measuring the amount of remaining feed.

Previous studies show that fish exhibit vertical or inclined postures in the surface water layer during active feeding, and this behavioral feature provides a critical basis for computer vision-based recognition of fish feeding states [10]. The improved ByteTrack algorithm proposed in prior studies enables a precise quantitative assessment of fish feeding appetite, achieving a feeding state recognition accuracy of up to 98.47% [11,12]. Liu et al. [13] developed an online 3D multi-target tracking method achieving 95.03% accuracy (MOTA). At the same time, image segmentation was applied by Wu et al. [14] to extract key points and construct kinematic parameters such as head-turning and tail-swinging angles.

These behavioral features provide key quantitative indicators for recognizing fish feeding behavior. However, all of the above methods rely on manually extracting fish features and performing feature analysis to identify the feeding status of fish schools, as well as on accurate foreground segmentation and target localization. These methods are suitable for environments with low noise and relatively stable backgrounds in order to achieve stable and effective identification of fish feeding behavior. In addition, uneven lighting, refraction, and reflection degrade the quality of fish feeding images and increase the demands on recognition algorithms [15,16,17].

Fish school feeding behavior is typically recognized and quantified through movement analysis. Zhou et al. [18] proposed a fish school aggregation method based on Delaunay triangulation, achieving 98% accuracy in determining feeding behavior. Wei et al. [19] developed a spatiotemporal feature-based classification model for feeding intensity, yielding accuracies of 97.08%, 97.35%, 92.50%, and 98.31% across four fish species. Zhao et al. [20] introduced a feeding desire assessment method that incorporates fish movement and environmental parameters, while Ye et al. [21] constructed a feeding intensity model using behavioral features and a combination of entropy analysis and optical flow algorithms.

Classification algorithms have been used to identify and evaluate feeding behavior in fish aquaculture, achieving recognition accuracies above 90% [22,23,24]. Current fish school feeding analysis mainly relies on single-feature recognition, such as Delaunay triangulation for centroid and distance calculation, which is prone to errors in complex aquaculture environments. This is because feeding intensity cannot be reliably determined based on aggregation behavior alone cannot reliably determine feeding intensity, as this behavior can also occur in non-feeding contexts. Optical flow–based speed analysis is computationally intensive, limiting its use in real time on edge devices. Deep learning-based feeding intensity classification lacks a quantitative assessment of feeding intensity across different feeding levels.

Additionally, fish feeding intensity can be indirectly assessed by detecting uneaten feed. Hu et al. [25] improved YOLOv4 to identify residual feed with 92.61% accuracy, while Xu et al. [2] enhanced YOLOv5 to achieve 94.1%. Zhou et al. [26] applied adaptive linear timestamps and thresholds for feed quantification. MobileNet and receptive field analysis in YOLOv3 improve fish detection accuracy [27], and adding dense connections to YOLOv4 for more efficient feature fusion maintains high detection accuracy and practical aquaculture applications [25]. However, relying solely on uneaten feed detection offers limited insight into feeding status, as pellet size and inter-individual variability substantially affect recognition accuracy in practical aquaculture. By detecting and tracking changes in individual fish swimming speeds and aggregation states, the feeding intensity of fish populations can be effectively assessed, thereby preventing overfeeding or underfeeding. The results of a comparison of existing deep learning detection methods are as follows:

Reference	Input	Detection Target	Detection Technology	Behavioral Feature	Dataset	Indicators	Limitation
Yang et al. [10]	RGB image	Single fish	YOLO	Feeding angle of a single fish	Feeding angle dataset	mAP50/Accuracy	These methods are suitable for environments with a single feature, low noise levels, and a relatively stable background, enabling stable and effective identification of fish feeding behavior.
Zhao et al. [12]	RGB video	Fish head	ByteTrack	Feeding of single fish	Fish head dataset	MOTA/IDF1/Accuracy/Precision/ Recall/F1-Score
Liu et al. [13]	RGB video	Individual fish	Pyramid Vision Transformer (PVT)	Swimming	3D-ZeF20 dataset from the MOT Challenge	MOTA/MOTP/IDF1 /IDS/FM/MTBFm
Wu et al. [14]	RGB image	Individual fish	Image segmentation	Swimming	Dataset on Fish Body segmentation	mAP50/Accuracy
Zhou et al. [18]	RGB video	Fish school	Delaunay triangulation	Fish school aggregation	None	Accuracy/Correlation coefficient R²	Single-feature quantization
Wei et al. [19]	RGB video	Fish school	MKEM network skeleton	Fish Movement Behavior	Fish Feeding Behavior Dataset	Accuracy/Precision/Recall/ F1-Score	Single-feature quantization
Zhao et al. [20]	RGB video	Fish school	Modified social force Model/Kinetic energy model	Dispersion degree, interaction force and the changing magnitude of the water flow field	Fish Feeding Behavior Dataset	Correlation coefficient R²	The potential of real-time appetite-based feeding for free-swimming fish, as opposed to feeding methods based on theoretical feeding rhythms.
Ye et al. [21]	RGB video	Fish school	Optical flow	Measurement of digesta index of stomach and bowel (DISB)	Fish Feeding Behavior Dataset	Assessments of Shoal Activity Based on Entropy	Optical flow–based speed analysis is computationally intensive, limiting its use in real time on edge devices.
Hu et al. [25]	RGB image	Feed pellets	improved YOLOv4	Feed Feature	Feed pellets Dataset	Accuracy/Precision/Recall/ F1-Score	Uneaten feed detection offers limited insight into feeding status, as pellet size and inter-individual variability substantially affect recognition accuracy in practical aquaculture.
Zhou et al. [26]	RGB image	Feed pellets	YOLOv5+ fuzzy neural network model	Feed Feature	Feed pellets Dataset	Accuracy/Precision/Recall/ F1-Score
Cai et al. [27]	RGB image	Individual fish	YOLOv3+ MobileNetv1	Feed Feature	Fish dataset	mAP50/Accuracy/Precision/Recall/ F1-Score

In the early stages of feeding, the stronger the feeding aggregation, the more intense the fish school’s feeding desire, which serves as the initial indicator and feeding point for assessing the fish school’s feeding motivation. However, as the feeding process begins, the feed pellets disperse due to the fish school’s competitive feeding behavior and the effects of water flow, and the fish school starts to scatter. At this point, the feeding aggregation cannot continue to quantify feeding intensity. Short-term hunger prior to feeding increases the fish school’s swimming speed, and the average swimming speed of the school can be used to assess its feeding motivation. During feeding, competition among individual fish for food and aggressive feeding behavior increase the average swimming speed of the fish school. Therefore, feeding intensity is also linked to fish school feeding speed. Tracking individual fish and quantifying swimming speed are key to further evaluating fish school feeding behavior [28,29,30,31]. Research on fish tracking and swim speed quantification encompasses both 2D [12] and 3D methods [13], with 2D methods being more practical for real-time detection of fast-moving fish targets, and deployment costs are low in aquaculture. By detecting and tracking changes in individual fish swimming speeds and aggregation states, the feeding intensity of fish populations can be effectively assessed, thereby preventing overfeeding or underfeeding. The results of a comparison of existing deep learning detection methods are as follows:

Therefore, based on the above review, we propose a quantitative method to characterize fish school feeding intensity using two distinct indicators: the average swimming speed of the fish school and the feeding aggregation speed. The main contributions are as follows:

(1): The proposed YOLOv11n-ALL model enables robust fish-head detection in Low-density aquaculture environments characterized by target-scale variation. By integrating the C3k2_EfficientViM feature extraction module and the lightweight BiMAFPN neck structure, the model effectively captures global feature dependencies while reducing parameter redundancy.
(2): The developed lightweight detection framework supports accurate identification of individual fish-head targets in aquaculture video streams under different feeding conditions. The proposed method achieves a fish-head detection precision of 90.10% and an mAP@0.50 of 94.13%, while reducing model parameters by 22.09%, thereby maintaining high detection accuracy with improved computational efficiency.
(3): The proposed fish-school feeding-intensity evaluation system combines YOLOv11n-ALL with the ByteTrack algorithm to perform multi-fish tracking and quantify feeding behavior using average swimming speed and feeding aggregation speed. The system enables quantitative classification of fish feeding intensity in low-density aquaculture environments, achieving an accuracy of 97.41%, a false positive rate of 1.78%, and a false negative rate of 2.32%, with high consistency for feeding-behavior characterization and feeding-demand assessment.

2. Materials and Methods

2.1. Aquaculture Environment

All experiments were conducted at the Yangdu base in Haining, Zhejiang Province, China (as shown in Figure 1). A total of 440 California bass (Largemouth bass) were evenly distributed among eight circular culture tanks, each with a diameter of 3 m. Fish were approximately 25 g in weight and 10–15 cm in length. The feed (2–3 mm in diameter) was sourced from Sichuan Tongwei Feed Co., Ltd. (Chengdu, China). Environmental parameters were maintained as follows: dissolved oxygen 7.0–8.0 mg/L, pH 6.8–8.0, water temperature 20–23 °C, nitrite and ammonia nitrogen both below 0.1 mg/L. Fish were fed twice daily at 08:00 and 16:00, with a feeding rate of 2.0% of total body weight per meal.

2.2. Image Acquisition and Data Labeling

2.2.1. Fish Feeding Behavior Analysis

During fixed-point feeding, the feeding features of fish schools vary with factors such as hunger level and intraspecific food competition [4,23,32,33,34]. Direct characterization of these behaviors is complex, as it requires careful consideration of both the features of individual fish and the internal dynamic characteristics of fish schools. The aggregation behavior of fish schools can reflect feeding intensity to a certain extent. The stronger the feeding aggregation of fish schools, the higher the corresponding feeding intensity. Conversely, a more dispersed distribution of the fish school indicates a lower feeding intensity. However, aggregation alone cannot reliably determine feeding intensity, as this behavior can also occur in non-feeding contexts. Feeding intensity is also linked to fish school feeding speed: For instance, pre-feeding hunger elevates fish swimming speed and school aggregation (Figure 2a–c); food competition during feeding intensifies feeding splashes and aggregation (Figure 2d–f); post-feeding reduces swimming speed and leads to school dispersion (Figure 2g–i). Therefore, quantifying fish school feeding intensity requires simultaneous consideration of both fish school feeding aggregation and swimming speed.

2.2.2. Image Acquisition

The image acquisition system consists of a high-definition camera (MSHiWi, MS-SUA133GC, 1920 × 1080, 30 fps) and a personal computer. The camera was mounted on a triangular bracket 1 m above each fish-pool and tilted at 30° from the horizontal to minimize interference from water surface reflections (Figure 3a). Since refraction and reflection at the air–water interface can cause geometric distortion of captured images, camera calibration was performed prior to formal data acquisition. Specifically, images of a checkerboard calibration board were captured at multiple positions and orientations within the camera’s field of view, and the intrinsic/extrinsic parameters of the camera and lens distortion coefficients were solved accordingly. The four reference points with known physical distances on the water surface were selected, and a planar homography matrix was estimated to transform image coordinates into real-world coordinates. The final calculation shows that 1 meter equals 512 pixels. The average reprojection error after homography calibration was 4.3 pixels, corresponding to approximately 2 cm in the real water-surface plane, indicating that the spatial transformation accuracy is sufficient for fish motion analysis.

The acquisition system was activated 5 min before feeding and stopped 5 min after. No obvious feeding splashes were observed on the water surface, ensuring complete coverage of the entire feeding process. The collected feeding videos were divided into two subsets: one was used as the test dataset to quantify fish swimming speed and the feeding aggregation speed index, and the other was used for model training (Figure 3b). The dataset was constructed from fish school feeding frames extracted from the video recordings. To reduce image redundancy, one frame was extracted every 5 frames from the 30 fps videos, and the resulting fish school images were saved as JPGs in RGB color space.

2.2.3. Image Labeling and Dataset Construction

Representative feeding images of fish schools were extracted, and selected images covered three typical imaging states: normal, occluded, and deformed, corresponding to Figure 2a, Figure 2d and Figure 2g, respectively. We adopted fish head labeling rather than whole-fish labeling during annotation, as shown in Figure 3b. Compared with other body parts, fish heads are less susceptible to deformation, thereby effectively improving object detection performance in crowded scenes [35]. The established fish head annotation dataset contained 15,000 images, which were randomly split into training, validation, and test sets in an 8:1:1 ratio. To verify the performance of the proposed multi-fish tracking algorithm, we used DarkLabel to build a multi-fish tracking dataset covering the full feeding process for evaluating fish school tracking and feeding intensity assessment, as detailed in Table 1.

2.3. Overall Workflow of the Proposed Method

This study introduces a method for quantifying fish feeding intensity by detecting and extracting the fish head center point information and using ByteTrack algorithms to estimate the average swimming speed of the fish school and the feeding aggregation speed. This approach differs from existing studies that classify feeding intensity or detect feed pellets to assess fish feeding demand. By detecting key head information and computing a dual-index of feeding aggregation speed and average swimming speed to quantify feeding kinetic energy, thereby enabling the assessment and classification of fish school feeding intensity as a whole.

The overall workflow of this study is presented in Figure 4. The improved YOLOv11n-ALL model is applied to detect fish heads and extract the head centers for each individual fish. Next, using the ByteTrack algorithm, calculate the displacement difference between the center points of adjacent frames. Finally, the average swimming speed and the feeding aggregation speed of the fish school are computed using the center points of the fish heads. The calculated results are then used to quantify the fish school’s feeding kinetic energy and assess its feeding intensity, enabling digitalization and rapid classification of the feeding intensity.

2.4. Target Detection for Individual Fish Head Parts

The backbone of the YOLOv11model is built on CSPDarkNet. YOLOv11 replaces the C2f [36] and C3 [37] module with the C3k2 module and C2PSA module [38,39] to achieve a lightweight model. The neck structure uses PANet [40] for feature fusion, improving information flow and aggregation. The head includes three output branches, each divided into classification and regression.

To enhance the inference efficiency and detection accuracy of YOLOv11n, this study focuses on optimizing feature extraction and achieving greater model lightweighting. Efficient Vision Mamba (EfficientViM) [41] is integrated into the C3k2 module, improving global dependency capture while reducing computational cost and improving the ability to extract individual fish head features in aquatic environments. To optimize the neck structure, the original neck network is replaced with a combination of the weighted bidirectional feature pyramid network (BiFPN) [42] and multi-branch auxiliary feature pyramid network (MAFPN) [43], enabling cross-stage fusion and multi-directional connections. This improves the flow of gradient information and enhances the detection of small objects, enabling the model to more effectively extract and integrate information on fish head characteristics from different output layers. The improved YOLOv11n-ALL network architecture is shown in Figure 5.

2.4.1. Improved C3k2 Module Integrating Efficient Vision Mamba

The EfficientViM module, shown in Figure 6a, is built upon the HSM-SSD layer, forming an efficient visual Mamba architecture. Each EfficientViM block sequentially stacks a feed-forward network (FFN) and an HSM-SSD layer to enable channel interaction and global information aggregation. The FFN comprises two consecutive 1 × 1 convolutional layers, or pointwise convolutions, and an expansion ratio set to 4, as illustrated in Figure 6b. The 3 × 3 DWConv is added before both the FFN and HSM-SSD layers. Each layer integrates with the subsequent layer via residual connections.

The HSM-SSD module, as depicted in Figure 6c, is an enhancement of the NC-SSD module (Figure 6d), aimed at reducing computational parameters and output costs. The NC-SSD process can be calculated as follows:

\hat{B}, C, Δ, x, z = L i n e a r (x_{i n})

(1)

a, B = D i s c r e t i z a t i o n (\hat{a}, \hat{B}, Δ)

(2)

B, C, x = D W C o n v (B, C, x)

(3)

y = N C - S S D (x, a, B, C)

(4)

x_{o u t} = L i n e a r (y ⊙ σ (z))

(5)

The variables z ∈ R^L×D and σ represent activation functions. x_in and x_out represent the input and output tensors of the NC-SSD module. The computational cost of Formulas (1)–(3) with a fixed kernel size is denoted by O(LD² + LND), followed by NC-SSD execution and output projections, which require O(LND) and O(LD²), respectively. Assuming N << D (where N × D is the total state-channel product), the linear projections generating x, z, and x_out dominate the overall complexity, resulting in a complexity of O(LD²).

For this reason, optimizing the SSD block is of paramount importance. We first perform a linear projection onto the hidden state space by computing h_in, which decreases the state-dependent computational overhead from O(LD²) to O(ND²). Thus, the primary computational burden of this layer can be mitigated by setting the states to N << L.

The next step is to reduce the computational cost related to O(LD²) in Equation (5). This optimization is achieved by focusing on the shared global hidden state h, which serves as a low-dimensional latent representation that condenses the original long input sequence into a concise sequence of length N. Therefore, a hidden state mixer (HSM) is proposed to perform channel mixing operations on h, as shown in Figure 6c. Therefore, the mathematical formulation of the NC-SSD layer is given by

\begin{array}{l} x_{o u t} = f (y) = L i n e a r (y ⊙ σ (z)) \\ = (C h ⊙ σ (x_{i n} W_{z})) W_{o u t} \\ \approx C (h ⊙ σ (h_{i n} W_{z})) W_{o u t} = C f (h) \end{array}

(6)

Here, y = Ch and f represent the gating function channel mixing, followed by the learnable matrix W_z yielding W ∈ R^D×D. The HSM module directly performs gating and projection operations on the hidden state. The hidden state is then updated with the C projection to generate the final output x_out. Thus, the total complexity of capturing global context is denoted as O(ND² + LND), which becomes negligible when N is small. This further eliminates all tensor operations introduced by the multi-head mechanism (e.g., reshaping and replication operations). Additionally, to mimic the multi-head’s ability to capture multiple relationships, we set Δ ∈ R^L×N and â ∈ R^N, estimate the importance weights, thereby assessing the significance of each state label. Consequently, the hidden state mixer input is represented as follows:

h_{i n} = {(A ⊙ B)}^{Τ} x_{i n}

(7)

By adjusting and replacing the C3k2 module within each head of the EfficientViMBlock (where C3k is set to True, denoted as C3k_EfficientViM, and where C3k is set to False, with EfficientViMBlock replacing Bottleneck), we can effectively enhance the network’s feature extraction capability. Additionally, it combines semantic information across multiple scales within the attention receptive field, while reducing redundant parameters and lowering computational costs, as illustrated in Figure 6e.

Conventional CNN-based local receptive fields may struggle to capture long-range dependencies among fragmented fish head features. The HSM-SSD mechanism enhances global contextual modeling, enabling the network to establish spatial associations among dispersed fish head features across turbid and aquatic backgrounds. This improves the representation ability for partially occluded and blurred targets.

2.4.2. Improving Neck Network Connectivity with BiMAFPN Networks

The path aggregation feature pyramid network (PAFPN), commonly used in YOLO detectors, optimizes feature fusion to enhance detection accuracy while balancing computational cost. However, it still has limitations in integrating low-level spatial information and high-level semantics adaptively. The PAFPN neck network primarily aggregates same-scale features, lacking cross-scale fusion. For example, Block 1 uses only upsampled P5 and corresponding P4 features, ignoring low-level details from P3, while Block 2 omits P2 features critical for small objects—a limitation that persists in later modules. Second, relying on a single top–down path and two basic modules for small object detection limits the model’s ability to capture and learn small object features, as shown in Figure 7a.

In the MAFPN neck, feature maps from P2–P5 serve as input. The B module in the bottom–up path extracts multi-scale features and performs preliminary shallow-layer fusion. In contrast, the C module in the top–down path gathers gradients via dense connections, providing the head with diverse multi-resolution inputs. Adaptive fusion of high-level and low-level features is achieved (Figure 7b).

Therefore, based on the MAFPN neck network structure, we eliminate nodes with a single input edge and introduce feature network layers at the same level between input and output nodes to enhance feature fusion without significantly increasing computational cost. Unlike the PAFPN, which has a single top–down and bottom–up path, we combine the BiFPN and MAFPNs (BiMAFPN), treating each bidirectional path as a feature network layer. This layer is repeated multiple times for advanced feature fusion, as shown in Figure 7c. In the feature fusion process, traditional methods resize feature maps to the same size before addition. Prior works, such as the pyramid attention network (PAN), propose global self-attention upsampling to improve pixel localization. However, these methods treat all input features equally, whereas we observe significant variations in feature contributions across resolutions. To address this, we introduce learnable adaptive weights for each input feature, allowing the network to automatically assess their importance. We employ a fast normalization fusion method [42] as the weighted fusion technique, as detailed in Equation (8).

Ψ = \sum_{i} \frac{w_{i}}{ε + \sum_{j} w_{j}} \cdot L_{i}

(8)

To keep the weight w_i non-negative (w_i ≥ 0), we apply the ReLU activation function as a constraint and introduce a small value ε = 0.0001 to enhance numerical stability, where L_i represents the i-th layer. Ultimately, the BiMAFPN adopts a fast normalization fusion strategy and incorporates bidirectional cross-scale connection, with the feature fusion at the 6th layer described by Equations (9) and (10).

P 6^{t d} = B l o c k (\frac{w_{1} \cdot P 6^{i n} + w_{2} \cdot R e s i z e (P 10^{t d}) + w_{3} \cdot R e s i z e (P 4^{i n})}{w_{1} + w_{2} + w_{3} + ε})

(9)

P 6^{o u t} = B l o c k (\frac{\begin{array}{l} {w^{'}}_{1} \cdot P 6^{t d} + {w^{'}}_{4} \cdot R e s i z e (P 4^{t d}) + {w^{'}}_{2} \cdot R e s i z e (P 10^{t d}) \\ + {w^{'}}_{3} \cdot R e s i z e (P 4^{o u t}) \end{array}}{{w^{'}}_{1} + {w^{'}}_{2} + {w^{'}}_{3} + {w^{'}}_{4} + ε})

(10)

Here, P6^td represents the intermediate features at the 6th layer in the top–down path, and P6^out denotes the output features at the 4th layer in the bottom–up path. The “Resize” operation adjusts the resolution of feature maps, including upsampling or downsampling, to ensure that features at different levels are dimensionally aligned. Additionally, “Block” typically refers to a convolutional module used for feature extraction and transformation. In this study, we employ the C3k2-EfficientViM structure to implement this module’s functionality.

2.5. Multi-Object Tracking (MOT) Algorithm

This study follows a detection-by-tracking technical framework, with ByteTrack [44] selected as the core tracking algorithm. As a classic method under the detection-by-tracking paradigm, ByteTrack operates in two main steps: first, the detector generates bounding boxes along with their associated confidence values; the obtained detection boxes are then categorized according to their confidence scores: the boxes with confidence scores higher than T_high are assigned to the high-confidence set, while those with scores between T_high and T_low are assigned to the low-confidence set. The detection boxes in the high-confidence set are first matched and associated with the existing tracking trajectories. Subsequently, the low-confidence bounding boxes are associated with the unassigned existing trajectories to reduce identity switches. This operation retains valid low-confidence detections while filtering out background interference, and also enables robust trajectory re-association when the targets reappear after occlusion. This approach enables the recovery of individual fish target identity after short-term occlusion, significantly decreases the frequency of ID switches, and thus improves the tracking robustness of the proposed algorithm for fish targets.

To achieve accurate data association, ByteTrack first estimates and predicts the motion states of fish targets using a Kalman filter, then generates corresponding predicted bounding boxes for existing trajectories. It then calculates the similarity between the detection boxes and the predicted bounding boxes during the tracking and matching pipeline. Based on the precomputed similarity matrix, the Hungarian algorithm is then applied to perform the final data association matching between detections and trajectories.

This framework achieves robust, efficient multi-target tracking of fish in complex scenes while maintaining high computational efficiency.

2.6. Assessment of Fish School Feeding Intensity via Detection of Average Swimming Speed and Feeding Aggregation Speed

Fish feeding speed and feeding aggregation [45,46] are key indicators of behavior and responses to environmental factors, stress, and hunger. Hungry (relatively underfed) fish exhibit faster feeding speed and stronger aggregation than satiated fish. In the previous section, we obtained information related to the detection boxes for individual fish heads. In this section, we will use the YOLOv11n-ALL + ByteTrack algorithm and extract the center points of the detection boxes for individual fish heads as nodes to compute the feeding speed and feeding aggregation speed, and subsequently quantify these measures to describe the feeding intensity of the fish school. The overall framework based on YOLOv11n-all-bytetrack tracking is shown in Figure 8a.

2.6.1. Detection of Average Swimming Speed and Feeding Aggregation Speed

Fish school average swimming speed estimation: This paper proposes using the center of the fish’s head as a key point to represent fish movement. Assuming that the fish head prediction boxes obtained in the previous sections are in a two-dimensional plane, the displacement calculation is first performed. For discrete position data, the speed can be computed using Equations (11)–(13):

v = \frac{\sqrt{{(x_{i + 1} - x_{i})}^{2} + {(y_{i + 1} - y_{i})}^{2}}}{t_{i + 1} - t_{i}}

(11)

t_{i + 1} - t_{i} = \frac{1}{F P S}

(12)

\bar{v} = \frac{1}{n} \sum_{i = 1}^{n} v_{i}

(13)

Here, (x_i+1, y_i+1) represents the coordinates of the individual fish head center at time t_i+1; (x_i, y_i) represents the coordinates of the individual fish head center at time t_i; t_i+1 − t_i denotes the time difference between the two frames; and

\bar{v}

denotes the average swimming speed of the fish school, as shown in Figure 8a. This measurement method tracks each detected fish by calculating the distance between consecutive frames (frame i and frame i + 1). Finally, the fish school’s average swimming speed is computed and extracted.

Fish school feeding aggregation speed assessment: Feeding aggregation reflects the clustering behavior and intensity of the fish school. Therefore, we propose a novel method to simplify the calculation and quantification of fish school feeding aggregation. The coordinates of the individual fish head detection box center (x_i, y_i) are extracted from sequential images using fish movement tracking. The fish school movement center coordinates (X, Y) are constructed. Finally, the average distance from the individual fish head detection box center coordinates (x_i, y_i) to the fish school movement center coordinates (X, Y) is computed, as shown in Figure 8a. The calculation and quantification of the feeding aggregation d and feeding aggregation speed v_d are given by Equations (14)–(17).

X = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(14)

Y = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

(15)

d = \frac{1}{n} \sum_{i = 1}^{n} \sqrt{{(x_{i} - X)}^{2} + {(y_{i} - Y)}^{2}}

(16)

v_{d} = \frac{d_{i + 1} - d_{i}}{t_{i + 1} - t_{i}}

(17)

The proposed method for fish school feeding aggregation and aggregation speed is based on the average distance between corresponding fish school individuals and the rate of change in this distance across consecutive frames (frame(i) and frame(i + 1)). These metrics are used to calculate the degree of fish school feeding aggregation and the aggregation speed index.

2.6.2. Assessment of Fish School Feeding Intensity

In the early stages of feeding, the stronger the feeding aggregation, the more intense the fish school’s feeding desire, which serves as the initial indicator and feeding point for assessing the fish school’s feeding motivation. However, as the feeding process begins, the feed pellets disperse due to the fish school’s competitive feeding behavior and the effects of water flow, and the fish school starts to scatter. At this point, the feeding aggregation cannot continue to quantify feeding intensity. Short-term hunger prior to feeding increases the fish school’s swimming speed, and the average swimming speed of the school can be used to assess its feeding motivation. During feeding, competition among individual fish for food and aggressive feeding behavior increase the average swimming speed of the fish school. Changes in the feeding aggregation and average swimming speed of the fish school serve as the most direct indicators of the fish school’s feeding behavior.

The quantification of fish school feeding intensity is related to the movement of the fish school. During the non-feeding phase, the changes in the average movement speed and the degree of feeding aggregation in the fish school are relatively small. However, during the feeding phase, both the average movement speed and feeding aggregation degree exhibit significant changes. The present study estimates the fish school’s feeding kinetic energy (E_k) using the dual indicators of average swimming speed and feeding aggregation speed to describe feeding intensity, as shown in Figure 8b.

Based on the formula for kinetic energy in motion:

E = \frac{1}{2} m v^{2}

(18)

It is known that kinetic energy is primarily related to mass and velocity; the velocity of a fish school can be broken down into feeding aggregation velocity and average velocity; mass, in turn, is related to the individual fish detected. Therefore, the feeding kinetic energy of a fish school is expressed as:

E = \frac{1}{2} \sum_{i = 1}^{n} m \{{(\bar{v})}^{2} + {(v_{d})}^{2}\}

(19)

where

\sum_{i = 1}^{n} m

represents the total mass of the detected fish school, kg;

\bar{v}

represents the average swimming speed of the fish school, m/s;

v_{d}

denotes the feeding aggregation speed of the fish school, m/s; n represents the number of fish individuals detected; and

E

represents the total kinetic energy of fish feeding detected, J (joule).

To improve the comparability of results across different video clips and under conditions involving varying numbers of tracked fish, we further processed the kinetic metrics by standardizing them based on the number of effectively tracked individuals. Each fish weighs approximately 0.025 kg (It is believed that the individual weight of a fish is a constant value).

E = \frac{1}{2} n m \{{(\bar{v})}^{2} + {(v_{d})}^{2}\}

(20)

Finally, the dual-index fish feeding kinetic model is normalized, as shown in Equation (21). The range of E_k is [0, 1].

E_{k} = \frac{E - E_{\min}}{E_{\max} - E_{\min}}

(21)

Sensitivity Analysis: We varied the assumed fish body weight within a reasonable biological range and artificially introduced different levels of false-negative rates to simulate the detection errors that might occur during actual tracking. The results are shown in Table 2.

The results indicate that, even when fish body weight and the rate of missed observations fluctuate moderately, the feeding intensity (E_k) exhibits only minor variations, suggesting that the proposed normalized fish school kinetic index demonstrates good stability in the analysis of fish feeding behavior.

2.7. Experimental Environment and Evaluation Metrics

Using the Python programming language, fish head detection and extraction of the center coordinates of detection boxes were implemented, and constructed indices of average feeding speed and feeding aggregation, which were subsequently used to calculate and quantify fish school feeding intensity.

The computer system comprises Windows 11 Professional, an NVIDIA RTX 2080Ti GPU, and an Intel Xeon Gold 6226 (11th Gen) CPU. PyTorch 2.0.1 framework and Python 3.8.19 were used for model development. Model training parameters are as follows: 400 epochs, SGD optimizer, batch size of 16, learning rate of 0.01 held constant throughout training, momentum of 0.937, and weight decay of 0.0005, with CUDA 11.7 acceleration. Table 3 lists the evaluation metrics for the detection model.

3. Results

3.1. Performance Comparison of Object Detection Models

3.1.1. Object Detection Results of Individual Fish Head Parts

By comparing with the baseline YOLOv11n model, we found that our improved YOLOv11n-ALL (YOLOv11n-C3k2_EfficientViM + BiMAFPN, where BiMAFPN denotes the fusion of BiFPN and MAFPN) model achieved the best detection performance. The box plots, precision, mAP50, and PR curves for the improved and baseline models on the validation dataset are shown in Figure 9. The improved model consistently yields lower loss than the baseline model, and its precision, mAP@50, and precision–recall curve are all superior, indicating that it effectively reduces missed detections and false negatives, thereby enhancing detection accuracy.

To validate the performance contribution of each modified module, we performed a systematic ablation test. The ablation test was conducted ten times under identical environmental conditions, and the stability of the resulting performance metrics was analyzed based on the average test results. The corresponding results are presented in Table 4 and Figure 9. When the C3k2-EfficientViM module and BiFPN module were separately integrated into the baseline YOLOv11n model, the GFLOPs of the two modified models increased by 4.54% and 3.08% relative to the baseline, respectively. By contrast, the parameter count reduced by 4.45% and 33.68%, and the model size decreased by 1.96% and 30.00%, respectively. Meanwhile, mAP@50 improved by 1.30% and 1.21%, respectively, and the detection speed (frames per second, FPS) increased by 0.96% and 7.14%, respectively, relative to the baseline.

When the BiFPN and MAFPN modules were simultaneously combined into the YOLOv11n model, the GFLOPs, parameter count, and model size were reduced by 1.61%, 27.09%, and 23.81%, respectively, compared with the baseline. Meanwhile, the mAP@50 and detection speed were improved by 1.57% and 32.79%, respectively.

When the C3k2-EfficientViM and BiFPN modules were simultaneously combined into the baseline YOLOv11n model, the GFLOPs increased by 11.26% relative to the baseline. By contrast, the parameter count and model size were reduced by 32.31% and 26.83%, respectively, while its mAP@50 was improved by 1.42%.

Overall, the proposed improved YOLOv11n-ALL model outperforms the baseline YOLOv11n model, improving precision, recall, F1-score, and mAP@50 by 1.89%, 2.92%, 2.40%, and 1.61%, respectively. Compared to the baseline, the YOLOv11n-ALL model significantly reduces the parameters while maintaining similar GFLOPs. As shown in Table 4, it reduces the parameter count by 22.09%, increases the detection speed (frames per second, FPS) by 7.92%, and reduces the model size by 17.31%.

The fish school feeding feature varies significantly under different feeding stages. We thus divided the feeding process into three distinct stages, namely pre-feeding, feeding, and post-feeding, as illustrated in Figure 10. For fish head detection in the pre-feeding stage, the baseline YOLOv11n model showed few missed detections, while our proposed improved YOLOv11n-ALL model achieved excellent detection performance with accurate fish head localization. This is mainly because the fish are dispersed in the water with minimal overlapping and occlusion during the pre-feeding stage, resulting in easily recognizable targets for the detection model.

In terms of fish head detection performance in the feeding and post-feeding stages, the YOLOv11n and the improved model with different enhanced modules exhibited missed detections to varying degrees. This is mainly because fish feeding activities cause water surface fluctuations and splash interference, and feeding aggregation increases overlap and occlusion between fish targets, resulting in severe target occlusion and thus missed detections by the models.

By separately integrating each proposed improved module into the YOLOv11n baseline model and comparing their detection performance, we found that integrating module A (the C3k2-EfficientViM module) effectively mitigated the baseline model’s severe missed detections. This performance improvement is mainly attributed to the C3k2-EfficientViM module, which enables efficient aggregation of global and local contextual information and enhances cross-channel feature interaction.

The integration of modules B (BiFPN feature fusion neck) and C (MAFPN feature fusion neck) into the YOLOv11n baseline model effectively reduced computational overhead, while maintaining detection accuracy and enhancing the detection capability for small targets. Finally, after integrating all modules, A, B, and C, into the baseline model, the final improved YOLOv11n-ALL model achieved robust and superior detection performance across all fish school feeding stages.

3.1.2. Comparison Results of the Improved Model with Other Models

The Comparison test was conducted ten times under identical environmental conditions, and the stability of the resulting performance metrics was analyzed based on the average test results. Table 5 compares the performance of various algorithms. Compared with the YOLOv11n, YOLOv11n-ALL achieves 1.89%, 2.92%, 2.40%, and 1.61% improvements in precision, recall, F1-score, and mAP50, respectively. Meanwhile, compared with YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv12n [47], YOLOv11n-ALL yields precision gains of 0.23%, 0.00%, 1.52%, and 1.46%, recall gains of 0.84%, 0.05%, 4.91%, and 3.02%, F1-score gains of 0.53%, 0.02%, 3.23%, and 2.23%, and mAP50 gains of 0.46%, 0.18%, 1.79%, and 1.59%, respectively. YOLOv11n-ALL has the parameter count (2.01M) and GFLOPs (6.3G) compared to other models, reducing them by 22.09% compared to YOLOv11n. It achieves an FPS of 278.24 and a model size of 4.3MB, outperforming all comparison models in both metrics. These characteristics make the YOLOv11n-ALL model more lightweight.

In Figure 11, most of the models under comparison detect the individual fish heads. However, in complex feeding scenarios for fish, the YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, and YOLOv12n models all display different levels of missed detections.

To further validate the recognition performance of the proposed YOLOv11n-ALL model, a cross-validation method was employed to assess the model’s performance improvement. The comparison results are shown in Table 6. The proposed YOLOv11n-ALL model achieved an accuracy of 94.10%, an average precision of 90.09%, an average recall of 90.72%, and an F1-Score of 90.40%. Among all the comparison models, YOLOv11n-ALL achieved improved performance metrics.

A t-test was performed to compare the proposed method with the baseline models. As shown in Table 7, the p-values for YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n and YOLOv12n are all below 0.05, indicating a significant difference between the YOLOv11n-ALL method and the comparison models, thus confirming its superior performance.

To this end, heatmaps were used to analyze the aggregation patterns of fish schools across different feeding phases, as presented in Figure 12. During the pre-feeding and post-feeding stages, the fish are relatively dispersed across the pond. In contrast, from the onset to the end of feeding, the fish schools show a distribution trend that shifts from dispersion to dense aggregation, then back to dispersion, effectively characterizing the aggregation dynamics of these schools throughout the entire feeding process.

3.2. MOT Algorithms Performance Comparison

To evaluate the improved YOLOv11n-ALL detection performance in multi-fish tracking, we constructed a multi-object tracking dataset using DarkLabel, covering different feeding stages of the fish school (pre-feeding, mid-feeding, and post-feeding). We conducted two sets of multi-fish tracking experiments at these three stages by combining the original YOLOv11n detector and the improved YOLOv11n-ALL detector with ByteTrack. The experimental results are presented in Table 8.

The experimental results show that, as shown in Figure 13a,d, the tracking performance of the YOLOv11n + ByteTrack and YOLOv11n-ALL + ByteTrack tracking models is largely comparable during the pre-feeding stage. However, compared with the YOLOv11n + ByteTrack, the YOLOv11n-ALL + ByteTrack achieves improvements of 1.57%, 1.63%, 5.10%, and 4.24% in MOTA, MOTP, IDR, and IDF1, respectively. The identity switching rate (IDS) decreased by 6.06%, as summarized in Table 8.

During the feeding stage, the YOLOv11n + ByteTrack model suffers from severe widespread tracking failures. This phenomenon is primarily attributed to water surface fluctuations and increased inter-individual occlusion caused by fish feeding behavior, which exacerbate ID switches and identity loss during YOLOv11n + ByteTrack tracking, resulting in a significant decrease in tracking performance. In contrast, as shown in Figure 13b,e, the YOLOv11n-ALL + ByteTrack model successfully tracks most target individuals, with only occasional mismatches between detection boxes and fish, exhibiting relatively stable tracking performance. Compared with the baseline YOLOv11n + ByteTrack tracking model, the YOLOv11n-ALL + ByteTrack model achieves improvements of 3.71%, 5.19%, 5.65%, and 2.59% in MOTA, MOTP, IDR, and IDF1, respectively. The identity switching rate (IDS) decreased by 19.14%, as summarized in Table 8.

During the post-feeding stage, the tracking performance of the YOLOv11n + ByteTrack and YOLOv11n-ALL + ByteTrack models is largely comparable, as shown in Figure 13c,f. However, compared with the YOLOv11n + ByteTrack, the YOLOv11n-ALL + ByteTrack model achieves improvements of 1.23%, 2.39%, 3.71%, and 4.42% in MOTA, MOTP, IDR, and IDF1, respectively. The identity switching rate (IDS) decreased by 5.39%, as summarized in Table 8.

3.3. Mean Swimming Speed Index and Feeding Aggregation Speed Index

Figure 14 shows the visualization results for the fish school’s average swimming speed and feeding aggregation speed index calculated using the individual fish head detection method proposed in this study. The black dotted line marks the dynamic changes in the number of validly tracked individual fish across the pre-feeding, feeding, and post-feeding stages. The red dotted line indicates the fish school’s average swimming speed in three stages, while the blue dotted line reflects the corresponding changes in the feeding aggregation degree index. The green dotted line illustrates the changes in the fish school’s feeding aggregation speed over the same period.

To systematically analyze the behavioral characteristics and variation patterns of feeding intensity across the three feeding stages, this study examines the dynamic changes in the average swimming speed, feeding aggregation index, and feeding aggregation speed of fish schools throughout the entire feeding process under different feeding frequencies. This study also conducts a quantitative analysis of the number of validly tracked individual fish at each feeding stage, as shown in Figure 14.

During the pre-feeding phase, the fish school’s average swimming speed presents a gradual upward trend, with clear anticipatory aggregation behavior already observed. After the first feeding point (as shown in Figure 15a), the fish school’s average swimming speed increases rapidly with the onset of feeding behavior within 0–4.5 min, accompanied by dramatic fluctuations in both the feeding aggregation degree and the number of validly tracked individual fish. The significant fluctuation in the degree of feeding aggregation directly leads to a sharp rise in the fish school’s feeding aggregation speed. During 4.5–6.5 min, the average swimming speed of the fish school gradually decreases from its peak. In contrast, the feeding aggregation degree and the number of individual fish shift from intense fluctuations to a steady state. Correspondingly, the feeding aggregation speed also gradually levels off, as illustrated by the fish school feeding curves over the 0–6.5 min period in Figure 14a,b.

The second feeding event starts at 6.5 min, as shown in Figure 15b. During the 6.5–7.5 min window, both the feeding aggregation index and the number of validly tracked individual fish show an abrupt, sharp decrease. In contrast, the fish school’s feeding aggregation speed fluctuates significantly again, forming a secondary variation peak, as illustrated in Figure 14b. From 7.5 to 8.5 min, as feeding progresses, both the average swimming speed and the feeding aggregation speed of the fish school decrease gradually, as shown in Figure 14a,b.

The third feeding event starts at 8.5 min, as shown in Figure 15c. During the 8.5–10 min window, the feeding aggregation index decreases significantly, and the number of validly tracked individual fish shows a notable change. In contrast, the fish school’s feeding aggregation speed fluctuates markedly again, as illustrated in Figure 14b. From 10 to 12 min, as feeding progresses, the average swimming speed of the fish school continues to decline, while the feeding aggregation index and the number of validly tracked individual fish gradually stabilize. The fluctuation amplitude of the feeding aggregation speed also decreases significantly, and the fish school enters an overall steady state, as shown in Figure 14a,b.

Analysis of behavioral characteristics and feeding intensity in fish schools at different stages from three repeated feeding tests shows that the feeding behavior of cultured fish schools presents clear phase-specific characteristics. First, with the increase in feeding events, the duration of the upward trend in the fish school’s average swimming speed shortens gradually, as illustrated by the feeding response window corresponding to each feeding point in Figure 14. This indicates that feeding motivation decreases progressively as the feeding process progresses. Second, within the feeding response window of a single feeding event, the feeding behavior of the fish school shows typical dynamic changes. In the early stage of the feeding event, the average swimming speed of the fish school remains at a relatively low level, while both the feeding aggregation index and the feeding aggregation speed change significantly.

As feeding behavior progresses, the fish school’s average swimming speed increases gradually, while the feeding aggregation index shifts from a significant initial aggregation state to a state of intense fluctuation, which in turn drives pronounced variations in feeding aggregation speed. This phenomenon is mainly attributed to the spatial dispersion of feed pellets caused by both water flow and the swimming behavior of individual fish. Individual fish within the school exhibit dispersed feeding behavior while chasing scattered feed pellets, which ultimately leads to marked fluctuations in feeding aggregation speed.

In the post-feeding phase, the fish school’s average swimming speed gradually stabilizes. At the same time, the feeding aggregation degree significantly increases again, and the number of validly tracked individual fish also reaches a steady state. This phenomenon is attributed to the fact that at the end of the entire feeding process, individual fish have largely stopped feeding behavior, are dispersed throughout the entire culture water body, and exhibit markedly reduced swimming activity, as presented in Figure 15d. Consequently, the average swimming speed remains at a low level, while the fluctuations in both the feeding aggregation index and feeding aggregation speed gradually diminish, with both indicators tending to stabilize.

3.4. Quantification of Fish School Feeding Intensity

Phase analysis of fish school feeding behavior reveals clear differences in key behavioral indicators (average swimming speed, feeding aggregation index, and feeding aggregation speed) between the non-feeding and feeding stages. During feeding, the fish school exhibits a typical aggregation–dispersion–reaggregation dynamic, driving significant fluctuations in feeding aggregation speed and a significantly higher average swimming speed than in the non-feeding phase. In the non-feeding period, the fish school exhibits no feeding-driven directional aggregation, with a low and stable feeding aggregation index and no notable changes in feeding aggregation speed.

To further quantify feeding intensity and classify its levels, we first measured the baseline kinetic energy during the non-feeding stage. The results showed that the non-feeding stage kinetic energy was significantly lower than that during the feeding phase, and was mainly distributed in the range of 0–0.05 (Figure 16d). Therefore, based on the analysis of the kinetic energy density distribution across different feeding stages and the significant positive correlation between feeding kinetic energy and feeding intensity, we established a quantitative evaluation and classification system for feeding intensity using average swimming speed and feeding aggregation speed as dual core indicators.

Based on Figure 16a–c, the total feeding kinetic energy of the fish school during different feeding phases, the feeding kinetic energy values of 0.15 and 0.35 were ultimately selected for feeding intensity classification because these two feeding kinetic energy values provided the most stable classification performance for weak feeding, medium feeding, and strong feeding states. The results indicate that the proposed E_k classification method can effectively distinguish between different levels of feeding intensity in fish schools.

Therefore, based on the E_k classification method described above, this study categorizes the feeding intensity of fish schools into three levels: strong (≥0.35), medium (0.15–0.35), and weak (≤0.15). The results of testing on the test set in Table 1 are shown in Figure 17 and Table 9.

As shown in Figure 17 and Figure 18, the feeding intensity classification based on fish feeding kinetic energy achieved good classification performance on the test set. To further validate the model’s classification performance, we performed a linear regression analysis on the feeding intensity scores of the fish schools and the scores observed by experts to assess the correlation between the two. Experts assign scores to each feeding intensity level (strong, medium, and weak). The scoring criteria are as follows: when human experts observe no surface ripples and the fish school is relatively dispersed, they determine that the school is not feeding, and the feeding intensity score is 0; When human experts observe that the school is relatively concentrated and there are noticeable ripples on the water surface, it is judged as moderate feeding, with a feeding intensity score of 1; when human experts observe that the school’s density fluctuates between concentrated and dispersed states and there are intense ripples on the water surface, it is judged as strong feeding, with a feeding intensity score of 2, as shown in Figure 19.

Specifically, we made predictions on the test set in Table 1 and compared them with the expert annotations mentioned above, as shown in Figure 20 and Table 10. The fitting results indicate a linear correlation coefficient of 0.958. The Pearson and Spearman correlation coefficients were 0.9795 with a p-value < 0.05, and Cohen’s kappa was 0.969; the results indicate that this classification model has good predictive ability.

To verify the model performance, we tested the feeding intensity detection on 440 fish from 8 culture ponds (55 fish per pond) across the three stages. We collected feeding videos (each approximately 10 min long) from eight different fish ponds, covering three distinct feeding phases: before, during, and after feeding. We then used the video from each pond as an independent test scenario for assessing the feeding intensity of the fish. During testing, the model automatically classified the feeding intensity of the fish schools into three levels: strong, medium, and weak. To further validate the reliability of the model’s predictions, we randomly selected 400 feeding images from the eight test videos (50 images from each video), as detailed in Table 11. The prediction results were compared with those of human experts. The experimental results show that the proposed method achieved a classification accuracy (ACC) of 97.41%, a false positive rate (FPR) of 1.78%, and a false negative rate (FNR) of 2.32%. as shown in Table 12. The visualization results of the model’s detection performance are presented in Figure 21. This indicates that the model maintains high classification accuracy across different fish ponds and feeding stages, and demonstrates good consistency with the judgments of human experts.

4. Discussion

4.1. Improved Algorithm Performance Analysis

To address the reduced fish detection accuracy of the baseline YOLOv11n model under complex aquaculture imaging conditions, we propose a YOLOv11n-ALL model for fish head detection. First, the EfficientViMBlock, based on a state space model (SSM), is integrated into the C3k2 module to construct the C3k2_EfficientViM module, enabling efficient global feature extraction while reducing parameter redundancy. Furthermore, a BiMAFPN neck network is designed to optimize multi-scale feature fusion and reduce the number of model parameters.

Ablation experiments and overall performance tests indicate that the YOLOv11n-ALL model delivers superior comprehensive detection performance over the original YOLOv11n model. For the fish head detection task in an aquaculture environment, the model achieved an average accuracy of 94.13%. Compared to the original model, it increases frames per second (FPS) by 7.92%, reduces model size by 17.31%, and reduces parameter count by 22.09%. To further validate the detection performance of the YOLOv11n-ALL model, we compared it with YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv12n, as shown in Figure 22. Experimental results show that YOLOv11n-ALL achieved the highest mAP50, improving by 0.46%, 0.18%, 1.79%, and 1.59%, respectively. Furthermore, the ablation experiments revealed that, compared with the baseline YOLOv11n model, the proposed YOLOv11n-ALL model achieved a significant reduction in parameter count.

Figure 22. Performance comparison of various network models; (a) mAP50, precision, recall, and F1-score; (b) model size, FPS, GFLOPs, and params.

Analysis indicates that the BiFPN and MAFPN modules adopt lightweight feature fusion strategies that effectively reduce redundant parameters. However, the repeated multi-scale bidirectional interactions and attention-based feature aggregation operations introduced by the C3k2_EfficientViM feature extraction module increase the computational complexity during inference. Consequently, although the proposed model achieves parameter compression, additional floating-point operations are required to enhance cross-scale feature fusion and contextual information extraction. This also provides a potential direction for further model optimization in future work.

To address the practical need for quantitative analysis of fish school behavior in aquaculture, we further integrated the YOLOv11n-ALL model with the ByteTrack multi-object tracking algorithm to construct a specialized multi-object tracking framework for individual fish. The proposed YOLOv11n-ALL + ByteTrack framework achieves improvements of 1.57%, 3.71%, and 1.23% in multiple object tracking accuracy (MOTA) during the three stages, as shown in Table 5. This performance improvement effectively mitigates common issues, including frequent target ID switches and trajectory fragmentation caused by dense fish movement in aquaculture water bodies, enabling continuous long-term tracking of individual fish and precise quantification of their behavioral characteristics. However, frequent ID switches in ByteTrack may disrupt the continuity of trajectory association, thereby introducing cumulative errors into the estimation of fish swimming speed. This effect becomes more pronounced in cases of long-term occlusion or repeated trajectory interruptions, which may reduce the stability of feeding intensity quantification. Therefore, establishing a robust trajectory association mechanism under dense occlusion conditions will be an important direction for future research.

4.2. Stability and Robustness Analysis of Proposed Methods for Assessing Fish School Feeding Intensity

During the non-feeding phase, short-term starvation increases the swimming speed of fish schools [14]. During the feeding phase, competitive feeding behavior among individual fish further increases the swimming speed. However, fish schools are highly sensitive to external environmental factors, such as fluctuations in water quality or human interference, which can significantly increase their swimming speed. This characteristic introduces a risk of misjudgment in assessing feeding intensity. Therefore, leveraging the distinct differences in average swimming speed and feeding aggregation speed between the feeding and non-feeding phases, this study proposes a dual-index kinetic quantification method to evaluate feeding intensity.

Compared with the method proposed by Zhou [48], which quantifies fish feeding behavior by calculating moment values, the proposed method evaluates fish school feeding intensity from two aspects: swimming speed and feeding aggregation speed, thereby reducing the likelihood of misjudgment. This is because, during the feeding process, feed particles move irregularly with water flow, causing the fish school to dynamically alternate between aggregation and dispersion states, as illustrated in Figure 14. Therefore, relying solely on a single feature, such as the feeding aggregation index, is insufficient for accurately assessing feeding intensity. In addition, compared with expert manual scoring, the proposed method achieved a higher linear correlation coefficient (R² = 0.958).

This study also conducted validation experiments by analyzing the correlation between feeding behaviors at different feeding stages and actual observational data, thereby verifying the reliability of the proposed method. The method achieved an accuracy of 97.41%, a false positive rate of 1.78%, and a false negative rate of 2.32%. In terms of overall performance, it outperforms existing approaches, including feed-particle-based detection methods, which achieve an accuracy of 90.3% [49], the short-term feeding behavior analysis method based on an EfficientNet-B2 dual-stream attention network (accuracy: 89.56%) [22], and a machine vision-based automatic grading method for assessing fish feeding intensity and appetite (accuracy: 90%) [50]. Compared with water quality sensor-based and passive acoustic methods, the proposed approach has the advantage of being unaffected by environmental noise in aquaculture systems and enables real-time, contactless detection.

4.3. Future Research

Although this study successfully applies an improved lightweight approach to detecting fish school feeding intensity, we acknowledge that there are still limitations and that further investigation is required. In this work, a dual-indicator feeding dataset for largemouth bass was constructed. Due to the image-level random split strategy adopted in this study, potential frame-level data leakage may exist among adjacent video frames, which could lead to optimistic performance estimation. Future work will employ strict video-level or recording-level dataset partition protocols to further evaluate the generalization capability of the proposed model under leakage-free conditions.

Future work will extend this dataset to include feeding data from multiple fish species. Since the current model was trained and validated only on 25 g largemouth bass, its direct application to other fish species may require adjustments to target-scale adaptation, feature representation, and trajectory association parameters. Nevertheless, due to YOLOv11n-ALL’s transfer-learning architecture, the model can be rapidly adapted to new aquaculture scenarios through lightweight fine-tuning with a small annotated dataset. In addition, the aggressive feeding behavior of largemouth bass may induce surface disturbances, thereby affecting detection accuracy. Further validation under varying illumination conditions, different water turbidity levels, and across multiple fish species will be conducted to evaluate the robustness and generalizability of the proposed method. Moreover, future research will investigate temporal characteristics such as transition-point duration, response delay, fluctuation periodicity, and symmetry features to achieve a more comprehensive analysis of fish feeding behavior dynamics and to further improve model accuracy. Meanwhile, future work will focus on integrating the proposed fish feeding intensity detection model with multi-sensor fusion technologies for fish behavior monitoring and quantitative feeding intensity analysis, thereby enabling more precise feeding management in aquaculture systems.

5. Conclusions

Fish school feeding intensity detection based on single-feature feeding images, overlooking the interdependencies between individual fish and fish school behavior. To address these challenges and enable quantification of feeding intensity, this study proposes a method based on the feeding aggregation index and average movement speed index. First, the YOLOv11n model is enhanced by incorporating the EfficientViMBlock module, modifying the C3k2 module, and improving the neck network structure (BiMAFPN), thereby enabling the efficient capture of global dependencies while maintaining computational efficiency. Additionally, the improved YOLOv11n-ALL demonstrated enhancements in precision, recall, F1-score, and mAP50 by 1.89%, 2.92%, 2.4%, and 1.61%, respectively. The improved YOLOv11n-ALL model, combined with ByteTrack to track and quantify average swimming speed and feeding aggregation speed across different feeding stages, achieved increases of 1.57%, 3.71%, and 1.23% in MOTA, respectively. By detecting the fish school feeding intensity, the model achieved detection accuracies of 97.41%, false positive rates of 1.78%, and false negative rates of 2.32%, respectively. This supports the subsequent development of scientific feeding strategies for fish schools and provides a research foundation for monitoring and digital analysis.

Author Contributions

B.J.: Writing—original draft, software, methodology, data curation, validation. X.W.: Writing—review and editing, supervision, funding acquisition, formal analysis. Y.S.: Writing—review and editing, formal analysis. J.Z.: Writing—review and editing, formal analysis. J.W.: Writing—review and editing, validation. Z.X.: Validation, data curation. X.Z.: Writing—review and editing. C.Z.: Validation, data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Jiangsu Province Key R&D Program Project of China (BE2021362) and Nanjing Modern Agricultural Machinery Equipment and Technology Innovation Demonstration Project of Jiangsu (NJ2023-05).

Institutional Review Board Statement

This study involved collecting images of fish feeding at different stages during normal feeding routines. No experimental interventions were performed on the fish, and all video footage was obtained from routine feeding and management procedures.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Barraza-Guardado, R.H.; Martínez-Córdova, L.R.; Enríquez-Ocaña, L.F.; Martínez-Porchas, M.; Miranda-Baeza, A.; Porchas-Cornejo, M.A. Effect of shrimp farm effluent on water and sediment quality parameters off the coast of Sonora, Mexico. Cienc. Mar. 2015, 40, 221–235. [Google Scholar] [CrossRef]
Xu, C.; Wang, Z.; Du, R.; Li, Y.; Li, D.; Chen, Y.; Li, W.; Liu, C. A method for detecting uneaten feed based on improved YOLOv5. Comput. Electron. Agric. 2023, 212, 108101. [Google Scholar] [CrossRef]
Wang, Y.; Yu, X.; Liu, J.; An, D.; Wei, Y. Dynamic feeding method for aquaculture fish using multi-task neural network. Aquaculture 2022, 551, 737913. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, C.; Du, R.; Kong, Q.; Li, D.; Liu, C. MSIF-MobileNetV3: An improved MobileNetV3 based on multi-scale information fusion for fish feeding behavior analysis. Aquac. Eng. 2023, 102, 102338. [Google Scholar] [CrossRef]
Li, D.; Wang, Z.; Wu, S.; Miao, Z.; Du, L.; Duan, Y. Automatic recognition methods of fish feeding behavior in aquaculture: A review. Aquaculture 2020, 528, 735508. [Google Scholar] [CrossRef]
Yang, P.; Liu, Q.Y.; Li, Z. A High-Precision Classification Method for Fish Feeding Behavior Analysis Based on Improved RepVGG. Preprints 2024, 2023091041. [Google Scholar] [CrossRef]
Cao, Y.; Liu, S.; Wang, M.; Liu, W.; Liu, T.; Cao, L.; Guo, J.; Feng, D.; Zhang, H.; Hassan, S.G.; et al. A Hybrid Method for Identifying the Feeding Behavior of Tilapia. IEEE Access 2024, 12, 76022–76037. [Google Scholar] [CrossRef]
Hu, W.C.; Chen, L.B.; Huang, B.K.; Lin, H.M. A Computer Vision-Based Intelligent Fish Feeding System Using Deep Learning Techniques for Aquaculture. IEEE Sens. J. 2022, 22, 7185–7194. [Google Scholar] [CrossRef]
Zeng, Y.; Yang, X.; Pan, L.; Zhu, W.; Wang, D.; Zhao, Z.; Liu, J.; Sun, C.; Zhou, C. Fish school feeding behavior quantification using acoustic signal and improved Swin Transformer. Comput. Electron. Agric. 2023, 204, 107580. [Google Scholar] [CrossRef]
Yang, H.; Shi, Y.; Wang, X. Detection Method of Fry Feeding Status Based on YOLO Lightweight Network by Shallow Underwater Images. Electronics 2022, 11, 3856. [Google Scholar] [CrossRef]
Marti-Puig, P.; Serra-Serra, M.; Campos-Candela, A.; Reig-Bolano, R.; Manjabacas, A.; Palmer, M. Quantitatively scoring behavior from video-recorded, long-lasting fish trajectories. Environ. Model. Softw. 2018, 106, 68–76. [Google Scholar] [CrossRef]
Zhao, H.; Cui, H.; Qu, K.; Zhu, J.; Li, H.; Cui, Z.; Wu, Y. A fish appetite assessment method based on improved ByteTrack and spatiotemporal graph convolutional network. Biosyst. Eng. 2024, 240, 46–55. [Google Scholar] [CrossRef]
Liu, Y.; Li, B.; Liu, D.; Duan, Q. Adaptive spatial aggregation and viewpoint alignment for three-dimensional online multiple fish tracking. Comput. Electron. Agric. 2025, 236, 110408. [Google Scholar] [CrossRef]
Wu, Y.; Shi, Y.; Li, W. Locomotor posture and swimming-intensity quantification in starvation-stress behavior detection of individual fish. Comput. Electron. Agric. 2022, 202, 107399. [Google Scholar] [CrossRef]
Wu, Y.; Wang, X.; Shi, Y.; Wang, Y.; Qian, D.; Jiang, Y. Fish feeding intensity assessment method using deep learning-based analysis of feeding splashes. Comput. Electron. Agric. 2024, 221, 108995. [Google Scholar] [CrossRef]
Liu, T.; He, S.; Liu, H.; Gu, Y.; Li, P. A Robust Underwater Multiclass Fish-School Tracking Algorithm. Remote Sens. 2022, 14, 4106. [Google Scholar] [CrossRef]
Zhang, Z.; Du, X.; Jin, L.; Wang, S.; Wang, L.; Liu, X. Large-scale underwater fish recognition via deep adversarial learning. Knowl. Inf. Syst. 2022, 64, 353–379. [Google Scholar] [CrossRef]
Zhou, C.; Lin, K.; Xu, D.; Chen, L.; Guo, Q.; Sun, C.; Yang, X. Near infrared computer vision and neuro-fuzzy model-based feeding decision system for fish in aquaculture. Comput. Electron. Agric. 2018, 146, 114–124. [Google Scholar] [CrossRef]
Wei, D.; Bao, E.; Wen, Y.; Zhu, S.; Ye, Z.; Zhao, J. Behavioral spatial-temporal characteristics-based appetite assessment for fish school in recirculating aquaculture systems. Aquaculture 2021, 545, 737215. [Google Scholar] [CrossRef]
Zhao, J.; Bao, W.J.; Zhang, F.D.; Ye, Z.Y.; Liu, Y.; Shen, M.W.; Zhu, S.M. Assessing appetite of the swimming fish based on spontaneous collective behaviors in a recirculating aquaculture system. Aquac. Eng. 2017, 78, 196–204. [Google Scholar] [CrossRef]
Ye, Z.; Zhao, J.; Han, Z.; Zhu, S.; Li, J.; Lu, H.; Ruan, Y. Behavioral Characteristics and Statistics-Based Imaging Techniques in the Assessment and Optimization of Tilapia Feeding in a Recirculating Aquaculture System. Trans. ASABE 2016, 59, 345–355. [Google Scholar] [CrossRef]
Yang, L.; Yu, H.; Cheng, Y.; Mei, S.; Duan, Y.; Li, D.; Chen, Y. A dual attention network based on efficientNet-B2 for short-term fish school feeding behavior analysis in aquaculture. Comput. Electron. Agric. 2021, 187, 106316. [Google Scholar] [CrossRef]
Zhang, L.; Wang, J.; Li, B.; Liu, Y.; Zhang, H.; Duan, Q. A MobileNetV2-SENet-based method for identifying fish school feeding behavior. Aquac. Eng. 2022, 99, 102288. [Google Scholar] [CrossRef]
Feng, S.; Yang, X.; Liu, Y. Fish feeding intensity quantification using machine vision and a lightweight 3D ResNet-GloRe network. Aquac. Eng. 2022, 98, 102244. [Google Scholar] [CrossRef]
Hu, X.; Liu, Y.; Zhao, Z.; Liu, J.; Yang, X.; Sun, C.; Chen, S.; Li, B.; Zhou, C. Real-time detection of uneaten feed pellets in underwater images for aquaculture using an improved YOLO-V4 network. Comput. Electron. Agric. 2021, 185, 106135. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, Q.; Zhang, H.; Yang, J.; Guo, Z.; Bulugu, I.; Shen, Y. A deep vision sensing-based fuzzy control scheme for smart feeding in the industrial recirculating aquaculture systems. Electron. Lett. 2023, 59, e12727. [Google Scholar] [CrossRef]
Cai, K.; Miao, X.; Wang, W.; Pang, H.; Liu, Y.; Song, J. A modified YOLOv3 model for fish detection based on MobileNetv1 as backbone. Aquac. Eng. 2020, 91, 102117. [Google Scholar] [CrossRef]
Güroy, D.; Karadal, O.; Mantoğlu, S.; Güroy, B.; Şimşek, O.; Çelebi, K.; Eroldoğan, O.T.; Genç, M.A.; Genç, E. The effects of feeding frequency on the growth performance, body composition, health status and histology of juvenile meagre (Argyrosomus regius). Aquac. Res. 2022, 53, 6855–6867. [Google Scholar] [CrossRef]
Portinho, J.L.; Silva, M.S.G.M.; Queiroz, J.F.; de Barros, I.; Campos Gomes, A.C.; Losekann, M.E.; Koga-Vicente, A.; Spinelli-Araujo, L.; Vicente, L.E.; Rodrigues, G.S. Integrated indicators for assessment of best management practices in tilapia cage farming. Aquaculture 2021, 545, 737136. [Google Scholar] [CrossRef]
Berg, E.M.; Mrowka, L.; Bertuzzi, M.; Madrid, D.; Picton, L.D.; El Manira, A. Brainstem circuits encoding start, speed, and duration of swimming in adult zebrafish. Neuron 2023, 111, 372–386.e374. [Google Scholar] [CrossRef]
Magnoni, L.J.; Collins, S.P.; Wylie, M.J.; Black, S.E.; Wellenreuther, M. Morphology and metabolic traits related to swimming performance in Australasian snapper (Chrysophrys auratus) selected for fast growth. J. Fish Biol. 2024, 105, 358–371. [Google Scholar] [CrossRef] [PubMed]
Georgopoulou, D.G.; Stavrakidis-Zachou, O.; Mitrizakis, N.; Papandroulakis, N. Tracking and Analysis of the Movement Behavior of European Seabass (Dicentrarchus labrax) in Aquaculture Systems. Front. Anim. Sci. 2021, 2, 754520. [Google Scholar] [CrossRef]
Wu, X.; Yang, S.; Cai, Z.; Song, R.; Fan, S. Measurement of Fish Motion Parameters Based on Deeplabcut. In Proceedings of the 2024 IEEE 19th Conference on Industrial Electronics and Applications (ICIEA), Kristiansand, Norway, 5–8 August 2024; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, L.; Zhai, G.; Hu, B.; Qiao, Z.; Zhang, P. Fish Target Detection and Speed Estimation Method based on Computer Vision. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; pp. 1330–1336. [Google Scholar] [CrossRef]
Wang, S.H.; Zhao, J.W.; Chen, Y.Q. Robust tracking of fish schools using CNN for head identification. Multimed. Tools Appl. 2017, 76, 23679–23697. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Mei, L.; Chen, Z. An Improved YOLOv5-Based Lightweight Submarine Target Detection Algorithm. Sensors 2023, 23, 9699. [Google Scholar] [CrossRef] [PubMed]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Sapkota, R.; Karkee, M. Ultralytics YOLO evolution: An overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 object detectors for computer vision and pattern recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. [Google Scholar] [CrossRef]
Lee, S.; Choi, J.; Kim, H.J. EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality. arXiv 2024, arXiv:2411.15241. [Google Scholar] [CrossRef]
Doherty, J.; Gardiner, B.; Kerr, E.; Siddique, N. BiFPN-YOLO: One-stage object detection integrating Bi-Directional Feature Pyramid Networks. Pattern Recognit. 2025, 160, 111209. [Google Scholar] [CrossRef]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-Branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for accurate object detection. arXiv 2024, arXiv:2407.04381. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar] [CrossRef]
Behzadi Pour, F.; Parra, L.; Lloret, J.; Abdanan Mehdizadeh, S. Measuring and Evaluating the Speed and the Physical Characteristics of Fishes Based on Video Processing. Water 2023, 15, 2138. [Google Scholar] [CrossRef]
Alahmad, R.; Solpico, D.; Masuda, S.; Ishizuzuka, T.; Naramura, K.; Dong, Z.; Li, Z.; Nishida, Y.; Ishii, K. Visual-Based System for Fish Detection and Velocity Estimation in Marine Aquaculture. Proc. Int. Conf. Artif. Life Robot. 2025, 30, 351–355. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Zhou, C.; Zhang, B.; Lin, K.; Xu, D.; Chen, C.; Yang, X.; Sun, C. Near-infrared imaging to quantify the feeding behavior of fish in aquaculture. Comput. Electron. Agric. 2017, 135, 233–241. [Google Scholar] [CrossRef]
Feng, M.; Jiang, P.; Wang, Y.; Hu, S.; Chen, S.; Li, R.; Huang, H.; Li, N.; Zhang, B.; Ke, Q.; et al. YOLO-feed: An advanced lightweight network enabling real-time, high-precision detection of feed pellets on CPU devices and its applications in quantifying individual fish feed intake. Aquaculture 2025, 608, 742700. [Google Scholar] [CrossRef]
Zhou, C.; Xu, D.; Chen, L.; Zhang, S.; Sun, C.; Yang, X.; Wang, Y. Evaluation of fish feeding intensity in aquaculture using a convolutional neural network and machine vision. Aquaculture 2019, 507, 457–465. [Google Scholar] [CrossRef]

Figure 1. Recirculating aquaculture system (RAS).

Figure 2. Images of fish school feeding status: (a–c) pre-feeding; (d–f) feeding; (g–i) post-feeding.

Figure 3. Fish school feeding video acquisition system and dataset construction process. (a) Fish school feeding video acquisition system. (b) Dataset construction process.

Figure 4. The overall workflow of the proposed method.

Figure 5. Improved YOLOv11n-ALL model structure.

Figure 6. YOLOv11n. C3k2_EfficientViM module; (a) C3k2 module improved based on EfficientViMBlock; (b) FFN module; (c) HSM-SSD module; (d) HC-SSD module; and (e) backbone structure for enhanced YOLOv11n using the C3k2_EfficientViM.

Figure 7. Different neck network connections. (a) Representing the basic structure of the PAFPN; (b) MAPFN neck network architecture; and (c) fusion of BiFPN to MAPFN neck network structure to form BiMAFPN.

Figure 8. Schematic diagram of the process of detecting the average feeding speed and feeding aggregation in a fish school to assess feeding intensity. (a) Fish school feeding velocity and feeding aggregation assessment; (b) assessment of fish feeding intensity.

Figure 9. Performance comparison of the proposed improved model and YOLOv11n baseline on the validation set. (a) Box loss curves; (b) mAP50; (c) precision; and (d) precision–recall (PR) curves.

Figure 10. Detection performance of the baseline model YOLOv11n integrated with different improved modules across different feeding stages of the fish school.

Figure 11. Different detection models’ recognition performance.

Figure 12. Heatmap of different feeding stages in a fish school.

Figure 13. Comparing the performance of different tracking algorithms across different feeding stages. (a) Tracking performance of YOLOv11n+ByteTrack before feeding; (b) Tracking performance of YOLOv11n+ByteTrack during feeding; (c) Tracking performance of YOLOv11n+ByteTrack after feeding; (d) Tracking performance of YOLOv11n-ALL+ByteTrack before feeding; (e) Tracking performance of YOLOv11n-ALL+ByteTrack during feeding; (f) Tracking performance of YOLOv11n-ALL+ByteTrack after feeding.

Figure 14. Results of the average swimming speed and feeding aggregation speed of fish schools in three stages. (a) The average swimming speed and the number of fish in the school; (b) the feeding aggregation degree and the feeding aggregation speed of the fish school.

Figure 15. Visualization results of the average swimming speed and feeding aggregation index of the fish school before feeding, during feeding, and after feeding. (a) First feeding point; (b) second feeding point; (c) third feeding point; and (d) no-feeding interval.

Figure 16. Classification results of fish school feeding intensity. (a) Feeding kinetic energy during the first feeding stage; (b) Feeding kinetic energy during the second feeding stage; (c) Feeding kinetic energy during the third feeding stage; (d) Feeding kinetic energy during the non-feeding stage.

Figure 17. Classification results of fish school feeding intensity. (a) Confusion matrix; (b) Multi-class ROC curve.

Figure 18. Classification results of fish school feeding intensity. (a) Sensitivity heatmap; (b) Threshold sensitivity analysis.

Figure 19. Intensity of fish school feeding (artificially labeled). (a) Weak; (b) Medium; (c) Strong.

Figure 20. Correlation coefficient metrics. (a) Correlation coefficient R²; (b) Pearson correlation heatmap; (c) Cohen’s kappa.

Figure 21. Results of fish school feeding intensity identification.

Table 1. Multi-object detection and tracking dataset.

Dataset Type	Feeding Stage	Resolution	FPS of the Video	Number of Frames
Dataset Type	Feeding Stage	Resolution	FPS of the Video	Train	Test
Multi-object detection and tracking	Pre-feeding	1920 × 1080	30	900	900
	Feeding	1920 × 1080	30	900	900
	Post-feeding	1920 × 1080	30	900	900

Table 2. Sensitivity analysis of sample size based on quality and testing.

Mass (g)	Missed Detection Rate (%)	E_k
20	/	0.78
25	/	0.81
30	/	0.82
/	0	0.88
/	10	0.85
/	20	0.83

Table 3. Evaluation metrics for detection models.

Metrics	Description
Precision	The proportion of samples predicted as positive that are truly positive (%).
Recall	The proportion of true positives correctly identified by the model (%).
mAP@50	Average precision (AP) for all categories (%).
F1-score	Weighted average of model precision and recall (%).
FPS	Frames processed per second.
Params	Total learned parameters in the network (M).
Model size	Measuring the size of the model.
GFLOPs	Giga floating-point operations per second (GFLOPS), measuring model computational complexity (G).
MOTA	Multiple objects tracking accuracy (%).
MOTP	Multiple objects tracking precision (%).
IDR	ID recall rate (%).
IDF1	ID F1-score (%).
IDSW	Identity switches count.
IDS	Identity switching rate (%).
ACC	The model’s detection accuracy (%).
FPR	FP divided by the sum of FP and TN (%).
FNR	FN divided by the sum of FN and TP (%).

Table 4. Comparison of results from various ablation experimental models.

YOLO v11n	A *	B *	C *	P * (%)	R * (%)	F1-Score (%)	mAP50 (%)	GFLOPs (G)	Params (M)	FPS (f/s)	Model Size (MB)
√	×	×	×	88.21	87.81	88.01	92.52	6.3	2.58	257.82	5.2
√	√	×	×	90.08	90.13	90.11	93.82	6.6	2.47	260.33	5.1
√	×	√	×	90.04	89.97	89.90	93.73	6.5	1.93	277.65	4.0
√	×	√	√	90.01	90.25	90.13	94.09	6.2	2.03	383.60	4.2
√	√	√	×	90.02	90.26	90.14	93.94	7.1	1.95	237.79	4.1
√	√	√	√	90.10	90.73	90.41	94.13	6.4	2.01	278.24	4.3

* A denotes the C3k2-EfficientViM module; B denotes the BiFPN neck network; C denotes the MAFPN neck network; P denotes precision; R denotes recall; YOLOv11n-ALL (YOLOv11n-C3k2_EfficientViM + BiFPN + MAFPN).

Table 5. Performance comparison of various network models.

Model	Precision (%)	Recall (%)	F1-Score (%)	mAP50 (%)	GFLOPs (G)	Params (M)	FPS (f/s)	Model Size (MB)
YOLOv5n	89.87	89.89	89.88	93.67	7.1	2.50	219.44	5.0
YOLOv8n	90.10	90.68	90.39	93.95	8.1	3.01	209.41	6.0
YOLOv10n	88.58	85.82	87.18	92.34	6.5	2.27	245.29	5.5
YOLOv11n	88.21	87.81	88.01	92.52	6.3	2.58	257.82	5.2
YOLOv12n	88.64	87.71	88.18	92.54	5.8	2.51	213.30	5.2
YOLOv11n-ALL	90.10	90.73	90.41	94.13	6.4	2.01	278.24	4.3

Table 6. Comparison results of fish-head recognition models based on cross-validation.

Model	Accuracy (%)	Average Precision (%)	Average Recall (%)	Average F1-Score (%)
YOLOv5n	93.62	89.85	89.88	89.86
YOLOv8n	93.84	90.08	90.63	90.35
YOLOv10n	92.29	88.58	85.80	87.16
YOLOv11n	92.50	88.20	87.78	87.98
YOLOv12n	92.48	88.61	87.70	88.15
YOLOv11n-ALL	94.10	90.09	90.72	90.40

Table 7. t-test results (0.05).

Model	t Statistic	p Value	Significant?
YOLOv5n	24.24237	1.53848 × 10⁻⁴	Yes
YOLOv8n	3.69336	0.03444	Yes
YOLOv10n	16.281450	7.33681 × 10⁻⁵	Yes
YOLOv11n	161.16336	5.26759 × 10⁻⁷	Yes
YOLOv12n	34.57029	5.32175 × 10⁻⁵	Yes

Table 8. Recognition results of different tracking models.

Models	Feeding Stage	MOTA (%)	MOTP (%)	IDR (%)	IDF1 (%)	IDSW	GT	IDS (%)
YOLOv11n + ByteTrack	Pre-feeding	87.91	77.49	75.24	79.65	15	70	21.43
	Feeding	82.17	70.29	69.04	72.59	35	90	38.89
	Post-feeding	85.47	73.39	72.38	74.11	19	74	25.68
YOLOv11n-ALL + ByteTrack	Pre-feeding	89.48	79.12	80.34	83.89	10	65	15.38
	Feeding	85.88	75.48	74.69	75.18	26	81	19.75
	Post-feeding	86.70	75.78	76.09	78.53	14	69	20.29

Table 9. Results of fish feeding intensity identification.

Feeding Categories	Precision	Recall	F1-Score	Accuracy
Strong Feeding	1.0000	0.9286	0.9630	0.9762
Medium feeding	0.9333	1.0000	0.9655
Weak feeding	1.0000	1.0000	1.0000
Macro avg	0.9778	0.9762	0.9762

Table 10. Statistical table of model-predicted values and manually observed true values.

Points	DOF	Pearson’s r	R² (COD)	Spearman	Cohen Kappa	CI
162	160	0.9795	0.958	0.9795	0.969	95%
162	160	p-value < 0.05	0.958	p-value < 0.05	0.969	95%

Note: “DOF” represents degree of freedom.

Table 11. Specific details of the extracted feeding images.

Feeding Video	Number of Feeding Images Extracted
Feeding Video	Strong Feeding	Medium Feeding	Weak Feeding
1	22	18	10
2	15	9	26
3	27	12	11
4	24	10	16
5	18	17	15
6	13	16	21
7	10	21	19
8	9	19	22

Table 12. Identification performance of fish school feeding intensity levels.

Model	Number	Feeding Images	ACC (%)	FPR (%)	FNR (%)	Cohen Kappa
YOLOv11n-ALL + ByteTrack	440	400	97.41	1.78	2.32	0.962

Note: “Number” represents total number of fish.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, B.; Wang, X.; Shi, Y.; Zheng, J.; Wang, J.; Xu, Z.; Zhang, X.; Zhou, C. A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed. Fishes 2026, 11, 300. https://doi.org/10.3390/fishes11050300

AMA Style

Jia B, Wang X, Shi Y, Zheng J, Wang J, Xu Z, Zhang X, Zhou C. A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed. Fishes. 2026; 11(5):300. https://doi.org/10.3390/fishes11050300

Chicago/Turabian Style

Jia, Bo, Xiaochan Wang, Yinyan Shi, Jinming Zheng, Jihao Wang, Zhen Xu, Xiaolei Zhang, and Chengquan Zhou. 2026. "A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed" Fishes 11, no. 5: 300. https://doi.org/10.3390/fishes11050300

APA Style

Jia, B., Wang, X., Shi, Y., Zheng, J., Wang, J., Xu, Z., Zhang, X., & Zhou, C. (2026). A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed. Fishes, 11(5), 300. https://doi.org/10.3390/fishes11050300

Article Menu

A Novel Dual-Index Analysis Method for Quantifying Fish School Feeding Intensity Using Average Swimming Speed and Feeding Aggregation Speed

Abstract

1. Introduction

2. Materials and Methods

2.1. Aquaculture Environment

2.2. Image Acquisition and Data Labeling

2.2.1. Fish Feeding Behavior Analysis

2.2.2. Image Acquisition

2.2.3. Image Labeling and Dataset Construction

2.3. Overall Workflow of the Proposed Method

2.4. Target Detection for Individual Fish Head Parts

2.4.1. Improved C3k2 Module Integrating Efficient Vision Mamba

2.4.2. Improving Neck Network Connectivity with BiMAFPN Networks

2.5. Multi-Object Tracking (MOT) Algorithm

2.6. Assessment of Fish School Feeding Intensity via Detection of Average Swimming Speed and Feeding Aggregation Speed

2.6.1. Detection of Average Swimming Speed and Feeding Aggregation Speed

2.6.2. Assessment of Fish School Feeding Intensity

2.7. Experimental Environment and Evaluation Metrics

3. Results

3.1. Performance Comparison of Object Detection Models

3.1.1. Object Detection Results of Individual Fish Head Parts

3.1.2. Comparison Results of the Improved Model with Other Models

3.2. MOT Algorithms Performance Comparison

3.3. Mean Swimming Speed Index and Feeding Aggregation Speed Index

3.4. Quantification of Fish School Feeding Intensity

4. Discussion

4.1. Improved Algorithm Performance Analysis

4.2. Stability and Robustness Analysis of Proposed Methods for Assessing Fish School Feeding Intensity

4.3. Future Research

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI