An Improved YOLOv8 and OC-SORT Framework for Fish Counting

Li, Yan; Wu, Zhenpeng; Yu, Ying; Liu, Chichi

doi:10.3390/jmse13061016

Open AccessArticle

An Improved YOLOv8 and OC-SORT Framework for Fish Counting

by

Yan Li

^1,2,*

,

Zhenpeng Wu

^1,2,3,

Ying Yu

^1,2,4 and

Chichi Liu

⁵

¹

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

²

Institutes of Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China

³

College of Information, Liaoning University, Shenyang 110136, China

⁴

School of Artificial Intelligence, Shenyang University of Technology, Shenyang 110870, China

⁵

State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361105, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(6), 1016; https://doi.org/10.3390/jmse13061016

Submission received: 30 April 2025 / Revised: 19 May 2025 / Accepted: 21 May 2025 / Published: 23 May 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate fish population estimation is crucial for fisheries management, ecological monitoring, and aquaculture optimization. Traditional manual counting methods are labor-intensive and error-prone, while existing automated approaches struggle with occlusions, small-object detection, and identity switches. To address these challenges, this paper proposes an improved fish counting framework integrating YOLOv8-DT for detection and Byte-OCSORT for tracking. YOLOv8-DT incorporates the Deformable Large Kernel Attention Cross Stage Partial (DLKA CSP) module for adaptive receptive field adjustment and the Triple Detail Feature Infusion (TDFI) module for enhanced multi-scale feature fusion, improving small-object detection and occlusion robustness. Byte-OCSORT extends OC-SORT by integrating ByteTrack’s two-stage matching and a Class-Aware Cost Matrix (CCM), reducing ID switches and improving multi-species tracking stability. Experimental results on real-world underwater datasets demonstrate that YOLOv8-DT achieves a mAP₅₀ of 0.971 and mAP_50:95 of 0.742, while Byte-OCSORT reaches a MOTA of 72.3 and IDF1 of 69.4, significantly outperforming existing methods, confirming the effectiveness of the proposed framework for robust and accurate fish counting in complex aquatic environments.

Keywords:

fish counting; object detection; multi-object tracking; YOLOv8-DT; Byte-OCSORT; underwater fish monitoring; deep learning; aquaculture automation

1. Introduction

Accurate estimation of fish populations is a critical task in fisheries management, marine biodiversity assessment, and aquaculture monitoring. The ability to track fish abundance and movement patterns is essential for ensuring sustainable fisheries, preventing overfishing, and maintaining ecological balance [1]. According to the Food and Agriculture Organization (FAO), global fish production reached approximately 178 million metric tons in 2020, with over 50% originating from aquaculture [2]. In commercial aquaculture, fish counting is used for stock assessment, optimizing feed allocation, and improving farm management, while in fisheries conservation, accurate population estimates help enforce regulations and assess ecosystem health [3].

Traditional fish counting methods, which primarily rely on manual observation and video frame-based counting, remain widely used but pose significant challenges. These methods are labor-intensive, time-consuming, and prone to subjective errors, making them infeasible for large-scale applications [4]. Furthermore, manual counting struggles in complex underwater environments where occlusions, varying lighting conditions, and high-density fish populations introduce substantial inaccuracies [5]. The increasing demand for precise and efficient fish population monitoring has led to the development of automated fish counting systems, which leverage computer vision and deep learning to significantly enhance accuracy and scalability.

Recent advancements in computer vision and deep learning have significantly improved the accuracy and efficiency of automatic fish counting systems [6]. Existing methods can be categorized into three primary approaches: density estimation, object detection and segmentation-based counting, and tracking-based counting.

Density-based methods estimate fish abundance by predicting a continuous density map across the image and integrating its values to approximate the number of fish present. These methods are particularly effective in environments where fish are densely packed and individual segmentation is impractical. Rather than detecting each fish directly, they model a spatial distribution of “fish likelihood” over the scene, typically using regression-based convolutional neural networks. For instance, Liu et al. [7] proposed LGSDNet, a novel deep learning model that aggregates local-global contextual information and incorporates a self-distillation mechanism to enhance the counting performance in deep-sea aquaculture settings. Their method demonstrated improved resilience to low visibility, occlusions, and background noise commonly found in underwater scenarios. Similarly, Zhu et al. [8] developed a semi-supervised density estimation framework leveraging proxy maps and limited annotations to count fish in recirculating aquaculture systems, enabling effective deployment in data-scarce conditions. Density estimation models are generally lightweight, robust to occlusions, and effective for estimating group-level abundance. However, they cannot distinguish individuals or fish species and are therefore unsuitable for behavior tracking or species-level population analysis.

Object detection and segmentation methods aim to identify and localize individual fish by drawing bounding boxes or generating pixel-wise segmentation masks for each instance. These approaches typically employ advanced object detection frameworks like YOLO or Faster R-CNN, or segmentation architectures like Mask R-CNN, which are trained on annotated datasets containing object-level fish labels. For example, Hong Khai et al. [9] used Mask R-CNN to detect and count fish in underwater scenes. Their model achieved accurate detection even in cluttered or low-contrast environments, demonstrating the effectiveness of instance segmentation for complex aquatic imagery. In another study, Arvind et al. [10] applied a deep instance segmentation approach using Mask R-CNN in pisciculture environments, achieving reliable tracking and individual identification even under partial occlusion or overlapping schools of fish. These methods offer fine-grained localization, species-level classification, and compatibility with downstream tasks like behavior analysis or biometric measurement. However, because they operate frame-by-frame without temporal association, they are prone to counting the same fish multiple times across consecutive frames, especially in continuous video sequences. This repeated detection without identity tracking is a critical limitation in real-world fish population monitoring tasks, where accurate cumulative counts are essential.

Tracking-based methods utilize object tracking algorithms to associate fish detections across video frames, allowing for identity-preserving, duplicate-free counting. These approaches typically couple real-time detectors (e.g., YOLO) with multi-object tracking (MOT) algorithms (e.g., SORT, ByteTrack), enabling both fish population monitoring and motion behavior analysis over time. For example, Kandimalla et al. [11] developed an automated fish monitoring pipeline combining YOLOv4 detection with the Norfair tracker. Their system effectively detected, classified, and counted fish in fish passage videos while maintaining object identity, even in scenarios with frequent occlusion and re-entry. Additionally, Liu et al. [12] proposed a dynamic tracking-based method tailored for counting Micropterus salmoides fry in highly occluded environments. By employing appearance-based association and fast target matching, their model achieved robust real-time tracking in challenging aquaculture conditions. Tracking-based methods are particularly valuable in long-term observation, fish movement tracking, and non-redundant population estimation.

Despite the advantages of tracking-based counting, existing methods still face significant challenges, including identity switches (ID switches), occlusions, and environmental variations [13]. Current tracking frameworks such as SORT and DeepSORT lack robustness in handling high-density fish schools and suffer from tracking failures in challenging underwater scenarios [14]. Additionally, occlusions and rapid fish movement lead to identity fragmentation, affecting counting accuracy. Addressing these challenges requires an improved tracking framework capable of maintaining consistent fish identities across frames while minimizing false associations [15].

This paper proposes an advanced fish counting framework that integrates a modified YOLOv8-based detection model, termed YOLOv8-DT, with an enhanced object tracking algorithm, Byte-OCSORT. YOLOv8-DT incorporates the Deformable Large Kernel Attention Cross Stage Partial (DLKA CSP) module to enhance feature extraction for irregularly shaped fish and the Triple Detail Feature Infusion (TDFI) module to improve small-object detection under occlusions. Byte-OCSORT extends the OC-SORT tracking framework by incorporating ByteTrack’s high-confidence and low-confidence detection matching strategy while integrating class constraints for multi-species tracking. This novel approach improves tracking robustness, reduces ID-switch occurrences, and enhances counting accuracy in complex underwater environments. The key contributions of this paper are as follows:

We propose a modified YOLOv8-based detection model, termed YOLOv8-DT, by integrating DLKA CSP and TDFI modules into the YOLOv8n backbone to enhance small-object detection and robustness in occluded conditions. DLKA CSP improves feature extraction by dynamically adjusting the receptive field using deformable convolutions and large kernel attention, enhancing adaptability to irregular fish shapes and varying underwater conditions. TDFI refines multi-scale feature fusion by integrating high-resolution spatial details with global semantics, improving small-object detection and occlusion handling;
We introduce Byte-OCSORT, an enhanced tracking algorithm that combines ByteTrack’s two-stage matching strategy with OC-SORT’s motion estimation for improved multi-species tracking. A Class-Aware Cost Matrix (CCM) prevents ID mismatches between different fish species, reducing tracking errors in dense and diverse aquatic environments. The two-stage association strategy ensures robust tracking by retaining low-confidence detections, minimizing identity loss, and improving long-term tracking stability;
We conduct extensive experiments to compare YOLOv8-DT and Byte-OCSORT with state-of-the-art models on real-world fish counting datasets. YOLOv8-DT achieves higher mAP and precision than baseline models, particularly in small-object detection. Byte-OCSORT significantly reduces ID-switch occurrences and enhances tracking robustness, especially in high-density and multi-species environments.

2. Materials and Methods

2.1. Overview of the Proposed Framework

The proposed framework for fish counting is based on object detection and tracking and consists of three primary stages: fish detection using a modified YOLOv8 model named YOLOv8-DT, object tracking using an enhanced OC-SORT algorithm named Byte-OCSORT, and a track-based counting mechanism. The workflow follows a structured pipeline, as illustrated in Figure 1, ensuring an efficient and accurate fish counting process. The first stage of the pipeline is input video processing, where raw underwater video footage is converted into individual frames at a fixed rate. These frames serve as the input for the detection module. In the second stage, fish detection is performed using YOLOv8-DT, which consists of three main components: Backbone, Neck, and Head. The Backbone extracts hierarchical feature representations from the input frame, the Neck enhances feature fusion to strengthen multi-scale object detection, and the Head predicts bounding boxes and class probabilities. To improve detection performance under challenging underwater conditions, two novel modules, the DLKA CSP module and TDFI module, are integrated into the Neck structure of YOLOv8-DT. The DLKA CSP module dynamically adjusts the receptive field to enhance feature extraction, improving detection of irregularly shaped fish with varying postures and orientations in diverse underwater scenes. TDFI module fuses high-resolution feature maps to strengthen multi-scale representation, improving detection of small fish and occluded targets by balancing local details with global semantics. After fish detection, the bounding boxes generated by YOLOv8-DT are fed into Byte-OCSORT, which integrates the two-stage matching strategy of high-confidence and low-confidence detections from ByteTrack while incorporating category constraints, enabling multi-class, multi-object tracking. The tracking process consists of Kalman Filter Estimation, Data Association, and Track Management. Kalman Filter Estimation predicts object motion in consecutive frames; Data Association matches detections with existing tracklets using motion information; and Track Management handles track initialization, update, and termination, ensuring robust fish tracking in dense scenes. The final counting stage utilizes a track-based classification mechanism, where each fish is assigned a unique tracking ID. The system categorizes and counts fish based on their trajectory IDs and class labels, ensuring precise counting.

2.2. YOLOv8-DT Architecture

Figure 2 illustrates the overall architecture of the proposed YOLOv8-DT framework, which integrates multi-resolution features and deformable large kernel attention to enhance fish detection. The feature maps P2, P3, P4, and P5, extracted by the Backbone with progressively decreasing resolutions, serve as the foundation for multi-scale feature fusion. Initially, P3, P4, and P5 pass through the TDFI module, where P3 retains high-resolution details, while P4 and P5 contribute global contextual information, improving small-object detection and robustness to occlusions. The fused output is then further integrated with P2 and P3 through another TDFI module, refining fine-grained textures while preserving strong semantic representations. These enriched multi-scale features are then propagated through PANet, ensuring smooth feature transitions and reinforcing spatial coherence. Finally, the DLKA CSP module dynamically adjusts the receptive field, improving adaptability to irregular fish shapes, varying postures, and densely populated environments.

2.3. DLKA CSP Module

Accurately detecting marine fish targets in underwater environments is particularly challenging due to issues like variable illumination, irregular target shapes, low contrast, and background noise. To address these challenges and improve detection accuracy, we propose the DLKA CSP module, inspired by the recent Deformable Large Kernel Attention (DLKA) mechanism [16] and the efficient Cross Stage Partial (CSP) network [17]. The DLKA CSP module integrates deformable convolutions with large adaptive kernels, multi-scale context extraction, and CSP-based feature fusion to effectively capture irregular spatial patterns and enhance discrimination of marine fish features in challenging underwater conditions.

The original DLKA module dynamically combines local and global feature extraction, overcoming limitations found in conventional convolutional or self-attention methods. Unlike standard convolutions with fixed receptive fields and traditional self-attention with high computational complexity, DLKA integrates depth-wise convolution, dilated convolution, and deformable sampling grids. This structure allows flexible and adaptive capturing of complex spatial relationships, especially beneficial for detecting irregularly shaped marine fish targets. As illustrated in Figure 3a, the module employs a three-step attention mechanism: (1) a

5 \times 5

deformable depth-wise convolution for local flexibility, (2) a

7 \times 7

dilated deformable convolution with dilation = 3 to expand global context, and (3) a

1 \times 1

point-wise convolution to refine channel-wise interactions. Finally, the computed attention map is fused with the input feature map via a residual connection. The mathematical formulation can be expressed as:

Attention = Con v_{1 \times 1} (Dilated - DeformConv (DeformConv (F^{'})))

(1)

Output = Con v_{1 \times 1} (Attention ⊙ F^{'}) + F

(2)

where

F^{'} = GELU (Con v_{1 \times 1} (F))

,

F

is the input feature map, and

⊙

denotes element-wise multiplication. The adaptive adjustment of receptive fields provided by DLKA significantly enhances the representation of multi-scale, irregular patterns found in underwater imagery.

To further leverage DLKA’s strengths, we designed a DLKA bottleneck structure, shown in Figure 3b, optimizing feature-extraction efficiency and effectiveness. Building upon this bottleneck component, we propose the DLKA CSP module, illustrated in Figure 3c. This module combines the strengths of the Deformable Large Kernel Attention (DLKA) mechanism with the CSP framework. Specifically, the input features are split into two groups using a

1 \times 1

convolution. One group undergoes DLKA bottleneck operations, while the other group preserves original features. The outputs are then concatenated and fused using another

1 \times 1

convolution, balancing efficiency and feature richness. The DLKA CSP module’s deformable large kernels effectively capture multi-scale features and irregular shapes of fish, while the attention mechanism enhances focus on key features in noisy, low-contrast environments.

2.4. TDFI Module

Detecting small fish in underwater environments is challenging due to their small size, frequent occlusions, and complex backgrounds. Traditional object detection models struggle to extract meaningful features from low-resolution fish patterns, especially when multiple fish overlap or blend into their surroundings, leading to reduced accuracy and missed detections. Feature Pyramid Network (FPN) is a widely used approach for small-object detection, enhancing multi-scale representation by fusing high-level semantic features with low-level spatial details through a top-down pathway. However, FPN primarily relies on progressively up-sampling low-resolution feature maps, which often lack the fine-grained spatial details necessary for accurately detecting small objects and fail to incorporate sufficient semantic information from high-resolution feature maps, limiting its ability to fully preserve small-object characteristics [18]. To address these challenges, we propose the Triple Detail Feature Infusion (TDFI) module, which integrates high-, mid-, and low-resolution features to preserve spatial details; enhance fine-grained feature expression; and incorporate multi-scale semantic information [19,20].

Figure 4 illustrates the structure of the TDFI module. All three branches first pass through spatial and channel attention mechanisms, which enhance discriminative features while suppressing irrelevant background noise. This operation is defined as:

F^{'} = ConvBNSiLU (A_{s} (F) \otimes A_{c} (F))

(3)

where

A_{s}

and

A_{c}

represent spatial and channel attention mechanisms, respectively, and

ConvBNSiLU

denotes a convolutional layer followed by batch normalization and

SiLU

activation, ensuring consistent feature refinement across different resolutions.

To align spatial resolutions for feature fusion, the mid-resolution branch serves as the reference, maintaining its original dimensions and directly undergoing a smoothing convolution to refine feature consistency. The high-resolution branch is down-sampled using a combination of dilated convolution and average pooling, ensuring critical high-frequency details are retained while reducing unnecessary redundancy. Conversely, the low-resolution branch is up-sampled using bilinear interpolation to match the reference resolution. These transformations are unified as follows:

F_{i}^{aligned} = \{\begin{array}{l} SmoothConv (AvgPool (F^{'}) + DilatedConv (F^{'})), & i = hr \\ SmoothConv (F^{'}), & i = mr \\ SmoothConv (Bilinear (F^{'})), & i = lr \end{array}

(4)

where

F_{i}^{aligned}

represents the processed feature map for each branch after resolution alignment and smoothing convolution.

Finally, the refined feature maps from all three branches are fused via the Hadamard product, allowing multi-scale information to be effectively integrated while preserving both spatial structure and semantic richness:

F_{out} = F_{hr}^{aligned} ⊙ F_{mr}^{aligned} ⊙ F_{lr}^{aligned}

(5)

2.5. Byte-OCSORT Algorithm

Multi-object tracking (MOT) in underwater environments presents unique challenges due to the highly dynamic and nonlinear movement of fish, frequent occlusions, and variable detection confidence scores. OC-SORT [21] has demonstrated its effectiveness in handling occlusions and non-linear motion through its Observation-Centric Re-Update (ORU) and Observation-Centric Recovery (OCR) mechanisms. These features make it well suited for fish tracking, as they mitigate error accumulation in Kalman filter (KF) updates and ensure smoother trajectory estimation.

However, OC-SORT also has notable limitations in underwater fish tracking. Firstly, it relies primarily on high-confidence detection boxes for association, which is problematic because fish detection models often produce low-confidence detections due to occlusions, reflections, and varying lighting conditions. Secondly, OC-SORT lacks a class-awareness mechanism, treating all tracked objects as belonging to the same category. This can lead to identity switches in multi-class fish tracking scenarios, as different species of fish may exhibit similar motion patterns and overlapping trajectories.

To address these limitations, we propose Byte-OCSORT, an improved tracking algorithm that integrates ByteTrack’s BYTE association strategy [22] and a Class Cost Matrix (CCM) into OC-SORT. Our method incorporates low-confidence detections in secondary association steps and introduces a class-based matching cost to prevent cross-class mismatches. This modification enhances recall and tracking stability, particularly in complex underwater environments.

To incorporate class awareness into object association, we introduce a Class Cost Matrix (CCM) that penalizes matches between different object classes. The Class Cost is formulated as:

C_{class} (c_{1}, c_{2}) = λ \times (1 - c o n f i d e n c e)

(6)

where

c_{1}

and

c_{2}

represent the class labels of two objects, and

λ

is a weighting factor and refers to the detection confidence score. This cost function assigns a higher penalty when confidence is low, thereby reducing the likelihood of false associations between different species of fish. The final matching cost function integrates IoU (Intersection-over-Union), velocity consistency, and Class Cost as follows:

C_{match} = C_{IoU} + α C_{velocity} + β C_{class}

(7)

where

α

and

β

are weighting parameters controlling the influence of motion and class consistency, respectively. By incorporating class information, Byte-OCSORT improves tracking accuracy in multi-class scenarios while maintaining the robustness of OC-SORT.

The Byte-OCSORT algorithm operates in three stages, as shown in Figure 5. The tracking process begins with Kalman filter prediction of tracked object positions. A high-confidence detection set (detections with confidence scores above a threshold) is matched to predicted tracks using a cost matrix composed of IoU, motion direction (OCM), and Class Cost. This step ensures that reliable detections are assigned to existing tracks first. Unmatched tracks undergo Observation-Centric Recovery (OCR), where historical observations are used to reassociate previously lost targets. The cost matrix in this step considers IoU and Class Cost. If OCR successfully re-identifies a track, it is reactivated with updated Kalman filter parameters. Tracks still unmatched after OCR proceed to a final matching step using low-confidence detections. This step further improves recall by allowing detections with lower confidence scores to contribute to tracking, using a cost matrix based on IoU and Class Cost. After the association steps, unmatched tracks are either marked as untracked or evaluated for deletion based on their Track Expiry Threshold. New tracks are initialized from unmatched high-confidence detections.

3. Experiments and Discussion

3.1. Datasets and Augmentation

The dataset used for object detection consists of fish images captured by Coral EcosystemCabled Observatory (CECO) in Dongshan, China, annotated in the YOLO format with bounding boxes. It contains 19 distinct fish species, with a total of over 40,000 annotated individual fish targets across several thousand underwater frames. However, as shown in Figure 6, the dataset exhibits a highly skewed distribution in the number of annotated fish targets per species. For instance, Neopomacentrus violascens accounts for over 10,000 individual bounding box annotations, while rare species such as Stephanolepis cirrhifer are represented by as few as 12 annotated targets. This long-tailed distribution poses a significant challenge to training fair and generalizable detection and tracking models.

To address this issue and improve model generalization, data augmentation was applied to all fish species with fewer than 500 images, increasing their sample sizes to at least 500 images per class. The augmentation strategies included horizontal flipping to simulate different fish orientations, optical distortion to mimic lens-warping effects in underwater footage, grid distortion to introduce realistic shape variations, and elastic transformation to enhance adaptability to non-rigid deformations. Motion blur was also applied to simulate fish movement in dynamic water conditions, along with random brightness and contrast adjustment to adapt to varying lighting conditions. Additionally, RGB shift was used to imitate color distortions caused by underwater environments, and safe cropping was introduced to ensure data diversity while preserving object integrity. These eight augmentation techniques were combined to simulate diverse underwater conditions, as shown in Figure 7, allowing the model to effectively detect fish across different environmental scenarios.

To ensure fair model evaluation, stratified sampling was used to divide the dataset into training, validation, and test sets, ensuring that each species is proportionally represented in each subset. The dataset was split into 80% training set for model training, 10% validation set for hyperparameter tuning and model selection, and 10% test set for final performance evaluation. This stratified sampling approach prevents class imbalance issues from affecting model performance, ensuring that the network learns to detect both abundant and rare species effectively.

For object tracking, a single underwater test video was used, recorded at 1920 × 1080 resolution, 30 FPS, and approximately 5 min in duration. To ensure comprehensive evaluation, this video was carefully selected to contain four typical tracking scenarios within different segments, including fast motion, multi-scale fish, dense schools with occlusions, and sparse distributions. These embedded scenarios reflect diverse real-world conditions in marine environments. The video was fully annotated in MOT17 format, with frame-level bounding boxes, unique object IDs, and trajectory associations, enabling precise and fair benchmarking of Byte-OCSORT against other tracking algorithms.

3.2. Implementation Details

The experiments were conducted on an NVIDIA GeForce RTX 3060 (12G) GPU with PyTorch 2.2.2, Python 3.10, and CUDA 11.8 dependencies. The model was trained from scratch without using pretrained weights. The input image size was set to 640 × 640, and the batch size for training was 32. The training process lasted for 150 epochs. Stochastic Gradient Descent (SGD) was used as the optimization function to train the model. The hyperparameters of SGD were set to a momentum of 0.937, an initial learning rate of 0.001, and a weight decay of 0.0005.

3.3. Detection Results

First, we compare YOLOv8-DT with several baseline object detection models, including YOLOv5n, YOLOv6n, YOLOv8n, and MobileNetV4-SSD. Table 1 presents the quantitative results in terms of params, precision, recall, mAP₅₀, mAP_50:95, and APs (small). Precision and recall are fundamental metrics in object detection, defined as:

Precision = \frac{T P}{T P + F P}

(8)

Recall = \frac{T P}{T P + F N}

(9)

where

T P

(true positives) refers to correctly detected objects,

F P

(false positives) are incorrect detections, and

F N

(false negatives) represent missed detections. YOLOv8-DT outperforms all other models across all metrics, achieving the highest mAP₅₀ (0.971) and mAP_50:95 (0.742), significantly surpassing YOLOv8n by 1.4% and 4.2%, respectively. The mean Average Precision (mAP) is computed as:

AP = \int_{0}^{1} P (R) d R

(10)

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(11)

where

P (R)

represents the precision–recall curve, and

N

denotes the number of object categories. The mAP₅₀ evaluates performance at an IoU (Intersection over Union) threshold of 0.5, while mAP_50:95 averages results over multiple IoU thresholds (from 0.5 to 0.95 with a step of 0.05). We also report APs (small), which measure the average precision for small objects (area < 32² pixels), based on the COCO evaluation protocol. This metric reflects the model’s ability to detect small-sized fish, such as juveniles or distant targets, and is computed over the same IoU thresholds as mAP_50:95. Furthermore, YOLOv8-DT achieves the highest precision (0.95) and recall (0.947), demonstrating its strong detection capability under complex underwater conditions. Particularly in APs (small), YOLOv8-DT reaches 0.71, a 5.9% improvement over YOLOv8n, highlighting its superior small-object detection performance. Figure 8 provides qualitative detection results for different models.

3.4. Tracking Results

For the tracking evaluation, Byte-OCSORT was compared against mainstream tracking algorithms, including SORT, ByteTrack, and OC-SORT. The results, presented in Table 2, show that Byte-OCSORT achieves the highest MOTA (72.3) and IDF1 (69.4) while significantly reducing ID switches to 16, demonstrating its ability to maintain stable identities for fish across frames. Although we did not explicitly compute species-specific classification or counting errors, the use of class labels in the association process via the Class Cost indirectly reflects classification consistency. Therefore, tracking performance—especially ID switches—serves as a proxy to evaluate the reliability of classification-based counting across species. The Multiple Object Tracking Accuracy (MOTA) is defined as:

MOTA = 1 - \frac{F N + F P + I D_{S W}}{G T}

(12)

where

F N

and

F P

represent false negatives and false positives,

I D_{S W}

denotes identity switches, and

G T

is the ground truth number of objects. The IDF1 score measures the accuracy of identity preservation in tracking, defined as:

IDF1 = \frac{2 \sum_{i} I D_{T P}}{2 \sum_{i} I D_{T P} + \sum_{i} I D_{F P} + \sum_{i} I D_{F N}}

(13)

where

I D_{T P}

,

I D_{F P}

, and

I D_{F N}

represent true positive, false positive, and false negative identity assignments, respectively. In comparison, SORT suffers from frequent ID switches (108) due to its reliance solely on motion information, while ByteTrack improves stability by integrating high-confidence and low-confidence detection association, reducing ID switches to 67. OC-SORT further refines motion estimation, leading to a MOTA of 67.8 and ID switches of 35, improving tracking robustness. However, Byte-OCSORT, which integrates the advantages of both the ByteTrack two-stage matching strategy and category constraints, outperforms all competitors, demonstrating superior tracking robustness and accuracy.

3.5. Ablation Study

To evaluate the contributions of the proposed modules in both detection and tracking components, we conducted ablation studies on YOLOv8-DT and Byte-OCSORT.

For the detection component, we tested the impact of integrating the DLKA CSP and TDFI modules into the YOLOv8n baseline. Table 3 summarizes the experimental results of YOLOv8n, YOLOv8n-DLKA, YOLOv8n-TDFI, and the complete YOLOv8-DT. The results demonstrate that incorporating DLKA CSP increases mAP₅₀ from 0.957 to 0.965 and mAP_50:95 from 0.7 to 0.725, indicating that DLKA CSP improves object localization accuracy. On the other hand, TDFI enhances recall from 0.928 to 0.946 and APs (small) from 0.671 to 0.695, showing that it is particularly effective in detecting small objects and improving feature fusion across frames. When combining both modules in YOLOv8-DT, the model reaches the highest overall performance across all metrics, confirming the complementary benefits of these enhancements. The effectiveness of these modules is further visualized in Figure 9, where the heatmaps illustrate that YOLOv8-DT generates more concentrated attention regions on fish targets, indicating improved feature representation and higher detection confidence.

Regarding the tracking component, we examined the effects of introducing the BYTE association strategy and the Class-Aware Cost Matrix to OC-SORT. Table 4 compares OC-SORT, OC-SORT + BYTE (removing class constraints), OC-SORT + Class Cost (removing two-stage matching), and the full Byte-OCSORT model. The results indicate that adding two-stage matching (OC-SORT + BYTE) improves MOTA from 67.8 to 70.5 but only slightly increases IDF1 from 62.1 to 63.2, suggesting that it primarily improves overall tracking accuracy but has a limited impact on identity consistency. Conversely, introducing class constraints (OC-SORT + Class Cost) raises IDF1 to 66.5 and reduces ID switches to 25, proving effectiveness in minimizing ID mismatches among different fish species. The full Byte-OCSORT model, which combines both improvements, achieves the best tracking performance with a MOTA of 72.3, IDF1 of 69.4, and only 16 ID switches, highlighting the necessity of both strategies for achieving optimal tracking stability. Figure 10 illustrates the Byte-OCSORT’s tracking results in a real-world video scenario.

4. Conclusions

In this paper, we propose an enhanced fish counting framework that integrates YOLOv8-DT for object detection and Byte-OCSORT for multi-object tracking. The framework is specifically designed for underwater fish counting tasks, addressing challenges such as occlusions, varying fish shapes, and dense distributions. By leveraging a track-by-detection strategy, the proposed system ensures accurate trajectory tracking and prevents duplicate counting, enabling robust and precise population estimation.

The YOLOv8-DT model incorporates two key improvements: the DLKA CSP module and the TDFI module. The DLKA CSP module dynamically adjusts the receptive field, enhancing feature extraction to better capture irregular fish shapes and posture variations. Meanwhile, the TDFI module integrates multi-scale temporal features by fusing high-resolution spatial details with global contextual information, significantly improving small-object detection and reducing false positives caused by motion blur.

Experimental results demonstrate that YOLOv8-DT outperforms existing models, achieving a mAP₅₀ of 0.971 and a mAP_50:95 of 0.742, with notable improvements in precision and recall. Meanwhile, Byte-OCSORT integrates a two-stage matching strategy with class-aware association, improving tracking robustness. It achieves the highest MOTA (72.3) and IDF1 (69.4), with the lowest number of ID switches (16), proving its effectiveness in reducing identity mismatches.

Despite the improvements, our method has certain limitations. First, the computational complexity of YOLOv8-DT is slightly higher than the baseline YOLOv8n, which may hinder real-time deployment in embedded systems. Second, the framework’s performance under varying environmental conditions—such as water turbidity and suspended sediment concentration—has not been quantitatively evaluated, although these factors can significantly impact detection visibility and accuracy. Finally, the generalization of the model to diverse marine species still requires broader validation to ensure robustness across different underwater ecosystems.

To address these limitations, future work will focus on optimizing computational efficiency for real-time deployment, expanding the dataset to include a wider range of species and environmental conditions, and incorporating adaptive modules or domain adaptation strategies to improve robustness under challenging water quality scenarios. Additionally, self-supervised learning strategies will be investigated to improve the model’s generalization capability while alleviating the dependency on extensive labeled data. These enhancements aim to improve both scalability and practicality of the proposed system in real-world marine monitoring applications.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L. and Z.W.; software, Z.W.; validation, Z.W. and Y.Y.; formal analysis, Y.L.; investigation, C.L.; resources, C.L.; data curation, C.L.; writing—original draft preparation, Y.L. and Z.W.; writing—review and editing, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 42206196 and No. U23A20645.

Data Availability Statement

Data is unavailable due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gebremedhin, S.; Bruneel, S.; Getahun, A.; Anteneh, W.; Goethals, P. Scientific methods to understand fish population dynamics and support sustainable fisheries management. Water 2021, 13, 574. [Google Scholar] [CrossRef]
FAO. The State of World Fisheries and Aquaculture 2022; Food and Agriculture Organization of the United Nations: Rome, Italy, 2022. [Google Scholar]
Helminen, J.; Linnansaari, T. Object and behavior differentiation for improved automated counts of migrating river fish. Fish. Res. 2021, 239, 105889. [Google Scholar]
Li, D.; Miao, Z.; Peng, F.; Wang, L.; Hao, Y.; Wang, Z.; Chen, T.; Li, H.; Zheng, Y. Automatic counting methods in aquaculture: A review. J. World Aquac. Soc. 2021, 52, 45–61. [Google Scholar] [CrossRef]
Eggleston, M.R.; Milne, S.W.; Ramsay, M.; Kowalski, K.P. Improved fish counting method accurately quantifies high-density fish movement. N. Am. J. Fish. Manag. 2020, 40, 883–900. [Google Scholar] [CrossRef]
Liu, H.; Ma, X.; Yu, Y.; Wang, L.; Hao, L. Application of deep learning-based object detection techniques in fish aquaculture: A review. J. Mar. Sci. Eng. 2021, 11, 867. [Google Scholar] [CrossRef]
Liu, H.; Ma, X.; Li, H. Deep Neural Network with Local-Global Context Aggregation and Self-distillation for Fish Counting in Deep-sea Aquaculture. IEEE Sens. J. 2025, 25, 16411–16424. [Google Scholar] [CrossRef]
Zhu, K.; Yang, X.; Yang, C.; Fu, T.; Ma, P.; Hu, W.; Zhou, C. Semi-supervised fish school density estimation and counting network in recirculating aquaculture systems based on adaptive density proxy. Comput. Electron. Agric. 2025, 230, 109874. [Google Scholar] [CrossRef]
Hong Khai, T.; Abdullah, S.N.H.S.; Hasan, M.K.; Tarmizi, A. Underwater fish detection and counting using mask regional convolutional neural network. Water. 2022, 14, 222. [Google Scholar] [CrossRef]
Arvind, C.S.; Prajwal, R.; Bhat, P.N.; Sreedevi, A.; Prabhudeva, K.N. Fish detection and tracking in pisciculture environment using deep instance segmentation. In Proceedings of the TENCON 2019-2019 IEEE Region 10 Conference (TENCON), Kochi, India, 17–20 October 2019; pp. 778–783. [Google Scholar]
Kandimalla, V.; Richard, M.; Smith, F.; Quirion, J.; Torgo, L.; Whidden, C. Automated detection, classification and counting of fish in fish passages with deep learning. Front. Mar. Sci. 2022, 8, 823173. [Google Scholar] [CrossRef]
Liu, H.; Cui, M.; Gu, H.; Feng, J.; Zeng, L. A fast and dynamic tracking-based Micropterus salmoides fry counting method in highly occluded scenarios. Aquac. Eng. 2025, 110, 102546. [Google Scholar]
Wei, Y.; Duan, Y.; An, D. Monitoring fish using imaging sonar: Capacity, challenges, and future perspectives. Fish Fish. 2022, 23, 345–360. [Google Scholar] [CrossRef]
Babu, K.M.; Bentall, D.; Ashton, D.T. Computer vision in aquaculture: A case study of juvenile fish counting. J. R. Soc. N. Z. 2023, 53, 52–68. [Google Scholar] [CrossRef] [PubMed]
Cui, M.; Liu, X.; Liu, H.; Zhao, J.; Li, D.; Wang, W. Fish Tracking, Counting, and Behaviour Analysis in Digital Aquaculture: A Comprehensive Survey. Rev. Aquac. 2025, 17, e13001. [Google Scholar] [CrossRef]
Azad, R.; Niggemeier, L.; Hüttemann, M.; Kazerouni, A.; Aghdam, E.K.; Velichko, Y.; Bagci, U.; Merhof, D. Beyond Self-Attention: Deformable large kernel attention for medical image segmentation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1287–1297. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Chen, Z.; Ji, H.; Zhang, Y.; Zhu, Z.; Li, Y. High-resolution feature pyramid network for small object detection on drone view. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 475–489. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Peng, Y.; Sonka, M.; Chen, D.Z. U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation. arXiv 2023, arXiv:2311.17791. [Google Scholar]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 6 October 2023; pp. 9686–9696. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object tracking by associating every detection box. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]

Figure 1. Pipeline of the proposed fish counting framework: fish detection using YOLOv8-DT, object tracking with Byte-OCSORT, and the track-based counting mechanism.

Figure 2. Overview of the YOLOv8-DT framework, incorporating the TDFI module for multi-resolution feature fusion and the DLKA CSP module for adaptive receptive field adjustment, enhancing fish detection performance.

Figure 3. The structure of the proposed DLKA CSP module. (a) DLKA module, (b) DLKA bottleneck structure, and (c) the proposed DLKA CSP module integrating DLKA bottlenecks with the CSP framework.

Figure 4. The structure of TDFI module. H, W, and C represent the height, width, and channel of the feature map. Each Triple Detail Feature Infusion module uses three feature maps of different sizes as input.

Figure 5. The workflow of the proposed Byte-OCSORT algorithm, which integrates ByteTrack’s BYTE association and a Class Cost Matrix (CCM) into OC-SORT. The process consists of three stages: high-confidence detection matching, Observation-Centric Recovery (OCR), and low-confidence detection matching, followed by track management.

Figure 6. Annotated target count per species in the fish dataset.

Figure 7. Data augmentation examples for underwater fish detection.

Figure 8. Qualitative detection results of different models.

Figure 9. Heatmap visualization of different YOLOv8 variants.

Figure 10. Multi-Species Fish Tracking Across Consecutive Frames by Byte-OCSORT.

Table 1. Comparison of YOLOv8-DT with other object detection models.

Model	Params (M)	Precision	Recall	mAP₅₀	mAP_50:95	APs (Small)
YOLOv5n	2.65	0.869	0.89	0.935	0.665	0.62
YOLOv6n	4.5	0.936	0.90	0.945	0.676	0.645
YOLOv8n	3.15	0.927	0.928	0.957	0.7	0.671
MobileNetV4-SSD	4.46	0.938	0.874	0.940	0.667	0.65
YOLOv8-DT	4.81	0.95	0.947	0.971	0.742	0.71

Table 2. Comparison of Byte-OCSORT with other tracking algorithms.

Tracker	MOTA	IDF1	ID Switches
SORT	58.2	53.7	108
ByteTrack	63.5	58.9	67
OC-SORT	67.8	62.1	35
Byte-OCSORT	72.3	69.4	16

Table 3. Ablation study results of YOLOv8-DT. Note: “√” indicates the module is used; “×” indicates the module is not used.

Model	DLKA	TDFI	Params (M)	Precision	Recall	mAP₅₀	mAP_50:95	APs (Small)
YOLOv8n	×	×	3.15	0.927	0.928	0.957	0.7	0.671
YOLOv8n-DLKA	√	×	4.768	0.94	0.938	0.965	0.725	0.675
YOLOv8n-TDFI	×	√	3.196	0.931	0.946	0.962	0.73	0.695
YOLOv8-DT	√	√	4.81	0.95	0.947	0.971	0.742	0.71

Table 4. Ablation study results of Byte-OCSORT. Note: “√” indicates the module is used; “×” indicates the module is not used.

Tracker	BYTE	Class Cost	MOTA	IDF1	ID Switches
OC-SORT	×	×	67.8	62.1	35
OC-SORT + BYTE	√	×	70.5	63.2	33
OC-SORT + Class Cost	×	√	68.2	66.5	25
Byte-OCSORT	√	√	72.3	69.4	16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Wu, Z.; Yu, Y.; Liu, C. An Improved YOLOv8 and OC-SORT Framework for Fish Counting. J. Mar. Sci. Eng. 2025, 13, 1016. https://doi.org/10.3390/jmse13061016

AMA Style

Li Y, Wu Z, Yu Y, Liu C. An Improved YOLOv8 and OC-SORT Framework for Fish Counting. Journal of Marine Science and Engineering. 2025; 13(6):1016. https://doi.org/10.3390/jmse13061016

Chicago/Turabian Style

Li, Yan, Zhenpeng Wu, Ying Yu, and Chichi Liu. 2025. "An Improved YOLOv8 and OC-SORT Framework for Fish Counting" Journal of Marine Science and Engineering 13, no. 6: 1016. https://doi.org/10.3390/jmse13061016

APA Style

Li, Y., Wu, Z., Yu, Y., & Liu, C. (2025). An Improved YOLOv8 and OC-SORT Framework for Fish Counting. Journal of Marine Science and Engineering, 13(6), 1016. https://doi.org/10.3390/jmse13061016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved YOLOv8 and OC-SORT Framework for Fish Counting

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Proposed Framework

2.2. YOLOv8-DT Architecture

2.3. DLKA CSP Module

2.4. TDFI Module

2.5. Byte-OCSORT Algorithm

3. Experiments and Discussion

3.1. Datasets and Augmentation

3.2. Implementation Details

3.3. Detection Results

3.4. Tracking Results

3.5. Ablation Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI