1. Introduction
In recent years, fisheries and aquaculture production have shown substantial growth, contributing significantly to global protein supply and economic livelihoods [
1]. According to the Food and Agriculture Organization of the United Nations (FAO), the total aquatic production increased from 19 million tonnes in 1950 to a record high of over 185 million tonnes in 2022, representing an average annual growth rate of 3.2 percent [
2]. The total first sale value was estimated at USD 452 billion in 2022, with aquaculture accounting for USD 296 billion. However, this rapid expansion has also raised critical concerns regarding disease outbreaks, which can result in huge economic losses—estimated at billions of dollars annually—due to mortality, treatment costs, and trade restrictions [
3]. During fish farming, the complex and often unpredictable underwater environment can lead to various issues, such as diseases, pollution, and parasites, affecting fish growth [
4,
5]. Moreover, in high-density farming environments, infectious diseases can spread rapidly among populations, leading to large-scale outbreaks, collective infections, and potentially irreversible economic losses [
1,
6]. Therefore, the early and accurate detection of fish diseases and abnormal behaviors is crucial for preventing widespread outbreaks, minimizing economic impacts, ensuring animal welfare, and preventing pathogen transmission to both farmed and wild fish populations [
7].
Traditional visual screening methods for fish health monitoring primarily rely on manual visual inspection by human experts as an initial screening step, complemented by laboratory analysis, microbiological testing, environmental assessment, and veterinary expertise for definitive disease diagnosis. While comprehensive diagnostic procedures remain essential for accurate etiological identification and clinical decision-making, routine visual monitoring in large-scale aquaculture systems is time-consuming, labor-intensive, and difficult to perform consistently [
1,
7]. Moreover, early-stage symptoms, such as subtle skin discolorations, fin erosion, or behavioral anomalies, can be easily overlooked during visual inspection, especially in challenging aquaculture environments, characterized by variable lighting, turbid water, or fish occlusion [
4,
5,
6,
7]. These limitations motivate the development of automated computer vision-based tools for early symptom screening and continuous health monitoring, which can assist aquaculture operators by enabling the rapid detection of visually observable abnormalities for further expert examination.
Computer vision techniques have been widely adopted for detecting fish diseases and abnormal behaviors in aquaculture, offering non-invasive, efficient methods for real-time monitoring [
8]. With the rapid advancements in deep learning, these techniques have become more accurate and automated, allowing for the detection of subtle behavioral changes in fish [
5]. Convolutional neural networks (CNNs) and object detection frameworks, such as You Only Look Once (YOLO) and Faster R-CNN, offer transformative potential for enhanced fish disease detection and behavioral analysis in intensive aquaculture environments [
7,
8,
9,
10,
11]. These state-of-the-art models can effectively learn complex patterns and features from visual data with high accuracy and speed. For instance, YOLO is a well-known end-to-end model that can simultaneously detect and classify fish diseases and abnormal behaviors in a single pass, offering real-time processing capabilities that are crucial for large-scale aquaculture monitoring [
12]. However, while such lightweight models are efficient, they may lack the fine-grained accuracy required for detecting subtle diseases or complex behavior patterns, especially under challenging conditions like occlusion or variations in fish species [
7]. On the other hand, two-stage models, such as Faster R-CNN, are better suited for environments where detection accuracy is paramount, offering a trade-off between speed and precision. In recent years, numerous state-of-the-art hybrid models for fish disease detection, tracking, and behavior analysis have been proposed in the literature [
7,
13,
14,
15]. Nevertheless, achieving an optimal balance between high detection performance and a low computational cost remains a significant challenge, particularly for real-time applications on embedded and edge devices.
Although deep learning has significantly improved fish disease detection, tracking, and behavior analysis, detecting visually subtle fish disease symptoms presents significant challenges, primarily due to the inherent characteristics of early-stage disease manifestations, limitations in available data, and complexities introduced by the aquatic environment, all of which significantly impact model performance and reliability [
9]. For example, early-stage diseases often exhibit low visual distinctiveness, such as minor discolorations, slight fin damage, or subtle behavioral changes, making them difficult to distinguish from normal biological variations or imaging noise [
9]. Moreover, the symptoms can vary significantly between different fish species for the same disease, and, conversely, different diseases might present very similar visual cues, making it difficult to develop generalized disease detection models [
2]. Furthermore, deep learning models generally require extensive, diverse, and meticulously annotated datasets to achieve robust performance and generalization capabilities. However, for fish diseases, such comprehensive datasets are often scarce and limited in scope [
4,
7,
9]. Class imbalance further complicates the problem when dealing with subtle symptoms. In typical aquaculture settings, healthy samples significantly outnumber diseased instances, especially early-stage cases, which biases models toward majority classes and degrades their sensitivity to minority yet potentially critical conditions [
9,
16]. Environmental factors inherent to aquaculture systems also introduce additional complexities, such as fluctuating lighting conditions, water turbidity, occlusion, and motion-induced blur, which further obscure discriminative features and reduce image quality [
17]. Besides the above, existing approaches exhibit an inherent trade-off between detection accuracy and computational efficiency, making it challenging to achieve both high precision and real-time performance simultaneously [
9].
Modern aquaculture increasingly embraces species diversification to enhance resilience and mitigate disease-related risks. Although aquaculture involves over 448 species globally, approximately 90% of production is concentrated on just 46 species [
18], creating significant vulnerability to species-specific outbreaks. Most existing computer vision systems for fish health monitoring are developed on single-species datasets, limiting their applicability in diversified operations. Multi-species farming further complicates health monitoring, as disease symptoms manifest differently across species due to morphological variability, and early-stage pathological indicators often require species-aware interpretation [
19]. This has driven the growing demand for universal computer vision solutions capable of operating across multiple species with minimal reconfiguration. In view of these issues, we propose a novel hybrid framework for joint fish disease and species identification that directly models the spatial–sequence nature of fish surface features. This spatial–sequence paradigm is inspired by the prior successes of CNN–BiLSTM in medical imaging [
20] and plant disease classification [
21], where the spatial relationships between patches are diagnostically meaningful, and global pooling eliminates such information.
The main contributions of this paper are summarized as follows.
We propose a hybrid ConvNeXt–BiLSTM–multi-head self-attention (MHSA) architecture that transforms a convolutional feature map into a spatially structured attention sequence, enabling the effective modeling of both local spatial patterns and long-range dependencies. The model outperforms the CNN, Transformer, and CNN–LSTM baselines, achieving 94.33% accuracy and a 0.9264 macro F1-score on 21 classes.
We introduce an auxiliary species classification head with multi-task learning (), which enforces species-aware feature learning and reduces cross-species confusion without additional inference costs.
We design a comprehensive training strategy to address severe class imbalance, integrating the square-root inverse-frequency focal loss, targeted oversampling (up to 15×) for challenging classes, a weighted random sampling scheme, five-view test-time augmentation, and a two-stage progressive training schedule with differential learning rates.
We develop an end-to-end detection–classification framework by integrating YOLOv8s with the proposed ConvNeXt–BiLSTM–MHSA classifier. The resulting system achieves state-of-the-art performance with precision of 97.1%, recall of 90.9%, mAP50 of 96.1%, and 213.3 FPS, demonstrating a strong accuracy–efficiency trade-off for real-time aquaculture monitoring.
The rest of this paper is organized as follows.
Section 2 briefly reviews the related literature on fish disease detection and classification.
Section 3 presents the proposed hybrid ConvNeXt–BiLSTM–MHSA framework.
Section 4 describes the training strategy and optimization techniques.
Section 5 details the experimental setup, followed by the results and analysis in
Section 6. Finally,
Section 7 concludes this paper.
2. Related Work
Early fish disease detection methods relied on handcrafted image processing techniques, including preprocessing (noise reduction, contrast enhancement, segmentation) followed by manual feature extraction using color, texture, and morphological descriptors such as GLCM and LBP. To improve the performance, these handcrafted features were later integrated with classical machine learning classifiers, including support vector machines (SVMs), k-nearest neighbors (k-NN), decision trees, and random forests. Although moderately effective, these approaches are inherently limited by their dependence on manually engineered features and their sensitivity to environmental variability. Consequently, recent research has shifted toward deep learning, which enables automatic feature extraction and improved robustness.
Unlike image processing- and machine learning-based methods, deep learning-based approaches to fish disease detection can extract more effective and discriminative features of subtle diseases in a wider range of environments and achieve superior performance under complex aquaculture scenarios. Deep learning-based fish disease detection has evolved from classification models to advanced object detection frameworks. Early CNN-based studies, such as [
22,
23,
24], demonstrated promising classification accuracy (up to 96.7%) but were limited to image-level predictions without localizing disease regions. However, these models relied on a limited and non-standard dataset, which restricted their generalizability to real-world aquaculture environments.
Current work increasingly adopts object detection architectures, particularly YOLO and Faster R-CNN, to enable real-time and non-destructive disease detection. One-stage detectors (e.g., YOLO) offer high inference speeds suitable for real-time aquaculture monitoring, whereas two-stage detectors (e.g., Faster R-CNN) provide higher precision for subtle disease features. Several studies have focused on improving YOLO-based models. For instance, ref. [
25] integrated MobileNetV3 and GELU into YOLOv4, improving both the mAP and inference speed, although dataset limitations restrict the generalization of the model across different fish species, disease types, and real-world underwater conditions. Similarly, DFYOLO [
26] enhanced YOLOv5m with lightweight modules and attention mechanisms, achieving 99.7% accuracy but focusing only on detection. Other improvements include CBFW-YOLOv8 [
27], YOLOv7 with a normalization-based attention module and image enhancement [
28], YOLO-FD with segmentation capabilities [
1], and YOLO-TPS [
29], all targeting better multi-scale feature extraction and small lesion detection. However, these methods remain sensitive to occlusion and extreme underwater conditions.
To address fine-grained feature extraction, two-stage approaches have also been explored. The RVFL-FR-CNN model [
30] improves classification within Faster R-CNN, while VMI-ATN-RCNN [
31] achieves state-of-the-art performance in detection and segmentation but at a high computational cost. RT-GalaDet [
2] provides a lightweight alternative with balanced accuracy and speed. It improves the existing RT-DETR framework with state-space modeling, local feature enhancement, and lightweight neck compression. Despite achieving competitive precision, recall, and mAP50 at 51.98 FPS, several limitations persist. First, the model underperforms on eye defect and fin defect classes due to their compact scale and high intra-class variability. Second, the achieved throughput is insufficient for multi-camera aquaculture systems, which typically require >100 FPS. Third, conventional CNN- and Transformer-based methods inadequately capture spatially structured and context-dependent disease patterns along the fish body, limiting fine-grained discrimination.
Huang et al. [
32] proposed a hybrid CNN-based framework, namely CNN-OSELM, that combines multilayer feature fusion, attention mechanisms, and an online sequential extreme learning machine, which aims to improve fish disease recognition in complex underwater environments. It enhances the feature extraction and classification efficiency, achieving strong performance (94.28% accuracy) on a custom dataset, particularly after background elimination. More recently, Transformer-based models have emerged, such as DeformAtt-ViT for feeding behavior analysis [
33] and TFMFT for multi-fish tracking [
34], demonstrating improved capabilities in modeling complex temporal dependencies. However, these methods [
33,
34] are computationally complex and are typically developed independently of disease detection frameworks.
Table 1 summarizes various existing studies published recently, spanning convolutional, attention-based, recurrent, Transformer, and hybrid architectures across diverse aquaculture datasets. Early CNN-based methods established solid baselines for identifying single species, but later research introduced attention mechanisms, multi-task learning, and real-time detection to address complex symptom differentiation, severe class imbalance, and limited aquatic imaging. The trend has moved from purely convolutional models to hybrid spatial–sequence and Transformer-based architectures, reflecting increasing recognition that accurate multi-species, multi-disease detection needs both fine-grained local texture analysis and global contextual understanding at the same time.
Despite these advances, existing works primarily focus on either fish detection or species and disease classification in isolation, with limited exploration of unified frameworks capable of jointly modeling these interrelated tasks in a single end-to-end architecture for marine aquaculture. Moreover, many approaches rely heavily on detection-based pipelines, which, while effective for localization, often overlook global contextual relationships and inter-region dependencies; these are critical for distinguishing subtle and visually similar disease patterns across different fish species. Although systems such as RT-GalaDet [
2] attempt to jointly model species identities and lesion types within a unified detection framework, such approaches remain relatively underexplored and are still constrained by challenges related to fine-grained lesion discrimination, cross-species variability, and robustness under complex underwater environments. Olsen et al. [
42] demonstrated that computer vision models for Saprolegnia detection in salmonids experience significant performance degradation when transferred across host genera, with the MCC dropping from 0.96 to 0.53–0.71 due to morphological variability. Similarly, Alnemari et al. [
43] identified limited cross-species knowledge transfer as a critical barrier to the commercial deployment of automated fish disease detection systems. These findings further emphasize the necessity of robust and generalizable modeling strategies for real-world aquaculture applications. Furthermore, challenges such as class imbalance and fine-grained feature discrimination remain insufficiently addressed in the current literature. Hybrid CNN–BiLSTM–attention architectures effectively combine local feature extraction, sequential dependency modeling, and global context learning, making them well suited for fine-grained classification in data-limited settings. Prior works in medical imaging, spatiotemporal analysis, and plant disease recognition demonstrate their superiority in capturing subtle structural patterns. This complementary integration of sequential and attention mechanisms motivates the CNN–BiLSTM–MHSA design adopted in this work.
3. Proposed Method: Hybrid ConvNeXt–BiLSTM–MHSA Framework
An overview of the proposed hybrid ConvNeXt–BiLSTM–MHSA framework is illustrated in
Figure 1, which follows a two-stage design comprising fish instance detection and fine-grained disease–species classification. In the first stage, a YOLOv8s-based detector is employed to localize individual fish instances in raw underwater frames. In the second stage, the detected regions of interest are passed to a deep classification network that performs simultaneous fish species identification and disease recognition. The classification stage consists of a sequential processing pipeline including image preprocessing, hierarchical feature extraction using a ConvNeXt backbone, spatial–temporal dependency modeling via a BiLSTM encoder, and global context refinement using an MHSA module. The resulting representations are then used for multi-task prediction through parallel classification heads. The overall framework is further supported by a comprehensive learning strategy that includes class imbalance mitigation, loss optimization, progressive two-stage training, and test-time augmentation to enhance the robustness under challenging underwater imaging conditions such as low visibility, color distortion, and fine-grained inter-class similarity.
3.1. Proposed Architecture and System Pipeline
The proposed framework follows a two-stage decoupled design that separates fish localization from fine-grained pathological classification, addressing the competing optimization objectives that typically affect monolithic detection architectures. In the first stage, a YOLOv8s detector processes raw input frames captured from underwater cameras and generates a set of bounding box predictions localizing each fish instance within the scene. Each detected bounding box is cropped and resized to a standardized resolution of pixels, producing a region of interest suitable for downstream fine-grained analysis. In the second stage, each cropped fish image is forwarded to the ConvNeXt–BiLSTM–MHSA classifier, which produces simultaneous predictions for both the species identity and the disease state . The classifier itself is composed of six tightly integrated sub-modules: (i) a Contrast-Limited Adaptive Histogram Equalization (CLAHE)-based preprocessing block that enhances the image contrast under variable underwater illumination; (ii) a ConvNeXt-Small convolutional backbone that extracts 768-dimensional feature descriptors over a spatial grid; (iii) a two-layer bidirectional LSTM encoder that models sequential dependencies across the flattened spatial tokens; (iv) a four-head self-attention block that performs non-sequential cross-patch reasoning; (v) an attention-weighted pooling head that aggregates the token sequence into a single discriminative embedding; and (vi) dual classification heads for disease and species prediction with multi-task learning. This decoupled design enables each architectural component to specialize fully in its designated task, while the shared backbone representation and auxiliary species supervision provide implicit regularization that improves generalization on the minority disease classes.
3.2. CLAHE-Based Image Preprocessing
Aquaculture underwater imagery is affected by significant visual degradation, such as wavelength-dependent light absorption, light scatter from suspended particles, and non-uniform illumination from artificial light sources, which reduces the visibility of subtle pathological patterns. To mitigate these effects and improve the consistency of input data distribution, we apply CLAHE as the first preprocessing step. Unlike global histogram equalization, which operates on the entire image and may amplify noise in homogeneous regions, CLAHE performs local contrast enhancement over small spatial regions while constraining amplification through a predefined clip limit. This enables the effective enhancement of fine structural details while suppressing noise amplification.
Specifically, each input crop
x is first converted from RGB to the LAB color space, where the
L (luminance) channel is decoupled from chromatic information
. CLAHE is applied exclusively to the
L channel with a clip limit of
and a tile grid size of
, producing an enhanced luminance channel
while preserving the original chromatic signals. The enhanced LAB image is then converted back to RGB, yielding
. To further enhance fine-grained structural information, an unsharp masking operation is applied as follows:
where
denotes a Gaussian blur with standard deviation
, and
controls the sharpening intensity. The subtraction term
yields a high-pass filtered representation that emphasizes local edge structures such as lesion boundaries, hemorrhagic regions, and fin deformities, which is then added back to the original enhanced image to produce a sharpened output
. This two-step preprocessing pipeline reduces the blue–green color bias that is prevalent in marine images and boosts the nuances of textural cues that are needed for detailed pathological differentiation at the same time.
3.3. ConvNeXt-Small Feature Extraction Backbone
The feature extraction backbone serves as the primary representation learning module of the proposed classification system, responsible for encoding the preprocessed input into a hierarchical feature map suitable for subsequent processing in a sequential manner. We adopt ConvNeXt-Small, a recent CNN architecture that outperforms Vision Transformers, while retaining the efficiency and biases of convolutional networks. ConvNeXt-Small is selected over alternative lightweight architectures such as EfficientNet-B2 and MobileNetV3 based on preliminary experiments, as it demonstrates superior capabilities in capturing subtle textural variations. This property is particularly important for distinguishing fine-grained disease patterns that exhibit minimal shape and color differences relative to healthy tissues.
The backbone transforms the preprocessed image in four stages, progressively reducing the spatial resolution while increasing the channel dimensionality. This results in a final feature tensor , where each of the 49 spatial locations encodes a 768-dimensional feature vector combining local texture and high-level semantic information. The effectiveness of ConvNeXt-Small is attributed to several architectural design choices: (i) depthwise convolutions that expand the effective receptive field while maintaining computational efficiency; (ii) inverted bottleneck structures inspired by MobileNetV2 that enhance the representational capacity; and (iii) LayerNorm, which replaces BatchNorm to enhance the network stability, especially with small batch sizes, which are used in our gradient accumulation-based training approach. The backbone is initialized using ImageNet-22K pretrained weights to enable effective transfer learning from large-scale natural image distributions, which is particularly beneficial given the limited size of the fish disease dataset.
To facilitate sequential modeling in later stages, the 2D spatial feature map is flattened into a token sequence in row-major order, , where each token corresponds to a spatial location in the original feature grid. This transformation enables the subsequent BiLSTM and MHSA modules to model spatial relationships as a structured sequence of feature embeddings.
3.4. BiLSTM Spatial–Sequence Encoder
A key architectural selection that sets the proposed framework apart from purely convolutional and Transformer-based approaches is the integration of a BiLSTM layer applied to the flattened spatial token sequence. The motivation for incorporating BiLSTM is twofold. First, the BiLSTM imposes an explicit positional inductive bias over the spatial grid, providing stable gradient flow during early training epochs, when the self-attention mechanism is still learning to identify relevant token relationships. Second, the bidirectional formulation captures dependencies in both spatial directions simultaneously, modeling symptom correlations that may propagate along the fish body in either orientation.
Formally, the BiLSTM processes the input sequence
in two parallel directions:
where
and
, respectively, denote the forward and backward hidden states at position
t, and the symbol ‖ represents concatenation along the feature dimension. The internal LSTM cell operations follow the standard gated formulation with input, forget, and output gates, and we employ two stacked layers to provide sufficient representational depth for capturing hierarchical spatial patterns. The hidden dimension is set to 192 per direction, yielding a total output dimension of 384 after concatenation. Applied across all 49 spatial tokens, this produces the encoded sequence
.
The BiLSTM effectively establishes a spatial reading order over the feature grid, encoding relationships such as the co-occurrence of hemorrhagic spots in one body region with ulceration in another. Such dependencies are typically overlooked by global pooling operations and only weakly modeled by purely convolutional hierarchies. This reading-order paradigm has proven highly effective in recent CNN–BiLSTM hybrid architectures for medical imaging classification and agricultural disease recognition, and we extend this principle to the domain of fish disease recognition for the first time.
3.5. Multi-Head Self-Attention Block
While the BiLSTM captures sequential positional dependencies, it processes tokens according to a fixed row-major traversal order and cannot directly model arbitrary long-range relationships between spatially distant tokens. To overcome this limitation and enable flexible cross-patch reasoning, we introduce a four-head self-attention module that operates on the BiLSTM output sequence H. The multi-head attention mechanism allows the model to attend to different representational subspaces in parallel, capturing diverse relationship patterns such as species-specific body proportions, cross-regional symptom correlations, and structural symmetries.
For each attention head
, the input sequence
H is linearly projected into query, key, and value representations:
where
are learnable projection matrices producing a per-head dimension of
. Scaled dot-product attention is then computed as
where the scaling factor
prevents the softmax from saturating when the dot products grow large in magnitude. The outputs of all four heads are concatenated and linearly projected back to the original dimensionality:
where
is the output projection matrix. To stabilize training and facilitate gradient flow, a complete Transformer-style block is formed by surrounding the attention operation with residual connections and layer normalization, followed by a position-wise feed-forward network (FFN):
where the FFN consists of two linear transformations with GELU activation in between, expanding the intermediate dimension to
before projecting back to 384. The final output of this block is
, representing contextually enriched token representations that encode both local sequential dependencies and global cross-patch relationships.
3.6. Attention-Weighted Pooling
To aggregate the sequence of contextualized tokens M into a single discriminative embedding suitable for classification, we employ an attention-weighted pooling mechanism that learns to prioritize diagnostically relevant spatial locations. This approach generalizes global average pooling by assigning location-specific importance weights learned end-to-end with the classification objective, following the self-attentive pooling paradigm.
For each token position
t, an attention score
is computed through a small two-layer network with tanh nonlinearity:
where
and
are learnable parameters, and
is a bias vector. The scalar scores across all 49 positions are then normalized via softmax to produce a valid probability distribution,
and the final pooled embedding is computed as a weighted sum of the token representations:
This learned pooling strategy allows the model to automatically discover the most informative spatial regions for disease classification, effectively implementing a soft attention-based region-of-interest selection mechanism that typically concentrates on body regions such as the eyes, fins, and lateral line, where pathological symptoms most commonly manifest.
3.7. Dual Classification Heads with Multi-Task Learning
The pooled embedding e is simultaneously fed into two parallel classification heads that share the same underlying representation but produce distinct predictions, implementing a hard parameter sharing multi-task learning paradigm.
The primary disease classification head predicts the 21-class disease–species joint label through a two-layer MLP with GELU activation:
where
and
, producing logits over 21 classes, corresponding to the Cartesian product of five health statuses across four primary species plus the additional Neobchi category.
The auxiliary species classification head predicts the five-class species identity through a similar two-layer MLP:
where
and
. This auxiliary supervision forces the backbone to learn species-discriminative features that would otherwise be suppressed when training solely on the 21-class disease objective, where species information appears only as a conditioning factor within each compound label. The species head incurs no additional inference cost at deployment time, as it can be optionally disabled once training is complete, yet its presence during training provides implicit regularization that demonstrably improves the minority disease recall.
4. Training Strategy and Optimization
4.1. Loss Formulation
To address the severe class imbalance inherent in the fish-project dataset, where healthy species classes contain over 1100 samples while minority disease classes contain as few as 31 samples, we employ the focal loss with frequency-based class weighting. The focal loss downweights well-classified examples to focus training on hard, often misclassified minority class instances.
For a probability prediction
for class
c with ground-truth indicator
, the focal loss is formulated as
where
is the focusing parameter that controls the rate at which easy examples are downweighted, and
is a class-dependent weighting coefficient. Following the class balance principle while avoiding the extreme weighting that can destabilize training, we define
where
is the training sample count for class
c, and
Z is a normalization constant ensuring
. The √-inverse-frequency formulation produces weights in the moderate range of
for our dataset, avoiding the catastrophic overweighting (range
) observed in preliminary experiments with pure inverse-frequency weighting, which caused the majority class recall to collapse to zero.
The total multi-task objective combines the disease classification loss with a weighted auxiliary species classification term:
where
controls the relative contribution of the species auxiliary task. This value was selected through a systematic grid search over
, with
providing the optimal balance between the primary disease objective and the regularization benefit of species supervision.
4.2. Class Imbalance Mitigation Strategy
To tackle the serious long-tailed distribution problem, we use a combined approach that includes targeted oversampling, weighted sampling, and regularization-based data augmentation. This strategy mitigates two common failure modes typically observed with naive imbalance handling: (i) majority class collapse (where aggressive minority weighting causes the model to entirely ignore majority classes) and (ii) overfitting on duplicated minority samples (where excessive oversampling leads to memorization rather than generalization).
Based on an empirical analysis of the per-class validation performance in preliminary experiments, eight visually subtle defect categories—primarily eye and fin defects, characterized by a limited spatial extent and high inter-class similarity—are identified as the most challenging and are oversampled by
, replicating each image 15 times in the effective training set. The remaining disease classes (bleeding and ulcer variants) are oversampled by
. Healthy species classes retain their original frequency without oversampling. Based on the class distribution detailed in
Section 5.1, we construct an oversampled training set, as summarized in
Table 2. The per-class sample counts before and after oversampling show that the effective training set size increases from 6242 to 12,340 images, significantly enhancing the contribution of minority classes without distorting the overall distribution.
Complementing static oversampling, we employ a
WeightedRandomSampler during mini-batch construction. Each training sample
i is assigned a sampling weight that is inversely proportional to the square root of its class frequency in the oversampled dataset:
where
denotes the class of sample
i, and
is the number of samples in that class after oversampling. This formulation ensures approximately balanced mini-batches, with the consistent representation of both majority and minority classes throughout training.
To further improve generalization and prevent memorization of the replicated minority samples, we apply Mixup and CutMix augmentations with probability each. Mixup creates convex combinations of training pairs with mixing coefficient , while CutMix replaces a random rectangular patch of one image with content from another. These augmentations provide smoothness regularization in both the input and label spaces, mitigating overfitting on the oversampled minority classes.
4.3. Two-Stage Progressive Training Strategy
The training procedure follows a two-stage progressive schedule designed to balance the rapid convergence of newly initialized components with the stable adaptation of the pretrained backbone. This strategy, derived from the empirical analysis of training dynamics, consistently outperformed single-stage fine-tuning in our ablation studies.
In the first stage, all parameters of the ConvNeXt-Small backbone are frozen (requires_gradFalse), and only the newly introduced components—the BiLSTM encoder, MHSA block, attention-weighted pooling, and dual classification heads—are trained. This isolation allows the randomly initialized modules to reach a reasonable operating point without disrupting the pretrained backbone representations. Training uses the OneCycleLR schedule with a maximum learning rate of , with a linear warm-up for the first 30% of iterations followed by cosine annealing for the remainder. This aggressive learning rate is appropriate because only the head parameters (approximately 2.8 M) are being updated, and these components benefit from the rapid initial exploration of the loss landscape.
In the second stage, the backbone is unfrozen and the entire network is trained end-to-end, but with carefully differentiated learning rates reflecting the different sensitivities of pretrained versus newly trained components. The backbone uses a low learning rate of to gently adapt the ImageNet-22K pretrained features to the fish disease domain, while the heads use a higher learning rate of to continue refining their task-specific representations. Both learning rates follow a smooth cosine annealing schedule without warm restarts, which was empirically found to produce more stable convergence than the warm-restart variant. Early stopping with patience of 12 epochs is applied based on the validation macro F1-score.
Optimization throughout both stages uses AdamW with weight decay , , , and . To manage GPU memory constraints, we employ gradient checkpointing (trading compute for memory by recomputing intermediate activations during the backward pass), mixed-precision training in FP16, and gradient accumulation over 2 micro-batches of 32 images each, yielding an effective batch size of 64. Gradient clipping with a maximum norm of is applied to prevent occasional gradient explosions during mixed-precision training.
Table 3 presents the complete specification of the training configuration to ensure the full reproducibility of the reported results. The most important selections in the hyperparameters, including the
learning rate difference between the backbone and heads in Stage 2, and the transition from OneCycleLR to cosine annealing across stages, were validated through controlled experiments. These components are therefore integral to the training protocol rather than arbitrary hyperparameter selections.
4.4. Test-Time Augmentation
To further improve the prediction robustness and calibration, particularly on minority classes, we apply five-view test-time augmentation (TTA) during inference. For each test image, five augmented variants are constructed: the original preprocessed image, a horizontally flipped version, and three center-cropped versions at scale factors
,
, and
. Each variant is forwarded through the network independently, and the resulting softmax probabilities are averaged as
where
denotes the
k-th augmentation transform,
is the trained network, and
. The final prediction is the argmax of the averaged distribution. TTA provides two key benefits: it reduces prediction variance by averaging over multiple related views, and it mitigates the impact of minor localization errors in the upstream detector by providing multiple aligned crops of the same fish.
6. Results
In this section, we present a comprehensive experimental evaluation of the proposed ConvNeXt–BiLSTM–MHSA hybrid framework, including analyses of the detection and classification performance, end-to-end pipeline comparisons, systematic ablation studies, and a qualitative assessment of model behavior.
6.1. Dataset Distribution Analysis
Figure 2 provides a detailed illustration of the class frequency distribution across all three dataset splits. The training distribution (left panel) exhibits an extreme long-tailed pattern characteristic of real-world aquaculture monitoring data, where healthy specimens naturally outnumber diseased ones by several orders of magnitude. The five healthy species classes (Chamdom, Doldom, Gamseongdom, Jopi-bollag, and Neobchi) collectively account for 5595 of the 6242 training samples (89.6%), while the 16 disease-specific classes together comprise only 647 samples (10.4%). The doldom-eyedefect class stands as an outlier among disease classes, with 85 training samples, while several other minority classes contain as few as 31–42 samples.
The validation and test distributions (middle and right panels) exhibit a similar but less extreme pattern, with the Doldom healthy class being particularly prominent in validation (93 samples) due to the standard curator-provided split. This distributional similarity between train and test ensures that the model is evaluated on data drawn from the same underlying population, while the class imbalance in both splits reflects the operational reality that any deployed system must be capable of correctly identifying rare disease cases amid predominantly healthy populations.
6.2. Training Dynamics Analysis
Figure 3 illustrates the training dynamics during Stage 2 fine-tuning. Three key observations from these curves validate the effectiveness of the proposed training strategy.
- (1)
Stable optimization under class imbalance.
The loss curves (left panel) show stable convergence without divergence, despite the aggressive oversampling of minority classes. This confirms that the square-root inverse-frequency focal loss weighting is well calibrated and avoids the majority class collapse observed with more extreme weighting schemes. The training loss decreases monotonically from at epoch 1 to approximately at epoch 60, while the validation loss exhibits higher variance due to the limited validation set size but follows a consistent downward trend, stabilizing near .
- (2)
Effect of data augmentation on accuracy dynamics.
The accuracy curves (center panel) reveal an initial phase (first ∼15 epochs) where the validation accuracy exceeds the training accuracy. This behavior arises from the use of Mixup and CutMix, which generate interpolated training samples and thereby distort the training accuracy estimate, while validation is performed on clean samples. After approximately epoch 20, the training and validation curves intersect, and the typical situation (training > validation) is restored. The final gap between the training accuracy (∼99%) and validation accuracy (∼95%) remains small, indicating effective generalization with minimal overfitting.
- (3)
Importance of macro F1 for imbalanced learning.
The macro F1 curve (right panel) highlights the necessity of optimizing a class-balanced metric. While the validation accuracy saturates at around epoch 30, the macro F1 continues to improve until approximately epoch 45, reaching a peak value of . This indicates continued gains on minority classes, which have a limited impact on the accuracy but significantly influence the macro F1. Accordingly, model selection is based on the validation macro F1 to ensure balanced performance across all 21 classes, rather than biasing toward majority classes.
6.3. Classification Results
Table 6 summarizes the classification performance of the proposed framework on the 282-image test set using five-view test-time augmentation. The model achieves top-1 accuracy of 94.33%, which increases to 98.94% and 100.00% for the top-3 and top-5 accuracy, respectively. This indicates that misclassifications are typically confined to closely related classes, with the correct label consistently appearing among the top-ranked predictions. The macro-averaged F1-score of 0.9264 demonstrates strong and balanced per-class performance. Similarly, the macro AUC of 0.994 indicates near-perfect separability across different decision thresholds. A key observation is that, for 14 out of 21 classes, the model achieves 100% recall, meaning that no test samples from these classes are missed. The remaining classification errors are concentrated primarily in four visually subtle eye defect categories and three partially confused majority species classes. These error patterns are further analyzed in the confusion matrix.
The test set of 282 images, while standard for curated aquaculture datasets, presents statistical limitations for a 21-class fine-grained classification problem. Several minority disease categories contain fewer than 10 test samples, making the per-class recall and AUC estimates for these categories susceptible to higher variance and sensitivity to individual predictions; therefore, such metrics should be interpreted as indicative rather than definitive measures of model capability.
The results in
Table 6 further reveal a notable imbalance between the macro precision (97.11%) and macro recall (90.88%) of
percentage points. This indicates that the model is highly reliable when assigning positive predictions (high precision) but occasionally fails to detect certain minority class instances (lower recall), particularly in visually subtle categories. From an application perspective, this trade-off is preferable for aquaculture health monitoring. False positives (low precision) would lead to unnecessary interventions and increased operational costs, whereas false negatives (low recall) can be mitigated through continuous monitoring across multiple frames, increasing the likelihood of eventual detection.
6.4. Confusion Matrix Analysis
Figure 4 presents both raw-count and row-normalized confusion matrices, providing detailed insights into the classification behavior of the proposed framework. The dominant diagonal structure in both representations confirms the strong overall performance, while the off-diagonal entries reveal three consistent and diagnostically meaningful error patterns.
The most prominent errors arise from the misclassification of eye defect samples as their corresponding healthy species classes. For example, chamdom-eyedefect yields 62% recall (3/8 misclassified as Chamdom (healthy)), doldom-eyedefect yields 75% recall (2/8 misclassified as Doldom (healthy)), and doldom-findefect yields 89% recall (1/9 misclassified as Doldom (healthy)). These errors stem from the inherently subtle nature of eye-related defects, which are often spatially small, exhibit weak visual contrast, and may be partially occluded or poorly captured under challenging imaging conditions. In such ambiguous cases, the model exhibits a bias toward the dominant healthy class, consistent with the underlying class distribution.
Limited confusion is observed within majority healthy species classes—that is, one Doldom (healthy) sample is predicted as Doldom-bleeding (yielding 98% recall) and one Gamseongdom (healthy) sample as Gamseongdom-ulcer (97% recall). From an application perspective, such errors are relatively benign in clinical terms, as they would trigger additional inspection rather than resulting in missed detections.
The most challenging cases involve eye defect categories, where inter-class similarity leads to reduced recall. For instance, Gamseongdom-eyedefect yields 50% recall (3/6 misclassified as healthy), and Jopi-eyedefect yields 38% recall (5/8 misclassified as healthy). Notably, Jopi-eyedefect yields 100% precision but low recall, indicating a conservative prediction strategy in which the model favors the dominant healthy species label when confidence in disease-specific features is low. This behavior is influenced by the auxiliary species classification head, which provides strong species-level cues that can override the weaker disease signal when the pathological features are ambiguous.
Despite these localized error patterns, the confusion matrix remains highly diagonal with minimal cross-class confusion. All bleeding classes, three of four ulcer classes, and most fin defect classes exhibit near-perfect classification, demonstrating the robustness of the proposed framework under severe class imbalance and limited training data.
6.5. Per-Class Performance Analysis
Figure 5 presents a per-class breakdown of the precision, recall, and F1-score, providing fine-grained insights into model behavior that complement the aggregate metrics reported in
Table 6. The classes are sorted by F1-score in descending order, revealing clear, three-tier performance stratification.
Eleven classes achieve perfect precision, recall, and F1-scores, including multiple bleeding, ulcer, and fin defect categories. Notably, several of these classes contain only four to eight test samples, yet all instances are correctly classified, indicating strong generalization despite limited data.
Six classes fall within the upper-moderate performance range, including the majority of healthy species classes (e.g., Doldom, Chamdom, Jopi-bollag, Gamseongdom) and a few disease categories. Minor reductions in the F1-score are primarily due to occasional intra-species confusion, as previously discussed, but the overall performance remains robust.
Four eye defect classes form a distinct lower-performance group, with F1-scores ranging from 0.54 to 0.80. These classes exhibit consistently high precision (86–100%) but reduced recall (37.5–75%), indicating a conservative prediction bias. The jopi-eyedefect class is the most challenging (F1 = 0.54, recall = 37.5%), reflecting the combined effects of subtle visual cues, inter-class similarity, and limited training samples.
This stratified analysis highlights that the proposed framework achieves reliable performance across most classes, while identifying eye defect categories as the primary source of residual error. From a deployment perspective, the predictions for Tier 1 and Tier 2 classes can be considered highly reliable, whereas Tier 3 classes may benefit from additional verification due to their lower recall.
6.6. ROC Curve Analysis
Figure 6 presents the one-vs-rest ROC curves for the ten highest-performing classes (ranked by AUC), along with the macro-averaged AUC of 0.994 across all 21 classes. The curves closely follow the upper-left boundary, with ten classes achieving an AUC of 1.000, indicating near-perfect separability between positive and negative samples.
For these classes, there exists an operating threshold at which the true positive rate approaches 1 while the false positive rate approaches 0, corresponding to ideal classification performance. This behavior confirms that the learned feature representations are highly discriminative for the majority of classes.
The reported macro AUC of 0.994 reflects strong overall performance, even when accounting for more challenging categories such as eye defect classes, whose AUC values range between 0.85 and 0.95. Importantly, these values still indicate good ranking capabilities, suggesting that most misclassifications arise from threshold-dependent decision boundaries rather than insufficient feature representation.
This observation has practical implications: by adjusting the class-specific decision thresholds, particularly for minority and visually subtle categories such as eye defects, it is possible to trade a small reduction in precision for a meaningful gain in recall. Such flexibility is valuable in aquaculture monitoring systems, where missing diseased instances may be more critical than generating additional false alarms.
The proposed framework is intended as an assistive early screening tool for detecting visually observable fish disease symptoms and continuous health monitoring, rather than as a replacement for comprehensive veterinary diagnosis, which additionally requires laboratory analysis, microbiological testing, environmental assessment, and expert clinical interpretation. It acts as a triage system, identifying potentially affected individuals who are flagged for prioritized examination by veterinary experts.
6.7. Qualitative Prediction Analysis
Figure 7 provides a qualitative assessment of the model through representative test samples. The top row shows correctly classified examples under significant visual variability, including diverse water colors, fish poses, and disease manifestations. Notably, a chamdom eye-defect case (confidence 0.72) is correctly identified despite subtle visual cues, indicating the model’s ability to detect fine-grained pathological features when sufficiently visible.
The bottom row presents representative misclassifications, all belonging to eye defect categories. These errors generally yield lower confidence scores (0.46–0.81) compared to correct predictions (0.72–0.80), suggesting that uncertainty-aware thresholding could be leveraged during deployment. Specifically, low-confidence predictions can be flagged for human review, enabling a semi-automated workflow that balances efficiency with reliability.
Examining the specific error modes, we observe confusions such as T:chamdom-eyedefect → P:Chamdom, T:jopi-eyedefect → P:Jopi-bollag, and T:doldom-eyedefect → P:Doldom. These results align with the confusion matrix analysis, confirming that the primary challenge lies in detecting subtle eye region abnormalities, while species-level discrimination remains robust due to the auxiliary classification head.
6.8. End-to-End Pipeline Demonstration
Figure 8 illustrates the complete two-stage inference pipeline on representative test images. The YOLOv8s detector localizes fish instances under challenging conditions, including small-scale targets, partial occlusions, boundary regions, and multi-fish scenes. The resulting bounding boxes are used to extract object-centric crops that serve as inputs to the classification module.
This decoupled design separates localization and fine-grained recognition into two specialized models, enabling independent optimization for each task. The detector is optimized for high-recall fish localization across complex underwater backgrounds, while the classifier operates on cropped regions with an improved signal-to-noise ratio, facilitating the more accurate discrimination of visually similar disease patterns.
From a system perspective, this modular formulation enhances the flexibility and maintainability, as either stage can be updated or replaced (e.g., newer YOLO variants or alternative classifiers) without retraining the entire pipeline, making the framework suitable for iterative deployment in aquaculture monitoring systems.
6.9. End-to-End Detection Pipeline Comparison
Table 7 compares the proposed YOLOv8s + ConvNeXt–BiLSTM–MHSA pipeline with existing state-of-the-art models. The proposed framework achieves the best overall performance in terms of precision, recall, mAP, and inference speed, indicating a clear Pareto improvement over existing methods.
The precision improves from 93.3% (RT-GalaDet) to 97.1%. This gain is primarily attributed to the decoupled detection–classification design, where YOLOv8s provides accurate region proposals and the classifier focuses exclusively on fine-grained disease discrimination using tightly cropped regions. This separation improves the discrimination ability of the classifier for visually subtle minority classes. Recall increases from 89.7% to 90.9%, confirming that the improvement is not achieved at the expense of sensitivity. The gain is consistent with the proposed oversampling strategy, which enhances the representation of challenging eye defect and fin defect categories and improves the detection of rare pathological patterns. The most significant improvement is observed in the mAP50, which increases from 89.0% to 96.1%. This indicates stronger overall ranking quality across confidence thresholds and reflects improved joint precision–recall behavior rather than simple threshold tuning. The proposed pipeline achieves 213.3 FPS, corresponding to a 4.1× speed-up over RT-GalaDet (51.98 FPS). This improvement is mainly due to (i) classification on cropped regions instead of full-resolution frames and (ii) the inherent computational efficiency of the ConvNeXt-Small backbone at inference time. The resulting throughput is sufficient for real-time multi-camera aquaculture deployment.
FPS assessment is performed in a single GPU (batch size 1) to simulate real-time deployment. The reported 213.3 FPS is for the fully end-to-end pipeline (YOLOv8s detection on frames and ConvNeXt–BiLSTM–MHSA classification on cropped regions) with single-view inference and no test-time augmentation. The accurate throughput drops to about 42.7 FPS with the five-view TTA function, which is acceptable in many aquaculture monitoring applications, where accuracy is desired over peak performance.
Overall, the results demonstrate that the proposed pipeline improves both the detection accuracy and computational efficiency, achieving consistent gains over recent YOLO- and RT-DETR-based baselines.
6.10. Ablation Study
To systematically evaluate the contribution of each architectural component and training strategy, we perform an ablation study in which key components are removed from the full model while keeping all other settings unchanged. The results in
Table 8 confirm that each component contributes non-redundantly to the overall performance, with distinct functional roles.
The MHSA module contributes the most significant performance gain, as substituting MHSA with average pooling over the output of BiLSTM leads to the largest degradation in both accuracy (−2.40%) and the macro F1-score (−0.030). This indicates that non-sequential cross-patch reasoning is essential for distinguishing subtle pathological features that manifest as correlations between spatially distant body regions. Without MHSA, the model relies solely on the sequential processing of the BiLSTM, which cannot effectively link features from opposite ends of the fish body.
The BiLSTM encoder provides the second most important contribution. Replacing the BiLSTM with a layer with the same number of parameters caused a drop in performance (−1.90% accuracy, −0.024 macro F1), demonstrating that the sequential feature modeling introduces a beneficial spatial inductive bias for capturing structured local dependencies. This complements MHSA by stabilizing optimization and enhancing local context modeling.
Replacing the focal loss with standard cross-entropy leads to the most pronounced degradation in the macro F1 (−0.031), highlighting its effectiveness in addressing class imbalances. The focusing mechanism is critical for learning from hard minority class samples and cannot be fully compensated for by class reweighting alone.
Removing CLAHE preprocessing reduces the accuracy by 1.20%, indicating that contrast enhancement improves the visibility of fine-grained lesion patterns under degraded underwater imaging conditions, thereby facilitating more discriminative feature extraction.
The removal of the auxiliary species head leads to measurable performance degradation under focal loss (−0.80% accuracy, −0.008 macro F1), indicating the benefit of multi-task regularization. This effect is primarily reflected in the detailed per-class analysis, which shows reduced recall for minority and species-specific disease classes, suggesting that the auxiliary branch encourages the backbone to learn more discriminative species-aware representations that improve class separation in visually similar categories.
Overall, the ablation results demonstrate that the performance gains arise from the synergistic integration of all components rather than any single module, validating the design of the proposed framework for long-tailed fine-grained classification in aquaculture environments.
The proposed CNN–BiLSTM–MHSA framework achieves state-of-the-art performance for fish disease detection, improving both the accuracy and efficiency over existing baselines while operating at 213.3 FPS. Overall, the proposed framework demonstrates that decoupling localization and fine-grained recognition, combined with hybrid spatial–sequence modeling and targeted imbalance handling, holds strong potential for detecting visually observable abnormalities in aquaculture environments. The observed improvements across accuracy, robustness, and inference speed confirm its suitability for real-time deployment in practical monitoring systems. These gains reflect synergistic rather than merely additive interactions among components; the MHSA module benefits from BiLSTM-stabilized token sequences, while species head regularization achieves the maximum effectiveness in conjunction with focal loss weighting. This interdependent behavior supports the unified architectural design, wherein the components function as an integrated whole rather than independent optional modules.
6.11. Limitations
Despite the strong performance of the proposed framework, the results of this study should be interpreted in light of several limitations:
Eye defect recall remains 37.5–75%; this is a dataset artifact, since eye defects are physically small and visually subtle, and the problem is aggravated by limited samples. Mamba-style longer-range backbones may help.
Our pipeline assumes a reliable front-end crop; very occluded fish are dropped by YOLOv8s.
The dataset presents notable limitations, including absent metadata on housing and annotation conditions, restricted coverage to five Korean marine species with unverified cross-species generalizability, and potential distributional biases stemming from its multi-source, non-standardized assembly.