1. Introduction
Aquaculture is one of the fastest-growing global food production sectors, accounting for a substantial share of the world’s edible fish and serving as the primary source of seafood, thereby playing a crucial role in food security [
1,
2,
3]. However, this rapid expansion poses major disease management challenges that threaten its economic sustainability and undermine its role in global food security. According to the United Nations Food and Agriculture Organization (FAO), the aquaculture industry incurs substantial annual losses from aquatic animal diseases [
4]. The most prevalent and damaging diseases in farmed fish encompass bacterial infections (e.g., aeromoniasis and vibriosis); fungal infections, notably saprolegniasis; parasitic infections (e.g., white spot disease and dactylogyrosis); and the rapidly spreading white tail disease [
5,
6,
7]. The intensive nature of modern aquaculture—characterized by high-density farming in confined environments—creates ideal conditions for disease transmission, frequently resulting in catastrophic economic impacts [
8].
Traditional diagnosis of fish diseases relies primarily on visual inspection by experienced veterinarians, supplemented by laboratory microscopy [
9]. Although established, these methods have several inherent limitations: the process is time-consuming, causing critical delays in outbreak response; diagnostic accuracy is highly dependent on the expertise of the practitioner; and a global shortage of aquaculture health specialists hinders effective disease surveillance. These limitations are especially critical for diseases like white tail disease, whose rapid progression and highly contagious nature require immediate intervention to prevent mass mortality [
10].
The emergence of computer vision and artificial intelligence has opened new pathways for automated fish disease diagnosis, with technical approaches evolving through three distinct phases. Initial research employed traditional machine learning, relying on handcrafted features with classifiers like Support Vector Machines (SVM) and Random Forests [
11,
12,
13]. While promising in controlled settings, these methods struggled in practical aquaculture due to variations in water quality, lighting, and fish behavior. The advent of deep learning, particularly Convolutional Neural Networks (CNNs) [
14,
15], marked a significant advancement. End-to-end frameworks like Faster R-CNN and YOLO substantially improved detection accuracy [
16,
17,
18,
19,
20,
21,
22,
23,
24]. For instance: Huang et al. developed CNN-OSELM [
25], a multi-layer fusion network with an attention mechanism for precise disease identification; Sanjay Kumaar et al. proposed FishNet [
26], increasing freshwater fish disease detection accuracy by 2%; Yu et al. introduced MobileNet3-GELU-YOLOv4 [
27], achieving a 12.39% mAP increase and 19.31 frames per second (FPS) speed boost over YOLOv4 with fewer parameters; Li et al. extended YOLOv8 with a semantic segmentation branch for lesion localization [
27], proposing YOLO-FD; Wu et al. addressed limited data and symptom similarity with YOLOv11-SDiseasedFishNet, attaining 94.8% mAP, 93.9% recall, and 97.1% precision [
28]. However, CNNs are constrained by limited receptive fields, hindering their ability to capture long-range spatial dependencies of symptoms distributed across a fish’s body—a critical factor for diagnosing diseases with correlated symptom patterns. To address this, researchers have turned to Transformer architectures that leverage self-attention to model global contexts [
29]. Nath et al. combined Vision Transformer (ViT) with CNNs to integrate global analysis and local features, enhancing classification efficiency and accuracy [
30]. Bhattacharjee et al. achieved state-of-the-art 97.92% accuracy with a ViT-based model, surpassing CNNs [
31]. Alluhaidan et al. developed an enhanced Swin-Transformer for automated pathogen detection [
32]. Despite excelling at capturing comprehensive symptom distributions, these Transformer-based methods face a major obstacle: their quadratic computational complexity makes real-time processing of high-resolution underwater images computationally prohibitive, presenting a significant implementation barrier in resource-constrained aquaculture environments.
The recent advent of SSMs [
33,
34], exemplified by the Mamba architecture [
35], presents a promising alternative to Transformers by combining comparable long-range dependency modeling with linear computational complexity. This approach efficiently processes sequential data while maintaining a global receptive field. VMamba [
36] is a vision-centric adaptation of the Mamba state space model, which extends the 1D selective scan to 2D for efficient long-range dependency modeling in images. These characteristics are well-suited to aquatic disease diagnosis. However, the potential of SSMs in aquatic animal health monitoring remains largely unexplored, with no existing research systematically investigating their application to fish disease detection in aquaculture. In addition, while not specifically designed for aquatic disease diagnosis, general computer vision research has increasingly focused on improving model robustness in complex environments. For instance, methodologies from robust feature learning in complex scene parsing [
37], and geometric-attack-resistant image representation [
38], provide relevant strategies for handling environmental interference and feature distortion. These works offer valuable cross-domain insights that could inform future improvements in the environmental adaptability and stability of aquatic vision models. Building on these perspectives—and to bridge the identified gap while overcoming the limitations of existing approaches—this paper introduces FishMambaNet, a novel detection framework that strategically integrates selective state space models with convolutional networks.
The framework introduces three key architectural innovations:
The FSBlock integrates a VMamba-based (2D Selective Scan) SS2D module for capturing global disease patterns with a GCBlock’s convolutional operations for extracting local symptomatic features, stabilized through residual learning.
The MSCA mechanism employs attention branches to efficiently capture contextual information, utilizing parallel partial convolutions and channel splitting to gather multi-scale information while maintaining computational efficiency.
The overall network architecture strategically positions the FSBlock and MSCA modules at critical backbone and neck stages, thereby significantly enhancing feature representation.
This study makes three primary contributions: (1) It pioneers the comprehensive application of selective state space models to fish disease diagnosis, establishing a new paradigm that synergizes SSMs and CNNs for aquatic health monitoring. (2) It introduces two novel components—the FSBlock and the MSCA mechanism—that collaboratively address the challenges of global dependency modeling and multi-scale feature extraction in underwater environments. (3) Through extensive experiments, it validates FishMambaNet’s state-of-the-art performance in multi-category fish disease detection, achieving a mAP@50 of 86.7% with computational efficiency suitable for real-world deployment (4.3 M parameters, 10.7 GFLOPs), offering a practical solution for commercial aquaculture. The remainder of this paper is organized as follows:
Section 2 details the dataset construction and the FishMambaNet architecture;
Section 3 presents the experimental setup and results;
Section 4 discusses practical applications and limitations; and
Section 5 provides concluding remarks and future directions.
4. Discussion
FishMambaNet demonstrates outstanding comprehensive performance in fish disease detection, achieving superior overall accuracy (86.7% mAP@50) compared to mainstream models like the YOLO series and RT-DETR [
40], while maintaining exceptional efficiency with only 4.3 M parameters and 10.7 GFLOPs. This performance is primarily attributed to the synergistic integration of SSMs with convolutional neural networks. The FSBlock validates the strength of selective SSMs in capturing global spatial dependencies through linear sequence modeling, effectively overcoming the limited receptive field of traditional CNNs. This is crucial for diagnosing diseases with correlated symptom distributions, as evidenced by FishMambaNet’s high AP@50 scores for white-tail disease (86.7%) and fungal infections (91.7%) in
Table 3 and
Table 5. The MSCA module functionally complements the FSBlock by providing sensitive multi-scale perception of local lesions. Ablation experiments confirm their strong synergy, as their combined performance gain exceeds the sum of individual contributions. Furthermore, the integration of partial convolution optimizes computational efficiency without sacrificing accuracy, enhancing the model’s suitability for resource-constrained environments.
This study has several limitations. The dataset, while covering a full production cycle, originates from a single aquaculture region in Southern China. This limits the model’s validated robustness against key environmental and operational variables prevalent in real-world deployments, such as variations in water quality (e.g., turbidity), lighting conditions, diverse fish species, and different imaging equipment. Although computationally efficient, the model’s real-time inference performance and energy consumption on edge devices in practical aquaculture settings require empirical validation. Finally, the current framework focuses exclusively on external diseases and does not address internal pathologies or behavioral abnormalities.
Despite these limitations, FishMambaNet establishes a viable pathway for intelligent aquaculture disease management. Its lightweight architecture enables deployment of high-precision diagnostic models on embedded systems near fish ponds, facilitating early detection and timely intervention to reduce economic losses. Future work will, therefore, prioritize robustness benchmarking and domain adaptation research alongside expanding dataset diversity and scale. This includes exploring integration of temporal behavior analysis to develop a more comprehensive fish health monitoring system.
5. Conclusions
Addressing the critical need for effective disease management in global aquaculture, this study developed an automated detection solution that balances high accuracy with computational efficiency. To overcome the limited receptive fields of CNNs and the high computational cost of Transformers, we introduced FishMambaNet, a novel framework based on SSMs. The principal contributions are threefold: (1) We established a new architectural paradigm by pioneering the integration of SSMs with CNNs for fish disease detection, leveraging global dependency modeling and local feature extraction. (2) We designed two core innovations—the FSBlock for capturing global disease patterns and local lesion details, and the MSCA mechanism for efficient multi-scale context fusion. (3) Extensive validation demonstrated state-of-the-art performance, with FishMambaNet achieving 86.7% mAP@50 while requiring only 4.3 M parameters and 10.7 GFLOPs, significantly outperforming existing models and offering a practical solution for real-time diagnosis in resource-limited settings.
Future research will focus on three directions: The foremost direction is rigorous robustness evaluation and enhancement. This involves systematically constructing benchmarks to quantify performance under varying conditions of turbidity, illumination, and viewpoint, and conducting small-scale field deployments across different farming systems to gather authentic feedback and drive domain-adaptive improvements. Building upon the promising cross-dataset generalization shown in this work, the immediate priority is to expand dataset diversity through collaborative efforts to construct and release a large-scale, open-access benchmark spanning multiple species, diseases, and farming environments; advancing edge deployment and validation in operational aquaculture settings; and integrating temporal analysis for early detection of behavioral abnormalities, ultimately enabling a comprehensive health monitoring system from external symptoms to behavioral cues.