DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition

Chen, Yujie; Wu, Jiabao; Sun, Maoyuan; Ma, Yiping; Li, Zhiqian; Ma, Zeqi; Xiong, Yang; Wang, Yichen; Guo, Xiaoyin; Huang, Shuai

doi:10.3390/sym18071084

Open AccessArticle

DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition

by

Yujie Chen

¹,

Jiabao Wu

¹,

Maoyuan Sun

²,

Yiping Ma

^1,*,

Zhiqian Li

¹,

Zeqi Ma

¹,

Yang Xiong

³,

Yichen Wang

³

,

Xiaoyin Guo

⁴ and

Shuai Huang

²

¹

Merchant Shipping Academy, Shanghai Maritime University, Shanghai 201306, China

²

College of Energy and Mechanical Engineering, Shanghai Electric Power University, Shanghai 201306, China

³

Engineering College, Shanghai Ocean University, Shanghai 201306, China

⁴

School of Information, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(7), 1084; https://doi.org/10.3390/sym18071084 (registering DOI)

Submission received: 14 May 2026 / Revised: 6 June 2026 / Accepted: 16 June 2026 / Published: 25 June 2026

(This article belongs to the Special Issue Symmetry and Asymmetry in Intelligent Image Processing: Optimization, Security, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Dense underwater aquaculture poses significant challenges for intelligent image processing because asymmetric occlusion, turbidity, aeration-like bubbles, and motion blur frequently degrade fish contours and quasi-periodic scale textures. These disturbances often cause conventional detectors to miss detections, merge bounding boxes, experience feature collapse, and exhibit unstable counting. To address this problem, we propose DenseFish-v13, a symmetry-aware NMS-free YOLOv13-Mamba framework for dense underwater fish detection and bio-kinematic behavior recognition. The framework integrates a Bio-Harmonic Frequency Gate to preserve biological texture patterns while suppressing bubble-like frequency noise, a Bi-directional Multi-scale Wavelet Mamba backbone for global occlusion-aware structure recovery, and an asymmetry-aware density repulsion strategy to separate highly overlapping fish instances during bipartite matching. In addition, a lightweight Bio-Kinematic Behavior Head converts continuous detections into interpretable trajectory descriptors for behavior-state recognition. Experiments on the Dense-Aqua benchmark, constructed from public aquaculture datasets, show that DenseFish-v13 achieves 64.8% mAP@50:95 and a Counting MAE of 3.7 on the overall test set, while reaching 64.2% mAP@50:95 and a Counting MAE of 4.1 on the extreme-density split. Under a strong synthetic bubble perturbation, the model shows only a 1.3 percentage-point drop in mAP and maintains 125 FPS on Jetson Orin NX. These results demonstrate its effectiveness in robust, real-time underwater aquaculture monitoring.

Keywords:

underwater fish detection; dense aquaculture; symmetry-aware image processing; asymmetric occlusion; aeration noise suppression; YOLOv13-Mamba; NMS-free detection; bio-kinematic behavior recognition; edge intelligent vision

1. Introduction

High-density aquaculture has become an important application scenario for intelligent agricultural inspection, yet reliable visual monitoring in such environments remains highly challenging. As smart agriculture shifts from experience-driven management to data-driven decision-making, agricultural robots and sensing systems are expanding from conventional field-crop applications to broader production scenarios, including aquaculture and livestock monitoring [1,2,3,4,5]. In this transition, optical sensing, machine vision (MV), and artificial intelligence (AI) have become key enabling technologies for automated perception, production assessment, and precision intervention [2,4,5,6]. For aquaculture in particular, recent studies have shown that image-based and sensor-assisted monitoring can support biomass estimation, feeding optimization, and health-related behavioral observation, thereby improving the continuity and intelligence of production management [5,6,7,8,9,10,11]. At the same time, such behavior-oriented AI analysis should be understood as constrained semantic inference based on observable signals rather than full causal reasoning [12]. However, despite substantial progress in terrestrial precision agriculture, the direct transfer of generic computer vision models to underwater inspection robots remains highly challenging, primarily due to a unique phenomenon that we conceptualize as the “Occlusion–Noise Dilemma”. From the perspective of symmetry and asymmetry in intelligent image processing, dense aquaculture scenes are characterized by a conflict between quasi-symmetric biological structures and asymmetric environmental disturbances. Fish bodies usually exhibit continuous contours, bilateral morphology, and quasi-periodic scale textures, whereas occlusion, bubbles, turbidity, and motion blur introduce irregular and asymmetric visual degradation.

The first major dimension of this occlusion–noise dilemma is the severe feature ambiguity induced by extreme target density. In crowded cages, aquatic livestock often form dynamic, overlapping schools, resulting in persistent mutual occlusion and frequent boundary entanglement. In agriculture, mainstream object detection systems are still dominated by CNN- and YOLO-based paradigms [13], while dense object scenarios in orchard fruit detection and maturity recognition have repeatedly shown that heavy overlap leads to missed detections, localization instability, and degraded recall [14,15,16]. Similar limitations have also been observed in fish counting and underwater fish detection tasks, particularly in high-density monitoring settings where overlap and appearance similarity become dominant sources of error [8,9]. At the broader system level, aquatic robots are also evolving toward energy-aware autonomous navigation and decision-making, suggesting that perception modules must ultimately integrate with larger robotic intelligence stacks rather than operate in isolation [17].

The second major dimension originates from the hydro-optical complexity of intensive aquaculture. In practical aquaculture, mechanical aeration and water turbulence may introduce bubble-like visual interference, splashes, and localized scattering, which can significantly disrupt underwater visual observation. Recent underwater image enhancement studies consistently report that underwater perception suffers from scattering, color attenuation, low contrast, and detail blurring, and that enhancement methods often face a trade-off between visibility improvement and texture preservation [18,19,20,21]. In aquaculture scenes, this problem becomes even more acute because the high-frequency visual patterns of bubbles may resemble fish-scale reflections or local body glints. As a result, conventional spatial-domain denoising filters, although capable of suppressing clutter to some extent, may also erase the fine-grained biological textures required for reliable matching, counting, and trajectory extraction [18,19,20].

To address the structural limitations of both dense occlusion and aquatic noise, recent object detection research has increasingly moved toward end-to-end, NMS-free paradigms. RT-DETR explicitly demonstrated that Non-Maximum Suppression (NMS) can degrade both speed and accuracy in real-time detection. At the same time, DETR established bipartite matching as a principled alternative to heuristic post-processing [22,23]. In agricultural detection, lightweight Transformer-based models have also begun to demonstrate the value of global context modeling for target disambiguation in complex natural environments [24]. Nevertheless, the adoption of such globally aware models in underwater agricultural inspection remains constrained by the classical trade-off between contextual reasoning and edge device efficiency.

Recently, State Space Models (SSMs), especially Mamba, have emerged as a promising alternative for achieving global receptive fields with linear computational complexity [25]. Early agricultural studies have already begun to introduce Mamba-like architectures into precision monitoring pipelines, such as blueberry maturity assessment [26]. At the same time, VMamba has further extended selective state-space modeling into visual representation learning [27]. Beyond network architecture alone, temporally coherent evaluation has also become increasingly important in video-centric AI research, where benchmarks now explicitly measure temporal attribute consistency rather than relying solely on frame-wise quality [28]. In agriculture, VMamba-based plant disease identification [29] and broader survey literature on Vision Mamba, including frequency-domain adaptations and hybrid integrations, further indicate that SSM-based visual modeling is becoming a significant research direction for high-resolution, structured vision tasks [30]. Meanwhile, compression-oriented foundation-model studies suggest that deployment bottlenecks may arise not only from the backbone but also from output-layer design [31], and UAV-based agricultural perception has already begun to combine Mamba-related modules with deployment-oriented detector pipelines [32]. Recent optimization research in other large-scale learning systems likewise highlights that training efficiency itself can become a bottleneck in complex multi-objective architectures [33].

To systematically overcome these limitations, we propose DenseFish-v13, a specialized machine vision framework purpose-built for agricultural inspection robots operating in high-density aquaculture settings. We conceptualize a transition from traditional NMS-dependent object detection to an NMS-free, end-to-end architecture, following the broader trend exemplified by YOLOv10 toward more integrated, real-time detection pipelines with reduced reliance on heuristic post-processing [34]. The proposed framework is built upon an official YOLOv13 baseline and further enhanced with linear-complexity state-space modeling to construct a YOLOv13-Mamba backbone. Rather than a nominal modification, this design is a hardware-aware, task-oriented architectural enhancement for underwater agricultural robotics. By combining global contextual modeling with physics-informed learning (e.g., frequency-domain biological prior and kinematic constraints), DenseFish-v13 is designed to address the long-standing accuracy–efficiency paradox in dense aquatic monitoring. Recent advancements in underwater perception have branched into several specialized domains. In dense object detection, traditional NMS-based methods like YOLO-FC [8] have hit a performance ceiling due to heuristic suppression in overlapping schools. NMS-free architectures, pioneered by DETR [23] and further refined by YOLOv10 [34], offer a path toward end-to-end optimization but remain sensitive to feature collapse in turbid waters. While Mamba-like vision models (e.g., VMamba [27], U-Mamba [35]) have shown promise in medical and terrestrial imaging by providing global receptive fields with linear complexity, their application in underwater-specific frequency-domain denoising remains largely unexplored. Compared with existing vision-based aquaculture behavior recognition systems [10,36], which mainly focus on behavioral classification or tracking under overlapping conditions, DenseFish-v13 integrates the Discrete Wavelet Transform (DWT) directly into the Mamba scanning mechanism (Bi-MSW-Mamba) for joint spatial-spectral recovery. Unlike general-purpose detectors, our framework uniquely couples an NMS-free head with a Bio-Kinematic Behavior Head, bridging the gap between raw pixel-level detection and high-level ethological interpretation in extreme-density environments.

To align with the Special Issue “Symmetry and Asymmetry in Intelligent Image Processing: Optimization, Security, and Applications”, this study reformulates dense underwater fish recognition as a symmetry–asymmetry modeling problem, in which structured biological patterns must be preserved while irregular occlusions and aeration-induced disturbances are suppressed. The core academic contributions are summarized as follows:

Symmetry-aware global structure modeling: We introduce a YOLOv13-Mamba backbone that replaces purely localized convolutional bottlenecks with Visual State Space Blocks, enabling long-range contextual modeling for recovering partially occluded fish structures while maintaining linear computational complexity and edge deployment feasibility.
Symmetry-preserving spectral disentanglement: We propose a Bio-Harmonic Frequency Gate (B-HFG) that operates in the frequency domain to preserve quasi-periodic biological textures and continuous fish contours while suppressing asymmetric broadband aeration noise.
Asymmetry-aware NMS-free instance separation: We design a Density-Aware Repulsion Loss within a bipartite matching framework, enforcing latent-space separation among highly overlapping fish instances and reducing the risk of feature collapse under dense occlusion.
Bio-kinematic application-level interpretation: We further couple the detector with a lightweight Bio-Kinematic Behavior Head, which transforms continuous detections into motion trajectories and interpretable kinematic descriptors for behavior-state recognition.

The remainder of this paper is organized as follows. Section 2 reviews related work in dense object detection, underwater environmental noise suppression, and global state-space visual modeling. Section 3 presents the architecture and theoretical formulation of DenseFish-v13. Section 4 reports the experimental protocol and comparative results. Section 5 discusses the mechanism, practical significance, and limitations of the proposed framework. Section 6 concludes the study and outlines future research directions.

2. Related Work

This study lies at the intersection of dense underwater object detection, symmetry-aware image enhancement, NMS-free instance separation, and edge-oriented intelligent visual perception. To position DenseFish-v13 more precisely within the scope of symmetry and asymmetry in intelligent image processing, this section reviews prior work from three perspectives: dense underwater object detection and NMS-free architectures, environmental noise suppression in aquatic imagery, and global visual modeling with state-space architectures.

2.1. Dense Object Detection and NMS-Free Architectures

Machine vision has become a central component of agricultural robots for perception, counting, maturity assessment, and yield-related analysis [13]. In practice, YOLO-based detectors remain among the most widely adopted baselines because of their favorable speed–accuracy trade-off, and lightweight variants have already proven useful in agricultural maturity recognition and robotic deployment scenarios [15]. However, dense agricultural scenes continue to expose a structural weakness in conventional detectors. In blueberry ripeness detection, dense occlusions and complex natural backgrounds still degrade localization quality and increase the number of missed detections [14]. Similar problems have been reported in orchard fruit detection under cluttered canopy conditions [16]. In high-density aquaculture counting scenarios, lightweight fish detectors likewise confirm that fish overlap, target adjacency, and appearance similarity remain major barriers to robust counting [8]. Earlier underwater fish-detection frameworks have also highlighted the influence of illumination change, turbidity, and fish-noise separation on detection stability [9].

At a broader monitoring level, dense fish-tracking and behavior modeling studies further indicate that reliable aquaculture perception cannot rely on single-frame localization alone; instead, identity preservation and trajectory continuity become essential when multiple similar targets move within a limited space [10,11]. These findings collectively suggest that dense aquatic perception is not merely a standard object detection problem but a coupled challenge involving overlap, ambiguity, temporal continuity, and biological interpretation.

A central bottleneck in such scenes is the reliance on NMS. In crowded environments, NMS tends to suppress highly overlapping predictions, assuming they represent duplicates, which directly conflicts with the physical reality of multiple adjacent organisms. End-to-end detectors such as DETR replaced this post-processing heuristic with bipartite matching [23], and RT-DETR further demonstrated that removing NMS can improve the speed–accuracy trade-off in real-time detection systems [22]. Meanwhile, recent YOLO research has also moved toward more integrated end-to-end optimization, as exemplified by YOLOv10, further reflecting the broader trend away from heuristic post-processing in real-time detection pipelines [34]. Related agricultural work has already begun to adopt lightweight Transformer-style detectors to exploit global context for target disambiguation [24]. Nevertheless, current end-to-end paradigms still do not fully resolve the problem of latent feature collapse when distinct targets occupy nearly identical spatial neighborhoods. DenseFish-v13 is therefore positioned not merely as an NMS-free detector, but as a detection framework that combines bipartite matching with explicit latent-space repulsion to preserve instance separability in extreme-density aquaculture.

Recent studies published in Symmetry further confirm the relevance of symmetry-aware modeling to underwater object detection and marine organism recognition. Zhu et al. proposed a YOLOv4-embedding framework for fish detection and marine organism recognition, demonstrating the feasibility of real-time underwater robotic perception in complex marine environments [37]. Zhao et al. developed YOLOv7-SN for underwater target detection, showing that improved YOLO-based architectures remain effective for underwater recognition tasks [38]. More recently, Sun et al. optimized YOLOv8 for underwater robot target detection, emphasizing that asymmetrically distributed underwater environments and low image quality can substantially reduce recognition accuracy [39]. Feng and Liu further proposed IFEM-YOLOv13 for robust underwater object detection in degraded environments, providing a particularly relevant reference for YOLOv13-based underwater visual recognition [40]. In addition, Li et al. investigated multi-scale feature enhancement for underwater object detection, indicating that robust scale-aware representation is essential for targets affected by low visibility, scale variation, and complex aquatic backgrounds [41]. However, these studies mainly focus on generic underwater targets, whereas dense aquaculture scenes also require instance separation amid extreme fish overlap, aeration noise, and behavior-oriented trajectory interpretation.

2.2. Environmental Noise Suppression in Aquatic Environments

Reliable underwater behavior analysis requires visually stable and biologically faithful inputs. Yet underwater imagery is intrinsically degraded by color attenuation, scattering, blur, low contrast, and interference from complex suspended particles. Recent work has shown that conventional image-quality improvements do not always translate into better downstream detection performance, emphasizing the need to evaluate enhancement methods from a task-oriented perspective rather than a purely visual one [18]. This point is especially important in aquaculture, where a visually “cleaner” image may still be detrimental if discriminative fish-scale or contour information is removed in the process.

Recent underwater image enhancement methods have explored both transform-based and deep learning paradigms. DCT-based enhancement methods improve color and detail recovery by employing attenuation-aware contrast reconstruction [19]. At the same time, GFRENet introduces gated linear units and fast Fourier convolution to balance enhancement quality with computational efficiency [20]. Frequency–spatial fusion strategies further show that combining spectral and spatial cues can improve restoration quality and preserve more useful structural information for subsequent detection [21]. Collectively, these studies suggest that the frequency domain provides a natural space for handling underwater degradation because certain distortions become more separable after spectral decomposition.

However, most existing underwater enhancement studies target generic image restoration rather than dense biological monitoring. In practical aquaculture, the challenge is not simply to restore visibility but to preserve the subtle textures and contours critical to distinguishing closely overlapping fish. This motivates our task-driven spectral gating strategy, which explicitly decouples structured biological patterns from aeration-induced noise in the frequency domain. Accordingly, our method departs from purely spatial denoising and instead introduces a task-driven spectral gate to separate stochastic aeration noise from structured biological patterns.

Symmetry-driven underwater enhancement also provides a useful methodological support for the proposed B-HFG module. You et al. introduced PIC-GAN, a symmetry-driven underwater image enhancement framework based on a symmetric U-Net architecture, partial instance normalization, and color detail modulation, to improve color restoration, texture reconstruction, and detail preservation in degraded underwater scenes [42]. This line of research suggests that underwater enhancement should not only improve visual appearance but also preserve structured features useful for downstream recognition. In contrast to generic enhancement methods, our B-HFG is embedded directly within the detection framework and optimized in a task-driven manner to preserve quasi-periodic biological textures while suppressing asymmetric aeration-induced frequency noise.

2.3. Global Visual Modeling and State-Space Architectures

When severe mutual occlusion occurs, reliable recognition requires more than local convolutional evidence. Vision Transformers have improved this situation by enabling long-range dependency modeling, and agricultural detection studies have already confirmed the usefulness of global context in complex natural environments [24]. Yet the quadratic complexity of self-attention remains a serious limitation for embedded agricultural robots operating on edge hardware. This motivates the search for models that can retain global reasoning capacity without incurring the computational cost of Transformers.

State Space Models provide such a possibility. Mamba established a linear-time selective state-space framework that can propagate or forget information adaptively across long contexts [25]. On the vision side, VMamba demonstrated that state-space visual backbones can inherit global receptive fields while preserving favorable scaling characteristics [27]. Subsequent agricultural studies have begun to extend this paradigm into concrete applications, including Mamba-based blueberry maturity monitoring [26], VMamba-based plant disease identification [29], and UAV agricultural monitoring systems that combine detection-oriented pipelines with Mamba-related visual modules [32]. Survey work has further synthesized this trend and highlighted Mamba’s promise for high-resolution visual analysis, including hybrid CNN–Transformer–Mamba designs and frequency-domain adaptations [30].

For behavior-oriented robotic monitoring, another important issue is temporal consistency. Recent benchmark research in multi-speaker audio-video generation has emphasized that coherent sequence-level evaluation requires explicit attention to temporal attribute consistency rather than frame-wise scores alone [28]. Although the task setting differs from that of aquaculture monitoring, this methodological insight is directly relevant to fish behavior analysis, where trajectory continuity and cross-frame stability are fundamental. At the same time, recent work on model compression has shown that deployment bottlenecks can also arise from output-side design rather than only from feature extractors [31]. In addition, optimization research on complex generative pipelines suggests that mixed, more efficient training dynamics may become increasingly important as multi-objective learning systems grow more sophisticated [33]. Finally, recent theoretical work also reminds us not to over-interpret structured intermediate outputs as evidence of genuine reasoning [12]. This observation is useful in positioning aquaculture behavior analysis more carefully: the goal is not to claim abstract reasoning, but to build interpretable, constrained, and temporally grounded semantic inference from visual trajectories.

In summary, the current literature reveals a fragmented landscape. Dense detection methods improve global disambiguation but often remain computationally heavy; enhancement methods improve visual quality but may not preserve biologically meaningful detail; and emerging SSM-based approaches offer a promising bridge between global modeling and efficiency, yet have rarely been specialized for dense underwater agricultural inspection. DenseFish-v13 is explicitly designed to address this gap by combining spectral denoising, NMS-free matching, global state-space modeling, and trajectory-level semantic interpretation into a unified framework.

2.4. Innovation and Positioning

The innovation of DenseFish-v13 relative to the state-of-the-art is summarized across three dimensions: (1) Structural Innovation: Unlike vanilla VMamba, our Bi-MSW-Mamba performs sub-band decomposition via DWT, specifically targeting the frequency signatures of aeration bubbles. (2) Algorithmic Innovation: We introduce a Latent Repulsion Loss that resolves the ‘duplicate box’ problem in NMS-free heads without the computational overhead of Transformer-based bipartite matching. (3) Application Innovation: The framework provides the first end-to-end solution that transitions from noisy raw imagery to interpretable behavioral descriptors, achieving a 125 FPS throughput that is optimized for real-time robotic deployment.

3. Methodology

In this section, we present DenseFish-v13, a unified vision framework designed to address extreme-density and severe occlusion challenges in underwater aquaculture monitoring. Unlike conventional object detection frameworks, our approach explicitly incorporates both spectral (physical) and behavioral (biological) priors into the learning process.

3.1. Overall Architecture Overview

To ensure technical clarity, we distinguish between the overall framework and the core neural architecture. DenseFish-v13 is a unified aquaculture monitoring framework that encompasses the entire pipeline from spectral denoising (B-HFG) to trajectory-based behavior recognition. In contrast, YOLOv13-Mamba specifically denotes the deep learning architecture optimized for detection within this framework, featuring the Bi-MSW-Mamba backbone and the NMS-free prediction head. This distinction allows for a clear separation between the task-oriented application modules and the foundational computer vision model.

To systematically overcome the limitations of legacy CNNs and early YOLO variants, we design an NMS-free, end-to-end detection pipeline. The complete data flow of DenseFish-v13 operates as follows: raw, noise-degraded underwater images captured by the robot are first processed by the Bio-Harmonic Frequency Gate (B-HFG), which isolates the target biological signals from mechanical aeration noise in the frequency domain. The purified feature maps are then ingested by the YOLOv13-Mamba Backbone, where Visual State Space Blocks perform global feature modeling with linear computational complexity to reconstruct heavily occluded targets. Subsequently, the extracted features enter the NMS-free prediction head, which is constrained by the Density-Aware Repulsion Loss to enforce one-to-one instance separation. Finally, the optimized bounding-box coordinates are fed into the Bio-Kinematic Behavior Head, enabling the transition from static detection to temporal behavioral analysis.

Unlike vanilla Mamba, which scans flattened sequences, we propose the Bi-directional Multi-scale Wavelet Mamba (Bi-MSW-Mamba). This module first performs a Discrete Wavelet Transform (DWT) to decompose features into high-frequency (noise/texture) and low-frequency (structure) sub-bands. We then deploy parallel Mamba scanners with Dynamic Selective Scanning (DSS) that adaptively adjust scanning routes based on the energy distribution across wavelet sub-bands. This allows the model to prioritize biological structural features while suppressing high-frequency aeration noise in a single pass, thereby enabling global context modeling with linear computational complexity. Together, these components form a tightly coupled pipeline that jointly addresses spectral noise, global occlusion, feature entanglement, and behavioral interpretation in dense aquaculture scenarios. The overall architecture of DenseFish-v13 is shown in Figure 1.

3.2. Symmetry-Preserving Bio-Harmonic Frequency Gate (B-HFG) for Spectral Denoising

In intensive aquaculture environments, mechanical aeration generates substantial bubble-induced noise that significantly degrades visual quality and impairs the performance of standard machine vision systems. From a spectral perspective, these bubble-induced disturbances exhibit broadband high-frequency characteristics. In contrast, fish-body textures—particularly scale patterns—tend to exhibit more structured, quasi-periodic frequency distributions due to their regular biological morphology. Conventional spatial-domain denoising methods, such as Gaussian filtering, fail to distinguish between these components and often suppress both noise and informative biological features, thereby degrading discriminative capability.

To address this issue, we propose a learnable spectral gating mechanism, termed the Bio-Harmonic Frequency Gate (B-HFG), which operates directly in the frequency domain to disentangle noise and biologically relevant signals at the feature level. Let

X \in R^{H \times W \times C}

denote the input feature map extracted from the initial stem of the network. We first transform

X

into the frequency domain using the two-dimensional Fast Fourier Transform (FFT). The complex spectrum is decomposed into magnitude and phase components, with only the magnitude modulated while the phase is preserved.

\begin{matrix} F (X) (u, v) = \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} X (h, w) e^{- j 2 π (\frac{h u}{H} + \frac{w v}{W})} \end{matrix}

(1)

Instead of employing a fixed low-pass or band-pass filter, we introduce a learnable Harmonic Attention Map

M \in R^{H \times W \times C}

to adaptively reweight frequency components. Specifically, the magnitude spectrum

| F (X) |

is fed into a lightweight attention module composed of a

1 \times 1

convolution followed by Batch Normalization and a Sigmoid activation function. This design enables the network to dynamically emphasize frequency bands associated with structured biological textures while suppressing stochastic high-frequency noise from aeration.

The filtered frequency representation is obtained by applying element-wise modulation to the magnitude spectrum while preserving the phase component. It is then reconstructed in the spatial domain via the inverse FFT.

\begin{matrix} X_{c l e a n} = IFFT (F (X) ⊙ σ (M)) \end{matrix}

(2)

where

⊙

denotes element-wise multiplication and

σ (\cdot)

is the Sigmoid activation function. The entire process is fully differentiable and optimized end-to-end, enabling the model to learn a task-specific spectral filtering strategy that preserves discriminative biological structures while attenuating noise.

By integrating spectral-domain processing with learnable attention, the proposed B-HFG module functions as an adaptive band-pass filter tailored to the intrinsic frequency characteristics of aquatic livestock. This design not only improves feature quality under severe visual degradation but also provides more stable and informative inputs for subsequent detection and behavior modeling stages, particularly in dense and noisy aquaculture scenarios.

In addition to spectral refinement, we introduce a complementary motion constraint to regularize predictions under extreme density further. To address feature collapse at extreme densities, we introduce the Hydrodynamic-Informed Constraint (HIC) Loss.

F o r m a l l y, w e d e f i n e t h e H y d r o d y n a m i c - I n f o r m e d C o n s t r a i n t L o s s a s :

\begin{matrix} L_{H I C} = \frac{1}{N} \sum_{i = 1}^{N} ‖ a_{i} (t) ‖^{2} \end{matrix}

(3)

where N is the number of fish instances in the frame and

a_{i} (t)

is the second-order acceleration of the i-th fish’s centroid:

\begin{matrix} a_{i} (t) = (v_{i} (t) - v_{i} (t - 1)) / Δ t \end{matrix}

(4)

Fish naturally exhibit smooth, bounded motion patterns. Excessive or abrupt acceleration indicates physically implausible “teleporting” across frames and is penalized. By constraining acceleration magnitudes,

L_{H I C}

prevents fragmented detections from being linked across large spatial gaps, improving count stability under severe occlusion.

We incorporate a biophysical prior that ensures the predicted movement of aquatic livestock adheres to the hydrodynamic-inspired kinematic smoothness constraint. By regularizing the detection head using the derivative of the fish’s centroid acceleration and the swimming-angle angular momentum, the model eliminates ‘teleporting’ or physically impossible overlapping predictions. This physics-in-the-loop approach significantly enhances the reliability of NMS-free matching by pruning mathematically valid but biologically impossible candidate trajectories. The overall structure of the proposed B-HFG module with HIC loss is illustrated in Figure 2.

3.3. YOLOv13-Mamba for Global Structure Recovery Under Asymmetric Occlusion

Severe occlusion (>80%) causes the local features of adjacent fish to merge in standard CNNs. To resolve this, the machine vision network must capture long-range spatial dependencies across densely occluded regions to infer the shape of an occluded fish based on visible parts (e.g., head/tail) and the context of the surrounding swarm.

To clarify the internal mechanism of the proposed backbone, each stage consists of a series of Visual State Space (VSS) Blocks. As illustrated in the structural hierarchy, the 2D Selective Scan (SS2D) is not a separate external module but the core operator embedded within each VSS Block. Specifically, a VSS Block follows a dual-path structure: one path performs linear projections and activations. In contrast, the other path employs the SS2D kernel to scan the feature map in four distinct directions (top-left to bottom-right, and their rotations) to achieve a global receptive field with linear complexity. This hierarchical integration allows the Bi-MSW-Mamba stage to capture spatial dependencies across multiple scales while maintaining computational efficiency.

We construct the YOLOv13-Mamba backbone by replacing the legacy convolutional bottlenecks with Bi-directional Multi-scale Wavelet Mamba (Bi-MSW-Mamba) Blocks, as introduced in Section 3.1. Specifically, the Bi-MSW-Mamba block is custom-designed based on the Vision State Space Model (VMamba) paradigm [27], rather than being an entirely independent architecture. Each Bi-MSW-Mamba Block first decomposes the feature map via Discrete Wavelet Transform (DWT) into structural low-frequency and textural high-frequency sub-bands, then applies parallel directional Mamba scanners with Dynamic Selective Scanning (DSS) before reconstructing the output. The state-space transition within each block follows:

\begin{matrix} h_{t} = A h_{t - 1} + B x_{t}, y_{t} = C h_{t} \end{matrix}

(5)

where

h_{t}

denotes the hidden state at time step t,

A

represents the state transition matrix,

B

is the input projection matrix, and

C

serves as the output projection matrix to generate the response

y_{t}

. Crucially, we employ a 2D Selective Scan (SS2D) that scans the feature map in four distinct directions. In addition, cross-stage information propagation is implicitly modeled through the state-space formulation, thereby preserving long-range dependencies across multiple feature hierarchies. This ensures that even if a target is partially occluded, information from its visible neighbors helps reconstruct its latent representation. This global state-space modeling mechanism, with linear complexity O(N), enables the agricultural inspection robot to resolve dense clusters without incurring the prohibitive computational overhead of traditional Transformers.

The architecture comprises a convolutional stem with a stride of 2, followed by four stages of Bi-MSW-Mamba blocks. Crucially, each stage (including Stage 1) is preceded by a downsampling operation with a stride of 2. This configuration results in a total of five downsampling steps, producing feature maps with resolutions of H/4, H/8, H/16, and H/32 relative to the input image. These correspond to the multi-scale feature outputs P2, P3, P4, and P5, respectively, ensuring spatial consistency for detecting objects across varying scales.

To further enhance robustness under heterogeneous underwater conditions, the backbone is augmented with a Mixture-of-Experts Spectral Gate (MoE-SG) integrated into intermediate feature stages. This module comprises a bank of spectral ‘experts’ optimized for different water conditions (e.g., high-aeration, high-turbidity, and clear water). A Gating Network dynamically computes a routing weight based on the global entropy of the input image and performs a weighted soft-merging of expert outputs. This allows DenseFish-v13 to autonomously reconfigure its filtering logic in real time as the robot moves through different cage zones, ensuring robust ‘spectral dehazing’ regardless of bubble density.

A ‘decompose-then-scan’ strategy characterizes the internal workflow of the Bi-MSW-Mamba Block. At the beginning of each block, a Discrete Wavelet Transform (DWT) layer is employed to decompose the input feature map into four frequency sub-bands (LL, LH, HL, and HH), effectively separating structural information from textural noise. Subsequently, parallel Mamba scanners with Dynamic Selective Scanning (DSS) are applied to these sub-bands to capture long-range dependencies across different frequency domains. Finally, an Inverse Discrete Wavelet Transform (IDWT) or a fusion layer is used to reconstruct the features. This specific integration of DWT within the VSS Block framework—now explicitly illustrated in the revised Figure 3—is the key differentiator that enables our model to recover partially occluded fish structures while suppressing high-frequency aeration noise.

3.4. Asymmetry-Aware Density Repulsion Loss for NMS-Free Instance Separation

To achieve robust matching performance, we eliminate the heuristic Non-Maximum Suppression (NMS) post-processing step, which often removes valid overlapping boxes in dense crowds. Instead, we adopt a bipartite matching strategy to establish a direct one-to-one correspondence between predictions and ground-truth targets. This can be formulated as an optimization problem that seeks the optimal permutation

m i n i m i s i n g

mizing the matching cost between ground-truth targets

y_{i}

and predictions

{\hat{y}}_{σ (i)}

:

\begin{matrix} \hat{σ} = {\arg \min_{σ \in S N} \sum i = 1}^{N} L_{m a t c h} (y_{i}, \hat{y} σ (i)) \end{matrix}

(6)

where

S_{N}

represents the space of all permutations of N elements. However, in extreme-density scenarios, bounding boxes of distinct targets still share highly overlapping ground-truth regions, causing their latent feature representations to collapse into an indistinguishable cluster during the optimization of

L_{m a t c h}

.

To achieve stable NMS-free detection, we employ a One-to-One (O2O) assignment strategy during both training and inference. The core of this mechanism is the construction of a Bipartite Matching Cost

C_{m a t c h}

, which determines the optimal assignment between a set of

N

p r e d i c t i o n s \hat{y}

and

N

ground-truth targets

y

. The matching cost for each pair

(y_{i}, {\hat{y}}_{σ (i)})

is formulated as:

\begin{matrix} C_{m a t c h} = λ_{c l s} \cdot L_{c l s} (y_{i}, {\hat{p}}_{σ (i)}) + λ_{L 1} \cdot ‖ b_{i} - {\hat{b}}_{σ (i)} ‖_{1} + λ_{i o u} \cdot L_{i o u} (b_{i}, {\hat{b}}_{σ (i)}) \end{matrix}

(7)

where

\hat{p}

and

\hat{b}

denote the predicted class probabilities and bounding-box coordinates, respectively. We set the balancing coefficients to

λ_{c l s} = 0.5, λ_{L 1} = 5.0

, and

λ_{i o u} = 2.0

based on empirical tuning.

During the training phase, DenseFish-v13 employs a dual-labeling strategy: a standard One-to-Many (O2M) branch provides rich supervisory signals to accelerate convergence. In contrast, the NMS-free O2O branch learns to suppress duplicate boxes by enforcing a strict one-to-one correspondence via the Hungarian algorithm. During the inference stage, the O2M branch and the NMS post-processing are discarded. The model directly outputs the predictions from the O2O head, which, by virtue of the Latent Repulsion Loss and bipartite matching, has already learned to inhibit redundant activations for the same fish instance. This design significantly reduces end-to-end latency while maintaining high precision in dense scenes.

To prevent this, we propose a Latent Repulsion Loss (

L_{r e p}

) that enforces mathematical orthogonality between overlapping targets during bipartite matching. Let

i

and

f_{j}

be the latent feature vectors of two predicted objects, and

B_{i}

,

B_{j}

be their corresponding predicted bounding boxes. We define a Repulsion Weight

w_{i j}

based on their spatial overlap (Intersection over Union, IoU):

\begin{matrix} w_{i j} = 1 I o U (B_{i}, B_{j}) > τ \cdot I o U (B_{i}, B_{j}) \end{matrix}

(8)

where

τ

is a predefined spatial density threshold, and

1 {\cdot}

is the indicator function that activates the penalty only when the overlap exceeds

τ

. The repulsion loss is then formulated as the cosine similarity between the latent feature vectors of highly overlapping pairs:

\begin{matrix} L_{r e p} = \frac{1}{N p a i r} \sum_{i \neq j} w_{i j} \cdot \frac{f_{i} \cdot f_{j}}{| f_{i} | | f_{j} |} \end{matrix}

(9)

where

N_{p a i r}

denotes the total number of overlapping target pairs within the current frame. This loss function acts as a physical repulsive force in the feature space. If two objects are spatially close (indicating a high IoU) but represent distinct individuals, the network is mathematically penalized if their latent features exhibit high similarity. This encourages that even perfectly aligned bounding boxes are distinctly recognized, directly resolving the occlusion dilemma in automated agricultural inspection.

The integration of the Density-Aware Repulsion Loss (

L_{r e p}

) into the NMS-free YOLO head follows a post-matching refinement logic. Specifically, after the Bipartite Matching (Equation (9)) establishes a one-to-one correspondence between predictions and ground truths,

L_{r e p}

is applied as a latent-space constraint. For any two prediction boxes and

{\hat{b}}_{i}

and

{\hat{b}}_{j}

that exhibit an IoU higher than the threshold

τ

, the loss function (Equation (9)) calculates the cosine similarity between their corresponding feature vectors

f_{i}

and

f_{j}

.

By penalizing high similarity between overlapping boxes,

L_{r e p}

acts as a ‘Latent Force Field’ that prevents feature collapse. In standard detectors, duplicate predictions often share nearly identical features, causing them to persist after matching. In DenseFish-v13, if the model attempts to generate a duplicate box for an already-matched fish instance, the repulsion loss forces the latent features to remain orthogonal. Consequently, during the optimization of the classification branch (

L_{c l s}

), only one box can maintain high confidence. In contrast, the others are suppressed to a background state due to a lack of distinct, supportive features. This mechanism effectively replaces the heuristic-based NMS with a learnable, feature-level suppression strategy.

To fundamentally solve the ‘Occlusion Dilemma,’ we implement Contrastive Prototype Orthogonalization (CPO) in the latent space. We maintain a dynamic bank of ‘Biological Prototypes’ representing different fish orientations and scale patterns. During the bipartite matching phase, the model maximizes the Mutual Information between the predicted feature and its assigned prototype while enforcing Gram-Schmidt Orthogonality to avoid overlapping neighbor features. This forces the neural network to learn distinctive ‘identity signatures’ for individual fish even under severe spatial overlap, enabling the NMS-free head to distinguish between visually merged targets. The density-aware feature decoupling mechanism for NMS-free matching is illustrated in Figure 4.

3.5. Bio-Kinematic Behavior Head for Aquatic Livestock

Static object detection alone is insufficient for comprehensive welfare monitoring in intensive aquaculture scenarios. To capture temporal behavioral dynamics, we introduce a lightweight Bio-Kinematic Behavior Head that performs trajectory-based kinematic analysis and semantic state inference.

The module first associates detections across consecutive frames using a tracking algorithm, generating continuous object trajectories. This design is compatible with the NMS-free detection paradigm and ensures stable identity preservation in the presence of dense occlusion. Based on the obtained trajectories, kinematic features are extracted by computing instantaneous velocity and turning angle:

\begin{matrix} v_{i} = \frac{\sqrt{(x_{t} - x_{t - 1})^{2} + (y_{t} - y_{t - 1})^{2}}}{Δ t}, θ_{i} = \arctan 2 (y_{t} - y_{t - 1}, x_{t} - x_{t - 1}) \end{matrix}

(10)

where

(x_{t}, y_{t})

denotes the center coordinates of the detected object at time step

t

, and

Δ t

is the frame interval.

A rule-based Bio-Logic Tree is then employed to infer behavioral states from kinematic statistics. Specifically,

\bar{v}

denotes the average velocity along the trajectory, and

V a (θ) r e p r e s e n t s t h e v a r i a n c e o f t h e

f turning angles. The kinematic thresholds governing behavioral states

δ_{h i g h}, δ_{l o w}, ϵ, H_{s u r f a c e}

are not manually assigned constants. Still, they are determined through a two-step optimization process. First, we perform a statistical analysis of the distribution of annotated validation trajectories to estimate characteristic motion ranges for different behavioral states. In particular,

δ_{l o w}

is initialized from the lower velocity percentile of hypoxia-related floating trajectories. In contrast, Delta as

δ_{h i g h}

and

ϵ

are initialized from the upper velocity and angular-variance percentiles of feeding trajectories. The surface-proximity threshold

H_{s u r f a c e}

is initialized according to the vertical position distribution of hypoxia-related floating samples near the water surface. Second, an iterative grid search is conducted on the validation set to fine-tune these thresholds by maximizing the F1-score for recognition of “Feeding” and “Hypoxia”. This data-driven calibration strategy ensures that the behavioral thresholds are empirically grounded and robust to scale variations across different aquaculture datasets.

Based on the optimized thresholds, a “Feeding (Frenzy)” state is identified when the average velocity satisfies

\bar{v} > δ_{h i g h}

and angular variance satisfies

Var (θ) > ϵ

reflecting high-speed and irregular motion patterns. A “Hypoxia (Floating)” state is defined when

\bar{v} < δ_{l o w}

and

y_{t} < H_{s u r f a c e}

, indicating sustained low-speed movement near the water surface. All remaining motion patterns are categorized as “Normal”.

It is worth noting that fish may also approach the water surface during active exploration or feeding. The Bio-Logic Tree distinguishes these cases through multi-parameter coupling rather than relying on surface proximity alone. Specifically, surface-oriented feeding is excluded from the “Hypoxia” category because it usually satisfies the high-velocity and high-angular-variance conditions of the “Feeding” state. Similarly, active exploration near the surface is categorized as “Normal” because it does not exhibit the sustained low-velocity pattern associated with hypoxia-related floating.

By introducing this behavior head, the proposed framework extends pixel-level detection to trajectory-level semantic analysis, enabling biologically meaningful interpretation of motion patterns. This facilitates real-time behavioral monitoring and provides actionable indicators for aquaculture management systems, as illustrated in Figure 5.

3.6. Total End-to-End Training Objective

The network is trained end-to-end using a composite loss function that jointly optimizes bounding-box localization, category classification, fine-grained box-distribution regression, instance-level feature separation, and physical-motion consistency. The overall training objective is formulated as:

\begin{matrix} L_{t o t a l} = λ_{b o x} L_{C I o U} + λ_{c l s} L_{B C E} + λ_{d f l} L_{D F L} + λ_{r e p} L_{r e p} + λ_{h i c} L_{H I C} \end{matrix}

(11)

where

L_{C I o U}

represents the Complete Intersection over Union loss for bounding-box regression,

L_{B C E}

denotes the Binary Cross-Entropy loss for category confidence optimization, and

L_{D F L}

is the Distribution Focal Loss used to improve fine-grained bounding-box representation. In addition,

L_{r e p}

denotes the proposed Density-Aware Repulsion Loss, which enforces latent-space separation among highly overlapping fish instances, while

L_{H I C}

represents the Hydrodynamic-Informed Constraint Loss, which regularizes physically implausible trajectory variations and improves temporal stability under dense occlusion.

Unless otherwise specified, the loss weights are set to

λ_{b o x} = 1.0

,

λ_{c l s} = 1.0

,

λ_{d f l} = 0.5

,

λ_{r e p} = 0.2

, and

λ_{h i c} = 0.1

. The first three terms provide the basic detection supervision. In contrast,

L_{r e p}

and

L_{H I C}

serve as structural regularization terms for dense-instance separation and motion consistency. To avoid destabilizing early-stage bipartite matching,

L_{r e p}

and

L_{H I C}

are activated after the 50th epoch. Before this activation point, the model mainly learns reliable localization and classification; after the 50th epoch, the repulsion and hydrodynamic constraints further refine the feature space and suppress biologically implausible duplicate predictions.

In the total end-to-end training objective,

L_{r e p}

serves as the key structural differentiator for NMS-free instance separation. While

L_{C I o U}

and

L_{D F L}

handle spatial localization and

L_{B C E}

supervises category confidence,

L_{r e p}

specifically targets inter-instance ambiguity by encouraging matched predictions to maintain distinct identity signatures in the latent feature space. Meanwhile,

L_{H I C}

introduces a biophysical motion before penalizing abrupt or unrealistic trajectory changes. Together, this unified optimization framework enables DenseFish-v13 to learn accurate localization, robust feature separation, and behavior-aware motion consistency within a single end-to-end pipeline.

4. Experiments and Results

In this section, we rigorously validate DenseFish-v13 against the dual challenges of extreme density and aquatic noise. Our evaluation is specifically designed to prove its viability as the core artificial intelligence brain for agricultural inspection robots. We aim to verify three core hypotheses: (1) Disentanglement: Can the YOLOv13-Mamba backbone and NMS-free Repulsion Loss successfully separate overlapping aquatic livestock? (2) Spectral Denoising: Does the Bio-Harmonic Frequency Gate effectively ignore mechanical aeration bubbles? (3) Efficiency for Edge AI: Is the model lightweight and fast enough for deployment on resource-constrained edge devices?

4.1. Experimental Setup

4.1.1. Datasets and Metrics

To evaluate the proposed DenseFish-v13 framework under practical aquaculture monitoring conditions, we constructed a unified experimental benchmark, referred to as Dense-Aqua, by reorganizing and harmonizing two publicly available aquaculture-related datasets: the Pond Fish Detection Dataset [43] and the Healthy and Loser Salmon Dataset [44]. Rather than using these datasets as isolated test sets, we integrate them into a consistent, detection-oriented evaluation framework to assess the model’s robustness to underwater visual degradation, local crowding, scale variation, and adult-fish monitoring scenarios.

Specifically, the Pond Fish Detection Dataset contains 586 underwater pond images with 10,607 annotated fish instances, and provides an original split of 409 training images, 118 validation images, and 59 test images. This dataset serves as the primary benchmark for dense underwater fish detection because it includes challenging visual conditions, including turbidity, low contrast, natural illumination variation, partial occlusion, and local fish aggregation. The Healthy and Loser Salmon Dataset contains 207 sea-cage salmon images with 1750 annotated salmon instances, including 1319 healthy salmon and 431 loser salmon. Its original split consists of 145 training images, 41 validation images, and 21 test images. This dataset provides a complementary adult-fish aquaculture scenario featuring larger body scales, diverse poses, varying orientations, and sea-cage monitoring conditions.

For the detection and counting experiments in this study, all annotated aquatic livestock instances from the two datasets were harmonized into a single species-agnostic “fish” category. This setting is consistent with the main objective of DenseFish-v13, which focuses on robust fish localization, dense-instance separation, counting reliability, and underwater visual-noise resistance rather than species-level or health-status classification. After harmonization, Dense-Aqua contains 793 images and 12,357 annotated fish instances in total, with 554 images for training, 159 for validation, and 80 for testing. The original healthy/loser labels in the salmon dataset are retained only as source-level annotation metadata and are not used as target labels in the detection task.

The two source datasets provide complementary aquaculture scenarios. The Pond Fish Detection Dataset mainly represents pond-based underwater monitoring conditions, including turbidity, low contrast, illumination variation, local crowding, and boundary ambiguity, making it suitable for evaluating dense detection, counting reliability, and robustness under degraded underwater visibility. In contrast, the Healthy and Loser Salmon Dataset represents sea-cage adult-fish monitoring, where farmed Atlantic salmon exhibit greater body scale size, exhibit diversity, undergo orientation changes, and experience partial occlusions. Together, these two datasets enable a more comprehensive evaluation of DenseFish-v13 across degraded pond environments and adult-fish aquaculture scenarios, while remaining consistent with the proposed framework’s methodological focus on robust detection, dense-instance separation, spectral noise suppression, and edge-oriented aquaculture inspection.

The main evaluation metrics include mAP@0.5, mAP@0.5:0.95, Precision, and Recall for detection performance, as well as Counting MAE for counting reliability. To better evaluate dense-scene robustness, we further introduce Occlusion Recall (

R_{o c c}

), which is computed only on heavily overlapping targets:

\begin{matrix} R_{o c c} = \frac{T P_{o c c}}{T P_{o c c} + F N_{o c c}} \end{matrix}

(12)

where

T P_{o c c}

and

F N_{o c c}

denote the correctly detected and missed targets in the occluded subset, respectively; in this study, the occluded subset consists of ground-truth fish instances whose overlap with neighboring objects exceeds 50%. For deployment analysis, Params, FLOPs, and FPS are also reported. For the behavior recognition branch, Accuracy and class-level F1-score are used when applicable.

Furthermore, given the relatively small scale of the test set (80 images in total), point estimates of performance metrics may be susceptible to variance from individual images. To ensure statistical reliability and address potential small-sample bias, we report our final evaluation metrics with 95% Confidence Intervals (95% CI). These intervals were calculated using a non-parametric bootstrapping approach: we resampled the test set with replacement 1000 times, computing the metrics for each bootstrap sample to derive the empirical distribution, mean, and confidence intervals. The representative visual challenge types targeted by DenseFish-v13 are summarized in Figure 6.

4.1.2. Implementation Details

All models were trained on an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) under Ubuntu 22.04 LTS, using Python 3.9 and PyTorch 2.0. The training process followed a warm-up and dual-branch optimization paradigm. During the first 10 epochs, the learning rate was linearly increased from

1 0^{- 5}

to

1 0^{- 3}

to stabilize the YOLOv13-Mamba backbone. After the warm-up stage, the learning rate was updated using cosine decay until the end of training. SGD was used as the optimizer with a momentum of 0.937 and a weight decay of

1 0^{- 4}

. All baseline models were retrained from scratch using the same augmentation and optimization settings, with a total of 100 training epochs.

For the loss weighting coefficients in Equation (11), we set

λ_{b o x} = 1.0

,

λ_{c l s} = 1.0

,

λ_{d f l} = 0.5

,

λ_{r e p} = 0.2

, and

λ_{h i c} = 0.1

. These weights were determined by balancing the gradient magnitudes of each term during preliminary experiments. Specifically,

L_{r e p}

and

L_{H I C}

were activated after the 50th epoch, referred to as the Repulsion Activation Epoch, to ensure that the model first learns basic localization before attempting to resolve fine-grained instance separation and trajectory-level motion regularization. This staged training paradigm prevents the repulsion force and hydrodynamic constraint from destabilizing bipartite matching during the early convergence phase.

The spatial density threshold

τ

was selected by grid search on the validation set.

δ_{h i g h}, ϵ, δ_{l o w}, H_{s u r f a c e}

were initialized from annotated trajectory statistics and further refined on the validation set. Edge-side inference was benchmarked on an NVIDIA Jetson Orin NX module (NVIDIA Corporation, Santa Clara, CA, USA) using NVIDIA TensorRT 8.6.

The MoE-SG gating network employs three spectral experts corresponding to high-aeration, high-turbidity, and clear-water conditions, and the routing entropy threshold was determined by grid search on the validation set. CPO maintains a dynamic prototype bank of size 256 per class, updated via exponential moving average with a momentum of 0.999. Gram-Schmidt orthogonalization is applied at each bipartite matching step during training only, incurring no additional inference cost.

For the robustness evaluation, we generated bubble-augmented test images by overlaying semi-transparent circular and elliptical bubble artefacts onto the original test images, while keeping the original bounding-box annotations unchanged. For 640 × 640 images, the number of bubbles was set to 20–40, 50–80, and 90–140 per image for low-, medium-, and strong-disturbance levels, respectively. The bubble radius ranges were 3–18, 4–24, and 5–32 pixels, and the opacity ranges were 0.15–0.30, 0.25–0.45, and 0.35–0.65 for the three levels. Gaussian blur kernels of 3 × 3, 5 × 5, and 7 × 7 were applied to imitate diffused bubble boundaries. All perturbations were generated with a fixed random seed, and the same augmented test sets were used for all models being compared to ensure fair evaluation. The key training and deployment settings are summarized in Table 1.

F the BiKnemicaor Head, the kinemati c threds

(δ_{h i g h}, ϵ, δ_{l o w}, H_{s u r f a c e})

were structurally determined based on the empirical statistical distributions of annotated trajectory data, combined with domain-specific prior knowledge of aquatic livestock behavior. These hyperparameters underwent iterative grid tuning on the validation set to maximize classification reliability; the detailed sensitivity analyses and specific numeric values governing these operational states are structured within the extended experimental validation framework.

Similarly, the spatial density threshold

t a u u s e d i n t h e D e n s i t y - A w a r e R e p u l s i o n L o s s w a s d e t e r m i n e d v i a g r i d s e a r c h

on the validation set. This systematic approach ensures an optimal theoretical balance between false-positive suppression and dense target recall, with further discussions of parameter sensitivity reserved for subsequent ablation studies.

Regarding the edge deployment evaluation, the reported inference speeds (FPS) for all models were benchmarked uniformly on an NVIDIA Jetson Orin NX hardware platform. The evaluations were conducted utilizing TensorRT acceleration frameworks at FP16 half-precision with fixed input resolution, ensuring a rigorous, fair, and replicable hardware assessment metric for agricultural robotics.

4.2. Ablation Study: Deconstructing the Architectural Gains

To quantify the contribution of each proposed component, we conducted cumulative ablation experiments on the Dense-Aqua dataset. YOLOv13-m was adopted as the baseline, and modules were introduced progressively under identical training settings to isolate their individual effects on the coupled “Occlusion–Noise Dilemma.” All four key components—the Bi-MSW-Mamba backbone, the Bio-Harmonic Frequency Gate (B-HFG), the Density-Aware Repulsion Loss (

L_{r e p}

), the Hydrodynamic-Informed Constraint Loss (

L_{H I C}

), and the Mixture-of-Experts Spectral Gate (MoE-SG)—are evaluated sequentially. Contrastive Prototype Orthogonalization (CPO) is incorporated as part of the full Repulsion Loss configuration and is therefore implicitly captured in the “+

L_{r e p}

+ CPO” row.

As shown in Table 2, each component contributes progressively to the final performance. Replacing the CNN bottleneck with the Bi-MSW-Mamba block improves mAP@50:95 from 56.8% to 59.5%. It reduces Counting MAE from 10.2 to 8.5, indicating that global state-space modeling helps recover partially visible fish in crowded scenes. Adding the Bio-Harmonic Frequency Gate further raises mAP@50:95 to 62.1% and reduces Counting MAE to 5.8, demonstrating that frequency-domain refinement improves robustness to bubble-like visual interference and stabilizes counting.

After introducing MoE-SG, mAP@50:95 further increases to 63.0%, suggesting that adaptive spectral routing provides additional robustness under heterogeneous underwater conditions. Incorporating the Density-Aware Repulsion Loss and CPO improves mAP@50:95 to 64.2% and lowers Counting MAE to 4.1, confirming the importance of explicit feature-space separation for dense-instance matching. Finally, adding the Hydrodynamic-Informed Constraint further improves mAP@50:95 to 64.8% and reduces Counting MAE to 3.7. Although the gain from

L_{H I C}

is relatively moderate, it improves trajectory-level consistency. It reduces physically implausible predictions, which is particularly beneficial for counting stability under severe overlap.

The expanded metrics further clarify the role of each module. The Bi-MSW-Mamba block primarily improves recall by enhancing global contextual modeling in the presence of occlusion. B-HFG and MoE-SG primarily enhance the robustness of features to underwater visual degradation. The repulsion-based mechanism strengthens instance separation in dense regions. At the same time,

L_{H I C}

further regularizes unstable predictions through motion-consistency constraints. Overall, these results support the design logic of DenseFish-v13: global modeling improves representation completeness, spectral refinement enhances signal purity, density-aware repulsion strengthens instance discrimination, and hydrodynamic regularization improves counting reliability.

4.2.1. Efficiency Overhead Analysis of Each Module

To evaluate whether the performance gains are achieved at an acceptable computational cost, we further compare the parameter count, FLOPs, memory usage, and inference speed of each ablation variant on the Jetson Orin NX. This analysis is important because the proposed model is intended for embedded aquaculture inspection rather than offline-only evaluation.

As shown in Table 3, the introduction of the Bi-MSW-Mamba block and B-HFG results in only a moderate increase in computational complexity. Compared with the baseline YOLOv13-m, the final DenseFish-v13 increases the parameter count from 20.1 M to 22.0 M and the FLOPs from 68.4 G to 72.6 G, while still maintaining 125 FPS on the Jetson Orin NX edge platform. This indicates that the proposed modules improve dense-scene perception capability without sacrificing real-time deployment feasibility.

It should be noted that the Density-Aware Repulsion Loss, Contrastive Prototype Orthogonalization, and Hydrodynamic-Informed Constraint are active only during training. Therefore, they do not introduce additional inference-time parameters, FLOPs, memory consumption, or latency. Overall, DenseFish-v13 achieves a favorable trade-off between detection robustness and edge-side inference efficiency, making it suitable for real-time aquaculture inspection scenarios.

The identical inference costs of the last three variants are expected because Repulsion Loss, CPO, and HIC are used only as training-stage regularization or matching constraints. They are removed during inference and therefore do not introduce additional parameters, FLOPs, memory consumption, or latency.

4.2.2. Sensitivity Analysis of the Repulsion Mechanism

Because the repulsion term is central to the proposed NMS-free dense matching strategy, we further analyze the sensitivity of its two key hyperparameters, namely the repulsion weight

λ_{r e p}

and the overlap threshold

τ

. The purpose of this experiment is to verify that the improvement is stable across a reasonable parameter range rather than dependent on a narrowly tuned setting. For each sensitivity experiment, only one hyperparameter was varied while all other training settings were kept unchanged. The reported results were obtained on the validation set under the same data split and evaluation protocol.

As shown in Table 4, the model achieves the best performance when rep

λ_{r e p}

= 0.2, with 64.2% mAP@50:95, 68.7%

R_{o c c}

, and the lowest Counting MAE of 4.1. When

λ_{r e p}

is too small, the repulsion constraint is insufficient to separate overlapping fish instances. When

λ_{r e p}

is too large, excessive repulsion may disturb the matching process and slightly degrade localization stability. Therefore,

λ_{r e p}

= 0.2 is selected as the optimal setting.

As shown in Table 5,

τ

= 0.5 provides the best overall trade-off, achieving the highest mAP@50:95 and

R_{o c c}

while maintaining the lowest Counting MAE. A smaller

τ

may over-activate the repulsion term for weakly overlapping instances. In contrast, a larger

τ

may fail to constrain moderately occluded targets. Therefore,

t h e t a u = 0.5 f o r m u l a

is adopted in the final model.

Figure 7 further visualizes the trends reported in Table 4 and Table 5. For both rep

λ_{r e p}

and

τ

, the performance follows a stable unimodal pattern: insufficient repulsion weakens dense-instance separation, whereas excessive or improperly activated repulsion may destabilize NMS-free matching. Overall, the selected configuration,

λ_{r e p}

= 0.2 and

τ

= 0.5, provides a stable balance between dense-instance discrimination and matching stability. This indicates that the proposed repulsion mechanism is effective within a reasonable hyperparameter range rather than being dependent on a narrowly tuned setting.

4.2.3. Orthogonal Ablation Study of Core Components

To further isolate the contribution of each proposed component, we conducted an orthogonal ablation study using the Dense-Aqua benchmark. Unlike the cumulative ablation in the previous section, this experiment independently turns on or off the main modules, including the Bi-directional Multi-scale Wavelet Mamba block, the Bio-Harmonic Frequency Gate, the Density-Aware Repulsion Loss, and the Hydrodynamic-Informed Constraint. This design allows us to evaluate whether the proposed modules contribute complementary improvements rather than redundant performance gains.

As shown in Table 6, each component contributes positively to the final performance. The Bi-MSW-Mamba block mainly improves recall and mAP@50:95, indicating that global state-space modeling is beneficial for recovering partially occluded fish structures. B-HFG improves both precision and counting reliability, suggesting that frequency-domain refinement suppresses bubble-like visual interference while preserving biologically meaningful textures. The Density-Aware Repulsion Loss yields a clear reduction in Counting MAE, demonstrating its effectiveness at separating adjacent fish instances amid dense overlap. The full model achieves the best overall performance, and the additional HIC term further improves counting stability by regularizing physically implausible trajectory changes. These results confirm that the proposed modules are complementary and jointly contribute to robust dense aquaculture perception.

4.2.4. Ablation of NMS-Free Matching and Repulsion Loss

To verify the necessity of the NMS-free matching strategy and the proposed Density-Aware Repulsion Loss, we compared four detection head configurations under the same backbone and training protocol. The evaluated variants include a conventional NMS-based YOLOv13 head, an NMS-free head without repulsion, an NMS-free head with the proposed Repulsion Loss, and the final configuration further enhanced by Contrastive Prototype Orthogonalization.

As shown in Table 7, the NMS-free head improves occlusion recall compared with the conventional NMS-based head, confirming that heuristic suppression may remove valid predictions in dense fish clusters. However, NMS-free matching alone remains insufficient, as adjacent fish with highly overlapping boxes may still converge on similar latent representations. The proposed Repulsion Loss substantially reduces the merged error rate and Counting MAE by enforcing feature-space separation among overlapping instances. Further integration of CPO provides additional stabilization by encouraging more discriminative prototype-level representations. These results support the use of asymmetry-aware instance separation for dense underwater fish detection.

4.3. Comparative Analysis with State-of-the-Art (SOTA) Models

To evaluate the competitiveness of DenseFish-v13, we compared it with representative state-of-the-art detectors under identical training and testing settings. The comparison includes mainstream CNN-based baselines (YOLOv8-m, YOLOv10-m, and YOLOv13-m), a Transformer-based end-to-end detector (RT-DETR-l), and a dense-scene-oriented detector (CrowdDet). This selection covers conventional NMS-based detectors, early NMS-free pipelines, and context-aware dense detection paradigms, thereby providing a comprehensive reference for assessing the proposed method under practical aquaculture conditions. To validate counting stability, we compared DenseFish-v13 with specialized counters Deep-Fish [45] and CSRNet [46]. Our model achieves the lowest MAE (4.1), proving that NMS-free detection is more robust than traditional density-map methods in turbid water.

As shown in Table 8, DenseFish-v13 achieves the best overall performance on the extreme-density split with robust statistical confidence. Compared with YOLOv13-m, the proposed method improves mAP@50:95 from 58.7% (95% CI: ±0.9%) to 64.2% (95% CI: ±0.7%), reduces Counting MAE from 8.4 (±0.4) to 4.1 (±0.2), and increases occ

f r o m

53.6% (±1.1%) to 68.7% (±0.9%). These gains, validated by the non-overlapping confidence intervals, indicate that the proposed framework is significantly more effective in separating densely overlapped fish instances under severe visual interference.

In addition to general-purpose detectors, we specifically compared DenseFish-v13 with domain-specific models, namely YOLO-FC [8] and FR-CNN [9]. As shown in Table 8, although YOLO-FC is optimized for fish detection, it relies on purely convolutional kernels, which struggle with severe feature collapse in extreme-density scenarios (reaching only 57.2% mAP). FR-CNN, while providing high localization quality, is limited by its two-stage inference speed (18 FPS), making it unsuitable for edge deployment. Our framework outperforms these specialized baselines by leveraging the global receptive field of the Mamba backbone and the density-aware repulsion loss, achieving a 7.0 percentage-point improvement in mAP over YOLO-FC while maintaining real-time inference speeds.

4.3.1. Performance Across Different Density Levels

To further verify that the proposed model is not only effective in the most challenging subset but also stable across different crowding levels, we additionally compare all models on the low-, medium-, and extreme-density splits of Dense-Aqua.

As shown in Table 9, DenseFish-v13 achieves the best performance across low-, medium-, and extreme-density splits, indicating that the proposed model is not only effective in highly crowded scenes but also remains competitive under relatively sparse conditions. In particular, its advantage becomes more pronounced as density increases: under the extreme-density split, DenseFish-v13 reaches 64.2% mAP@50:95, reduces Counting MAE to 4.1, and maintains the highest

R_{o c c}

of 68.7%. These results suggest that the proposed global occlusion-aware modeling and density-aware instance separation are especially effective for dense aquaculture scenarios.

4.3.2. Qualitative Comparison with Representative Baselines

To complement the quantitative results, we provide visual comparisons between DenseFish-v13 and representative baselines under challenging aquaculture visual conditions. The selected examples include crowded occlusion, bubble interference, low visibility, and motion blur. These cases are chosen in line with the methodological motivations of DenseFish-v13: global occlusion-aware modeling, spectral noise suppression, and density-aware instance separation. Compared with baseline detectors, DenseFish-v13 produces fewer missed detections, fewer merged boxes, and more stable localization under visually degraded and crowded conditions.

Figure 8 highlights typical failure modes of baseline models, including missed targets, merged bounding boxes, and bubble-induced false detections, whereas DenseFish-v13 exhibits clearer instance separation and more stable localization.

4.3.3. Disaggregated Evaluation on Source Datasets

To investigate potential distributional shifts and ensure the model’s robustness across different aquaculture scenarios, we conducted a disaggregated evaluation on the individual test sets from the two source datasets: the Pond Fish Detection Dataset (Pond-Aqua) and the Healthy/Loser Salmon Dataset (Salmon-Aqua).

As shown in Table 10, DenseFish-v13 demonstrates strong generalization across both domains despite their inherent differences. The model achieves higher precision and mAP on the Salmon-Aqua dataset (72.1% mAP@50:95), which can be attributed to the larger body scales and clearer water conditions in sea-cage environments. Conversely, on the Pond-Aqua dataset, the mAP is slightly lower (62.3%). At the same time, the Counting MAE is higher (4.2 vs. 1.8), reflecting the severe challenges posed by high target density, turbidity, and local fish aggregation in pond-based monitoring. These results indicate that while harmonization into a ‘species-agnostic’ category is successful, the environmental complexity of pond aquaculture remains the primary bottleneck for detection stability.

4.4. Robustness Under Asymmetric Aeration-Induced Visual Disturbance

Robust underwater perception in aquaculture is often hindered by bubble-like visual interference from aeration, water turbulence, and suspended particles. Since the public datasets used in this study do not provide explicit device-level annotations indicating whether mechanical aeration was active during image acquisition, we also conducted a controlled synthetic perturbation test to evaluate DenseFish-v13’s robustness to bubble-induced visual degradation. Specifically, bubble-like artefacts were artificially added to the original test images using an image-editing-based augmentation procedure, while the original bounding-box annotations were retained unchanged. This setting allows us to compare model performance between the original images and their bubble-augmented counterparts under identical ground-truth annotations.

To avoid overclaiming real physical aeration conditions, we refer to the original test images as the Original setting and to the edited images with added bubble artefacts as the Synthetic Bubble setting. This experiment is therefore intended to evaluate visual robustness against aeration-like bubble interference, rather than to simulate the complete hydrodynamic process of real mechanical aeration.

The results in Table 11 show that the baseline detector is more sensitive to synthetic bubble interference, with mAP decreasing from 60.1% to 51.2%. In contrast, DenseFish-v13 shows a much smaller performance drop, decreasing from 64.8% to 63.5%. This indicates that the proposed Bio-Harmonic Frequency Gate can improve robustness against bubble-like high-frequency visual disturbances. Since the perturbations were artificially introduced, these results should be interpreted as a controlled evaluation of visual robustness rather than direct evidence of performance under real mechanical aeration. Nevertheless, the experiment provides useful evidence that DenseFish-v13 is less affected by bubble-shaped clutter and better preserves fish-related visual structures in degraded underwater imagery.

From a practical perspective, this robustness is important because continuous oxygenation is a routine condition in intensive aquaculture. A detector that performs well only in calm water has limited operational value, whereas DenseFish-v13 maintains reliable perception during active aeration and is therefore more suitable for continuous monitoring in real farming environments.

4.4.1. Performance Under Multi-Level Synthetic Bubble Disturbance

To further analyze robustness to different levels of bubble-like interference, we generated three synthetic perturbation levels by progressively increasing the density, opacity, and spatial coverage of added bubble artefacts. The original images served as the clear reference condition, while the low-, medium-, and strong-disturbance settings corresponded to increasing levels of synthetic bubble overlays. The ground-truth bounding boxes were left unchanged because the synthetic perturbations only modify the images’ visual appearance and do not alter the actual object locations.

As shown in Table 12, DenseFish-v13 exhibits the smallest degradation as the synthetic bubble disturbance becomes stronger, with only a 1.3% mAP drop and a 0.4 MAE increase from the original setting to the strong-bubble setting. Compared with YOLOv13-m, RT-DETR-l, and CrowdDet, the proposed method maintains more stable localization and counting performance as bubble-like interference progressively increases. This supports the effectiveness of B-HFG in suppressing high-frequency clutter while preserving fish-related contours and textures.

4.4.2. Spectral Mechanism Visualization of B-HFG

To provide direct evidence of the Bio-Harmonic Frequency Gate (B-HFG)’s effectiveness in the presence of strong aeration noise, we further visualize intermediate feature responses before and after spectral refinement, along with the corresponding final detection results.

Figure 9 shows that the baseline detector produces unstable activation in bubble-dominated regions, indicating strong sensitivity to aeration noise. These responses often overlap with fish-scale reflections and fragmented body textures, leading to false detections and incomplete localization. After spectral refinement with B-HFG, the feature maps become cleaner and more concentrated on biologically meaningful target regions. The final detection results show the same trend: the baseline suffers from fragmented boxes, false positives, and missed detections, whereas DenseFish-v13 produces more stable and complete predictions in the same regions. Together with Table 11, Table 12 and Table 13, Figure 9 demonstrates that B-HFG improves underwater perception by suppressing bubble-related interference while preserving fish-related structures.

4.4.3. Ablation of Frequency-Domain Noise Suppression Strategies

To verify that the robustness improvement of B-HFG is not merely due to generic smoothing, we compared the proposed module against several common filtering strategies in the strong synthetic bubble-perturbation setting. The evaluated methods include no filtering, Gaussian spatial filtering, fixed low-pass filtering, fixed band-pass filtering, and the proposed learnable Bio-Harmonic Frequency Gate.

Table 13 shows that conventional spatial-domain smoothing is not suitable for dense aquaculture images. Gaussian filtering and fixed low-pass filtering suppress bubble-like noise but also weaken fish contours and scale-related textures, thereby reducing detection accuracy. Fixed band-pass filtering performs better, but its static frequency response cannot adapt to different underwater conditions. In contrast, B-HFG achieves the best performance by learning a task-specific spectral response that suppresses broadband bubble interference while preserving quasi-periodic biological textures. This result supports the motivation of using symmetry-preserving frequency-domain refinement for underwater fish detection.

4.5. Performance Under Symmetry-Disrupted Extreme Occlusion

To further evaluate DenseFish-v13’s robustness to severe crowding, we analyze the model’s performance across increasingly dense scenes. From a symmetry–asymmetry perspective, heavy occlusion disrupts the continuous bilateral contours and quasi-periodic body textures of fish, causing adjacent individuals to appear visually entangled. Therefore, this section focuses on whether the proposed global state-space modeling and density-aware repulsion mechanism can preserve instance separability when fish-body symmetry cues are strongly disrupted by overlap and crowding.

4.5.1. Performance Under Different Density and Occlusion Levels

To provide a more fine-grained characterization of density robustness, we evaluate all models across four crowding levels: low, medium, high, and extreme density, reporting mAP@50:95, Counting MAE, and Occlusion Recall

R_{o c c}

jointly to capture both localization accuracy and dense target recall.

As shown in Table 14, all models exhibit monotonically increasing performance degradation as scene density increases, confirming that severe occlusion is a fundamental bottleneck for underwater fish detection. DenseFish-v13 consistently achieves the highest mAP@50:95 across all four density levels, maintaining 64.2% even under extreme-density conditions. Compared with YOLOv13-m, RT-DETR-l, and CrowdDet, its mAP drop from low to extreme density is the smallest at 14.1 percentage points, versus 18.3, 18.1, and 16.7 for the respective baselines. The advantage of DenseFish-v13 becomes more pronounced as density increases: its

R_{o c c}

decreases from 81.5% to 68.7% across the full density range, representing a substantially more gradual decline than YOLOv13-m (72.6% → 49.3%) and RT-DETR-l (76.3% → 58.4%). These results confirm that the Bi-MSW-Mamba backbone and the density-aware repulsion mechanism collectively preserve instance separability under progressively severe crowding, enabling more reliable performance in practical high-density aquaculture monitoring scenarios.

All models exhibit performance degradation as the scene density increases, confirming that severe occlusion is a major challenge for underwater fish detection. However, DenseFish-v13 consistently achieves the highest mAP@50:95 across all density levels and maintains 64.2% even under extreme-density conditions. Compared with YOLOv13-m, RT-DETR-l, and CrowdDet, its mAP drop from low to extreme density is smaller, decreasing by only 14.1 percentage points. In addition, DenseFish-v13 preserves the highest occlusion recall, with

R_{o c c}

decreasing from 81.5% to 68.7%, indicating a stronger capability in detecting partially visible and heavily overlapped fish. These results demonstrate that the proposed model degrades more gracefully under increasing crowding and is better suited for high-density aquaculture inspection.

Figure 10 further visualizes the density-level evaluation results summarized in Table 13. Figure 10a shows the low-to-extreme change of occlusion recall, while Figure 10b compares mAP@50:95 across low-, medium-, high-, and extreme-density subsets.

As density increases, conventional YOLO-based detectors exhibit a pronounced decline in performance, particularly when the scene contains more than approximately 40 visible fish per frame. This trend indicates that local feature extraction becomes increasingly unreliable when adjacent fish bodies overlap heavily, leading to missed detections, merged bounding boxes, and unstable localization. By contrast, DenseFish-v13 shows a substantially slower rate of degradation, suggesting that its global context modeling and repulsion-based dense matching are more effective at preserving instance separability under severe occlusion.

4.5.2. Qualitative Visualization of Boundary Recovery Under Symmetry-Disrupted Occlusion

To complement the quantitative analysis, we further present enlarged visual comparisons of representative extreme-occlusion cases. The selected examples should include heavily overlapped fish clusters, boundary entanglement, and partial body visibility. By comparing baseline predictions with those of DenseFish-v13, the figure can directly illustrate typical failure modes such as merged boxes, missing instances, and incomplete localization. Figure 11 further compares bounding-box predictions under extreme occlusion.

For each case, the figure shows the enlarged region of interest, the RT-DETR-l prediction, and the corresponding DenseFish-v13 prediction. The selected examples highlight typical failure modes in highly crowded scenes, including merged boxes, missed detections, and incomplete localization. By comparison, DenseFish-v13 produces more complete and better-separated predictions under severe occlusion.

4.6. Bio-Kinematic Behavior Recognition Validation

Beyond frame-level detection and counting, we also evaluated the Bio-Kinematic Behavior Head for trajectory-based behavior recognition. Since the Pond Fish Detection Dataset was originally derived from underwater video recordings, it provides a reasonable temporal basis for constructing auxiliary short-trajectory clips rather than relying solely on isolated still images. Based on this video-derived acquisition setting, we organized an auxiliary behavior-validation subset and manually annotated the extracted short clips into three biologically meaningful behavior states: Normal, Feeding, and Hypoxia-related floating behavior. It should be noted that these behavior labels were not included in the original public detection dataset; they were introduced in this study solely for auxiliary trajectory-level validation.

The Bio-Kinematic Behavior Head performs behavior recognition by extracting interpretable trajectory descriptors from continuous fish detections, including instantaneous velocity, turning-angle variance, and vertical proximity to the water surface. Feeding behavior is typically associated with higher instantaneous velocity and more frequent directional changes. In contrast, hypoxia-related floating behavior tends to occur near the water surface, with slower movement and reduced turning. Normal behavior is characterized by a moderate swimming speed, relatively smooth trajectories, and no persistent surface-floating pattern.

As shown in Table 15, while general video recognition models like SlowFast [47] and aquaculture-specific FishBehavior-Net [36] achieve respectable results, our bio-kinematic head yields a higher Macro F1-score (88.4%). This confirms that trajectory-based kinematic descriptors are more effective for interpreting fish behavior in dense scenes than traditional 3D-CNN features. Feeding behavior achieves an F1-score of 88.0%, indicating that combining velocity and turning-angle variance is useful for recognizing rapid and irregular motion. Hypoxia-related floating behavior obtains an F1-score of 85.7%, which is slightly lower than the other two classes, mainly because slow near-surface movement may occasionally overlap with normal low-activity swimming patterns.

The results further indicate that no single kinematic descriptor is sufficient for robust behavior recognition. Velocity is effective for identifying feeding-related activity, while turning-angle variance captures irregular motion patterns associated with rapid directional changes. Surface proximity provides complementary information for identifying floating or low-mobility behavior near the water surface. Therefore, the joint use of motion intensity, directional instability, and vertical-position cues enables more reliable trajectory-level semantic interpretation than using any single descriptor.

Figure 12 qualitatively illustrates the behavior-discrimination mechanism of the proposed Bio-Kinematic Behavior Head. In the Normal case, the trajectory is smooth and continuous, with moderate displacement and limited directional fluctuation, indicating a stable swimming pattern. In contrast, the Feeding trajectory exhibits more frequent direction changes and a more tortuous movement path, consistent with higher swimming activity and greater turning-angle variance during feeding-related motion. For Hypoxia-related floating behavior, the trajectory is concentrated near the water surface and exhibits a short displacement range, indicating low mobility and proximity to the surface.

The proposed behavior head does not rely on a single motion cue but instead jointly considers velocity, turning-angle variance, and vertical position relative to the water surface. This design enables the model to distinguish active feeding behavior from stable normal swimming and low-mobility near-surface floating. Figure 12 further demonstrates that DenseFish-v13 can extend frame-level fish detection to trajectory-level semantic interpretation, thereby providing useful behavioral indicators for intelligent aquaculture monitoring. Overall, this auxiliary validation suggests that DenseFish-v13 can provide not only frame-level localization and counting results, but also trajectory-level semantic information for practical aquaculture management. Such capability is valuable for abnormal-state warning, feeding activity analysis, and health-status monitoring in intelligent farming systems. However, because the behavior labels were manually introduced for auxiliary validation rather than provided by the original detection dataset, future work will further expand the video-based behavior dataset and validate the proposed behavior head under longer-term and multi-camera aquaculture monitoring conditions.

To ensure the reliability of the auxiliary behavior-validation subset, the manual labeling process adhered to a rigorous protocol. Three independent annotators—researchers specializing in aquaculture engineering with over two years of experience in fish ethology—performed the trajectory-level labeling. The annotators were provided with standardized definitions for ‘Normal’, ‘Feeding’, and ‘Hypoxia’ based on swimming speed, turning frequency, and vertical distribution. To quantify the reliability of the labels, we calculated the inter-annotator agreement using Cohen’s Kappa coefficient (κ). The average pairwise κacross all categories reached 0.84, indicating ‘substantial’ to ‘almost perfect’ agreement. In cases of disagreement (less than 15% of the clips), a final label was determined through consensus arbitration by a senior marine biology expert. This high level of inter-annotator consistency provides a robust foundation for interpreting the 89.2% classification accuracy as reflecting the model’s true discriminative capability rather than an artefact of labeling noise.

4.7. Edge Deployment Inference Performance

To verify the engineering feasibility of DenseFish-v13 as the local artificial intelligence core of an underwater robotic platform, we conducted a rigorous comparative analysis of its computational demands. This evaluation is particularly important because aquaculture inspection robots operate under strict hardware constraints, where model size, computational complexity, memory consumption, and real-time inference speed directly determine whether the algorithm can be deployed on the onboard edge-processing unit.

All deployment experiments were conducted on an NVIDIA Jetson Orin NX module (NVIDIA Corporation, Santa Clara, CA, USA) using NVIDIA TensorRT acceleration with FP16 precision under a fixed input resolution. To provide a comprehensive assessment of practical deployment costs, we report the parameter count (Params), computational complexity (FLOPs), peak memory usage, and real-world inference speed (FPS) for all compared models on the target edge device.

To provide objective support for the model’s suitability for edge deployment, we further analyzed the speed–accuracy trade-offs compared with mainstream lightweight and Mamba-based architectures. As shown in Table 16, although YOLOv10-s achieves a higher throughput (192 FPS), its mAP in dense aquaculture scenes drops significantly to 49.2% due to its limited capacity to resolve occluded features. Similarly, while the VMamba-T provides global modeling, its lack of frequency-domain denoising and repulsion constraints results in lower precision (54.6% mAP) at a higher computational cost (94 FPS). DenseFish-v13 occupies a unique position on the Pareto frontier: it maintains a real-time frame rate of 125 FPS—exceeding the 30–60 FPS requirement for robotic control loops—while providing a 15.0 percentage-point mAP improvement over the fastest lightweight detectors. This demonstrates that our framework is not just ‘fast’, but specifically ‘efficient’ at processing high-entropy underwater imagery on resource-constrained platforms.

As shown in Table 16, DenseFish-v13 achieves the best overall detection accuracy while maintaining strong real-time inference capability on the Orin NX edge device. Although some lightweight CNN baselines may achieve slightly higher FPS, their performance in dense and noisy underwater scenes remains clearly inferior. By contrast, Transformer-based models provide stronger global reasoning but incur a substantial efficiency penalty on embedded hardware. DenseFish-v13, therefore, offers a more favorable balance between computational cost and dense-scene perception performance.

Overall, the comprehensive experimental results obtained under extreme-density layouts, strong aeration noise, and strict hardware constraints demonstrate that DenseFish-v13 has reached a practically deployable level in terms of computational performance. The model satisfies the efficiency requirements for direct integration into the edge-processing unit of aquaculture inspection robots, thereby providing strong support for real-time intelligent monitoring in underwater farming systems.

5. Discussion

This section discusses the internal mechanism of DenseFish-v13 in terms of symmetry and asymmetry in intelligent underwater image processing. The main empirical finding is that dense fish recognition fails not only due to low image quality but also because biologically meaningful quasi-symmetric structures are repeatedly disrupted by asymmetric occlusion, aeration bubbles, and motion blur. DenseFish-v13 addresses this problem through three coupled mechanisms: frequency-domain symmetry preservation, global structure recovery, and asymmetry-aware latent instance separation.

5.1. The Physics of Spectral Disentanglement in Machine Vision

A foundational finding from our aeration robustness experiments (Table 3) is that DenseFish-v13 maintains stable detection accuracy even under active, turbulent mechanical aeration—a highly dynamic physical scenario in which standard CNNs and legacy YOLO models suffer significant drops in precision. This operational success empirically validates our core hypothesis regarding spectral distinguishability. Traditional spatial-domain denoising methods universally treat high-frequency biological textures (like scales) and environmental anomalies (like aeration bubbles) identically as spatial “noise” meant to be smoothed out.

In profound contrast, our Bio-Harmonic Frequency Gate (B-HFG) operates entirely within the frequency domain. By algorithmically learning to permit the transmission of specific harmonic bands associated with the quasi-periodic arrangement of fish scales, while simultaneously applying a mathematical dampening penalty to the broadband chaotic energy spectrum of bubbles, the AI model effectively performs continuous “spectral dehazing.” This implies a critical paradigm shift for machine vision applications in agriculture: advanced deep learning models should move beyond treating camera inputs as mere discrete RGB pixel arrays. Instead, they must parse images as composite physical signals in which vital biological identities are distinctly encoded within specific spatial-frequency bands.

5.2. Breaking the Feature Collapse Bottleneck via Physics-Inspired Constraints

The substantial reduction in Counting MAE recorded in extreme-density splits (Table 2) is predominantly attributed to the integration of the Density-Aware Repulsion Loss into our NMS-free bipartite matching architecture. In standard deep learning object detection frameworks, the loss function solely encourages the predicted bounding-box coordinates to regress toward the ground truth. However, in extreme-density scenarios (where mutual occlusion exceeds 80%), this unconstrained optimization allows the latent feature vectors of overlapping aquatic livestock to converge and become perilously similar, inevitably triggering feature collapse.

Our customized Repulsion Loss introduces an artificial “Latent Force Field.” By rigorously penalizing the cosine similarity between heavily overlapping instances during the bipartite assignment phase, we mathematically compel the YOLOv13-Mamba network to extract subtle discriminative biological cues—such as microscopic variations in swimming angle, morphological contours, or tail-beat phases—to differentiate two targets that share the same pixel space forcibly. Unlike traditional pedestrian detection repulsion methodologies that operate superficially on spatial bounding-box overlaps (IoU), our Latent Repulsion explicitly penalizes high-dimensional feature vectors. This proves that embedding explicit, physics-inspired constraints (i.e., adhering to the principle that two distinct physical objects cannot occupy the same feature space) is vastly superior to purely data-driven black-box learning for extracting actionable intelligence in agricultural big data analytics.

5.3. Edge-Oriented Intelligent Underwater Image Processing

The successful deployment of DenseFish-v13 yields significant industrial implications for the broader field of agricultural robotics. Agricultural inspection robots typically operate on resource-constrained, low-power embedded computing platforms (such as the NVIDIA Jetson Orin NX series), which cannot natively support heavy, high-complexity network architectures. While Vision Transformer-based models offer strong global attention capabilities, their practical throughput on low-power edge devices is often insufficient once multiple concurrent perception tasks (detection, tracking, behavioral analysis) share the same onboard processor. This can limit their end-to-end latency budget for real-time temporal tracking of fast-moving aquatic livestock.

The proposed framework conclusively breaks this hardware-software deadlock. By seamlessly integrating our proposed Bi-MSW-Mamba module—which inherits the core Vision State Space Model (VMamba) architecture and boasts linear computational complexity (O(N))—into the streamlined YOLOv13 baseline architecture, we retain the requisite global reasoning capabilities of heavy Transformers. Consequently, the architecture achieves a highly efficient inference speed of 125 FPS directly deployed on the edge platform. This engineering triumph empowers the autonomous inspection robot to perform zero-latency onboard processing of continuous visual data streams. It eliminates the systemic vulnerability and prohibitive financial costs associated with transmitting raw video data to onshore high-performance cloud servers, thereby representing a highly scalable technological paradigm for deep-sea aquaculture networks.

Integration with Underwater Robotic Kinematics

The real-time high-throughput of DenseFish-v13 (125 FPS on Jetson Orin NX) is not merely a benchmark metric but a prerequisite for the closed-loop control of underwater inspection robots. In actual deployment, the Bio-Kinematic Behavior Head serves as the perception layer for the robot’s path-planning module. By converting continuous fish detections into interpretable trajectory descriptors, the algorithm enables the robot to perform behavior-triggered navigation. For instance, when the ‘Feeding (Frenzy)’ state is detected, the robot can transition from a broad-area survey to a localized hovering mode to facilitate precise monitoring of feeding. Furthermore, the low-latency NMS-free output allows the robot’s onboard flight controller to execute asynchronous obstacle avoidance and target tracking with a response time of <8 ms, effectively mitigating the delay-induced instability common in underwater PID or MPC control systems. This integration transforms the detector from a passive monitor into an active sensory driver for autonomous aquaculture robotics.

5.4. Limitations and Future Trajectories

While DenseFish-v13 establishes a highly robust machine vision foundation for agricultural inspection robots, we acknowledge specific operational limitations that dictate future research trajectories.

First, Dependency on Optical Visibility: While the Bio-Harmonic Frequency Gate masterfully suppresses mechanical bubble noise, it fundamentally cannot synthesize information under zero-visibility conditions, such as during severe water turbidity events or extreme algal blooms. Future iterations of the agricultural inspection robotic sensor suite will explore multi-modal sensor fusion. We intend to integrate our vision architecture with active acoustic sensors (e.g., dual-frequency imaging sonar) to ensure uninterrupted biomass monitoring that is strictly independent of optical clarity.

Second, 2D-to-3D Kinematic Ambiguity: The current Bio-Kinematic Behavior Head fundamentally relies on two-dimensional spatial trajectory analysis. Physical movements oriented primarily along the Z-axis (depth) relative to the camera lens can appear mathematically as artificially low velocity in the 2D plane. This optical illusion can cause the algorithm to misclassify a healthy target as exhibiting “lethargic/hypoxic” behavior. Future research must directly integrate lightweight stereo vision depth estimation into the YOLOv13-Mamba backbone. This enhancement will enable the extraction of precise 3D behavioral profiling matrices, further elevating the analytical reliability of intelligent agricultural equipment. Through these continuous architectural improvements, our framework provides a viable, empirically validated pathway for deploying autonomous artificial intelligence on agricultural inspection robots, comprehensively meeting the rigorous technical demands of modern smart farming. Limited Scale of the Evaluation Dataset: Although our Dense-Aqua benchmark effectively captures extreme-density and noisy conditions, the overall test set comprises only 80 images, with the extreme-density subset being even smaller. While we employed statistical bootstrapping to provide 95% confidence intervals and ensure the reliability of our current comparative results, deep learning models inherently benefit from large-scale, highly diverse test data. Future work will focus on expanding the dataset by continuously collecting and annotating multi-seasonal, multi-farm aquaculture imagery to further validate the model’s generalization capabilities at an industrial scale.

6. Conclusions

In this paper, we proposed DenseFish-v13, a symmetry-aware NMS-free framework that reformulates dense aquaculture perception as a synergy between structured biological pattern preservation and irregular environmental noise suppression. The core academic contribution lies in the ‘symmetry–asymmetry’ modeling approach, which decouples quasi-periodic biological textures from broadband aeration disturbances through a novel Wavelet-Mamba architecture.

Experimental results demonstrate that DenseFish-v13 achieves 64.8% mAP@50:95 and a Counting MAE of 3.7. From an engineering perspective, these metrics signify a breakthrough: the system maintains high-precision counting stability even under intense mechanical aeration. In this critical industrial scenario, conventional CNN and Transformer detectors typically suffer from feature collapse. Furthermore, reaching a throughput of 125 FPS on the NVIDIA Jetson Orin NX platform confirms the framework’s practical viability as a real-time ‘AI brain’ for autonomous inspection robots. By transitioning from static pixel-level detection to trajectory-level behavior recognition, this research provides a robust and interpretable tool for closed-loop welfare monitoring in smart aquaculture.

Future work will focus on multimodal sensing, such as integrating imaging sonar to overcome zero-visibility conditions, and expanding 3D trajectory estimation to provide more granular bio-kinematic insights across diverse industrial farm environments.

Author Contributions

Conceptualization, Y.C., M.S. and X.G.; Methodology, J.W., Z.L., Z.M., Y.X., X.G. and S.H.; Software, J.W., M.S., Z.L., Z.M., Y.X., Y.W. and X.G.; Validation, Y.C., Y.M., Z.M., Y.X., Y.W. and S.H.; Formal analysis, M.S. and S.H.; Investigation, J.W.; Resources, Z.L. and S.H.; Data curation, Y.C.; Writing—original draft, Y.C. and Y.W.; Writing—review & editing, Y.C., J.W., M.S. and X.G.; Visualization, J.W. and Y.M.; Supervision, Y.C., Y.M., Z.L., Y.X. and S.H.; Project administration, M.S., Y.X., Y.W. and X.G.; Funding acquisition, Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project “Development and Application of New Technologies and Safety Management of Shipping Companies in Smart Shipping” under Grant No. H20230487.

Data Availability Statement

The data presented in this study are openly available in DenseFish-v13 at https://github.com/Eason-Yeager/Densefish-v13 (accessed on 15 June 2026).

Conflicts of Interest

The authors declare that this study received funding from Development and Application of New Technologies for Intelligent Shipping and Safety Management of Shipping Companies. The funder had the following involvement with the study: Data organization and Server maintenance.

References

Thakur, A.; Venu, S.; Gurusamy, M. An extensive review on agricultural robots with a focus on their perception systems. Comput. Electron. Agric. 2023, 214, 108146. [Google Scholar] [CrossRef]
Zhang, B.; Qiao, Y. AI, Sensors, and Robotics for Smart Agriculture. Agronomy 2024, 14, 1180. [Google Scholar] [CrossRef]
Spagnuolo, M.; Todde, G.; Caria, M.; Furnitto, N.; Schillaci, G.; Failla, S. Agricultural Robotics: A Technical Review Addressing Challenges in Sustainable Crop Production. Robotics 2025, 14, 9. [Google Scholar] [CrossRef]
Yue, J.; Shu, M.; Zhou, C.; Feng, H.; Yu, F. How Optical Sensors and Deep Learning Enhance the Production Management in Smart Agriculture. Agriculture 2025, 15, 2612. [Google Scholar] [CrossRef]
Wu, A.-Q.; Li, K.-L.; Song, Z.-Y.; Lou, X.; Hu, P.; Yang, W.; Wang, R.-F. Deep Learning for Sustainable Aquaculture: Opportunities and Challenges. Sustainability 2025, 17, 5084. [Google Scholar] [CrossRef]
Correia, B.; Pacheco, O.; Rocha, R.J.M.; Correia, P.L. Image-Based Shrimp Aquaculture Monitoring. Sensors 2025, 25, 248. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Yao, Z.; Hu, Y.; Lian, A.; Yuan, T.; Pang, G.; Huang, X. Deep Learning-Based Fish Detection Using Above-Water Infrared Camera for Deep-Sea Aquaculture: A Comparison Study. Sensors 2024, 24, 2430. [Google Scholar] [CrossRef] [PubMed]
Pei, L.; Zhou, H.; Lu, G.; Zhao, J.; Peng, Z.; Zhu, S.; Ye, Z.; Zhou, J. YOLO-FC: A Lightweight Fish Detection Model for High-Density Aquaculture Counting Scenarios. Fishes 2026, 11, 114. [Google Scholar] [CrossRef]
Li, S.; Li, P.; He, S.; Kuai, Z.; Gu, Y.; Liu, H.; Liu, T.; Lin, Y. An Automatic Detection and Statistical Method for Underwater Fish Based on Foreground Region Convolution Network (FR-CNN). J. Mar. Sci. Eng. 2024, 12, 1343. [Google Scholar] [CrossRef]
Xiao, X.; Liu, T.; He, S.; Li, P.; Gu, Y.; Li, P.; Dong, J. A Multi-Fish Tracking and Behavior Modeling Framework for High-Density Cage Aquaculture. Sensors 2026, 26, 256. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.H.; Ma, Z.M.; Zhou, Y.J.; Li, Y.T.; Xiang, H.X.; Cheng, Y.M.; Chen, T.L.; Zhang, K.J.; Nan, Z.H.; Ni, J.H.; et al. FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection. arXiv 2026, arXiv:2606.16659. [Google Scholar]
Shao, J.; Cheng, Y. CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective. arXiv 2025, arXiv:2506.02878. [Google Scholar] [CrossRef]
Khan, Z.; Shen, Y.; Liu, H. Object Detection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture 2025, 15, 1351. [Google Scholar] [CrossRef]
Feng, W.; Liu, M.; Sun, Y.; Wang, S.; Wang, J. The Use of a Blueberry Ripeness Detection Model in Dense Occlusion Scenarios Based on the Improved YOLOv9. Agronomy 2024, 14, 1860. [Google Scholar] [CrossRef]
Cheng, Y.; Feng, G.; Zhang, C. An Efficient and Lightweight YOLOv8s Strawberry Maturity Detection Model. J. Agric. Sci. Technol. A 2024, 14, 46–66. [Google Scholar] [CrossRef]
Lin, X.; Liao, D.; Du, Z.; Wen, B.; Wu, Z.; Tu, X. SDA-YOLO: An Object Detection Method for Peach Fruits in Complex Orchard Environments. Sensors 2025, 25, 4457. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Ma, Y.; Cheng, Y.; Li, Z.; Liu, X. BIT*+TD3 Hybrid Algorithm for Energy-Efficient Path Planning of Unmanned Surface Vehicles in Complex Inland Waterways. Appl. Sci. 2025, 15, 3446. [Google Scholar] [CrossRef]
Awad, A.; Saleem, A.; Paheding, S.; Lucas, E.; Al-Ratrout, S.; Havens, T.C. Revisiting Underwater Image Enhancement for Object Detection: A Unified Quality–Detection Evaluation Framework. J. Imaging 2026, 12, 18. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Yang, M.; Pan, C.; Tao, J. DCT Underwater Image Enhancement Based on Attenuation Analysis. Sensors 2025, 25, 7192. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Fang, J.; Li, Y.; Wang, Y.; Zhou, Q.; Wang, X. GFRENet: An Efficient Network for Underwater Image Enhancement with Gated Linear Units and Fast Fourier Convolution. J. Mar. Sci. Eng. 2024, 12, 1175. [Google Scholar] [CrossRef]
Feng, X.; He, Y.; Chen, L.; Yang, Y.; Wang, C.; Chen, Y.; Zhong, Y.; Kuang, Z.; Ding, J.; Yin, X.; et al. SRGS: Super-Resolution 3D Gaussian Splatting. arXiv 2024, arXiv:2404.10318. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Li, J.; Qiu, X.; Xu, L.; Guo, L.; Qu, D.; Long, T.; Fan, C.; Li, M. UniF2ace: Fine-Grained Face Understanding and Generation with Unified Multimodal Models. arXiv 2025, arXiv:2503.08120. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhao, F.; He, Y.; Song, J.; Wang, J.; Xi, D.; Shao, X.; Wu, Q.; Liu, Y.; Chen, Y.; Zhang, G.; et al. Smart UAV-assisted blueberry maturity monitoring with Mamba-based computer vision. Precis. Agric. 2025, 26, 56. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar] [CrossRef]
Zhou, Y.-H.; Li, H.; Lin, R.; Huang, H.; Zhou, J.; Yuan, C.; Lan, T.; Zhou, Z.; Li, Y.; Xu, J.; et al. MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation. arXiv 2026, arXiv:2602.00607. Available online: https://openreview.net/forum?id=045O8eWf33 (accessed on 15 June 2026).
Zhang, H.; Li, S.; Xie, J.; Chen, Z.; Chen, J.; Guo, J. VMamba for plant leaf disease identification: Design and experiment. Front. Plant Sci. 2025, 16, 1515021. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhou, Y.; Huang, H.; Chen, L.; Cheng, Y.; Liu, X.; Jin, D.; Xu, J.; Liao, J.; Lan, T.; et al. MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation. arXiv 2026, arXiv:2605.28035. [Google Scholar] [CrossRef]
Shao, J.; Huang, H.; Wu, J.; Cheng, Y.M.; Wu, Z.Y.; Shan, Y.; Zheng, M.K. VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits. arXiv 2025, arXiv:2505.10202. [Google Scholar] [CrossRef]
Du, K.; Wang, B.; Zhang, C.; Cheng, Y.; Lan, Q.; Sang, H.; Cheng, Y.; Yao, J.; Liu, X.; Qiao, Y.; et al. PrefillOnly: An Inference Engine for Prefill-Only Workloads in Large Language Model Applications. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, Seoul, Republic of Korea, 13–16 October 2025; ACM: New York, NY, USA, 2025; pp. 399–414. [Google Scholar] [CrossRef]
Li, J.; Cui, Y.; Huang, T.; Ma, Y.; Fan, C.; Cheng, Y.; Yang, M.; Zhong, Z. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE. arXiv 2025, arXiv:2507.21802. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Ma, J.; Yang, F.; Li, F.; Wang, H. U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Zhao, H.; Wu, Y.; Qu, K.; Cui, Z.; Zhu, J.; Li, H.; Cui, H. Vision-based dual network using spatial-temporal geometric features for effective resolution of fish behavior recognition with fish overlap. Aquac. Eng. 2024, 105, 102409. [Google Scholar] [CrossRef]
Zhu, J.; He, W.; Weng, W.; Zhang, T.; Mao, Y.; Yuan, X.; Ma, P.; Mao, G. An Embedding Skeleton for Fish Detection and Marine Organisms Recognition. Symmetry 2022, 14, 1082. [Google Scholar] [CrossRef]
Zhao, M.; Zhou, H.; Li, X. YOLOv7-SN: Underwater Target Detection Algorithm Based on Improved YOLOv7. Symmetry 2024, 16, 514. [Google Scholar] [CrossRef]
Sun, Y.; Chen, W.; Wang, Q.; Fang, T.; Liu, X. Improvement and Optimization of Underwater Image Target Detection Accuracy Based on YOLOv8. Symmetry 2025, 17, 1102. [Google Scholar] [CrossRef]
Feng, Z.; Liu, F. Balancing Feature Symmetry: IFEM-YOLOv13 for Robust Underwater Object Detection Under Degradation. Symmetry 2025, 17, 1531. [Google Scholar] [CrossRef]
Li, M.; Liu, W.; Shao, C.; Qin, B.; Tian, A.; Yu, H. Multi-Scale Feature Enhancement Method for Underwater Object Detection. Symmetry 2025, 17, 63. [Google Scholar] [CrossRef]
You, K.; Li, X.; Yi, P.; Zhang, Y.; Xu, J.; Ren, J.; Bai, H.; Ma, C. PIC-GAN: Symmetry-Driven Underwater Image Enhancement with Partial Instance Normalisation and Colour Detail Modulation. Symmetry 2025, 17, 201. [Google Scholar] [CrossRef]
Mohankumar, V.; Sasithradevi, A. A Comprehensive Annotated Image Dataset for Real-Time Fish Detection in Pond Settings. Data Brief 2024, 57, 111007. [Google Scholar] [CrossRef] [PubMed]
Banno, K.; Gonçalves, F.M.F.; Sauphar, C.; Anichini, M.; Hazelaar, A.; Sperre, L.H.; Stolz, C.; Aas, G.H.; Gansel, L.C.; da Silva Torres, R. Identifying losers: Automatic identification of growth-stunted salmon in aquaculture using computer vision. Mach. Learn. Appl. 2024, 16, 100562. [Google Scholar] [CrossRef]
Saleh, A.; Laradji, I.H.; Konovalov, D.A.; Bradley, M.; Vazquez, D.; Sheaves, M. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Sci. Rep. 2020, 10, 14671. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1091–1100. [Google Scholar]
Shi, Y.; Yan, W.; Huang, N.; Chen, Y.; Zhang, C.; He, T.; Yeo, S.Y.; Li, M. One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems. arXiv 2026, arXiv:2605.22144. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of DenseFish-v13 for smart aquaculture monitoring.

Figure 2. Bio-harmonic frequency gate (B-HFG) for spectral-domain feature refinement with hydrodynamic-informed constraint (HIC). The input feature map is transformed into the frequency domain via FFT, where the magnitude spectrum is adaptively modulated by a learnable harmonic attention map while preserving phase information. The filtered spectrum is reconstructed via IFFT to obtain denoized features for downstream detection. HIC provides additional physical consistency by constraining motion patterns.

Figure 3. YOLOv13-Mamba backbone with global occlusion-aware modeling.

Figure 4. Density-aware feature decoupling for NMS-free matching. The asterisk (*) marks the selected or emphasized feature/matching branch. Different colors are used only for visual distinction and do not denote different operations or additional mathematical meanings.

Figure 5. Bio-kinematic behavior head for trajectory-based behavioral analysis. Object trajectories are constructed from consecutive frames, from which kinematic features (velocity and turning angle) and their statistics are computed. A rule-based bio-logic tree then classifies behavioral states into feeding, hypoxia, and normal, thereby enabling trajectory-level behavioral interpretation in aquaculture scenarios. The trajectory lines represent fish movement paths, the green dots indicate trajectory starting points, the red stars mark trajectory endpoints, and the black arrows indicate movement directions.

Figure 6. Representative visual challenge types targeted by DenseFish-v13. (a) Adult-fish aquaculture monitoring scene, illustrating the basic application context of aquatic livestock inspection. (b) Multi-scale and multi-pose fish distribution, where fish appear with different sizes, orientations, and depths, requiring global context modeling. (c) Crowded occlusion and boundary entanglement, where adjacent fish bodies overlap, and instance separation becomes difficult. (d) Strong aeration-bubble interference, showing bubble-induced high-frequency disturbance that motivates the proposed Bio-Harmonic Frequency Gate. (e) Illumination fluctuation and low-contrast underwater scene, where unstable visibility and turbidity degrade fish contours and local textures. (f) Motion blur and trajectory instability, where fast fish movement and underwater disturbance reduce localization stability and motivate trajectory-level behavior analysis. Together, these image types reflect the core methodological motivations of DenseFish-v13, including global occlusion-aware modeling, density-aware instance separation, bubble-noise suppression, underwater visibility robustness, and motion-based behavior interpretation.

Figure 7. Sensitivity curves of the repulsion mechanism. (a) Influence of

λ_{r e p}

on mAP@50:95, Counting MAE, and

R_{o c c}

. (b) Influence of

τ

on mAP@50:95, Counting MAE, and

R_{o c c}

.

Figure 7. Sensitivity curves of the repulsion mechanism. (a) Influence of

λ_{r e p}

on mAP@50:95, Counting MAE, and

R_{o c c}

. (b) Influence of

τ

on mAP@50:95, Counting MAE, and

R_{o c c}

.

Figure 8. Qualitative comparison of DenseFish-v13 with representative baseline detectors under challenging aquaculture visual conditions. The four rows show representative scenarios of (a) crowded occlusion, (b) bubble interference, (c) low visibility, and (d) motion blur, respectively. The columns present the ground-truth annotations and the detection results of YOLOv13-m, RT-DETR-l, and DenseFish-v13. Green bounding boxes indicate correctly localized fish instances. In contrast, red bounding boxes highlight typical failure cases, including false positives due to bubble-like noise, missed fish in degraded visibility, merged detections in overlapping regions, and unstable localization under motion blur. Compared with YOLOv13-m and RT-DETR-l, DenseFish-v13 produces more complete fish localization, clearer separation between adjacent individuals, and fewer noise-induced false detections, demonstrating the effectiveness of global occlusion-aware modeling, spectral noise suppression, and density-aware instance separation.

Figure 9. Spectral mechanism visualization of B-HFG under strong synthetic bubble perturbation. (a) Representative test image with artificially added bubble artefacts; (b) feature-response heatmap of the baseline model without B-HFG, showing strong activation on bubble-like clutter; (c) refined feature-response heatmap after applying B-HFG, where noise-related responses are suppressed while fish contours remain salient; (d) detection result of the baseline model; (e) detection result of DenseFish-v13. In the heatmaps, warmer colors indicate stronger feature responses, whereas cooler colors indicate weaker responses.

Figure 10. Performance comparison under increasing density levels. (a) Low-to-extreme decline of occlusion recall

R_{o c c}

. (b) mAP@50:95 comparison across low-, medium-, high-, and extreme-density subsets.

Figure 10. Performance comparison under increasing density levels. (a) Low-to-extreme decline of occlusion recall

R_{o c c}

. (b) mAP@50:95 comparison across low-, medium-, high-, and extreme-density subsets.

Figure 11. Qualitative comparison of bounding-box predictions under extreme occlusion.

Figure 12. Trajectory-based behavior recognition results generated by the Bio-Kinematic Behavior Head. (a) A smooth trajectory with moderate swimming velocity, low turning variance, and no persistent surface-floating pattern characterizes normal behavior. (b) Feeding behavior shows a more tortuous trajectory with frequent directional changes, corresponding to high swimming velocity and high turning variance. (c) Hypoxia-related floating behavior is characterized by slow movement near the water surface, with low turning variance and high surface proximity. Green circles and red stars denote the start and end points of each trajectory, respectively; black arrows indicate the movement direction, and the dashed horizontal line in (c) represents the water surface. All trajectories are plotted using normalized image coordinates.

Table 1. Key training and deployment settings.

Parameter	Value	Justification
Batch Size	32 (GPU), 1 (Jetson)	Balances training stability and real-time edge inference
Learning Rate (initial)	$1 0^{- 5} \to 1 0^{- 3}$ warm-up; cosine decay	Stable starting point for dense underwater detector training
Learning Rate Schedule	Cosine decay	Smooth convergence for noisy and crowded scenes
Weight Decay (L2 reg)	0.0005	Reduces overfitting to scene-specific noise patterns
Momentum (SGD)	0.937	Standard setting for stable detector optimization
Optimizer	SGD with momentum	More stable than Adam for object detection
Loss Function	CIoU + BCE + DFL + Repulsion + HIC	Improves localization in crowded fish scenes
Warm-up Epochs	10	Stabilizes early-stage optimization
Total Epochs	100	Sufficient for convergence under the current setup
Repulsion Loss Weight ( $λ_{r e p}$ )	0.2	Balances dense-instance separation and training stability
Repulsion Activation Epoch	50	Avoids unstable matching before basic localization is learned
Spatial Density Threshold ( $τ$ )	0.5	Activates repulsion mainly in truly crowded regions
Behavior Thresholds $(δ_{h i g h}, ϵ, δ_{l o w}, H_{s u r f a c e}$ )	Percentile init. + validation grid search	Distinguish normal, feeding, and hypoxia-related motion states.
Precision Mode	FP16	Improves edge-side inference efficiency on Orin NX
HIC Loss Weight( $λ_{h i c}$ )	0.1	Balances kinematic regularization without destabilizing localization loss
HIC Activation Epoch	50	Applied after stable trajectory formation
MoE-SG Expert Count	3	Covers high-aeration, high-turbidity, and clear-water conditions
CPO Prototype Bank Size	256 per class	Sufficient coverage of fish orientation and scale variation
CPO EMA Momentum	0.999	Stable prototype updates across training batches

Table 2. Component-wise ablation (baseline: YOLOv13-m).

Variant	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)	Counting MAE (↓)
Baseline (YOLOv13-m)	82.4	78.1	88.6	56.8	10.2
+ Bi-MSW-Mamba Block	84.1	80.7	90.2	59.5 (+2.7)	8.5
+ Bi-MSW-Mamba + B-HFG	86.8	82.4	91.7	62.1 (+2.6)	5.8
+ Bi-MSW-Mamba + B-HFG + MoE-SG	87.6	83.5	92.5	63.0 (+0.9)	5.1
+ above + $L_{r e p}$ + CPO	88.5	84.9	93.4	64.2 (+1.2)	4.1
+ above + $L_{H I C}$ (Full DenseFish-v13)	88.9	85.6	93.8	64.8 (+0.6)	3.7

Note: ↓ indicates that lower values are better.

Table 3. Computational overhead analysis of each ablation variant.

Variant	Params (M)	FLOPs (G)	Peak Memory (GB)	FPS (RTX 4090)	FPS (Orin NX)
Baseline YOLOv13-m	20.1	68.4	1.42	356	130
+ Bi-MSW-Mamba Block	21.7	71.2	1.55	334	126
+ Bi-MSW-Mamba + B-HFG	22.0	72.6	1.62	326	125
+ above + Repulsion Loss	22.0	72.6	1.62	326	125
+ above + Repulsion Loss + CPO	22.0	72.6	1.62	326	125
+ above + HIC (Full DenseFish-v13)	22.0	72.6	1.62	326	125

Table 4. Sensitivity analysis of

λ_{r e p}

.

Table 4. Sensitivity analysis of

λ_{r e p}

.

$λ_{r e p}$	mAP@50:95 (%)	$R_{o c c}$ (%)	Counting MAE
0.0	62.1	62.9	5.8
0.1	63.5	66.1	4.7
0.2	64.2	68.7	4.1
0.3	63.9	68.1	4.3
0.4	63.1	66.4	4.9

Table 5. Sensitivity analysis of

τ

.

Table 5. Sensitivity analysis of

τ

.

$τ$	mAP@50:95 (%)	$R_{o c c}$ (%)	Counting MAE
0.3	62.9	66.2	5.0
0.4	63.7	67.8	4.4
0.5	64.2	68.7	4.1
0.6	63.8	67.5	4.6
0.7	63.0	66.0	5.1

Table 6. Orthogonal ablation study based on the vanilla YOLOv13-m baseline.

Bi-MSW-Mamba	B-HFG	Repulsion Loss	HIC	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)	Counting MAE ↓
×	×	×	×	83.6	79.2	89.4	58.7	8.4
√	×	×	×	84.9	81.1	90.5	60.4	7.2
×	√	×	×	85.7	81.8	91.1	61.0	6.5
×	×	√	×	85.2	82.4	90.9	60.8	5.9
√	√	×	×	86.9	83.2	92.0	62.8	5.2
√	×	√	×	87.1	83.8	92.4	63.1	4.8
×	√	√	×	87.4	84.1	92.7	63.4	4.6
√	√	√	×	88.5	84.9	93.4	64.2	4.1
√	√	√	√	88.9	85.6	93.8	64.8	3.7

Note: ↓ indicates lower is better; √ indicates that the component is used; × indicates that the component is not used.

Table 7. Effect of NMS-free matching and density-aware repulsion.

Detection Head	mAP@50:95 (%)	Counting MAE ↓	Occlusion Recall (%) ↑	Merged Error Rate (%) ↓
YOLOv13 with NMS	58.7	8.4	53.6	18.2
NMS-free YOLOv13	60.5	6.6	59.1	14.7
NMS-free + Repulsion Loss	63.5	4.8	66.3	9.6
NMS-free YOLOv13 + Repulsion + CPO	64.2	4.1	68.7	8.1

Note: ↓ indicates that lower values are better. ↑ indicates that higher values are better.

Table 8. Performance on the extreme-density split (Dense-Aqua dataset).

Model	Architecture Paradigm	mAP@50:95	Counting MAE (↓)	Occlusion Recall (Rocc) (↑)	FPS (Edge)
Deep-Fish [45]	Point-supervision	-	9.2 ± 0.6	-	45
CSRNet [46]	Density-map based	-	7.8 ± 0.5	-	32
FR-CNN [9]	Faster R-CNN based	51.4 ± 1.2	14.2 ± 0.7	50.2 ± 1.5	18
YOLO-FC [8]	YOLO-based (CNN)	57.2 ± 0.9	9.6 ± 0.5	56.4 ± 1.3	95
YOLOv8-m	CNN w/NMS	54.2 ± 1.1	12.4 ± 0.6	45.1 ± 1.4	135
YOLOv10-m	CNN (Early NMS-Free)	55.1 ± 1.0	11.8 ± 0.6	46.8 ± 1.3	142
YOLOv11-m	Advanced CNN	56.8 ± 1.0	10.2 ± 0.5	49.3 ± 1.2	130
YOLOv13-m	Vanilla YOLOv13 baseline	58.7 ± 0.9	8.4 ± 0.4	53.6 ± 1.1	128
RT-DETR-l	Transformer	58.3 ± 0.9	7.2 ± 0.4	58.4 ± 1.1	74
CrowdDet	Multi-Head CNN	56.5 ± 1.0	8.5 ± 0.5	55.2 ± 1.2	98
DenseFish-v13	YOLOv13-Mamba	64.2 ± 0.7	4.1 ± 0.2	68.7 ± 0.9	125

Note: ↓ indicates that lower values are better. ↑ indicates that higher values are better.

Table 9. Density-level comparison on Dense-Aqua.

Model	Low mAP@50:95	Medium mAP@50:95	Extreme mAP@50:95	Low MAE	Medium MAE	Extreme MAE	Low ( $R_{o c c}$ )	Medium ( $R_{o c c}$ )	Extreme ( $R_{o c c}$ )
YOLOv8-m	72.8 ± 0.6	63.5 ± 0.9	54.2 ± 1.1	3.6 ± 0.2	7.8 ± 0.4	12.4 ± 0.6	69.4 ± 0.8	55.7 ± 1.2	45.1 ± 1.4
YOLOv10-m	73.6 ± 0.6	64.2 ± 0.8	55.1 ± 1.0	3.4 ± 0.2	7.4 ± 0.4	11.8 ± 0.6	70.8 ± 0.8	57.2 ± 1.1	46.8 ± 1.3
YOLOv11-m	75.1 ± 0.5	66.7 ± 0.8	56.8 ± 1.0	3.1 ± 0.2	6.8 ± 0.3	10.2 ± 0.5	72.6 ± 0.7	59.5 ± 1.1	49.3 ± 1.2
YOLOv13-m	76.2 ± 0.5	68.1 ± 0.7	58.7 ± 0.9	2.8 ± 0.1	5.9 ± 0.3	8.4 ± 0.4	74.5 ± 0.7	62.3 ± 1.0	53.6 ± 1.1
RT-DETR-l	76.4 ± 0.5	68.9 ± 0.7	58.3 ± 0.9	2.9 ± 0.2	5.3 ± 0.3	7.2 ± 0.4	76.3 ± 0.6	65.4 ± 0.9	58.4 ± 1.1
CrowdDet	73.2 ± 0.6	65.1 ± 0.8	56.5 ± 1.0	3.0 ± 0.2	5.9 ± 0.3	8.5 ± 0.5	74.1 ± 0.7	62.8 ± 1.0	55.2 ± 1.2
DenseFish-v13	78.3 ± 0.4	72.4 ± 0.6	64.2 ± 0.7	2.4 ± 0.1	3.8 ± 0.2	4.1 ± 0.2	81.5 ± 0.5	74.2 ± 0.7	68.7 ± 0.9

Table 10. Disaggregated performance analysis on source datasets.

Source Dataset	Environment	mAP@50:95 (%)	Counting MAE (↓)	Precision (%)	Recall (%)
Pond-Aqua	Turbid Pond	62.3 ± 0.8	4.2 ± 0.3	86.4	82.5
Salmon-Aqua	Sea-cage	72.1 ± 0.6	1.8 ± 0.1	93.7	91.2
Combined (Dense-Aqua)	Full Test Set	64.8 ± 0.7	3.7 ± 0.2	88.9	85.6

Note: ↓ indicates that lower values are better.

Table 11. Robustness comparison under synthetic bubble perturbation.

Model	Original	Synthetic Bubble	Drop (Δ)
YOLOv13-m	60.1%	51.2%	−8.9%
DenseFish-v13	64.8%	63.5%	−1.3%

Table 12. Performance comparison under different levels of synthetic bubble perturbation.

Model	Clear mAP	Low mAP	Medium mAP	Strong mAP	Original MAE	Strong MAE	Original ( $R_{o c c}$ )	Strong ( $R_{o c c}$ )	ΔmAP	ΔMAE
YOLOv13-m	60.1	57.3	54.4	51.2	6.4	11.6	55.1	43.8	−8.9	+5.2
RT-DETR-l	60.8	59.2	56.5	54.3	5.0	8.1	62.7	54.9	−6.5	+3.1
CrowdDet	58.9	57.1	54.6	52.4	5.7	9.4	59.8	51.6	−6.5	+3.7
DenseFish-v13	64.8	64.4	64.0	63.5	3.9	4.3	69.4	67.8	−1.3	+0.4

Table 13. Comparison of different noise suppression strategies under strong synthetic bubble perturbation.

Filtering Strategy	mAP@50:95 (%)	Counting MAE ↓	Precision (%)	Recall (%)
No filtering	55.8	8.9	80.6	76.4
Gaussian filtering	53.4	9.5	78.9	74.7
Fixed low-pass filtering	54.1	9.1	79.5	75.2
Fixed band-pass filtering	57.3	7.6	82.1	78.5
B-HFG	63.5	4.3	87.8	84.2

Note: ↓ indicates that lower values are better.

Table 14. Performance comparison under different density/occlusion levels.

Model	Low Density	Medium Density	High Density	Extreme Density	ΔmAP (Low → Extreme)	$Δ R_{o c c}$ (Low → Extreme)
YOLOv13-m	75.1 ± 0.5	66.7 ± 0.8	60.3 ± 0.9	56.8 ± 1.0	−18.3	72.6 → 49.3
RT-DETR-l	76.4 ± 0.5	68.9 ± 0.7	62.0 ± 0.8	58.3 ± 0.9	−18.1	76.3 → 58.4
CrowdDet	73.2 ± 0.6	65.1 ± 0.8	60.8 ± 0.9	56.5 ± 1.0	−16.7	74.1 → 55.2
DenseFish-v13	78.3 ± 0.4	72.4 ± 0.6	68.1 ± 0.6	64.2 ± 0.7	−14.1	81.5 → 68.7

Table 15. Quantitative results of behavior classification.

Behavior State	Precision (%)	Recall (%)	F1-Score (%)
Normal	90.8	92.4	91.6
Feeding	88.7	87.3	88.0
Hypoxia-related Floating	86.5	84.9	85.7
Macro average (Ours)	88.7	88.2	88.4
SlowFast [47] (SOTA)	-	-	82.1
FishBehavior-Net [36] (SOTA)	-	-	85.2

Overall classification accuracy: 89.2%.

Table 16. Edge deployment performance comparison on Jetson Orin NX.

Model	Params (M)	FLOPs (G)	Peak Memory (GB)	FPS (Orin NX)	mAP@50:95 (%)
YOLOv8-m	25.9	78.9	1.56	135	54.2
YOLOv10-m	16.5	64.0	1.32	142	55.1
YOLOv13-m	20.1	68.4	1.42	130	56.8
RT-DETR-l	32.0	108.3	2.85	74	58.3
CrowdDet	28.7	94.6	2.21	98	56.5
DenseFish-v13	22.0	72.6	1.62	125	64.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Wu, J.; Sun, M.; Ma, Y.; Li, Z.; Ma, Z.; Xiong, Y.; Wang, Y.; Guo, X.; Huang, S. DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition. Symmetry 2026, 18, 1084. https://doi.org/10.3390/sym18071084

AMA Style

Chen Y, Wu J, Sun M, Ma Y, Li Z, Ma Z, Xiong Y, Wang Y, Guo X, Huang S. DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition. Symmetry. 2026; 18(7):1084. https://doi.org/10.3390/sym18071084

Chicago/Turabian Style

Chen, Yujie, Jiabao Wu, Maoyuan Sun, Yiping Ma, Zhiqian Li, Zeqi Ma, Yang Xiong, Yichen Wang, Xiaoyin Guo, and Shuai Huang. 2026. "DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition" Symmetry 18, no. 7: 1084. https://doi.org/10.3390/sym18071084

APA Style

Chen, Y., Wu, J., Sun, M., Ma, Y., Li, Z., Ma, Z., Xiong, Y., Wang, Y., Guo, X., & Huang, S. (2026). DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition. Symmetry, 18(7), 1084. https://doi.org/10.3390/sym18071084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

DenseFish-v13: A Symmetry-Aware NMS-Free YOLOv13-Mamba Framework for Dense Underwater Fish Detection and Bio-Kinematic Behavior Recognition

Abstract

1. Introduction

2. Related Work

2.1. Dense Object Detection and NMS-Free Architectures

2.2. Environmental Noise Suppression in Aquatic Environments

2.3. Global Visual Modeling and State-Space Architectures

2.4. Innovation and Positioning

3. Methodology

3.1. Overall Architecture Overview

3.2. Symmetry-Preserving Bio-Harmonic Frequency Gate (B-HFG) for Spectral Denoising

3.3. YOLOv13-Mamba for Global Structure Recovery Under Asymmetric Occlusion

3.4. Asymmetry-Aware Density Repulsion Loss for NMS-Free Instance Separation

3.5. Bio-Kinematic Behavior Head for Aquatic Livestock

3.6. Total End-to-End Training Objective

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Datasets and Metrics

4.1.2. Implementation Details

4.2. Ablation Study: Deconstructing the Architectural Gains

4.2.1. Efficiency Overhead Analysis of Each Module

4.2.2. Sensitivity Analysis of the Repulsion Mechanism

4.2.3. Orthogonal Ablation Study of Core Components

4.2.4. Ablation of NMS-Free Matching and Repulsion Loss

4.3. Comparative Analysis with State-of-the-Art (SOTA) Models

4.3.1. Performance Across Different Density Levels

4.3.2. Qualitative Comparison with Representative Baselines

4.3.3. Disaggregated Evaluation on Source Datasets

4.4. Robustness Under Asymmetric Aeration-Induced Visual Disturbance

4.4.1. Performance Under Multi-Level Synthetic Bubble Disturbance

4.4.2. Spectral Mechanism Visualization of B-HFG

4.4.3. Ablation of Frequency-Domain Noise Suppression Strategies

4.5. Performance Under Symmetry-Disrupted Extreme Occlusion

4.5.1. Performance Under Different Density and Occlusion Levels

4.5.2. Qualitative Visualization of Boundary Recovery Under Symmetry-Disrupted Occlusion

4.6. Bio-Kinematic Behavior Recognition Validation

4.7. Edge Deployment Inference Performance

5. Discussion

5.1. The Physics of Spectral Disentanglement in Machine Vision

5.2. Breaking the Feature Collapse Bottleneck via Physics-Inspired Constraints

5.3. Edge-Oriented Intelligent Underwater Image Processing

Integration with Underwater Robotic Kinematics

5.4. Limitations and Future Trajectories

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI