A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage

Ma, Fengshuang; Liu, Xiangyong; Xu, Zhiqiang

doi:10.3390/jmse13091700

Open AccessArticle

A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage

by

Fengshuang Ma

^1,2,

Xiangyong Liu

^1,2,*

and

Zhiqiang Xu

^1,2

¹

Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Science, Shanghai 200092, China

²

East China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200090, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(9), 1700; https://doi.org/10.3390/jmse13091700

Submission received: 28 July 2025 / Revised: 23 August 2025 / Accepted: 28 August 2025 / Published: 3 September 2025

(This article belongs to the Special Issue Selection of Deep-Sea Aquaculture Species and Development of Supporting Technologies and Equipment)

Download

Browse Figures

Versions Notes

Abstract

Fish condition detection, including the identification of feeding desire, biological attachments, fence breaches, and dead fishes, has become an important research frontier in fishery aquaculture. However, perception in underwater conditions is less satisfactory and remains a tricky problem. Firstly, we have developed a multimodal dataset based on Neuromorphic vision (NeuroVI) and RGB images, encompassing challenging fishery aquaculture scenarios. Within the fishery aquaculture dataset, a spike neural network (SNN) method is designed to filter NeuroVI images, and the sift feature points are leveraged to select the optimal image. Next, we propose a dual-image cross-attention learning network that achieves scene segmentation in a fishery aquaculture cage. This network comprises double-channels feature extraction and guided attention learning modules. In detail, the feature matrix of NeuroVI images serves as the query matrix for RGB images, generating attention for calculating key and value matrices. Then, to alleviate the computational burden of the dual-channel network, we replace dot-product multiplication with element-wise multiplication, thereby reducing the computational load among different matrices. Finally, our experimental results from the fishery cage demonstrate that the proposed method achieves the state-of-the-art segmentation performance in the management process of fishery aquaculture.

Keywords:

fishery aquaculture; feeding; biofouling; fence and dead fishes’ detection; multimodal vision; information fusion; detection improvement

1. Introduction

In fishery aquaculture management, the detection of feeding desire, biological attachments, fence breaches, and dead fishes plays a critical role [1]. Current research primarily focuses on segmentation in clear-water conditions. With recent advances in deep learning, the segmentation performance for net-cage cables has improved significantly. However, harsh underwater conditions such as resuspended sediments and dynamic obstacles severely degrade perceptual accuracy in the process of fishery aquaculture (Figure 1) [2]. Our experiments indicate that conventional deep neural networks exhibit subpar seg-mentation performance under the conditions that occur in these cages. To address this research gap, our study aims to overcome the challenges posed by low-visibility cage environments.

In research focused on offshore cage aquaculture, computer vision is being used for biofouling detection, net integrity monitoring, dead fish identification, and feeding behavior analysis with increasing frequency [3]. While advanced models (e.g., Vision Transformer-based UISFormer for fouling segmentation, YOLOv8 variants for net tear detection, and RL-enhanced systems for appetite tracking) show promise, severe underwater visual degradation remains a primary challenge [4]. Key obstacles include color distortion due to wavelength absorption, low contrast from turbid water and light scattering, dynamic occlusions by fish shoals, and photon scarcity in deep cages [5]. These conditions blur features, hinder object detection, and amplify false negatives. Mitigation strategies like multi-spectral histogram matching (HMFD_YOLOv5), and acoustic–optical fusion are emerging, yet data scarcity and domain gaps persist [6]. Hence, achieving accurate image segmentation under low-visibility conditions becomes exceedingly arduous. Given the adverse effects of inclement conditions on cameras and sensors, ensuring the robustness of vision models in the process of fishery aquaculture becomes critically important.

To address the aforementioned challenges in the fishery aquaculture, we propose a dual-image cross-attention network that combines two types of images to improve detection accuracy. To achieve this, we firstly leverage the spiking neural network (SNN) method along with sift feature points to extract the best NeuroVI images from the NeuroVI camera, thereby creating a dual-image dataset in the fishery aquaculture. Then, a double channels’ transformer network is designed as the feature extraction backbone, where the NeuroVI images provide spatial attention for RGB images in the fishery cage. Finally, a lightweight multiplication mechanism is integrated to alleviate the computational burden. The relevant relationships are illustrated in Figure 2, and our contributions are as follows:

(1): By utilizing the NeuroVI and RGB images (Figure 2a), we introduce a novel approach to segmentation in fishery aquaculture. Notably, we have developed a unique low-visibility dataset that encompasses cage management scenarios such as feeding desire, biofouling attachments, fence breach, and dead fishes detection.
(2): A dual-image cross-attention network is designed for a fishery aquaculture cage (Figure 2b). The features from NeuroVI and RGB images are simultaneously extracted. And NeuroVI images are leveraged to provide spatial attention for RGB images, improving the learning efficiency of network in the fishery aquaculture.
(3): An element-wise multiplication mechanism is designed to replace dot-product multiplication, thereby reducing the calculation latency caused by the dual-channel network. Ultimately, we have achieved state-of-the-art detection performance in the detection of fishery aquaculture, with an 8% improvement in mean average precision (mAP).

2. Related Work

2.1. The Fishery Aquaculture Dataset in Underwater Conditions

Over the last decade, significant efforts have been made to develop original datasets for validating robotic perception systems [7,8]. However, there remains a notable lack of datasets that encompass rare and challenging conditions such as feeding desire, biofouling attachments, fence breaches, and dead fishes [9]. To this end, recently, some researchers have developed image datasets under low-visibility conditions [10]. For instance, datasets like URPC (Underwater Robot Picking Contest) [11] and SUIM (Semantic Underwater Imagery Dataset) [12] provide annotated images for object detection and segmentation in underwater scenarios, though they still lack comprehensive coverage of rare events. Additionally, synthetic data generation techniques, such as those employing generative adversarial networks (GANs) [13] or physics-based simulations [14], have been explored to augment limited real-world datasets. These approaches help to model rare conditions like biofouling growth or irregular fish behavior, but their generalization to real-world deployments remains an ongoing challenge [15]. Future efforts should focus on collaborative data collection initiatives, leveraging underwater drones and crowd-sourced contributions from aquaculture farms to build more diverse and representative datasets [16]. However, a limitation of these datasets is that they are often collected from a single scene and lack NeuroVI images.

As a newly developed technology, NeuroVI cameras encounter significant challenges related to insufficient datasets, impeding the full-fledged advancement of the sensor. Liu et al. recorded a segment of pedestrian behavior to capture the body’s movement center [17]. To identify moving objects and improve positional calculations in dynamic scenes, Liu et al. developed a NeuroVI dataset to recognize and deduct moving objects [18]. Although several datasets are currently available, there remains a scarcity of datasets specifically focused on feeding desire, biofouling attachments, fence breaches, and dead fishes in fishery aquaculture.

2.2. Fishery Aquaculture Cage’s Segmentation Methods

By using the powerful feature extraction capability of Convolutional Neural Networks (CNNs), deep learning-based segmentation methods have achieved some encouraging results in conventional daytime scenes [19]. The Fully Convolutional Network (FCN) is considered as a milestone, which demonstrates the segmentation capability with an end-to-end output [20]. Chen introduced FPN convolution and UNET++ into the segmentation network [21,22]. Furthermore, to address the challenge of objects with varying scales, a scale-adaptive network was proposed in [23]. Recently, Deng et al. combined the Swin-transformer and deconvolution to achieve nighttime segmentation [24,25]. However, the above methods exhibit poor robustness when facing underwater illumination.

To address the issue of unstable illumination, some researchers have explored image enhancement methods, aiming at improving images’ brightness and appearance. Based on the image’s bright parts, Wang et al. calculated the enhancement parameters for the dark portions of images [26]. Li et al. proposed deep learning model which enhances underwater image segmentation by synergizing global and intricate features through alternating dual iterations [27]. DAPNet enhances underwater images via attention mechanisms and adaptive normalization, excelling in enhancement and dehazing tasks [28]. Transformers, combined with convolutional networks, have also been applied to underwater image segmentation, demonstrating superior performance over traditional CNNs [29]. Romera et al. proposed an edge detection algorithm that used the average of gradients to calculate the threshold for highlighting pixel values [30]. Nevertheless, it should be noted that the aforementioned image enhancement methods largely rely on the clarity of the image’s visible parts.

2.3. Sensors’ Fusion

To address the limitations of RGB cameras in underwater environments, researchers have explored multimodal sensor integration, including depth cameras, sonar, and hyperspectral imaging [31]. For instance, Chen et al. utilized the depth information from LiDAR for image pre-segmentation and subsequently re-segmented the images [32]. Guan et al. (2023) proposed an acoustic–optical fusion framework for pipeline inspection, though its applicability is constrained in dynamically changing environments such as aquaculture cages [6]. More recently, deep learning-based fusion strategies, such as feature-level concatenation and cross-modal attention, have shown improved performance in underwater segmentation tasks [33]. However, optimal fusion architectures for aquaculture applications remain understudied, particularly in real-time deployment scenarios. In summary, the existing methods mainly rely on cameras and sonar, but it should be noted that laser light from sonar is easily attenuated in water. To bridge this gap, our work presents a novel approach that combines RGB and NeuroVI images, resulting in a robust solution to handle challenging low-visibility scenarios.

3. Materials and Methods

3.1. Fishery Aquaculture Cage Scene Dataset with NeuroVI and RGB Images

In well-lit environments, RGB images have demonstrated excellent performance in segmentation tasks. However, their effectiveness significantly decreases in challenging visibility conditions, particularly in low-light scenarios where critical visual features become indistinct. While RGB imagery offers comprehensive visual data, its quality deteriorates substantially under poor illumination. Conversely, event cameras excel at capturing object outlines but fail to provide sufficient texture details. To leverage the synergistic benefits of both modalities, we introduce a groundbreaking NeuroVI-RGB multimodal dataset specifically designed for low-visibility environments. This dataset serves as an essential tool for overcoming segmentation difficulties in dimly lit conditions.

Event cameras exhibit noise event points in fishery aquaculture cages. Moreover, the clarity of NeuroVI images can vary depending on the movement of vehicles. To overcome these challenges, we propose a two-stage filtering method, aiming to achieve noise reduction and the selection of the best NeuroVI image.

3.2. Cross-Attention Network Based on Spatial Feature Attention

The performance of segmentation algorithms tends to be suboptimal in low-visibility conditions. This study focuses on addressing this issue by utilizing a NeuroVI camera to capture the attention of important features in low-visibility images. To enhance the semantic segmentation performance of these images, a cross-attention network is devised.

RGB images contain rich information. However, in low-visibility scenes, the image may lack clarity, especially for critical features. Event cameras have the advantage of displaying object contours, but lack detailed information. Therefore, to harness complementary advantages and enhance detection accuracy, a cross-attention network is designed.

The cross-attention learning network for dual images consists of two main components: feature extraction and guided attention learning modules. The feature extraction module comprises two feature extraction channels, enabling features to be extracted from both the NeuroVI and RGB images. On the other hand, the guided attention learning module is responsible for generating image attention maps from the NeuroVI camera at different scales.

Figure 3a illustrates the overall architecture of the cross-attention network, which adopts a hierarchical construction approach similar to that of convolutional neural networks. For both the NeuroVI and RGB channels, the image undergoes an initial convolutional operation for image preprocessing. Subsequently, four similar feature extraction modules are sequentially applied. Each of these modules comprises two components: a patch merging block for size adjustment and a Swin-transformer block for feature computation.

It is assumed that the input size for patch merging is C × C × N. The patch merging operation consolidates every 2 × 2 adjacent pixels into a patch. Then, pixels in each patch are stitched together, resulting in an output dimension of C/2 × C/2 × N/4. Finally, a fully connected layer changes the depth of the feature map from N/4 to N/2 dimensions. In summary, the image undergoes down-sampling operation at 4×, 8×, and 16× scales, and each down-sampling operation doubles the number of channels.

3.3. Guided Attention Learning Module

The integration of attention mechanisms in computer vision tasks tries to emulate human visual perception, with the goal of identifying prominent regions within complex scenes. By mimicking the human visual selection process, attention mechanisms optimize the calculation weights of interesting zones while disregarding irrelevant details. This enables the model to concentrate on pertinent features, enhancing its performance in understanding and processing visual information.

The guided attention module (Figure 3b) is designed to enhance feature extraction efficiency and reduce information loss caused by down-sampling operations. This module leverages cross-attention associations between A1 and A2 channels. Specifically, A1 generates key and value matrixes (K and V), while A2 generates a query matrix (Q). By utilizing the query matrix from A2, relevant features are highlighted in the A1 matrix. This attention mechanism is applied at each feature extraction scale of the backbone network, enabling noise removal and extraction of valuable information from RGB images. Through the training and learning process, the network learns to prioritize interesting regions in the images, leading to a better segmentation performance.

3.4. Element-Wise Multiplication to Release Calculation Burden

The conventional transformer architecture employs matrices of Q, K, and V to compute the final weights. It begins by performing a dot-product operation between the Q and K matrices, followed by another dot-product multiplication with the V matrix (as specified in Equation (1)) to calculate the final weights [25]. Each element value within the intermediate matrix is acquired by the vector multiplication operation.

Specifically, the Q matrix (d × n) is multiplied by the transpose of the K matrix (d × n) to obtain an intermediate matrix (d × d), with a computational cost of d² × n. The intermediate matrix is then multiplied by the V (d × n) matrix to obtain a d × n matrix with a computational cost of d² × n. In summary, the computational complexity is directly proportional to both the number of tokens (d²) and the feature dimension (n). However, the use of double channels in the network amplifies the computational load, leading to increased detection latency by the dot-product multiplications.

\hat{x} = s o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{d}}) \cdot V \Rightarrow C o m p u t a t i o n l o a d = 2 \times d^{2} \times n

(1)

To mitigate computational complexity, the approach depicted in Figure 3c substitutes the traditional dot-product multiplication with element-wise multiplication. It is worth noting that Q, K, and V matrices are all subsets of the real domain R^d×n. To derive attention weights for the query matrix extracted from NeuroVI images, it undergoes an initial multiplication with a trainable parameter vector (w_n ∈ Rⁿ), followed by the application of a Rectified Linear Unit (ReLU) activation function. This process results in the generation of a global attention vector denoted as η_d (as specified in Equation (2)). The detection goals are usually distributed near the central parts of the image. It is a well-established fact that detection objectives are typically concentrated in the central regions of images. This operation is specifically designed to efficiently optimize the computation weights for the zones within the NeuroVI image.

η_{d} = \frac{\exp (Q \cdot w_{n} / \sqrt{n})}{\sum_{j = 1}^{d} \exp (Q \cdot w_{n} / \sqrt{n})} \Rightarrow C o m p u t a t i o n l o a d = d \times n

(2)

Next, the K-matrix undergoes element-wise multiplication with the global attention vector η_d, yielding a global query vector q ⊂ Rⁿ. This global vector q is then subjected to element-wise multiplication with the V matrix, resulting in the generation of global features that amalgamate information from both the Q-matrix and the K-matrix. In contrast to the previous dot-product calculation, the load of element-wise multiplication exhibits a linear relationship with the parameters (d × n), leading to reduced computational complexity. Then, we perform another transformation to activate the final information, as depicted in Equation (3), where T denotes the transformation operation.

\{\begin{matrix} q = \sum_{i = 1}^{d} η_{i} \times K_{i} \\ x = T (q \times V) \end{matrix} \Rightarrow C o m p u t a t i o n l o a d = d \times n

(3)

3.5. The Definition of Loss Function

The loss function of segmentation under low illumination is a multi-dimensional classification problem, including classification loss, bounding box loss and mask loss (in Equation (4)).

Regarding classification loss, it is assumed that there are N samples and K categories. y_i,j (0 or 1) represents the actual probability that the sample i is classified to the j category. y_i,j (0 ≤ y_i,j ≤ 1) represents the predicted probability that sample i belongs to the j category. The classification loss adopts a logarithmic loss function in Equation (5), where N denotes the total number of samples and K denotes the number of classifications.

For the bounding box loss, it encompasses position coordinates (x, y) and length-width parameters (w, h) of the bounding box. The regression parameter v* represents the predictions for the box, while v represents the regression parameters associated with the actual box. The bounding box loss is defined in Equation (6).

The mask loss is computed only for pixels with labels. As defined in Equation (7), each pixel value is evaluated by the binary cross-entropy loss function. Pi represents the actual label and Pi* represents the predicted label.

L = L_{c l s} + L_{b o x} + L_{m a s k}

(4)

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{K} y_{i, j} {\hat{\log y}}_{i, j}

(5)

\{\begin{matrix} L_{b o x} = \sum_{i \in \{x, y, w, h\}} s m o o t h (v - v^{*}) \\ s m o o t h (x) = \{\begin{matrix} 0.5 x^{2} i f |x| < 1 \\ |x| - 0.5 o t h e r w i s e \end{matrix} \end{matrix}

(6)

L_{m a s k} = - [P_{i}^{*} \log (P_{i}) + (1 - P_{i}^{*}) \log (1 - P_{i})]

(7)

4. Results and Discussion in Fishery Aquaculture

In this section, we initially elaborate on the dataset’s recording process and the model’s training process. Then, with the trained model, we conduct segmentation experiments under various low-visibility conditions and compare the detection performance with other existing methods.

4.1. Dataset Collection and Training Settings

Due to the reduced availability of low-visibility datasets captured by NeuroVI and RGB simultaneously, we have developed a comprehensive double-image dataset that encompasses various scenarios, including feeding desire, biofouling attachments, fence breach, and dead fishes detection. DAVIS46 is used as the NeuroVI camera, as it can output both RGB and NeuroVI images.

Two experimental methods were used to collect data information. One method involved mounting a DAVIS event camera on a real underwater robot (Figure 4a), which was primarily used to collect data in various conditions such as aquaculture biofouling attachments, fence breach. However, since the dataset includes multiple scenarios, it was difficult to encompass a wide range of adverse conditions simultaneously. Therefore, the other method involved fixing the camera in front of a screen to record adverse conditions from the internet (Figure 4b), and videos under more than ten cable scenes were recorded. At the same time, the recorded videos could be replayed at different speeds, simulating different driving speeds and generating event points with different thicknesses. In summary, both of the two methods were based on real-world scenarios rather than synthetic approaches. After collecting the dataset, we annotated it by using the LabelMe tool (1.0) [34]. All the datasets included goals for detecting feeding desire, biofouling attachments, fence breaches, and dead fishes. Each sub-dataset contains over 250 images, and the entire dataset contains a total number of 1000 images.

The learning network was encoded by the PyTorch tool (2.0) using a dataset of 1000 images, split into 70% training, 15% validation, and 15% test sets. During the experimental process, the training batch size was set as eight, and the network was trained for 30 epochs with data augmentation including random rotation (±15°), scaling (0.8–1.2x), stitching (mosaic augmentation), and NeuroVI-specific event noise injection (±5% sparse events). The Adam optimizer was used for training with hyperparameters β1 = 0.9, β2 = 0.999 and a weight decay of 1 × 10⁻⁴. The initial learning rate was set as 0.001, decaying by half at the 12th and 24th epochs. The training was conducted on a GPU (NVIDIA GeForce GTX 3080 Ti Huawei; Shenzhen, China) computer with 16 GB memory. Throughout the 30-epoch training process, the five best-performing models were retained, while others with lower accuracy were discarded.

Training accuracy and learning loss are crucial evaluation metrics for assessing model learning performance, as depicted in Figure 5a,b. Although the resuspended sediments scene dataset exhibited a slower training speed, satisfactory training accuracy was eventually achieved among all four scenarios. Furthermore, images from four scenarios were integrated into a single dataset for another training process. In this process, both single-channel and cross-attention networks were employed to train images, as illustrated in Figure 5c. The results demonstrate that the cross-attention network exhibits higher training accuracy and lower training loss, indicating the superiority of the cross-channel network.

4.2. Segmentation Demonstration Under Low Illumination in Fishery Aquaculture

In Figure 6, we present and compare the segmentation results achieved by the cross-attention network under various low-visibility conditions.

For each scenario, Figure 6 also illustrates the process features learned by the cross-attention network from RGB and NeuroVI images, respectively. Unlike the RGB images, backgrounds and objects in NeuroVI images can be clearly distinguished and marked with distinct prominent colors. Therefore, it can be concluded that NeuroVI images possess the capability to extract the attention zones within the image, and the attention zones are then assigned higher computational weights through the NeuroVI image’s query matrix. Through the multiplication computation among the Q, K, and V matrices, important features of RGB images are strengthened, while unimportant zones in the image are suppressed.

To provide a clearer qualitative and quantitative comparison, we calculate the score of segmentation performance. Figure 7 displays the partial appearance in biofouling attachments and fence breach sediments scenes, and each scenario adopts the single-channel and cross-attention networks, respectively. Since the dataset contains five categories, we select the TOP1 class with the highest score to calculate the classification accuracy. Furthermore, the classification with the TOP1 score is defined as the best prediction, which is then compared to the ground truth by the index of Intersection over Union (IOU).

The segmentation results demonstrate that the cross-attention method outperforms the single-image approach, yielding higher TOP1 class scores and larger IOU ratio values. Therefore, the cross-attention network exhibits superior learning attention, resulting in better segmentation results in low-visibility scenarios.

4.3. The Learning Performance Evaluation

In addition, we conduct additional comparative evaluations. The evaluation metrics for comparison included the PR curve and F₁ scores, with their respective definitions provided in Equation (8). In Equation (8), TP represents the probability of actual positive samples being forecasted as positive, FP denotes the probability of actual negative samples being forecasted as positive, FN represents the probability of actual positive samples being forecasted as negative, and TN represents the probability of actual negative samples being forecasted as negative.

\{\begin{matrix} P r e c i s i o n = \frac{T P}{(T P + F P)} \\ R e c a l l = \frac{T P}{(T P + F N)} \\ F_{1} = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l} \end{matrix}

(8)

According to the definition in Equation (8), we compared the learning performance of relevant indexes, as shown in Figure 8. For the four sub-datasets, the semantic segmentation effects were individually learned, yielding comparable performance. Additionally, the four sub-datasets were integrated, and the average learning performances were recorded. Based on the average performance curve, it is evident that the P-R curve of the single image’s network is entirely covered by that of the cross-attention network, indicating the latter’s superior performance.

Moreover, we can observe that the equilibrium point value (P = R) of the cross-attention network surpasses that of the single-channel network, indicating improved segmentation performance by the cross-attention network. Additionally, the cross-attention network exhibits a higher F1 score, leading us to conclude that the cross-attention network achieves state-of-the-art performance.

4.4. Comparisons with Other Methods

To validate the effectiveness of our proposed components, we conducted systematic experiments by incrementally adding each module to a baseline model. The baseline consists of a standard Swin-Transformer backbone processing only RGB images. All experiments were performed on our aquaculture dataset under identical training conditions.

Table 1 presents a component-wise ablation study analyzing the incremental performance improvements of various model modifications. The baseline model, utilizing only RGB inputs, achieves 72.1% mAP and 63.5% IOU with a latency of 1.8 ms. Incorporating SNN filtering yields a performance gain of +5.2% mAP and +4.3% IOU while introducing only a marginal latency increase of 0.1 ms. The addition of cross-attention mechanisms further enhances performance, improving mAP and IOU by +6.3% and +5.4%, respectively, with a modest latency overhead of 0.2 ms. Finally, element-wise multiplication optimization provides additional gains of +4.3% mAP and +5.0% IOU while reducing latency by 0.7 ms, resulting in a final performance of 87.9% mAP and 78.2% IOU at 1.4 ms latency. Each modification demonstrates consistent improvements in both accuracy and computational efficiency.

Table 2 presents the calculation latency for various Transformer-based architectures. The first-row method, utilizing dot-product multiplication, achieved a detection accuracy of 75.1% with a latency of 1.3 ms. The second-row method replaced the dot-product multiplication with element-wise multiplication, resulting in a slightly lower detection accuracy of 74.8%, but appearing with a reduced latency of 1.1 ms. The third-row method employed the cross-attention network, yielding a performance improvement of 88.3% at the cost of a 1.6 ms computational latency. Finally, the fourth-row method combined the cross-attention network and element-wise multiplication, demonstrating commendable performance in both detection accuracy and speed. These comparative results highlight the advantages of our proposed model in terms of both accuracy and latency.

The objects’ box-detection with different backbones is compared in Table 3. Mask RCNN is used as the detection head. Our cross-attention network has achieved a performance of 42.1 AP boxes, which overcomes the AlexNet (↑9.7), VGGNet (↑8.6), UNET++ (↑5.9), and ResNet50 (↑3.1). As shown in the literature [25], we also used the Swin-Transformer method for comparison. Furthermore, our proposed cross-attention network outperforms the popular Swin-transformer, with a performance improvement of 1.9 AP boxes. In terms of instance segmentation, our method achieves a detection performance of 38.2 AP^mask, which is better than the previous algorithms. The cross-attention network outperforms lightweight models like AlexNet (↑9.4), VGGNet (↑8.2), Unet++ (↑4.7), and ResNet50 (↑3.8).

The improvement in detection and segmentation performance demonstrates the effectiveness of our proposed backbone model. The observed enhancements in detection and segmentation performance ensure the perception safety under adverse conditions.

Prior research in underwater image segmentation has primarily relied on unimodal approaches utilizing either RGB data or sonar imaging alone. Such approaches exhibit limitations in addressing challenging underwater conditions such as suspended sediments and light scattering. To overcome these constraints, we propose NeuroVI-RGB Fusion, a novel dual-modal framework that effectively combines the complementary strengths of both visual modalities. Our architecture incorporates an innovative cross-attention mechanism that distinguishes itself from conventional Transformer-based approaches (e.g., Swin-Transformer) through its guided attention module, which dynamically enhances salient features from NeuroVI to optimize RGB segmentation performance. Furthermore, the proposed design demonstrates superior computational efficiency, achieving faster inference times (1.4 ms) compared to standard dot-product transformer implementations (1.6 ms), while maintaining competitive segmentation accuracy through its lightweight architecture.

5. Conclusions

Underwater pose significant challenges to fishery aquaculture cage tasks, as they can impact sensor and camera performance. This study proposes a dual-image cross-attention network, which can improve the cable segmentation performance under low illumination. We have pioneered the development of four low-light datasets focused on feeding desire, biofouling attachments, fence breach, and dead fishes. Secondly, we have designed a dual-channel feature extraction network, in which NeuroVI images provide cross-attention for RGB images. Finally, to reduce the detection latency, an element-wise multiplication mechanism is designed to accelerate computations among different matrices. The final qualitative and quantitative comparison demonstrate that our network has achieved the state-of-the-art detection performance.

The current approach has several limitations, including dependence on event camera availability, computational overhead for real-time deployment, and challenges with regard to generalizing to diverse aquatic environments, compounded by high GPU memory usage during training (16 GB for 1000 images). To address these issues, future work will focus on developing lightweight SNN architectures to reduce computational demands, integrating acoustic sensors for deeper water applications where event cameras are less effective, and expanding the dataset to include more species and varied environmental conditions to improve generalization.

Author Contributions

Conceptualization, Z.X.; methodology, X.L.; software, F.M.; validation, X.L. and F.M.; data curation, X.L. and F.M.; writing—original draft preparation, X.L., F.M. and Z.X.; writing—review and editing, X.L., F.M. and Z.X.; project administration, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by Central Public-interest Scientific Institution Basal Research Fund, CAFS (NO. 2023TD87), in part by the Shanghai Pujiang Program (NO. 24PJA160), in part by the Central Public-interest Scientific Institution Basal Research Fund, ECSFR, CAFS (NO. 2025YJS201).

Data Availability Statement

The original contributions presented in the study are included in the article material; further inquiries can be directed to the corresponding author.

Acknowledgments

The discussion and assistance are from the Fishery Machinery and Instrument Research Institute. Xiangyong Liu and Zhiqiang Xu are greatly appreciated. All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huan, X.; Shan, J.; Han, L.; Song, H. Research on the efficacy and effect assessment of deep-sea aquaculture policies in China: Quantitative analysis of policy texts based on the period 2004–2022. Mar. Policy 2024, 160, 105963. [Google Scholar] [CrossRef]
Sun, X.; Hu, L.; Fan, D.; Wang, H.; Yang, Z.; Guo, Z. Sediment Resuspension Accelerates the Recycling of Terrestrial Organic Carbon at a Large River-Coastal Ocean Interface. Glob. Biogeochem. Cycles 2024, 38, e2024GB008213. [Google Scholar] [CrossRef]
Xiao, Y.; Huang, L.; Zhang, S.; Bi, C.; You, X.; He, S.; Guan, J. Feeding behavior quantification and recognition for intelligent fish farming application: A review. Appl. Anim. Behav. Sci. 2025, 285, 106588. [Google Scholar] [CrossRef]
López-Barajas, S.; Sanz, P.J.; Marín-Prades, R.; Gómez-Espinosa, A.; González-García, J.; Echagüe, J. Inspection operations and hole detection in fish net cages through a hybrid underwater intervention system using deep learning techniques. J. Mar. Sci. Eng. 2024, 12, 80. [Google Scholar] [CrossRef]
Zhu, G.; Li, M.; Hu, J.; Xu, L.; Sun, J.; Li, D.; Dong, C.; Huang, X.; Hu, Y. An Experimental Study on Estimating the Quantity of Fish in Cages Based on Image Sonar. J. Mar. Sci. Eng. 2024, 12, 1047. [Google Scholar] [CrossRef]
Guan, M.; Cheng, Y.; Li, Q.; Wang, C.; Fang, X.; Yu, J. An Effective Method for Submarine Buried Pipeline Detection via Multi-sensor Data Fusion. IEEE Access 2019, 7, 125300–125309. [Google Scholar] [CrossRef]
Li, D.; Du, L. Recent advances of deep learning algorithms for aquacultural machine vision systems with emphasis on fish. Artif. Intell. Rev. 2022, 55, 4077–4116. [Google Scholar] [CrossRef]
Kong, H.; Wu, J.; Liang, X.; Xie, Y.; Qu, B.; Yu, H. Conceptual validation of high-precision fish feeding behavior recognition using semantic segmentation and real-time temporal variance analysis for aquaculture. Biomimetics 2024, 9, 730. [Google Scholar] [CrossRef]
Gao, T.; Jin, J.; Xu, X. Study on detection image processing method of offshore cage. J. Phys. Conf. Ser. 2021, 1769, 012070. [Google Scholar] [CrossRef]
Zhou, C.; Lin, K.; Xu, D.; Liu, J.; Zhang, S.; Sun, C.; Yang, X. Method for segmentation of overlapping fish images in aquaculture. Int. J. Agric. Biol. Eng. 2019, 12, 135–142. [Google Scholar] [CrossRef]
Hu, Z.; Cheng, L.; Yu, S.; Xu, P.; Zhang, P.; Tian, R.; Han, J. Underwater Target Detection with High Accuracy and Speed Based on YOLOv10. J. Mar. Sci. Eng. 2025, 13, 135. [Google Scholar] [CrossRef]
Islam, M.J.; Edge, C.; Xiao, Y.; Luo, P.; Mehtaz, M.; Morse, C.; Enan, S.S.; Sattar, J. Semantic Segmentation of Underwater Imagery: Dataset and Benchmark. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 1769–1776. [Google Scholar]
Yang, D.; Wang, C.; Cheng, C.; Pan, G.; Zhang, F. Data Generation with GAN Networks for Sidescan Sonar in Semantic Segmentation Applications. J. Mar. Sci. Eng. 2023, 11, 1792. [Google Scholar] [CrossRef]
Liu, W.; Bai, K.; He, X.; Song, S.; Zheng, C.; Liu, X. FishGym: A High-Performance Physics-based Simulation Framework for Underwater Robot Learning. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 6268–6275. [Google Scholar]
Marnet, L.R.; Grasshof, S.; Brodskiy, Y.; Wąsowski, A. Bridging the Sim-to-Real GAP for Underwater Image Segmentation. In Proceedings of the OCEANS 2024—Singapore, Singapore, 15–18 April 2024; pp. 1–6. [Google Scholar]
Contini, M.; Illien, V.; Barde, J.; Poulain, S.; Bernard, S.; Joly, A.; Bonhommeau, S. From underwater to drone: A novel multi-scale knowledge distillation approach for coral reef monitoring. Ecol. Inform. 2025, 89, 103149. [Google Scholar] [CrossRef]
Liu, X.Y.; Chen, G.; Sun, X.; Knoll, A. Ground Moving Vehicle Detection and Movement Tracking Based on the Neuromorphic Vision Sensor. IEEE Internet Things J. 2020, 7, 9026–9039. [Google Scholar] [CrossRef]
Liu, X.Y.; Yang, Z.X.; Hou, J.; Huang, W. Dynamic Scene’s Laser Localization by NeuroIV-based Moving Objects Detection and LIDAR Points Evaluation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5230414. [Google Scholar] [CrossRef]
Sun, X.; Chen, C.; Wang, X.; Dong, J.; Zhou, H.; Chen, S. Gaussian dynamic convolution for efficient single-image segmentation. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2937–2948. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-End Change Detection for High Resolution Satellite Images Using Improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Huang, Z.; Wang, C.; Wang, X.; Liu, W.; Wang, J. Semantic image segmentation by scale-adaptive networks. IEEE Trans. Image Process. 2020, 29, 2066–2077. [Google Scholar] [CrossRef] [PubMed]
Deng, X.; Wang, P.; Lian, X.; Newsam, S. Nightlab: A duallevel architecture with hardness detection for segmentation at night. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 938–948. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Chen, Z.; Yuan, X.; Guan, F. An adaptive weak light image enhancement method. In Proceedings of the Twelfth International Conference on Signal Processing Systems 2020, Shanghai, China, 6–9 November 2020. [Google Scholar]
Ge, H.; Ouyang, J. Underwater image segmentation via the progressive network of dual iterative complement enhancement. Expert Syst. Appl. 2025, 266, 126049. [Google Scholar] [CrossRef]
Li, X.; Yu, R.; Zhang, W.; Lu, H.; Zhao, W.; Hou, G.; Liang, Z. DAPNet: Dual Attention Probabilistic Network for Underwater Image Enhancement. IEEE J. Ocean. Eng. 2025, 50, 178–191. [Google Scholar] [CrossRef]
Jiang, J.; Xu, H.; Xu, X.; Cui, Y.; Wu, J. Transformer-Based Fused Attention Combined with CNNs for Image Classification. Neural Process. Lett. 2023, 55, 11905–11919. [Google Scholar] [CrossRef]
Romera, E.; Bergasa, L.M.; Yang, K.; Alvarez, J.M.; Barea, R. Bridging the day and night domain gap for semantic segmentation. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium, Paris, France, 9–12 June 2019; pp. 1312–1318. [Google Scholar]
Domhof, J.; Kooij, J.; Gavrila, D.M. A Joint Extrinsic Calibration Tool for Radar, Camera and Lidar. IEEE Trans. Intell. Veh. 2021, 6, 571–582. [Google Scholar] [CrossRef]
Chen, H.; Xu, F.; Liu, W.; Sun, D.; Liu, P.X.; Menhas, M.I.; Ahmad, B. 3D Reconstruction of Unstructured Objects Using Information from Multiple Sensors. IEEE Sens. J. 2021, 21, 26951–26963. [Google Scholar] [CrossRef]
Roy, S.M.; Beg, M.M.; Bhagat, S.K.; Charan, D.; Pareek, C.M.; Moulick, S.; Kim, T. Application of artificial intelligence in aquaculture—Recent developments and prospects. Aquac. Eng. 2025, 111, 102570. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Wan, S.; Liang, Y.; Zhang, Y. Deep convolutional neural networks for diabetic retinopathy detection by image classification. Comput. Electr. Eng. 2018, 72, 274–282. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Fishery aquaculture management system based on fish condition monitoring under harsh underwater illumination. (a) Feeding desire detection. (b) Biofouling attachments detection. (c) Structural breach detection. (d) Dead fish detection.

Figure 2. Fishery aquaculture safety perception in a harsh environment. (a) Comparison of the NeuroVI and RGB images; (b) framework diagram of the technical roadmap.

Figure 3. The guided attention learning module and element-wise multiplication operation. (a) The dual-image cross-attention learning network; (b) the guided attention learning module; (c) the element-wise multiplication operation.

Figure 4. Dataset collection methods. (a) The real collection method; (b) the video playback method.

Figure 5. The training process with different low-visibility datasets. (a) The learning accuracy with four sub-datasets, separately; (b) the training loss with four sub-datasets, separately; (c) the training processes involving single-channel and cross-attention networks are compared with the integrated dataset.

Figure 6. The segmentation demonstration including two kinds of images, process features, and segmentation comparison. (a) RGB image; (b) NeuroVI image; (c) results by cross-attention.

Figure 7. The segmentation of scenes is evaluated with the TOP1 and IOU indexes. (a,b) The left image corresponds to the single channel method, and the right image corresponds to the cross-attention method. Purple and red are the colors of the identified objects.

Figure 8. Comparison of detection with P-R curves and F1 score indexes. (a) The P-R curve for the single-channel method; (b) the F1 score for the single-channel method; (c) the P-R curve for the cross-attention method; (d) the F1 score for the cross-attention method.

Table 1. Component-wise ablation analysis.

Model Configuration	mAP (%)	ΔmAP	IOU (%)	ΔIOU	Latency (ms)	ΔLatency
Baseline (RGB only)	72.1	-	63.5	-	1.8	-
+SNN Filtering	77.3	+5.2	67.8	+4.3	1.9	+0.1
+Cross-Attention	83.6	+6.3	73.2	+5.4	2.1	+0.2
+Element-wise Multiplication	87.9	+4.3	78.2	+5.0	1.4	−0.7

Table 2. The detection latency and TOP1 accuracy with different Transformer-based structures.

Methods	Image	Latency (ms)	Top1 (%)
Dot-product + Single channel	RGB	1.3	75.1
Element-wise + Single channel	RGB	1.1	74.8
Dot-product + Cross-attention	RGB + NeuroVI	1.6	88.3
Element-wise + Cross-attention	RGB + NeuroVI	1.4	87.9

Table 3. The object detection and semantic segmentation with different backbones.

Backbones	Detection and Instance Segmentation (%)
Backbones	$A P^{b o x}$	$A P_{50}^{b o x}$	$A P_{75}^{b o x}$	$A P^{m a s k}$	$A P_{50}^{m a s k}$	$A P_{75}^{m a s k}$
AlexNet [34]	32.4	52.7	35.0	28.8	50.2	30.9
VGGNet [35]	33.5	53.3	36.1	30.0	51.7	32.5
UNET++ [22]	36.2	56.1	38.9	33.5	53.6	34.5
ResNet18 [36]	35.0	55.0	37.7	32.2	52.0	33.7
ResNet50 [36]	38.0	58.6	41.4	34.4	55.1	36.7
Swin-Transformer [25]	40.3	60.5	43.6	36.3	57.4	38.4
Ours	42.1	62.2	45.1	38.2	59.8	40.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, F.; Liu, X.; Xu, Z. A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage. J. Mar. Sci. Eng. 2025, 13, 1700. https://doi.org/10.3390/jmse13091700

AMA Style

Ma F, Liu X, Xu Z. A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage. Journal of Marine Science and Engineering. 2025; 13(9):1700. https://doi.org/10.3390/jmse13091700

Chicago/Turabian Style

Ma, Fengshuang, Xiangyong Liu, and Zhiqiang Xu. 2025. "A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage" Journal of Marine Science and Engineering 13, no. 9: 1700. https://doi.org/10.3390/jmse13091700

APA Style

Ma, F., Liu, X., & Xu, Z. (2025). A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage. Journal of Marine Science and Engineering, 13(9), 1700. https://doi.org/10.3390/jmse13091700

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multimodal Vision-Based Fish Environment and Growth Monitoring in an Aquaculture Cage

Abstract

1. Introduction

2. Related Work

2.1. The Fishery Aquaculture Dataset in Underwater Conditions

2.2. Fishery Aquaculture Cage’s Segmentation Methods

2.3. Sensors’ Fusion

3. Materials and Methods

3.1. Fishery Aquaculture Cage Scene Dataset with NeuroVI and RGB Images

3.2. Cross-Attention Network Based on Spatial Feature Attention

3.3. Guided Attention Learning Module

3.4. Element-Wise Multiplication to Release Calculation Burden

3.5. The Definition of Loss Function

4. Results and Discussion in Fishery Aquaculture

4.1. Dataset Collection and Training Settings

4.2. Segmentation Demonstration Under Low Illumination in Fishery Aquaculture

4.3. The Learning Performance Evaluation

4.4. Comparisons with Other Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI