RetinalCoNet: Underwater Fish Segmentation Network Based on Bionic Retina Dual-Channel and Multi-Module Cooperation

Jianhua Zheng; Yusha Fu; Junde Lu; Jinfang Liu; Zhaoxi Luo; Shiyu Zhang

doi:10.3390/fishes10090424

,

and

¹

College of Artificial Intelligence, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

²

Guangzhou Key Laboratory of Agricultural Products Quality & Safety Traceability Information Technology, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

³

Smart Agriculture Innovation Research Institute, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

^*

Author to whom correspondence should be addressed.

Fishes2025, 10(9), 424;https://doi.org/10.3390/fishes10090424

Version Notes

Order Reprints

Abstract

Underwater fish image segmentation is the key technology to realizing intelligent fisheries and ecological monitoring. However, the problems of light attenuation, blurred boundaries, and low contrast caused by complex underwater environments seriously restrict the segmentation accuracy. In this paper, RetinalConet, an underwater fish segmentation network based on bionic retina dual-channel and multi-module cooperation, is proposed. Firstly, the bionic retina dual-channel module is embedded in the encoder to simulate the separation and processing mechanism of light and dark signals by biological vision systems and enhance the feature extraction ability of fuzzy target contours and translucent tissues. Secondly, the dynamic prompt module is introduced, and the response of key features is enhanced by inputting adaptive prompt templates to suppress the noise interference of water bodies. Finally, the edge prior guidance mechanism is integrated into the decoder, and low-contrast boundary features are dynamically enhanced by conditional normalization. The experimental results show that RetinalCoNet is superior to other mainstream segmentation models in the key indicators of mDice, reaching 82.3%, and mIou, reaching 89.2%, and it is outstanding in boundary segmentation in many different scenes. This study achieves accurate fish segmentation in complex underwater environments and contributes to underwater ecological monitoring.

Keywords:

underwater image segmentation; prior knowledge; underwater complex environment

Key Contribution:

RetinalConet, an underwater fish segmentation network, is proposed; it innovatively combines bionic retina dual channels, dynamic prompt learning, and an edge prior guidance mechanism, which significantly overcomes the problems of fuzzy boundaries and low contrast in underwater environments and achieves high-precision and complete boundary segmentation.

1. Introduction

The intelligent fishery is a modern sustainable fishery production mode in which image processing technology for identifying fish in underwater environments is very important, as it can promote the development of intelligent aquaculture. Underwater fish segmentation technology is an important basis for mining fish biological information and promoting aquaculture informatization. The precise identification of fish species enables the development of more accurate ecological models. This facilitates a deeper understanding of essential biological information, including ecological habits and reproductive patterns. Such detailed insights are crucial for implementing intelligent management strategies in aquaculture operations. Consequently, optimized resource utilization can be achieved while simultaneously mitigating the risks associated with overfishing and resource depletion. This integrated approach ultimately supports the maintenance of the ecological equilibrium in aquatic environments.

Early underwater fish image segmentation methods mainly rely on traditional image processing technology, and traditional methods mainly include classical technologies such as threshold-based, edge detection, region growth, and graph cutting technologies. Utilizing trawl-mounted underwater cameras, Chuang et al. [] developed a fish segmentation approach that integrates dual Otsu thresholding with histogram back-projection algorithms. Baloch et al. [] developed a fish segmentation algorithm that employs morphological preprocessing techniques, namely, grayscale conversion, edge detection, dilation, region filling, and image negation, ultimately yielding a binary segmentation mask. Spampinato et al. [] used image processing and feature extraction technology to extract the features of fish from underwater images, used support vector machine (SVM) and other classifiers to automatically classify fish, and conducted a fish behavior analysis to improve the classification accuracy and understand fish behavior. These methods may be effective under simple background conditions, but the segmentation effect is not good under complex background, unclear boundary, or uneven illumination conditions.

In recent years, deep learning methods have demonstrated significant advances in image segmentation, leveraging their robust feature extraction capabilities. These techniques are now extensively applied to underwater fish segmentation applications. By designing specific loss functions [], introducing attention mechanisms [], or using twin networks [], researchers can automatically learn the distinguishing features between fish and complex underwater backgrounds, which significantly improves the accuracy and robustness of segmentation. However, although existing methods perform well in high-resolution clear images, they still have fundamental limitations in underwater images, which exhibit quality degradation and contain complex scenes: On the one hand, the image contrast is reduced, and the boundary is blurred due to light attenuation and suspended particle scattering, which makes it difficult to define fish contours. On the other hand, the dynamic deformation and appearance diversity caused by fish swimming further challenge the adaptability of the model.

To enhance model performance in underwater segmentation tasks, researchers frequently integrate image enhancement or restoration techniques into their processing models. These approaches frequently incorporate enhancement methods such as physics-based underwater defogging algorithms and color correction techniques [], which aim to improve the visual quality of underwater images. These methods focus more on improving the overall visual quality or correcting specific degradation models, which can optimize the dominant features of images to a certain extent and provide clearer input for subsequent segmentation models. However, physical models struggle to accurately reconstruct fine geometric details and gradient variations in boundary regions. This limitation hinders their ability to resolve the core challenge of structural ambiguity at target boundaries in underwater segmentation tasks.

In order to solve the problems of degraded underwater image quality and low segmentation accuracy caused by complex environment, an underwater fish segmentation network based on bionic retina dual-channel and multi-module cooperation is proposed. The technical contribution of this approach manifests primarily in three key aspects:

(1): The underwater lighting conditions are complex and changeable, and different depths and shooting times lead to great differences in light intensity and spectral distribution. To address critical challenges in underwater imagery—including unstable contrast, reduced definition, blurred boundaries, and noise-obscured semi-transparent tissues—this paper proposes a novel bionic retina dual-channel computational module. By simulating biological vision to separate light and dark signals, it can adapt to the changes in underwater lighting and enhance the perception of subtle features of fish.
(2): To mitigate edge blurring induced by light attenuation and particle scattering while resolving occlusion-induced contour fragmentation, we implement a prompt learning module. This approach amplifies feature responses at indistinct boundaries and occluded regions while adaptively guiding model attention toward discriminative features, thereby enhancing target saliency perception and improving fish segmentation integrity in complex underwater environments.
(3): The underwater optical characteristics lead to the attenuation and scattering of light, which weakens the gradient information of the fish boundary, and the low contrast between target and background further aggravates the difficulty of boundary discrimination. In order to meet this challenge, this paper integrates the concept of edge prior guidance and designs an edge enhancement module to solve the problems of the blurred boundary and insufficient fusion of multi-scale features in underwater fish segmentation, and realize accurate pixel-level segmentation.

2. Related Works

2.1. Underwater Image Segmentation

Liu et al. [] enhanced the DeepLabv3+ architecture for underwater segmentation by integrating an unsupervised color correction module within the encoder to optimize image quality, while incorporating dual upsampling layers in the decoder to preserve high-resolution target features and boundary details. Hambard et al. [] developed an end-to-end underwater generative adversarial network for monocular underwater image depth estimation. Dudhane et al. [] introduced an end-to-end depth network for underwater image restoration, incorporating three specialized components: a channel-wise color feature extraction module, a dense residual feature extraction module, and a custom user-defined loss function. Liu et al. [] introduced a target-guided dual-adversarial contrastive learning framework for underwater image enhancement, specifically designed to preserve detection-favorable features that conventional enhancement methods often degrade. Kim et al. [] designed a special loss function in a parallel segmentation network to improve the extraction of boundary region features and introduce an attention mechanism to enhance the learning ability of the model for the foreground and background of underwater images. Liu et al. [] developed an attention-driven underwater saliency detection architecture that integrates channel-wise and spatial attention mechanisms to enhance feature representations, thereby optimizing object boundary delineation in challenging aquatic environments. Fu et al. [] developed MASNet, a deep learning architecture for underwater animal segmentation, which enhances segmentation robustness and precision through integrated Siamese networks and strategic data augmentation techniques. Chen et al. [] detected and segmented underwater fish under different lighting conditions using a pre-trained Mask R-CNN model. Yang et al. [] designed a multi-scale feature extraction module leveraging atrous spatial pyramid pooling, which strengthens the extraction of high-level semantic features through multi-layer dilated convolution pyramidal processing and a triangularly configured adaptive channel attention mechanism. Chicchon et al. [] achieved high accuracy on underwater public datasets based on the method of an image contour and joint loss function.

2.2. Bionic Neural Network

Shen et al. [] developed a bio-inspired polarization navigation method integrating insect visual neural principles with deep learning. This intelligent approach enhances navigation reliability and precision in complex natural environments while establishing novel methodological frameworks for polarimetric orientation sensing. Pu et al. [] developed a bionic artificial lateral line positioning system that emulates fish neuromast functionality to augment the flow-field pressure perception. This bio-inspired approach optimizes underwater target localization accuracy and environmental adaptability while demonstrating superior positioning precision and noise resilience. Li et al. [] developed a bionic olfactory neural network modeled after the mammalian olfactory bulb. This architecture directly processes sensor outputs using Gabor-based neural encoding, circumventing conventional preprocessing stages to enhance electronic nose odor discrimination capabilities. Hu et al. [] proposed a bionic attention method inspired by the human visual system. This method supports mainstream identification by introducing additional category labels as bionic information flow in the model input stage, thus improving the detection effect. Gu et al. [] proposed a vision–smell bionic perception system inspired by zebrafish, which was used to locate and identify liquids with similar colors or smells. Drawing inspiration from the signal integration mechanism observed in the zebrafish retina–olfactory bulb circuit, this system utilizes a gas sensor array and a camera as functional analogs for olfactory and visual perception, respectively. An artificial neural network within the system emulates the operational mechanisms of biological bipolar cells, thereby facilitating the encoding and subsequent fusion of multi-modal signals. Liang et al. [] developed a bionic vision technique inspired by the Drosophila visual system for detecting small-target motion within cotton fields. The model simulates the sensitivity of the Drosophila neurovisual pathway, responds to the weak motion of small targets against a complex background, and introduces a directional selective suppression algorithm to reduce background interference.

2.3. Prior Information

Xia et al. [] assumes that different semantic information has different distribution characteristics in the frequency domain and spatial domain. This spatial distribution profile serves as valuable prior knowledge, directing the model’s focus towards critical regions within the image. Such guidance enhances the model’s ability to comprehend and differentiate anatomical structures, ultimately improving segmentation precision. Jiang et al. [] assume that the heterogeneity of seismic data brings its distribution closer to a t-distribution. Within each data window, the computed t-distribution probabilities serve as prior knowledge. These are combined with the soft attention weights derived from the self-attention mechanism via element-wise multiplication, yielding the final posterior attention distribution. Zhang et al. [] designed the RASpine network, which uses the prior anatomical knowledge and combines with the overlap detector to identify the overlapping regions between different vertebrae in the segmentation results. The ApplianceFilter model proposed by Ding et al. [] combines statistical features and current aggregate power data as prior information, and the coding and fusion of prior knowledge are realized by an expert feature encoder, which improves the accuracy of non-invasive load decomposition. Yan et al. [] introduced the physical perception prior encoder, which effectively eliminated the color shift and blur in underwater images and provided more accurate prior information for the network.

3. Research Method

3.1. Methods for Incorporating Prior Information

In deep learning, prior information refers to the inherent knowledge about data, tasks, or model structure that people have based on domain knowledge, experience, or assumptions before training the model. It can help the model to learn more efficiently, improve its generalization ability, and avoid over-reliance on the noise or accidental patterns in training data. To augment model performance, prior information can be integrated through four principal approaches:

(1): Prior information adding method based on transfer learning and pre-training: refers to using the general knowledge contained in the model pre-trained on large-scale general data as prior information, and transferring this knowledge to the target task through transfer learning technology (such as feature extraction or fine-tuning). This approach circumvents full model training from scratch, substantially enhancing efficacy in data-scarce scenarios while accelerating convergence and strengthening generalization capabilities. Fundamentally, it leverages transferable feature representations acquired during pre-training as robust initialization points, enabling rapid adaptation to novel tasks.
(2): The method of adding prior information based on the data level: this refers to adding prior information such as domain knowledge or expert experience to the model by transforming the training data itself, such as using knowledge to guide data enhancement, screening high-quality samples, embedding structured knowledge representation, or generating synthetic data in line with the prior, so the data more directly and explicitly reflect the known rules or relationships, thus guiding the model to learn this information more efficiently and accurately.
(3): Structural integration of prior information: This approach embeds domain knowledge and inductive biases by designing inherent architectural characteristics within the network. Through such structural constraints, models are guided to autonomously learn feature representations aligned with prior assumptions. This kind of prior does not depend on the data preprocessing or training strategy, but gives the model “built-in sensitivity” to specific types of data patterns through the connection mode, parameter-sharing mechanism, or constraints of the network layers.
(4): The method of adding prior information based on the loss function: This refers to embedding prior information such as domain knowledge, problem characteristics, or expected behavior into the model optimization process by designing the form, structure, or parameters of loss functions, and guiding the model to learn features or prediction results that conform to the prior law.

This paper adopts the method of edge prior information, which is one of the methods of adding prior information to the model structure. Underwater fish segmentation encounters significant environmental challenges including low ambient light and turbulent water conditions. These factors degrade the image quality by obscuring target–background boundaries and reducing feature discriminability, thereby impeding the model’s capacity to accurately delineate object contours. Consequently, this work incorporates edge prior information to leverage boundary characteristics as fundamental visual cues. By explicitly modeling gradient variations and structural patterns at target–background interfaces, the approach enhances model sensitivity to fish contours—directly addressing the critical challenge of edge degradation in underwater environments.

3.2. RetinalCoNet Model Structure

As shown in Figure 1, the model is an underwater fish segmentation network based on bionic retina dual-channel and multi-module cooperation, which is designed for underwater fish segmentation tasks. The network includes two parts: encoder and decoder. In the encoder stage, an Interactive Encoder Block module with bionic retina dual channels is integrated: the module firstly extracts the general visual features through a three-layer standard convolution process of “dimension reduction–processing–dimension improvement”, and then optimizes the fish segmentation task by using the bionic retina dual-channel module, thus forming an efficient network structure of “general processing+underwater fish segmentation task customization”. At the same time, dynamic prompt templates are generated by the Dynamic Prompt Block, and effective templates are selected based on global mean weighting to realize the adaptive enhancement of semantic features. In the decoder stage, upsampling and cross-layer feature stitching are completed by the Decoder Block, and each decoder has a built-in BoundaryEnhance Block, which explicitly enhances the target boundary features by using the mechanisms of edge condition generation, condition normalization, and residual fusion. The network solves the problems of a fuzzy boundary and insufficient multi-scale feature fusion in the complex environment of underwater fish segmentation through bionic retinal dual-channel feature extraction in the encoder, and dynamic prompt semantic enhancement and edge prior guidance of the decoder, and it realizes accurate pixel-level segmentation. Among them, BN stands for batch normalization, and ReLU stands for the Rectified Linear Unit activation function.

Figure 1. Underwater fish segmentation network based on bionic retinal dual-channel and multi-module cooperation.

3.2.1. Interactive Encoder Block

Underwater fish segmentation faces numerous challenges, primarily due to complex environmental lighting conditions. Changes in depth and time factors cause significant fluctuations in light intensity and spectral composition, leading to instability in image contrast and resolution. These factors hinder the reliable extraction of features, resulting in blurred boundaries and loss of fine details in the segmentation images. Additionally, semi-transparent tissue structures of fish such as scales and fins are susceptible to noise occlusion, thereby compromising the integrity of segmentation results.

In order to solve the above challenges, this study designs a bionic retina dual-channel module, which is inspired by the dual-channel theory of retinas in the biological vision system, as put forward by Minkyu Choi et al. []. The human visual system employs dual processing streams for spatial analysis and object recognition. In comparison, computer vision typically relies on purely feedforward architectures, resulting in reduced robustness, adaptability, and operational efficiency relative to biological vision. Inspired by this, we designed a bionic retina dual-channel module. The module adapts to the change in underwater illumination by simulating the separation of light and dark signals by biological vision. This module is designed to address issues such as unstable contrast, blurred boundaries, and semi-transparent tissues that are prone to noise interference due to the variable lighting conditions underwater. It comprises two parallel channels: (1) Magnocellular Channel: Utilizes dilated convolution to expand the receptive field and rapidly capture the global contour information of the fish body, and is sensitive to moving or blurred targets.; (2) Parvocellular Channel: Includes ON and OFF subchannels, which specifically respond to “bright center–dark surroundings” and “dark center–bright surroundings” visual patterns, respectively, to finely extract features in bright and dark areas under different lighting conditions. The features from the two sub-channels are adaptively integrated through a dynamic weight fusion mechanism, effectively balancing feature extraction in bright and dark areas, suppressing background interference, and enhancing the perception of subtle fish features.

To effectively drive the subsequent bichannel module of the biomimetic retina and improve feature quality, this work designed an Interactive Encoder Block. Its core objectives are to preprocess and refine input features, remove redundant information, and focus on key visual cues related to fish segmentation, making them more consistent with the retinal mechanism’s requirements for detail and contour processing. This module adopts a “general processing+task customization” structure: first, a standard convolutional process is used for general feature extraction and dimensionality reduction; then, the refined features are fed into the biomimetic retinal module for customized optimization tailored to underwater fish segmentation.

The specific structure is shown in Figure 2. The input feature map X is first copied as a residual connection branch. X is processed through a three-layer convolutional stack, with each layer containing convolution, BN, and ReLU. The processed feature map is added element-wise to the residual branch to achieve feature fusion and mitigate gradient vanishing, resulting in a refined feature x’. Finally, x’ is fed into the bionic retina dual-channel module for further processing.

Figure 2. Structural diagram of Interactive Encoder Block.

Figure 3 shows the bionic retinal dual-channel module. The Magnocellular channel comprises a convolutional layer followed by instance normalization and ReLU activation. This architecture is specifically optimized for contour perception through broad spatial context integration. The convolution operation adopts extended convolution to expand the receptive field, as shown in Formula (1):

{f e a t}_{m} = R e L U (I n s t a n c e N o r m ({C o n v}_{3 \times 3} (x ’)))

(1)

Figure 3. Bionic retinal dual-channel module.

Parvocellular channel includes ON/OFF dual channels, and each channel is composed of a convolution layer, group normalization layer, and ReLU activation function, which can be described by Formulas (2) and (3):

{f e a t}_{o n} = R e L U (G r o u p N o r m ({C o n v}_{3 \times 3} (x ’)))

(2)

{f e a t}_{o f f} = R e L U (G r o u p N o r m ({C o n v}_{3 \times 3} (x ’)))

(3)

The ON channel is used to extract the features of central excitation, and the mechanism of central excitation–peripheral inhibition is realized by a 3 × 3 convolution layer. The weight of convolution kernels is initialized to a fixed pattern of 0.8 for the central point and −0.1 for the peripheral point, which specifically responds to visual stimuli with a bright center and dark periphery. The OFF channel is used to extract the features of peripheral inhibition, and the central inhibition–peripheral excitation mechanism is realized by dilated convolution (dilation = 2). The weight of convolution kernels is initialized to the symmetric mode of −0.8 for the center point and 0.1 for the peripheral point, and the visual features of a dark center and bright periphery are detected.

{f e a t}^{'}_{o n}

and

{f e a t}^{'}_{o f f}

are obtained by global average pooling of the output characteristics of the ON channel and the OFF channel, respectively. Then, the weights of dynamic fusion are calculated by two fully connected layers and the Softmax function, and the feature

{f e a t}_{p}

is obtained after weighted fusion, as shown in Formulas (4) and (5):

w e i g h t s = S o f t m a x ({C o n v}_{1 \times 1} (R e L U ({C o n v}_{1 \times 1} (A v g P o o l (c a t ({f e a t}^{'}_{o n}, {f e a t}^{'}_{o f f}))))))

(4)

{f e a t}_{p} = w e i g h t s [:, 0 : 1] \cdot {f e a t}_{o n} + w e i g h t s [:, 1 : 2] \cdot {f e a t}_{o f f}

(5)

The output features of the Magnoceollular channel and the fused features of the Parvocellular channel are spliced, and the fused features

{f e a t}_{c r o s s}

are obtained by 3 × 3 convolution, group normalization, and the ReLU module, respectively, while the features are enhanced by a 1 × 1 convolution and GELU module to obtain

{f e a t}_{f u s i o n}

.

The fused feature

{f e a t}_{c r o s s}

, the enhanced feature

{f e a t}_{f u s i o n}

, and original input features undergo element-wise summation to achieve hierarchical feature integration. This aggregated representation yields the final composite feature map for downstream network propagation. As shown in Formula (6),

F_{o u t} = {f e a t}_{c r o s s} + {f e a t}_{f u s i o n} + x'

(6)

In Figure 3, ⊕ represents addition, while ⊗ represents multiplication.

3.2.2. Dynamic Prompt Block

In underwater fish segmentation tasks, image features become blurred due to light attenuation and particle scattering caused by the aquatic environment; contour discontinuities and incomplete segmentation occur due to edge gradient degradation or occlusion; and the features of transparent fish are easily confused with complex backgrounds, making it difficult for traditional segmentation methods to effectively repair blurred areas and complete missing structures. Vaishnav Potlapalli et al. [] proposed the PromptGen Block, which not only helps the network learn image degradation type information but also adjusts the prompt based on the input content, thereby effectively guiding the network in restoration. This paper adopts a trainable ‘prompt template’ to guide the model to focus on key features. The prompt template can explicitly encode prior knowledge about fish and enhance the model’s attention to the target.

However, the PromptGen Block module uses random initialization to generate templates with fixed parameters, unable to dynamically adjust the prompt content based on changes in the morphology of underwater fish, lighting conditions, and background interference. When handling fish segmentation tasks in different scenarios, the limitations of template representation capabilities can easily lead to feature extraction biases.

Therefore, this paper optimizes and improves the PromptGen module and renames it the Dynamic Prompt Module. The Dynamic Prompt Module uses a dynamic prompt template generation network to replace the original static randomly initialized template. This module explicitly enhances feature responses in blurred edges and occluded regions through real-time dynamically generated prompt templates. It can generate adaptive prompt content in real-time based on the fish contours, textures, and environmental features of the input image, replacing fixed-parameter templates to enhance the model’s generalization capability for complex underwater scenes; by suppressing water noise through weighting, it repairs contour breaks caused by occlusion or low contrast, making the model more robust against interference. The specific structure is shown in Figure 4, where ⊕ represents addition, while ⊗ represents multiplication.

Figure 4. Structural diagram of Dynamic Prompt Block.

The Dynamic Prompt Block mainly consists of three parts: a dynamic prompt template generation network, a linear layer, and a convolution layer. Among these, as shown in Figure 5, the dynamic prompt template generation network passes through a series of convolution layers, batch normalization, activation functions, a pooling operation, and linear transformation.

Figure 5. Structural diagram of prompt generator.

Assuming that the feature map

x \in R^{B \times C \times H \times W}

is input, a prompt template is dynamically generated, which can be expressed by Formula (7):

P_{g e n}^{i} = g (x), for i \in {1, 2, 3, 4, 5}

(7)

where

g (.)

includes convolution, pooling, and linear transformation. The generation of this prompt template is dynamic, that is, it changes according to different input characteristics, thus providing personalized prompt information for different input data. The spatial average vector derived from input features undergoes transformation via a linear layer, subsequently generating prompt weights

W_{p r o m p t}

through Softmax normalization.

The weight

W_{p r o m p t}

indicates the importance of different prompts under the current input characteristics, which is used for the subsequent selective fusion of prompt templates. Following dimensional expansion, the prompt weights undergo element-wise multiplication with the generated prompt template. Subsequent summation across the prompt dimension yields a prompt-enhanced feature

P_{f u s i o n}

, as formalized in Formula (8):

P_{f u s i o n} = \sum_{i = 1}^{L} (W_{p r o m p t}^{i} \cdot P_{g e n}^{i})

(8)

F_{o u t} = {C o n v}_{3 \times 3} (P_{u p s a m p l e d} + x) \cdot x

(9)

This process realizes the dynamic enhancement of input features and effectively integrates the prompt information into original features. The feature map fused with prompt information is adjusted to the same spatial size as the input feature by bilinear interpolation to form

P_{u p s a m p l e d}

, then added with the input feature map element by element, further processed by the convolution layer, and finally multiplied with the input feature map element by element to derive the enhanced feature map

F_{o u t}

, as shown in Formula (9).

3.2.3. Decoder Block

In underwater fish segmentation tasks, the optical properties of water cause light attenuation and scattering, thereby weakening the information of fish boundary gradients. The low contrast between the target and the background further exacerbates the difficulty of boundary distinction. Additionally, the boundary integrity of multi-scale fish targets and the dynamic deformation of boundaries when fish swim pose challenges to segmentation accuracy. To address that, this study introduces a boundary enhancement module aimed at integrating edge prior information to improve segmentation accuracy.

As shown in Figure 6, this module consists of two submodules: an Edge Condition Generator and a ConditionalNorm. The Edge Condition Generator primarily focuses on quickly capturing contours by adaptively extracting edge probability distributions from feature maps. ConditionalNorm employs a contrast enhancement mechanism, using the edge probability to dynamically modulate normalization parameters and sharpen low-contrast boundaries. A residual fusion approach is adopted to preserve original features, avoid information loss, and ensure the structural integrity of segmentation results in complex underwater environments.

Figure 6. Decoder block structural diagram.

Assuming that the feature map

x ϵ R^{B \times C \times H \times W}

is input, the Edge Condition Generator captures the weak boundary clues in the way of adaptive learning marginal probability distribution, and it extracts the edge information through two-layer convolution, as shown in Formulas (10) and (11):

e_{1} = R e L U (B a t c h N o r m ({C o n v}_{3 \times 3} (x)))

(10)

E d g e C o n d = S i g m o i d ({C o n v}_{3 \times 3} (e_{1}))

(11)

ConditionalNorm dynamically adjusts the normalized parameters based on the marginal probability to enhance the feature discrimination of low-contrast boundaries, as shown in Formulas (12)–(16). Firstly, the input feature map

x ϵ R^{B \times C \times H \times W}

is obtained by standard normalization,

x^{'} = I n s t a n c e N o r m (x)

(12)

Then, a dynamic scaling parameter γ and an offset parameter β are generated, and conditional normalization is applied.

γ = {C o n v}_{3 \times 3} (E d g e C o n d)

(13)

β = {C o n v}_{3 \times 3} (E d g e C o n d)

(14)

N o r m a l i z e d = (1 + t a n h (γ)) \cdot x^{'} + β

(15)

Finally, the normalized features are further processed by the convolution layer, and the residual connection is made with

e_{1}

.

{f e a t}_{b o u n d a r y} = {C o n v}_{3 \times 3} (N o r m a l i z e d) + e_{1}

(16)

By embedding the module in each upsampling stage to deal with multi-scale boundaries, the original features are preserved by combining residual connections, so as to effectively deal with the problems of blurred boundaries, low contrast, multi-scale variations, and dynamic deformation in underwater scenes, and to improve the edge fit and robustness of segmentation results. Here, ⊕ represents addition, while ⊗ represents multiplication.

4. Collection and Construction of Dataset

One of the major challenges in fish segmentation research is the lack of publicly available datasets. To address that issue, this study collected fish videos in open environments such as ponds, reservoirs, and rivers, and generated a total of 1167 images by extracting 1 image every 12 frames. The dataset primarily targets common freshwater fish species such as grass carp, silver carp, and bighead carp, while also being applicable to different fish species. The main occlusion types are caused by overlapping aquatic plants, rocks, and the fish’s own movements. The environment factor considers scenarios with clear and relatively turbid water qualities.

The dataset was divided into a training set and a validation set at a 7:3 ratio, containing 817 and 315 images, respectively, and annotated using LabelMe. During annotation, efforts were made to cover all parts of the fish’s body, such as the body, tail, and fins. Even for incomplete fish, the parts where the fish appeared were annotated to maintain the completeness and accuracy of the annotations. Figure 7 displays the original image and its corresponding ground truth. The dataset was collected under different locations, natural lighting conditions, and underwater depths to ensure data diversity.

Figure 7. Data set section display diagram. (a) Original image, (b) mask.

5. Results and Discussion

5.1. Experimental Environment and Evaluation Metrics

We use Ubuntu 18 released in London in 2018, the GPU model NVIDIA GeForce RTX 3090, and the programming language Python 3.9.

Because of the small difference between each segmented image, it is impossible to evaluate it accurately only by human eye observation. To comprehensively assess model performance, this study employs six quantitative evaluation metrics: mean Dice (mDice), mean IoU (mIoU), S-measure (

S_{α}

), mean E-measure (

E_{φ}

), adaptive F-measure (

F_{β}

), and Mean Absolute Error (MAE) for quantitative evaluation. The mathematical expression of the evaluation index is shown in Formulas (17)–(25).

mDice is a measure of the similarity between the predicted segmentation result and the real label.

D i c e = \frac{2 T P}{F P + 2 T P + F N}

(17)

The Dice coefficient quantifies segmentation accuracy by measuring the intersection-over-union proportion between predicted results and ground truth, normalized by their harmonic mean. The components of this evaluation metric are formally defined in Table 1.

Table 1. Confusion matrix.

mIoU refers to calculating the average intersection ratio of all categories or all images. The intersection-over-union (IoU) metric quantifies segmentation accuracy by computing the ratio between the intersection area and the union area of predicted and ground-truth regions, as formalized in Equation (22), where A denotes the predicted segmentation mask for a class, and B represents its corresponding ground-truth mask.

I o U = \frac{|A ⋂ B|}{|A ⋃ B|}

(18)

m I o U = \frac{1}{a} \sum_{i = 1}^{a} I o U_{i}

(19)

S_{α}

is a measure of the spatial structure similarity between the predicted segmentation mask and the real map, and it combines object perception (So) and area perception (Sr) evaluation.

S_{α} = α S_{o} + (1 - α) S_{r}

(20)

where α ∈ [0, 1] is a weighting factor, which is usually set to 0.5 to balance the contributions of So and Sr.

E_{φ}

is an index that combines local pixel matching and global image statistical information, and it simultaneously evaluates the local similarity and global similarity between the predicted image C and the real image G. It aims to capture both local and global segmentation quality.

E_{φ} = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} φ [C (x, y), G (x, y)]

(21)

In the context provided,

φ

denotes the enhanced alignment matrix. Herein, W corresponds to the width of the input image, whereas H corresponds to its height.

F_{β}

is used to calculate the relationship between accuracy and recall. The input is first converted into a binary mask M. Thresholding is performed within the range of 0 to 255. By comparing M across these thresholds with the real graph G, P and R are calculated. Here, M(T) refers to the binary mask obtained by thresholding the non-binary prediction map at threshold T. The symbol |·| denotes the total area of the mask.

P = \frac{|M (T) ⋂ G|}{|M (T)|}

(22)

R = \frac{|M (T) ⋃ G|}{|G|}

(23)

F_{β} = \frac{(1 + β^{2}) P \times R}{β^{2} \times P + R}

(24)

MAE assesses the average absolute difference in pixel values between the normalized prediction map C and the real map G, with both falling within the interval [0, 1]. Specifically, for each pixel at position (x,y), C(x,y) and G(x,y) denote the respective pixel values from the prediction map and the real map.

M A E = \frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} |C (x, y) - G (x, y)|

(25)

5.2. Comparative Experiments with Different Models

In order to comprehensively evaluate the performance of the proposed model, this study compares it with recent image segmentation techniques using the self-collected and constructed dataset mentioned above. These techniques include the classic UNet model []; the multi-view pattern MFFN model [] for detecting blurry boundaries and deformed objects; the HitNet model [] that enhances low-resolution representations using high-resolution features and iterative feedback patterns; the feature-aware FIRNet model [] for exploring the integrity of hidden objects; a robust underwater animal segmentation network called MASNet [], which primarily addresses the issues of image degradation and animal camouflage in underwater images; and the DeepLab-FusionNet model [] specifically designed for underwater object segmentation.

Detailed experimental results are shown in Table 2. The input size of the dataset used in this paper is 512 × 512, and all other hyperparameters are retained at the values used by the original model authors.

Table 2. Comparative experimental results of different models.

The experimental results indicate that the RetinalCoNet introduced in this paper outperforms other models across six evaluation metrics. RetinalCoNet reached 0.823 on the mDice index, which was 5.5%, 25.7%, 31.3%, 29.1%, 19.3%, and 34.8% higher than those of Unet, MFFN, HitNet, FIRNet, MASNet, and Deeplab-FusionNet, respectively. RetinalCoNet reached 0.892 on the mIou index, which was 3.6%, 42.7%, 47.2%, 46.7%, 34.8%, and 8% higher than those of Unet, MFFN, HitNet, FIRNet, MASNet, and Deeplab-FusionNet, respectively. The improvement of these indexes proves the excellent ability of RetinalCoNet to segment fish at the pixel level in a complex underwater environment.

RetinalCoNet has the best performance on indicator

S_{α}

, reaching 0.856, which is 2.8%, 14.7%, 17.9%, 18.3%, 10.4%, and 13.1% ahead of Unet, MFFN, HitNet, FIRNet, MASNet, and Deeplab-FusionNet, respectively. This shows that the segmentation result is more consistent with the real label in the overall structure and can better preserve the shape information of the target.

RetinalCoNet and MASNet are close and optimal in index

F_{β}

. Both are significantly better than other models. This reflects that a high and stable segmentation performance can be maintained under different confidence thresholds.

RetinalCoNet has the most prominent advantage in indicator

E_{φ}

, reaching 0.924, which is 11.4%, 16.7%, 17.2%, 14.3%, 15.2%, and 28.1% higher than Unet, MFFN, HitNet, FIRNet, MASNet, and Deeplab-FusionNet, respectively. The focus is given to evaluating the local accuracy of the boundary area. The remarkable improvement fully proves that it has clear advantages in dealing with the blurred and low-contrast boundaries common in underwater images, and it can produce clearer and more accurate target contours.

RetinalCoNet achieved the lowest MAE value, which was 0.6%, 4.9%, 4.7%, 6.7%, 2.4%, and 4.6% lower than those of Unet, MFFN, HitNet, FIRNet, MASNet, and Deeplab-FusionNet, respectively. This directly shows that segmentation prediction has lower pixel-level error compared to real annotation, making the result the most accurate.

Through the analysis of the above indicators, it is evident that the method proposed in this paper has achieved a significant breakthrough in comprehensive performance for the task of underwater fish segmentation. Its great advantages in the core segmentation accuracy indexes mDice and mIoU, combined with the remarkable improvement in structural maintenance and segmentation stability, especially in boundary quality and pixel-level error MAE, fully verify the effectiveness of the model design. Compared with MASNet and other general segmentation models specially designed for underwater creaures, RetinalCoNet can better solve the challenges brought by underwater image quality degradation, such as fog blur, low contrast, and uneven illumination, to accurately segment fish targets.

5.3. Comparative Experiments in Different Scenes

To more clearly demonstrate the improvement of the enhanced text algorithm for underwater fish segmentation, seven distinct algorithms were utilized in this study for fish image segmentation across diverse scenarios, and the segmentation effect is shown in Figure 8.

Figure 8. Comparative experiment of different scenes.

The segmentation effect of each scene is shown in Figure 8, and the specific analysis is as follows: In the blurred scene, MFFN and MASNet failed to effectively perceive and identify the target; HitNet, FIRNet, U-Net, and Deeplab-FusionNet can segment some fish body areas, but they cannot segment the shape and outline of fish completely. Although the method in this paper does not achieve completely accurate segmentation, it can outline the general shapes of fish more accurately. In the exposure scene, MFFN, FIRNet, MASNet, Deeplab-FusionNet, and U-Net can completely segment the main body of the fish, but it is difficult to accurately identify translucent tissues such as fins and the fish tail. HitNet mistakenly segmented the stones in the background, and there was a lack of segmentation in the middle of the fish; In contrast, this method can not only accurately segment the fish body but also effectively identify and segment translucent tissues such as fins and fish tails, so as to obtain more complete fish segmentation results. In the similar color scene, there are two targets with slender fish and a similar color to the aquatic plants, and MASNet failed to detect the targets. Although MFFN and HitNet detected two fish, they both mistakenly divided some background interferers into fish bodies. Deeplab-FusionNet and U-Net only successfully identified and separated one fish; FIRNet can identify two targets, but the shape of the segmented fish is biased; The method in this paper can accurately identify two fish targets and segment their shapes completely. In the overlapping scenes, MFFN, HitNet, FIRNet, MASNet, and Deeplab-FusionNet failed to completely segment the overlapping targets. Compared with the above methods, the segmentation results of U-Net are clearer and more complete, but the segmentation accuracy of this method is better, especially in the positioning of the left-target fish tail and the middle-target fish fin. In the occlusion scene, MFFN, HitNet, and FIRNet all mistakenly divide some water grass obstructions into targets; MASNet and U-Net can accurately identify the occluded target, but their segmentation boundaries are not smooth enough. In contrast, the segmentation contour lines of Deeplab-FusionNet and this method are smoother. In the multi-target scene, MFFN, HitNet, FIRNet, and MASNet failed to identify the target in the middle position. Deeplab-FusionNet can identify the intermediate target, but the segmentation is incomplete. U-Net can identify and segment the intermediate target, but the segmentation result of this method is better in the integrity of the target shape and the fluency of the boundary. In the near scene, MFFN, HitNet, FIRNet, Deeplab-FusionNet, and U-Net failed to segment the target completely. Although MASNet can segment a complete fish body, the confidence of segmentation is low when dealing with the color change area inside the fish body, and the boundary is uncertain. This method can not only segment the target shape completely but also segment the boundary more clearly and smoothly.

Based on the above-mentioned multi-scene visualization analysis, it can be seen that the proposed method shows remarkable segmentation robustness and accuracy advantages in complex underwater environments, and it can effectively maintain the integrity of the target structure in optical degradation scenes such as blur and exposure, while it has strong discrimination against similar-color targets, overlapping targets, and obstructions, significantly reducing the false segmentation rate, and improving the segmentation accuracy of the internal texture changes and boundary shapes of fish in close-range and multi-target scenes. The method in this paper effectively overcomes the three core problems of underwater fish segmentation—the blurred contour, lack of translucent tissue segmentation, and complex background interference caused by optical degradation—and it provides more reliable technical support for underwater ecological monitoring.

5.4. Ablation Experiments

To verify the impact of the improved module in this paper for enhancing the model’s segmentation capability, ablation experiments of different modules on the model segmentation performance were carried out and evaluated, and the influences of different modules on the detection effect were evaluated under the same experimental conditions. The relevant ablation experimental results are listed in Table 3.

Table 3. Experimental results of multi-module ablation.

Compared with the basic model A, after embedding the dynamic prompt module, all indexes of model B were significantly improved. In particular, mDice increased by 4.9%, mIou increased by 2%, S_α increased by 2.6%, and F_β increased by 4.2% slightly, Eφ increased from 0.81 to 0.915, with a significant increase of 10.5%, and MAE decreased from 0.029 to 0.023. This shows that the dynamic prompt module has an important contribution to the segmentation performance of the model. This enhancement improves the model’s ability to capture underwater fish details, particularly in more fully describing fish shapes, thus making the segmentation results more realistic. At the same time, the dynamic prompt module can dynamically adjust the model’s attention to features according to fish features, which enhances the generalization ability of the model in a complex underwater environment and enables it to extract key features stably in different scenes.

On the basis of model B, model C adds a boundary enhancement module. Compared with Unet-A, mIou increased from 0.876 to 0.882, S_α increased from 0.854 to 0.857, and meanEm increased from 0.915 to 0.918. This indicates that the boundary enhancement module effectively refines segmentation. It allows the model to capture underwater fish boundaries more precisely, thus boosting the segmentation accuracy and quality. Furthermore, it improves the connectivity of segmentation outcomes and mitigates fragmentation.

RetinalCoNet, a model proposed in this paper, further improved in performance after we integrated the bionic retinal dual-channel module on the basis of model C. mDice increased slightly by 0.5%, mIou increased from 0.882 to 0.892, an increase of 1%, and

F_{β}

and

E_{φ}

increased by 0.6%. This demonstrates that the bionic retina dual-channel module successfully integrates multi-scale features, further refines the segmentation results, and enhances the model’s adaptability and robustness in handling complex underwater environments.

6. Conclusions

In this paper, a RetinalCoNet network for underwater fish segmentation is proposed to solve the problems of light attenuation, a blurred boundary, and low contrast caused by a complex underwater environment. The network simulates the separation and processing mechanism of light and dark signals by embedding the bionic retina dual-channel module in the encoder, and it enhances the feature extraction ability of the fuzzy target contour and translucent tissue. The dynamic prompt module is introduced, and the response of key features is enhanced by inputting an adaptive prompt template to suppress the noise interference of the water body. The edge prior guidance mechanism is integrated in the decoder, and the low-contrast boundary features are dynamically enhanced by conditional normalization. Through multi-module synergy, RetinalCoNet has achieved an excellent performance on several key indicators, superior to other mainstream segmentation models. In addition, RetinalCoNet also performs well in different scenes, which can effectively deal with the problems of blurred boundaries and low contrast in a complex underwater environment and realize accurate fish segmentation. This research provides important technical support for underwater ecological monitoring and fishery management, and it helps to promote the development and application of underwater image segmentation technology.

Currently, this model integrates a large number of feature-processing modules and does not fully consider model lightweighting and real-time issues. Future research will focus on exploring the development of lightweight versions to improve the deployment efficiency of the model on edge devices, making it more suitable for actual detection applications.

Author Contributions

Conceptualization, S.Z.; methodology, J.Z.; investigation, Y.F. and Z.L.; software, Y.F. and J.L. (Jinfang Liu); validation, J.L. (Junde Lu); writing—original draft preparation, Y.F. and J.Z.; writing—review and editing, J.L. (Jinfang Liu) and J.L. (Junde Lu). All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported partly by the Natural Science Foundation of Guangdong Province under grant 2022B1515120059, Innovation Team Project of Universities in Guangdong Province under grant 2021KCXTD019, Science and Technology Planning Project of Yunfu under grants 2023020202 and 2023020203, Science and Technology Program of Guangzhou under grants 2023E04J1238 and 2023E04J1239, Guangdong Science and Technology Project under grant 2020B0202080002, Major Science and Technology Special Projects in Xinjiang Uygur Autonomous Region under grant 2022A02011, Undergraduate Teaching Quality Project in Guangdong Province: Teaching and Research Section of Artificial Intelligence Curriculum Group (Guangdong Higher Education Letter [2024] No. 9), and Guangdong Postgraduate Education Innovation Plan Project (No. 2024JGXM_090).

Data Availability Statement

The source code of RetinalCoNet was accessed at https://github.com/jianhuazheng/RetinalCoNet on 6 August 2025, and it is public. If you have any questions about the implementation, please contact the corresponding author.

Acknowledgments

The authors acknowledge any support given which is not covered by the Author Contributions or Funding sections.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Chuang, M.C.; Hwang, J.N.; Williams, K.; Towler, R. Automatic fish segmentation via double local thresholding for trawl-based underwater camera systems. In Proceedings of the 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 3145–3148. [Google Scholar]
Baloch, A.; Ali, M.; Gul, F.; Basir, S.; Afzal, I. Fish Image Segmentation Algorithm (FISA) for improving the performance of image retrieval system. Int. J. Adv. Comput. Sci. Appl. 2017, 8, 396–403. [Google Scholar] [CrossRef]
Spampinato, C.; Giordano, D.; Di Salvo, R.; Chen-Burger, Y.H.J.; Fisher, R.B.; Nadarajan, G. Automatic fish classification for underwater species behavior understanding. In Proceedings of the First ACM International Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams, New York, NY, USA, 29 October 2010; pp. 45–50. [Google Scholar]
Liu, R.; Jiang, Z.; Yang, S.; Fan, X. Twin adversarial contrastive learning for underwater image enhancement and beyond. IEEE Trans. Image Process. 2022, 31, 4922–4936. [Google Scholar] [CrossRef]
Kim, Y.H.; Park, K.R. PSS-net: Parallel semantic segmentation network for detecting marine animals in underwater scene. Front. Mar. Sci. 2022, 9, 1003568. [Google Scholar] [CrossRef]
Liu, L.; Yu, W. Underwater image saliency detection via attention-based mechanism. J. Phys. Conf. Ser. 2022, 2189, 012012. [Google Scholar] [CrossRef]
Liu, F.; Fang, M. Semantic segmentation of underwater images based on improved Deeplab. J. Mar. Sci. Eng. 2020, 8, 188. [Google Scholar] [CrossRef]
Hambarde, P.; Murala, S.; Dhall, A. UW-GAN: Single-image depth estimation and image enhancement for underwater images. IEEE Trans. Instrum. Meas. 2021, 70, 5018412. [Google Scholar] [CrossRef]
Dudhane, A.; Hambarde, P.; Patil, P.; Murala, S. Deep underwater image restoration and beyond. IEEE Signal Process. Lett. 2020, 27, 675–679. [Google Scholar] [CrossRef]
Fu, Z.; Chen, R.; Huang, Y.; Cheng, E.; Ding, X.; Ma, K.-K. MASNet: A robust deep marine animal segmentation network. IEEE J. Ocean. Eng. 2023, 49, 1104–1115. [Google Scholar] [CrossRef]
Chen, I.-H.; Belbachir, N. Using Mask R-CNN for underwater fish instance segmentation as novel objects: A proof of concept. In Proceedings of the Northern Lights Deep Learning Workshop, Tromsø, Norway, 10–12 January 2023; Volume 4. [Google Scholar]
Yang, Y.; Li, D.; Zhao, S. A novel approach for underwater fish segmentation in complex scenes based on multi-levels triangular atrous convolution. Aquac. Int. 2024, 32, 5215–5240. [Google Scholar] [CrossRef]
Chicchon, M.; Bedon, H.; Del-Blanco, C.R.; Sipiran, I. Semantic segmentation of fish and underwater environments using deep convolutional neural networks and learned active contours. IEEE Access 2023, 11, 33652–33665. [Google Scholar] [CrossRef]
Shen, C.; Wu, Y.; Qian, G.; Wu, X.; Cao, H.; Wang, C.; Tang, J.; Liu, J. Intelligent bionic polarization orientation method using biological neuron model for harsh conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 789–806. [Google Scholar] [CrossRef]
Pu, Y.; Hang, Z.; Wang, G.; Hu, H. Bionic artificial lateral line underwater localization based on the neural network method. Appl. Sci. 2022, 12, 7241. [Google Scholar] [CrossRef]
Li, H.; Tian, F.; Deng, S.; Wu, Z.; Zhao, L. Mammalian olfaction-inspired spike-coded bionic neural network. IEEE Trans. Instrum. Meas. 2024, 73, 6505411. [Google Scholar] [CrossRef]
Hu, Y.; Li, Z.; Lu, Z.; Jia, X.; Wang, P.; Liu, X. Identification method of crop aphids based on bionic attention. Agronomy 2024, 14, 1093. [Google Scholar] [CrossRef]
Gu, T.; Liu, S.; Pu, Q.; Wang, J.; Wang, B.; Hu, X.; Sun, P.; Li, Q.; Zhu, L.; Lu, G. A visual-olfactory bionic sensing system bioinspired from zebrafish for confusable liquid localization and recognition. Sens. Actuators B Chem. 2025, 441, 138053. [Google Scholar] [CrossRef]
Liang, Z.; Lin, Z.; Li, X.; Zou, X. A bionic vision method for extracting motion information of small-target in cotton field backgrounds. In Proceedings of the International Conference on Optical and Photonic Engineering, Bellingham, WA, USA, 14–16 April 2025; Volume 13509. [Google Scholar]
Xia, B.; Zhan, B.; Shen, M.; Yang, H. Explicit-implicit prior knowledge-based diffusion model for generative medical image segmentation. Knowl. Based Syst. 2024, 303, 112426. [Google Scholar] [CrossRef]
Jiang, W.; Zhang, D.; Hui, G. A dual-branch fracture attribute fusion network based on prior knowledge. Eng. Appl. Artif. Intell. 2023, 127, 107383. [Google Scholar] [CrossRef]
Zhang, Y.; Meng, N.; Zhao, M.; Zhang, T. RASpine: Regional attention lateral spinal segmentation based on anatomical prior knowledge. In Proceedings of the 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 15–19 July 2024; pp. 1–4. [Google Scholar]
Ding, D.; Li, J.; Wang, H.; Wang, K.; Feng, J.; Xiao, M. ApplianceFilter: Targeted electrical appliance disaggregation with prior knowledge fusion. Appl. Energy 2024, 365, 123157. [Google Scholar] [CrossRef]
Yan, J.; Zhang, Y.; Hu, J.; Cui, H.; Chi, J.; Yang, G.; Chen, C.; Yu, T. Prior-based bi-encoder transformer for underwater image enhancement. Multimed. Syst. 2025, 31, 3. [Google Scholar] [CrossRef]
Choi, M.; Han, K.; Wang, X.; Zhang, Y.; Liu, Z. A dual-stream neural network explains the functional segregation of dorsal and ventral visual pathways in human brains. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ‘23), New York, NY, USA, 10–16 December 2023; pp. 50408–50428. [Google Scholar]
Potlapalli, V.; Zamir, S.W.; Khan, S.; Khan, F.S. PromptIR: Prompting for all-in-one blind image restoration. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New York, NY, USA, 10 December 2023; pp. 71275–71293. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Zheng, D.; Zheng, X.; Yang, L.; Gao, C.; Zhu, Y.; Ruan, M. MFFN: Multi-view feature fusion network for camouflaged object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Hawaii, HI, USA, 2–7 January 2023; pp. 6221–6231. [Google Scholar]
Hu, X.; Wang, S.; Qin, X.; Dai, H.; Ren, W.; Luo, D.; Tai, Y.; Shao, L. High-resolution iterative feedback network for camouflaged object detection. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 881–889. [Google Scholar]
Ge, Y.; Ren, J.; Zhang, C.; He, M.; Bi, H.; Zhang, Q. Feature-aware and iterative refinement network for camouflaged object detection. Vis. Comput. 2024, 41, 4741–4758. [Google Scholar] [CrossRef]
Liu, C.; Yao, H.; Qiu, W.; Cui, H.; Fang, Y.; Xu, A. Multi-scale feature map fusion encoding for underwater object segmentation. Appl. Intell. 2025, 55, 163. [Google Scholar] [CrossRef]

Figure 1. Underwater fish segmentation network based on bionic retinal dual-channel and multi-module cooperation.

Figure 2. Structural diagram of Interactive Encoder Block.

Figure 3. Bionic retinal dual-channel module.

Figure 4. Structural diagram of Dynamic Prompt Block.

Figure 5. Structural diagram of prompt generator.

Figure 6. Decoder block structural diagram.

Figure 7. Data set section display diagram. (a) Original image, (b) mask.

Figure 8. Comparative experiment of different scenes.

Table 1. Confusion matrix.

	Positive Category	Negative Category
Positive category	(True Positive) TP	(False Negative) FN
Negative category	(False Positive) FP	(True Negative) TN

Table 2. Comparative experimental results of different models.

Model	mDice	mIou	$S_{α}$	$F_{β}$	$E_{φ}$	MAE
Unet	0.768	0.856	0.828	0.784	0.810	0.029
MFFN	0.566	0.465	0.709	0.589	0.757	0.072
HitNet	0.510	0.420	0.677	0.548	0.752	0.070
FIRNet	0.532	0.425	0.673	0.524	0.781	0.090
MASNet	0.630	0.544	0.752	0.833	0.772	0.047
Deeplab-FusionNet	0.475	0.812	0.725	0.752	0.643	0.069
Ours	0.823	0.892	0.856	0.834	0.924	0.023

Table 3. Experimental results of multi-module ablation.

Experiment Number	Dynamic Prompt Block	Boundary Enhance Block	Bionic Retinal Dual-Channel Block	mDice	mIou	$S_{α}$	$F_{β}$	$E_{φ}$	MAE
A				0.768	0.856	0.828	0.784	0.810	0.029
B	√			0.817	0.876	0.854	0.826	0.915	0.023
C	√	√		0.818	0.882	0.857	0.828	0.918	0.023
D	√	√	√	0.823	0.892	0.856	0.834	0.924	0.023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

RetinalCoNet: Underwater Fish Segmentation Network Based on Bionic Retina Dual-Channel and Multi-Module Cooperation

Abstract

1. Introduction

2. Related Works

2.1. Underwater Image Segmentation

2.2. Bionic Neural Network

2.3. Prior Information

3. Research Method

3.1. Methods for Incorporating Prior Information

3.2. RetinalCoNet Model Structure

3.2.1. Interactive Encoder Block

3.2.2. Dynamic Prompt Block

3.2.3. Decoder Block

4. Collection and Construction of Dataset

5. Results and Discussion

5.1. Experimental Environment and Evaluation Metrics

5.2. Comparative Experiments with Different Models

5.3. Comparative Experiments in Different Scenes

5.4. Ablation Experiments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics