Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning

Zhang, Jiapeng; Qian, Cheng; Xu, Jincheng; Tu, Xueying; Jiang, Xuyang; Liu, Shijing

doi:10.3390/fishes10120627

Open AccessArticle

Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning

by

Jiapeng Zhang

¹

,

Cheng Qian

¹

,

Jincheng Xu

¹

,

Xueying Tu

¹

,

Xuyang Jiang

^1,2,3,*

and

Shijing Liu

^1,4,5,*

¹

Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200092, China

²

College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China

³

Qingdao Conson Oceantec Valley Development Co., Ltd., Qingdao 266237, China

⁴

Sanya Oceanographic Insitution, Ocean University of China, Sanya 572011, China

⁵

East China Sea Fishery Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200090, China

^*

Authors to whom correspondence should be addressed.

Fishes 2025, 10(12), 627; https://doi.org/10.3390/fishes10120627

Submission received: 13 November 2025 / Revised: 3 December 2025 / Accepted: 5 December 2025 / Published: 6 December 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

Phenotyping technologies are essential for modern aquaculture, particularly for precise analysis of individual morphological traits. This study focuses on critical phenotype segmentation tasks for fish carcass and fins, which have significant applications in phenotypic assessment and breeding. In high-density underwater environments, fish frequently exhibit structural overlap and indistinct boundaries, making it difficult for conventional segmentation methods to obtain complete and accurate phenotypic regions. To address these challenges, a double-branch segmentation network is proposed for fish phenotype segmentation in high-density underwater scenes. An auxiliary saliency object detection (SOD) branch is introduced alongside the primary segmentation branch to localize structurally complete targets and suppress interference from overlapping or incomplete fish while inter-branch skip connections further enhance the model’s focus on salient targets and their boundaries. The network is trained under a multi-task learning framework, allowing the branches to specialize in edge detection and accurate region segmentation. Experiments on large yellow croaker (Larimichthys crocea) images collected under real farming conditions show that the proposed method achieves Dice scores of 97.58% for carcass segmentation and 88.88% for fin segmentation. The corresponding ASD values are 0.590 and 0.364 pixels, and the HD95 values are 3.521 and 1.222 pixels. The method outperforms nine existing algorithms across key metrics, confirming its effectiveness and reliability for practical aquaculture phenotyping.

Keywords:

fish phenotype segmentation; semantic segmentation; salient object detection; multi-task learning; high-density underwater scenes

Key Contribution: In this work, we propose a double-branch multi-task learning frame-work guided by salient object detection for fish phenotype segmentation in high-density aquaculture scenes. The auxiliary saliency branch enhances target localization and boundary cues, while inter-branch connections strengthen attention to salient individuals. Joint regression–segmentation supervision enables specialized boundary detection and accurate segmentation. The proposed approach mitigates false-positive segmentation results caused by occlusions and overlapping individuals, substantially improving the precision and robustness of individual fish phenotype segmentation.

1. Introduction

With the development of precision agriculture, phenotyping technologies have become increasingly important for assessing individual traits and guiding farming practices [1]. In the field of fisheries, accurate measurements of fish carcass and fins are of considerable scientific and practical importance, particularly in the context of phenotypic assessment and breeding. By leveraging these measured results, aquaculture professionals can gain insight into the growth conditions of fish populations, allowing them to adjust feeding strategies and stocking densities effectively [2]. Furthermore, measurement of fish fins is vital not only for assessing swimming capabilities and health of fish, but also for supporting genomics, behavioral studies [3], and breeding optimization [4]. Conventional measurement techniques are predominantly based on manual operations, which pose substantial limitations in contemporary intensive aquaculture contexts. In scenarios characterized by high stocking densities and extensive fish populations, manual measurement methods are not only inefficient, but also susceptible to errors, thereby complicating the fulfillment of requirements for speed and precision. Furthermore, these manual approaches frequently require the capture or handling of fish, potentially inflicting varying degrees of harm, which can adversely affect their growth and health. Consequently, the development of non-contact, non-invasive fish phenotype recognition and measurement techniques has emerged as a pivotal challenge in modern aquaculture [5].

Image segmentation, as a significant branch of computer vision, has demonstrated extensive potential in non-contact measurement applications in recent years. The core of image segmentation involves the division of an image’s pixels into distinct and meaningful clusters, wherein pixels within each cluster exhibit similar characteristics or correspond to a specific structural element with particular significance in the image. However, in high-density underwater aquaculture scenes, occlusion and overlap between individual fish are particularly pronounced [6]. This makes it crucial for the model to focus on salient targets in the image, particularly when accurate segmentation of the entire fish is needed for measurements. Furthermore, fish fins in aquatic environments frequently exhibit semi-transparency, which obscures the demarcation between the fins, the surrounding water, and the fish carcass [7]. Consequently, segmentation results are expected to exhibit high boundary precision and enhanced robustness.

In early studies, classical image processing techniques such as thresholding, edge detection, and active contour models were commonly employed for fish segmentation [8,9]. These methods are highly dependent on image quality and the shooting environment, often requiring clear color contrast between the target and the background or the presence of only a single target within the segmentation area. In contrast, deep learning approaches, particularly those based on convolutional neural networks (CNNs), have demonstrated significantly better performance in handling image segmentation tasks under complex environmental conditions. These methods can automatically learn and extract abstract features such as color and texture patterns from annotated data, thereby enabling more robust and accurate segmentation. In the domain of image segmentation, UNet [10], known for its encoder–decoder architecture, is widely adopted. By leveraging multi-scale feature extraction through its encoder–decoder structure and incorporating rich skip connections, UNet achieves strong performance in segmenting targets of varying sizes. Most existing fish segmentation models are based on adaptations or enhancements of UNet. For example, Yu et al. [11] proposed a fish segmentation model based on an improved version of UNet, while Li et al. [12] introduced a model that integrates UNet with atrous spatial pyramid pooling (ASPP). These studies consistently demonstrate the notable advantages of deep learning methods over traditional image processing techniques.

In real-world aquaculture settings, fish are typically cultured at high densities, resulting in frequent occlusions and overlaps among individuals. For accurate measurement, it is essential to segment and analyze only structurally complete individuals within the image. However, most existing segmentation approaches have been developed for out-of-water or controlled laboratory settings [11,13], and their effectiveness in dense, real-world aquaculture conditions remains underexplored. While some semantic segmentation models designed for underwater scenarios are capable of global segmentation, they often fail to distinguish individual fish [14]. Conversely, certain instance segmentation methods can identify and segment individual targets but lack the capacity for multi-class semantic segmentation and may mistakenly include incomplete individuals in the results [15]. Figure 1 illustrates the differences between the application scenarios targeted in this study and existing segmentation tasks.

To overcome the challenges of incomplete segmentation and imprecise boundary delineation commonly encountered in high-density underwater scenes, we propose a novel multi-task cooperative segmentation model that incorporates the SOD task from computer vision. In such environments, a single image often contains multiple fish of the same species, including individuals whose bodies are partially occluded, truncated, or tightly adjacent to others. Since accurate phenotype measurement relies on the integrity of body contours, this study treats the individual with the most complete visible body within each image as the salient target. Other fish that are structurally incomplete or partially occluded by the salient individual are regarded as non-salient interfering objects. This setting naturally introduces the need to discriminate the primary measurement target from neighboring, visually similar distractors, particularly in scenarios where individuals touch or overlap. The proposed architecture comprises two parallel task branches: one dedicated to SOD and the other to semantic segmentation. Both branches are constructed upon an enhanced DeepLabV3+ backbone and are trained using separate loss functions. Specifically, the SOD branch is optimized with a boundary regression loss, while the segmentation branch is supervised using a standard semantic segmentation loss. To enable effective feature interaction between branches, skip connections are introduced for inter-branch communication. This design allows the model to better identify structurally complete fish individuals and reduces misclassifications caused by fragmented or overlapping targets. Additionally, the SOD task is refined through a boundary-specific regression module, which strengthens the model’s ability to capture the structural edge information of fish. Such structural edge information constitutes the fine-grained semantic information required by segmentation models to distinguish overlapping individuals and to produce more accurate phenotype boundaries.

In summary, our main contributions are as follows:

(1) We propose a multi-task cooperative segmentation model that integrates SOD with semantic segmentation to improve fish phenotype segmentation accuracy in densely crowded aquaculture environments.

(2) We develop a task-specific feature-sharing mechanism with parallel branches and skip connections, which enhances segmentation accuracy while reducing errors in detecting fragmented or incomplete fish.

(3) We introduce a boundary-specific regression task that strengthens the model’s focus on boundary features during training, thereby further improving overall segmentation performance.

The rest of this paper is organized as follows. Section 2 reviews existing research. Section 3 introduces the network architecture. Section 4 details the experiments and presents findings. Section 5 discusses implications. Finally, we conclude this work in Section 6.

2. Related Works

Given that the proposed fish phenotype segmentation model integrates both an SOD task and a semantic segmentation task, with particular emphasis on edge refinement, this literature review is organized into the following sections: (1) Fish segmentation tasks; (2) Image segmentation methods; and (3) Salient object detection.

2.1. Fish Segmentation Tasks

Current fish segmentation tasks can be broadly categorized into three types: (1) Semantic segmentation of fish groups without individual distinction, such as the enhanced cascaded decoder network (ECD-Net) proposed by Li et al. [14] for marine animal segmentation, SUR-Net developed by Lin et al. [16] for wild fish segmentation in underwater scenes, and LCFCN introduced by Laradji et al. [17] for counting fish via localization-based fully convolutional networks. These methods are primarily used for rough estimation of marine biomass but often fail to achieve precise identification of individual targets. (2) Instance segmentation with individual distinction, which involves two-stage models based on region proposal network (RPN) [18,19,20], as well as single-stage approaches such as the model proposed by Ye et al. [15], which integrates body features and motion characteristics for fish detection and segmentation. While these methods can effectively localize individual fish in a group, they are typically limited to coarse-grained identification of individuals due to framework constraints and cannot distinguish detailed anatomical parts. (3) Detailed structural segmentation of individual fish, including methods such as the body/part segmentation approach for tilapia proposed by Fernandes et al. [13], the improved UNet for oval squid segmentation developed by Yu et al. [11], and the multi-part segmentation method for fish bodies introduced by Li et al. [12]. These methods enable precise segmentation of critical anatomical regions, which can support accurate quantification of key areas (e.g., meaty parts) while avoiding interference from non-dorsal regions like fins or tails.

However, most existing approaches are designed for simple scenarios with single individuals and are not well-suited for practical applications in complex, real-world production environments where fish may be densely packed. Therefore, a method is urgently needed to accurately segment multiple parts of fish in cluttered, high-density scenes while maintaining computational efficiency.

2.2. Image Segmentation Methods

UNet [10] is a classical model in semantic segmentation tasks. Based on CNNs, it implements the fusion of low-level and high-level features through an encoder–decoder architecture and introduces skip connection mechanisms to improve prediction accuracy at object boundaries. UNet++ [21] further optimizes UNet by adopting nested and dense skip connections, enhancing the model’s ability to capture detailed information and improving spatial consistency in segmentation results. However, both methods have limitations in global perception. They primarily rely on convolution kernels in the encoder’s final layer to obtain contextual information, which may lead to insufficient understanding of distant contextual relationships in complex scenarios (e.g., high morphological similarity between objects or blurred boundaries). To address this, Chen et al. proposed DeepLabV3 [22] using a dilated convolution strategy. By stacking convolutions with different dilation rates in the encoder’s final layer (i.e., the Pyramid Pooling Module ASPP), DeepLabV3 enhances its ability to utilize global contextual information and better captures distant dependencies between objects. The improved version, DeepLabV3+ [23], further addresses limitations by introducing skip connection structures similar to UNet, significantly improving segmentation accuracy and robustness. In recent years, the Transformer architecture has gained attention for its ability to model long-range dependencies in natural language processing (NLP). By leveraging self-attention mechanisms, it effectively captures long-distance relationships in images and has inherent advantages in learning multi-scale features. Consequently, some studies have explored applying Transformer structures to image segmentation tasks to enhance global perception and detailed expression capabilities [24,25].

However, Transformer-based models lack the structured inductive biases inherent to CNN methods. This makes it difficult for them to fully learn long-range dependencies in images with limited datasets and can lead to unstable training. Additionally, when dealing with incomplete or similar objects in images, previous methods often produce varying degrees of erroneous segmentation in non-critical regions. Considering that annotations are typically limited in real-world fisheries applications, this work uses DeepLabV3+ as the baseline model. We attempt to enhance the model’s focus on core targets by incorporating an SOD task from computer vision. Figure 2 shows the basic architecture of the previous methods and our proposed method.

2.3. Salient Object Detection

The human visual system possesses a selective attention mechanism that allows it to automatically focus on important targets in complex scenes or environments densely populated with objects, while maintaining sustained attention on specific regions for detailed analysis [26]. SOD is a task designed to mimic this natural cognitive process [27]. With the advancement of computer vision technologies, SOD has demonstrated extensive applicability across multiple domains. For instance, it is used in medical image processing to locate microscopic structures [28], in remote sensing data analysis to extract key geographical information [29], in industrial production for defect detection [30], and in robotics navigation for obstacle recognition and avoidance [31]. Leveraging the inherent focus on important targets, researchers have begun integrating SOD with other visual tasks to develop multi-task learning models that enhance overall performance. For example, Khattar et al. [32] proposed a multi-task learning framework combining SOD and target recognition, achieving improved accuracy in target recognition through shared feature representation and mutual task promotion. Liu et al. [33], addressing the challenge of small-object detection in infrared imaging, designed a method integrating a saliency detection module with a feature pyramid network. By guiding the model to focus on small-target regions, this approach significantly enhanced detection performance in complex scenarios.

In this study, we incorporate SOD technology into fish phenotype segmentation tasks by establishing inter-task skip connections, directing the segmentation model to prioritize attention on salient regions. This method enhances overall segmentation accuracy by reducing mis-segmentation issues in non-critical areas, offering a novel solution for fish phenotype segmentation tasks in high-density underwater scenes.

3. Methods

An SOD guided fish phenotype segmentation method is proposed in this study. SOD and semantic segmentation tasks are integrated into a multi-task learning framework to achieve precise segmentation of fish in high-density underwater scenes. Additionally, an edge distance mapping-based attention graph annotation is introduced to enhance the segmentation model’s attention to edge structures of fish. The method details are presented in the following sections.

3.1. Multi-Task Learning Framework

The proposed multi-task learning framework is illustrated in Figure 3. This framework comprises two relatively independent branches: the multi-class segmentation branch and the SOD branch. The former utilizes the original image as input, employing segmentation masks for training purposes. Similarly to conventional segmentation models, this branch focuses on enhancing the capacity to extract fish texture features from raw underwater images. On the other hand, the latter branch incorporates an edge-mixed mapping derived from the edge-mixed extractor as its input and employs a boundary distance map to obtain a salient boundary map for the regression task. This branch is designed to filter out non-target objects beyond the main subject and reinforce the boundary features of fish. Through inter-branch skip connections, feature maps generated by the SOD branch are transmitted to the multi-class segmentation branch, thereby enhancing structural boundary features and providing clearer target and edge attention for detailed semantic segmentation tasks.

As the SOD task shares similarities with multi-class segmentation tasks in requiring attention to object details at different scales, both task branches adopt a classical multi-layer encoder–decoder architecture. Through four layers of encoding (Layer 0 to Layer 4) and continuous down-sampling, the model is capable of simultaneously extracting low-level and high-level semantic information from the two original inputs. Considering that segmentation targets can be effectively focused under the guidance of SOD tasks but may increase model complexity, the proposed framework does not employ layer-wise skip connections like UNet. Instead, a single branch-specific skip connection is implemented after Layer 1. Additionally, cross-branch skip connections are executed after Layer 1 and Layer 4 to enhance the segmentation model’s attention to salient fish targets and their structural edges. Correspondingly, in the decoding phase, spatial detail information is restored through up-sampling and fusion of cross-branch skip connections. This process facilitates the generation of refined segmentation results and saliency regression predictions. All skip connections are realized through channel concatenation combined with feature mapping via 1 × 1 convolutions.

At the end of the encoder, the ASPP module from DeepLabV3 [22] is employed. This module consists of multiple parallel convolution kernels with different dilation rates. The increase in computing burden is limited due to the unchanged kernel size; however, the receptive field is significantly expanded through the introduction of dilated convolutions. By combining results from convolutions with various dilation rates, a comprehensive feature representation that incorporates multiscale information is formed. This design enhances the model’s adaptability to complex scenarios and diverse object shapes while effectively mitigating the risk of overfitting. More details of the multi-task learning framework layers can be found in Table 1.

3.2. Salient-Boundary Map and Edge-Mixed Extractor

In this study, a salient object detection branch is introduced to locate the most prominent individual fish within densely populated underwater imagery. Traditional SOD methods mainly focus on identifying globally salient regions, yet this emphasis often leads to insufficient attention to the fine-grained boundary information required for segmentation. Such boundary cues are essential for achieving accurate phenotype segmentation. At the same time, the anatomical complexity of fish, particularly the low contrast and blurred transitions between the body and fins, further complicates the extraction of precise contours. These factors together make the task highly challenging. To address this challenge, a novel edge-distance mapping method for constructing SOD ground truth labels is proposed. The motivation behind this approach is to encourage the feature extractor of the saliency regression branch to focus not only on the segmentation target itself but also on its boundary regions.

Specifically, the SOD branch’s true labels are derived from segmentation mask annotations combined with multi-class boundary distance information. For the two independent categories (fish carcass and fins) in our fish phenotype segmentation task, the Euclidean distance transformation is first computed to obtain pixel-level distances to the nearest boundary of each category:

T_{I n D i s} (i) = \{\begin{matrix} + \underset{\hat{i} ϵ \partial S_{c}}{i n f} ∥ i - \hat{i} ∥_{2} & , i ϵ S_{c, i n} \\ 0 & , i ϵ S_{c, o u t} \end{matrix}

(1)

where i are different pixels in the segmentation mask and

S_{c}

denotes the segmentation boundary of the

c - t h

category,

c \in \{1, 2\}

. Subsequently, these distance maps are normalized using min-max normalization:

D i s_{n o r m}^{c} = \{\begin{matrix} 1 - MinMax (T_{I n D i s} (i)) & , i ϵ S_{c, i n} \\ 0 & , i ϵ S_{c, o u t} \end{matrix}

(2)

Finally, the multi-class boundary distance information is fused to construct the final salience-edge map:

D i s_{f u s e} = D i s_{n o r m}^{1} + D i s_{n o r m}^{2}

(3)

As shown in Figure 4, this process highlights pixels near segmentation boundaries with higher values, while assigning lower values to those farther from the edges. This characteristic encourages the SOD branch to focus on structural boundary regions during training, ultimately providing richer edge information to the semantic segmentation branch for more accurate predictions.

Given that the original images contain rich color and texture information, an Edge-mixed extractor is proposed to enhance the image edges and overall structural information for better compatibility with the SOD regression branch. This preprocessing is conducted prior to inputting into the SOD branch and consists of three steps: smoothing filtering, edge extraction, and image fusion.

In classical image processing tasks, edge detection operators can effectively extract structural edges from images. However, underwater images often suffer from high levels of impulse noise due to limited lighting conditions. Directly performing edge detection on raw data often leads to sparse, scattered, and discontinuous edge outputs. Consequently, we introduce smoothing filtering as a preprocessing step to enhance the quality of edge detection results. Underwater imaging commonly contains impulse-like artifacts caused by floating particles, isolated light spots, and other scattering-related disturbances, which make edge preservation challenging. Prior studies have shown that median filters preserve edge structures more effectively than linear filters such as mean filters when dealing with impulse noise [34]. In this work, a median filter with a larger kernel (

Ω = 9

) is employed for preprocessing the raw images. Empirical testing has shown that this filter strikes an optimal balance between noise reduction and detail preservation, suppressing impulse noise in underwater environments while preserving fine image details:

x_{s} = median (\{x_{i} | k \in Ω\})

(4)

where

x_{i}

represents the input image. After performing the median filtering, the Sobel edge extractor is used to extract the edge structures from the image. Compared to other complex edge detection algorithms, the Sobel operator, implemented through convolution kernels, operates with high computational efficiency and is highly sensitive to edge orientation, thereby preserving more structural information effectively:

x_{e} = \sqrt{{(C o n v 2 D (x_{s}, ω_{v}))}^{2} + {(C o n v 2 D (x_{s}, ω_{h}))}^{2}}

(5)

ω_{v} = [\begin{matrix} - 1 & - 2 & 0 & 2 & 1 \\ - 4 & - 8 & 0 & 8 & 4 \\ - 6 & - 12 & 0 & 12 & 6 \\ - 4 & - 8 & 0 & 8 & 4 \\ - 1 & - 2 & 0 & 2 & 1 \end{matrix}]

(6)

ω_{h} = [\begin{matrix} - 1 & - 4 & - 6 & - 4 & - 1 \\ - 2 & - 8 & - 12 & - 8 & - 2 \\ 0 & 0 & 0 & 0 & 0 \\ 2 & 8 & 12 & 8 & 2 \\ 1 & 4 & 6 & 4 & 1 \end{matrix}]

(7)

where

ω_{v}

and

ω_{h}

are two Sobel kernels in vertical and horizontal, respectively. And the Edge-Mixed map is the result of fusing the smoothed image with the edge map extracted using the Sobel operator:

x_{f u s e} = 0.5 \times x_{e d g e} + 0.5 \times x_{s m o o t h e d}

(8)

The processing results at various stages are shown in Figure 5.

3.3. Loss Functions and Training Strategy

The deep supervision strategies are, respectively, employed to the SOD branch and semantic segmentation branch to perform edge saliency regression and multi-class segmentation. This differentiated loss design effectively guides the two branches to focus on their respective task characteristics while achieving collaborative optimization through inter-branch skip connections.

In the SOD branch, the mean squared error (MSE) is employed as the loss function:

L_{R e g} = \frac{1}{N} \sum_{i = 1}^{N} {(d_{i} - {\hat{d}}_{i})}^{2}

(9)

where

d_{i}

represents the distance mapping value of pixel i to the edge in the ground truth edge distance map, and

{\hat{d}}_{i}

denotes the corresponding value in the predicted edge distance map. MSE is a widely used loss function for regression tasks, aiming to minimize the squared difference between predictions and labels. In this work, the MSE loss encourages the model to simultaneously attend to both the salient fish object and its edges in the image. For the semantic segmentation branch, a combination of Dice loss and cross-entropy (CE) loss is utilized:

L_{D i c e} = 1 - \frac{2 \sum y_{i}^{c} {\hat{y}}_{i}^{c}}{\sum (y_{i}^{c} + {\hat{y}}_{i}^{c})}

(10)

L_{C E} = - \sum y_{i}^{c} log {\hat{y}}_{i}^{c}

(11)

where

{\hat{y}}_{i}^{c}

represents the ground truth binary indicator of

c - t h

class for pixel i in the input,

y_{i}^{c} \in [0, 1]

denotes the corresponding prediction possibility. The combined loss can be defined as

L_{S e g} = α L_{D i c e} + (1 - α) L_{C E}

(12)

where

α = 0.5

is a weighting parameter. In this work, the background pixels in the images are significantly more numerous than the target regions (fish carcass and fins). To alleviate the pronounced class imbalance inherent in this multi-class setting, the hybrid Dice + CE loss is employed. The Dice loss increases the relative weighting of small regions in the normalized overlap computation, thereby enhancing the delineation of fine anatomical structures, such as fins, in complex underwater scenes. Meanwhile, the CE loss provides stable optimization by reinforcing class-level discrimination, ensuring reliable pixel-wise classification across all categories. By combining these complementary properties, the mixed loss function mitigates the adverse effects of class imbalance while maintaining both segmentation precision and training stability.

During training, an information transfer mechanism based on skip connections is implemented. Specifically, the two branches interactively feature information through inter-branch skips. However, in each training iteration, the forward and backward propagations are conducted independently for each branch. The optimization processes for their respective tasks are also mutually exclusive. Given that the two branches have fundamentally different task objectives and loss functions (SOD focuses on edge regression while semantic segmentation emphasizes multi-class classification), allowing them to share weight update mechanisms could lead to task conflicts, thereby reducing model performance. This independent training strategy ensures that each branch remains focused on its specific task during optimization: the SOD branch efficiently learns salient and boundary features, while the semantic segmentation branch effectively captures global contextual information.

4. Experiments and Results

4.1. Dataset

The present study centers on the large yellow croaker as the research subject and the dataset utilized in this research was collected from the deep-sea aquaculture experimental platform of Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Sciences. Unlike existing datasets for fish segmentation, which are typically acquired under out-of-water or laboratory conditions, all images in our dataset were captured underwater during high-density aquaculture. After removing blurry or low-quality images, 459 images were randomly selected for further processing. These images underwent coarse localization of yellow croaker individuals via object detection and were uniformly cropped to 512 × 288 pixels. As a result of this cropping strategy, each image contains one complete measurement target together with varying numbers of touching, partially occluded, or incomplete fish bodies around it. These surrounding individuals are not considered segmentation targets in this study but are treated as non-salient background regions. Based on the measurement demand of the large yellow croaker in actual production process, fish segmentation ground truth was defined as fish carcass and fins. All images were annotated by two experts with extensive aquaculture experience using X-AnyLabeling, with annotations verified by a third expert to ensure consistency and accuracy. The dataset was randomly divided into training and validation sets in an 8:2 ratio for model training.

4.2. Implementation Details

All the experiments are conducted on a workstation equipped with Intel^® Core(TM) i7-14700K CPU working @ 3.40GHz, 1 NVIDIA RTX 4090 graphic card with 24 GB of memory and 64 G of RAM. The deep learning framework is implemented in PyTorch, and the version is 2.3.0.

To ensure strict fairness and reproducibility, all backbone networks and all compared methods are trained entirely from scratch, without using any pre-trained weights or external initialization. During the training period, the batch size is 16 and the Adam optimizer with first-order moment set to 0.9 and second-order moment set to 0.999 is used. Cosine Annealing is employed for learning rate scheduling, with an initial learning rate of 0.001 and a total training duration of 300 epochs.

In the testing stage, four metrics are used to quantitatively evaluate the performance of our method. They are Accuracy (Acc), Dice Score (Dice), 95th percentile of the Hausdorff Distance (HD95) and average surface distance (ASD). The Dice index is a commonly used metric in segmentation tasks as it effectively balances the number of true positives against both false positives and negatives, making it particularly suitable for evaluating performance on imbalanced datasets. However, one limitation of the Dice index lies in its reduced sensitivity to local variations at object boundaries, which may limit its ability to fully capture detailed boundary information. To address this limitation and provide a more comprehensive evaluation, we incorporate HD95 and ASD as supplementary metrics. The HD95 is particularly useful for assessing the maximum distance between two point sets, thereby capturing extreme cases of misalignment that might otherwise be overlooked [35]. On the other hand, ASD provides an average measure of all corresponding point distances, offering insight into the model’s overall alignment accuracy [36]. Together, these metrics allow us to evaluate both the global segmentation performance and local boundary accuracy, ensuring a robust and comprehensive assessment of our method.

4.3. Results and Analysis

The performance of the proposed method is validated through two sets of experiments: (1) Performance of the proposed method under different configurations (incorporating an additional edge map feature extraction network; implementing multi-task learning with SOD regression; adjusting the number of channels in the edge feature extraction branch); (2) Comparison of the proposed method with other segmentation methods.

4.3.1. Ablation Study

As shown in Table 2, different components are added to the model from (A) to (E) for experiments. The baseline backbone model is the standard DeepLabV3+. Model (B) is an enhanced version incorporating an additional SOD branch. Compared with model (A), which only performs feature extraction on the original image using a single-branch model, model (B) achieves a 0.81% improvement in Dice score, and reductions of 34.95% and 4.99% for ASD and HD95, respectively. This indicates that while the Dice metric, which emphasizes large region segmentation accuracy, saw limited gains due to already high baseline performance, the introduction of the SOD branch significantly improved structure edge-related metrics (ASD and HD95). This improvement can be attributed to the enhanced ability of model (B) to capture fine structural boundary features through its additional salient boundary map and edge-mixed feature extractor.

Furthermore, we explored the impact of introducing a regression loss for the SOD branch in model (E), which builds on model (B). Compared to model (B), model (E) achieved further reductions in ASD (7.19%) and HD95 (22.07%), demonstrating that employing relatively independent loss functions for each branch has a positive impact on overall performance.

Finally, we investigated the trade-off between computational efficiency and model performance by adjusting the number of feature channels in the auxiliary branch (from model (C) to model (E)). While reducing the channel count decreased computational overhead, it also resulted in degraded performance. Considering that the fish segmentation task in this study does not require real-time processing, we ultimately selected a configuration where the auxiliary SOD branch maintains a 1:1 channel ratio with the main branch to achieve optimal performance.

To provide a more intuitive presentation of the model’s performance under different combinations in our ablation experiments, we visualize both edge salient regression results and segmentation outcomes. As shown in the first to third columns of Figure 6, introducing the edge saliency detection branch mitigates the mis-segmentation of non-salient fish individuals in high-density underwater scenes. Furthermore, integrating the regression loss from the edge saliency detection branch into the segmentation loss further improves the model’s performance in predicting fin regions. As evidenced by the comparison between models (C) through (E), the configuration with a 1:1 channel ratio between the segmentation and edge saliency branches yields the lowest risk of prediction errors.

To further validate whether the proposed model appropriately focuses on pixels corresponding to target categories and their boundaries, we employ Grad-CAM and Uncertainty Map visualizations (columns 4–5 of Figure 6). Grad-CAM is a visualization technique that highlights regions of an image most critical for a model’s prediction [37]. In this study, the “jet” color map is used to differentiate between fin and carcass regions, where red signifies high importance and blue indicates low importance. As shown in the figures, compared to single-branch segmentation models, models incorporating an edge saliency detection branch exhibit significantly increased attention to structural edge regions. Moreover, model (E) demonstrates a stronger disparity in its focus on structural edges versus internal regions. Uncertainty Map is a method that quantifies a model’s confidence in its predictions at each pixel level [38]. In our experiment, “jet” color scheme is also used to represent the most certain and uncertain prediction regions, where red denotes high uncertainty and blue signifies low uncertainty. From Figure 6, it can be observed that compared to single-branch models, dual-branch models exhibit higher concentrations of uncertainty in areas near fish contours. Notably, the final selected model (E) demonstrates lower prediction uncertainty at critical edge regions, such as the head and tail base.

4.3.2. Comparison with Other Models

In this section, the proposed model is compared with nine other state-of-the-art models for image semantic segmentation. These include SegNet [13], UNet [11], and RA-UNet [12] which have been employed for fish phenotype segmentation. Additionally, we incorporate general-purpose image segmentation methods based on CNNs, such as UNet++ [21], DeepLabV3 [22], and AHF-Unet [39], along with transformer-based approaches like SwinUNet [25] and SegFormer [40]. To ensure a fair comparison, all models are trained from scratch using the same training protocol without leveraging any pre-trained weights. The quantitative results are presented in Table 3, demonstrating that our proposed model outperforms others in the evaluation metrics of Acc, Dice, ASD, and HD95, despite not having the lowest computational complexity or parameter count.

Compared with SegNet, UNet, and RA-UNet, the proposed method shows an improvement in Dice Score by 4.3%, 1.4%, and 0.7%, respectively. Additionally, ASD decreased by 80.4%, 76.4%, and 54.2%, while HD95 decreased by 77.9%, 74.2%, and 26.8%. The superior performance in ASD and HD95 metrics highlights their sensitivity to model precision at object boundaries and overall structural integrity. This demonstrates the effectiveness of our dual-branch multi-task architecture with edge saliency detection. Interestingly, despite their strong performance in tasks requiring global feature perception, the transformer-based models SwinUNet and SegFormer failed to achieve comparable results in specific underwater fish image segmentation task targeted by this study. The quantitative results indicate that SegFormer performs comparably to SegNet, which suggests potential limitations of transformer-based approaches when applied to the limited-scale datasets used in this study.

To further evaluate the segmentation performance across different phenotype regions, we conducted a comparative analysis focusing on both fish carcass and fish fin predictions. As shown in Table 4, all methods exhibit relatively lower performance metrics on the fish fin compared to the fish carcass. This can be attributed to the inherent challenges in segmenting fish fins, which are delicate, semi-transparent structures often located near the edges of the fish carcass. The smaller size and complex geometry make fish fins more prone to segmentation errors compared to the larger, more homogeneous fish carcass region. Despite the inherent difficulty of fin segmentation, our method consistently achieves superior performance on both the fish carcass and fins, with ASD and HD95 metrics on fins even outperforming several competing methods on the fish carcass.

Qualitative visualization results are presented in Figure 7, using the same visualization strategy as in the ablation study to intuitively compare the performance of our proposed method with other approaches on specific image samples. Compared to other methods, the proposed approach generates smoother boundary predictions and avoids false positives in dense scenes outside the salient individuals. The CAMs demonstrate a stronger focus on structural boundaries, while the Uncertainty Maps reveal consistently low prediction uncertainty along challenging object edges.

5. Discussion

To investigate the strengths and weaknesses of different methods in high-density underwater scenes, we present additional segmentation results across a variety of samples in Figure 8.

When comparing with the UNet and DeepLab series, it is evident that adding extra skip connections between the encoder and decoder significantly improves segmentation performance near structural boundaries. DeepLabV3, in particular, benefits from the ASPP module introduced at the end of its encoder, which enhances global context awareness and helps reduce false positives on non-salient targets, where UNet models often fail to achieve satisfactory performance. Among the improved variants, RA-UNet incorporates DeepLabV3’s ASPP module into the UNet backbone, while AHF-UNet introduces uncertainty-aware attention mechanisms. Both models aim to enhance boundary prediction and show some success; however, they still produce false positives segmentation predictions when encountering dense scenes with non-salient distractors. In contrast, the proposed method consistently avoids false positive segmentation predictions on non-salient fish, even in case where these individuals partially overlap or torch the salient object. The SOD branch reliably filters out such interfering regions by emphasizing the structural completeness cues, allowing the model to focus on the segmentation target. As a result, the model produces smoother boundary transitions and more precise contour delineation around the salient individual compared with other methods. These results validate the effectiveness of our design, which incorporates an edge-aware SOD branch and leverages a multi-task learning strategy that combines segmentation loss with boundary regression loss.

It is noteworthy that Transformer-based models, including SwinUNet and SegFormer, exhibit unexpectedly limited performance. Although these architectures are known for their strong global modeling capacity, they struggle in our high-density underwater segmentation task. As illustrated in Figure 7, both the CAMs and the uncertainty maps of these models exhibit patch-wise discontinuities, especially near the structural boundaries of the large yellow croaker. This tendency is closely related to the limited local detail reconstruction of their lightweight decoders and the absence of strong local inductive bias, which makes boundary continuity difficult to maintain. Moreover, all models in this study were trained from scratch to ensure fairness; without pre-trained weights, Transformer architectures rely heavily on coarse token-level representations when data are relatively limited, which hampers their ability to recover the smooth curvature and fine-grained structures required for phenotype measurement. Future improvements may arise from integrating stronger local priors through hybrid CNN–Transformer designs or edge-aware decoders, from pre-training on larger-scale aquatic datasets via self-supervised learning to enhance representation stability.

Regarding the trade-off between computational cost and segmentation performance, the visual results in Figure 8 indicate that the proposed method delivers superior accuracy, particularly in fine-grained boundary delineation. However, as shown in Table 4, these improvements are accompanied by higher computational demands. Due to the introduction of an additional boundary-aware branch and a multi-task optimization scheme, the overall number of parameters is relatively large. The model size is second only to RA-UNet, which applies ASPP modules at every layer, and it is substantially larger than lighter single-branch encoder–decoder architectures such as SegNet and UNet. Moreover, the two-branch design generates a larger number of high-resolution feature maps and involves more inter-branch skip connections, resulting in computational costs comparable to those of UNet++, although still lower than those of AHU-UNet, the uncertainty-guided RA-UNet, and SegNet constructed with VGG blocks instead of ResBlocks. Given the practical application scenario targeted in this study, segmentation accuracy has higher priority than real-time inference, and the additional computational overhead is therefore acceptable.

Nevertheless, the current computational burden still limits deployment in real-time systems and lightweight edge platforms. Future work will investigate model compression and related techniques to achieve a more favorable balance between accuracy and efficiency. In addition, because underwater imaging conditions in offshore mariculture vary substantially across time and space, enhancing robustness under domain shifts is equally important. Subsequent studies will explore domain adaptation strategies and cross-domain training schemes to improve generalization across diverse operational environments.

6. Conclusions

In this work, an approach that combines edge-aware SOD with a multi-task learning framework is proposed to improve fish phenotype segmentation performance in high-density underwater scenes. Supervised by boundary distance regression, the edge-aware SOD branch effectively extracts boundary features from boundary maps and transmits them to the image feature extraction branch via inter-branch skip connections. The image feature extraction branch is trained under instance-level segmentation supervision, enhanced with additional edge cues, thereby improving the segmentation accuracy around object boundaries and reducing false positive predictions on non-salient targets. Experimental results demonstrate that the proposed method achieves Dice scores of 97.58% and 88.88% on the fish carcass and fins of large yellow croaker in high-density underwater scenes, respectively. The proposed framework shows great potential for deployment in underwater phenotypic measurement systems for offshore marine aquaculture.

Author Contributions

Conceptualization, J.Z. and S.L.; methodology, J.Z. and S.L.; software, J.Z. and C.Q.; validation, J.X., C.Q. and X.T.; formal analysis, J.Z.; investigation, J.Z.; resources, X.J.; data curation, X.J. and J.X.; writing—original draft preparation, J.Z.; writing—review and editing, C.Q. and S.L.; visualization, J.Z.; supervision, S.L.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Science and Technology Project Founded by Ministry of Agriculture of the PRC, in part by Central Public-interest Scientific Institution Basal Research Fund, ECSFRCAFS (NO.2025YJ02), and in part by Central Public-interest Scientific Institution Basal Research Fund, CAFS (NO.2024XT0901).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

Author Xuyang Jiang was employed by the company Qingdao Conson Oceantec Valley Development. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SOD	Saliency Object Detection
CNN	Convolutional Neural Network
ASPP	Atrous Spatial Pyramid Pooling
NLP	Natural Language Processing
Acc	Accuracy
HD95	95th percentile of the Hausdorff Distance
ASD	Average Surface Distance

References

Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Hu, Z.; Li, R.; Xia, X.; Yu, C.; Fan, X.; Zhao, Y. A method overview in smart aquaculture. Environ. Monit. Assess. 2020, 192, 493. [Google Scholar] [CrossRef]
Zhao, Y.; Qin, H.; Xu, L.; Yu, H.; Chen, Y. A review of deep learning-based stereo vision techniques for phenotype feature and behavioral analysis of fish in aquaculture. Artif. Intell. Rev. 2025, 58, 7. [Google Scholar] [CrossRef]
Freitas, M.V.; Lemos, C.G.; Ariede, R.B.; Agudelo, J.F.G.; Neto, R.R.O.; Borges, C.H.S.; Mastrochirico-Filho, V.A.; Porto-Foresti, F.; Iope, R.L.; Batista, F.M.; et al. High-throughput phenotyping by deep learning to include body shape in the breeding program of pacu (Piaractus mesopotamicus). Aquaculture 2023, 562, 738847. [Google Scholar] [CrossRef]
Strachan, N.J.C. Length measurement of fish by computer vision. Comput. Electron. Agric. 1993, 8, 93–104. [Google Scholar] [CrossRef]
Chen, J.C.; Chen, T.-L.; Wang, H.-L.; Chang, P.-C. Underwater abnormal classification system based on deep learning: A case study on aquaculture fish farm in Taiwan. Aquac. Eng. 2022, 99, 102290. [Google Scholar] [CrossRef]
Zion, B.; Alchanatis, V.; Ostrovsky, V.; Barki, A.; Karplus, I. Real-time underwater sorting of edible fish species. Comput. Electron. Agric. 2007, 56, 34–45. [Google Scholar] [CrossRef]
Hao, Y.; Yin, H.; Li, D. A novel method of fish tail fin removal for mass estimation using computer vision. Comput. Electron. Agric. 2022, 193, 106601. [Google Scholar] [CrossRef]
Atienza-Vanacloig, V.; Andreu-García, G.; López-García, F.; Valiente-González, J.M.; Puig-Pons, V. Vision-based discrimination of tuna individuals in grow-out cages through a fish bending model. Comput. Electron. Agric. 2016, 130, 142–150. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Yu, C.; Hu, Z.; Han, B.; Wang, P.; Zhao, Y.; Wu, H. Intelligent measurement of morphological characteristics of fish using improved U-Net. Electronics 2021, 10, 1426. [Google Scholar] [CrossRef]
Li, J.; Liu, C.; Yang, Z.; Lu, X.; Wu, B. RA-UNet: An intelligent fish phenotype segmentation method based on ResNet50 and atrous spatial pyramid pooling. Front. Environ. Sci. 2023, 11, 1201942. [Google Scholar] [CrossRef]
Fernandes, A.F.A.; Turra, E.M.; de Alvarenga, É.R.; Passafaro, T.L.; Lopes, F.B.; Alves, G.F.O.; Singh, V.; Rosa, G.J.M. Deep learning image segmentation for extraction of fish body measurements and prediction of body weight and carcass traits in Nile tilapia. Comput. Electron. Agric. 2020, 170, 105274. [Google Scholar] [CrossRef]
Li, L.; Dong, B.; Rigall, E.; Zhou, T.; Dong, J.; Chen, G. Marine animal segmentation. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2303–2314. [Google Scholar] [CrossRef]
Ye, Z.; Zhou, J.; Ji, B.; Zhang, Y.; Peng, Z.; Ni, W.; Zhu, S.; Zhao, J. Feature fusion of body surface and motion-based instance segmentation for high-density fish in industrial aquaculture. Aquac. Int. 2024, 32, 8361–8381. [Google Scholar] [CrossRef]
Lin, H.-Y.; Tseng, S.-L.; Li, J.-Y. SUR-Net: A deep network for fish detection and segmentation with limited training data. IEEE Sens. J. 2022, 22, 18035–18044. [Google Scholar] [CrossRef]
Laradji, I.H.; Saleh, A.; Rodriguez, P.; Nowrouzezahrai, D.; Azghadi, M.R.; Vazquez, D. Weakly supervised underwater fish segmentation using affinity LCFCN. Sci. Rep. 2021, 11, 17379. [Google Scholar] [CrossRef] [PubMed]
Garcia, R.; Prados, R.; Quintana, J.; Tempelaar, A.; Gracias, N.; Rosen, S.; Vågstøl, H.; Løvall, K. Automatic segmentation of fish using deep learning with application to fish size measurement. ICES J. Mar. Sci. 2020, 77, 1354–1366. [Google Scholar] [CrossRef]
Yu, C.; Fan, X.; Hu, Z.; Xia, X.; Zhao, Y.; Li, R.; Bai, Y. Segmentation and measurement scheme for fish morphological features based on Mask R-CNN. Inf. Process. Agric. 2020, 7, 523–534. [Google Scholar] [CrossRef]
Alshdaifat, N.F.F.; Talib, A.Z.; Osman, M.A. Improved deep learning framework for fish segmentation in underwater videos. Ecol. Inform. 2020, 59, 101121. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; p. 1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-UNet: UNet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Borji, A.; Itti, L. State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 185–207. [Google Scholar] [CrossRef]
Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient object detection in the deep learning era: An in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3239–3259. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Fang, R.; Zhang, N.; Liao, C.; Chen, X.; Wang, X.; Luo, Y.; Li, L.; Mao, M.; Zhang, Y. An improved algorithm for salient object detection of microscope based on U2-Net. Med. Biol. Eng. Comput. 2025, 63, 383–397. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Zhou, X.; Fang, H.; Liu, Z.; Zheng, B.; Sun, Y.; Zhang, J.; Yan, C. Dense attention-guided cascaded network for salient object detection of strip steel surface defects. IEEE Trans. Instrum. Meas. 2021, 71, 5004914. [Google Scholar] [CrossRef]
Guo, B.; Guo, N.; Cen, Z. Motion saliency-based collision avoidance for mobile robots in dynamic environments. IEEE Trans. Ind. Electron. 2021, 69, 13203–13212. [Google Scholar] [CrossRef]
Khattar, A.; Hegde, S.; Hebbalaguppe, R. Cross-domain multi-task learning for object detection and saliency estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3639–3648. [Google Scholar]
Liu, Z.; He, J.; Zhang, Y.; Zhang, T.; Han, Z.; Liu, B. Infrared small target detection based on saliency guided multi-task learning. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3459–3463. [Google Scholar]
Ng, P.E.; Ma, K.-K. A switching median filter with boundary discriminative noise detection for extremely corrupted images. IEEE Trans. Image Process. 2006, 15, 1506–1516. [Google Scholar] [CrossRef]
Müller, D.; Soto-Rey, I.; Kramer, F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res. Notes 2022, 15, 210. [Google Scholar] [CrossRef]
Langerak, T.R.; van der Heide, U.A.; Kotte, A.N.T.J.; Berendsen, F.F.; Pluim, J.P.W. Evaluating and improving label fusion in atlas-based segmentation using the surface distance. In Medical Imaging 2011: Image Processing; SPIE Digital Library: Orlando, FL, USA, 2011; Volume 7962, pp. 688–694. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Ghoshal, B.; Tucker, A.; Sanghera, B.; Wong, W.L. Estimating uncertainty in deep learning for reporting confidence to clinicians in medical image segmentation and disease detection. Comput. Intell. 2021, 37, 701–734. [Google Scholar] [CrossRef]
Munia, A.A.; Abdar, M.; Hasan, M.; Jalali, M.S.; Banerjee, B.; Khosravi, A.; Hossain, I.; Fu, H.; Frangi, A.F. Attention-guided hierarchical fusion U-Net for uncertainty-driven medical image segmentation. Inf. Fusion 2025, 115, 102719. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]

Figure 1. Image segmentation applications in fisheries: (a) fish segmentation under water-free conditions [11], (b) fish phenotype segmentation under water-free conditions [13], (c) intensive aquaculture scene, (d) the objective of this study.

Figure 2. Segmentation models proposed in the literature: (a) UNet [10], (b) UNet++ [21], (c) TransUNet [24], (d) SwinUNet [25], (e) DeepLabV3+ [23], and (f) Ours.

Figure 3. Overall view of the proposed model.

Figure 4. The result from transforming segmentation mask into Salient-Boundary Regression map.

Figure 5. The execution process of Edge-Mixed extractor. (a) The original fish image; (b) The smoothed image; (c) The edge map extracted by the edge detector; (d) The fused result of (b,c).

Figure 6. Qualitative comparison of ablation experiments for semantic segmentation of large yellow croaker images in high-density underwater scenes. (Regression prediction maps appear only for the multi-branch method. The ’Unavailable’ label in the single-branch method’s corresponding position indicates the absence of such predictions. (A)–(E) correspond to the results presented in Table 2).

Figure 7. Qualitative comparison of different methods for semantic segmentation of large yellow croaker in high-density underwater scenes.

Figure 8. Visualization of segmentation results on multiple samples using different methods for comparative analysis.

Table 1. Size of the input and output of multi-task learning framework.

Multi-Class Segmentation Branch			SOD Branch
Layers	Input	Output	Layers	Input	Output
Layer₀	3 × 288 × 512	64 × 144 × 256	Layer’₀	1 × 144 × 256	64 × 72 × 128
Layer₁	64 × 144 × 256	256 × 72 × 128	Layer’₁	64 × 144 × 256	256 × 72 × 128
Layer₂	(256 ∗ 2) × 72 × 128	512 × 36 × 64	Layer’₂	256 × 72 × 128	512 × 36 × 64
Layer₃	512 × 36 × 64	1024 × 18 × 32	Layer’₃	512 × 36 × 64	1024 × 18 × 32
Layer₄	1024 × 18 × 32	2048 × 9 × 16	Layer’₄	1024 × 18 × 32	2048 × 9 × 16
ASPP	(2048 ∗ 2) × 9 × 16	256 × 9 × 16	ASPP’	2048 × 9 × 16	256 × 9 × 16
Upsample × 8	256 × 9 × 16	256 × 72 × 128	Upsample × 8	256 × 9 × 16	256 × 72 × 128
Decoder	(256 ∗ 4) × 72 × 128	3 × 72 × 128	Decoder’	256 × 72 × 128	1 × 72 × 128
Upsample × 4	3 × 72 × 128	3 × 288 × 512	Upsample × 4	1 × 72 × 128	1 × 288 × 512

Table 2. Quantitative results of ablation experiments for semantic segmentation of large yellow croaker images in high-density underwater scenes. The best-performing results are highlighted in bold.

Model	Backbone	SOD-Branch	Multi-Task	Acc (%) ↑	Dice (%) ↑	ASD (Pixel) ↓	HD95 (Pixel) ↓	Flops (G)	Params (M)
(A)	√			99.03	92.38	0.654	3.552	27.64	40.82
(B)	√	Full Channels		99.13	92.92	0.514	3.044	70.94	100.54
(C)	√	Quarter Channels	√	99.07	92.74	0.552	3.441	33.31	48.22
(D)	√	Half Channels	√	99.07	92.93	0.519	3.131	43.37	60.93
(E)	√	Full Channels	√	99.18	93.23	0.477	2.372	75.02	102.31

Table 3. Quantitative results of different methods for semantic segmentation of large yellow croaker in high-density underwater scenes. The best-performing results are highlighted in bold.

Methods	Acc (%) ↑	Dice (%) ↑	ASD (Pixel) ↓	HD95 (Pixel) ↓	Flops (G)	Params (M)
SegNet [13]	0.9858	0.8932	2.439	10.769	75.30	24.94
UNet [11]	0.9884	0.9191	2.025	9.215	31.74	7.85
UNet++ [21]	0.9880	0.9187	1.931	8.568	78.66	9.16
DeepLabV3 [22]	0.9889	0.9103	0.801	3.795	14.39	39.64
DeepLabV3+ [23]	0.9903	0.9238	0.654	3.552	27.64	40.82
SwinUNet [25]	0.9601	0.7548	11.726	42.050	22.74	41.34
SegFormer [40]	0.9875	0.9020	0.930	3.681	17.37	30.84
RA-UNet [12]	0.9907	0.9256	0.657	3.249	212.03	170.45
AHF-UNet [39]	0.9903	0.9286	1.041	4.449	168.00	37.67
Proposed	0.9918	0.9323	0.477	2.372	75.02	102.31

Table 4. Quantitative results of segmentation performance on fish carcass and fins, highlighting regional differences among methods. The best scores for each region are marked in bold.

Methods	Fish Fins				Fish Carcass
Methods	Acc (%) ↑	Dice(%) ↑	ASD (Pixel) ↓	HD95 (Pixel) ↓	Acc (%) ↑	Dice (%) ↑	ASD (Pixel) ↓	HD95 (Pixel) ↓
SegNet [13]	0.9896	0.8543	2.464	11.743	0.9821	0.9321	2.4152	9.7943
UNet [11]	0.9922	0.8806	2.043	10.167	0.9846	0.9577	2.007	8.263
UNet++ [21]	0.9921	0.8800	1.713	8.969	0.9840	0.9575	2.149	8.166
DeepLabV3 [22]	0.9900	0.8529	0.883	4.850	0.9878	0.9677	0.719	2.739
DeepLabV3+ [23]	0.9922	0.8776	0.844	4.849	0.9885	0.970	0.464	2.256
SwinUNet [25]	0.9772	0.6589	16.061	57.308	0.9431	0.8508	7.391	26.792
SegFormer [40]	0.9895	0.8412	1.244	5.735	0.9855	0.9628	0.615	1.626
RA-UNet [12]	0.9922	0.8794	0.828	4.745	0.9892	0.9718	0.485	1.752
AHF-UNet [39]	0.9925	0.8878	1.189	6.173	0.9881	0.9695	0.892	2.725
Proposed	0.9927	0.8888	0.590	3.521	0.9909	0.9758	0.364	1.222

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Qian, C.; Xu, J.; Tu, X.; Jiang, X.; Liu, S. Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning. Fishes 2025, 10, 627. https://doi.org/10.3390/fishes10120627

AMA Style

Zhang J, Qian C, Xu J, Tu X, Jiang X, Liu S. Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning. Fishes. 2025; 10(12):627. https://doi.org/10.3390/fishes10120627

Chicago/Turabian Style

Zhang, Jiapeng, Cheng Qian, Jincheng Xu, Xueying Tu, Xuyang Jiang, and Shijing Liu. 2025. "Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning" Fishes 10, no. 12: 627. https://doi.org/10.3390/fishes10120627

APA Style

Zhang, J., Qian, C., Xu, J., Tu, X., Jiang, X., & Liu, S. (2025). Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning. Fishes, 10(12), 627. https://doi.org/10.3390/fishes10120627

Article Menu

Salient Object Detection Guided Fish Phenotype Segmentation in High-Density Underwater Scenes via Multi-Task Learning

Abstract

1. Introduction

2. Related Works

2.1. Fish Segmentation Tasks

2.2. Image Segmentation Methods

2.3. Salient Object Detection

3. Methods

3.1. Multi-Task Learning Framework

3.2. Salient-Boundary Map and Edge-Mixed Extractor

3.3. Loss Functions and Training Strategy

4. Experiments and Results

4.1. Dataset

4.2. Implementation Details

4.3. Results and Analysis

4.3.1. Ablation Study

4.3.2. Comparison with Other Models

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI