A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation

Jiang, Yue; Gao, Yan; Wang, Yifei; Wang, Yue; Yu, Hong; Lin, Yuanshan

doi:10.3390/electronics14193927

Open AccessArticle

A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation

by

Yue Jiang

^1,2

,

Yan Gao

^1,2,

Yifei Wang

^1,2,

Yue Wang

^1,2,*,

Hong Yu

^1,2 and

Yuanshan Lin

^1,2,*

¹

College of Information Engineering, Dalian Ocean University, Dalian 116023, China

²

Dalian Key Laboratory of Smart Fisheries, Dalian 116023, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(19), 3927; https://doi.org/10.3390/electronics14193927

Submission received: 29 August 2025 / Revised: 24 September 2025 / Accepted: 28 September 2025 / Published: 2 October 2025

(This article belongs to the Special Issue Recent Advances in Efficient Image and Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Marine animal segmentation aims at segmenting marine animals in complex ocean scenes, which plays an important role in underwater intelligence research. Due to the complexity of underwater scenes, relying solely on a single RGB image or learning from a specific combination of multi-model information may not be very effective. Therefore, we propose a uniform multi-modal feature extraction and adaptive local–global feature fusion structure for RGB-X marine animal segmentation. It can be applicable to various situations such as RGB-D (RGB+depth) and RGB-O (RGB+optical flow) marine animal segmentation. Specifically, we first fine-tune the SAM encoder using parallel LoRA and adapters to separately extract RGB information and auxiliary information. Then, the Adaptive Local–Global Feature Fusion (ALGFF) module is proposed to progressively fuse multi-modal and multi-scale features in a simple and dynamical way. Experimental results on both RGB-D and RGB-O datasets demonstrate that our model achieves superior performance in underwater scene segmentation tasks.

Keywords:

marine animal segmentation; segment anything model; RGB-X; multi-modal feature extraction; adaptive feature fusion

1. Introduction

Marine Animal Segmentation (MAS) [1] is a critical task that aims to segment marine animals in complex ocean scenes. It has made significant contributions across multiple domains, including marine engineering technology [2,3,4], as well as in the conservation and management of marine ecological resources [5,6]. The underwater environment has issues such as insufficient illumination, low visibility, and cluttered background information. For example, intertwined seaweed, suspended sediment, and coral reefs of various shapes make it particularly challenging to distinguish the foreground from the background in terms of color and texture. Moreover, the protective coloration of many marine animals makes their body color and texture highly similar to the surrounding background, which poses a significant challenge to the accurate segmentation of underwater images.

In early marine object segmentation, Convolutional Neural Networks (CNNs) [7,8] were commonly employed. These models utilize the local receptive fields of convolutional kernels to perform feature extraction, allowing them to effectively represent spatial structures within images, such as object shapes and surface textures. However, the long-distance dependence in the image may be lost, due to its fixed convolution kernel size and step size and other parameters. Transformers [9] are better at processing image data. Through the self-attention mechanism, they can directly calculate the relationship between any two positions in the sequence, which is not limited by distance, and can well capture the long-distance dependence. However, the self-attention mechanism has high computational complexity, which leads to the substantial consumption of computing resources when processing large-scale data or complex tasks, seriously impairing the computational efficiency. The newly proposed Segment Anything Model (SAM) [10] has made great progress in the field of segmentation. It not only has a powerful structure but also uses tens of millions of data for training, which can improve the effectiveness. However, SAM is trained to focus more on some common scenarios, which is quite different from underwater scenes. To make SAM better adapted to marine animal segmentation, methods such as MAS-SAM [1] and Dual-SAM [11] efficiently fine-tune the SAM model, which enhances the performance of segmentation on underwater scenes.

To enhance the segmentation effectiveness in underwater situations, some methods utilize auxiliary information to supply RGB information. Depth information can provide geometric structural details, enhancing the ability to identify boundaries. Hong et al. [12] introduced depth information as auxiliary information to augment RGB information, enhancing the dataset’s information completeness. Optical flow information can capture the motion information of pixels between consecutive frames, enabling the identification of moving regions by comparing pixel displacements between adjacent frames. Zhang et al. [13] strategically employed optical flow information as an auxiliary tool, significantly improving the performance of the segmentation task. Gamma information can adjust the non-linear response of image brightness, improving the visual quality of images and thus assisting in segmentation. Zhang et al. [11] utilizes gamma information to enhance the segmentation performance, but it suffers from the issue of excessively large parameters. These existing methods tend to focus on one specific combination of modalities, which may only suitable for limited marine scenarios. Meanwhile, they tend to use diverse and complex approaches for multi-scale and multi-modal feature fusion, which complicate the design of the structure.

To address these issues, we propose a uniform multi-modal feature extraction and adaptive local-global feature fusion structure. This framework can be applied to various modality combinations, such as RGB-D (RGB+depth) or RGB-O (RGB+optical flow). Our framework consists of two main components: a uniform multi-modal feature extraction module and an adaptive local–global feature fusion (ALGFF) module. The first part uses SAM as the backbone and employs parallel LoRA and adapters to simultaneously extract features from RGB and auxiliary information. The second part is designed to deal with both multi-modal and multi-scale feature fusions in a simple and consistent way. Specifically, it first roughly fuses two features and then applies two branches to extract local and global information with a multi-kernel CNN block and Transformer layer. These two parts are combined with a dynamic spatial attention map for fused feature enhancement, which enables adaptation to the fusion of different features.

In summary, our contributions are as follows:

We propose a novel feature learning architecture, namely a Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure. It enhances marine object segmentation capabilities by integrating RGB information with auxiliary information.
In the decoder part, we introduce the Adaptive Local–Global Feature Fusion (ALGFF) module, which combines the strengths of CNNs and Transformers. It performs multi-modal and multi-scale feature fusion in a simple, consistent, and dynamic way.
The experimental results prove the efficiency and effectiveness of the proposed method by achieving state-of-the-art performance on three datasets.

2. Related Work

2.1. Marine Animal Segmentation

Due to the low visibility of the underwater environment and particulate matter, there are many challenges in marine animal segmentation. In the past, many methods were handmade. Then, CNN became the model selected for marine animal segmentation due to its good performance in extracting information hierarchically. Ref. [14] enhances the model’s performance in complex underwater environments through transfer learning and image restoration techniques. Ref. [15] utilizes Convolutional Neural Networks (CNN) to achieve color balance and dehazing of degraded underwater images. Ref. [16] utilizes adaptive optical flow selection to extract motion information from video sequences and combines it with CNN for fish segmentation. Although the segmentation effect is satisfactory, CNN has limitations in capturing global context information and long-distance dependencies.

At the same time, Transformer [9] can better capture global information and long-distance dependencies through self-attention mechanism. Ref. [17] proposes the UWSegFormer framework, based on the Transformer architecture, focusing on underwater semantic segmentation and achieving excellent segmentation results. Ref. [18] designs WaterFormer, combining global local Transformer blocks (GL Transformers) and detail enhanced skip connections (DESC) to solve the problem of local texture loss in traditional Transformers for underwater image enhancement. Recently, the proposed SAM helps the model better extract and utilize key features by focusing on the importance of different spatial locations in the input data. Ref. [19] uses a depth map to guide SAM to segment underwater objects from multiple perspectives. Dual-SAM [11] introduces a dual structure encoder in parallel with SAM, which optimizes underwater image features through gamma correction to alleviate color distortion and noise interference. These methods can alleviate the difficulties caused by the complex underwater environment. Therefore, based on the above methods, we propose a method suitable for accurate segmentation in multiple scenarios.

2.2. RGB+X Object Detection

In the process of segmentation, many methods use single RGB information extraction, in order to further use the information to achieve the desired segmentation effect. Some methods use RGB information and auxiliary information to improve the segmentation effect and accuracy. Some methods utilize depth information as auxiliary information through approaches such as multi-modal information fusion. ACMF [20] utilizes a ResNet backbone to derive multi-level features from both RGB and depth data, incorporating a cross-modal identity attention mechanism to dynamically recalibrate feature importance across modalities. CMX [21] introduces a versatile fusion architecture capable of processing multiple input types, including RGB-D and RGB–thermal pairs. The Depth Attention Network [22] leverages depth confidence maps to modulate RGB features adaptively. Meanwhile, CFANet [23] aligns and integrates RGB and depth representations using multi-head self-attention, performing fusion across spatial and channel dimensions. To mitigate underwater image blur, ref. [24] presents a lightweight network built on MobileNetV2 that combines polarization, spatial, and semantic learning.

Several approaches also incorporate optical flow as supplementary information to enhance RGB-based analysis. Flowfusion [25] applies the PWC-Net architecture to compute optical flow between consecutive RGB frames, facilitating the separation of dynamic objects from static backgrounds. Ref. [26] combines optical flow consistency loss to achieve real-time decomposition of static backgrounds and moving objects in RGB videos. Ref. [27] combines RGB appearance features with optical flow motion features, employing an Attention-based Redundancy Removal (AD) module and multi-scale temporal modeling. Ref. [28] utilizes both RGB and optical flow as two-stream inputs, designs a 3D CNN for action recognition, and introduces an attention mechanism to fuse the two modalities.

In addition, some methods use Thermal Infrared, which captures the temperature cues of objects as auxiliary for detection. Ref. [29] proposes a “template-search region bidirectional interaction” module, which treats template features as dynamic kernels to perform local convolution in the search region. FMTrack [30] decomposes images into low-frequency structures and high-frequency textures, processes them separately through two expert branches, and then performs “frequency-domain gating” fusion. Within cross-modal salient object detection, RGB-T Salient Object Detection (RGB-T SOD) focuses on identifying and segmenting salient regions using co-aligned pairs of visible RGB and thermal infrared images [31]. Therefore, based on this situation, this paper uses the form of RGB and auxiliary information to improve the accuracy of underwater environment segmentation.

3. Proposed Method

The method we propose includes a parallel feature extraction module and an adaptive local–global feature fusion module. The parallel feature extraction is used to extract multi-scale features of RGB and auxiliary information, which is achieved by a SAM encoder with parallel LoRA and adapter. The adaptive local–global feature fusion module dynamically fuses the multi-scale features and multi-modal features with a uniform structure to generate the prediction map. These key components will be elaborated in the following sections. The overall structure diagram is shown in Figure 1.

3.1. Parallel Feature Extraction

Inputting an RGB image

I^{r}

and an auxiliary information image

I^{x}

(depth information or optical flow information), we aim to obtain relevant multi-scale and multi-modal features

{F_{i}^{r}}_{i = 1}^{5}

and

{F_{i}^{x}}_{i = 1}^{5}

. As previously mentioned, the Segment Anything Model (SAM) has demonstrated outstanding performance in conventional segmentation tasks. However, the marine environment is complicated, and marine animals possess unique characteristics in terms of shape, texture, and lighting conditions. SAM may not be directly suitable to marine animal segmentation tasks. In the field of image segmentation, the utilization of RGB information along with auxiliary information (such as depth information, optical flow information, etc.) often leads to a significant improvement in segmentation results and accuracy. For marine target segmentation, RGB information can provide basic features of the target, such as color and appearance, while auxiliary information (referred to as X here, representing depth information or optical flow information) can supplement key information about the target’s spatial position and motion state. RGB information and X information form a complementary relationship at the feature level. Based on the above circumstances, we aim to extract effective features from two different modalities: RGB data and X data, as efficiently as possible without altering the overall structure of SAM. To achieve this goal, we conduct efficient fine-tuning on SAM, enabling it to better adapt to marine target segmentation tasks and thereby enhancing the segmentation precision and effectiveness.

As shown in Figure 1, the parameters of the SAM encoder are frozen, and we only train the parallel LoRA and adapter to efficiently fine-tune the SAM encoder on a small number of parameters. The SAM encoder contains four blocks with Transformer layers, which can be used to extract features of four scales for one modality. Each Transformer layer contains a self-attention (SA) layer and a Feed-Forward Network (FFN) layer. Firstly, we apply LoRA to the SA layers, which uses a low-rank matrix simulation for parameter updating to reduce computation and memory usage. It can be expressed as:

\hat{X} = W \cdot X + B A \cdot X,

(1)

where W represents the original pre-trained weight matrix, A is a low rank reduced dimension matrix, and B is a low rank ascending dimension matrix. X represents the input feature vector, while

\hat{X}

represents the output feature vector. In this paper, we use it for updating the queries and keys of the self-attention layers in SAM encoder, so that a refined global relationship within the marine images can be extracted to produce more precise representations. Specifically, with the input features

F_{i - 1}^{r}

and

F_{i - 1}^{x}

for the i-th encoder block, we apply the shared parameters

W_{i}^{V}

of the SA layer to get values

V_{i}^{r}

,

V_{i}^{x}

for RGB and X information. Then, we apply the shared frozen parameters

W_{i}^{Q}

,

W_{i}^{K}

of the SA layer, as well as two separable learned LoRA to individually produce refined queries

{\hat{Q}}_{i}^{r}

,

{\hat{Q}}_{i}^{x}

and keys

{\hat{K}}_{i}^{r}

,

{\hat{K}}_{i}^{x}

for two inputs using Equation (1). The outputs of the SA layer for both modalities can be represented as follows:

\begin{matrix} F_{i}^{r} = S A ({\hat{Q}}_{i}^{r}, {\hat{K}}_{i}^{r}, {\hat{V}}_{i}^{r}), \end{matrix}

(2)

\begin{matrix} F_{i}^{x} = S A ({\hat{Q}}_{i}^{x}, {\hat{K}}_{i}^{x}, {\hat{V}}_{i}^{x}), \end{matrix}

(3)

where

S A (\cdot)

represents the self-attention mechanism, which is applied to multi-modal inputs (RGB and auxiliary modality information). Through LoRA fine-tuning, SA enables the model to capture long-range spatial dependencies within images, thereby improving the segmentation accuracy.

Then, we apply an adapter to the FFN layer. The adapter inserts a small number of trainable parameters into the original structure, allowing it to efficiently adapt to new tasks. It can be described as follows:

A d a p t e r (X) = W_{a d p}^{u p} (W_{a d p}^{d o w n} (X)),

(4)

where

W_{a d p}^{d o w n} \in R^{P \times D}

and

W_{a d p}^{u p} \in R^{P \times D}

are the weights of two linear projections, respectively. Here, D is normally much smaller than the dimension of input feature F to save computational costs. Here, we use the last FFN layer of i-th encoder block as an example. With the input features

{\hat{F}}_{i}^{r}

and

{\hat{F}}_{i}^{x}

from the SA layer of two modalities, we apply two different adapters along with the frozen FFN layer to get the final features from i-th scale with certain refinement. The process can be described as follows:

\begin{matrix} F_{i}^{r} = A d a p t e r_{i}^{r} (F F N ({\hat{F}}_{i}^{r})) + {\hat{F}}_{i}^{r}, \end{matrix}

(5)

\begin{matrix} F_{i}^{x} = A d a p t e r_{i}^{x} (F F N ({\hat{F}}_{i}^{x})) + {\hat{F}}_{i}^{x}, \end{matrix}

(6)

where

A d a p t e r_{i}^{r}

,

A d a p t e r_{i}^{x}

represents the adapters for RGB and auxiliary data. Using certain LoRA and adapters, we are able to get features

{F_{i}^{r}}_{i = 1}^{4}

and

{F_{i}^{x}}_{i = 1}^{4}

from the four encoder blocks. The highest scale feature

F_{5}^{r}

and

F_{5}^{x}

are then produced with a small neck module in the SAM encoder. LoRA and adapter both use a very low parameter p to effectively fine-tune the parameters, making the pre-trained SAM encoder more suitable for the marine environment. The improved encoder is used to extract RGB and auxiliary information on multiple scales, which are progressively fused in the decoder part.

3.2. Adaptive Local–Global Feature Fusion

Due to the complexity of the underwater scene, how to adaptively fuse the multi-model information and multi-scale information is very crucial for improving the robustness and accuracy. As previous studies have indicated, CNNs are capable of efficiently extracting local information, whereas Transformers excel at capturing global information, which can be complimentary to each other. In this paper, we propose a simple yet effective module named the Adaptive Local–Global Feature Fusion (ALGFF) which leverages the strengths of both the CNN and Transformer.

As shown in Figure 2a, with the arbitrary two features X and Y as inputs, the ALGFF module first uses concatenation to get the rough combination R of these two features. Then, it applies a flexible structure to simultaneously extract and adaptively combine the local and global information of R, which effectively enhances the fusion process. Following the idea of Mixture of Experts (MoE) [32], this design is flexible in various sitations. It is suitable to not only the fusion of multi-scale features but also the fusion of diverse modality combinations as RGB+D and RGB+O. It also simplifies the structure of the decoder, which avoids the usage of complicated structures in existing methods.

The proposed ALGFF module contains two branches, the local one and the global one. The local branch mainly uses CNN layers to extract information within adjacent regions. To extract local information with different levels, we apply an improved variant of convolution named multi-kernel CNN (MK-CNN) block. It involves kernels of sizes 3 × 3, 5 × 5, 7 × 7, and 9 × 9, which are processed in parallel, thereby capturing details at different levels. Meanwhile, to maintain the efficiency of the proposed module, we apply deep separable convolution (depthwise and pointwise) to greatly reduce the number of parameters. Therefore, it ensures an expanded receptive field, and its computational complexity is similar to that of standard convolution which does not introduce additional computational overhead or parameter count. Specifically, with the input roughly fused feature R, the output of local branch is

H_{k} = G U L E (R_{i} * K_{k}^{d}) * K_{k}^{p},

(7)

O_{i}^{l} = C o n c a t (H_{3}, H_{5}, H_{7}, H_{9}),

(8)

where

H_{k}

is the output of each CNN kernel,

k \in \{3, 5, 7, 9\}

is the size of the convolution kernel, * is the convolution operation, and

K_{k}^{d}

and

K_{k}^{p}

are the parameters of the depthwise and pointwise part of deep separable convolution. Through this design, the model can effectively extract multi-level local information at low computational cost.

The global branch is used to extract the long-range relationship bwtween any two positions in the input feature. The self-attention layer of Transformer is very situable, but it highly increases the computational costs when multiplying the query and key. To avoid this problem, we employ the efficient attention (EA) layer [33] and FFN layer. It optimizes the approach of traditional Dot-Product Attention by multiplying the key and value instead. Specifically, it first computes

K^{⊤} V

to obtain a matrix of size

c \times c

(where c denotes the number of channels which can be manually set) and then multiplies this matrix by the query. It reduces the computational complexity from

O (n^{2})

to

O (c^{2})

, where n is the number of pixels in the feature, while maintaining equivalence with the conventional attention mechanism. Based on the features provided by the attention mechanism, the FFN layer, through nonlinear transformations and dimensional operations, achieves the ability to deeply explore and enhance feature representation, enabling the model to learn more complex patterns and abstract concepts as

E A (R) = ρ_{q} (Q) (ρ_{k} {(K)}^{⊤} V),

(9)

where

ρ_{q} (\cdot)

and

ρ_{k} (\cdot)

represent the softmax functions applied along each row of Q and each column of K, respectively. Among them, Q, K, and V are obtained from the input feature R. Next, it passes through the FFN layer.

The outputs of the two branches can be describled as

O_{i}^{l} = M K (R),

(10)

O_{i}^{g} = T R (R),

(11)

where

M K (\cdot)

represents the overall calculation of the local branch, and

T R (\cdot)

represents the overall calculation of the global branch. Both local and global information are then fused with a dynamically learned spatial attention map.

C_{i} = C o n c a t (O_{i}^{l}, O_{i}^{g}),

(12)

W_{s} = σ ((ϕ_{1 \times 1} (G E L U (ϕ_{1 \times 1} (C_{i}))))),

(13)

F_{i}^{'} = W_{s} \times O_{i}^{l} + (1 - W_{s}) \times O_{i}^{g},

(14)

where

σ

is a sigmoid function, and

ϕ_{1 \times 1}

is a convolution layer with a 1 × 1 kernel.

W_{s}

is the spatial attention map, which is dynamically learned by

O_{i}^{l}

and

O_{i}^{g}

. It allows the local and global information on each pixel to be fused with different weights. It varies across positions and inputs, enabling the network to allocate more weight to local detail or to global context as needed. With this design, this module can be adapted to various fusion situations.

The ALGFF module is used to progressively achieve the fusion multi-modal and multi-scale features from coarse to fine as shown in Figure 2b. When processing the i-th scale, we first apply the fusion of multi-modal features with the same scale. It means

F_{i}^{r}

and

F_{i}^{x}

are used as the inputs X and Y of ALGFF module to get the multi-modal fused feature

F_{i}

as follows:

F_{i} = A L G F F (F_{i}^{r}, F_{i}^{x}) .

(15)

Then, we also combine the multi-scale feature fusion with the structure of the ALGFF module. Since the feature fusion process is produced in a coarse-to-fine way, the multi-scale feature fusion involves the fused feature from the larger scale. This means that, to get the final fused feature

{\tilde{F}}_{i}

of this scale,

{\tilde{F}}_{i + 1}

and

F_{i}

are used as the inputs X and Y of the ALGFF module, as shown below:

\tilde{F_{i}} = A L G F F ({\tilde{F}}_{i + 1}, \hat{F_{i}}) .

(16)

Notice that, when

i = 5

, since there are no features with a larger scale, only multi-modal fusion is performed and not multi-scale fusion. After these processing steps, our model is able to segment images with greater precision.

3.3. Progressive Prediction

Finally, we implemented progressive prediction. After each extracted feature passes through the ALGFF module, an intermediate prediction map is generated. We adjust the size of the intermediate prediction map through upsampling and convolution. Finally, we combine all the intermediate prediction maps using a

1 \times 1

convolution to generate the final prediction map. The process is as follows:

P_{i} = ϕ_{1 \times 1} (φ (\tilde{F i})),

(17)

P = ϕ_{1 \times 1} ([P_{0}, P_{1}, P_{2}, P_{3}, P_{4}]),

(18)

where

ϕ_{1 \times 1}

represents

1 \times 1

convolution,

φ

is the upsampling, and

P_{j}

is the j-th prediction mask. P is the final generated prediction map. The proposed ALGFF module optimizes the transmission of information between different layers through progressive feature fusion, and maximizes the retention of useful information. The adaptive spatial fusion operation solves the problem of information conflict in the process of feature fusion. We use the pyramid structure and the proposed ALGFF module to effectively use RGB and auxiliary information, making underwater segmentation more efficient and refined.

During training, we employ three widely-adopted loss functions: Binary Cross-Entropy (BCE) Loss [34,35], Structural Similarity (SSIM) Loss [36], and Intersection over Union (IoU) Loss [37,38]. BCE Loss optimizes the foreground and background separation by measuring the discrepancy between the predicted probability and true label at each pixel. SSIM Loss improves the perceptual quality by enhancing the similarity in luminance, contrast, and structural details between predictions and ground truth images. IoU Loss directly boosts localization performance by maximizing the overlap area between the predicted and target bounding boxes. Each intermediate prediction map is associated with the application of these loss functions. The final loss is obtained by summing all individual loss terms and computing their mean.

L_{i} = L_{B C E}^{i} + L_{S S I M}^{i} + L_{I O U}^{i},

(19)

L = \sum_{i = 0}^{4} L_{i},

(20)

where

L^{i}

represents the loss of the i-th level prediction, and L represents the loss of the final prediction.

4. Experiments

4.1. Datasets and Evaluation Metrics

To prove that the proposed structure can be suitable for both RGB+D and RGB+O marine object detection, we use three common data sets and four indicators. The RMAs [39] dataset contains 3014 ocean images with marine animals. Mask3K [40] is a dataset focusing on marine animal segmentation, including 3103 images, of which 193 are background images. The situations within these two datasets are complicated with noisy backgrounds, where depth maps can be helpful for distinguishing foreground. Meanwhile, The DeepFish dataset we used is the processed dataset from [13]. The original format of this dataset is video [41], and [13] split the videos into frame images. Additionally, the RAFT algorithm was used to generate optical flow, allowing motion information to serve as a supplement to RGB information. For more details on the processing, please refer to reference [13]. DeepFish [13] contains video for underwater fishes of 72 species, where optical flow can be helpful for detecting moving objects.

In this work, RGB-D and RGB-O marine object detection models share an identical architecture but are trained separately. For the RGB-D model evaluated on the RMAs dataset, a split of 1769 training and 1141 testing images is adopted, consistent with MAS-SAM [1]. Training on the Mask3K dataset employs 2514 images for training and 500 for testing, also following the MAS-SAM protocol. The RGB-O model is trained on the DeepFish dataset using 3107 training samples and 609 for testing, in alignment with the setup of MSGNet [13].

The model performance is assessed using four standard evaluation metrics. The first one is the Mean Absolute Error (MAE) [42], which measures the mean absolute deviation per pixel between the predicted result and the ground truth:

M A E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | S (x, y) - G (x, y) |,

(21)

where

S (x, y)

and

G (x, y)

represent the predicted saliency probability and ground truth value, respectively, while W and H denote the image dimensions.

The second one is Maximum F-measure (

F_{m}

) [43]. It is a comprehensive metric that combines precision and recall. The F-measure is defined as

F_{β} = \frac{(1 + β^{2}) P r e c i s i o n \times R e c a l l}{β^{2} P r e c i s i o n + R e c a l l},

(22)

where

β^{2}

is usually set to 0.3. The maximum F-measure is the maximum value of

F_{β}

over all possible thresholds.

The third one is the Structural Similarity Measure (S) [44]. This metric measures the structural similarity between the predicted segmentation mask and the ground truth. The equation for s is based on the comparison of luminance, contrast, and structure. It is defined as

S_{m} = α * S_{0} + (1 - α) * S_{r},

(23)

where

α

in [0,1] is the balance parameter and set to 0.5,

S_{0}

is the object-aware structural similarity measure, and

S_{r}

is the region-aware structural similarity measure describing the similarity between the predicted saliency map and the ground truth.

The Mean Enhanced-Alignment Measure (

E_{m}

) [45] is typically based on the alignment characteristics of regions and boundaries, and it is used to evaluate the degree of matching between segmentation results and ground truth in terms of both regions and boundaries. Its formula can be expressed as follows:

E_{m} = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 \times (T P_{r} \times T P_{b})}{(T P_{r} + T P_{b})},

(24)

where N denotes the number of test samples.

T P_{r}

(True Positive for region) represents the true positive at the regional level, which is a metric related to the overlap between the part of the segmentation result correctly predicted as the target region and the actual target region.

T P_{b}

(True Positive for boundary) represents the true positive at the boundary level, which is a metric related to the overlap between the boundary of the segmentation result and the boundary of the ground truth.

Meanwhile, to validate the efficiency of our model, we also include the number of parameters, the floating-point operations (FLOPs), and the frames-per-second (FPS). These metrics can be used to indicate the model size, the computational complexity, and the testing speed.

4.2. Implementation Details

All experiments were conducted on an NVIDIA GeForce RTX 4090 GPU, using Python 3.10.16 and the PyTorch 2.5.1 framework (built with CUDA 12.4). We initialize the encoder with the SAM-B model, pre-trained on the SA-1B dataset. Other newly incorporated components, including LoRA layers, adapter modules, and the decoder, are fine-tuned on the marine organism datasets. The model is trained for 50 epochs using the AdamW optimizer [46] with an initial learning rate of 0.001, which is reduced by one order of magnitude every 22 epochs. In accordance with [1], we adopt a batch size of 6 and an input resolution of 512 × 512 pixels. To improve robustness, data augmentation techniques such as random flipping and rotation are applied during training.

4.3. Comparison with the State of the Art

In this section, our method is compared with other methods on three common datasets. We make a comparison according to the previous methods, namely TFL-Net [47], MAS-SAM [1], AFNet [48], iGAN [49], TC-USOD [12], and Dual-SAM [11]. Both the qualitative and quantitative results show that our method has more advantages.

Quantitative Comparisons. The quantitative comparisons with state-of-the-art methods are summarized in Table 1. On the RMAs dataset, our approach obtains the top performance in terms of

F_{m}

,

S_{m}

, and

E_{m}

. With an MAE value merely 0.001 higher than that of Dual-SAM, our method ranks second in this metric. Similar superiority is observed on the Mask3K dataset, where our model yields the highest

F_{m}

and

S_{m}

scores. Furthermore, on the DeepFish dataset, the proposed method consistently outperforms all others across every evaluation metric, demonstrating a substantial margin of improvement.

Meanwhile, as shown in Table 1, our method has 98.59 M parameters, and only 12.45 M are trainable parameters, while the rest are remain frozen during training. It is much smaller than most of the other methods except for MAS-SAM. This is mainly because MAS-SAM only deals with RGB images, while our method only uses complementary information from other modalities. Therefore, our method achieves significant improvements on performance, especially on the DeepFish dataset. We only have 1.32M more trainable parameters than MAS-SAM and have 61.85 fewer trainable parameters than Dual-SAM, which demonstrates the efficiency our model.

We also include FLOPs and FPS in Table 1 and Table 2 for reference. Our method may not have relatively higher computational costs and slower testing speed compared with most of other methods. However, it should be noticed that the AFNet, iGAN, TFL-Net, and TC-USOD methods rely on convolutional backbones, which naturally have lower FLOPs and faster speed. Our method relies on a Transformer-based encoder, which would have larger FLOPs and smaller FPS but more accurate segmentation predictions. In addition, MAS-SAM also applies the Transformer-based SAM encoder, but it can only deal with RGB modality without using any complimentary information from other modalities. It tends to have slightly smaller FLOPs and higher FPS, but it also leads to poor segmentation performance. Dual-SAM uses a SAM encoder and can deal with both modalities as in our method, but it has much higher FLOPs of 325.68 G, while our method only has 252.11 G. It also has fewer FPS, but our method can still generally perform better than Dual-SAM on segmentation prediction.

Qualitative Comparisons. A qualitative comparison between our method and other leading approaches is provided in Figure 3 to visually assess the segmentation effectiveness. Representative examples reveal a clear visual advantage of our method. For instance, in segmenting overall structures (rows 1–2), our approach accurately identifies global morphological characteristics while mitigating typical issues such as structural fragmentation or contour inaccuracies often observed in alternative models. In scenes involving camouflaged marine organisms (rows 3–5), where targets closely resemble their surroundings, our technique reliably distinguishes organisms from the background, substantially reducing missed detections and mis-segmentations. When delineating fine-grained boundaries (rows 6–7), our results exhibit refined edge details and continuous segmentation contours, outperforming other methods that tend to produce blurred or irregular boundaries.

Overall, the proposed method produces clearer segmentations with higher target completeness compared to existing techniques. It consistently delivers high-quality results even when processing challenging images characterized by cluttered backgrounds and abundant details, demonstrating robust performance under complex scenarios. The superior segmentation outcomes can be attributed to the effective incorporation of task-specific auxiliary information within our design. Such information offers enhanced cues for target localization and feature discrimination, considerably strengthening the model’s capacity to address complex environments, camouflaged objects, and detailed boundary preservation.

4.4. Ablation Studies

In this part, we verify the effectiveness of each component of the proposed structure through an ablation study as follows.

Effectiveness of Dual Branches. To demonstrate the effectiveness of the dual branches design of our method, we compare the results of using our method with single-branch and dual-branch models in Table 3. The “RGB” in the table denotes a single-branch model using only RGB images as input, while “RGB+depth(single)” refers to a single-branch model trained with a simple concatenation of RGB and depth images as input. These two models still use LoRA and an adapter for fine-tuning the SAM encoder and ALGFF modules for multi-scale feature fusion. “Ours” represents our dual-branch model that simultaneously processes both RGB and auxiliary information (depth in this table). All the results are performed on the RMAs dataset.

As shown in Table 3, the “RGB” model with only information of RGB images tends to have relatively lower performance across all metrics. However, it is slightly better than MAS-SAM (in Table 3), which shows the effectiveness of our method even with only RGB images as input. “RGB+depth(single)” incorporates depth information with a single branch. However, the absence of a dynamic multi-modal fusion mechanism limits its ability to fully exploit the complementary characteristics inherent in RGB and depth data. In contrast, our approach achieves better performance with dual branches to deal with two modalies and achieve dynamic fusion on multi-scale and mutli-modal features. This experiment demonstrates the effectiveness of our proposed RGB+X framework for marine object segmentation.

Effect of ALGFF on multi-modal and multi-scale feature fusion. To further demonstrate the effectiveness of the ALGFF on multi-modal and multi-scale feature fusion, we use several models with or without ALGFF during fusion and show the results in Table 4. Here, “B-add” represents a baseline, which replaces the ALGFF module with a simple summation for multi-modal and multi-scale featrue fusion. “B-concat” represents a baseline, which replaces the ALGFF module with a simple concatenation for multi-modal and multi-scale featrue fusion. “Multi-scale” refers to the model where multi-scale feature fusion is performed using the ALGFF module, while multi-modal feature fusion still relies on simple addition. “Ours” indicates that both multi-scale and multi-modal features are fused using the ALGFF module. Meanwhile, we also replace the ALGFF module with the FAM module, which is designed to achieve the multi-scale feature fusion in MAS-SAM, which is denoted as “FAM”. All the results are performed on the RMAs dataset.

As shown in Table 4, “B-add” and “B-concat” perform worse by only employing the simple strategies for fusion. However, when we use the ALGFF module to adaptively fuse multi-scale information from the two modalities for “multi-scale” model, all the evaluation metrics improve significantly. When the ALGFF module is applied to both multi-modal and multi-scale feature fusion as in “Ours”, further performance gains can be observed. Meanwhile, the “FAM” module still generally performs worse than our method especially on metrics like

F_{m}

and

S_{m}

. It indicates the effectiveness of the ALGFF module, which captures both local and global information during fusion and dynamically adjusts the fusion weights, thereby fully leveraging the complementary between multi-modal and multi-scale information. We also provide the number of parameters and FLOPs in Table 4. It shows that the increments of number of parameters and FLOPs are minimal, which indicates the efficiency of our proposed ALGFF module.

Figure 4 presents some visualizations of the feature maps with and without the proposed ALGFF module. It can be seen that the features of the “B-add” only show the coarse location of the objects. By using the ALGFF module for feature fusion, the features around objects are enhanced, which would be helpful for improving the performance of segmentation. These results demonstrate that the proposed ALGFF module can effectively leverage the complementarity of multi-modal and multi-scale information, which can be helpful for marine object segmentation.

Effect of each component of the ALGFF. To validate the effectiveness of each component in ALGFF, we design several models, as shown in Table 5. “B-add” is the baseline using only addition for multi-scale and multi-modal feature fusion without ALGFF. “local-branch” indicates we only uses parallel convolutional kernels with varying sizes in the ALGFF to capture local information during fusion. “Equal-weight” indicates global information is further captured with efficient attention layers in the ALGFF, but local and global information are fused with equal weights, while “Ours” integrates both local and global information with the adaptive weights, as we designed. The experiments are performed on both the RMAs and DeepFish datasets.

As observed in Table 5 and Table 6, “B-add” yields the poorest performance across all metrics without considering the capture of local–global information during fusion. When the local branch is incorporated (“local-branch”) to capture fine-grained details during fusion, a notable improvement in segmentation performance is observed, especially on

S_{m}

). The global context is further captured in “Equal-weight”, but without the dynamic fusion of these two information, it slightly decreases the performance. By using the adaptive weights, we achieve a generally better performance among all these models. These results demonstrate that each component of the ALGFF module is necessary and can be helpful for achieving accurate marine object segmentation.

4.5. Failure Cases

It is worth noting that the proposed approach still encounters difficulties when handling objects of small size or those containing complex structural details, especially in contexts demanding high precision in edge segmentation. As illustrated in Figure 5, while the segmented outlines of marine animals are generally well-defined in a majority of cases, certain local boundaries appear inadequately refined or exhibit coarseness. Our method shows improved detection performance over Dual-SAM. Nevertheless, achieving higher accuracy in capturing intricate structures and delineating precise boundaries remains an area for future enhancement.

To mitigate this limitation, we plan to incorporate explicit boundary-aware cues in future work, for instance, by introducing auxiliary edge detection tasks to enhance the model’s prediction capability to capture the boundaries. This is expected to further improve the quality and robustness of the segmentation results along boundary details.

5. Conclusions

In this paper, we present a novel feature learning framework for marine object segmentation. Built upon SAM as the backbone, our approach incorporates fine-tuning through LoRA and adapter mechanisms to effectively extract features from RGB, depth, and optical flow modalities. At the decoder stage, we integrate the complementary strengths of CNN and Transformer architectures and propose the Adaptive Local–Global Feature Fusion (ALGFF) module. By unifying multi-modal and multi-scale feature processing, ALGFF facilitates efficient and adaptive fusion, capably handling both multi-scale and multi-modal representations.

Experimental results on three benchmark datasets (RMAs, Mask3K, and DeepFish) demonstrate the effectiveness of our method, which achieves superior performance compared to various state-of-the-art models across multiple evaluation metrics. However, we also identify limitations in the current design, including possible inefficiencies in parallel feature extraction and a lack of in-depth computational analysis.

To address these issues and further improve the framework, future work will aim to enhance the model’s effectiveness and efficiency. We plan to develop more modality-specific tuning strategies such as integrating mixture-of-experts modules into the SAM encoder, which may better handle highly divergent modalities. Additionally, we will introduce boundary-aware learning techniques, including auxiliary edge detection and hierarchical feature fusion, to recover fine details and improve structural accuracy, especially for small and complex objects. Further efforts will focus on reducing the inference time and computational costs to perform real-time marine object segmentation.

Author Contributions

Conceptualization, Y.W. (Yue Wang) and H.Y.; methodology, Y.W. (Yue Wang) and Y.J.; validation, Y.J., Y.G., and Y.W. (Yifei Wang); formal analysis, Y.J.; investigation, Y.J. and Y.W. (Yue Wang); writing—original draft preparation, Y.J.; writing—review and editing, Y.J. and Y.W. (Yue Wang); visualization, Y.J., Y.W. (Yifei Wang) and Y.G.; supervision, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (Grant Nos. 62406052, 32573571), Liaoning Province Natural Science Foundation (Grant No. 2024-BS-214), the Key R&D Projects in Liaoning Province (Grant No. 2023JH26/10200015), Basic Research Funding Projects of Liaoning Provincial Department of Education (Grant No. LJ212410158022), the special fund for basic scientific research operations of undergraduate universities affiliated to Liaoning Province (Grant No. 2024JBQNZ011).

Data Availability Statement

The original data presented in the study are openly available in the following platforms: IEEE Xplore at 10.1109/joe.2023.3252760 (accessed on 24 September 2025), SpringerLink at 10.1007/978-3-030-71058-3_12 (accessed on 24 September 2025), ScienceDirect at 10.1016/j.neucom.2021.07.018 (accessed on 24 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yan, T.; Wan, Z.; Deng, X.; Zhang, P.; Liu, Y.; Lu, H. MAS-SAM: Segment any marine animal with aggregated features. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 6886–6894. [Google Scholar]
Chen, Z.; Sun, Y.; Gu, Y.; Wang, H.; Qian, H.; Zheng, H. Underwater object segmentation integrating transmission and saliency features. IEEE Access 2019, 7, 72420–72430. [Google Scholar] [CrossRef]
Sun, Y.; Zhe, C.; Wang, H.; Zhang, Z.; Shen, J. Level set method combining region and edge features for segmenting underwater images. J. Image Graph. 2020, 25, 824–835. [Google Scholar]
Ma, Z.; Wang, C.; Niu, Y.; Wang, X.; Shen, L. A saliency-based reinforcement learning approach for a UAV to avoid flying obstacles. Robot. Auton. Syst. 2018, 100, 108–118. [Google Scholar] [CrossRef]
Beijbom, O.; Edmunds, P.J.; Kline, D.I.; Mitchell, B.G.; Kriegman, D. Automated Annotation of Coral Reef Survey Images. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1170–1177. [Google Scholar]
Yang, Y.; Li, D.; Zhao, S. A novel approach for underwater fish segmentation in complex scenes based on multi-levels triangular atrous convolution. Aquac. Int. 2024, 32, 5215–5240. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Zhang, P.; Yan, T.; Liu, Y.; Lu, H. Fantastic animals and where to find them: Segment any marine animal with dual sam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2578–2587. [Google Scholar]
Hong, L.; Wang, X.; Zhang, G.; Zhao, M. Usod10k: A new benchmark dataset for underwater salient object detection. IEEE Trans. Image Process. 2023, 34, 1602–1615. [Google Scholar] [CrossRef]
Zhang, P.; Yu, H.; Li, H.; Zhang, X.; Wei, S.; Tu, W.; Yang, Z.; Wu, J.; Lin, Y. Msgnet: Multi-source guidance network for fish segmentation in underwater videos. Front. Mar. Sci. 2023, 10, 1256594. [Google Scholar] [CrossRef]
Boudhane, M.; Nsiri, B. Underwater image processing method for fish localization and detection in submarine environment. J. Vis. Commun. Image Represent. 2016, 39, 226–238. [Google Scholar] [CrossRef]
Zhu, S.; Luo, W.; Duan, S. Enhancement of underwater images by CNN-based color balance and dehazing. Electronics 2022, 11, 2537. [Google Scholar] [CrossRef]
Zhang, P.; Yang, Z.; Yu, H.; Tu, W.; Gao, C.; Wang, Y. RUSNet: Robust fish segmentation in underwater videos based on adaptive selection of optical flow. Front. Mar. Sci. 2024, 11, 1471312. [Google Scholar] [CrossRef]
Zuo, X.; Jiang, J.; Shen, J.; Yang, W. Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention. Pattern Anal. Appl. 2025, 28, 1–12. [Google Scholar] [CrossRef]
Wen, J.; Cui, J.; Yang, G.; Zhao, B.; Zhai, Y.; Gao, Z.; Dou, L.; Chen, B.M. Waterformer: A global–local transformer for underwater image enhancement with environment adaptor. IEEE Robot. Autom. Mag. 2024, 31, 29–40. [Google Scholar] [CrossRef]
Chen, Z.; Tang, J.; Wang, G.; Li, S.; Li, X.; Ji, X.; Li, X. UW-SDF: Exploiting Hybrid Geometric Priors for Neural SDF Reconstruction from Underwater Multi-view Monocular Images. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 14248–14255. [Google Scholar]
Song, K.; Wang, H.; Zhao, Y.; Huang, L.; Dong, H.; Yan, Y. Lightweight multi-level feature difference fusion network for RGB-DT salient object detection. J. King Saud-Univ.-Comput. Inf. Sci. 2023, 35, 101702. [Google Scholar] [CrossRef]
Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
Wu, Z.; Allibert, G.; Meriaudeau, F.; Ma, C.; Demonceaux, C. Hidanet: Rgb-d salient object detection via hierarchical depth awareness. IEEE Trans. Image Process. 2023, 32, 2160–2173. [Google Scholar] [CrossRef]
Wu, L.F.; Wei, D.; Xu, C.A. CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation. J. Imaging 2025, 11, 177. [Google Scholar] [CrossRef]
Yang, X.; Li, Q.; Yu, D.; Gao, Z.; Huo, G. Polarization spatial and semantic learning lightweight network for underwater salient object detection. J. Electron. Imaging 2024, 33, 033010. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, H.; Li, Y.; Nakamura, Y.; Zhang, L. Flowfusion: Dynamic dense rgb-d slam based on optical flow. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 7322–7328. [Google Scholar]
Luiten, J.; Kopanas, G.; Leibe, B.; Ramanan, D. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 800–809. [Google Scholar]
Sun, W.; Cao, L.; Guo, Y.; Du, K. Multimodal and multiscale feature fusion for weakly supervised video anomaly detection. Sci. Rep. 2024, 14, 22835. [Google Scholar] [CrossRef]
Anvarov, F.; Kim, D.H.; Song, B.C. Action recognition using deep 3D CNNs with sequential feature aggregation and attention. Electronics 2020, 9, 147. [Google Scholar] [CrossRef]
Hui, T.; Xun, Z.; Peng, F.; Huang, J.; Wei, X.; Wei, X.; Dai, J.; Han, J.; Liu, S. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13630–13639. [Google Scholar]
Xue, Y.; Jin, G.; Zhong, B.; Shen, T.; Tan, L.; Xue, C.; Zheng, Y. FMTrack: Frequency-aware Interaction and Multi-Expert Fusion for RGB-T Tracking. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-conquer: Confluent triple-flow network for RGB-T salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef]
Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 5547–5569. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3531–3539. [Google Scholar]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
Connor, R.; Dearle, A.; Claydon, B.; Vadicamo, L. Correlations of cross-entropy loss in machine learning. Entropy 2024, 26, 491. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–23 October 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Fu, Z.; Chen, R.; Huang, Y.; Cheng, E.; Ding, X.; Ma, K.K. Masnet: A robust deep marine animal segmentation network. IEEE J. Ocean. Eng. 2023, 49, 1104–1115. [Google Scholar] [CrossRef]
Li, L.; Rigall, E.; Dong, J.; Chen, G. Mas3k: An open dataset for marine animal segmentation. In Proceedings of the International Symposium on Benchmarking, Measuring and Optimization, Virtual Event, 15–16 November 2020; pp. 194–212. [Google Scholar]
Saleh, A.; Laradji, I.H.; Konovalov, D.A.; Bradley, M.; Vazquez, D.; Sheaves, M. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Sci. Rep. 2020, 10, 14671. [Google Scholar] [CrossRef] [PubMed]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-Measure: A New Way to Evaluate Foreground Maps. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Fan, D.P.; Ji, G.P.; Qin, X.; Cheng, M.M. Cognitive vision inspired object segmentation metric and loss function. Sci. Sin. Informationis 2021, 6, 5. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Huang, L.; Gong, A. Trigonometric feature learning for RGBD and RGBT image salient object detection. Knowl.-Based Syst. 2025, 310, 112935. [Google Scholar] [CrossRef]
Chen, T.; Xiao, J.; Hu, X.; Zhang, G.; Wang, S. Adaptive fusion network for RGB-D salient object detection. Neurocomputing 2023, 522, 152–164. [Google Scholar] [CrossRef]
Mao, Y.; Zhang, J.; Wan, Z.; Tian, X.; Li, A.; Lv, Y.; Dai, Y. Generative transformer for accurate and reliable salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 1041–1054. [Google Scholar] [CrossRef]

Figure 1. The overall structure of our proposed framework. It consists of two main parts: Parallel Feature Extraction and Adaptive Local–Global Features Fusion. Parallel feature extraction is a SAM encoder with parallel LoRA and adapter, which is used to extract RGB multi-scale features and auxiliary information. The Adaptive Local–Global Features Fusion realizes multi-scale features fusion and multi-modal features fusion. The PP module is used for progressive prediction.

Figure 2. The structure of our Adaptive Local–Global Feature Fusion (a) and Multi-modal and Multi-scale fusion (b). (a) includes two blocks, MK-CNN and the TR, where the MK-CNN block extracts local information, and the TR block extracts global information. (b) includes two ALGFF modules, one for processing multi-modal information and the other for processing multi-scale information.

Figure 3. Visual comparison of predicted segmentation with different methods.

Figure 4. The visualization of the feature map with or without ALGFF. “B-add” denotes the heatmap without the ALGFF module, while “Ours” represents the heatmap with the ALGFF module.

Figure 5. Visual examples of some failure cases. “Ours” is our example, and “Dual-SAM (RGBD)” is an example of Dual SAM.

Table 1. Performance comparison on RMAs, Mask3K, and DeepFish. We compared our method with TFL-Net, MAS-SAM, TC-USOD, and Dual-SAM, respectively. The best results are marked in red.

Dataset	Method	Type	MAE	$F_{m}$	$S_{m}$	$E_{m}$	Total Params	Trainable Params	FLOPs
RMAs	MAS-SAM	RGB	0.0229	0.8652	0.8561	0.9442	98.41 M	11.13 M	141.03G
	AFNet	RGBD	0.0263	0.8527	0.8366	0.9335	254.45 M	254.45 M	128.30 G
	iGAN	RGBD	0.0258	0.8589	0.8464	0.9325	87.32 M	87.32 M	47.72 G
	TFL-Net	RGB-D	0.0243	0.8545	0.8437	0.9426	201.45 M	201.32 M	54.32 G
	TC-USOD	RGB-D	0.0237	0.8448	0.8412	0.9230	125.96 M	125.81 M	29.64 G
	Dual-SAM	RGB-D	0.0257	0.8614	0.8400	0.9360	159.95 M	74.30 M	325.68 G
	Dual-SAM	RGB-G	0.0220	0.8609	0.8550	0.9424	159.95 M	74.30 M	325.68 G
	Ours	RGB-D	0.0221	0.8769	0.8633	0.9464	98.59 M	12.45 M	252.11 G
Mask3K	MAS-SAM	RGB	0.0258	0.8748	0.8829	0.9358	98.41 M	11.13 M	141.03 G
	AFNet	RGB-D	0.0334	0.8426	0.8581	0.9089	254.45 M	254.45 M	128.30 G
	iGAN	RGB-D	0.0288	0.8580	0.8654	0.9128	87.32 M	87.32 M	47.72 G
	TFL-Net	RGB-D	0.0294	0.8501	0.8623	0.9174	201.45 M	201.32 M	54.32 G
	TC-USOD	RGB-D	0.0343	0.8229	0.8470	0.9015	125.96 M	125.81 M	29.64 G
	Dual-SAM	RGB-D	0.0306	0.8705	0.8703	0.9205	159.95 M	74.30 M	325.68 G
	Dual-SAM	RGB-G	0.0252	0.8756	0.8821	0.9330	159.95 M	74.30 M	325.68 G
	Ours	RGB-D	0.0263	0.8857	0.8863	0.9325	98.59 M	12.45 M	252.11 G
DeepFish	MAS-SAM	RGB	0.0092	0.8646	0.8783	0.9538	98.41 M	11.13 M	141.03 G
	TFL-Net	RGB-O	0.0070	0.8893	0.8923	0.9615	201.45 M	201.32 M	54.32 G
	TC-USOD	RGB-O	0.0070	0.8835	0.8959	0.9546	125.96 M	125.81 M	29.64 G
	Dual-SAM	RGB-O	0.0077	0.8783	0.8904	0.9618	159.95 M	74.30 M	325.68 G
	Dual-SAM	RGB-G	0.0086	0.8558	0.8727	0.9523	159.95 M	74.30 M	325.68 G
	Ours	RGB-O	0.0062	0.8964	0.9016	0.9730	98.59 M	12.45 M	252.11 G

Table 2. Comparisons on testing speed (FPS) on RMAs dataset. The unit is (it/s).

Method	MAS-SAM	AFNet	iGAN	TFL-Net	TC-USOD	Dual-SAM	Ours
FPS (it/s)	15.17	18.65	33.13	19.58	31.81	9.91	11.03

Table 3. Performance effect of dual branches on RMAs dataset. The best results are marked in red.

Method	MAE	$F_{m}$	$S_{m}$	$E_{m}$	Total Params	Trainable Params	FLOPs
RGB	0.0225	0.8752	0.8605	0.9439	95.00 M	11.43 M	138.99 G
RGB+depth (single)	0.0235	0.8693	0.8592	0.9428	95.00 M	11.51 M	139.00 G
Ours	0.0221	0.8769	0.8633	0.9464	98.59 M	12.45 M	252.11 G

Table 4. The effectiveness of ALGFF on multi-modal and multi-scale feature fusion on the RMAs dataset. The best results are marked in red.

Method	MAE	$F_{m}$	$S_{m}$	$E_{m}$	Total Params	Trainable Params	FLOPs
B-add	0.0225	0.8752	0.8578	0.9433	97.83 M	11.69 M	243.39 G
B-concat	0.0222	0.8719	0.8608	0.9388	97.96 M	11.74 M	243.93 G
multi-scale	0.0220	0.8724	0.8573	0.9457	98.31 M	12.23 M	248.78 G
FAM	0.0228	0.8739	0.8584	0.9450	98.39 M	12.40 M	244.92 G
Ours	0.0221	0.8769	0.8633	0.9464	98.59 M	12.45 M	252.11 G

Table 5. Performance comparisons of using different modules on RMAs dataset. The best results are marked in red.

Method	MAE	$F_{m}$	$S_{m}$	$E_{m}$	Total Params	Trainable Params	FLOPs
B-add	0.0225	0.8752	0.8578	0.9433	97.83 M	11.69 M	243.39 G
local-branch	0.0222	0.8782	0.8607	0.9439	98.20 M	12.16 M	246.62 G
Equal-weight	0.0224	0.8740	0.8602	0.9468	98.59 M	12.44 M	252.06 G
Ours	0.0221	0.8769	0.8633	0.9464	98.59 M	12.45 M	252.11 G

Table 6. Performance comparisons of using different modules on DeepFish dataset. The best results are marked in red.

Method	MAE	$F_{m}$	$S_{m}$	$E_{m}$	Total Params	Trainable Params	FLOPs
B-add	0.0073	0.8789	0.8906	0.9615	97.83 M	11.69 M	243.39 G
local-branch	0.0064	0.8907	0.8974	0.9664	98.20 M	12.16 M	246.62 G
Equal-weight	0.0069	0.8872	0.8957	0.9686	98.57 M	12.44 M	252.06 G
Ours	0.0062	0.8964	0.9016	0.9730	98.59 M	12.45 M	252.11 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Y.; Gao, Y.; Wang, Y.; Wang, Y.; Yu, H.; Lin, Y. A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation. Electronics 2025, 14, 3927. https://doi.org/10.3390/electronics14193927

AMA Style

Jiang Y, Gao Y, Wang Y, Wang Y, Yu H, Lin Y. A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation. Electronics. 2025; 14(19):3927. https://doi.org/10.3390/electronics14193927

Chicago/Turabian Style

Jiang, Yue, Yan Gao, Yifei Wang, Yue Wang, Hong Yu, and Yuanshan Lin. 2025. "A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation" Electronics 14, no. 19: 3927. https://doi.org/10.3390/electronics14193927

APA Style

Jiang, Y., Gao, Y., Wang, Y., Wang, Y., Yu, H., & Lin, Y. (2025). A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation. Electronics, 14(19), 3927. https://doi.org/10.3390/electronics14193927

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Marine Animal Segmentation

2.2. RGB+X Object Detection

3. Proposed Method

3.1. Parallel Feature Extraction

3.2. Adaptive Local–Global Feature Fusion

3.3. Progressive Prediction

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with the State of the Art

4.4. Ablation Studies

4.5. Failure Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI