A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+

Xiao, Chongjing; Zhou, Zhiyu; Hu, Yanjun

doi:10.3390/jimaging11050162

Open AccessArticle

A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+

by

Chongjing Xiao

,

Zhiyu Zhou

^* and

Yanjun Hu

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(5), 162; https://doi.org/10.3390/jimaging11050162

Submission received: 21 April 2025 / Revised: 15 May 2025 / Accepted: 16 May 2025 / Published: 19 May 2025

(This article belongs to the Section Image and Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Underwater object image processing is a crucial technology for marine environmental exploration. The complexity of marine environments typically results in underwater object images exhibiting color deviation, imbalanced contrast, and blurring. Existing semantic segmentation methods for underwater objects either suffer from low segmentation accuracy or fail to meet the lightweight requirements of underwater hardware. To address these challenges, this study proposes a lightweight semantic segmentation model based on DeepLabv3+. The framework employs MobileOne-S0 as the lightweight backbone for feature extraction, integrates Simple, Parameter-Free Attention Module (SimAM) into deep feature layers, replaces global average pooling in the Atrous Spatial Pyramid Pooling (ASPP) module with strip pooling, and adopts a content-guided attention (CGA)-based mixup fusion scheme to effectively combine high-level and low-level features while minimizing parameter redundancy. Experimental results demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 71.18% on the DUT-USEG dataset, with parameters and computational complexity reduced to 6.628 M and 39.612 G FLOPs, respectively. These advancements significantly enhance segmentation accuracy while maintaining model efficiency, making the model highly suitable for resource-constrained underwater applications.

Keywords:

semantic segmentation; lightweight network; attention mechanism; strip pooling; feature fusion

1. Introduction

Underwater image semantic segmentation is critical for advancing marine exploration, habitat monitoring, and autonomous underwater operations. However, the aquatic environment introduces unique challenges: light attenuation and scattering in water cause severe color distortion, while suspended particles and turbidity degrade image contrast, resulting in blurred textures and ambiguous object boundaries. These degradation effects not only obscure critical visual cues but also amplify the complexity of feature extraction, rendering conventional segmentation models—often optimized for terrestrial imagery—ineffective in underwater scenarios.

Recent advances in the field of image segmentation leverage deep learning to deal with complex scenes. The atrous convolution in DeepLab [1] has shown its effectiveness in extracting multi-scale contextual features, and Fully Convolutional Networks [2] provide a unified framework for pixel-level prediction. The integration of dynamic kernel mechanisms in K-Net [3] enables adaptive feature refinement for different segmentation tasks, and Transformer-based DETR [4] provides an end-to-end paradigm by detecting and segmenting objects in a unified way. Asymmetric Non-local Networks [5] effectively model long-range spatial dependencies and reduce the computational overhead.

With the growing demand for underwater machine vision, semantic segmentation of underwater images has emerged as a critical research field. Liu et al. [6] developed an underwater semantic segmentation network with an unsupervised color correction module to improve input image quality. Zhou et al. [7,8,9] proposed a suite of underwater image enhancement techniques to address diverse degradation issues. Islam et al. [10] introduced SUIM, the first large-scale dataset for underwater semantic segmentation. WaterBiSeg-Net [11] performs real-time segmentation of marine garbage by enhancing and suppressing background information with multi-scale information. UIE-Convformer [12] is a fusion CNN and feature fusion Transformer network that uses a multi-scale U-Net, local and global feature extraction, multi-scale fusion, and refinement modules to obtain better segmentation effects. CEWformer [13], a transformer-based network that enables image enhancement and watermarking at the same time, provides a new way to think about underwater segmentation. However, most existing models overlook lightweight design, prompting increasing attention to efficient architectures. ENet [14] reduced model complexity but suffered from limited receptive fields. OCNet [15] integrated cross-sparse self-attention with dilated pyramid pooling to boost accuracy. Separable convolution-based networks minimized computational load, while CGNet [16] simultaneously learned local and global features with under 0.5 M parameters, albeit with compromised accuracy. DFANet [17] proposed deep feature aggregation to combine network-level and stage-level representations, while LEDNet [18] accelerated processing via a lightweight upsampling module. BiseNetv2 [19] adopted a two-branch architecture, where the semantic branch outputs guided spatial branch features to improve performance. Despite these advancements in terrestrial domains, underwater segmentation remains challenging due to the low contrast, high noise, and inherent blurriness in underwater imagery, which often result in rough edges and ambiguous semantic boundaries in segmentation outputs. Furthermore, real-time processing demands for resource-constrained platforms, such as autonomous underwater vehicles (AUVs) or low-power embedded systems, necessitate models that balance accuracy with computational efficiency.

To solve the above problems, this paper proposes a lightweight semantic segmentation network for underwater scenarios and conducts experiments on the underwater semantic segmentation dataset DUT-USEG [20]. This study selects the MobileOne-S0 [21] network as the main feature extractor based on the lightweight architecture design criteria. Through structural reparameterization technology, this network constructs a multi-branch topology during the training phase to enhance feature representation capabilities and merges into a single branch during inference, thus achieving efficient feature encoding and optimization of computational costs. To further enhance the multi-scale feature expression ability, SimAM [22] is innovatively introduced. This module constructs three-dimensional attention weights based on the energy function theory, adaptively enhances the saliency feature responses in the target area by implicitly modeling the feature correlations in the channel and spatial dimensions, and does not introduce additional learnable parameters. Aiming at the limitations of the global information capture of the ASPP module, the strip pooling [23] strategy is adopted to replace the traditional global average pooling. By constructing long and narrow pooling kernels in the horizontal and vertical directions, the long-range context dependencies of the image are extracted simultaneously, effectively broadening the global receptive field of the model. In the high-level and low-level features fusion stage, a mixup fusion scheme based on CGA [24] is introduced. By dynamically calibrating the receptive fields and semantic consistency of features at different levels, it suppresses information redundancy in cross-scale feature interactions and achieves efficient aggregation of feature maps.

Experimental data show that the improved model reaches an mIoU index of 71.18% on the DUT-USEG dataset, while maintaining a computational complexity of 6.628 M parameters and 39.612 GFLOPs. Under the premise of ensuring the lightweight nature of the model, this scheme provides an effective technical implementation path for real-time semantic segmentation tasks of underwater biological images.

The rest of the paper is organized as follows: Section 2 describes the materials and methods in detail; Section 3 analyzes the experimental results and is followed by the conclusion in Section 4.

2. Materials and Methods

2.1. DeepLabv3+ Model

DeepLabv3+ stands out as an advanced semantic segmentation model, serving as an enhanced version of DeepLabv3 with a residual network typically employed as the underlying architecture. It incorporates the ASPP module and an encoder–decoder framework. The encoding module comprises a backbone feature extraction network—often mainstream architectures such as Xception, ResNet, MobileNetV2, and ShuffleNet—and the ASPP module. The DeepLabv3+ model first performs feature learning, optimization, and hierarchical representation through the backbone network to generate coarse feature maps. These coarse maps are then processed by ASPP to extract and aggregate multi-perspective contextual information. The decoding module fuses low-level features with high-level features, integrating pixel-level location details with contextual information, and employs bilinear interpolation for upsampling to achieve pixel-wise segmentation. By leveraging depthwise separable convolution (DSC) and dilated/atrous convolution (DC), DeepLabv3+ effectively increases network depth while keeping model parameters in check.

2.2. Methodology

2.2.1. MobileOne-S0

Many current improvements to the DeepLabv3+ model involve replacing the relatively “heavyweight” backbone network with the classic lightweight backbone network MobileNetV2. The primary reason for this is MobileNetV2’s own 3.4 M parameters and 0.98 ms inference latency, which bring significant performance advantages. After replacement, the DeepLabv3+ model also demonstrates excellent performance. While MobileNetV2 brings competitiveness, it also has several shortcomings, mainly in the following three aspects: first, the alternating stacking of depthwise convolutions and linear bottleneck structures limits the effective receptive field; second, the channel splitting mechanism of depthwise separable convolutions increases computational redundancy for cross-channel information interaction; more critically, although the adopted H-swish activation function can enhance representation capabilities, it incurs significant memory access overhead in mobile hardware or embedded device acceleration. These design flaws make it difficult for MobileNetV2 to meet higher efficiency requirements in latency-sensitive scenarios.

As a representative of newer lightweight network architectures, MobileOne-S0 demonstrates significant advantages in efficient computing scenarios for mobile and embedded devices. Figure 1 illustrates the structural differences in the core modules of MobileOne during the training and inference phases. During training, the module employs a multi-branch design: the main branch consists of 3 × 3 depthwise separable convolutions, while introducing reparameterizable skip connections (including 1 × 1 convolutions and batch normalization layers) and multiple trivial over-parameterized branches (controlled by the hyperparameter k). These branches enhance the model’s feature expression capabilities through parameter sharing. During inference, all branches are merged into a single linear structure via structural reparameterization techniques, eliminating multi-branch operations and forming a directly connected network without skip connections (as shown in the right half of Figure 1). This design significantly reduces computational latency on mobile devices while preserving the multi-scale feature learning capabilities of the training phase.

Table 1 specifically defines the network hierarchy configurations of different MobileOne variants (S0–S4). As shown in the table, the network is divided into 8 stages. Early stages (e.g., Stages 1–2) process high-resolution inputs (224 × 224 to 112 × 112) and use fewer blocks (1–2 blocks) to reduce computational load. Deep stages (e.g., Stages 3–5) densely stack blocks on low-resolution feature maps (e.g., Stage 3 contains 8 blocks), with channel numbers dynamically adjusted by the width scaling coefficient α (e.g., α = 0.75 for S0, α = 3.0 for S4). This progressive complexity strategy introduces SE-ReLU activation functions in Stages 5–6 (only used in the largest variant S4) and replaces convolutional operations with global average pooling in Stage 7 to further optimize computational efficiency. By comparing the parameter settings of different variants, MobileOne achieves model scaling by adjusting channel numbers rather than input resolution, avoiding the sharp increase in FLOPs and memory consumption caused by resolution improvements. To achieve the lightest weight, this study selects MobileOne-S0 as the backbone network from various MobileOne variants.

2.2.2. SimAM

The ASPP in DeepLabv3+ extracts multi-scale context information from high-level features. Applying SimAM after ASPP further refines these features. It strengthens the correlations among different multi-scale features obtained by ASPP in both spatial and channel dimensions. Without adding extra parameters, SimAM generates more discriminative three-dimensional attention weights. This enables the model to better distinguish and utilize the multi-scale context information provided by ASPP, thus improving the accuracy of object segmentation, especially for objects of diverse scales. The structure of the SimAM attention mechanism is shown in Figure 2.

Using relevant theories of neural networks, relatively important features are selected to achieve semantic segmentation of images. The energy function is defined as

e_{t} (w_{t}, b_{t}, y, x_{i}) = (y_{t} - \hat{t}) + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{o} - \hat{x_{i}})}^{2}

(1)

In Equation (1),

\hat{t} = w_{t} t + b_{t}

,

\hat{x_{i}} = w_{t} x_{i} + b_{t}

, and binary labels are introduced.

t

and

x_{i}

represent the input features of the target neuron and other neurons, respectively, while

w_{t}

and

b_{t}

denote the weights and biases of the linear transformation.

M

is the number of neurons in a channel.

w_{t}

and

b_{t}

can also be expressed by the following equations:

w_{t} = - \frac{2 (t - u_{t})}{{(t - u_{t})}^{2} + 2 {σ_{t}}^{2} + 2 λ}

(2)

b_{t} = - 0.5 (t + u_{t}) w_{t}

(3)

where

t

denotes the regularization coefficient, and

u_{t}

and

{σ_{t}}^{2}

represent the mean and variance of neurons, respectively. After Equation (1) is minimized, a regularization term is introduced to obtain the final energy function:

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - u_{t})}^{2} + 2 {σ_{t}}^{2} + 2 λ}

(4)

Equation (4) indicates that the smaller the value of the energy function, the more important it is. Therefore, the saliency can be evaluated by the reciprocal of the neuron, and the features can be enhanced through normalization by the Sigmoid function. SimAM is a lightweight, parameter-free attention module designed to enhance feature discriminability in deep networks. Rooted in energy function theory, it dynamically generates 3D attention weights (channel and spatial) by analyzing feature statistics to suppress background noise and highlight salient target regions. Integrated after the ASPP module in DeepLabv3+, SimAM refines multi-scale contextual features without introducing additional learnable parameters. By adaptively emphasizing relevant features, it improves the model’s ability to distinguish objects of varying scales in complex underwater scenes.

2.2.3. Strip Pooling

Replacing global average pooling with strip pooling in DeepLabv3+’s ASPP module addresses two limitations of global average pooling while preserving multi-scale context aggregation. Unlike global average pooling, which compresses global features into a single vector and risks oversmoothing spatially anisotropic structures, strip pooling employs horizontal and vertical long-kernel pooling to explicitly model long-range dependencies along orthogonal axes. Moreover, when combining the two spatial dimensions, the feature values of rows or columns are weighted averages. This complements ASPP’s multi-scale atrous convolutions by adding direction-aware contextual priors, particularly beneficial for segmenting irregularly shaped targets. Strip pooling retains global context via its hybrid local–global receptive fields, enhancing feature discriminability without sacrificing computational efficiency. This adaptation synergizes with ASPP’s core design philosophy of multi-scale fusion while addressing the spatial bias inherent to isotropic pooling operations. The model structure is shown in Figure 3 below. For the input image, the calculation equation for the output of the row vector is as follows:

y_{i}^{h} = \frac{1}{W} \sum_{0 \leq j < w} X_{i, j}

(5)

And the calculation equation for the column vector output is as follows:

y_{i}^{v} = \frac{1}{H} \sum_{0 \leq i < H} X_{i, j}

(6)

For the input feature map

X \in R^{H \times W \times C}

, H and W denote the height and width, and C denotes the number of channels. Pooling is performed on X in the horizontal and vertical directions, and the resulting outputs are

y^{h} \in R^{H \times C}

and

y^{v} \in R^{W \times C}

. The output feature map is obtained by directly summing them:

y_{c, i, j} = y_{c, j}^{h} + y_{c, j}^{v}

(7)

The output feature map is first subjected to a convolution operation and then activated by a sigmoid function. The result is fused with the original input feature map to obtain the final output. They are directly summed to obtain the output feature map:

z = S c a l e (X, σ (f (y)))

(8)

In Equation (8), Scale denotes multiplication,

σ

represents the sigmoid function, and f indicates a 1 × 1 convolution operation.

2.2.4. CGA-Based Mixup Fusion Scheme

The CGA-based mixup fusion scheme is an innovative feature fusion mechanism designed to address the inherent defects of traditional feature fusion methods in receptive field mismatch and information redundancy. The core of this scheme lies in dynamically generating channel-specific spatial importance maps (SIMs) to achieve fine-grained alignment and weight allocation for low-level and high-level features. Specifically, CGA achieves collaborative interaction through parallel channel attention and spatial attention, as shown in Figure 4, to first generate a preliminary global SIM. Subsequently, the content information of the input features is used to refine the SIMs at the channel level, enabling the SIM of each channel to adaptively reflect the key region distribution of that channel’s features. This two-stage design preserves global semantic information while enhancing the saliency of local details, providing more discriminative weight criteria for subsequent fusion.

For the input feature X

\in

C

\times

H

\times

W, the goal of CGA is to generate a SIM with the same dimension as X, that is, W

\in

C

\times

H

\times

W. First,

W_{c}

and

W_{s}

are calculated using the following equations:

W_{c} = C_{1 \times 1} (m a x (0, C_{1 \times 1} (X_{G A P}^{c})))

(9)

W_{s} = C_{7 \times 7} ([X_{G A P}^{s}, X_{G M P}^{s}])

(10)

where

m a x (0, x

) is the ReLU activation function,

C_{k \times k} (

.) represents a k × k convolutional, and [.] represents channel-wise concatenation.

X_{G A P}^{c}

,

X_{G A P}^{s}

, and

X_{G M P}^{s}

represent the global average pooling operation across the channel dimension, the global average pooling operation across the spatial dimension, and the global max pooling operation across the spatial dimension, respectively. To reduce the number of parameters and limit the complexity of the model, the first 1 × 1 convolution reduces the channel dimension from C to C/r (where r is the reduction ratio), and the second 1 × 1 convolution expands it back to C. In the implementation of this paper, r is set to 8, reducing the channel dimension to a fixed value of C/8. Then,

W_{c}

and

W_{s}

are directly added according to the broadcasting rule to obtain

W_{C O S} \in C \times H \times W

(11)

To obtain the final W, each channel of

W_{C O S}

needs to be adjusted based on the corresponding input feature. Under the guidance of the input feature content, the final W is generated.

W = σ (G C_{7 \times 7} (C S ([X, W_{C O S}])))

(12)

where

σ

represents the sigmoid operation,

C S (.)

represents the channel-shuffling operation, and

G C_{k \times k} (.)

represents a hierarchical convolutional group with a kernel size of k × k. CGA assigns a unique SIM to each channel, guiding the model to focus on the important regions within each channel. Therefore, more useful information encoded in the features is emphasized, effectively improving the performance.

Compared with traditional direct concatenation or simple weighted fusion strategies, the CGA-based hybrid fusion scheme exhibits significant advantages. First, the concatenation operation assumes that features at different levels can be directly aligned spatially for concatenation. However, in practice, shallow features are limited by local receptive fields and often only capture texture details and edge information, while deep features form global semantic representations through multi-scale aggregation. These two types of features have fundamental differences in spatial coverage. Direct concatenation is prone to introducing high-frequency noise or diluting low-frequency information, making effective information interaction difficult. In contrast, the CGA-based scheme uses dynamically generated SIMs to spatially modulate features, explicitly modeling the receptive field differences between features at different levels. For example, in certain regions, CGA tends to assign higher weights to high-level semantic features to suppress noise, while in areas with clear edges, it enhances the contribution of low-level detail features, achieving adaptive feature selection and fusion. The structure diagram of the CGA-based mixup fusion scheme is shown in Figure 5.

The CGA-based mixup fusion scheme addresses the mismatch in receptive fields and information redundancy between high-level semantic and low-level spatial features. It employs a two-stage attention mechanism: first, parallel channel and spatial attention branches generate a global SIM by fusing global average/max pooling features. Second, input feature content guides channel-wise refinement of the SIM, enabling adaptive weight allocation for cross-scale feature interaction. By dynamically calibrating feature relevance through channel shuffling and hierarchical convolutions, this scheme enhances fine-grained alignment and suppresses redundant information.

2.3. Improved DeepLabv3+ Model

As shown in Figure 6, the input image first enters the feature extraction network with MobileOne-S0 as the backbone. Leveraging its unique structural reparameterization technology, MobileOne-S0 introduces linear branches during training and simplifies the structure during inference, effectively reducing memory access costs. This technology enhances model performance with low parameters, enabling the model to quickly and efficiently extract basic features of the image, providing a strong foundation for subsequent processing. After feature extraction by MobileOne-S0, the high-level features enter the ASPP module. ASPP captures multi-scale contextual information through dilated convolutions with different atrous rates. Now, SimAM is added after ASPP, followed by a 1 × 1 convolution. SimAM adaptively adjusts the weights of feature maps, highlights important features, suppresses irrelevant information, and enhances the model’s ability to capture target features in complex scenes, making the segmentation results more accurate. Global average pooling is replaced by strip pooling, which obtains richer global and local information through pooling operations in different directions. This addresses the problem of global average pooling possibly losing spatial information and further improves the model’s understanding of the overall image structure and details. In the feature fusion stage, the classic DeepLabv3+ uses the concatenation operation to fuse high-level and low-level features, while the improved model employs a CGA-based feature fusion scheme. By dynamically generating channel-specific SIMs, CGA achieves fine-grained alignment and weight allocation for shallow and deep features, enhancing interactions between features at different levels. This enables the model to better utilize the semantic and spatial information of images, thereby improving segmentation accuracy. The feature map processed as above is then upsampled to the original image resolution through bilinear interpolation and other operations to achieve pixel-level classification and obtain the final image segmentation result.

This improved DeepLabv3+ model combines the efficient feature extraction capability of MobileOne-S0, the attention mechanism of SimAM, the information acquisition advantages of strip pooling, and the CGA-based mixup feature fusion scheme, demonstrating superior performance in semantic segmentation tasks.

2.4. Loss Function

The Focal Loss function was initially proposed for object detection tasks to balance the contributions of easy and hard examples during training, enabling models to optimize parameters more effectively during backpropagation. The Dice Loss function, commonly employed in semantic segmentation, calculates loss by measuring the similarity between two samples, thereby mitigating the adverse effects of foreground–background class imbalance. In this study, the DUT-USEG dataset exhibits significant class imbalance between positive and negative samples, coupled with substantial variations in sample difficulty. To address these challenges, a hybrid loss function combining Focal Loss and Dice Loss was adopted to enhance the model’s segmentation performance. This approach ensures robust optimization under imbalanced data distributions while maintaining precise boundary delineation.

F o c a l L o s s = - {(1 - p_{t})}^{γ} \log (p_{t})

(13)

where

p_{t}

denotes the confidence score of the predicted class for a sample, while

γ

represents a tunable parameter with a default value of 2.

D i c e L o s s = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} \hat{y_{i}}}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} \hat{y_{i}}}

(14)

where

y_{i}

and

\hat{y_{i}}

denote the ground truth value and predicted value of pixel

i

, respectively, and

N

represents the total number of pixels.

2.5. Model Training and Testing

2.5.1. DUT-USEG Dataset

The DUT-USEG dataset, a real-world underwater semantic segmentation benchmark, comprises 6617 images of four marine species (echinus, holothurian, scallop, and starfish) with resolutions ranging from 586 × 480 to 3840 × 2160. It exhibits substantial class imbalance (e.g., 40,435 echinus instances vs. 1471 scallop instances), mirroring natural ecological distributions. For experiments, a subset of 1487 annotated images was utilized, partitioned into training, validation, and test sets at an 8:1:1 ratio, with input images resized to 512×512 pixels, as shown in Figure 7.

2.5.2. Model Training Environment and Parameters

The experiments in this study were conducted on a Windows 10 64-bit operating system using PyTorch 1.12.1 and cuDNN 8.3.02 frameworks. The computational platform was equipped with an Intel Core i7-10700K processor, 16 GB of RAM, and an NVIDIA RTX A4000 GPU. The model was trained for 100 epochs with an initial learning rate of 1×10⁻⁴, where the learning rate was decayed by 10% every 10 epochs. A batch size of 8 and the Adam optimizer were employed during training, with a weight decay factor set to 0.0001 and other hyperparameters maintained at their default configurations.

2.5.3. Model Evaluation Metrics

The evaluation metrics adopted in this study include the mean Intersection over Union (mIoU), mean Pixel Accuracy (mPA), number of trainable parameters, and floating-point operations (FLOPs). The specific equations of these metrics are defined as follows:

m P A = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j}}

(15)

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(16)

where

p_{i i}

is the total number of pixel points that belong to class i and are predicted to be class i.

p_{i j}

denotes the total number of pixel points that belong to class i but are predicted to be class j.

p_{j i}

is the total number of pixel points that belong to category j but are predicted to be category i. k

+

1 is the number of categories. mIoU evaluates global segmentation accuracy and boundary consistency. mPA quantifies pixel-level classification performance and inter-class recognition robustness. Parameters indicate model complexity and overfitting risks, whereas FLOPs assess computational efficiency.

3. Results and Discussion

3.1. Comparison of Different Models

To validate the efficacy of the proposed improved DeepLabv3+ algorithm, five semantic segmentation models, including UNet, PSPNet, HRNetv2, DeepLabv3+ (MobileNetV2), and SegFormer, were selected as comparative models for comprehensive performance analysis. To ensure the rigor and fairness of the experimental design, all comparative trials were conducted under strictly identical computational configurations and hyperparameter settings throughout the evaluation process. The results are presented in Table 2.

The comparative experiments demonstrate that the improved DeepLabv3+ model achieves an mIoU of 71.18% on the DUT-USEG dataset, outperforming UNet, PSPNet, HRNetv2, DeepLabv3+ (MobileNetV2), and SegFormer by 3.99%, 2.67%, 1.09%, 3.45%, and 2.81%, respectively. Furthermore, the proposed model attains an mPA of 80.42%, surpassing all comparative models and representing the sole model exceeding 80% in the comparative study. These results validate the effectiveness of the proposed improvements in enhancing semantic segmentation accuracy for underwater organisms.

In terms of lightweight performance, the improved DeepLabv3+ model exhibits a parameter value of 6.628 M, significantly lower than most comparative models, while slightly exceeding DeepLabv3+ (MobileNetV2) (5.814 M). Notably, the computational complexity of the proposed model, measured in FLOPs, is reduced to 39.612 G, achieving a 50.5% reduction compared to DeepLabv3+ (MobileNetV2) (79.949 G). Although the SegFormer model demonstrates marginally lower FLOPs (29.520 G), its parameters (44.605 M) remain substantially higher, underscoring the superior lightweight characteristics of the proposed architecture.

The improved model achieves an optimal balance between segmentation accuracy and computational efficiency. By maintaining minimal parameters, 6.628 M, while delivering excellent performance (71.18% mIoU, 80.42% mPA), the proposed method establishes a novel paradigm for underwater semantic segmentation tasks, addressing critical hardware constraints such as limited memory and computational resources in real-world marine exploration scenarios.

3.2. Ablation Study

The improved model is based on DeepLabv3+, with the backbone network replaced by MobileOne-S0, the SimAM applied to high-level features, strip pooling substituted for global average pooling, and the CGA-based mixup feature fusion used instead of concatenation. Ablation experiments were conducted using the same dataset and hyperparameters to validate the impact of these improvement strategies on the model’s segmentation accuracy and light weight, as shown in Table 3 and Table 4.

Replacing the backbone network of DeepLabv3+ from MobileNetV2 with MobileOne-S0 yields remarkable advantages. In terms of accuracy, both the mIoU and mPA are improved, enabling the model to more precisely identify and segment targets. Regarding lightweight performance, although the number of parameters slightly increases, the FLOPs are significantly reduced, leading to a decrease in inference computation and more efficient operation. This replacement achieves performance optimization, making the model more competitive in practical applications and providing strong support for semantic segmentation. Therefore, in the comparison of the ablation experiment results in Table 3 and Table 4, the backbone model of the proposed model is MobileOne-S0.

Table 4 presents the results of the ablation experiments conducted to evaluate the individual and combined effects of SimAM, strip pooling, and feature fusion on the model’s performance. When only SimAM was applied, mIoU increased by 0.58% and mPA increased by 0.35%. This indicates that SimAM can contribute to the model’s ability to better focus on relevant features, thus improving the segmentation accuracy to some extent. When CGA-based feature fusion was used alone, mIoU increased by 1.78% and mPA increased by 1.12%, which implies that the introduced feature fusion scheme can effectively enhance the model’s performance. Adding strip pooling while keeping SimAM results in an mIoU of 69.35% and an mPA of 79.89%. Although the improvement in mIoU is relatively small compared to the case with only SimAM, the mPA shows a slight increase, suggesting that strip pooling can have a positive impact on the pixel-level accuracy when combined with SimAM. Combining SimAM and feature fusion leads to an mIoU of 70.86% and an mPA of 80.06%, demonstrating a synergistic effect between the two methods. The experimental results demonstrate that the synergistic integration of these methods significantly enhances the model’s overall segmentation performance, providing critical insights into their individual contributions and interactive dynamics within the proposed architectural framework.

3.3. Comparison of Locations of SimAM Application

In this section, different control groups were also established. Considering that the attention module improves the adaptability of the convolutional network by suppressing irrelevant channel weights, experiments were conducted to explore the comparative effects of SimAM at the three common positions where the attention mechanism is added in the DeepLabv3+ network structure. The results are shown in Table 5, which is used to demonstrate the rationality of the position where SimAM is added in the DeepLabv3+ network structure.

The positions of A, B, and C are marked in Figure 6. A is the low-level features extracted by the backbone, SimAM, before the 1

\times

1 convolution; B is SimAM after the ASPP and before the 1

\times

1 convolution; C is SimAM after high-level and low-level feature fusion and before the 3

\times

3 convolution. Applying SimAM to different positions does not necessarily enhance the semantic segmentation of underwater biological images. For example, in group 1, applying the attention module at position A reduced the model’s segmentation accuracy, possibly because the attention mechanism at this position disrupted the channel weights of the original network. Additionally, groups 4, 5, 6, and 7, which applied the SimAM module at multiple positions, exhibited inferior detection performance compared to group 2, which applied SimAM solely at position B. Therefore, SimAM was applied only after high-level feature fusion to improve the model’s detection capability for objects of various scales, thereby enhancing overall performance. This optimization increased the model’s mIoU to 69.12% and mPA to 79.49%.

3.4. Qualitative Analysis

In this section, to visually demonstrate the segmentation performance of the improved model, we selected several sample images from the DUT-USEG dataset for experimentation. Figure 8 compares the segmentation results of these sample images with other segmentation networks.

Qualitative analysis results demonstrate that the improved model exhibits outstanding segmentation performance. It can precisely delineate the edges of underwater organisms such as sea urchins, vividly presenting the fine details of sea urchin spines, and accurately segment small objects like sea cucumbers, thereby assisting underwater detection devices in identifying organisms and estimating their sizes. As shown in Figure 8a, the improved model not only completely segments all sea urchins but also clearly displays the minute spiny structures of the largest sea urchin, an achievement beyond the reach of other comparative models. In Figure 8f, due to the low contrast of the underwater environment and the tiny size of the sea cucumber, which is hard to see with the naked eye, only the improved model successfully accomplishes the segmentation task, while all other models fail to detect it. Compared with other comparative models, the improved model has a significant advantage in segmentation accuracy. It can accurately capture the fine-grained morphological features of underwater organisms, effectively mitigating the edge blurring caused by turbid water or low contrast. Additionally, it performs remarkably well in addressing the challenges of small underwater targets, such as vulnerability to noise interference and missed detection.

4. Conclusions

Aiming at the problems such as image blurring, indistinct target features, complex scenes, and limited performance of underwater hardware devices faced in underwater image semantic segmentation, a lightweight semantic segmentation model based on DeepLabv3+ is proposed. By integrating MobileOne-S0 as the backbone network, SimAM for high-level features, strip pooling for anisotropic context modeling, and a CGA-based mixup fusion scheme for efficient multi-scale feature integration, the proposed model achieves excellent performance on the DUT-USEG dataset, with an mIoU of 71.18% and an mPA of 80.42%. With only 6.628 M parameters and 39.612 G FLOPs, this work provides a practical solution for underwater exploration systems requiring a balance between segmentation accuracy and lightweight deployment.

The lightweight architecture of our model, achieving high accuracy with constrained computational demands, holds significant promise for real-world underwater exploration systems. Its efficiency enables seamless deployment on resource-constrained platforms such as AUVs, where low-latency, real-time semantic segmentation is critical for navigation and habitat mapping. Furthermore, the model’s portability facilitates integration into edge devices for collaborative marine research—for instance, empowering marine biologists to perform in situ species monitoring via compact underwater drones or fixed sensors, even in turbid or low-visibility environments. By bridging the gap between algorithmic innovation and practical deployability, our work advances scalable solutions for sustainable ocean exploration and ecological preservation.

Still, the research conducted in this paper does possess some deficiencies. Compared with some heavyweight semantic segmentation models, there is a certain gap in segmentation accuracy. However, this sacrifice in segmentation accuracy better meets the requirements of underwater hardware devices. In scenarios where the contrast is extremely low or the water is very turbid, the segmentation performance of this method is not ideal.

In the future, our research direction will focus on achieving higher accuracy while pursuing a more lightweight model. We will seek better improvement methods, such as incorporating multimodal data (e.g., sonar detection, thermal imaging information, etc.), performing more efficient feature extraction and fusion at multiple scales, and enhancing downsampling and upsampling methods. Additionally, we will attempt to apply the model to real-time semantic segmentation of underwater videos to adapt to dynamic and complex environments.

Author Contributions

Conceptualization, C.X. and Z.Z.; methodology, C.X.; software, C.X.; validation, C.X.; formal analysis, Y.H.; investigation, C.X.; resources, C.X.; data curation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, C.X. and Z.Z.; visualization, C.X.; supervision, C.X. and Z.Z.; project administration, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (No. 2022YFC2803903) and the Key Research and Development Program of Zhejiang Province (No. 2021C03013).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code for the data analysis is available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution.and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhao, H.; Qi, X.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Fully convolutional networks for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 214–223. [Google Scholar]
Zhang, W.; Pang, J.; Chen, K.; Loy, C. K-net: Towards unified image segmentation. Adv. Neural Inf Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
Liu, F.; Fang, M. Semantic segmentation of underwater images based on improved Deeplab. J. Mar. Sci. Eng. 2020, 8, 188. [Google Scholar] [CrossRef]
Zhou, J.; Wei, X.; Shi, J.; Chu, W.; Lin, Y. Underwater image enhancement via two-level wavelet decomposition maximum brightness color restoration and edge refinement histogram stretching. Opt. Express 2022, 30, 17290–17306. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Wang, Y.; Zhang, W.; Li, C. Underwater image restoration via feature priors to estimate background light and opti-mized transmission map. Opt. Express 2021, 29, 28228–28245. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Yang, T.; Ren, W.; Zhang, D.; Zhang, W. Underwater image restoration via depth map and illumination estimation based on a single image. Opt. Express 2021, 29, 29864–29886. [Google Scholar] [CrossRef] [PubMed]
Islam, M.J.; Edge, C.; Xiao, Y.; Luo, P.; Mehtaz, M.; Morse, C.; Sakib Enan, S.; Sattar, J. Semantic segmentation of underwater imagery: Dataset and benchmark. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IEEE), Las Vegas, NV, USA, 25–29 October 2020; pp. 1769–1776. [Google Scholar]
Zhang, W.; Wei, B.; Li, Y.; Li, H.; Song, T. WaterBiSeg-Net: An underwater bilateral segmentation network for marine debris segmentation. Mar. Pollut. Bull. 2024, 205, 116644. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Xu, H.; Jiang, G.; Yu, M.; Ren, T.; Luo, T.; Zhu, Z. UIE-Convformer: Underwater Image Enhancement Based on Convolution and Feature Fusion Transformer. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 1952–1968. [Google Scholar] [CrossRef]
Wu, J.; Luo, T.; He, Z.; Song, Y.; Xu, H.; Li, L. CEWformer: A transformer-based collaborative network for simultaneous underwater image enhancement and watermarking. IEEE J. Ocean. Eng. 2023, 49, 30–47. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016; pp. 1–10. [Google Scholar]
Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. OCNet: Object context for semantic segmentation. Int. J. Comput. Vis. 2021, 129, 2375–2398. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Xiong, P.; Fan, H.; Sun, J. DFANet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE), Long Beach, CA, USA, 16–20 June 2019; pp. 9514–9523. [Google Scholar]
Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; Latecki, L.J. LEDNet: A lightweight encoder-decoder network for real-time semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (IEEE), Taipei, Taiwan, 22–25 September 2019; pp. 1860–1864. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Ma, Z.; Li, H.; Fan, X.; Luo, Z.; Li, J.; Wang, Z. DUT-USEG: A New Underwater Semantic Segmentation Dataset and Benchmark. J. Beijing Univ. Aeronaut. Astronaut. 2022, 48, 1515–1524. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. MobileOne: An Improved One millisecond Mobile Backbone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7907–7917. [Google Scholar]
Yang, L.; Zhang, R.; Li, L.; Xie, X. SimAM: A Simple, Parameter—Free Attention Module for Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–23 June 2022; pp. 12803–12812. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.; Feng, J. StripPooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13151–13160. [Google Scholar]
Chen, Z.; He, Z.; Lu, Z. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]

Figure 1. MobileOne Block.

Figure 2. The structure of SimAM.

Figure 3. The structure of strip pooling.

Figure 4. The structure of CGA.

Figure 5. The structure diagram of the CGA-based mixup fusion scheme.

Figure 6. The structure of improved DeepLabv3+.

Figure 7. DUT-USEG dataset.

Figure 8. Segmentation results on the DUT-USEG dataset. (a–f) represent the segmentation results of six different images selected from the dataset.

Table 1. MobileOne network specifications.

Stage	Input	Blocks	Stride	Block Type	Channels	$MobileOne Block Parameters (α, k,$ act = ReLU)
Stage	Input	Blocks	Stride	Block Type	Channels	S0	S1	S2	S3	S4
1	224 × 224	1	2	MobileOne Block	$64 \times α$	(0.75, 4)	(1.5, 1)	(1.5, 1)	(2.0, 1)	(3.0, 1)
2	112 × 112	2	2	MobileOne Block	$64 \times α$	(0.75, 4)	(1.5, 1)	(1.5, 1)	(2.0, 1)	(3.0, 1)
3	56 × 56	8	2	MobileOne Block	$128 \times α$	(1.0, 4)	(1.5, 1)	(2.0, 1)	(2.5, 1)	(3.5, 1)
4	28 × 28	5	2	MobileOne Block	$256 \times α$	(1.0, 4)	(2.0, 1)	(2.5, 1)	(3.0, 1)	(3.5, 1)
5	14 × 14	5	1	MobileOne Block	$256 \times α$	(1.0, 4)	(2.0, 1)	(2.5, 1)	(3.0, 1)	(3.5, 1, SE-RELU)
6	14 × 14	1	2	MobileOne Block	$512 \times α$	(2.0, 4)	(2.5, 1)	(4.0, 1)	(4.0, 1)	(4.0, 1, SE-RELU)
7	7 × 7	1	1	AvgPool	-	-	-	-	-	-
8	1 × 1	1	1	Linear	$512 \times α$	2.0	2.5	4.0	4.0	4.0

Table 2. Performance comparison of different models on the DUT-USEG dataset.

Model	mIoU (%)	mPA (%)	Parameters (M)	FLOPs (G)
Unet	67.19	78.34	43.933	184.200
PSPNet	68.51	79.68	46.708	369.481
HRNetv2	70.09	78.50	29.540	90.972
DeepLabv3+ (MobileNetV2)	67.63	77.62	5.814	79.949
SegFormer	68.37	78.26	44.605	29.520
Ours	71.18	80.42	6.628	39.612

Table 3. Comparative performance of backbone: MobileNetV2 and MobileOne-S0.

Backbone	mIoU (%)	mPA (%)	Parameters (M)	FLOPs (G)
MobileNetV2	67.63	77.62	5.814	79.949
MobileOne-S0	68.54	79.14	6.813	58.101

Table 4. Results of ablation experiments.

SimAM	Strip Pooling	Feature Fusion	mIoU (%)	mPA (%)
√			69.12	79.49
	√		68.97	80.22
		√	70.32	80.26
√	√		69.35	79.89
√		√	70.86	80.06
	√	√	70.35	80.13
√	√	√	71.18	80.42

Table 5. Effect of applying SimAM at different positions on the model.

Group	A	B	C	mIoU (%)	mPA (%)	Ascension of mIoU (%)
1	√			68.12	78.26	−0.42
2		√		69.12	79.49	0.58
3			√	68.89	79.03	0.35
4	√	√		68.61	78.93	0.07
5	√		√	68.22	78.37	−0.32
6		√	√	68.93	79.21	0.39
7	√	√	√	68.59	79.12	0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, C.; Zhou, Z.; Hu, Y. A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+. J. Imaging 2025, 11, 162. https://doi.org/10.3390/jimaging11050162

AMA Style

Xiao C, Zhou Z, Hu Y. A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+. Journal of Imaging. 2025; 11(5):162. https://doi.org/10.3390/jimaging11050162

Chicago/Turabian Style

Xiao, Chongjing, Zhiyu Zhou, and Yanjun Hu. 2025. "A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+" Journal of Imaging 11, no. 5: 162. https://doi.org/10.3390/jimaging11050162

APA Style

Xiao, C., Zhou, Z., & Hu, Y. (2025). A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+. Journal of Imaging, 11(5), 162. https://doi.org/10.3390/jimaging11050162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+

Abstract

1. Introduction

2. Materials and Methods

2.1. DeepLabv3+ Model

2.2. Methodology

2.2.1. MobileOne-S0

2.2.2. SimAM

2.2.3. Strip Pooling

2.2.4. CGA-Based Mixup Fusion Scheme

2.3. Improved DeepLabv3+ Model

2.4. Loss Function

2.5. Model Training and Testing

2.5.1. DUT-USEG Dataset

2.5.2. Model Training Environment and Parameters

2.5.3. Model Evaluation Metrics

3. Results and Discussion

3.1. Comparison of Different Models

3.2. Ablation Study

3.3. Comparison of Locations of SimAM Application

3.4. Qualitative Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI