VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation

Tang, Tang; Wang, Haihui; Rao, Qiang; Zuo, Ke; Gan, Wen

doi:10.3390/electronics14142866

Open AccessArticle

VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation

by

Tang Tang

,

Haihui Wang

^*,

Qiang Rao

,

Ke Zuo

and

Wen Gan

Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430073, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2866; https://doi.org/10.3390/electronics14142866

Submission received: 9 June 2025 / Revised: 12 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Section Bioelectronics)

Download

Browse Figures

Versions Notes

Abstract

Deep learning has advanced medical image segmentation, yet existing methods struggle with complex anatomical structures. Mainstream models, such as CNN, Transformer, and hybrid architectures, face challenges including insufficient information representation and redundant complexity, which limit their clinical deployment. Developing efficient and lightweight networks is crucial for accurate lesion localization and optimized clinical workflows. We propose the VML-UNet, a lightweight segmentation network with core innovations including the CPMamba module and the multi-scale local supervision module (MLSM). The CPMamba module integrates the visual state space (VSS) block and a channel prior attention mechanism to enable efficient modeling of spatial relationships with linear computational complexity through dynamic channel-space weight allocation, while preserving channel feature integrity. The MLSM enhances local feature perception and reduces the inference burden. Comparative experiments were conducted on three public datasets, including ISIC2017, ISIC2018, and PH2, with ablation experiments performed on ISIC2017. VML-UNet achieves 0.53 M parameters, 2.18 MB memory usage, and 1.24 GFLOPs time complexity, with its performance on the datasets outperforming comparative networks, validating its effectiveness. This study provides valuable references for developing lightweight, high-performance skin lesion segmentation networks, advancing the field of skin lesion segmentation.

Keywords:

medical image segmentation; mamba; U-Net

1. Introduction

Medical image segmentation, serving as the core technology for computer-aided diagnosis systems, plays a pivotal role in clinical applications, including tumor localization [1] and skin lesion analysis [2]. Prior to automated solutions, manual segmentation proved both labor-intensive and clinically unreliable. The past decade has witnessed transformative advances in this domain, driven by escalating demands for surgical navigation and diagnostic assistance systems [3,4,5]. A seminal breakthrough emerged in 2015 with Ronneberger et al.’s U-Net architecture [6], which surpassed Fully Convolutional Networks (FCNs) [7] through its encoder–decoder structure with skip connections. This design effectively integrates multi-scale contextual information, establishing U-Net as the paradigm for medical segmentation. Subsequent innovations like U-Net++ [8], UNext [9], and 3D U-Net [10] have refined encoder–decoder configurations to enhance accuracy and robustness. Nevertheless, inherent limitations persist: CNN-based models rely on downsampling to manage computational load, inevitably sacrificing spatial detail. This fundamental trade-off underscores the critical need for richer global context and finer spatial features to support advanced semantic reasoning.

In 2017, Vaswani et al. [11] introduced the Transformer architecture, which revolutionized natural language processing (NLP) through its capacity for modeling long-range dependencies and global context [12,13]. However, its adoption in computer vision remained limited until Dosovitskiy et al. [14] pioneered the Vision Transformer (ViT), successfully adapting this paradigm to visual tasks. This breakthrough catalyzed a new era in semantic segmentation, with hybrid architectures like TransUNet [15] and TransFuse [16] enhancing global feature processing through U-Net integration. Despite these advances, a fundamental limitation persists: the Transformer’s self-attention mechanism incurs quadratic computational complexity relative to image resolution. This scaling behavior severely compromises inference efficiency when processing high-resolution medical images, raising critical research questions about achieving effective long-range dependency modeling without prohibitive computational overhead.

In 2024, Gu et al. [17] proposed the Mamba model, offering a novel approach to balance modeling capability and computational efficiency in deep learning. Departing from Transformer’s attention mechanism, Mamba achieves input-dependent sequence modeling through structured state space equations (S4), a selective scanning mechanism, and hardware-aware algorithms. This architecture maintains linear computational complexity while enabling effective long-range dependency modeling. Subsequent visual adaptations—Vision Mamba [18] with its ViM module and VMamba [19] with the VSS module—successfully translated this framework to computer vision. These innovations position Mamba as a competitive alternative to CNN and Transformer architectures in U-Net optimization efforts.

Current deep learning approaches for skin lesion segmentation fundamentally struggle to reconcile diagnostic precision with clinical adaptability. The canonical UNet framework relies on standard convolutions that inherently blur fine edge textures, producing oversmoothed contours and false-positive artifacts that distort true lesion morphology. Attention-enhanced variants like Attention-UNet compound this limitation by prioritizing global contexts while neglecting micro-lesions and intricate margins, where diagnostic oversight carries direct clinical consequences. Even DeepLabv3+’s atrous convolutions, designed to expand receptive fields, inadvertently dilute locally critical features; this proves particularly detrimental for diagnostically decisive attributes like border microtopography and minute papulation, where feature aggregation attenuates essential information.

Our architectural response strategically counters these failure mechanisms through synergistic integration: The CPME module circumvents computational bottlenecks by fusing state-space modeling with channel prior attention, achieving linear-complexity global context capture impossible for either quadratic-cost Transformers or globally deficient CNNs. Complementarily, the MLSM module fortifies local semantics via multi-scale convolutional branches that hierarchically preserve high-frequency edge signatures and mid-frequency micro-lesion patterns during training, directly addressing pathological feature erosion. Meanwhile, the MAB module’s axially factorized depthwise convolutions maintain orientation-sensitive feature representation while eliminating parameter bloat, thus adapting to morphological complexity beyond conventional operators. This tripartite co-design establishes an emergent capability: clinically viable precision-adaptability alignment previously unattainable in lesion segmentation.

The following are the main contributions of our research:

(1): This paper proposes a lightweight model, VML-UNet, and further explores the potential of the lightweight Mamba model for skin lesion applications. It achieves a trade-off between the Mamba model’s efficiency and accuracy, providing a valuable solution for Mamba to become a mainstream lightweight model in the future.
(2): This paper proposes a visual state space module based on channel prior convolutional attention. This module integrates channel prior convolutional attention and VSS modules, dynamically allocating attention weights through multi-scale depth-separable convolutional modules and retaining channel priors. This combination can help the network better capture important features and improve its ability to represent features.
(3): We designed an MLSM module based on a multi-scale convolutional neural network to compensate for the model’s lack of perception of local feature information. Convolution operations at three different scales and the introduction of an auxiliary loss function enhance the perception of local semantic information. In addition, MLSM is only designed in the training phase, which can reduce the inference burden and thus ensure the model’s operation efficiency.

2. Related Work

2.1. Semantic Segmentation of Medical Images

Melanoma, while representing a minority of skin lesions, exhibits the highest mortality rate among cutaneous malignancies. Its incidence growth rate ranks prominently across six primary cancers, with late-stage interventions offering minimal therapeutic benefit [20]. Consequently, precise early diagnosis and rapid intervention are clinically critical. Traditional reliance on physician visual assessment remains prone to diagnostic error, particularly under suboptimal imaging conditions. Computer-aided diagnosis (CAD) has thus emerged as a vital tool for enhancing skin cancer segmentation accuracy [21,22,23]. Initial CAD implementations leveraged traditional image processing algorithms, capitalizing on computational speed and stability. However, these methods proved notoriously susceptible to noise, image quality variations, and low-contrast scenarios while requiring extensive parameter tuning through iterative experimentation. Such limitations rendered them inadequate for clinical diagnostic workflows demanding both speed and reliability. The advent of deep learning has catalyzed a paradigm shift: convolutional neural networks (CNNs) now supersede traditional methods through superior accuracy and adaptive capability [24], establishing themselves as indispensable assets in clinical dermatology.

U-Net’s seminal encoder–decoder architecture with skip connections has fundamentally advanced medical image segmentation since its inception. As depicted in Figure 1, its layer-wise feature fusion mechanism uniquely preserves global context while recovering spatial detail, establishing an enduring paradigm. Subsequent innovations confront inherent limitations: Zhang et al.’s DIU-Net [25] augments feature extraction breadth through DenseNet-inspired connections [26] and GoogLeNet’s parallel convolutions [27], yet remains constrained by traditional convolution’s limited receptive field interactions. This fundamental restriction in modeling long-range dependencies catalyzed Transformer integration, exemplified by Petit et al.’s U-Transformer [28] employing multi-head self-attention (MHSA) for global encoding and cross-attention (MHCA) for semantic filtering, significantly enhancing low-contrast boundary delineation. Concurrently, Su et al.’s MSU-Net [29] addresses multi-scale challenges through hierarchical receptive fields and interactive skip connections. However, these architectural enhancements incur prohibitive computational costs, particularly the attention mechanisms’ quadratic complexity that severely compromises efficiency in high-resolution medical imaging scenarios.

2.2. Integration of Mamba Module with U-Net for Enhanced Medical Image Segmentation

Despite the Transformer’s dominance in language and vision tasks, its self-attention mechanism suffers from quadratic computational complexity relative to sequence length, a critical limitation in high-resolution image processing. This scalability challenge has motivated state space models (SSMs) as promising alternatives for efficient long-sequence modeling. By transforming sequence processing into linear-complexity operations through continuous-time dynamic systems, SSMs map input sequences x(t) to hidden states h(t), generating outputs y(t) while preserving global modeling capabilities.

However, medical data inherently exists as discrete sampling signals, precluding direct application of continuous formulations. Gu et al.’s structured state space model (S4) [17] addresses this through three synergistic innovations: discretization via zero-order hold techniques enables digital signal processing, convolutional reformulation permits parallel training, and HiPPO (High-order Polynomial Projection Operators) integration resolves long-range dependency capture. This unified approach significantly enhances computational efficiency while maintaining the theoretical advantages of continuous system modeling.

Although Mamba was initially designed for 1D sequences, visual tasks require multidimensional processing. Vision Mamba [18] bridges this gap by reconfiguring the ViT architecture, substituting the Transformer encoder with a Vision Mamba (ViM) encoder while integrating positional embeddings to maintain spatial awareness. This design achieves effective visual representation without self-attention overhead. Parallel work by Liu et al. introduced VMamba [19], whose SS2D (Selective Scanning 2D) module fundamentally rethinks spatial processing. SS2D synergizes cross-scanning mechanisms with state space models to enable linear-complexity global feature modeling, with its operational flow visualized in Figure 2. The SS2D framework transforms 2D feature maps through three cohesive stages: Cross-scanning first decomposes spatial structures into four directional 1D sequences, enabling each pixel to aggregate global context through distinct pathways. These sequences then undergo adaptive processing via S6 blocks—gated SSM units that dynamically filter features while recursively accumulating long-range dependencies within strict O(N) complexity. Finally, directional reconstructions synthesize multipath contextual intelligence into consolidated 2D representations through weighted fusion, preserving spatial fidelity throughout the transformation.

In medical images, segmentation models are primarily based on U-Net. After the emergence of the Mamba model, a series of medical segmentation models were developed in combination with the U-Net model. U-Mamba [30] first attempted to integrate Mamba into U-Net for medical image segmentation and successfully proposed a hybrid SSM-CNN module. It inherits the local feature extraction capability of the convolutional layer and the excellent long-range dependency capture capability of the SSM model, verifying the effectiveness of SSM in medical image segmentation tasks. In order to further break through the limitations of convolutional inductive bias, Ruan et al. [31] first proposed a U-shaped medical image segmentation model VM-UNet based on pure SSM. The VSS block proposed in VMamba [19] is introduced to replace the convolutional layer of the traditional U-shaped network, and the VSS block is used to capture a wide range of contextual information. Combined with ViT’s patch merging and Patch Extension operations, they are used to perform upsampling and downsampling operations. This integrated approach lays a solid foundation for further developing effective medical image segmentation methods. In addition, in the medical field, the computational burden is too heavy to be suitable for clinical applications due to limited computing resources. Nguyen et al. [32] deeply analyzed the key factors affecting parameter efficiency in the Mamba framework. They proposed the MambaU-Lite model, the core of which is to process parallel Vision deep features. The Mamba layer and a parallel convolution layer are used to process features. The model parameters are only 416K, which significantly reduces the computational burden of the model.

2.3. Hybrid Transformer–Mamba and Diffusion-Based Architectures for Medical Segmentation

In 2024, Hatamizadeh et al. pioneered MambaVision, a hybrid architecture integrating state space models (SSMs) with Transformer components [33]. This approach employs CNN residual blocks for rapid multi-resolution feature extraction: the front-end leverages Mamba’s linear complexity for long-sequence processing, while the back-end utilizes Transformer self-attention to enhance global dependency modeling, significantly boosting dense prediction accuracy. Subsequently, Zhang et al.’s HMT-Unet [34] emerged as a representative hybrid model, optimizing local feature extraction through MambaVision blocks while introducing a multi-scale Attention Aggregation (MSAA) module to strengthen multi-scale perception, achieving superior medical image segmentation accuracy over traditional CNN and pure Transformer architectures.

Concurrently, diffusion models have advanced medical image segmentation. MedSegDiff [35] first adapted denoising diffusion probabilistic models (DDPMs) to medical segmentation, employing conditional encoders to inject modality-specific features and achieving state-of-the-art results in 3D MRI brain tumor segmentation. Similarly, diff-Unet [36] innovatively integrated U-Net with latent diffusion models, designing noise-aware skip connections and single-step fine-tuning strategies that improved segmentation accuracy while reducing inference time for skin lesions.

Despite these advances, critical challenges persist in clinical skin lesion segmentation. Hybrid models’ residual Transformer components cause memory overhead surges during high-resolution processing, hindering deployment in resource-constrained settings. Diffusion models face dual limitations: their iterative sampling mechanisms impede real-time performance, while segmentation accuracy suffers on ambiguous lesion boundaries. Addressing these issues, our Mamba-based segmentation framework capitalizes on selective state-space models’ core properties—maintaining linear O(L) complexity while precisely preserving lesion details. Eliminating Transformer dependencies, our approach directly learns lesion feature distributions, demonstrating high accuracy across ISIC2017, ISIC2018, and PH2 datasets, thereby providing efficient, precise technical support for early clinical diagnosis.

3. Methods

Figure 3 shows the proposed VML-UNet. The model adopts the structure of a U-shape network, which contains four basic substructures: encoder, decoder, bottleneck layer, and skip connection.

The input skin lesion image is initially adjusted to 16 channels after passing through a convolution layer. Subsequently, the generated feature map enters the encoder stage. Specifically, the

16 \times H \times W

feature map with a resolution will first pass through two CPMamba Enhancement blocks (CPME). After downsampling through the maximum pooling layer of these two encoding blocks, the spatial resolution of the feature map will be adjusted to

32 \times \frac{H}{2} \times \frac{W}{2}

and

64 \times \frac{H}{4} \times \frac{W}{4}

. The following two encoding layers will be divided into two parallel paths, effectively reducing the number of channels of the feature map by half. Then, it is processed by the CPME block of the main path and the axial convolution encoder (ACE block) of the auxiliary path. After the maximum pooling layers of these two CPME blocks and ACE blocks, the size of the feature map is adjusted to

64 \times \frac{H}{8} \times \frac{W}{8}

, respectively

128 \times \frac{H}{15} \times \frac{H}{16}

. Finally, the feature maps output by these two parallel paths are spliced on the channels to obtain a size of and are input into the

256 \times \frac{H}{16} \times \frac{H}{16}

MAB block of the bottleneck structure, which is then combined with the jump connection and passed through the decoding layer. Through the decoding layer of each layer, the number of channels of the output feature map is adjusted to 1 to obtain the reduced-map. Reduce all decoding layers map to perform channel splicing. These outputs are then passed through Final. The Conv layer performs processing to generate a predicted mask for the input image.

3.1. Multi-Scale Lightweight Axial Convolution Bottleneck (MAB)

Traditional convolutional bottlenecks exhibit fundamental limitations in modeling direction-sensitive structures like lesion edges due to their isotropic spatial response characteristics and parameter redundancy. These constraints impose prohibitive computational costs for capturing orientation-critical features. To address this dual challenge, we introduce the Multi-scale Axial Bottleneck (MAB) as shown in Figure 4a. Its core innovation leverages three parallel axial depthwise separable convolutions (AxialDW) with kernel size n = 3, illustrated in Figure 4b. This architecture implements strategic axis decomposition along orthogonal planes while incorporating depthwise channel factorization. Through synergistic optimization of directional sensitivity and parameter efficiency, MAB achieves O(N) computational complexity—exponentially superior to conventional approaches—while preserving vital orientation-aware feature representations. The AxialDW operation executes this through heat-diffusion-inspired directional processing:

X^{'} = X + D W_{1 \times n} (X) + D W_{n \times 1} (X) .

(1)

Y = GELU ({PW}_{C 1 \to C 2} (BN (X^{'}))) .

(2)

where X is the input feature, Y is the output feature; DW, PW and BN represent depthwise convolution, pointwise convolution and normalization,

1 \times N

and

N \times 1

is the convolution kernel size.

C 1

and

C 2

represent the number of input and output channels of the feature map.

The multi-scale feature extraction module employs axially dilated depthwise convolutional layers (kernel size = 3) with dilation coefficients d = 1, 2, 3, as detailed in Figure 4b. This configuration balances two critical design imperatives: First, after quadruple downsampling, feature map spatial dimensions reduce to 1/16th of the original inputs, where larger kernels would cause excessive feature overlap, making 3 × 3 kernels optimal for receptive field expansion while preserving resolution. Second, parallel branches with multi-rate dilation establish exponentially expanding equivalent receptive fields, systematically integrating contextual semantics across spatial granularities. This approach conceptually aligns with Atrous Spatial Pyramid Pooling (ASPP) principles [37] through channel-concatenated cross-scale feature fusion. To further minimize learnable parameters, a pointwise convolution layer precedes the axial depthwise separable convolution mechanism, reducing channel dimensionality before feature processing.

3.2. CPMamba Enhancement Block (CPME)

Traditional visual state space (VSS) blocks, depicted in Figure 5a, process inputs through sequential VSS, Layer Normalization (LN), and Multi-layer Perceptron (MLP) operations. This architecture fundamentally struggles with efficient modeling of cross-regional pathological features while exhibiting linear parameter growth relative to depth—severely limiting low-resource deployment.

3.2.1. Parallel Vision Mamba

Our CPME block innovates through the multi-branch architecture shown in Figure 5b, centered on VSS-CPCA (channel prior convolutional attention) fusion. The first pathway employs 3 × 3 depthwise separable convolution for computationally efficient local feature extraction. Simultaneously, input features undergo channel-wise bisection into dual VSS processing streams, implementing Wu et al.’s parameter-optimized Mamba framework [38] which preserves global reasoning capabilities at reduced computational cost. These parallel processed features then undergo channel concatenation to restore original dimensionality, followed by Instance Normalization and ReLU activation. This integrated design achieves breakthrough efficiency in complex medical image semantics modeling while maintaining stringent lightweight constraints, formally expressed as follows:

X_{split 1}, X_{split 2} = {Split}_{\frac{c}{2}} ({DepthwiseConv}_{3 \times 3} (X))

(3)

Y = ReLU (IN (Concat [VSS (X_{split 1}), VSS (X_{split 2})]))

(4)

where IN represents the InstanceNorm operation.

{S p l i t}_{\frac{c}{2}}

(\cdot)

represents splitting the output in half from the channel.

3.2.2. Channel Prior Convolutional Attention

In the second branch, the method proposed by CBAM is used to perform channel and spatial attention in sequence. First, a one-dimensional channel weight map is inferred from the input feature map

M_{c} \in R^{C \times 1 \times 1}

. Then, the channel attention values

M_{c}

are broadcast

F_{c} \in R^{C \times H \times W}

along the spatial dimension by element-by-element multiplication with the input features to obtain refined features with channel attention F. Subsequently, the spatial attention module (SA) is processed

F_{c}

to generate a spatial attention map

M_{s} \in R^{C \times H \times W}

. Finally, the features are output by dynamically weighting the spatial dimensions of the feature map

\bar{F} \in R^{C \times H \times W}

. The whole process can be summarized as follows:

F_{c} = CA (F) \otimes F

(5)

\bar{F} = SA (F_{c}) \otimes F_{c}

(6)

Average and maximum pooling operations are first used to aggregate spatial information in the feature map. This process generates two independent spatial context descriptors, which are then nonlinearly mapped through a multi-layer perceptron with shared weights. After element-by-element addition, a channel attention map is finally formed. In addition, to reduce the calculation parameters, a single hidden layer is used to form a shared MLP layer. The calculation of channel attention can be summarized as follows:

CA (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F)))

(7)

where

σ

represents the sigmoid operation.

The obtained channel attention map is passed through a multi-scale depth-separable convolution to capture the spatial relationship between features, ensuring the preservation of the relationship between channels while reducing the computational complexity. Subsequently, the obtained output uses a strip convolution optimization structure to enhance the ability of the convolution operation to capture spatial relationships. Finally,

1 \times 1

convolution with the same kernel size fuses the channel information. The calculation of spatial attention can be summarized as the following formula:

SA (F) = {Conv}_{1 \times 1} (\sum_{i = 0}^{3} {Branch}_{i} (DwConv (F)))

(8)

where DwConv represents depthwise convolution, and

{Branch}_{i}

refers to the 0 to i branches inside the MSC block. Branch 0 preserves the original features to maintain the integrity of high-frequency details.

3.3. Multi-Scale Local Supervision Module (MLSM)

While the SS2D module in visual state space (VSS) architectures effectively captures long-range dependencies through receptive field expansion, its global attention mechanism suffers from local semantic dilution in skin lesion segmentation—particularly detrimental at lesion boundaries. Medical imaging’s inherent high-resolution demands precise spatial fidelity, yet deep feature extraction attenuates fine-grained information, compromising edge segmentation accuracy. Addressing this fundamental limitation, our multi-scale local supervision module (MLSM) introduced in Figure 6 enables collaborative local–global optimization through heterogeneous multi-scale convolutions. This synergistic approach preserves critical boundary details while maintaining the SS2D’s global contextual strengths, resolving the spatial precision-efficiency trade-off endemic to medical image analysis.

We use three parallel convolution branches in MLSM:

1 \times 1

convolution layer: extract high-frequency detail features (such as texture features at the edge of the lesion) through point-by-point convolution;

3 \times 3

convolution layer: capture mid-frequency structural information and use symmetric padding to maintain the continuity of feature map boundaries;

5 \times 5

convolution layer: model low-frequency global features and enhance spatial context perception through a large receptive field.

F_{l 1} = LeakyReLU (BN ({Conv}_{1 \times 1} (F_{l})))

(9)

F_{l 1} = LeakyReLU (BN ({Conv}_{3 \times 3} (F_{l})))

(10)

F_{l 1} = LeakyReLU (BN ({Conv}_{5 \times 5} (F_{l})))

(11)

where BN is batch normalization, LeakyReLU is the activation function, and

F_{i}

represents the decoder’s output features at the corresponding stage.

The multi-scale features output by each branch are concatenated in the channel dimension and then passed through the SpatialDropout and

1 \times 1

convolution compression layers. This design is inspired by the channel attention mechanism of SENet [36] and improves feature robustness through spatial dimension regularization. Finally, the original input resolution is restored through bilinear interpolation upsampling. The MLSM process can be summarized as the following formula:

F_{l}^{'} = Upsample ({Conv}_{1 \times 1}^{'} ({Conv}_{1 \times 1} (SpatialDropout (Concat (F_{l 1}, F_{l 2})))))

(12)

During model training, a multi-scale local supervision module (MLSM) is embedded in the second to fourth layers of the decoder to enhance the feature representation capabilities at different stages. The outputs of each layer are summed element by element to obtain the segmentation result, and an auxiliary loss function is constructed based on the result for model optimization.

3.4. Loss Function

In order to strengthen the learning of local details, this paper designs two loss functions: the main loss function and the auxiliary loss function. The main loss function solves the class imbalance problem between lesions and backgrounds in medical images

L_{p}

by combining Dice loss and Tversky loss. Among them, Tversky loss adjusts parameters

α = 0.7

,

β = 0.3

, to adapt to the blurred edges of skin lesions. The auxiliary loss function

L_{A}

is constructed by MLSM, which is introduced at the output of the second to fourth levels of the decoder. Tversky loss is used to constrain the features of each level independently, and the weight coefficient decays exponentially to achieve optimization from coarse-grained to fine-grained. The final total loss function is defined as L:

L_{Dice} (g, p) = 1 - \frac{2 \sum_{i = 1}^{n} p_{i} g_{i}}{\sum_{i = 1}^{n} (p_{i} + g_{i})}

(13)

L_{Tversky} (g, p) = 1 - \frac{2 \sum_{i = 1}^{n} p_{i} g_{i}}{\sum_{i = 1}^{n} (p_{i} g_{i}) + α \sum_{i = 1}^{n} p_{i} (1 - g_{i}) + β \sum_{i = 1}^{n} g_{i} (1 - p_{i})}

(14)

L = L_{p} + L_{a} = (L_{Dice} + L_{Tversky}) + γ L_{Tversky}

(15)

where

p_{i} \in [0, 1]

represents the predicted label of the pixel, and

g_{i} \in [0, 1]

represents the true label of each pixel. The

γ

is set to 0.4.

4. Experiment

4.1. Dataset

The ISIC2017, ISIC2018, and PH2 datasets represent established public benchmarks in skin lesion segmentation, selected for their synergistic alignment with critical research objectives. These datasets collectively address urgent clinical needs through comprehensive coverage of high-mortality malignancies such as melanoma. Precise segmentation of such lesions directly enables early diagnostic interventions, fulfilling our core aim of enhancing clinical efficiency. Methodologically, their complementary design ensures rigorous evaluation: ISIC2017 and ISIC2018 provide substantial scale for robust model training, while PH2’s constrained sample size validates adaptive capability across operational scenarios. Crucially, all datasets incorporate expert-validated lesion annotations that guarantee gold-standard reliability, establishing a foundational basis for model development and comparative analysis.

During preprocessing, all images were standardized to 256 × 256 resolution. This decision balances computational efficiency with diagnostic integrity, specifically preserving critical edge textures essential for accurate segmentation while supporting lightweight architectural goals. Uniform resolution also eliminates input size variability as a potential confounding factor, ensuring experimental fairness in cross-model comparisons.

4.1.1. ISIC2017

This dataset contains 2000 images of skin lesions covering three clinically critical lesion types: melanoma, melanocytic nevus, and seborrheic keratosis. The format is a 24-bit deep RGB three-channel JPG and PNG mixed format, with original resolutions ranging from

768 \times 560

to

1024 \times 768

. An accurate binary segmentation mask accompanies each image. We randomly selected 1800 images as the training set of the model and 200 images as the test set of the model.

4.1.2. ISIC2018

This dataset contains 2594 skin images covering common skin lesions such as melanoma, nevus, and seborrheic keratosis. The format is a 24-bit RGB three-channel PNG format with a uniform resolution of

768 \times 560

pixels. All images contain manually annotated binary segmentation masks. Similar to the ISIC2017 dataset, we randomly selected 2334 images as the training set of the model and 260 images as the test set of the model.

4.1.3. PH2

This dataset contains 200 dermoscopic images covering three clinically critical lesion types: benign nevus, atypical nevus, and melanoma. The images are stored in PNG format with 8-bit RGB depth, and the original resolution ranges from

600 \times 450

to

768 \times 560

pixels. They are all in 24-bit RGB three-channel PNG format. All images contain manually annotated binary segmentation masks. We randomly selected 170 images as the training set of the model and 30 images as the test set of the model.

4.2. Implementation Setup

This experiment is built on the PyTorch 2.3.0 framework, using the AdamW optimizer and performing end-to-end training on the ISIC2017, ISIC2018, and PH2 datasets. The initial learning rate is set to

1 \times 10^{- 3}

, and dynamically adjusted the learning rate is based on the Dice coefficient of the training set: if the optimal performance is not achieved for five consecutive training cycles, the learning rate is reduced to half of the current value, and the minimum learning rate threshold is

1 \times 10^{- 6}

; the experiment was run on an Ubuntu 20.10 server equipped with an NVIDIA GeForce RTX 4070Ti GPU, the batch size was fixed to 16, the total number of training cycles was 300, and all tasks were completed in a single-card environment to control variables.

4.3. Evaluation Index

We used two main metrics in semantic segmentation to evaluate the model’s performance: Dice Similarity Coefficient (DSC) and Intersection over Union (IOU). These metrics help determine the similarity overlap between the predicted mask and the ground truth label, demonstrating the model’s effectiveness. The calculation formula is as follows:

IoU = \frac{TP}{TP + FP + FN}

(16)

DSC = \frac{2 TP}{2 TP + FP + FN}

(17)

where TP (True Positive) represents the number of correctly classified lesion pixels, FP (False Positive) represents the number of pixels in normal tissue that are misclassified as lesions, and FN (False Negative) represents the number of pixels that are missed in the lesion area.

5. Results and Discussion

5.1. Comparative Performance Analysis with State-of-the-Art Models

To verify the effectiveness of the VML-UNet model, this work selected seven representative medical segmentation models as baseline comparisons, including classic architectures (U-Net [6]), attention mechanism variants (Attention U-Net [39], DCSAU-Net [40]), lightweight designs (UNext [9], ULite [41]), and cutting-edge state-space models (VM-UNet [31], MambaU-Lite [32]). All comparison models were reproduced based on the author’s open-source code and trained under a unified experimental configuration to eliminate experimental variable interference. The performance evaluation was performed on the ISIC2017, ISIC2018, and PH² datasets.

First, we compare the accuracy of VML-UNet with other classical models on the ISIC2017, ISIC2018, and PH2 datasets. All models are trained under the same conditions to ensure fairness; the results are shown in Table 1.

For the ISIC2017 and ISIC2018 datasets, which contain diverse skin lesions such as melanoma, nevus, and others with significant variations in size, morphology, and anatomical distribution, VML-UNet achieves an average Dice Similarity Coefficient (DSC) of 0.9075 and Intersection over Union (IoU) of 0.8367. This performance can be attributed to the CPMamba module efficiently capturing the long-range spatial dependencies of complex lesion distributions. In addition, the MLSM module refines the boundaries of heterogeneous lesions, improving the DSC metric by approximately

5 %

compared to CNN-based models such as UNet, UNext, and Attention UNet.

For the small PH2 dataset, characterized by high labeling accuracy and dense micro-lesions, VML-UNet achieves a DSC score of 0.9564 and an IoU of 0.9053. This high accuracy is attributed to two factors: (1) the fine-grained local supervision of the MLSM module accurately capturing the boundaries of micro-lesions, and (2) the effective suppression of background noise by the CPMamba module, facilitated by the low-noise imaging environment typical of PH2 images. Consequently, VML-UNet outperforms models such as VM-UNet and Attention UNet on this dataset. Its final performance also exceeds that of Mamba-based models like VM-UNet and MambaU-Lite, verifying the model’s robustness in small-data scenarios.

The visualization results in Figure 7 further demonstrate that, even with common challenges like low-contrast lesion boundaries and complex backgrounds in the datasets, the model’s predicted masks exhibit significantly higher overlap with the ground truth contours. This is likely due to the CPMamba module effectively capturing global lesion features through its channel-space dynamic weight allocation mechanism.

Table 2 shows the efficiency and corresponding segmentation performance of our VML-UNet and several existing models on the ISIC2017 dataset. On the ISIC2017 dataset, VML-UNet achieves the current optimal segmentation accuracy with a Dice coefficient (DSC) of 0.9121 and an intersection over union (IoU) of 0.8452. Its computational complexity is only 1.24 GFLOPS, and the number of parameters is as low as 0.53 M. Although the UNext model exhibits higher computational efficiency as low as 0.57 GFLOPS, its DSC score and IoU are only 0.8775 and 0.8054, which are significantly lower than VML-UNet, indicating that traditional lightweight designs have difficulty in balancing accuracy and efficiency. Our VML-UNet achieves an excellent balance between segmentation accuracy and efficiency.

5.2. Robustness Evaluation of VML-UNet Under Complex Imaging Interference

To rigorously evaluate robustness in clinical deployment, we systematically augmented public skin lesion datasets to design specialised test datasets, generating 200 images for each interference scenario. Figure 8 visually illustrates the three key real-world interference scenarios we simulated: For sensor noise simulation, we added

20 %

salt-and-pepper noise to ISIC2017 images to replicate CMOS sensor degradation during clinical imaging, with panel (a) clearly showing the contrast between the original image and the noisy version; For hair occlusion scenarios, we used naturally obscured lesion images professionally curated from the ISIC2017 and ISIC2018 datasets, where panel (b) displays the difference between images without occlusion and those with hair interference on either side; Low-light conditions were achieved by calibrated luminance reduction of ISIC2017 samples to mimic suboptimal dermoscopic illumination, an effect observable in the comparison between normal and low-light environment images in panel (c). Additionally, for high-resolution assessment, we preserved the native resolution of ISIC2017 images to fully retain diagnostic-grade details, facilitating evaluation of computational scaling performance. This comprehensive validation framework, closely aligned with the scenarios presented in Figure 8, establishes an ecologically valid stress-testing system that covers various critical failure modes commonly encountered in real-world dermatological clinical practice.

VML-UNet achieves robust multi-scene adaptation with merely 0.53 M parameters and a 2.18 MB memory footprint. This extreme lightweight design demonstrates remarkable operational stability, evidenced in Table 3 by two key observations: Under high-resolution inputs where pixel proliferation extends inference to 210 ms, the model maintains peak segmentation accuracy (DSC = 0.9260, IoU = 0.8630) through meticulous detail preservation. Meanwhile, in clinically challenging scenarios involving noise, low-light, or hair occlusion, inference efficiency stabilizes at 38–47 ms while exhibiting minimal performance variance, with maximum IoU deviation limited to

2.2 %

across all test conditions. Such deterministic behavior under environmental stressors confirms the architecture’s resilience for deployment in diverse clinical environments.

5.3. Ablation Studies

5.3.1. Effects of the Proposed Modules

To verify the effectiveness of the CPME, MAB, and MLSM modules in VML-UNet, this work conducts ablation experiments based on the Mamba-ULite architecture on the ISIC2017 dataset, as shown in Table 4.

Figure 9 demonstrates VML-UNet’s module-specific optimization effects. Replacing encoder Mamba blocks with the CPME module reduces parameters from 0.42 M to 0.33 M while increasing DSC to 0.9078 and IoU to 0.8386. This improvement confirms CPME’s efficacy in decoding complex lesion semantics through fused channel prior attention and Mamba optimization. Substituting the bottleneck’s standard block with the MAB decreases FLOPs from 1.25 G to 0.93 G with only

0.4 %

DSC reduction, validating its axial decomposition strategy for lightweight efficiency. Concurrently, incorporating the MLSM elevates DSC and IoU by

0.6 %

and

0.4 %

respectively, despite increasing parameters to 0.62 M. This module enhances small lesion segmentation robustness through multi-level semantic supervision, strengthening local–global feature associations without inference overhead. Supporting visual evidence is provided in Figure 10. Collectively, CPME and MAB achieve accuracy-efficiency co-optimization via global–local feature fusion and axial convolution decomposition, while MLSM provides zero-cost feature discrimination enhancement. This synergistic approach enables dual improvement in segmentation accuracy and computational efficiency.

5.3.2. Effect of the Dual-Branch Structure

To validate the efficacy of the multi-path design, this study compares the performance of VML-UNet (Full), incorporating the ACE module, against its variant without this component (VML-UNet w/o ACE). Results, as shown in Table 5, demonstrate that VML-UNet (Full) exhibits higher computational requirements with 1.24 G FLOPs, 0.53 M parameters, and 2.18 MB memory occupancy compared to VML-UNet w/o ACE. This increase stems from the additional computational steps and parameters introduced by the ACE module. Crucially, VML-UNet (Full) achieves significantly superior segmentation accuracy, attaining a DSC of 0.9078 and an IoU of 0.8386. These quantitative improvements are visually corroborated in Figure 11, which presents a visualisation of segmentation results for VML-UNet with and without the ACE module. In the regions highlighted by circular annotations, the segmentation output of VML-UNet with the ACE module closely aligns with the ground truth, featuring stronger boundary continuity and more complete capture of fine-grained structures. These metrics and visual evidence collectively confirm that the multi-path architecture enhances skin lesion segmentation through the ACE module’s improved feature capture capability, while also establishing an optimal balance between efficiency and performance, particularly well-suited for high-precision segmentation tasks.

5.4. Discussion

The experimental results and visualization analysis of VML-UNet on three skin lesion datasets verified an effective balance between lightweight design and high accuracy. This balance can be analyzed in two aspects: model performance and core module contributions.

Regarding overall model performance, VML-UNet has only 0.53 M parameters, 2.18 MB memory occupation, and 1.24 GFLOPs computational complexity. This is significantly lower than the 44.3 M parameters of VM-UNet and the 65.52 G FLOPs of U-Net. Meanwhile, the DSC and IoU scores of VML-UNet are better than those of lightweight models, such as U-Lite and MambaU-Lite, across all three datasets. Especially on PH2, the model achieves a DSC score of 0.9564, attributed to its accurate capture of tiny lesions, verifying the synergy between lightweight design and segmentation accuracy.

Compared with existing studies, the key innovations of VML-UNet are as follows: (1) It breaks through the local feature modeling limitations of traditional CNNs by utilizing the VSS block to achieve global dependency capture with linear computational complexity. (2) It employs the “training-specific” MLSM module to enhance local detail accuracy without increasing inference burden. (3) It incorporates the axial decomposition strategy of the MAB module to preserve orientation-sensitive features within the lightweight framework. This contrasts with models relying solely on single mechanisms, such as the MLP block in UNext or the pure SSM structure in VM-UNet.

VML-UNet’s particular success on PH2 can be explained by several factors: The dataset consists of 200 images characterized by high labeling accuracy, dense lesions, and low imaging interference, aligning well with VML-UNet’s design strengths. The MLSM module accurately captures details of dense, small lesions, while the attention mechanism of the CPMamba module effectively suppresses noise in PH2’s low-interference environment. The synergy between these modules enables superior performance on this dataset.

A significant limitation stems from dataset composition bias. The experimental datasets contain a disproportionately high percentage of light-skinned samples, leading to more robust feature learning for high-contrast lesions against light backgrounds. Conversely, the model’s capability to segment low-contrast lesions on dark or medium-skinned skin remains inadequately validated due to insufficient representation. This distributional bias risks reducing segmentation accuracy for dark-skinned lesions in clinical practice—a concern particularly amplified in small datasets like PH2. Compounding this issue, the textural complexity of darker skin may further interfere with lesion edge identification, while the scarcity of such samples fundamentally compromises model generalizability. Additionally, the PH2 dataset introduces specific constraints: its small sample size and limited coverage of rare lesions restrict rigorous validation of the model’s generalizability to these clinically important cases. Consequently, constructing large-scale datasets encompassing diverse skin tones and richer representations of rare lesions is imperative to rigorously assess clinical applicability and address these compounded limitations.

6. Limitation and Future Work

6.1. Limitation

Although VML-UNet demonstrates excellent accuracy and lightweight characteristics in skin lesion segmentation, it still exhibits the following limitations.

While VML-UNet’s FLOPs are significantly lower than U-Net and VM-UNet, its computational complexity remains higher than UNext. Further compression is needed to approach the efficiency required for ultra-lightweight deployment scenarios. Although the three experimental datasets (ISIC2017, ISIC2018, PH2) are representative, their coverage is limited. The PH2 dataset contains only 200 samples, and rare skin lesions are underrepresented across all datasets, restricting the model’s generalization in clinically complex scenarios. The datasets are predominantly composed of light-skinned samples, with insufficient representation of dark- and medium-skinned individuals. Consequently, the model’s performance in low-contrast scenarios (e.g., identifying pigmented lesions against dark skin backgrounds) is not fully validated, potentially limiting its clinical generalizability across diverse patient populations. Validation was confined to dermoscopic image datasets. The model’s adaptability to other imaging modalities or images acquired by different devices remains unevaluated, leaving its cross-modality and cross-device performance uncertain. The model’s performance was not stratified or validated based on lesion size during training or evaluation. Small lesions risk feature loss during downsampling, while large, irregularly shaped lesions may suffer from incomplete edge segmentation. VML-UNet’s robustness across varying lesion sizes requires dedicated assessment.

6.2. Future Work

To address these limitations, future work will focus on the following directions: Explore lightweight variants of the CPMamba module to further reduce FLOPs while maintaining accuracy, aiming to match or surpass the computational efficiency of UNext. Construct larger, more diverse, and hierarchically structured datasets. Prioritize validation across different skin tones and include a broader spectrum of rare lesions to improve clinical applicability and generalization. Incorporate domain adaptation techniques, such as adversarial training, to enhance the model’s generalization capability on cross-modal data. Collaborate with clinical institutions to evaluate real-time performance and diagnostic efficacy within actual clinical workflows, facilitating the transition from laboratory research to clinical application. Investigate the synergistic mechanisms between the CPMamba, MAB, and MLSM modules. Focus on reducing redundant computations through dynamic attention weight allocation. Explore lightweight fusion strategies with diffusion models to improve segmentation accuracy for lesions with fuzzy boundaries.

7. Conclusions

In this study, we proposed a lightweight network VML-UNet model for skin lesion segmentation to minimize the number of parameters, computational cost, and memory usage. We proposed the CPMamba block, which combines the advantages of Mamba and channel-spatial attention to effectively capture high-level and fine-grained features. To improve the local information learning ability of the model, we proposed a multi-scale local information supervision module MLSM that is only trained and introduced the MAB module with axial depthwise separable convolution as the core to ensure that the computational complexity is reduced while retaining the direction-sensitive features. Experiments on three skin lesion datasets verified the effectiveness of the proposed algorithm. The performance of our network is better than that of other methods, with IoUs reaching

84.52 %

,

82.82 %

, and

90.53 %

, respectively.

While our model shows promising results on the skin lesion dataset, there is still a long way to go to make the Mamba model lightweight and efficient, and future work will aim to optimize the model efficiency further.

Author Contributions

Conceptualization, T.T. and H.W.; methodology, T.T.; software, T.T. and W.G.; validation, T.T.; formal analysis, T.T.; investigation, T.T.; data curation, Q.R. and W.G.; writing—original draft preparation, T.T.; writing—review and editing, H.W. and Q.R.; visualization, K.Z. and W.G.; supervision, H.W.; project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (Grant No. 62171327) and the Graduate Innovative Fund of Wuhan Institute of Technology (Grant No. CX2024172).

Institutional Review Board Statement

This study utilized publicly available skin lesion datasets, which were sourced from public databases. These datasets had been anonymized and processed in line with relevant norms during their release to protect participants’ privacy. Since this research did not involve direct recruitment, intervention, or contact with human subjects, and the data usage complied with the access and application rules of public datasets, no additional ethical approval from an Institutional Review Board was required for this study. Thus, it is marked as “Not applicable”.

Data Availability Statement

The datasets used in this study include ISIC2017, ISIC2018 (retrieved from the International Skin Imaging Collaboration archive: https://www.isic-archive.com/ (accessed on 21 October 2024) and PH2 (obtained from the Department of Information Engineering, University of Siena: https://github.com/vikaschouhan/PH2-dataset (accessed on 21 October 2024). For analysis, these raw datasets were preprocessed: images were cropped to 256 × 256 pixels and converted to .npy format. The preprocessed datasets are available from the corresponding author upon reasonable request, to ensure compliance with data privacy and ethical guidelines of the original sources.

Conflicts of Interest

The authors declare that there are no conflicts of interest in this study or during the publication process that could potentially affect the objectivity of the research.

References

Iriawan, N.; Pravitasari, A.A.; Nuraini, U.S.; Nirmalasari, N.I.; Azmi, T.; Nasrudin, M.; Fandisyah, A.F.; Fithriasari, K.; Purnami, S.W.; Irhamah; et al. YOLO-UNet Architecture for Detecting and Segmenting the Localized MRI Brain Tumor Image. Appl. Comput. Intell. Soft Comput. 2024, 2024, 3819801. [Google Scholar] [CrossRef]
Mirikharaji, Z.; Abhishek, K.; Bissoto, A.; Barata, C.; Avila, S.; Valle, E.; Celebi, M.E.; Hamarneh, G. A survey on deep learning for skin lesion segmentation. Med. Image Anal. 2023, 88, 102863. [Google Scholar] [CrossRef]
Shen, D.; Wu, G.; Suk, H.-I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef]
Sun, L.; Li, C.; Ding, X.; Huang, Y.; Chen, Z.; Wang, G.; Yu, Y.; Paisley, J. Few-shot medical image segmentation using a global correlation network with discriminative embedding. Comput. Biol. Med. 2022, 140, 105067. [Google Scholar] [CrossRef]
Li, C.; Ma, W.; Sun, L.; Ding, X.; Huang, Y.; Wang, G.; Yu, Y. Hierarchical Deep Network with Uncertainty-Aware Semi-Supervised Learning for Vessel Segmentation. Neural Comput. Appl. 2022, 3151–3164. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Rahman; Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Valanarasu, J.M.J.; Patel, V.M. UNext: MLP-based rapid medical image segmentation network. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; Proceedings, Part II 19. Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]
Hatamizadeh, A.; Yin, H.; Heinrich, G.; Kautz, J.; Molchanov, P. Global context vision transformers. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformersmake strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Zhang, Y.D.; Liu, H.Y.; Hu, Q. TransFuse: Fusing transformers and CNNs for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Strasbourg, France, 27 September 27–1 October 2021; Springer: Cham, Switzerland, 2021; pp. 14–24. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence Modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectionalstate space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Wunderlich, K.; Suppa, M.; Gandini, S.; Lipski, J.; White, J.M.; Del Marmol, V. Risk Factors and Innovations in Risk Assessment for Melanoma, Basal Cell Carcinoma, and Squamous Cell Carcinoma. Cancers 2024, 16, 1016. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Chan, H.P.; Hadjiiski, L.M.; Samala, R.K. Computer-aided diagnosis in the era of deep learning. Med. Phys. 2020, 47, 218–227. [Google Scholar] [CrossRef]
Fujita, H. AI-based computer-aided diagnosis (AI-CAD): The latest review to read first. Radiol. Phys. Technol. 2020, 13, 6–19. [Google Scholar] [CrossRef]
Wang, S.; Zhao, Z.; Ouyang, X.; Liu, T.; Wang, Q.; Shen, D. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng. 2024, 3, 133. [Google Scholar] [CrossRef]
Liu, S.; Liu, S.; Zhang, Y. A review of deep-learning-based medical image segmentation methods. Sustainability 2021, 13, 1224. [Google Scholar] [CrossRef]
Zhang, Z.A.; Wu, C.D.; Coleman, S.; Kerr, D. DENSE-INception U-net for medical image segmentation. Comput. Methods Programs Biomed. 2020, 192, 105395. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Module, I.N. Googlenet: Going Deeper With Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Petit, O.; Thome, N.; Rambour, C.; Themyr, L.; Collins, T.; Soler, L. U-net Transformer: Self and cross attention for medical image segmentation. In Machine Learning in Medical Imaging; Springer International Publishing: Cham, Switzerland, 2021; pp. 267–276. [Google Scholar]
Su, R.; Zhang, D.Y.; Liu, J.H.; Cheng, C. MSU-net: Multi-scale U-net for 2D medical image segmentation. Front. Genet. 2021, 12, 639930. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Nguyen, T.N.Q.; Ho, Q.H.; Nguyen, D.T.; Le, H.M.Q.; Pham, V.T.; Tran, T.T. MambaU-Lite: A Lightweight Model based on Mamba and Integrated Channel-Spatial Attention for Skin Lesion Segmentation. arXiv 2024, arXiv:2412.01405. [Google Scholar]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 25261–25270. [Google Scholar]
Zhang, M.; Chen, Z.; Ge, Y.; Tao, X. HMT-UNet: A hybrid mamba-transformer vision UNet for medical image segmentation. arXiv 2024, arXiv:2408.11289. [Google Scholar]
Wu, J.; Fu, R.; Fang, H.; Zhang, Y.; Yang, Y.; Xiong, H.; Liu, H.; Xu, Y. Medsegdiff: Medical image segmentation with diffusion probabilistic model. In Proceedings of the Medical Imaging with Deep Learning, PMLR, Paris, France, 3–5 July 2024; pp. 1623–1639. [Google Scholar]
Zheng, Z.; Wan, L.; Fu, H.; Yang, G.; Zhu, L. Diff-unet: A diffusion embedded network for volumetric segmentation. arXiv 2023, arXiv:2303.10326. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Wu, R.; Liu, Y.; Ning, G.; Liang, P.; Chang, Q. Ultralight vm-unet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation. arXiv 2024, arXiv:2403.20035. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Xu, Q.; Ma, Z.; Duan, W. DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation. Comput. Biol. Med. 2023, 154, 106626. [Google Scholar] [CrossRef]
Dinh, B.D.; Nguyen, T.T.; Tran, T.T.; Pham, V.T. 1M parameters are enough? A lightweight CNN-based model for medical image segmentation. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]

Figure 1. Schematic diagram of the UNet architecture. The structure comprises four components: encoder layers, decoder layers, skip connections, and a bottleneck module.

Figure 2. Schematic diagram of 2D selective scanning (SS2D).The SS2D module employs a four-directional scanning strategy that flattens 2D features into 1D vectors along four distinct spatial directions. These directional features are processed through S6 blocks for context-aware feature extraction and then reintegrated into consolidated 2D output via cross-directional fusion.

Figure 3. Schematic diagram of VML-UNet’s overall structure for skin lesion segmentation.

Figure 4. The architecture of MAB Block.

Figure 5. The main components’ architectures of the proposed VML-UNet model. (a) The architecture of VSS block. (b) The proposed architecture of a conventional CPMamba. (c) The proposed architecture of ACE block. (d) The proposed architecture of CPME block. (e) The architecture of MSC block. (f) The architecture of axial DW block.

Figure 6. The architecture of MLSM block.

Figure 7. Visual segmentation results of the proposed VML-UNet and other state-of-the-art techniques on three skin lesion datasets.

Figure 8. Schematic comparison of skin lesion images under noise, hair occlusion and illumination changes. (a) The left side is the original image without noise, and the right side is the image after adding noise; (b) the left side is the image without hair occlusion, and the right side is the image with hair occlusion interference; (c) the left side is the image with normal illumination, and the right side is the image captured in low-light environment.

Figure 9. Radar comparison of performance and complexity for each module in ablation experiments. The CPME module achieves a balance between lightweight design and performance improvement. The MAB module highlights advantages in computational efficiency but has limited performance enhancement. Although the MLSM does not optimize parameters (Params) and memory, it significantly improves performance.

Figure 10. In the ablation experiments, Grad-CAM heatmap comparisons with different modules added are conducted. Red indicates high-activation regions focused by the model, while blue indicates low-activation regions, overlaid on the original images.

Figure 11. Comparative visualization of segmentation results of VML-UNet with and without ACE module. In the dominant region of circle annotation, the segmentation result of VML-UNet with ACE module is more close to the real annotation, with stronger boundary continuity and more complete capture of fine structure.

Table 1. Comparison with representative segmentation models on three different skin lesion datasets.

Methods	ISIC2017		ISIC2018		PH 2
Methods	DSC	IoU	DSC	IoU	DSC	IoU
U-Net	0.8483	0.7662	0.8769	0.7996	0.9243	0.8672
Attention U-Net	0.8710	0.7918	0.8833	0.8054	0.9260	0.8669
UNext	0.8775	0.8054	0.8899	0.8140	0.9233	0.8716
DCSAU-Net	0.8977	0.8288	0.8908	0.8183	0.9386	0.8870
VM-UNet	0.8903	0.8223	0.8931	0.8237	0.9235	0.8735
U-Lite	0.9057	0.8385	0.8948	0.8241	0.9425	0.8930
MambaU-Lite	0.9028	0.8357	0.8940	0.8191	0.9367	0.8894
PMFSNet	0.8953	0.8267	0.8818	0.8082	0.9301	0.8823
MobileUNTR	0.8878	0.8192	0.8764	0.8011	0.9262	0.8787
VML-UNet (Ours)	0.9121	0.8452	0.9028	0.8282	0.9564	0.9053

Table 2. Comparison of parameters and computational consumption with representative segmentation models on the skin lesion dataset ISIC2017.

Methods	FLOPS	Params	Memory Size	DSC	IoU
U-Net	65.52 G	34.5 M	138.11 MB	0.8483	0.7662
Attention U-Net	66.63 G	34.9 M	139.40 MB	0.8710	0.7918
UNext	0.57 G	1.5 M	5.89 MB	0.8775	0.8054
DCSAU-Net	6.92 G	2.6 M	10.39 MB	0.8977	0.8288
VM-UNet	7.56 G	44.3 M	177.09 MB	0.8903	0.8223
U-Lite	0.76 G	0.88 M	3.51 MB	0.9057	0.8385
PMFSNet	2.21 G	0.99 M	3.82 MB	0.8953	0.8267
MobileUNTR	1.3 G	1.3 M	5.18 MB	0.8878	0.8192
VML-UNet (Ours)	1.24 G	0.53 M	2.18 MB	0.9121	0.8452

Table 3. Performance comparison of algorithms in different image processing scenarios.

Methods	Response Time	Params	Memory Size	DSC	IoU
Contains noise ( $20 %$ )	42 ms	0.53 M	2.18 MB	0.8983	0.8318
Low light	47 ms	0.53 M	2.18 MB	0.8851	0.8189
Hair masking	38 ms	0.53 M	2.18 MB	0.8892	0.8231
High resolution	210 ms	0.53 M	2.18 MB	0.9260	0.8630
VML-UNet (Ours)	38 ms	0.53 M	2.18 MB	0.9121	0.8452

Table 4. Ablation experiments of several modules on the ISIC2017 dataset.

Model Name	CPME (A)	MAB (B)	MLSM (C)	FLOPS	Params	DSC	IoU
Baseline				1.25 G	0.42 M	0.9028	0.8357
Baseline+A	✓			1.24 G	0.33 M	0.9078	0.8386
Baseline+B		✓		0.93 G	0.38 M	0.9042	0.8366
Baseline+C			✓	0.93 G	0.62 M	0.9081	0.8392
Baseline+A+B	✓	✓		1.24 G	0.29 M	0.9087	0.8397
Baseline+A+C	✓		✓	1.24 G	0.55 M	0.9114	0.8437
Baseline+B+C		✓	✓	0.93 G	0.58 M	0.9098	0.8405
Baseline+A+B+C	✓	✓	✓	1.24 G	0.53 M	0.9121	0.8452

Table 5. Comparison of experimental results at ISIC2017 on the impact of ACE module on multipath design in VML-UNet.

Model Name	ACE	FLOPS	Params	Memory Size	DSC	IoU
VML-UNet w/o ACE		0.95 G	0.38 M	1.63 MB	0.8929	0.8257
VML-UNet (Full)	✓	1.24 G	0.53 M	2.18 MB	0.9078	0.8386

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, T.; Wang, H.; Rao, Q.; Zuo, K.; Gan, W. VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation. Electronics 2025, 14, 2866. https://doi.org/10.3390/electronics14142866

AMA Style

Tang T, Wang H, Rao Q, Zuo K, Gan W. VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation. Electronics. 2025; 14(14):2866. https://doi.org/10.3390/electronics14142866

Chicago/Turabian Style

Tang, Tang, Haihui Wang, Qiang Rao, Ke Zuo, and Wen Gan. 2025. "VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation" Electronics 14, no. 14: 2866. https://doi.org/10.3390/electronics14142866

APA Style

Tang, T., Wang, H., Rao, Q., Zuo, K., & Gan, W. (2025). VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation. Electronics, 14(14), 2866. https://doi.org/10.3390/electronics14142866

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VML-UNet: Fusing Vision Mamba and Lightweight Attention Mechanism for Skin Lesion Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of Medical Images

2.2. Integration of Mamba Module with U-Net for Enhanced Medical Image Segmentation

2.3. Hybrid Transformer–Mamba and Diffusion-Based Architectures for Medical Segmentation

3. Methods

3.1. Multi-Scale Lightweight Axial Convolution Bottleneck (MAB)

3.2. CPMamba Enhancement Block (CPME)

3.2.1. Parallel Vision Mamba

3.2.2. Channel Prior Convolutional Attention

3.3. Multi-Scale Local Supervision Module (MLSM)

3.4. Loss Function

4. Experiment

4.1. Dataset

4.1.1. ISIC2017

4.1.2. ISIC2018

4.1.3. PH2

4.2. Implementation Setup

4.3. Evaluation Index

5. Results and Discussion

5.1. Comparative Performance Analysis with State-of-the-Art Models

5.2. Robustness Evaluation of VML-UNet Under Complex Imaging Interference

5.3. Ablation Studies

5.3.1. Effects of the Proposed Modules

5.3.2. Effect of the Dual-Branch Structure

5.4. Discussion

6. Limitation and Future Work

6.1. Limitation

6.2. Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI