MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM

Huang, Wenbo; Cai, Xudong; Yan, Yang; Kang, Yufeng

doi:10.3390/app15126538

Open AccessArticle

MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM

Institution of Computer Science and Technology, Changchun Normal University, Changchun 130032, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6538; https://doi.org/10.3390/app15126538

Submission received: 16 May 2025 / Revised: 6 June 2025 / Accepted: 8 June 2025 / Published: 10 June 2025

Download

Browse Figures

Versions Notes

Abstract

Skin diseases are common medical conditions, and early detection significantly contributes to improved cure rates. To address the challenges posed by complex lesion morphology, indistinct boundaries, and image artifacts, this paper proposes a skin lesion segmentation method based on multi-scale attention and bidirectional long short-term memory (Bi-LSTM). Built upon the U-Net architecture, the proposed model enhances the encoder with dense convolutions and an adaptive feature fusion module to strengthen feature extraction and multi-scale information integration. Furthermore, it incorporates both channel and spatial attention mechanisms along with temporal modeling to improve boundary delineation and segmentation accuracy. A generative adversarial network (GAN) is also introduced to refine the segmentation output and boost generalization performance. Experimental results on the ISIC2017 dataset demonstrate that the method achieves an accuracy of 0.950, a Dice coefficient of 0.902, and a mean Intersection over Union (mIoU) of 0.865. These results indicate that the proposed approach effectively improves lesion segmentation performance and offers valuable support for computer-aided diagnosis of skin diseases.

Keywords:

deep learning; medical imaging; skin lesion segmentation; multi-scale attention; U-shaped network; adaptive feature fusion

1. Introduction

Dermatological diseases are common in the medical field, and among them, skin pigmentary lesions are primarily caused by an increase or decrease in pigment, leading to changes in skin color. Melanoma is the most severe form of skin cancer within skin pigmentary lesions. In the early stages of the disease, melanoma is easily confused with other benign pigmentary skin lesions [1], and when the disease progresses to an advanced stage, treatment becomes significantly more difficult. Therefore, early detection of this disease is crucial. Due to factors such as hair, color, and blood vessel [2,3] distribution on the skin surface, as well as the low contrast between diseased and healthy skin, even experienced clinicians often struggle to accurately identify the lesion areas, which can hinder the diagnosis of malignant melanoma [4]. In the medical field, medical image segmentation technology has evolved from manual segmentation to semi-automated and fully automated segmentation, with progressively improved segmentation results [5]. With the increasing size, complexity, and quantity of medical images, traditional machine learning methods and manual segmentation techniques are no longer sufficient to meet the demands of modern healthcare. The introduction of deep learning methods has, to some extent, addressed this challenge [6], significantly enhancing the efficiency and accuracy of image segmentation. However, the automatic segmentation of skin lesion images still faces several significant challenges, including hair occlusion [7], ambiguous lesion boundaries [8], low contrast between foreground and background [9], and high intra-class variability of lesion appearances [10]. These factors place substantial pressure on the discriminative and generalization capabilities of segmentation models. Therefore, the development of more robust segmentation algorithms is of critical importance for enhancing the accuracy and reliability of early melanoma detection [11].

With the rapid development of deep learning technology, image processing capabilities have significantly improved, and image segmentation has become increasingly widespread in the medical field. Computer-Aided Diagnosis (CAD) [12] has become an important tool in clinical diagnostics. Computer-aided diagnostic technology enables accurate segmentation of skin lesion areas, significantly improving the efficiency of clinical diagnosis. Traditional segmentation methods, including thresholding [13], region-based segmentation [14], edge-based segmentation [15], and support vector machine (SVM)-based techniques [16], are limited in their ability to extract high-level semantic features and often struggle to distinguish complex structures. To address common issues in skin lesion image segmentation, such as edge loss and insufficient segmentation accuracy, researchers have proposed various improvement algorithms. Long et al. [17] pioneered the Fully Convolutional Network (FCN), enabling pixel-level image segmentation in an end-to-end manner, thus laying the foundation for semantic segmentation. Building upon this, a significant breakthrough in medical image segmentation came with the U-Net architecture proposed by Ronneberger et al. [18], which employs a symmetric encoder-decoder structure and introduces skip connections to mitigate feature loss during up-sampling and down-sampling, effectively merging high-level and low-level semantic features, thereby promoting the widespread application of medical image segmentation. Subsequently, Oktay et al. [19] introduced the Attention-UNet by embedding an attention mechanism within U-Net, adaptively adjusting the encoder output to extract more effective features. Innani et al. [20] further proposed an Efficient-GAN framework based on U-Net, employing a composite scaling encoder based on Squeeze-and-Excitation to capture dense features and generate segmentation results, while using adversarial learning to distinguish between real and synthetic labels, ultimately improving the boundary segmentation accuracy in lesion regions.

However, existing deep learning methods still face certain limitations. For instance, when processing skin lesion images, traditional U-Net architectures may experience feature loss during the up-sampling and down-sampling processes of the encoder-decoder, which affects segmentation accuracy. Moreover, existing methods often fail to fully account for the unique characteristics of skin lesion images, such as variations in the size and shape of lesion regions, hair occlusion, bubble interference, and uneven coloration.

To overcome these challenges, we propose a novel skin lesion segmentation framework that synergistically integrates a U-Net architecture incorporating dense convolutional blocks, attention mechanisms, and Long Short-Term Memory (LSTM) module with the discriminative power of a multi-scale Generative Adversarial Network (GAN). The proposed method achieves dual improvements in both segmentation precision and feature discriminability for diverse lesion patterns, demonstrating state-of-the-art performance across multiple benchmark dermatoscopic image datasets. The principal contributions of this work include:

Improvement of U-Net Architecture: We have made significant improvements to the traditional U-Net architecture by enhancing the encoder-decoder up-sampling modules, incorporating attention mechanisms in the skip connections, and integrating generative adversarial training. These modifications strengthen the model’s ability to recognize and segment complex skin lesion features.
Integration of Attention Mechanism: The attention mechanism is introduced into the model, a strategy that significantly improves the model’s ability to segment skin lesions and normal skin areas, as well as distinguish between common and rare types of lesions, by learning the features of the lesion regions in a batch manner.
Addressing Feature Loss Issue: By incorporating the Bidirectional Convolutional Long Short-Term Memory (BDC-LSTM) module, which combines Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs), we effectively address issues such as gradient vanishing or explosion during training. This approach helps retain crucial feature information and focuses on solving the feature loss problem commonly encountered in the traditional U-Net architecture during the up-sampling and down-sampling processes.
Extensive Experimental Validation: We conducted extensive experiments on multiple skin lesion image datasets to validate the effectiveness of our proposed method and its applicability in recognizing different types of skin lesions. Additionally, we assessed the generalizability of the model across other datasets.

The following sections of this paper will provide a detailed description of our method, including the design of the model architecture, the implementation of generative adversarial training, and the experimental results and analyses on various datasets.

2. Related Work

Early traditional machine learning segmentation methods are cumbersome and heavily reliant on the physician’s prior knowledge and clinical experience during the feature selection phase, requiring extensive manual intervention. This not only increases operational complexity, but also makes the process susceptible to subjective judgment and external factors, leading to feature extraction bias that ultimately affects segmentation accuracy and the reliability of the final diagnostic results. In contrast, deep learning methods, through end-to-end training, automatically extract high-level features from images, significantly reducing manual intervention and effectively improving the accuracy and stability of medical image segmentation. The U-Net segmentation network enhances the precision and consistency of segmentation results by actively learning image features, incorporating well-designed loss functions, and optimizing through gradient-based algorithms. It has become an essential tool in the field of medical image segmentation, playing a critical role in assisting physicians to make more accurate diagnoses.

The U-Net++ model proposed by Zhou et al. [21] incorporates a deep network architecture with multi-layer skip connections to improve feature fusion. However, it faces accuracy limitations when handling blurred boundaries and complex lesion shapes. Xiao et al. [22] introduced a weighted Res-UNet model that builds upon the original Res-UNet by incorporating a weighted attention mechanism. This allows for more precise learning of pixel features in lesion and normal regions, thereby enhancing segmentation accuracy at lesion boundaries. Ruan et al. [23] proposed the Multi-Attention and Lightweight U-Net (MALUNet) model, which effectively balances parameter efficiency and segmentation performance by integrating multiple attention mechanisms with a lightweight design. However, its effectiveness is constrained when dealing with complex lesion regions. Le et al. [24] developed a mobile anti-aliasing attention U-Net model, which mitigates high-frequency information loss and enhances segmentation accuracy by using anti-aliasing pooling layers. Nonetheless, this approach results in insufficient capture of fine-grained details, particularly at lesion boundaries. Dai et al. [25] introduced a multi-scale residual encoder, which enhances feature representation learning by integrating the benefits of adaptive multi-scale features. Gu R et al. [26] proposed the CA-Net model, which integrates a multi-scale attention mechanism with residual connections, effectively adapting to spatial locations, feature channels, and object scales, thus achieving higher accuracy. However, it still faces performance bottlenecks when dealing with larger-scale contextual relationships. Rahman et al. [27] presented an integrated attention gating mechanism and multi-scale convolutional model, which effectively addresses segmentation challenges for irregular shapes and multi-scale objects. Nevertheless, it exhibits limited performance when processing lesion boundary regions and suffers from high model complexity. Wu et al. [28] introduced an image segmentation method that combines an adaptive dual-attention mechanism, effectively improving feature extraction specificity and segmentation accuracy.

Long Short-Term Memory (LSTM) networks have been widely applied in image segmentation tasks due to their outstanding performance in sequence data modeling, particularly in extracting complex spatial dependencies and contextual information from images. Yu H et al. [29] proposed the MSAU-Net model, which combines bidirectional convolutional LSTMs to extract shared discriminative features from lesion regions, while suppressing features with lower information content. However, this approach results in high complexity, requiring significant computational resources. Shahzaib et al. [30] introduced the TESL-Net model, which uses Swin-Transformer blocks in the encoder module to effectively extract global contextual information from skin lesion images. Nevertheless, the model’s ability is limited when dealing with complex lesion images. Rao et al. [31] proposed the U-LSTM model, which integrates the spatial feature extraction capability of CNNs with the temporal sequence modeling ability of LSTMs. This architecture enables simultaneous capture of both static features and dynamic changes of skin lesions; however, the model is structurally complex and demands substantial computational resources. K. P. Arjun et al. [32] introduced the RCNN-LSTM, combining the strengths of RCNN and LSTM to extract spatial features while handling temporal information, making it suitable for complex image data. Nevertheless, its segmentation performance is limited when processing lesion boundaries in smaller images.

Generative Adversarial Networks (GANs), first proposed by Goodfellow et al. [33] in 2014, pioneered a new direction in the research of deep generative models. The core concept of GAN involves constructing an adversarial game between a generator and a discriminator, allowing the generator to progressively improve the authenticity of the generated samples. Bi et al. [34] introduced an adversarial learning-based method for automatic skin lesion segmentation, which enhances segmentation performance by performing data augmentation and deeply fusing features extracted through convolution operations in both the encoder and decoder. Wei Z et al. [35] proposed the Att-DenseUnet model based on GANs, which effectively integrates multi-scale features within the encoder, thereby strengthening feature representation capabilities. Additionally, by incorporating an attention mechanism, the model suppresses irrelevant regions in the output feature maps, reducing the impact of artifacts. The introduction of adversarial loss further enhances the discriminative power of the generated features, thereby improving the segmentation accuracy of the model. Nidhi Bansal et al. [36] proposed HEXA-GAN, which significantly enhances the accuracy of hair segmentation and enables natural reconstruction of hair-occluded regions, thereby improving the overall visual consistency of dermoscopic images. Bansal, N. et al. [37] introduced EA-GAN, which integrates attention mechanisms and vision transformers to effectively remove artifacts and hair interference from lesion areas. This approach substantially improves the quality of skin lesion image preprocessing, thereby enhancing the accuracy and robustness of subsequent diagnostic systems.

3. Methods

This paper proposes a novel skin lesion image segmentation model based on the U-Net architecture, incorporating concepts such as dense networks, bidirectional long short-term memory networks, and attention mechanisms. The model consists of two parts: the segmentation module and the discriminator module.

3.1. Segmenter

The proposed segmenter is based on the U-net architecture and primarily consists of the encoder down-sampling path, the decoder up-sampling path, a hybrid multi-scale attention module, and an attention module based on skip connections. These connections correspond to the respective layers between the two paths, as illustrated in Figure 1. The detailed description of the segmenter is as follows:

3.1.1. Encoder Module

Due to the multi-scale nature and variability of skin lesions, and inspired by the DenseNet architecture that connects all layers densely, we adopt a DenseNet-like [38] structure in the down-sampling path to enhance the flow of multi-scale information. This path consists of four dense blocks, each containing d dense layers. The output feature map channels of each dense layer are the same, with a growth rate k. Within the same dense block, the resolution of the feature maps remains unchanged, and the input of each dense layer is the concatenation of the outputs from all previous dense layers. As the network depth increases, features in the down-sampling path are reused, leading to a significant increase in memory usage. To address the increased memory consumption caused by dense blocks and to expand the receptive field, a transition layer is introduced after each dense block. This transition layer comprises batch normalization (BN) to stabilize training, a ReLU activation function to introduce non-linearity, a 1 × 1 convolutional layer to reduce the number of channels and computational complexity, and a 2 × 2 average pooling layer for spatial down-sampling of the feature maps, thereby reducing their spatial dimensions and enlarging the receptive field. This design effectively mitigates memory overhead while enhancing the model’s ability to capture multi-scale information. The overall computational process of the dense convolutional module is expressed by the following equation:

X_{i} = H_{i} ([X_{1}, X_{2}, \dots \dots, X_{i - 1}])

(1)

k_{1} = k_{0} + k \cdot (l - 1)

(2)

In the equation,

H_{i}

represents the non-linear operation consisting of Batch Normalization (BN), ReLU, and convolution operations, and

X_{i}

represents the concatenation of the feature maps from the outputs of layers 1 to

i - 1

. The architecture of the dense convolutional module is illustrated in Figure 2.

3.1.2. Attention Module

In medical image segmentation tasks, the target regions often exhibit irregular shapes, varying sizes, and blurred boundaries, which cause traditional skip connections to introduce redundant information during feature fusion, thereby affecting segmentation accuracy. To enhance the model’s ability to model critical information, we propose an improved attention mechanism that generates spatial attention maps by jointly leveraging the information from the attention signal Attentional_Signal and the local feature map DenseBlock_Out. The attention module contains two inputs: the attention signal

G^{D}

and the local feature map

F^{c}

. We define the local feature map as

F^{c} = {\{f_{i}^{c}\}}_{i = 1}^{M}

}, where

f_{i}^{c}

represents the feature vector at the i-th pixel position in the feature map

F^{c}

, with

C

channels. The attention signal is defined as

G^{D} = {\{g_{j}^{D}\}}_{j = 1}^{N}

, where

g_{j}^{D}

represents the feature vector at the j-th pixel position in the feature map

G^{D}

, with D channels. First, we apply a 1 × 1 convolution, batch normalization, and ReLU activation to the attention signal

G^{D}

to extract global contextual information and reduce dimensionality. Then,

G^{D}

is up-sampled to align its spatial dimensions with

F^{c}

, resulting in channel-matched feature maps

{F M}_{1}

and

{F M}_{2}

. Subsequently,

{F M}_{1}

and

{F M}_{2}

are element-wise added and passed through a ReLU activation to generate the spatial attention map. Finally, the generated attention map is multiplied element-wise with the original input feature map to obtain the attention-modulated output, denoted as Att-Output. The specific computation is shown in Equation (3).

A t t = Up (S i g m o i d (W \times (Relu (W_{f} {\times F}^{c} + U p (W_{g} \times g^{D})))))

(3)

In the equation,

W_{g}

denotes a 1 × 1 convolution kernel with ccc filters;

W_{f}

represents a 3 × 3 convolution with a stride of 2; and

W

refers to a 1 × 1 convolution kernel with a single filter. The up-sampling operation, denoted as Up, utilizes bilinear interpolation to match the spatial dimensions. As a result, the generated attention map shares the same spatial resolution as the output of the dense module

(D e n s e B l o c k_O u t)

. Subsequently, the attention map is element-wise multiplied with DenseBlock_Out to produce the final output of the attention module.

By performing weighted fusion of features across different hierarchical levels, the attention mechanism effectively captures critical target information at multiple scales while suppressing background noise and irrelevant regions. This enhances the saliency of the target area and improves the network’s ability to focus on and localize the region of interest. The specific attention module is illustrated in Figure 3.

3.1.3. Channel Spatial Attention Enhancement Module

In our network architecture, the deepest feature map encapsulates global image information, where an attention mechanism is employed to enhance salient spatial and channel-wise features. Specifically, we integrate the Convolutional Block Attention Module (CBAM) [39] and train it end-to-end alongside the base network to prioritize critical regions and discriminative features while suppressing irrelevant background noise. The core principle of this approach lies in inferring attention weights along two orthogonal dimensions (channel and spatial) and subsequently applying these weights to the original feature maps through multiplicative modulation, thereby enabling adaptive feature refinement. This dual-path attention mechanism allows the model to dynamically emphasize informative features while attenuating redundant or distracting elements, leading to more efficient and discriminative feature representation.

The Convolutional Block Attention Module consists of two sequentially arranged submodules: channel attention and spatial attention. The channel attention mechanism focuses on identifying and emphasizing the most informative channels by adaptively recalibrating channel-wise feature weights, thereby enabling more effective extraction of global contextual information. Distinct from the standard CBAM, which employs global pooling operations, we propose a multi-scale pooling strategy by incorporating average pooling with kernel sizes of 2 × 2 and 4 × 4. This approach captures both local and semi-global statistical information, thereby enhancing the model’s capability to assess the importance of channel features across different spatial scales.

The detailed procedure of the channel attention module is illustrated in Figure 4. Given an input feature map

F \in R^{H \times W \times C}

, spatial compression is first performed in parallel using average pooling operations with two different kernel sizes (2 × 2 and 4 × 4), global average pooling, and max pooling, respectively generating four feature vectors F₁, F₂, F₃, and F₄, each having spatial dimensions smaller than those of the original feature map. Subsequently, these four feature vectors are separately fed into a shared multi-layer perceptron (MLP) to ensure parameter efficiency. The outputs of the MLP are channel descriptor vectors with spatial dimensions 1 × 1 and channel dimension C. Finally, the four output feature vectors are summed element-wise and passed through a sigmoid activation function, yielding the channel attention weight vector

M_{c} (F)

with size 1 × 1 × C. The detailed computational pipeline is shown in Figure 4, with the mathematical formulation provided in Equations (4)–(8):

F_{1} = {p o o l}_{a v g} (F)

(4)

F_{2} = {p o o l}_{2 \times 2} (F)

(5)

F_{3} = {p o o l}_{4 \times 4} (F)

(6)

F_{4} = {p o o l}_{m a x} (F)

(7)

M_{c} (F) = σ \{M L P [{p o o l}_{a v g} (F_{1})] + M L P [{p o o l}_{2 \times 2} (F_{2})] + M L P [{p o o l}_{3 \times 3} (F_{3})] + M L P [{p o o l}_{m a x} (F_{4})]\}

(8)

In the equation,

σ

denotes the sigmoid activation function; the multi-layer perceptron is represented by the notation MLP; average pooling is denoted by the symbol

{p o o l}_{a v g}

; 2 × 2 pooling is denoted by the symbol

{p o o l}_{2 \times 2}

; 4 × 4 pooling is represented by the symbol

{p o o l}_{4 \times 4}

; and max pooling is indicated by the symbol

{p o o l}_{m a x}

.

The spatial attention mechanism, in contrast, is designed to focus on salient regions within feature maps by computing pixel-wise importance weights, thereby directing the model’s attention towards critical spatial areas. The computational pipeline operates as follows: Given the channel-refined feature map

M_{c} (F)

, the mechanism first applies both max pooling and average pooling operations along the channel dimension to generate two spatial descriptors (

F_{1}

and

F_{2}

) of size H × W × 1. These descriptors are subsequently concatenated along the channel axis to form a composite feature representation of dimension H × W × 2. This concatenated feature map is then processed through a convolutional layer with a 7 × 7 kernel and stride of 1, which effectively captures broad spatial contextual information. Finally, a sigmoid activation function is applied to generate the spatial attention weights

M_{s} (F)

, where each element represents the relative importance of the corresponding spatial location. The complete mathematical formulation is provided in Equation (9), demonstrating the transformation from channel-refined features to spatially weighted outputs through this carefully designed sequence of operations.

M_{s} (F) = σ \{C o n v [c o n c a t ({p o o l}_{a v g} (F_{1}), {p o o l}_{m a x} (F_{2}))]\}

(9)

In the equation, σ denotes the sigmoid activation function, Conv represents the convolution operation, average pooling is denoted by the symbol

{p o o l}_{a v g}

, and max pooling is indicated by

{p o o l}_{m a x}

. The complete computational procedure is illustrated in the spatial attention module of Figure 5.

The CBAM module’s operational pipeline follows a carefully designed sequential attention mechanism, as formally characterized below. The input feature map F first undergoes channel-wise refinement through multiplication with the channel attention vector

M_{c} (F)

, producing the intermediately enhanced feature representation

F_{1}

. This channel-attentive feature map then enters the spatial attention phase, where it is modulated by the spatial attention weights

M_{s} (F)

to yield F₂. The final output F′ is obtained through a residual summation that combines the original features with the doubly attentive modifications, as mathematically formulated in Equations (10)–(12).

F_{1} = M_{c} (F) ⊙ F

(10)

F_{2} = M_{s} (F) ⊙ F_{1}

(11)

F^{'} = F + F_{2}

(12)

In the formulation,

M_{c} (F)

denotes the channel attention map,

M_{s} (F)

represents the spatial attention map, and the operator ⊙ indicates element-wise multiplication (Hadamard product). The complete computational procedure is illustrated in Figure 6.

3.1.4. Bidirectional Convolutional LSTM Temporal Modeling Module

While CBAM enhances critical features through channel and spatial attention mechanisms, it remains limited in capturing long-range spatial dependencies. Moreover, conventional unidirectional LSTM networks can only exploit past information, making them inadequate for modeling bidirectional contextual relationships—particularly when dealing with complex structures or ambiguous boundaries in skin lesion regions. To address these limitations and further improve the model’s capacity for understanding and perceiving lesion areas, we introduce a Bidirectional Convolutional Long Short-Term Memory (BDC-LSTM) module following the attention mechanism. ConvLSTM integrates convolutional operations into the traditional LSTM architecture, enabling it to process spatially structured sequential data more effectively. Unlike fully connected LSTMs, ConvLSTM applies convolutional operations to the input, hidden states, and gating mechanisms, thereby capturing spatial features and local dependencies with greater precision. The core computations of ConvLSTM include the input gate

i_{t}

, forget gate

f_{t}

, output gate

o_{t}

, and the update of the cell state, all of which are adapted to handle spatial information inherent in image data. The corresponding calculation formula is given in Equations (13)–(17).

i_{t} = σ (W_{x i} * X_{t} + W_{h i} * H_{t - 1} + W_{c i} * C_{t - 1} + b_{i})

(13)

f_{t} = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1} + W_{c f} * C_{t - 1} + b_{f})

(14)

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} t a n h (W_{x c} * X_{t} + W_{h c} * H_{t - 1} + b_{c})

(15)

o_{t} = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1} + W_{c o} * C_{t - 1} + b_{c})

(16)

H_{t} = o_{t} ⊙ t a n h (C_{t})

(17)

In the mathematical formulation, W denotes the learnable weight matrices, b represents the bias terms,

X_{t}

signifies the input tensor at time step t,

H_{t}

corresponds to the hidden state tensor, and

C_{t}

indicates the cell state tensor. The operator σ refers to the sigmoid activation function, while ⊙ designates the element-wise multiplication (Hadamard product) operation.

Building upon this foundation, we integrate the Bidirectional Convolutional Long Short-Term Memory (BDC-LSTM) module after the attention mechanism. On one hand, this strategy leverages the salient features enhanced by the attention mechanism as input, allowing the model to focus more effectively on key regions for spatiotemporal modeling. On the other hand, it compensates for the limitations of attention mechanisms in capturing long-range dependencies, thereby improving the model’s ability to recognize lesion morphology variations and ambiguous boundaries. The BDC-LSTM module consists of two independent ConvLSTM branches: a forward branch that processes the feature sequence

\{X_{1}, X_{2}, \dots, X_{t}\}

in chronological order, gradually accumulating information from past states; and a backward branch that traverses the sequence in reverse order

\{X_{t}, X_{t - 1}, \dots, X_{1}\}

, modeling the potential influence of future states on the current state. At each time step, the hidden states from both the forward and backward ConvLSTM branches are non-linearly transformed and fused using a hyperbolic tangent (tanh) activation function, resulting in a more comprehensive spatiotemporal context representation. The detailed process is illustrated in Figure 7. This bidirectional structure enables BDC-LSTM to aggregate feature information from both temporal directions, effectively establishing long-range contextual dependencies and enhancing the consistency of feature representation as well as segmentation accuracy. The corresponding computation is defined in Equation (18).

Y_{t} = t a n h (W_{y}^{\vec{H}} * {\vec{H}}_{t} + W_{y}^{\overset{\leftarrow}{H}} * {\overset{\leftarrow}{H}}_{t} + b)

(18)

In the formulation,

Y_{t}

denotes the output at the current time step; W represents the weight matrix;

H_{t}

indicates the hidden state; and b refers to the bias term. The detailed process is illustrated in Figure 7.

3.1.5. Decoder Path

The up-sampling pathway adopts a symmetrical architecture utilizing transposed convolutions to facilitate hierarchical feature fusion. Each up-sampling module initially performs bilinear interpolation to double the spatial resolution, followed by a 3 × 3 convolutional layer equipped with batch normalization for effective feature transformation. Subsequently, a ReLU activation function is applied to introduce non-linearity. During the decoding process, the up-sampled feature maps are concatenated along the channel dimension with their corresponding encoder features through skip connections, enabling the seamless integration of high-level semantic information with low-level spatial details. The concatenated features are further refined via dense convolutional blocks, which promote feature reuse and enhance multi-scale representation capability. The decoder progressively reconstructs the spatial resolution through four successive up-sampling stages, each doubling the resolution, culminating in a 1 × 1 convolutional layer that projects the refined features onto the target classes for end-to-end pixel-wise prediction. This architectural design ensures precise segmentation performance while maintaining computational efficiency through systematic feature recombination.

3.2. Discriminator

The discriminator employs an inverted pyramidal multi-scale feature extraction architecture with progressive spatial down-sampling for robust adversarial discrimination. The network first utilizes three parallel convolutional pathways with varying receptive fields to capture both local and global structural features, where each pathway independently processes the input through

f_{k} = L R e L U (B N (C o n v_{k (x)}))

for

k \in \{3,5, 7\}

. The multi-scale features are then concatenated along the channel dimension to form an enriched representation. Subsequent stages employ strided convolutions for hierarchical feature compression: the second layer maintains dual-path 3 × 3 and 5 × 5 convolutions to preserve multi-scale processing, while the third and fourth layers utilize single 3 × 3 convolutions for efficient spatial reduction. All intermediate layers incorporate batch normalization and LeakyReLU activation to ensure training stability. The final discrimination head employs a 4 × 4 valid convolution coupled with sigmoid activation. The calculation formula for the final discrimination output is shown in Equation (19):

D (x) = σ (W^{(o u t)} C o n v (x_{4}) + b^{o u t})

(19)

In the formulation, σ denotes the sigmoid activation function; W represents the weights of the final convolutional kernel; Conv represents stride-1 convolution with no padding;

x_{4}

represents the input feature map from the fourth layer; and b is the bias term.

3.3. Generative Adversarial Network

This paper proposes an adversarial training framework based on Generative Adversarial Network (GAN) [40], designed to improve the segmentation performance of skin lesions in medical images. By integrating a multi-scale convolutional discriminator with a semantic segmentation network, the framework establishes an end-to-end adversarial optimization mechanism, enabling the generated segmentation maps to closely approximate the ground truth. The entire system is trained through an alternating optimization strategy between the generator and the discriminator, resulting in more stable and accurate segmentation outcomes. During training, the Adam optimizer is employed to optimize both the generator and the discriminator independently over 100 epochs, aiming to accelerate convergence while maintaining model stability. The optimizer parameters are configured as follows: an initial learning rate of 0.0001, a first-order momentum decay rate of 0.9, and a second-order momentum decay rate of 0.999. In addition, mixed-precision training is adopted to reduce memory consumption and enhance computational efficiency, while gradient clipping with a threshold of 1.0 is applied to prevent gradient explosion.

The discriminator is designed with a three-layer multi-scale feature extraction module. Each convolutional layer is followed by batch normalization and a LeakyReLU activation function, which enhances non-linear representation capability and stabilizes the training process. The final discrimination probability is obtained through a 4 × 4 valid convolution followed by a sigmoid activation function, indicating the likelihood that the input segmentation map is real or generated. The generator adopts a four-stage symmetric up-sampling architecture, where each stage consists of bilinear interpolation, a 3 × 3 convolution, batch normalization, and a ReLU activation function. This structure progressively restores the resolution of the feature maps to the original input size. To preserve spatial details and improve localization, skip connections are introduced during the decoding process, linking feature maps from corresponding encoder layers directly to the decoder, thereby enhancing multi-scale feature fusion and spatial precision.

To further improve the model’s generalization capability and adapt to different training phases, a dynamic learning rate adjustment strategy is implemented. When validation metrics (e.g., Dice coefficient or accuracy) show no improvement for five consecutive epochs, the learning rate is automatically reduced by a factor of 0.1, helping the model escape local minima and achieve better overall performance. The final segmentation output is generated via a sigmoid activation function, producing a probability map that aligns closely with the ground truth annotations.

3.4. Loss Function

In the design of the loss function, we combine Dice Loss with Binary Cross-Entropy (BCE) Loss [41]. The Dice Loss effectively alleviates the class imbalance problem by measuring the overlap between the predicted results and the ground truth masks, thereby promoting the model’s ability to learn more accurate boundary information. On the other hand, BCE Loss, as a pixel-wise loss, provides stable gradient information and enhances the model’s ability to recognize the target regions. The two losses can be expressed as Equations (20)–(22).

L_{B C E (p, y)} = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} \cdot \log (p_{i}) + (1 - y_{i}) \cdot \log (1 - p_{i}))

(20)

L_{D i c e L o s s (p, y)} = 1 - \frac{2 \cdot \sum i p_{i} \cdot y_{i}}{\sum i p_{i} + \sum i y_{i}}

(21)

L_{B c e D i c e} = a_{1} L_{B c e} + a_{2} L_{D i c e}

(22)

3.5. Experimental Details

3.5.1. Experimental Setup

The experiments for the proposed skin lesion image segmentation model were conducted using a deep learning framework based on Python version 3.9 The operating system used is Windows 10, and the GPU utilized is an NVIDIA 3080 with a memory capacity of 16 GB (NVIDIA Corporation, Santa Clara, CA, USA).

3.5.2. Dataset and Preprocessing

In this paper, we utilized the PH² dataset alongside the ISIC 2017 and ISIC 2018 datasets from the ISBI challenge for model training and validation. The PH² dataset comprises 200 dermoscopic images, which were divided into 140 training samples, 20 validation samples, and 40 test samples. The ISIC 2017 dataset consists of 2000 original images with corresponding annotations, supplemented by 150 validation samples and 600 test samples. The ISIC 2018 dataset contains 2596 images and their annotations, supplemented by 200 validation samples and 596 test samples. The detailed composition of these datasets is summarized in Table 1. Collectively, these datasets cover three primary types of skin lesions, melanoma, seborrheic keratosis, and nevus, which offer both strong representativeness and significant challenges for segmentation tasks. As illustrated in Figure 8, images from the ISIC 2017 test set exhibit lesions on raw dermoscopic images. Notably, the presence of noise artifacts such as hair and bubbles is evident, which increases the complexity of lesion segmentation and poses challenges for practical applications.

The dataset utilized in this study exhibits significant challenges for deep learning-based feature extraction, including image noise, blurred boundaries, heterogeneous skin tones, hair occlusion, capillary interference, and diverse lesion morphologies. Given the limited size of publicly available dermatological datasets and the prohibitive cost of manual annotation, we implemented an extensive data augmentation pipeline incorporating both geometric transformations and photometric variations. This approach effectively expanded the training dataset while preserving pathological features, as validated through quantitative performance improvements in subsequent experiments.

To address the challenges posed by varying image sizes and significant resolution discrepancies in skin lesion datasets, this study implements a comprehensive data normalization and augmentation strategy. The specific details of the dataset are shown in Table 1. Initially, all lesion images are rescaled to a uniform resolution of 256 × 256 pixels to eliminate scale variability. Furthermore, to mitigate the limitation of insufficient sample sizes in publicly available datasets, a multidimensional data augmentation approach is employed. This includes geometric transformations such as rotation, multi-directional translation, scaling, and flipping. These operations enhance data diversity and significantly improve the model’s generalization capability while preserving the integrity of pathological features. The specific training data are presented in Table 2.

3.5.3. Evaluation Metrics

In this experiment, commonly used segmentation evaluation metrics for medical image segmentation are employed to assess the model’s performance, including Accuracy, Dice Similarity Coefficient (Dice), Specificity, Sensitivity (SE), and mean Intersection over Union (mIoU), calculated as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(23)

D i c e = \frac{2 \cdot T P}{2 \cdot T P + F P + F N}

(24)

S p e c i f i c i t y = \frac{T N}{T N + T P}

(25)

S e n s i t i v i t y = \frac{T P}{T P + F N}

(26)

m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(27)

In the formulation, TP denotes the number of positive samples correctly predicted as positive, FP represents the number of negative samples incorrectly predicted as positive, FN indicates the number of positive samples incorrectly predicted as negative, and TN refers to the number of negative samples correctly predicted as negative.

4. Experiment and Results

This section presents comparative experiments conducted on three datasets, as well as ablation studies of the proposed method.

4.1. Comparative Experiments

4.1.1. Comparative Experiments on PH²

To validate that the proposed model outperforms traditional convolutional neural networks and existing segmentation models, a comprehensive comparison and evaluation were conducted against several state-of-the-art approaches. The quantitative results on the PH² dataset are presented in Table 3.

The experimental results on the PH² dataset are summarized in Table 3, demonstrating the performance of various network models across multiple evaluation metrics, including accuracy and Dice coefficient. Figure 9 demonstrates that the proposed method consistently achieves superior results compared to the baseline and current state-of-the-art models, with particularly notable improvements in the Dice coefficient and mean Intersection over Union (mIoU). Specifically, the proposed MA-DenseUNet achieves an accuracy improvement of approximately 4.95% over the baseline U-Net and about 0.3% over the SCSONet model. Although its accuracy is marginally lower than that of LCAUnet, the proposed method yields substantial gains in Dice coefficient, outperforming U-Net by approximately 6.4%, and surpassing LCAUnet and SCSONet by about 0.6% and 0.2%, respectively. In terms of mIoU, MA-DenseUNet exhibits an improvement of approximately 7.2% compared to U-Net and about 1.3% relative to SCSONet. Moreover, as shown in Figure 10, the proposed model consistently outperforms other comparative methods across all additional evaluation criteria. These results clearly demonstrate that the MA-DenseUNet is capable of accurately delineating lesion regions from the background, thereby significantly enhancing the segmentation performance in terms of both accuracy and robustness. The specific segmentation results are shown in Figure 11.

To achieve a more comprehensive and precise evaluation of the segmentation performance across different models, we further conducted an in-depth analysis of the outputs generated by several state-of-the-art networks, as illustrated in Figure 11. For skin lesion images characterized by large background areas and relatively small lesion regions, U-Net and other mainstream models are generally capable of accurately segmenting the principal lesion regions; however, significant discrepancies in boundary delineation remain when compared with the ground truth (as shown in Figure 11a,b). When the lesion occupies a larger proportion of the image, the performance differences among the models become more pronounced. As exemplified in Figure 11a, U-Net tends to exhibit under-segmentation, failing to fully capture the extent of the lesion area. Although LCAUnet and SCSONet demonstrate improvements over U-Net in segmenting the primary regions, LCAUnet produces excessively smoothed boundaries, leading to a loss of fine-grained details, while SCSONet struggles to accurately preserve irregular boundary structures when compared to the proposed MA-DenseUNet. Overall, these visual results substantiate that the proposed MA-DenseUNet can more effectively capture intricate lesion features and suppress noise-induced artifacts, thereby generating segmentation maps that more closely approximate the ground truth annotations.

4.1.2. Comparative Experiments on ISIC2017

On the ISIC2017 dataset, the proposed method consistently outperforms the baseline U-Net and several state-of-the-art segmentation networks across a range of performance metrics. The experimental results, as summarized in Table 4, provide a comparative analysis of different models based on accuracy, Dice coefficient, mean Intersection over Union (mIoU), and other key evaluation metrics. Specifically, as shown in Figure 12, the proposed method demonstrates an accuracy improvement of approximately 3.5% over the baseline U-Net, and an enhancement of about 0.5% over the current SCSONet. In terms of the Dice coefficient, the proposed model outperforms U-Net by approximately 11.9%, and surpasses SCSONet and TESL-Net by about 1.4% and 0.16%, respectively. For the mIoU metric, the proposed approach shows a notable improvement of approximately 7.0% compared to U-Net, and achieves gains of about 1.3% and 1.27% over SCSONet and EGE-Unet, respectively.

Furthermore, the MA-DenseUNet model excels in other critical performance metrics, including sensitivity and specificity, outperforming most of the comparative methods. These results provide compelling evidence of its superior segmentation performance. The qualitative differences in segmentation results across various models are illustrated in Figure 13. In summary, the proposed MA-DenseUNet model achieves robust and accurate segmentation across lesions of varying sizes and shapes, particularly excelling at capturing fine-grained lesion boundaries and textural details. This capability significantly enhances the overall accuracy and robustness of lesion segmentation, demonstrating the method’s effectiveness in practical applications. The specific segmentation results are shown in Figure 14.

We have provided a detailed overview of the performance of different state-of-the-art algorithms on the ISIC2017 dataset. To further assess the segmentation performance of each model in a more comprehensive and accurate manner, a more thorough analysis of the outputs from existing advanced models has been conducted. The specific segmentation visualization results are shown in Figure 13. For skin lesion images characterized by large background areas and relatively small lesion regions, U-Net and other mainstream models are generally capable of accurately segmenting the principal lesion regions; however, significant discrepancies in boundary delineation remain when compared with the ground truth (as shown in Figure 14a,b). When the lesion occupies a larger proportion of the image, the performance differences among the models become more pronounced. As exemplified in Figure 14a, U-Net tends to exhibit under-segmentation, failing to fully capture the extent of the lesion area. Although LCAUnet and SCSONet demonstrate improvements over U-Net in segmenting the primary regions, LCAUnet produces excessively smoothed boundaries, leading to a loss of fine-grained details, while SCSONet struggles to accurately preserve irregular boundary structures when compared to the proposed MA-DenseUNet. Overall, these visual results substantiate that the proposed MA-DenseUNet can more effectively capture intricate lesion features and suppress noise-induced artifacts, thereby generating segmentation maps that more closely approximate the ground truth annotations.

Figure 15 illustrates the variation in the accuracy and Dice coefficients during the validation process as a function of the number of training epochs. As the training progresses, the model gradually converges, exhibiting enhanced stability. Specifically, the fluctuations in the accuracy and Dice coefficients decrease over time, indicating that the model is continuously optimized during the training process, leading to stabilized performance with improved accuracy and robustness. Furthermore, the model’s accuracy and Dice coefficients are significantly higher than those of other comparative models, demonstrating its superiority in the image segmentation task. This suggests that, as training advances, the model not only achieves more accurate segmentation, but also exhibits enhanced adaptability and generalization capabilities.

4.1.3. Comparative Experiments on ISIC2018

On the ISIC2018 dataset, the proposed MA-DenseUNet model also demonstrates superior performance compared to several state-of-the-art segmentation networks across multiple evaluation metrics. As summarized in Table 5, the experimental results include key performance indicators such as accuracy, Dice coefficient, and mean Intersection over Union (mIoU). Specifically, the proposed method achieves an accuracy improvement of approximately 3.2% over the baseline U-Net and outperforms the SCSONet by around 1.13%. In terms of the Dice coefficient, the model surpasses U-Net by approximately 4.41%, and exceeds SCSONet and ARU-Net by around 1.4% and 0.18%, respectively. For the mIoU metric, the proposed approach improves upon U-Net by about 4.15%, and yields gains of approximately 1.33% and 1.27% over SCSONet and EGE-UNet, respectively. Furthermore, as shown in Figure 16,the MA-DenseUNet model outperforms most comparative methods in other critical indicators such as sensitivity and specificity, further confirming its superior performance in skin lesion segmentation tasks.

4.2. Ablation Research

To verify the effectiveness and necessity of each component in the proposed convolutional modulation network—particularly the contributions of the multi-scale hybrid attention module and the key designs within the long short-term memory (LSTM) mechanism—we conduct ablation studies from two perspectives: network architecture design and detailed component analysis. For fair comparison and evaluation, a baseline model is constructed by retaining only the local feature extraction and feature mapping parts of the DenseUNet module. Based on this baseline, we incrementally incorporate the proposed modules and perform experiments on the ISIC2017 dataset to systematically assess the impact of each component on overall network performance. The experimental results are presented in Table 6.

As shown in Figure 17, the components of the proposed network substantially contribute to overall performance enhancement. Beginning with the baseline model, which achieves an accuracy of 92.21%, a Dice coefficient of 81.82%, and a mean Intersection over Union (mIoU) of 80.93%, consistent improvements are observed as each module is progressively incorporated. As shown in Figure 18, the integration of the attention mechanism results in a marked increase in segmentation accuracy, with the Dice coefficient rising to 84.32% and mIoU to 82.71%, highlighting its effectiveness in capturing spatial dependencies. Further enhancement is achieved by incorporating the Convolutional Block Attention Module (CBAM), which elevates the Dice coefficient to 86.76% and mIoU to 83.63%, demonstrating the advantages of combined channel and spatial attention in refining feature representations. The introduction of the Long Short-Term Memory (LSTM) module significantly augments the model’s capability to capture temporal and contextual information, leading to a Dice coefficient of 89.92% and mIoU of 85.05%. Ultimately, the complete model achieves the highest performance, with an accuracy of 95.07%, a Dice coefficient of 90.25%, and mIoU of 86.59%, thereby validating the complementary contributions and necessity of each architectural component.

5. Discussion

To tackle the inherent challenges of medical image segmentation, including morphological variability and indistinct lesion boundaries, this study introduces MA-DenseUNet—a novel and task-specific segmentation framework designed to enhance lesion delineation in medical imaging. Specifically, a newly designed dense convolutional module is integrated into the encoder to strengthen feature representation and facilitate the extraction of high-level semantic information. The dense connectivity promotes efficient feature propagation across layers, mitigating issues such as gradient vanishing and feature redundancy.

To further improve the model’s attention to critical lesion regions, a multi-scale Convolutional Block Attention Module (CBAM) is incorporated. By jointly leveraging multi-scale channel and spatial attention, the model effectively concentrates on lesion-relevant features while suppressing background noise. Moreover, to address issues related to blurred lesion boundaries and unclear textures, a Bidirectional Convolutional Long Short-Term Memory (BDC-LSTM) module is embedded within the bottleneck layer. This enables the modeling of forward and backward spatial dependencies, enhancing the integration of local and global contextual information and improving segmentation in complex anatomical structures. Furthermore, a multi-scale Generative Adversarial Network (GAN) is introduced to refine segmentation results through adversarial learning. By employing discriminators at multiple scales, the model enforces fine-grained structural realism in the predicted segmentation maps, particularly at lesion boundaries. This adversarial strategy enhances the generator’s capacity to reconstruct detailed lesion contours, contributing to more stable and robust segmentation outcomes, especially under challenging background conditions.

To evaluate the robustness and generalizability of the proposed method, extensive comparisons are conducted with several state-of-the-art models published in recent years. As shown in Table 3, on the PH² dataset, MA-DenseUNet achieves improvements of 0.6% in Dice coefficient, 0.42% in specificity, and 1.5% in sensitivity over LCAUnet, along with a 2.4% increase in mean Intersection over Union (mIoU). These results confirm the model’s superior capability in accurately localizing skin lesions, particularly by improving recall without compromising precision. Additionally, as illustrated in Table 4 and Table 5, MA-DenseUNet consistently outperforms contemporary approaches on the ISIC dataset across all evaluation metrics. The model demonstrates strong adaptability in segmenting various lesion types, especially those characterized by irregular shapes, fuzzy boundaries, and low contrast. These findings substantiate the effectiveness and practical utility of the proposed framework in both controlled experimental settings and real-world clinical scenarios.

6. Conclusions

This study proposes a novel convolutional neural network based on the U-Net architecture, termed MA-DenseUNet, which integrates dense convolutional blocks, a multi-scale attention mechanism, and bidirectional LSTM units to enhance the accuracy and robustness of medical image segmentation tasks. Experimental results demonstrate that each of the proposed components contributes significantly to overall performance improvement. The dense convolutional blocks strengthen the representation of deep semantic features and effectively mitigate gradient vanishing and feature redundancy. To address the considerable variation in lesion sizes—ranging from large regions occupying most of the image to small lesions comprising less than one-tenth of the area—a multi-scale attention mechanism is incorporated to guide the model’s focus toward critical lesion regions while suppressing background noise. Furthermore, the integration of the BDC-LSTM module enhances the modeling of spatial contextual information, showing superior structural awareness and regional consistency, particularly in cases of blurred boundaries and unclear textures. In addition, a multi-scale generative adversarial network (GAN) is employed for joint training, leveraging multi-scale discriminators to refine segmentation details and compensate for the local structure insensitivity of traditional loss functions.

Comprehensive experimental evaluations reveal that MA-DenseUNet achieves outstanding performance across multiple benchmark medical image segmentation datasets, significantly outperforming existing state-of-the-art models, thereby validating its strong segmentation capability and generalizability. Nevertheless, the model remains relatively complex in structure and demands substantial computational resources. Its performance on extremely small lesions or low-contrast images still presents challenges. Future work will focus on incorporating lightweight architectural designs to reduce model complexity, and exploring self-supervised and few-shot learning strategies to further improve the model’s applicability and clinical deployment potential.

Author Contributions

Conceptualization, Y.Y. and X.C.; methodology, X.C.; software, W.H.; validation, Y.Y. and Y.K.; formal analysis, W.H.; investigation, W.H.; resources, W.H.; data curation, Y.K.; writing—original draft preparation, X.C.; writing—review and editing, Y.Y.; visualization, Y.Y.; supervision, Y.Y.; project administration, W.H.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jilin Provincial Science and Technology Development Plan “Research on Multi-Target Classification Techniques Incorporating Efficient Channel Attention Mechanisms”, approval number YDZJ202401350ZYTS.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are public datasets. The PH² dataset can be accessed at the following URL: https://www.fc.up.pt/addi/project.html, the ISIC 2017 dataset can be accessed at https://challenge.isic-archive.com/data/#2017, and the ISIC 2018 dataset can be accessed at https://challenge.isic-archive.com/data/#2018.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Siegel, R.L.; Miller, K.D.; Wagle, N.S. Cancer statistics, 2023. CA A Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Jiang, X.; Ding, H.; Liu, J. Bi-directional dermoscopic feature learning and multi-scale consistent decision fusion for skin lesion segmentation. IEEE Trans. Image Process. 2019, 29, 3039–3051. [Google Scholar] [CrossRef] [PubMed]
Hay, R.J.; Johns, N.E.; Williams, H.C.; Bolliger, I.; Dellavalle, R.P.; Margolis, D.J.; Marks, R.; Naldi, L.; Weinstock, M.A.; Wulf, S.K.; et al. The global burden of skin disease in 2010: An analysis of the prevalence and impact of skin conditions. J. Investig. Dermatol. 2014, 134, 1527–1534. [Google Scholar] [CrossRef] [PubMed]
Ge, Z.; Demyanov, S.; Chakravorty, R.; Garnavi, R. Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clinical images. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, 11–13 September 2017, Proceedings, Part III 20; Springer International Publishing: Cham, Switzerland, 2017; pp. 250–258. [Google Scholar]
Zuo, Z.; Wei, J.; Sun, B.; Wu, P.; Su, S.; Tong, X. ASCU-Net: Attention gate, spatial and channel attention u-net for skin lesion segmentation. Diagnostics 2021, 11, 501. [Google Scholar]
Iyatomi, H.; Wen, Q.; Schaefer, G.; Hwang, S.; Celebi, M.E. Lesion border detection in dermoscopy images using ensembles of thresholding methods. Ski. Res. Technol. 2013, 19, e252–e258. [Google Scholar]
Sun, G.; Ahmed, A.; Ebad, S.A.; Li, Y.; Bilal, A. Precision and efficiency in skin cancer segmentation through a dual encoder deep learning model. Sci. Rep. 2025, 15, 4815. [Google Scholar]
Alghamdi, N.S.; Anbarasi, L.J.; Suliman, W.; Ravi, V.; Jawahar, M.; Sharen, H. FDUM-Net: An enhanced FPN and U-Net architecture for skin lesion segmentation. Biomed. Signal Process. Control. 2024, 91, 106037. [Google Scholar]
Yuan, C.; Zhao, D.; Agaian, S.S. MUCM-Net: A mamba powered ucm-net for skin lesion segmentation. arXiv 2024, arXiv:2405.15925. [Google Scholar] [CrossRef]
Abbas, F.; Ahmad, I.; Amin, J.; Sharif, M.I.; Lali, M.I. A novel Deeplabv3+ and vision-based transformer model for segmentation and classification of skin lesions. Biomed. Signal Process. Control. 2024, 92, 106084. [Google Scholar]
Cui, X.; Lv, H.; Chang, Q.; Huang, X.; Liang, P.; Wu, R. HSH-UNet: Hybrid selective high order interactive U-shaped model for automated skin lesion segmentation. Comput. Biol. Med. 2024, 168, 107798. [Google Scholar]
Suer, S.; Kockara, S.; Mete, M. An improved border detection in dermoscopy images for density based clustering. BMC Bioinform. 2011, 12, S12. [Google Scholar] [CrossRef] [PubMed]
Xiong, D.; Hu, K.; Lu, J.; Lee, D.; Chen, Z. AS-Net: Attention Synergy Network for skin lesion segmentation. Expert Syst. Appl. 2022, 201, 117112. [Google Scholar]
Chaturvedi, M.; John, A.; Sathishkumar, K.; Roselind, F.S.; Sudarshan, K.L.; Das, P.; Santhappan, S.; Nallasamy, V.; Narasimhan, S.; on behalf of ICMR-NCDIR-NCRP Investigator Group; et al. Cancer statistics, 2020: Report from national cancer registry programme, India. JCO Glob. Oncol. 2020, 6, 1063–1075. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Bashir, T.; Khan, M.I.; Naqvi, S.S.; Khan, H.A.; Langah, Z.A.; Razzak, M.I.; Khan, T.M. Glan: Gan assisted lightweight attention network for biomedical imaging based diagnostics. Cogn. Comput. 2023, 15, 932–942. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Innani, S.; Talbar, S.; Guntuku, S.C.; Bakas, S.; Baheti, B.; Baid, U.; Dutande, P.; Pokuri, V. Generative adversarial networks based skin lesion segmentation. Sci. Rep. 2023, 13, 13467. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018, Proceedings 4; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Xiao, X.; Lian, S.; Luo, Z. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
Ruan, J.; Xiang, S.; Xie, M. Malunet: A multi-attention and light-weight unet for skin lesion segmentation. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1150–1156. [Google Scholar]
Le, P.T.; Pham, B.T.; Chang, C.C. Anti-aliasing attention U-net model for skin lesion segmentation. Diagnostics 2023, 13, 1460. [Google Scholar] [CrossRef]
Dai, D.; Dong, C.; Luo, N.; Yan, Q.; Zhang, C.; Xu, S.; Li, Z. Ms RED: A novel multi-scale residual encoding and decoding network for skin lesion segmentation. Med. Image Anal. 2022, 75, 102293. [Google Scholar] [CrossRef]
Gu, R.; Deprest, J.; Ourselin, S.; Zhang, S.; Aertsen, M.; Huang, R.; Song, T.; Vercauteren, T.; Wang, G. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef]
Rahman, C.M.A.; Bhuiyan, R.K.; Shyam, S.P. Attention Enabled MultiResUNet for Bio-Medical Image Segmentation. In Proceedings of the 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), Dhaka, Bangladesh, 2–4 May 2024; pp. 622–627. [Google Scholar]
Wu, R.; Liang, P.; Huang, X. Mhorunet: High-order spatial interaction unet for skin lesion segmentation. Biomed. Signal Process. Control. 2024, 88, 105517. [Google Scholar] [CrossRef]
Yu, H.; Zhang, W.; Li, C.; Liu, Z.; Qi, W.; Guo, Y.; Zhou, S. MSAU-net: Road extraction based on multi-headed self-attention mechanism and U-net with high resolution remote sensing images. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6898–6900. [Google Scholar]
Shahzaib, I.; Zeeshan, M.; Mehmood, M. TESL-Net: A Transformer-Enhanced CNN for Accurate Skin Lesion Segmentation. In Proceedings of the 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Perth, Australia, 27–29 November 2024; pp. 313–320. [Google Scholar]
Rao, A.J.; Lakshmanarao, A.; DurgaRao, G.; Kumar, G.P. Innovative Way of Identifying Skin Cancer Model Design with FCNN and LSTM. In Proceedings of the 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), Uttar Pradesh, India, 9–10 February 2024; Volume 5, pp. 287–291. [Google Scholar]
Kumar, T.G.; Kumar, K.S.; Arjun, K.P.; Ravi, V.; Dhanaraj, R.K. Optimizing time prediction and error classification in early melanoma detection using a hybrid RCNN-LSTM model. Microsc. Res. Tech. 2024, 87, 1789–1809. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, A. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar] [CrossRef]
Bi, L.; Kim, J.; Ahn, E. Step-wise integration of deep class-specific learning for dermoscopic image segmentation. Pattern Recognit. 2019, 85, 78–89. [Google Scholar] [CrossRef]
Wei, Z.; Song, H.; Chen, L. Attention-based DenseUnet network with adversarial training for skin lesion segmentation. IEEE Access 2019, 7, 136616–136629. [Google Scholar] [CrossRef]
Song, H.; Han, G.; Chen, L.; Li, Q.; Wei, Z. EA-GAN: Enhanced attention generative adversarial network applied to hair removal in dermoscopy images. Signal Image Video Process. 2025, 19, 268. [Google Scholar]
Bansal, N.; Sridhar, S. Hexa-gan: Skin lesion image inpainting via hexagonal sampling based generative adversarial network. Biomed. Signal Process. Control. 2024, 89, 105603. [Google Scholar] [CrossRef]
Iandola, F.; Moskewicz, M.; Karayev, S. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhao, L.; Zhang, Z. A improved pooling method for convolutional neural networks. Sci. Rep. 2024, 14, 1589. [Google Scholar] [CrossRef]
Ruby, U.; Yendapalli, V. Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5393–5397. [Google Scholar]
Codella, N.; Rotemberg, V.; Tschandl, P. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv 2019, arXiv:1902.03368. [Google Scholar]
Codella, N.C.F.; Gutman, D.; Celebi, M.E. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 168–172. [Google Scholar]
Deb, M.; Dhal, K.G.; Basu, A.; Senapati, P. Sharp dense u-net: An enhanced dense u-net architecture for nucleus segmentation. Int. J. Mach. Learn. Cybern. 2024, 15, 2079–2094. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
Hu, X.Z.; Jeon, W.S.; Rhee, S.Y. ATT-UNet: Pixel-wise staircase attention for weed and crop detection. In Proceedings of the 2023 International Conference on Fuzzy Theory and Its Applications (iFUZZY), Penghu, Taiwan, 26–29 October 2023; pp. 1–5. [Google Scholar]
Samarakoon, P.N.; Tushar, F.I.; Dahal, L.; Martí, R.; Hasan, K. DSNet: Automatic dermoscopic skin lesion segmentation. Comput. Biol. Med. 2020, 120, 103738. [Google Scholar]
Wu, H.; Wen, Z.; Chen, G.; Lei, B.; Wang, W.; Chen, S. FAT-Net: Feature adaptive transformers for automated skin lesion segmentation. Med. Image Anal. 2022, 76, 102327. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Kang, X.; Tian, S.; Wu, W.; Yu, L.; Zhang, N. APT-Net: Adaptive encoding and parallel decoding transformer for medical image segmentation. Comput. Biol. Med. 2022, 151, 106292. [Google Scholar] [CrossRef]
Wang, G.; Zhao, Y.; Xu, L.; Li, Y.; Ma, Q.; Mao, K. A skin lesion segmentation network with edge and body fusion. Appl. Soft Comput. 2025, 170, 112683. [Google Scholar] [CrossRef]
Chen, H.; Peng, Z.; Huang, X.; Li, Z.; Deng, Y.; Tang, L.; Yin, L. SCSONet: Spatial-channel synergistic optimization net for skin lesion segmentation. Front. Phys. 2024, 12, 1388364. [Google Scholar] [CrossRef]
Chen, J.; Zhang, X.; Wang, M.; Cao, H.; Jiang, D.; Wang, Y.; Tian, Q. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Ruan, J.; Xie, M.; Gao, J. Ege-unet: An efficient group enhanced unet for skin lesion segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2023; pp. 481–490. [Google Scholar]
Ziang, H.; Zhang, J.; Li, L. Framework for lung CT image segmentation based on UNet++. arXiv 2025, arXiv:2501.02428. [Google Scholar]
Kaur, R.; Kaur, S. Automatic skin lesion segmentation using attention residual U-Net with improved encoder-decoder architecture. Multimed. Tools Appl. 2025, 84, 4315–4341. [Google Scholar] [CrossRef]

Figure 1. MA-DenseUNet architectural framework.

Figure 2. The structure of the dense convolutional module.

Figure 3. The attention module in the skip connection.

Figure 4. Channel attention module diagram.

Figure 5. Spatial attention module diagram.

Figure 6. Multi-scale hybrid attention module architecture diagram.

Figure 7. The overall architecture of the Bidirectional Convolutional LSTM (BDC-LSTM) network.

Figure 8. Three samples from the ISIC dataset: (a) contains hair, (b) contains bubbles, and (c) contains shadows.

Figure 9. Performance comparison of different segmentation algorithms on the PH² dataset.

Figure 10. Comparison results on the PH² dataset. (a) Segmentation result from U-Net; (b) segmentation result from LCAUnet; (c) segmentation result from SCSONet; (d) segmentation result from MA-DenseUNet.

Figure 11. Visual comparison between ground truth and prediction. (a) Segmentation result from U-Net; (b) segmentation result from LCAUnet; (c) segmentation result from SCSONet; (d) segmentation result from MA-DenseUNet. Red contours represent ground truth boundaries, while green contours indicate predicted segmentation boundaries.

Figure 12. Performance comparison of different segmentation algorithms on the ISIC2017 dataset.

Figure 13. Comparison results on the ISIC2017 dataset. (a) Segmentation result from U-Net; (b) segmentation result from LCAUnet; (c) segmentation result from EGE-Unet; (d) segmentation result from MA-DenseUNet.

Figure 14. Visual comparison between ground truth and prediction. (a) Segmentation result from U-Net; (b) segmentation result from LCAUnet; (c) segmentation result from EGE-Unet; (d) segmentation result from MA-DenseUNet. Red contours represent ground truth boundaries, while green contours indicate predicted segmentation boundaries.

Figure 15. Validation process on ISIC2017 dataset. (a) Curves of Acc metric; (b) curves of Dice metric.

Figure 16. Performance comparison of different segmentation algorithms on the ISIC2018 dataset.

Figure 17. Comparative performance analysis of ablation studies on the overall network architecture.

Figure 18. Visual comparison of ablation study results. (a) Segmentation result of the baseline DenseUNet; (b) segmentation result of DenseUNet with integrated attention module; (c) segmentation result of DenseUNet with skip-connected attention module and CBAM; (d) segmentation result of DenseUNet with skip-connected attention module, CBAM, and BDC-LSTM; (e) segmentation result of the complete network model.

Table 1. Dataset description.

Dataset	Images	Train	Valid	Test	Size	Resized to	Augmentation	Augmented Total
PH² [42]	200	140	20	40	Variable	256 × 256	Flip, Scale, Translate, Rotate	1200
ISIC2017 [43]	2750	2000	150	600	Variable			8250
ISIC2018 [44]	3596	2596	200	800	Variable			7192

Table 2. Training data of the skin lesion dataset.

Dataset	Training Epochs	Batch Size	Learning Rate	Weight Decay	Optimizer	Loss Function
PH² [42]	150	2	1 × 10⁻⁵	0.01	Adam	Cross-Entropy Loss + Dice loss
ISIC2017 [43]	100	4
ISIC2018 [44]	100	4

Table 3. Performance evaluation of different segmentation algorithms on the PH² dataset.

Method	Acc (%)	Dice (%)	Sp (%)	Sen (%)	mIoU (%)
U-net [45]	91.32 ± 0.43	89.45 ± 0.31	95.63 ± 0.27	91.32 ± 0.56	84.12 ± 0.39
Att-Unet [46]	92.13 ± 0.24	90.06 ± 0.26	96.42 ± 0.18	92.13 ± 0.19	85.83 ± 0.24
DSNet [47]	94.80 ± 0.18	92.02 ± 0.35	96.15 ± 0.29	92.46 ± 0.33	87.26 ± 0.39
FAT-Net [48]	97.01 ± 0.25	94.48 ± 0.27	97.44 ± 0.19	94.41 ± 0.27	87.61 ± 0.16
APT-Net [49]	96.16 ± 0.36	94.66 ± 0.42	97.93 ± 0.34	94.05 ± 0.39	86.96 ± 0.22
LCAUnet [50]	97.32 ± 0.15	95.24 ± 0.18	98.24 ± 0.16	95.43 ± 0.14	88.02 ± 0.23
SCSONet [51]	95.97 ± 0.29	95.62 ± 0.22	98.03 ± 0.18	96.22 ± 0.23	89.04 ± 0.14
Ours	96.28 ± 0.21	95.85 ± 0.13	98.67 ± 0.16	96.94 ± 0.24	90.39 ± 0.18

Table 4. Performance evaluation of different segmentation algorithms on the ISIC2017 dataset.

Method	Acc (%)	Dice (%)	Sp (%)	Sen (%)	mIoU (%)
U-net [45]	91.51 ± 0.52	78.31 ± 0.48	95.44 ± 0.39	80.65 ± 0.68	76.58 ± 0.52
Att-U-net [46]	92.65 ± 0.43	80.82 ± 0.23	97.86 ± 0.36	80.03 ± 0.27	78.74 ± 0.31
FAT-Net [47]	93.36 ± 0.26	85.06 ± 0.32	97.32 ± 0.19	84.01 ± 0.23	76.51 ± 0.19
Swin-Unet [52]	94.76 ± 0.38	81.94 ± 0.56	96.18 ± 0.42	88.06 ± 0.25	80.93 ± 0.36
LCAUnet [50]	94.04 ± 0.29	86.63 ± 0.32	96.53 ± 0.21	85.24 ± 0.41	76.14 ± 0.38
SCSONet [51]	94.52 ± 0.32	88.97 ± 0.28	97.07 ± 0.19	84.02 ± 0.35	82.26 ± 0.33
EGE-Unet [53]	94.61 ± 0.22	88.83 ± 0.16	96.98 ± 0.23	86.26 ± 0.29	82.32 ± 0.28
TESL-Net [34]	95.80 ± 0.29	90.09 ± 0.31	97.29 ± 0.36	91.10 ± 0.26	83.37 ± 0.35
Ours	95.07 ± 0.32	90.25 ± 0.2	97.46 ± 0.17	91.82 ± 0.16	83.59 ± 0.13

Table 5. Performance evaluation of different segmentation algorithms on the ISIC2018 dataset.

Method	Acc (%)	Dice (%)	Sp (%)	Sen (%)	mIoU (%)
U-net [45]	92.51 ± 0.86	86.61 ± 0.64	94.23 ± 0.96	85.24 ± 0.76	80.17 ± 0.55
UNet++ [54]	93.72 ± 0.65	87.34 ± 0.45	94.52 ± 0.51	88.76 ± 0.46	81.64 ± 0.42
FAT-Net [48]	94.71 ± 0.54	89.09 ± 0.39	95.06 ± 0.46	91.01 ± 0.38	82.02 ± 0.37
Swin-Unet [52]	95.12 ± 0.36	87.94 ± 0.45	96.16 ± 0.35	89.26 ± 0.42	82.41 ± 0.26
LCAUnet [50]	95.46 ± 0.28	91.93 ± 0.32	96.11 ± 0.29	91.34 ± 0.3	81.59 ± 0.35
SCSONet [51]	94.58 ± 0.37	89.51 ± 0.31	95.86 ± 0.19	91.03 ± 0.22	82.94 ± 0.42
ARU-Net [55]	96.24 ± 0.42	90.84 ± 0.26	97.57 ± 0.28	90.85 ± 0.16	83.26 ± 0.28
Ours	95.71 ± 0.25	91.02 ± 0.18	97.83 ± 0.15	92.08 ± 0.27	84.32 ± 0.18

Table 6. Ablation study of the overall network on the ISIC2017 dataset.

Method	Acc (%)	Dice (%)	mIoU (%)
Baseline	92.21 ± 0.31	81.82 ± 0.22	76.93 ± 0.37
Base + Att	92.78 ± 0.35	84.32 ± 0.27	78.71 ± 0.32
Base + Att + CBAM	93.42 ± 0.22	86.76 ± 0.19	81.63 ± 0.28
Base + Att + CBAM + BDC-LSTM	94.63 ± 0.31	89.92 ± 0.27	82.05 ± 0.24
Ours	95.07 ± 0.16	90.25 ± 0.14	83.59 ± 0.22

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Cai, X.; Yan, Y.; Kang, Y. MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM. Appl. Sci. 2025, 15, 6538. https://doi.org/10.3390/app15126538

AMA Style

Huang W, Cai X, Yan Y, Kang Y. MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM. Applied Sciences. 2025; 15(12):6538. https://doi.org/10.3390/app15126538

Chicago/Turabian Style

Huang, Wenbo, Xudong Cai, Yang Yan, and Yufeng Kang. 2025. "MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM" Applied Sciences 15, no. 12: 6538. https://doi.org/10.3390/app15126538

APA Style

Huang, W., Cai, X., Yan, Y., & Kang, Y. (2025). MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM. Applied Sciences, 15(12), 6538. https://doi.org/10.3390/app15126538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MA-DenseUNet: A Skin Lesion Segmentation Method Based on Multi-Scale Attention and Bidirectional LSTM

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Segmenter

3.1.1. Encoder Module

3.1.2. Attention Module

3.1.3. Channel Spatial Attention Enhancement Module

3.1.4. Bidirectional Convolutional LSTM Temporal Modeling Module

3.1.5. Decoder Path

3.2. Discriminator

3.3. Generative Adversarial Network

3.4. Loss Function

3.5. Experimental Details

3.5.1. Experimental Setup

3.5.2. Dataset and Preprocessing

3.5.3. Evaluation Metrics

4. Experiment and Results

4.1. Comparative Experiments

4.1.1. Comparative Experiments on PH2

4.1.2. Comparative Experiments on ISIC2017

4.1.3. Comparative Experiments on ISIC2018

4.2. Ablation Research

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1.1. Comparative Experiments on PH²