FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation

Chen, Lifang; Yu, Entao; Cao, Qihang; Hu, Ke

doi:10.3390/info16121102

Open AccessArticle

FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation

¹

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

²

Engineering Research Center of Intelligent Technology for Healthcare, Ministry of Education, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(12), 1102; https://doi.org/10.3390/info16121102 (registering DOI)

Submission received: 6 November 2025 / Revised: 9 December 2025 / Accepted: 11 December 2025 / Published: 13 December 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of lesion regions is essential for skin cancer diagnosis. As dermoscopic images of skin lesions demonstrate different sizes, diverse shapes, fuzzy boundaries, and so on, accurate segmentation still faces great challenges. To address these issues, we propose a new dermatologic image segmentation network, FFM-Net. In FFM-Net, we design a new FM block encoder based on state space models (SSMs), which integrates a low-frequency information extraction module (LEM) and an edge detail extraction module (EEM) to extract broader overall structural information and more accurate edge detail information, respectively. At the same time, we dynamically adjust the input channel ratios of the two module branches at different stages of our network, so that the model can learn the correlation relationship between the overall structure and edge detail features more effectively. Furthermore, we designed the cross-channel spatial attention (CCSA) module to improve the model’s sensitivity to channel and spatial dimensions. We deploy a multi-level feature fusion module (MFFM) at the bottleneck layer to aggregate rich multi-scale contextual representations. Finally, we conducted extensive experiments on three publicly available skin lesion segmentation datasets, ISIC2017, ISIC2018, and PH2, and the experimental results show that the FFM-Net model outperforms most existing skin lesion segmentation methods.

Keywords:

low-frequency information; edge information; multi-level feature fusion; U-Net; Mamba

1. Introduction

In recent years, skin cancer has become one of the most prevalent and serious diseases affecting human health worldwide. According to incomplete statistics, although the mortality rate of malignant skin cancer remains high, early diagnosis can reduce mortality in 95% of patients, which can improve the survival rate by an average of 5 years [1]. Skin lesion image segmentation is crucial in medical diagnosis, as reliable medical image segmentation methods can help medical professionals quickly grasp the shape and detailed features of a patient’s lesion area, thereby laying the foundation for more accurate diagnostic and treatment decisions. However, accurately segmenting skin lesions remains challenging, with complex difficulties such as blurred skin lesion boundaries, irregular textured lesions, and lesion shapes of varying sizes, as shown in Figure 1. These complexities make it difficult to distinguish them from normal skin tissues, and advanced segmentation techniques are needed to accurately identify and differentiate between various skin lesions.

Ronneberger et al. [2] proposed the CNN-based U-Net architecture, which has emerged as one of the most extensively utilized networks for medical image segmentation tasks and maintains dominant status in the field of skin lesion segmentation. Zhou et al. introduced U-Net++ [3], which utilizes dense skip connections between networks to resolve semantic differences between encoder and decoder paths, thus improving segmentation accuracy. Ruan et al. [4] proposed an EGE-Unet network that integrates a multi-axis attention module to extract pathological information from different angles, which effectively enhances the recognition of irregularly shaped lesion regions and ensures the accuracy of segmentation. CMUNeXt [5] is a lightweight skin lesion segmentation network based on large kernel convolution and inverted bottleneck design, which ensures segmentation accuracy while achieving lightweighting. Xu et al. [6] proposed a DCSAU-Net network for melanoma segmentation, which effectively fuses low-level and high-level semantic information and enhances the network feature extraction by adopting a multipath approach with a different number of convolutions and a channel attention mechanism. Although CNN-based models possess excellent feature representation capabilities, they lack the ability to learn global information and long-term relationships due to the limitation of convolutional kernel size. Transformers [7] employ multi-head self-attention mechanisms to capture long-range dependencies, enabling efficient global modeling and focusing on key regions of the image. TransUnet [8] is a pioneer in medical image segmentation based on the Transformer model, integrating the Transformer into the final encoder layer of the widely used U-Net framework to explore its potential in medical image segmentation. TransAttUnet [9] is a kind of attention-guided network based on the Transformer, which incorporates multi-level attention guidance and multi-scale skip connection methods, thereby enhancing the network’s capability to segment dermoscopic images of various shapes and sizes. Missformer [10] is an effective and powerful medical image segmentation network that improves the performance of the model by redesigning the feedforward network to take full advantage of the multi-scale features generated from hierarchical Transformers. Li et al. [11] proposed a medical image segmentation MA-UNet network based on U-Net, which designed two levels of encoders including rough ordinary extraction and multi-scale fine extraction to enhance the model feature extraction. CMLCNet [12] introduces a capsule encoder to learn the relationship between parts and the whole in medical images, which helps the network to extract more local details and context information and reduces the problem of information loss caused by pooling in the process of downsampling. However, Transformer’s quadratic complexity also imposes a high computational burden, and the trade-off between global context modeling and computational efficiency remains unresolved.

Recently, state space models (SSMs) [13] have attracted the interest of many researchers. Based on the traditional SSM, the modern SSM (Mamba [14]) achieves a remarkable breakthrough by introducing an efficient selective scanning mechanism. This approach not only establishes long-range dependencies but also achieves five times higher throughput than Transformers. Moreover, its computational complexity scales linearly with input size, substantially improving computational efficiency. Ma et al. [15] proposed a hybrid medical image segmentation model called U-Mamba based on SSMs. This model combines the local feature extraction capability of CNNs with SSM’s ability to capture long-range dependencies, effectively enhancing medical image segmentation capabilities while addressing the inherent issues of locality or computational complexity in both CNNs and Transformers. Vmamba [16] employs a cross-scan module (SS2D) to process image blocks. It utilizes a quad-directional scanning strategy to achieve scanning in four directions, effectively capturing global features while reducing computational complexity. In addition, many researchers have applied Mamba to explore its potential in dermatoscopy image segmentation. Ruan et al. [17] proposed a medical image segmentation model VM-UNet based on SSM, which captures a wide range of context information through visual state space blocks. MambaULite [18] is a lightweight network model that efficiently segments skin lesion images with variable shapes by dynamically adjusting the fusion strategy of local and global features. Tang et al. [19] proposed the RM-Unet network, which innovatively integrates the residual visual state space ResVSS block and the rotating SSM module to mitigate the network degradation problem caused by the reduced efficiency of information transmission from the shallow to the deep layers. Zou et al. [20] introduced SkinMamba, a hybrid Mamba–CNN architecture for dermatologic image analysis. The model enhances segmentation precision by incorporating a frequency-boundary guided module at the bottleneck layer while leveraging retained information to assist the decoder during reconstruction.

In summary, although segmentation models based on CNNs, Transformers, and Mamba have made significant progress in medical imaging, CNNs struggle with limited local modeling capabilities, making it difficult to handle blurred boundaries and capture global structures. Transformers face challenges in effectively balancing complexity and accuracy. Existing Mamba models, constrained by fixed feature scanning methods, have yet to fully realize their potential. Furthermore, skin lesion segmentation continues to face challenges such as blurred lesion boundaries, varying lesion sizes, and information loss due to subsampling.

We propose FFM-Net, a novel architecture for lesion segmentation to extract more extensive feature information. In medical image segmentation, low-frequency information corresponds to regions with gradual image changes, characterizing the overall structural features of skin lesions. Edge detail information is associated with regions undergoing rapid and abrupt changes, delineating the sharp boundary transitions between lesions and normal skin. The FFM-Net effectively addresses skin lesion segmentation in images with blurred boundaries by dynamically adjusting the channel ratio between overall structure and edge details, and further improves the segmentation accuracy by leveraging the long-distance capture capability of the Mamba model. Given the higher image resolution and richer detail information in the shallow layer of the network, increasing the proportion of overall structural information channels helps preserve overall structural information. In the deeper layers of the network, with the decrease in image resolution and the abstraction of feature information, the proportion of edge information can be appropriately increased to capture more edge detail features. The low-frequency information extraction module (LEM) enhances overall structural feature extraction while suppressing interference from texture irregularities, while the edge detail extraction module (EEM) captures fine edge details. We design the cross-channel spatial attention module (CCSA) to effectively narrow the semantic gap between the encoder and decoder feature maps and enhance the sensitivity to the channel spatial dimension. In addition, we apply a multi-level feature fusion module (MFFM) at the bottleneck layer to fuse different levels of information, producing a feature map as an input to the first layer decoder. This ensures effective segmentation of lesions of varying sizes.

The main contributions of this study are as follows:

We propose the FFM-Net model for the segmentation of skin lesions. FFM-Net uses a parallel two-branches architecture to extract the overall structure and edge detail information respectively. In addition, it dynamically adjusts channel-wise ratios between the two information streams at different network depths to adequately extract the information in the image.
We propose the CCSA module to enhance the model’s sensitivity to both channel and spatial dimensions, reducing the semantic gap between the encoder and decoder feature maps.
We propose the MFFM, which can effectively fuse multi-scale information to enhance the ability to learn feature representation.

2. Related Work

2.1. Segmentation Based on Mamba

Mamba [14] is an emerging selective state space model that achieves linear computational complexity while maintaining long-range dependency modeling capabilities, and has shown excellent performance in a variety of vision tasks. Inspired by VMamba [16], many image segmentation studies have adopted VSS blocks derived from VMamba as the basic components of their models. Zhu et al. [21] designed an encoder that combines Mamba with ResNet, utilizing discrete cosine transform to analyze features from various spectral angles, thereby effectively integrating local and global features. Wu et al. [22] proposed the SK-VM++ network, a novel skip connection architecture based on Mamba, which effectively reduces computational complexity while realizing the fusion of high and low feature information. VM-UNetV2 [23] combines the VSS block with the feature fusion mechanism of UNetV2 [24]. The architecture employs VSS blocks as encoders to capture extensive contextual information while introducing the semantics and detail infusion mechanism that effectively combines low-level and high-level features, thereby improving segmentation performance. Chen et al. [25] introduced a triple attention module based on Mamba to achieve selective global modeling of image features, thereby improving the overall performance of the segmentation process. UltraLightVM-Unet [26] adopts a parallel visual mamba layer method to process image features, which achieves excellent performance with minimal computational overhead and parameter size. In addition, LightM-UNet [27] and H-vmunet [28] further explore the potential of SSM to reduce computational burden and achieve lightweighting.

2.2. Low-Frequency Information Extraction

In recent years, the research on the frequency domain has attracted increasing attention in the field of deep learning [29,30]. TinyViM [31] introduces a novel Laplace mixer to decouple the frequency features and improves the performance of global modeling by inputting low-frequency components into Mamba. FMamba [32] introduced a learnable parameter matrix as a global filter, and coordinated the work by combining the VSS module and the frequency feature enhancement module to effectively enhance the global feature extraction capability while reducing the computational complexity. Zhang et al. [33] combined the Mamba feature extraction branch and the Gaussian filter frequency domain feature extraction branch to carry out the fusion processing of feature information layer by layer and improve the accuracy of model segmentation. FBI-Net [34] is a three-encoder network architecture which takes RGB images and high/low DCT-filtered images as inputs, the network can learn feature information more comprehensively to further improve the segmentation performance of the network. Han et al. [35] improved the discrete Fourier transform (DFT) by introducing low-frequency information to realize multi-channel data expansion for each image, thus enriching the diversity of input data. The WET-Unet network [36] integrates the wavelet transform into the U-Net and utilizes the low-frequency component to adjust the encoder, which enhances the overall structural information of the lesion and improves the accuracy and robustness of image segmentation. Tang et al. [37] verified that the use of filters in U-Net networks can reduce the number of parameters while achieving better segmentation performance. As shown in Figure 2, we take an image from the ISIC2018 dataset as an example. By gradually increasing the size of the truncation frequency radius, it can be observed that the whole of the image becomes more and more structurally clear as the truncation frequency radius expands.

2.3. Edge Information Extraction

Edge detection is an important problem in the field of image processing [38,39]. The main purpose of edge detection is to identify regions of an image where there is a sharp change in gray level or structure, and extract the contour features of the target object. In early edge detection methods, classical conventional operators (Robert, Prewitt, Sobel [40]) used differential information to represent mutations and details of the edge context. In recent years, researchers have begun to explore methods that combine traditional edge detection operators with modern convolutional neural networks, which utilize the powerful feature extraction capabilities of CNNs to achieve impressive performance. The central series difference convolution (CDC) [41] is a novel operator that incorporates the central difference step into the standard convolution process. Specifically, it subtracts the pixel value of the central point from the pixel values of its surrounding nine pixels and then aggregates nine new values obtained by computing the dot product with the convolution kernel weights to produce the final output value. The cross-central difference convolution (CCDC) [42] performs differential operations on all neighboring features in the neighborhood and diagonally, reducing the redundancy and complexity incurred of the computation. The pixel difference convolution (PDC) [43] employs a unique strategy to encode pixel relationships in different directions, achieving more accurate edge detection. These models are more streamlined and easier to learn than other methods, and these innovations are very enlightening.

3. Method

3.1. Algorithm Overview

The overall architecture of FFM-Net is shown in Figure 3, which consists of an encoder, bottleneck layer, and decoder. The encoder consists of four stages, an FM encoder block, and a downsampling layer. After passing through each encoder stage, the input features are mapped into a deeper feature space, while the width and height of the image are halved and the number of channels is doubled. Theoretically, the network’s shallow image has higher resolution and richer detail information, while the network’s deep image has lower resolution and mainly retains the overall structure and higher-level semantic information. In the FM block encoder, by dividing the channel ratio between LEM and EEM and performing weighted fusion on their outputs, the fused features undergo further feature extraction through the selective scanning (SS2D) module. This enables the encoder to extract feature information more comprehensively at each layer. We propose the CCSA module for the efficient fusion of channel features and spatial features from encoders and decoders. In addition, we design the MFFM at the bottleneck layer of the network, which can effectively enhance the ability of the network to learn feature representations by fusing shallow detailed features with deeper high-level semantic features.

The decoder consists of three stages; each decoder layer includes a decoder block and an upsampling layer, and the decoder block consists of a 3 × 3 residual block convolution, batch normalization, and a ReLU activation function. The final layer of the decoder is the output module, which integrates feature mapping and restores the resolution to the input resolution H×W× Class, and outputs pixel-level segmentation predictions.

3.2. Edge Detail Extraction Module (EEM)

The edge detection operator extracts edge detail information by calculating the gradient of the image, which involves calculating the difference between neighboring pixel values. As shown in Figure 4, we combine the Sobel operator with differential convolution to obtain two types of convolutional instances: Sobel X-Convolution and Sobel Y-Convolution. The specific process for these two new convolutional instances involves selecting six pixels horizontally and vertically within the feature map. The difference between these six pixels at corresponding positions is then computed according to the Sobel operator’s calculation method. These six pixel differences are multiplied element-by-element with the kernel weights, and finally the summation operation outputs the values in the feature map.

We computed the difference between each neighboring pixel in the horizontal and vertical directions to obtain a convolution kernel with gradient weights, which was summarized to obtain the final output. The method of construction is as follows:

\begin{matrix} X & = (x_{1} - x_{3}) \cdot w_{1} + (x_{3} - x_{1}) \cdot w_{3} + 2 (x_{4} - x_{6}) \cdot w_{4} + 2 (x_{6} - x_{4}) \cdot w_{6} \\ + (x_{7} - x_{9}) \cdot w_{7} + (x_{9} - x_{7}) \cdot w_{9} \end{matrix}

(1)

\begin{matrix} Y & = (x_{1} - x_{7}) \cdot w_{1} + 2 (x_{2} - x_{8}) \cdot w_{2} + (x_{3} - x_{9}) \cdot w_{3} + (x_{7} - x_{1}) \cdot w_{7} \\ + 2 (x_{8} - x_{2}) \cdot w_{8} + (x_{9} - x_{3}) \cdot w_{9} \end{matrix}

(2)

where

x_{i}

is the selected pixel position,

w_{i}

is the weights of the convolution kernel, and X and Y denote the gradient values obtained by differential convolution in the horizontal and vertical directions, respectively.

It is well known from Equations

(1)

and

(2)

that computing the pixel difference before each convolution operation raises the computational cost and memory utilization. To solve this problem, we transform the pixel differences into standard convolutions with specific weighted differences, which not only accomplishes the same results as before but also simplifies the computational process. The calculation of Sobel X-convolution is taken as an example, and the conversion process is as follows:

\begin{matrix} X & = (x_{1} - x_{3}) \cdot w_{1} + (x_{3} - x_{1}) \cdot w_{3} + 2 (x_{4} - x_{6}) \cdot w_{4} + 2 (x_{6} - x_{4}) \cdot w_{6} \\ + (x_{7} - x_{9}) \cdot w_{7} + (x_{9} - x_{7}) \cdot w_{9} \\ = (w_{1} - w_{3}) \cdot x_{1} + (w_{3} - w_{1}) \cdot x_{3} + 2 (w_{4} - w_{6}) \cdot x_{4} + 2 (w_{6} - w_{4}) \cdot x_{6} \\ + (w_{7} - w_{9}) \cdot x_{7} + (w_{9} - w_{7}) \cdot x_{9} \end{matrix}

(3)

where

x_{i}

is the selected pixel position,

w_{i}

is the weight of the convolution kernel, and X denotes the gradient value obtained by differential convolution in the horizontal direction.

We further designed the edge detail extraction module (EEM), which integrates the gradient information into the convolution operation so that the model can focus on the edge features of the lesion when learning the features. As shown in Figure 5, we process the input feature map

I_{i n}

with 3 × 3 depthwise convolution to obtain the lesion texture feature map

I_{d i_{n}}

. On the other hand, the feature map

I_{i n}

extracts the horizontal and vertical edge features in Sobel X-Convolution and Sobel Y-Convolution, respectively, and

I_{d i_{n}}

and

I_{X Y}

are added to obtain the enhanced feature map

I_{e n}

. Then, the feature information is further extracted by 3 × 3 depthwise convolution and the final output

I_{o u t}

.

The edge detail extraction module (EEM) is represented as follows:

\begin{matrix} I_{d i n} = f_{d c 3} (I_{i n}) \end{matrix}

(4)

\begin{matrix} I_{X Y} = \sqrt{{(f_{s o b e l X} (I_{i n}))}^{2} + {(f_{s o b e l Y} (I_{i n}))}^{2}} \end{matrix}

(5)

\begin{matrix} I_{e n} = I_{d i n} + I_{X Y} \end{matrix}

(6)

\begin{matrix} I_{o u t} = f_{r} (f_{b n} (f_{d c 3} (I_{e n}))) \end{matrix}

(7)

where

f_{d c 3}

denotes 3 × 3 depth convolution,

f_{b n}

denotes batch normalization,

f_{r}

denotes ReLU activation function, and

f_{s o b e l X}

and

f_{s o b e l Y}

are horizontal and vertical gradient difference convolution, respectively.

3.3. Low-Frequency Information Extraction Module (LEM)

Low-frequency information is crucial in medical segmentation because it contains information about the location relationship and overall structure of the lesion. TinyViM network [31] uses the Laplace mixer to decouple the frequency features and inputs the low-frequency components into Mamba, which enhances Mamba’s global modeling capabilities. However, the TinyViM method is more complicated and suffers from the problem of information loss. We use the low-pass filter to extract features, which allows the transmission of low-frequency information and filters out high-frequency information. It can avoid complex mixing operations and simplifies the processing flow while improving the extraction accuracy of low-frequency information. Specifically, in the frequency branch, we use the discrete Fourier transform (DFT) to convert the input feature map from the spatial domain to the frequency domain.

After the DFT operation, the low-frequency components are initially located at the four corners of the frequency domain map, and then subsequently shifted to the center of the frequency domain map to obtain the central spectral map. Then, we set the appropriate truncation frequency to let the low-frequency components pass and hinder the passage of the high-frequency components. Finally, the frequency domain feature map is converted back to the spatial domain using the inverse discrete Fourier transform (IDFT).

As shown in Figure 6, we propose an adaptive adjustment method for the low-frequency region in 2D-DFT. We establish a rectangular coordinate system at the center of the spectrogram

X_{m n}

and set its center point as

(0, 0)

, and set

(m, n)

as an arbitrary point in the coordinate system, where

m \in [- H / 2, H / 2]

,

n \in [- H / 2, H / 2]

. r is defined as the distance from

(m, n)

to

(0, 0)

. It is noteworthy that the maximum radius of the circle in the spectrogram is

max (H / 2, W / 2)

. The formula for adaptively acquiring the low frequency

X_{m n}^{l}

in the spectrum can be expressed as:

\begin{matrix} r = k \times max (\frac{H}{2}, \frac{W}{2}) \end{matrix}

(8)

\begin{matrix} X_{m n} = \{\begin{matrix} X_{m n}^{l} & , \sqrt{m^{2} + n^{2}} < r \\ 0 & , others \end{matrix} \end{matrix}

(9)

where k is a suitable truncation frequency parameter

(0 < k < 1)

. DFT and IDFT denote the Fourier transform and inverse Fourier transform, respectively, while DFTshift and IDFTshift denote the Fourier shift operation and the Fourier inverse shift operation, respectively.

The 2D-DFT and 2D-IDFT formulas are expressed as follows:

F (u, v) = \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} f (h, w) e^{- j 2 π (\frac{u h}{H} + \frac{v w}{W})}

(10)

f (h, w) = \frac{1}{H W} \sum_{u = 0}^{H - 1} \sum_{v = 0}^{W - 1} F (u, v) e^{j 2 π (\frac{u h}{H} + \frac{v w}{W})}

(11)

where u and v correspond to the vertical and horizontal frequency components, respectively, h and w denote the vertical and horizontal spatial domain position indices, and H and W represent the height and width of the input signal. j is the imaginary unit (

j^{2} = - 1

).

F (u, v)

represents the result of transforming the spatial domain signal

f (h, w)

into the frequency domain via the discrete Fourier transform (2D-DFT), while

f (h, w)

denotes the result of converting the frequency domain signal

F (u, v)

back into the spatial domain through the inverse discrete Fourier transform (2D-IDFT).

3.4. Cross-Channel Spatial Attention Module (CCSA)

Although Mamba [14] has made significant progress in dermatoscopic image segmentation [17,20,22], these methods mainly focus on capturing spatial dependencies, ignoring the critical role of channel information in image segmentation. We focus on the interrelationships between the feature maps of the encoder and decoder in both the channel and spatial dimensions, rather than the characteristic analysis of a single feature map. We propose the cross-channel spatial attention module. Through the cross-channel spatial attention module, the encoder in the same layer and the decoder in the previous layer can learn the importance of each channel feature and emphasize a meaningful feature selection in the spatial map to locate the critical structures and narrow the semantic gap between the encoder and decoder feature graphs.

The cross-channel spatial attention module is illustrated in Figure 7. It consists of two branches: channel attention and spatial attention. The process of CCSA can be represented as follows:

\begin{matrix} f_{C A M} = f_{s k} \times f_{s i g} (f_{3 c} (d_{a p o o l} (f_{u p})) + f_{3 c} (d_{a p o o l} (f_{s k}))) \end{matrix}

(12)

\begin{matrix} f_{s k}^{'} = f_{r} (f_{b n} (f_{1 c} (f_{s k}))) \end{matrix}

(13)

\begin{matrix} f_{u p}^{'} = f_{r} (f_{b n} (f_{1 c} (f_{u p}))) \end{matrix}

(14)

\begin{matrix} f_{S A M} = f_{u p} \times f_{s i g} (f_{b n} (f_{1 c} (f_{s k}^{'} + f_{u p}^{'}))) \end{matrix}

(15)

where

f_{C A M}

and

f_{S A M}

represent the channel attention weight map and the spatial attention weight map, respectively.

f_{s k}

and

f_{u p}

are the input feature maps from the jump connection and upsampling, respectively.

f_{3 c}

denotes 3 × 3 convolution,

f_{1 c}

denotes 1 × 1 convolution,

f_{b n}

denotes batch normalization, and

f_{r}

denotes ReLU activation function.

f_{s i g}

denotes the sigmoid function, and

d_{a p o o l}

represents average pooling.

3.5. Multi-Level Feature Fusion Module (MFFM)

The MFFM is a multi-level feature fusion module that can receive feature maps at four different resolutions. In order to capture the feature representation of different levels of images, we fuse different levels of image information by pooling and convolution operations. Theoretically, shallow layers are more concerned with localized detailed information, while deep layers are more concerned with high-level semantic information. The fusion of shallow and deep features can enrich the semantic and detailed features of network inputs, and effectively enhance the ability of network feature learning representation. As shown in Figure 8, where

f_{i} (i = 1, 2, 3, 4)

represents the features of each layer in the network encoder, respectively. We use the Max-pooling operation to reduce the image size of each layer to the same size as the size of the fourth layer

f_{4}

, thereby maximizing the retention of salient edge information within the features. After the pooled four layers of features

f_{i}^{'} (i = 1, 2, 3, 4)

are spliced together, we use 1 × 1 convolution to adjust the number of channels, and then the feature information of the fused four sizes is further feature-extracted using 3 × 3 convolution. Finally, the different levels of feature information are fused together as input to the first layer decoder.

4. Experiments

4.1. Dataset and Evaluation Matrix

This study conducted systematic experiments on three widely recognized public dermatology datasets: ISIC2017 [44], ISIC2018 [45], and PH2 [46]. The ISIC2018 and ISIC2017 datasets comprise 2594 and 2150 high-resolution dermatoscopic images, respectively, covering various skin lesion types, including melanoma. They serve as common benchmarks for evaluating skin lesion segmentation performance. PH2 consists of 200 skin lesion images, which are suitable for algorithm validation in small-sample scenarios. To ensure consistency and fairness in experimental evaluation, all three datasets were randomly split into training and validation sets at an 8:2 ratio. Additionally, in order to reduce the network load and speed up the network training process, we resized all the images of the three datasets to 256 × 256 pixels. To ensure fairness and rigor in experimental comparisons, all comparison methods were retrained using publicly available code under uniform dataset partitioning and hardware environments. To evaluate the state-of-the-art (SOTA) deep learning methods and our proposed FFM-Net network, we conducted experiments using several standard evaluation metrics, including Mean Intersection over Union (mIoU), Dice Similarity Coefficient (DSC), Accuracy (ACC), Specificity (Spe), and Sensitivity (Sen). The mathematical formulas for these indicators are summarized below:

\begin{matrix} mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{T P + F P + F N} \end{matrix}

(16)

\begin{matrix} DSC = \frac{2 T P}{2 T P + F P + F N} \end{matrix}

(17)

\begin{matrix} ACC = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(18)

\begin{matrix} Spe = \frac{T N}{T N + F P} \end{matrix}

(19)

\begin{matrix} Sen = \frac{T P}{T P + F N} \end{matrix}

(20)

where TP, FP, FN, and TN in the formula represent true positives, false positives, false negatives, and true negatives, respectively.

k + 1

is the number of categories including background.

4.2. Implementation Details

We implemented the FFM-Net model in PyTorch 2.7.1. Using an NVIDIA RTX 4090 GPU for training. The initial input image size is 256 × 256 pixels with 3 channels. After processing by the first encoder layer, the image size becomes 64 × 64 pixels with 96 channels. In subsequent layers, the image size is halved while the number of channels doubles. Our loss function design adopts the approach from VM-UNet [17], employing a weighted sum of BCE loss and Dice loss, with both coefficients set to 0.5. The AdamW optimizer is used with an initial learning rate of 0.0001. The training period is 100 epochs and the batch size is set equal to 16. We employ random flipping and 0–360 degree random rotation as data augmentation techniques for preprocessing, with each operation applied with a 50% probability.

4.3. Comparison with Other Network Results

Table 1, Table 2 and Table 3 show the experimental results of FFM-Net on the ISIC2018, ISIC2017, and PH2 datasets, respectively, and compare its performance with other mainstream segmentation models based on CNN, Transformer, and Mamba architectures. Specifically, on the three datasets, FFM-Net achieves a significant improvement in segmentation accuracy compared to the traditional CNN network architectures U-Net [2], UNet++ [3], and EGE-Unet [4]. Additionally, we compared FFM-Net with networks based on Transformer and Mamba architectures, including MiSSFormer [10], TransAttUnet [9], VM-UNet [17], and MambaULite [18]. The results show that FFM-Net outperforms these competing methods in mIoU and DSC metrics. FFM-Net achieved DSC values of 0.9173, 0.9103, and 0.9542 on the three datasets, respectively. These results fully validate the accuracy and robustness of FFM-Net for skin lesion segmentation tasks. Specifically, on the ISIC2018 dataset, FFM-Net achieved a 2.23% improvement in DSC and a 0.75% improvement in ACC compared to MA-Unet [11]. Although its specificity (Spe) was slightly lower than other methods, FFM-Net achieved the best performance across the remaining four metrics. Compared to MA-Unet [11] on three datasets, our FFM-Net method achieved improvements of 3.74%, 3.15%, and 2.34% in mIoU metrics, respectively. This enhancement stems from our model’s ability to accurately extract edge information for precise segmentation through the EEM. The ISIC2018 dataset contains skin images with blurred edges, posing challenges for accurate segmentation. SkinMamb [20] only incorporates a frequency-guided boundary module at the final encoder layer, resulting in limited extraction of boundary information. While this partially mitigates boundary blurring and information loss caused by downsampling, high-frequency information is prone to noise contamination, compromising segmentation accuracy. Theoretically, images exhibit higher resolution and richer details at shallow layers, while deeper layers feature lower resolution and abstract semantic information. Our FFM-Net approach employs dynamic channel adjustment at each layer, increasing the proportion of LEM to preserve the overall morphology of lesions while avoiding interference from redundant details, and appropriately increasing the proportion of EEM to precisely capture blurred boundaries and compensate for edge information decay. This dynamic adjustment mechanism enables each layer to balance learning overall structure and edge details, achieving effective feature information extraction. Therefore, on the ISIC2017 dataset, our FFM-Net method achieves improvements of 0.29% and 0.18% over SkinMamba [20] in mIoU and DSC metrics, respectively. On the three datasets, the mIoU values of FFM-Net are 0.8473, 0.8353, and 0.9124, respectively, which are higher than other networks based on Transformer and Mamba models. These encouraging experimental results highlight the accuracy and robustness of FFM-Net for skin lesion segmentation.

Figure 9 shows the segmentation results of these networks on the ISIC2018 dataset, visualizing the segmentation capabilities of each network. From the visualization results in Figure 9, it can be seen that our segmentation results have significantly improved in terms of completeness and accuracy compared to other networks, achieving better results in the accuracy of shape segmentation. The third and fourth lines of the figure show that MiSSFormer [10] and VM-UNet [17] rely on the global feature modeling capabilities of Transformer and Mamba respectively, and have improved the accuracy of lesion contour segmentation, but are still insufficient in segmentation accuracy of fuzzy edge regions. The FFM-Net is able to accurately identify the main shapes of lesions that are very light in color and difficult to distinguish from normal skin, and can accurately capture some of the detailed features of the edges of the lesions, as shown in the area circled in red. In addition, based on the images provided in the first, third, and fifth lines, the design of FFM-Net incorporates the multi-level feature fusion module, and our model has the ability to capture features at different scales, which enables accurate segmentation, and the segmentation results are closer to Ground Truth.

4.4. Ablation Study

4.4.1. Suitable Radius of Truncation Frequency

The function of the low-frequency information extraction module (LEM) is to extract the overall structural information from the frequency feature map using a low-pass filter. To explore the appropriate truncation frequency, we take H and W at the first layer of the encoder as examples and perform ablation experiments on the PH2 dataset. The truncation frequency radius sizes are set to 52, 48, 24, 12, and 6, as shown in Figure 10.

The experimental results show that the mIoU (%) metric of FFM-Net increases from 90.92 (truncation frequency is 24) to 91.24 (truncation frequency is 48), the DSC (%) metric increases from 95.24 (truncation frequency is 24) to 95.42 (truncation frequency is 48), and the ACC (%) metric increases from 96.85 (truncation frequency is 24) to 96.99 (truncation frequency is 48). However, as the truncation frequency increases further, the performance of FFM-Net starts to decrease. Therefore, it can be concluded that the optimal k value for the LEM module in our model design is 0.75.

4.4.2. The Suitable Channel Ratio Division Between LEM and EEM

Theoretically, the shallow layer of the network has higher resolution and richer edge details, while the deep layer of the network has lower resolution and richer overall structure and high-level semantic information. We adjust the model using proportional allocation of channels to avoid information redundancy and can focus on learning different feature information at different stages. We used the PH2 dataset to explore the appropriate channel ratio division between the LEM module and the EEM in the four stages of the encoder. The proportion combinations are designed based on the pattern of feature emphasis across different layers of the network, covering the core gradient range from 0.1 to 0.9 with increments or decrements of 0.1. This approach simultaneously avoids redundant combinations to prevent redundancy. In order to reasonably explore the appropriate channel ratio, the CCSA module and the MFFM are not added to the network model. As shown in Table 4, the experimental results indicate that the appropriate channel division ratios of the model from the first to the fourth layer are (0.7, 0.3), (0.7, 0.3), (0.4, 0.6), and (0.3, 0.7), respectively.

4.4.3. Effect of the EEM Module

In order to verify that the EEM can capture the edge detail features, we conducted ablation experiments on the ISIC2018 dataset and visualized the segmentation results, as shown in Figure 11. To better demonstrate the differences in lesion segmentation results, we use red lines to label the segmentation results. NEEMnet indicates the network in the encoder that does not use the EEM, while FFM-Net contains the EEM. A comparison of the segmentation results in the red-boxed area shows that NEEMnet cannot accurately segment lesions with blurred boundaries, while FFM-Net can. The results of the two experiments are shown in Table 5. The experimental results show that FFM-Net with EEM is 2.24%, 1.33% and 0.59% higher than NEEMnet without EEM in terms of mIoU, DSC and ACC values, respectively, which can effectively improve the segmentation accuracy of lesions.

4.4.4. Effect of the LEM Module

The LEM module efficiently captures overall structural information. As shown in Figure 12, we conducted ablation experiments on the ISIC-2018 dataset. Meanwhile, in order to better illustrate the differences among the lesion segmentation results, we used red lines to highlight the segmentation results. The network without the LEM module in the encoder is denoted as NLEMnet. As can be seen from the red box area, the segmentation performance of NLEMnet is inferior to that of FFM-Net. Because NLEnet only focuses on the change in the edge detail information and lacks the ability to capture global structural information, which is susceptible to interference from other elements such as hairs, color variations, and so on. The results of the two experiments are shown in Table 6. In the mIoU, DSC and ACC metrics, FFM-Net is 0.95%, 0.56%, and 0.21% higher than NLEMnet, respectively.

4.4.5. Validity of Modules Within the FM Block Encoder

To further validate the effectiveness of the LEM, EEM, and SS2D components within the FM block encoder of our FFM-Net model, we designed component replacement experiments using standard convolutions of equivalent complexity. The results are shown in Table 7. EEMConv-Net replaces the EEM model with standard convolutions. LEMConv-Net denotes replacing the LEM module with an equivalent convolutional module without performing low-frequency information filtering operations. SS2DConv-Net replaces the SS2D operation with standard convolutions. As shown in Table 7, compared to EEMConv-Net, FFM-Net achieves improvements of 0.79%, 0.47%, and 14% in mIoU, DSC, and ACC, respectively. Simultaneously, the visualization results in Figure 13 demonstrate that FFM-Net achieves more precise edges in segmented lesion regions. This indicates that FFM-Net, incorporating the EEM, can more effectively capture edge details of skin lesions, thereby enhancing the accuracy of the segmentation task. Furthermore, under identical standard convolution replacement strategies, FFM-Net demonstrates superior performance compared to both LEMConv-Net and SS2DConv-Net. The segmentation comparison in Figure 13 provides further intuitive validation of the critical importance of the LEM module’s ability to capture low-frequency information for overall structural details and the SS2D module’s feature enhancement capability in improving skin lesion segmentation performance.

4.4.6. Effect of FFM-Net Architecture Modifications

We conducted a series of ablation experiments on the ISIC2018 dataset to evaluate the contribution of each module to our model, and the results of the experiments are shown in Table 7. The baseline adopts the pure Mamba structure as the encoder, and the MFFM and CCSA module are not added to the network. The decoding layer is a basic network architecture composed of 3 × 3 convolutional residual blocks and upsampling. LEEM denotes the integration of LEM and EEM modules in the four layers of the encoder in a proportional channel allocation of (0.7, 0.3), (0.7, 0.3), (0.4, 0.6), and (0.3, 0.7), respectively. CCSA denotes cross-channel spatial attention module. MFFM denotes the multi-level feature fusion module.

In Table 8, it can be seen that after adding the MFFM to the bottleneck layer, the scores of mIoU, DSC, and ACC increase from 0.8224, 0.9025, and 0.9571 to 0.8300, 0.9071, and 0.9596, respectively, especially the mIoU metric improves by 0.76%. The results show that the multi-level feature fusion module can effectively fuse different levels of information and improve the segmentation performance.

After adding the CCSA module to the baseline, we can see a significant improvement in model performance compared to the baseline, with mIoU, DSC, and ACC metrics improving by 0.84%, 0.51%, and 0.27%, respectively. Mamba mainly focuses on capturing spatial dependencies. Comparing the experimental results of the baseline and the CCSA module, it can be found that the addition of the CCSA module enhances the network’s ability to pay attention to and process the channel features, thus realizing the efficient fusion of the channel and the spatial features, and improving the performance of the network.

Compared with the baseline, after adding the LEEM module, the mIoU, DSC, and ACC metrics have increased by 1.78%, 1.06%, and 0.45%, respectively. Furthermore, the combination of BaseLine+MFFM+LEEM further clarifies the synergistic effects of the modules. Adding the LEEM module to the BaseLine + MFFM framework improved mIoU, DSC, and ACC by 1.52%, 0.9%, and 0.34%, respectively. This result validates that the holistic structural and edge contour information extracted by the LEEM module complements the multi-scale information extracted by the MFFM, effectively enhancing the representational capability of features. The experimental results show that the model can effectively extract the overall structural information and edge detail information by appropriate channel ratio division, which is a great improvement to the baseline network.

4.5. Algorithmic Efficiency

In this section, we analyze the efficiency of FFM-Net and compare it with other methods. We evaluate algorithm efficiency through the number of parameters (Param) and the number of floating-point operations (FLOPS), with the relevant results summarized in Table 9. Our FFM-Net features 36.30 M parameters and performs 4.00 G floating-point operations, outperforming most comparison methods. Compared to traditional architectures like U-Net [2], UNet++ [3], and TransAttUnet [9], FFM-Net exhibits fewer parameters and faster computational efficiency. While FFM-Net has slightly more parameters than Mamba-based VM-Unet [17] and SkinMamba [20], it requires fewer floating-point operations. This stems from the LEM module in FFM-Net, which simplifies features through filtering operations and reduces redundant computations, thereby lowering overall FLOPS. The newly introduced EEM and MFFM modules introduce some additional parameter overhead. While our network may not be as lightweight as MambaULite [18] or EGE-UNet [4], it remains suitable for most computational platforms.

5. Conclusions

We propose a novel FFM-Net model based on state space models (SSMs). We design LEM and EEM to extract the overall structural information and edge detail information. Meanwhile, in order to further improve the network performance, we set an appropriate channel ratio in each layer. In addition, we use the cross-channel spatial attention module to focus on important channel and spatial information. The multi-level feature fusion module correlates shallow feature information with deeper high-level semantic information, realizing multi-scale feature mapping and further improving the segmentation capability of the model. Experimental results demonstrate that our model achieves state-of-the-art performance across the ISIC2017, ISIC2018, and PH2 public datasets. Particularly in skin lesion segmentation scenarios characterized by blurred boundaries and irregular textures, the model achieves mIoU scores of 0.8353, 0.8473, and 0.9124, respectively. Compared to other methods, FFM-Net demonstrates superior segmentation accuracy and visual quality. However, the model still has room for further optimization. Future research will focus on enhancing the model’s adaptability and generalization capabilities, particularly in its application to other medical image segmentation tasks. Concurrently, we will further explore methods to streamline the network architecture, aiming to strike a balance between efficiency and accuracy.

Author Contributions

Conceptualization, L.C. and E.Y.; methodology, L.C. and E.Y.; software, E.Y.; validation, E.Y., Q.C., and K.H.; formal analysis, E.Y., Q.C., and K.H.; investigation, E.Y., Q.C., and K.H.; resources, L.C.; data curation, E.Y.; writing—original draft preparation, E.Y.; writing—review and editing, L.C.; visualization, E.Y.; supervision, L.C.; project administration, E.Y.; funding acquisition, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study conducted systematic experiments on three widely recognized public dermatology datasets: ISIC2017, ISIC2018, and PH2.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Balch, C.M.; Gershenwald, J.E.; Soong, S.-j.; Thompson, J.F.; Atkins, M.B.; Byrd, D.R.; Buzaid, A.C.; Cochran, A.J.; Coit, D.G.; Ding, S.; et al. Final version of 2009 AJCC melanoma staging and classification. J. Clin. Oncol. 2009, 27, 6199–6206. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Munich, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4; Springer: Granada, Spain, 2018; pp. 3–11. [Google Scholar]
Ruan, J.; Xie, M.; Gao, J.; Liu, T.; Fu, Y. Ege-unet: An efficient group enhanced unet for skin lesion segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer: Cham, Switzerland, 2023; pp. 481–490. [Google Scholar]
Tang, F.; Ding, J.; Quan, Q.; Wang, L.; Ning, C.; Zhou, S.K. Cmunext: An efficient medical image segmentation network based on large kernel and skip fusion. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), San Francisco, CA, USA, 1–4 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Xu, Q.; Ma, Z.; He, N.; Duan, W. DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation. Comput. Biol. Med. 2023, 154, 106626. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Chen, B.; Liu, Y.; Zhang, Z.; Lu, G.; Kong, A.W.K. Transattunet: Multi-level attention-guided u-net with transformer for medical image segmentation. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 55–68. [Google Scholar] [CrossRef]
Huang, X.; Deng, Z.; Li, D.; Yuan, X. Missformer: An effective medical image segmentation transformer. arXiv 2021, arXiv:2109.07162. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Ren, Z.; Zhu, G.; Liang, Y.; Cui, H.; Wang, C.; Wang, J. Enhancing medical image segmentation with MA-UNet: A multi-scale attention framework. Vis. Comput. 2025, 41, 6103–6120. [Google Scholar] [CrossRef]
Qin, C.; Wang, Y.; Zhang, J. CMLCNet: Medical image segmentation network based on convolution capsule encoder and multi-scale local co-occurrence. Multimed. Syst. 2024, 30, 220. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2024; Volume 37, pp. 103031–103063. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Nguyen, T.-N.-Q.; Ho, Q.-H.; Nguyen, D.-T.; Le, H.-M.-Q.; Pham, V.-T.; Tran, T.-T. MambaU-Lite: A Lightweight Model based on Mamba and Integrated Channel-Spatial Attention for Skin Lesion Segmentation. arXiv 2024, arXiv:2412.01405. [Google Scholar]
Tang, H.; Huang, G.; Cheng, L.; Yuan, X.; Tao, Q.; Chen, X.; Zhong, G.; Yang, X. RM-UNet: UNet-like Mamba with rotational SSM module for medical image segmentation. Signal Image Video Process. 2024, 18, 8427–8443. [Google Scholar] [CrossRef]
Zou, S.; Zhang, M.; Fan, B.; Zhou, Z.; Zou, X. SkinMamba: A Precision Skin Lesion Segmentation Architecture with Cross-Scale Global State Modeling and Frequency Boundary Guidance. arXiv 2024, arXiv:2409.10890. [Google Scholar]
Zhu, X.; Wang, W.; Zhang, C.; Wang, H. Polyp-Mamba: A Hybrid Multi-Frequency Perception Gated Selection Network for polyp segmentation. Inf. Fusion 2025, 115, 102759. [Google Scholar] [CrossRef]
Wu, R.; Pan, L.; Liang, P.; Chang, Q.; Wang, X.; Fang, W. SK-VM++: Mamba assists skip-connections for medical image segmentation. Biomed. Signal Process. Control 2025, 105, 107646. [Google Scholar] [CrossRef]
Zhang, M.; Yu, Y.; Jin, S.; Gu, L.; Ling, T.; Tao, X. VM-UNET-V2: Rethinking vision mamba UNet for medical image segmentation. In Proceedings of the International Symposium on Bioinformatics Research and Applications, Kunming, China, 19–21 July 2024; Springer: Cham, Switzerland, 2024; pp. 335–346. [Google Scholar]
Peng, Y.; Sonka, M.; Chen, D.Z. U-net v2: Rethinking the skip connections of u-net for medical image segmentation. arXiv 2023, arXiv:2311.17791. [Google Scholar]
Chen, Q.; Li, J.; Fang, X. Dual triple attention guided CNN-VMamba for medical image segmentation. Multimed. Syst. 2024, 30, 275. [Google Scholar] [CrossRef]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. Ultralight vm-unet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation. arXiv 2024, arXiv:2403.20035. [Google Scholar] [CrossRef]
Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order vision mamba unet for medical image segmentation. Neurocomputing 2025, 624, 129447. [Google Scholar] [CrossRef]
Yang, C.; Zhang, Z. Pfd-net: Pyramid fourier deformable network for medical image segmentation. Comput. Biol. Med. 2024, 172, 108302. [Google Scholar] [CrossRef]
Shan, L.; Li, X.; Wang, W. Decouple the high-frequency and low-frequency information of images for semantic segmentation. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1805–1809. [Google Scholar]
Ma, X.; Ni, Z.; Chen, X. TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba. arXiv 2024, arXiv:2411.17473. [Google Scholar]
Liu, S.; Lin, Y.; Liu, D.; Wang, P.; Zhou, B.; Si, F. Frequency-Enhanced Lightweight Vision Mamba Network for Medical Image Segmentation. IEEE Trans. Instrum. Meas. 2025; in press. [Google Scholar]
Zhang, Z.; Peng, B.; Zhao, T. An ultra-lightweight network combining Mamba and frequency-domain feature extraction for pavement tiny-crack segmentation. Expert Syst. Appl. 2025, 264, 125941. [Google Scholar] [CrossRef]
Gu, A.-R.; Nam, J.-H.; Lee, S.-C. FBI-Net: Frequency-based image forgery localization via multitask learning with self-attention. IEEE Access 2022, 10, 62751–62762. [Google Scholar] [CrossRef]
Han, Q.; Wang, H.; Hou, M.; Weng, T.; Pei, Y.; Li, Z.; Chen, G.; Tian, Y.; Qiu, Z. HWA-SegNet: Multi-channel skin lesion image segmentation network with hierarchical analysis and weight adjustment. Comput. Biol. Med. 2023, 152, 106343. [Google Scholar] [CrossRef]
Zeng, Y.; Li, J.; Zhao, Z.; Liang, W.; Zeng, P.; Shen, S.; Zhang, K.; Shen, C. WET-UNet: Wavelet integrated efficient transformer networks for nasopharyngeal carcinoma tumor segmentation. Sci. Prog. 2024, 107, 00368504241232537. [Google Scholar] [CrossRef]
Tang, S.; Ran, H.; Yang, S.; Wang, Z.; Li, W.; Li, H.; Meng, Z. A frequency selection network for medical image segmentation. Heliyon 2024, 10, e35698. [Google Scholar] [CrossRef] [PubMed]
Muthukrishnan, R.; Radha, M. Edge detection techniques for image segmentation. Int. J. Comput. Sci. Inf. Technol. 2011, 3, 259. [Google Scholar] [CrossRef]
Bertasius, G.; Shi, J.; Torresani, L. Semantic segmentation with boundary neural fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3602–3610. [Google Scholar]
Chaple, G.N.; Daruwala, R.D.; Gofane, M.S. Comparisons of Robert, Prewitt, Sobel operator based edge detection methods for real time uses on FPGA. In Proceedings of the 2015 International Conference on Technologies for Sustainable Development (ICTSD), Mumbai, India, 12–13 February 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–4. [Google Scholar]
Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE/CVF: Piscataway, NJ, USA, 2020; pp. 5295–5305. [Google Scholar]
Yu, Z.; Qin, Y.; Zhao, H.; Li, X.; Zhao, G. Dual-cross central difference network for face anti-spoofing. arXiv 2021, arXiv:2105.01290. [Google Scholar] [CrossRef]
Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietik"ainen, M.; Liu, L. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE/CVF: Piscataway, NJ, USA, 2021; pp. 5117–5127. [Google Scholar]
Gutman, D.; Codella, N.C.F.; Celebi, E.; Helba, B.; Marchetti, M.; Mishra, N.; Halpern, A. Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). arXiv 2016, arXiv:1605.01397. [Google Scholar] [CrossRef]
Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar] [CrossRef]
Mendonça, T.; Ferreira, P.M.; Marques, J.S.; Marcal, A.R.S.; Rozeira, J. PH 2-A dermoscopic image database for research and benchmarking. In Proceedings of the 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 3–7 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 5437–5440. [Google Scholar]
Azad, R.; Heidari, M.; Wu, Y.; Merhof, D. Contextual attention network: Transformer meets u-net. In Proceedings of the International Workshop on Machine Learning in Medical Imaging (MLMI), Held in Conjunction with MICCAI 2022, Singapore, 18 September 2022; Springer: Cham, Switzerland, 2022; pp. 377–386. [Google Scholar]

Figure 1. Some representative images from the ISIC2018 dataset.

Figure 2. The effect of low-frequency features with gradually increasing filter sizes in an image, where r represents the radius of the filter.

Figure 3. Overview of the proposed FFM-Net architecture.

Figure 4. Convolution of pixel differences in Sobel X–/Y–Convolution.

Figure 5. Edge detail extraction module.

Figure 6. Low-pass filter process for 2D-DFT and 2D-IDFT.

Figure 7. Cross-channel spatial attention module.

Figure 8. Multi-level feature fusion module.

Figure 9. Comparison of segmentation results between the proposed FFM-Net and the SOTA networks on the ISIC2018 dataset. We highlight the key regions with the appropriate red boxes.

Figure 10. Performance of LEM with different truncation frequencies.

Figure 11. Visual comparison between the FFM-Net and NEEMnet on the ISIC2018 dataset. We highlight the key regions with the appropriate red boxes.

Figure 12. Visual comparison between the FFM-Net and NLEMnet on the ISIC2018 dataset. We highlight the key regions with the appropriate red boxes.

Figure 13. Visual comparison of replacement experiments for FM encoder components on the ISIC2018 dataset. We highlight the key regions with the appropriate red boxes.

Table 1. Validation results on the ISIC2018 comparison with SOTA methods.

Networks	Year	mloU	DSC	ACC	Spe	Sen
U-Net [2]	2015	0.7688	0.8693	0.9434	0.9715	0.8451
UNet++ [3]	2018	0.8107	0.8955	0.9548	0.9793	0.8693
MiSSFormer [10]	2021	0.8139	0.8974	0.9555	0.9790	0.8736
TMU-net [47]	2022	0.8135	0.8971	0.9555	0.9792	0.8724
EGE-UNet [4]	2023	0.8299	0.9070	0.9590	0.9768	0.8971
TransAttUnet [9]	2023	0.8071	0.8933	0.9546	0.9835	0.8536
VM-UNet [17]	2024	0.8136	0.8972	0.9554	0.9790	0.8732
MambaULite [18]	2024	0.8413	0.9138	0.9625	0.9829	0.8913
SkinMamba [20]	2024	0.8451	0.9160	0.9633	0.9821	0.8979
MA-Unet [11]	2025	0.8099	0.8950	0.9558	0.9871	0.8463
Ours(FFM-Net)	2025	0.8473	0.9173	0.9633	0.9779	0.9126

Table 2. Validation results on the ISIC2017 comparison with SOTA methods.

Networks	Year	mloU	DSC	ACC	Spe	Sen
U-Net [2]	2015	0.7110	0.8311	0.9418	0.9786	0.7784
UNet++ [3]	2018	0.7268	0.8418	0.9460	0.9830	0.7817
MiSSFormer [10]	2021	0.7883	0.8816	0.9576	0.9801	0.8578
TMU-net [47]	2022	0.7743	0.8728	0.9558	0.9855	0.8239
EGE-UNet [4]	2023	0.8341	0.9096	0.9673	0.9837	0.8945
TransAttUnet [9]	2023	0.8127	0.8967	0.9636	0.9874	0.8581
VM-UNet [17]	2024	0.8109	0.8955	0.9623	0.9816	0.8771
MambaULite [18]	2024	0.8230	0.9029	0.9651	0.9839	0.8817
SkinMamba [20]	2024	0.8324	0.9085	0.9673	0.9865	0.8823
MA-Unet [11]	2025	0.8038	0.8912	0.9623	0.9902	0.8387
Ours(FFM-Net)	2025	0.8353	0.9103	0.9674	0.9830	0.8982

Table 3. Validation results on the PH2 comparison with SOTA methods.

Networks	Year	mloU	DSC	ACC	Spe	Sen
U-Net [2]	2015	0.8108	0.8955	0.9342	0.9690	0.8625
UNet++ [3]	2018	0.8236	0.9032	0.9387	0.9692	0.8758
MiSSFormer [10]	2021	0.8366	0.9110	0.9399	0.9394	0.9410
TMU-net [47]	2022	0.8856	0.9393	0.9616	0.9868	0.9097
EGE-UNet [4]	2023	0.8885	0.9409	0.9614	0.9720	0.9396
TransAttUnet [9]	2023	0.8787	0.9354	0.9590	0.9831	0.9093
VM-UNet [17]	2024	0.8580	0.9236	0.9519	0.9826	0.8887
MambaULite [18]	2024	0.8839	0.9383	0.9607	0.9833	0.9141
SkinMamba [20]	2024	0.9014	0.9481	0.9666	0.9821	0.9347
MA-Unet [11]	2025	0.8890	0.9412	0.9626	0.9849	0.9165
Ours(FFM-Net)	2025	0.9124	0.9542	0.9699	0.9753	0.9589

Note: The bolded values in Table 1, Table 2 and Table 3 represent the optimal results for each metric on the ISIC2018, ISIC2017, and PH2 datasets, respectively.

Table 4. Impact of different channel scale divisions of the encoder. The scores are the performance on the PH2 dataset.

Encoder Layer	LEM:EEM	mIoU	DSC	ACC
	0.9:0.1	0.8998	0.9472	0.9658
	0.8:0.2	0.9015	0.9482	0.9662
First layer	0.7:0.3	0.9045	0.9498	0.9674
	0.6:0.4	0.9041	0.9496	0.9671
	0.5:0.5	0.9011	0.9480	0.9663
	0.9:0.1	0.9034	0.9492	0.9670
	0.8:0.2	0.9045	0.9498	0.9675
Second layer	0.7:0.3	0.9051	0.9502	0.9677
	0.6:0.4	0.9046	0.9499	0.9675
	0.5:0.5	0.9045	0.9498	0.9674
	0.5:0.5	0.9051	0.9502	0.9677
	0.4:0.6	0.9066	0.9510	0.9683
Third layer	0.3:0.7	0.9056	0.9504	0.9676
	0.2:0.8	0.9051	0.9502	0.9676
	0.1:0.9	0.9041	0.9496	0.9674
	0.5:0.5	0.9066	0.9510	0.9683
	0.4:0.6	0.9071	0.9513	0.9681
Fourth layer	0.3:0.7	0.9078	0.9516	0.9680
	0.2:0.8	0.9071	0.9512	0.9684
	0.1:0.9	0.9064	0.9509	0.9678

Note: The bolded values in the table represent the optimal metric values for each layer on the PH2 dataset.

Table 5. Performance comparison of the FFM-Net against the NEEMnet on the ISIC2018 dataset.

Networks	mIoU	DSC	ACC
NEEMnet	0.8249	0.9040	0.9574
FFM-Net	0.8473	0.9173	0.9633

Table 6. Performance comparison of the FFM-Net against the NLEMnet on the ISIC2018 dataset.

Networks	mIoU	DSC	ACC
NLEMnet	0.8378	0.9117	0.9612
FFM-Net	0.8473	0.9173	0.9633

Table 7. Effectiveness of FM block encoder components on the ISIC2018 dataset.

Networks	mIoU	DSC	ACC
EEMConv-Net	0.8394	0.9126	0.9619
LEMConv-Net	0.8411	0.9137	0.9621
SS2DConv-Net	0.8338	0.9094	0.9596
FFM-Net	0.8473	0.9173	0.9633

Table 8. Ablation studies of different components of FFM-Net on the ISIC2018 dataset.

BaseLine	MFFM	CCSA	LEEM	mIoU	DSC	ACC
✔				0.8224	0.9025	0.9571
✔	✔			0.8300	0.9071	0.9596
✔		✔		0.8308	0.9076	0.9598
✔			✔	0.8402	0.9131	0.9616
✔	✔		✔	0.8452	0.9161	0.9630
✔		✔	✔	0.8452	0.9161	0.9632
✔	✔	✔	✔	0.8473	0.9173	0.9633

Table 9. Algorithm complexity of our proposed FFM-Net and other methods.

Networks	Year	FLOPS (G)	Param (M)
U-Net [2]	2015	50.71	32.08
UNet++ [3]	2018	138.66	36.62
MiSSFormer [10]	2021	7.25	42.46
TMU-net [47]	2022	98.31	165.16
EGE-UNet [4]	2023	0.28	0.05
TransAttUnet [9]	2023	88.78	25.96
VM-UNet [17]	2024	4.11	27.42
MambaLite [18]	2024	0.93	0.41
SkinMamba [20]	2024	5.09	14.08
MA-Unet [11]	2025	192.7	68.53
Ours(FFM-Net)	2025	4.00	36.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.; Yu, E.; Cao, Q.; Hu, K. FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation. Information 2025, 16, 1102. https://doi.org/10.3390/info16121102

AMA Style

Chen L, Yu E, Cao Q, Hu K. FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation. Information. 2025; 16(12):1102. https://doi.org/10.3390/info16121102

Chicago/Turabian Style

Chen, Lifang, Entao Yu, Qihang Cao, and Ke Hu. 2025. "FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation" Information 16, no. 12: 1102. https://doi.org/10.3390/info16121102

APA Style

Chen, L., Yu, E., Cao, Q., & Hu, K. (2025). FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation. Information, 16(12), 1102. https://doi.org/10.3390/info16121102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

FFM-Net: Fusing Frequency Selection Information with Mamba for Skin Lesion Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Segmentation Based on Mamba

2.2. Low-Frequency Information Extraction

2.3. Edge Information Extraction

3. Method

3.1. Algorithm Overview

3.2. Edge Detail Extraction Module (EEM)

3.3. Low-Frequency Information Extraction Module (LEM)

3.4. Cross-Channel Spatial Attention Module (CCSA)

3.5. Multi-Level Feature Fusion Module (MFFM)

4. Experiments

4.1. Dataset and Evaluation Matrix

4.2. Implementation Details

4.3. Comparison with Other Network Results

4.4. Ablation Study

4.4.1. Suitable Radius of Truncation Frequency

4.4.2. The Suitable Channel Ratio Division Between LEM and EEM

4.4.3. Effect of the EEM Module

4.4.4. Effect of the LEM Module

4.4.5. Validity of Modules Within the FM Block Encoder

4.4.6. Effect of FFM-Net Architecture Modifications

4.5. Algorithmic Efficiency

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI