Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance

Yue, Yeshun; He, Qihang; Zou, Yaobin

doi:10.3390/electronics14193871

Open AccessArticle

Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance

by

Yeshun Yue

^1,2

,

Qihang He

^1,2 and

Yaobin Zou

^2,3,*

¹

Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University, Yichang 443002, China

²

College of Computer and Information Technology, China Three Gorges University, Yichang 443002, China

³

Yichang Key Laboratory for Intelligent Healthcare, China Three Gorges University, Yichang 443002, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3871; https://doi.org/10.3390/electronics14193871

Submission received: 23 August 2025 / Revised: 28 September 2025 / Accepted: 28 September 2025 / Published: 29 September 2025

(This article belongs to the Special Issue AI-Driven Medical Image/Video Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Digital epidermal microscopic (DEM) images offer the potential to quantitatively analyze skin aging at the microscopic level. However, stochastic complexity, local highlights, and low contrast in DEM images pose significant challenges to accurate segmentation. This study evaluated eight deep learning models to identify methods capable of accurately segmenting complex DEM images while meeting diverse performance requirements. To this end, this study first constructed a manually labeled DEM image dataset. Then, eight deep learning models (FCN-8s, SegNet, UNet, ResUNet, NestedUNet, DeepLabV3+, TransUNet, and AttentionUNet) were systematically evaluated for their performance in DEM image segmentation. Our experimental findings show that AttentionUNet achieves the highest segmentation accuracy, with a DSC of 0.8696 and an IoU of 0.7703. In contrast, FCN-8s is a better choice for efficient segmentation due to its lower parameter count (18.64 M) and efficient inference speed (GPU time 37.36 ms). FCN-8s and NestedUNet show a better balance between accuracy and efficiency when assessed across metrics like segmentation accuracy, model size, and inference time. Through a systematic comparison of eight deep learning models, this study identifies superior methods for segmenting skin furrows and ridges in DEM images. This work lays the foundation for subsequent applications, such as analyzing skin aging through furrow and ridge features.

Keywords:

deep learning models; skin image segmentation; skin image dataset; UNet; FCN-8s; NestedUNet; AttentionUNet

1. Introduction

The epidermal surface in digital epidermal microscopic (DEM) images displays a netlike pattern, characterized by an interwoven network of skin furrows and ridges [1]. Quantitative analysis of DEM images is crucial for investigating the extent of skin aging. This is because skin furrows and ridges undergo gradual modifications linked to both intrinsic and extrinsic aging processes [2]. These analyses calculate structural features for furrows, such as length, width, and angle. They also calculate ridge features, including number, perimeter, and area. To obtain these features, DEM images must first be segmented to separate ridges from furrows [3]. To extract these features, DEM images must be segmented to accurately separate ridges from furrows. In practice, reliable DEM segmentation has two key applications. In clinical dermatology, it supports the objective assessment of skin aging, enabling the early detection of age-related disorders and evaluation of treatment efficacy. In cosmetics, it provides quantitative indicators for monitoring anti-aging product effectiveness and developing personalized skincare strategies.

However, human skin microstructure exhibits significant geometric heterogeneity. As a result, furrows and ridges in DEM images present highly complex topological features. Additionally, uneven illumination and low contrast problems often occur during the imaging process. Together, these factors challenge the development of automatic and robust segmentation methods for DEM images.

Traditional image segmentation techniques, such as region growing and clustering segmentation [4,5], have been utilized for skin images. However, they often perform poorly when facing complex skin patterns, as well as issues like lighting variation and low contrast [6]. In recent years, Convolutional Neural Networks (CNNs) and Transformer-based architectures have shown great potential for medical image segmentation tasks, including skin lesion detection and dermoscopic image analysis [7,8]. These methods enable models to automatically extract features through end-to-end learning and overcome the limitations of traditional methods to some extent. However, existing studies have mainly focused on the segmentation of skin lesions [9,10], while less research has been conducted on the automatic segmentation of furrows and ridges under the microscope.

Recent advances in deep learning have further promoted the development of medical image segmentation, including the wide application of models such as full convolutional networks FCN-8s [11], UNet [7], SegNet [12], DeepLabV3+ [13], and so on. UNet [7] and its variants show good performance in medical image segmentation tasks due to their unique skip connection structure. Typical improvements to the UNet model include ResUNet [14], which introduces residual connections; AttentionUNet [15], which incorporates attention mechanisms; and NestedUNet [16], which achieves multi-scale feature aggregation. In addition, Transformer-based networks (e.g., TransUNet [17]) have further enhanced the capability of deep learning in medical image segmentation. However, the applicability of these existing methods requires further validation and optimization due to the specific challenges of DEM image segmentation, including ambiguity, individual variability, and complex texture. Furthermore, the scarcity of publicly available DEM datasets limits the development and evaluation of automatic segmentation methods.

This study first constructs a new DEM dataset to explore deep learning methods capable of accurately and automatically segmenting complex DEM images while meeting diverse performance requirements. This dataset contains microscopic skin images that are finely labeled and can provide a reliable benchmark for the training and evaluation of deep learning models. Moreover, this study thoroughly and systematically investigates a variety of state-of-the-art deep learning models, including FCN-8s [11], SegNet [12], UNet [7], DeepLabV3+ [13], ResUNet [14], NestedUNet [16], AttentionUNet [15], and TransUNet [17], to compare and analyze their effectiveness in the DEM image segmentation task. This study provides a comprehensive evaluation of the segmentation effectiveness of these models. Furthermore, it analyzes their performance in terms of computational efficiency, robustness, and generalization ability. This study finds better and more flexible methods for segmenting skin furrows and ridges in DEM images, thereby laying the foundation for subsequent applications, such as skin aging analysis based on the relevant features of skin furrows and ridges.

2. Materials and Methods

This section outlines the experimental framework used to evaluate deep learning models for DEM image segmentation. It begins with a detailed description of data collection, including image acquisition from healthy volunteers and subsequent preprocessing steps such as cropping, labeling, and augmentation. The aim is to build a robust dataset for training and testing. Following this, the section introduces eight segmentation models—ranging from classical CNNs to hybrid architectures—and explains their structural features. Finally, it presents the training setup and evaluation metrics used to assess both segmentation accuracy and computational efficiency.

2.1. Data Collection and Processing

DEM images were captured by a specialized skin imaging device developed by Boseview Technology Company [18] (Guangzhou, China). This advanced instrument integrates flexible plug-and-play technology and connects conveniently to a computer via a standard USB interface. The system includes an ergonomic handheld probe, designed to minimize operator fatigue during prolonged use. The probe allows for flexible positioning and stable contact with the skin surface, ensuring high-quality image acquisition. This design provides researchers with convenient and reliable access to detailed skin microstructural information.

For this study, DEM images were obtained from a cohort of 46 healthy volunteers (age range: 20–50 years) residing in Yichang, China. The majority of participants reported indoor occupations, which helped to reduce the influence of long-term ultraviolet radiation and environmental exposure on skin morphology. The forearm, specifically the dorsal and ventral regions between the elbow and wrist, was selected as the sampling site. These anatomical sites are commonly used in dermatological research because they offer relatively uniform skin thickness, easy accessibility, and minimal interference from dense hair growth.

In total, 46 raw DEM images were acquired, corresponding to one image from each participant. The collected dataset provides a valuable resource for analyzing the microtopography and biophysical properties of normal skin. Figure 1 presents four representative raw DEM images, illustrating the variations in texture, fine lines, and microstructures among different individuals.

To mitigate the adverse impacts of uneven illumination and provide more effective training samples, we preprocessed each raw 704 × 576-pixel DEM image. From each raw image, we cropped four to seven local images at a resolution of 256 × 256 pixels (see regions ②–⑤ in Figure 2). This process resulted in a total of 261 local DEM images. Subsequently, an experienced researcher used Adobe Photoshop 2023 to accurately label the skin furrows and ridges in the 261 local DEM images, generating a corresponding set of labeled images. Additionally, a larger test image (greater than 256 × 256 but smaller than the original 704 × 576) was cropped from each raw DEM image (see region ① in Figure 2), yielding a total of 46 test images.

To reduce the risk of overfitting during training models and to improve the robustness of models [19], we implemented data augmentation on 261 local DEM images using Python 3.9. The data augmentation operations include horizontal flipping, rotation, geometric transformations, brightness and contrast adjustment, and gamma correction. The details of the augmentation operations are as follows (see Figure 3): flip horizontally with 50% probability; rotate with 50% probability by randomly selecting one of the four angles, namely, 0°, 90°, 180°, and 270°; perform a combination of translation, scaling, and rotation with 50% probability; adjust the brightness and contrast of the image with 50% probability; and change the gamma value of the image with 50% probability. All of the above operations use random sampling to ensure a high degree of diversity in the results at each data augmentation. This random application ensures that multiple augmentations of the same source image produce unique outputs, thereby effectively increasing the diversity of the training data.

After data augmentation for the 261 local DEM images, we generated five augmented versions of each image, resulting in a total of 1566 local DEM images (256 × 256 pixels) and the corresponding 1566 labeled images. These images are randomly divided into a training set and a validation set in the ratio of 8:2. The test set consists of 46 images, each larger than the 256 × 256 pixel images used for training. This setup implies that the models learn from local patches during training but must make predictions on larger images during testing, a process that better evaluates their generalization ability [20].

2.2. Deep Learning Models for Segmentation

2.2.1. Model Descriptions

The architectural design of the eight deep learning models selected for this study largely follows the classic encoder–decoder paradigm. This structure employs a contracting path (the encoder) to capture the contextual semantic information of an image and a symmetric expanding path (the decoder) to restore spatial resolution, enabling precise pixel-level localization. The primary distinctions among these models lie in their feature extraction modules, decoder upsampling strategies, and the methods used to fuse features from different levels. The most distinctive core design of each model is briefly introduced below.

(i): FCN-8s

The Fully Convolutional Network (FCN) is a pioneering work in the field of semantic segmentation [11]. It replaces the final fully-connected layers of a traditional CNN with convolutional layers, enabling end-to-end, pixel-level prediction for inputs of arbitrary size. The FCN-8s version used in this study merges feature maps from three different depths (with strides of 8, 16, and 32), which more effectively combines deep semantic information with shallow details compared to earlier versions, thus improving segmentation fineness.

(ii): SegNet

The core innovation of SegNet lies in its efficient decoding mechanism [12]. During max-pooling in the encoder, it stores the positional indices of the maximum values. In the decoder, it uses these indices to perform non-parametric upsampling (unpooling), which helps restore object boundary details with minimal additional computational cost.

(iii): UNet

UNet’s signature designs are its U-shaped symmetric architecture and “skip connections” [7]. By directly concatenating high-resolution feature maps from each stage of the encoder to the corresponding stage in the decoder, it significantly mitigates information loss during downsampling. This allows the model to leverage both deep semantic information and shallow texture details simultaneously, achieving great success in medical image segmentation.

(iv): ResUNet

ResUNet integrates the core ideas of UNet with Residual Networks (ResNet) [14]. It introduces residual connections within each convolutional block of the UNet architecture, allowing information to “skip” across layers. This effectively addresses the vanishing gradient problem in deep network training, making it possible to build deeper networks to learn more complex feature representations.

(v): NestedUNet

NestedUNet redesigns the decoder path of UNet by introducing dense, nested skip connections [16]. This design enables the decoder at each node to fuse features from multiple scales, achieving deep feature aggregation and flexible network pruning. Consequently, it delivers excellent performance across segmentation tasks of varying complexity.

(vi): DeepLabV3+

The distinguishing feature of DeepLabV3+ is its powerful capability for capturing multi-scale contextual information [13]. It employs an Atrous Spatial Pyramid Pooling (ASPP) module with atrous convolutions in the encoder to obtain a larger receptive field without reducing spatial resolution [21,22]. It also adopts an encoder–decoder structure to better recover object boundary information.

(vii): TransUNet

TransUNet was among the first models to successfully integrate the Transformer architecture into medical image segmentation. It innovatively embeds a Transformer module at the bottom of the UNet encoder (the bottleneck layer), using its self-attention mechanism to capture global long-range dependencies in the image, thereby compensating for the limitations of traditional CNNs in global context modeling [17].

(viii): AttentionUNet

AttentionUNet incorporates Attention Gates into the skip connections of the UNet architecture [15]. Guided by high-level semantic information from the decoder, these gates automatically learn to focus on feature regions most relevant to the current segmentation task while suppressing feature transmission from irrelevant areas, such as the background. This allows the model to more intelligently “attend” to important details [23,24].

2.2.2. Model Setting and Training

The eight network models in this study were all implemented with Python 3.9 in the PyCharm 2025.1 development environment, using the PyTorch 2.0.0+cu118 deep learning framework. All experiments are conducted on a unified hardware platform equipped with the Windows 10 operating system, an NVIDIA GeForce RTX 3060 Laptop GPU (12 GB graphics memory), an Intel Core i9-11980HK CPU (2.60 GHz), and 32 GB RAM.

To ensure the fairness and comparability of the experiments, after data preprocessing, experimental environment construction, and model building, the training process for each dataset is set to 200 epochs, and the same training parameters are uniformly adopted. The learning rate is set to 0.001, a value that has been proven in most image segmentation tasks to balance convergence speed and stability, thus serving as a common default setting [25].

In the model training phase, we set the batch sizes to 2, 4, 8, and 16 to evaluate their impact on model performance. The choice of batch size is primarily constrained by model complexity and GPU memory capacity. While ensuring the memory limits were not exceeded, values following the 2ⁿ pattern are adjusted. Under the same number of training rounds and hyperparameters, the performance of each set of settings is evaluated on the validation dataset. The batch size that achieves the best performance on the validation dataset is selected as the optimal one for the model, and accordingly, final evaluation and model comparison analysis are performed on the test dataset with that batch size.

To ensure reproducibility and minimize human intervention, we used PyTorch’s default settings for all hyperparameters not explicitly mentioned, such as optimizer parameters and weight initialization methods. Table 1 shows the specific parameters used in the training phase of each model, including image input size, batch size, learning rate, and number of training rounds.

2.3. Quantitative Metrics for Model Evaluation

A set of mainstream performance metrics is used to provide a comprehensive assessment of the model’s performance. The metrics used to evaluate segmentation accuracy include the Dice Similarity Coefficient (DSC), Intersection over Union (IoU), Recall, and Precision [26,27].

DSC is defined as follows:

D S C = \frac{2 T P}{2 T P + F P + F N}

(1)

TP is defined as the number of pixels correctly predicted as furrows and confirmed by the ground truth. FP is defined as the number of pixels predicted as furrows but actually belonging to the background in the ground truth. FN is defined as the number of pixels predicted as background but actually corresponding to furrows in the ground truth. TN is defined as the number of pixels predicted as background and confirmed by the ground truth.

IoU is defined as follows:

I o U = \frac{T P}{T P + F P + F N}

(2)

Recall is defined as follows:

R e c a l l = \frac{T P}{T P + F N}

(3)

Precision is defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

In addition, three metrics were employed to quantify the processing efficiency of the models under study [28]: the number of parameters (Params), floating-point operations (FLOPs), and inference time. Params measure memory resource usage and are reported in millions (M). FLOPs indicate computational complexity and are reported in GFLOPs, calculated with the thop toolkit by estimating the number of operations required for a single forward pass with a 256 × 256 input. Inference time, including average CPU and GPU times, reflects segmentation speed and is reported in milliseconds (ms).

3. Results

This section presents a comprehensive evaluation of the eight deep learning models based on segmentation accuracy and computational efficiency. Quantitative metrics such as DSC, IoU, Recall, and Precision are used to measure segmentation performance, while Params, FLOPs, and inference time assess efficiency. A normalized comparison is also provided to rank the models holistically.

3.1. Segmentation Accuracy of Each Model

This study evaluates the segmentation accuracy of eight deep learning models in the DEM image segmentation task, using DSC, IoU, Recall, and Precision for comparative analysis. Table 2 presents the specific segmentation accuracy values (mean ± standard deviation across test samples) of the eight models. Due to computational constraints, all results are based on a single training run. To further assess the robustness of the findings, we computed the evaluation metrics for each test image individually and then reported their mean and standard deviation. The relatively small per-sample fluctuations indicate stable performance on the test set.

As shown in Table 2, AttentionUNet achieved the highest segmentation accuracy (DSC: 0.8696, IoU: 0.7703, Recall: 0.8819, Precision: 0.8587), demonstrating strong ability in capturing complex furrow boundaries. FCN-8s (0.8690) and NestedUNet (0.8642) also perform competitively: FCN-8s offers comparable accuracy with faster inference, while NestedUNet benefits from dense skip connections and deep supervision for robust training. The baseline UNet (0.8641) remains solid, whereas ResUNet (0.8548) and DeepLabV3+ (0.8470) are slightly weaker, reflecting the limitations of deeper or more complex designs under limited data. In contrast, SegNet (0.8272) and TransUNet (0.8281) show the lowest accuracy, due to information loss in unpooling and the underutilization of Transformer modules in small, localized datasets. Overall, AttentionUNet excels in accuracy, FCN-8s offers the best balance of accuracy and efficiency, while SegNet and TransUNet illustrate the risks of oversimplified or overly complex architectures in small-sample scenarios.

Figure 4 shows segmentation results of the eight network models on five representative DEM images. We evaluated the performance of each network model by comparing its segmentation results against the ground truth images. AttentionUNet and FCN-8s maintained clear furrow boundaries. Both models accurately captured fine structures. UNet and NestedUNet also exhibit good segmentation capabilities and demonstrate strong resistance to interference from complex backgrounds. During segmentation, DeepLabV3+ and SegNet tend to produce boundary blurring, particularly within high-complexity regions, making discontinuous results more likely. TransUNet maintains high accuracy under certain conditions; however, its performance becomes less stable when dealing with complex textures, leading to substantial uncertainty along the boundaries.

In summary, AttentionUNet and FCN-8s demonstrate relatively higher segmentation accuracy for the DEM image segmentation, while UNet and NestedUNet also exhibit good robustness. DeepLabV3+ and SegNet, however, exhibit a tendency for boundary blurring in specific scenarios, and TransUNet still has space for further optimization in segmenting complex skin textures.

3.2. Segmentation Efficiency of Each Model

This study also evaluates the efficiency performance of eight deep learning network models in the DEM image segmentation task. Comparative analysis is conducted using the Params, FLOPs, average CPU inference time (ms), and average GPU inference time (ms). Table 3 shows that different models exhibit significant efficiency differences in processing the DEM image segmentation task.

FCN-8s is a lightweight model with only 18.64 M parameters and 25.50 G FLOPs. It achieves the fastest inference speed among all of the tested models (CPU: 12.62 ms; GPU: 37.36 ms), making it highly suitable for scenarios that require computational efficiency. The computational complexity of SegNet is slightly higher than that of FCN-8s, but its overall efficiency remains at a relatively high level (CPU: 23.92 ms, GPU: 54.00 ms), achieving a certain balance between speed and model capability. NestedUNet achieves good inference speed (CPU: 29.11 ms, GPU: 80.18 ms) with a minimal number of parameters (9.16 M) and moderate FLOPs (34.90 G). Despite having fewer parameters (13.04 M), ResUNet exhibits a higher FLOPs count (80.98 G), reflecting a more intricate computational structure. While it performs well on CPU (18.67 ms), its longer average GPU inference time (98.86 ms) suggests it benefits more from CPU computation optimization.

In contrast to the previously discussed models, both UNet and TransUNet suffer from longer inference times. UNet involves 31.04 M parameters and 54.74 G FLOPs, leading to an average GPU inference time as high as 130.08 ms. TransUNet has the largest parameter count (67.08 M) and the longest GPU inference time (207.27 ms). AttentionUNet and DeepLabV3+ exhibit good segmentation accuracy but have higher computational overheads, with average GPU inference times of 97.97 ms and 94.26 ms, respectively.

Overall, FCN-8s demonstrated the highest efficiency in both CPU and GPU processing. UNet demonstrates lower efficiency for both CPU and GPU processing. Although TransUNet shows moderate CPU processing efficiency, it has the lowest efficiency for GPU processing. SegNet and NestedUNet achieve a good balance between CPU and GPU processing efficiency. AttentionUNet, while exhibiting higher CPU processing efficiency, shows lower efficiency for GPU processing.

3.3. Comprehensive Performance Evaluation of Each Model

To comprehensively evaluate the segmentation performance of the eight network models, it is essential to consider both the segmentation accuracy and efficiency metrics. However, because the various quantitative metrics have different units and numerical ranges, they cannot be directly compared. To remove the effect of these differing units, each metric must undergo scaling to ensure all values fall within the [0, 1] range.

Regarding positive metrics like DSC, IoU, Recall, and Precision (higher values denote superior performance), Max Scaling [29] is applied. The normalization equation is presented below, as follows:

x_{n o m} = \frac{x}{x_{m a x}}

(5)

where

x

denotes the raw metric value,

x_{m a x}

denotes the maximum value of that metric across all models.

For negative metrics like Params, FLOPs, CPU runtime, and GPU runtime (where lower values signify superior performance), reciprocal normalization [30] is used. The formula is presented below as follows:

x_{m a n} = \frac{x_{m i n}}{x}

(6)

where

x_{m i n}

denotes the minimum value of that metric across all models.

After the above normalization processing, all indicators adhere to the following principle: the higher the indicator value, the better the model’s performance in that metric. The normalized quantitative evaluation results are shown in Table 4.

Normalized metrics are visualized in a radar chart (see Figure 5). Since higher normalized values correspond to superior model performance, values nearer to the outer edge of the chart indicate better performance. Based on the results of the area calculations from the radar charts (see Table 5), the overall model performance ranking from highest to lowest is as follows: FCN-8s, NestedUNet, SegNet, ResUNet, AttentionUNet, DeepLabV3+, UNet, and TransUNet.

FCN-8s shows the largest area on the radar chart, suggesting the best overall performance in segmentation accuracy and efficiency, whereas TransUNet has the smallest area, indicating the worst overall performance. Traditional Convolutional Neural Network architectures like FCN-8s and SegNet are more fundamental compared to newer architectures, but they still maintain stable performance across multiple evaluation metrics. In particular, FCN-8s achieves the best overall performance in this task. In contrast, TransUNet, which integrates attention mechanisms and Transformer structures, fails to demonstrate the expected advantages. This might be influenced by the data characteristics or training strategies.

These results indicate that model performance depends not only on the complexity of the network architecture but also closely relates to the task type and data characteristics. In the DEM image segmentation task, traditional network architectures still possess significant competitiveness, whereas the advantages of newer architectures require further validation and optimization within specific contexts.

4. Discussion

Focusing on automatic segmentation for DEM images, this study systematically assesses the performance of eight representative deep learning models based on eight metrics that cover segmentation accuracy and computational efficiency.

Regarding segmentation accuracy, AttentionUNet demonstrates relatively better performance across multiple evaluation metrics (DSC: 0.8696, IoU: 0.7703, Recall: 0.8819), effectively highlighting its relative advantage in handling complex boundary structures. AttentionUNet extends the traditional UNet architecture by integrating attention gates. These gates dynamically inhibit irrelevant regions during feature transfer, strengthening the model’s focus on significant areas like epidermal groove edges. This mechanism enhances both spatial perception and feature selection accuracy. As a result, it is especially effective for DEM images characterized by indistinct boundaries and intricate textures. Additionally, the attention module addresses the traditional UNet’s weakness in differentiating complex tissue structures through the introduction of non-linear correlation modeling.

NestedUNet significantly improves the fusion of multi-scale semantic and spatial features. It achieves this by introducing dense, nested skip connections, which are further enhanced by a deep supervision mechanism. Its unique decoding path adopts a multi-level aggregation strategy. This strategy enhances semantic consistency while preserving spatial details. As a result, it effectively bridges the information gap between shallow features and deep semantic alignment that exists in the original UNet structure. Additionally, the integration of deep supervision enables loss calculation at all sub-network levels during training, which improves convergence speed and segmentation robustness.

ResUNet, on the other hand, incorporates residual blocks into UNet. This integration enables efficient gradient transfer through constant mappings, thereby stabilizing the network’s deeper layers during training. This structure enhances the nonlinear expression ability during the feature extraction phase. It effectively addresses the common issues of gradient disappearance and feature degradation in deeper networks while preserving the texture details in lower layers. ResUNet demonstrates good performance in both training stability and accuracy.

Overall, classic models based on the encoder–decoder architecture, along with their improved variants, continue to demonstrate strong adaptability and scalability when addressing the challenges posed by complex local structures and dense textures in DEM images. These model designs generally emphasize multi-scale feature fusion (e.g., NestedUNet’s cascaded structure), deep semantic extraction (e.g., ResUNet’s residual blocks), and local boundary enhancement (e.g., AttentionUNet’s attention mechanism), providing a solid structural foundation for segmenting complex medical images.

However, greater architectural complexity does not necessarily translate into better performance. TransUNet integrates a vision transformer module into the UNet encoder. This design theoretically enhances global context modeling and long-range dependency capture. However, in this study, its segmentation performance (DSC: 0.8281) was significantly lower than that of several CNN-based models. Several factors may explain this result. First, the texture features in DEM images are primarily localized. Consequently, the Transformer’s global attention mechanism struggles to precisely capture small, irregular edge details. Second, the Transformer module’s high parameter complexity and data dependency hinder its structural benefits when data volume is limited, potentially causing reduced generalization and overfitting risks. This suggests that Transformer-based modules are better suited to large datasets with pronounced global structures. However, for tasks involving dense local textures, their performance benefits remain to be validated and optimized.

In terms of computational efficiency, FCN-8s demonstrates a relative advantage. This model employs a relatively shallow convolutional architecture and a fully convolutional up-sampling strategy, resulting in a concise structure. With only 18.64 M parameters and 25.50 G FLOPs, it achieves an average GPU inference time of 37.36 ms, which is the lowest among all the tested models. The skip connection mechanism in FCN-8s preserves high-resolution features and facilitates semantic information flow, effectively controlling computational resource consumption. This model features a short fusion path and a straightforward up-sampling process, making it suitable for resource-constrained or computationally efficient application scenarios. However, its down-sampling stage relies on traditional max-pooling, which can compromise the recovery of high-frequency details.

Although SegNet also uses an encoder–decoder structure, its reliance on max unpooling with indices during the decoding phase leads to significant information loss. This deficiency results in blurred segmentation boundaries and the poor recovery of high-frequency details. Additionally, while this structure reduces storage overhead, it has limitations in complex texture reconstruction tasks.

Notably, although AttentionUNet demonstrates superior accuracy, its complex attention computation and deep feature fusion structure result in relatively slow inference speed. This makes it more suitable for tasks requiring high segmentation accuracy and tolerable delay.

Based on a normalized multi-metric comprehensive analysis using radar charts, FCN-8s demonstrates stable performance in both accuracy and efficiency. It achieves the largest radar chart area (2.3856), indicating the best overall performance. In contrast, NestedUNet and SegNet manage to maintain a certain level of segmentation accuracy while also achieving structural light-weighting and edge-awareness, placing them among the top performers as well. However, TransUNet, despite incorporating structurally complex Transformer modules, fails to demonstrate the expected advantages in the DEM image segmentation task. This result further underscores the tension between architectural complexity and task suitability. It suggests that adapting a model’s structure to a specific task is more effective than simply increasing its complexity.

Nevertheless, several limitations of this study should be acknowledged. First, the dataset is relatively small (46 volunteers; 261 original images) and relies heavily on offline augmentation, which may restrict the generalization ability of the models. This constraint may particularly disadvantage data-intensive architectures like TransUNet and DeepLabV3+. Consequently, their suboptimal performance in our experiments is likely driven by data scarcity rather than inherent architectural limitations. Second, the training was conducted on 256 × 256 cropped images, while testing used larger images, creating a domain gap that may bias performance evaluation. Moreover, because hyperparameters were not systematically tuned for each model, the robustness of our comparative conclusions may be limited.

5. Conclusions

The performance of eight representative deep learning models was systematically evaluated for DEM image segmentation. AttentionUNet achieved the best segmentation accuracy, particularly in capturing complex boundary structures, while FCN-8s demonstrated the highest computational efficiency. In contrast, TransUNet showed the lowest performance in both accuracy and efficiency.

Considering both segmentation accuracy and efficiency, FCN-8s provides a good trade-off and can serve as a baseline model for DEM segmentation tasks. Different models may suit different scenarios: lightweight models (e.g., FCN-8s) are preferable for real-time clinical applications; high-accuracy models (e.g., AttentionUNet) are suitable for research requiring precise recognition; and balanced models (e.g., NestedUNet) show promise for small-scale or collaborative studies.

The dataset has notable limitations regarding demographic diversity. It was collected from healthy volunteers in a single region, featuring relatively concentrated age ranges and skin types. This limitation may affect the generalization ability of the models across populations with different ages, genders, skin tones, and lifestyles. Therefore, future research should focus on collecting larger and more diverse datasets to further validate the comprehensive performance of the eight network models. In addition, future work will emphasize enhancing model robustness and generalization, including lightweight optimization for AttentionUNet, accuracy improvement for FCN-8s, and the exploration of larger, more diverse datasets to support broader applicability.

Author Contributions

Conceptualization, Y.Y. and Y.Z.; methodology, Y.Y.; validation, Y.Y., Q.H. and Y.Z.; investigation, Y.Y.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Y.; writing—review and editing, Q.H. and Y.Z.; visualization, Y.Y.; supervision, Y.Z.; project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Hubei Provincial Central Guidance Local Science and Technology Development Project (Grant No. 2024BSB002).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request for non-commercial, not-for-profit scientific research.

Acknowledgments

The authors thank the School Hospital of China Three Gorges University for its technical support during the skin image acquisition.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lagarde, J.M.; Rouvrais, C.; Black, D. Topography and anisotropy of the skin surface with ageing. Skin Res. Technol. 2005, 11, 110–119. [Google Scholar] [CrossRef]
Park, S. Biochemical, structuraland physical changes in aging human skin, and their relationship. Biogerontology 2022, 23, 275–288. [Google Scholar] [CrossRef]
Zou, Y.; Lei, B.; Dong, F.; Xu, G.; Sun, S.; Xia, P. Structure similarity-guided image binarization for automatic segmentation of epidermis surface microstructure images. J. Microsc. 2017, 266, 153–165. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Mandal, M. Epidermis segmentation in skin histopathological images based on thickness measurement and k-means algorithm. EURASIP J. Image Video Process. 2015, 2015, 18. [Google Scholar] [CrossRef]
Islam, A.R.; Alammari, A.; Buckles, B. Skin detection in image and video founded in clustering and region growing. In Proceedings of the Mobile Multimedia/Image Processing, Security, and Applications 2019, Bellingham, WA, USA, 14–18 April 2019; pp. 186–201. [Google Scholar]
Alahmadi, M.D. Multiscale attention U-Net for skin lesion segmentation. IEEE Access 2022, 10, 59145–59154. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
Xia, K.; Wang, J. Recent advances of transformers in medical image analysis: A comprehensive review. MedComm–Future Med. 2023, 2, e38. [Google Scholar] [CrossRef]
Nie, Y.; Sommella, P.; Carratu, M.; Ferro, M.; O’nils, M.; Lundgren, J. Recent advances in diagnosis of skin lesions using dermoscopic images based on deep learning. IEEE Access 2022, 10, 95716–95747. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef] [PubMed]
Boseview Technology Company. Available online: http://www.boseview.com.cn/English (accessed on 20 August 2025).
Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Wen, R.; Yuan, H.; Ni, D.; Xiao, W.; Wu, Y. From denoising training to test-time adaptation: Enhancing domain generalization for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 464–474. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Salih, A.A.; Abdulazeez, A.M. Evaluation of classification algorithms for intrusion detection system: A review. J. Soft Comput. Data Min. 2021, 2, 31–40. [Google Scholar] [CrossRef]
Müller, D.; Soto-Rey, I.; Kramer, F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res. Notes 2022, 15, 210. [Google Scholar] [CrossRef]
Chen, Z.; Peng, C.; Guo, W.; Xie, L.; Wang, S.; Zhuge, Q.; Wen, C.; Feng, Y. Uncertainty-guided transformer for brain tumor segmentation. Med. Biol. Eng. Comput. 2023, 61, 3289–3301. [Google Scholar] [CrossRef]
Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques, 4th ed.; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
Chakraborty, S.; Yeh, C.-H. A simulation comparison of normalization procedures for TOPSIS. In Proceedings of the 2009 International Conference on Computers & Industrial Engineering, Troyes, France, 6–8 July 2009; pp. 1815–1820. [Google Scholar]

Figure 1. Four raw DEM images, each with a resolution of 704 × 576 pixels.

Figure 2. In a raw DEM image with a resolution of 704 × 576 pixels, the regions ②, ③, ④, and ⑤ indicate four local DEM images to be augmented with a resolution of 256 × 256 pixels, and the region ① indicates a test image.

Figure 3. Examples of augmentation applied to four local DEM images. (a–d) Each row shows a local DEM image along with its augmented versions. Augmentation methods include horizontal flipping, rotation, geometric transformations, brightness and contrast adjustment, and gamma correction. All operations are applied randomly to ensure diversity and improve model robustness.

Figure 4. Segmentation results of eight network models on five representative DEM images.

Figure 5. Radar plot of the segmentation performance of eight network models.

Table 1. Parameters of 8 models.

Model	Image Size	Batch Size	Learning Rate	Epochs
FCN-8s	256 × 256 × 3	16	0.001	200
SegNet	256 × 256 × 3	4	0.001	200
UNet	256 × 256 × 3	8	0.001	200
ResUNet	256 × 256 × 3	8	0.001	200
NestedUNet	256 × 256 × 3	8	0.001	200
DeepLabV3+	256 × 256 × 3	8	0.001	200
TransUNet	256 × 256 × 3	8	0.001	200
AttentionUNet	256 × 256 × 3	8	0.001	200

Table 2. Segmentation accuracy of the eight models (mean ± standard deviation across test samples).

Model	DSC	IoU	Recall	Precision
FCN-8s	0.8690 ± 0.0279	0.7694 ± 0.0434	0.8639 ± 0.0314	0.8751 ± 0.0373
SegNet	0.8272 ± 0.0282	0.7063 ± 0.0410	0.8015 ± 0.0375	0.8568 ± 0.0441
UNet	0.8641 ± 0.0287	0.7619 ± 0.0442	0.8689 ± 0.0320	0.8606 ± 0.0400
ResUNet	0.8548 ± 0.0306	0.7476 ± 0.0465	0.8735 ± 0.0303	0.8385 ± 0.0476
NestedUNet	0.8642 ± 0.0283	0.7620 ± 0.0438	0.8668 ± 0.0306	0.8625 ± 0.0373
DeepLabV3+	0.8470 ± 0.0312	0.7359 ± 0.0469	0.8749 ± 0.0295	0.8222 ± 0.0456
TransUNet	0.8281 ± 0.0410	0.7087 ± 0.0586	0.8182 ± 0.0616	0.8410 ± 0.0385
AttentionUNet	0.8696 ± 0.0266	0.7703 ± 0.0413	0.8819 ± 0.0257	0.8587 ± 0.0404

Note: The best results among all the methods are highlighted in bold.

Table 3. Segmentation efficiency of eight models.

Model	Params (M)	FLOPs (G)	Average CPU Inference Time (ms)	Average GPU Inference Time (ms)
FCN-8s	18.64	25.50	12.62	37.36
SegNet	24.87	28.63	23.92	54.00
UNet	31.04	54.74	87.10	130.08
ResUNet	13.04	80.98	18.67	98.86
NestedUNet	9.16	34.90	29.11	80.18
DeepLabV3+	59.34	22.24	68.89	94.26
TransUNet	67.08	32.63	38.71	207.27
AttentionUNet	34.88	66.63	18.71	97.97

Note: The best results among all the methods are highlighted in bold.

Table 4. Results after normalization of eight quantitative metrics.

Model	DSC	IoU	Recall	Precision	Params	FLOPs	Average CPU Inference Time	Average GPU Inference Time
FCN-8s	1.00	1.00	0.98	1.00	0.49	0.87	1.00	1.00
SegNet	0.95	0.92	0.91	0.98	0.37	0.78	0.53	0.69
UNet	0.99	0.99	0.99	0.98	0.30	0.41	0.14	0.29
ResUNet	0.98	0.97	0.99	0.96	0.70	0.27	0.68	0.38
NestedUNet	0.99	0.99	0.98	0.99	1.00	0.64	0.43	0.47
DeepLabV3+	0.97	0.96	0.99	0.94	0.15	1.00	0.18	0.40
TransUNet	0.95	0.92	0.93	0.96	0.14	0.68	0.33	0.18
AttentionUNet	1.00	1.00	1.00	0.98	0.26	0.33	0.67	0.38

Table 5. Octagonal areas of eight network models on the radar chart.

Model	Area
FCN-8s	2.3856
NestedUNet	1.9403
SegNet	1.6528
ResUNet	1.6061
AttentionUNet	1.4814
DeepLabV3+	1.3263
UNet	1.3160
TransUNet	1.1661

Note: The best three results among all the methods are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yue, Y.; He, Q.; Zou, Y. Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance. Electronics 2025, 14, 3871. https://doi.org/10.3390/electronics14193871

AMA Style

Yue Y, He Q, Zou Y. Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance. Electronics. 2025; 14(19):3871. https://doi.org/10.3390/electronics14193871

Chicago/Turabian Style

Yue, Yeshun, Qihang He, and Yaobin Zou. 2025. "Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance" Electronics 14, no. 19: 3871. https://doi.org/10.3390/electronics14193871

APA Style

Yue, Y., He, Q., & Zou, Y. (2025). Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance. Electronics, 14(19), 3871. https://doi.org/10.3390/electronics14193871

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Segmentation for Digital Epidermal Microscopic Images: A Comparative Study of Overall Performance

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Processing

2.2. Deep Learning Models for Segmentation

2.2.1. Model Descriptions

2.2.2. Model Setting and Training

2.3. Quantitative Metrics for Model Evaluation

3. Results

3.1. Segmentation Accuracy of Each Model

3.2. Segmentation Efficiency of Each Model

3.3. Comprehensive Performance Evaluation of Each Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI