AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography

Derekas, Panagiotis; Theodoridis, Charalampos; Likas, Aristidis; Bassukas, Ioannis; Gaitanis, Georgios; Zampeta, Athanasia; Exadaktylou, Despina; Spyridonos, Panagiota

doi:10.3390/diagnostics15141752

Open AccessArticle

AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography

by

Panagiotis Derekas

¹

,

Charalampos Theodoridis

²,

Aristidis Likas

²,

Ioannis Bassukas

³

,

Georgios Gaitanis

³

,

Athanasia Zampeta

³

,

Despina Exadaktylou

⁴

and

Panagiota Spyridonos

^1,*

¹

Department of Medical Physics, Faculty of Medicine, School of Health Sciences, University of Ioannina, 45110 Ioannina, Greece

²

Department of Computer Science & Engineering, School of Engineering, University of Ioannina, 45110 Ioannina, Greece

³

Department of Skin and Venereal Diseases, Faculty of Medicine, School of Health Sciences, University of Ioannina, 45110 Ioannina, Greece

⁴

Department of Dermatology, General Hospital of Nikaia—Piraeus “Agios Panteleimon”, 18454 Nikaia, Greece

^*

Author to whom correspondence should be addressed.

Diagnostics 2025, 15(14), 1752; https://doi.org/10.3390/diagnostics15141752

Submission received: 3 June 2025 / Revised: 8 July 2025 / Accepted: 9 July 2025 / Published: 10 July 2025

(This article belongs to the Special Issue Artificial Intelligence in Dermatology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background: Integrating artificial intelligence into clinical photography offers great potential for monitoring skin conditions such as actinic keratosis (AK) and skin field cancerization. Identifying the extent of AK lesions often requires more than analyzing lesion morphology—it also depends on contextual cues, such as surrounding photodamage. This highlights the need for models that can combine fine-grained local features with a comprehensive global view. Methods: To address this challenge, we propose AKTransU-net, a hybrid U-net-based architecture. The model incorporates Transformer blocks to enrich feature representations, which are passed through ConvLSTM modules within the skip connections. This configuration allows the network to maintain semantic coherence and spatial continuity in AK detection. This global awareness is critical when applying the model to whole-image detection via tile-based processing, where continuity across tile boundaries is essential for accurate and reliable lesion segmentation. Results: The effectiveness of AKTransU-net was demonstrated through comparative evaluations with state-of-the-art segmentation models. A proprietary annotated dataset of 569 clinical photographs from 115 patients with actinic keratosis was used to train and evaluate the models. From each photograph, crops of 512 × 512 pixels were extracted using translation lesion boxes that encompassed lesions in different positions and captured different contexts. AKtransU-net exhibited a more robust context awareness and achieved a median Dice score of 65.13%, demonstrating significant progress in whole-image assessments. Conclusions: Transformer-driven context modeling offers a promising approach for robust AK lesion monitoring, supporting its application in real-world clinical settings where accurate, context-aware analysis is crucial for managing skin field cancerization.

Keywords:

medical image segmentation; U-net; transformer; skin lesions; actinic keratosis; cutaneous cancerization field; AI in dermatology

1. Introduction

Actinic keratoses (AKs) are common precancerous skin lesions that develop in areas of chronic ultraviolet radiation exposure. They are considered early indicators of squamous cell carcinoma [1] and often coexist with subclinical changes in the surrounding skin, highlighting their key role in the broader process of skin field cancerization. Managing AKs effectively requires addressing both individual lesions and the entire affected skin field to reduce the risk of progression to invasive skin cancers [2].

The clinical evaluation of AK involves assessing the extent of the AK lesions (the AK burden), which is crucial for determining baseline disease severity and guiding treatment decisions.

In routine clinical practice, assessing the AK burden presents challenges. Most guidelines rely on a lesion count, yet specialists often find this metric inconsistent [3]. Additionally, studies indicate that complete patient clearance rates inversely correlate with baseline lesion numbers, yet many intervention studies fail to account for this baseline burden, potentially underestimating treatment’s benefits [4,5].

An accurate evaluation of treatment outcomes and long-term monitoring are critical to optimizing patient care. Current methods for assessing the AK burden and treatment response, such as clinical scoring systems and qualitative indices, are subjective and prone to interobserver variability [6,7]. These limitations underscore the need for more objective, reproducible, and comprehensive approaches that can enhance management strategies and improve the surveillance of AK over time.

Clinical photography has become an essential documentation tool in dermatology, particularly for diagnosing and monitoring conditions such as actinic keratoses (AKs). It allows clinicians to track changes in skin lesions over time, particularly in cases of extensive sun damage where AKs appear diffusely across large body surface areas [8]. Clinical photography, coupled with image analysis, offers an efficient means of the comprehensive documentation of AK clearance and/or the formation of new AKs.

Detecting AK lesions in clinical photographs poses a significant challenge due to their subtle visual characteristics and ill-defined boundaries. Their variability in size, shape, color, and texture further complicates the automated detection process. Early attempts at quantitatively assessing AK lesions using clinical photography faced challenges such as inconsistent illumination, lesion color diversity [9], and the restriction of detection to preselected subregions of the photographed skin areas using a binary patch classifier for the discrimination of Aks from healthy skin [10,11,12]. These limitations hindered the reproducibility and accuracy of detection.

Leveraging an optimized actinic keratosis convolutional neural network (AKCNN) as a patch classifier has significantly improved the detection of AKs [13]. However, a patch classifier processes small image patches (e.g., 50 × 50 pixels) independently, classifying each patch without explicit consideration of its surrounding spatial context. An AKCNN is subject to manually predefined scanning areas necessary to exclude skin regions prone to false diagnoses, such as actinic keratosis, on clinical images. Unlike more distinct skin conditions, AK lesions often blend with the surrounding sun-damaged skin, making it challenging to delineate them precisely. These limitations prompted our earlier work, which explored semantic segmentation using a U-net model with recurrent spatial processing in skip connections, thereby enhancing the detection of AKs using clinical photographs (AKU-net) [14].

However, an accurate identification of the AK burden within skin field cancerization often relies more on contextual cues—such as surrounding photodamage—than on the lesion’s appearance alone. This dependence on broader spatial information underscores the importance of models that integrate fine-grained local features and a global context to effectively detect and segment AK lesions in a real clinical photography setting.

Moreover, the high-resolution nature of the clinical photographs and the need to detect fine-texture AK regions impose additional limitations when implementing a regular semantic segmentation method. Training a segmentation network on high-resolution images requires either a huge memory capacity, which is often unavailable, or progressively reducing the image’s spatial size to fit into the memory. Nevertheless, down-sampling increases the risk of losing critical discriminative features. To address the challenge of segmenting AK lesions from clinical images, we have previously adopted the “tile strategy” [14]. This method involves breaking down a large image into smaller tiles (crops), making predictions on these tiles, and then stitching them back together to form the final prediction. However, this process may result in the loss of image features shared between tiles, leading to under-segmentation or the missed detection of AK lesions.

As semantic segmentation evolves rapidly, it is essential to continually reassess and refine architectural choices to meet the growing demands of complex clinical tasks, including the evaluation of the AK burden and longitudinal monitoring in dermatology. Within this context, the main contributions of our study are as follows:

We explore architectural U-based enhancements that integrate Transformers to enable more coherent and contextually informed AK segmentations.
We introduce AKTransU-net, a hybrid U-net-based architecture that incorporates Transformer blocks to enrich feature representations. These are further refined through ConvLSTM modules embedded within the skip connections. Improvements in AK segmentation accuracy—at both crop and full-image levels—are demonstrated through comparisons with state-of-the-art semantic segmentation models.
We significantly improve the whole-image segmentation performance by employing a Gaussian weighting technique to mitigate boundary effects during prediction merging.
We underscore the crucial role of scanning methodology in real-world segmentation applications, suggesting that evaluations conducted solely under ideal conditions may fail to capture a model’s variability and robustness during deployment.

The remainder of the paper is organized as follows: Section 2 reviews related work on Transformer architectures in medical image segmentation, with an emphasis on hybrid U-net-based models and enhancements to their skip connections. Section 3 introduces the proposed AKtransU-net architecture, describes the materials, the data preparation, and details the evaluation methodology. Section 4 presents comparative results on both localized image crops and full-image regions. Section 5 discusses the main findings, highlights the study’s limitations, and outlines future research directions. Finally, Section 6 concludes the paper.

2. Related Work

2.1. The Evolution of Transformer Architectures in Medical Image Segmentation

Initially developed for natural language processing, Transformers have increasingly demonstrated strong potential in computer vision. The introduction of the Vision Transformer (ViT) by Dosovitskiy et al. [15] marked a significant shift, establishing the viability of Transformers for visual tasks. Unlike traditional convolutional neural networks (CNNs), Transformers leverage self-attention mechanisms to model long-range dependencies across the entire image, enabling a deeper understanding of spatial relationships and context. Although ViT was developed for natural image understanding, it quickly attracted attention for dense prediction tasks such as semantic segmentation due to its powerful global context modeling. However, early efforts to directly apply pure Transformer architectures to pixel-wise segmentation encountered challenges, including a high computational cost and limited spatial precision. To mitigate these issues, various Transformer variants were introduced. The Swin Transformer [16] improved computational efficiency and spatial locality by computing self-attention within shifted windows. Other variants, such as Pyramid Vision Transformer (PVT) [17] and LeViT [18], further optimized the balance between resolution and context modeling. These advances expanded the feasibility of integrating Transformers into medical segmentation tasks, paving the way for more practical and effective designs.

Since its introduction, U-net has inspired numerous enhanced variants, demonstrating a widespread adoption in medical image segmentation [19,20]. The rise of Transformer architectures has further driven the development of hybrid models that combine U-net’s strengths with global attention mechanisms. The review papers in the bibliography reflect this rapid evolution and highlight the growing impact of these hybrid approaches in the medical field [21,22,23].

One of the earliest and most influential examples is TransU-net, which integrates a ViT module into the U-net encoder [24]. In this architecture, a CNN backbone is first employed for local feature extraction, followed by the ViT component, which captures long-range dependencies across the spatial domain. More recently, the TransU-net framework has been extended to include Transformer modules that can be flexibly inserted into the U-net backbone, resulting in three hybrid configurations: Encoder-only, Decoder-only, and Encoder + Decoder. This modular design allows users to tailor the architecture more easily to specific segmentation tasks [25].

Following TransU-net, several variants explored different ways to integrate Transformer blocks. Swin-Unet [26] replaced ViT with Swin Transformer blocks to enable efficient, hierarchical, and windowed attention across scales. DS-TransUnet [27] extended this concept by utilizing Swin Transformers in both the encoder and decoder paths, with dual-branch interactions to enhance its multi-scale consistency. UCTransnet [28], in contrast, focused on the skip connections, replacing them with a channel transposition and attention fusion mechanism.

Most recently, HmsU-net [29] was proposed as a multi-scale U-net architecture that integrates CNN and Transformer blocks in parallel at both the encoder and decoder. Each CNN-Trans block processes features through two branches, one convolutional and one Transformer-based, and fuses them using a multi-scale feature fusion module. A cross-attention mechanism further enhances skip connections across levels, enabling a stronger interaction between encoder and decoder representations. This design allows the network to effectively capture both local textures and global semantics [29].

These models illustrate the diversity of Transformer integration strategies within the U-net framework, ranging from encoder-only attention to full dual-path designs. Each variant reflects a unique attempt to balance spatial accuracy, global reasoning, and computational efficiency, making Transformer-enhanced U-nets a powerful class of models for the segmentation of medical images.

2.2. Skip Connections in Hybrid U-Net Architectures

Skip connections have been a cornerstone of U-net-based architectures, effectively bridging the encoder and decoder by preserving spatial details [30]. However, several studies have shown that plain skip connections, such as direct concatenation, can introduce semantic gaps between encoder and decoder features, negatively impacting segmentation performance. To address this, various models have rethought the design of skip connections. MultiResUnet [31], for example, instead of simply concatenating the feature maps from the encoder stages to the decoder stages, first passes them through a chain of convolutional layers with residual connections and then concatenates them with the decoder features. In another architecture, IBA-U-net [32], the researchers, to solve the difference between the feature map extracted from the encoding path and the output feature map of the upper layer in the decoding path, introduced the Attentive Bidirectional Convolutional Long Short-Term Memory block (BConvLSTM), which uses the BConvLSTM block to extract bidirectional information and then uses the attention block to compare BConvLSTM output elements to different degrees to highlight the salient features in the skip connection. UCTransnet [28] and FAFS-Unet [33] have introduced cross-attention and feature selection modules to enrich skip connections with context-aware filtering and semantic alignment, thereby enhancing their capabilities. Similarly, HmsU-net [29] improved the communication between the encoder and decoder by introducing a cross-attention module that fuses features from the last three encoder stages and distributes them to the decoder.

3. Materials and Methods

3.1. AKtransU-Net: Skip Connection Enforcement with Global Dependencies

Inspired by recent developments in hybrid U-net architectures and based on our previous exploration of U-net variants for the detection of AKs [14], the proposed architecture adopts the view that skip connections should actively propagate feature maps enriched with global dependencies to the decoder stage. To achieve this, we apply ViT blocks [15] to the encoder-derived feature maps at multiple scales. Passing the CNN feature maps through a Transformer block augments them with global relationships.

The output of each Transformer block is then passed into ConvLSTM modules within the skip connections, which process three inputs sequentially: the CNN encoder output, the Transformer features, and the up-sampled decoder output. This design enables the network to detect lesions not only by analyzing local textures but also by leveraging their broader spatial context. The architecture is shown in Figure 1.

3.1.1. Transformer Encoder

In AKTransU-net, three Transformer encoder blocks are integrated at progressively down-sampled scales within the encoding path. This design enhances the CNN-extracted features with global attention, enabling the model to capture long-range dependencies across the spatial domain.

Each Transformer encoder block follows a standard ViT configuration (Figure 2), utilizing 16 attention heads (h = 16), and processes embedded tokens of size D = 1024. The encoder block consists of two sub-layers, each preceded by Layer Normalization and followed by a residual connection.

In the first sub-layer, the embedded patches are projected into sets of query (Q), key (K), and value (V) vectors through Multi-Head Self-Attention. For each head, attention is computed using scaled dot-product attention:

Attention (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

where

d_{k}

is the dimensionality of the key vectors calculated as follows:

d_{k} = \frac{D}{h} = \frac{1024}{16} = 64

(2)

The outputs from all the attention heads are concatenated and passed through a linear projection. This is followed by a second Layer Normalization and a Feed-Forward Network composed of two linear layers with a non-linear activation function (Rectified Linear Unit). Finally, the output token sequence is reshaped back to its original spatial dimensions. This step ensures that the global context learned through attention is preserved while maintaining compatibility with the subsequent ConvLSTM and decoder layers.

3.1.2. ConvLSTM in Skip Connections for Spatial Refinement

ConvLSTM layers are a variation of LSTMs (Long Short-Term Memory networks) that extend the traditional LSTM by incorporating convolutional operations, making them well-suited for capturing spatial relationships in image data [34,35,36,37]. While ConvLSTMs are often used in spatiotemporal tasks, they can be effectively applied to purely spatial data by focusing on spatial interactions across feature maps. ConvLSTM modules have been integrated into the U-net skip connections to enhance the model’s capacity to reconstruct accurate segmentation maps [32,37,38]

A ConvLSTM consists of an input gate

i_{t}

, an output gate

o_{t}

, a forget gate

f_{t}

, a memory cell

c_{t}

, and the hidden state

H_{t}

. The gates control access to, update, and clear the memory cell.

For an input

X_{t}

the ConvLSTM process is formulated as follows:

i_{t} = σ (W_{x i} * X_{t} + W_{h i} * H_{t - 1} + W_{c i} \cdot c_{t - 1} + b_{i})

(3)

f_{t} = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1} + W_{c f} \cdot c_{t - 1} + b_{f})

(4)

c_{t} = f_{t} \cdot c_{t - 1} + i_{t} \cdot \tanh (W_{x c} * X_{t} + W_{h c} * H_{t - 1} + b_{c})

(5)

o_{t} = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1} + W_{c o} \cdot c_{t} + b_{o})

(6)

H_{t} = o_{t} \cdot t a n h (c_{t})

(7)

In the final step t, the refined feature map

H_{t}

capturing spatial dependencies is passed to the decoder.

W_{x \cdot}

and

W_{h \cdot}

correspond to the 2D convolution kernel of the input and hidden states, respectively. (*) represents the convolution operation and a bullet

(\cdot)

the Hadamard function (element-wise multiplication).

b_{i}, b_{f}, b_{c}, a n d b_{o}

are the bias terms, and

σ

is the sigmoid function.

In our architecture, the ConvLSTM layer processes the CNN encoding, the Transformer encoding, and the up-sampled decoder feature map in a sequential three-step process. In the first step, the encoder’s CNN-based feature map is processed. In the second step, the Transformer-based representation is incorporated, using the hidden state from the previous step. In the third and final step, the up-sampled decoder feature map is refined, leveraging the accumulated spatial memory from the earlier steps. This sequential integration allows the ConvLSTM to serve as a spatial refinement module, enhancing the consistency of skip connections. As a result, the decoder receives a richer combination of low- and high-level spatial information, leading to more precise and accurate segmentation.

3.2. Material

The use of archival photographic material for this study was approved by the Human Investigation Committee (IRB) of the University Hospital of Ioannina (Approval No.: 3/17-2-2015 [θ.17]). The study included 115 patients diagnosed with facial AK, comprising 60 males and 55 females, aged between 45 and 85 years, who attended the specialized Dermato-Oncology Clinic of the Dermatology Department.

The inclusion criteria for the clinical photographs were the presence of clinically confirmed AKs depicted in the acquired and utilized photographs. The exclusion criteria were the existence of a biopsy-confirmed keratinocytic tumor (basal cell or squamous cell carcinoma).

Facial photographs were captured with the camera positioned perpendicular to the target area, ensuring full coverage of the face from the chin to the hairline. High-resolution digital images (4016 × 6016 pixels) were obtained using a Nikon D610 (Nikon, Tokyo, Japan) equipped with a Nikon NIKKOR© 60 mm f/2.8G ED Micro lens, following a protocol adapted from Muccini et al. [38]. The camera was configured with an aperture of f/18, shutter speed of 1/80 s, ISO 400 [39], autofocus enabled, and white balance set to auto mode. A Sigma ring flash (Sigma, Fukushima, Japan) in TTL mode was mounted on the camera. Linear polarizing filters were placed in front of both the lens and the flash, aligned to create a 90° polarization angle between them, thereby minimizing surface reflections and enhancing the lesions’ visibility.

Two dermatologists jointly reviewed and annotated the photographs, reaching a consensus on the regions affected by AK.

Data Augmentation and Dataset Preparation

Semantic segmentation is a dense prediction task that requires pixel-level annotation. In medical imaging, however, obtaining such labels is particularly challenging, as it demands expert knowledge and is both time-consuming and costly.

Despite this, successful models have been trained on relatively small datasets. The solution to the limited availability of annotated medical images is excessive data augmentation [30].

While traditional augmentation techniques—such as elastic deformations, rotations, and flips—are commonly used to expand training data, we adopted a strategy more aligned with the clinical characteristics of our dataset. Specifically, multiple photographs were taken per patient to capture the presence of lesions across the entire face. These photographs were intended to provide different views of the same lesions, resulting in 569 annotated clinical photographs.

Given the high spatial resolution of the images (4016 × 6016 pixels), model training was performed using rectangular crops (tiles). The crop size was chosen to preserve sufficient contextual information. To this end, the images were initially downscaled by a factor of 0.5, and 512 × 512-pixel crops were extracted using translation-based lesion bounding boxes. This approach ensured that the resulting image tiles included lesions in varied positions and captured different contexts of the perilesional skin. In total, 16,891 lesion-center and translation-augmented patches were extracted. These augmentations provided natural variations in appearance and positioning without distorting the lesion morphology, allowing the model to learn more realistic and generalizable features.

For training, we used 510 clinical photographs acquired from 98 patients, from which 16,488 translation-augmented image crops were extracted. Approximately 20% of these crops (3298), corresponding to 15 patients, were reserved for validation. To evaluate the model’s performance, an independent test set comprising 59 photographs from 17 patients was used, yielding 403 central lesion crops without translation augmentation (Table 1).

3.3. Evaluation

To demonstrate the anticipated improvements in AKtransU-net in detecting AK lesions, we present comparative results using both CNN-based and CNN–Transformer hybrid architectures at the crop level. One of the CNN models evaluated was a deeper variant of AKU-net [14], which extends the original four hierarchical levels of AKU-net to five (AKU-net5). AKU-net is a U-net architecture with ConvLSTM modules integrated into the skip connections, without any Transformer components. Additionally, we assessed DeepLabv3+, a widely recognized benchmark model for the segmentation of medical images [40,41]. For CNN–Transformer hybrids, we included both TransU-net and HmsU-net in our comparisons. TransU-net is among the earliest and most extensively evaluated Transformer-based architectures for complex medical image segmentation tasks [24]. HmsU-net, currently considered a state-of-the-art model, has demonstrated superior performance across various medical image segmentation applications [29].

In semantic segmentation, several key evaluation metrics are employed to assess a model’s performance:

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

S e n s i t i v i t y = \frac{T P}{T P + F N}

(9)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(10)

A c c u r a c y = \frac{T N + T P}{T N + F N + T P + F P}

(11)

J a c c a r d i n d e x = \frac{|A \cap B|}{|A \cup B|} = \frac{T P}{T P + F N + F P}

(12)

D i c e = \frac{(2 * |A \cap B|)}{(|A| + |B|)} = \frac{2 * T P}{2 * T P + F N + F P} = \frac{2 * P r e c i s i o n * S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y}

(13)

The notion

| |

represents the total number of pixels.

T P, F N, F P

are the true-positive, false-negative, and false-positive prediction rates at the pixel level, respectively.

Precision represents the proportion of correctly predicted positives among all predicted positives. Sensitivity (or Recall) measures the model’s ability to detect actual positives, while Specificity reflects its ability to identify actual negatives. Accuracy captures the overall correctness of the model by accounting for both positive and negative predictions. The Dice coefficient (also known as the F1-score) effectively balances Precision and Sensitivity and is particularly useful in segmentation tasks where both false positives and false negatives are essential. The Jaccard index, also known as Intersection over Union (IoU), measures the intersection between the predicted and ground-truth masks relative to their union.

However, not all metrics are relevant to the nature of a specific segmentation task. Specificity and Accuracy, for example, can appear misleadingly high due to the overwhelming presence of TN, especially when the background dominates the image. In our case, the mean lesion coverage within the lesion-centered 512 × 512 image tiles is approximately 15%, yielding a background-to-lesion pixel ratio of about 7:1. This imbalance inflates both Specificity and Accuracy, leading to an overestimation of segmentation performance. This effect becomes even more pronounced during whole-image inference, where background regions become more prevalent.

Moreover, segmentation evaluation should reflect the clinical context and task-specific challenges. In the case of the detection of AKs, lesions often lack sharply defined borders, with margins that are blurred and gradually blend into the surrounding healthy skin, making precise delineation challenging even for experienced dermatologists. As a result, annotations are inherently approximate. IoU is strict and penalizes any deviation more heavily, making it less suitable for evaluating segmentations with loose annotations. By contrast, the Dice coefficient is less sensitive to boundary deviations, making it more suitable for tasks with imprecise or fuzzy annotations, such as AK segmentation.

Additionally, the segmentations that our experts (IB, GG, DE, AZ) consider acceptable for AK lesions tend to align more closely with the Dice scores, reflecting the metric’s strong agreement with expert judgment (Figure 3). Given its robustness in this context, Dice is adopted throughout the remainder of this study to compare segmentation models in a clinically meaningful way.

To rigorously compare the performance of models, we analyzed the Dice coefficient achieved by each method using non-parametric statistical tests, which are appropriate for repeated measures across the same dataset when necessary. A Friedman test was applied to determine whether performance differences existed across all of the models. Post hoc pairwise comparisons were performed using the Wilcoxon signed-rank test with the Bonferroni correction for multiple comparisons.

Prominent architectures at the crop level were finally evaluated for the detection of AKs in clinical images employing a sliding window approach. This method involves breaking down a large image into smaller overlapping tiles, making predictions on these patches, and then stitching them back together to form the final prediction. This technique enables the model to focus on smaller areas of the image, which is particularly advantageous when working with high-resolution images that display a wide range of textures, color intensities, and lesion appearances. This overlapping strategy helps mitigate the boundary effects that typically arise in segmentation tasks, where regions at the edges of crops may receive less attention from the model, or errors can even occur when the patch boundary splits objects. Once the crops have been processed and predictions are made, the next challenge is to merge these predictions back into a single, coherent segmentation map. Given the overlapping nature of the crops, multiple predictions will be made for the same pixel. Without proper weighting, inconsistencies or artifacts may arise, especially in overlapping regions where predictions from different crops may vary (Figure 4).

To ensure a seamless integration of predictions, we applied a Gaussian weighting technique. The idea is to prioritize the center of each crop when merging the predictions.

Pixels near the center of a crop tend to be more reliable, because they are farther from the crop’s boundaries, where predictions might be affected by edge artifacts or the limited context available to the model.

The Gaussian weight map,

G (x, y)

, is a 2D map that assigns higher weights to the central pixels and gradually reduces the weights towards the edges. It is defined as follows:

G (x, y) = \exp (- 0.5 \times \frac{(x^{2} + y^{2})}{σ^{2}})

(14)

where

x

and

y

are the pixel coordinates relative to the center of the crop, and

σ

controls the spread of the Gaussian function. The Gaussian map resembles a bell-shaped curve, where the central region of the crop is emphasized, and the boundary regions are down-weighted.

In our experiments, the Gaussian function’s influence extends up to

1 / 4 t h

of the crop’s width from its center. This ensures that the center region of each crop, which contains the most confident predictions, is weighted more heavily, while the outer pixels, which may be affected by edge effects, contribute less to the final prediction.

To reconstruct the full segmentation map from the individual crops, the predictions for each crop are multiplied by the corresponding Gaussian weight map, which emphasizes the central pixels and downweighs the predictions near the edges. This weighted prediction is added to a cumulative full-size prediction map.

In addition to the full prediction map, a cumulative weight map is maintained to track the total weight assigned to each pixel during the merging process. This ensures that areas with more overlap receive appropriate normalization.

For each crop, the weighted prediction is added to the full prediction map at its corresponding position, and the Gaussian weight map is added to the cumulative weight map. After all the crops have been processed, the final full-size prediction map is normalized by dividing it by the cumulative weight map, ensuring that each pixel’s final prediction is a weighted average of all the overlapping crops that contributed to it. This process can be expressed as follows:

P f i n a l (x, y) = \frac{\sum_{i} G_{i} (x, y) P_{i} (x, y)}{\sum_{i} G_{i} (x, y)}

(15)

where

P f i n a l (x, y)

is the final prediction for pixel

(x, y)

, and

P_{i} (x, y) a n d G_{i} (x, y)

are the prediction and the Gaussian weight map for the

i

-th crop, respectively.

The full-size segmentation map, reconstructed using the Gaussian-weighted predictions, is then thresholded to produce a binary mask.

Besides the overlapping tiles approach, we also experimented with the YOLOv8-Seg model, which processes the entire image to detect and segment AK regions. The YOLOv8-Seg model is an extension of the YOLOv8 architecture explicitly designed for instance segmentation tasks [42,43,44]. While it retains the core object detection framework of YOLOv8, it adds segmentation capabilities, enabling it to predict pixel-wise masks in addition to bounding boxes and class probabilities for objects in an image. This allows the model to not only detect objects but also to separate them from the background with more precise segmentation.

To assess the segmentation performance of the proposed models, we computed the median Dice coefficient (%) for each model.

3.4. Implementation Details

The models were implemented in Python 3 using the TensorFlow 2.15 framework or PyTorch 1.11 (HmsU-net, TransU-net). Training was conducted on a system equipped with an NVIDIA A100 GPU and 100 GB of RAM. All models were trained for 100 epochs using the Binary Cross-Entropy loss function and the Adam optimizer, with an initial learning rate of 0.001 and a weight decay of 1 × 10⁻⁶. The batch size was set to 32, which was chosen based on the memory efficiency and throughput characteristics of the NVIDIA A100 GPU, as recommended by the manufacturer for deep learning workloads.

Due to computational constraints, the image crops were downscaled by a factor of 0.5, resulting in a final resolution of 256 × 256 pixels. For YOLOSegV8 and the whole-image AK segmentation task, the images were instead rescaled to 1120 × 1120 pixels.

4. Results

4.1. AK Detection in Localized Image Regions (256 × 256 Crops)

To assess the segmentation performance of the proposed models, we computed the median Dice coefficient (%) for each model. Post hoc pairwise comparisons were performed using the Wilcoxon signed-rank test with the Bonferroni correction for multiple comparisons. With 15 pairwise model comparisons, the Bonferroni-adjusted threshold for significance was

α_corrected = 0.05/15 ≈ 0.0033

(16)

The violin plot in Figure 5 visually summarizes these results. Horizontal lines indicate each pairwise test. Asterisks denote adjusted p-value thresholds as follows:

*: p < 0.01 (nominal, but not significant under Bonferroni);

**: p < α_corrected (statistically significant under Bonferroni);

***: p < 0.001 (highly significant).

Figure 5. Violin plots of Dice coefficient distributions for six models. Pairwise comparisons were conducted using Wilcoxon signed-rank tests with Bonferroni correction for multiple testing (15 comparisons; α_corrected ≈ 0.0033). Asterisks indicate significance levels: *: p < 0.01 (not significant after Bonferroni correction), **: p < 0.0033 (Bonferroni-corrected significance), ***: p < 0.0 (highly significant). The shape and spread of each violin reflect performance consistency. Models like HmsU-net exhibit a narrow distribution with a high median, indicating both a strong and stable performance across test cases.

Moreover, Table 2 compiles the pairwise statistical comparisons against the baseline AKU-net model [14].

The results indicate that HmsU-net significantly outperforms all other models in terms of both central tendency and consistency of performance. Both HmsU-net and AKtransU-net exhibited highly significant improvements (***) over the reference model, AKU-net, with HmsU-net achieving a substantial increase of more than 18 percentage points in its median Dice score.

In contrast, the improvements observed with TransU-net and AKU-net5, while nominally better than the baseline, did not withstand statistical correction for multiple comparisons and thus lacked statistical robustness. Similarly, DeepLabv3+ failed to show a statistically significant improvement over AKU-net, both in terms of raw performance (60.24% Dice) and statistical significance.

A visual demonstration of the efficiency of the evaluated models is given in Figure 6.

4.2. Lesion Detection in Full Clinical Photographs—Whole-Image Inference

To handle lesion detection in whole images, we compared three crop-based scanning strategies during semantic segmentation and evaluated their impact on segmentation performance. The first tested approach was simple block processing, involving non-overlapping image tiling into fixed-size crops (512 × 512 pixels), with predictions directly concatenated. While computationally efficient, this approach introduced artifacts at crop boundaries due to the lack of contextual information. Following, overlapping crop inference was applied using a sliding window with a stride of either 256 × 256 pixels or 128 × 128 pixels, where overlapping predictions were averaged to produce the final segmentation. Finally, we employed a Gaussian-weighted overlapping inference that used the same stride (Table 3). The quantitative evaluation revealed a progressive improvement in segmentation accuracy and boundary continuity across methods. The Gaussian-weighted stride approach achieved the highest segmentation accuracy, with a median Dice score of 65.13%, followed by the overlapping average stride (48.11%) and simple block processing (45.74%). Pairwise comparisons using the Wilcoxon signed-rank test confirmed that both alternative strategies performed significantly worse than the Gaussian-stride baseline (p < 0.01 for both comparisons). These results demonstrate that Gaussian weighting over overlapping patches effectively improves segmentation consistency and accuracy in large, high-resolution photographs.

The wide-area segmentation performance of AKtransU-net, using a Gaussian-weighted stride (base model), was compared with that of HmsU-net and YOLOSegV8. The results compiled in Table 4 show that AKtransU-net performed significantly better than HmsU-net at a 256 × 256 stride (p < 0.05), while HmsU-net at a 128 × 128 stride and YOLOSegV8 showed no significant difference from AKtransU-net (p > 0.05). However, a stride of 128 × 128 pixels requires approximately four times more inference computations than a stride of 256 × 256 pixels for the same image. Figure 7 illustrates the qualitative results of AK detection on full-face images.

We performed an additional evaluation using targeted patch extraction to estimate the upper bound of segmentation performance for each model. Specifically, crops were extracted with their centers aligned to the annotated AK lesions. This lesion-centered extraction strategy ensured that each crop fully encompassed the lesion of interest while minimizing background variability and irrelevant tissue. As a result, the models were evaluated at whole-image detection under optimal conditions that approximate the maximum achievable segmentation performance. These results provide a helpful reference point for interpreting the practical scanning performance of each model under standard whole-image tiling conditions.

Based on the lesion-centered patch extraction, the upper-bound evaluation for whole-image inference confirmed the superiority of hybrid CNN + ViT architectures over their CNN-only counterparts, highlighting the benefit of incorporating global attention mechanisms (Table 5). Moreover, AKtransU-net and HmsU-net achieved the highest segmentation performance. Both models achieved comparable median Dice scores under these ideal conditions (p > 0.05), indicating a similar potential when the lesion is fully captured within the input crop. However, the results of the full-image scanning experiments demonstrated substantial performance differences between the models, with AKtransU-net consistently outperforming HmsU-net under standard tiling conditions. This divergence suggests that the scanning strategy and variability in context have a strong influence on models’ behavior. Specifically, the performance of HmsU-net appeared more sensitive to changes in scanning configuration and patch overlap, while AKtransU-net maintained a more consistent accuracy.

These findings underscore the crucial role of scanning methodology in real-world segmentation applications, suggesting that evaluating models solely under optimal conditions may underestimate their variability and robustness during deployment.

5. Discussion

The integration of artificial intelligence into clinical photography holds significant potential for enhancing the monitoring of skin conditions, such as actinic keratosis, and for the broader context of skin field cancerization. AK lesions are often subtle, poorly defined, and diffusely distributed across chronically sun-damaged skin, making their accurate and consistent evaluation challenging, even for experienced clinicians. Given that cutaneous field cancerization progresses gradually and exhibits complex spatial patterns, there is a clear need for an advanced semantic segmentation model tailored explicitly to these clinical demands. If AK lesions can be reliably identified across complex, photodamaged skin, then a range of critical capabilities emerge, such as the consistent assessment of disease burden, the early detection of subclinical field changes, and the objective quantification of treatment response over time. Furthermore, such a system would enable standardized follow-up protocols by providing clinicians with detailed, reproducible, and spatially coherent information. In essence, high-performance AK detection is the foundation for longitudinal monitoring, therapeutic planning, and, ultimately, the advancement of the precise and personalized management of field-directed therapies in dermatology.

In this work, we extended a previously introduced U-net architecture enhanced with ConvLSTM modules by further integrating ViT encoding into the skip connections. Our goal was to target improvements where they are most critically needed: preserving spatial detail while incorporating a global contextual understanding. The inclusion of ViT modules in the skip pathways enabled the model to retain fine-grained localization features while simultaneously gaining a broader semantic view of the image. This global awareness is vital for suppressing false positives, as lesion detection benefits from an enhanced understanding of the surrounding tissue context. This is supported by a comparison with the state-of-the-art hybrid architecture HmsU-Net, where our model exhibited a more robust context awareness in whole-image assessments. The achieved Dice score of 65.13% represents the highest performance reported to date, demonstrating significant progress (Table 4).

To mitigate boundary artifacts and improve the smoothness and coherence of the final segmentation map, Gaussian stride processing was employed, which applies a weighted blending of overlapping tile outputs and enhances detection accuracy by approximately 17% (Table 3). While this method enables the processing of full-size images on limited hardware, it inherently fragments spatial information. Moreover, upper-bound evaluations revealed the gap between the tile-wise and true detection accuracy of targeted lesions, indicating room for improvement (Table 5).

Alternatively, YOLOv8-Seg offers a segmentation solution that processes the entire image in a single pass, preserving the global context. While its performance is comparable to that of the proposed AKTransU-net (Table 4), it appears to have reached a performance plateau in the context of AK segmentation. A key limitation is its maximum input resolution of 1120 × 1120 pixels, which is constrained by hardware and the architectural design. This restricts its capacity for further improvement in high-resolution dermatological image analysis.

In contrast, tile-based processing offers a more flexible and scalable framework, with significant potential for further improvement through advancements in both model architecture and tile management strategies.

Finally, while the current architecture prioritizes segmentation accuracy and contextual understanding, its computational complexity remains a limitation, particularly due to the inclusion of multi-scale Transformer blocks. In future work, we aim to explore model compression techniques, lightweight Transformer variants, and efficient attention mechanisms to improve its inference speed and make the model more suitable for deployment on resource-constrained devices.

6. Conclusions

The introduction of Transformers into visual tasks has significantly expanded the deep learning armamentarium, providing powerful tools for modeling long-range dependencies and capturing a global context—capabilities that are particularly valuable in clinical imaging. This is especially important for challenging applications, such as skin field cancerization, where the detection of AK lesions depends on both fine detail and a broader contextual understanding. As high-resolution imaging becomes increasingly central to dermatologic care, Transformer-driven context modeling offers a promising path forward for robust, scalable, and clinically meaningful AI-assisted monitoring.

Author Contributions

Conceptualization, P.D., P.S., A.L., I.B. and G.G.; methodology, P.D., P.S. and A.L.; software, P.D.,C.T. and P.S.; validation, P.D., P.S. and G.G.; resources, G.G., I.B., D.E. and A.Z.; data curation, G.G., I.B., D.E. and A.Z. writing—original draft preparation P.S. and P. D.; writing—review and editing, P.S., P.D., C.T., A.L., G.G. and I.B.; visualization, G.G., I.B. and D.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of the University Hospital of Ioannina (Approval No.: 3/17-2-2015(0.17), approval date: 17 February 2015).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patients to publish this paper.

Data Availability Statement

The code is available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AK	Actinic Keratosis
CNN	Convolutional Neural Network
ViT	Vision Transformer
ConvLSTM	Convolutional Long Short-Term Memory
TransU-net	Transformer U-Net
HmsU-net	Hybrid Multi-Scale U-Net
YOLO	You Only Look Once
BConvLSTM	Bidirectional Convolutional Long Short-Term Memory

References

Willenbrink, T.J.; Ruiz, E.S.; Cornejo, C.M.; Schmults, C.D.; Arron, S.T. Jambusaria-Pahlajani, Field cancerization: Definition, epidemiology, risk factors, and outcomes. J. Am. Acad. Dermatol. 2020, 83, 709–717. [Google Scholar] [CrossRef]
Nart, I.F.; Cerio, R.; Dirschka, T.; Dréno, B.; Lear, J.T.; Pellacani, G.; Peris, K.; de Casas, A.R. Defining the actinic keratosis field: A literature review and discussion. J. Eur. Acad. Dermatol. Venereol. 2018, 32, 544–563. [Google Scholar] [CrossRef]
Schmitz, L.; Broganelli, P.; Boada, A. Classifying Actinic Keratosis: What the Reality of Everyday Clinical Practice Shows Us. J. Drugs Dermatol. 2022, 21, 845–849. [Google Scholar] [CrossRef]
Malvehy, J.; Stratigos, A.J.; Bagot, M.; Stockfleth, E.; Ezzedine, K.; Delarue, A. Actinic keratosis: Current challenges and unanswered questions. J. Eur. Acad. Dermatol. Venereol. 2024, 38 (Suppl. S5), 3–11. [Google Scholar] [CrossRef]
Ezzedine, K.; Painchault, C.; Brignone, M. Systematic Literature Review and Network Meta-analysis of the Efficacy and Acceptability of Interventions in Actinic Keratoses. Acta Derm. Venereol. 2021, 101, adv00358. [Google Scholar] [CrossRef]
Epstein, E. Quantifying actinic keratosis: Assessing the evidence. Am. J. Clin. Dermatol. 2004, 5, 141–144. [Google Scholar] [CrossRef]
Baker, C.; James, A.; Supranowicz, M.; Spelman, L.; Shumack, S.; Cole, J.; Weightman, W.; Sinclair, R.; Foley, P. Method of Assessing Skin Cancerization and KeratosesTM (MASCK^TM): Development and photographic validation in multiple anatomical sites of a novel assessment tool intended for clinical evaluation of patients with extensive skin field cancerization. Clin. Exp. Dermatol. 2022, 47, 1144–1153. [Google Scholar] [CrossRef]
Pasquali, P. Photography in Clinical Medicine, 1st ed.; Springer: Cham, Switzerland, 2020. [Google Scholar]
Hames, S.C.; Sinnya, S.; Tan, J.M.; Morze, C.; Sahebian, A.; Soyer, H.P.; Prow, T.W. Automated detection of actinic keratoses in clinical photographs. PLoS ONE 2015, 10, e0112447. [Google Scholar] [CrossRef]
Spyridonos, P.; Gaitanis, G.; Likas, A.; Bassukas, I.D. Automatic discrimination of actinic keratoses from clinical photographs. Comput. Biol. Med. 2017, 88, 50–59. [Google Scholar] [CrossRef]
Nanni, L.; Paci, M.; Maguolo, G.; Ghidoni, S. Deep learning for actinic keratosis classification. AIMS Electron. Electr. Eng. 2020, 4, 47–56. [Google Scholar] [CrossRef]
Spyridonos, P.; Gaitanis, G.; Likas, A.; Bassukas, I.D. Late fusion of deep and shallow features to improve discrimination of actinic keratosis from normal skin using clinical photography. Skin. Res. Technol. 2019, 25, 538–543. [Google Scholar] [CrossRef] [PubMed]
Spyridonos, P.; Gaitanis, G.; Likas, A.; Bassukas, I.D. A convolutional neural network based system for detection of actinic keratosis in clinical images of cutaneous field cancerization. Biomed. Signal Process Control. 2023, 79, 104059. [Google Scholar] [CrossRef]
Derekas, P.; Spyridonos, P.; Likas, A.; Zampeta, A.; Gaitanis, G.; Bassukas, I. The Promise of Semantic Segmentation in Detecting Actinic Keratosis Using Clinical Photography in the Wild. Cancers 2023, 15, 4861. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 548–558. [Google Scholar] [CrossRef]
Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 12239–12249. [Google Scholar] [CrossRef]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical Image Segmentation Review: The Success of U-Net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef] [PubMed]
Pu, Q.; Xi, Z.; Yin, S.; Zhao, Z.; Zhao, L. Advantages of transformer and its application for medical image segmentation: A survey. Biomed. Eng. Online 2024, 23, 14. [Google Scholar] [CrossRef]
Khan, R.F.; Lee, B.D.; Lee, M.S. Transformers in medical image segmentation: A narrative review. Quant. Imaging Med. Surg. 2023, 13, 8747–8767. [Google Scholar] [CrossRef]
Xiao, H.; Li, L.; Liu, Q.; Zhu, X.; Zhang, Q. Transformers in medical image segmentation: A review. Biomed. Signal Process Control. 2023, 84, 104791. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef] [PubMed]
Atek, S.; Mehidi, I.; Jabri, D.; Belkhiat, D.E.C. SwinT-Unet: Hybrid architecture for Medical Image Segmentation Based on Swin transformer block and Dual-Scale Information. In Proceedings of the 2022 7th International Conference on Image and Signal Processing and Their Applications (ISPA), Mostaganem, Algeria, 8–9 May 2022; pp. 1–6. [Google Scholar] [CrossRef]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 4005615. [Google Scholar] [CrossRef]
Wang, H.; Cao, P.; Wang, J.; Zaïane, O. UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-Wise Perspective with Transformer. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2441–2449. [Google Scholar] [CrossRef]
Fu, B.; Peng, Y.; He, J.; Tian, C.; Sun, X.; Wang, R. HmsU-Net: A hybrid multi-scale U-net based on a CNN and transformer for medical image segmentation. Comput. Biol. Med. 2024, 170, 108013. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Chen, S.; Zou, Y.; Liu, P.X. IBA-U-Net: Attentive BConvLSTM U-Net with Redesigned Inception for medical image segmentation. Comput. Biol. Med. 2021, 135, 104551. [Google Scholar] [CrossRef]
Zhang, X.; Yang, S.; Jiang, Y.; Chen, Y.; Sun, F. FAFS-UNet: Redesigning skip connections in UNet with feature aggregation and feature selection. Comput. Biol. Med. 2024, 170, 108009. [Google Scholar] [CrossRef] [PubMed]
Byeon, W.; Breuel, T.M.; Raue, F.; Liwicki, M. Scene labeling with LSTM recurrent neural networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3547–3555. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. Available online: https://arxiv.org/abs/1506.04214v2 (accessed on 4 July 2023).
Azad, R.; Asadi-Aghbolaghi, M.; Fathy, M.; Escalera, S. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. Int. Conf. Comput. Vis. Workshop 2019, 2019, 406–415. [Google Scholar] [CrossRef]
Jiang, X.; Jiang, J.; Wang, B.; Yu, J.; Wang, J. SEACU-Net: Attentive ConvLSTM U-Net with squeeze-and-excitation layer for skin lesion segmentation. Comput. Methods Programs Biomed. 2022, 225, 107076. [Google Scholar] [CrossRef]
Muccini, J.A.; Kollias, N.; Phillips, S.B.; Anderson, R.R.; Sober, A.J.; Stiller, M.J.; Drake, L.A. Polarized light photography in the evaluation of photoaging. J. Am. Acad. Dermatol. 1995, 33, 765–769. [Google Scholar] [CrossRef] [PubMed]
ISO 12232:2019; Photography—Digital Still Cameras—Determination of Exposure Index, ISO Speed Ratings, Standard Output Sensitivity, and Recommended Exposure Index. ISO: Geneva, Switzerland, 2019.
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; ECCV: Munich, Germany, 2018; Volume 11211, pp. 833–851. [Google Scholar] [CrossRef]
Prokopiou, I.; Spyridonos, P. Highlighting the Advanced Capabilities and the Computational Efficiency of DeepLabV3+ in Medical Image Segmentation: An Ablation Study. BioMedInformatics 2025, 5, 10. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024. [Google Scholar] [CrossRef]
Zhang, Z.; Zou, Y.; Tan, Y.; Zhou, C. YOLOv8-seg-CP: A lightweight instance segmentation algorithm for chip pad based on improved YOLOv8-seg model. Sci. Rep. 2024, 14, 27716. [Google Scholar] [CrossRef] [PubMed]
Explore Ultralytics YOLOv8. Ultralytics YOLO Docs (n.d.). Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 29 May 2025).

Figure 1. AKtransU-net overview. The encoder consists of four stages, where feature maps are progressively down-sampled using convolutional and pooling layers. Each encoder-level input is passed to a Transformer block. ConvLSTM units in skip connections sequentially process convolutional, Transformer, and up-sampled decoder features.

Figure 2. ViT encoder adapted from [15].

Figure 3. Segmentation results considered acceptable by clinical experts, with corresponding Dice and Jaccard scores [Dice, Jaccard]. Red contours represent ground-truth annotations, while green contours show model predictions. In cases where segmentations exhibit slight boundary mismatches, the Dice scores remain relatively high, whereas the Jaccard scores drop more sharply due to their stricter penalization of disagreement. Dice better aligns with clinical judgment in cases of ambiguous lesion margins, such as those seen in AK, where exact boundaries are often difficult to define.

Figure 4. The left panel shows the original facial image overlaid with a grid indicating tile-based segmentation. The middle panel illustrates the result of simple tile-wise processing followed by direct stitching, which leads to missed detections and boundary artifacts in AK localization. In contrast, the right panel demonstrates the improved AK detection achieved through Gaussian-weighted blending across tile boundaries, which alleviates discontinuities and enhances detection’s accuracy and consistency.

Figure 6. Qualitative comparison of lesion segmentation results on cropped cases where the base model, AKU-net, underperformed. Red contours indicate the ground-truth segmentations, while green contours represent the predicted segmentations by each model.

Figure 7. Qualitative comparison of AK detection results on whole-face images. From left to right: AKtransU-net (stride, 256 × 256), HmsU-net (256 × 256), HmsU-net (128 × 128), and YOLOSegV8. Red contours denote the ground-truth segmentations, while green contours indicate the predictions generated by each model.

Table 1. Dataset splitting into train, validation, and test sets.

	Patients	Images	Image Crops	Augmentation Technique
Train	83	410	13,190	Translation
Validation	15	100	3298	Translation
Test	17	59	403	None
Total	115	569	16,891

Table 2. Median Dice coefficients and statistical significance (p-values) compared to the baseline AKU-net model. *: p < 0.01 (nominal, but not significant under Bonferroni), ***: p < 0.001 (highly significant).

Model	Median Dice (%)	p-Value (vs. Baseline Model)
AKU-net (baseline model)	57.21
AKU-net5	63.16	*
DeepLabv3+	60.24	-
TransU-net	61.70	*
HmsU-net	75.68	***
AKtransU-net	65.10	***

Table 3. Performance of the whole-image AKtransU-net. Pairwise comparisons using the Wilcoxon signed-rank test showed that Gaussian weighting (the base method) significantly improved segmentation consistency compared to alternative strategies, such as block processing or uniform averaging of overlapping patches. Decreasing the stride from 256 to 128 pixels did not improve the segmentation accuracy of the AKtransU-net model.

Scanning Method	Median Dice (%)	p-Value (vs. Baseline Method)
Gaussian-weighted stride (256 × 256) baseline	65.13	—
Gaussian-weighted stride (128 × 128)	64.70	p > 0.05
Average overlapping stride (256 × 256)	48.11	p < 0.01
Non-overlapping block processing	45.74	p < 0.01

Table 4. Comparison of segmentation performance between AKtransU-net, HmsU-net, and YOLOSegV8 models. AKtransU-net with a stride of 256 × 256 pixels served as the baseline model. Pairwise comparisons were performed using the Wilcoxon signed-rank test.

Model	Median Dice (%)	p-Value (vs. Baseline Model)
AKtransU-net stride (256 × 256) baseline	65.13	—
HmsU-net (256 × 256)	52.39	p < 0.05
HmsU-net (128 × 128)	61.21	p > 0.05
YOLOSegV8	61.44	p > 0.05

Table 5. Upper-bound segmentation performance confirmed the superiority of hybrid CNN + ViT architectures (AKtransU-net and HmsU-net) over their CNN-only counterparts.

Model	Median Dice (%)
AKU-net (base model)	59.21
AKU-net5	61.35
DeepLabv3+	60.70
TransU-net	67.51
AKtransU-net	73.34
HmsU-net	74.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Derekas, P.; Theodoridis, C.; Likas, A.; Bassukas, I.; Gaitanis, G.; Zampeta, A.; Exadaktylou, D.; Spyridonos, P. AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography. Diagnostics 2025, 15, 1752. https://doi.org/10.3390/diagnostics15141752

AMA Style

Derekas P, Theodoridis C, Likas A, Bassukas I, Gaitanis G, Zampeta A, Exadaktylou D, Spyridonos P. AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography. Diagnostics. 2025; 15(14):1752. https://doi.org/10.3390/diagnostics15141752

Chicago/Turabian Style

Derekas, Panagiotis, Charalampos Theodoridis, Aristidis Likas, Ioannis Bassukas, Georgios Gaitanis, Athanasia Zampeta, Despina Exadaktylou, and Panagiota Spyridonos. 2025. "AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography" Diagnostics 15, no. 14: 1752. https://doi.org/10.3390/diagnostics15141752

APA Style

Derekas, P., Theodoridis, C., Likas, A., Bassukas, I., Gaitanis, G., Zampeta, A., Exadaktylou, D., & Spyridonos, P. (2025). AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography. Diagnostics, 15(14), 1752. https://doi.org/10.3390/diagnostics15141752

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AΚtransU-Net: Transformer-Equipped U-Net Model for Improved Actinic Keratosis Detection in Clinical Photography

Abstract

1. Introduction

2. Related Work

2.1. The Evolution of Transformer Architectures in Medical Image Segmentation

2.2. Skip Connections in Hybrid U-Net Architectures

3. Materials and Methods

3.1. AKtransU-Net: Skip Connection Enforcement with Global Dependencies

3.1.1. Transformer Encoder

3.1.2. ConvLSTM in Skip Connections for Spatial Refinement

3.2. Material

Data Augmentation and Dataset Preparation

3.3. Evaluation

3.4. Implementation Details

4. Results

4.1. AK Detection in Localized Image Regions (256 × 256 Crops)

4.2. Lesion Detection in Full Clinical Photographs—Whole-Image Inference

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI