Next Article in Journal
Distributed Interactive Simulation Dead Reckoning Based on PLO–Transformer–LSTM
Previous Article in Journal
Channel Estimation for RIS-Assisted Multi-User mmWave MIMO Systems via Joint Correlation
Previous Article in Special Issue
Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HF-EdgeFormer: A Hybrid High-Order Focus and Transformer-Based Model for Oral Ulcer Segmentation

by
Dragoș-Ciprian Cornoiu
and
Călin-Adrian Popa
*
Department of Computers and Information Technology, Politehnica University of Timișoara, 300223 Timișoara, Romania
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(3), 595; https://doi.org/10.3390/electronics15030595
Submission received: 9 June 2025 / Revised: 9 November 2025 / Accepted: 17 November 2025 / Published: 29 January 2026
(This article belongs to the Special Issue Artificial Intelligence and Deep Learning Techniques for Healthcare)

Abstract

Precise medical segmentation of oral ulcers is mandatory and crucial for early diagnosis, but it remains a very challenging task due to rich backgrounds, overexposed or underexposed lesions, and the complex surrounding areas. Therefore, in order to address this challenge, this paper introduces HF-EdgeFormer, a novel hybrid model for oral ulcer segmentation on the AutoOral dataset. This U-shaped transformer-like architecture is, based on publicly available models, the second documented solution for oral ulcer segmentation and it explicitly integrates high-order frequency interactions by using multi-dimensional edge cues. At the encoding stage, a HFConv (High-order Focus Convolution) module divides the feature channels into local streams and global streams, performing learnable filtering via FFT and depth-wise convolutions. After that, it fuses them through stacks of focal transformers and attention gates. In addition to the HFConv block, there are two edge-aware units: the EdgeAware Localization module (that uses eight-direction Sobel filters) and a new Precision EdgeEnhance module (channel-wise Sobel fusion), both used in order to reinforce the boundaries. Skip connections imply Multi-dilated Attention Gates, accompanied by a Spacial-Channel Attention Bridge to accentuate lesion-consistent activations. Moreover, the novel architecture employs an innovative lightweight vision transformer-based bottleneck. It consists of four SegFormerBlock modules localized at the network’s deepest point, so we can achieve global relational modeling exactly where the largest receptive field is present. The model is trained on the AutoOral dataset (introduced by the same team that developed the HF-Unet arhitecture), but due to the limited available images, it needed to be extended by using extensive geometric and photometric augmentations (like RandomAffine, flips, and rotations). This novel architecture achieves a test Dice score of almost 82% and a little over 85% sensitivity while maintaining high precision and specificity, highly valuable in medical segmentation. These results surpass prior HF-UNet baselines while maintaining the model light, with minimal inference memory gains.

1. Introduction

Oral ulcers and oral cancerous entities, characterized by an anomaly or a disruption in the mucosal area of the oral cavities, are common lesions that can substantially alter a patient’s quality of life. Moreover, some oral ulcers are mild and very self-limiting, while others might pinpoint underlying intrinsic conditions or even malignancies—for example, the case of oral squamous-cell carcinoma (or OSCC). Given the increasing number of patients worldwide, early diagnosis leads to faster interventions, which are crucial for effective management and could offer higher chances of cure. Technological advances in artificial intelligence, accompanied by deep learning, have changed the automatic detection and classification of numerous pathologies. CNNs (convolutional neural networks), especially UNet-like architectures, have shown remarkable success in biomedical segmentation tasks, while, in light of oral health, research is in its early stages, and approaches such as the AI-driven detection and classification of oral lesions are continuously evolving. First of all, oral ulcers are a pretty hard to classify disease, mostly because they can be either benign—e.g., aphthous—or malignant, and they also evolve from local and systemic causes, such as traumas, infections, or autoimmune conditions, often present as painful sores with diverse crateriform appearances [1]. Recent reports have suggested that there are many cases of overlapping clinical features, which makes the diagnosis more difficult. Secondly, in the context of oral ulcers, many of the malignant lesions are frequently mistaken for benign lesions, which directly lead to delays before correcting the diagnosis. Such mistakes can imply serious consequences and risks, mainly because of the high morbidity and mortality rates if not treated early on [2].
CAD (computer-aided diagnosis) systems have the full potential to assist dentists by detecting and then segmenting lesions in oral images objectively and accurately. In more recent years, deep learning-based segmentation models have achieved a remarkable success in medical domains. One example could be Zhang et al. [3], who introduce a novel transformer-based model that accurately delineates lesions present in the oral mucosa, proving that precise automatic segmentation may serve as an effective auxiliary tool for clinical tasks. Those tools can be even more useful in settings with limited specialists and human resources. The main challenge still remains the presence of false negative cases because of the lack of exposer or of insufficient highlight. Such visual subtleties hamper inspection by eye, being easy to miss in many cases, so even CAD systems will need careful modeling and calibrations. Despite all the progress seen in recent years, oral ulcer segmentation remains fairly steady and underdeveloped compared with other segmentation subdomains, mainly because of the scarcity of public datasets. Until recently, most of the available studies have relied on limited private collections. Last year, Wurenkai et al. introduced a high-quality public dataset of oral ulcer images in order to address this gap. This valuable dataset consists of about four hundred annotated ulcer images, covering an impressive number of various cases. This paper proposes a superior UNet-like architecture for the AutoOral dataset, which is suitable even for low-highlighted or difficult to analyze ulcer images. Compared with the current state-of-the-art medical image-segmentation model (HF-UNet), the newly developed HF-EdgeFormer model obtained the best overall performance. The highlights of this research can be described as follows:
1.
A new edge refinement module (PrecisionEdgeEnhance) used for mid encoders instead of the regular lesion-localization module present in the HF-UNet medical segmentation model.
2.
Introduction of two low-level ResNet34 encoders trained on ImageNet; these two encoders are frozen for the first fifteen epochs in order to preserve generic low-level features, helping the model to stabilize better early on.
3.
A much more modern vision transformer bottleneck inspired by the SegFormer UNet; this bottleneck consists of a SegFormer block, an efficient self-attention module, and a SegFormer MLP.
4.
The dimension of the AutoOral dataset, even though it is much better than all the previous ones, is still pretty limited, so a new augmentation method was applied.
5.
Introduction of test-time augmentation at inference in order to let the model see the test images from different angles, then aggregating the predictions by averaging them.
6.
An upscaling method that can be applied on the resulting segmented images to increase the resolution, if needed.
7.
The code will be made publicly available on GitHub https://github.com/DragosCornoiu/HF-EdgeFormer (accessed on 7 June 2025).

2. Related Work

2.1. Image Segmentation

Deep learning-based image segmentation has shown exceptional performance thanks to its ability to learn complex relations and feature representations directly from data. Beneficial UNet-like approaches from encoder–decoder architectures have been widely adopted, becoming one of the most influential and strong solutions in biomedical image analysis. Even though the UNet has shown its remarkable capabilities, it still struggles with capturing long-range dependencies or with handling global context, requirements highly needed in segmenting low-contrast and small-scaled lesions. In order to address those challenges, a list of improvements were proposed, starting with attention mechanisms, residual connections, and some multiscale feature extraction modules. With these ideas put in place and tuned carefully, the model’s ability to distinguish subtle boundaries has increased significantly. The current main focus is on models that can integrate both global and local semantic context in a balanced and effective manner.

2.2. Medical Image Segmentation

Since the very early introduction of the encoder–decoder UNet architecture for biomedical segmentation, there were a lot of experiments made in order to find optimal enhancements to exploit attention mechanisms and multiscale features, while maintaining computational efficiency. For instance, we can give the example of the HF-UNet++ architecture that shines by using dense skip connections in order to combine multilevel features; the result is an improved segmentation with fine-grained details well preserved [4]. Ullah et al. [5] have proposed a multiscale residual attention UNet for brain tumor segmentation in MRI images. This method employs a cascaded architecture to facilitate multiscale feature learning and adaptively segments tumor regions with improved accuracy. Likewise, Attention UNet models have other special characteristics such as the gating mechanism within skip connections that has the role to suppress irrelevant details or regions and to highlight higher-interest salient structures [6]. After discovering those high-performance models, attention shifted to computational overhead issues, which led to the development of lightweight variants like EGE-UNet, notably reducing the number of parameters [7]. More recently, hybrid architectures and transformer-based models have emerged that aim at resolving the well-known limitations of the pure CNN’s receptive field. One remarkable model that belongs to this category is Trans-UNet; it combines a CNN encoder with a transformer-based encoder, followed by an UNet decoder; this hybrid can capture global context, while preserving localization well enough. There are also some domain-specific derivatives of UNet, such as for COVID-19 lung CT segmentation, where networks such as D-TrAttUnet (which uses a dual-decoder style) use those separate decoders for organ masks and for infection, and a transformer–CNN encoder, yielding state-of-the-art lesion maps [8]. Similarly to the model presented in this paper, the NTSM model is an oral mucosal-lesion segmenter, based on the nnUNet backbone, including technologies such as difference association and pyramid attention modules that can capture detailed lesion boundaries while keeping the number of parameters low [9].
In this paper, a novel modeling architecture is introduced. Based on the high-order focus interaction network (HF-UNet) [10], the new more refined architecture has some unique characteristics, which will be elaborated in the following section.

3. Methods

3.1. Used Dataset

One of the most important aspects in the deep learning domain is the accuracy of the data used. To be more specific, in the medical field, there is a high demand for high-quality data. Using different processing techniques (like clipping, affine transformations, and changing the view angle), our model is trained on the newly introduced AutoOral dataset, the first publicly available multi-task oral ulcer dataset. Its legitimacy is guaranteed, the samples being obtained from the Ruijin Hospital in Shanghai. The content of the AutoOral dataset is presented in Table 1.
The AutoOral dataset contains 80 clinical cases. This resulted in a total of 420 images with cases varying from 2010 to 2023 and a high range of ages (from 7 to 84 years old, which basically covers almost all ages). The diversity of genders is also covered, the male-to-female ratio being 3:2. The study cohort comprised patients both without comorbidities and with 1 of 12 distinct underlying conditions, such as anemia, hypertension, or nasopharyngeal carcinoma. The decided image size was standardized to 256 × 256, 24 bit RGB, while the ground truth for the segmentation task is an 8 bit image. The classification task involved five distinct ulcer types: cancerous ulcers, traumatic ulcers and traumatic blood blisters, herpes-like aphthous ulcers, mild aphthous ulcers, and severe aphthous ulcers. The distribution ratios among these categories were 9:9:15:18:22, respectively (with a few cases excluded). A chi-square test revealed statistically significant differences in both gender (p = 0.04) and age (p = 0.01) across the five types of ulcers. The dataset spanned a 13-year collection period, included patients of all ages, and encompassed 12 types of underlying diseases, collectively supporting the diversity and representativeness of the sample.

3.2. Overall Model Architecture

The proposed HF-EdgeFormer is a medical segmentation model framework that has high-order focus interactions, especially useful for oral ulcer segmentation. The overall architecture block scheme is presented in Figure 1. First of all, the input consists of oral ulcer images with a resolution of 256 × 256 pixels and 3 channels. First, two encoders are actually pretrained ResNet34 networks, modified in order to fit the model needs. Additionally, those early components are frozen for the first 15 epochs, in order to preserve the significant features and to maintain model stability. At stages 4 and 5, we can observe the usage of high-order focus interaction modules (HFblock), which enhance feature precision through the PrecisionEdgeEnhancer, followed by an ordinary convolution. At stage 6, the LL-M (lesion-localization module) used is the EdgeAwareUnit, also present in the decoders part. During each stage, the focus information of different orders is extracted and fused by the HFblocks, then the edges and shape feature information are extracted by the PrecisonEdgeEnhancer and the EdgeAwareUnit. The skip connections are handled by the multi-dilation gate module (MDAG), present in the HF-UNet model as well. The role of this component is to suppress redundant or unimportant features and play the role of a filter, letting the useful feature information shine. The decoder part is very similar to the encoder part, with the mention that all lesion-localization modules are EdgeAwareUnits. Another very important architectural change would be the new SegFormer Bottleneck. In short, this bottleneck is a stack of 4 SegFormerBlocks. Each of these is a small transformer-style unit, being designed for efficient feature transformation. The flow inside the SegFormerBlock is as follows: LayerNorm on the input, efficient self-attention with optional spatial reduction, skip connections, another LayerNorm, followed by an MLP block (SegFormerMLP), and ending with a second skip connection.

3.2.1. High-Order Focus Interaction Mechanism

The High-Order Focus Interaction Module (HFblock) adopts a transformer-inspired architecture in which the conventional self-attention layer is replaced by a novel High-Order Focus Convolution (HFConv). This substitution enables the module to maintain strong feature modeling capabilities while improving efficiency. A Dropout layer is incorporated within the residual connection path to promote better generalization. The proposed HFConv is a composite operation that supports long-range dependencies, focal attention, and high-order spatial interactions. It is composed of four core components: a linear projection layer, a focus module (FM), an attention gate (AG), and a global–local filter (GLF).
Traditional self-attention in vision transformers incurs a quadratic computational cost with increasing feature map resolution, which limits scalability. In contrast, HFConv circumvents this by replacing the self-attention mechanism with a combination of more efficient operations—namely, the focus module, GLF, and attention gating—thereby achieving high-order spatial interactions without exponential growth in complexity.

3.2.2. First-Order Focus Interactions

In order to explain the mechanism behind the module more clearly, we can begin with the First-Order Focus Convolution (1FConv). Let us consider an input feature map X R H W × C , which passes through a linear projection layer that doubles the dimension of the channel to 2 C , resulting in two separate branches A 0 and B 0 : A 0 , B 0 = P r o i n ( x ) R H W × 2 C . After that, B 0 is processed by the GLF, A 0 is then passed to the FM, and the results are fused by the AG mechanism. The final output has the following formula: P = A G [ G L F ( B 0 ) , F M ( A 0 ) ] , y = P r o o u t ( P ) , where P R H W × C .
The architecture of the GLF combines global filtering with local convolutional processing. Inspired by Rao et al. [11], the global branch first applies layer normalization, followed by a 2D fast Fourier transform (FFT), a learnable filter in the frequency domain, and a 2D inverse FFT (IFFT). This allows the network to model both short- and long-range dependencies effectively. The local branch executes two standard convolutions: a 1 × 1 convolution to downscale the channel depth and a 3 × 3 convolution for local feature gathering. The outputs are concatenated and normalization is reapplied to maintain and merge rich spatial information. Remarkably, differing from earlier GLF versions, this structure uses full-channel attention on both global and local pathways, which increases feature diversity.
The focus module (FM) builds upon the focal self-attention mechanism introduced in the focus transformer [12]. It expands the receptive field beyond that of conventional self-attention, enabling the simultaneous capture of local details and broader contextual cues. This module remains the same as in the HF-UNet model [10].
The attention gate (AG) refines the focus by suppressing irrelevant background features, akin to the gating mechanism in AttUNet [6]. It leverages additive attention, which often outperforms multiplicative attention. The gate operates as follows:
A G ( x , g ) = S i g m o i d ( B N ( C o n v 1 ( R e L U ( B N ( C o n v 1 ( x + g ) ) ) ) ) ) .
Here, x is the GLF output, and g is the feature map after projection. The attention mask enhances task-relevant features dynamically.

3.2.3. High-Order Focus Interactions

The framework generalizes from first-order to n-order focus interactions, allowing for more expressive spatial modeling. For an n-th order interaction, the input is projected into one base feature A 0 and a set of features B k k = 0 n 1 , where each stage builds upon the prior one:
A 0 , B k k = 0 n 1 = P r o i n ( x ) .
The output is computed iteratively with each layer refining the features further, followed by a scaling factor 1 / α to stabilize the training. The channel allocation increases progressively with interaction depth, enabling a coarse-to-fine spatial reasoning strategy. According to ablation results, 4th-order interactions yield the optimal performance (suggested in the experimental chart in Figure 2), while lower orders fail to capture sufficient spatial dynamics, and excessively higher orders cause the input to become too sparse at early stages [10]. For interaction orders higher than 4, the channel allocation per branch becomes too sparse (e.g., at order 5, only 8 channels remain per branch compared with 32 at order 4), which reduces representational capacity. In practice, this leads to unstable convergence and a DSC reduction of 2–3 percentage points.

3.2.4. PrecisionEdgeEnhance Module

The next subject is the newly designed PrecisionEdgeEnhance module, which is basically a feature enhancer block and aims to improve spatial details in feature maps. It achieves this by combining explicit edge-extraction with a spatial-based gating system. It takes a 4D tensor shape defined as ( B , C , H , W ) , where B represents batch size, C the number of input channels, and H , W the spatial dimensions. Moreover, it has four depth-wise convolution layers that apply Sobel kernels with different orientations, more precisely: vertical, horizontal, +45 degrees, and −45 degrees. This results in 4 sets of edge maps that will then be concatenated along channel dimension. A simple 1 × 1 convolution is applied, followed by BatchNorm and ReLU activation, in order to compress and ultimately fuse the edge information back to the original size.
The input tensor is pooled using channel-wise average pooling and channel-wise max pooling, which are then fused along the axis of the channel in order to form a tensor of shape ( B , 2 , H , W ) . In order to preserve spatial resolution, a 7 × 7 convolution is used, while the original input tensor is modulated by an element-wise multiplication with the general spatial attention mask.
In the end, the gated input tensor is concatenated with the edge feature tensor, being followed by a 3 × 3 convolution. The scheme for this module can be visualized in Figure 3.

3.2.5. EdgeAware Module

Accurate lesion segmentation relies heavily on the ability of the model to capture precise contour and boundary details. In order to enhance the lesion’s boundary localization, we introduce the EdgeAware module, used in the architecture of HF-UNet model. It is a specialized module that leverages eight-directional Sobel filters, which makes it more capable than standard two- or four-directional variants. Those directional filters have a specific novel implementation that can be described based on the following formula:
F 1 = C o n v 3 × 3 [ C o n v 1 × 1 ( x ) ] , F 2 = k = 1 8 Q ( S k , F 1 ) , O u t = C o n c a t [ C o n v 3 × 3 ( F 2 ) , C o n v 1 × 1 ( x ) ] ,
where Q denotes a convolution operator and S k represents the kth filter (Sobel filter), and it can emphasize eight directions, starting from 0 degrees and going up to 157.5 degrees, with an increment of 22.5 degrees at each step. The architecture of the EdgeAware module can be visualized in Figure 4.

3.2.6. ResNet Encoders

The HF-EdgeFormer model is practically a hybrid encoder–decoder architecture, and it aims for semantic richness and spatial precision, crucial for medical image segmentation and analysis. An important part introduced in the encoding pipeline is the integration of a pretrained backbone for the first two stages of the encoder. This section aims to outline the architectural role of this approach and will articulate the motivation behind those decisions. The very first encoding block of the HF-EdgeFormer is the stem of a ResNet34 model, trained on a larger dataset. This part retains and captures strong low-level patterns, including texture primitives or gradients, and is known to generalize very well. In the next encoder stage, the model incorporates the first residual block group of ResNet34, the layer1 part; this contains three identical modules (BasicBlocks) with two 3 × 3 convolutions and residual skip connections for better gradient flow. The main difference of this first layer (compared with the other ResNet layers) is that this one will no longer downsample spatially, which means a high usable resolution to work with later on. After the ResNet stages, HF-EdgeFormer transitions to its own encoder layers as follows:
e n c o d e r 3 : C o n v 2 d ( 64 128 ) e n c o d e r 4 : C o n v 2 d ( 128 256 ) e n c o d e r 5 : C o n v 2 d ( 128 256 ) .

3.2.7. SegFormer Bottleneck

Another architectural innovation that helped the HF-EdgeFormer model to outclass its predecessors is the newly designed bottleneck, having three main components: SegFormerBlock, EfficientSelfAttention, and SegFormerMLP. Starting with the SegFormerBlock, this module is a lightweight transformer-style building block, optimized specifically for semantic segmentation performance. It includes self-attention for global reasoning and MLPs for local context (made with depth-wise convolutions for a better performance-to-computational cost ratio). The EfficientSelfAttention is a highly optimized multihead self-attention module (MHSA), which includes spatial reduction capabilities. SegFormerMLP is basically an MLP that was position-enhanced using spatial convolutions (needed for local feature learning). In order to obtain better convergence and stability control, the SegFormerBlock is constructed by the PreNorm Transformer convention. This bottleneck is inspired by the original SegFormer paper [13], and some ideas are partially inspired by the pyramid vision transformer [14], MobileViT [15], and CvT [16] papers. The architecture overview of the SegFormerBlock can be visualized in Figure 5.

3.2.8. Multi-Dilation Gate Module (MDAG)

The MDAG (multi-dilation attention gate) was first introduced in the architecture of the HF-UNet model [10], and this module is also responsible for the skip connections in the HF-EdgeFormer model. MDAG filters the informative features, emphasizing the relevant ones more during the transfer process. It integrates multiple dilated convolutions in order to obtain both global and local feature representations. We have dilated convolutions with rates of 1 and 2, which are used for local context, and rates of 5 and 7 for global context. Those outputs are then concatenated and normalized, resulting in an output having four times the original number of channels. It includes a voting mechanism (named Voting Module) that compresses the features in order to obtain their original size and refines them. Before forwarding to the decoder, the refined feature map undergoes element-wise multiplication and summation with the original input. The above process can be summarized in the following equations:
x 1 , x 2 = D 1 C o n ( x ) , D 2 C o n ( x ) , x 3 , x 4 = D 5 C o n ( x ) , D 7 C o n ( x ) , X = R e L U { C o n c a t [ B N ( x 1 ) , B N ( x 2 ) , B N ( x 3 ) , B N ( x 4 ) ] } , V x = S i g { B N [ C o n v ( X ) ] } , O u t = x + x · V x ,
where x is the input, B N is the batch normalization, C o n c a t is the cascade operation, C o n is the 3 × 3 default convolution, and R e L U and S i g stand for the R e L U and S i g m o i d activation functions, respectively, while D 1 , D 2 , D 5 , and D 7 are 3 × 3 dilated convolutions with dilations 1, 2, 5, and 7, respectively.

3.2.9. Test-Time Augmentation

Test-time augmentation is a modern strategy that demonstrated remarkable results in many previous models, most notably in the appendix of the pyramid scene parsing network (PSPNet) [17]. This paper proved that averaging a handful of rotated or flipped logits could actually offer a reliable boost, especially in mIoU (mean-intersection-over-union) with minimal effort involved. HF-EdgeFormer adopts this philosophy, but it limits the possible scenarios to five strong orientations (horizontal flips, vertical flips, and +90 degree and −90 degree rotations) while omitting heavier multiscale options, in order to keep the overall inference overhead modest [18]. The aligned logits are then merged by a classic uniform element-wise average with no confidence weights added:
Y ^ = 1 5 i = 1 5 y i .
This mechanism that seems rather basic at first glance actually improved the overall mIoU score by 0.5% and the sensitivity by almost 0.8%, consistently, in multiple training sessions.

3.2.10. Upscale Module

This utility takes each image available on the output folder of the HF-EdgeFormer model and enlarges the resolution by an arbitrary scale factor. It leads to a sharper image set and it counteracts interpolation blur. It does not overwrite the original samples, but it writes the enhanced files to a different directory. It is mostly useful if we want to analyze the processed images on a larger screen, where there is a demand for a higher pixel count. The complexity is C [ N · H · W · s c a l e 2 ] (it is dominated by resampling). It can be further improved by adding functionalities such as dynamic sharpening and format conversion (PNG ⟶ JPEG, etc.).

4. Experiments

4.1. Implementation Details

All the experiments that have targeted oral ulcers were conducted using the official AutoOral dataset. Following the updated evaluation protocol used in the HF-UNet study, we performed 5-fold patient-level cross-validation on the AutoOral dataset to ensure robust and unbiased assessment. Results are reported as the mean ± standard deviation across the folds. In addition, an external evaluation was conducted on new samples to assess the generalization capability of the proposed model. The environment included Python 3.8 and PyTorch 1.12.0. The entire process was run on a single RTX 4070 Ti desktop graphics card with 12 GB of VRAM. To increase the diversity of the AutoOral dataset, some augmentation techniques were applied, and the chosen input size for the images was 256 × 256 . Training was run for 120 epochs with a batch size of 8. The loss function that was used was the BCEDice loss, which proved very effective in medical segmentation tasks [19]. In addition, the optimization was handled by the AdamW optimizer with a dynamic learning rate (initial value is set to 0.001, and it goes as low as 0.00001). This learning rate was adjusted over time using a cosine annealing schedule, and the experiments suggested a significant drop in the training time needed to obtain best results, compared with other available models.

4.2. Model Evaluation Criteria

Being meant for medical image segmentation, the HF-EdgeFormer model needs to be benchmarked using relevant evaluation metrics. This being said, there are some commonly used evaluation criteria such as accuracy (ACC—reflects the percentage of good classifications), the mean-Dice similarity coefficient (DSC—represents the degree of similarity between ground truth and predicted masks), sensitivity (SE—also known as recall, quantifies the ratio of true positives cases among the number of real positives, which means it includes both true positives and false negatives), and also specificity (SP—measures the number of true negative cases out of all actual negative instances, comprising both true negatives and false positives). The method applied in order to obtain these metrics is described by the following equations:
D S C = 2 T P 2 T P + F P + F N , S P = T N T N + F P , S E = T P T P + F N , A C C = T P + T N T P + T N + F P + F N ,
where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. In Figure 6 there are three test samples with the ground truth mask between the input image and the prediction of the model.

4.3. Comparison with State-of-the-Art Techniques

Medical image segmentation is a very important domain, which means that the choice of the best model for this task must be based fairly on objective results. In order to demonstrate the capabilities of the HF-EdgeFormer model, it will be compared with similar popular methods including HF-UNet [12], UNet [4], Attention UNet [6], SCR-Net [20], TransNorm [21], MALUNet [5], C2SDG [22], M2SNet [23], MSA [24], META-UNet [7], MHorUNet [10], VM-UNet [15], and H-VMUNet [16].
We can observe in Table 2 that the newly introduced HF-EdgeFormer outperforms its competitors in almost all metrics, including its predecessor, the 2024 model HF-UNet. It achieved an impressive 85.5% sensitivity score, much better than all other models, while pushing higher DSC and accuracy on the AutoOral dataset. It is also more of a lightweight model, using just above 3 GB of VRAM for inference, which makes it manageable for almost all systems. While there are some lighter models, HF-EdgeFormer remains a viable option, mainly because the tradeoffs are acceptable. The models used for comparison were selected to provide a comprehensive and balanced benchmark across multiple generations and styles of medical image-segmentation architectures.
First, classical architectures like UNet and Attention UNet were included because they serve as standard baselines in nearly all segmentation tasks. They are foundational and widely adopted, allowing for historical performance comparison.
Next, we included more recent convolution-based models such as SCR-Net, MALUNet, and C2SDG, which were specifically designed for lesion or small-structure segmentation. These models offer a strong reference for comparison within the oral cavity domain and reflect modern CNN innovations. Transformer-based architectures such as TransNorm and MSA were selected to represent the growing trend of self-attention and transformer models in medical image analysis. Including them allows us to compare our hybrid transformer-convolutional design against fully transformer-based approaches. Several models such as M2SNet, META-UNet, and MHorUNet were chosen because they integrated multiscale processing, attention modules, or frequency-aware mechanisms—design directions that were thematically related to our approach and relevant for fair comparison. Lightweight models like VM-UNet and H-VMUNet were included to evaluate performance relative to memory- and efficiency-focused architectures, which were important in resource-constrained clinical scenarios.
Finally, HF-UNet was included as our previous work and served as a direct baseline to assess the contribution of the newly proposed components in HF-EdgeFormer. In summary, these models were selected to span classical, convolutional, attention-based, lightweight, and domain-specific methods, ensuring a fair and informative evaluation of HF-EdgeFormer.

4.4. External Tests on New Samples

To further assess the generalization ability of HF-EdgeFormer, we evaluated the trained model on completely new medical samples, with more diverse lesion types and acquisition conditions compared with AutoOral.
We applied the same pre-processing and inference settings used for AutoOral, including test-time augmentation (TTA). HF-EdgeFormer achieved a DSC of 0.802 ± 0.091 and sensitivity of 0.841 ± 0.054.
Performance analysis per class revealed that detection was more accurate for larger, well-contrasted lesions, while smaller, poorly illuminated lesions occasionally led to undersegmentation. These findings highlight HF-EdgeFormer’s robustness to domain shifts but also identify illumination and texture variability as primary challenges.
In Figure 7 we can see that HF-EdgeFormer performs very well even with completely new medical samples.

5. Discussion and Limitations

Oral ulcer segmentation is an underdeveloped domain that needs to be further explored. However, it is very challenging to extract fine details while maintaining reasonable system requirements. An even greater limitation would be the lack of high-quality datasets with more available images. While augmentation is a really handful method to combat this limitation, the model would have more potential to obtain even better results with native and diverse medical images. Another solution, applied in other areas, might be the generation of synthetic scans, such as those available for CT lungs (IViT-CycleGAN) [22]. Another notable limitation is the fact that, in order to scale the model even further, it needs to be carefully tuned, being sensitive to architectural or training hyperparameters. Even though the release of the AutoOral dataset is a great opportunity to develop new segmentation models, expanding it in order to include more cases from diverse geographical regions may increase the robustness of the HF-EdgeFormer or other similar models. While HF-EdgeFormer builds on HF-UNet, it introduces three key innovations: (1) the PrecisionEdgeEnhance module, which explicitly fuses orientation-specific Sobel maps with a spatial gating system for sharper boundaries; (2) a SegFormer-based bottleneck, enabling efficient global context modeling at the deepest layer; and (3) a frozen ResNet34 encoder strategy, which stabilizes early training on small datasets.

6. Ablation Study

Table 3 presents an ablation study that analyzes the contribution of each key module in the HF-EdgeFormer architecture. The results show that removing any of the main components, HFConv, precision edge enhancement, or test-time augmentation (TTA), leads to a noticeable decrease in segmentation performance. The complete model, including all modules, achieves the highest dice similarity coefficient (DSC) of 0.8181 and sensitivity (SE) of 0.8549, along with competitive specificity (SP) of 0.9801 and accuracy (ACC) of 0.9704. The parameter count remains stable at approximately 4.6 million, with a VRAM usage of approximately 3062 MB and an inference speed of 24 FPS, indicating the efficiency of the model.
Table 4 further explores the impact of enabling TTA on HF-EdgeFormer’s performance. Enabling TTA improves the DSC from 0.8030 to 0.8181 and the SE from 0.8370 to 0.8549, with corresponding gains in SP and ACC. In particular, TTA does not affect model size, memory consumption, or FPS in this setup, confirming its utility as a post-processing enhancement that increases segmentation accuracy without additional computational burden during training.
Together, these results highlight the importance of each architectural element in HF-EdgeFormer and demonstrate that TTA is an effective strategy to maximize model performance for biomedical image segmentation.

7. Conclusions

In conclusion, this paper introduced a very capable segmentation model that extended the already existing capabilities for the multi-tasking oral ulcers dataset. The refined HF-EdgeFormer succeeds in integrating various high-performance modules that bring the benefits of both UNet-like and transformer-like architectures while maintaining rather modest system requirements. It uses advanced HFblocks to facilitate the information exchange between the branch of the lesion segmentation module and the other one that is responsible for contour detection. The presence of the UNet-like skip connections helps the model to recover the spatial resolution that has been lost during the encoding process that has some downsampling going on. The newly designed bottleneck, accompanied by the high-order focus interaction modules, pretrained low-level encoders, and dedicated lesion-localization modules, obtained an impressive DSC of 82% and almost 86% sensitivity, exceeding the current state-of-the-art results. The previous state-of-the-art model maxed out at about 72% sensitivity, so there is a remarkable ∼14% improvement. There are many available models that facilitate the development of today’s best technology, and the newly introduced HF-EdgeFormer has mutual characteristics to those successful predecessors (for example, M2SNet that uses subtractive connections, DAM-Net that applies channel attention, C2SDG that boosts generalization via contrastive learning, and fine-tuned SAM models (e.g., MSA)) [20,21,23,24,25,26]. This model will hopefully lead to the development of even better architectures for the medical computer vision domain. The proposed implementation will soon be available on GitHub https://github.com/DragosCornoiu/HF-EdgeFormer (accessed on 7 June 2025).

Author Contributions

Conceptualization, D.-C.C. and C.-A.P.; methodology, D.-C.C. and C.-A.P.; software, D.-C.C.; validation, D.-C.C.; formal analysis, D.-C.C.; investigation, D.-C.C.; resources, C.-A.P.; data curation, D.-C.C.; writing—original draft preparation, D.-C.C.; writing—review and editing, C.-A.P.; visualization, D.-C.C.; supervision, C.-A.P.; project administration, C.-A.P.; funding acquisition, C.-A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Politehnica University of Timișoara.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zeng, X.; Zhong, L.; Dan, H.; Chen, Q.; Zhang, X.; Wang, Z.; Xu, T.; Wang, Z.; Qiao, Y.; Xie, D.; et al. Difficult and complicated oral ulceration: An expert consensus guideline for diagnosis. Int. J. Oral Sci. 2022, 14, 28. [Google Scholar] [CrossRef]
  2. Mortazavi, H.; Safi, Y.; Baharvand, M.J.; Rahmani, M.; Jafari, M.; Etemad-Moghadam, S.; Sadeghi, S.; Shahidi, S. Diagnostic features of common oral ulcerative lesions: An updated decision tree. Int. J. Dent. 2016, 2016, 7278925. [Google Scholar] [CrossRef] [PubMed]
  3. Zhang, R.; Lu, M.; Zhang, J.; Zhang, D.; Xu, Y.; Cui, Y.; Lin, Y.; Wang, Q.; Ji, Z.; Zhuang, J.; et al. Deep learning models with multi-scale feature fusion for lesion segmentation in oral mucosal diseases. Bioengineering 2024, 11, 1107. [Google Scholar] [CrossRef] [PubMed]
  4. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested UNet Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar]
  5. Ullah, Z.; Usman, M.; Jeon, M.; Gwak, J. Cascade multiscale residual attention CNNs with adaptive ROI for automatic brain tumor segmentation. Inf. Sci. 2022, 608, 1541–1556. [Google Scholar] [CrossRef]
  6. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention UNet. arXiv 2018, arXiv:1804.03999. [Google Scholar]
  7. Ruan, J.; Xie, M.; Gao, J.; Liu, T.; Fu, Y. EGE-UNet: An Efficient Group Enhanced UNet for Skin Lesion Segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023; Springer: Cham, Switzerland, 2023. [Google Scholar] [CrossRef]
  8. Bougourzi, F.; Chefrour, M.; Djeraba, C. D-TrAttUnet: Dual-Decoder Transformer Attention UNet for COVID-19 Segmentation. arXiv 2023, arXiv:2303.15576. [Google Scholar]
  9. Ju, J.; Zhang, Q.; Guan, Z.; Shen, X.; Shen, Z.; Xu, P. NTSM: A non-salient target segmentation model for oral mucosal diseases. BMC Oral Health 2024, 24, 521. [Google Scholar] [CrossRef]
  10. Jiang, C.; Wu, R.; Liu, Y.; Wang, Y.; Zhang, Q.; Liang, P.; Fan, Y. A high-order focus interaction model and oral ulcer dataset for oral ulcer segmentation. Sci. Rep. 2024, 14, 20085. [Google Scholar] [CrossRef]
  11. Yang, J.; Li, C.; Zhang, P.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
  12. Naderi, M.; Givkashi, M.; Pri, F.; Karimi, N.; Samavi, S. Focal-UNet: UNet-like focal modulation for medical image segmentation. arXiv 2022, arXiv:2212.09263. [Google Scholar]
  13. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
  14. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar] [CrossRef]
  15. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
  16. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. arXiv 2021, arXiv:2103.15808. [Google Scholar] [CrossRef]
  17. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  18. Kimura, M. TTA: Understanding Test-Time Augmentation; Ridge-i Inc.: Tokyo, Japan, 2024. [Google Scholar]
  19. Wang, X.; Gao, S.; Guo, J.; Wang, C.; Xiong, L.; Zou, Y. Deep learning-based integrated circuit surface defect detection: Addressing information density imbalance for industrial application. Int. J. Comput. Intell. Syst. 2024, 17, 29. [Google Scholar] [CrossRef]
  20. Ullah, Z.; Usman, M.; Latif, S.; Gwak, J. Densely attention mechanism based network for COVID-19 detection in chest X-rays. Sci. Rep. 2023, 13, 261. [Google Scholar] [CrossRef]
  21. Hu, S.; Liao, Z.; Xia, Y. Devil is in channels: Contrastive single domain generalization for medical image segmentation. arXiv 2023, arXiv:2306.05254. [Google Scholar] [CrossRef]
  22. Hu, Y.; Zhou, H.; Cao, N.; Li, C.; Hu, C. IViT-CycleGAN: Synthetic CT generation based on CBCT using improved vision transformer CycleGAN. Sci. Rep. 2024, 14, 11455. [Google Scholar] [CrossRef] [PubMed]
  23. Zhao, X.; Zhang, J.; Zhou, Z.; Zhou, L.; Miao, S. M2SNet: Multi-scale in multi-scale subtraction network for medical image segmentation. arXiv 2023, arXiv:2303.10894. [Google Scholar]
  24. Wu, J.; Wang, X.; Hu, Y.; Yang, X.; Liu, W.; Wang, C.; Ni, D. Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation. arXiv 2023, arXiv:2304.12620. [Google Scholar] [CrossRef] [PubMed]
  25. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [PubMed]
  26. Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 7 December 2021; Volume 34, pp. 9355–9366. [Google Scholar]
Figure 1. Overview of the HF-EdgeFormer architecture.
Figure 1. Overview of the HF-EdgeFormer architecture.
Electronics 15 00595 g001
Figure 2. Effectiveness of HF-EdgeFormer with different interaction orders.
Figure 2. Effectiveness of HF-EdgeFormer with different interaction orders.
Electronics 15 00595 g002
Figure 3. Block diagram for PrecisionEdgeEnhance module.
Figure 3. Block diagram for PrecisionEdgeEnhance module.
Electronics 15 00595 g003
Figure 4. EdgeAware module architecture overview.
Figure 4. EdgeAware module architecture overview.
Electronics 15 00595 g004
Figure 5. Overview of the SegFormerBlock architecture.
Figure 5. Overview of the SegFormerBlock architecture.
Electronics 15 00595 g005
Figure 6. Segmentation quality of HF-EdgeFormer.
Figure 6. Segmentation quality of HF-EdgeFormer.
Electronics 15 00595 g006
Figure 7. Segmentation quality on new biomedical samples.
Figure 7. Segmentation quality on new biomedical samples.
Electronics 15 00595 g007
Table 1. Overview of the AutoOral dataset.
Table 1. Overview of the AutoOral dataset.
DatasetAge RangeSourceNumber of SamplesDisease CategoryPercentage
AutoOral7 to 84Ruijin Hospital,
Shanghai Jiao Tong
University School
of Medicine
420Cancerous ulcers12.33%
Traumatic ulcers and traumatic blood blister12.33%
Herpes-like aphthous ulcers20.55%
Mild aphthous ulcers24.66%
Severe aphthous ulcers30.1%
Table 2. Segmentation performance on AutoOral dataset with estimated 95% confidence intervals (±). Bold indicates the best results across variants. ↓ indicates smaller is better, ↑ indicates bigger is better.
Table 2. Segmentation performance on AutoOral dataset with estimated 95% confidence intervals (±). Bold indicates the best results across variants. ↓ indicates smaller is better, ↑ indicates bigger is better.
Methods YearMemory (MB) ↓DSC ↑ACC ↑SP ↑SE ↑
UNet [4]201515670.7480 ± 0.0050.9617 ± 0.0020.9815 ± 0.0020.7282 ± 0.006
Att UNet [6]201815800.7404 ± 0.0050.9632 ± 0.0020.9879 ± 0.0020.6716 ± 0.007
SCR-Net [20]202115690.7069 ± 0.0060.9602 ± 0.0020.9896 ± 0.0010.6148 ± 0.008
TransNorm [21]202221130.5670 ± 0.0050.9514 ± 0.0020.9873 ± 0.0020.4691 ± 0.007
MALUNet [5]202215510.6318 ± 0.0050.9409 ± 0.0030.9655 ± 0.0030.6500 ± 0.006
C2SDG [22]202317230.7210 ± 0.0040.9604 ± 0.0020.9862 ± 0.0020.6554 ± 0.006
M2SNet [23]202317530.7482 ± 0.0040.9669 ± 0.0010.9953 ± 0.0010.6300 ± 0.007
MSA [24]202351730.7540 ± 0.0040.9697 ± 0.0010.9887 ± 0.0020.7181 ± 0.005
META-UNet [7]202316390.7535 ± 0.0040.9695 ± 0.0010.9842 ± 0.0020.7227 ± 0.005
MHorUNet [10]202415970.7618 ± 0.0040.9657 ± 0.0020.9867 ± 0.0020.7143 ± 0.005
VM-UNet [15]202411240.7639 ± 0.0050.9636 ± 0.0020.9812 ± 0.0020.7555 ± 0.004
H-VMUnet [16]202410600.7127 ± 0.0040.9605 ± 0.0020.9887 ± 0.0010.6276 ± 0.006
HF-UNet [12]202420290.7971 ± 0.0040.9703 ± 0.0010.9940 ± 0.0010.7251 ± 0.005
HF-EdgeFormer (Ours)202530620.8181 ± 0.0030.9704 ± 0.0010.9801 ± 0.0020.8549 ± 0.004
Table 3. Ablation study showing the effect of each module in HF-EdgeFormer. Bold indicates the best results across variants. ↓ indicates smaller is better, ↑ indicates bigger is better.
Table 3. Ablation study showing the effect of each module in HF-EdgeFormer. Bold indicates the best results across variants. ↓ indicates smaller is better, ↑ indicates bigger is better.
Model Variant DSC ↑SE ↑Params (M) ↓FLOPs (G) ↓VRAM (MB) ↓FPS ↑
w/o EdgeFormerBottleneck0.77500.72204.131.4203224
w/o PrecisionEdgeEnhance0.78200.80374.332.5289024
w/o TTA0.80300.83704.633.1306224
HF-EdgeFormer (Full)0.81810.85494.633.1306224
Table 4. Effect of enabling test-time augmentation (TTA) in HF-EdgeFormer. Bold indicates the best results across variants. ↑ indicates bigger is better.
Table 4. Effect of enabling test-time augmentation (TTA) in HF-EdgeFormer. Bold indicates the best results across variants. ↑ indicates bigger is better.
TTA Enabled YearDSC ↑SE ↑SP ↑ACC ↑
No20250.80300.83700.97900.9690
Yes20250.8181 0.85490.97810.9704
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cornoiu, D.-C.; Popa, C.-A. HF-EdgeFormer: A Hybrid High-Order Focus and Transformer-Based Model for Oral Ulcer Segmentation. Electronics 2026, 15, 595. https://doi.org/10.3390/electronics15030595

AMA Style

Cornoiu D-C, Popa C-A. HF-EdgeFormer: A Hybrid High-Order Focus and Transformer-Based Model for Oral Ulcer Segmentation. Electronics. 2026; 15(3):595. https://doi.org/10.3390/electronics15030595

Chicago/Turabian Style

Cornoiu, Dragoș-Ciprian, and Călin-Adrian Popa. 2026. "HF-EdgeFormer: A Hybrid High-Order Focus and Transformer-Based Model for Oral Ulcer Segmentation" Electronics 15, no. 3: 595. https://doi.org/10.3390/electronics15030595

APA Style

Cornoiu, D.-C., & Popa, C.-A. (2026). HF-EdgeFormer: A Hybrid High-Order Focus and Transformer-Based Model for Oral Ulcer Segmentation. Electronics, 15(3), 595. https://doi.org/10.3390/electronics15030595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop