Next Article in Journal
Exploratory Study on Hybrid Systems Performance: A First Approach to Hybrid ML Models in Breast Cancer Classification
Previous Article in Journal
In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation

by
Yean Chun Ng
1,
Alexander G. Belyaev
2,
Florence Choong
1,
Shahrel Azmin Suandi
3,
Joon Huang Chuah
4,5 and
Bhuvendhraa Rudrusamy
1,*
1
School of Engineering and Physical Sciences, Heriot-Watt University Malaysia, Putrajaya 62200, Malaysia
2
School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK
3
School of Electrical and Electronic Engineering, Universiti Sains Malaysia, Gelugor 11700, Malaysia
4
Faculty of Engineering and Information Technology, Southern University College, Skudai 81300, Malaysia
5
Department of Electrical Engineering, Faculty of Engineering, University of Malaya, Kuala Lumpur 50603, Malaysia
*
Author to whom correspondence should be addressed.
Submission received: 8 December 2025 / Revised: 4 January 2026 / Accepted: 9 January 2026 / Published: 15 January 2026
(This article belongs to the Section AI Systems: Theory and Applications)

Abstract

Thermal facial imaging enables non-contact measurements of face heat patterns that are valuable for healthcare and affective computing, but common occluders (glasses, masks, scarves) and the single-channel, texture-poor nature of thermal frames make robust landmark localisation and visibility estimation challenging. We propose M3MSTL, a multi-modal, multi-stage, multi-task framework for occlusion-aware landmarking on thermal faces. M3MSTL pairs a ResNet-50 backbone with two lightweight heads: a compact fully connected landmark regressor and a Vision Transformer occlusion classifier that explicitly fuses per-landmark temperature cues. A three-stage curriculum (mask-based backbone pretraining, head specialisation with a frozen trunk, and final joint fine-tuning) stabilises optimisation and improves generalisation from limited thermal data. On the TFD68 dataset, M3MSTL substantially improves both visibility and localisation: the occlusion accuracy reaches 91.8% (baseline 89.7%), the mean NME reaches 0.246 (baseline 0.382), the ROC–AUC reaches 0.974, and the AP is 0.966. Paired statistical tests confirm that these gains are significant. Our approach aims to improve the reliability of temperature-based biometric and clinical measurements in the presence of realistic occluders.

1. Introduction

Thermal facial imaging (long-wave infrared) provides non-contact measurements of face heat patterns and has proven valuable in healthcare and affective computing because thermal signatures often correlate with physiological state and emotional arousal [1,2]. Compared with visual images, thermal frames have a single channel, lack color and fine texture, and often represent anatomy only through temperature gradients; common occluders such as eyeglasses, masks, scarves, or facial hair can therefore completely obscure or insulate critical landmark regions (for example, the inner canthi), producing spatially incomplete or misleading measurements [2,3]. Common detectors trained on RGB appearance fail in this setting because they rely on cues that do not exist in infrared [3,4], and prior thermal studies frequently avoid occluded regions with heuristics or focus on downstream tasks rather than delivering a general pixel- or landmark-level occlusion signal [1,2]. Motivated by these gaps, we propose an occlusion-aware framework for thermal faces that jointly (1) regresses landmark coordinates and (2) predicts per-landmark visibility. The model uses a ResNet backbone trained from scratch on raw thermal images to learn modality-specific local encodings, lightweight fully connected heads for precise coordinate regression, and a Vision Transformer (ViT) visibility head that applies multi-head self-attention to reason globally across facial regions. Critically, both task heads fuse backbone features with explicit per-landmark temperature measurements so that visual heat patterns and landmark temperatures inform localisation and occlusion decisions. Together, these design choices keep the backbone focused on infrared radiometry while enabling heads to combine local detail, global context, and direct thermal signals to improve robustness under occlusion.
Because occlusion detection and landmark localisation are intrinsically coupled, we formulate the problem as multi-task learning (MTL) so the network can share representations and exploit task synergies [5,6]. MTL increases data efficiency and reduces overfitting by shaping features that generalise across tasks; in our setting, jointly learning landmarks and occlusion labels helps the model (a) identify which facial regions supply reliable localisation cues under occlusion, and (b) allow occlusion prediction to leverage landmark geometry as an informative signal. To stabilise optimisation and improve the final performance, we employ a three-stage training curriculum: (1) we train the ResNet backbone from scratch on thermal inputs, (2) freeze the backbone and train the task heads so they specialise in regression and visibility using both thermal images and per-landmark temperature inputs, and (3) unfreeze and fine-tune the entire model end to end with a combined loss. This staged scheme decouples representation learning from task specialisation, improves convergence with limited thermal data, and helps the hybrid ResNet–ViT architecture exploit both local and long-range cues for reliable, temperature-informed biometric and clinical measurements in real-world occlusion scenarios [6,7,8,9].
The remainder of the paper is organised as follows. Section 3 details the proposed methodology, including the ResNet–ViT architecture and the three-stage training curriculum. Section 4 describes the experimental setup: dataset preparation and annotation, baseline models, and training and hyperparameters. Section 5 presents quantitative and qualitative results, hypothesis testing, and discussion of robustness under occlusion. Finally, Section 6 concludes the paper and outlines limitations and directions for future work. The dataset, code, and pre-trained models presented in this paper can be accessed at https://github.com/lucas-nyc/M3MSTL (accessed on 29 December 2025).

2. Related Works

Previous studies have advanced facial landmark prediction, occlusion handling, and thermal face analysis in complementary ways; nevertheless, occlusion-aware landmark detection specifically targeted at thermal imagery remains underexplored.

2.1. Architectures for Occlusion-Aware Landmark Localisation

Thermal imaging of faces has been studied for recognition, affect analysis, and clinical applications, but it is less mature than visible-spectrum research. Early thermal deep models demonstrate that naively applying visible-light detectors to infrared inputs fails due to the large modality gap; accordingly, end-to-end thermal models (e.g., U-Net variants that jointly predict landmarks and emotions) were proposed to capture thermal-specific structures [6]. An alternative approach is RGB-to-thermal synthesis, which allows for the reuse of pretrained RGB models. However, these transfers often introduce structural distortions and do not reliably reproduce occluders. Joint multi-modal fusion by processing RGB and thermal together has therefore emerged as a strong option: multi-modal networks that combine CNN feature extractors with transformer-based cross-modal attention (e.g., M2FNet) consistently outperform single-modality baselines by leveraging complementary cues (visible texture + thermal heat patterns) [10]. Such fusion can exploit the robustness of thermal imaging (low-light, privacy-preserving signatures) while retaining appearance information from RGB when available.
MTL leverages a shared encoder that feeds multiple task-specific heads, a technique that is widely used in face analysis because shared representations improve data efficiency and generalisation [11]. In the thermal domain, MTL U-Net variants that jointly regress landmarks and recognise emotion outperform single-task baselines, indicating that related tasks provide proper auxiliary signals when thermal training data are limited [5,6,12].
Occlusion handling has been approached from several angles. Structure-aware alignment methods model inter-landmark relationships using relational graphs and relative-location losses, improving robustness to pose and partial occlusion on visible-light benchmarks [3]. Attention-based and region-partitioning strategies suppress corrupted features or isolate reliable facial regions under occlusion [7]. Transformer-based occlusion-recovery schemes (e.g., ORFormer) use messenger tokens or patch-level reasoning to detect and recover non-visible regions, achieving strong resilience to occlusion in RGB datasets [8]. Specific to thermal imagery, targeted studies show that occlusions (notably eyeglasses, which appear as cold, signal-blocking regions) severely degrade downstream tasks; methods that exclude or specially treat occluded ROIs can improve affect recognition, but they typically rely on manually defined regions rather than producing landmark- or pixel-level occlusion outputs [2].
Complementing these lines of work are several concrete occlusion-aware architectures and segmentation-based strategies that demonstrate beneficial trade-offs for landmark robustness. Occlusion-Directed Networks (ODNs) are occlusion-specific architectures built on a CNN backbone (commonly ResNet-18) that add three dedicated modules: an occlusion-distillation module that predicts a per-location occlusion probability map, a low-rank recovery module that attempts to reconstruct lost features, and a geometry-aware module that encodes facial-shape relationships to guide recovery [13]. ODN effectively down-weights features in occluded regions via the learned occlusion map and uses facial priors to recover missing information, producing substantially improved landmark accuracy on occluded benchmarks [13]; the principal drawbacks are added model complexity and parameters, an implicit assumption of a single learned occlusion model that highly varied occluders may challenge, and reliance on somewhat costly module annotations or careful unsupervised learning of the occlusion map.
Segmentation-based methods such as Mask R-CNN provide an alternative by explicitly predicting pixel-level masks: extending a two-stage detector with a mask head enables the network to detect occluding objects or segment occluded versus visible face regions, which can then be used to mask out unreliable pixels or trigger special handling [14]. Mask R-CNN is conceptually simple and strong on benchmarks (detection, segmentation, and landmarks), but it is heavier, two-stage, requires box-level and mask supervision, and is less attractive for low-latency or resource-constrained deployments.
Stacked Hourglass networks, while typically used for heatmap-based landmark regression, can be repurposed as occlusion classifiers by training them to output binary masks or per-landmark visibility scores instead of (or in addition to) coordinate heatmaps [15]. The multi-scale down-/upsampling stacks and intermediate supervision make hourglass architectures naturally capable of integrating global context and recovering weak responses over later stacks, but they remain computationally intensive and depend on pixel-level supervision for reliable occlusion masks.
Transformer-based approaches provide yet another avenue for occlusion reasoning. Vision Transformers (ViTs) process images as sequences of patch tokens with multi-head self-attention, allowing information to propagate across distant facial regions and allowing the model to infer missing content from surrounding context [16]. Extensions such as ORFormer introduce specialised “messenger” tokens that help the network detect and recover occluded patches, demonstrating strong occlusion-recovery capabilities in RGB landmark detection [8]. Powerful, standard ViTs typically require extensive pretraining on large datasets and do not include intrinsic multi-scale inductive biases unless explicitly augmented; without such design choices, a vanilla ViT may treat occluded patches as noisy tokens rather than explicitly modelling occlusion.
Convolutional backbones (e.g., ResNet) offer robust local feature extraction through residual hierarchies and pre-trained transfer learning priors, which are particularly beneficial when thermal data are scarce. However, CNNs have limited receptive fields and may struggle when local appearance is missing. Vision Transformers (ViTs) complement CNNs by modelling long-range dependencies through self-attention, enabling direct interactions among distant facial regions and supporting inference about occluded parts based on distant cues. Recent work hybridises ResNet backbones with ViT-style heads, allowing for the simultaneous availability of local detail and global context. Such hybrids have demonstrated improved localisation and classification across several imaging domains [9]. Applied to the thermal occlusion problem, a hybrid design allows the model to combine high-resolution visual features with global relational reasoning to both identify occlusions and estimate plausible landmark positions.

2.2. Training Strategies Under Thermal Data Scarcity

Data scarcity in thermal face benchmarks has motivated the development of cross-modal distillation and the creation of synthetic datasets. Multi-level distillation transfers supervision from large RGB models into thermal detectors, boosting localisation accuracy but typically assuming visible landmarks and not explicitly predicting occlusions [4]. Synthetic thermal datasets (e.g., T-FAKE) produced by RGB-to-thermal style transfer provide dense annotations at scale and improve detector performance; however, these synthetic scenes are often idealised and rarely include realistic occluders, such as eyeglasses or facial hair, which limits their utility for occlusion-aware training [4].
Training strategies that uses multi-stage optimisation can stabilise learning in complex multi-task systems. Two-stage or multi-stage curricula (pretrain encoder, train heads separately, then jointly fine-tune) have been shown to improve convergence, accuracy, and task balance in thermal face tasks and related domains [6,12,17]. Such schemes are beneficial when combining cross-modal fusion, MTL, and occlusion prediction under limited thermal data.
Collectively, prior work contributes structural priors, attention- and transformer-based occlusion filtering/recovery, multi-modal fusion techniques, and synthetic/distillation remedies for data scarcity. However, significant gaps remain for the specific problem of occlusion-aware landmark detection in thermal imagery. Many high-performing structural or occlusion-recovery methods were developed for RGB appearance and are not optimised for single-channel thermal images. Thermal studies that address occlusion often employ manual ROI rules or focus on downstream tasks rather than producing landmark- or pixel-level occlusion outputs and visibility scores. Synthetic and distillation approaches improve landmark localisation but generally do not simulate or explicitly supervise realistic occluders, limiting their usefulness for learning explicit occlusion semantics.
These observations motivate a unified approach that explicitly predicts per-landmark visibility in thermal images, leverages thermal-specific signals and temperature measurements rather than assuming RGB appearance, and combines structural priors with global contextual recovery so that downstream temperature-based biometric measurements remain reliable in the presence of common occluders (e.g., eyeglasses, masks, facial hair).

3. Methodology

Our occlusion-aware landmark system is a multi-modal multi-stage multi-task network composed of a shared ResNet backbone and task-specific heads (a dense landmark regressor and a ViT occlusion classifier); see Figure 1. ResNet-style convolutional encoders provide strong local and mid-level feature extraction due to deep residual hierarchies and well-known inductive biases for edges, textures, and local patterns; they also benefit from widespread ImageNet pretraining that yields robust starting representations for many downstream tasks [18]. In contrast, ViT excels at modelling long-range relationships through self-attention, which is particularly valuable when portions of the input are missing or corrupted by occluders (e.g., eyeglasses, masks). Combining these two paradigms allows the network to (1) extract reliable local features via the ResNet encoder and (2) reason globally about the configuration of facial parts via self-attention in the ViT head. Recent hybrid ResNet–ViT systems applied to medical and other imaging domains have demonstrated that such combinations bridge local and global representations and improve downstream classification or segmentation performance [9,16]. In our facial-thermal setting, the ResNet supplies texture- and geometry-oriented features while the ViT head helps reconcile inconsistent local cues by attending to symmetrically or contextually relevant areas.

3.1. Occlusion-Aware Landmark Localisation Architecture

This section details the hybrid architecture used in M3MSTL: a ResNet-50 backbone for dense spatial features, and task-specific heads (a compact landmark regressor and a ViT occlusion classifier) that fuse image features with per-landmark temperature cues.

3.1.1. ResNet-50

ResNet-50 is a deep convolutional encoder built from residual bottleneck blocks that use identity skip-connections to mitigate vanishing gradients and enable practical training of very deep networks [18]. Architecturally, the residual design encourages the network to learn residual functions, which stabilises optimisation and yields strong local feature extractors that are useful for low-level radiometric patterns in thermal images. In our setting, the ResNet backbone is trained from scratch on single-channel thermal inputs, allowing its early convolutional filters to specialise in detecting infrared gradients and edges rather than RGB texture. Backbone activations are processed by a light spatial refinement path to produce a high-resolution spatial feature map aligned to the input resolution that the subsequent task heads can exploit.

3.1.2. Vision Transformer

The ViT applies multi-head self-attention to a sequence of patch embeddings, enabling global, content-driven reasoning across the image [16]. Unlike convolutional modules that aggregate locality via receptive fields, self-attention directly models long-range interactions between facial regions, which is valuable for occlusion reasoning, where evidence for a landmark’s visibility may come from distant, unoccluded cues. In our lightweight ViT, pooled spatial patches are projected to compact token embeddings and passed through a small stack of transformer blocks. The pooled transformer output is then fused with explicit per-landmark temperature features, so that attention-based contextual reasoning is informed by both spatial signals and direct thermal measurements.

3.1.3. Spatial Features and Task Heads

Building on the ResNet–50 backbone and the lightweight ViT described above, the full architecture fuses dense spatial features and global context to support both landmark regression and per-landmark occlusion prediction. Given an input thermal image X R H × W × 1 , the backbone produces a spatial feature map F s R H × W × C s that is aligned to the input resolution (see Figure 1), where C s denotes the number of output channels for the backbone.
Landmark Regressor (Compact FC Head): We first apply global average pooling (GAP) to F s , then pass the result through a lightweight dense tower: Dense ( 2048 ) , Dropout ( 0.3 ) , Dense ( 512 ) , Dropout ( 0.3 ) , Dense ( 2 K ) with sigmoid activation, outputting the concatenated, normalised coordinates ( x , y ) for all landmarks (here, K is the number of landmarks and coordinates are scaled to [ 0 , 1 ] per axis).
ViT Occlusion Head: We tokenise F s into non-overlapping patch embeddings and process them with a lightweight ViT consisting of two transformer blocks, each with LayerNorm , multi-head self-attention with four heads (with a key dimension of 32), residual connections, and an MLP with output dimension of 128. The pooled embeddings are aggregated using GAP and regularised with a Dropout ( 0.1 ) . The per-landmark temperature vector T R K is projected to a lower dimension through a small, fully connected projection layer (Flatten, then Dense ( 64 ) ) and concatenated with the pooled embeddings prior to the final classification layers of Dense ( 256 ) , Dropout ( 0.2 ) , Dense ( K ) with sigmoid activation outputs for per-landmark occlusion probabilities in [ 0 , 1 ] .
Training objective: The system is trained in a multi-task fashion: the landmark branch minimises Wing loss [19], while the occlusion branch minimises binary cross-entropy per landmark. The total objective is
L = L landmark + λ occl L occl
which balances the two tasks, with λ occl as the balancing parameter.
Section 3.2 details the staged optimisation: first, the backbone learns thermal-specific spatial features, then the heads specialise with explicit temperature conditioning, before a final joint fine-tuning aligns shared features to the combined objective.

3.2. Multi-Stage Training Strategies

We adopt a three-stage training curriculum (see Algorithm 1 and the architectural layout in Figure 1) to stabilise optimisation, decouple representation learning from task specialisation, and then align the shared features with the joint objective through a brief end-to-end fine-tuning. We formalise notation as follows (learning rates and epoch budgets are detailed in Section 4.4). Let the dataset be
D = { ( X i , M i , T i , Y i ) } i = 1 N
where X is a single-channel thermal image, M is a dense face mask used for backbone pretraining, T R K is the per-landmark temperature vector, and Y = ( Y lm , Y oc ) contains landmark coordinates Y lm [ 0 , 1 ] 2 K and binary occlusion labels Y oc { 0 , 1 } K . The backbone (b) and head (h) parameters are partitioned into θ b and θ h , respectively. We use Adam optimization with stage-specific learning rates η 1 , η 2 , η 3 and epoch budgets epochs 1 , epochs 2 , epochs 3 as described in alg:msmtl.

3.2.1. Stage 1: Backbone Pre-Training

Predicted masks are M ^ = b θ b ( X ) . Stage-1 optimises only θ b to minimise the pixel-wise BCE mask loss:
θ b Adam η 1 θ b , BCE b θ b ( X ) , M
Pre-training on a dense spatial task encourages the backbone to capture face shape and radiometric structure (edges, gradients) that are useful for later landmark and occlusion tasks. We validate mask metrics each epoch and checkpoint the best backbone weights θ b ( 1 ) .

3.2.2. Stage 2: Heads Pretraining

Load θ b ( 1 ) and freeze the backbone. Extract features F = b θ b ( 1 ) ( X ) and compute head outputs Y ^ = h θ h ( F , T ) . Train only θ h to minimise the joint loss from eq:loss:
θ h Adam η 2 θ h , L total h θ h ( F , T ) , Y
Passing the per-landmark temperature vector T directly to the head allows the head to condition its predictions on explicit thermal cues without altering the backbone representation. Validation monitors NME for landmarks and AUC/AP/accuracy for occlusion; the best head weights θ h ( 2 ) are saved.

3.2.3. Stage 3: Joint Fine-Tuning

In Stage 3, the per-landmark temperature vector T continues to condition the occlusion head, where the backbone remains image-only but is unfrozen for modest adaptation. Load θ b ( 1 ) and θ h ( 2 ) , unfreeze θ b , and fine-tune the whole network end-to-end with a smaller learning rate:
( θ b , θ h ) Adam η 3 ( θ b , θ h ) , L total h θ h ( b θ b ( X ) , T ) , Y
Joint fine-tuning enables the modest adaptation of shared features to support the combined localisation and occlusion objective more effectively. At the same time, the earlier staged specialisation reduces variance and helps avoid collapse during end-to-end optimisation. We validate final task metrics each epoch and use checkpoint/early stopping on L total .
In summary, Stage 1 shapes thermal-specific spatial features; Stage 2 specialises the task heads with explicit temperature conditioning and a frozen encoder; Stage 3 aligns the shared features to the joint objective with a small learning rate. This progression empirically improves optimisation stability and final accuracy (see Section 5).
Algorithm 1 M3-MSTL training algorithm
  • Require: Backbone b, head h, dataset D = { ( X i , M i , T i , Y i ) }
  • Require: Hyperparameters: epochs 1 , epochs 2 , epochs 3 , learning rates η 1 , η 2 , η 3 , batch size B
    1:
    Initialize backbone parameters θ b
    2:
    Initialize head parameters θ h
    3:
    Stage 1: Backbone Pre-training
    4:
    for epoch = 1 to  epochs 1  do
    5:
       for each minibatch ( X , M ) of B samples from D  do
    6:
        M ^ b ( X )
    7:
        loss _ mask BCE ( M ^ , M )
    8:
       Update θ b using Adam η 1 to minimize loss _ mask
    9:
       end for
    10:
       Validate mask performance; checkpoint θ b
    11:
    end for
    12:
    Save stage-1 backbone weights θ b ( 1 )
    13:
    Stage 2: Head Pre-training
    14:
    Load θ b ( 1 ) and freeze backbone b
    15:
    for epoch = 1 to  epochs 2  do
    16:
       for each minibatch ( X , T , Y ) from D  do
    17:
        F b ( X )
    18:
        Y ^ h ( F , T )
    19:
        L L ( Y ^ , Y )
    20:
       Update θ h using Adam η 2 to minimize L
    21:
       end for
    22:
       Validate metrics; checkpoint θ h
    23:
    end for
    24:
    Save stage-2 head weights θ h ( 2 )
    25:
    Stage 3: Joint Fine-tuning
    26:
    Load θ b ( 1 ) , θ h ( 2 ) ; unfreeze  θ b
    27:
    for epoch = 1 to  epochs 3  do
    28:
       for each minibatch ( X , T , Y , M ) from D  do
    29:
        F b ( X )
    30:
        Y ^ h ( F , T )
    31:
        L L ( Y ^ , Y )
    32:
       Update θ b , θ h using Adam η 3 to minimize L
    33:
       end for
    34:
       Validate final metrics; checkpoint final weights
    35:
    end for
    36:
    Save final model weights θ b , h final

4. Experimental Setup

4.1. TFD68 Dataset

TFD68 is a publicly released paired thermal–visible facial dataset designed for occlusion-aware thermal landmark research [12], as shown in Figure 2. Thermal images were captured with a FLIR A400 streaming camera (320 × 240, 7.5–14, μm, ±2 °C accuracy, ∼40, mK sensitivity). At the same time, synchronised visual pairs were recorded by the camera’s integrated visible sensor (640 × 480). The collection occurred in a climate-controlled laboratory at 24 °C, with automated capture driven by a rotating platform and triggered by an Arduino. This setup minimised background thermal artifacts and standardised subject-to-camera geometry [12].
To provide dense pose coverage, head yaw was sampled every 15 ° from 90 ° to + 90 ° (13 positions) and pitch every 15 ° from 30 ° to + 30 ° (five positions), yielding 65 pose combinations per subject. A two-stage occlusion protocol was employed to minimise residual microclimate effects: an initial non-occluded capture following a five-minute acclimatisation period, and a subsequent capture with participant-worn or provided occluders (e.g., eyeglasses, surgical masks, hijabs). The dataset also contains per-pixel thermal maps, registered RGB–thermal pairs, and seven posed expression categories (neutral, happiness, sadness, fear, anger, surprise, disgust), enabling both geometric and affective evaluations [12].
Annotations were generated cross-modally: 68-point 3D landmarks were estimated on RGB frames (3DDFA-V2) and transferred to registered thermal images. Meanwhile, landmark visibility (occlusion) labels were derived via a simulated 3D head model for pose-induced self-occlusion and temperature-based heuristics to detect accessory-induced occlusion. All automated annotations and occlusion labels were subsequently validated and corrected to ensure high-quality landmark coordinates, per-landmark visibility flags, and pixel-wise occlusion masks. These properties make TFD68 well suited for developing and benchmarking occlusion-aware thermal landmark detectors and temperature-based biometric analyses [12].

4.2. Data Preparation

We adopt a deterministic, subject-wise partition to eliminate identity leakage across folds: the dataset is split by subject identifier into training, validation, and test sets with ratios of 60/20/20, respectively. The split is saved along with the random seed to ensure reproducibility. For each image, the face bounding box from the ground truth is used to crop the face region. The crop origin and scale factor were recorded, and the crop was resized to a fixed resolution of 256 × 256 . Recording the crop parameters enables exact, invertible transforms between original-image coordinates and the cropped coordinate frame for all downstream annotations and measurements.
Original landmark coordinates ( x , y ) are transformed into the crop coordinate system using the recorded origin and scale, then normalised to [ 0 , 1 ] by dividing x by the crop width and y by the crop height. Per-landmark scalar temperatures were sampled from the co-registered thermal map at the corresponding transformed locations. These per-landmark temperature values were stored alongside the normalised coordinates for use by the model heads. Binary masks are generated from the annotated landmarks to indicate valid face regions as shown in Figure 3. They were resized to match the crop resolution and any downstream input sizes.

4.3. Baseline Models

We evaluate a diverse set of convolutional and transformer-based backbones chosen to balance depth, multi-scale feature extraction, and spatial resolution. To preserve high-resolution spatial information, we evaluate an HRNet-inspired backbone that fuses multi-scale streams in parallel. This architecture has shown clear advantages for pixel-level tasks, such as face alignment. A lightweight hourglass design is also considered; its single downsampling–upsampling pathway with skip connections captures bottom-up and top-down context in a compact stacked-hourglass style. Complementing these, we test a SegFormer-inspired backbone (utilising strided convolutions and transpose convolutions) and a classic U-Net backbone; the latter’s symmetric encoder–decoder with skip connections is well suited to capturing context while enabling precise localisation. The final landmark regressor utilises the same compact, fully connected layers as described in Section 3.
For occlusion prediction, we implement a spectrum of head designs that reflect common strategies in the literature, ranging from simple global classifiers to multi-scale CNNs and transformer-based modules. The lightweight MLP head pools the spatial features and passes them through dense layers with dropout to produce per-landmark occlusion scores, serving as a fast baseline. An ODN-inspired head enhances the convolutional backbone with residual convolutional blocks, an occlusion-distillation style branch, and a low-rank recovery component, explicitly down-weighting corrupted regions and reconstructing missing features, thereby trading additional parameters for improved robustness under heavy occlusion. An HRNet-inspired head builds parallel multi-resolution branches (each with pooling/upsampling and 1 × 1 convolutions). It fuses them by adding features at multiple scales before prediction.
In contrast, the hourglass head implements a single pooling-and-upsampling fusion that mirrors the multi-scale behavior of larger stacked hourglass networks. We also use a Mask R-CNN–inspired head that predicts per-landmark spatial masks via 3 × 3 convolutions and upsampling. These mask logits are then globally pooled and passed to an MLP to yield occlusion probabilities, following the intuition that pixel-wise segmentation can inform visibility classification. Finally, the ViT head treats the spatial feature tensor as a grid of patch tokens. It applies transformer layers (multi-head self-attention and MLPs) to capture long-range dependencies and relate distant facial regions. This approach has proven effective for occlusion recovery when coupled with attention-based occlusion modules. In all heads, we fuse per-landmark temperature into the prediction pipeline and apply dropout for regularisation (0.3 in the MLP and mask heads and 0.1–0.2 in the transformer head). This collection of backbone and head architectures provides flexible capacity for modeling occlusion cues, ranging from global statistical signals to spatially detailed segmentation and global attention-based recovery.

4.4. Hyperparameters

Unless otherwise stated, experiments use a batch size of 8. Training follows a three-stage scheme: Stage 1 runs for 15 epochs (1000 steps) with a learning rate of ( 1 × 10 4 ) ; Stage 2 runs for 20 epochs (1000 steps) with the same learning rate of ( 1 × 10 4 ) ; Stage 3 runs for 25 epochs (200 steps) with a learning rate equal to one-tenth of the Stage 2 learning rate. We use the Adam optimiser and set the occlusion-loss weight to 1.0. Backbone dropout is set to 0.5 for U-Net backbone (0.3 for other backbones). The landmark head uses dropout of 0.3, while the ViT uses dropout of 0.1, and the final ViT MLP dropout is 0.2.

4.5. Evaluation Metrics

This section defines the metrics used to evaluate (1) facial landmark localisation and (2) per-landmark occlusion classification. For landmark localisation, we use the Normalised Mean Error (NME), but we replace the common inter-ocular normalisation factor with a triangulation-derived length that is robust to extreme yaw. For occlusion classification, we report standard binary classification metrics (accuracy, precision, recall, F1).

4.5.1. Landmark Localisation Metrics

Given a test image containing K landmark points, let p i = ( p i , x , p i , y ) denote the predicted pixel coordinates while g i = ( g i , x , g i , y ) is the ground-truth pixel coordinates for landmark i. The per-sample Mean Error is the average Euclidean distance between prediction and ground truth:
ME = 1 K i = 1 K p i g i 2
The Normalised Mean Error (NME) divides the ME by a normalisation length d to produce a scale-invariant error:
NME = 1 K i = 1 K p i g i 2 d
To reduce sensitivity to extreme head yaw, which can make the inter-ocular distance very small and inflate normalised errors, we use a normalisation factor derived from the area of a triangle formed by three robust facial landmarks: the left outer eye corner (landmark 37), the right outer eye corner (landmark 46), and the chin (landmark 9), denoted as A, B, and C, respectively. Their coordinates are denoted by A = ( x A , y A ) , B = ( x B , y B ) , C = ( x C , y C ) . The signed area of the triangle is
A ( A , B , C ) = 1 2 x A ( y B y C ) + x B ( y C y A ) + x C ( y A y B )
We then take the normalisation length to be the square root of the triangle area:
d tri = | A ( A , B , C ) |
Finally, the triangulation-normalised NME is
NME tri = 1 K i = 1 K p i g i 2 d tri
The triangle formed by two eye corners and the chin tends to retain a substantial geometric extent even under extreme yaw. When the face rotates, the projected inter-ocular distance can shrink, but the vertical separation to the chin helps to keep the triangle area sufficiently large. Taking the square root of the area yields a length-scale with the same units (pixels) as the Euclidean errors, making it a natural normaliser.

4.5.2. Occlusion Classification Metrics

Occlusion detection is treated as a binary classification problem per landmark. After applying a decision threshold to the model’s occlusion probability (commonly 0.5 ), predictions are summarised by the confusion matrix. From these counts we compute the standard thresholded metrics:
Accuracy = TP + TN TP + TN + FP + FN
Precision = TP TP + FP
Recall = TP TP + FN
F 1 = 2 · Precision · Recall Precision + Recall
These metrics are interpretable and straightforward, but they depend on the chosen operating threshold and can be misleading in the presence of class imbalance. Because occlusion outputs are probabilistic, we also report threshold-agnostic ranking metrics that evaluate performance across all decision thresholds. The Receiver Operating Characteristic (ROC) curve plots True-Positive Rate (TPR) against the False-Positive Rate (FPR) as the decision threshold varies. The area under the ROC curve (ROC–AUC) summarises this curve:
ROC - AUC = 0 1 TPR ( τ ) d FPR ( τ )
The Precision–Recall (PR) curve plots precision versus recall across thresholds; its area (Average Precision, AP) is especially informative with class imbalance:
AP = 0 1 Precision ( r ) d Recall ( r )

5. Results and Discussion

Table 1 shows a consistent benefit from the proposed multi-stage multi-task learning (M3MSTL) protocol: most backbone/head combinations gain occlusion accuracy and reduce NME relative to their baseline counterparts. By baseline occlusion accuracy, the strongest entries are ResNet+MLP (90.8%). ResNet+Hourglass (90.4%), and a near-tie for third place at (89.8%) for U-Net+ViT and U-Net+HRNet. Their M3MSTL results show modest-to-substantial improvements: ResNet+MLP increases to 91.2% (+0.46%) while lowering NME from 0.367 to 0.245 (−32.88%); ResNet+Hourglass drops slightly in accuracy to 90.0%(−0.48%) but reduces NME from 0.422 to 0.269 (−36.16%); and U-Net+ViT improves occlusion accuracy to 91.7% (+2.22%) and reduces NME from 0.508 to 0.338 (−33.41%). These results indicate that, even for already strong backbones, M3MSTL tends to preserve or improve occlusion detection while substantially reducing localisation error.
The smallest baseline NMEs are all ResNet variants: ResNet+MLP (0.367), ResNet+ViT (0.382), and ResNet+Hourglass (0.422). Under M3MSTL, these three models achieve some of the lowest final NMEs in the table: ResNet+MLP (0.245, −32.88%), ResNet+ViT (0.246, −35.47%), and ResNet+Hourglass (0.269, −36.16%). Crucially, the ResNet+ViT pairing closes nearly all of the gaps of ResNet+MLP on landmark localisation while providing superior occlusion accuracy.
The highest M3MSTL occlusion accuracies are ResNet+ViT (91.8%), ResNet+MLP (91.2%), and HRNet+HRNet (90.7%). ResNet+ViT achieves the best occlusion performance (91.8%) while producing an NME of 0.246, which is only marginally worse than the best localisation result (ResNet+MLP with an NME of 0.245). In other words, ResNet+ViT achieves state-of-the-art occlusion detection in our suite while maintaining an effective tie with the top landmark regressor in terms of localisation quality, which is a desirable trade-off for occlusion-aware thermal applications.
The lowest NMEs after M3MSTL are led by ResNet+MLP (0.245), ResNet+ViT (0.246) and ResNet+Mask-RCNN (0.246). All three started from stronger ResNet baselines and benefited from large relative reductions in NME (roughly −33% to −36%). These reductions demonstrate that M3MSTL achieves a robust decrease in localisation error across high-performing encoders.
The top three models by M3MSTL AUC are ResNet+ViT (AUC = 0.974), ResNet+Mask-R-CNN (AUC = 0.972) and ResNet+HRNet (AUC = 0.972). These three models are tightly clustered at the top of the ranking, but ResNet+ViT achieves the highest AUC, indicating that it maintains superior true-positive rates at low false-positive rates compared to alternatives. The ROC curves for these three models are plotted in Figure 4. There, the ResNet+ViT curve consistently dominates or closely matches the other two across most operating points, which supports its strong threshold-agnostic discrimination, as reported in the table.
ResNet+ViT also leads for AP (AP = 0.966), followed by ResNet+Mask-R-CNN (AP = 0.963) and ResNet+HRNet (AP = 0.963). The high AP for ResNet+ViT indicates that, in addition to producing good ranking scores (high AUC), it also maintains a favourable precision–recall trade-off. It delivers relatively few false positives even when operating at recall levels that would be useful in downstream pipelines. The precision–recall curves in Figure 4 show that ResNet+ViT retains higher precision than the other two across a wide range of recalls, which explains its leading AP score and makes it the most reliable choice when a conservative (high-precision) occlusion flag is desired.
Some architectures show dramatic relative gains in occlusion accuracy (e.g., SegFormer+ViT + 41.23 % ), but these are not always accompanied by improved localisation (SegFormer+ViT NME increased slightly by + 0.30 % ). Conversely, HRNet+HRNet demonstrates a significant NME reduction ( 66.26 % ) while also achieving competitive occlusion accuracy of 90.7%. U-Net variants with occlusion-aware heads typically achieve sizable combined improvements, where U-Net+ODN achieves a + 5.51 % accuracy increase and a 45.16 % NME reduction, illustrating that explicit occlusion modelling and recovery modules can substantially help localisation under occlusion.
The overall pattern suggests M3MSTL’s principal benefit comes from (1) staged training that stabilises feature learning for the backbone and heads and (2) joint conditioning on explicit thermal cues and occlusion supervision, which helps the heads learn to ignore unreliable local evidence. ViT-based occlusion heads excel at aggregating distant contextual cues to determine visibility, while simple MLP regressors attached to strong ResNet encoders remain exceptionally efficient and accurate for coordinate regression. The best practical compromise in our results is ResNet+ViT, which maximises occlusion detection with 91.8% accuracy, AUC score of (0.974), an AP score of (0.966), and achieves near-optimal localisation, making it particularly attractive for temperature-sensitive downstream tasks that require both reliable occlusion flags and accurate landmarks.
A few combinations (notably some SegFormer variants) show significant occlusion gains, but either do not improve or slightly worsen localisation; these cases warrant further analysis to understand when occlusion-focused modules harm spatial precision.
Finally, Figure 5 illustrates qualitative results from the ResNet+ViT M3MSTL variant. In each subfigure, the ground-truth landmarks are shown in red, the model predictions appear in green, and white markers denote annotated occluded landmarks. The examples highlight three useful behaviours: when landmarks are visible (e.g., nose and mouth in unobstructed regions) the green predictions closely overlap the red ground truth, indicating accurate localisation; when accessories (mask, glasses) or extreme poses create occlusion, the white markers indicate those ground-truth occlusions and the ViT head typically assigns high occlusion probability to those points, producing either suppressed or lower-confidence predictions; and in pose-induced self-occlusion, the ViT leverages long-range context to preserve plausible relative geometry for non-occluded landmarks while flagging the hidden side. Overall, the visualisation shows that M3MSTL with a ViT occlusion head detects occluded landmarks and maintains accurate localisation for visible landmarks, supporting the quantitative improvements reported above.
The training–inference trade-offs reported in Table 2 show that the M3MSTL protocol increases training cost because of its staged optimisation, yet it preserves efficient inference for all backbone/head combinations. Notably, the ResNet–ViT M3MSTL variant achieves a strong operating point, attaining state-of-the-art occlusion accuracy while maintaining the highest ROC–AUC and AP, all while maintaining competitive throughput (approximately 225 FPS). Although a few lighter configurations yield marginally higher frame rates, they do so at the expense of occlusion performance. This accuracy–speed balance makes ResNet–ViT well suited for real-time thermal landmark estimation, where robust occlusion flags are required. Conversely, practitioners with tight latency budgets may prefer U-Net–ViT or ResNet–Mask R-CNN variants as pragmatic alternatives when a slight drop in accuracy is acceptable. In summary, M3MSTL trades extra training time for substantially improved robustness to occlusion while maintaining an inference speed sufficient for real-world deployment.
Across Table 3, the paired t-tests show firm evidence that the M3MSTL models differ from the baseline models: almost every comparison has a p-value effectively equal to zero (many reported as 10 300 in the table), meaning the null hypothesis of no mean difference is rejected at any usual significance level. The reported statistic is the paired t for the difference defined as follows: ( diff = Baseline M 3 MSTL ) . Thus a positive t (and positive Cohen’s d) for NME ¯ indicates the baseline NME ¯ is larger than the M3MSTL NME ¯ . Conversely, a negative t and negative d for AUC/AP/ Acc Occl indicates that M3MSTL scores are higher than baseline.
The most extreme significance (largest | t | ) occurs in AP comparisons (e.g., Hourglass/Mask-RCNN AP with t 122.52 , p < 10 300 ), reflecting massive, highly consistent per-sample improvements. The least significant comparisons are concentrated in a few SegFormer+MLP/SegFormer+Hourglass entries (for example, SegFormer/MLP Acc Occl has t 0.62 and p 0.535 and SegFormer/Hourglass NME ¯ has t 1.32 , p 0.185 ), where we cannot claim a reliable change.
Looking at effect sizes which quantify the practical magnitude of the paired differences, ResNet50 and U-Net variants show moderate positive d for NME ¯ (e.g., ResNet50+ViT d 0.368 , U-Net+ViT d 0.347 ), meaning M3MSTL reduces NME ¯ by a meaningful amount. For AUC/AP/ Acc Occl , the Cohen’s d values are negative, with the largest practical gains often reported for ResNet50+Mask-RCNN (e.g., AUC d 0.626 , AP d 0.491 , Acc Occl d 0.528 ). ResNet50+ViT and U-Net+ViT show smaller but consistent improvements (absolute d 0.28 0.37 for AUC/AP/Acc), indicating that M3MSTL also helps these models, albeit to a lesser degree than ResNet50+Mask-RCNN.
Overall, M3MSTL yields statistically significant and practically meaningful improvements in most cases. The strongest effects (largest | t | and | d | ) are typically observed in AP and AUC, and are especially pronounced for Mask-RCNN heads on ResNet50 backbones. A few SegFormer+MLP comparisons show little to no significant change.

6. Conclusions and Outlook

We introduced a multi-stage multi-task learning strategy for occlusion-aware facial landmarks in thermal images. The architecture combines a ResNet-50 encoder with two task-specific heads: a compact fully connected regressor for coordinates and a ViT occlusion head that leverages long-range self-attention and per-landmark temperature signals. The three-stage curriculum stabilises optimisation and encourages the backbone to learn generalisable spatial features while permitting heads to specialise in their objectives.
Empirically, M3MSTL yields consistent and substantial gains: most backbone/head combinations show higher occlusion accuracy and reduced mean NME relative to single-stage baselines. The ResNet+ViT M3MSTL variant is the best practical trade-off in our suite, delivering 91.8% occlusion accuracy, mean NME 0.246, ROC–AUC 0.974, AP 0.966, and competitive throughput (≈225 FPS). Statistical testing corroborates these improvements: paired t-tests (Table 3) reject the null hypothesis for the vast majority of baseline–M3MSTL pairs (many p-values effectively zero), indicating that the observed gains are highly unlikely to be due to chance. Effect-size analysis shows moderate, practically meaningful NME reductions for ResNet50 and U-Net variants (e.g., ResNet50+ViT Cohen’s d 0.37 ) and consistent, positive improvements in occlusion metrics (AUC/AP/accuracy with | d | 0.28 0.31 for ResNet50+ViT), while a small subset of SegFormer combinations show no reliable change.
Limitations and future work: Some architecture pairings produce strong occlusion gains without matching localisation improvements, suggesting that occlusion-focused modules can occasionally harm spatial precision. Future work will analyse per-occluder failure modes, pursue lightweight compression for embedded deployment, explore cross-modal distillation while preserving thermal semantics, and expand occluder diversity to strengthen generalisation. Overall, M3MSTL with a ResNet+ViT hybrid provides a practical, statistically validated solution for occlusion-aware thermal landmarking.

Author Contributions

Conceptualization, B.R.; Methodology, B.R. and Y.C.N.; Software, Y.C.N.; Validation, B.R. and Y.C.N.; Formal analysis, B.R. and Y.C.N.; Investigation, Y.C.N. and B.R.; Resources, B.R.; Data curation, Y.C.N.; Writing—original draft, Y.C.N.; Writing—review and editing, B.R.; Visualization, Y.C.N.; Supervision, A.G.B., F.C. and B.R. Project administration, B.R.; Funding acquisition, S.A.S., J.H.C. and B.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Grant Scheme (FRGS), Ministry of Higher Education Malaysia, grant number FRGS/1/2023/TK0/HWUM/02/1.

Institutional Review Board Statement

This study was approved by the Ethics Committee of the School of Engineering and Physical Sciences (EPS), Heriot-Watt University (protocol code: 2023-5418-8035, approval date: 19 July 2023).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. A total of 137 subjects participated in thermal face biometric data collection. All participants signed informed consent forms prior to participation. Any identifiable illustrative images appearing in the figures were taken from members of our research team who provided explicit consent for their use in this manuscript.

Data Availability Statement

The dataset, code, and pre-trained models supporting this study are publicly available at: https://github.com/lucas-nyc/M3MSTL (accessed on 29 December 2025).

Acknowledgments

The authors thank the Heriot-Watt University Malaysia research facilities for providing technical support during data collection.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Baskaran, R.; Moller, K.; Wiil, U.K.; Brabrand, M. Using Facial Landmark Detection on Thermal Images as a Novel Prognostic Tool for Emergency Departments. Front. Artif. Intell. 2022, 5, 815333. [Google Scholar] [CrossRef] [PubMed]
  2. Qudah, M.A.; Mohamed, A.; Lutfi, S. Analysis of Facial Occlusion Challenge in Thermal Images for Human Affective State Recognition. Sensors 2023, 23, 3513. [Google Scholar] [CrossRef] [PubMed]
  3. Lin, C.; Zhu, B.; Wang, Q.; Liao, R.; Qian, C.; Lu, J.; Zhou, J. Structure-Coherent Deep Feature Learning for Robust Face Alignment. IEEE Trans. Image Process. 2021, 30, 5313–5326. [Google Scholar] [CrossRef] [PubMed]
  4. Flotho, P.; Piening, M.; Kukleva, A.; Steidl, G. T-FAKE: Synthesizing Thermal Images for Facial Landmarking. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26356–26366. [Google Scholar] [CrossRef]
  5. Kuzdeuov, A.; Koishigarina, D.; Aubakirova, D.; Abushakimova, S.; Varol, H.A. SF-TL54: A Thermal Facial Landmark Dataset with Visual Pairs. In Proceedings of the 2022 IEEE/SICE International Symposium on System Integration (SII), Narvik, Norway, 9–12 January 2022; pp. 748–753. [Google Scholar] [CrossRef]
  6. Chu, W.T.; Liu, Y.H. Thermal Facial Landmark Detection by Deep Multi-Task Learning. In Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia, 27–29 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
  7. Ding, H.; Zhou, P.; Chellappa, R. Occlusion-Adaptive Deep Network for Robust Facial Expression Recognition. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–9. [Google Scholar] [CrossRef]
  8. Chiang, J.C.; Hu, H.N.; Hou, B.S.; Tseng, C.Y.; Liu, Y.L.; Chen, M.H.; Lin, Y.Y. ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 784–793. [Google Scholar] [CrossRef]
  9. Wahid, J.A.; Xu, X.; Ayoub, M.; Husssain, S.; Li, L.; Shi, L. A hybrid ResNet–ViT approach to bridge the global and local features for myocardial infarction detection. Sci. Rep. 2024, 14, 4359. [Google Scholar] [CrossRef] [PubMed]
  10. Jiang, C.; Ren, H.; Yang, H.; Huo, H.; Zhu, P.; Yao, Z.; Li, J.; Sun, M.; Yang, S. M2FNet: Multi-modal fusion network for object detection from visible and thermal infrared images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103918. [Google Scholar] [CrossRef]
  11. Rakhimzhanova, T.; Kuzdeuov, A.; Varol, H.A. AnyFace++: Deep Multi-Task, Multi-Domain Learning for Efficient Face AI. Sensors 2024, 24, 5993. [Google Scholar] [CrossRef] [PubMed]
  12. Ng, Y.C.; Belyaev, A.G.; Choong, F.; Suandi, S.A.; Chuah, J.H.; Rudrusamy, B. TFD68: A Fully Annotated Thermal Facial Dataset with 68 Landmarks, Pose Variations, Per-Pixel Thermal Maps, Visual Pairs, Occlusions, and Facial Expressions. In Proceedings of the SIGGRAPH Asia 2025 Technical Communications, Hong Kong, China, 15–18 December 2025. [Google Scholar] [CrossRef]
  13. Zhu, M.; Shi, D.; Zheng, M.; Sadiq, M. Robust Facial Landmark Detection via Occlusion-Adaptive Deep Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3481–3491. [Google Scholar] [CrossRef]
  14. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  15. Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar] [CrossRef]
  16. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar] [CrossRef]
  17. Mahmud, T. A Novel Multi-Stage Training Approach for Human Activity Recognition From Multimodal Wearable Sensor Data Using Deep Neural Network. IEEE Sens. J. 2021, 21, 4995–5004. [Google Scholar] [CrossRef]
  18. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  19. Feng, Z.H.; Kittler, J.; Awais, M.; Huber, P.; Wu, X.J. Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2235–2245. [Google Scholar] [CrossRef]
Figure 1. M3MSTL architecture. Stage 1: thermal image → backbone (mask pretraining). Stage 2: spatial features + landmark temperature → multi-task heads (backbone frozen); heads consume F s and T (per-landmark temperatures). Stage 3: end-to-end fine-tuning (backbone unfrozen); heads continue to consume F s and T.
Figure 1. M3MSTL architecture. Stage 1: thermal image → backbone (mask pretraining). Stage 2: spatial features + landmark temperature → multi-task heads (backbone frozen); heads consume F s and T (per-landmark temperatures). Stage 3: end-to-end fine-tuning (backbone unfrozen); heads continue to consume F s and T.
Ai 07 00028 g001
Figure 2. Visual and thermal image examples showcasing different scenarios: (a) frontal pose, (b) occluded face, (c) facial expression, and (d) extreme pose angle.
Figure 2. Visual and thermal image examples showcasing different scenarios: (a) frontal pose, (b) occluded face, (c) facial expression, and (d) extreme pose angle.
Ai 07 00028 g002
Figure 3. Example mask images at various poses. (a) Frontal pose, (b) side pose, and (c) side pose.
Figure 3. Example mask images at various poses. (a) Frontal pose, (b) side pose, and (c) side pose.
Ai 07 00028 g003
Figure 4. (a) ROC–AUC and (b) precision–recall curves for the top 3 best-performing backbone/head combinations. Solid lines represent M3MSTL results, while dotted lines represent baseline results.
Figure 4. (a) ROC–AUC and (b) precision–recall curves for the top 3 best-performing backbone/head combinations. Solid lines represent M3MSTL results, while dotted lines represent baseline results.
Ai 07 00028 g004
Figure 5. Predicted landmarks (green) vs. ground truth (red) under different poses; white markers indicate annotated occluded landmarks. (a,b) extreme angle with face mask, (c) extreme angle with eyeglasses, (d) extreme angle.
Figure 5. Predicted landmarks (green) vs. ground truth (red) under different poses; white markers indicate annotated occluded landmarks. (a,b) extreme angle with face mask, (c) extreme angle with eyeglasses, (d) extreme angle.
Ai 07 00028 g005
Table 1. Comparison of baseline (single-stage, single-modal, multi-task) and M3MSTL.  Δ % = 100 · M 3 MSTL baseline baseline . Best value in bold.
Table 1. Comparison of baseline (single-stage, single-modal, multi-task) and M3MSTL.  Δ % = 100 · M 3 MSTL baseline baseline . Best value in bold.
BackboneHeadBaselineM3MSTL Δ %
Acc . Occl NME ¯ AUC AP Acc . Occl NME ¯ AUC AP Δ % Acc . Δ % NME ¯ Δ % AUC Δ % AP
HourglassMask-RCNN0.6321.7270.6620.5400.8871.7170.9540.941+40.31%−0.56%+43.98%+74.17%
HourglassMLP0.6281.7220.7020.5680.6401.7070.7020.568+1.92%−0.87%+0.00%+0.00%
HourglassViT0.6401.7290.7050.5700.9001.7120.9610.949+40.65%−0.98%+36.24%+66.56%
HourglassHourglass0.6281.7250.9290.9010.8571.6990.9290.901+36.40%−1.52%+0.00%+0.00%
HourglassODN0.6371.7290.6990.5730.8991.7180.9620.951+41.05%−0.61%+37.58%+66.10%
HRNetHRNet0.7961.7240.8700.8350.9070.5820.9680.958+13.96%−66.26%+11.31%+14.71%
HRNetMLP0.6521.6800.7140.5850.8030.5980.8840.845+23.20%−64.43%+23.92%+44.37%
HRNetViT0.8081.6940.8780.8480.9040.6110.9650.954+11.88%−63.92%+9.88%+12.62%
HRNetHourglass0.6571.7100.7210.5950.8920.5850.9530.938+35.96%−65.79%+32.31%+57.84%
HRNetODN0.7891.7240.8600.8220.9050.5850.9640.954+14.62%−66.06%+12.08%+16.10%
HRNetMask-RCNN0.7561.7160.8360.7780.9050.6280.9650.955+19.74%−63.43%+15.42%+22.83%
ResNet50MLP0.9080.3670.9660.9560.9120.2450.9710.962+0.46%−32.88%+0.48%+0.65%
ResNet50ODN0.8940.4430.9580.9440.9070.2630.9710.961+1.43%−40.73%+1.38%+1.79%
ResNet50HRNet0.8910.4210.9540.9380.9110.2590.9720.963+2.24%−38.50%+1.86%+2.70%
ResNet50Hourglass0.9040.4220.9640.9530.9000.2690.9700.960−0.48%−36.16%+0.62%+0.73%
ResNet50Mask-RCNN0.8790.4000.9460.9310.9150.2460.9720.963+4.12%−38.64%+2.79%+3.53%
ResNet50ViT (ours)0.8970.3820.9620.9490.9180.2460.9740.966+2.23%−35.47%+1.30%+1.80%
SegFormerMLP0.6291.7250.6410.5240.6291.7290.6410.524−0.03%+0.18%+0.00%+0.00%
SegFormerViT0.6371.7260.6850.5540.9001.7320.9640.953+41.23%+0.30%+40.77%+72.01%
SegFormerHourglass0.6271.7260.6430.5240.8501.7240.9230.893+35.67%−0.10%+43.57%+70.46%
SegFormerMask-RCNN0.6501.7260.6890.5620.8621.7370.9420.924+32.66%+0.64%+36.75%+64.50%
SegFormerHRNet0.6901.7260.7420.6700.8951.7310.9610.948+29.72%+0.30%+29.51%+41.64%
SegFormerODN0.6721.7280.7340.6000.8941.7320.9590.946+33.00%+0.18%+30.73%+57.67%
U-NetMLP0.8690.4960.9460.9270.9050.3370.9660.955+4.11%−32.03%+2.10%+3.01%
U-NetViT0.8980.5080.9470.9470.9170.3380.9730.965+2.22%−33.41%+2.68%+1.87%
U-NetHourglass0.8760.5350.9480.9300.9060.3120.9650.953+3.47%−41.72%+1.91%+2.44%
U-NetMask-RCNN0.8770.4970.9300.9300.9130.3160.9710.962+4.09%−36.53%+4.38%+3.45%
U-NetHRNet0.8980.5030.9600.9470.9140.3020.9710.963+1.78%−39.93%+1.19%+1.66%
U-NetODN0.8610.5380.9210.9210.9090.2950.9710.961+5.51%−45.16%+5.37%+4.39%
Table 2. Training and inference time summary. Training and inference time are listed in minutes, while frames-per-second (FPS) denotes throughput in frames per second.
Table 2. Training and inference time summary. Training and inference time are listed in minutes, while frames-per-second (FPS) denotes throughput in frames per second.
BackboneHeadBaselineM3MSTL
Train (min) Inf Time (min) FPS Train (min) Inf Time (min) FPS
HourglassHourglass2.760.116437.9225.170.114443.57
HourglassHRNet8.920.269188.2740.600.269188.67
HourglassMask-RCNN3.990.132399.9327.740.131388.08
HourglassMLP2.310.095534.9524.790.096529.44
HourglassODN7.030.216244.1341.400.212238.76
HourglassViT3.990.171299.1534.450.151336.46
HRNetHourglass4.380.154328.6131.090.154328.94
HRNetHRNet9.560.346148.2150.070.340148.89
HRNetMask-RCNN6.120.175289.0933.670.172298.23
HRNetMLP3.530.108469.9730.070.110450.78
HRNetODN9.540.261387.3541.040.262192.99
HRNetViT4.820.180281.5630.030.175283.06
ResNet50Hourglass6.100.200252.8236.240.191264.67
ResNet50HRNet10.230.342148.3151.490.346146.41
ResNet50Mask-RCNN8.320.200253.2735.970.208243.25
ResNet50MLP5.200.156323.7232.900.164308.57
ResNet50ODN9.670.281187.2251.440.288175.55
ResNet50ViT (ours)8.180.218232.0637.200.226224.33
SegFormerHourglass2.440.116435.5625.990.115441.04
SegFormerHRNet8.580.268189.1636.040.268188.84
SegFormerMask-RCNN3.440.132376.6722.700.132383.92
SegFormerMLP2.440.094539.3721.890.094538.43
SegFormerODN6.600.213239.9739.220.211239.39
SegFormerViT3.540.150337.2423.230.151336.35
U-NetHourglass7.210.238217.0937.210.233217.20
U-NetHRNet11.710.367137.9731.820.377134.33
U-NetMask-RCNN10.390.254196.0236.110.251198.63
U-NetMLP6.610.227222.9538.820.224225.73
U-NetODN11.440.355142.5738.480.362139.91
U-NetViT7.760.273188.5039.240.269188.66
Table 3. Paired t-test results (Baseline vs M3MSTL). Metrics: NME ¯ , Acc Occl , AUC (ROC-AUC), and AP. p-values printed as (< 10 300 ) indicate extremely small values.
Table 3. Paired t-test results (Baseline vs M3MSTL). Metrics: NME ¯ , Acc Occl , AUC (ROC-AUC), and AP. p-values printed as (< 10 300 ) indicate extremely small values.
BackboneHead NME ¯ Acc Occl AUC (ROC–AUC)AP (Average Precision)
t p -Value Cohen’s d t p -Value Cohen’s d t p -Value Cohen’s d t p -Value Cohen’s d
HourglassHourglass8.579 1.510 × 10 17 0.156−67.455< 10 300 −1.223−69.345< 10 300 −1.258−95.439< 10 300 −1.731
HourglassHRNet10.273 2.310 × 10 24 0.186−76.032< 10 300 −1.379−77.401< 10 300 −1.404−95.892< 10 300 −1.739
HourglassMask-RCNN7.431 1.390 × 10 13 0.135−88.489< 10 300 −1.605−82.741< 10 300 −1.501−122.516< 10 300 −2.222
HourglassMLP5.654 1.710 × 10 08 0.103−19.145 3.150 × 10 77 −0.347−15.034 2.500 × 10 49 −0.273−16.359 1.030 × 10 57 −0.297
HourglassODN8.867 1.250 × 10 18 0.161−73.870< 10 300 −1.340−81.006< 10 300 −1.469−122.160< 10 300 −2.216
HourglassViT12.126 4.430 × 10 33 0.220−74.618< 10 300 −1.353−80.115< 10 300 −1.453−108.885< 10 300 −1.975
HRNetHourglass21.063 4.870 × 10 92 0.382−73.902< 10 300 −1.340−79.723< 10 300 −1.446−110.203< 10 300 −1.999
HRNetHRNet21.213 3.010 × 10 93 0.385−48.093< 10 300 −0.872−43.885< 10 300 −0.796−52.529< 10 300 −0.953
HRNetMask-RCNN20.607 2.020 × 10 88 0.374−61.046< 10 300 −1.107−53.349< 10 300 −0.968−57.256< 10 300 −1.038
HRNetMLP20.730 2.150 × 10 89 0.376−58.292< 10 300 −1.057−54.839< 10 300 −0.995-68.193< 10 300 −1.237
HRNetODN21.298 6.300 × 10 94 0.386-58.900< 10 300 −1.068−47.496< 10 300 −0.861−52.554< 10 300 −0.953
HRNetViT20.992 1.780 × 10 91 0.381−39.792 4.340 × 10 279 −0.722−30.040 7.870 × 10 174 −0.545−27.497 7.850 × 10 149 −0.499
ResNet50Hourglass18.294 5.260 × 10 71 0.3322.693 7.130 × 10 03 0.049−9.159 9.340 × 10 20 −0.166−5.259 1.550 × 10 07 −0.095
ResNet50HRNet21.916 5.730 × 10 99 0.398−16.671 8.940 × 10 60 −0.302−27.949 3.500 × 10 153 −0.507−17.601 4.160 × 10 66 −0.319
ResNet50Mask-RCNN26.070 2.200 × 10 135 0.473−29.106 1.690 × 10 164 −0.528−34.492< 10 300 −0.626−27.056 1.230 × 10 144 −0.491
ResNet50MLP21.370 1.640 × 10 94 0.388−4.654 3.400 × 10 06 −0.084−5.921 3.550 × 10 09 −0.107−3.952 7.930 × 10 02 −0.072
ResNet50ODN25.806 5.930 × 10 133 0.468−11.261 7.540 × 10 29 −0.204−18.362 1.690 × 10 71 −0.333−10.660 4.510 × 10 26 −0.193
ResNet50ViT (ours)20.280 7.280 × 10 86 0.368−16.836 7.080 × 10 61 −0.305−16.217 8.690 × 10 57 −0.294−15.621 5.690 × 10 53 −0.283
SegFormerHourglass1.325 1.854 × 10 01 0.024−68.098< 10 300 −1.235−67.333< 10 300 −1.221−88.701< 10 300 −1.609
SegFormerHRNet−8.306 1.470 × 10 16 −0.151−68.761< 10 300 −1.247−61.132< 10 300 −1.109−80.918< 10 300 −1.468
SegFormerMask-RCNN−9.375 1.310 × 10 20 −0.170−69.701< 10 300 −1.264−89.276< 10 300 −1.619−101.164< 10 300 −1.835
SegFormerMLP−2.783 5.420 × 10 03 −0.0510.620 5.353 × 10 01 0.011−1.896 5.810 × 10 02 −0.034−2.107 3.520 × 10 02 −0.038
SegFormerODN−5.663 1.630 × 10 08 −0.103−81.113< 10 300 −1.471−79.682< 10 300 −1.445−111.454< 10 300 −2.021
SegFormerViT−5.865 4.980 × 10 09 −0.106−78.844< 10 300 −1.430−81.542< 10 300 −1.479−118.796< 10 300 −2.155
U-NetHourglass22.442 2.440 × 10 103 0.407−23.016 3.460 × 10 108 −0.417−23.185 1.240 × 10 109 −0.421−15.747 9.110 × 10 54 −0.286
U-NetHRNet22.250 9.770 × 10 102 0.404−17.664 1.490 × 10 66 −0.320−28.143 4.590 × 10 155 −0.510−17.704 7.990 × 10 67 −0.321
U-NetMask-RCNN19.860 1.250 × 10 82 0.360−30.327 1.010 × 10 176 −0.550−38.522< 10 300 −0.699−26.450 6.270 × 10 139 −0.480
U-NetMLP19.541 3.400 × 10 80 0.354−23.955 2.600 × 10 116 −0.434−28.260 3.360 × 10 156 −0.513−17.299 5.060 × 10 64 −0.314
U-NetODN25.232 1.060 × 10 127 0.458−29.088 2.520 × 10 164 −0.528−31.858 2.020 × 10 192 −0.578−22.204 2.380 × 10 101 −0.403
U-NetViT19.151 2.860 × 10 77 0.347−17.219 1.790 × 10 63 −0.312−17.657 1.680 × 10 66 −0.320−16.334 1.500 × 10 57 −0.296
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ng, Y.C.; Belyaev, A.G.; Choong, F.; Suandi, S.A.; Chuah, J.H.; Rudrusamy, B. Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation. AI 2026, 7, 28. https://doi.org/10.3390/ai7010028

AMA Style

Ng YC, Belyaev AG, Choong F, Suandi SA, Chuah JH, Rudrusamy B. Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation. AI. 2026; 7(1):28. https://doi.org/10.3390/ai7010028

Chicago/Turabian Style

Ng, Yean Chun, Alexander G. Belyaev, Florence Choong, Shahrel Azmin Suandi, Joon Huang Chuah, and Bhuvendhraa Rudrusamy. 2026. "Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation" AI 7, no. 1: 28. https://doi.org/10.3390/ai7010028

APA Style

Ng, Y. C., Belyaev, A. G., Choong, F., Suandi, S. A., Chuah, J. H., & Rudrusamy, B. (2026). Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation. AI, 7(1), 28. https://doi.org/10.3390/ai7010028

Article Metrics

Back to TopTop