Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation
Abstract
1. Introduction
2. Related Works
2.1. Architectures for Occlusion-Aware Landmark Localisation
2.2. Training Strategies Under Thermal Data Scarcity
3. Methodology
3.1. Occlusion-Aware Landmark Localisation Architecture
3.1.1. ResNet-50
3.1.2. Vision Transformer
3.1.3. Spatial Features and Task Heads
3.2. Multi-Stage Training Strategies
3.2.1. Stage 1: Backbone Pre-Training
3.2.2. Stage 2: Heads Pretraining
3.2.3. Stage 3: Joint Fine-Tuning
| Algorithm 1 M3-MSTL training algorithm |
|
4. Experimental Setup
4.1. TFD68 Dataset
4.2. Data Preparation
4.3. Baseline Models
4.4. Hyperparameters
4.5. Evaluation Metrics
4.5.1. Landmark Localisation Metrics
4.5.2. Occlusion Classification Metrics
5. Results and Discussion
6. Conclusions and Outlook
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Baskaran, R.; Moller, K.; Wiil, U.K.; Brabrand, M. Using Facial Landmark Detection on Thermal Images as a Novel Prognostic Tool for Emergency Departments. Front. Artif. Intell. 2022, 5, 815333. [Google Scholar] [CrossRef] [PubMed]
- Qudah, M.A.; Mohamed, A.; Lutfi, S. Analysis of Facial Occlusion Challenge in Thermal Images for Human Affective State Recognition. Sensors 2023, 23, 3513. [Google Scholar] [CrossRef] [PubMed]
- Lin, C.; Zhu, B.; Wang, Q.; Liao, R.; Qian, C.; Lu, J.; Zhou, J. Structure-Coherent Deep Feature Learning for Robust Face Alignment. IEEE Trans. Image Process. 2021, 30, 5313–5326. [Google Scholar] [CrossRef] [PubMed]
- Flotho, P.; Piening, M.; Kukleva, A.; Steidl, G. T-FAKE: Synthesizing Thermal Images for Facial Landmarking. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 26356–26366. [Google Scholar] [CrossRef]
- Kuzdeuov, A.; Koishigarina, D.; Aubakirova, D.; Abushakimova, S.; Varol, H.A. SF-TL54: A Thermal Facial Landmark Dataset with Visual Pairs. In Proceedings of the 2022 IEEE/SICE International Symposium on System Integration (SII), Narvik, Norway, 9–12 January 2022; pp. 748–753. [Google Scholar] [CrossRef]
- Chu, W.T.; Liu, Y.H. Thermal Facial Landmark Detection by Deep Multi-Task Learning. In Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia, 27–29 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Ding, H.; Zhou, P.; Chellappa, R. Occlusion-Adaptive Deep Network for Robust Facial Expression Recognition. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Chiang, J.C.; Hu, H.N.; Hou, B.S.; Tseng, C.Y.; Liu, Y.L.; Chen, M.H.; Lin, Y.Y. ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 784–793. [Google Scholar] [CrossRef]
- Wahid, J.A.; Xu, X.; Ayoub, M.; Husssain, S.; Li, L.; Shi, L. A hybrid ResNet–ViT approach to bridge the global and local features for myocardial infarction detection. Sci. Rep. 2024, 14, 4359. [Google Scholar] [CrossRef] [PubMed]
- Jiang, C.; Ren, H.; Yang, H.; Huo, H.; Zhu, P.; Yao, Z.; Li, J.; Sun, M.; Yang, S. M2FNet: Multi-modal fusion network for object detection from visible and thermal infrared images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103918. [Google Scholar] [CrossRef]
- Rakhimzhanova, T.; Kuzdeuov, A.; Varol, H.A. AnyFace++: Deep Multi-Task, Multi-Domain Learning for Efficient Face AI. Sensors 2024, 24, 5993. [Google Scholar] [CrossRef] [PubMed]
- Ng, Y.C.; Belyaev, A.G.; Choong, F.; Suandi, S.A.; Chuah, J.H.; Rudrusamy, B. TFD68: A Fully Annotated Thermal Facial Dataset with 68 Landmarks, Pose Variations, Per-Pixel Thermal Maps, Visual Pairs, Occlusions, and Facial Expressions. In Proceedings of the SIGGRAPH Asia 2025 Technical Communications, Hong Kong, China, 15–18 December 2025. [Google Scholar] [CrossRef]
- Zhu, M.; Shi, D.; Zheng, M.; Sadiq, M. Robust Facial Landmark Detection via Occlusion-Adaptive Deep Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3481–3491. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar] [CrossRef]
- Mahmud, T. A Novel Multi-Stage Training Approach for Human Activity Recognition From Multimodal Wearable Sensor Data Using Deep Neural Network. IEEE Sens. J. 2021, 21, 4995–5004. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Feng, Z.H.; Kittler, J.; Awais, M.; Huber, P.; Wu, X.J. Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2235–2245. [Google Scholar] [CrossRef]





| Backbone | Head | Baseline | M3MSTL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hourglass | Mask-RCNN | 0.632 | 1.727 | 0.662 | 0.540 | 0.887 | 1.717 | 0.954 | 0.941 | +40.31% | −0.56% | +43.98% | +74.17% |
| Hourglass | MLP | 0.628 | 1.722 | 0.702 | 0.568 | 0.640 | 1.707 | 0.702 | 0.568 | +1.92% | −0.87% | +0.00% | +0.00% |
| Hourglass | ViT | 0.640 | 1.729 | 0.705 | 0.570 | 0.900 | 1.712 | 0.961 | 0.949 | +40.65% | −0.98% | +36.24% | +66.56% |
| Hourglass | Hourglass | 0.628 | 1.725 | 0.929 | 0.901 | 0.857 | 1.699 | 0.929 | 0.901 | +36.40% | −1.52% | +0.00% | +0.00% |
| Hourglass | ODN | 0.637 | 1.729 | 0.699 | 0.573 | 0.899 | 1.718 | 0.962 | 0.951 | +41.05% | −0.61% | +37.58% | +66.10% |
| HRNet | HRNet | 0.796 | 1.724 | 0.870 | 0.835 | 0.907 | 0.582 | 0.968 | 0.958 | +13.96% | −66.26% | +11.31% | +14.71% |
| HRNet | MLP | 0.652 | 1.680 | 0.714 | 0.585 | 0.803 | 0.598 | 0.884 | 0.845 | +23.20% | −64.43% | +23.92% | +44.37% |
| HRNet | ViT | 0.808 | 1.694 | 0.878 | 0.848 | 0.904 | 0.611 | 0.965 | 0.954 | +11.88% | −63.92% | +9.88% | +12.62% |
| HRNet | Hourglass | 0.657 | 1.710 | 0.721 | 0.595 | 0.892 | 0.585 | 0.953 | 0.938 | +35.96% | −65.79% | +32.31% | +57.84% |
| HRNet | ODN | 0.789 | 1.724 | 0.860 | 0.822 | 0.905 | 0.585 | 0.964 | 0.954 | +14.62% | −66.06% | +12.08% | +16.10% |
| HRNet | Mask-RCNN | 0.756 | 1.716 | 0.836 | 0.778 | 0.905 | 0.628 | 0.965 | 0.955 | +19.74% | −63.43% | +15.42% | +22.83% |
| ResNet50 | MLP | 0.908 | 0.367 | 0.966 | 0.956 | 0.912 | 0.245 | 0.971 | 0.962 | +0.46% | −32.88% | +0.48% | +0.65% |
| ResNet50 | ODN | 0.894 | 0.443 | 0.958 | 0.944 | 0.907 | 0.263 | 0.971 | 0.961 | +1.43% | −40.73% | +1.38% | +1.79% |
| ResNet50 | HRNet | 0.891 | 0.421 | 0.954 | 0.938 | 0.911 | 0.259 | 0.972 | 0.963 | +2.24% | −38.50% | +1.86% | +2.70% |
| ResNet50 | Hourglass | 0.904 | 0.422 | 0.964 | 0.953 | 0.900 | 0.269 | 0.970 | 0.960 | −0.48% | −36.16% | +0.62% | +0.73% |
| ResNet50 | Mask-RCNN | 0.879 | 0.400 | 0.946 | 0.931 | 0.915 | 0.246 | 0.972 | 0.963 | +4.12% | −38.64% | +2.79% | +3.53% |
| ResNet50 | ViT (ours) | 0.897 | 0.382 | 0.962 | 0.949 | 0.918 | 0.246 | 0.974 | 0.966 | +2.23% | −35.47% | +1.30% | +1.80% |
| SegFormer | MLP | 0.629 | 1.725 | 0.641 | 0.524 | 0.629 | 1.729 | 0.641 | 0.524 | −0.03% | +0.18% | +0.00% | +0.00% |
| SegFormer | ViT | 0.637 | 1.726 | 0.685 | 0.554 | 0.900 | 1.732 | 0.964 | 0.953 | +41.23% | +0.30% | +40.77% | +72.01% |
| SegFormer | Hourglass | 0.627 | 1.726 | 0.643 | 0.524 | 0.850 | 1.724 | 0.923 | 0.893 | +35.67% | −0.10% | +43.57% | +70.46% |
| SegFormer | Mask-RCNN | 0.650 | 1.726 | 0.689 | 0.562 | 0.862 | 1.737 | 0.942 | 0.924 | +32.66% | +0.64% | +36.75% | +64.50% |
| SegFormer | HRNet | 0.690 | 1.726 | 0.742 | 0.670 | 0.895 | 1.731 | 0.961 | 0.948 | +29.72% | +0.30% | +29.51% | +41.64% |
| SegFormer | ODN | 0.672 | 1.728 | 0.734 | 0.600 | 0.894 | 1.732 | 0.959 | 0.946 | +33.00% | +0.18% | +30.73% | +57.67% |
| U-Net | MLP | 0.869 | 0.496 | 0.946 | 0.927 | 0.905 | 0.337 | 0.966 | 0.955 | +4.11% | −32.03% | +2.10% | +3.01% |
| U-Net | ViT | 0.898 | 0.508 | 0.947 | 0.947 | 0.917 | 0.338 | 0.973 | 0.965 | +2.22% | −33.41% | +2.68% | +1.87% |
| U-Net | Hourglass | 0.876 | 0.535 | 0.948 | 0.930 | 0.906 | 0.312 | 0.965 | 0.953 | +3.47% | −41.72% | +1.91% | +2.44% |
| U-Net | Mask-RCNN | 0.877 | 0.497 | 0.930 | 0.930 | 0.913 | 0.316 | 0.971 | 0.962 | +4.09% | −36.53% | +4.38% | +3.45% |
| U-Net | HRNet | 0.898 | 0.503 | 0.960 | 0.947 | 0.914 | 0.302 | 0.971 | 0.963 | +1.78% | −39.93% | +1.19% | +1.66% |
| U-Net | ODN | 0.861 | 0.538 | 0.921 | 0.921 | 0.909 | 0.295 | 0.971 | 0.961 | +5.51% | −45.16% | +5.37% | +4.39% |
| Backbone | Head | Baseline | M3MSTL | ||||
|---|---|---|---|---|---|---|---|
| Train (min) | Inf Time (min) | FPS | Train (min) | Inf Time (min) | FPS | ||
| Hourglass | Hourglass | 2.76 | 0.116 | 437.92 | 25.17 | 0.114 | 443.57 |
| Hourglass | HRNet | 8.92 | 0.269 | 188.27 | 40.60 | 0.269 | 188.67 |
| Hourglass | Mask-RCNN | 3.99 | 0.132 | 399.93 | 27.74 | 0.131 | 388.08 |
| Hourglass | MLP | 2.31 | 0.095 | 534.95 | 24.79 | 0.096 | 529.44 |
| Hourglass | ODN | 7.03 | 0.216 | 244.13 | 41.40 | 0.212 | 238.76 |
| Hourglass | ViT | 3.99 | 0.171 | 299.15 | 34.45 | 0.151 | 336.46 |
| HRNet | Hourglass | 4.38 | 0.154 | 328.61 | 31.09 | 0.154 | 328.94 |
| HRNet | HRNet | 9.56 | 0.346 | 148.21 | 50.07 | 0.340 | 148.89 |
| HRNet | Mask-RCNN | 6.12 | 0.175 | 289.09 | 33.67 | 0.172 | 298.23 |
| HRNet | MLP | 3.53 | 0.108 | 469.97 | 30.07 | 0.110 | 450.78 |
| HRNet | ODN | 9.54 | 0.261 | 387.35 | 41.04 | 0.262 | 192.99 |
| HRNet | ViT | 4.82 | 0.180 | 281.56 | 30.03 | 0.175 | 283.06 |
| ResNet50 | Hourglass | 6.10 | 0.200 | 252.82 | 36.24 | 0.191 | 264.67 |
| ResNet50 | HRNet | 10.23 | 0.342 | 148.31 | 51.49 | 0.346 | 146.41 |
| ResNet50 | Mask-RCNN | 8.32 | 0.200 | 253.27 | 35.97 | 0.208 | 243.25 |
| ResNet50 | MLP | 5.20 | 0.156 | 323.72 | 32.90 | 0.164 | 308.57 |
| ResNet50 | ODN | 9.67 | 0.281 | 187.22 | 51.44 | 0.288 | 175.55 |
| ResNet50 | ViT (ours) | 8.18 | 0.218 | 232.06 | 37.20 | 0.226 | 224.33 |
| SegFormer | Hourglass | 2.44 | 0.116 | 435.56 | 25.99 | 0.115 | 441.04 |
| SegFormer | HRNet | 8.58 | 0.268 | 189.16 | 36.04 | 0.268 | 188.84 |
| SegFormer | Mask-RCNN | 3.44 | 0.132 | 376.67 | 22.70 | 0.132 | 383.92 |
| SegFormer | MLP | 2.44 | 0.094 | 539.37 | 21.89 | 0.094 | 538.43 |
| SegFormer | ODN | 6.60 | 0.213 | 239.97 | 39.22 | 0.211 | 239.39 |
| SegFormer | ViT | 3.54 | 0.150 | 337.24 | 23.23 | 0.151 | 336.35 |
| U-Net | Hourglass | 7.21 | 0.238 | 217.09 | 37.21 | 0.233 | 217.20 |
| U-Net | HRNet | 11.71 | 0.367 | 137.97 | 31.82 | 0.377 | 134.33 |
| U-Net | Mask-RCNN | 10.39 | 0.254 | 196.02 | 36.11 | 0.251 | 198.63 |
| U-Net | MLP | 6.61 | 0.227 | 222.95 | 38.82 | 0.224 | 225.73 |
| U-Net | ODN | 11.44 | 0.355 | 142.57 | 38.48 | 0.362 | 139.91 |
| U-Net | ViT | 7.76 | 0.273 | 188.50 | 39.24 | 0.269 | 188.66 |
| Backbone | Head | AUC (ROC–AUC) | AP (Average Precision) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -Value | Cohen’s | -Value | Cohen’s | -Value | Cohen’s | -Value | Cohen’s | ||||||
| Hourglass | Hourglass | 8.579 | 0.156 | −67.455 | < | −1.223 | −69.345 | < | −1.258 | −95.439 | < | −1.731 | |
| Hourglass | HRNet | 10.273 | 0.186 | −76.032 | < | −1.379 | −77.401 | < | −1.404 | −95.892 | < | −1.739 | |
| Hourglass | Mask-RCNN | 7.431 | 0.135 | −88.489 | < | −1.605 | −82.741 | < | −1.501 | −122.516 | < | −2.222 | |
| Hourglass | MLP | 5.654 | 0.103 | −19.145 | −0.347 | −15.034 | −0.273 | −16.359 | −0.297 | ||||
| Hourglass | ODN | 8.867 | 0.161 | −73.870 | < | −1.340 | −81.006 | < | −1.469 | −122.160 | < | −2.216 | |
| Hourglass | ViT | 12.126 | 0.220 | −74.618 | < | −1.353 | −80.115 | < | −1.453 | −108.885 | < | −1.975 | |
| HRNet | Hourglass | 21.063 | 0.382 | −73.902 | < | −1.340 | −79.723 | < | −1.446 | −110.203 | < | −1.999 | |
| HRNet | HRNet | 21.213 | 0.385 | −48.093 | < | −0.872 | −43.885 | < | −0.796 | −52.529 | < | −0.953 | |
| HRNet | Mask-RCNN | 20.607 | 0.374 | −61.046 | < | −1.107 | −53.349 | < | −0.968 | −57.256 | < | −1.038 | |
| HRNet | MLP | 20.730 | 0.376 | −58.292 | < | −1.057 | −54.839 | < | −0.995 | -68.193 | < | −1.237 | |
| HRNet | ODN | 21.298 | 0.386 | -58.900 | < | −1.068 | −47.496 | < | −0.861 | −52.554 | < | −0.953 | |
| HRNet | ViT | 20.992 | 0.381 | −39.792 | −0.722 | −30.040 | −0.545 | −27.497 | −0.499 | ||||
| ResNet50 | Hourglass | 18.294 | 0.332 | 2.693 | 0.049 | −9.159 | −0.166 | −5.259 | −0.095 | ||||
| ResNet50 | HRNet | 21.916 | 0.398 | −16.671 | −0.302 | −27.949 | −0.507 | −17.601 | −0.319 | ||||
| ResNet50 | Mask-RCNN | 26.070 | 0.473 | −29.106 | −0.528 | −34.492 | < | −0.626 | −27.056 | −0.491 | |||
| ResNet50 | MLP | 21.370 | 0.388 | −4.654 | −0.084 | −5.921 | −0.107 | −3.952 | −0.072 | ||||
| ResNet50 | ODN | 25.806 | 0.468 | −11.261 | −0.204 | −18.362 | −0.333 | −10.660 | −0.193 | ||||
| ResNet50 | ViT (ours) | 20.280 | 0.368 | −16.836 | −0.305 | −16.217 | −0.294 | −15.621 | −0.283 | ||||
| SegFormer | Hourglass | 1.325 | 0.024 | −68.098 | < | −1.235 | −67.333 | < | −1.221 | −88.701 | < | −1.609 | |
| SegFormer | HRNet | −8.306 | −0.151 | −68.761 | < | −1.247 | −61.132 | < | −1.109 | −80.918 | < | −1.468 | |
| SegFormer | Mask-RCNN | −9.375 | −0.170 | −69.701 | < | −1.264 | −89.276 | < | −1.619 | −101.164 | < | −1.835 | |
| SegFormer | MLP | −2.783 | −0.051 | 0.620 | 0.011 | −1.896 | −0.034 | −2.107 | −0.038 | ||||
| SegFormer | ODN | −5.663 | −0.103 | −81.113 | < | −1.471 | −79.682 | < | −1.445 | −111.454 | < | −2.021 | |
| SegFormer | ViT | −5.865 | −0.106 | −78.844 | < | −1.430 | −81.542 | < | −1.479 | −118.796 | < | −2.155 | |
| U-Net | Hourglass | 22.442 | 0.407 | −23.016 | −0.417 | −23.185 | −0.421 | −15.747 | −0.286 | ||||
| U-Net | HRNet | 22.250 | 0.404 | −17.664 | −0.320 | −28.143 | −0.510 | −17.704 | −0.321 | ||||
| U-Net | Mask-RCNN | 19.860 | 0.360 | −30.327 | −0.550 | −38.522 | < | −0.699 | −26.450 | −0.480 | |||
| U-Net | MLP | 19.541 | 0.354 | −23.955 | −0.434 | −28.260 | −0.513 | −17.299 | −0.314 | ||||
| U-Net | ODN | 25.232 | 0.458 | −29.088 | −0.528 | −31.858 | −0.578 | −22.204 | −0.403 | ||||
| U-Net | ViT | 19.151 | 0.347 | −17.219 | −0.312 | −17.657 | −0.320 | −16.334 | −0.296 | ||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ng, Y.C.; Belyaev, A.G.; Choong, F.; Suandi, S.A.; Chuah, J.H.; Rudrusamy, B. Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation. AI 2026, 7, 28. https://doi.org/10.3390/ai7010028
Ng YC, Belyaev AG, Choong F, Suandi SA, Chuah JH, Rudrusamy B. Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation. AI. 2026; 7(1):28. https://doi.org/10.3390/ai7010028
Chicago/Turabian StyleNg, Yean Chun, Alexander G. Belyaev, Florence Choong, Shahrel Azmin Suandi, Joon Huang Chuah, and Bhuvendhraa Rudrusamy. 2026. "Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation" AI 7, no. 1: 28. https://doi.org/10.3390/ai7010028
APA StyleNg, Y. C., Belyaev, A. G., Choong, F., Suandi, S. A., Chuah, J. H., & Rudrusamy, B. (2026). Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation. AI, 7(1), 28. https://doi.org/10.3390/ai7010028

