SCGViT: A Pseudo-Multimodal Low-Latency Framework for Real-Time Skin Lesion Diagnosis
Abstract
1. Introduction
- This article extends the MobileViT architecture to pseudo-multimodal inputs and designs a branch-specific feature extractor with learnable weights. This design achieves collaborative modeling of local texture and global long-range dependencies of skin lesions while maintaining low computational complexity (with only 0.524 M parameters and 0.866 G FLOPs), achieving high inference efficiency.
- The added cross-attention module enhances directional features through asymmetric fusion strategy. This module guides the effective injection of auxiliary branch IR and INV information through RGB visual features. Compared with traditional simple concatenation or weighted summation, it can more accurately capture complementary pathological information between different branches.
- The framework, as a multitasking structure, introduces a self-supervised generator to perform adversarial training. By generating interference samples, the classifier is forced to learn more fundamental pathological features, reducing reliance on surface artifacts.
- Auxiliary classification heads are deployed at the end of each branch extractor, enforcing the discriminative ability of single branch features through deep supervision loss.
2. Related Work
3. Proposed Framework
3.1. Overview of the Overall Architecture
3.2. Residual CGAN for Generative Data Balancing
3.3. Multi-Branch Cross-Attention Fusion Mechanism
3.4. SCG-SE Blocks: Feature Recalibration and Global Modeling
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Comparative Experiments on HAM10000 Dataset
4.5. Ablation Experiments
5. Conclusions
- Introducing the CGAN during training enables the model to better learn the intrinsic distribution of lesion features, improving its resistance to variations and noise outside the training set. Ablation experiments confirm that adversarial training significantly strengthens the robustness of the model’s identification of easily confused categories.
- Multi-branch fusion significantly improves diagnostic accuracy. A single RGB image faces recognition bottlenecks when dealing with images with blurred edges or high noise. This paper utilizes IR and INV, employing cross attention to achieve deep interaction at the feature level, effectively extracting deep structural texture and boundary contrast information of the lesion area. In the 7-class disease classification task of HAM10000, SCGViT achieved a macro-averaged F1 score of 0.973, far exceeding other models.
- The lightweight design optimizes the parameter distribution while ensuring high-precision performance. The final model contains only 0.52 million parameters, which is approximately 1/7th of OverLoCK model, but achieves an inference speed of 304.439 FPS.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics, 2025. CA A Cancer J. Clin. 2025, 75, 40–78. [Google Scholar] [CrossRef]
- Kittler, H.; Tschandl, P.; Rosendahl, C. Diagnostic accuracy of dermatoscopy for skin cancer: A systematic review and meta-analysis. Br. J. Dermatol. 2016, 175, 50–60. [Google Scholar]
- China Electronics Standardization Institute. Edge Computing Reference Architecture—Part 1: General Principles; China Standards Press: Beijing, China, 2023. [Google Scholar]
- Chen, Y.; Zhang, H.; Li, J. Multi-modal attention fusion for skin lesion diagnosis using RGB, infrared and metadata. IEEE J. Biomed. Health Inform. 2023, 27, 3824–3833. [Google Scholar]
- Zhang, L.; Wang, X.; Zhao, Y. GAN-based data augmentation for imbalanced skin lesion classification. Pattern Recognit. 2023, 136, 109284. [Google Scholar]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
- Souza, A.; Pereira, D.; Costa, P. A Hybrid CNN-Transformer Model with Focal Loss for Skin Lesion Classification. IEEE J. Biomed. Health Inform. 2022, 26, 3864–3873. [Google Scholar]
- Zhang, Y.; Li, J.; Wang, H. SEACU-Net: A novel U-Net based on squeeze-and-excitation and attention ConvLSTM for skin lesion segmentation. Comput. Biol. Chem. 2022, 102, 107586. [Google Scholar]
- Zhu, S.; Yan, Y.; Wei, L.; Li, Y.; Mao, T.; Dai, X.; Du, R. SECA-Net: Squeezed-and-excitated contextual attention network for medical image segmentation. Biomed. Signal Process. Control. 2024, 97, 106704. [Google Scholar] [CrossRef]
- Wen, Y.; Dongming, Z.; Teng, F.; Zhuopu, Y.; Zhen, L. Image segmentation of skin lesions based on dense atrous spatial pyramid pooling and attention mechanism. J. Biomed. Eng. 2022, 39, 1108–1116. [Google Scholar]
- Wei, M.; Wu, Q.; Ji, H.; Wang, J.; Lyu, T.; Liu, J.; Zhao, L. A Skin Disease Classification Model Based on DenseNet and ConvNeXt Fusion. Electronics 2023, 12, 438. [Google Scholar] [CrossRef]
- Won, H.S.; Chae, J.W.; Cho, H.C. Progressive Defocusing Guided Attention in a Hybrid CNN-Transformer CADx System for Skin Lesion Classification. J. Electr. Eng. Technol. 2025, 20, 1–10. [Google Scholar] [CrossRef]
- Kim, H.; Kim, Y.; Song, W. SkinSavvy2: Augmented Skin Lesion Diagnosis and Personalized Medical Consultation System. Electronics 2025, 14, 969. [Google Scholar] [CrossRef]
- Lee, P.; Bubeck, S.; Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N. Engl. J. Med. 2023, 388, 1233–1239. [Google Scholar] [CrossRef]
- Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. Skinformer: Learning statistical texture representation with transformer for skin lesion segmentation. IEEE J. Biomed. Health Inform. 2024, 28, 6008–6018. [Google Scholar] [CrossRef]
- Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. Int. Conf. Mach. Learn. 2021, 139, 10347–10357. [Google Scholar]
- You, C.; Zhao, R.; Liu, F. Class-aware generative adversarial transformers for medical image segmentation. arXiv 2022, arXiv:2201.10737. [Google Scholar]
- Khan, M.; Ahmad, J.; El Saddik, A.; Gueaieb, W. Skin-former: Mobile-friendly transformer for skin lesion diagnosis. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 5–8 January 2024; pp. 1–6. [Google Scholar]
- Su, Q.; Hamed, H.N.; Isa, M.A.; Hao, X.; Dai, X. A GAN-based data augmentation method for imbalanced multi-class skin lesion classification. IEEE Access 2024, 12, 16498–16513. [Google Scholar] [CrossRef]
- Zhang, X.; Liu, Y.; Ouyang, G.; Chen, W.; Xu, A.; Hara, T.; Zhou, X.; Wu, D. DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification. Bioengineering 2025, 12, 421. [Google Scholar] [CrossRef]
- Walczak, M.; Kallakuri, U.; Humes, E.; Lin, X.; Mohsenin, T. BitMedViT: Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge. In Proceedings of the 2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), Munich, Germany, 26–30 October 2025; pp. 1–7. [Google Scholar]
- Hoque, M.S. Comparative Analysis of CNN, Vision Transformers, and Hybrid Models for Skin Lesion Classification Using the HAM10000 Dataset. Master’s Thesis, Southern Illinois University, Edwardsville, IL, USA, 2024. [Google Scholar]
- Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef] [PubMed]
- Medical University of Vienna. HAM10000 Dataset Documentation; Medical University of Vienna: Vienna, Austria, 2018. [Google Scholar]
- Gutman, D.; Codella, N.C.; Celebi, E.; Helba, B.; Marchetti, M.; Mishra, N.; Halpern, A. Skin lesion analysis toward melanoma detection: A challenge at the 2017 International Symposium on Biomedical Imaging (ISBI). IEEE Trans. Med. Imaging 2018, 38, 585–598. [Google Scholar]
- Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12057–12070. [Google Scholar]
- Lou, M. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. arXiv 2025, arXiv:2502.20087. [Google Scholar]
- Yadav, D.P.; Sharma, B.; Chauhan, S.; Webber, J.L.; Mehbodniya, A. Dual scale light weight cross attention transformer for skin lesion classification. PLoS ONE 2024, 19, e0312598. [Google Scholar] [CrossRef]
- Lv, D.; Zhao, C.; Ye, H.; Fan, Y.; Shu, X. GS-YOLO: A Lightweight SAR Ship Detection Model Based on Enhanced GhostNetV2 and SE Attention Mechanism. IEEE Access 2024, 12, 108414–108424. [Google Scholar] [CrossRef]
- Dhanalaxmi, B.; Kumar, B.N.; Raju, Y.; Channapragada, R.S.R. MobileNetV3: An efficient deep learning-based feature selection and classification technique for cardiovascular disease. J. Eng. Appl. Sci. 2025, 72, 107. [Google Scholar] [CrossRef]
- Höhn, J.; Hekler, A.; Krieghoff-Henning, E.; Kather, J.N.; Utikal, J.S.; Meier, F.; Gellrich, F.F.; Hauschild, A.; French, L.; Schlager, J.G. Skin cancer classification using convolutional neural networks with integrated patient data: A systematic review (preprint). J. Med. Internet Res. 2020, 23, e20708. [Google Scholar] [CrossRef] [PubMed]








| Model Name | Macro-Averaged F1 Score | Number of Parameters (M) | FLOPs (G) | Inference Speed (FPS) |
|---|---|---|---|---|
| CNN | 0.901 | 1.161 | 1.6763 | 347.392 |
| ResNet18 | 0.902 | 11.180 | 1.823 | 346.998 |
| ViT-Tiny | 0.908 | 2.526 | 0.495 | 341.183 |
| SCGViT | 0.973 | 0.524 | 0.866 | 304.439 |
| OverLoCK | 0.915 | 3.872 | 1.203 | 298.654 |
| H-CAST | 0.928 | 4.516 | 1.437 | 285.319 |
| GhostNetV2 | 0.943 | 1.235 | 0.167 | 328.415 |
| MobileNetV3-S | 0.951 | 0.934 | 0.057 | 386.124 |
| Model Name | Index | Akiec | Bcc | Bkl | Df | Mel | Nv | Vasc |
|---|---|---|---|---|---|---|---|---|
| CNN | Precision | 0.867 | 0.889 | 0.882 | 0.843 | 0.876 | 0.978 | 0.915 |
| Recall rate | 0.924 | 0.897 | 0.893 | 0.901 | 0.912 | 0.956 | 0.938 | |
| F1 score | 0.895 | 0.893 | 0.887 | 0.871 | 0.894 | 0.967 | 0.926 | |
| ResNet18 | Precision | 0.872 | 0.885 | 0.886 | 0.849 | 0.873 | 0.981 | 0.908 |
| Recall rate | 0.918 | 0.903 | 0.898 | 0.897 | 0.916 | 0.953 | 0.937 | |
| F1 score | 0.894 | 0.894 | 0.892 | 0.872 | 0.894 | 0.966 | 0.922 | |
| ViT-Tiny | Precision | 0.865 | 0.893 | 0.884 | 0.846 | 0.871 | 0.979 | 0.919 |
| Recall rate | 0.909 | 0.901 | 0.896 | 0.893 | 0.914 | 0.957 | 0.943 | |
| F1 score | 0.886 | 0.897 | 0.890 | 0.869 | 0.892 | 0.968 | 0.931 | |
| SCGViT | Precision | 0.896 | 0.924 | 0.912 | 0.878 | 0.903 | 0.982 | 0.957 |
| Recall rate | 0.921 | 0.947 | 0.935 | 0.905 | 0.928 | 0.964 | 0.973 | |
| F1 score | 0.908 | 0.935 | 0.923 | 0.891 | 0.915 | 0.973 | 0.965 | |
| OverLoCK | Precision | 0.873 | 0.859 | 0.891 | 0.852 | 0.883 | 0.980 | 0.924 |
| Recall rate | 0.912 | 0.908 | 0.902 | 0.895 | 0.918 | 0.958 | 0.945 | |
| F1 score | 0.892 | 0.901 | 0.896 | 0.873 | 0.900 | 0.969 | 0.934 | |
| H-CAST | Precision | 0.881 | 0.903 | 0.898 | 0.861 | 0.892 | 0.981 | 0.932 |
| Recall rate | 0.918 | 0.921 | 0.915 | 0.902 | 0.924 | 0.961 | 0.952 | |
| F1 score | 0.899 | 0.912 | 0.906 | 0.881 | 0.908 | 0.971 | 0.942 | |
| GhostNetV2 | Precision | 0.888 | 0.915 | 0.907 | 0.869 | 0.899 | 0.981 | 0.941 |
| Recall rate | 0.923 | 0.934 | 0.926 | 0.907 | 0.930 | 0.963 | 0.960 | |
| F1 score | 0.905 | 0.924 | 0.916 | 0.887 | 0.914 | 0.972 | 0.950 | |
| MobileNetV3-S | Precision | 0.893 | 0.921 | 0.913 | 0.876 | 0.902 | 0.981 | 0.948 |
| Recall rate | 0.927 | 0.940 | 0.931 | 0.911 | 0.935 | 0.962 | 0.967 | |
| F1 score | 0.910 | 0.930 | 0.922 | 0.890 | 0.913 | 0.972 | 0.957 |
| CGAN | Cross Attention | SE | Macro-Averaged F1 Score | Number of Parameters (M) | FLOPs(G) | Inference Speed (FPS) |
|---|---|---|---|---|---|---|
| 0.897 | 0.458 | 0.579 | 398.625 | |||
| √ | 0.928 | 0.486 | 0.699 | 356.814 | ||
| √ | √ | 0.959 | 0.512 | 0.782 | 321.573 | |
| √ | √ | √ | 0.973 | 0.524 | 0.866 | 304.439 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Luo, Z.; Hou, C.; Wang, H. SCGViT: A Pseudo-Multimodal Low-Latency Framework for Real-Time Skin Lesion Diagnosis. Electronics 2026, 15, 845. https://doi.org/10.3390/electronics15040845
Luo Z, Hou C, Wang H. SCGViT: A Pseudo-Multimodal Low-Latency Framework for Real-Time Skin Lesion Diagnosis. Electronics. 2026; 15(4):845. https://doi.org/10.3390/electronics15040845
Chicago/Turabian StyleLuo, Zirui, Chengyu Hou, and Haishi Wang. 2026. "SCGViT: A Pseudo-Multimodal Low-Latency Framework for Real-Time Skin Lesion Diagnosis" Electronics 15, no. 4: 845. https://doi.org/10.3390/electronics15040845
APA StyleLuo, Z., Hou, C., & Wang, H. (2026). SCGViT: A Pseudo-Multimodal Low-Latency Framework for Real-Time Skin Lesion Diagnosis. Electronics, 15(4), 845. https://doi.org/10.3390/electronics15040845

