Hybrid Deep Learning Models for Arabic Sign Language Recognition in Healthcare Applications
Abstract
1. Introduction
- Our primary contribution is ResNet50ViT, a novel hybrid architecture that synergistically combines ResNet50’s hierarchical local feature extraction with ViT’s global attention mechanism to achieve superior performance on Arabic Sign Language Recognition (ArSLR). This approach uniquely addresses the challenges of ArSL by simultaneously processing local features through CNNs and capturing long-range dependencies via Transformers.
- A Keyframe Extraction approach leveraging motion detection has been introduced, incorporating a novel configuration strategy that ensures uniform key frame selection. This enhancement improves computational efficiency, making the technique both precise and highly effective.
- A data preprocessing strategy introduces various techniques to preprocess the ASiL dataset using an innovative method that balances performance with computational efficiency.
2. Related Works
3. Background Theory
4. System Framework
- Training: In the training phase, beginning with a diverse image and video dataset, which is presented in Section 4. This image dataset is enhanced through data augmentation techniques, such as geometric transformations. In contrast, the video dataset is converted into key frames, each annotated with labels, resulting in a total of 23,000 frames. This extraction is performed using a motion-based method, as detailed later. These two datasets are then subjected to preprocessing and optimization steps to standardize the data for efficient model training. Multiple models are employed in Section 5: a deep learning-based sign detection model extracts static features from the augmented images, while a vision transformer model captures dynamic features from key frames. Then, we combined one of these trained models with a vision transformer to create a unified hybrid model that learns both the local and global information flow of ArSL. The model’s performance is rigorously evaluated using various performance metrics during training.
- Classification: A video input or image is processed through the trained model to extract feature maps, which are used for the detection and classification of signs. The classified signs are subsequently translated into textual output, facilitating the translation of ArSL gestures into corresponding words. The system will incorporate continuous evaluation to monitor performance, ensuring robust and accurate recognition. Thus, this approach utilizes cutting-edge deep learning techniques combined with transformer architecture to provide a scalable solution for ArSL recognition.
4.1. ASiL Dataset
4.2. Video Fragmentation
4.3. Histogram Equalization
4.4. Application of a Noise Reduction Filter
| Algorithm 1. Optimization of NLM parameters (, search_sizes) |
| Input: initial value for h_values ← [5, 10, 15], hColor_values ← [5, 10, 15], template_sizes ← [7, 10, 15], search_sizes ← [21, 35, 45] Output: Best parameter value of h, hColor, template_size, search_size, total_time Initialization: tracking variables: best_result←None, best_params←None, best_quality← Infinity, start_time ← CurrentTime (), num_combinations ← Size(h_values) * Size(hColor_values) * Size(template_sizes) * Size(search_sizes) combination_counter ← 0 //Construct a Learning model
For each h in h_values Do
|
4.5. Data Augmentation
5. Arabic Sign Language Recognition Approaches
5.1. Approach 1: Architectures of Custom CNN
5.2. Approach 2: Architecture of ResNet50 Modified
5.3. Approach 3: Architecture of ResNet50V2 Modified
5.4. Fine-Tuning
5.5. Approach 4: Architecture of SignViT

5.6. Approach 5: Architecture of ResNet50ViT
6. The Experimental Setup and Results
6.1. Performance Metrics and Callbacks
6.2. Results of Custom CNNs
6.3. Results of Transfer Learning Models Modified
6.4. Results of ResNet50 Fine-Tuned
6.5. Results of ResNet50V2 Tuned
6.6. Results of SignViT
6.7. Results of ResNet50ViT
6.8. Comparison of SignViT and ResNet50ViT
6.9. Comparison with Other Model Performances
7. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Deafness and Hearing Loss. Available online: https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss (accessed on 27 September 2024).
- Deafness and Hearing Loss Toolkit: Hearing Loss a Global Problem|RCGP Learning. Available online: https://elearning.rcgp.org.uk/mod/book/view.php?id=12532&chapterid=288 (accessed on 15 April 2025).
- World Federation of the Deaf. Canadian Association of the Deaf—Association des Sourds du Canada. Available online: https://cad-asc.ca/about-us/about-cad-asc/world-federation-of-the-deaf/ (accessed on 16 May 2024).
- al Moustafa, A.; Rahim, M.; Bouallegue, B.; Khattab, M.; Soliman, A.; Tharwat, G.; Ahmed, A. Integrated Mediapipe with a CNN Model for Arabic Sign Language Recognition. J. Electr. Comput. Eng. 2023, 2023, 8870750. [Google Scholar] [CrossRef]
- Luqman, H.; Mahmoud, S.A. Automatic translation of Arabic text-to-Arabic sign language. Univers. Access Inf. Soc. 2019, 18, 939–951. [Google Scholar] [CrossRef]
- Kamal, S.M.; Chen, Y.; Li, S.; Shi, X.; Zheng, J. Technical Approaches to Chinese Sign Language Processing: A Review. IEEE Access 2019, 7, 96926–96935. [Google Scholar] [CrossRef]
- Kothadiya, D.R.; Bhatt, C.M.; Saba, T.; Rehman, A.; Bahaj, S.A. SIGNFORMER: DeepVision Transformer for Sign Language Recognition. IEEE Access 2023, 11, 4730–4739. [Google Scholar] [CrossRef]
- Shin, J.; Miah, A.S.M.; Hasan, M.A.M.; Hirooka, K.; Suzuki, K.; Lee, H.S.; Jang, S.W. Korean Sign Language Recognition Using Transformer-Based Deep Neural Network. Appl. Sci. 2023, 13, 3029. [Google Scholar] [CrossRef]
- Rathi, P.; Kuwar Gupta, R.; Agarwal, S.; Shukla, A. Sign Language Recognition Using ResNet50 Deep Neural Network Architecture. In Proceedings of the 5th International Conference on Next Generation Computing Technologies (NGCT-2019), Dehradun, India, 20–21 December 2019. [Google Scholar] [CrossRef]
- Wadhawan, A.; Kumar, P. Deep learning-based sign language recognition system for static signs. Neural Comput. Appl. 2020, 32, 7957–7968. [Google Scholar] [CrossRef]
- Thakar, S.; Shah, S.; Shah, B.; Nimkar, A.V. Sign Language to Text Conversion in Real Time using Transfer Learning. arXiv 2022, arXiv:2211.14446. [Google Scholar] [CrossRef]
- Areeb, Q.M.; Maryam; Nadeem, M.; Alroobaea, R.; Anwer, F. Helping Hearing-Impaired in Emergency Situations: A Deep Learning-Based Approach. IEEE Access 2022, 10, 8502–8517. [Google Scholar] [CrossRef]
- Zakariah, M.; Alotaibi, Y.A.; Koundal, D.; Guo, Y.; Mamun Elahi, M. Sign Language Recognition for Arabic Alphabets Using Transfer Learning Technique. Comput. Intell. Neurosci. 2022, 2022, 4567989. [Google Scholar] [CrossRef]
- Luqman, H.; El-Alfy, E.-S.M. Towards Hybrid Multimodal Manual and Non-Manual Arabic Sign Language Recognition: mArSL Database and Pilot Study. Electronics 2021, 10, 1739. [Google Scholar] [CrossRef]
- Bora, J.; Dehingia, S.; Boruah, A.; Chetia, A.A.; Gogoi, D. Real-time Assamese Sign Language Recognition using MediaPipe and Deep Learning. Procedia Comput. Sci. 2023, 218, 1384–1393. [Google Scholar] [CrossRef]
- Özdemir, O.; Kındıroğlu, A.A.; Camgöz, N.C.; Akarun, L. BosphorusSign22k Sign Language Recognition Dataset. arXiv 2020, arXiv:2004.01283. [Google Scholar] [CrossRef]
- Suardi, C. CNN architecture based on VGG16 model for SIBI sign language. AIP Conf. Proc. 2023, 2909, 120010. [Google Scholar] [CrossRef]
- Islam, M.; Aloraini, M.; Aladhadh, S.; Habib, S.; Khan, A.; Alabdulatif, A.; Alanazi, T.M. Toward a Vision-Based Intelligent System: A Stacked Encoded Deep Learning Framework for Sign Language Recognition. Sensors 2023, 23, 9068. [Google Scholar] [CrossRef]
- Alamri, M.; Lajmi, S. Design a smart platform translating Arabic sign language to English language. Int. J. Electr. Comput. Eng. (IJECE) 2024, 14, 4759–4774. [Google Scholar] [CrossRef]
- GitHub—Byhqsr/Tzutalin-Labelimg: LabelImg Is a Graphical Image Annotation Tool and Label Object Bounding Boxes in Images. Available online: https://github.com/byhqsr/tzutalin-labelImg (accessed on 20 October 2025).
- Noor, T.H.; Noor, A.; Alharbi, A.F.; Faisal, A.; Alrashidi, R.; Alsaedi, A.S.; Alharbi, G.; Alsanoosy, T.; Alsaeedi, A. Real-Time Arabic Sign Language Recognition Using a Hybrid Deep Learning Model. Sensors 2024, 24, 3683. [Google Scholar] [CrossRef]
- Al-Barham, M.; Sa’Aleek, A.A.; Al-Odat, M.; Hamad, G.; Al-Yaman, M.; Elnagar, A. Arabic Sign Language Recognition Using Deep Learning Models. In Proceedings of the 2022 13th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 21–23 June 2022; pp. 226–231. [Google Scholar]
- Mahmoud, E.; Wassif, K.; Bayomi, H. Transfer Learning and Recurrent Neural Networks for Automatic Arabic Sign Language Recognition. In Proceedings of the 8th International Conference on Advanced Machine Learning and Technologies and Applications (AMLTA2022), Cairo, Egypt, 5–7 May 2022; Hassanien, A.E., Rizk, R.Y., Snášel, V., Abdel-Kader, R.F., Eds.; Lecture Notes on Data Engineering and Communications Technologies. Springer International Publishing: Cham, Switzerland, 2022; Volume 113, pp. 47–59, ISBN 978-3-031-03917-1. [Google Scholar]
- Gochoo, M.; Batnasan, G.; Ahmed, A.A.; Otgonbold, M.-E.; Alnajjar, F.; Shih, T.K.; Tan, T.-H.; Wee, L.K. Fine-Tuning Vision Transformer for Arabic Sign Language Video Recognition on Augmented Small-Scale Dataset. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Oahu, HI, USA, 1–4 October 2023; pp. 2880–2885. [Google Scholar]
- Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 7478–7498. [Google Scholar] [CrossRef]
- Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
- Kyaw, N.N.; Mitra, P.; Sinha, G.R. Automated recognition of Myanmar sign language using deep learning module. Int. J. Inf. Technol. 2024, 16, 633–640. [Google Scholar] [CrossRef]
- Al-Obodi, A.H.; Al-Hanine, A.M.; Al-Harbi, K.N.; Al-Dawas, M.S.; Al-Shargabi, A.A. A Saudi Sign Language Recognition System based on Convolutional Neural Networks. Int. J. Eng. Res. Technol. 2020, 13, 3328–3334. [Google Scholar] [CrossRef]
- Balat, M.; Awaad, R.; Adel, H.; Zaky, A.B.; Aly, S.A. Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models. arXiv 2024, arXiv:2410.00681. [Google Scholar] [CrossRef]
- Al-Nafjan, A.; Al-Abdullatef, L.; Al-Ghamdi, M.; Al-Khalaf, N.; Al-Zahrani, W. Designing SignSpeak, an Arabic Sign Language Recognition System. In Proceedings of the HCI International 2020—Late Breaking Papers: Universal Access and Inclusive Design, Copenhagen, Denmark, 19–24 July 2020; Stephanidis, C., Antona, M., Gao, Q., Zhou, J., Eds.; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2020; Volume 12426, pp. 161–170, ISBN 978-3-030-60148-5. [Google Scholar]
- RVL-SLLL American Sign Language Database. Available online: https://engineering.purdue.edu/RVL/Database/ASL/asl-database-front.htm (accessed on 22 January 2025).
- American Sign Language Video Dataset. Available online: https://crystal.uta.edu/~athitsos/projects/asl_lexicon/ (accessed on 22 January 2025).
- Zahedi, M.; Keysers, D.; Deselaers, T.; Ney, H. Combination of Tangent Distance and an Image Distortion Model for Appearance-Based Sign Language Recognition. In Pattern Recognition; Kropatsch, W.G., Sablatnig, R., Hanbury, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3663, pp. 401–408. [Google Scholar] [CrossRef]
- Robert, E.J.; Duraisamy, H.J. A review on computational methods based automated sign language recognition system for hearing and speech impaired community. Concurr. Comput. Pract. Exp. 2023, 35, e7653. [Google Scholar] [CrossRef]
- Feng, B.; Zhang, H. Expression Recognition Based on Visual Transformers with Novel Attentional Fusion. J. Phys. Conf. Ser. 2024, 2868, 012036. [Google Scholar] [CrossRef]
- Yulvina, R.; Putra, S.A.; Rizkinia, M.; Pujitresnani, A.; Tenda, E.D.; Yunus, R.E.; Djumaryo, D.H.; Yusuf, P.A.; Valindria, V. Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray. Computers 2024, 13, 343. [Google Scholar] [CrossRef]
- Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef] [PubMed]
- Taye, M.M. Understanding of Machine Learning with Deep Learning: Architectures, Workflow, Applications and Future Directions. Computers 2023, 12, 91. [Google Scholar] [CrossRef]
- Iman, M.; Arabnia, H.R.; Rasheed, K. A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
- Jiang, X.; Satapathy, S.C.; Yang, L.; Wang, S.-H.; Zhang, Y.-D. A Survey on Artificial Intelligence in Chinese Sign Language Recognition. Arab. J. Sci. Eng. 2020, 45, 9859–9894. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Hagen, A. Microsoft Vision Model: A State-of-the-Art Pretrained Vision Model. Microsoft Research. Available online: https://www.microsoft.com/en-us/research/blog/microsoft-vision-model-resnet-50-combines-web-scale-data-and-multi-task-learning-to-achieve-state-of-the-art/ (accessed on 22 September 2024).
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Gaudenz Boesch Vision Transformers (ViT) in Image Recognition: Full Guide—Viso.ai. Available online: https://viso.ai/deep-learning/vision-transformer-vit/ (accessed on 24 January 2025).
- Alharthi, N.M.; Alzahrani, S.M. Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition. Appl. Sci. 2023, 13, 11625. [Google Scholar] [CrossRef]
- Arabic Sign Language Dictionary for the Deaf 2|Arab Organization of Sign Language Interpreters. Available online: https://selaa.org/node/215 (accessed on 19 July 2024).
- Arabic Sign Language Dictionary for the Deaf 1|Arab Organization of Sign Language Interpreters. Available online: https://selaa.org/node/204 (accessed on 19 July 2024).
- Schindler, K.; Van Gool, L. Action snippets: How many frames does human action recognition require? In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef]
- Obaid, F.; Babadi, A.; Yoosofan, A. Hand Gesture Recognition in Video Sequences Using Deep Convolutional and Recurrent Neural Networks. Appl. Comput. Syst. 2020, 25, 57–61. [Google Scholar] [CrossRef]
- Maharana, K.; Mondal, S.; Nemade, B. A review: Data pre-processing and data augmentation techniques. Glob. Transit. Proc. 2022, 3, 91–99. [Google Scholar] [CrossRef]
- Celebi, T.; Shayea, I.; El-Saleh, A.A.; Ali, S.; Roslee, M. Histogram Equalization for Grayscale Images and Comparison with OpenCV Library. In Proceedings of the 2021 IEEE 15th Malaysia International Conference on Communication (MICC), Virtual, 1–2 December 2021; pp. 92–97. [Google Scholar]
- Siby, T.A.; Pal, S.; Arlina, J.; Nagaraju, S. Gesture based Real-Time Sign Language Recognition System. In Proceedings of the 2022 International Conference on Connected Systems & Intelligence (CSI), Trivandrum, India, 31 August–2 September 2022; pp. 1–6. [Google Scholar]
- Das, S.; Yadav, S.K.; Samanta, D. Isolated Sign Language Recognition Using Deep Learning. In Proceedings of the Computer Vision and Image Processing, Jammu, India, 3–5 November 2023; Kaur, H., Jakhetiya, V., Goyal, P., Khanna, P., Raman, B., Kumar, S., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 343–356. [Google Scholar]
- Maheshan, C.M.; Prasanna Kumar, H. Performance of image pre-processing filters for noise removal in transformer oil images at different temperatures. SN Appl. Sci. 2020, 2, 67. [Google Scholar] [CrossRef]
- Wang, G.; Lan, Y.; Wang, Y.; Xiong, W.; Li, J. Modified Non-local Means Filter for Color Image Denoising. Rev. Tec. Fac. Ing. Univ. Zulia 2016, 39, 123–131. [Google Scholar] [CrossRef]
- Chlap, P.; Min, H.; Vandenberg, N.; Dowling, J.; Holloway, L.; Haworth, A. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol. 2021, 65, 545–563. [Google Scholar] [CrossRef] [PubMed]
- ImageNet. Available online: https://www.image-net.org/update-mar-11-2021.php (accessed on 2 September 2024).
- Brownlee, J. A Gentle Introduction to Transfer Learning for Deep Learning. MachineLearningMastery.com. 2017. Available online: https://machinelearningmastery.com/transfer-learning-for-deep-learning/ (accessed on 2 September 2024).
- Akcay, S.; Kundegorski, M.E.; Willcocks, C.G.; Breckon, T.P. Using Deep Convolutional Neural Network Architectures for Object Classification and Detection Within X-Ray Baggage Security Imagery. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2203–2215. Available online: https://ieeexplore.ieee.org/document/8306909 (accessed on 25 January 2025). [CrossRef]
- Gülmez, B. A Comprehensive Review of Convolutional Neural Networks based Disease Detection Strategies in Potato Agriculture. Potato Res. 2024, 68, 1295–1329. [Google Scholar] [CrossRef]
- Thongkhome, P.; Yonezawa, T.; Kawaguchi, N. Performance Evaluation of KNIME Low Code Platform in Deep Learning Study and Optimal Hyperparameter Tuning. In Proceedings of the TENCON 2024—2024 IEEE Region 10 Conference (TENCON), Singapore, 1–4 December 2024; pp. 1373–1376. [Google Scholar]
- Tesfagergis, A.M. Transformer Networks for Short-Term Forecasting of Electricity Prosumption; LUT University: Lappeenranta, Finland, 2021. [Google Scholar]
- Mogan, J.N.; Lee, C.P.; Lim, K.M.; Muthu, K.S. Gait-ViT: Gait Recognition with Vision Transformer. Sensors 2022, 22, 7362. [Google Scholar] [CrossRef]
- Vujovic, Ž.Ð. Classification Model Evaluation Metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 599–606. [Google Scholar] [CrossRef]
- Team, K. Keras Documentation: EarlyStopping. Available online: https://keras.io/api/callbacks/early_stopping/ (accessed on 10 September 2024).
- Team, K. Keras Documentation: ModelCheckpoint. Available online: https://keras.io/api/callbacks/model_checkpoint/ (accessed on 10 September 2024).
- Team, K. Keras Documentation: ReduceLROnPlateau. Available online: https://keras.io/api/callbacks/reduce_lr_on_plateau/ (accessed on 10 September 2024).
- Kandel, I.; Castelli, M. The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express 2020, 6, 312–315. [Google Scholar] [CrossRef]
- Sharma, S.; Sharma, S. Comparison of Transfer learning-based Models for Sign- Language Recognition. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–6. [Google Scholar]
- Sulistya, Y.I.; Bangun, E.T.; Tyas, D.A. CNN Ensemble Learning Method for Transfer Learning: A Review. Available online: https://www.researchgate.net/publication/381101834_CNN_Ensemble_Learning_Method_for_Transfer_learning_A_Review (accessed on 12 March 2025).
- AlKhuraym, B.Y.; Ismail, M.M.B.; Bchir, O. Arabic Sign Language Recognition using Lightweight CNN-based Architecture. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 319–328. [Google Scholar] [CrossRef]
- Alnabih, A.F.; Maghari, A.Y. Arabic sign language letters recognition using Vision Transformer. Multimed. Tools Appl. 2024, 83, 81725–81739. [Google Scholar] [CrossRef]
- Herbaz, N.; Idrissi, H.E.; Badri, A. Advanced Sign Language Recognition Using Deep Learning: A Study on Arabic Sign Language (ArSL) with VGGNet and ResNet50 Models. Res. Sq. 2025. [Google Scholar] [CrossRef]
- Dong, W.; Shen, S.; Han, Y.; Tan, T.; Wu, J.; Xu, H. Generative Models in Medical Visual Question Answering: A Survey. Appl. Sci. 2025, 15, 2983. [Google Scholar] [CrossRef]
- Sandeep, R.; Prakash, R.; Amit, D. Retrieval-Augmented Generation of Medical Vision-Language Models. Available online: https://www.researchgate.net/publication/389095901_Retrieval-Augmented_Generation_of_Medical_Vision-Language_Models (accessed on 14 March 2025).
- Chen, L.; Chen, Y.; Ouyang, Z.; Dou, H.; Zhang, Y.; Sang, H. Boosting adversarial transferability in vision-language models via multimodal feature heterogeneity. Sci. Rep. 2025, 15, 7366. [Google Scholar] [CrossRef] [PubMed]






































| Authors/Year | Type Dataset | Class | Primitive Language | Type | Methodology | Accuracy |
|---|---|---|---|---|---|---|
| Shin et al., 2023 [8] | Public Self-Build | 77 20 | Korean Sign Language (KSL) | Multi | vision-based transformer | 89% 98.3% |
| Rathi et al., 2023 [9] | Public | 36 | American Sign Language (ASL) | Static | 2-level Res-Net50-based Neural Network Architecture | 99.03% |
| Thakar et al., 2022 [11] | Self-build | 29 | ASL | Static | CNN + Transfer Learning | 98.7% |
| Areeb et al., 2022 [12] | Self-build | 8 | Indian Sign Language (ISL) | Dynamic | 3D CNNs Pretrained VGG16+LSTM YOLO v5 | 82% 98% 99.6% |
| Zakariah 2022 [13] | Public: ArSL2018 | 32 | Arabic Sign Language (ArSL) | Static | EfficientNetB4 model | 95% |
| Alamri and Lajmi 2024 [19] | Self-build | 118 | ArSL | Static | fine-tuning of the SSD-ResNet50 V1 FPN | average F-score of 86.4% and an accuracy of 94% |
| Mahmoud, Wassif, and Bayomi 2022 [23] | Self-build | — | ArSL | — | Hybrid model: Transfer learning + RNN | 93.4%. |
| Islam et al., 2023 [18] | Public: ArSL2018 | 32 | ArSL | Static | EfficientNetB3 with encoder and decoder network | 99.26% |
| Noor et al., 2024 [21] | Self-build: 4000 images:10 static gesture words + 500 videos for 10 dynamic gesture words | 20 | ArSL | Static and dynamic | hybrid model: sub-model CNN and sub-model LSTM | Accuracy of CNN: 94.40% LSTM:82.70% |
| Al-Barham et al., 2022 [22] | Public: ArSL2018 | 32 | ArSL | Static | CNNs: VGG-16, and ResNet-18 | Average 99.7%, |
| Gochoo et al., 2023 [24] | Self-build | 6 | ArSL | Dynamic | Fine-tuning ViT | 93% |
| Özdemir et al., 2023 [16] | Self-build | 744 | Turkish Sign Language (TSL) | Multi | CNN | 94.76% for 3D Resets And88.53% for IDT |
| Kyaw et al., 2024 [27] | Self-build | 8 | Myanmar Sign Language (MSL) | Dynamic | CNN | 94% for Adam optimizer 92% for the SGDM |
| Al-Obodi et al., 2020 [28] | Self-build | 40 | Saudi Sign Language | Static | CNN | 97.69% for training data and 99.47% |
| Balat et al., 2024 [29] | Public | 32 | ArSL | Static | CNN and transformer-based models were fine-tuned | ResNet50: 99.30% MobileNetV2: 99.48%, EfficientNetB7: 99.60% ViT: 99.38% Microsoft Swin: 99.60% |
| Network Layers | Output Shape | Parameters |
|---|---|---|
| conv2d (Conv2D) | (None, 229, 229, 128) | 9728 |
| max_pooling2d(MaxPooling2D) | (None, 115, 115, 128) | 0 |
| conv2d_1 (Conv2D) | (None, 115, 115, 64) | 32,832 |
| max_pooling2d_1(MaxPooling2D) | (None, 58, 58, 64) | 0 |
| conv2d_2 (Conv2D) | (None, 58, 58, 32) | 8224 |
| max_pooling2d_2(MaxPooling2D) | (None, 29, 29, 64) | 0 |
| Flatten_1 (Flatten) | (None, 26912) | 0 |
| dense (Dense) | (None, 512) | 13,779,456 |
| dropout (Dropout) | (None, 512) | 0 |
| dense_1 (Dense) | (None, 36) | 18,468 |
| Total parameters: 2,744,148 (10.47 MB) | ||
| Network Layers | Output Shape | Parameters |
|---|---|---|
| conv2d (Conv2D) | (None, 229, 229, 256) | 19,456 |
| max_pooling2d(MaxPooling2D) | (None, 115, 115, 256) | 0 |
| conv2d_1 (Conv2D) | (None, 115, 115, 128) | 819,328 |
| max_pooling2d_1(MaxPooling2D) | (None, 58, 58, 128) | 0 |
| conv2d_2 (Conv2D) | (None, 58, 58, 64) | 32,832 |
| max_pooling2d_2(MaxPooling2D) | (None, 29, 29, 64) | 0 |
| conv2d_3 (Conv2D) | (None, 29, 29, 32) | 8224 |
| max_pooling2d_3(MaxPooling2D) | (None, 15, 15, 32) | 0 |
| conv2d_4 (Conv2D) | (None, 15, 15, 16) | 2064 |
| batch_normalization (Batch Normalization) | (None, 15, 15, 16) | 64 |
| flatten (Flatten) | (None, 3600) | 0 |
| dense (Dense) | (None, 512) | 1,843,712 |
| dropout (Dropout) | (None, 512) | 0 |
| dense_1 (Dense) | (None, 36) | 18,468 |
| Total parameters: 2,744,148 (10.47 MB) | ||
| Parameter | Value |
|---|---|
| Image size | (256, 256, 3) |
| Patch size P | 32 |
| Number of encoder layers L | 8 |
| Number of heads K | 8 |
| MLP dimension | 2048 |
| Hidden size D | 512 |
| Epochs | 30 |
| Hyperparameter | Value |
|---|---|
| Image size | (256, 256, 3) |
| Patch size P | 32 |
| Number of encoder layers L | 6 |
| Number of heads K | 6 |
| MLP dimension | 1024 |
| Hidden size D | 256 |
| Epochs | 30 |
| Models | Epochs | Accuracy | Val Accuracy | Loss | Test Accuracy | Test Loss | Val Loss |
|---|---|---|---|---|---|---|---|
| CNN 3 Layers | 70 | 93% | 95% | 0.28 | 100% | 0.0011 | 0.24 |
| CNN 5 Layers | 50 | 96% | 96% | 0.24 | 100% | 0.014 | 0.36 |
| ResNet50 modified | 20 | 98% | 93% | 1.35 | 94.64% | 0.28 | 0.41 |
| ResNet50V2 | 12 | 99% | 95% | 0.18 | 97.61% | 0.20 | 0.55 |
| ResNet50 Fine-tuned | 30 | 99% | 86% | 0.03 | 98.03% | 0.02 | 0.005 |
| ResNet50V2 Fine-tuned | 18 | 90% | 70% | 0.15 | 97.02% | 0.23 | 0.23 |
| Models | Epochs | Accuracy | Val Accuracy | Loss | Test Accuracy | Test Loss | Val Loss |
|---|---|---|---|---|---|---|---|
| SignViT | 18 | 76% | 78% | 0.87 | 92.68% | 0.21 | 0.75 |
| ResNet50ViT | 20 | 98% | 82% | 0.09 | 99.86% | 0.002 | 0.03 |
| Studies | Public/Self Build | Size | Classes | Methodology | Accuracy |
|---|---|---|---|---|---|
| [22] | Self-build | 54,049 images of 32 alphabets (static) | 32 classes | CNN, VGG-16, ResNet-18 | 99.47%, the best result for ResNet18 |
| [71] | Self-build | 5400 images ArSL alphabet | 30 classes | lightweight EfficientNet CNN | 94% |
| [19] | Self-build | 5900 images | 118 classes | MobileNet and fine-tuned SSD-ResNet50 | 94% |
| [72] | Public | ArSL2018 dataset: 54,049 images of the ArSL letters | 32classes | ViT-based model fine-tuned | 99.3% |
| [73] | public and self-build | 32 classes: 54,049 images of the alphabet, including 15,200 images | 32classes | ResNet50 | 98.50% |
| Our Framework: RAFID | Self-build | 1800 static images 23,000 key frames (4600 videos) | 36 classes 92 classes | ResNet50 Fine-tuning ResNet50ViT | 98.03% 99.86% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mansour, I.; Hamroun, M.; Lajmi, S.; Abassi, R.; Sauveron, D. Hybrid Deep Learning Models for Arabic Sign Language Recognition in Healthcare Applications. Big Data Cogn. Comput. 2025, 9, 281. https://doi.org/10.3390/bdcc9110281
Mansour I, Hamroun M, Lajmi S, Abassi R, Sauveron D. Hybrid Deep Learning Models for Arabic Sign Language Recognition in Healthcare Applications. Big Data and Cognitive Computing. 2025; 9(11):281. https://doi.org/10.3390/bdcc9110281
Chicago/Turabian StyleMansour, Ibtihel, Mohamed Hamroun, Sonia Lajmi, Ryma Abassi, and Damien Sauveron. 2025. "Hybrid Deep Learning Models for Arabic Sign Language Recognition in Healthcare Applications" Big Data and Cognitive Computing 9, no. 11: 281. https://doi.org/10.3390/bdcc9110281
APA StyleMansour, I., Hamroun, M., Lajmi, S., Abassi, R., & Sauveron, D. (2025). Hybrid Deep Learning Models for Arabic Sign Language Recognition in Healthcare Applications. Big Data and Cognitive Computing, 9(11), 281. https://doi.org/10.3390/bdcc9110281

