Real-Time Mobile Application for Translating Portuguese Sign Language to Text Using Machine Learning
Abstract
:1. Introduction
2. Related Work
2.1. Sign Language Characteristics and LGP Context
2.2. Evolution of Sign Language Recognition Techniques
3. Materials and Methods
3.1. System Architecture
3.2. LGP Alphabet Dataset: Acquisition and Preprocessing
3.2.1. Data Collection and Preparation
- Direct Image Capture: Still images were captured from several participants signing LGP letters in diverse environments, featuring different backgrounds, lighting conditions, and camera angles (see examples in Figure 3). This aimed to reflect potential real-world usage scenarios.
- Video Frame Extraction: Video recordings of individuals performing sequences of LGP letters provided another data source, significantly increasing the number of available training samples. Relevant frames for each letter gesture were extracted from these videos. To focus the models on the hand gesture itself and minimize background influence, the hand region in each extracted frame was identified using the MediaPipe framework [24], and only this region was retained (see examples in Figure 4). Although many frames extracted this way were visually similar, this apparent redundancy proved beneficial for capturing subtle variations in gesture execution under diverse conditions, potentially improving model robustness. However, the high degree of similarity also presented a risk of model overfitting, reinforcing the necessity for the subsequent application of data augmentation techniques to further diversify the training set.
- “Nothing” Class Inclusion: To enable the system to differentiate between a valid LGP letter gesture and the absence of one, a dedicated “nothing” class was included. Images for this class, representing background scenes or hands not performing gestures, were sourced from a publicly available dataset on Kaggle [34] (see an example in Figure 5). This helps reduce false positive classifications when no intended gesture is present.
3.2.2. Data Preprocessing
- Image Normalization: All images were resized to a uniform resolution of 200 × 200 pixels. This size was selected as a trade-off between retaining sufficient detail for gesture recognition and maintaining computational efficiency for model training and inference. To preserve the original aspect ratio and avoid distortion during resizing, black padding was added as necessary (see Figure 6).
- Data Augmentation: To enhance model robustness and generalization capabilities, data augmentation techniques were applied to the training set, artificially increasing its size and variability. The applied transformations included random rotations (simulating variations in hand orientation), horizontal flips (accounting for potential left-handed signers), brightness and contrast adjustments (simulating different lighting conditions), and small random translations (improving spatial invariance). Examples of augmented images are shown in Figure 7.
3.2.3. Dataset Organization and Splitting
- 94,064 images for training;
- 6157 images for validation;
- 2520 images for testing.
3.3. Classification Model Development
3.3.1. Model 1: CNN for On-Device Image Recognition
Architecture
Training
Class Imbalance Handling
Mobile Deployment
3.3.2. Model 2: MediaPipe Landmarks and MLP Classification
Architecture and Feature Extraction
MLP Classifier
3.4. Mobile Application Implementation and Interface
3.5. Evaluation Methodology
3.5.1. Model Performance Evaluation
3.5.2. Usability Evaluation
4. Results
4.1. Evaluation of Classification Models
4.1.1. Model 1: CNN for Letter Recognition
4.1.2. Model 2: MediaPipe for Hand Landmark Detection
4.1.3. Comparative Analysis
4.2. Mobile Application Evaluation
4.2.1. Technical Performance Within Application
4.2.2. User Experience Feedback
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lucas, C. (Ed.) Sociolinguistic Variation in American Sign Language; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar] [CrossRef]
- Klima, E.S.; Bellugi, U. Signs of Language; Harvard University Press: London, UK, 1979. [Google Scholar]
- Stokoe, W. Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf. J. Deaf. Stud. Deaf. Educ. 2005, 10, 3–37. [Google Scholar] [CrossRef]
- Assembleia da República. Lei Constitucional n.º 1/97, de 20 de Setembro. Point (h) of Paragraph 2 of Article 74. 1997. Available online: https://diariodarepublica.pt/dr/detalhe/lei-constitucional/1-653562 (accessed on 22 July 2024).
- Pfau, R.; Steinbach, M.; Woll, B. (Eds.) Sign Language: An International Handbook; De Gruyter: Berlin, Germany, 2012. [Google Scholar] [CrossRef]
- Tamura, S.; Kawasaki, S. Recognition of sign language motion images. Pattern Recognit. 1988, 21, 343–353. [Google Scholar] [CrossRef]
- Tanibata, N.; Shimada, N.; Shirai, Y. Extraction of Hand Features for Recognition of Sign Language. In Proceedings of the International Conference on Vision Interface, New York, NY, USA, 8–10 July 2002. [Google Scholar]
- Guerin, G. Sign Language Recognition—Using MediaPipe & DTW. Available online: https://data-ai.theodo.com/en/technical-blog/sign-language-recognition-using-mediapipe (accessed on 22 July 2024).
- Li, G.; Tang, H.; Sun, Y.; Kong, J.; Jiang, G.; Jiang, D.; Tao, B.; Xu, S.; Liu, H. Hand gesture recognition based on convolution neural network. Clust. Comput. 2019, 22, 2719–2729. [Google Scholar] [CrossRef]
- Papastratis, I.; Dimitropoulos, K.; Konstantinidis, D.; Daras, P. Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space. IEEE Access 2020, 8, 91170–91180. [Google Scholar] [CrossRef]
- Cihan Camgoz, N.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 14–19 June 2020; pp. 10020–10030. [Google Scholar] [CrossRef]
- Tayade, A.; Halder, A. Real-time Vernacular Sign Language Recognition using MediaPipe and Machine Learning. Int. J. Res. Publ. Rev. 2021, 2, 9–17. [Google Scholar] [CrossRef]
- Indriani; Harris, M.; Agoes, A.S. Applying Hand Gesture Recognition for User Guide Application Using MediaPipe. In Proceedings of the 2nd International Seminar of Science and Applied Technology (ISSAT 2021), Online, 12 October 2021; Atlantis Press: Amsterdam, The Netherlands, 2021; pp. 101–108. [Google Scholar] [CrossRef]
- Feng, Q.; Yang, C.; Wu, X.; Li, Z. A smart TV interaction system based on hand gesture recognition by using RGB-D Sensor. In Proceedings of the 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), Shengyang, China, 20–22 December 2013; pp. 1319–1322. [Google Scholar] [CrossRef]
- Mora-Zarate, J.E.; Garzón-Castro, C.L.; Castellanos Rivillas, J.A. Construction and Evaluation of a Dynamic Sign Dataset for the Colombian Sign Language. In Proceedings of the 2024 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Bogota, Colombia, 13–15 November 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Neiva, I.G.S. Desenvolvimento de um Tradutor de Língua Gestual Portuguesa. Master’s Thesis, NOVA School of Science and Technology, NOVA University Lisbon, Lisbon, Portugal, 2014. Available online: http://hdl.handle.net/10362/14753 (accessed on 22 July 2024).
- Ribeiro, P.R. Sistema de Reconhecimento de Língua Gestual Portuguesa Recorrendo à Kinect. Master’s Thesis, School of Engineering, University of Minho, Minho, Portugal, 2019. Available online: https://hdl.handle.net/1822/72165 (accessed on 22 July 2024).
- Oliveira, O.R.S.A. Tradutor da Língua Gestual Portuguesa Modelo de Tradução Bidireccional. Master’s Thesis, ISEP—Porto School of Engineering, Polytechnic University of Porto, Porto, Portugal, 2013. Available online: http://hdl.handle.net/10400.22/6246 (accessed on 22 July 2024).
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
- Goswami, T.; Javaji, S.R. CNN Model for American Sign Language Recognition. In Proceedings of the ICCCE 2020; Kumar, A., Mozar, S., Eds.; Springer: Singapore, 2021; pp. 55–61. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
- TecPerson. Sign Language MNIST: Drop-In Replacement for MNIST for Hand Gesture Recognition Tasks. 2017. Available online: https://www.kaggle.com/datasets/datamunge/sign-language-mnist (accessed on 22 July 2024).
- Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
- Google AI. MediaPipe Gesture Recognizer. 2025. Available online: https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer (accessed on 22 July 2024).
- Kruse, R.; Mostaghim, S.; Borgelt, C.; Braune, C.; Steinbrecher, M. Multi-layer Perceptrons. In Computational Intelligence: A Methodological Introduction; Springer International Publishing: Cham, Switzerland, 2022; pp. 53–124. [Google Scholar] [CrossRef]
- Takahashi, S. Hand Gesture Recognition Using MediaPipe. 2023. Available online: https://github.com/Kazuhito00/hand-gesture-recognition-using-mediapipe (accessed on 22 July 2024).
- Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef]
- Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
- Fahey, O.; Lexset. Synthetic ASL Alphabet. 2022. Available online: https://www.kaggle.com/datasets/lexset/synthetic-asl-alphabet/ (accessed on 22 July 2024).
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Google. Flutter MediaPipe. 2024. Available online: https://github.com/google/flutter-mediapipe (accessed on 22 July 2024).
Model | Accuracy | Macro Avg. Precision | Macro Avg. Recall | Macro Avg. F1-Score |
---|---|---|---|---|
CNN | 76% | 79% | 75% | 76% |
MediaPipe+MLP | 77% | 78% | 76% | 77% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fonseca, G.; Marques, G.; Albuquerque Santos, P.; Jesus, R. Real-Time Mobile Application for Translating Portuguese Sign Language to Text Using Machine Learning. Electronics 2025, 14, 2351. https://doi.org/10.3390/electronics14122351
Fonseca G, Marques G, Albuquerque Santos P, Jesus R. Real-Time Mobile Application for Translating Portuguese Sign Language to Text Using Machine Learning. Electronics. 2025; 14(12):2351. https://doi.org/10.3390/electronics14122351
Chicago/Turabian StyleFonseca, Gonçalo, Gonçalo Marques, Pedro Albuquerque Santos, and Rui Jesus. 2025. "Real-Time Mobile Application for Translating Portuguese Sign Language to Text Using Machine Learning" Electronics 14, no. 12: 2351. https://doi.org/10.3390/electronics14122351
APA StyleFonseca, G., Marques, G., Albuquerque Santos, P., & Jesus, R. (2025). Real-Time Mobile Application for Translating Portuguese Sign Language to Text Using Machine Learning. Electronics, 14(12), 2351. https://doi.org/10.3390/electronics14122351