Next Article in Journal
Investigations of Transport Aircraft Shock Buffet Under Forced Wing Motions
Previous Article in Journal
Statement of Peer Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Development of Transactional Filipino Sign Language Recognition System Using MediaPipe and Gated Recurrent Units †

by
Angela Cardano
,
Franz Railey Columna
and
Jocelyn Villaverde
*
School of Electrical, Electronics, and Computer Engineering, Mapúa University, Manila 1002, Philippines
*
Author to whom correspondence should be addressed.
Presented at the 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025), Yunlin, Taiwan, 14–16 November 2025.
Eng. Proc. 2026, 134(1), 47; https://doi.org/10.3390/engproc2026134047
Published: 14 April 2026

Abstract

Persistent communication barriers for the deaf and hard-of-hearing community in the Philippines are addressed in this study by developing a Filipino Sign Language Recognition (SLR) system. The system focuses on transactional signs commonly used in commercial environments such as markets and public facilities, thereby filling a gap left by existing SLR models. A vision-based approach was adopted, employing MediaPipe for landmark detection and Gated Recurrent Units for translating signs into text. To train the model, a custom dataset comprising 1065 video samples of 26 transactional signs was created, accounting for subtle variations in individual signing styles. The complete system was implemented on a Raspberry Pi 5 equipped with a webcam and touchscreen display. When evaluated on unseen data, the system achieved a recognition accuracy of 87%, demonstrating its potential for real-world applications in supporting commercial interactions for deaf and hard-of-hearing individuals.

1. Introduction

The Filipino Sign Language (FSL) was declared the national sign language of the Philippines through Republic Act No. 11106 [1]. Despite this recognition since 2018, communication barriers for the deaf and hard-of-hearing communities persist in official and commercial transactions [2]. In response, sign language recognition (SLR) systems are emerging as a technology-based solution to address the communication barriers faced by deaf and hard-of-hearing communities [3,4,5,6]. Comprehensive reviews of machine learning applications in SLR highlight the shift toward deep learning to improve gesture classification [7,8]. Many of these vision-based algorithms have been investigated to determine which are best suited for SLR systems [9,10]. Gated Recurrent Units (GRUs) are often preferred over Long Short-Term Memory (LSTM) networks due to lower computational complexity [11,12,13]. Recent studies on FSL recognition have explored various architectures, including InceptionV3 and CNN-LSTM hybrids, to capture spatial-temporal features [14,15].
With an accuracy of 95% during evaluation and 98% using the test dataset, the MediaPipe and LSTM combination in [9] demonstrated significant potential in dynamic sign language recognition. Gated Recurrent Units (GRUs) are preferred over Long Short-Term Memory (LSTM) networks for sequential data processing due to their similar performance, lower computational complexity, and quicker training times [11].
Moreover, Gated Recurrent Units are a type of recurrent neural network (RNN) designed for sequential data processing and serve as a tool against the vanishing gradient problem that faces common traditional RNNs. The update gate in GRU controls the percentage of information from the previous hidden state that is transmitted to the current hidden state [12,13]. Moreover, the reset gate governs the extent to which the previous hidden state is disregarded. Activation of the reset gate enables the network to effectively reset its internal memory, thereby allowing it to focus more intensely on the current input.
In this study, we used the MediaPipe Holistic algorithm for feature extraction, particularly hand landmarks, which are essential for recognizing both static and dynamic FSL gestures [14,15]. The holistic framework provides a comprehensive feature set comprising 543 landmarks, yielding 1629 features when the X, Y, and Z coordinates of each landmark are considered. Such granularity is critical for robust and nuanced FSL recognition [14]. The holistic pipeline encompasses face, hand, and pose landmark detection, isolating key points that serve as inputs for machine learning models used in classification.
Based on the results, we developed a transactional FSL recognition system using MediaPipe and GRUs. The system extracts key landmarks and translates dynamic transactional gestures into text. The system was implemented on a Raspberry Pi 5. Similar edge-computing setups using NVIDIA Jetson or Raspberry Pi have been utilized for real-time object detection and checkout systems [16]. Furthermore, vision-based recognition techniques have been successfully applied to related linguistic tasks, such as PIR-based gesture control [17], lip-reading for regional languages [18], and the translation of air-written Baybayin [19]. In addition, the system includes a webcam interface that uses MediaPipe for landmark detection and GRUs for sign identification. The system performance was evaluated using a multi-class confusion matrix and overall accuracy metrics. The system contributes to automated FSL recognition by expanding the dataset to include common transactional signs while accounting for natural variations in signing styles.

2. Materials and Methods

2.1. System Development

The system utilizes the input–process–output model. It takes the video feed of transactional signs as input (Figure 1). MediaPipe extracts the hand and pose landmark coordinates from the video frames, which are then fed as a time sequence to the GRU pipeline. The GRU processes the temporal dependencies to recognize the sign, with the final output being the text translation of the recognized gesture. The system was constructed using a Raspberry Pi 5 (Cytron Technologies, Bukit Mertajam, Malaysia), with a 1080p web camera connected through the Camera Serial Interface (CSI) port to supply the video input. The output is displayed on a 7-inch touchscreen monitor. Other components include a micro-Secure Digital card, a USB-C power supply, and a CSI cable. All components are housed in a 3D-printed case, with the Raspberry Pi 5 attached to the back of the monitor (Figure 2).
The developed FSL recognition model operates through the workflow illustrated in Figure 3. At system initialization, the webcam is activated to capture the video feed. The process relies on two primary modules: MediaPipe Landmarking and GRU Sign Identification. The MediaPipe Landmarking module extracts the spatial coordinates of landmarks from each video frame, providing detailed positional data. Once these landmarks are obtained, the GRU Sign Identification module analyzes the temporal progression of the gestures by tracking landmark movements across frames. This structured pipeline ensures that video frames are continuously captured and processed in real time, enabling accurate recognition of FSL gestures. The system then outputs the corresponding text translation of the recognized sign, with an average latency of approximately five seconds.
The MediaPipe processing module slices the three-second video feed into 75 frames. For each frame, a color format conversion is performed before MediaPipe executes advanced landmark detection. This entails locating and obtaining the 3D landmark coordinates for the left and right hands. To maintain a constant data structure, zero-padding is applied if a hand is not detected in any frame. The extracted hand landmark data from both hands are concatenated into a single array, producing a feature representation for that instant (Figure 4).
The processed landmarks data from MediaPipe are then directly fed into the GRU Model Inference Module, as shown in Figure 5. A reshaping operation aligns the data format with the GRU model’s expected input structure, which takes the stream of individual frame snapshots and arranges them to resemble the original video clip. By joining the frames, the GRU model analyzes the entire sequence to understand the temporal progression of the sign and identifies the specific gesture by choosing the sign with the highest assigned probability.

2.2. Experimental Setup

Figure 6a,b show how the prototype is used by prompting the signer via the touchscreen monitor to perform a gesture. The camera must visibly capture the user’s upper body and arm reach during the three-second recording window. The system then displays the text translation of the recognized sign below the video feed, along with the confidence score.

2.3. Data Gathering

We created a custom dataset for everyday transactional interactions in Filipino Sign Language. Twenty participants (aged 19 to 60) performed the signs, including two variations per sign to account for natural signing nuances. The final dataset consists of 1065 videos, split into an 80–20 ratio for model training and validation/testing, respectively.

3. Results

A multi-class confusion matrix is utilized to tabulate the number of correctly identified signs (TP), incorrectly identified signs (FP), correctly identified non-signs (TN), and incorrectly identified non-signs (FN). The accuracy of the model is calculated using Equation (1).
A c c u r a c y = n = 1 26 A n n i = 1 j = 1 26 A i j
The numerator takes the summation of the diagonal elements of the confusion matrix, each cell representing the count of samples correctly predicted for each sign (Figure 7). On the other hand, the denominator sums up all entries in the matrix, representing the total number of predictions made by the system.
The detailed test results, including the predicted outputs, actual outputs, and corresponding cases, are summarized in Table 1. Out of seventy-eight trials, with three recorded clips per gesture, the system produced ten misclassifications. The resulting multi-class confusion matrix, shown in Figure 7, highlights key performance trends of the model.
Performance evaluation was conducted using the confusion matrix in conjunction with Equation (1), while Equation (2) was applied to compute overall accuracy by summing correct classifications for each gesture and dividing by the total number of trials. Based on this analysis, the model achieved an overall recognition accuracy of 87.18% in identifying transactional Filipino Sign Language gestures, demonstrating its effectiveness for practical applications.
A c c u r a c y = 0 + 0 + 1 + 1 + 2 + 2 + 2 + 3 + + 3 78 = 68 78

4. Discussion

Extended and distinct gesture sequences, such as the signs for Card and Coin, were consistently recognized by the prototype, highlighting its strength in identifying macro-level temporal distinctions in FSL. In contrast, the ten misclassified trials primarily involved numerical signs (Eight, Seven, Three, and Ten), with single instances of misclassification for Wait and Yes. These errors are attributed to the limited distinctiveness of landmark placement and temporal sequencing in numerical gestures compared to longer, more dynamic signs. This challenge is particularly evident in signs like Eight and Seven, which share nearly identical sequences and placements, differing only by one finger.

5. Conclusions

We developed an FSLR system capable of translating dynamic gestures from recorded clips into text. We implemented a system that recognizes 26 transactional signs. The system’s applicability was validated in public and commercial settings. The model achieved an overall recognition accuracy of 87.18%. While previous studies focused on basic signs or hand gestures under varied backgrounds [20], our work expands this to transactional contexts. Other recent efforts in facial and gesture recognition have achieved high precision by utilizing Keras-based models [21] and real-time inference pipelines [22]. The consistent and precise classification of extended gestures such as Card and Coin underscores the system’s potential for capturing macro-level temporal distinctions in FSL. However, the misclassification of numerical signs, particularly Seven and Eight, suggests that MediaPipe landmark extraction combined with GRU processing struggles to capture fine-grained differences in finger placement. Despite this limitation, a foundational transactional-focused set of signs for FSL recognition is established by addressing real-world commercial interactions—contrasting with prior studies that concentrated mainly on basic signs.
To improve recognition accuracy for similar gestures, feature extraction needs to be enhanced through targeted data augmentation and the expansion of the dataset with additional video samples. A diverse collection of the misclassified gestures is essential for training the model on subtle landmark differences and improving robustness in distinguishing closely related signs.

Author Contributions

Conceptualization, A.C., F.R.C. and J.V.; methodology, A.C. and F.R.C.; software, A.C. and F.R.C.; validation, A.C. and F.R.C.; formal analysis, A.C. and F.R.C.; investigation, A.C. and F.R.C.; resources, A.C. and F.R.C.; data curation, A.C. and F.R.C.; writing—original draft preparation, A.C., F.R.C. and J.V.; writing—review and editing, A.C., F.R.C. and J.V.; visualization, A.C. and F.R.C.; supervision, J.V.; project administration, A.C., F.R.C. and J.V.; funding acquisition, A.C. and F.R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. All participants signed a consent form that explicitly illustrates their responsibilities as participants and the use of their data in the study.

Data Availability Statement

The Dataset is available on request from the authors.

Acknowledgments

The authors wish to convey their most profound gratitude to their families, whose unwavering love, remarkable patience, and steadfast support served as a continual source of strength throughout this endeavor. They are equally indebted to their friends, whose shared knowledge and incisive insights proved instrumental in navigating the challenges of this study. This research benefited immensely from the exceptional mentorship of their adviser, Jocelyn F. Villaverde. We sincerely thank her for her invaluable guidance, deeply constructive feedback, and sustained encouragement, which were crucial in meticulously shaping the final form of this work. We are further obligated to John Paul T. Cruz, Noel B. Linsangan, and Analyn N. Yumang for their discerning comments and expert professional critique, which significantly elevated the quality and rigor of this investigation. A special and sincere acknowledgment is extended to all the generous participants whose time and effort were freely given, proving fundamental to the successful creation of the transactional signs dataset. Their direct contributions form the bedrock upon which our findings rest. Finally, we recognize the School of Electrical, Electronics, and Computer Engineering of Mapúa University for providing the foundational infrastructure, essential resources, and the academic environment that made the execution of this comprehensive research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SLRSign Language Recognition
GRUGated Recurrent Units
LSTMLong Short-Term Memory
YOLOYou Only Look Once
CNNConvolutional Neural Networks

References

  1. Filipino Sign Language Act, Rep. Act No. 11106, 2018. Available online: https://ncda.gov.ph/disability-laws/republic-acts/ra-11106/ (accessed on 30 October 2025).
  2. Sintos, M.L. Psychological Distress of Filipino Deaf: Role of Environmental Vulnerabilities, Self-Efficacy, and Perceived Functional Social Support. Asia-Pac. Soc. Sci. Rev. 2020, 20, 42–55. [Google Scholar] [CrossRef]
  3. Verdadero, M.S.; Dela Cruz, J.C. An Assistive Hand Glove for Hearing and Speech Impaired Persons. In Proceedings of the 2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Laoag, Philippines, 29 November–1 December 2019. [Google Scholar]
  4. Linsangan, N.B.; Calites, J.V.G.; Reyes, J.T.L.; Sioson, G.C.D.; Pellegrino, R.V.; Juanatas, L.C. Filipino Sign Language to Text Converter using K-Nearest Neighbor Algorithm. In Proceedings of the 2022 IEEE 14th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Boracay, Philippines, 28–30 November 2022. [Google Scholar]
  5. Villagomez, E.B.; King, R.A.; Lazaro, J.; Villaverde, J.F. Hand Gesture Recognition for Deaf-Mute using Fuzzy-Neural Network. In Proceedings of the 2019 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Bangkok, Thailand, 12–14 June 2019. [Google Scholar]
  6. Rosero-Montalvo, P.D.; Godoy-Trujillo, P.; Flores-Bosmediano, E.; Carrascal-Garcia, J.; Otero-Potosi, S.; Benitez-Pereira, H. Sign Language Recognition Based on Intelligent Glove Using Machine Learning Techniques. In Proceedings of the 2018 IEEE Third Ecuador Technical Chapters Meeting (ETCM), Cuenca, Ecuador, 15–19 October 2018. [Google Scholar]
  7. Singh, J.; Singh, D. A Comprehensive Review on Sign Language Recognition Using Machine Learning. In Proceedings of the 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 13–14 October 2022; pp. 1–6. [Google Scholar]
  8. Sharma, S.; Singh, S. Vision-based sign language recognition system: A Comprehensive Review. In Proceedings of the 2020 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–28 February 2020; pp. 140–144. [Google Scholar]
  9. Madrid, G.K.R.; Villanueva, R.G.R.; Caya, M.V.C. Recognition of Dynamic Filipino Sign Language using MediaPipe and Long Short-Term Memory. In Proceedings of the 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 3–5 October 2022. [Google Scholar]
  10. Balbin, J.R.; Padilla, D.A.; Caluyo, F.S.; Fausto, J.C.; Hortinela, C.C., IV; Bernardino, C.K.S.; Fiñones, E.G. Sign language word translator using Neural Networks for the Aurally Impaired as a tool for communication. In Proceedings of the 2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 25–27 November 2016; pp. 425–429. [Google Scholar]
  11. Tupal, I.; Cabatuan, M.; Manguerra, M. Recognizing Filipino Sign Language with InceptionV3, LSTM, and GRU. In Proceedings of the 2022 IEEE 14th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Boracay, Philippines, 28–30 November 2022; pp. 1–5. [Google Scholar]
  12. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In Proceedings of the NIPS 2014 Deep Learning and Representation Learning Workshop, Montreal, QC, Canada, 12 December 2014. [Google Scholar]
  13. Ebrahimi, Z.; Loni, M.; Daneshtalab, M.; Gharehbaghi, A. A review on deep learning methods for ECG arrhythmia classification. Expert Syst. Appl. X 2020, 7, 100033. [Google Scholar] [CrossRef]
  14. Grishchenko, I.; Bazarevsky, V. MediaPipe Holistic: Simultaneous Face, Hand and Pose Prediction, on Device. Available online: https://research.google/blog/mediapipe-holistic-simultaneous-face-hand-and-pose-prediction-on-device/ (accessed on 25 June 2025).
  15. Oropesa, A.R.M.; Felicen, G.L.R.; de Guzman, J.A. SENYAS: A Filipino Sign Language Recognition System Using MediaPipe and CNN-LSTM. Bachelor’s Thesis, University of the Philippines, Quezon City, Philippines, 2024. [Google Scholar]
  16. Calimag, A.D.R.; Padilla, D.A.; Manlises, C.O. Checkout System with Object Detection using NVIDIA Jetson Nano and Raspberry Pi. In Proceedings of the 2023 IEEE 5th Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 27–29 October 2023. [Google Scholar]
  17. Landrito, M.L.C.; Abanilla, A.J.E.; Linsangan, N.B. Gesture-based Television Controller Using PIR Sensor Array. In Proceedings of the 2024 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Shah Alam, Malaysia, 29 June 2024. [Google Scholar]
  18. Abadiano, A.B.; Sangilan, F.O.; Cruz, J.P.T. Lip-Reading for Philippine Regional Language Classification of Cebuano and Ilocano. In Proceedings of the 2025 17th International Conference on Computer and Automation Engineering (ICCAE), Brisbane, Australia, 21–23 April 2025. [Google Scholar]
  19. Villespin, J.A.A.V.; Magana, M.J.U.; Manlises, C.O. Translation of Air-Written Baybayin Using Optical Flow in Complex Background. In Proceedings of the TENCON 2024 IEEE Region 10 Conference, Singapore, 1–4 December 2024. [Google Scholar]
  20. Ang, M.C.; Richmond, K.; Taguibao, C.; Manlises, C.O. Hand Gesture Recognition for Filipino Sign Language Under Different Backgrounds. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Malaysia, 13–15 September 2022. [Google Scholar] [CrossRef]
  21. Logronio, A.D.; Reyes, R.C.; Linsangan, N.B. Age Range Classification Through Facial Recognition Using Keras Model. In Proceedings of the 2023 15th International Conference on Computer and Automation Engineering (ICCAE), Sydney, Australia, 3–5 March 2023. [Google Scholar]
  22. Debnath, J.; Joe, P. Real-Time Gesture Based Sign Language Recognition System. In Proceedings of the International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Gwalior, India, 18–19 April 2024; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/10533518 (accessed on 10 June 2025).
Figure 1. The input–process–output model of the system developed in this study.
Figure 1. The input–process–output model of the system developed in this study.
Engproc 134 00047 g001
Figure 2. Hardware components.
Figure 2. Hardware components.
Engproc 134 00047 g002
Figure 3. Software process of the system.
Figure 3. Software process of the system.
Engproc 134 00047 g003
Figure 4. MediaPipe landmarking workflow.
Figure 4. MediaPipe landmarking workflow.
Engproc 134 00047 g004
Figure 5. GRU identification process.
Figure 5. GRU identification process.
Engproc 134 00047 g005
Figure 6. (a) Graphical user interface (GUI) displaying the real-time sign recognition output; (b) the deployed system prototype in operation.
Figure 6. (a) Graphical user interface (GUI) displaying the real-time sign recognition output; (b) the deployed system prototype in operation.
Engproc 134 00047 g006
Figure 7. Confusion matrix.
Figure 7. Confusion matrix.
Engproc 134 00047 g007
Table 1. The system output.
Table 1. The system output.
Trial No.Actual OutputPredicted OutputResults
1 AgainAgainClassified
2AgainAgainClassified
3AgainAgainClassified
4CardCardClassified
5CardCardClassified
78TenYesMisclassified
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cardano, A.; Columna, F.R.; Villaverde, J. Development of Transactional Filipino Sign Language Recognition System Using MediaPipe and Gated Recurrent Units. Eng. Proc. 2026, 134, 47. https://doi.org/10.3390/engproc2026134047

AMA Style

Cardano A, Columna FR, Villaverde J. Development of Transactional Filipino Sign Language Recognition System Using MediaPipe and Gated Recurrent Units. Engineering Proceedings. 2026; 134(1):47. https://doi.org/10.3390/engproc2026134047

Chicago/Turabian Style

Cardano, Angela, Franz Railey Columna, and Jocelyn Villaverde. 2026. "Development of Transactional Filipino Sign Language Recognition System Using MediaPipe and Gated Recurrent Units" Engineering Proceedings 134, no. 1: 47. https://doi.org/10.3390/engproc2026134047

APA Style

Cardano, A., Columna, F. R., & Villaverde, J. (2026). Development of Transactional Filipino Sign Language Recognition System Using MediaPipe and Gated Recurrent Units. Engineering Proceedings, 134(1), 47. https://doi.org/10.3390/engproc2026134047

Article Metrics

Back to TopTop