Towards a Bidirectional Mexican Sign Language–Spanish Translation System: A Deep Learning Approach

: People with hearing disabilities often face communication barriers when interacting with hearing individuals. To address this issue, this paper proposes a bidirectional Sign Language Translation System that aims to bridge the communication gap. Deep learning models such as recurrent neural networks (RNN), bidirectional RNN (BRNN), LSTM, GRU, and Transformers are compared to find the most accurate model for sign language recognition and translation. Keypoint detection using MediaPipe is employed to track and understand sign language gestures. The system features a user-friendly graphical interface with modes for translating between Mexican Sign Language (MSL) and Spanish in both directions. Users can input signs or text and obtain corresponding translations. Performance evaluation demonstrates high accuracy, with the BRNN model achieving 98.8% accuracy. The research emphasizes the importance of hand features in sign language recognition. Future developments could focus on enhancing accessibility and expanding the system to support other sign languages. This Sign Language Translation System offers a promising solution to improve communication accessibility and foster inclusivity for individuals with hearing disabilities.


Introduction
Deaf communities globally encounter significant challenges in accessing vital services like education, healthcare, and employment due to language barriers, rather than auditory limitations [1].Their primary language is often a signed language, such as American, French, German, or Greek Sign Language, each a unique and complete language, distinct from spoken languages and each other.These languages, with over two hundred identified varieties, possess the same depth and expressive power as spoken languages [2,3].However, for the Deaf, any spoken language is secondary, leading to low literacy rates; for instance, in the U.S., deaf high school graduates have an average reading level of third to fourth grade [4].This language gap not only hinders everyday interactions with the hearing, non-signing population but also affects access to critical services.While certified sign language interpreters are the best solution for essential services, their scarcity and cost render them impractical for everyday, brief interactions.Thus, the development of effective automatic translation systems for spoken and signed languages could significantly improve communication and inclusivity for the Deaf community.
To overcome this barrier, technological solutions have been developed.Translator gloves [5][6][7], mobile applications, and automatic translators are the leading technologies that have been used for unidirectional or bidirectional communication.
Our research aims to develop a bi-directional sign language translator that can translate from Spanish to Mexican Sign Language (MSL) and vice versa, bridging the gap between these two languages.The system involves two operation modes: From Mexican Sign Language to Spanish (MSL-SPA) and from Spanish to Mexican Sign Language (SPA-MSL).In the MSL-SPA mode, the system captures live video and processes it to recognize the sign and translate it to Spanish, as shown in the upper path in Figure 1.Conversely, in the SPA-MSL mode, the user types the phrase and the system displays a sign language animation, as shown in the lower path in Figure 1.General pipeline of the bidirectional sign language translation system.In the MSL-SPA mode, the system recognizes the sign from live video and displays the text in Spanish on the other end.Conversely, in the SPA-MSL mode, the user types the phrase and the system displays a sign language animation on the other end.
Our system is based on deep learning techniques, which have shown great success in various computer vision and natural language processing tasks.Specifically, we used MediaPipe for keypoint detection, which is an advanced, real-time framework that utilizes machine learning to detect and track keypoints on objects, faces, hands, or poses in images and videos.We also used recurrent networks such as RNN, BRNN, LSTM, and GRU, as well as an encoder-only transformer for the translation process, which we treated as a time-series classification.
One of the main challenges in developing a bi-directional sign language translator is the variability and complexity of sign language gestures, as well as the need to capture the nuances and context of the conversation.Another challenge is the lack of large and diverse sign language datasets, which are crucial for training accurate models.To address these challenges, we collected a new dataset consisting of gestures from MSL, which we used to train and evaluate our system.
The proposed bi-directional sign language translator has the potential to significantly improve the communication and integration of the deaf community into society by allowing them to communicate more effectively with hearing people.Moreover, it can facilitate the learning of sign language for hearing people and promote a more inclusive and diverse society.
To provide an overview of the paper, we have organized it in the following manner: In Section 2, we summarize the relevant literature, while Section 3 outlines the methodology we employed in our project.Section 4 will showcase the results we obtained, and finally, in Section 5, we present our concluding thoughts.

Related Work
The landscape of sign language translation and recognition research is rich and varied, marked by a series of interconnected advancements that build upon each other.This section weaves through these developments, highlighting how each contribution sets the stage for the next.
Starting with Bungeroth & Ney [8], we see the foundations being laid with a German Sign Language (DGS) translation system.This innovative approach, integrating audio feedback and animated representation, utilizes IBM Model 1-4 and Hidden Markov Models (HMM) for training.The challenge they faced due to limited training samples echoes the necessity for a robust corpus, as further exemplified by the notation method of [9].
Building on the concept of practical translation, San-Segundo et al. [10] introduced a real-time method for Spanish-to-sign language translation.Their dual approach, blending rule-based and statistical methods, demonstrated adaptability and precision, particularly in contexts with limited vocabulary.
Pichardo-Lagunas et al. [11] continued this trajectory, focusing on Mexican Sign Language (MSL).They brought a meticulous, analytical lens to Spanish text, using Freeling to classify words for accurate translation.This method, though currently limited to one-way translation, reflects the evolving complexity of sign language translation systems.
Segueing to pose detection and classification, Qiao et al. [12] utilized the OpenPose model, demonstrating a significant leap in motion analysis without the dependency on specialized hardware.This development represents a shift towards more accessible and cost-effective solutions in the field.
Barrera-Melchor et al. [13] then added a new dimension by applying these technologies to educational content translation into MSL.Their cloud-based system, which translates speech to text and then to MSL using a 3D avatar, exemplifies the integration of cloud computing in sign language translation.
In a similar vein, focusing on a specific application area, Sosa-Jimenez et al. [14] developed a research prototype tailored for primary care health services in Mexican Sign Language.Their use of Microsoft Kinect sensors [15] and HMMs highlights the trend of specialized systems addressing distinct contexts like healthcare.
Parallel to these developments, Martínez-Gutiérrez et al. [16] and Martinez-Seis et al. [17] focused on MSL alphabet recognition through advanced computational methods, each achieving notable accuracy in their respective areas.
Carmona et al. [18] introduced a system for recognizing the static alphabet in Mexican Sign Language using Leap Motion and MS Kinect 1 sensors.Their unique application of 3D affine moment invariants for sign recognition demonstrated a significant improvement in accuracy, showcasing the potential of 3D modeling in sign language recognition.Naranjo et al. [19] attempt to expand the field, developing a graphical tool to aid in learning Costa Rican Sign Language (LESCO).Their methodology, utilizing phonological parameters and a similarity formula, provides a bridge for learners to grasp the nuances of sign languages, emphasizing the role of educational tools in sign language dissemination.Complementing these efforts, Trujillo et al. [20] presented a translation system from Mexican Sign Language to spoken language, employing 3D hand movement trajectories.Their approach to refining movement patterns and using advanced algorithms like KNN highlights the continuous push for higher precision and efficiency in translation systems.
In a similar spirit of refinement, Jimenez et al. [21], and Cervantes et al. [22] each contributed distinct methodologies for sign language recognition, whether through 3D affine invariants or sophisticated video analysis.These studies underscore the diverse technological avenues being explored to enhance sign language translation and recognition accuracy.
With the advent of the Transformer as a powerful deep learning model for translation, it has been used to improve the accuracy of sign language translation by effectively extracting joint visual-text features and capturing contextual information [23].One approach is to design an efficient transformer-based deep network architecture that exploits multilevel spatial and temporal contextual information, such as the proposed heterogeneous attention-based transformer (HAT) model [24].Another approach is to address the local temporal relations and non-local and global context modeling in sign videos, using techniques like the multi-stride position encoding scheme and the adaptive temporal interaction module [25].Additionally, transfer learning with pretrained language models, such as BERT, can be used to initialize sign language translation models and improve performance [26].Furthermore, incorporating content-aware and position-aware convolution layers, as well as injecting relative position information to the attention mechanism, can enhance sign language understanding and improve translation quality [27].
Using avatars to translate sign language presents both challenges and benefits.One of the main challenges is the complexity of sign languages, which requires a deep understanding of their grammatical mechanisms and inflecting mechanisms [28].Additionally, the lack of direct participation from the deaf community and the underestimation of sign language complexity have resulted in structural issues with signing avatar technologies [29].However, the benefits of using avatars for sign language translation include increased accessibility for the deaf community and the potential for automation and efficiency in translating spoken or written language to sign language [30][31][32].Avatars can also be used as educational tools and have the potential to improve the naturalness and believability of sign language motion either from text to animation [33], from animation to text [23,34] or both ways, as proposed in this work.
In recap, this collective body of work forms a tapestry of innovation, each research piece contributing to a greater understanding and capability in the field of sign language translation and recognition.Our research aims to add to this rich tapestry by developing a bi-directional translator between Spanish and Mexican Sign Language (MSL).Leveraging advanced techniques like MediaPipe and deep learning models, our goal is to bridge the communication gap for the deaf community.The Section 3 that follows will detail our unique approach, situating it within this dynamic and evolving research landscape.

Methods
This section describes the methodology pursued to develop the bidirectional translation system.The development stages were the following: Hardware selection 2.
Graphical user interface

Hardware Selection
To select the computing board, we compared the Raspberry Pi 4 Model B [35], the Up Square [36], and the Nvidia Jetson Nano [37], running a benchmark to evaluate the inference speed of each of them.
For this, we run MediaPipe's Holistic model on each card.This model includes detecting points on the body, hands, and face, making it computationally expensive.The RaspberryPi 4 ran at four frames per second, the UpSquared ran at six frames per second and the Jetson Nano, being the most powerful due to its GPU, ran at 13 frames per second.

Feature Selection
The inference and translation of the model depend on the input of keypoints they receive, so it is necessary to define those features, or in this case, keypoints, that are statistically significant and contribute to the model's inference process, always seeking the balance between the number of features to process and computational cost.
To optimize the Jetson Nano's resources for keypoint coordinate detection, we reduced the number of features.This reduction freed up computational capacity for other tasks in our translation system.We conducted performance tests using MediaPipe's pose detection, hand detection, and holistic pipelines.The holistic pipeline was the most resource-intensive, leading us to combine pose and hand detection pipelines for greater efficiency.This combination created a lighter version than the holistic model by eliminating the dense facial keypoint mesh computation.Figure 2, shows the full face mesh containing 468 keypoints and the eleven keypoints we end using shown in blue.We chose this approach because the facial mesh keypoints added little value to our model's inference, particularly since the signs we needed to identify mainly involved arm movements and finger positions.
For the body, we reduced the body keypoints to five: four from the original BlazePose model and one midpoint between the shoulder keypoints for chest detection.This selection was due to the movements in the signs being above the waist, making leg keypoints irrelevant for our model.For the hands, we kept all 21 keypoints because hand and finger positions are crucial for distinguishing between signs.Figure 3 displays the final topology of our translation system, comprising 58 keypoints.We calculated the X, Y, and Z coordinates for each, resulting in 174 features for processing.
By reducing the keypoints, we optimized the model's input layer, thereby decreasing its computational demands.This optimization made both the training and inference processes more efficient and reduced the data volume needed for training, validation, and testing splits of the model.

Data Collection
To make the system manageable, we chose a subset of ten signs, precisely phrases applicable in a school setting.The selected phrases are: "Hello", "Are there any questions?", "Help me", "Good morning", "Good afternoon", "Good night", "What is the homework?","Is this correct?","The class is over", and "Can you repeat it?".Approximately 1000 samples of each sign were collected from six individuals, comprising an equal gender split of three women and three men, with their ages ranging from 22 to 55.
For sample collection of these phrases, we developed a Python script that uses Medi-aPipe's keypoint detector to gather samples of each sign.We collected each sign from a distance of 2m containing 15 frames with detections.We found that in practice, all signs fit within this time period.
For each keypoint, we calculated the X, Y, and Z coordinates.
To compute the Z coordinate, we used the depth provided by the OAK-D camera [38].The depth camera is composed of a stereo pair of OMNIVISION's OV9282 1MP grayscale image sensor [39].The depth accuracy varies depending on the distance from the object being measured being more accurate at closer ranges.From 0.7 m to 4 m, the camera maintains an absolute depth error below 1.5 cm [40], which is sufficient for our application.
We used perspective projection to determine the distance relative to the camera for each keypoint of interest, as shown in Equation (1).
To increase the variability of the samples, signs were collected from six different individuals, aiming to reduce sample bias.For each of the ten signs, we gathered approximately 900 samples on average, resulting in a total of around 9300 samples.

Model Definition
Given the specific challenges of our project, we chose to implement a Recurrent Neural Network (RNN) model within our translation system.RNNs are particularly effective for tasks like natural language processing, video analysis, and machine translation, mainly because of their ability to maintain a form of memory.This memory helps in understanding sequences, as it can track changes over time.
To find the most suitable RNN model for classifying signs in Mexican Sign Language (MSL), we evaluated various RNN architectures.The models we considered included the following:

•
Standard RNN [41]: Ideal for handling sequences and time-series data.

•
Long Short-Term Memory (LSTM) [42]: Similar to GRU but with a different gating mechanism, often used for more complex sequence data.• Bidirectional RNN (BRNN) [43] and Bidirectional LSTM [44]: Enhances the standard recurrent networks by processing data in both forward and backward directions, offering a more comprehensive understanding of the sequence context.• Gated Recurrent Unit (GRU) [45]: A more efficient version of the standard RNN, known for better performance on certain datasets.

•
Transformer [46]: A newer model that has gained popularity in various sequence modeling tasks, known for handling long-range dependencies well.• Model Ensemble: An ensemble averaging of all the previous models.
Each of these models was trained and evaluated for its effectiveness in classifying MSL signs.The designed architectures for the RNN and BRNN used in our tests are depicted in Figure 4.The LSTM and Bidirectional LSTM architectures are shown in Figure 5.The GRU architecture is shown in Figure 6.

Graphical User Interface Design
To improve user interaction with our translation system, we developed a graphical user interface (GUI).This GUI is aimed at enhancing the usability and accessibility of the system.A User Interface (UI) is essentially the point of interaction between the user and the system, enabling the user to input commands and data and to access the system's content.UIs are integral to a wide variety of systems, including computers, mobile devices, and games.
Beyond the UI, we also focused on User Experience (UX).UX is about the overall experience of the user, encompassing their emotions, thoughts, reactions, and behavior during both direct and indirect engagement with the system, product, or service.This aspect of design is critical because it shapes how users perceive and interact with the system.
The outcomes of our efforts to develop a compelling UI and UX for the translation system are detailed in Section 4.2.

Results
We compared the performance of the models described in Section 3.4 using our collected data for training and testing.We trained each model using early stopping with a patience of five epochs.Table 1 shows the epochs and accuracy per model.
To simulate a real environment, we tested our models under different perturbations: 1. Drop keypoints : in this test, we randomly remove keypoints to simulate real-life situations where the keypoints are incomplete.

3.
Drop keypoints + noise: in this test, we added both of the perturbations described.
We used the MacroF1, the unweighted mean of the F1 scores calculated per class, to compare the models.Equations ( 2)-( 5) show the formulas used to compute this metric.Table 2 shows the MacroF1 score for the different models under the different test conditions.The model with the best performance overall was the Ensemble averaging followed by the Bidirectional LSTM.

Ablation Studies
The hands' features are essential for sign language, not so much the body or face.In this section, we evaluate the accuracy of the models under different perturbations and with different combinations of features.We want to know how the body and facial features contribute to overall sign language recognition.Tables 3-9 show the accuracy for each model with different combinations of features.
Focusing on the best model results.Table 9 illustrates the performance of the ensemble averaging when applied to sign language recognition, with varying input features and under different conditions of data perturbation.The model's baseline accuracy is measured without any perturbations.When keypoints are dropped from the test data, simulating incomplete data, there is a noticeable decrease in accuracy across all input feature combinations, indicating that the model relies significantly on the complete set of keypoints to make accurate predictions.The addition of Gaussian noise, simulating variations in keypoint detection, also lowers the model's accuracy, but less dramatically than dropping points when using all the keypoints, suggesting that the model has some robustness to noise.However, when both perturbations are applied (dropping keypoints and adding noise), the accuracy declines substantially, underscoring that the integrity and quality of input data are critical for the model's performance.The most robust combination of input features against these perturbations is the "Hands + Face + Body", which achieves the best results under the most significant data perturbation.Combining the three sources of keypoints helps to obtain a more robust sign language recognition.

User Interface
The system operates in two modes: translating from Mexican Sign Language to Spanish text (MSL-SPA) and from Spanish text to Mexican Sign Language (SPA-MSL).The user interface comprises three main views: 1.
MSL-SPA Mode View: In this view, the system displays a live video feed from a user-facing camera.This feed shows the user's body keypoints before the system classifies the sign.This view is depicted in Figure 7.

2.
SPA-MSL Mode View: This view is for the SPA-MSL mode.Here, the system displays the result of an MSL-SPA translation.The second user can then respond by typing on a keyboard.This typed text is used to generate a sign language animation for the other user.This view is illustrated in Figure 8.

3.
Animation View: The third view presents the animation generated from the SPA-MSL mode.Users can see the sign language animation created from the text input.This view is shown in Figure 9.The back-end system consists of multiple parallel processes (daemons) running on independent CPU threads.Figure 10 shows the general pipeline with the back-end daemons for processing tasks in blue and interpreter tasks in yellow and green.The MSL-SPA mode has two main functions: it receives the video feed from the camera and displays the live output as shown in Figure 7, along with making sign predictions.Conversely, the SPA-MSL mode shows sign language translations from the MSL-SPA mode and can also accept text input to generate animations of sign language, which are then displayed to the partner.These modes are carried out through three parallel processes: the frames, prediction, and feedback processors.
The frames processor takes input frames and utilizes the Mediapipe keypoint detector to obtain the X, Y, and Z coordinates relative to the camera.Because the input is live video, the frames processor uses a sliding window consisting of 15 frames (as depicted in Figure 11) to create a feature vector that is placed into the prediction queue.Additionally, the frames processor performs an initial calibration with the user to ensure they are positioned at least two meters away from the camera.This distance ensures that the camera captures the necessary area and maintains consistency between the live data and the training data.The prediction processor can operate in two modes: MSL-SPA and SPA-MSL.In the MSL-SPA mode, it retrieves data from the frames queue produced by the frames processor and runs the prediction model to generate a list of results.The final prediction is based on the number of votes, and a notification is sent to the feedback processor.In contrast, in the SPA-MSL mode, it receives Spanish text and sends a notification to the feedback processor.In the MSL-SPA mode, the feedback processor is responsible for displaying the text translation, while in the SPA-MSL mode, it displays an animation video.We have a database containing animation videos for each sign.The feedback processor searches for the corresponding video in the database and presents it to the user in the SPA-MSL mode, as shown in Figure 9.
We created the video animations using Blender3D with the free rig character Rain [47].We chose to animate Rain (see Figure 9) to enhance the gestures for improved comprehension.

Prototype
The final prototype consists of a Jetson Nano board that has two seven-inch LCD touch displays.One of the screens displays the MSL-SPA mode, while the other shows the SPA-MSL mode.The MSL-SPA mode screen has an OAK-D camera attached to it for video capture.We used free-access 3D models from Thingiverse [48] to create the mounting structure of the board and screens and added the camera stand on top of one display.Finally, we 3D printed the structure.You can see the final prototype in Figure 12.

Conclusions
Our bidirectional Mexican Sign Language (MSL) translation system aims to bridge the communication gap between the deaf community and the hearing world.Utilizing machine learning, including recurrent neural networks, transformers, and keypoint detection, our system shows promise in enabling seamless communication, with the promise of integrating individuals with hearing disabilities into society and education more effectively.This innovation emphasizes the importance of inclusive technology and the role of artificial intelligence in surmounting language barriers.The project's key successes lie in its bidirectional translation system, characterized by efficiency and accuracy.The use of MediaPipe for keypoint detection, along with RNN, BRNN, LSTM, GRU, and Transformer architectures, facilitates accurate translation of signs into text and vice versa.The system's real-time functionality, adaptability to various sign language variations, and user-friendly interface make it a practical tool for everyday use.
However, the study faced challenges related to the variability and complexity of sign language gestures, and the scarcity of diverse sign language datasets, impacting the training accuracy of the models.The system's performance under different real-world conditions like varied lighting and backgrounds also presents ongoing challenges.These issues highlight the necessity for further research, particularly in dataset development and enhancing the system's adaptability.Future directions include expanding the system to more languages and sign language variants, refining algorithms for complex signs and nonmanual signals, and collaborating with the deaf community for feedback and improvements.

Figure 1 .
Figure 1.General pipeline of the bidirectional sign language translation system.In the MSL-SPA mode, the system recognizes the sign from live video and displays the text in Spanish on the other end.Conversely, in the SPA-MSL mode, the user types the phrase and the system displays a sign language animation on the other end.

Figure 2 .
Figure 2. Mediapipe Face Mesh includes 468 3D face landmarks.To reduce computation, we used the blue keypoints obtained from the pose model instead.

Figure 3 .
Figure 3. Final 58 Keypoints used for the sign-recognition system.

Figure 4 .
Figure 4. RNN and BRNN architectures tested for sign language recognition.

Figure 5 .
Figure 5. LSTM and BLSTM architectures tested for sign language recognition.

Figure 6 .
Figure 6.GRU architecture tested for sign language recognition.

Figure 7 .
Figure 7. Mexican Sign Language to Spanish mode (MSL-SPA).The system shows the video from a front-facing camera with the body keypoints and the status Interpretando indicating that the system is interpreting the sign.

Figure 8 .
Figure 8. Interface for Spanish to Mexican Sign Language (SPA-MSL) Translation.The interface presents two input fields: 'Traducción LSM' for displaying the translation in Spanish and 'Respuesta ESP' for entering text to translate into MSL.A virtual keyboard is provided for the user to input text, which are then converted into MSL animations.The buttons 'Borrar' and 'Enviar' allow the user to delete the input or send it for translation, respectively.

Figure 9 .
Figure 9. SPA-MSL mode.The user can type text in Spanish and generate an animation showing the corresponding sign.

Figure 10 .
Figure10.The back-end system consists of multiple parallel processes: frames, prediction, feedback processors, and the two interpreters.

Figure 11 .
Figure 11.The frames processor applies a sliding window technique to the video stream, creating a feature vector for analysis.Green frames indicate those that have already been processed, yellow frames are currently undergoing processing, and red frames are queued for future processing.

Figure 12 .
Figure 12.Final Prototype.The image shows the two touchscreens back-to-back with the main processor in the middle.

Table 1 .
Number of epochs and accuracy per model during training.

Table 2 .
Comparison of Macro F1 scores across various models under diverse testing conditions, with the highest performing model in each scenario emphasized in bold.

Table 3 .
Accuracy of the RNN with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.

Table 4 .
Accuracy of the BRNN with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.

Table 5 .
Accuracy of the LSTM with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.

Table 6 .
Accuracy of the BLSTM with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.

Table 7 .
Accuracy of the GRU with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.

Table 8 .
Accuracy of the Transformer with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.

Table 9 .
Accuracy of the ensemble with various input feature combinations, with the highest accuracy in each testing condition emphasized in bold.