1. Introduction
Mexican Sign Language (MSL) is essential for communication, education, and social inclusion within the Deaf community in Mexico. However, its formal teaching remains limited and heterogeneous, partly due to the inherent complexity of coordinating hands, arms, head, torso, and non-manual markers (facial expressions) to convey precise information. This situation perpetuates communication barriers that hinder access to basic services and rights for people with hearing disabilities.
The use of virtual trainers for sign language instruction represents an innovative solution that helps reduce communication barriers and enhances the teaching–learning process. Several studies have explored this field using neural networks and/or immersive environments. Examples include tutoring and accessibility systems based on deep learning models [
1,
2,
3,
4]; virtual reality proposals employing 3D avatars and interactive scenarios [
5,
6,
7]; and sensor-based solutions using devices such as Leap Motion for manual gesture classification [
8,
9]. In parallel, translation and representation systems have been developed for various sign languages through avatars capable of converting text or speech into animated sequences that reproduce the corresponding gestures [
10,
11].
Among the most recent works in the literature dedicated to the teaching and learning of sign languages, several studies stand out for their focus on integrating digital tools and interactive environments to improve the acquisition of visual–gestural language. These contributions range from didactic proposals mediated by online training platforms to immersive experiences in virtual and augmented reality environments, all aimed at strengthening comprehension, participation, and educational inclusion of both Deaf and hearing individuals.
The system proposed by [
12] was developed to improve the learning and communication of Chinese Sign Language (CSL) through an interactive smartphone-based platform. Recognizing the limitations of traditional learning methods such as books and videos, this tool provides a more dynamic and engaging approach for both Deaf and hearing users. The system suggests the top three possible signs, helping users recall the correct gesture. The study concludes that this interactive approach significantly improves the efficiency of sign language learning and contributes to reducing communication barriers between Deaf and hearing individuals.
In the work of Lara et al. [
13], a virtual robotic training platform was designed and implemented to teach Mexican Sign Language (MSL) to both deaf and hearing users. The proposed system, named MSL–Virtual Robotic Trainer (MSL–VRT), was modeled after the human upper limb—comprising the hand, forearm, and wrist—and incorporates 25 degrees of freedom (DOF) to perform finger-spelling corresponding to the 21 static one-handed signs of the MSL alphabet.
Gutiérrez Navarrete [
14] proposed a project aimed at facilitating initial communication between Deaf and hearing individuals through a mobile learning application for Colombian Sign Language (LSC). The initiative sought to overcome communication barriers and promote interaction between both groups in everyday contexts. Considering the large population of people with hearing disabilities in Colombia, effective communication was identified as an urgent social need. The proposed application was designed to teach LSC using practical daily-life phrases, fostering empathy, inclusion, and awareness of the Deaf community. The results were promising, supported by non-probabilistic sampling methods. The authors emphasized the importance of continuing research and developing inclusive strategies to foster a more equitable and accessible society, noting that the prototype could be further refined to enhance usability and overall performance.
The proposal presented by [
15] addresses the limited availability of sign language learning resources that provide immediate feedback to learners. This study evaluates the effectiveness of an online tool designed to offer real-time video feedback, accessible even on standard laptops. By integrating Google’s MediaPipe technology with a robust logical framework, the prototype aims to deliver accurate and instantaneous responses to support effective American Sign Language (ASL) instruction. Focusing on the ASL alphabet, the research analyzes computer vision-based learning tools for static and dynamic signs, assessing the potential and current limitations of the system in advancing sign language education.
The article developed by [
16] introduces an innovative approach to building an intelligent avatar tutor for teaching Bangla Sign Language (BSL). The proposed system applies image processing and machine learning techniques to recognize and analyze hand gestures from a dedicated dataset, generating corresponding animations that allow the avatar to reproduce numbers and alphabet signs interactively. The study involved pre-processing and collecting gesture data to train the model, resulting in an intuitive and user-friendly interface that enables learners to physically practice signs, improve sensorimotor engagement, and receive immediate feedback through cosine similarity evaluation.
The study by [
17] presents a web application based on machine learning designed to make sign language learning more accessible. Representing a step forward in sign language education, the proposed platform differs from traditional approaches by assigning users various words to spell through hand signs. The learners must correctly sign each letter to complete the word and earn points, creating a gamified and engaging experience. The paper details the system’s development, main features, and underlying machine learning framework. Implemented with HTML, CSS, JavaScript and Flask, the Web Application uses the user’s webcam to capture live video and display predictions of real-time models, enabling interactive practice sessions.
The system proposed by [
18] was designed, developed, and evaluated to support sign language learning and communication within Social Virtual Reality (SVR) environments. Building on insights from previous research, the system was implemented as a Virtual Reality Learning Environment (VRLE) that replicates the real-life processes of teaching and assessing sign language learners. The study’s findings indicate that, although current technological solutions have proven effective in transferring non-verbal cues (NVCs) and enhancing interaction, several challenges remain. Further research is required to improve the overall quality and effectiveness of sign language communication in virtual environments.
The study conducted by Erofeeva et al. [
19] examined how virtual reality (VR) can foster inclusive and immersive learning through community-based sign language (SL) education within the VRChat platform. Using multimodal video analysis of two French SL classes taught by a Deaf instructor and involving learners with diverse sensory, linguistic, and technological profiles, the research explored how VR tools—such as air pens and spatial-directional emojis—facilitated communication and engagement. The findings suggest that these tools helped reduce technological and communicative asymmetries, although their effectiveness depended on participants’ interactional competencies.
Coy et al. [
20] examined the opportunities and challenges of an approach designed to facilitate real-time communication between Deaf students and hearing teachers who do not use sign language, integrating speech and language technologies, computer vision, machine translation systems, and 3D avatars powered by artificial intelligence (AI).
Similarly, the initiative developed by [
21] at the National Taiwan Library promotes digital and interactive strategies to enhance communication between sign language users and the hearing community, reflecting the growing emphasis on inclusivity for people with disabilities. The “Interactive System of Digital Sign Language Picture Books,” which combines augmented reality (AR) and artificial intelligence (AI), aims to create more inclusive and engaging learning experiences. The results revealed that ease of use significantly influenced perceived usefulness, attitude toward use, and intention to use, demonstrating that such digital tools can enhance learning engagement and foster more effective social communication.
With the evolution of computer vision and deep learning approaches [
22], multimodal models have also gained relevance due to their ability to extract video embeddings and reason over temporal sequences, allowing new strategies for recognition and comparative analysis [
23]. In this context, video embeddings can be combined with few-shot classifiers such as Matching Networks to assign a sample to its most probable class based on similarity metrics [
24]. More recently, large-scale multimodal models have made it possible to obtain robust representations from frame sequences, facilitating the exploration of direct classification scenarios in sign language videos [
25].
Our study presents two complementary approaches for the analysis of MSL sign videos, applied to a custom data set of 335 recordings produced with professional interpreters: (i) an interactive training system that extracts hand and face key points using MediaPipe and employs Dynamic Time Warping (DTW) to compare the user’s performance against a reference; and (ii) an experimental direct video classification method that uses an advanced multimodal model to generate video embeddings and a Matching Network to assign each sample to its most probable class [
23,
24,
25].
To construct the dataset, twelve MSL lessons were recorded with professional interpreters: one interpreter recorded the alphabet using the left hand, while another recorded the remaining eleven lessons using the right hand. In addition, a second practice dataset was created with a non-expert participant who studied the signs beforehand and recorded them individually. This complementary dataset allowed evaluation of system robustness under learner-like conditions and served as the baseline for DTW experiments.
The first approach, implemented in a lightweight and easily deployable application, focuses on learning through near-real-time quantitative feedback, enabling users to progressively improve their performance. The second approach was explored as a research-oriented alternative for automatic sign recognition without explicit keypoint preprocessing. Although not integrated into the application due to computational requirements, its results show potential for large-scale classification and video pre-labeling tasks.
The main contributions of this work are as follows: (1) an interpretable, low-cost training framework for learning Mexican Sign Language (MSL), which extracts hand and face keypoints from RGB videos and provides real-time quantitative feedback using Dynamic Time Warping (DTW); and (2) an experimental baseline that combines video embeddings with a Matching Network to explore scalability to larger vocabularies and semi-automatic dataset labeling.
The remainder of this article is organized as follows.
Section 2 describes the materials and methods, including participant recruitment, digital resource development, and outcome measures.
Section 3 presents the quantitative and qualitative results of the proposal.
Section 4 discusses these findings in the context of the existing literature, highlighting their strengths, limitations, and implications for practice. Finally,
Section 5 summarizes the main conclusions and suggests directions for future research.
2. Materials and Methods
This section describes the methodological framework used to design, implement, and evaluate the proposed system. We first present the construction of two parallel datasets (reference and practice) and the procedure for extracting 2D keypoints from video sequences. We then detail the normalization process, similarity metrics, and temporal alignment using Dynamic Time Warping (DTW), as well as robustness analysis under noise perturbations. Next, we introduce the graphical user interface that supports interactive practice, and the design of a user study conducted to assess usability and learning effectiveness.
2.1. Dataset Construction
We built two parallel datasets of videos under an identical protocol to ensure comparable processing (
Figure 1). The reference dataset was recorded with professional interpreters and contains 335 videos across 12 lessons. The practice dataset was recorded by a non-expert participant who studied the same lessons and reproduced all signs. All sessions were captured with a conventional RGB camera. The videos were manually trimmed frame by frame to preserve only the relevant segment, and the audio was removed.
Figure 2 shows four representative signs in the Mexican Sign Language (MSL). Each row corresponds to one sign, and the four images display consecutive frames (left–right) from start to finish.
2.2. Keypoint Extraction and Data Representation
We used MediaPipe [
26] to identify keypoints on the hand and face for each video. The Hands module captures 21 keypoints per hand, while the Face Mesh module identifies six facial landmarks, namely the right and left eyes, nose tip, mouth, and right and left ears. Consequently, every frame consists of 48 keypoints, each of which has 2D coordinates
.
Each frame
t was vectorized as
:
Data were saved in CSV format, each file corresponding to a different sign and subject, comprising 96 columns; each row represented a frame.
Figure 3 shows the keypoint distribution.
2.3. Normalization and Frame-Wise Similarity Metric
To make interpreter and user trajectories comparable under framing or distance variations, axis-wise normalization (z-score) was applied to each sequence:
where
and
are the mean and standard deviation per axis, computed over the sequence. This centers the point cloud at
and standardizes variances, reducing the effects of translation and scale (
Figure 4).
Instantaneous similarity between two normalized frames
i and
j was computed using the Euclidean distance:
Figure 5 illustrates the basic calculation between two homologous points.
2.4. Temporal Alignment with Dynamic Time Warping (DTW)
To compare entire sequences with potential differences in execution speed or minor desynchronizations, Dynamic Time Warping (DTW) was employed. Given the frame-wise distance
, the accumulated cost is defined recursively as:
The final DTW cost acts as a global dissimilarity measure (lower is better). For user feedback, the system also reports a similarity percentage derived from this score (
Figure 6).
To define a practical acceptance criterion for the execution of the correct sign, we analyzed the distribution of the DTW distances by comparing all lessons (N = 335). Because the alphabet lesson was recorded with opposite handedness between interpreter and user, its 27 samples presented systematically higher DTW values (mean = 1419.5, SD = 169.4). Based on this empirical distribution, we adopted a global acceptance threshold of , which lies between the median and the 75th percentile and is close to the overall mean of the full dataset (≈583). This cutoff provides a balanced trade-off between tolerance to natural intra-user variability and rejection of mismatched executions.
A short sensitivity analysis showed that applying thresholds of 500, 600, and 700 yielded acceptance rates of 55.8%, 69.5%, and 80.5%, respectively, confirming that is a robust and interpretable criterion. The alphabet samples were excluded from threshold determination due to the handedness mismatch; alternatively, they could be handled by applying a mirroring step before DTW to remove the systematic bias.
2.5. Noise Robustness
To assess the sensitivity of the system to capture variability and detection errors, perturbed versions of the sequences were generated by adding Gaussian noise with
to the
coordinates. This allowed evaluation of how DTW distances change under random displacements, simulating variability due to camera or detection artifacts. Results are reported in
Section 3.
2.6. Robustness to Lighting and Resolution Changes
To evaluate the robustness of the DTW alignment under realistic capture conditions, we generated user video variants with controlled degradations in lighting and spatial resolution. We defined: Light50 and Light25: moderate and severe brightness reductions, respectively; Res50 and Res25: moderate and severe spatial downscaling; Light50_Res50 and Light25_Res25: moderate and severe combined degradations. The reference condition corresponds to the user pair without any degradation. The extraction of keypoints and the DTW computation were identical to those in the baseline condition.
As an illustrative example of the degradations considered in this subsection,
Figure 7 shows the base user video (sign “Number 4”) along with the six lighting and resolution variants, arranged in a single row of seven consecutive panels.
2.7. Partial Occlusion Protocol
We simulate occlusions in key regions relevant to Mexican Sign Language (MSL): left hand (L), right hand (R), both hands (B), face (Face) and full frame (All). Three severity levels (10%, 30% and 60%) were tested, as well as two masking patterns (suffixes c/r for constant/random masks). For each condition, we report the mean and standard deviation of DTW values, as well as the percentage change relative to the reference, as detailed in
Section 3.2.
2.8. Graphical User Interface
We built a graphical user interface to evaluate the method with participants. The application is organized into two levels: (i) a main menu with the available lessons and (ii) a practice window where the user views the reference video and performs practice attempts.
In the main menu, lessons are displayed as buttons arranged in rows and columns for an ordered selection (
Figure 8).
Upon selecting a lesson, the practice window opens as shown in
Figure 9. In the upper-left corner, a home button allows returning to the main menu. The central area displays the reference video, above it, two labels appear: the lesson title and the sign name. The bottom strip contains five control buttons: Previous, Play, Pause, Next, and Practice. The Practice button initiates camera capture and evaluation.
As shown in
Figure 10a, when the practice button is pressed, a preparation message instructs: “Stand one meter away from the camera so your open palms are visible.” The camera then activates, and the system captures the hands and face keypoints per frame using MediaPipe. The user sequence is normalized and compared with the reference CSV via DTW, providing immediate feedback with similarity percentage as shown in
Figure 10b,c.
2.9. User Study Design
A user study was conducted with 33 participants to assess perceived usefulness and user experience. A 5-point Likert questionnaire (1 = strongly disagree, 5 = strongly agree) measured four constructs: ease of use, perceived usefulness, satisfaction, and learning efficiency. The questionnaire’s internal consistency was Cronbach’s .
Each participant: (1) opened the application and selected a lesson; (2) reviewed the reference video and control layout; (3) pressed Practice to capture their attempt after the preparation message; (4) received immediate feedback-based similarity percentage; (5) repeated attempts as needed; (6) completed the questionnaire. During the process, the researcher resolved questions in person or via video call.
3. Results
The experiments with dynamic time warping (DTW) enabled a systematic comparison of user executions against reference performances by professional interpreters. A lower DTW distance indicates greater similarity, whereas larger values reflect discrepancies in shape or timing. This approach made it possible to identify and quantify the differences between movements, adjust for temporal variations, and generate objective metrics for immediate feedback during practice.
In the alphabet lesson, the DTW distances were consistently greater than 1000 due to the difference in the dominant hand (interpreter using the left hand vs. user using the right hand). In contrast, in lessons involving basic vocabulary and short phrases, most distances ranged between 300 and 600, consistent with correct executions.
Table 1 summarizes the mean and standard deviation of DTW distances per lesson.
3.1. Performance Evaluation of the DTW Method
To quantify the computational requirements of the proposed method based on Dynamic Time Warping (DTW), a single sign from Lesson 10 (Sign_0011) was analyzed. The experiment was conducted on a Windows 11 computer equipped with an Intel Core i7 CPU and 32 GB of RAM. The process included three stages: (1) keypoint extraction from both interpreter and user videos using MediaPipe, (2) data normalization, and (3) DTW distance computation between both temporal trajectories. Execution time was measured in milliseconds (ms), and memory usage in megabytes (MB). RSS_delta represents the net memory change, peak_py indicates the peak memory used by Python 3.10 objects, and peak_rss is the overall process peak (including native video buffers and TensorFlow Lite delegates).
Table 2 summarizes the execution time and memory usage for each stage of the DTW process (keypoint extraction, normalization, and alignment).
The extracted matrices contained 96 normalized coordinates per frame, with dimensions of (0.073 MB) for the interpreter and (0.054 MB) for the user, for a total memory footprint of approximately 0.126 MB. The DTW distance obtained between both sequences was 365.08, indicating strong similarity in temporal alignment and spatial configuration.
The results demonstrate that the computational bottleneck lies in the keypoint extraction stage due to video decoding and neural landmark detection, whereas DTW computation itself is lightweight in both time and memory consumption. When starting from precomputed keypoints (CSV files), the inference time drops below 0.3 s, making this approach feasible for real-time feedback on standard CPUs without dedicated GPUs.
Figure 11 presents a representative heatmap of DTW distances between interpreter and user signs. Darker values represent higher similarity, while lighter values indicate discrepancies. As expected, in basic vocabulary lessons most comparisons fell in the 300–600 range, whereas alphabet signs consistently exceeded 1000 due to handedness differences.
3.2. Robustness Results with DTW
Robustness experiments with Dynamic Time Warping (DTW) quantify how visual conditions affect the similarity between user executions and interpreter references. Lower DTW means higher similarity. We analyze (i) lighting and resolution changes and (ii) partial occlusions.
3.2.1. Effect of Lighting and Resolution
We generate video variants with reduced brightness and spatial resolution. Light50/ Light25 decrease brightness by 50% and 25% (about
and
). Res50/Res25 downscale resolution to 50% and 25%. Light50_Res50 and Light25_Res25 apply both factors simultaneously. The baseline condition is the unaltered user–interpreter pair (see
Table 3).
To visualize the overall impact of lighting and resolution,
Figure 12 summarizes the average DTW distance by condition (error bars = ±1 SD).
Table 4 reports the mean DTW, standard deviation and percentage change relative to baseline. As shown in the table, moderate degradations (Light50, Res50, Light50_Res50) alter DTW by less than 5%, whereas the severe combined condition (Light25_Res25) yields the largest increase (about +38%), indicating sensitivity to simultaneous low light and low resolution.
3.2.2. Effect of Partial Occlusions
We occluded key regions at three levels (10%, 30%, 60%) with constant (c) or random (r) masks. Regions: left hand (L), right hand (R), both hands (B), face (Face), and whole frame (All). The notation is occ[Region][Level][Pattern] (see
Table 5).
Table 6 summarizes the mean DTW, standard deviation, and relative change for each occlusion region, level, and mask pattern.
DTW increases with occlusion severity. The largest effects occur when both hands or the full frame are masked, while single-hand and 10% masks have smaller impact. This confirms the critical role of hand visibility for temporal alignment.
3.3. Experimental Comparison with Video Embeddings
Beyond DTW, we explored an experimental approach based on direct video classification using advanced multimodal models, aimed at evaluating scalability to larger datasets. We selected three models from the state of the art (SOTA) to test this approach: the Qwen2.5-VL-7B model pre-trained on a dataset constructed through a combination of methods, including cleaning raw web data, synthesizing data, etc. [
25]; the VideoMAEv2 model (vit_small_patch16_224) pre-trained on the Kinetics-710 dataset, which presents a dual masking strategy for self-supervised pretraining [
27]; and the VJEPA2 model (vjepa2-vitg-fpc64-384), pretrained on internet-scale videos and followed by post-training with a small amount of interaction data [
28]. Each selected model is capable of generating video embeddings from frame sequences decomposing each frame into multiple patches represented as N-dimensional vectors, see
Figure 13. Each model employs a distinct vision encoder configuration, resulting in feature vectors with different dimensions, as shown
Table 7.
For similarity-based classification, we implemented a Matching Network (MN) [
24], comparing query embeddings against a set of supports using cosine distance and soft voting as shown in
Figure 14. The support vectors are those that compose the reference videos. Each of these patch embeddings determines the class to which it belongs. The query vectors are those that need to be classified. The process for classifying patch video embeddings involves comparing the cosine distance between each patch of the query vector and each support vector, with the one yielding the smallest distance being set as the prediction for that patch. Due to every video sample having N video-empedding patches, the process is repeated in all the patches that belong to the video, at the same time each prediction is registered to obtain the class voted the most, implementing a soft voting strategic, this class is the one that is set as the preliminar class predicted.
To gain a broader understanding of the advantages and disadvantages of implementing these foundational models as feature extractors, it is necessary to compare their computational resource requirements and processing latency. We compare the amount of memory used by the complete system (feature extractor + Matching Network) and the processing latency of a single sample, see
Table 8.
3.4. Embedding Baseline Results
The embedding baseline achieved different results across the foundational models used as feature extractors. In the case of Qwen2.5VL + MN, we achieved a clear class separability between multiple signs with a perfect accuracy score.
Figure 15,
Figure 16 and
Figure 17 show the classification results using video embedding similarity. This strong performance can be attributed to the Qwen2.5VL’s architecture, which emphasizes video understanding as a core capability. In contrast, VideoMAEv2 and VJEPA2 showed the opposite trend, with poor performance and accuracy (see
Table 9). These results can be attributed to the fact that VideoMAEv2 and VJEPA2 are models pre-trained on masked prediction and reconstruction tasks and are commonly finetuned for downstream tasks, making them unsuitable for zero-shot scenarios.
However, due to the higher computational cost and lower interpretability, this approach was not integrated into the final practice module.
3.5. User Evaluation Results
To assess participant perceptions of the application, a five-point Likert questionnaire (1–5) was administered to 33 participants. The elements were grouped into four categories: ease of use, perceived usefulness, satisfaction, and learning efficiency. An open question was also included for qualitative suggestions.
Figure 18 summarizes the distribution of responses per category (subfigures a–d). The responses were clearly concentrated in the higher range (4–5), suggesting favorable perceptions of the system in all dimensions.
Table 10 integrates, for each item, the mean, standard deviation,
p-value of a one-sample
t-test vs. 3, Cohen’s
d effect size, and 95% confidence intervals. In all categories, the mean was significantly higher than 3 (
).
We applied the Holm–Bonferroni correction for multiple comparisons; significance remained after adjustment.
Qualitative Results
The open question (Q13) highlighted recurring improvement opportunities: (i) visual optimization of the interface, (ii) improved robustness under poor lighting or low-quality cameras, and (iii) more detailed feedback to guide practice. However, several participants expressed their full satisfaction with the application.
Taken together, the quantitative evidence: means significantly greater than 3, medium to large effect sizes and confidence intervals above neutral, and the psychometric evidence (reliability, correlations and PCA) indicate that the application is perceived as useful, clear, and motivating to learn the Mexican Sign Language.
4. Discussion
The Dynamic Time Warping (DTW)-based approach proved effective for guided practice of Mexican Sign Language (MSL), as it provides immediate and quantitative feedback using only a conventional RGB camera. The similarity threshold allowed for consistent differentiation between correct and incorrect executions, particularly in lessons involving basic vocabulary and short phrases. These findings indicate that motion–trajectory alignment can serve as a reliable metric for sign fidelity in low-cost configurations, making it suitable for educational applications without requiring specialized sensors or wearable devices.
These examples (
Figure 2) visually illustrate the temporal evolution of the recorded signs, reinforcing the interpretability and reproducibility of the dataset for subsequent DTW analysis.
In addition to DTW, several alternative methods could be considered for aligning or comparing temporal motion sequences between the interpreter and the user. Examples include frame-wise Euclidean or cosine distance, cross-correlation measures, and probabilistic or parametric models such as Dynamic Movement Primitives (DMPs) and the soft-DTW variant. We selected classical DTW for this work because it provides an explicit alignment path, is robust to differences in sequence length and execution speed, and produces interpretable distance matrices that can be directly visualized as feedback for learners. In contrast, methods such as soft-DTW or embedding-based similarity scores are more suitable for large-scale automatic recognition but less transparent for educational feedback. Therefore, DTW was favored for its balance between simplicity, computational efficiency, and pedagogical interpretability.
A complementary baseline using video embeddings and Matching Networks was also explored to evaluate scalability toward larger vocabularies. Although this approach achieved clear class separability and high accuracy, it demands greater computational resources and offers limited interpretability at the frame level. In contrast, DTW offers lightweight computation and transparent alignment visualization, which are advantageous for real-time feedback and educational use. Therefore, both strategies are considered complementary: DTW is ideal for interactive practice and formative evaluation, while the embedding-based model has potential for large-scale recognition and automatic dataset annotation.
The computational analysis (
Table 2) demonstrated that keypoint extraction dominates resource consumption, while DTW computation remains lightweight. This supports the method’s suitability for real-time educational feedback.
The main strength of the proposed system lies in its accessibility and ease of deployment. It can be used on standard laptops or tablets and provides an intuitive interface for learners. However, its performance depends on accurate keypoint detection; tracking errors or occlusions can increase dissimilarity scores, especially under poor lighting or low-resolution conditions. The visual examples in
Figure 7 help interpret how lighting and resolution degradations visually impact keypoint detection and, consequently, DTW alignment stability.
Quantitatively, moderate lighting or resolution degradations altered DTW distances by less than 5%, indicating robustness for standard webcams. However, severe combined conditions (Light25_Res25) and occlusions affecting both hands increased DTW by over 35–70%, which could lead to false negatives in practical use. These results underscore the need for adaptive thresholds and more robust keypoint tracking under non-controlled environments. In addition, the data set was limited to 12 lessons and a small group of interpreters, which may constrain generalization to broader contexts of MSL use.
The results of the user study further support the educational value of the system. Participants reported high satisfaction, perceived usefulness, and learning motivation. The psychometric validation of the questionnaire (Cronbach’s and all means significantly above the neutral point after Holm–Bonferroni correction) supports the reliability and internal consistency of the participants’ responses, confirming the robustness of subjective evaluations. These outcomes are consistent with previous research showing that interactive and feedback-oriented learning environments enhance engagement and retention in sign language education. The DTW-based feedback mechanism not only improved self-assessment but also encouraged repeated practice, demonstrating its potential as a tool for inclusive and autonomous learning.
5. Conclusions
This study presented a virtual training system for learning Mexican Sign Language (MSL) using a computer vision approach based on Dynamic Time Warping (DTW) and, experimentally, video embeddings with Matching Networks. The DTW-based module allowed learners to practice signs with real-time feedback through a conventional RGB camera, providing an interpretable and efficient similarity metric that supports immediate correction. The embedding-based baseline, in turn, achieved high classification accuracy, demonstrating potential for large-scale vocabulary expansion and automatic dataset labeling. Together, these complementary methods highlight the value of combining interpretability and scalability for inclusive sign language education.
Limitations and Future Work
The main limitations of this study include the relatively small dataset (12 lessons with few interpreters), which limits model generalization, and the DTW metric’s dependence on accurate keypoint detection, making it sensitive to occlusions and lighting variations. The evaluation also focused mainly on short, isolated signs rather than continuous signing, and the participant group lacked broad diversity in age, hearing condition, or cultural background. Furthermore, while accurate, the embedding-based approach requires substantial computational resources, hindering real-time deployment on low-power devices.
Building directly upon the robustness findings, we will (i) adopt adaptive DTW thresholds conditioned on capture quality (lighting and resolution) and (ii) improve keypoint extraction under occlusions, especially when both hands are partially hidden. Concretely, we will calibrate per-lesson and per-condition thresholds using validation splits, integrate lightweight exposure and contrast normalization for low-light scenes, and add occlusion-aware hand tracking and user prompts to preserve alignment stability in non-controlled environments.
Future work will expand the dataset with additional interpreters and vocabulary, including participants from diverse demographic and linguistic backgrounds, to improve representativeness and generalization between populations. We also plan to evaluate the transferability of the model between different user groups, enhance robustness under various recording conditions, and explore adaptive similarity metrics that personalize feedback. Lightweight multimodal models can merge the interpretability of DTW with the scalability of embeddings, enabling hybrid systems suitable for both teaching and large-scale MSL analysis.
Future research could also explore end-to-end recurrent architectures such as LSTM- or GRU-based models to learn temporal dependencies directly from raw video sequences, provided that larger and more balanced datasets become available. This dual analysis highlights a complementary pathway: DTW provides interpretable, low-latency feedback for learners, while embedding-based classification enables scalability for automatic dataset expansion. Integrating both could yield hybrid systems capable of delivering personalized, explainable, and large-scale sign language instruction. Finally, robustness experiments under controlled degradations (lighting, resolution, and occlusions) showed that DTW similarity remains stable under moderate conditions, with significant degradation only when both hands or the full frame were occluded (
Table 6). These findings confirm the reliability of the method for typical classroom and home environments.