Next Article in Journal
A BIM-GIS Framework Integrated with CCTV Analytics for Urban Walkability Assessment
Previous Article in Journal
Non-Destructive Testing and Evaluation of Hybrid and Advanced Structures: A Comprehensive Review of Methods, Applications, and Emerging Trends
Previous Article in Special Issue
A Multimodal Deep Learning Approach for Legal English Learning in Intelligent Educational Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Toward a Recognition System for Mexican Sign Language: Arm Movement Detection

by
Gabriela Hilario-Acuapan
,
Keny Ordaz-Hernández
*,
Mario Castelán
and
Ismael Lopez-Juarez
*
Robotics and Advanced Manufacturing Department, Centre for Research and Advanced Studies (CINVESTAV), Ramos Arizpe 25900, Mexico
*
Authors to whom correspondence should be addressed.
Sensors 2025, 25(12), 3636; https://doi.org/10.3390/s25123636
Submission received: 5 April 2025 / Revised: 3 June 2025 / Accepted: 4 June 2025 / Published: 10 June 2025

Abstract

:
This paper describes ongoing work surrounding the creation of a recognition system for Mexican Sign Language (LSM). We propose a general sign decomposition that is divided into three parts, i.e., hand configuration (HC), arm movement (AM), and non-hand gestures (NHGs). This paper focuses on the AM features and reports the approach created to analyze visual patterns in arm joint movements (wrists, shoulders, and elbows). For this research, a proprietary dataset—one that does not limit the recognition of arm movements—was developed, with active participation from the deaf community and LSM experts. We analyzed two case studies involving three sign subsets. For each sign, the pose was extracted to generate shapes of the joint paths during the arm movements and fed to a CNN classifier. YOLOv8 was used for pose estimation and visual pattern classification purposes. The proposed approach, based on pose estimation, shows promising results for constructing CNN models to classify a wide range of signs.

1. Introduction

Deafness or hearing loss is the partial or total loss of the ability to hear sounds in one or both ears. The World Health Organization’s most recent World Hearing Report [1] estimates that more than 1.5 billion people have some degree of hearing loss. Approximately 430 million of them have moderate or greater hearing loss in their better ear; this number is expected to increase to 700 million people by 2050.
According to the Ministry of Health [2], approximately 2.3 million people in Mexico have hearing disabilities. This vulnerable group faces significant levels of discrimination and limited employment opportunities. Additionally, this health condition restricts access to education, healthcare, and legal services, further exacerbating social inequalities and limiting opportunities for integration. One of the primary challenges faced by the deaf community is communication with hearing individuals, as linguistic differences hinder social and workplace interactions. While technology has proven useful in reducing some of these barriers, deaf individuals often rely on the same technological tools as the hearing population, such as email and text messaging applications. However, these tools are not always effective, as not all deaf individuals are proficient in written Spanish.
In the Americas, the most widely studied sign languages are American Sign Language (ASL) and Brazilian Sign Language (LIBRAS), which have facilitated research and technological advancements aimed at improving communication with the deaf community. An example of one innovation is SLAIT [3], a startup that emerged from a research project at Aachen University of Applied Sciences in Germany. During this research, an ASL recognition engine was developed using MediaPipe and recurrent neural networks (RNNs). Similarly, Ref. [4] announced an innovative project in Brazil that uses computer vision and artificial intelligence to translate LIBRAS into text and speech in real time. Although this technology is still undergoing internal testing, the developers claim that after four years of work, the system has reached a significant level of maturity. This technology was developed by Lenovo researchers in collaboration with the Center for Advanced Studies and Systems in Recife (CESAR), which has already patented part of this technology [5]. The system is capable of recognizing the positions of arm joints, fingers, and specific points on the face, similar to SLAIT. From this data, it processes facial movements and gestures, enabling the identification of sentence flow and the conversion of sign language into text. CESAR and Lenovo believe that their system has the potential to become a universally applicable tool.
Compared to speech recognition and text translation systems, applications dedicated to sign language (SL) translation remain scarce. This is partly due to the relatively new nature of the field and the inherent complexity of sign language recognition (SLR), which involves visual, spatial, and gestural elements. Recognizing sign language presents a significant challenge, primarily due to limited research and funding. This highlights the importance of promoting research into the development of digital solutions that enhance the quality of life for the deaf community (c.f. [6]). However, researchers agree that the key factor for developing successful machine learning models is data (c.f. [7]). In this regard, for SLs like LSM, existing databases are often inadequate in terms of both size and quality, hindering the advancement of these technologies. Also, sensing technology has a fundamental role in the reliability of incoming data. This is the main reason why SLR is broadly divided into two branches, i.e., contact sensing and contactless sensing.
Sign data acquisition with contact relies on gloves [8], armbands [9], wearable inertial sensors [10,11], or electromyographic (EMG) signals [12]. In contrast, contactless sign data acquisition is mainly divided into two types, depending on the kind of hardware, that is, simple hardware (color or infrared cameras) or specialized hardware (e.g., depth sensors, optical 3D sensors [13], commercial WiFi devices [14], and ultrasonic devices [15]).
This classification is similar to the one presented by [16] (Figure 1), except that their sign data acquisition approaches are divided into sensor-based approaches and vision-based approaches. We present several examples of sign language research and related work, along with various approaches to sign data acquisition, as detailed in Table 1.
In Table 1, we include information regarding the features of signs that are included in the sign data acquisition for each reported work. Instead of using the separation employed by [17] (facial, body, and hand features), we propose our own decomposition into hand configurations (HCs), arm movements (AMs), and non-hand gestures (NHGs); see Figure 1. This is a fundamental concept of our research, so this decomposition is discussed in more detail in Section 1.1.2. The separation of facial, body, and hand features is a concept commonly seen in pose estimators (such as MediaPipe [18]), which are also common in SL research, as presented in Table 1. It is also possible to observe that most SL research is focused on the HC features.
We will now present the scientific context of LSM research. First, we present the known datasets and then present studies about LSM recognition and analysis.
LSM is composed of two parts, i.e., dactylology (fingerspelling) and ideograms ([19], p. 12). Dactylology is a small subset of LSM and basically consists of letters of the alphabet, where most parts are static signs. A few signs for numbers are also static. Due to the small, nevertheless important, role of dactylology, we are interested in LSM ideogram datasets. To the best of our knowledge, there are three publicly available ideogram-focused datasets. Two of them are visual, i.e., (i) the MX-ITESO-100 preview [20], which contains video clips of 11 signs from 3 signers (out of 100 signs, although not all are currently available), and (ii) the Mexican Sign Language dataset [21,22], which includes image sequences of 249 signs from 11 signers. The third dataset, consisting of keypoints, is provided by [23]; this dataset contains 3000 samples of 30 signs from 4 signers (8 letters, 20 words, and 2 phrases). This was constructed by processing the RGBD data into keypoints by means of the MediaPipe [18] tool, but the unprocessed visual data is not provided. A comparison of these datasets, along with LSM glossaries, is provided in Table 2 and Table 3.
Table 1. Sign language research and related work.
Table 1. Sign language research and related work.
Ref.SLSign Group *Sign TypeSign Features Sensor/Tool
Yao et al. (2025) [24]ASLL, N, PStaticHCHydrogel strain sensor
Chiradeja et al. (2025) [8]-SDynamicHCGloves
Rodríguez-Tapia et al. (2019) [10]ASLWDynamicHCMyoelectric bracelets
Filipowska et al. (2024) [12]PJMWDynamicHCEMG
Umut and Kumdereli (2024) [9]TSLWDynamicHC, AMMyo armbands (IMU + sEMG)
Gu et al. (2024) [11]ASLW, SDynamicHC, AMIMUs
Wei et al. (2025) [25]-WDynamicHCGloves
Wang et al. (2025) [26]ASLLBothHCTriboelectric sensor
Urrea et al. (2023) [27]ASLL, WStaticHCCamera/MediaPipe
Al-Saidi et al. (2024) [16]ArSLLStaticHCCamera/MediaPipe
Niu (2025) [28]ASLLStaticHCCamera
Hao et al. (2020) [14]-WDynamicHCWiFi
Galván-Ruiz et al. (2023) [13]LSEWDynamicHCLeap motion
Wang et al. (2023) [15]CSLW, PDynamicHCUltrasonic
Raihan et al. (2024) [29]BdSLL, N, W, PDynamicHCKinect
Woods and Rana (2023) [30]ASLWDynamicAM, NHGCamera/OpenPose
Eunice et al. (2023) [31]ASLWDynamicHC, AM, NHGCamera/Sign2Pose, YOLOv3
Gao et al. (2024) [17]ASL, TSLWDynamicHC, AM, NHGCamera, Kinect
Kim and Baek (2023) [32]DGS, KSLW, SDynamicHC, AM, NHGCamera/AlphaPose
Boháček and Hrúz (2022) [33]ASL, LSAWDynamicHC, AM, NHGCamera/Vision API (Apple)
Cihan Camgöz et al. (2020) [34]DGSSDynamicHC, AM, NHGCamera
Miah et al. (2024) [35]ASL, PSL, LSML, W, PDynamicHC, AM, NHGCamera/MediaPipe, OpenPose
Gil-Martín et al. (2023) [36]LSEL, N, WBothHC, AM, NHGVirtual camera/MediaPipe
Villa-Monedero et al. (2023) [37]LSEL, N, WBothHC, AM, NHGVirtual camera/MediaPipe
Current studyLSMW, PDynamicAMCamera/YOLOv8
* L: alphabet letter; N: number; W: word; P: phrase; S: sentence; HC: hand configuration; AM: arm movement; NHG: non-hand gesture. Virtual camera since the dataset was created using synthetic avatars. SL names are provided in the Abbreviations section. Top part: Sign data acquisition with contact sensing. Bottom part: Contactless sign data acquisition.
Figure 1. Sign features: hand configuration (HC), arm movement (AM), and non-hand gesture (NHG). “Surprise!” sign images were taken from screenshots of the corresponding YouTube video of the GDLSM [38]; see Appendix A.
Figure 1. Sign features: hand configuration (HC), arm movement (AM), and non-hand gesture (NHG). “Surprise!” sign images were taken from screenshots of the corresponding YouTube video of the GDLSM [38]; see Appendix A.
Sensors 25 03636 g001
Table 2. LSM datasets and glossaries.
Table 2. LSM datasets and glossaries.
Ref.TypeSign Group *Sign SignalSamples
DIELSEME 1 (2004) [39]Glossary 535 WVisual1 video per sign
DIELSEME 2 (2009) [40]Glossary 285 WVisual1 video per sign
GDLSM (2024) [38]Glossary27 L, 49 N, 667 W, 4 PVisual1 video per sign
MX-ITESO-100 (2023) [20]Dataset96 W, 4 PVisual50 videos per sign
Mexican Sign Language dataset (2024) [22]Dataset243 W, 6 PVisual11 image sequences per sign
Mexican Sign Language Recognition (2022) [23]Dataset8 L, 21 W, 1 PKeypoints100 samples per sign
* L: alphabet letter; N: number; W: word; P: phrase. According to [41], DIELSEME 1 and 2 are actually glossaries and not dictionaries. The three LSM glossaries contain only one sample per sign, whereas the datasets include multiple samples per sign. Their site reports 719 videos, but only 715 were found; also, the 32 videos in the “Estados y capitales” thematic category include 2 signs per video.
Table 3. LSM datasets and glossaries: sign and signal properties.
Table 3. LSM datasets and glossaries: sign and signal properties.
Ref.Sign FeaturesSignal PropertiesFile FormatComments
DIELSEME 1 (2004) [39]HC, AM *, NHG320 × 234 @ 12 fpsSWF videos
DIELSEME 2 (2009) [40]HC, AM, NHG720 × 405 @ 30 fpsFLV videos
GDLSM (2024) [38]HC, AM, NHG1920 × 1080 @ 60 fpsvideosHosted on a streaming platform; c.f. Appendix A
MX-ITESO-100 (2023) [20]HC, AM, NHG512 × 512 @ 30 fpsMP4 videosPreview only
Mexican Sign Language dataset (2024) [22]HC, AM *640 × 480JPEG imagesBlurred faces
Mexican Sign Language Recognition (2022) [23]HC, AM, NHG20 × 201 arrayCSV filesOne row per frame, 67 ( x , y , z ) keypoints
* In those cases, the background and clothing are black, so the segmentation of skin (hand and face) is easier, but tracking joints for AM is more difficult. Only 11 signs (words) are available in the public preview. Also, the 50 samples of each sign were performed by a single subject.
Regarding LSM studies, most of the SLR research on LSM mainly focuses on classifying static letters and numbers using classical machine learning techniques and convolutional neural networks (CNNs) [42,43,44,45,46,47,48,49]. Using the classification provided by [16], there are four classes of signs: (i) continuous signs, (ii) isolated signs, (iii) letter signs, and (iv) number signs. In LSM, most of the signs in the last three categories are static signs. But signing in LSM is generally highly dynamic and continuous, since most signs are ideograms, as mentioned before.
In terms of dynamic sign recognition, early studies focused on classifying letters and numbers with motion. For example, Ref. [50] used the CamShift algorithm to track the hand trajectory, generating a bitmap that captures the pixels of the hand path; these bitmaps were then classified using a CNN. Another approach, presented in [51], involved obtaining the coordinates (x,y) of 22 keypoints of the hand using an Intel RealSense sensor, which were used as training data for a multilayer perceptron (MLP) neural network. Finally, in [52], 3D body cue points obtained with MediaPipe were used to train two recurrent neural networks (RNNs), i.e., LSTM and GRU.
In more recent research, in addition to letters and numbers, some simple words and phrases were included. Studies such as Refs. [53,54,55] used MLP-type neural networks, while others, such as Ref. [23], used more advanced RNN models. In Ref. [20], CNNs were used to extract features from the frames of a series of videos, which were then used as input to an LSTM model.
On the other hand, Ref. [56] presented a method for dynamic sign classification that involves extracting a sequence of frames, followed by a segmentation process using neural networks based on color, resulting in the segmentation of the skin of the hands and face. To classify the signs, four classical machine learning algorithms are compared, i.e., Bayesian classifier, decision trees, SVM, and NN.
Although research on LSM recognition has been conducted for several years, progress in this area has been slow and limited compared to other SLs. A common approach is to use computer vision techniques such as CNNs to build automatic sign recognition systems. However, with the recent emergence of pose recognition models, such as MediaPipe and YOLOv8, there is a trend in both LSM and other sign languages to use these tools to train more complex models, such as RNNs, or more sophisticated architectures, such as Transformers. A comparison of the studies mentioned here, with additional details, is shown in Table 4.
Table 4. LSM research.
Table 4. LSM research.
Ref.Sign Group *Sign TypeSign FeatureSensor/Tool
Solís et al. (2016) [42]LStaticHCCamera
Carmona-Arroyo et al. (2021) [43]LStaticHCLeap Motion, Kinect
Salinas-Medina and Neme-Castillo (2021) [44]LStaticHCCamera
Rios-Figueroa et al. (2022) [45]LStaticHCKinect
Morfín-Chávez et al. (2023) [46]LStaticHCCamera/MediaPipe
Sánchez-Vicinaiz et al. (2024) [47]LStaticHCCamera/MediaPipe
García-Gil et al. (2024) [48]LStaticHCCamera/MediaPipe
Jimenez et al. (2017) [49]L, NStaticHCKinect
Martínez-Gutiérrez et al. (2019) [51]LBothHCRealSense f200
Rodriguez et al. (2023) [52]L, NBothHCCamera/MediaPipe
Rodriguez et al. (2025) [57]L, NBothHCCamera/MediaPipe
Martinez-Seis et al. (2019) [50]LBothAMCamera
Mejía-Peréz et al. (2022) [23]L, WBothHC, AM, NHGOAK-D/MediaPipe
Sosa-Jiménez et al. (2022) [58]L, N, WBothHC, body but not NHGKinect
Sosa-Jiménez et al. (2017) [53]W, PDynamicHC, AMKinect/Pose extraction
Varela-Santos et al. (2021) [59]WDynamicHCGloves
Espejel-Cabrera et al. (2021) [56]W, PDynamicHCCamera
García-Bautista et al. (2017) [54]WDynamicAMKinect
Martínez-Guevara and Curiel (2024) [60]W, PDynamicAMCamera/OpenPose
Martínez-Guevara et al. (2019) [61]WDynamicHC, AMCamera
Trujillo-Romero and García-Bautista (2023) [55]W, PDynamicHC, AMKinect
Martínez-Guevara et al. (2023) [62]W, PDynamicHC, AMCamera
Martínez-Sánchez et al. (2023) [20]WDynamicHC, AM, NHGCamera
González-Rodríguez et al. (2024) [63]PDynamicHC, AM, NHGCamera/MediaPipe
Miah et al. (2024) [35]L, W, PDynamicHC, AM, NHGCamera/MediaPipe, OpenPose
Current studyW, PDynamicAMCamera/YOLOv8
* L: alphabet letter; N: number; W: word; P: phrase.

1.1. Toward a Recognition System for LSM

We present the sign data acquisition, the hardware selected, and the fundamental concepts of our research toward a recognition system for LSM.

1.1.1. Contactless Sign Data Acquisition with Simple Hardware

Due to the socioeconomic conditions of the main users of LSM, this research uses contactless, simple hardware for sign data acquisition (i.e., a pure vision-based approach), since color cameras are widely accessible and available in portable devices, which are very common in Mexico. As presented in Table 4, one important remark is that only one LSM research work [59] used contact sensing for sign data acquisition.

1.1.2. Sign Features

From a linguistics perspective, LSM signs present six documented parameters, that is, basic articulatory parameters that simultaneously combine to form signs [39,64,65,66]. We propose a simplified Kinematics perspective, as shown in Figure 1, which combines four of those parameters into arm movements (AMs):
  • Hand configuration (HC): The shape adopted by one or both hands. As seen in Table 1 and Table 3, most research focuses on HC only. Hand segmentation [67] and hand pose detectors are very promising technologies for this feature. The number of HCs required to perform a sign is variable in LSM; some examples regarding the number of HCs required for a sign are as follows: number “1” (1 HC), number “9” (2 HCs), number “15” (2 hands, 1 HC), and “grandmother” (2 hands, 3 HCs). See Appendix A for samples of these signs.
  • Non-hand gestures (NHGs): Facial expressions (frowning, raising eyebrows), gestures (puffing out cheeks, blowing), and body movements (pitching, nodding). While most signs do not require non-hand gestures, some LSM signs do. Some signs that require one or more NHGs are as follows: “How are you?”, “I’m sorry”, “Surprise!” (two NHGs of this sign are shown in Figure 1). See Appendix A for links to samples of these signs.
  • Arm movement (AM): This can be characterized by tracking the joint movements of wrists, shoulders, and elbows. It is enough to obtain the following basic articulatory parameters [39,64,65,66]:
    (a)
    Articulation location: This is the location on the signer’s body or space where the signs are executed.
    (b)
    Hand movement: The type of movement made by the joints from one point to another.
    (c)
    Direction of movement: The trajectory followed by the hand when making the sign.
    (d)
    Hand orientation: Orientation of the palm of one or both hands, with respect to the signer’s body when making the manual configuration.
    This part can be studied from pose-based approaches (c.f. [31,32] with pose estimation using AlphaPose).
Other decompositions have been proposed to simplify sign analysis, such as in [62] (Figure 1), where an LSM sign is decomposed into fixed postures and movements. We consider the fact that this approach could lose important information, as transitions in hand postures are also important (as documented in the Hamburg Notation System (HamNoSys)) [68].
The use of pose estimators, particularly MediaPipe, enables the extraction of facial, hand, and body features; c.f. [17,23]. The use of pose estimators is quite frequent in SL research, however, there are still areas for improvement (c.f. [27] (Figure 8)), where a PhBFC was designed to improve MediaPipe hand pose estimation. Complementary approaches like bimodal frameworks [17] highlight the current limitations of these estimators.
We believe that focusing on a single element to describe LSM is inadequate, given its meaning and contribution to the sign. But covering everything at the same time is also very complex, as seen in most LSM research. Since most of the LSM work focuses on HC, this paper focuses on the AM part and reports the approach created to analyze visual patterns in arm joint movements. Our current work uses YOLOv8 [69,70] for pose estimation. While it is a 2D method, and MediaPipe is better for 3D, we discuss our decision in Appendix B.
The main contribution of this work involves the use of arm movement keypoints, particularly wrist positions, as a partial feature for sign language recognition. This is motivated by the observation in [23], where wrist location played a crucial role in distinguishing similar signs. For instance, the same hand configuration used at different vertical positions (e.g., near the head to indicate a headache, or near the stomach to indicate a stomachache) conveys different meanings. By isolating and analyzing this spatial feature, we aim to better understand its discriminative power in sign recognition tasks.
This paper is structured as follows. Section 2 describes the data acquisition, the experimental design and setup, the stages of the proposed approach for SLR, and the evaluation process and metrics. Section 3 describes the results from the analysis of two case studies and presents a comparison of the proposed methodology against state-of-the-art works. The potential and the limitations of our approach are discussed in Section 4. The conclusions of this work are presented in Section 5.

2. Materials and Methods

This section describes the resources, tools, and procedures used in this study. First, the acquisition of a visual sign language dataset is presented, including a detailed description of its features. Next, the experimental design is introduced, indicating the experiments to be performed and their objectives. Then, the experimental setup involving the deep learning models and the computational resources employed is described. Afterward, the stages of our first-step sign language recognition system are explained. Lastly, a detailed explanation of motion shapes used in the experimentation and the evaluation metrics is provided.

2.1. Data Acquisition

In this research, a proprietary dataset was developed with the active participation of the deaf community and LSM experts, ensuring no restrictions on recognizing hand configurations, arm movements, and facial expressions. The creation of the dataset was reviewed and approved by the Bioethics Committee for Human Research at Cinvestav, and all participants provided written informed consent.
The dataset comprises 74 signs—73 performed by 17 subjects and 1 (“iron”) performed by 16 subjects. In total, we have 1257 color videos (900 × 720 @ 90 fps) for RGB data acquisition. We consider this dataset a visual sign signal dataset.
All signs show HCs and AM, and three of them have NHGs (“How?”, “How are you?”, “Why?”). There are four phrases in the dataset, as follows: “Good morning!” (“¡Buenos días!”), “Good afternoon!” (“¡Buenas tardes!”), “How are you?” (“¿Cómo estás?”), and “Why?” (“¿Por qué?”). The latter is a question word in English, but it is constructed with two words in Spanish and, in LSM, is represented by a sign composed of two signs with independent meanings. This information is summarized in Table 5.

2.2. Experimental Design

Experiments were conducted on the custom dataset. The goal of these experiments was to classify dynamic LSM signs by detecting and tracking the wrist, elbow, and shoulder joints in order to characterize the AM. For this purpose, since sign production involves motion and changes in shape in space, we decided to use a pose-based approach to transform the visual sign signals into keypoint sign signals, and CNN for classification.
Two case studies are presented in this research. The first case only considers shoulders and wrists, as the wrists exhibit the predominant movement while the shoulders serve as base joints with minimal displacement. The second case includes the elbows, in addition to the shoulders and wrists, as the elbows also experience significant movement.
To carry out these analyses, three groups of signs were selected from the custom dataset. Each group was chosen based on specific characteristics. The first two subsets were selected based on signs with visually distinguishable motion patterns; in contrast, the third subset is composed of signs with variants to examine how this variability influences the classifier’s performance. More detailed information about these subsets is provided in Section 2.5.

2.3. Experimental Setup

For the experimentation, a pose detector and a CNN classifier framework were required. To select a pose estimation framework, we conducted preliminary experiments to compare the commonly used MediaPipe (Google LLC, Mountain View, CA, USA) and the YOLOv8-pose (Ultralytics Inc., Frederick, MD, USA) detector. Based on this comparison, we chose YOLOv8-pose due to its superior performance. The details of this comparison, which support our decision, can be found in Appendix B.
As YOLOv8-pose was selected for pose estimation, we used YOLOv8-cls (Ultralytics Inc., Frederick, MD, USA) to analyze visual patterns of the arm joint movements. Using a single technology for multiple tasks offers several advantages. For example, a unified architecture reduces the need for format adaptation between different models, simplifies implementation, and streamlines the workflow. Also, it reduces the possible problems of training and running multiple models across different frameworks.
A micromamba (QuantStack, Saint-Maur-des-Fossés, France) environment was employed for the installation and implementation of the pose detection and image classification models used in this work. Table 6 provides a summary of the technical specifications of the components of the experimental setting.

2.4. Sign Language Recognition

This work represents a preliminary step toward a recognition system, which comprises a three-step process, as follows: (1) pose estimation, (2) shape generation, and (3) class prediction. A simplified diagram of this process is presented in Figure 2.
In this setup, a video file is passed through a pose detector, where six keypoints are extracted for each frame and saved as a NumPy (NumFOCUS, Austin, TX, USA) array. These keypoints are then plotted to generate motion shapes, and the resulting images are used as input to a classification model. The classification model returns the top five predicted classes and their associated confidence score. Detailed descriptions of each stage in the process are provided in the following subsections.

2.4.1. Visual Sign Signals

To process the visual information, the video frames were cropped to 720 × 720 pixels (see Figure 3), as YOLOv8-pose operates internally on square images. This adjustment does not affect sign visibility, as all relevant joints remain within the square frame.

2.4.2. Pose Estimation

LSM specifies that only the upper part of the body is meaningful in signing; so from the 17 keypoints detected by the selected pose detector, only 13 corresponding to the upper body are relevant; the 4 keypoints for knees and ankles are discarded. If the model fails to detect a joint, it is assigned a null value, which allows for easily discarding these missing values in further processing. Below is an example of pose estimation applied to the initial and final poses of the “deer” sign (Figure 4), as well as the extraction of the 13 keypoints.
The keypoints are stored in NPY format, a file type used by NumPy for efficiently storing data arrays. These arrays have dimensions of (13, 2, N): keypoints, 2D ( x , y ) coordinates, and the number of frames in each video.

2.4.3. Shape Generation

From these arrays, the coordinates corresponding to the wrists, shoulders, and elbows are extracted according to each case study. The positions of these coordinates were plotted for each frame, illustrating the movement pattern of each joint, as shown in Figure 5.

2.4.4. Classification

The shape classification stage involves assigning each image a label from a predefined set of classes. For this purpose, the YOLOv8x-cls model was employed. This classifier is the most robust of the YOLOv8 classification models and maintains a deep CNN structure. The classifier outputs the top-5 predicted class labels along with their associated confidence scores.
The maximum number of examples per sign in all selected sets is 17; 10 examples were used for training, 2 for validation, and 5 for the testing phase. Table 7 shows the most relevant hyperparameters for model training and configuration. On the other hand, Table 8 details the data augmentation-related hyperparameters handled by YOLOv8 (not all parameters are active).

2.5. Evaluation

Experimentation was conducted on the two case studies outlined in Section 2.2, using three sets of motion shapes described below. The lists of signs in each subset are shown in Table 9, Table 10 and Table 11
The first subset consists of a small group of five signs, chosen for their distinguishable shapes based on a qualitative evaluation. The primary objective of this group is to conduct a more controlled evaluation of the neural network, which allows for a clearer analysis of what the network is learning in an environment with fewer variables. Examples of these signs are presented in Figure 6, while the corresponding words are listed in Table 9.
In the second subset, the signs are similarly distinguishable, but with a larger set consisting of 62 signs. The goal now is to assess whether the neural network’s behavior remains consistent with that of the first set, despite the increased number of classes. Some examples of these signs are presented in Figure 7, and the corresponding words are listed in Table 10.
The third subset consists of 16 words related to the semantic field of house. This group is particularly notable for the high number of variants in its signs. As such, this experiment aims to assess the model’s accuracy, as well as its ability to generalize and identify distinctive features within more complex sign language contexts. Examples of the sign forms from this set can be seen in Figure 8, and the corresponding vocabulary is outlined in Table 11.
Once the training stage is completed, the corresponding weights are saved in a custom model, which is then utilized for the subsequent testing phase. During this phase, key performance metrics, such as top-1 and top-5 accuracies, are collected. Top-1 accuracy measures how often the model’s first prediction is correct, while top-5 accuracy evaluates whether the correct class appears among the five most probable predictions.
Top-1 accuracy is computed using the standard approach employed for most classification tasks. It is defined as the proportion of correctly predicted labels over the total number of samples. Let y ^ i be the predicted label for the i-th sample and y i the corresponding true label. The top-1 accuracy is then calculated as follows [71]:
top-1 accuracy ( y , y ^ ) = 1 n samples i = 0 n samples 1 1 ( y ^ i = y i )
where 1 ( x ) is the indicator function, which returns 1 if the prediction is correct and 0 otherwise.
On the other hand, the top-k accuracy considers a prediction correct if the true label is among the k-highest predicted scores. Thus, top-1 accuracy is a special case of top-k accuracy, where k = 1 .
Let f ^ i , j represent the predicted class for the i-th sample that has the j-th highest predicted score, and let y i be the corresponding true label. The top-k accuracy is then calculated as follows [72]:
top-k accuracy ( y , f ^ ) = 1 n samples i = 0 n samples 1 j = 1 k 1 ( f ^ i , j = y i )
where k is the number of top predictions considered, and 1 ( x ) is the indicator function.
These metrics are crucial for assessing the model’s performance in a multi-class classification environment.
Additionally, a confusion matrix is generated for each experiment, providing a detailed overview of correct and incorrect predictions for each class. The results, along with their interpretation and analysis, are discussed in the following section.

3. Results

A total of seven SLR experiments on LSM were conducted (six with our custom dataset and one using an external dataset) to test our approach. The results are presented below.
Performance was evaluated using top-1 accuracy, top-5 accuracy, and the confusion matrix (see Section 2.5), which together provide a comprehensive view of the model’s performance across each subset. In addition, performance graphs depicting loss and accuracy across training epochs are included, allowing observation of the model’s learning curve over time.

3.1. Visual Sign Signal Dataset

3.1.1. First Subset

In the first experiment, five of the most distinguishable classes were selected (see confusion matrices in Figure 9). The results reveal that using only the shoulder and wrist coordinates achieved a top-1 accuracy of 0.9599. However, when the elbow coordinates were included, the top-1 accuracy decreased to 0.8799, suggesting that the additional information had a negative impact on performance.
Both the “son” and “deer” classes were classified with high accuracy in both case studies. However, slight confusion was observed between the “Monday” and “hello” classes in the first case. Additionally, when elbow coordinates were included, the model made errors in three of the five classes, indicating greater difficulty in differentiating between them. The performance graphs show that the accuracy in both models tends to stabilize around the 30th epoch, while the loss continues to decrease. Despite this, the model using only the wrist and shoulder coordinates outperformed the version with elbow coordinates, achieving higher accuracy (see graphs in Figure 10). In summary, the results are highly favorable in the best-case scenario, with a classification rate exceeding 95%. This suggests that the model is capable of effectively distinguishing between a limited number of well-defined classes. However, it is preferable to restrict the analysis to wrist and shoulder data, as including elbow data appears to negatively impact performance.

3.1.2. Second Subset

In the second experiment, we expanded the number of classes to 62, while ensuring that the shapes remained distinguishable from one another (see confusion matrices in Figure 11). The model using only wrist and shoulder coordinates achieved a top-1 accuracy of 0.6375, whereas including elbow information resulted in a slight improvement to 0.6537.
For top-5 accuracy, the results were similar, with the first model achieving an accuracy of 0.8640, which improved to 0.8932 when elbow data was included. Performance analysis during training and validation revealed a consistent trend in both models, that is, accuracy steadily increased while loss progressively decreased (see Figure 12), indicating effective learning. The best model achieved an overall accuracy of 65%, which is acceptable, but showed variability in class performance. Some classes were classified nearly perfectly, while others exhibited notable precision issues. This suggests that, despite clear visual distinctions between classes, the large number of classes (62) combined with the limited number of examples per class (5) may hinder the model’s ability to generalize effectively. In conclusion, although incorporating elbow information improves classification accuracy, the inconsistent performance underscores the need for more examples per class to optimize the model’s results.

3.1.3. Third Subset

In this experiment, the set is composed of 16 words in the home semantic field. The complexity of this group lies in the fact that some signs have variants. It is interesting to note that—in both models—words such as “internet”, “keys”, “mop”, and “window” were classified correctly since they showed less variability. In contrast, words like “curtains”, “garden”, and “wall” performed poorly, with poor predictions in both models (see confusion matrices in Figure 13).
The model using only wrist and shoulder information achieved a top-1 accuracy of 0.6875 , while including the elbow coordinates increased the accuracy to 0.7125 . For top-5 accuracy, both models achieved a value of 0.9250 .
Performance in both studies was quite similar (see the graphs in Figure 14), showing fluctuations during training, but with a tendency to stabilize at a constant value toward the later stages. This suggests that the model managed to learn the main features of the characters, although its generalization capacity is limited by the complexity of the variants within the set. The classification rate reached up to 71% when the elbow information was included, which indicates that this additional information contributes positively to the recognition, although the increase in accuracy is not very significant.
Despite the limitations, the model was able to detect patterns in some cases. However, its ability to generalize across a large number of classes, variants, and a limited number of examples is insufficient. Notwithstanding, the performance graphs reveal a tendency toward stabilization, suggesting that while the model holds potential for certain datasets, it requires additional information—such as finger movements—to enhance its classification accuracy in more complex scenarios.

3.2. Comparison of the Proposed Model on a Keypoint Sign Signal Dataset

In order to compare our approach against other state-of-the-art works, we needed to perform additional experiments on another LSM dataset. We selected from among the publicly available LSM datasets; see Table 2 and Table 3. Our selection criterion was based on the number of SLR studies that used each dataset and reported performance accuracy, to enable a proper comparison. Therefore, we opted for a keypoint sign signal dataset, that is, the MSLR dataset from [23]. Details about this dataset are available in Appendix C. This dataset has been tested and reported by at least three different machine learning models [23,35,73]. In contrast to the visual sign signal dataset, the MSLR dataset required a shorter pipeline compared to the one required for visual signals. The pipeline is shown in Figure 15.
For this comparison, we tested our arm movement approach with this dataset using all the arm joints. For classification, we trained a model from scratch, using the current YOLO nanoarchitecture [74] YOLO11n-cls, with the PyTorch framework. This architecture uses 86 layers and has a computational complexity of 0.5 GFLOPs, with 1.633584 million parameters, when using a frame size of 224 pixels [75]. The results of this comparison are presented in Table 12.

4. Discussion

Table 13 presents the accuracy values based on the top-1 accuracy metric obtained using the YOLOv8x-cls model. The results indicate that including elbow coordinates led to better performance in two out of the three experiments. Although the improvement was modest (ranging from 3% to 4%), it suggests that incorporating additional joint information can contribute to more accurate classifications.
The experiments with various datasets allowed us to observe the behavior of the convolutional neural network (CNN) based on the input data. It became evident that the network’s performance is heavily influenced by the selection of classes. Using all available classes from the database is not always ideal, as this tends to yield suboptimal results. Therefore, a more focused approach, where only relevant classes are included, is recommended for improving model classification.
Despite certain limitations—such as the small number of examples per class, the presence of variants, and the high similarity between some signs—the neural network was still able to classify a significant number of signs correctly and recognize patterns in the movement data. This demonstrates the potential of the YOLOv8 model for this type of task.
Compared to other CNNs, YOLOv8 stands out due to its optimized architecture, which allows for the use of pre-trained models on large datasets like ImageNet. This enables the model to achieve high accuracy and efficiency, making it suitable for real-time applications. However, as with any model, performance is largely dependent on the quality and quantity of the input data. In this case, the limited number of examples (17 per class) restricts the network’s ability to achieve optimal accuracy.
These results highlight both the potential and the limitations of our approach. The experiments demonstrated that it is possible to classify a considerable number of signs, indicating that this dataset and strategy could serve as a useful tool for training a convolutional neural network (CNN), such as YOLOv8. However, the analysis also reveals that the current structure of the dataset—characterized by a limited number of examples, variants between classes, and high similarity among some signs—presents challenges that must be addressed through alternative approaches.
The comparison between the two case studies (with and without elbows) was intended to assess whether the inclusion of a greater number of keypoints improves the performance of the model. This seems to indicate that this assumption is correct. The next immediate step is to optimize these results, either by using a different convolutional neural network (CNN) or by exploring different architectures, such as recurrent neural networks (RNNs), but keeping the focus on the use of keypoints; i.e., using pose-based approaches.
Additionally, the study performed on the MSLR dataset showed good results for the proposed approach (with an accuracy of 85.78 % using 6 keypoints), compared to the extraordinary results obtained by [23] (with accuracies of 96.44 % and 97.11 %) and [35] (with accuracies of 99 % and 99.75 %) while using the complete keypoint sign signals in the dataset; see Table 12. This is an interesting finding that shows the relevance of AM sign features, as most previous research studies typically focused on HC sign features.

5. Conclusions

This paper presents ongoing work toward the creation of a recognition system for LSM. A decomposition of sign features is proposed into HC, AM, and NHG. Contactless, simple hardware was used for sign signal acquisition. A custom proprietary dataset of 74 signs (70 words and 4 phrases) was constructed for this research. In contrast to most LSM research, this paper reports an analysis focused on the AM part of signs, rather than on HC-focused or holistic approaches (HC + AM + NHG).
The analysis was conducted through a series of classification experiments using YOLOv8, aimed at identifying visual patterns in the movement of key joints, i.e., wrists, shoulders, and elbows. A pose detection model was used to extract joint movements, followed by an image classification model (both integrated into YOLOv8) to classify the shapes generated by these movements.
These experiments are the first stage of a larger project. For now, we are focusing on the analysis of arm movement (shoulders, elbows, and wrists) because it is a less-studied feature, and information can be extracted from it using a relatively simple methodology.
Later, the goal will be to integrate other essential components of sign language, such as manual configuration and non-hand gestures, to develop a more complete system. Ultimately, this will support progress toward automatic sign language recognition.

Author Contributions

Conceptualization, G.H.-A., K.O.-H. and M.C.; methodology, G.H.-A., K.O.-H. and M.C.; software, G.H.-A. and K.O.-H.; validation, G.H.-A.; formal analysis, G.H.-A., K.O.-H. and M.C.; investigation, G.H.-A., K.O.-H. and M.C.; resources, K.O.-H. and M.C.; data curation, G.H.-A.; writing—original draft preparation, G.H.-A., K.O.-H. and I.L.-J.; writing—review and editing, G.H.-A., K.O.-H., M.C. and I.L.-J.; visualization, G.H.-A.; supervision, K.O.-H. and M.C.; project administration, K.O.-H.; funding acquisition, G.H.-A. and I.L.-J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by CONAHCYT through scholarship grant number 828990.

Institutional Review Board Statement

Ethical review and approval were conducted by the Ethics Committee of Cinvestav (protocol code: 105/2023; date of approval: 7 December 2023).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study, and due to technical and time limitations. Requests to access the datasets should be directed to keny.ordaz@cinvestav.edu.mx.

Acknowledgments

We thank Felipe Hernández Rodríguez for providing a space at his institution for dataset acquisition. We thank Hilda Xóchitl Cabrera Hernández; Daniela Fernanda Espinoza Ibarra; and María Guadalupe Luna Arguello for their help with contacting participants.

Conflicts of Interest

The authors declare no conflicts of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AMarm movement
APIapplication programming interface
ArSLArabic Sign Language
ASLAmerican Sign Language
BdSLBangladeshi Sign Language
CESARRecife Center for Advanced Studies and Systems
CSLChinese Sign Language
CSVcomma-separated values
CNNconvolutional neural network
DGSGerman Sign Language (Deutsche Gebärdensprache)
EMGelectromyography
FLVflash video
fpsframes per second, frame rate
GCARgraph convolution with attention and residual connection
GCNgraph convolutional network
GRUgated recurrent units
HChand configuration
IMUinertial measurement unit
JPEGJoint Photographic Experts Group, ISO/IEC 10918
LIBRASBrazilian Sign Language (Língua Brasileira de Sinais)
LSAArgentinian Sign Language (Lengua de Señas Argentina)
LSESpanish Sign Language (Lengua de Señas Española)
LSMMexican Sign Language (Lengua de Señas Mexicana)
LSTMlong short-term memory
MKVMatroska video
MLPmultilayer perceptron
MSLRMexican Sign Language Recognition dataset
MP4MPEG-4 Part 14, ISO/IEC 14496-14:2003
NHGnon-hand gesture
NNneural network
NPYNumPy standard binary file format
PJMPolish Sign Language (Polski Język Migowy)
PSLPakistan Sign Language
RGBDred, green, blue, and depth
RNNrecurrent neural network
sEMGsurface EMG
SLsign language
SLRsign language recognition
SVMsupport vector machine
SWFsmall web format
TSLTurkish Sign Language
YOLOyou only look once

Appendix A. Digital Glossary of LSM

The GDLSM [38] has 747 signs grouped into 19 thematic categories. We provide direct links to some of the signs included in this digital glossary, which were mentioned in Section 1.1.2.

Appendix B. Comparison Between MediaPipe and YOLOv8 Pose Detection Models

MediaPipe detects 33 keypoints with its Pose Landmarker (Heavy) model, and it can provide 2D and 3D coordinates. YOLOv8 detects 17 keypoints with its YOLOv8x-pose-p6 model and provides 2D coordinates. YOLOv8 keypoints 5–10 are for the shoulder, elbow, and wrist joints, and MediaPipe keypoints 11–16 are for the same joints. We compared the MediaPipe and the YOLOv8 pose detectors in several signs. We decided to use YOLOv8 over MediaPipe due to frequent tracking failures of the wrist joint in many of the signs, particularly in the occluded conditions of the hands. An example of this issue is shown in Figure A1.
Figure A1. Comparison of wrist joint tracking between YOLOv8 and MediaPipe. Example with the “state” sign. Top row: MediaPipe. Bottom row: YOLOv8 pose detector. Four inner frames: MediaPipe loses track of the wrist joint, while YOLOv8 keeps track of the AM in all frames.
Figure A1. Comparison of wrist joint tracking between YOLOv8 and MediaPipe. Example with the “state” sign. Top row: MediaPipe. Bottom row: YOLOv8 pose detector. Four inner frames: MediaPipe loses track of the wrist joint, while YOLOv8 keeps track of the AM in all frames.
Sensors 25 03636 g0a1

Appendix C. MSLR Dataset

The Mexican Sign Language Recognition (MSLR) dataset was created by [23]. It contains samples of 30 signs in LSM; see Table A1. This is a keypoint-based sign signal dataset, as shown in Table A2. Each sample consists of 20 frames, with 67 pose keypoints recorded per frame. The keypoints are distributed as follows: 20 for the face, 5 for the body (shoulders, elbows, and a midpoint between the shoulders), and 21 for each hand.
Table A1. Signs in the MSLR [23] dataset.
Table A1. Signs in the MSLR [23] dataset.
No.Group *Sign
1alphabetA
2alphabetB
3alphabetC
4alphabetD
5alphabetJ
6alphabetK
7alphabetQ
8alphabetX
9questionsWhat?
10questionsWhen?
11questionsHow much?
12questionsWhere?
13questionsFor what?
14questionsWhy?
15questionsWhat is that?
16questionsWho?
17days of the weekMonday
18days of the weekTuesday
19days of the weekWednesday
20days of the weekThursday
21days of the weekFriday
22days of the weekSaturday
23days of the weekSunday
24frequent words(to) spell
25frequent words(to) explain
26frequent wordsthank you
27frequent wordsname
28frequent wordsplease
29frequent wordsyes
30frequent wordsno
* Group names and information taken from [23] (Table 2).
Table A2. MSLR [23] dataset.
Table A2. MSLR [23] dataset.
FeatureDescription
Signs *8 L, 21 W, 1 P
Signers4
Samples30 signs with 100 samples
Sign featuresHC, AM, NHG
Sign signalKeypoints
File formatCSV files
Samples for training 70 samples
Samples for validation 15 samples
Samples for testing 15 samples
* L: letters, W: words; P: phrase. This split was defined by the dataset authors.

References

  1. World Health Organization. World Reporting on Hearing. 2021. Available online: https://www.who.int/publications/i/item/9789240020481 (accessed on 31 March 2025).
  2. Secretaría de Salud. 530. Con Discapacidad Auditiva, 2.3 Millones de Personas: Instituto Nacional de Rehabilitación. 2021. Available online: https://www.gob.mx/salud/prensa/530-con-discapacidad-auditiva-2-3-millones-de-personas-instituto-nacional-de-rehabilitacion (accessed on 31 March 2025).
  3. SLAIT. SLAIT—AI-Driven American Sign Language Translator. 2024. Available online: https://slait.ai (accessed on 29 March 2025).
  4. Lenovo. Lenovo’s AI-Powered Sign Language Translation Solution Empowers Signers in Brazil. 2023. Available online: https://news.lenovo.com/ai-powered-sign-language-translation-solution-hearing-brazil/ (accessed on 31 March 2025).
  5. Rocha, J.V.; Lensk, J.; Ferreira, M.D.C. Techniques for Determining Sign Language Gesture Partially Shown in Image(s). U.S. Patent 11587362B2, 21 February 2023. [Google Scholar]
  6. Mane, V.; Puniwala, S.N.; Rane, V.N.; Gurav, P. Advancements in Sign Language Recognition: Empowering Communication for Individuals with Speech Impairments. Grenze Int. J. Eng. Technol. (GIJET) 2024, 10, 4978–4984. [Google Scholar]
  7. Krishnan, S.R.; Varghese, C.M.; Jayaraj, A.; Nair, A.S.; Joshy, D.; Sulbi, I.N. Advancements in Sign Language Recognition: Dataset Influence on Model Accuracy. In Proceedings of the 2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS), Bengaluru, India, 17–18 December 2024; pp. 1563–1568. [Google Scholar] [CrossRef]
  8. Chiradeja, P.; Liang, Y.; Jettanasen, C. Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks. Appl. Sci. 2025, 15, 2957. [Google Scholar] [CrossRef]
  9. Umut, I.; Kumdereli, U.C. Novel Wearable System to Recognize Sign Language in Real Time. Sensors 2024, 24, 4613. [Google Scholar] [CrossRef] [PubMed]
  10. Rodríguez-Tapia, B.; Ochoa-Zezzatti, A.; Marrufo, A.I.S.; Arballo, N.C.; Carlos, P.A. Sign Language Recognition Based on EMG Signals through a Hibrid Intelligent System. Res. Comput. Sci. 2019, 148, 253–262. [Google Scholar] [CrossRef]
  11. Gu, Y.; Oku, H.; Todoh, M. American Sign Language Recognition and Translation Using Perception Neuron Wearable Inertial Motion Capture System. Sensors 2024, 24, 453. [Google Scholar] [CrossRef]
  12. Filipowska, A.; Filipowski, W.; Mieszczanin, J.; Bryzik, K.; Henkel, M.; Skwarek, E.; Raif, P.; Sieciński, S.; Doniec, R.; Mika, B.; et al. Pattern Recognition in the Processing of Electromyographic Signals for Selected Expressions of Polish Sign Language. Sensors 2024, 24, 6710. [Google Scholar] [CrossRef]
  13. Galván-Ruiz, J.; Travieso-González, C.M.; Pinan-Roescher, A.; Alonso-Hernández, J.B. Robust Identification System for Spanish Sign Language Based on Three-Dimensional Frame Information. Sensors 2023, 23, 481. [Google Scholar] [CrossRef]
  14. Hao, Z.; Duan, Y.; Dang, X.; Liu, Y.; Zhang, D. Wi-SL: Contactless Fine-Grained Gesture Recognition Uses Channel State Information. Sensors 2020, 20, 4025. [Google Scholar] [CrossRef]
  15. Wang, Y.; Hao, Z.; Dang, X.; Zhang, Z.; Li, M. UltrasonicGS: A Highly Robust Gesture and Sign Language Recognition Method Based on Ultrasonic Signals. Sensors 2023, 23, 1790. [Google Scholar] [CrossRef]
  16. Al-Saidi, M.; Ballagi, A.; Hassen, O.A.; Saad, S.M. Type-2 Neutrosophic Markov Chain Model for Subject-Independent Sign Language Recognition: A New Uncertainty–Aware Soft Sensor Paradigm. Sensors 2024, 24, 7828. [Google Scholar] [CrossRef]
  17. Gao, Q.; Hu, J.; Mai, H.; Ju, Z. Holistic-Based Cross-Attention Modal Fusion Network for Video Sign Language Recognition. IEEE Trans. Comput. Soc. Syst. 2024; early access. [Google Scholar] [CrossRef]
  18. Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.; Lee, J.; et al. MediaPipe: A Framework for Perceiving and Processing Reality. In Proceedings of the Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, Long Beach, CA, USA, 17 June 2019. [Google Scholar]
  19. Serafín De Fleischmann, M.; González Pérez, R. Manos con voz: Diccionario de Lengua de Señas Mexicana; Consejo Nacional para Prevenir la Discriminación: Mexico City, Mexico, 2011. [Google Scholar]
  20. Martínez-Sánchez, V.; Villalón-Turrubiates, I.; Cervantes-Álvarez, F.; Hernández-Mejía, C. Exploring a Novel Mexican Sign Language Lexicon Video Dataset. Multimodal Technol. Interact. 2023, 7, 83. [Google Scholar] [CrossRef]
  21. Espejel-Cabrera, J.; Dominguez, L.; Cervantes, J.; Cervantes, J. Mexican Sign Language Dataset. 2023. Available online: https://data.mendeley.com/datasets/6rj76z6y3n/1 (accessed on 31 March 2025). [CrossRef]
  22. Espejel, J.; Jalili, L.D.; Cervantes, J.; Canales, J.C. Sign language images dataset from Mexican sign language. Data Brief 2024, 55, 110566. [Google Scholar] [CrossRef]
  23. Mejía-Peréz, K.; Córdova-Esparza, D.M.; Terven, J.; Herrera-Navarro, A.M.; García-Ramírez, T.; Ramírez-Pedraza, A. Automatic Recognition of Mexican Sign Language Using a Depth Camera and Recurrent Neural Networks. Appl. Sci. 2022, 12, 5523. [Google Scholar] [CrossRef]
  24. Yao, D.; Wang, W.; Wang, H.; Luo, Y.; Ding, H.; Gu, Y.; Wu, H.; Tao, K.; Yang, B.R.; Pan, S.; et al. Ultrasensitive and Breathable Hydrogel Fiber-Based Strain Sensors Enabled by Customized Crack Design for Wireless Sign Language Recognition. Adv. Funct. Mater. 2025, 35, 2416482. [Google Scholar] [CrossRef]
  25. Wei, C.; Liu, S.; Yuan, J.; Zhu, R. Multimodal hand/finger movement sensing and fuzzy encoding for data-efficient universal sign language recognition. InfoMat 2025, 7, e12642. [Google Scholar] [CrossRef]
  26. Wang, W.; Bo, X.; Li, W.; Eldaly, A.B.M.; Wang, L.; Li, W.J.; Chan, L.L.H.; Daoud, W.A. Triboelectric Bending Sensors for AI-Enabled Sign Language Recognition. Adv. Sci. 2025, 12, 2408384. [Google Scholar] [CrossRef]
  27. Urrea, C.; Kern, J.; Navarrete, R. Bioinspired Photoreceptors with Neural Network for Recognition and Classification of Sign Language Gesture. Sensors 2023, 23, 9646. [Google Scholar] [CrossRef]
  28. Niu, P. Convolutional neural network for gesture recognition human-computer interaction system design. PLoS ONE 2025, 20, e0311941. [Google Scholar] [CrossRef]
  29. Raihan, M.J.; Labib, M.I.; Jim, A.A.J.; Tiang, J.J.; Biswas, U.; Nahid, A.A. Bengali-Sign: A Machine Learning-Based Bengali Sign Language Interpretation for Deaf and Non-Verbal People. Sensors 2024, 24, 5351. [Google Scholar] [CrossRef]
  30. Woods, L.T.; Rana, Z.A. Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data. Mathematics 2023, 11, 2129. [Google Scholar] [CrossRef]
  31. Eunice, J.; J, A.; Sei, Y.; Hemanth, D.J. Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer Model. Sensors 2023, 23, 2853. [Google Scholar] [CrossRef] [PubMed]
  32. Kim, Y.; Baek, H. Preprocessing for Keypoint-Based Sign Language Translation without Glosses. Sensors 2023, 23, 3231. [Google Scholar] [CrossRef] [PubMed]
  33. Boháček, M.; Hrúz, M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2022; pp. 182–191. [Google Scholar] [CrossRef]
  34. Cihan Camgöz, N.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10020–10030. [Google Scholar] [CrossRef]
  35. Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign Language Recognition Using Graph and General Deep Neural Network Based on Large Scale Dataset. IEEE Access 2024, 12, 34553–34569. [Google Scholar] [CrossRef]
  36. Gil-Martín, M.; Villa-Monedero, M.; Pomirski, A.; Sáez-Trigueros, D.; San-Segundo, R. Sign Language Motion Generation from Sign Characteristics. Sensors 2023, 23, 9365. [Google Scholar] [CrossRef]
  37. Villa-Monedero, M.; Gil-Martín, M.; Sáez-Trigueros, D.; Pomirski, A.; San-Segundo, R. Sign Language Dataset for Automatic Motion Generation. J. Imaging 2023, 9, 262. [Google Scholar] [CrossRef]
  38. INDISCAPACIDAD. Glosario Digital de Lengua de Señas Mexicana. 2024. Available online: https://lsm.indiscapacidad.cdmx.gob.mx (accessed on 31 March 2025).
  39. Calvo-Hernández, M.T. Diccionario Español-Lengua de Señas Mexicana (DIELSEME). 2004. Available online: http://campusdee.ddns.net/dielseme.aspx (accessed on 31 March 2025).
  40. Álvarez Hidalgo, A.; Acosta-Arellano, A.; Moctezuma-Contreras, C.; Sanabria-Ramos, E. Diccionario Lengua de Señas Mexicana (DIELSEME 2). 2009. Available online: http://campusdee.ddns.net/dielseme.aspx (accessed on 31 March 2025).
  41. Cruz-Aldrete, M. Hacia la construcción de un diccionario de Lengua de Señas Mexicana. Rev. Investig. 2014, 38, 57–80. [Google Scholar]
  42. Solís, F.; Martínez, D.; Espinoza, O. Automatic Mexican Sign Language Recognition Using Normalized Moments and Artificial Neural Networks. Engineering 2016, 8, 733–740. [Google Scholar] [CrossRef]
  43. Carmona-Arroyo, G.; Rios-Figueroa, H.V.; Avendaño-Garrido, M.L. Mexican Sign-Language Static-Alphabet Recognition Using 3D Affine Invariants. In Machine Vision Inspection Systems, Volume 2: Machine Learning-Based Approaches; Scrivener Publishing LLC: Beverly, MA, USA, 2021; pp. 171–192. [Google Scholar] [CrossRef]
  44. Salinas-Medina, A.; Neme-Castillo, J.A. A real-time deep learning system for the translation of mexican signal language into text. In Proceedings of the 2021 Mexican International Conference on Computer Science (ENC), Morelia, Mexico, 9–11 August 2021; pp. 1–7. [Google Scholar] [CrossRef]
  45. Rios-Figueroa, H.V.; Sánchez-García, A.J.; Sosa-Jiménez, C.O.; Solís-González-Cosío, A.L. Use of Spherical and Cartesian Features for Learning and Recognition of the Static Mexican Sign Language Alphabet. Mathematics 2022, 10, 2904. [Google Scholar] [CrossRef]
  46. Morfín-Chávez, R.F.; Gortarez-Pelayo, J.J.; Lopez-Nava, I.H. Fingerspelling Recognition in Mexican Sign Language (LSM) Using Machine Learning. In Advances in Computational Intelligence; Calvo, H., Martínez-Villaseñor, L., Ponce, H., Eds.; Springer: Cham, Switzerland, 2023; pp. 110–120. [Google Scholar] [CrossRef]
  47. Sánchez-Vicinaiz, T.J.; Camacho-Pérez, E.; Castillo-Atoche, A.A.; Cruz-Fernandez, M.; García-Martínez, J.R.; Rodríguez-Reséndiz, J. MediaPipe Frame and Convolutional Neural Networks-Based Fingerspelling Detection in Mexican Sign Language. Technologies 2024, 12, 124. [Google Scholar] [CrossRef]
  48. García-Gil, G.; López-Armas, G.d.C.; Sánchez-Escobar, J.J.; Salazar-Torres, B.A.; Rodríguez-Vázquez, A.N. Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach. Technologies 2024, 12, 152. [Google Scholar] [CrossRef]
  49. Jimenez, J.; Martin, A.; Uc, V.; Espinosa, A. Mexican Sign Language Alphanumerical Gestures Recognition using 3D Haar-like Features. IEEE Lat. Am. Trans. 2017, 15, 2000–2005. [Google Scholar] [CrossRef]
  50. Martinez-Seis, B.; Pichardo-Lagunas, O.; Rodriguez-Aguilar, E.; Saucedo-Diaz, E.R. Identification of Static and Dynamic Signs of the Mexican Sign Language Alphabet for Smartphones using Deep Learning and Image Processing. Res. Comput. Sci. 2019, 148, 199–211. [Google Scholar] [CrossRef]
  51. Martínez-Gutiérrez, M.E.; Rojano-Cáceres, J.R.; Benítez-Guerrero, E.; Sánchez-Barrera, H.E. Data Acquisition Software for Sign Language Recognition. Res. Comput. Sci. 2019, 148, 205–211. [Google Scholar] [CrossRef]
  52. Rodriguez, M.; Oubram, O.; Ali, B.; Lakouari, N. Mexican Sign Language’s Dactylology and Ten First Numbers–Extracted Features and Models. 2023. Available online: https://data.mendeley.com/datasets/hmsc33hmkz/1 (accessed on 31 March 2025). [CrossRef]
  53. Sosa-Jiménez, C.O.; Ríos-Figueroa, H.V.; Rechy-Ramírez, E.J.; Marin-Hernandez, A.; González-Cosío, A.L.S. Real-time Mexican Sign Language recognition. In Proceedings of the 2017 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), Ixtapa, Mexico, 8–10 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
  54. García-Bautista, G.; Trujillo-Romero, F.; Caballero-Morales, S.O. Mexican Sign Language Recognition Using Kinect and Data Time Warping Algorithm. In Proceedings of the 2017 International Conference on Electronics, Communications and Computers (CONIELECOMP), Cholula, Mexico, 22–24 February 2017; pp. 1–5. [Google Scholar] [CrossRef]
  55. Trujillo-Romero, F.; García-Bautista, G. Mexican Sign Language Corpus: Towards an Automatic Translator. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 212. [Google Scholar] [CrossRef]
  56. Espejel-Cabrera, J.; Cervantes, J.; García-Lamont, F.; Ruiz-Castilla, J.S.; Jalili, L.D. Mexican sign language segmentation using color based neuronal networks to detect the individual skin color. Expert Syst. Appl. 2021, 183, 115295. [Google Scholar] [CrossRef]
  57. Rodriguez, M.; Oubram, O.; Bassam, A.; Lakouari, N.; Tariq, R. Mexican Sign Language Recognition: Dataset Creation and Performance Evaluation Using MediaPipe and Machine Learning Techniques. Electronics 2025, 14, 1423. [Google Scholar] [CrossRef]
  58. Sosa-Jiménez, C.O.; Ríos-Figueroa, H.V.; Solís-González-Cosío, A.L. A Prototype for Mexican Sign Language Recognition and Synthesis in Support of a Primary Care Physician. IEEE Access 2022, 10, 127620–127635. [Google Scholar] [CrossRef]
  59. Varela-Santos, H.; Morales-Jiménez, A.; Córdova-Esparza, D.M.; Terven, J.; Mirelez-Delgado, F.D.; Orenday-Delgado, A. Assistive Device for the Translation from Mexican Sign Language to Verbal Language. Comput. Sist. 2021, 25, 451–464. [Google Scholar] [CrossRef]
  60. Martínez-Guevara, N.; Curiel, A. Quantitative Analysis of Hand Locations in both Sign Language and Non-linguistic Gesture Videos. In Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, Turin, Italy, 20 May 2024; pp. 225–234. [Google Scholar]
  61. Martínez-Guevara, N.; Rojano-Cáceres, J.R.; Curiel, A. Detection of Phonetic Units of the Mexican Sign Language. In Proceedings of the 2019 International Conference on Inclusive Technologies and Education (CONTIE), San Jose del Cabo, Mexico, 30 October–1 November 2019; pp. 168–1685. [Google Scholar] [CrossRef]
  62. Martínez-Guevara, N.; Rojano-Cáceres, J.R.; Curiel, A. Unsupervised extraction of phonetic units in sign language videos for natural language processing. Univers. Access Inf. Soc. 2023, 22, 1143–1151. [Google Scholar] [CrossRef]
  63. González-Rodríguez, J.R.; Córdova-Esparza, D.M.; Terven, J.; Romero-González, J.A. Towards a Bidirectional Mexican Sign Language–Spanish Translation System: A Deep Learning Approach. Technologies 2024, 12, 7. [Google Scholar] [CrossRef]
  64. López-García, L.A.; Rodríguez-Cervantes, R.M.; Zamora-Martínez, M.G.; Esteban-Sosa, S.S. Mis Manos que Hablan, Lengua de se nas para Sordos; Editorial Trillas: Mexico City, Mexico, 2008. [Google Scholar]
  65. Cruz-Aldrete, M. Gramática de la Lengua de Señas Mexicana; El Colegio de México: Mexico City, Mexico, 2008. [Google Scholar]
  66. Escobedo-Delgado, C.E. (Ed.) Diccionario de Lengua de Señas Mexicana de la Ciudad de México; INDEPEDI: Mexico City, Mexico, 2017. [Google Scholar]
  67. Sánchez-Brizuela, G.; Cisnal, A.; de la Fuente-López, E.; Fraile, J.C.; Pérez-Turiel, J. Lightweight real-time hand segmentation leveraging MediaPipe landmark detection. Virtual Real. 2023, 27, 3125–3132. [Google Scholar] [CrossRef]
  68. Hanke, T. HamNoSys—Representing Sign Language Data in Language Resources and Language Processing Contexts. In Proceedings of the LREC 2004, Workshop Proceedings: Representation and Processing of Sign Languages, Lisbon, Portugal, 26–28 May 2004; Streiter, O., Vettori, C., Eds.; European Language Resources Association (ELRA): Paris, France, 2004; pp. 1–6. [Google Scholar]
  69. Rasheed, A.F.; Zarkoosh, M. Optimized YOLOv8 for multi-scale object detection. J. Real-Time Image Process. 2024, 22, 6. [Google Scholar] [CrossRef]
  70. Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
  71. Scikit-Learn. Accuracy Score. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score (accessed on 19 March 2025).
  72. Scikit-Learn. Top-k Accuracy Score. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#top-k-accuracy-score (accessed on 19 March 2025).
  73. Miah, A.S.M.; Hasan, M.A.M.; Shin, J. Dynamic Hand Gesture Recognition Using Multi-Branch Attention Based Graph and General Deep Learning Model. IEEE Access 2023, 11, 4703–4716. [Google Scholar] [CrossRef]
  74. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
  75. Jocher, G.; Jing, Q.; Chaurasia, A. Ultralytics YOLO Docs: Image Classification. Available online: https://docs.ultralytics.com/tasks/classify/#models (accessed on 5 June 2025).
Figure 2. Pipeline of the arm movement approach for SLR.
Figure 2. Pipeline of the arm movement approach for SLR.
Sensors 25 03636 g002
Figure 3. Dimensions of original and cropped frames.
Figure 3. Dimensions of original and cropped frames.
Sensors 25 03636 g003
Figure 4. Pose detection of the “deer” sign. (Left): neutral pose. (Right): final pose.
Figure 4. Pose detection of the “deer” sign. (Left): neutral pose. (Right): final pose.
Sensors 25 03636 g004
Figure 5. Movement shapes for the “deer” sign. (Left): only wrists and shoulder. (Right): also elbows.
Figure 5. Movement shapes for the “deer” sign. (Left): only wrists and shoulder. (Right): also elbows.
Sensors 25 03636 g005
Figure 6. Shapes of the first subset (see words in Table 9). (Top): only wrists and shoulders. (Bottom): also elbows.
Figure 6. Shapes of the first subset (see words in Table 9). (Top): only wrists and shoulders. (Bottom): also elbows.
Sensors 25 03636 g006
Figure 7. Shape examples of the second subset (“hug”, “tall”, “atole”, “airplane”, “flag”, and “bicycle”). (Top): only wrists and shoulders. (Bottom): also elbows.
Figure 7. Shape examples of the second subset (“hug”, “tall”, “atole”, “airplane”, “flag”, and “bicycle”). (Top): only wrists and shoulders. (Bottom): also elbows.
Sensors 25 03636 g007
Figure 8. Shape examples of the third subset (“garbage”, “trash can”, “house”, “curtains”, “electricity”, and “stairs”). (Top): only wrists and shoulders. (Bottom): also elbows.
Figure 8. Shape examples of the third subset (“garbage”, “trash can”, “house”, “curtains”, “electricity”, and “stairs”). (Top): only wrists and shoulders. (Bottom): also elbows.
Sensors 25 03636 g008
Figure 9. Confusion matrices for the first subset. (Left): only wrists and shoulders. (Right): also elbows.
Figure 9. Confusion matrices for the first subset. (Left): only wrists and shoulders. (Right): also elbows.
Sensors 25 03636 g009
Figure 10. Performance charts for the first subset. (Left): only wrists and shoulders. (Right): also elbows.
Figure 10. Performance charts for the first subset. (Left): only wrists and shoulders. (Right): also elbows.
Sensors 25 03636 g010
Figure 11. Confusion matrices for the second subset. (Left): only wrists and shoulders. (Right): also elbows.
Figure 11. Confusion matrices for the second subset. (Left): only wrists and shoulders. (Right): also elbows.
Sensors 25 03636 g011
Figure 12. Performance charts for the second subset. (Left): only wrists and shoulders. (Right): also elbows.
Figure 12. Performance charts for the second subset. (Left): only wrists and shoulders. (Right): also elbows.
Sensors 25 03636 g012
Figure 13. Confusion matrices for the third subset. (Left): only wrists and shoulders. (Right): also elbows.
Figure 13. Confusion matrices for the third subset. (Left): only wrists and shoulders. (Right): also elbows.
Sensors 25 03636 g013
Figure 14. Performance charts for the third subset. (Left): only wrists and shoulders. (Right): also elbows.
Figure 14. Performance charts for the third subset. (Left): only wrists and shoulders. (Right): also elbows.
Sensors 25 03636 g014
Figure 15. Pipeline of the arm movement approach for SLR with the MSLR dataset.
Figure 15. Pipeline of the arm movement approach for SLR with the MSLR dataset.
Sensors 25 03636 g015
Table 5. Custom dataset.
Table 5. Custom dataset.
FeatureDescription
Signs *70 W, 4 P
Signers17
Samples73 signs with 17 samples, 1 sign with 16 samples
Sign featuresHC, AM, NHG
Sign signalVisual
Signal properties900 × 720 @ 90 fps
File formatMKV videos
Samples for training10 samples
Samples for validation2 samples
Samples for testing5 samples
* W: words; P: phrase.
Table 6. Computational resources.
Table 6. Computational resources.
ComponentVersion/Model
Operating systemUbuntu 22.04.2 (Canonical Ltd., London, England)
Graphics cardAsus ROG STRIX GeForce RTX 2080 Ti O11G (ASUS Holdings Mexico S.A. de C.V., Mexico City, Mexico)
Computing APICUDA 12.4 (NVIDIA Corporation, Santa Clara, CA, USA)
Programming languagePython 3.11.8 (Python Software Foundation, Beaverton, OR, USA)
Machine Learning libraryPyTorch 2.2.2 (Linux Foundation, San Francisco, CA, USA)
FrameworkYOLO 8.1.47 (Ultralytics Inc., Frederick, MD, USA)
Table 7. Training parameters and their descriptions.
Table 7. Training parameters and their descriptions.
ParameterValueDescription
epochs50Number of epochs or training cycles.
batch16Number of images processed in each iteration.
imgsz224Size of the images input into the model.
patience100Number of epochs without improvement before stopping the training.
lr00.01Initial learning rate.
pre-trainedTrueIndicates that the model uses pre-trained weights (ImageNet).
single_clsFalseIf set to true, the model classifies into a single class.
dropout0.0Dropout rate. This is a regularization technique used to reduce overfitting in artificial neural networks.
Table 8. Image augmentation parameters and their descriptions.
Table 8. Image augmentation parameters and their descriptions.
ParameterValueDescription
hsv_h0.015Hue of the image in the HSV color space.
hsv_s0.7Saturation of the image in the HSV color space.
hsv_v0.4Brightness of the image in the HSV color space.
degrees0.0Random rotation applied to the images.
translate0.1Random translation of the images.
scale0.5Random scaling factor applied to the images.
shear0.0Random shear angle applied to the images.
perspective0.0Perspective transformation applied to the images.
flipud0.0Probability of flipping the image vertically.
fliplr0.5Probability of flipping the image horizontally.
bgr0.0BGR to RGB color space correction factor.
mosaic1.0Probability of using the mosaic technique to combine images.
mixup0.0Probability of mixing two images.
copy_paste0.0Technique of copying and pasting objects between images.
auto_augmentrandaugmentType of data augmentation used.
erasing0.4Probability of erasing parts of the image to simulate occlusions.
crop_fraction1.0Proportion of the image to be cropped. A value of 1.0 indicates no cropping.
Table 9. Signs for the first subset.
Table 9. Signs for the first subset.
No.Semantic FieldSign
1familyson *
2greetingshello *
3days of the weekMonday *
4familygodfather *
5animalsdeer *
* These signs are also in the second subset.
Table 10. Signs for the second subset.
Table 10. Signs for the second subset.
No.Semantic FieldSignNo.Semantic FieldSign
1verbshug32verbsto arrive
2adjectivestall33days of the weekMonday *
3drinksatole34kitchentablecloth
4transportairplane35miscellaneoussea
5schoolflag36fruitsmelon
6transportbicycle37kitchentable
7greetingsGood afternoon!38verbsto swim
8greetingsGood morning!39colorsdark
9citiescapital40familygodfather *
10house house41animalsbird
11miscellaneoussky42clothingpants
12questionsHow?43animalspenguin
13questionsHow are you?44schoolblackboard
14schoolclassmate45foodpizza
15housecurtains 46roomiron
16days of the weekday47miscellaneousplease
17housebroom 48questionsWhy?
18living roomlight bulb49timepresent
19animalsrooster50professionspresident
20adjectivesfat51bathroomshower
21adjectivesbig52living roomliving room
22verbsto like53foodsauce
23familydaughter54citiesSaltillo
24familyson *55clothingshorts
25greetingshello *56verbsto dream
26timehour57transporttaxi
27timetoday58bathroomtowel
28animalsgiraffe59animalsdeer *
29verbsto play60housewindow
30drinksmilk61clothingdress
31vegetableslettuce62personwidower
* These signs are also in the first training set. These signs are also in the third subset.
Table 11. Signs for the third subset.
Table 11. Signs for the third subset.
No.Semantic FieldSign
1housegarbage
2housetrash can
3househouse *
4housecurtains *
5houseelectricity
6housestairs
7housebroom *
8houseinternet
9housegarden
10housekeys
11housewall
12housefloor
13housedoor
14houseroof
15housemop
16housewindow *
* These signs are also in the second subset.
Table 12. Performance accuracy with the MSLR dataset and a state-of-the-art comparison.
Table 12. Performance accuracy with the MSLR dataset and a state-of-the-art comparison.
Ref.DatasetJoint Keypoints *Performance Accuracy (%)
RNN [23]MSLR6796.44
GRU [23]MSLR6797.11
Dynamic-GCN [73]MSLR6798.55
Single-stream GCAR [35]MSLR6799.00
Two-stream GCAR [35]MSLR6799.75
Proposed modelMSLR685.78
* 67 keypoints of the full body; 6 keypoints of the arm joints: wrists, elbows, and shoulders. The model is presented in [73]; the performance accuracy is reported in [35].
Table 13. Top-1 accuracy comparison on the custom dataset.
Table 13. Top-1 accuracy comparison on the custom dataset.
DatasetNo. ClassesDescriptionWith ElbowsWithout Elbows
15More distinguishable0.87990.9599
262More or less distinguishable0.65370.6375
316House group0.71250.6875
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hilario-Acuapan, G.; Ordaz-Hernández, K.; Castelán, M.; Lopez-Juarez, I. Toward a Recognition System for Mexican Sign Language: Arm Movement Detection. Sensors 2025, 25, 3636. https://doi.org/10.3390/s25123636

AMA Style

Hilario-Acuapan G, Ordaz-Hernández K, Castelán M, Lopez-Juarez I. Toward a Recognition System for Mexican Sign Language: Arm Movement Detection. Sensors. 2025; 25(12):3636. https://doi.org/10.3390/s25123636

Chicago/Turabian Style

Hilario-Acuapan, Gabriela, Keny Ordaz-Hernández, Mario Castelán, and Ismael Lopez-Juarez. 2025. "Toward a Recognition System for Mexican Sign Language: Arm Movement Detection" Sensors 25, no. 12: 3636. https://doi.org/10.3390/s25123636

APA Style

Hilario-Acuapan, G., Ordaz-Hernández, K., Castelán, M., & Lopez-Juarez, I. (2025). Toward a Recognition System for Mexican Sign Language: Arm Movement Detection. Sensors, 25(12), 3636. https://doi.org/10.3390/s25123636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop