4.1. Experimental Framework
This subsection details the pipeline used for the training and testing of the ISLR models utilizing the MS-G3D method.
Figure 4 illustrates the standardized pipeline implemented across all experiments described in the following subsection, which involved the three selected datasets: SWL-LSE, WLASL300, and ASL Citizen. For each dataset, keypoints were extracted using the same approach: Mediapipe Holistic with the Heavy model, focusing on the keypoints defined in
Section 3.3.
The datasets were chosen based on the following criteria:
To assess the value of SWL-LSE in comparison to a dataset with a similar number of sign classes and signers, we selected SWL-LSE and WLASL300;
To evaluate the effectiveness of a baseline skeleton-only ISLR method and test its scalability with an increasing number of classes, we selected ASL Citizen, recorded in similar conditions to SWL-LSE.
To preserve the simplicity of the baseline framework and ensure a clear understanding of how different input streams affect the model, we deliberately chose not to use fusion techniques to maximize the accuracy.
4.2. Experiments and Analysis
The provided training pipeline (see
https://github.com/mvazquezgts/SWL-LSE accessed on 6 September 2024) was executed with identical training parameters for each dataset and input stream configuration, and each experiment was repeated several times, with a maximum of five repetitions. The results report the mean, standard deviation, and maximum value. For reproducibility, the checkpoints of the models with the highest accuracy for each dataset are also provided.
The streams are provided to the model as mono-modal channels with different dimensions, depending on whether the Z component is included. A confidence channel is always added. When using bones, the average of the extreme keypoint confidences is provided, and, if using the uni-dimensional angle, the three involved confidences are included. The notation for each stream is Feature_C#_xy[z]c, where ‘Feature’ can be Joints, Bones, Angles, or Joint_Motion, and ‘#’ represents three or four channels. The stream Angle contains always four channels, but the channel carrying the cosine information can be calculated with the Z component or without it. The graph for the MS-G3D model is then composed of 61 nodes, each with the number of channels defined by a specific stream.
For each dataset and stream, four normalization experiments were conducted:
No further normalization from the MediaPipe Holistic output (No Norm);
Skeleton centering and scaling (Shoulders Norm);
Feature standardization (Std. Norm);
Shoulders Norm followed by standardization.
The training parameters were consistent across all experiments. An MS-G3D architecture was used with a batch size of 32 and the standard SGD optimizer with an initial learning rate of 0.1. Learning rate adjustments were managed by the ReduceLROnPlateau scheduler, set with a reduction factor of 0.5 and patience of 10 epochs. L2 (weight decay) regularization was set to 0.0005, and early stopping was applied to halt training after 20 epochs with no improvement. Data augmentation was applied by horizontally mirroring the skeleton sequences with a probability of 0.5.
4.3. Baseline Experiments
The first set of experiments was conducted on the smaller datasets, SWL-LSE and WLASL300, to asses the new dataset and the baseline model and to identify the best combination of streams and normalization techniques, irrespective of the video source or quality. The smaller dataset size allowed for more extensive experimentation. Both datasets contain 300 classes, but, as noted in
Section 2.5, WLASL2000 includes some labeling ambiguities. To avoid the potential contamination of the results due to mislabeling, we cross-referenced the 300 labels in WLASL300 and replaced 42 ambiguous signs with 42 ‘clean’ and frequently occurring signs from the 1449 subset defined from WLASL2000 in [
43]. This modified dataset is referred to as WLASL300Custom, and its signs and splits are documented in the GitHub repository associated with this paper, to ensure reproducibility and support further experimentation.
Table 4 shows the accuracies obtained on the SWL-LSE test set for all streams and normalization configurations.
These experiments were repeated under the same training conditions with the WLASL300Custom dataset. The results are presented in
Table 5.
The first noticeable observation from
Table 4 and
Table 5 is the significant performance difference between these two datasets, allowing us to verify that our model produces results that are consistent with those reported in the literature for the original WLASL300. A recent study achieved state-of-the-art results on WLASL300 using keypoints and other modalities [
23] (without correcting labeling issues), so we also conducted tests on the original WLASL300. The accuracy reported by [
23] in their Table 4 (53.75%) is based solely on the skeleton version, in which keypoints were extracted using MMPose (including 68 facial landmarks) and the classification model was SPOTER [
56], a skeleton-based Transformer. The best accuracy on this dataset using our baseline method and Joints (2D, as in [
23]) with shoulder normalization was 70.11%. This comparison demonstrates that our baseline approach trained on the WLASL300 dataset, using MediaPipe Holistic keypoints focused primarily on the arms and hands and feeding them into a standard MS-G3D model, significantly outperformed SPOTER by a wide margin. It is unclear whether the large difference is due to the DL model, the reduction in keypoints, the method of extracting them (MMPose vs. Mediapipe), or a combination of all of these factors. Additionally, it shows that correcting the labeling errors in WLASL300 and replacing ambiguous classes leads to a dataset that enables the training of a slightly more accurate recognition model (see Joints line in
Table 5). Returning to the significant performance gap between the models trained on SWL-LSE and WLASL300Custom, a detailed review of the quality differences in the videos across both datasets could provide further insights. However, such an analysis is beyond the scope of this work.
Regarding the best-performing streams,
Table 4 and
Table 5 clearly show that Joints consistently outperforms Bones by a small margin, and both outperform Motion-Joints and Angles, regardless of the dataset and the normalization method. The performance decrease for these latter two streams appears to be highly dependent on the dataset. In SWL-LSE, Motion-Joints yields slightly lower accuracies compared to Bones, while Angles results in significantly poorer performance. However, in WLASL300Custom, both Motion-Joints and Angles show much weaker results compared to Joints and Bones. Further analysis may help to explain the dependency of these specific streams, but our hypothesis is that the lower quality of the WLASL dataset, possibly coupled with more variable frame rates during acquisition, contributes to these differences.
This observation holds true across the use of the Z coordinate from the MediaPipe output, as every stream that includes the Z component consistently results in similar or lower accuracy, regardless of the dataset or normalization method. The performance drop is particularly significant for the Angles and Motion-Joints streams, and the decline is more pronounced for WLASL300Custom than for SWL-LSE. These results corroborate findings from other researchers regarding the noisiness of the Z component in MediaPipe’s estimation.
Both tables provide valuable insights into the effects of different normalization methods, although the conclusions are less broadly applicable. The combination that consistently performs best across the datasets is Joints with Shoulders Norm. In the case of WLASL300Custom, applying additional standard normalization further improves the accuracy, with a mean of 74.608% and maximum accuracy of 77.350% in one of the trained models. This result surpasses the best-reported performance in the original WLASL300 dataset, which used more complex models with the fusion of multiple modalities [
23]. Motion-Joints also benefits from Shoulders Norm, but it is unclear whether adding standard normalization consistently improves the results across the datasets, as the experiments with and without standardization show overlapping 68% confidence intervals. In contrast, the next best-performing stream, Bones, only benefits from Shoulders Norm in the WLASL300Custom dataset, with no additional gains from standard normalization. However, none of the normalization differences are statistically significant. Lastly, the Angles stream, which performs the worst among the streams, appears minimally affected by any normalization method, as the results show only slight variations across the methods in both datasets—likely because the cosine values are inherently normalized by definition.
4.4. Scalability of the Baseline Model
The second set of experiments focused on the largest dataset, ASL Citizen. It was designed to assess the scalability of the deep learning model, compare its performance with other state-of-the-art results, and explore the potential of using this dataset for the pretraining of models that could be applied to smaller datasets. These experiments only used the best-performing streams, Joints and Bones, with and without Shoulders Norm.
Table 6 shows the results when using the same training conditions as the experiments in the smaller datasets.
The results achieved on this dataset can be considered a new state-of-the-art for the ASL Citizen dataset, although it should be noted that, as far as we know, only the Microsoft team that collected the dataset has reported results. As summarized in
Table 2, they reported accuracy of 59.52% using keypoints and 63.10% using RGB. It is also worth noting that Shoulders Norm again improves the performance of Joints and only slightly benefits Bones. However, unlike the smaller datasets, the use of Bones surpasses the accuracy achieved with Joints. We believe that this change is not due to the type of data acquisition, which is similar to SWL-LSE; rather, the size of the dataset, which is ten times larger, may play a more significant role. Further research is needed to better understand this behavior. Lastly, the use of the Z coordinate shows no benefit when Shoulders Norm is applied, and there is only a marginal, statistically insignificant improvement when no normalization is applied.
The good results on this dataset and the large amount of signs that the model can recognize led us to test it as a pretraining stage for the training of models for other sign language datasets.
Table 7 illustrates the results using the checkpoints of the best Joints and Bones models with Shoulders Norm in
Table 6 (Joints-based model with 74.180% and Bones-based model with 75.060% accuracy).
These results clearly demonstrate that pretraining on a large dataset within the same domain enhances the training and improves the model accuracy, a well-established principle in deep learning. In this case, the performance gain is particularly significant when the smaller dataset is closely aligned with the larger one. Not only do WLASL300 and ASL Citizen share the same sign language, but they also have substantial overlap in specific signs. In fact, upon reviewing the sign-gloss names, we found that 96% of the WLASL300Custom signs are present in ASL Citizen. In the case of SWL-LSE, since it represents a different sign language, comparing the sign-gloss names does not reveal similar gestures. A comprehensive visual overlap analysis was beyond the scope of this study.
As a final thought, driven by curiosity, we sought to explore whether the original WLASL2000 dataset (despite containing some noisy labels) could benefit from pretraining on ASL Citizen. A similar sign-gloss analysis revealed that 45.46% of the WLASL2000 gloss names appear in ASL Citizen. Using the same pretraining checkpoint for the Joints- and Bones-based models, we ran three experiments for each stream on WLASL2000 and achieved maximum accuracy of 57.070% with Joints and 60.130% with Bones. This shows that, despite the more comparable dataset sizes, pretraining still led to improved accuracy, surpassing the best results reported in the literature (Table 2 in [
47]) with a simpler model.