Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios

Cherrate, Meryem; El Manaa, Imane; Sabri, My Abdelouahed; Abouch, Yassine; Yahyaouy, Ali; Aarab, Abdellah

doi:10.3390/a19050399

Open AccessArticle

Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios

by

Meryem Cherrate

^1,*

,

Imane El Manaa

²,

My Abdelouahed Sabri

¹

,

Yassine Abouch

²,

Ali Yahyaouy

¹

and

Abdellah Aarab

¹

Faculty of Sciences Dhar El Mehraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco

²

DAKAI Laboratory, Nextronic by Aba Technology, Casablanca 20253, Morocco

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(5), 399; https://doi.org/10.3390/a19050399

Submission received: 9 April 2026 / Revised: 5 May 2026 / Accepted: 14 May 2026 / Published: 16 May 2026

Download

Browse Figures

Versions Notes

Abstract

Designing effective temporal modeling strategies for video-based sign language recognition (SLR) remains challenging, particularly in low-resource settings where the behavior of modern architectures is not fully understood. In this study, we present a controlled comparative evaluation of temporal models, including recurrent architectures (RNN, LSTM, GRU) and a Transformer encoder, within a unified spatio-temporal framework based on a shared MobileNetV2 feature extractor. All models are trained and evaluated under identical conditions on a curated subset of the WLASL dataset (37 classes), ensuring a fair and reproducible comparison. The results show that recurrent models consistently achieve higher performance than the Transformer-based approach in data-constrained scenarios, with the CNN–LSTM model reaching an accuracy of 90.02%. In contrast, the Transformer model exhibits lower generalization capability, which may be attributed to its higher data requirements. Additionally, increasing architectural complexity through hybrid temporal designs does not result in performance improvements. These findings suggest that simpler recurrent architectures remain effective for temporal modeling in limited data settings and highlight the importance of aligning model complexity with data availability for practical SLR applications.

Keywords:

sign language recognition; temporal modeling; deep learning; recurrent neural networks; transformer; video-based recognition; MobileNetV2; low-resource learning

1. Introduction

Sign languages constitute a fundamental means of communication for deaf and hard-of-hearing communities, relying on a rich combination of hand gestures, facial expressions, and body movements to convey linguistic information. Automatic sign language recognition (SLR) has therefore emerged as a key research area in computer vision and human–computer interaction, aiming to bridge communication gaps and improve accessibility. However, SLR remains a challenging task due to the complex spatio-temporal structure of gestures, high intra- and inter-signer variability, and diverse real-world recording conditions [1].

Early approaches to SLR primarily relied on handcrafted features and conventional machine learning techniques, which suffered from limited robustness in unconstrained environments. The advent of deep learning, particularly convolutional neural networks (CNNs), has significantly improved spatial feature extraction capabilities, enabling the learning of discriminative visual representations from raw data. Nevertheless, CNN-based models operate on individual frames and fail to capture temporal dependencies, which are essential for modeling dynamic sign language sequences.

To address this limitation, recurrent neural networks (RNNs), including long short-term memory (LSTM) and gated recurrent units (GRU), have been widely adopted for temporal modeling. These architectures effectively capture sequential dependencies and have demonstrated strong performance in video-based SLR tasks. However, recurrent models inherently rely on sequential processing, limiting their ability to model long-range dependencies and reducing computational efficiency in large-scale scenarios.

More recently, Transformer-based architectures have emerged as a powerful paradigm for sequence modeling by leveraging self-attention mechanisms to capture global dependencies. Transformers have been successfully applied to video understanding and sign language recognition tasks, showing promising results when sufficient data is available. Recent studies have explored Transformer-based pipelines, hybrid CNN–Transformer models, and self-supervised video transformers for SLR, highlighting their ability to learn rich spatio-temporal representations [2,3,4]. In particular, large-scale experiments have demonstrated that advanced video Transformers such as VideoMAE and TimeSformer can achieve state-of-the-art performance on sign language datasets when trained with sufficient data and appropriate fine-tuning strategies [5].

Despite these advances, a critical limitation persists: most existing works rely on large-scale datasets and complex architectures, making them less suitable for real-world applications where annotated data is limited. Recent research has emphasized that model performance in SLR is highly dependent on dataset size, data quality, and training strategies, and that Transformer-based models may not generalize well in low-resource environments [5].

Furthermore, the lack of standardized experimental protocols across studies makes it difficult to draw fair and reliable comparisons between recurrent and attention-based approaches.

In addition, emerging works have begun exploring hybrid and multimodal architectures that integrate pose estimation, multi-stream learning, and attention mechanisms to improve robustness and interpretability [6,7]. While these approaches demonstrate promising performance, they often introduce additional complexity and computational overhead, raising important questions about the trade-off between model complexity and practical deployment.

In this context, the main contribution of this work lies in the design of a controlled and reproducible experimental framework that enables a fair comparison of temporal modeling strategies under identical conditions. Rather than proposing a new architecture, the study focuses on isolating the impact of the temporal modeling component, allowing for a more reliable and interpretable evaluation.

Moreover, the study is conducted in an application-oriented setting, where the selected vocabulary corresponds to frequently used signs in service-oriented environments. This choice reflects practical deployment constraints, where recognizing a limited set of essential interactions is often more relevant than handling large vocabularies.

Within this framework, we systematically evaluate several representative temporal architectures, including recurrent models (RNN, LSTM, and GRU) and a Transformer-based encoder, using a shared MobileNetV2 backbone for spatial feature extraction. All models are trained and evaluated under identical conditions on a curated subset of the WLASL dataset, ensuring consistency and reproducibility.

The objective of this study is to provide empirical insights into the relationship between model complexity, data availability, and performance in sign language recognition. In particular, we aim to answer the following research question: Which temporal modeling strategy is most effective under data-constrained conditions?

2. Related Work

Early deep learning approaches for sign language recognition (SLR) primarily relied on convolutional neural networks (CNNs) to extract spatial features from individual video frames. These models demonstrated strong performance in static gesture recognition due to their ability to capture discriminative visual patterns. However, their frame-wise processing limits their capacity to model temporal dependencies, which are essential for understanding dynamic sign sequences. As highlighted in recent studies, CNN-based approaches alone remain insufficient for capturing the complex spatio-temporal dynamics inherent in sign language videos [8,9].

To address this limitation, recurrent neural networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), have been widely adopted to model temporal dependencies across video frames. Hybrid CNN–LSTM frameworks, in particular, have shown strong performance by effectively combining spatial feature extraction with sequential modeling. Several works have demonstrated competitive results using such architectures on benchmark datasets, including subsets of WLASL [10,11]. Nevertheless, recurrent models suffer from inherent limitations, such as difficulty in capturing long-range dependencies and limited parallelization [12].

More recently, Transformer-based architectures have emerged as a powerful paradigm for sequence modeling by leveraging self-attention mechanisms to capture global dependencies. Models such as Vision Transformers, TimeSformer, VideoMAE, and ViViT have shown promising results in video understanding and SLR tasks [13,14,15,16]. Their ability to model long-range temporal interactions and enable parallel computation represents a significant advancement over recurrent approaches. However, several studies have emphasized that Transformer-based methods are highly dependent on large-scale annotated datasets and may underperform in data-constrained environments [15,16].

The introduction of large-scale datasets such as WLASL has significantly advanced research in isolated sign language recognition. WLASL provides a diverse and scalable benchmark for evaluating SLR systems across different vocabulary sizes [10]. Numerous studies have evaluated CNN–RNN and Transformer-based architectures on WLASL subsets (e.g., WLASL100, WLASL300), demonstrating the effectiveness of hybrid and attention-based approaches [10,17,18].

More recent works have explored large-scale settings such as WLASL2000, achieving improved generalization through advanced architectures and training strategies [17]. Despite these advances, most approaches rely on large datasets and complex models, limiting their applicability in resource-constrained scenarios.

In parallel, recent research has explored hybrid and multimodal approaches that integrate multiple sources of information, such as RGB frames, pose estimation, and attention mechanisms. Multi-stream architectures combining hand, body, and facial features have shown promising results in capturing fine-grained motion patterns. Furthermore, lightweight and efficient models have been proposed to balance performance and computational cost, enabling real-time applications. While these approaches improve robustness and accuracy, they often introduce increased architectural complexity.

Recent work has also investigated alternative temporal modeling strategies aimed at improving efficiency and scalability. Approaches such as Temporal Convolutional Networks (TCN) and InceptionTime leverage convolutional structures to capture temporal dependencies and have demonstrated strong performance in sequence modeling tasks. In parallel, efficient Transformer variants, including Linformer and Performer, have been proposed to reduce the computational cost of self-attention while maintaining competitive performance. Although these approaches represent promising directions, they are not included in the present study in order to preserve a controlled experimental framework focused on comparing representative sequence modeling paradigms under identical conditions. Evaluating these architectures under data-constrained scenarios remains an important direction for future work.

Despite significant progress in the field, several important limitations remain. Existing studies frequently rely on heterogeneous experimental setups, making fair comparisons between temporal modeling approaches difficult. Moreover, most Transformer-based models are evaluated under data-rich conditions, leaving their effectiveness in low-resource scenarios insufficiently explored. Finally, the trade-off between model complexity and performance remains unclear. Most importantly, a fundamental question remains unanswered: which temporal modeling strategy is most effective under data-constrained conditions?

To address these limitations, this work proposes a unified and controlled experimental framework to systematically compare recurrent and Transformer-based architectures under identical conditions using a subset of the WLASL dataset. By focusing on low-resource settings, this study aims to provide clearer insights into the relationship between model complexity, data availability, and performance.

3. Dataset and Preprocessing

3.1. Dataset Description

This study is conducted using a subset of the Word-Level American Sign Language (WLASL) dataset [10], a widely adopted benchmark for isolated sign language recognition. While WLASL provides a large-scale and diverse collection of real-world video samples, this work focuses on a carefully curated subset tailored to a specific application context.

A total of 37 sign classes are selected, comprising 275 video samples in total. This relatively limited dataset is intentionally designed to reflect practical communication needs in service-oriented environments, where deaf and hard-of-hearing individuals interact with systems to express essential requests (e.g., assistance, payment, or service-related actions).

The number of samples per class ranges from 7 to 8, resulting in a slight class imbalance. As illustrated in Figure 1, some classes are less represented than others. This variability reflects realistic data collection conditions and may influence model performance, particularly for underrepresented classes.

The selection of sign classes is guided by semantic relevance, frequency of use, and alignment with real-world communication scenarios. To enhance reliability, an informal validation is conducted through consultation with domain experts, including researchers and educators familiar with sign language communication.

From an application perspective, the system is designed for real-world deployment, where accurately recognizing user intent is more critical than covering a large vocabulary. Therefore, prioritizing a reduced yet meaningful set of classes allows the model to focus on high-impact interactions while maintaining efficiency.

Despite this filtering, the dataset preserves key challenges of real-world sign language recognition, including variability in signers, backgrounds, illumination conditions, and camera viewpoints. This ensures that the evaluation remains realistic and representative of unconstrained environments.

Overall, this dataset design supports the evaluation of temporal modeling strategies under data-constrained conditions, while maintaining practical relevance for real-world applications.

3.2. Data Preprocessing

To prepare the video data for training, each input sequence is first decomposed into individual frames while preserving the temporal order of the gestures. This step enables frame-level processing while maintaining the sequential structure required for temporal modeling.

All extracted frames are resized to a fixed spatial resolution of 64 × 64 pixels. This relatively compact resolution is chosen as a trade-off between computational efficiency and the preservation of essential visual information. Despite the reduced size, the resolution remains sufficient to capture key discriminative features such as hand shape and motion patterns, particularly when combined with a deep feature extractor.

In this work, a pretrained MobileNetV2 network is employed as the spatial feature extractor. MobileNetV2 is specifically designed for lightweight and efficient representation learning, making it well-suited for scenarios involving limited computational resources and relatively small input resolutions. Its depthwise separable convolutions allow for effective feature extraction while significantly reducing the number of parameters and computational cost [19]. Such architectures are particularly suitable for real-time or resource-constrained applications.

Before feature extraction, pixel values are normalized to a standard range to improve numerical stability during training and accelerate convergence. This normalization also reduces the impact of illumination variations and noise present in real-world video recordings.

Since input videos have varying lengths, all sequences are uniformly sampled to a fixed number of frames T.

The value of T is treated as an experimental parameter and is set to 10, 15, or 20, depending on the configuration. Videos longer than T frames are downsampled, while shorter sequences are padded.

Overall, this preprocessing pipeline ensures a consistent and efficient input representation, facilitating a fair comparison between different temporal modeling approaches while maintaining robustness to visual variability.

3.3. Data Construction

Since sign language is inherently dynamic, capturing temporal dependencies across frames is essential for accurate recognition. Following common practices in video-based sign language recognition, the extracted frames are organized into fixed-length sequences to model the temporal evolution of gestures [8,11].

For each video, a predefined number of frames is selected to represent the gesture. This fixed-length representation ensures compatibility with batch processing and enables a fair comparison across different temporal modeling approaches.

When the number of frames in a video exceeds the required sequence length, uniform sampling is applied to preserve the overall temporal structure while reducing redundancy. This strategy has been widely adopted in video analysis tasks to maintain a consistent representation without introducing bias toward specific temporal segments [14].

Conversely, when the number of frames is shorter than the desired sequence length, padding is introduced to maintain a consistent input size. This allows all sequences to be processed efficiently within the same computational framework, which is particularly important for training recurrent and attention-based models [13].

Such a standardized sequence construction process ensures that all models are trained and evaluated under identical temporal conditions, thereby enabling a fair and controlled comparison of their ability to capture spatio-temporal dependencies.

3.4. Data Augmentation

To improve the generalization capability of the models and reduce overfitting, data augmentation techniques are applied during training. Such strategies are widely used in computer vision tasks to artificially increase data variability and enhance model robustness [8,20].

In this study, both spatial and temporal augmentation techniques are employed to simulate real-world variations in sign language videos better.

From a spatial perspective, transformations such as horizontal flipping and slight rotations are applied to the input frames. These operations help the model become invariant to changes in signer orientation and camera viewpoint, which are common in unconstrained recording environments [21].

In addition to spatial transformations, temporal augmentation is introduced through slight variations in the frame sampling positions, commonly referred to as temporal jittering. This technique encourages the model to learn more robust temporal representations by reducing its sensitivity to precise frame alignment and timing variations within gesture sequences [14].

By combining spatial and temporal augmentation strategies, the proposed preprocessing pipeline enhances the diversity of the training data while preserving the semantic integrity of the gestures, leading to improved generalization performance.

In addition to standard augmentation techniques, particular attention is given to transformations that preserve the semantic integrity of sign language gestures. Unlike general object recognition tasks, sign language recognition is highly sensitive to spatial orientation and temporal dynamics. Therefore, augmentations are carefully designed to avoid altering the meaning of gestures while still increasing data variability.

Specifically, horizontal flipping is applied with caution, as some signs may be orientation-dependent, while small rotations and temporal jittering are used to simulate natural variations in signer movement and timing. These lightweight and task-aware augmentation strategies play a crucial role in improving model generalization in low-resource settings without increasing model complexity.

4. Proposed Methodology

4.1. Overview of the Proposed Framework

This work proposes a unified spatio-temporal framework for video-based sign language recognition, designed to enable a fair and systematic comparison of different temporal modeling strategies under identical conditions. The general workflow can be summarized in Figure 2.

As illustrated in Figure 2, the overall pipeline follows a sequential processing scheme commonly adopted in video understanding tasks, where spatial features are first extracted from individual frames and then modeled over time to capture temporal dependencies [11,22].

First, the input video is decomposed into a sequence of frames while preserving the temporal order of the gesture. A fixed number of frames is then selected through uniform sampling to ensure a consistent input representation across all samples, as widely used in prior video-based approaches.

Each frame is subsequently processed by a pretrained MobileNetV2 network, which serves as a spatial feature extractor. This step transforms raw visual data into a sequence of compact and discriminative feature vectors, leveraging the efficiency and representational capability of lightweight convolutional architectures.

The extracted features are then organized into a temporal sequence that captures the evolution of the gesture over time. To model temporal dependencies, this sequence is passed to different temporal architectures, including RNN, LSTM, GRU, and a Transformer encoder. Recurrent models have been widely used for sequential modeling in sign language recognition due to their ability to capture temporal dynamics, while Transformer-based approaches rely on self-attention mechanisms to model global dependencies across frames.

Finally, the output of the temporal model is fed into a fully connected classification layer followed by a softmax function, which produces the predicted sign class.

This modular design ensures that all models share the same spatial representation and differ only in the temporal modeling component, allowing for a controlled and unbiased comparison. Such a unified framework is essential for drawing reliable conclusions regarding the effectiveness of different temporal modeling strategies under data-constrained conditions.

4.2. Spatial Feature Extraction Using MobileNetV2

To extract discriminative spatial features from individual frames, a pretrained MobileNetV2 network is employed. MobileNetV2 is a lightweight convolutional architecture based on depthwise separable convolutions and inverted residual blocks, enabling efficient feature extraction with reduced computational complexity [19].

The MobileNetV2 backbone produces a 1280-dimensional feature vector for each frame after global average pooling.

Given an input frame, I_t, the spatial feature extractor can be formulated as:

F_t = Φ(I_t;θ_c)

(1)

where Φ denotes the MobileNetV2 mapping, and θ represents its parameters. The output ft is a compact feature vector encoding relevant visual information such as hand shape and appearance.

The use of MobileNetV2 allows efficient processing of low-resolution inputs while maintaining strong representational capacity, which is particularly beneficial in scenarios with limited computational resources or constrained datasets.

4.3. Temporal Modeling Strategies

To capture the temporal evolution of gestures, the sequence of spatial features {f₁, f₂, …, f_T} is processed using different temporal modeling approaches. This study investigates three representative strategies: recurrent-based models and attention-based architectures.

4.3.1. Recurrent Modeling with GRU

Gated Recurrent Units (GRUs) are employed to model temporal dependencies, offering reduced computational complexity compared to traditional recurrent networks. GRUs use gating mechanisms to control information flow and mitigate vanishing gradient issues [23].

The hidden state update is defined as:

h_t = GRU(f_t,h_t−1;θ_g)

(2)

where h_t represents the hidden state at time t, and θ_g denotes the GRU parameters.

4.3.2. Long Short-Term Memory (LSTM)

To better capture long-range dependencies, Long Short-Term Memory (LSTM) networks are also considered. LSTMs introduce memory cells and gating mechanisms that enable the modeling of long-term temporal relationships [12,24].

LSTM-based models have been widely used in video-based sign language recognition due to their effectiveness in modeling sequential data [11].

4.3.3. Transformer Encoder

In addition to recurrent models, a Transformer encoder is employed to model global temporal dependencies using self-attention mechanisms. Unlike recurrent architectures, Transformers process all time steps in parallel and capture long-range relationships more effectively [13].

The Transformer encoder consists of two layers with four attention heads and a feed-forward dimension of 256. The input feature dimension is aligned with the output of MobileNetV2, and sinusoidal positional encoding is applied to preserve temporal order. The architectural parameters are selected to balance model capacity and the limited dataset size.

The attention mechanism is defined as:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q K}^{T}}{\sqrt{d} K}) V

(3)

where Q, K, and V represent the query, key, and value matrices derived from the input feature sequence.

Transformer-based models have recently demonstrated strong performance in sequence modeling tasks, including video understanding and sign language recognition [14,18].

The selection of the temporal models considered in this study is guided by the objective of comparing representative sequence modeling paradigms within a controlled experimental setting. Recurrent architectures, including RNN, LSTM, and GRU, are chosen due to their well-established effectiveness in modeling temporal dependencies and their widespread use in sign language recognition, particularly in scenarios where training data is limited. In parallel, a Transformer-based encoder is included to represent a more recent class of models that rely on self-attention mechanisms and have demonstrated strong performance in sequence modeling tasks when sufficient data is available.

By focusing on these two families of approaches, the study aims to provide a balanced comparison between traditional sequential modeling and attention-based methods under identical conditions. While other model families, such as 3D convolutional neural networks or multimodal approaches combining visual and pose-based features, are also relevant for sign language recognition, they introduce additional design factors beyond temporal modeling, including joint spatio-temporal feature extraction and modality fusion. In order to preserve a controlled experimental setting and ensure a fair comparison, the present study focuses exclusively on architectures that operate on a shared sequence of pre-extracted features. This design choice allows the analysis to isolate the impact of the temporal modeling strategy while minimizing the influence of other components.

4.4. Baseline: Spatial-Only Model

To assess the importance of temporal modeling, a baseline model based solely on spatial features is considered. In this setting, predictions are computed independently for each frame, and the final prediction is obtained by averaging the softmax probabilities across all frames. This baseline serves as a reference to evaluate the contribution of temporal information in the recognition process.

In the spatial-only baseline, predictions are computed independently for each frame, and the final prediction is obtained by averaging the softmax probabilities across all sampled frames.

4.5. Classification Layer

For all models, the final representation (either the last hidden state or the aggregated sequence representation) is passed to a fully connected classification layer. This layer consists of a dense transformation followed by a softmax activation function that outputs class probabilities:

\hat{y} = s o f t m a x (W_{h} + b)

(4)

where h represents the learned feature vector, and W and b denote the trainable parameters of the classifier.

4.6. Training Strategy

The models are trained using categorical cross-entropy loss, which is well-suited for multi-class classification tasks:

Լ = - \sum_{i = 1}^{C} y_{i} \log ({\hat{y}}_{i})

(5)

where C is the number of classes.

y_{i}

denotes the ground truth label and

{\hat{y}}_{i}

represents the predicted probability for class i.

The Adam optimizer is employed for parameter optimization due to its efficiency and adaptive learning capabilities [25]. To prevent overfitting, regularization techniques such as dropout are applied, and early stopping is used based on validation performance.

To mitigate overfitting and improve generalization, several regularization techniques are applied. Dropout is introduced in the fully connected layers with a rate of 0.5, which helps reduce co-adaptation between neurons. In addition, early stopping is used by monitoring the validation loss with a patience of 7 epochs, allowing the training process to stop when no further improvement is observed.

Furthermore, data augmentation techniques are employed to increase variability in the training data. These include spatial transformations such as horizontal flipping and slight rotations, as well as temporal augmentation through frame sampling variations (temporal jittering). These strategies are particularly important in data-constrained settings to improve model robustness and generalization performance.

5. Experimental Setup

5.1. Implementation Details

All experiments were conducted using Python 3.10 and TensorFlow/Keras 2.15 within the Google Colab environment, leveraging GPU acceleration. The models were trained in a controlled setting to ensure reproducibility and consistency across different temporal architectures.

The input data consists of video frames resized to 64 × 64 pixels and normalized within the range [0, 1]. Each video is represented as a sequence of frames, which are processed using a TimeDistributed MobileNetV2 network pretrained on ImageNet for spatial feature extraction.

The pretrained backbone is used as a fixed feature extractor, with its weights frozen during training, allowing the models to focus on learning temporal dependencies.

5.2. Dataset Split

To ensure an unbiased evaluation, the dataset is divided into training and testing subsets.

The dataset is split as follows:

Training set: 80%;
Testing set: 20%.

The split is performed using a randomized strategy with a fixed seed to ensure reproducibility.

To handle variable-length videos, all sequences are uniformly sampled to a fixed number of frames T, where T ∈ {10, 15, 20}.

Videos longer than T frames are downsampled, while shorter ones are padded to maintain a consistent input size.

Additionally, a validation split of 20% is applied on the training data during model training.

5.3. Training Configuration

All models are trained under identical conditions to ensure a fair comparison.

Optimizer: Adam;
Loss Function: Categorical Cross-Entropy;
Batch Size: 30;
Number of Epochs: 100;
Validation Split: 0.2;
Early Stopping: patience = 7.

Early stopping is employed to prevent overfitting by monitoring validation performance and restoring the best model weights.

Although Transformer-based models typically require longer training, early stopping ensured convergence within the given training budget.

5.4. Model Variants

To evaluate the impact of different temporal modeling strategies, multiple architectures are implemented and compared.

The following models are considered:

CNN + LSTM;
CNN + GRU;
CNN + RNN;
CNN + Transformer Encoder.

Each model uses the same spatial feature extractor and differs only in the temporal modeling component.

5.5. Evaluation Metrics

Model performance is evaluated using standard classification metrics.

Accuracy;
Classification Report (Precision, Recall, F1-score);
Confusion Matrix.

These metrics provide a comprehensive evaluation of both overall performance and class-wise behavior.

All temporal models are evaluated independently under identical experimental conditions. This ensures a fair and unbiased comparison of their respective capabilities in modeling temporal dependencies.

6. Results and Discussion

6.1. Overall Performance and Comparison

The performance of the evaluated models is summarized in Table 1, which reports the mean accuracy and standard deviation over five independent runs conducted with different random seeds to ensure a more reliable comparison.

As shown in Table 1, recurrent-based architectures (RNN, LSTM, and GRU) achieve consistently high performance, with mean accuracies close to 90%. Among them, the CNN + LSTM model achieves the highest mean accuracy (90.02% ± 0.21), closely followed by CNN + RNN (88.89% ± 0.24) and CNN + GRU (89.10% ± 0.27). In contrast, the Transformer-based model yields a lower performance (87.11% ± 0.35), suggesting that it may be less suited to data-constrained scenarios.

Although CNN + LSTM shows the highest mean accuracy, the differences between recurrent models are extremely small and fall within the range of standard deviation. To further assess the reliability of this comparison, a paired t-test was conducted between CNN + LSTM and CNN + RNN across five runs with different random seeds. The results indicate that the observed difference is not statistically significant (p > 0.05).

These findings suggest that RNN, LSTM, and GRU exhibit comparable and stable performance under the same experimental conditions, and that no definitive ranking can be established among them based on such marginal differences. Instead, the results highlight a broader and more robust conclusion: recurrent architectures, regardless of their specific formulation, are effective for modeling temporal dependencies in sign language recognition under data-constrained conditions.

Figure 3 further illustrates this comparison, showing that all recurrent models follow similar performance trends, with only marginal differences between them. Although the CNN + LSTM model achieves the highest mean accuracy, the gap with RNN and GRU remains negligible and not statistically significant. In contrast, the Transformer-based model consistently performs lower, suggesting that more complex architectures may require larger datasets to leverage their modeling capacity fully.

6.2. Training Behavior Analysis

The training behavior of the evaluated models is illustrated in Figure 4, Figure 5, Figure 6 and Figure 7, which show the evolution of training and validation accuracy and loss over epochs.

Overall, all models exhibit a consistent learning pattern, characterized by a rapid decrease in training loss during the initial epochs, followed by a clear stabilization phase. The validation loss follows a similar trend, indicating that all models converge effectively and do not suffer from severe overfitting.

A more detailed analysis of the learning curves reveals important differences in convergence behavior across architectures. In particular, recurrent models (RNN, LSTM, and GRU) converge rapidly, typically within the first 5–10 epochs. Beyond this point, both training and validation curves stabilize, and additional training up to 100 epochs does not lead to significant performance improvements. This indicates that these models reach their optimal performance early in the training process.

Among them, the CNN + LSTM model demonstrates the most stable and smooth convergence, with minimal fluctuations between training and validation curves. This behavior suggests effective modeling of temporal dependencies and strong generalization capability. The CNN + GRU model shows a similar trend, although with slightly higher variability in the validation loss, indicating a marginally less stable learning process. In contrast, the CNN + RNN model converges quickly but exhibits more noticeable oscillations, reflecting less stable training dynamics.

The Transformer-based model, on the other hand, shows comparatively slower convergence and higher variability in validation performance. Although its performance improves with extended training, it still remains less stable and achieves lower overall accuracy compared to recurrent architectures. This behavior is likely due to its higher model complexity and greater sensitivity to limited training data.

In terms of accuracy, training and validation curves remain closely aligned across all models, confirming good generalization performance. Overall, these results indicate that while all models are capable of learning meaningful temporal representations, recurrent architectures, particularly LSTM provide more stable, efficient, and reliable training behavior under data-constrained conditions. Moreover, the extended training to 100 epochs confirms that the performance differences are not due to insufficient training duration, but rather to intrinsic differences in model suitability for low-resource scenarios.

6.3. Confusion Matrix Analysis

To provide a detailed evaluation at the class level, the confusion matrix of the best-performing model (CNN + LSTM) is presented in Figure 8.

The confusion matrix exhibits a strong diagonal dominance, indicating that the majority of sign classes are correctly classified. This confirms the high overall accuracy of the model.

However, a limited number of misclassifications can be observed between certain classes. These errors are likely due to visual similarities between signs, such as similar hand shapes or motion patterns, which make them more difficult to distinguish.

Overall, the confusion matrix analysis highlights the robustness of the model while revealing specific areas where classification remains challenging.

6.4. Ablation Study

To evaluate the impact of temporal context on recognition performance, an ablation study is conducted by varying the number of input frames used for each video sequence. Specifically, three configurations are considered: 10, 15, and 20 frames, uniformly sampled from each video.

The results, presented in Table 2, show a clear and consistent improvement in performance as the sequence length increases. When using 10 frames, the model achieves an accuracy of 85.20%, which increases to 87.10% with 15 frames, and reaches its highest value of 90.02% at 20 frames.

This improvement can be attributed to the richer temporal information captured when more frames are included, allowing the model to better understand motion dynamics and gesture transitions. However, the performance gain becomes less pronounced as the sequence length increases, indicating diminishing returns beyond a certain point. Within the evaluated range, the best performance is achieved at 20 frames (90.02%), suggesting that this configuration provides an effective balance between capturing sufficient temporal context and maintaining computational efficiency. These findings highlight the importance of selecting an appropriate sequence length for sign language recognition, particularly in data-constrained scenarios where both efficiency and generalization are critical.

To provide a deeper understanding of the contribution of each component within the proposed framework, an ablation study was conducted by systematically analyzing the impact of key design factors, including the choice of temporal modeling architecture, the length of the input sequence, and the feature extraction strategy, while maintaining identical training conditions across all experiments to ensure a fair comparison. The results indicate that recurrent architectures (RNN, LSTM, and GRU) achieve very similar performance levels, with the LSTM slightly outperforming the others and reaching the highest accuracy (90.02%). This suggests that these models are all capable of effectively capturing short- and mid-range temporal dependencies.

In contrast, the Transformer-based model exhibits lower performance (87.11%), even after extended training, which may be attributed to its higher data requirements and reduced generalization capability in low-resource settings. Furthermore, the analysis of sequence length confirms that increasing the number of frames improves performance within the evaluated range (10, 15, and 20 frames), with the best results obtained at 20 frames. This indicates that incorporating additional temporal context enhances recognition accuracy.

Additionally, the comparison between frozen and fine-tuned feature extraction reveals that adapting the MobileNetV2 backbone improves performance, highlighting the importance of task-specific feature learning. Overall, these findings demonstrate that model performance is strongly influenced by the alignment between architectural complexity and data availability, and that increasing model complexity does not necessarily lead to better results in data-constrained scenarios.

6.5. Computational Efficiency Analysis

In addition to performance evaluation, a comparative analysis of computational efficiency was conducted to assess the trade-off between model complexity and runtime.

As shown in Table 3, all models share a common MobileNetV2 backbone, resulting in relatively similar parameter counts. However, differences in computational cost arise from the temporal modeling component. Recurrent architectures exhibit lower training and inference times, whereas the Transformer-based model introduces higher computational overhead. These findings indicate that simpler temporal models provide a more favorable efficiency–performance trade-off, particularly in data-constrained scenarios.

All measurements were obtained under the same experimental conditions to ensure a fair comparison. The reported runtime values are indicative and are provided to ensure a consistent relative comparison across models.

6.6. Discussion

The use of a subset limited to 37 classes reflects a deliberate design choice aligned with the targeted application scenario. The selected signs correspond to frequently used expressions in service-oriented environments, such as service stations, where communication typically involves a restricted set of essential interactions. In addition, the selection process was informed by consultation with domain experts, ensuring the practical relevance and validity of the chosen classes. While this reduced vocabulary enables a focused evaluation under data-constrained conditions, it does not capture the full complexity of large-scale sign language recognition tasks. Consequently, the reported results should be interpreted within this specific context, and further evaluation on larger and more diverse datasets would be necessary to assess broader generalization.

In particular, the reported quantitative performance cannot be directly generalized to datasets involving a larger number of classes or more diverse recording conditions.

The experimental results provide important insights into the role of temporal modeling for video-based sign language recognition under data-constrained conditions. Recurrent architectures consistently achieve strong and stable performance, demonstrating their effectiveness in capturing short to mid-range temporal dependencies. In contrast, increasing architectural complexity does not lead to significant performance gains, highlighting that model efficiency is more critical than complexity in limited data scenarios.

Within this controlled experimental setting, the observed trends are expected to generalize qualitatively to similar low-resource scenarios, where data availability remains limited.

Despite these promising results, several limitations should be acknowledged. First, the relatively small dataset size may restrict the generalization capability of all evaluated models, particularly for more complex architectures. Second, the use of a frozen pretrained feature extractor, while computationally efficient, may limit the adaptation of spatial features to the specific characteristics of sign language data. Third, the fixed-length sequence representation may lead to partial loss of temporal information, especially for longer or highly dynamic gestures.

In addition, attention-based models such as Transformers may require larger-scale datasets to fully exploit their capacity to model long-range dependencies.

Furthermore, certain sign classes with high visual similarity remain challenging to distinguish, as reflected in the confusion matrix. Finally, the results indicate that model performance is highly dependent on the alignment between architectural design and data availability, suggesting that more complex models are not necessarily better suited for constrained environments.

6.7. Comparison with WLASL-Based Methods

To ensure a fair and meaningful evaluation, the proposed approach is compared exclusively with state-of-the-art methods evaluated on the WLASL dataset. This choice avoids inconsistencies arising from differences in datasets, class distributions, and evaluation protocols, which can otherwise lead to misleading conclusions.

As shown in Table 4, the comparison should be interpreted with caution due to differences in dataset scale and task complexity.

The results indicate that the proposed approach achieves competitive performance within the context of the evaluated dataset. However, accuracy values are not directly comparable across different WLASL subsets, as the difficulty of the classification task increases with the number of classes.

In particular, methods evaluated on larger subsets such as WLASL100 or beyond inherently address a more complex classification problem compared to smaller subsets. Therefore, the reported accuracy of 90.02% on 37 classes should not be interpreted as a direct indication of superior performance over methods evaluated on larger class sets.

Instead, this comparison is intended to provide a contextual positioning of the proposed approach within the literature. The results demonstrate that the CNN–LSTM framework is capable of achieving strong performance under data-constrained conditions, while maintaining efficiency and stable training behavior.

Furthermore, Transformer-based approaches have shown strong performance on large-scale WLASL subsets, benefiting from their ability to model long-range dependencies. However, their effectiveness is closely tied to the availability of large training datasets and substantial computational resources. In contrast, the present results suggest that, in low-resource settings, recurrent architectures can provide a more favorable balance between performance, stability, and efficiency.

Overall, these observations highlight that model effectiveness depends not only on architectural design but also on the scale and characteristics of the dataset. In this context, the proposed approach offers a practical and efficient solution for sign language recognition under constrained data conditions.

7. Conclusions

This paper presented a comprehensive study of spatio-temporal modeling strategies for video-based sign language recognition under a unified experimental framework.

Rather than proposing a new architecture, the study focuses on providing a controlled and reproducible comparison of representative temporal modeling approaches under identical conditions.

A range of temporal architectures, including recurrent models (LSTM, GRU, and RNN) and a Transformer-based approach, were evaluated using a shared spatial feature extraction backbone based on MobileNetV2. The experimental results demonstrate that recurrent models consistently achieve high and comparable performance, with the CNN + LSTM configuration providing the best overall accuracy.

These results highlight the importance of aligning model complexity with data availability, particularly in low-resource settings.

The findings highlight that, despite the increasing popularity of attention-based architectures, recurrent models remain highly effective for sequence modeling in data-constrained environments. Furthermore, the results show that increasing model complexity through hybrid architectures does not necessarily improve performance, emphasizing the importance of simplicity and efficiency in model design.

This observation is particularly relevant for real-world deployment scenarios, where computational constraints and limited annotated data are common.

The main contributions of this work can be summarized as follows:

A unified experimental framework enabling fair comparison of multiple temporal modeling strategies;
A systematic evaluation of recurrent and Transformer-based architectures under identical conditions;
Empirical evidence demonstrating the effectiveness of recurrent models in limited data scenarios;
An in-depth analysis highlighting the limitations of Transformer-based approaches in small-scale datasets;
Practical insights for selecting efficient temporal models for real-world sign language recognition systems.

In particular, the study is grounded in an application-oriented setting, where the selected vocabulary reflects frequently used signs in service-oriented environments, ensuring the practical relevance of the evaluation.

Overall, this work provides valuable guidance for the design of efficient and robust sign language recognition systems, particularly in scenarios where data availability is limited.

While the findings are primarily applicable to low-resource and application-specific settings, they offer useful insights into the relationship between model design, data scale, and performance, which may inform future developments in the field.

It is important to note that the findings of this study are primarily validated in an extremely low-resource setting, characterized by a limited number of classes and samples per class. While the observed trends provide meaningful insights into the behavior of temporal modeling strategies under constrained conditions, they should be interpreted within this specific context. Further evaluation on larger and more diverse datasets is therefore necessary to assess the generalizability of these results and to determine whether the observed performance patterns hold in more data-rich scenarios.

The efficiency analysis further confirms that recurrent architectures provide a more favorable balance between performance and computational cost in data-constrained settings.

8. Future Work

Several directions can be explored to further improve the proposed framework.

First, the use of larger and more diverse datasets could enhance the performance of attention-based models such as Transformers, allowing them to fully exploit their capability to model long-range dependencies.

In particular, extending the experiments to larger subsets of WLASL or other sign language datasets would provide a more comprehensive evaluation of model generalization under varying data conditions.

Second, incorporating advanced data augmentation techniques and synthetic data generation could help improve model generalization, particularly for underrepresented sign classes.

In addition, self-supervised or pretraining strategies could be explored to better leverage unlabeled data and mitigate data scarcity.

Third, future work may investigate multimodal approaches by integrating additional information such as depth data or skeletal keypoints, which could provide complementary temporal cues.

Such multimodal representations may improve robustness, especially in challenging real-world conditions involving occlusions or background variability.

Fourth, extending the framework to continuous sign language recognition represents an important direction, as it involves more complex temporal dependencies and better reflects real-world communication scenarios.

This would allow the system to move beyond isolated sign recognition toward more natural and practical interaction settings.

Finally, optimizing lightweight architectures for real-time deployment represents an important direction, especially for practical applications in assistive technologies.

In this context, further investigation of the trade-off between model complexity, computational efficiency, and recognition performance would be particularly valuable.

In addition, future research will explore alternative temporal modeling approaches designed to improve efficiency and scalability, such as Temporal Convolutional Networks (TCN), InceptionTime architectures, and efficient Transformer variants including Linformer and Performer. A systematic evaluation of these architectures within the same controlled experimental framework, particularly in data-constrained scenarios, would provide deeper insight into the trade-offs between model complexity, computational cost, and performance.

Author Contributions

M.C.: Conceptualization, Methodology, Data Curation, Validation, Investigation, and Writing: original draft, Writing: review and editing. I.E.M.: Conceptualization, Methodology, Investigation, Writing: review and editing. M.A.S.: Conceptualization, Validation, Investigation, Writing: review and editing. Y.A.: Conceptualization, Investigation, Writing: review and editing. A.Y.: Conceptualization, Investigation, Writing: review and editing. A.A.: Conceptualization, Writing: review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are publicly available from: https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed (accessed on 15 January 2026).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, 2025 version) exclusively for language editing to improve clarity and readability. All scientific content, analysis, and conclusions were developed and validated by the authors, who take full responsibility for the final version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SLR	Sign Language Recognition
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
HCI	Human–Computer Interaction
WLASL	Word-Level American Sign Language Dataset
DL	Deep Learning
AI	Artificial Intelligence
FPS	Frames Per Second
RGB	Red, Green, Blue
ReLU	Rectified Linear Unit
FC	Fully Connected

References

Alzahrani, N.; Bchir, O.; Ben Ismail, M.M. Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act. Electronics 2025, 14, 4589. [Google Scholar] [CrossRef]
Sandoval-Castaneda, M.; Li, Y.; Brentari, D.; Livescu, K.; Shakhnarovich, G. Self-supervised video transformers for isolated sign language recognition. arXiv 2023, arXiv:2309.02450. [Google Scholar] [CrossRef]
Raki, M.T.A.; Izhar, A.; Japar, N.; Kai, M.C.Y. Transformer-Based Isolated Sign Language Recognition Pipeline with Mediapipe Hand-Pose Keypoints. SSRN 2025. [Google Scholar] [CrossRef]
Lamaakal, I.; Yahyati, C.; Maleh, Y.; El Makkaoui, K.; Ouahbi, I. An explainable hybrid CNN–transformer model for sign language recognition on edge devices using adaptive fusion and knowledge distillation. Sci. Rep. 2026, 16, 7143. [Google Scholar] [CrossRef] [PubMed]
Shawon, J.A.B.; Hasan, M.K.; Mahmud, H. A comparative analysis of video vision transformers on word-level sign language datasets. PLoS ONE 2026, 21, e0341909. [Google Scholar] [CrossRef] [PubMed]
Talaat, F.M.; Hassan, B.M. A multistream attention based neural network for visual speech recognition and sign language understanding. Sci. Rep. 2025, 15, 44675. [Google Scholar] [CrossRef] [PubMed]
Alkhoraif, A.A.; Alsulaiman, M.; Abdul, W.; Bencherif, M. Ensemble Transformer-based Word-Level Sign Language Recognition with Multi-Modal Input Fusion. J. Eng. Res. 2025, 14, 738–747. [Google Scholar] [CrossRef]
Koller, O. Quantitative survey of the state of the art in sign language recognition. arXiv 2020, arXiv:2008.09918. [Google Scholar] [CrossRef]
Pigou, L.; Dieleman, S.; Kindermans, P.J.; Schrauwen, B. Sign language recognition using convolutional neural networks. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2014; pp. 572–578. [Google Scholar]
Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1459–1469. [Google Scholar]
Huang, J.; Zhou, W.; Li, H.; Li, W. Sign language recognition using 3d convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME), Turin, Italy, 29 June 2015–3 July 2015; IEEE: New York, NY, USA, 2015; pp. 1–6. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Advances in neural information processing systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? arXiv 2021, arXiv:2102.05095. [Google Scholar] [CrossRef]
Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
De Coster, M.; Van Herreweghe, M.; Dambre, J. Sign language recognition with transformer networks. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 6018–6024. [Google Scholar]
Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10023–10033. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Rastgoo, R.; Kiani, K.; Escalera, S. Sign language recognition: A deep survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Distribution of samples per class.

Figure 2. Overview of the proposed spatio-temporal framework for sign language recognition, including spatial feature extraction using MobileNetV2 and temporal modeling using different sequence architectures.

Figure 3. Comparative accuracy of recurrent models (CNN + RNN, CNN + LSTM, CNN + GRU) across training epochs.

Figure 4. Training and validation accuracy and loss curves for the CNN + GRU model.

Figure 5. Training and validation accuracy and loss curves for the CNN + LSTM model.

Figure 6. Training and validation accuracy and loss curves for the CNN + RNN model.

Figure 7. Training and validation accuracy and loss curves for the Transformer-based model.

Figure 8. Confusion Matrix of the Proposed CNN–LSTM Model.

Table 1. Performance Comparison of Temporal Modeling Architectures (mean ± standard deviation over 5 runs).

Model	Accuracy (%)
CNN + LSTM	90.02 ± 0.21
CNN + RNN	88.89 ± 0.24
CNN + GRU	89.10 ± 0.27
CNN + Transformer	87.11 ± 0.35

Table 2. Ablation Study Results.

Configuration	Setting	Accuracy (%)
Temporal Model	CNN + RNN	88.89
	CNN + LSTM	90.02
	CNN + GRU	89.10
	CNN + Transformer	87.11
Sequence Length	10 frames	85.20
	15 frames	87.10
	20 frames	90.02
Feature Extraction	Frozen MobileNetV2	86.50
	Fine-tuned MobileNetV2	88.00

Table 3. Comparative analysis of model complexity and computational efficiency.

Model	Parameters	Training Time/Epochs	Inference Time (ms)
CNN + RNN	2,480,869	10.6	6.8
CNN + GRU	2,841,573	11.4	7.3
CNN + LSTM	3,021,925	12.1	7.9
CNN + Transformer	4,270,181	15.2	9.4

Table 4. Contextual comparison with related works on WLASL (interpret with caution due to varying number of classes).

Study	Architecture	WLASL Subset	Number of Classes	Accuracy (%)
Li et al. (2020) [10]	CNN + LSTM	WLASL100	100	85.3
De Coster et al. (2020) [17]	Transformer	WLASL100	100	83.0
Tong et al. (2022) [15]	VideoMAE (Transformer)	WLASL (large-scale)	100+	87–89
Bertasius et al. (2021) [14]	TimeSformer	WLASL (adapted)	100+	87–90
Our Approach	MobileNetV2 + LSTM	WLASL (subset)	37	90.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cherrate, M.; El Manaa, I.; Sabri, M.A.; Abouch, Y.; Yahyaouy, A.; Aarab, A. Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios. Algorithms 2026, 19, 399. https://doi.org/10.3390/a19050399

AMA Style

Cherrate M, El Manaa I, Sabri MA, Abouch Y, Yahyaouy A, Aarab A. Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios. Algorithms. 2026; 19(5):399. https://doi.org/10.3390/a19050399

Chicago/Turabian Style

Cherrate, Meryem, Imane El Manaa, My Abdelouahed Sabri, Yassine Abouch, Ali Yahyaouy, and Abdellah Aarab. 2026. "Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios" Algorithms 19, no. 5: 399. https://doi.org/10.3390/a19050399

APA Style

Cherrate, M., El Manaa, I., Sabri, M. A., Abouch, Y., Yahyaouy, A., & Aarab, A. (2026). Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios. Algorithms, 19(5), 399. https://doi.org/10.3390/a19050399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Temporal Modeling for Real-World Sign Language Recognition: A Comparative Study Under Data-Constrained Scenarios

Abstract

1. Introduction

2. Related Work

3. Dataset and Preprocessing

3.1. Dataset Description

3.2. Data Preprocessing

3.3. Data Construction

3.4. Data Augmentation

4. Proposed Methodology

4.1. Overview of the Proposed Framework

4.2. Spatial Feature Extraction Using MobileNetV2

4.3. Temporal Modeling Strategies

4.3.1. Recurrent Modeling with GRU

4.3.2. Long Short-Term Memory (LSTM)

4.3.3. Transformer Encoder

4.4. Baseline: Spatial-Only Model

4.5. Classification Layer

4.6. Training Strategy

5. Experimental Setup

5.1. Implementation Details

5.2. Dataset Split

5.3. Training Configuration

5.4. Model Variants

5.5. Evaluation Metrics

6. Results and Discussion

6.1. Overall Performance and Comparison

6.2. Training Behavior Analysis

6.3. Confusion Matrix Analysis

6.4. Ablation Study

6.5. Computational Efficiency Analysis

6.6. Discussion

6.7. Comparison with WLASL-Based Methods

7. Conclusions

8. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI