Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act

Alzahrani, Nada; Bchir, Ouiem; Ben Ismail, Mohamed Maher

doi:10.3390/electronics14234589

Open AccessArticle

Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act

by

Nada Alzahrani

^1,2,*,

Ouiem Bchir

¹

and

Mohamed Maher Ben Ismail

¹

Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

²

Computer Science Department, College of Computer Engineering and Science, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4589; https://doi.org/10.3390/electronics14234589 (registering DOI)

Submission received: 23 October 2025 / Revised: 19 November 2025 / Accepted: 21 November 2025 / Published: 23 November 2025

Download

Browse Figures

Versions Notes

Abstract

Isolated Sign Language Recognition (ISLR), which focuses on identifying individual signs from sign language videos, presents substantial challenges due to small and ambiguous hand regions, high visual similarity among signs, and large intra-class variability. This study investigates the adaptability of YOLO-Act, a unified spatiotemporal detection framework originally developed for generic action recognition in videos, when applied to large-scale sign language benchmarks. YOLO-Act jointly performs signer localization (identifying the person signing within a video) and action classification (determining which sign is performed) directly from RGB sequences, eliminating the need for pose estimation or handcrafted temporal cues. We evaluate the model on the WLASL2000 and MSASL1000 datasets for American Sign Language recognition, achieving Top-1 accuracies of 67.07% and 81.41%, respectively. The latter represents a 3.55% absolute improvement over the best-performing baseline without pose supervision. These results demonstrate the strong cross-domain generalization and robustness of YOLO-Act in complex multi-class recognition scenarios.

Keywords:

isolated sign language recognition (ISLR); spatiotemporal action detection; deep learning; YOLO-Act

1. Introduction

Sign language is the primary mode of communication within the deaf community and is recognized as a natural language with its own syntax, grammar, and expressive richness [1,2]. Unlike spoken languages, sign languages are visual gestural systems that rely on manual features such as hand motion, shape, and orientation, as well as non-manual cues including facial expressions and head movements [3,4]. Sign languages are fundamentally distinct from spoken and written languages due to the multimodal qualities that distinguish them from these languages. This presents a unique challenge for computational modeling.

Automatic sign language recognition (SLR) has gained increasing attention due to its potential to bridge communication gaps between deaf and hearing individuals. SLR aims to translate sign language videos into textual representations and is generally divided into two subtasks: isolated SLR (ISLR) and continuous SLR (CSLR). On one hand, ISLR is a word recognition that aims to classify individual word-level signs. On the other hand, CSLR is a sentence recognition that recognizes unsegmented signing sequences. While CSLR is essential for full translation of the sentence, ISLR remains a crucial and challenging problem, as it demands fine-grained visual classification of individual signs. This study focuses on the challenges of isolated sign language recognition (ISLR) as a foundational step toward more complex continuous sign recognition tasks. Despite its narrower scope, word-level recognition often serves as the foundation for higher-level tasks such as sign spotting [5,6], video retrieval [7], sign language translation [3,8], and continuous recognition [9,10]. However, ISLR remains highly challenging due to several factors. The hand occupies only a small and often ambiguous visual region, lacking distinctive local features. As illustrated in Figure 1, hand gestures are prone to self-occlusion and motion blur, particularly during rapid movements or in cluttered scenes, which makes recognition difficult. Moreover, semantic ambiguity arises when different signs appear visually identical in static frames, such as “Sit” and “Chair” in Figure 2, both of which share similar initial handshapes and positions. Disambiguation in such cases depends on motion dynamics such as single versus repeated movements or divergent trajectories. Such an issue requires models to capture temporal cues rather than relying solely on appearance. Finally, intra-class variability further complicates recognition, as different signers may perform the same sign with variations in hand position, trajectory, or amplitude, as shown in Figure 3, thereby reducing inter-signer consistency and hindering generalization.

Existing approaches often attempt to mitigate these issues by cropping hand regions or leveraging pose-based representations. However, such methods risk discarding important non-physical signs and remain sensitive to occlusions, resolution loss, and background clutter. In contrast, modeling the full person bounding box in RGB data preserves richer information, such as upper body posture and facial expressions, that enhances robustness and disambiguation across signers.

To address these challenges, we propose to employ YOLO-Act [11] a lightweight and efficient pipeline for isolated sign language recognition based solely on RGB input. Unlike prior methods that rely on pose estimation, multi-modal fusion, or handcrafted features, YOLO-Act [11] operates solely on RGB inputs and leverages a unified spatiotemporal detection pipeline. This design significantly simplifies deployment and maintains strong performance results on WLASL and MSASL datasets. In fact, YOLO-Act [11] emphasizes efficiency, robustness, and practicality [11]. This positions it as a strong candidate for scalable, real-world SLR applications, while also offering new insights into the balance between accuracy and computational efficiency in spatiotemporal modeling.

The main contribution of this study is the adaptation of the YOLO-Act framework to the domain of isolated sign language recognition (ISLR). The proposed approach employs joint signer localization and word-level classification directly from RGB video sequences, thereby eliminating the need for pose estimation or handcrafted temporal cues. Furthermore, comprehensive evaluations on large-scale benchmarks, WLASL and MS-ASL, confirm the model’s robustness and generalization, positioning YOLO-Act as a strong baseline for unified spatiotemporal sign recognition.

2. Related Work

2.1. Sign Language Recognition

Sign language recognition (SLR) is one of the core tasks in the subject of sign language understanding. Moreover, SLR has drawn a lot of interest lately [12,13,14,15,16]. Based on the input modality, the research works may generally be divided into two groups: (i) RGB-based techniques, which use RGB video as input, and (ii) pose-based techniques, which use the skeleton sequence as input.

2.1.1. RGB-Based Input Modality

SLR methods based on RGB input have evolved significantly over the years. Early research focused on designing handcrafted features such as HOG, SIFT, and motion trajectories to capture the visual cues of hand shapes and body movements during signing [9,10,17,18,19,20]. These traditional approaches, while foundational, were limited in their ability to generalize across different signers or complex visual environments.

With the rise in deep learning in computer vision, convolutional neural networks (CNNs) became a popular choice for modeling visual features in SLR. Various architectures have been explored, including 2D CNNs for spatial modeling and 3D CNNs for capturing spatial temporal information from video sequences [12,21,22,23,24]. Some studies adopted a combination of 2D CNNs followed by sequence models like HMMs to incorporate temporal dynamics into the recognition process [13].

More recent developments have seen the integration of deeper temporal models. For instance, 3D CNNs are used because of their ability to describe spatiotemporal dependency [7,16,25]. These models can better represent complex signing sequences but often come with increased computational cost due to their reliance on full video processing.

2.1.2. Pose-Based Input Modality

In addition to the RGB-based methods previously mentioned, numerous studies have investigated pose-based methods. Pose modality is a high-level and concise representation of human action that includes the physical connection between skeleton joints [26,27]. GRU [28] and LSTM [29] are examples of using recurrent neural networks. This temporal information of the key point sequence has been modeled in [30,31,32]. Some works that are based on CNNs aim to convert the input key point sequence into the feature map and employ the popular CNNs to capture spatiotemporal dynamics [26,33].

Action recognition tasks have seen a rise in the popularity of graph convolutional networks (GCNs) for modeling the structured character of human pose data. The first spatial-temporal GCN was introduced by Yan et al. [8], which involved the construction of graphs in which nodes represented human key points and edges captured their physical connections. This method allows for efficient modeling of spatiotemporal dynamics in actions. The robust performance of GCN-based methods in extracting semantic representations from pose sequences has been further demonstrated in subsequent works [5,26,32,34,35]. Furthermore, authors in [36] introduced a hybrid model that integrates Transformers with GCNs to more effectively capture the spatial-temporal dependencies in sign language recognition.

2.2. CNN-Based Methods for Sign Language Recognition

With the rise in deep learning, CNN-based approaches have increasingly replaced conventional approaches based on handcrafted descriptors in SLR. Authors in [37] proposed a deep end-to-end architecture that integrates temporal convolutions with bidirectional recurrence, substantially enhancing frame wise gesture recognition compared to static single frame models. Building on this, the study in [6] introduced a recurrent 3D-CNN capable of detecting and classifying dynamic hand gestures in unsegmented streams through connectionist temporal classification. The authors in [38] further explored multimodal inputs by designing a 3D-CNN algorithm that leverages both depth and intensity data, demonstrating the effectiveness of multimodal fusion. Wu et al. [39] extended this idea with Deep Dynamic Neural Networks, combining Deep Belief Networks for skeleton features with 3D-CNNs for RGB-D inputs. Furthermore, Authors in [40] propose a CNN architecture for identifying sign language by extracting skeletal elements of the hands and body from RGB video frames captured by a conventional webcam. The proposed model uses hierarchical feature extraction to capture complex spatial patterns of hand shapes and movements, which is different from traditional vision-based methods that use skin segmentation or hand-crafted features. The system works well in different lighting and background conditions, and it can recognize things in real time, which makes it good for human–computer interaction.

A CNN based image classification approach was proposed in [41] for recognizing alphabet signs from RGB hand gesture images. The model uses a standard convolutional neural network trained end-to-end to classify static hand postures representing letters of the alphabet. It operates directly on raw RGB images without relying on pose estimation or additional preprocessing steps, offering a simple and effective baseline for sign recognition. Compared with other CNN based methods that incorporate hand segmentation or hybrid feature extraction, this approach demonstrates competitive accuracy but remains sensitive to variations in lighting, background, and hand appearance.

Moreover, A real time RGB based CNN model was proposed in [42] for recognizing hand gestures in the American Sign Language alphabet. The system integrates skin segmentation and convex hull analysis to extract the hand region before applying a lightweight CNN classifier trained on cropped RGB images. This approach achieves high accuracy and efficient real time performance, demonstrating the effectiveness of CNNs for visual gesture recognition. However, the reliance on handcrafted preprocessing and a small dataset constrains its scalability to large vocabulary or continuous sign language recognition tasks.

More recent work has focused on addressing large vocabulary challenges. For instance, Zuo et al. [43] introduced Natural Language Assisted Sign Language Recognition (NLA-SLR). The NLA-SLR framework is a semantically guided CNN model designed to enhance sign language recognition by leveraging linguistic information from gloss annotations. The approach addresses the challenge of visually indistinguishable signs (VISigns), which share similar visual appearances but have different meanings, making them difficult for conventional vision-based neural networks to differentiate. The framework integrates natural language modeling into the recognition process through two key innovations: language-aware label smoothing and inter-modality mix-up. The first generates soft labels based on semantic similarity among glosses, allowing the model to learn smoother decision boundaries for similar signs, while the second fuses visual and textual features to strengthen the model’s discriminative capability in the latent space. The architecture comprises three main components: (i) a preprocessing stage that produces video–keypoint pairs, where a video–keypoint pair refers to a video clip together with its associated sequence of detected keypoints. (ii) a video–keypoint network (VKNet) that extracts complementary features from RGB videos. (iii) keypoint heatmaps, and a head network that applies semantic regularization strategies. Trained on large-scale datasets such as WLASL, MS-ASL, and NMFs-CSL, NLA-SLR achieves comparable results by effectively aligning linguistic semantics with visual representations to improve recognition accuracy and generalization.

Overall, CNN-based approaches demonstrate strong capability in modeling spatial and spatiotemporal features for SLR. However, they are often computationally expensive (particularly 3D CNNs) and may rely on multimodal data not always accessible in real-world deployment scenarios.

2.3. SLR Based on Transformer

Transformer-based architectures have recently gained traction in SLR, particularly in pre-training frameworks designed to capture semantic representations through self-supervised tasks. Several studies [44,45] employ masking-and-reconstruction strategies to model contextual dependencies, achieving strong performance in sign language understanding (SLU). The SignBERT series [44,45], for example, leverages large-scale unlabeled data and self-supervised objectives to enhance representation capability, although these approaches primarily focus on low-level visual semantics and often underutilize textual knowledge, limiting their transferability to tasks such as sign language translation (SLT).

Subsequent work has extended Transformer-based pre-training with additional pretext objectives. For instance, MSLU [46] introduces a multimodal pre-training framework that enhances sign language understanding through joint modeling of visual and linguistic information. The framework integrates keypoint reconstruction to capture fine-grained motion and articulation details. This enables the model to learn both manual and non-manual components of sign language. It employs a sign pose encoder, a multilingual text encoder, and a pose decoder trained via multi-task objectives, including masked pose modeling and sign–text contrastive learning. This structure enables the network to align pose dynamics with semantic representations from gloss text, improving cross-modal understanding. Leveraging a large-scale dataset (SL-1.5M) containing approximately 1.5 million text-labeled sign samples, MSLU demonstrates strong generalization across isolated and continuous recognition, translation, and retrieval tasks. Similarly, C²RL [47] enhances textual alignment through integrated language modeling to bridge visual–semantic gaps. Although these methods demonstrate the potential of multimodal supervision, they remain constrained by the limited availability of paired gloss text datasets. Furthermore, the Uni-Sign [48] framework introduces a large-scale, Transformer-based model that unifies multiple sign language understanding tasks under a single generative formulation. The model is trained on the newly proposed CSL-News dataset, which includes 1985 h of Chinese Sign Language videos paired with text, providing the scale needed for effective pre-training. Uni-Sign operates in three stages: pose-only pre-training, RGB-pose interaction pre-training, and unified fine-tuning. It processes 69 keypoints grouped into hand, body, and face sub-poses using spatial graph convolutional networks, with RGB hand crops encoded through EfficientNet. A Prior-Guided Fusion (PGF) module aligns pose and RGB modalities using keypoint coordinates as priors, while a score-aware sampling strategy reduces redundancy by focusing on low-confidence frames. Unlike prior CNN or task-specific models, Uni-Sign adopts a language-modeling objective for both pre-training and fine-tuning, treating isolated recognition, continuous recognition, and translation uniformly as text generation tasks.

In parallel, Transformers have also been adapted to pose sequences. BEST [49] organizes hand and body poses into structured triplet units and introduces a coupling tokenization mechanism, enabling masked unit modeling to reconstruct missing motion patterns. By combining NLP-inspired objectives with pose-specific encoding, BEST demonstrates the value of adapting Transformer pre-training to domain specific structures in sign language.

Despite these advances, Transformer-based methods often depend on large-scale multimodal corpora and gloss text resources, which limits their scalability across languages and datasets.

2.4. Discussion

To incorporate temporal information, many related works employ 3D CNNs [7,16,25]. These models capture spatiotemporal dependencies by jointly processing consecutive frames, but they are computationally expensive and struggle to model long-range temporal dynamics effectively. Other pipelines often integrate pose-based methods, where GRUs, LSTMs, or GCNs [5,8,28] model skeleton key point sequences to encode temporal information. While efficient, these approaches sacrifice essential non-manual cues such as posture and facial expressions, reducing their ability to disambiguate visually similar glosses. Furthermore, traditional descriptors like HOG, SIFT, and motion trajectories rely heavily on handcrafted features and fixed motion patterns, limiting robustness in complex real-world signing conditions.

Recent computer vision research has also investigated 3D vision technologies for automated structural inspection, including crack detection and damage assessment using self-developed robotic systems [50,51]. These studies highlight the broader applicability of vision-based recognition methods across diverse real-world domains. Moreover, detection-driven frameworks continue to demonstrate the importance of robust spatial localization and multi-branch feature processing in complex visual understanding tasks. For example, the ConTriNet model proposed in [52] introduces a triple-flow architecture for RGB–Thermal salient object detection, effectively separating modality-specific cues and enabling stronger spatial representation learning. Such detection-based strategies highlight the value of precise spatial feature extraction, which is also employed in YOLO-Act for robust signer localization and accurate keyframe-based temporal recognition.

More recently, Transformer-based pre-training frameworks such as SignBERT [44,45], MSLU [46], and Uni-Sign [48] have shown strong performance by leveraging masked modeling, multimodal fusion, and gloss–text alignment. However, these methods rely on large, paired datasets and suffer high computational costs. This restricts their scalability across sign languages and deployment in resource-constrained environments. In contrast, YOLO-Act [11] leverages deep detection to automatically learn spatial cues (e.g., hand trajectory, body posture, facial orientation) and temporal dynamics (e.g., motion progression) directly from RGB data. By combining signer localization, keyframe-based temporal abstraction, and confidence-based late fusion, our framework achieves robust generalization across diverse signers and gloss categories while maintaining lightweight efficiency. Table 1 compares CNN-based and Transformer-based methods for isolated sign language recognition, emphasizing differences in input modalities and recognition performance across datasets, while highlighting the trade-offs between spatiotemporal modeling capability and computational efficiency.

As shown in Table 1, Transformer-based models such as Uni-Sign [48] have recently outperformed earlier CNN-based methods like NLS-ASL [43]. Nevertheless, CNN architecture remains highly effective for large-scale datasets such as WLASL and MS-ASL, offering stable convergence and strong spatial feature representation. In this work, YOLO-Act employs a CNN-based backbone to achieve a balance between accuracy and computational efficiency, providing a robust foundation for large-vocabulary sign recognition.

3. The YOLO-Act Model

YOLO-Act is introduced in [11]. It is unified spatiotemporal action detection framework that extends YOLOv8 to handle video-based action recognition. The framework integrates detection, tracking, and temporal reasoning in a single pipeline. Its design allows efficient processing of keyframes while maintaining strong recognition performance across dynamic actions.

In this study, we provide a comprehensive evaluation of YOLO-Act on large-scale sign language datasets (WLASL [53] and MSASL [54]), demonstrating that although YOLO-Act is a model originally developed for action recognition, it can generalize effectively to the domain of sign language recognition. This establishes a strong baseline for future research.

The overall YOLO-Act architecture is illustrated in Figure 4. This model consists of four main components: (i) keyframe extraction, (ii) action detection, (iii) class fusion for temporal consistency, and (iv) actor tracking for spatial continuity. YOLO- Act [11] begins with action detection and tracking using YOLOv8 to ensure consistent localization across frames. From each signing sequence, three representative keyframes are extracted. These are the beginning, middle, and end of the gloss to preserve the temporal structure. Predictions from these frames are then combined through a confidence-based late fusion strategy. The latter aims to enhance classification accuracy and robustness by leveraging agreement across temporal snapshots.

To avoid ambiguity regarding the scope and contribution of this work, we clarify that the objective of this study is not to introduce a new architectural model but to adapt, analyze, and evaluate an existing action-detection framework (YOLO-Act) within the specialized domain of isolated sign language recognition (ISLR). Unlike generic human action detection, ISLR involves highly fine-grained gloss distinctions, subtle and rapid hand-centric motion, signer-focused spatial–temporal contexts, and large, long-tailed vocabularies. These characteristics require domain-specific adjustments in preprocessing, class formulation, and training and evaluation protocols. The contribution of this work therefore lies in demonstrating how an action-detection architecture behaves when transferred to the ISLR domain, establishing the first YOLO-based RGB-only baseline for ISLR, and providing empirical insights into the strengths, limitations, and error patterns of such models under fine-grained signer-centric conditions.

Additionally, we clarify potential confusion with other models bearing similar names. The YOLO-Act model used in this study is entirely unrelated to YOLACT (“You Only Look At CoefficienTs”) by Bolya et al. [55]. YOLACT is an instance segmentation framework that generates prototype masks combined with per-instance coefficient vectors, whereas YOLO-Act is an action-detection model designed for joint human localization and temporal action classification. The two models differ fundamentally in purpose, architecture, task formulation, and output representation—YOLACT produces pixel-level instance masks, while YOLO-Act outputs bounding boxes with action labels over short spatiotemporal windows. Apart from the coincidental similarity in naming, there is no architectural or conceptual connection between the two approaches.

3.1. Keyframe Extraction

The number of frames considered directly influences the trade-off between computational efficiency and temporal reasoning. A single frame is sufficient for localizing an actor, but inadequate for action classification. Conversely, using too many frames increases redundancy and processing cost. To address this, YOLO-Act [11] employs a first-middle-last frame selection strategy. By selecting three representative frames at the beginning, middle, and end of each action sequence, the model captures critical temporal dynamics without redundancy. This method ensures that transitions are preserved while remaining computationally efficient.

As mentioned before, in YOLO-Act, three keyframes per action (the first, middle, and last) are deliberately selected to represent the temporal evolution of each gesture. This strategy captures the critical phases of an action: initiation, peak (or steady state), and conclusion. Despite using a limited number of frames, this approach preserves sufficient temporal information to enable effective action recognition.

The choice of three frames balances computational efficiency with predictive performance. Processing fewer frames reduces inference time, memory usage, and computational complexity, making YOLO-Act suitable for real-time applications. By focusing on the most informative moments of an action, the model avoids redundancy in video frames and learns more discriminative temporal features. Furthermore, in many practical scenarios, key actions can be characterized by their start, mid-point, and end dynamics, making this three-frame uniform temporal sampling strategy effective for most action types.

In this work, the identification of the signer’s initial frame does not rely on explicit motion-estimation techniques (e.g., optical flow or frame differencing). Instead, the beginning of the signing interval is inferred implicitly from the continuous localization trajectory produced by the YOLOv8 detector and tracker. Once the tracker forms a stable sequence of signer detections across consecutive frames with consistent bounding boxes and sufficient confidence, the earliest frame in this stable track is selected as the first keyframe. This ensures that keyframe extraction begins only after reliable signer localization is achieved.

3.2. Action Detection

Action detection requires both actor localization and action classification. YOLO-Act leverages YOLOv8 [56] as the detection backbone, benefiting from its dynamic label assignment, and strong detection accuracy. Unlike region-based approaches such as Faster R-CNN, YOLOv8 processes the full frame directly, allowing it to exploit contextual cues without discarding environmental information.

3.3. Class Fusion

Since each keyframe is classified independently, YOLO-Act [11] applies a late fusion strategy to aggregate predictions into a final action label. Confidence scores from the three detectors are combined according to their temporal order. Let

p (C_{i n} | F_{n})

denote the probability that the action keyframe

F_{n}

is the

n^{t h}

action keyframe of action

i

. It is predicted by the

n^{t h}

YOLOv8 model as the confidence score of

F_{n}

with respect to

C_{i n} .

Similarly, let

p (C_{i} | (F_{1}, F_{2}, \dots, F_{N}))

be the probability that the action represented by the

N

consecutive frames

(F_{1}, F_{2}, \dots, F_{N})

belongs to class

C_{i}

.

To classify the action, we intend to employ Late Fusion technique [57]. Subsequently the confidence scores from individual keyframes are combined using a multiplicative rule. Specifically, the final probability

p (C_{i} | (F_{1}, F_{2}, \dots, F_{N}))

that an action sequence belongs to class

C_{i}

is computed as the product of the individual keyframe probabilities

p (C_{i n} | F_{n})

, as shown in Equation (1). This approach assumes that the predictions for each keyframe are conditionally independent, and it emphasizes classes that are consistently supported across all keyframes. Formally

p (C_{i} | (F_{1}, F_{2}, \dots, F_{N})) = \prod_{n = 1}^{N} p (C_{i n} | F_{n})

(1)

This multiplicative fusion strategy enhances robustness by reducing the influence of isolated misclassifications: if a keyframe incorrectly predicts an action class with low confidence, its effect on the overall prediction becomes minimal. Consequently, only action classes consistently supported across multiple keyframes achieve a high final confidence score.

3.4. Actor Tracking

To ensure consistent localization across frames, YOLO-Act integrates the YOLOv8 [56] tracking module. The tracker associates detected actor through appearance embeddings and motion cues, propagating bounding boxes across consecutive frames. This reduces identity switches and guarantees that the same actor is consistently localized during the action sequence.

In this work, the YOLOv8 tracker is applied prior to the first, middle, and last keyframe selection process to ensure stable and continuous signer localization across consecutive frames. The tracker associates YOLOv8 detections into a single coherent signing track, from which the three keyframes are extracted. Its role is limited to providing consistent bounding boxes for keyframe sampling. It does not propagate identity across sparse keyframes, nor does it contribute directly to the classification stage. By combining tracking with keyframe-based detection, the framework maintains both spatial accuracy and temporal coherence, enabling robust recognition of dynamic activities.

4. Experiments

This section presents a detailed evaluation of the proposed YOLO-Act model for isolated sign language recognition (ISLR), focusing on both spatial localization and temporal understanding using only RGB video. The experiments were conducted on two large-scale benchmarks: Word-Level American Sign Language (WLASL) [53] and Microsoft American Sign Language (MSASL) [54], each offering diverse signers and vocabularies to evaluate the model’s generalization across sign glosses and users. Both WLASL and MS-ASL benchmarks are among the largest and most comprehensive datasets for American Sign Language recognition. These datasets include diverse signers, complex backgrounds, and varied recording conditions, making them well suited for assessing model generalization and robustness. In contrast, smaller datasets such as ASLLVD [58] offer limited scale and variability, reducing their suitability for large-vocabulary sign recognition evaluation.

WLASL is the most recent large-scale benchmark for ISLR, with a vocabulary size of 2000 glosses. It was collected from diverse, unconstrained sources to capture vocabulary richness and signer variability. The dataset is divided into 14,289 training samples, 3916 validation samples, and 2878 test samples, totaling more than 21,000 annotated video clips. To facilitate evaluation, common experimental subsets include WLASL100 [53], which restricts evaluation to 100 glosses, and WLASL2000 [53], which covers the full vocabulary. The dataset poses challenges such as significant intra-class variation, signer diversity, and visually similar glosses that require strong temporal reasoning for disambiguation.

MSASL provides a complementary resource with a vocabulary size of 1000 glosses and over 25,000 annotated video samples. The data are contributed by more than 200 unique signers across varied environments and conditions, ensuring broad coverage of signing styles and visual contexts. The dataset is partitioned into 16,054 training samples, 5287 validation samples, and 4172 test samples. Standard evaluation splits include MSASL100 and MSASL1000, corresponding to 100-gloss and full 1000-gloss benchmarks.

Both datasets are highly challenging due to heterogeneous signing speeds, subtle temporal transitions, and intra-class variation. Their scale and diversity make them robust testbeds for evaluating the temporal modeling, localization, and generalization capabilities of spatiotemporal detection frameworks such as YOLO-Act.

Although each video in WLASL and MSASL is annotated with a single gloss label, the temporal boundaries of the signing action may not perfectly align with the start and end of the video. Consequently, some clips may include idle or transitional frames outside the active signing interval. This is consistent with standard ISLR settings, where the annotated gloss corresponds to the central signing segment rather than the entire sequence. To address this, the proposed YOLO-Act framework initiates keyframe extraction only after detecting the signer’s first movement and designates this as the initial (beginning) frame. This procedure ensures that the selected beginning, middle, and end keyframes represent active signing intervals and minimize the inclusion of non-signing frames. These keyframes are localized using the YOLOv8 detection and tracking module. To illustrate the labeling approach, Figure 5 shows three representative sign glosses segmented into beginning, middle, and end frames. For example, the gloss “Across” is divided into across_1, across_2, and across_3, corresponding to the temporal evolution of the sign. Similar segmentation is applied to other glosses such as “Adverb” and “All Day”. Each keyframe is annotated with a bounding box around the signer, ensuring consistent localization while capturing the temporal progression of the gesture.

Following prior works [43,48], we evaluate ISLR performance using two standard metrics: Per-Instance Top-1 Accuracy (P-I) and Per-Class Top-1 Accuracy (P-C). P-I measures the overall accuracy across all test instances, reflecting how well the model performs on average. P-C calculates the average accuracy across all sign classes, providing insight into the model’s ability to generalize across frequent and rare signs alike. These complementary metrics allow for a more comprehensive assessment of recognition performance in imbalanced sign language datasets like WLASL and MSASL. P-I and P-C Top-1 accuracies are calculated as follows:

P - I = \frac{1}{N} \sum_{i = 1}^{N} 1 (y_{i} = {\hat{y}}_{i})

(2)

P - C = \frac{1}{C} \sum_{c = 1}^{C} (\frac{1}{N_{c}} \sum_{i \in c} 1 (y_{i} = {\hat{y}}_{i}))

(3)

where

N

is the total number of test samples,

C

is the number of classes, and

N_{c}

is the number of samples in class

c

.

y_{i}

and

{\hat{y}}_{i}

denote the ground truth and predicted gloss labels, respectively.

1 (.)

is the indicator function, which equals 1 if the prediction is correct and 0 otherwise.

The experiments were conducted on a system running Windows 11 equipped with 64 GB RAM, an NVIDIA GeForce RTX 4080 SUPER GPU, and an Intel i9-14900F CPU. The software platform consisted of Torch 2.4.0 with CUDA 12.6, managed by Anaconda3. The dataset was divided into 60% for training, 20% for validation, and 20% for testing. Table 2 summarizes the training hyperparameters used across all experiments. The model was optimized using AdamW with a learning rate of 0.0001, a cosine learning rate decay schedule, and a 0.1 dropout rate. A warm-up phase of 5 epochs was used, and early stopping was applied with a patience of 15 epochs. All models were trained with a batch size of 16 for up to 150 epochs.

During training and validation, the classification loss curves are monitored as illustrated in Figure 6. Both curves converge smoothly after approximately 80 epochs, with no signs of overfitting, suggesting strong generalization capability.

4.1. Comparison with the State of the Arts

To assess the effectiveness of the proposed YOLO-Act framework, we conduct a series of experiments on ISLR benchmarks [53,54]. The proposed YOLO-Act-based sign language recognition model is benchmarked against state-of-the-art approaches for isolated sign language recognition (ISLR) on the WLASL [53] and MSASL [54] datasets in order to demonstrate its ability to generalize from action recognition to sign language recognition. As shown in Table 3, the proposed model outperforms previous approaches across multiple evaluation settings.

The WLASL and MSASL datasets are typically evaluated under different vocabulary scales to measure model generalization. MSASL100 and WLASL100 are smaller subsets containing the 100 most frequent and visually distinct signs, designed to assess performance in low-vocabulary settings. In contrast, MSASL1000 and WLASL2000 include 1000 and 2000 unique glosses, respectively, offering a broader vocabulary and greater signer diversity. These larger subsets introduce higher intra-class variation and more visually similar signs, making them substantially more challenging. Evaluating both small and large vocabularies enables a comprehensive assessment of the model’s scalability and robustness to increase recognition complexity. As illustrated in Table 3, while accuracy decreases with larger vocabulary sizes, this trend is consistent across most sign language recognition models due to the high inter-class similarity and long-tailed class distributions present in datasets such as WLASL and MS-ASL. Nevertheless, YOLO-Act maintains competitive performance in these challenging settings using only RGB input, demonstrating strong generalization and scalability without reliance on pose information or gloss-text supervision.

On MSASL1000 benchmark, YOLO-Act achieves a Top-1 accuracy of 81.41%, surpassing prior methods such as MSLU [46] and Uni-Sign [48]. On WLASL2000, the proposed model achieves 67.07%, representing a 3.55% absolute improvement over the best-performing baseline without relying on pose supervision. This improvement can be attributed to YOLO-Act’s end-to-end spatiotemporal design, which learns directly from RGB sequences without relying on pose supervision or large-scale language model pretraining, unlike Uni-Sign and MSLU that depend heavily on pose-based and text-driven pretraining.

Although WLASL2000 is more challenging due to its large-scale multi-class setting, the proposed approach delivers the highest performance gain on this dataset, demonstrating strong robustness to complex sign language recognition tasks. In fact, the proposed YOLO-Act based approach does not require task specific architectural tuning or modality specific adaptation. In contrast, Uni-Sign employs a large generative language model with prior-guided fusion between pose and RGB features, and MSLU integrates multi-stream language modeling and keypoint reconstruction both of which increase complexity and limit real time performance. Instead, YOLO-Act benefits from its unified detection and recognition framework and robust spatiotemporal fusion strategy. The achieved results across both small scale (WLASL100, MSASL100) and large scale (WLASL2000, MSASL1000) settings highlight the scalability and generalization capacity of YOLO-Act. Additionally, YOLO-Act’s use of three connected frames provides an enriched temporal context that enables the model to disambiguate signs sharing nearly identical handshapes and initial positions. This design proves particularly effective for closely related signs such as “mother” and “woman”, both of which begin with the hand placed at the chin and therefore appear visually indistinguishable in isolated frames. As illustrated in Figure 7, the confidence scores across three consecutive frames demonstrate that YOLO-Act overall assigns higher probability to the correct class by leveraging temporal evolution capturing the outward extension for “mother” versus the downward descent to the chest for “woman”. This temporal reasoning reduces misclassification and underscores the robustness of YOLO-Act in distinguishing visually overlapping signs.

By incorporating three strategically sampled keyframes YOLO-Act captures the temporal evolution that differentiates these signs in a structured manner. While the early stage of both signs appears visually similar, the subsequent frames expose divergent trajectories: “mother” progresses outward from the chin, whereas “woman” transitions downward toward the chest. These temporal cues reveal critical directional changes and endpoint locations. As shown in Figure 7, this enriched temporal context empowers YOLO-Act to correctly separate signs that static-frame approaches would struggle with, demonstrating the effectiveness of three-frame fusion in resolving fine grained ambiguities inherent in sign language recognition.

The confidence-based fusion strategy further reinforces this effect by selecting the class with the highest product of confidence scores across all three frames. This multiplicative scheme ensures that even if one frame has a lower score, the overall decision strongly favors classes that maintain consistent support across the sequence. By reducing reliance on any single potentially ambiguous frame and instead leveraging agreement through combined confidence, the approach yields more robust and semantically accurate predictions. These findings emphasize the importance of multi frame temporal modeling in disambiguating visually similar glosses and improving overall recognition performance. This temporal reasoning also gives YOLO-Act a key advantage over Uni-Sign and MSLU, which primarily rely on pose-based static representations and lack fine-grained temporal tracking of signer motion, making them less effective in differentiating closely related signs. Overall, YOLO-Act’s gains over Uni-Sign and MSLU arise from its unified detection-recognition pipeline that couples spatial localization and temporal evolution directly at the feature level, eliminating the pose dependence present in prior frameworks.

Moreover, Figure 8 presents a performance breakdown for the 15 most confused glosses in our experiments. The stacked bar chart illustrates the proportion of correctly (green) and incorrectly (red) classified samples for each of these challenging classes. As shown, most misclassifications occur between signs that are visually or semantically similar. For example, glosses such as cry and sad are often confused because both involve similar handshapes and movements near the face, while eat and food share nearly identical motion patterns. This analysis confirms that YOLO-Act successfully distinguishes the majority of glosses, although fine-grained visual similarities remain the primary source of recognition errors. Overall, the plot provides an interpretable and quantitative view of model performance, highlighting the classes that may benefit from further refinement.

4.2. Ablation Study: Effect of Bounding Box Cropping

To assess the role of spatial context in sign recognition, an ablation study is performed. Specifically, the effect of Bounding Box Cropping (Hand vs. Human) is investigated. In this regard, we compare two input configurations on the WLASL100 benchmark: (i) cropping the full human bounding box, and (ii) cropping only the localized hand regions. In the hand-cropping experiment, the bounding boxes were manually annotated using the LabelImg tool [59]. This approach allowed precise localization of the hand regions across frames and provided a reliable basis for comparing hand-only and full-human cropping strategies. As shown in Table 4, restricting the input to hands alone leads to a substantial drop in accuracy, with P-I decreasing from 93.77% to 59.01% and P-C from 94.32% to 60.12%. Although the hands are the primary articulators in sign language, removing the broader context discards critical cues such as arm trajectory, facial orientation, and upper-body posture, which are essential for disambiguating visually similar glosses. In contrast, full human bounding boxes preserve these complementary features, enabling the model to capture both fine-grained hand motion and larger-scale temporal dynamics. These findings highlight that accurate sign recognition requires modeling not only the hands but also the broader spatial context of the signer.

4.3. Ablation Study: Impact of Multi-Stage Supervision

Another ablation study was performed to investigate the effectiveness of employing multi-stage supervision. For this purpose, we trained models on individual stage-specific subsets (Action1, Action2, Action3) and compared them with a fused model trained jointly on all stages. As presented in Table 5, the fused multi-stage model consistently outperforms its stage-specific counterparts. This improvement is particularly evident for signs characterized by gradual or complex temporal transitions, where isolating a single stage fails to capture the full motion dynamics. By integrating signals from the beginning, middle, and end of each sign, the fused training scheme enhances temporal representation learning and provides more robust generalization across diverse signing styles. These findings confirm that multi-stage sub-action data plays a critical role in boosting recognition accuracy and mitigating ambiguity in glosses with overlapping handshapes or partial motions.

The results in Table 5 highlight the diagnostic role of the stage-specific experiments. Each stage-specific model represents an incomplete portion of a sign and consequently exhibits lower accuracy. When all temporal segments are integrated through multi-stage supervision, the complementary motion cues across the beginning, middle, and end phases collectively form a more complete and discriminative representation of the sign, confirming the necessity of temporal fusion for accurate recognition.

4.4. Ablation Study: Number of Keyframes

To investigate how temporal sampling density influences recognition performance, we conducted an ablation study using varying numbers of uniformly selected keyframes. We adopt the formulation

K = 2^{n} + 1

(4)

where

n

denotes the number of recursive splits and

K

the resulting number of keyframes. We experimented with configurations n = 0, 1, 2, 3. When n = 0, no splits are applied and two keyframes are extracted (at frames 0 and N), where N is the total number of frames. When n = 1, one split is introduced at the midpoint, yielding three keyframes at frames 0, N/2, and N. Higher values of n divide the video into additional segments, producing five and nine keyframes. This formulation allows us to systematically assess how sampling density affects temporal representativeness and computational efficiency. As shown in Table 6, relying on only two keyframes (start and end) fails to capture key articulatory changes, particularly mid-sign motions necessary for distinguishing visually similar glosses. Increasing the number of keyframes to five or nine yields a denser sampling strategy; however, many of these frames correspond to nearly identical poses or transitional phases, providing minimal additional discriminative information while incurring additional computation. In contrast, selecting three uniformly spaced keyframes (corresponding to the first, middle, and last frames) achieves the optimal balance between temporal completeness and efficiency. This sampling strategy captures the onset, peak articulation, and concluding phases of each sign without introducing temporal redundancy. The results in Table 6 confirm that three keyframes provide a compact yet sufficiently informative temporal representation for isolated sign language recognition.

5. Conclusions

This study provided a comprehensive evaluation of the YOLO-Act framework for isolated sign language recognition on large-scale benchmarks, WLASL and MSASL, emphasizing its spatial localization and temporal modeling capabilities using only RGB video. Our experiments demonstrate that YOLO-Act achieves state-of-the-art performance across small- and large-scale settings, with Top-1 accuracies of 81.41% on MSASL1000 and 67.07% on WLASL2000, highlighting its strong generalization and scalability without relying on pose supervision or gloss annotations. Ablation studies reveal the critical role of full human bounding boxes in preserving spatial context and the effectiveness of multi-stage supervision in capturing temporal dynamics, particularly for visually similar glosses. These findings confirm that YOLO-Act’s combination of keyframe selection, temporal fusion, and enriched spatial context enables accurate isolated sign language recognition. It demonstrates that accurate sign recognition requires modeling both fine-grained hand motion and holistic spatiotemporal cues.

Author Contributions

Conceptualization, N.A. and O.B.; Methodology, N.A.; Validation, N.A.; Formal analysis, N.A.; Investigation, N.A.; Writing—original draft, N.A.; Writing—review & editing, O.B. and M.M.B.I.; Visualization, O.B.; Supervision, O.B. and M.M.B.I.; Project administration, O.B. and M.M.B.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study is publicly available. WLASL is available at https://github.com/dxli94/WLASL (accessed on 20 November 2025) and MS-ASL is available at https://www.microsoft.com/en-us/research/project/ms-asl/ (accessed on 20 November 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Adaloglou, N.; Chatzis, T.; Papastratis, I.; Stergioulas, A.; Papadopoulos, G.T.; Zacharopoulou, V.; Xydopoulos, G.J.; Atzakas, K.; Papazachariou, D.; Daras, P. A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition. IEEE Trans. Multimed. 2022, 24, 1750–1762. [Google Scholar] [CrossRef]
Sandler, W.; Lillo-Martin, D.C. Sign Language and Linguistic Universals; Cambridge University Press: Cambridges, UK, 2006; ISBN 978-0-521-48248-6. [Google Scholar]
Rastgoo, R.; Kiani, K.; Escalera, S. Sign Language Recognition: A Deep Survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
Yin, A.; Zhao, Z.; Jin, W.; Zhang, M.; Zeng, X.; He, X. MLSLT: Towards Multilingual Sign Language Translation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5099–5109. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1109–1118. [Google Scholar]
Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4207–4215. [Google Scholar]
Huang, J.; Zhou, W.; Li, H.; Li, W. Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2822–2832. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Koller, O.; Forster, J.; Ney, H. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
Evangelidis, G.D.; Singh, G.; Horaud, R. Continuous Gesture Recognition from Articulated Poses. In Computer Vision—ECCV 2014 Workshops; Agapito, L., Bronstein, M.M., Rother, C., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 8925, pp. 595–607. ISBN 978-3-319-16177-8. [Google Scholar]
Alzahrani, N.; Bchir, O.; Ismail, M.M.B. YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences. Sensors 2025, 25, 3013. [Google Scholar] [CrossRef] [PubMed]
Hu, H.; Zhou, W.; Li, H. Hand-Model-Aware Sign Language Recognition. Assoc. Adv. Artif. Intell. 2021, 35, 1558–1566. [Google Scholar] [CrossRef]
Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs. Int. J. Comput. Vis. 2018, 126, 1311–1325. [Google Scholar] [CrossRef]
Momeni, L.; Bull, H.; Prajwal, K.R.; Albanie, S.; Varol, G.; Zisserman, A. Automatic Dense Annotation of Large-Vocabulary Sign Language Videos; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Niu, Z.; Mak, B. Stochastic Fine-Grained Labeling of Multi-State Sign Glosses for Continuous Sign Language Recognition. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12361, pp. 172–186. ISBN 978-3-030-58516-7. [Google Scholar]
Li, D.; Yu, X.; Xu, C.; Petersson, L.; Li, H. Transferring Cross-Domain Knowledge for Video Sign Language Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6204–6213. [Google Scholar]
Buehler, P.; Everingham, M.; Zisserman, A. Learning Sign Language by Watching TV (Using Weakly Aligned Subtitles). In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Yasir, F.; Prasad, P.W.C.; Alsadoon, A.; Elchouemi, A. SIFT Based Approach on Bangla Sign Language Recognition. In Proceedings of the 2015 IEEE 8th International Workshop on Computational Intelligence and Applications (IWCIA), Hiroshima, Japan, 6–7 November 2015; pp. 35–39. [Google Scholar]
Starner, T.E. Visual Recognition of American Sign Language Using Hidden Markov Models. Hidden Markov Models 1995. [Google Scholar]
Starner, T.; Weaver, J.; Pentland, A. Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video. IEEE Trans. Pattern Anal. Machine Intell. 1998, 20, 1371–1375. [Google Scholar] [CrossRef]
Selvaraj, P.; NC, G.; Kumar, P.; Khapra, M. OpenHands: Making Sign Language Recognition Accessible with Pose-Based Pretrained Models across Languages. arXiv 2021, arXiv:2110.05877. [Google Scholar] [CrossRef]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. Multi-Fiber Networks for Video Recognition. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11205, pp. 364–380. ISBN 978-3-030-01245-8. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar]
Qiu, Z.; Yao, T.; Ngo, C.-W.; Tian, X.; Mei, T. Learning Spatio-Temporal Representation with Local and Global Diffusion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12048–12057. [Google Scholar]
Albanie, S.; Varol, G.; Momeni, L.; Afouras, T.; Chung, J.S.; Fox, N.; Zisserman, A. BSL-1K: Scaling up Co-Articulated Sign Language Recognition Using Mouthing Cues. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-Occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
Ng, E.; Ginosar, S.; Darrell, T.; Joo, H. Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11860–11869. [Google Scholar]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. Assoc. Adv. Artif. Intell. 2017, 31, 4263–4270. [Google Scholar] [CrossRef]
Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. Assoc. Adv. Artif. Intell. 2016, 30, 3697–3703. [Google Scholar] [CrossRef]
Yong, D.; Wang, W.; Wang, L. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
Cao, C.; Lan, C.; Zhang, Y.; Zeng, W.; Lu, H.; Zhang, Y. Skeleton-Based Action Recognition with Gated Convolutional Neural Networks. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3247–3257. [Google Scholar] [CrossRef]
Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7784–7793. [Google Scholar]
Min, Y.; Zhang, Y.; Chai, X.; Chen, X. An Efficient PointLSTM for Point Clouds Based Gesture Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5760–5769. [Google Scholar]
Tunga, A.; Nuthalapati, S.V.; Wachs, J. Pose-Based Sign Language Recognition Using GCN and BERT. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikola, HI, USA, 5–9 January 2021; pp. 31–40. [Google Scholar]
Pigou, L.; Van Den Oord, A.; Dieleman, S.; Van Herreweghe, M.; Dambre, J. Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video. Int. J. Comput. Vis. 2018, 126, 430–439. [Google Scholar] [CrossRef]
Molchanov, P.; Gupta, S.; Kim, K.; Kautz, J. Hand Gesture Recognition with 3D Convolutional Neural Networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 1–7. [Google Scholar]
Wu, D.; Pigou, L.; Kindermans, P.-J.; Le, N.D.-H.; Shao, L.; Dambre, J.; Odobez, J.-M. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1583–1597. [Google Scholar] [CrossRef]
Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Sign Language Recognition Based on Hand and Body Skeletal Data. In Proceedings of the 2018—3DTV-Conference: The True Vision—Capture, Transmission and Display of 3D Video (3DTV-CON), Helsinki, Finland, 3–5 June 2018; pp. 1–4. [Google Scholar]
Daroya, R.; Peralta, D.; Naval, P. Alphabet Sign Language Image Classification Using Deep Learning. In Proceedings of the TENCON 2018—2018 IEEE Region 10 Conference, Jeju, Republic of Korea, 28–31 October 2018; pp. 646–650. [Google Scholar]
Taskiran, M.; Killioglu, M.; Kahraman, N. A Real-Time System for Recognition of American Sign Language by Using Deep Learning. In Proceedings of the 2018 41st International Conference on Telecommunications and Signal Processing (TSP), Athens, Greece, 4–6 July 2018; pp. 1–5. [Google Scholar]
Zuo, R.; Wei, F.; Mak, B. Natural Language-Assisted Sign Language Recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Hu, H.; Zhao, W.; Zhou, W.; Wang, Y.; Li, H. SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11067–11076. [Google Scholar]
Hu, H.; Zhao, W.; Zhou, W.; Li, H. Signbert+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11221–11239. [Google Scholar] [CrossRef]
Zhou, W.; Zhao, W.; Hu, H.; Li, Z.; Li, H. Scaling up Multimodal Pre-Training for Sign Language Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 11753–11767. [Google Scholar] [CrossRef]
Chen, Z.; Zhou, B.; Huang, Y.; Wan, J.; Hu, Y.; Shi, H.; Liang, Y.; Lei, Z.; Zhang, D. C² RL: Content and Context Representation Learning for Gloss-Free Sign Language Translation and Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8533–8544. [Google Scholar] [CrossRef]
Li, Z.; Zhou, W.; Zhao, W.; Wu, K.; Hu, H.; Li, H. Uni-Sign: Toward Unified Sign Language un-Derstanding at Scale. arXiv 2025, arXiv:2501.15187. [Google Scholar] [CrossRef]
Zhao, W.; Hu, H.; Zhou, W.; Shi, J.; Li, H. BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization. Assoc. Adv. Artif. Intell. 2023, 37, 3597–3605. [Google Scholar] [CrossRef]
Dai, R.; Wang, R.; Shu, C.; Li, J.; Wei, Z. Crack Detection in Civil Infrastructure Using Autonomous Robotic Systems: A Synergistic Review of Platforms, Cognition, and Autonomous Action. Sensors 2025, 25, 4631. [Google Scholar] [CrossRef] [PubMed]
Yuan, C.; Xiong, B.; Li, X.; Sang, X.; Kong, Q. A Novel Intelligent Inspection Robot with Deep Stereo Vision for Three-Dimensional Concrete Damage Detection and Quantification. Struct. Health Monit. 2022, 21, 788–802. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef]
Li, D.; Opazo, C.R.; Yu, X.; Li, H. Word-Level Deep Sign Language Recognition from Video: A New Large-Scale Dataset and Methods Comparison. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1448–1458. [Google Scholar]
Joze, H.R.V.; Koller, O. Ms-Asl: A Large-Scale Data Set and Benchmark for Understanding American Sign Language. arXiv 2018, arXiv:1812.01053. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9156–9165. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 3 October 2024).
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Athitsos, V.; Neidle, C.; Sclaroff, S.; Nash, J.; Stefan, A.; Yuan, Q.; Thangali, A. The American Sign Language Lexicon Video Dataset. In Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Zutalin, T. LabelImg (Free Software: MIT License). 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 15 May 2025).

Figure 1. Example of small and ambiguous hand regions in SLR.

Figure 2. Semantic ambiguity in visually similar signs. Static frames of “Sit” and “Chair” appear nearly identical, requiring temporal dynamics for disambiguation.

Figure 3. Intra-class variability across signers. The same sign can differ in trajectory or position.

Figure 4. Block diagram of YOLO-Act approach. Red bounding box shows the tracked signer after YOLOv8 tracking.

Figure 5. Sample sign gloss keyframes with temporal segmentation. (a–c) Gloss “Across”: (a) Beginning, (b) Middle, (c) End. (d–f) Gloss “Adverb”: (d) Beginning, (e) Middle, (f) End. (g–i) Gloss “All Day”: (g) Beginning, (h) Middle, (i) End. Each frame includes a bounding box surrounding the signer performing the sign. Red boxes indicate the detected signer bounding boxes.

Figure 6. Training and validation classification loss trends. (a) Training classification loss, (b) Validation classification loss.

Figure 7. Illustration of YOLO-Act’s predictions of the sign gloss “Mother” and the sign gloss “Woman”. Bounding boxes are shown around the signer, labeled with the predicted gloss, temporal stage, and confidence score. (a) Start of the gloss Mother, (b) middle of the gloss Mother, (c) end of the gloss Mother, (d) start of the gloss Woman, (e) middle of the gloss Woman, (f) end of the gloss Woman.

Figure 8. Performance on the 15 most confused glosses from the WLASL2000 test set. Each bar represents the total number of test samples for that gloss, divided into correct (green) and incorrect (red) predictions.

Table 1. Overview of deep learning approaches for isolated sign and gesture recognition, illustrating variations in input modalities and recognition performance across datasets.

Category	Reference	Input Modality	Architecture/Approach	Dataset	Performance
SLR based on CNN	[6]	RGB	CNN with a recurrent extension	SKIG, ChaLearn2014	Acc = 97.7%, 98.2%
	[38]	RGB-D	3D-CNN	VIVA	Acc = 77.5%
	[39]	RGB-D + Pose	3D-CNN + Gaussian–Bernoulli DBN	ChaLearn LAP	Jaccard index = 0.81
	[43]	RGB + Pose	S3D backbone + CNN-based keypoint encoding	MSASL1000, WLASL2000	Acc = 72.56%, 61.05%
	[37]	RGB	CNNs spatial feature + LSTM/RNN temporal dynamics	ChaLearn LAP	Jaccard index = 0.906
SLR based on Transformer	[44]	Pose	Transformer encoder–decoder backbone	MSASL1000, WLASL2000	Acc = 67.96%, 52.08%
	[45]	Pose	Transformer encoder + hand-model-aware decoder	MSASL1000, WLASL2000	Acc = 62.42%, 48.85%
	[46]	Pose	Transformer pre-training with (contrastive + masked modeling)	MSASL1000, WLASL2000	Acc = 74.07%, 56.29%
	[48]	RGB + Pose	unified Transformer pre- with GCN-based pose encoder	MSASL1000, WLASL2000	Acc = 78.16%, 63.52%
	[49]	Pose	BERT-style Transformer with GCN pose embedding	MSASL1000, WLASL2000	Acc = 71.21%, 54.59%

Table 2. Training configurations.

Hyperparameters	Value
optimizer	AdamW
optimizer momentum	0.937
weight decay	0.0005
Learning rate schedule	Cosine Decay
warmup epochs	5.0
drop out	0.1
learning rate	0.0001
batch size	16
epochs	150 early stopping at epoch 80
Patience/early stopping	15

Table 3. Performance Comparison of YOLO-Act with state-of-the-art methods on MSASL and WLASL datasets. Best results in bold, while second is underlined.

Method	Modality		MSASL100		MSASL1000		WLASL100		WLASL2000
Method	Pose	RGB	P-I	P-C	P-I	P-C	P-I	P-C	P-I	P-C
ST-GCN [8]	✔		50.78	51.62	34.40	32.53	50.78	51.62	34.40	32.53
SignBERT [44]	✔		76.09	76.65	49.54	46.39	76.36	77.68	39.40	36.74
SignBERT+ [45]	✔		84.94	85.23	62.42	60.15	79.84	80.72	48.85	46.37
MSLU [46]	✔		91.54	91.75	74.07	71.81	88.76	89.25	56.29	53.29
BEST [49]		✔	89.56	90.08	71.21	68.24	81.01	81.63	54.59	52.12
NLA-SLR [43]	✔	✔	90.49	91.04	72.56	69.86	91.47	92.17	61.05	58.05
Uni-Sign [48]	✔	✔	93.79	94.02	78.16	76.97	92.25	92.67	63.52	61.32
YOLO-Act (Ours)		✔	95.04	95.49	81.41	79.13	93.77	94.32	67.07	64.59

Table 4. Effect of Bounding Box Cropping (Hand vs. Human) on WLASL100.

Input Crop	P-I (%)	P-C (%)
Hand Bounding Box	59.01	60.12
Human Bounding Box	93.77	94.32
Δ (Human–Hand) =	+34.76	+34.2

Table 5. Impact of multi-stage supervision on WLASL2000.

Training Scheme	P-I (%)	P-C (%)
Stage-specific (Action1)	12.87	11.60
Stage-specific (Action2)	17.26	15.64
Stage-specific (Action3)	15.76	14.09
Fused multi-stage (Action1 + 2 + 3)	67.07	64.59

Table 6. Ablation study on the number of selected keyframes using uniform temporal sampling.

No of Frames	Extracted Key Frames
2	Frame 0 - Frame 99
3	Frame 0 - Frame 49 - Frmae 99
5	Frame 0 - Frame 25 - Frmae 50 - Frame 75 - Frame 99
9	Frame 0 - Frame 12 - Frame 25 - Frame 37 - Frame 50 Frame 62 - Frame 75 - Frame 87 - Frame 99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alzahrani, N.; Bchir, O.; Ben Ismail, M.M. Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act. Electronics 2025, 14, 4589. https://doi.org/10.3390/electronics14234589

AMA Style

Alzahrani N, Bchir O, Ben Ismail MM. Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act. Electronics. 2025; 14(23):4589. https://doi.org/10.3390/electronics14234589

Chicago/Turabian Style

Alzahrani, Nada, Ouiem Bchir, and Mohamed Maher Ben Ismail. 2025. "Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act" Electronics 14, no. 23: 4589. https://doi.org/10.3390/electronics14234589

APA Style

Alzahrani, N., Bchir, O., & Ben Ismail, M. M. (2025). Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act. Electronics, 14(23), 4589. https://doi.org/10.3390/electronics14234589

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Unified Spatiotemporal Detection for Isolated Sign Language Recognition Using YOLO-Act

Abstract

1. Introduction

2. Related Work

2.1. Sign Language Recognition

2.1.1. RGB-Based Input Modality

2.1.2. Pose-Based Input Modality

2.2. CNN-Based Methods for Sign Language Recognition

2.3. SLR Based on Transformer

2.4. Discussion

3. The YOLO-Act Model

3.1. Keyframe Extraction

3.2. Action Detection

3.3. Class Fusion

3.4. Actor Tracking

4. Experiments

4.1. Comparison with the State of the Arts

4.2. Ablation Study: Effect of Bounding Box Cropping

4.3. Ablation Study: Impact of Multi-Stage Supervision

4.4. Ablation Study: Number of Keyframes

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI