1. Introduction
Arabic sign language (ArASL) is the primary medium of communication among Deaf individuals and persons with hearing loss in Arabic-speaking societies. Despite its importance, the development of robust machine learning-based models capable of automatically recognizing and interpreting ArASL remains a significant research challenge. Many current ArASL recognition systems focus on static hand gestures and are built entirely on visual signs extracted from images or videos. These vision-only approaches typically rely on convolutional neural networks (CNNs) or recurrent and transformer-based sequence models and have achieved acceptable performance in detecting the corresponding letters.
However, these models work as closed-set classifiers and can detect only the gesture classes observed during training. This limitation is critical in the ArASL domain, where labeled datasets are limited and many sign letters share similar visual patterns. As a result, existing approaches do not scale well to new letters and lack semantic awareness. In addition, off-the-shelf vision–language models are not designed for fine-grained sign language gestures and require task-specific alignment to capture subtle hand-shape differences.
Multimodal learning offers a promising direction to address these limitations by linking visual representations with linguistic information. Inspired by CLIP-style contrastive learning, multimodal approaches align visual features with textual descriptions in a shared embedding space. This formulation enables recognition through image–text similarity and supports zero-shot prediction. In the context of ArASL recognition, each sign letter is associated with linguistic descriptions, such alignment allows the use of semantic information that is not captured by visual features alone. Despite its potential, multimodal alignment has received limited attention for static ArASL recognition, which remains a low-resource research area where semantic information can provide additional benefits beyond pure visual modeling.
To explore the role of multimodal alignment in ArASL recognition, we introduce CLIP-ArASL, a lightweight CLIP-style approach that jointly learns visual and textual representations for static ArASL signs. The proposed approach integrates an EfficientNet-B0 image encoder [
1] with a multilingual MiniLM text encoder [
2] and aligns their outputs using a hybrid training objective that combines supervised classification and contrastive alignment. This design supports zero-shot evaluation, in which unseen gesture classes are recognized by matching image embeddings to their corresponding textual descriptions.
This paper provides three main contributions:
A lightweight multimodal approach for isolated static ArASL sign recognition that aligns hand gesture images with bilingual textual descriptions in a shared embedding space, enabling semantic grounding beyond visual-only classification.
A task-specific training strategy that combines cross-entropy and contrastive alignment to support supervised learning while maintaining semantic consistency.
A class-level zero-shot evaluation for ArASL, where 20% of gesture classes are excluded during the training phase and recognized at test time using image–text similarity, demonstrating semantic generalization to unseen classes.
The rest of this manuscript is structured as follows.
Section 2 reviews previous work on ArASL recognition and multimodal learning.
Section 3 describes the proposed CLIP-ArASL approach, including input data, multimodal encoding, shared embedding space, and model training and evaluation.
Section 4 and
Section 5 present the experimental results and discuss the findings.
Section 6 concludes this work and highlights potential directions for future research.
2. Related Work
Sign languages are an important means of communication for Deaf and hard-of-hearing people. Arabic sign language is the main language for these people in Arabic-speaking regions. In recent years, several research projects within computer vision and machine learning have been conducted to develop recognition systems and methods to support communication, education, and accessibility for the Deaf community. Most current methods rely on visual signs such as hand shape, orientation, and movement to map the visual hand gestures to letters or words. However, most of these efforts treat recognition purely as a visual task and overlook the linguistic meaning of signs, which limits generalization and interpretability.
To address data limitations, the authors in [
3] introduced a realistic ArASL dataset containing 150 isolated signs covering 18 hand shapes. Their recognition system followed a three-stage approach involving hand segmentation, shape extraction, and motion analysis, using depth and skeleton data from Kinect V2. However, due to high visual similarity among signs and recording noise, the model achieved an accuracy of 55.6%. More recently, Luqman [
4] released ArabSign, a continuous ArASL dataset comprising 9335 annotated samples captured in RGB, depth, and skeleton modalities. An encoder–decoder model was proposed to benchmark the dataset and showed better performance than an attention-based baseline.
Other works have focused on developing recognition methods for isolated ArASL. The authors in [
5] developed a signer-independent system combining MobileNetV2 and ResNet18 with a Self-MLP classifier and using MediaPipe segmentation to isolate hand and facial areas from RGB videos. Noor et al. [
6] proposed a CNN-LSTM model to capture both spatial and temporal dependencies, though constrained by a small 20-word dataset. Javaid and Rizvi [
7] introduced the Sign Language Action Transformer Network (SLATN) framework for recognizing both manual and non-manual signs using the PKSLMNM dataset. Alasmari and Asiri [
8] proposed ASLDetect, which represents a hybrid ResNet34 and U-Net-Like model aimed at handling noisy ArASL gesture images, but it does not incorporate textual descriptions. Alsaadi et al. [
9] implemented a real-time ArASL recognition system based on AlexNet without augmentation or transfer learning.
Research on transformer-based models has also grown. Baghdadi et al. [
10] proposed a vision transformer-based ensemble with LIME explanations to improve semantic grounding for ArASL classification, though LIME can produce unstable explanations. The authors in [
11] compared transfer learning and vision transformers for ArASL alphabet recognition, showing that pretrained models such as MobileNet, ResNet, ViT, and Swin outperform CNNs trained from scratch.
More recent efforts explore CLIP-based multimodal approaches. Alyami and Luqman [
12] proposed CLIP-SLA which is a parameter-efficient framework for continuous sign language recognition (CSLR) that integrates SLA-LoRA and SLA-Adapter into the CLIP visual encoder. The framework enables better temporal and spatial modeling and achieves competitive results against the state-of-the-art CSLR methods. Jiang et al. [
13] proposed SignCLIP, which aligns sign language videos with spoken language using contrastive learning. The method focuses on representation learning and retrieval tasks, instead of lightweight classification. SignVLM [
14] also adopts CLIP as a visual feature extractor for video frames and combines it with a transformer-based temporal decoder to support isolated and continuous sign recognition. AraCLIP, developed by [
15], adapts CLIP for Arabic text-to-image retrieval using knowledge distillation from an English CLIP model. However, it targets general Arabic imagery and is not designed for ArASL or sign-specific visual modeling.
Recent CLIP-based models have shown promise for language processing. However, they differ from the scope and design of the proposed approach. SignCLIP focuses on large-scale video–text pretraining using multilingual sign dictionaries and targets general sign language understanding across multiple languages. Its main purpose is representation learning, with less emphasis on lightweight deployment, and it relies on video-based inputs and extensive pretraining resources. Similarly, SignVLM introduces additional temporal modeling components to support isolated and continuous sign recognition, increasing model complexity and not formulating zero-shot recognition as a semantic image–text matching problem.
In contrast, CLIP-ArASL targets static Arabic sign letter recognition using a compact image-based formulation. The proposed model aligns hand gesture images directly with bilingual textual descriptions in a shared embedding space. It operates on static visual inputs without additional temporal modeling. CLIP-SLA differs in that it focuses on continuous sign language recognition and adapts the CLIP visual encoder using parameter-efficient fine-tuning methods. Textual information is incorporated indirectly through gloss-level supervision instead of direct semantic alignment. In comparison, CLIP-ArASL is designed as a lightweight and task-specific approach. It supports both supervised recognition and zero-shot prediction through image–text similarity.
In summary, most previous work on ArASL recognition has concentrated mainly on visual features, with limited consideration of semantic and linguistic information. Although vision transformer-based and vision–language models have been introduced, their application in this domain often emphasizes visual representation learning rather than direct vision–language alignment. To address this limitation, this work proposes a lightweight multimodal approach that aligns ArASL images with bilingual textual descriptions in a shared embedding space. The integration of textual semantics supports supervised recognition and enables zero-shot recognition of unseen classes through image–text matching.
3. Methodology
This section describes the proposed CLIP-ArASL approach in detail. The approach follows the vision–language alignment principle of CLIP by mapping hand gesture images and bilingual textual descriptions into a shared embedding space. At the same time, CLIP-ArASL introduces task-specific adaptations for static ArASL recognition. Unlike standard CLIP models trained on large-scale image–text pairs, the proposed approach operates using class-level textual descriptions and static hand gesture images. The visual encoder is adapted via selective fine-tuning of higher convolutional layers, while the text encoder remains frozen to maintain linguistic semantics. In addition, a hybrid training objective is employed to jointly support supervised classification and multimodal alignment, enabling both closed-set recognition and zero-shot evaluation.
Figure 1 shows the overall workflow of the proposed methodology, and we can discuss each part in the following subsections:
3.1. Input Data
The proposed approach utilized two types of input data: visual hand gesture images and textual class descriptions.
For visual data, we used two different datasets, ArASL2018 and ArASL21L. Each dataset was separated into three parts: for training, for validation, and the remaining portion for testing. To mitigate the effect of class imbalance, the datasets were balanced by downsampling.
ArASL2018 dataset [
16,
17] contains 32 Arabic sign language letter classes and a total of 54,049 grayscale images. The original dataset shows noticeable class imbalance, where the number of images per class varies considerably. To address this issue, each class was reduced to 1290 images, resulting in a total of 41,280 images. The images were selected randomly to preserve variability and ensure equal class representation.
The ArASL21L dataset [
18,
19] has 32 Arabic sign letter classes with a total of 14,202 images. The number of images per class ranges between 401 and 451, which introduces moderate class imbalance. A new balanced dataset was created by reducing each class to 400 images.
Figure 2 shows representative samples from both datasets.
Prior to training, all images are resized to pixels and normalized using the ImageNet mean and standard deviation values. During training, data augmentation was applied to improve generalization and reduce overfitting. It included random horizontal flipping with probability , random rotation within ±15°, and color jittering with brightness, contrast, and saturation factors of 0.2. During validation and testing, only resizing and normalization were applied.
The textual data consisted of bilingual descriptions for each class provided in two languages (Arabic and English) and stored as arrays in a JSON file (the JSON file is publicly available at
https://doi.org/10.6084/m9.figshare.31082545, accessed on 18 January 2026). The descriptions were adapted from an official guideline issued by the Government Communication Center of the Saudi Ministry of Media (2021) and translated into English. The English descriptions were obtained through manual translation and verification to preserve semantic equivalence with the Arabic source.
The same set of bilingual descriptions is used across all datasets in this work. This ensures that textual class representations remain standardized and independent of the visual data source. Each class is associated with one Arabic description and one English description. When a description contains multiple sentences they are stored as an array and treated as a single semantic unit.
These bilingual descriptions are used to construct text-based class representations, which are then used in both supervised classification and zero-shot evaluation.
3.2. Multimodal Encoding
We applied two different encoders, an image encoder and text encoder, to extract the required features or encode text descriptions into semantic vectors.
In the image encoder, visual feature extraction is performed using a pretrained EfficientNet-B0 convolutional neural network [
1]. EfficientNet-B0 was selected due to its compact architecture and competitive performance. The model is initialized with ImageNet weights to take advantage of previous visual knowledge. During the training phase, only higher-level convolutional blocks are fine-tuned, while the earlier layers remain frozen to balance generalization. In particular, the last three feature blocks of EfficientNet-B0 are updated, whereas all preceding layers remain fixed. The resulting feature maps are aggregated using adaptive average pooling to produce fixed-length image embeddings.
Regarding the text encoder, textual class descriptions are encoded using a pretrained MiniLM sentence transformer. It was selected due to its cross-lingual representation capability and compact architecture [
2]. This encoder maps both Arabic and English descriptions into a shared semantic vector space. For each class, the Arabic and English descriptions are encoded separately. The embeddings are then averaged to produce a single class-level representation. The final embeddings are L2-normalized to ensure consistent similarity computation. The text encoder remains frozen during training to preserve semantic stability and prevent language drift.
After feature extraction, the outputs of the visual and textual encoders are related through a similarity-based alignment mechanism. For each input image, the visual encoder produces an image embedding, while the textual encoder provides a fixed class-level text embedding corresponding to each gesture label. These embeddings are mapped to the same dimensional space and normalized, allowing direct similarity computation. In this way, the model can operate using a unified representation during both training and inference, without requiring additional modality-specific classifiers.
3.3. Shared Multimodal Embedding Space
To achieve alignment between visual and textual representations, image features generated by the encoder are projected into a shared embedding space using a learnable linear projection layer. Both image and text embeddings are L2-normalized so similarity can be computed based on cosine similarity. This shared embedding space forms the core of the proposed approach, enabling visual samples to be directly compared against textual class representations. Such alignment is required for zero-shot evaluation, where unseen classes are represented through their textual descriptions, while recognition is performed by matching image embeddings to these text embeddings.
3.4. Model Training
During training, images and descriptions are mapped into a shared vision–language embedding space. Given an input image,
, the visual encoder produces an image embedding,
, where 384 denotes the dimensionality of the shared vision–language embedding space. Each class description is encoded by the text encoder to obtain a text embedding,
. Both embeddings are L2-normalized before similarity computation. The similarity between an image and a class description is computed using scaled cosine similarity:
where:
denotes the similarity score between the image i and text description j,
is a learnable temperature scaling parameter that controls the sharpness of the similarity distribution.
For supervised classification, the similarity scores over the seen classes are used as logits, and the cross-entropy loss
is applied:
where
denotes the ground-truth class label of image
i.
To encourage multimodal alignment, a contrastive loss,
, is applied between image and text embeddings within a mini-batch:
where the image-to-text and text-to-image losses are computed by applying cross-entropy over the scaled similarity matrix.
The final training objective
is defined as a weighted combination of both losses:
Overall, the model is trained on seen classes using a hybrid objective that integrates cross-entropy loss for supervised classification and contrastive loss for multimodal alignment, as defined above. Both loss components are equally weighted, with each contributing 0.5 to the total training objective. This balanced weighting allows supervised classification and contrastive alignment to contribute equally during optimization. It encourages the correspondence between image embeddings and their associated textual descriptions while preserving discriminative class boundaries. A learnable temperature scaling parameter is used to control the sharpness of the similarity distribution.
Training was conducted for a maximum of 80 epochs using the AdamW optimizer. The initial learning rate was , and the weight decay was 0.01. A batch size of 32 was used in all experiments. The learning rate was reduced by half every ten epochs using a step-based schedule. Early stopping was applied based on validation accuracy with a patience of seven epochs to limit overfitting.
3.5. Evaluation Phase
In this phase, we conducted two kinds of evaluation: supervised and zero-shot. In the supervised evaluation, the model was tested on the test split of seen classes by comparing image embeddings against the corresponding text embeddings. Performance was measured using standard metrics involving accuracy, precision, recall, and F1-score matrices.
In zero-shot evaluation, of gesture classes were selected at the class level and withheld entirely from the training and validation sets. During this phase, the model was trained using only images and textual representations of the remaining seen classes. Textual embeddings of unseen classes were not used during training and were introduced only at inference time. At test time, images from unseen classes were recognized by matching their image embeddings to the textual embeddings of the unseen classes within the shared embedding space. This protocol ensured that no information leakage occurred and evaluated the ability of the approach to transfer semantic knowledge from text to image.
4. Results
We evaluated the CLIP-ArASL approach using two different ArASL datasets, namely ArASL2018 and ArASL21L, which are described in
Section 3.1. In addition, we employed a JSON file that contains bilingual text descriptions in Arabic and English languages.
In the supervised learning method, we divided each dataset into three subsets: a training set (
), a validation set (
), and a testing set (
). In zero-shot evaluation, 20% of the hand gesture classes were completely excluded from both training and validation and were used only for testing, as described in
Section 3.5.
All experiments were carried out through ten runs on a personal computer running Windows 11 Pro, equipped with an Intel Core i7 CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 5070 GPU. The proposed approach was implemented in Python 3.10.11 using PyTorch 2.8.0 (CUDA 12.8) for model training and inference. The EfficientNet-B0 image encoder and data transformations were implemented using torchvision 0.23.0. Text encoding was performed using the SentenceTransformers 5.1.2 library based on the transformer framework.
4.1. Supervised Learning Method
Table 1 shows the mean recognition performance of ten experiments for our proposed work for ArASL2018 dataset using different evaluation metrics. Accuracy reflects the overall correctness of the approach by calculating the percentage of correctly classified images in the testing set. The proposed approach achieved a mean of accuracy equal to
, indicating excellent performance. Precision calculates the ratio of true positive images to the total number of images predicted as positive. The approach obtained a mean precision of
, which reflects very low false positive estimation. Recall evaluates the effectiveness of the approach in detecting all relevant positive images. In our work, the mean of the recall value was
, showing that the approach successfully recognized most of the true sign instances. The mean of the F1-score was equal to
and represents a metric that integrates both precision and recall to provide a balanced evaluation of the approach’s performance for this dataset. All reported values were computed using weighted averaging across classes, where each class contribution was weighted by its number of test samples. Consequently, the reported F1-score was obtained as a weighted average of class-wise F1-scores and was not directly derived from the aggregated precision and recall values.
Figure 3 presents the training and validation curves for both loss and accuracy on the ArASL2018 dataset for the initial experiment. On the left side, both loss curves decrease steadily over the training epochs, meaning that the model learned effectively and progressively reduced prediction errors. In addition, this trend indicates that the approach achieved good generalization without notable overfitting. On the right side, the training and validation accuracy curves are shown. The training accuracy rises sharply during the early epochs (around the second one) and reaches above
. After this point, the training and validation accuracy curves follow a similar pattern and remain nearly overlapped. Then both curves change slightly and move in parallel until the end of the training. This behavior indicates a stable learning process and consistent performance of the approach.
As shown in
Table 2, the proposed CLIP-ArASL approach achieved a mean recognition accuracy of
on the ArASL21L dataset. The corresponding mean of the precision, recall, and F1-score values (
) indicate consistent recognition performance across all classes. Although overall performance was lower than that obtained on the ArASL2018 dataset, the results are considered strong regarding the increased complexity of the ArASL21L dataset. This dataset includes challenging factors such as complex background, lighting variations, and images containing facial features and hand gestures, which increase recognition difficulty. Regardless of these challenges, the proposed approach demonstrated robust and stable performance across all evaluation metrics.
Figure 4 presents the training and validation curves for accuracy of the initial experiment when the approach was trained on the ArASL21L dataset. The loss curves on the left side show a consistent downward trend for both training and validation, with similar shapes throughout the training process. This pattern indicates effective learning and suggests that the approach maintained a reasonable generalization. On the right side, the accuracy curves for training and validation follow a similar trend. The two curves intersect at an early stage of training, after which a gradual gap appears and increases slightly toward the end of the training. To address the issue of possible overfitting, we applied early stopping during the training process. Overall, the curves demonstrate stable learning behavior despite the high complexity of the ArASL21L dataset.
Table 3 reports supervised recognition results of CLIP-ArASL on seen classes alongside representative vision-based ArASL methods evaluated on ArASL2018 or ArASL21L. These methods differed in training protocols, data splits, augmentation strategies, and use of transfer learning. Therefore, the comparison was indicative rather than strictly controlled. However, the results showed that CLIP-ArASL achieved competitive accuracy on both datasets while using a compact multimodal architecture. Unlike prior approaches that relied only on visual cues, CLIP-ArASL integrated visual and textual representations. This design maintained high recognition performance exceeding
on both datasets and supported multimodal learning.
The lightweight design refers to the compact vision backbone and the number of trainable parameters. The trained vision model contains 4.54 million parameters, of which 3.65 million are optimized during training. The text encoder is used in frozen mode and is not updated. The final trained checkpoint occupies MB on disk.
To further contextualize the results, we additionally evaluated three off-the-shelf vanilla CLIP models, ViT-B/32, ViT-B/16, and ViT-L/14, under different prompt settings. These settings included class name, English descriptions, Arabic descriptions, and bilingual Arabic–English descriptions. All CLIP models were kept frozen and used without fine-tuning. As shown in
Table 4, vanilla CLIP achieved near-random accuracy, below
, across all configurations. This outcome indicated limited transferability to ArASL recognition and highlighted the need for task-specific multimodal alignment, which was addressed by CLIP-ArASL.
4.2. Zero-Shot Recognition
To evaluate the generality of our proposed approach beyond the closed-set scenario, a zero-shot evaluation was carried out. As described in
Section 3.5, 20% of the hand gesture classes were excluded from the training phase and the approach was required to recognize these unseen classes using only their textual descriptions.
On the ArASL2018 dataset, CLIP-ArASL obtained a mean zero-shot accuracy of . For the more complex dataset (i.e., ArASL21L), the mean accuracy reached . Although these results are lower than those achieved on the supervised performance, they stay well above the theoretical chance level expected from random guessing. This implies that the learned embedding space captures meaningful semantic relationships between gestures and text.
Zero-shot learning in sign language recognition has been explored under different problem formulations, depending on what is considered unseen during the evaluation phase.
Table 5 summarizes these differences and reports the main evaluation metrics from representative works. SignCLIP [
13] concentrates on cross-modal representation learning, allowing zero-shot transfer across tasks (e.g. retrieval or classification) by aligning sign language videos and text in a shared embedding space. Its performance is evaluated using retrieval metrics such as recall at one (R@1). The reported results reflect retrieval accuracy rather than classification accuracy. In contrast, SignVLM [
14] evaluates cross-dataset transfer for isolated sign language recognition, evaluating zero-shot and few-shot transfer where models trained on one dataset are tested on disjoint sign classes from other datasets, mainly within supervised classification settings. The performance is measured using classification accuracy and the results are vary across datasets.
The proposed work addresses a different case, which is class-level zero-shot learning. In this case, a subset of hand gesture classes was completely excluded during the training phase and introduced at test time via their textual descriptions. The model was therefore required to recognize unseen signs without having access to any visual samples of these classes during training. The performance was measured using classification accuracy on the unseen classes. Unlike previous methods, CLIP-ArASL is specifically designed for isolated ArASL recognition, where visual similarities between signs can be subtle and language-specific.
Although all methods are described as zero-shot, they differ in task definition, utilized datasets, and evaluation metrics. Therefore, the results in
Table 5 are provided as a contextual reference and are not directly comparable.
We first performed a single zero-shot experiment and obtained accuracies of
and
on the two datasets. To evaluate the stability of the results, we carried out a total of ten experiments using a fixed
unseen class ratio. The first run corresponded to the initial experiment, and the remaining runs were performed with different randomly selected unseen classes.
Table 6 reports the accuracy of each experiment on both datasets, together with the mean and standard deviation. On ArASL2018, CLIP-ArASL achieved a mean zero-shot accuracy of
. On ArASL21L, the mean accuracy was
. The results indicate variability across different class splits, particularly for ArASL2018, which shows a higher standard deviation. ArASL21L shows lower variance but also lower overall accuracy.
Figure 5 and
Figure 6 present the confusion matrices from Experiment 1 in
Table 6. It achieved
on ArASL2018 and
on ArASL21L. These figures provide a class-wise view of how the proposed CLIP-ArASL predicts unseen sign classes using only textual descriptions. The confusion matrices visualize the six unseen classes. These classes represent
of the total 32 gesture classes. They were excluded during the training phase and used only for zero-shot evaluation.
For ArASL2018, the confusion matrix displays clear diagonal dominance, indicating that several unseen letters could be correctly detected in the zero-shot setting. This behavior suggests that the approach could align textual descriptions with visual patterns when the corresponding hand gestures had distinctive shapes. At the same time, noticeable confusion was observed between visually similar letters, where overlapping finger positions or hand orientations led to misclassifications.
In the case of ArASL21L, the confusion matrix shows a weaker diagonal dominance compared to ArASL2018, indicating a higher difficulty in zero-shot recognition. While some correct predictions are shown, the overall distribution of errors is more dispersed across classes. This pattern can be attributed to the higher visual similarity among letters in this dataset, as well as the greater variation in backgrounds and viewing conditions. Consequently, zero-shot recognition becomes more challenging.
In general, the confusion matrices demonstrate that the proposed approach has structured prediction behavior rather than random classification when dealing with unseen classes. However, the observed confusion patterns indicate the inherent limitations of zero-shot recognition for fine-grained sign language gestures, highlighting the need for richer textual descriptions and more discriminative representations in future work.
5. Discussion
This section discusses the experimental results in the context of ArASL recognition, with particular emphasis on supervised performance and zero-shot capability.
The supervised results show that CLIP-ArASL achieved an accuracy comparable to recent vision-based methods on the ArASL2018 and ArASL21L datasets. This result is notable because our approach uses a lighter architecture. Unlike many existing methods that rely only on visual features (e.g., our earlier work [
8] and the approaches reviewed there), CLIP-ArASL learns both visual features and linguistic semantics through multimodal contrastive alignment. This design helps the approach maintain strong classification performance while improving flexibility.
The drop in mean accuracy from in ArASL2018 to in ArASL21L is expected, as the second dataset contains more background noise and visual clutter. From a data-centric perspective, these characteristics increased inter-class overlap and reduce class separability. Under such conditions, reliance on visual cues alone becomes less effective. The inclusion of textual guidance helps stabilize feature learning when visual information is less reliable, which further motivates the use of CLIP-ArASL in zero-shot settings.
Unlike traditional ArASL recognition methods that work under a closed-set assumption, our work can detect unseen gesture classes using only their text descriptions. While recent studies have examined zero-shot learning in sign language recognition, many focus on retrieval tasks, cross-dataset transfer, or attribute-based supervision. In contrast, CLIP-ArASL addresses class-level zero-shot recognition of isolated ArASL alphabets using visual samples of unseen classes during the training phase. This way tackles the scalability limitations of supervised ArASL recognition systems.
The zero-shot confusion matrices shown in
Figure 5 and
Figure 6 provide insight into the class-level behavior of the proposed approach when applied to unseen sign classes. In both cases, the predictions show clear non-random structure and reveal limitation related to visual similarity between letters. Most errors are associated with visually similar hand shapes and finger configurations rather than arbitrary misclassification.
In the ArASL2018 dataset, some letters showed clear misclassification patterns. The letter
had eight correct predictions and was often misclassified as
. Another case was the confusion between
and
, where many
samples were classified as
. Despite these errors, several letters maintained strong diagonal responses. This explains the
zero-shot accuracy observed in Experiment 1, as shown in
Figure 5.
In contrast, the ArASL21L dataset showed a lower zero-shot performance. The accuracy was
in Experiment 1, as shown in
Figure 6. The confusion matrix reveals reduced diagonal dominance with misclassifications distributed across classes. The letter
had zero correct predictions and appeared as a predicted class for other letters. For example, many samples of
were predicted as
. Another confusion was observed between
and
, and vice versa. These errors were spread over multiple classes, and no single letter was consistently recognized. From a data-centric viewpoint, these patterns indicate higher inter-class ambiguity and weaker discriminative cues. As a result, the approach found it more difficult to distinguish letters in this dataset.
The behavior of frozen vanilla CLIP models further clarifies the design choices in this work. Despite large-scale pretraining and different kind of prompt, vanilla CLIP models produced near-random results on both ArASL datasets. This observation suggests that generic vision–language models do not capture the fine-grained hand-shape differences required for isolated sign letter recognition. In contrast, CLIP-ArASL benefits from task-specific multimodal alignment and supervised sign language data. This allows the model to learn subtle visual distinctions while leveraging linguistic descriptions, which accounts for the large performance gap between the two approaches.
Overall, the analysis shows that CLIP-ArASL produces structured predictions in class-level zero-shot scenarios. Most errors are caused by subtle visual differences between hand gestures rather than random classification. At the same time, the results highlight the limitations of zero-shot recognition for isolated sign language letters and point to the need for more discriminative visual representations and richer semantic description in the future work. The findings indicate that CLIP-ArASL is applicable to practical ArASL scenarios where new letters may be introduced and labels are limited.
6. Conclusions and Future Work
This work introduced CLIP-ArASL, a CLIP-style multimodal approach that aligns isolated Arabic sign language images with bilingual textual descriptions through a shared embedding space. The approach combines a partially fine-tuned EfficientNet-B0 visual encoder with a MiniLM text encoder. Training is guided by a hybrid objective that integrates cross-entropy classification loss with contrastive vision–language alignment loss. This design allows for joint learning of visual recognition and image–text alignment within a single architecture.
The proposed approach was evaluated on the ArASL2018 and ArASL21L datasets using supervised and zero-shot settings. In supervised evaluation, our proposed work obtained recognition accuracies of and , respectively. The training and validation curves remained stable in all epochs. Early stopping was applied to avoid overfitting. Overall, the results indicate that CLIP-ArASL is effective for isolated Arabic sign language recognition using static image input.
Zero-shot experiments were carried out by excluding a subset of classes from training and performing recognition using only their textual descriptions. The approach was able to associate unseen test images with their corresponding text embeddings at a level above random selection. This indicates that the shared embedding space captures semantic information that generalizes beyond the seen classes. While recognition performance remains affected by visual similarity between certain hand shapes and the lack of temporal information, the zero-shot results suggest that vision–language alignment can support ArASL recognition when annotated visual data are limited.
For future work, this approach can be improved in several directions. First, CLIP-ArASL can be extended to dynamic sign recognition by replacing the image encoder with spatiotemporal models such as 3D convolutional networks or video transformers. These models can process video inputs and capture temporal information. The same contrastive alignment strategy can map video features to textual descriptions in a shared embedding space. This extension would support supervised and zero-shot recognition of dynamic signs. Second, alternative multilingual text encoders and different prompt design strategies may be explored to generate richer semantic representations and improve zero-shot accuracy. Finally, further investigation of alternative loss weight settings and fine-tuning strategies may provide additional performance improvements.