Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition

: Sign languages are complex, but there are ongoing research efforts in engineering and data science to recognize, understand, and utilize them in real-time applications. Arabic sign language recognition (ArSL) has been examined and applied using various traditional and intelligent methods. However, there have been limited attempts to enhance this process by utilizing pretrained models and large-sized vision transformers designed for image classiﬁcation tasks. This study aimed to create robust transfer learning models trained on a dataset of 54,049 images depicting 32 alphabets from an ArSL dataset. The goal was to accurately classify these images into their corresponding Arabic alphabets. This study included two methodological parts. The ﬁrst one was the transfer learning approach, wherein we utilized various pretrained models namely MobileNet, Xception, Inception, InceptionResNet, DenseNet, and BiT, and two vision transformers namely ViT, and Swin. We evaluated different variants from base-sized to large-sized pretrained models and vision transformers with weights initialized from the ImageNet dataset or otherwise randomly. The second part was the deep learning approach using convolutional neural networks (CNNs), wherein several CNN architectures were trained from scratch to be compared with the transfer learning approach. The proposed methods were evaluated using the accuracy, AUC, precision, recall, F1 and loss metrics. The transfer learning approach consistently performed well on the ArSL dataset and outperformed other CNN models. ResNet and InceptionResNet obtained a comparably high performance of 98%. By combining the concepts of transformer-based architecture and pretraining, ViT and Swin leveraged the strengths of both architectures and reduced the number of parameters required for training, making them more efﬁcient and stable than other models and existing studies for ArSL classiﬁcation. This demonstrates the effectiveness and robustness of using transfer learning with vision transformers for sign language recognition for other low-resourced languages.


Introduction
Hearing loss refers to the inability to hear at the same level as individuals with normal hearing, who have thresholds of 20 decibels or above in both ears.It can range from mild to profound and can occur in one or both ears.Profound hearing loss is typically observed in deaf individuals, who have minimal or no hearing ability.Various factors can contribute to hearing loss, including congenital or early-onset hearing loss during childhood, chronic infections of the middle ear, exposure to loud noises, age-related changes in hearing, and the use of certain drugs that can cause damage to the inner ear [1].The World Health Organization estimates that more than 1.5 billion people (nearly 20% of the global population) suffer from hearing loss, and this number could rise to over 2.5 billion by 2050 (https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss,accessed on 17 September 2023).According to the General Authority for Statistics in Saudi Arabia, there are approximately 289,355 deaf people in the Kingdom of Saudi Arabia, Appl.Sci.2023, 13, 11625 2 of 28 according to the latest update for 2017 (https://www.stats.gov.sa/ar/904,accessed on 17 September 2023).Sign language (SL) is a visual form of communication used by deaf and hearing-impaired individuals.It involves the use of hand movements and facial expressions to convey meaning, following specific grammar rules.Each SL has a wide variety of signs, with slight variations in hand shape, motion, and position.Additionally, SL incorporates non-manual features and facial expressions, including eyebrow movements, to enhance communication.Like spoken language, SL develops naturally because of groups of people interacting with one another.Regions and cultures additionally play an essential role in their development [2].Understanding SL outside the deaf community is uncommon, which makes communication between deaf individuals and ordinary individuals difficult.Some deaf children are born to ordinary parents, and consequently, communication gaps exist within the family.Moreover, there is no standardized form among SLs, which makes learning a SL challenging [3].
There are two main types of SLs, namely, alphabet and ideographic [4].First is the SL alphabet (SLA), wherein each word is spelled letter by letter, used by deaf individuals to participate in the traditional educational and pedagogical process.Second is ideographic SL (ISL), which expresses each meaningful word by a specific hand gesture; for instance, the word "father" is represented by a sign.ISL shares four main components: hand movements, facial expressions, lip movements, and body movements [4,5].One difficulty to mention here is that people from different areas might use various gestures to represent the same word just like different dialects in natural languages.There are two main types of SL recognition systems.Device-based systems involve wearable tools, such as gloves to track and interpret gestures [6].Vision-based systems, on the other hand, utilize techniques for processing and analyzing images and videos of SL's speakers using artificial intelligence (AI).Images and videos are captured by cameras, and there is no need to be equipped with devices or sensors.Vision-based SL recognition systems are in their developmental stages and have varying degrees of success [6].
Arabic SL (ArSL) is a natural language that is used by the deaf and hearing-impaired people in the Arab world and has a large community.ArSL differs from an Arab country to another based on the local dialects and cultures.ArSL recognition is a challenging task because Arabic contains hundreds of words which could be very similar in their hand poses.The gestures are not only played with hands but also include other different nonverbal communication means such as facial expressions and body movements [7].Different research works have investigated the use of sensors and device-based techniques for ArSL [5,8,9].These studies have developed a speech recognition engine to convert common Arabic words into ArSL and vice versa.Some studies were designed to measure sign language motion using data gloves [8,9].To the best of our knowledge, few research works have investigated ArSL vision-based methods which rely on AI techniques including image processing, computer vision, machine learning [3], and deep learning [7,[10][11][12].Therefore, the contributions of this work can be highlighted in the following points:

•
This study explores the use of transfer learning approach using pretrained deep learning models such as VGG, ResNet, MobileNet, Inception, Xception, DenseNet, and InceptionResNet for ArSL classification using a large-size dataset.Additionally, a stateof-the-art pretrained model, BiT (Big Transfer), by Google Research, was evaluated for ArSL classification.

•
This study investigates the use of vision transformers namely ViT and Swin, with the transfer learning approach for ArSL classification.Since the literature review studies have shown successful vision-based transformer models applied to other sign languages, these vision transformers are also investigated and evaluated for ArS classification.

•
Finally, this research work investigates several deep learning convolutional neural network (CNN) architectures trained from scratch for ArSL classification and compares their results with the proposed transfer learning models for ArSL classification.An ablation study was conducted to compare different architectural models of CNN for the classification of Arabic sign language alphabets.
The rest of the paper is organized as follows.Section 2 gives an extensive review of SL recognition methods and resources, moving from machine learning to deep learning methods, and transfer learning-based methods for SL classification.Section 3 introduces the methodology framework.Section 4 presents the experimental setup, datasets for ArSL and evaluation metrics.Section 5 presents our experimental results using different architectures of deep learning CNN models, pretrained models, and vision transformers, and finally, in Section 6, we give the conclusion and future remarks to be continued.

Related Works
During the past two decades, many research works have been published for automatic SL recognition in different languages and countries [13][14][15][16], most of them used devicebased sensors such as gloves and wearable devices [17], and Leap Motion Controller (LMC) [7,15,[18][19][20][21][22].As device-based methods are expensive and need the person to be equipped with sensors and due to the recent advances in AI, the current trend and efforts in SL research has been directed to the use of vision-based techniques combined with machine learning and deep learning [23][24][25][26].In this section, extensive literature review studies in SL classification field are discussed from 2020 until present.Prior to these years, review papers by [18,27,28] have investigated SL research works in the gap between 1998 to 2021, which provide a systematic analysis of research employed in SL recognition-related studies.Spanning over a period of two decades, these reviews cover various stages involved in the process, such as image acquisition, image segmentation, feature extraction, and classification algorithms.They also highlight the accomplishments of different researchers in achieving high recognition accuracy.Additionally, the reviews discuss several limitations and obstacles faced by vision-based approaches, including a lack of available datasets, the complexity of certain techniques, the nature of signs themselves, the complexity of backgrounds, and illumination conditions in images and videos.

Overview of SL Classification Research Based on Methods
For the recognition and classification of hand gestures, several machine learning methods have been employed such as principle component analysis (PCA) [4], support vector machine (SVM) [4,[28][29][30], K-nearest neighbor (KNN) [3,28], and linear discriminant analysis (LDA) [4,31].Recently, deep learning methods have been employed widely in the literature and we focus on them in this paper.The following methods are commonly used for SL recognition: multilayer perceptron (MLP) [28], convolutional neural network (CNN) [23,30,[32][33][34][35][36], recurrent neural network (RNN) [37,38], long short-term memory (LSTM) [32,34], and transfer learning-based methods [14].Table 1 shows the SL classification research studies using various feature extraction techniques coupled with different machine learning and deep learning methods.As can be seen in the table, different research works have employed CNN for SL classification [23,30,32,34,36].In ref. [34], researchers developed a real-time sign language recognition system using CNN and a convex hull algorithm to segment the hand region from input images, employing YCbCr skin color segmentation.The CNN model they proposed consisted of an input layer, two convolution layers, pooling layers, a flattening layer, and two dense layers.When tested with real-time data, the system achieved an accuracy of 98.05%.A 3D CNN model was trained from British SL and achieved a 98% accuracy, outperforming state-of-the-art results on British SL [36].A real-time system was created in ref. [32] to teach SL to beginners, which utilized a technique called skin-cooler modeling to separate the hand region from the background, and CNN was used to classify images.The system achieved a 72.3% accuracy on signed sentences and 89.5% on isolated sign words.In ref. [30], CNN was employed for hand gesture recognition using a modified structure from AlexNet and VGG16 models for feature extraction and SVM for classification of SL numbers and letters.Their two-level models achieved a 99.82% accuracy, better than the state-of-the-art models.Similarly, in ref. [23], CNN and VGG19 were applied for SL classification using single-handed isolated signs, which achieved 99% with VGG19 and 97% with CNN.RNN is a type of neural network that handles sequential data of different lengths.Unlike other neural networks, RNNs are not limited to fixed-size inputs, allowing for the processing of sequences with varying lengths or images of various sizes [28].To address the issue of long-term dependencies, LSTM is a specific type of RNN.Regular RNNs struggle with predicting words stored in long-term memory and rely on recent information for predictions.However, LSTM networks have a built-in ability to retain information for extended periods.This is achieved through a memory cell as a container for holding information over time, making them highly suitable for tasks such as language translation, speech recognition, and time series forecasting [28].Research works introduced RNN for SL classification [37,38], to recognize static signs which attained a 95% accuracy in ref. [37], and to improve a real-time sign recognition system for American sign language, which obtained 99% in ref. [38].In ref. [32], RNN and LSTM were used to classify SL based on breaking continuous signs into smaller parts and analyzing them with neural networks.This approach eliminated the need to train for different combinations of subunits.The models were tested on 942 signed sentences containing 35 different sign words.The average accuracy achieved was 72% for signed sentences and 89.5% for isolated words.In transfer learning, the insights, features, or representations learned from the source task are transferred and utilized to enhance the learning process or performance on the target task.The network's features of the pretrained model are reused, the output of the last layer is removed and replaced by a new layer for a different set of classes of the new task.Transfer learning was used for SL classification in ref. [14], wherein a YCbCr segmentation technique using skin color as a basis and the local binary pattern were utilized to accurately segment shapes and capture texture features and local shape information.The VGG-19 model was adjusted to obtain features, which were then combined with manually crafted features using a serial fusion technique.This approach achieved an accuracy of 98.44%.In ref. [2], a combination of three models including 3D CNN, mixed RNN-LSTM, and YOLO for advanced object recognition in a video dataset created for eight different emergency situations.The frames were extracted and used to evaluate these models, which attained 82% for 3D-CNN, 98% for RNN-LSTM, and 99.6% for YOLO models.

Overview of SL Classification Research Based on Languages
Many types of SLs have been used in previous studies based on countries and spoken languages.Languages used in SL research are numerous including American [16,22,[41][42][43][44][45][46], Mexican [47], Arabic [3,7,10,11,39,40,48,49], Algerian [50], Pakistani [51], Indian [52][53][54], Bangla [55], Chinese [56,57], Indonesian [58], Italian [23], Peruvian [59], Turkish [34], and Urdu [24].Based on [60], which analyzed SL classification research studies published from 2014 to 2021, another breakdown of literature works was presented according to the local variation in the SL they referred to.Most of the works were dedicated to the American SL, but Arabic was also included.Chinese SL was the second after American SL.Table 2 shows a summary of research works in SL based on languages presented in our work in the years between 2020 until the present.According to the number of studies in this very near range of years, American SL is still the most dominant in SL research, and other SLs borrowed many principles from it.

Overview of SL Classification Research Based on Datasets
Many studies in the literature have utilized public datasets, with a focus on American SL due to its accessibility.Other SLs often require researchers to create their own datasets, which they then publish for other researchers to use.In our review of the previous literature, we found that ArSL itself was diverse due to variations between cities and dialects.One significant challenge for SL research is the limited data availability, particularly for ArSL [33].Datasets, also called databases in earlier research on SL, can be obtained through public or private repositories [16,23,47,59].Datasets of SL can be divided into three main categories: finger spelling, isolated signs, and continuous signs.Finger spelling datasets are based primarily on the shape and direction of the finger.Most of the numbers and alphabets for SL are static and use only fingers.Isolated datasets are equivalent to spoken words and can be static (i.e., images) or dynamic (i.e., videos).Continuous SL datasets are formed from videos of signers representing words and sentences in the language.The evaluation of SL datasets should consider several factors including size, background conditions, and sign representation.One of the factors that determine the variability of an SL database is the number of signers, which is important for evaluating the generalization of recognition systems.Signer-independent recognition systems are evaluated on signers other than those involved in the system training by increasing the number of signers.The number of samples per sign is another factor whereby having several samples per sign with some variations per sample is important for training deep learning models.Samples in SL datasets can be grayscale or RGB format.Any researcher interested in this field can obtain an SL dataset from online repositories such as Kaggle or by building a dataset [3,[33][34][35]51,63,64].Table 3 summarizes various datasets in terms of isolated vs. continuous, static vs. dynamic, language, recognition model, lexicon size, acquisition mode, signs, signers, single vs. double handed, and background conditions.Existing datasets available for ArSL were explored from the point of view of isolated and continuous signing.To the best of our knowledge, there were only two datasets available for ArSL, where the second was developed from the first by Latif et al. [12].The ArSL alphabets dataset consisted of 54,049 images compiled by more than 40 volunteers for 32 signs and alphabets.More details are given in Section 4.

Overview of SL Classification Research Based on Arabic Language
ArSL classification deals with the identification of alphabets (SLA) and ideographic (ISL), whereby the latter is based on the diversity of Arabic dialects in countries and their cultures.In 2001, the Arab Federation of the Deaf designated ArSL as the official language for individuals with speech and hearing impairments in Arab countries.Despite Arabic being widely spoken, ArSL is still undergoing development.One major challenge faced by ArSL is "Diglossia," where regional dialects are spoken instead of the written language in each nation.In this regard, different spoken dialects made different ArSLs [18], but still, few studies have addressed this problem.The focus of existing works has been given to alphabetic ArSL classification [3,10,12].The datasets used in ArSL research works were mostly collected by the researchers themselves.In ref. [7], a feature extractor with deep behavior was used to deal with the minor details of ArSL.A 3D CNN was used to recognize 25 gestures from the ArSL dictionary.The system achieved a 98% accuracy for observed data and an 85% average accuracy for new data.The results could be improved as more data from more different signers were included.In ref. [11], a vision-based system by applying CNN for recognizing ArSL letters and translating them into Arabic speech was proposed.The proposed system automatically detected hand sign letters and spoke out the result in Arabic with a deep learning model which achieved a 90% accuracy.Their results assured that using deep learning such as CNN was highly dependable and encouraging.In ref. [35], a 3D CNN skeleton network and 2D CNN models were used for ArSL classification using 80 static and dynamic signs.The dataset was created by repeating each sign five times by 40 signers.Their models attained accuracies of 98.39% for dynamic, 88.89% for static signs in the dependent mode, and 96.69% for dynamic, 86.34% for static in the independent mode: When mixing both dependent and independent modes, the results achieved were 89.62% for signer dependent and 88.09% for signer independent.Additional recent research works presented a model utilizing fine-tuning with deep learning for the specific task of recognizing ArSL [39,40], which initiated the research direction of using transfer learning for ArSL and also created the datasets used in this study.CNN models reducing the size of the dataset required for training and at the same time, reaching a higher accuracy, were investigated [40].Models like VGG-16 and ResNet152 were adjusted using the ArSL dataset.To address the class size imbalance, random undersampling was utilized, reducing the image count from 54,049 to 25,600.Moreover, the study in ref. [39] presented a modified ResNet18 model and reported the best accuracy score compared to the previous models, which was 1.87% higher than the existing works using ResNet.A recent study [49] explored continuous ArSL classification using video data and CNN-LSTM-SelfMLP models with MobileNetV2 and ResNet18 backbones for feature extraction.Their study achieved an 87.7% average accuracy.

General Framework
The framework in this study was developed using vision-based approaches instead of sensor-based approaches.Several proposed models were designed, developed, and evaluated to obtain the optimum possible results with ArSL recognition and classification.In particular, this work aims to investigate the following: Problem: Few research works have investigated ArSL vision-based methods which rely on AI techniques.The use of transfer learning with recently built large-sized deep learning models and vision transformers remains a relatively unexplored area of research with ArSL.
Solution: To address the abovementioned problem, this work utilized two methodological parts for the classification of Arabic signs to alphabets.First was the transfer learning approach using pretrained deep learning models including VGG, ResNet, Mo-bileNet, Xception, Inception, InceptionResNet, DenseNet, and BiT, and vision transformers including ViT and Swin.Second was the deep learning approach using CNN architectures, namely, CNN-1, CNN-2, and CNN-3 with data augmentation and batch normalization techniques.The aim of the latter was to be compared with the pretrained models and vision transformers for the ArSL classification task.The general framework of the methodology is shown in Figure 1.The input dataset at a glance was split into training and test splits to ensure that our study did not suffer from data leakage and inflation of accuracy results problems [66].Different pretrained models using transfer learning were used to classify images in the training and test splits.Each model has its own hyperparameters that we initialized, then, fine-tuning was performed on each model separately using an added dense layer and one classification layer.On the other hand, three CNN architectures with some variants were proposed with a set of layers which include convolutions, pooling, and other fully connected layers, in order to produce an output of one of a number of possible classes.We proposed three different CNN architectures as can be seen in the figure and trained them from scratch on the dataset.The first two CNNs differed in the number of convolution layers, and the third CNN used image augmentation, batch normalization, and dropout techniques to train the models.

Transfer Learning for ArSL Classification
Image classification is the process of learning features from images and categorizing them according to these features.Throughout this section, different pretrained small-and large-size deep learning models for image classification were utilized.By fine-tuning these models, they learned an inner representation of images that could then be used to extract features useful for downstream tasks.In our research, an ArSL dataset of labeled images were used to train a standard classifier by placing a linear layer on top of the pretrained models.We utilized the transfer learning approach in our experiments, following a series of stages: preprocessing the data, dividing it into segments, fine-tuning the hyperparameters, determining the loss function and optimizer, defining the model architecture, and training it.Once trained, we validated and tested the results.Several deep learning pretrained models were utilized in this study as follows.
VGG was introduced by the Oxford Vision Group and achieved second place in the ImageNet challenge [67].The VGG family of networks remains popular today and is often used as a benchmark for comparing newer architectures.VGG improved upon previous designs by replacing a single 5 × 5 layer with two stacked 3 × 3 layers, or a single 7 × 7 layer with three stacked 3 × 3 layers.This approach has several benefits, as using stacked layers reduces the number of weights and operations compared to a single layer with a large filter size.Moreover, stacking multiple layers enhances the discriminative capabilities of the decision function.We utilized in this study two variants: VGG16, as one of the

Transfer Learning for ArSL Classification
Image classification is the process of learning features from images and categorizing them according to these features.Throughout this section, different pretrained small-and large-size deep learning models for image classification were utilized.By fine-tuning these models, they learned an inner representation of images that could then be used to extract features useful for downstream tasks.In our research, an ArSL dataset of labeled images were used to train a standard classifier by placing a linear layer on top of the pretrained models.We utilized the transfer learning approach in our experiments, following a series of stages: preprocessing the data, dividing it into segments, fine-tuning the hyperparameters, determining the loss function and optimizer, defining the model architecture, and training it.Once trained, we validated and tested the results.Several deep learning pretrained models were utilized in this study as follows.
VGG was introduced by the Oxford Vision Group and achieved second place in the ImageNet challenge [67].The VGG family of networks remains popular today and is often used as a benchmark for comparing newer architectures.VGG improved upon previous designs by replacing a single 5 × 5 layer with two stacked 3 × 3 layers, or a single 7 × 7 layer with three stacked 3 × 3 layers.This approach has several benefits, as using stacked layers reduces the number of weights and operations compared to a single layer with a large filter size.Moreover, stacking multiple layers enhances the discriminative capabilities of the decision function.We utilized in this study two variants: VGG16, as one of the baselines in [39], and VGG19.
ResNet, short for residual neural network, is a deep learning architecture that was introduced by researchers at Microsoft [68].It has become one of the most popular and influential network architectures in the field of computer vision.The main idea behind ResNet is the use of residual connections, which aim to overcome the degradation problem that arises when deep neural networks are trained.The degradation problem refers to the observation that as the network depth increases, the network's accuracy starts to saturate and then rapidly degrade.To address this problem, ResNet introduces "skip connections" that allow the network to bypass some layers and learn residual mappings.These skip connections connect layers that are not adjacent to each other, allowing the network to skip over certain layers and reuse the learned features from previous layers.By doing so, ResNet ensures that the gradient signals during the training process can flow more easily, preventing the degradation problem.The main building block in ResNet is called a residual block.It consists of two or more convolutional layers, usually with a size of 3 × 3, and a skip connection that adds the input of the block to the output of the block.This skip connection enables the network to learn the residual mapping, which is the difference between the input and the output of the block.ResNet-152, as another baseline in [40], and ResNet-50, with the number indicating the total number of layers in the network, were used in this study.
MobileNet [69] and MobileNetV2 [70] were also utilized in this study.MobileNet is a convolutional neural network architecture designed to be lightweight and efficient for mobile and embedded devices with limited computational resources.MobileNet achieves this efficiency by using a combination of depthwise separable convolutions and a technique called linear bottlenecks.Depthwise separable convolutions split the standard convolution operation into two separate operations: a depthwise convolution and a pointwise convolution.This reduces the number of parameters and computations required, resulting in a lighter model.Both variants were applied to the ArSL classification dataset used in our study.
Inception [71] and InceptionResNet [72] were fine-tuned for ArSL in this work.The Inception architecture, also known as GoogleNet, is characterized by its use of an Inception module designed to capture multiscale features in an efficient manner.It achieves this by using multiple convolutional filters of different sizes (1 × 1, 3 × 3, and 5 × 5) in parallel and concatenating their outputs.By doing so, the Inception module can capture features at different scales without significantly increasing the number of parameters.
InceptionResNet, on the other hand, combines the Inception architecture with residual connections to allow the network to directly propagate information from one layer to a deeper layer.This helps to alleviate the problem of vanishing gradients and enables the network to learn more easily, even when it becomes very deep.The combination of the Inception module and residual connections in InceptionResNet results in a powerful and efficient architecture, has demonstrated state-of-the-art performance on various benchmark datasets, and has been widely adopted in both research and industry applications.We utilized two variants called InceptionV3 and InceptionResNetV2 for ArSL classification.
Xception, proposed in 2016, stands for "Extreme Inception", as it is an extension of Google's Inception architecture [73].The main goal of Xception is to improve the efficiency and of deep learning models.It achieves this by replacing traditional convolutional layers with a modified version called depthwise separable convolutions which separate the spatial filtering and channel filtering processes.It applies spatial filtering on each input channel separately, followed by a 1 × 1 convolution that combines the filtered outputs.This reduces the number of parameters and allows for better learning capabilities.We utilized Xception in our study and evaluated it for ArSL classification.DenseNet, developed in [74], has a primary building block called the dense block, which consists of a series of layers that are connected to each other in a dense manner.Within a dense block, each layer receives the feature maps of all preceding layers as input, resulting in highly connected feature maps.This encourages feature reuse and enables the network to capture more diverse and abstract features.To ensure that the network does not become excessively large and computationally expensive, DenseNet incorporates a transition layer between dense blocks.The transition layer consists of a convolutional layer followed by a pooling operation, which reduces the number of feature maps, compresses information, and controls the growth of the network.We utilized one variant called DenseNet169.
BiT (Big Transfer by Google Research) [75] was additionally utilized in this study as one of the large-scale pretrained models.It was designed to transfer knowledge from pretraining tasks to downstream tasks more effectively.The BiT model follows a twostep process: pretraining and transfer learning.During pretraining, the model is trained on large datasets containing diverse and unlabeled images.The goal is to learn general visual representations that capture various concepts and patterns in the images.This helps the model develop a strong understanding of the visual world.After pretraining, the knowledge learned by the BiT model is transferred to specific downstream tasks.Transfer learning involves fine-tuning the pretrained model on a smaller labeled dataset that is specific to the target task.By starting with a pretrained model, the transfer learning process benefits from the rich visual representations learned during pretraining.This often leads to improved performance and faster convergence on the target task compared to training a model from scratch.BiT offers a range of model variations, each with a different size and capacity.In this study, BiT helps accelerate transfer learning by fine-tuning it on our ArSL dataset.This approach enables the model to leverage the knowledge gained during pretraining, leading to improved performance and efficiency for the classification of sign language alphabets.This study utilized two variants of BiT architectures namely BiT-m-r50x1 and BiT-m-r50x3 for ArSL recognition.
Additionally, two pretrained vision transformers were utilized for ArSL classification in this study as follows.ViT, introduced in ref. [76], is a vision transformer which treat an image as a sequence of patches, similar to how words are treated in NLP tasks.The image is divided into small patches, and each patch is flattened into a 1D vector.These patch embeddings serve as the input to the transformer model.The transformer architecture consists of multiple layers, including multihead self-attention and feed-forward neural networks.The self-attention mechanism allows the model to attend to different patches and capture the relationships between them.It helps the model understand the context and dependencies between different parts of the image.ViT also incorporates positional embeddings, which provide information about the spatial location of each patch.Two variants of ViT namely ViT_b16 and ViT_132 were used in our work.
Swin transformer was also utilized in this study.According to ref.
[77], Swin was built on the Transformer architecture, which consists of multiple layers of self-attention mechanisms to process and understand sequential data.Swin has shifted windows that work by dividing the input image into smaller patches or windows and shifting them during the self-attention process.This approach allows Swin to efficiently process large images without relying on computationally expensive operations, such as sliding window mechanisms or convolutional operations.By shifting the patches, Swin extends the receptive field to capture global dependencies and enables the model to have a better understanding of the visual context.We employed in this work the variant named SwinV2Tiny256 for ArSL classification.

CNN for ArSL Classification: An Ablation Study
To further validate the performance of pretrained models and vision transformers, we conducted an ablation study on CNN deep learning architectures.The purpose of the study was to investigate and understand the individual contributions and importance of different components of the CNN models.The study involved systematically removing or disabling specific features, layers, or components of the model and evaluating its impact on the model's overall performance.We targeted three questions in this study: Which layers or modules of the model are essential for achieving a high performance?What is the relative importance or contribution of convolution filters and size to the model?How does the model's performance change when using batch normalization and data augmentation?To this aim, three CNN architectures were utilized.CNN-1 had three convolution layers, three max-pooling layers, one fully connected (dense) layer, and one output classification (softmax) layer.CNN-2 network had five convolution layers, five max-pooling layers, one fully connected (dense) layer, and an output classification (softmax) layer.The convolution layers in both CNN-1 and CNN-2 used filters, which were applied to the entire input data.Each filter was characterized by a set of adjustable weights.The max-pooling layers were used to select the neuron with the highest activation value within each local receptive field, forwarding only that value.The output from the last convolutional layer was flattened.The fully connected layer, also known as the dense layer, was responsible for receiving the flattened input and pasted their weights to the final classification (softmax) layer.This was necessary because the fully connected layer expected a one-dimensional input, while the output of the convolutional layer was three-dimensional.The final layer acted as a classification layer, utilizing the softmax function.This is typically employed in the output layer of a classifier to generate probabilities for each input's class identification.The CNN-3 network was the same as CNN-2 but also applied batch normalization, dropout, and data augmentation.Batch normalization provides a method to process data in the hidden layers of a network, similar to the standard score.It ensures that the outputs of the hidden layer are normalized within each minibatch, maintaining a mean activation value close to 0 and a standard deviation close to 1.This technique can be applied to both convolutional and fully connected layers.When the training data are limited, the network may become prone to overfitting.To address this issue, data augmentation was used to artificially increase the size of the training set.We used common image augmentations including random rotation with 0.2 degrees and random horizontal and vertical flipping.The proposed CNN architectures for ArSL classification are shown in Table 4. ArSL images with three RGB color channels were resized to 224 × 224 pixels before fed as input to the CNN models.In all of our CNN models, we used the leaky ReLU (rectified linear unit) activation function as follows: where x represents the input to the activation function, and α is a small constant that determines the slope of the function when x is less than or equal to 0. Typically, the value of α is set to a small positive number, such as 0.01, to ensure a small gradient when the input is negative.This helps to address the issue of dead neurons in the ReLU activation function, where neurons can become inactive and stop learning if their inputs are always negative.Leaky ReLU allows for a small gradient and thus enables the learning process even for negative values.By conducting this ablation study into our methodology, we gained insights into which components were critical for the model's performance and which were less important or redundant.This helped understand the underlying behavior of the model, leading to better model design choices for ArSL classification, as shown in Section 5.

Dataset
To evaluate the proposed models, the ArSL dataset presented in ref. [12] was utilized in this work.The dataset contains 54,049 images distributed around 32 classes of Arabic signs.The dimensions of the images are 64 × 64, and many variations of images are presented using different lighting and backgrounds.A sample from the dataset were randomly generated as can be seen in Figure 2. One observation noted was that the number of images per each class was not balanced.Table 5 shows the classification of the Arabic alphabet signs, with labels, the English transliteration of each alphabet, and the number of images, which is also visualized in Figure 3.It has been claimed in ref. [12] that this dataset is sufficient for the training and classification of ArSL alphabets.We did not make any attempts to identify and eliminate the noisy features, as we think their inclusion would enhance the model's resilience and ability to apply them to various situations.

Dataset
To evaluate the proposed models, the ArSL dataset presented in ref. [12] was utilized in this work.The dataset contains 54,049 images distributed around 32 classes of Arabic signs.The dimensions of the images are 64 × 64, and many variations of images are presented using different lighting and backgrounds.A sample from the dataset were randomly generated as can be seen in Figure 2. One observation noted was that the number of images per each class was not balanced.Table 5 shows the classification of the Arabic alphabet signs, with labels, the English transliteration of each alphabet, and the number of images, which is also visualized in Figure 3.It has been claimed in ref. [12] that this dataset is sufficient for the training and classification of ArSL alphabets.We did not make any attempts to identify and eliminate the noisy features, as we think their inclusion would enhance the model's resilience and ability to apply them to various situations.

Baselines
This work was compared to four baselines from existing work reported in the studies in refs.[39,40].In these two studies, VGG16, and ResNet152 were fine-tuned for ArSL recognition and classification using our dataset.We also tested VGG19 and ResNet vari-ants using our dataset to establish comparisons between the performance of the proposed pretrained models and vision transformers and that of the baselines.

Resources and Tools
The resources utilized in this research included three machines used for our initial experiments, each with Intel (R) Core (TM) i7-10700T 3.2 GHz processors and 16 GB of RAM.For our final experiments, we used a Tesla A100 offered by Google Collab Pro for accelerated deep learning tasks.The A100 GPU is a powerful and high-performance GPU that excels in deep learning workloads by offering NVIDIA Ampere architecture, 6912 CUDA Cores, 432 Tensor Cores and 40 GB of high-bandwidth memory (HBM2).Model training and testing were implemented using TensorFlow 2.8.0, which uses Keras as a high-level API.Complementary libraries for training the proposed models, including Scikit-Learn, Matplotlib, Pandas, NumPy, os, seaborn and glop, were used.

Evaluation
The effectiveness of the proposed ArSL classification was evaluated based on four major outcomes [32]: true positives (tp), false positives (fp), true negatives (tn), and false negatives (fn).The accuracy determines the ability to differentiate ArSL signs correctly, which was computed by the following formula: In deep learning, the loss metric is a measure of how well the model is performing in terms of its ability to make accurate predictions.It represents the discrepancy between the predicted output of the model and the actual output or target value.The loss metric is commonly used to quantify the error or cost associated with the model's predictions.The goal is to minimize this error by adjusting the model's parameters during the training process.The loss measure in this study was computed by the following formula: where y pred is the predicted output of the model, and y true the ground-truth output or target value.We also used AUC, precision, recall, and F1-score to evaluate our models.The AUC (area under the receiver operating characteristic curve) measures the overall performance of a binary classification model across various classification thresholds.It quantifies the model's ability to distinguish between positive and negative samples.The AUC represents the area under the curve of the receiver operating characteristic (ROC) curve, which plots the true positive rate (recall) against the false positive rate (1 − specificity) at various classification thresholds.Precision measures the proportion of true positive predictions out of all positive predictions made by the model.It indicates how well the model correctly identifies positive samples.
Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions out of all actual positive samples.It indicates how well the model captures the positive samples.
The F1 score is the harmonic mean of precision and recall.It provides a balanced measure of the model's performance by considering both precision and recall simultaneously.The F1 score ranges between 0 and 1, with 1 indicating perfect precision and recall, and 0 indicating poor performance.

Experimental Results from the CNN Approach
CNN-1, CNN-2, and CNN-3 utilized different convolutional and max-pooling layers with various sizes and number of filters.Following the convolution and pooling layers, two fully connected layers were utilized to flatten the output.Eight variants of CNN models were tested wherein the number of filters were changed in the convolution layers.To obtain initial results from these models, the models were trained in one cycle (i.e., one epoch) on 48645 images for training and 5404 files for validation.Then, we used 10 epochs to show the performance of the suggested architectures.The model which showed robust performance was trained for more epochs (30 epochs and 100 epochs).
Figure 4 illustrates the performance of the proposed CNN architectures in terms of accuracy and loss over epochs.Meanwhile, Table 6 presents the results of our ablation study, showcasing metrics such as loss, accuracy, precision, recall, AUC, and F1-score.After noting the unstable performances of CNN-1 and CNN-2 models, we observed that the performance became more consistent with CNN-3 when we implemented augmentation and batch normalization techniques.To further improve the CNN-3 model, we increased the number of training epochs and closely monitored training progress, as illustrated in Figure 5.The results indicated that the model became more resilient and displayed a gradual improvement in accuracy, accompanied by a decrease in loss values.Specifically, after 100 epochs, the CNN-3 model achieved an accuracy of 0.6323.Consequently, we decided to compare its performance with transfer learning methods, which is discussed in Section 5.2.
to show the performance of the suggested architectures.The model which showed robust performance was trained for more epochs (30 epochs and 100 epochs).
Figure 4 illustrates the performance of the proposed CNN architectures in terms of accuracy and loss over epochs.Meanwhile, Table 6 presents the results of our ablation study, showcasing metrics such as loss, accuracy, precision, recall, AUC, and F1-score.After noting the unstable performances of CNN-1 and CNN-2 models, we observed that the performance became more consistent with CNN-3 when we implemented augmentation and batch normalization techniques.To further improve the CNN-3 model, we increased the number of training epochs and closely monitored training progress, as illustrated in Figure 5.The results indicated that the model became more resilient and displayed a gradual improvement in accuracy, accompanied by a decrease in loss values.Specifically, after 100 epochs, the CNN-3 model achieved an accuracy of 0.6323.Consequently, we decided to compare its performance with transfer learning methods, which is discussed in Section 5.2.

Experimental Results from the Transfer Learning Approach
Table 7 shows the experimental results from transfer learning for ArSL classification using 15 variants from the following models: VGG, ResNet, MobileNet, Xception, Inception, InceptionResNet, DenseNet, BiT, ViT, and Swin.In these transformers, the top layer of the model, which is responsible for classification was removed and our own classification layer was added on top.The proposed pretrained models, except for MobileNet, were initialized with pretrained weights from the ImageNet dataset, which contains millions of images and thousands of object categories.MobileNet models were initialized with random weights instead of pretrained weights.VGG19 and ResNet were used to provide powerful base models for extracting rich features from input images that can be used to train a classifier.As can be seen in the left-half from the table below, using one-cycle of fine-tuning (i.e., one epoch) showed promising results with some models such as ResNet152, while it showed poor performance with large-sized models such as ViT132.The results after one cycle of fine-tuning the models indicated that the pretrained models and vision transformers required additional training to achieve competitive performance.To further validate the performance of our models, we also fine-tuned them for several epochs and reported their accuracy and loss results.The right-half of Table 7 shows the experimental results after implementing and fine-tuning the transformers for 10 epochs on 32 classes of signs, 48,645 images for training, and 5404 files for validation.Additionally, we used several figures to monitor the performance of the pretrained models and the vision transformers and that of the baseline models.Figure 6 shows the visualization of pretrained model performance values used for ArSL classification including MobileNet, MobileNetV2, Xception, InceptionV3, InceptionResNetV2, and DenseNet169.We noticed that the pretrained models MobileNet, MobileNetV2, and Xception showed robust learning, while InceptionV3, InceptionResNetV2, and DenseNet169 showed instable performance.Based on the results, after fine-tuning the transformers for ArSL classification, the pretrained models model achieved superior accuracy results in the range between 0.88 and 0.96.
The InceptionResNetV2 model attained the highest accuracy of 0.9817 followed by MobileNet variants which attained an accuracy of 0.96.DenseNet169 achieved the lowest accuracy of 0.5587.Notably, the deep learning CNN-based architecture which was trained from scratch achieved an accuracy of 0.6323, which was low compared to using transfer learning with pretrained models.However, the CNN-3 experimentally chosen structure, the randomly initialed weights, and the long-time of training could change these results.
Hence, we believe that the effectiveness of using transfer learning is highly promising for ArSL recognition.Additionally, our findings highlight the importance of selecting appropriate model architectures and hyperparameters to achieve the best possible performance for ArSL tasks.Figure 7 shows the performance of the BiT (Big Transfer) model used for ArSL classification.As can be seen, different variants of BiT may result in different performance values.The accuracy of BiT was above 90%.As another example, different variants of the vision transformers achieved different results as shown in Figure 8. ViT and Swin transformer models achieved great performance with stable learning over time.They achieved an 88-97% accuracy after fine-tuning for classification.Our experimental study focused on examining the performance of various pretrained models and vision transformers on an isolated ArSL dataset.The results indicated that certain pretrained models, such as MobileNet, Xception, and Inception, tended to overfit the data, as depicted in Figure 6.On the other hand, models such as our CNNs, ViT, and the Big Transfer models exhibited superior learning performance without any signs of overfitting, as demonstrated in Figures 7 and 8.As a comparison with other baseline works in ArSL in refs.[39,40], we retrained the VGG16 VGG19, ResNet50, and ResNet152, which attained an 85-98% of accuracy when applied to our dataset.Our findings highlight that these baselines initiated the research direction of using pretrained models and transfer learning for ArSL classification and recognition, and our results from 11 variants of other state-of-the-art models confirmed it as a successful direction.Figures 9 and 10 illustrates the performance of the VGG and ResNet baselines.The results reveal that pretrained CNN architectures can exhibit rapid learning, but they are also susceptible to sudden drops in performance.We found that ResNet and InceptionResNet achieved a higher accuracy compared to the other models.This can be justified by looking at the InceptionResNet architecture that combines the advantages of both the Inception and ResNet models.By combining the concepts of both Inception and ResNet, it leverages the strengths of both architectures to achieve better results in ArSL image classification.Moreover, InceptionResNet incorporates residual connections, allowing for the reuse of features across different layers.This helps to combat the vanishing gradient problem and enables a more efficient training of deep models.
Our experimental study focused on examining the performance of various pretrained models and vision transformers on an isolated ArSL dataset.The results indicated that certain pretrained models, such as MobileNet, Xception, and Inception, tended to overfit the data, as depicted in Figure 6.On the other hand, models such as our CNNs, ViT, and the Big Transfer models exhibited superior learning performance without any signs of overfitting, as demonstrated in Figures 7 and 8.

Conclusions and Future Works
This research paper investigated various methods for recognizing Arabic Sign Language (ArSL) using transfer learning.The study utilized pretrained models, originally designed for image classification tasks like VGG, ResNet, MobileNet, Xception, Inception, InceptionResNet, and DenseNet.A pretrained Google Research model named BiT, Big Transfer, was also evaluated on our dataset.Additionally, the paper explored state-of-theart vision transformers, specifically ViT and Swin, for the task at hand.Different deep learning CNN-based architectures were also employed and compared against the pretrained models and vision transformers.Experimental results revealed that the transfer learning approach, using both pretrained models and vision transformers, achieved a higher accuracy compared to traditional CNN-based deep learning models.Although the pretrained models performed better in terms of accuracy, the vision transformers exhibited more consistent learning.The findings suggest that using recently designed pretrained models such as BiT and InceptionResNet, as well as vision transformers like ViT and Swin, can enhance accuracy, parameter efficiency, feature reuse, scalability, and overall performance in ArSL classification tasks.These advantages make these models a promising choice for future ArSL classification tasks involving words and sentences in image and video datasets.In particular, transfer learning can be a valuable tool for image classification tasks in other sign languages with limited resources and data, outperforming traditional CNN deep learning models in many scenarios.Some limitations of this work are as follows.The dataset was not representative enough to capture the complexity and variability of ArSL words and sentences as a whole.A larger and more diverse dataset could potentially provide a more comprehensive evaluation.Furthermore, practical implementation challenges such as computational resources, model deployment, real-time performance, and user usability could potentially pose challenges that need to be addressed for real-world applications.
We propose several areas of future research that can enhance the progress of ArSL recognition and sign language understanding.These suggestions can also offer valuable Our experimental study focused on examining the performance of various pretrained models and vision transformers on an isolated ArSL dataset.The results indicated that certain pretrained models, such as MobileNet, Xception, and Inception, tended to overfit the data, as depicted in Figure 6.On the other hand, models such as our CNNs, ViT, and the Big Transfer models exhibited superior learning performance without any signs of overfitting, as demonstrated in Figures 7 and 8.

Conclusions and Future Works
This research paper investigated various methods for recognizing Arabic Sign Language (ArSL) using transfer learning.The study utilized pretrained models, originally designed for image classification tasks like VGG, ResNet, MobileNet, Xception, Inception, InceptionResNet, and DenseNet.A pretrained Google Research model named BiT, Big Transfer, was also evaluated on our dataset.Additionally, the paper explored stateof-the-art vision transformers, specifically ViT and Swin, for the task at hand.Different deep learning CNN-based architectures were also employed and compared against the pretrained models and vision transformers.Experimental results revealed that the transfer learning approach, using both pretrained models and vision transformers, achieved a higher accuracy compared to traditional CNN-based deep learning models.Although the pretrained models performed better in terms of accuracy, the vision transformers exhibited more consistent learning.The findings suggest that using recently designed pretrained models such as BiT and InceptionResNet, as well as vision transformers like ViT and Swin, can enhance accuracy, parameter efficiency, feature reuse, scalability, and overall performance in ArSL classification tasks.These advantages make these models a promising choice for future ArSL classification tasks involving words and sentences in image and video datasets.In particular, transfer learning can be a valuable tool for image classification tasks in other sign languages with limited resources and data, outperforming traditional CNN deep learning models in many scenarios.Some limitations of this work are as follows.The dataset was not representative enough to capture the complexity and variability of ArSL words and sentences as a whole.A larger and more diverse dataset could potentially provide a more comprehensive evaluation.Furthermore, practical implementation challenges such as computational resources, model deployment, real-time performance, and user usability could potentially pose challenges that need to be addressed for real-world applications.
We propose several areas of future research that can enhance the progress of ArSL recognition and sign language understanding.These suggestions can also offer valuable guidance for the creation of robust and efficient models for other sign languages.Some of the recommended future works are as follows:

•
Address the imbalance in datasets: Imbalanced datasets can affect the performance of deep learning models.Future work can focus on handling the class imbalance present in ArSL datasets to ensure a fair representation and improve the accuracy of minority classes.

•
Extend the research to video-based ArSL recognition: The paper focused on ArSL recognition in images, but future work can expand the research to video-based ArSL recognition.This would involve considering temporal information and exploring techniques such as 3D convolutional networks or temporal transformers to capture motion and dynamic features in sign language videos.

•
Investigate hybrid approaches: Explore hybrid approaches that combine both pretrained models and vision transformers.Investigate ways to leverage the strengths of both architectures to achieve an even higher accuracy and stability in ArSL recognition tasks.This could involve combining features from pretrained models with the attention mechanisms of vision transformers.

•
Further investigate the fine-tuning strategies and optimization techniques for transfer learning with pretrained models and vision transformers.Explore different hyperparameter settings, learning rate schedules, and regularization methods to improve the performance and stability of the models.

•
Explore techniques for data augmentation and generation specifically tailored for ArSL recognition.This can help overcome the challenge of limited data and enhance the performance and generalization of the models.Techniques such as generative adversarial networks (GANs) or domain adaptation can be investigated.

•
Investigate cross-language transfer learning: Explore how pretrained models and vision transformers trained on one sign language can be used to improve the performance of models on another sign language with limited resources.This can help address the challenges faced by sign languages with limited data availability.

•
Finally, continue benchmarking and comparing different architectures, pretrained models, and vision transformers on ArSL recognition tasks.Evaluate their performance on larger and more diverse datasets to gain a deeper understanding of their capabilities and limitations.This can help identify the most effective models for specific ArSL recognition scenarios.

28 Figure 1 .
Figure 1.The general framework of the methodology proposed for this study.

Figure 1 .
Figure 1.The general framework of the methodology proposed for this study.

Figure 3 .
Figure 3.Samples distribution of the Arabic Sign Language dataset.

Figure 3 .
Figure 3.Samples distribution of the Arabic Sign Language dataset.

Figure 4 .
Figure 4. Performance of various CNN model architectures used for ArSL classification.

Figure 5 .
Figure 5. Performance of the CNN-3's best results using a different number of epochs.

Figure 4 .
Figure 4. Performance of various CNN model architectures used for ArSL classification.

Figure 4 .
Figure 4. Performance of various CNN model architectures used for ArSL classification.

Figure 5 .
Figure 5. Performance of the CNN-3's best results using a different number of epochs.Figure 5. Performance of the CNN-3's best results using a different number of epochs.

Figure 5 .
Figure 5. Performance of the CNN-3's best results using a different number of epochs.Figure 5. Performance of the CNN-3's best results using a different number of epochs.

Figure 6 .
Figure 6.Performance of pretrained deep learning models used for ArSL classification.

Figure 6 .
Figure 6.Performance of pretrained deep learning models used for ArSL classification.

28 Figure 7 .
Figure 7. Performance of BiT (Big Transfer) model used for ArSL classification.

Figure 7 .
Figure 7. Performance of BiT (Big Transfer) model used for ArSL classification.

28 Figure 8 .
Figure 8. Performance of ViT and Swin vision transformers used for ArSL classification.

Figure 8 . 28 Figure 8 .
Figure 8. Performance of ViT and Swin vision transformers used for ArSL classification.

Table 1 .
Summary of intelligent vision-based methods using various feature extraction techniques.

Table 2 .
Summary of SL research reviewed in this work from 2020 to 2023, listed alphabetically according to their acronyms.

Table 3 .
Summary of the key parameters of sign language recognition studies.

Table 4 .
Architectural details of CNN models for ArSL classification.

Table 5 .
The classification of the Arabic alphabet signs, with labels and number of images.

Table 6 .
Ablation study results from deep learning CNN models trained for ArSL classification.

Table 6 .
Ablation study results from deep learning CNN models trained for ArSL classification.

Table 7 .
Experimental results from transfer learning for ArSL classification using the baseline models and the proposed models. of both the Inception and ResNet models.By combining the concepts of both Inception and ResNet, it leverages the strengths of both architectures to achieve better results in ArSL image classification.Moreover, InceptionResNet incorporates residual connections, allowing for the reuse of features across different layers.This helps to combat the vanishing gradient problem and enables a more efficient training of deep models. advantages

Table 7 .
Experimental results from transfer learning for ArSL classification using the baseline models and the proposed models.