Empowering Communication: A Deep Learning Framework for Arabic Sign Language Recognition with an Attention Mechanism

: This article emphasises the urgent need for appropriate communication tools for communities of people who are deaf or hard-of-hearing, with a specific emphasis on Arabic Sign Language (ArSL). In this study, we use long short-term memory (LSTM) models in conjunction with MediaPipe to reduce the barriers to effective communication and social integration for deaf communities. The model design incorporates LSTM units and an attention mechanism to handle the input sequences of extracted keypoints from recorded gestures. The attention layer selectively directs its focus toward relevant segments of the input sequence, whereas the LSTM layer handles temporal relationships and encodes the sequential data. A comprehensive dataset comprised of fifty frequently used words and numbers in ArSL was collected for developing the recognition model. This dataset comprises many instances of gestures recorded by five volunteers. The results of the experiment support the effectiveness of the proposed approach, as the model achieved accuracies of more than 85% (individual volunteers) and 83% (combined data). The high level of precision emphasises the potential of artificial intelligence-powered translation software to improve effective communication for people with hearing impairments and to enable them to interact with the larger community more easily.


Introduction
People with hearing loss and speech impairments are deprived of effective contact with the rest of the community.According to the statistics of the International Federation of the Deaf and the World Health Organisation (WHO), more than 5% of people around the world are deaf and have severe difficulties communicating with those without hearing impairments, which means approximately 360 million people.Deaf individuals use another method to communicate instead of speech called sign language (SL) [1].SL facilitates communication between the deaf community and people who are either deaf or nondisabled.SL is a visual communication system that encompasses both manual elements, such as hand gestures, and nonmanual elements, such as facial emotions and body movements [2].SL is a complicated style of communication based mostly on hand gestures.These gestures are formed by different components, such as hand shape, hand motion, hand location, palm orientation, the movement of the lips, facial expressions, and points of contact between the hands or between the hands and other parts of the body, to express words, letters, and numbers.
Many sign languages exist in the deaf community, roughly one per country, which vary as much as spoken languages [3], e.g., Arabic Sign Language (ArSL), American Sign Language (ASL), British Sign Language (BSL), Australian Sign Language (Auslan), French Sign Language (LSF), Japanese Sign Language (JSL), Chinese Sign Language (CSL), German 1.
The DArSL50 dataset is a large-scale dataset comprised of 50 dynamic gestures in Arabic Sign Language (ArSL), including words and numbers, resulting in a total of 7500 video samples.This extensive dataset addresses the lack of sufficient data for dynamic gestures in ArSL and supports the development and evaluation of robust sign language recognition systems.

2.
The proposed model leverages long short-term memory (LSTM) units with an attention mechanism combined with MediaPipe for keypoint extraction.This architecture effectively handles the temporal dynamics of gestures and focuses on relevant segments of input sequences.

3.
The model's performance was evaluated in the following two scenarios: individual volunteer data and combined data from multiple volunteers.This dual evaluation approach ensures that the model is tested for its ability to generalise across different individuals and in different signing styles.4.
The proposed framework is validated for real-time performance.
The rest of this paper is organised as follows.Section 2 describes the methodology of the proposed ArSL recognition system and includes details about the DArSL 50 dataset.The experimental results are reported in Section 3, while an explanation of the results is presented in Section 4. Section 6 concludes the discussion and outlines future research directions.
The following two categories of sign language recognition systems can be distinguished according to the method used for data collection in the academic literature: sensorbased and vision-based [11], as shown in Figure 1.
The rest of this paper is organised as follows.Section 2 describes the methodology of the proposed ArSL recognition system and includes details about the DArSL 50 dataset.The experimental results are reported in Section 3, while an explanation of the results is presented in Section 4. Section 6 concludes the discussion and outlines future research directions.
The following two categories of sign language recognition systems can be distinguished according to the method used for data collection in the academic literature: sensor-based and vision-based [11], as shown in Figure 1.In the sensor-based method, sensors and equipment are used to collect the position, hand motion, wrist orientation, and velocity.Flex sensors, for instance, are used to measure finger movements.The inertial measurement unit (IMU) measures the acceleration of the fingers using a gyroscope and an accelerometer.The IMU is also used to detect wrist orientation.Wi-Fi and radar detect variations in the intensity of communications in the air using electromagnetic indicators.Electromyography (EMG) identifies finger mobility by measuring the electrical pulse in human muscles and then decreasing the biosignal.Other devices include haptic, mechanical, electromagnetic, ultrasonic, and flex sensors [12].Sensor-based systems have an important advantage over vision-based systems, since gloves can rapidly communicate data to computers [13].The device-based sensors (Microsoft Kinect sensor, Leap Motion Controller, and electronic gloves) can directly extract features without preprocessing, which means that the device-based sensors can minimise the time needed to prepare sign language datasets, data can be obtained directly, and a good accuracy rate can be achieved in comparison with vision-based devices [14].Figure 2 demonstrates the primary phases of the SL gesture data collection and detection utilising the sensor-based system.The sensor-based approach has the issue of requiring the end-user to have a physical connection to the computer, making it unsuitable.Furthermore, it is expensive due to the use of sensitive gloves [13].Despite the accuracy of the data that may be obtained from these devices, whether they wear gloves or are coupled to a computer, gadgets such as a Leap Motion or Microsoft Kinect device remain unpleasant [14].In the sensor-based method, sensors and equipment are used to collect the position, hand motion, wrist orientation, and velocity.Flex sensors, for instance, are used to measure finger movements.The inertial measurement unit (IMU) measures the acceleration of the fingers using a gyroscope and an accelerometer.The IMU is also used to detect wrist orientation.Wi-Fi and radar detect variations in the intensity of communications in the air using electromagnetic indicators.Electromyography (EMG) identifies finger mobility by measuring the electrical pulse in human muscles and then decreasing the biosignal.Other devices include haptic, mechanical, electromagnetic, ultrasonic, and flex sensors [12].Sensor-based systems have an important advantage over vision-based systems, since gloves can rapidly communicate data to computers [13].The device-based sensors (Microsoft Kinect sensor, Leap Motion Controller, and electronic gloves) can directly extract features without preprocessing, which means that the device-based sensors can minimise the time needed to prepare sign language datasets, data can be obtained directly, and a good accuracy rate can be achieved in comparison with vision-based devices [14].Figure 2 demonstrates the primary phases of the SL gesture data collection and detection utilising the sensor-based system.The sensor-based approach has the issue of requiring the end-user to have a physical connection to the computer, making it unsuitable.Furthermore, it is expensive due to the use of sensitive gloves [13].Despite the accuracy of the data that may be obtained from these devices, whether they wear gloves or are coupled to a computer, gadgets such as a Leap Motion or Microsoft Kinect device remain unpleasant [14].
Another option is the vision-based approach, which involves using a video camera to capture hand gestures.This gesture-detection solution combines appearance information with a 3D hand model.Key gesture capture technology in a vision-based technique was developed in Ref. [13].Body markers such as colourful gloves, wristbands, and LED lights were used in this study, as well as active light projection systems that make use of the Kinect: Manufactured by Microsoft Corporation, Redmond, WA, USA. and Leap Motion Controller (LMC): Manufactured by Ultraleap Inc., San Francisco, CA, USA).A single camera might be employed with a smartphone camera, a webcam, or a video camera, as well as stereo cameras, which deliver rich information by using numerous monocular cameras.The primary benefit of employing a camera is that it removes the need for sensors in sensory gloves, lowering the system's manufacturing costs.Cameras are fairly inexpensive, and most laptops employ a high-specification camera due to the blurring effect of a webcam [13].A simplified representation of the camera vision-based method for extracting and detecting hand movements is shown in Figure 3. Another option is the vision-based approach, which involves using a video camera to capture hand gestures.This gesture-detection solution combines appearance information with a 3D hand model.Key gesture capture technology in a vision-based technique was developed in Ref. [13].Body markers such as colourful gloves, wristbands, and LED lights were used in this study, as well as active light projection systems that make use of the Kinect: Manufactured by Microsoft Corporation, Redmond, WA, USA. and Leap Motion Controller (LMC): Manufactured by Ultraleap Inc., San Francisco, CA, USA).A single camera might be employed with a smartphone camera, a webcam, or a video camera, as well as stereo cameras, which deliver rich information by using numerous monocular cameras.The primary benefit of employing a camera is that it removes the need for sensors in sensory gloves, lowering the system's manufacturing costs.Cameras are fairly inexpensive, and most laptops employ a high-specification camera due to the blurring effect of a webcam [13].A simplified representation of the camera vision-based method for extracting and detecting hand movements is shown in Figure 3.In the literature, many SLR systems use traditional machine learning algorithms to classify the features of images to recognise SL gestures.In addition, the former uses traditional image segmentation algorithms to segment hand shapes from sign language images or the video frames of sign language video and then uses a machine-learning approach (such as SVM, HMM, or the k-NN algorithm).Using traditional machine learning  Another option is the vision-based approach, which involves using a video camera to capture hand gestures.This gesture-detection solution combines appearance information with a 3D hand model.Key gesture capture technology in a vision-based technique was developed in Ref. [13].Body markers such as colourful gloves, wristbands, and LED lights were used in this study, as well as active light projection systems that make use of the Kinect: Manufactured by Microsoft Corporation, Redmond, WA, USA. and Leap Motion Controller (LMC): Manufactured by Ultraleap Inc., San Francisco, CA, USA).A single camera might be employed with a smartphone camera, a webcam, or a video camera, as well as stereo cameras, which deliver rich information by using numerous monocular cameras.The primary benefit of employing a camera is that it removes the need for sensors in sensory gloves, lowering the system's manufacturing costs.Cameras are fairly inexpensive, and most laptops employ a high-specification camera due to the blurring effect of a webcam [13].A simplified representation of the camera vision-based method for extracting and detecting hand movements is shown in Figure 3.In the literature, many SLR systems use traditional machine learning algorithms to classify the features of images to recognise SL gestures.In addition, the former uses traditional image segmentation algorithms to segment hand shapes from sign language images or the video frames of sign language video and then uses a machine-learning approach (such as SVM, HMM, or the k-NN algorithm).Using traditional machine learning In the literature, many SLR systems use traditional machine learning algorithms to classify the features of images to recognise SL gestures.In addition, the former uses traditional image segmentation algorithms to segment hand shapes from sign language images or the video frames of sign language video and then uses a machine-learning approach (such as SVM, HMM, or the k-NN algorithm).Using traditional machine learning algorithms has disadvantages related to handicraft features, which have a limited representational capability.It is difficult to extract representative semantic information from complex material, and step-by-step gesture recognition performs poorly in real-time.Other researchers have used deep neural networks to detect and recognise the gestures of SL.Deep neural network models such as CNNs, RNNs, GRUs, long short-term memory (LSTM), and bidirectional long short-term memory (LSTM) networks are used to address the issue of frame dependency in sign movement.These models employ an object-detection neural network to learn the video frame's features, allowing it to find the hand while also classifying the movements.Compared to traditional image processing and machine learning algorithms, deep neural network-based target detection networks frequently achieve a higher accuracy and recognition speed, as well as better real-time performance, and have become the mainstream method of dynamic target detection.The advantage of deep learning is its ability to automatically learn data representations directly from raw inputs.Deep learning models can autonomously extract features and patterns from complex datasets without the need for manual feature engineering [15].
SLR studies can also be divided into static sign language recognition and dynamic sign language recognition.The former performs gesture recognition by judging the hand posture, and it does not contain dynamic information.The latter contains hand movements and performs gesture recognition based on the video sequence, which is essentially a classification problem.Dynamic sign language recognition is much more difficult to implement than static sign language recognition, but it is more meaningful and valuable.
The following presents a review of SLR studies, including methods and datasets.In Ref. [16], a recognition system was utilised as a communication tool between those who are hearing-challenged and others who are not.This work describes the first automatic Arabic Sign Language (ArSL) recognition system using hidden Markov models (HMMs).A vast number of samples were utilised to identify 30 isolated terms from the standard Arabic Sign Language.The recognition accuracy of the system was between 90.6 and 98.1%.In Ref. [17], ArSL was based on the hidden Markov model (HMM).They collected a large dataset to detect 20 isolated phrases from the genuine recordings of deaf persons in various clothing and skin hues, and they obtained a recognition rate of approximately 82.22%.In Ref. [18], the authors presented an ArSL recognition system.The scope of this study includes the identification of static and dynamic word gestures.This study provides an innovative approach for dealing with posture fluctuations in 3D object identification.This approach generates picture features using a pulse-coupled neural network (PCNN) from two separate viewing angles.The proposed approach achieved a 96% recognition accuracy.Ref. [19] provided an automated visual SLRS that converted solitary Arabic word signals to text.The proposed system consisted of the following four basic stages: hand segmentation, tracking, feature extraction, and classification.A dataset of 30 isolated words used in the everyday school lives of hearing-challenged students was created to evaluate the proposed method, with 83% of the words having varied occlusion conditions.The experimental findings showed that the proposed system had a 97% identification rate in the signer-independent mode.Ref. [20] presented a framework for the field of Arabic Sign Language recognition.A feature extractor with deep behaviour was utilised to address the tiny intricacies of Arabic Sign Language.A 3D convolutional neural network (CNN) was utilised to detect 25 motions from the Arabic Sign Language vocabulary.The recognition system was used to obtain data from depth maps using two cameras.The system obtained a 98% accuracy for the observed data, but the for fresh data, the average accuracy was 85%.The results might be enhanced by including more data from various signers.In Ref. [21], a computational mechanism was described that allowed an intelligent translator to recognise the separate dynamic motions of ArSL.The authors utilised ArSL's 100-sign vocabulary and 1500 video clips to represent these signs.These signs included static signs such as alphabets, numbers ranging from 1 to 10, and dynamic words.Experiments were carried out on our own ArSL dataset, and the matching between ArSL and Arabic text was evaluated using Euclidian distance.The suggested way to automatically find and translate single dynamic ArSL gestures was tested and found to work well and correctly.The test findings revealed that the proposed system can detect signs with a 95.8% accuracy.In Ref.
[4], the authors generated a video-based Arabic Sign Language dataset with 20 signs generated by 72 signers and suggested a deep learning architecture based on CNN and RNN models.The authors separated the data preprocessing into three stages.In the first stage, the proportions of each frame decreased to reach a lower total complexity.In the second stage, they sent the result to a code that subtracted every two consecutive frames to determine the motion between them.Finally, in the third stage, the attributes of each class were merged to produce 30 frames, with each unified frame combining 3 frames.The goal of stage three was to decrease the duplication while not losing any information.The primary idea behind the proposed architecture was to train two distinct CNNs independently for feature extraction, then concatenate the output into a single vector and transmit it to an RNN for classification.The proposed model scored 98% and 92% on the validation and testing subsets of the specified dataset, respectively.Furthermore, they attained promising accuracies of 93.40% and 98.80% on the top one and top five rankings of the UFC-101 dataset, respectively.The study by Ref. [22] provides a computer application for translating Iraqi Sign Language into Arabic (text).The translation process began with the capture of videos to create the dataset (41 words).The proposed system then employed a convolutional neural network (CNN) to categorise the sign language based on its attributes to infer the meaning of the signs.The proposed system's section that translates the sign language into Arabic text had an accuracy rate of 99% for the sign words.
Research on Arabic Sign Language recognition lacks common datasets available for researchers.Despite the publication of two volumes of "A Unified Arabic Sign Language Dictionary" in 2008, researchers in this field continue to face a lack of large-scale datasets.As such, each researcher needed to create a sufficiently large dataset to develop the ArSL recognition systems.Therefore, this study endeavoured to create a comprehensive dataset that was explicitly tailored for Arabic Sign Language recognition.Subsequently, this dataset serves as the foundation for the development of an accurate Arabic Sign Language recognition system capable of recognising the dynamic gestures inherent in ArSL.

Materials and Methods
The suggested system for recognising dynamic hand gestures uses keypoints that have been extracted.It is a neural network model that is constructed for learning from one sequence to another.Figure 4 depicts the primary phases of the proposed framework for recognising the dynamic gestures of Arab Sign Language.The model architecture incorporates both long short-term memory (LSTM) units and an attention mechanism.The model received a series of extracted keypoints from recorded gestures that indicate hand spatial configurations in a frame.The LSTM layer is responsible for processing the input sequence, identifying the temporal dependencies, and encoding the sequential information in its output sequence.The LSTM output sequence was The model architecture incorporates both long short-term memory (LSTM) units and an attention mechanism.The model received a series of extracted keypoints from recorded gestures that indicate hand spatial configurations in a frame.The LSTM layer is responsible for processing the input sequence, identifying the temporal dependencies, and encoding the sequential information in its output sequence.The LSTM output sequence was also improved with an attention layer that allows the model to focus on different parts of the input sequence based on how relevant they are to the task at hand.The incorporation of this attention mechanism enhanced the ability of the model to recognise significant temporal patterns and spatial configurations within the sequences of gestures.Ultimately, the output layer generates a probability distribution over the potential classes of hand gestures, enabling the model to categorise the input sequences into predetermined gesture categories.
Anaconda Navigator (Anaconda3) and the free Jupyter Notebook Version 6.4.3 environment service were used to create the framework software package for the selected models.By utilising the Open-Source Computer Vision Library (OpenCV) Version 4.5.3, a specialised photo and video processing library that enables a wide range of tasks, including image analysis, facial recognition, and the identification of sign language gestures, along with the Mediapipe library, which extracts information from multimedia and which is the main tool for tracking motion and video analysis, the MP-holistic model was put into action along with some drawing functions.A dataset was recorded and gathered in which the volunteer represented all of the gestures by recording 30 videos of 30 frames each.The next stage was the conversion of frameworks from BGR to RGB colour coordination, because MediaPipe prefers RGB and Open CV coordination prefers BGR colour coordination.For the application of an activated model in each framework and the extraction of keypoint values, we created subvolumes under a major folder to store video clips for each class, where a separate folder was created for each class and each video under this volume, and these data were the data used to train the learning model to classify these classes.The dataset was collected and recorded using a webcam, and analysed using the MediaPipe model.The volunteer had to follow the criteria, which will be mentioned later, and then perform them.The key values discovered from the multimedia library's total model were extracted and stored for training.Then, we started the pretreatment phase, which involved labelling each class.A label was used to convert the correct name into a binary representation.For example, in our search for 50 classes of (0-49), Class 1 will become [0, 1, 0] and Class 2 will become [1, 0, 0].A sequential neural network model comprising LSTM layers and fully linked layers was constructed for the classification.The training approach involved utilising data and the "Adam" algorithm to optimise the weight parameters, while the "categorical_crossentropy" function was employed to compute the loss during training.The term "categorical accuracy" refers to the correctness of the categorisation and served as a metric for evaluating the model's performance.The subsequent step involved saving the model, which could then be employed to recover the model and make predictions or to conduct the training.The last phase involved evaluating and using the confusion matrix, accuracy, and classification energy.

Dataset
In recent years, there has been tremendous development in the field of deep learning algorithms in artificial intelligence (AI).The success of AI applications depends on the quality and quantity of training and testing data.To improve AI systems, vast datasets must be collected and used.As far as we are aware, there is a lack of sufficient datasets for dynamic signals in Arabic Sign Language, which impedes the progress of recognition systems.Thus, it is crucial to create a large-scale dataset for dynamic signals in Arabic Sign Language.Accordingly, we created a DArSL50 dataset with a wide range of Arabic Sign Language dynamic motions.The DArSL50 dataset is comprised of 50 Arabic gestures representing 44 words and 6 digits.Each gesture was recorded by five participants.We selected signs from two dictionaries, " " (Sign Language Dictionary for Deaf Children) and " " (The Arabic Sign Language Dictionary for the Deaf).To collect the dataset, a series of processes were carried out.Initially, a collaboration was formed with the Deaf Centre, ensuring access to resources and specialised knowledge in Arabic Sign Language.Two dictionaries were examined to understand the signs.This study focused on 50 frequently used words and numbers, with a particular emphasis on To collect the dataset, a series of processes were carried out.Initially, a collaboration was formed with the Deaf Centre, ensuring access to resources and specialised knowledge in Arabic Sign Language.Two dictionaries were examined to understand the signs.This study focused on 50 frequently used words and numbers, with a particular emphasis on those that may be expressed using only the right hand for the sake of simplicity.A group of volunteers was enlisted to imitate the signs, with each sign being replicated 30 times to capture variations.Data collection involved recording videos using a laptop camera, while the OpenCV program analysed the video clips by extracting important characteristics and preparing the data for additional analysis.This meticulous approach resulted in the creation of a complete and representative dataset for the study of ArSL signs.Volunteers of diverse demographics participated without limitations, ensuring inclusivity and diversity within the dataset.In addition, it is important to guarantee that the volunteer's body and all of their movements fit within the camera frame.A consistent and unchanging background setting should be ensured, with a particular emphasis on capturing volunteers' hands and faces.A robust camera tripod was used to generate crisp and dependable video recordings.In addition, it is advisable to establish the duration and frame count of the clip before recording, and to strive for a resolution of 640 × 480 or greater to achieve the best possible quality.

Feature Extraction Using MediaPipe
Google created MediaPipe, an open-source framework that allows developers to build multimodal (video, audio, and time-series data) cross-platform applied ML pipelines.MediaPipe contains a wide range of human body identification and tracking algorithms that were trained using Google's massive and diverse dataset.As the skeleton of the nodes and edges, or landmarks, they track keypoints on different parts of the body.All of the coordinated points are three-dimensionally normalised.Models built by Google developers using TensorFlow lite facilitate the flow of information that is easily adaptable and modifiable via graphs [23].Sign language is based on hand gestures and stance estimation, yet the recognition of dynamic gestures and faces presents several challenges as a result of the continual movement.The challenges involved recognising the hands and establishing their form and orientation.MediaPipe was used to address these issues.It extracts the keypoints for the three dimensions of X, Y, and Z for both hands and estimates the postures for each frame.The pose estimation approach was used to forecast and track the hand's position relative to the body.The output of the MediaPipe architecture was a list of keypoints for hand and posture estimation.MediaPipe extracted 21 keypoints for each hand [24], as shown in Figure 6.The keypoints were determined in three dimensions, X, Y, and Z, for each hand.Therefore, the number of extracted keypoints for the hands is determined as follows [25]: For the pose estimation, MediaPipe extracted 33 keypoints [26], as shown in Figure 7.They were calculated in three dimensions (X, Y, and Z), in addition to the visibility.The visibility value indicates whether a point is visible or concealed (occluded by another body component) in a frame.Thus, the total number of keypoints extracted from the pose estimate is computed as follows [27]: keypoints in pose × (Three dimensions + Visibility) = (33 × (3 + 1)) = 132 keypoints.
For the face, MediaPipe extracted 468 keypoints [28], as shown in Figure 8. Lines linking landmarks define the contours around the face, eyes, lips, and brows, while dots symbolise the 468 landmarks.They were computed in three dimensions (X, Y, and Z).Thus, the number of retrieved keypoints from the face is computed as follows: Key points in face × Three dimensions = (468 × 3) = 1404 keypoints.list of keypoints for hand and posture estimation.MediaPipe extracted 21 keypoints for each hand [24], as shown in Figure 6.The keypoints were determined in three dimensions, X, Y, and Z, for each hand.Therefore, the number of extracted keypoints for the hands is determined as follows [25]: keypoints in hand × Three dimensions × No. of hands = (21 × 3 × 2) = 126 keypoints.For the pose estimation, MediaPipe extracted 33 keypoints [26], as shown in Figure 7.They were calculated in three dimensions (X, Y, and Z), in addition to the visibility.The visibility value indicates whether a point is visible or concealed (occluded by another body component) in a frame.Thus, the total number of keypoints extracted from the pose estimate is computed as follows [27]: keypoints in pose × (Three dimensions + Visibility) = (33 × (3 + 1)) = 132 keypoints.For the face, MediaPipe extracted 468 keypoints [28], as shown in Figure 8. Lines linking landmarks define the contours around the face, eyes, lips, and brows, while dots symbolise the 468 landmarks.They were computed in three dimensions (X, Y, and Z).Thus, the number of retrieved keypoints from the face is computed as follows: Key points in face × Three dimensions = (468 × 3) = 1404 keypoints.The total number of keypoints for each frame was determined by summing the number of keypoints in the hands, the pose, and the face.This calculation resulted in a total of 1662 keypoints.Figure 9 displays the keypoints retrieved from a sample of frames.For the pose estimation, MediaPipe extracted 33 keypoints [26], as shown in Figure 7.They were calculated in three dimensions (X, Y, and Z), in addition to the visibility.The visibility value indicates whether a point is visible or concealed (occluded by another body component) in a frame.Thus, the total number of keypoints extracted from the pose estimate is computed as follows [27]: keypoints in pose × (Three dimensions + Visibility) = (33 × (3 + 1)) = 132 keypoints.For the face, MediaPipe extracted 468 keypoints [28], as shown in Figure 8. Lines linking landmarks define the contours around the face, eyes, lips, and brows, while dots symbolise the 468 landmarks.They were computed in three dimensions (X, Y, and Z).Thus, the number of retrieved keypoints from the face is computed as follows: Key points in face × Three dimensions = (468 × 3) = 1404 keypoints.The total number of keypoints for each frame was determined by summing the number of keypoints in the hands, the pose, and the face.This calculation resulted in a total of 1662 keypoints.Figure 9 displays the keypoints retrieved from a sample of frames.The total number of keypoints for each frame was determined by summing the number of keypoints in the hands, the pose, and the face.This calculation resulted in a total of 1662 keypoints.Figure 9 displays the keypoints retrieved from a sample of frames.

Model
To process the dynamic gestures, data were represented as a series of frames, with each frame containing a collection of values representing the features of the hand posture in that frame.A recurrent neural network, specifically long short-term memory (LSTM), was used to process the resulting set of frames.LSTM is a well-known tool for encoding time series by extracting latent sign language expressions [29].The model used in this study combines LSTM units with an attention mechanism.The model structure comprises the following three primary layers: an LSTM layer, an attention mechanism layer, and an output layer.The LSTM layer consists of 64 units, which contribute the most parameters to the model because of its recurring nature and the related parameters for each unit.The attention mechanism layer introduces a limited number of parameters, consisting of 10 units that govern the attention weights.The output layer, which is responsible for predicting the hand gesture classes, has a set of parameters that are dictated by the size of the context vector generated by the attention mechanism and the number of classes that need to be predicted.In total, the model consists of 89,771 parameters, with the LSTM layer accounting for the largest proportion.This architecture was specifically designed to efficiently handle sequential data, exploit temporal relationships, and dynamically prioritise essential sections of the input sequence, ultimately facilitating precise hand motion detection.The choice of the optimal parameter was pivotal for building these layers.Table 1 displays the utilised model parameters.During the use of the model, the parameters of each layer can be modified by picking values from Table 1 in preparation for the training phase.The choice of 64 hidden units and the specific activation function (ReLU) was based on preliminary experiments and established practices in similar research domains.An LSTM model with 64 hidden nodes was used to balance the model complexity and computational performance.We wanted a model that could learn complex data patterns without overfitting, which may occur with large networks.Experiments showed that 10

Model
To process the dynamic gestures, data were represented as a series of frames, with each frame containing a collection of values representing the features of the hand posture in that frame.A recurrent neural network, specifically long short-term memory (LSTM), was used to process the resulting set of frames.LSTM is a well-known tool for encoding time series by extracting latent sign language expressions [29].The model used in this study combines LSTM units with an attention mechanism.The model structure comprises the following three primary layers: an LSTM layer, an attention mechanism layer, and an output layer.The LSTM layer consists of 64 units, which contribute the most parameters to the model because of its recurring nature and the related parameters for each unit.The attention mechanism layer introduces a limited number of parameters, consisting of 10 units that govern the attention weights.The output layer, which is responsible for predicting the hand gesture classes, has a set of parameters that are dictated by the size of the context vector generated by the attention mechanism and the number of classes that need to be predicted.In total, the model consists of 89,771 parameters, with the LSTM layer accounting for the largest proportion.This architecture was specifically designed to efficiently handle sequential data, exploit temporal relationships, and dynamically prioritise essential sections of the input sequence, ultimately facilitating precise hand motion detection.The choice of the optimal parameter was pivotal for building these layers.Table 1 displays the utilised model parameters.During the use of the model, the parameters of each layer can be modified by picking values from Table 1 in preparation for the training phase.The choice of 64 hidden units and the specific activation function (ReLU) was based on preliminary experiments and established practices in similar research domains.An LSTM model with 64 hidden nodes was used to balance the model complexity and computational performance.We wanted a model that could learn complex data patterns without overfitting, which may occur with large networks.Experiments showed that 10 attention units offered enough attentional concentration without too much of a processing burden.We used 'SoftMax' for the activation function because it is common for classification tasks, especially multiclass problems.The LSTM model underwent training for a total of 40 epochs, with early stopping based on validation loss to prevent overfitting.The models' inputs include the sequence length and total number of keypoints.The sequence length is the number of frames contained in each clip.The total number of keypoints was 1662.At this point, the model is ready to accept the dataset and begin the training phase using the sequence of keypoints collected.Thus, the sign movement was examined and a hand gesture label could be used.As a result, DArSL-50 could be accurately detected.

Experiments
This research collected data from five participants, resulting in two separate scenarios.The first scenario involved creating the model by using the data from each volunteer separately.In the second scenario, the data gathered from the volunteers were combined, and then the suggested model was implemented.In Scenario 1, the dataset comprised data from five volunteers, with each volunteer contributing 1500 data points.For the training set, 1125 data points were selected, representing 75% of the total data, ensuring a comprehensive representation of the variability within the dataset.The remaining 375 data points were allocated to the testing set, representing 25% of the total data.This subset was reserved for evaluating the performance and generalizability of the trained models, as shown in Table 2.In Scenario 2, four datasets were generated by combining the volunteer data.Data-I was composed of data collected from volunteers, resulting in 3000 data points.Subsequently, Data-II, Data-III, and Data-V were formed by merging the data from three, four, and five volunteers, resulting in dataset sizes of 4500, 6000, and 7500 data points, respectively.To evaluate the proposed model, the dataset was partitioned into training and testing sets using a split ratio of 75-25 respectively.As a result, the training set consisted of 3375, 4500, and 5625 data points, while the testing set contained 1125, 1500, and 1875 data points for the datasets with three, four, and five volunteers, respectively, as shown in Table 3.The objective of integrating the dataset with data from numerous individuals was to improve the reliability and applicability of the trained models across a wide variety of signers and signing styles.By integrating the data from several individuals, the models were enhanced to effectively manage variances in gestures and signing styles, resulting in enhanced performance in real-world applications.This training and testing technique allowed for a thorough assessment and validation of the models, ensuring their dependability and efficacy in different settings and populations.

Evaluation Metrics
Evaluation metrics, such as the accuracy, precision, recall, and F1 score, are commonly used to evaluate the performance of classification models.These metrics provide crucial information about how well the model is doing and where it may require improvement.
Accuracy is the most commonly used simple metric for classification.It represents the ratio of the number of correctly classified predictions to the total number of predictions.A high level of accuracy indicates that the model is making correct predictions overall.The accuracy was calculated using Equation ( 1), as follows: Precision measures the proportion of true positive predictions among all positive predictions.
Interpretation: A high precision indicates that, when the model predicts a positive class, it is likely to be correct.The precision is calculated using Equation ( 2), as follows: Recall measures the proportion of true positive predictions among all actual positive instances.
Interpretation: A high recall indicates that the model can identify most of the positive instances.The recall is calculated using Equation ( 3), as follows: The F1 score is the harmonic mean of the precision and recall, providing a balanced measure between the two metrics.The F1 score considers both the precision and recall, making it suitable for imbalanced datasets where one class dominates.The F1 score is calculated using Equation (4), as follows: where: The number of true positives (TPs) is the number of positive class samples correctly classified by a model.True negatives (TNs) are the number of negative class samples correctly classified by a model.False positives (FPs) are the number of negative class samples that were predicted (incorrectly) to be of the positive class by the model.False negatives (FNs) are the number of positive class samples that were predicted (incorrectly) to be of the negative class by the model.The classification report provides the accuracy, recall, and F1 score for each class, as well as the overall metrics.The assessment measures were used to determine how well the trained models performed on the testing datasets.This showed how well, accurately, and consistently they could recognise Arabic Sign Language gestures.

Results
The studies were carried out on a PC with an Intel(R) Core (TM) i7-10750H CPU operating at a base frequency of 2.60 GHz, which has 12 cores and 16,384 MB of RAM.The framework was developed using the Python programming language.The source code for this study may be accessed at the following URL: https://drive.google.com/file/d/1FcXudNQqXb_IzehsdMWb0tSBplcq-8LJ/view?usp=sharing (accessed on 10 June 2024).The dataset was gathered by a team of five volunteers, including a total of 50 distinct categories.Every participant captured recordings for the dataset consisting of 50 classes, and the outcomes were examined using the DArSL50 dataset.The DArSL50 dataset was divided randomly, with 75% used for training and 25% used for testing in the experiment.The performance criteria, such as the accuracy, precision, recall, and F1 score, were assessed under different situations to evaluate the functioning of the suggested system.In the first scenario, we evaluated the classification model with a dataset that included five participants' recordings; each participant provided 1500 data points.A training set was created from 1125 data points (representing 75% of the total), and a testing set was created from 375 data points (representing 25% of the total).Table 4 indicates the performance metrics obtained for each volunteer in Scenario 1.The data presented in Table 4 indicate that the third volunteer achieved the highest accuracy, approximately 85%, while the first volunteer achieved the lowest accuracy, approximately 82%.Nevertheless, the dataset's accuracy ratio for all volunteers was highly similar, indicating a highly effective discrimination mechanism for each individual.The results of Scenario 1 provide valuable insights into the model's efficacy in categorising hand movements using the given dataset.Through the evaluation of parameters such as accuracy, precision, recall, and the F1 score, we can determine the model's ability to generalise across various volunteers and accurately recognise gestures.The model's high accuracy, precision, recall, and F1 score demonstrate its effectiveness in recognising hand gestures from varied recordings.This indicates that the model is resilient and generalisable across multiple volunteers and signing styles.Table 5 shows the findings of Scenario 2, which included experiments to recognise dynamic hand gestures for four datasets.These datasets represent a combination of volunteer data.The results presented in Table 5 indicate that the highest level of accuracy, reaching 83%, was achieved by Data-I, which represents the combined data of two participants.However, Data-III and Data-IV achieved the minimum accuracy, which was approximately 80%.The accuracy of the four experiments varied between 83% and 80%, which is near and relevant in terms of the precision and recall.The F1 score, a metric that combines precision and recall using the harmonic mean, provides a well-balanced evaluation of the models' overall performance, with scores ranging from 0.82 to 0.80.By analysing Table 5, it is clear that the best accuracy ever achieved after the merger of volunteers is almost very close to the accuracy of the merger of the five volunteers, which suggests that the system is good with discrimination and has a strong impact, depending on the multiple people and the magnitude of the dataset.Overall, the models had good precision and recall scores, indicating that they could make accurate predictions and successfully detect positive events.These results show that the trained models are effective at recognising Arabic Sign Language.Compared to Data-IV, Table 6 shows the performance metrics (precision, recall, and the F1 score) for recognising 50 different types of ArSL gestures.Every row represents a particular class, and the metrics indicate the model's performance in accurately differentiating between gestures of that class.Table 6 presents a comprehensive analysis of the performance metrics of the model for each class in the classification report.Some classes demonstrate exceptional performance, as seen by their high precision, recall, and F1 score levels.For instance, the classes "Takes a shower", "Our", "The grandfather", and "Understand" exhibit high scores in all measures, indicating that the model accurately recognises these actions.However, specific classes exhibit disparities in performance indicators.For example, the "Blind" class exhibits relatively high precision but lower recall and F1 scores, suggesting that the model can accurately detect certain instances of this gesture but may fail to detect certain actual occurrences.
Classes such as "Common cold", "Measles", and "Stupid" consistently and effectively display strong recognition abilities across all parameters, indicating their robustness in gesture recognition.Conversely, classes such as "North direction", "East direction", and "To grow" display different performance metrics, with higher precision but lower recall values.This suggests that the model might have difficulty in accurately identifying all occurrences of these gestures.Based on the categorisation report results, we discovered that classes 11, 12, 13, and 14 (equivalent to classes 23, 24, 25, and 26, respectively) performed relatively poorly compared to the other classes.This is due to the nature of the movement in these classes, where the distinction between individual movements may be unclear.For example, the movement could be a slight hand gesture with no substantial variations in motion, or the difference between one movement and another may not be obvious enough, making classification more difficult for these classes.High values of accuracy, precision, recall, and the F1 score indicate successful model performance, while lower values may signify areas for improvement in the model's predictive capabilities.
To evaluate the system performance in real-time sign language detection, measurements were made concerning the reading error rate at the first stage.Algorithm 1 presents the approach used to measure the system performance metrics.Each letter was tested individually with five participants, and 40 iterations were applied to each letter to determine the frequency of the recognition.Consequently, the performance of the proposed system can be assessed by calculating the recognition accuracy of each gesture, followed by the total accuracy of the entire system, as shown in Algorithm 1. Errors in the results may be categorised as either "misclassification" (incorrect recognition) or "gesture not recognised" (not detection).The accuracy and error rates are determined using the equations provided below: The real-time results are summarised in Table 7, which shows the accuracy, error of incorrect recognition, and error of not detecting each sign.The real-time performance analysis of dynamic Arabic gesture recognition reveals high accuracy for gestures such as " " (Goodbye) and " " (Cleaning teeth), indicating the model's proficiency with distinct patterns.However, lower accuracy and higher error rates in gestures such as " " (Blind) and " " (Smell) suggest difficulties in distinguishing these gestures, highlighting areas for improvement.The results presented in Table 7 evaluate the real-time recognition proficiency of dynamic Arabic gestures, which achieved an overall accuracy rate of 83.5%.The accuracy of dynamic Arabic gestures indicates a generally high performance for many gestures, such as " " (Goodbye) and " " (Cleaning teeth), with a 100% and 98% accuracy, respectively, and minimal errors.This reflects the model's effectiveness in recognising distinct gestures.Conversely, gestures such as " " (Smell) and " " (Blind) achieved a moderate accuracy, with significant errors not detected (20% and 16%).Numeric gestures, particularly " 11" Computers (Eleven number) and " 12" (Twelve number), provide lower accuracy and higher error rates, suggesting challenges in distinguishing similar visual patterns.Figure 10 shows examples of complex signs that achieved low accuracy due to similarity problems.

Discussion
The evaluation of the model performance through the comparison of "macro-" and "weighted" averages offers useful insights into how the distribution of classes affects the accuracy of categorisation.While "macro-averages" provide a simple average over all classes, "weighted" averages take into consideration class imbalance by assigning weights to the average based on the number of instances in each class.Our investigation revealed that both types of averages showed similar patterns across different circumstances, indicating the continuous impact of class distribution on the model results.Analysing the outcomes of every scenario clarifies the connection between the model performance,

Discussion
The evaluation of the model performance through the comparison of "macro-" and "weighted" averages offers useful insights into how the distribution of classes affects the accuracy of categorisation.While "macro-averages" provide a simple average over all classes, "weighted" averages take into consideration class imbalance by assigning weights to the average based on the number of instances in each class.Our investigation revealed that both types of averages showed similar patterns across different circumstances, indicating the continuous impact of class distribution on the model results.Analysing the outcomes of every scenario clarifies the connection between the model performance, volunteer contributions, and dataset size.The best accuracy and F1 score were obtained in Scenario 1, when each volunteer provided 1500 data points, demonstrating the potency of the individual volunteer datasets.We observed a modest decline in the accuracy and F1 score in Scenario 2, as the dataset size rose with the merged data from several participants.Larger datasets may have advantages, but adding a variety of volunteer contributions could complicate things and impair the model performance according to this tendency.Additional analysis of the classification report offers valuable information about the specific difficulties faced by the model in distinct categories.Classes 10, 11, 12, and 13 demonstrated worse precision, recall, and F1 scores than did the other classes, suggesting challenges in successfully recognising these gestures.This difference highlights the significance of analysing metrics relevant to each class to discover areas where the model may need more refinement or training data augmentation to enhance its performance.
Several factors contribute to these classes' inferior performance.First, the nature of the movements within these classes may provide complexity that is difficult to fully determine.For example, these movements may include subtle gestures or minor differences between different signs, making it difficult for the model to distinguish between them efficiently.Furthermore, the classification model may have problems catching the intricacies of these movements, particularly if they include small fluctuations or sophisticated hand movements that are difficult to identify precisely.Moreover, the minimal size and diversity of the dataset for these classes may have contributed to the poor performance.A larger and more diversified dataset would give the model a broader set of instances, improving its capacity to generalise and identify these complex movements.To summarise, while the model's overall performance is acceptable, further modification and augmentation of the dataset, as well as the model architecture, are required to enhance the classification accuracy for these hard classes.This highlights the need for ongoing research and development efforts in the field of sign language recognition to solve these unique issues while also improving the accessibility and effectiveness of sign language recognition technology.The observed influence of an increasing dataset size emphasises the need for data augmentation and the establishment of larger, more diverse datasets in sign language recognition research.As part of the study's objectives, the goal was to create a comprehensive dataset exclusively for Arabic Sign Language recognition.By expanding the dataset, the model can be trained on a broader collection of instances, boosting its capacity to generalise and reliably identify sign language movements, especially in difficult categories.This is consistent with the overall goal of improving the accessibility and effectiveness of sign language recognition systems, ultimately leading to greater inclusivity and accessibility for people with hearing impairments.

A Comparison with Previous Studies
This study focused on the recognition of dynamic gestures performed with a single hand captured using a single camera setup.The primary goal was to recognise isolated dynamic words and dynamic numbers expressed through sign language gestures.The data collection process involved recording sessions where individuals performed these gestures in front of the camera, ensuring that the dataset captured a diverse range of hand movements and expressions, and by limiting the scope to dynamic gestures performed with one hand.Table 8 provides a comparison with prior studies that align with our objectives.The dataset size in the proposed work is also significantly larger, at 7500 samples, compared to 7200 in Ref. [4], 390 in Ref. [23], and 4045 in Ref. [17].A larger dataset contributes to better model generalizability and robustness, ensuring that the model performs well on diverse and unseen data.Moreover, the proposed framework handles 50 gestures, including both simple and complex signs, whereas the other studies focus primarily on simple signs (20 in Ref. [4], 30 in Ref. [23] and Ref. [17]).This broader range of gestures, which includes words and numbers, demonstrates the versatility and applicability of the proposed model for more comprehensive sign language recognition tasks.The data used in the proposed framework are balanced, ensuring that the model is trained on an equal representation of all gesture classes, reducing bias and improving the overall performance.In contrast, the datasets in Refs.[4,17,23] are not balanced, which could lead to skewed results favouring more frequent classes.For data collection, the proposed framework uses recorded videos with keypoint extraction using MediaPipe, a state-of-the-art framework for extracting hand and body keypoints.This method captures more detailed motion data than do the simpler approaches used in other studies, such as the smartphone videos in Ref.
[4] and OpenPose version 1.4 in Ref. [17].In terms of preprocessing, the proposed framework simplifies the process by not converting frames to greyscale, preserving more information from the original videos.
The MediaPipe feature extraction method used in the proposed framework is more advanced than methods, such as adaptive thresholding, convolution layers, and discrete cosine transform (DCT), which have been used in other studies.The proposed framework might not be as accurate as those used in other studies, but it is a strong and flexible solution for sign language recognition because it can better handle complex gestures, has a larger and more balanced dataset, uses advanced data collection and preprocessing methods, and can evaluate performance in real-time.

Conclusions
In this study, we attempted to meet the pressing need for effective communication tools for the deaf community by developing a model that can recognise dynamic hand gestures from video recordings.This was accomplished by combining the attention mecha-

Figure 2 .
Figure 2. The main phases of recognising the SL gesture data using a sensor-based system [13].

Figure 2 .
Figure 2. The main phases of recognising the SL gesture data using a sensor-based system [13].

Figure 2 .
Figure 2. The main phases of recognising the SL gesture data using a sensor-based system [13].

24 Figure 4 .
Figure 4.The proposed sign language recognition framework for dynamic Arabic gestures.

Figure 4 .
Figure 4.The proposed sign language recognition framework for dynamic Arabic gestures.

Figure 5 Figure 5 .
Figure 5 displays a segment of the sign language database, which includes 50 dynamic signals in the Arabic Sign Language (ArSL) database.Five volunteers recorded each sign, with each participant performing each sign 30 times.Hence, the aggregate number of videos reached 7500, which was calculated by multiplying 50 by 5 and then by 30.The Video Capture function in OpenCV enabled the collection of data, which were then saved in NumPy format for further analysis.Computers 2024, 13, x FOR PEER REVIEW 9 of 24

Figure 5 .
Figure 5. Images from the ArSL Words and Numbers dataset, which includes the lexicon for sign language for children that are deaf and the Arabic Sign Language Dictionary.

Figure 7 .
Figure 7.A total of 33 keypoints for the pose.

Figure 8 .
Figure 8.A total of 468 keypoints for the face.

Figure 7 .
Figure 7.A total of 33 keypoints for the pose.

Figure 7 .
Figure 7.A total of 33 keypoints for the pose.

Figure 8 .
Figure 8.A total of 468 keypoints for the face.

Figure 8 .
Figure 8.A total of 468 keypoints for the face.

Figure 9 .
Figure 9. Keypoints that were extracted from a sample of frames.

Figure 9 .
Figure 9. Keypoints that were extracted from a sample of frames.

Figure 10 .
Figure 10.The similarity between the signs in ArSL.

Figure 10 .
Figure 10.The similarity between the signs in ArSL.

Table 2 .
Data size, training set, and test set for each volunteer.

Table 3 .
Data size, training set, and test set for Scenario 2.

Table 5 .
The proposed framework results for Scenario 2.

Table 6 .
Results for the scenarios with classification reports for each class of Scenario 5.

Table 7 .
The Real-Time Performance Result.

Table 8 .
Comparison with similar ArSL recognition systems.