You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

6 June 2024

Real-Time Arabic Sign Language Recognition Using a Hybrid Deep Learning Model

,
,
,
,
,
,
,
and
Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning Technology and Image Sensing

Abstract

Sign language is an essential means of communication for individuals with hearing disabilities. However, there is a significant shortage of sign language interpreters in some languages, especially in Saudi Arabia. This shortage results in a large proportion of the hearing-impaired population being deprived of services, especially in public places. This paper aims to address this gap in accessibility by leveraging technology to develop systems capable of recognizing Arabic Sign Language (ArSL) using deep learning techniques. In this paper, we propose a hybrid model to capture the spatio-temporal aspects of sign language (i.e., letters and words). The hybrid model consists of a Convolutional Neural Network (CNN) classifier to extract spatial features from sign language data and a Long Short-Term Memory (LSTM) classifier to extract spatial and temporal characteristics to handle sequential data (i.e., hand movements). To demonstrate the feasibility of our proposed hybrid model, we created a dataset of 20 different words, resulting in 4000 images for ArSL: 10 static gesture words and 500 videos for 10 dynamic gesture words. Our proposed hybrid model demonstrates promising performance, with the CNN and LSTM classifiers achieving accuracy rates of 94.40% and 82.70%, respectively. These results indicate that our approach can significantly enhance communication accessibility for the hearing-impaired community in Saudi Arabia. Thus, this paper represents a major step toward promoting inclusivity and improving the quality of life for the hearing impaired.

1. Introduction

Sign language is a communication method utilized by deaf individuals and encompasses a series of hand gestures and symbols [1]. It is also employed by hearing individuals to facilitate communication with the deaf community. Predominantly, sign language is concept-based, with each gesture or symbol representing a distinct idea or concept. It uses four major manual components that comprise (1) finger configuration, (2) hand movement, (3) hand orientation, and (4) hand location relative to the body [2,3]. Compared to other gestures, sign language is the most structured. On the one hand, it has a large set of signs where each sign has a specific meaning [4]. On the other hand, this contrasts with word-based communication systems. Nonetheless, certain words and names lack direct equivalents in sign language. To address this, the deaf community often resorts to using a hand (i.e., finger) alphabet to spell out such words, ensuring clarity and precision in communication. This approach highlights the adaptability and inclusivity of sign language as a communication tool [5].
The Kingdom of Saudi Arabia is home to a sizable deaf population of about 229,541, many of whom are not provided with appropriate care in public venues because of a lack of interpreters, as the General Authority for Statistics has shown in recent years [6]. According to the Center for Strategic and International Studies (CSIS) [7], in the state of California, the ratio of sign language interpreters to hearing-impaired individuals is approximately 1:46. This indicates a relatively high availability of interpreters for the deaf community. In contrast, Saudi Arabia presents a starkly different scenario, with the ratio being approximately 1:93,000. This vast disparity highlights significant differences in the availability of sign language interpretation services between the two regions, underscoring a potential area of concern in terms of accessibility and support for the hearing-impaired population in Saudi Arabia. Moreover, most current research utilizes either only letters/alphabets to translate Arabic Sign Language (ArSL) or sensors to facilitate this process [4,8,9]. Thus, there is a need to conduct further research on ArSL [9,10].
In addressing the shortage of sign language interpreters, the role of technology, particularly machine-based communication methods, becomes crucial in bridging this gap. Advances in machine learning have led to the development of automated sign language translation systems. These systems employ sophisticated algorithms and gesture recognition technologies to translate sign language into spoken and written language, and vice versa. Such technological solutions offer a promising path to alleviate interpreter scarcity, particularly in areas such as the Kingdom of Saudi Arabia, where the ratio of interpreters to hearing-impaired individuals is exceedingly low. Machine-based communication can provide real-time, on-demand translation services, making communication more accessible and inclusive for the deaf community.
Furthermore, contemporary technological approaches, such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM), are instrumental in enhancing the efficacy of automated ASL translation systems. CNNs can process and interpret visual information, making them highly suitable for recognizing and analyzing the intricate hand gestures and facial expressions inherent in sign language [11,12,13,14]. LSTM networks, a form of recurrent neural network, excel in handling sequential data, thus effectively capturing the dynamic and temporal aspects of sign language [13,14,15,16]. Hence, in this paper, both CNNs and LSTMs were employed to develop the proposed model.
The novelty and main contribution of this paper focus on tackling a critical societal issue: the lack of sign language interpreters in the Kingdom of Saudi Arabia, which has left a significant hearing-impaired population without adequate support in public spaces. This research highlights the importance and versatility of Arabic Sign Language (ArSL) as a communication tool and leverages technology such as artificial intelligence to improve accessibility for the deaf community in Saudi Arabia. The key contributions of this paper include:
  • Identifying the Accessibility Gap: This research comprehensively analyzes the accessibility gap faced by the hearing-impaired population in Saudi Arabia compared to regions with more interpreters such as California, USA. By highlighting the stark contrast in interpreter ratios, this paper sheds light on a crucial issue that requires attention and action.
  • Leveraging Deep Learning: We propose a novel hybrid model to capture the spatio-temporal aspects of sign language. The hybrid model consists of a CNN classifier to extract spatial features from sign language data and an LSTM classifier to extract spatial and temporal characteristics to handle sequential data. This hybrid model facilitates the process of automatic sign language recognition in real time from either spoken or written language, addressing the shortage of qualified interpreters.
  • Building a Custom Annotated Dataset: We create a dataset of 20 different universal words, resulting in 4000 images for ArSL. It consists of 10 static gesture words and 500 videos for 10 dynamic gesture words. The dataset can be used to train and evaluate ASL recognition models.
  • Validation: We conduct a comprehensive evaluation of the hybrid model on our dataset and compare its performance on 20 different words (i.e., 10 static gesture words and 10 dynamic gesture words).
The remainder of this paper is organized as follows. The related works are covered in Section 2. Section 3 reviews the system architecture and the hybrid model. Section 4 describes the proposed model’s implementation, and Section 5 discusses the experimental results. Section 6 presents the concluding remarks and discusses future work.

3. System Architecture and the Hybrid Model

To overcome the lack of Arabic interpreters for people with hearing impairment, especially in Saudi Arabia, an architecture for sign language recognition is presented in this work that enables automatic real-time translation between sign language and spoken or written language. The architecture is shown in Figure 1 and consists of four layers: the data acquisition layer, the mobile network layer, the cloud layer, and the sign language recognition layer.
Figure 1. Real-time ArSL system architecture.

3.1. Architecture Layers

Each layer of the architecture is responsible for a set of tasks and interacts with the other layers. The architecture uses a streamlined pipeline design to make communication accessible and intuitive. The architecture layers are as follows:
1. Data Acquisition Layer: This layer gathers the sign language data that require translation using images collected from cameras (e.g., webcams, smartphone cameras, wearable gadget cameras, etc.) or videos collected from cameras, which are broken into frames. The images and frames are then sent through the mobile network layer and the sign language recognition layer for processing.
2. Mobile Network Layer: This layer connects the data acquisition layer and the sign language recognition layer. It is made up of several Wireless Access Points (WAPs), Base Transceiver Stations (BTSs), and satellites to enable communication. The information that is provided includes the ID of the camera and the images or video frames that contain the hand landmark or gesture sequence that needs to be translated. This layer makes communication easy and straightforward through a streamlined pipeline design.
3. Cloud Layer: For the other layers, this layer offers Infrastructure as a Service (IaaS) and Software as a Service (SaaS), making it possible for data to be stored and shared across layers via the Internet. It also gives the system security, scalability, and dependability. By employing a hybrid model for processing the deep learning models, including the training and learning phases, we use SaaS, a cloud computing model that does not require management of the underlying infrastructure. The management and storage of data, including training and recognition data for upcoming training, are achieved using the IaaS cloud computing architecture. With this computing approach, resources like servers, storage, and networks can grow as needed without needing to be managed.
4. Sign Language Recognition Layer: This layer utilizes the proposed hybrid model on each image or video frame that comes from the data acquisition layer, which contains a hand landmark or gesture sequence to be translated. The hybrid model consists of two deep learning models, namely Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM), each of which consists of several modules:
  • Image/Frame Pre-processing: This module employs the Google MediaPipe framework (https://developers.google.com/mediapipe/framework, accessed on 23 February 2024) to extract hand landmarks or gesture sequences from a set of images or frames. The images or frames are organized into separate directories, each representing a unique category. For every image or frame, the coordinates of detected hand landmarks or gesture sequences are captured and flattened into a list. These lists are then compiled into a ’data’ list while corresponding category labels are added to the ’labels’ list. The entire dataset, consisting of hand landmarks or gesture sequences and their associated labels, is stored in a pickle file (i.e., in the cloud layer).
  • Data Sampling: This module is a crucial aspect of deep learning, particularly when dealing with data that require training and testing for their models. It randomly splits the data into 80% for training and 20% for testing. This division ensures that both models (i.e., CNN and LSTM in the hybrid model) learn patterns and relationships from a diverse range of examples during the training phase while also assessing their performance on unseen data during testing. The training set, comprising 80% of the data, is used to train the models, which allows them to learn from a large variety of hand gestures and their corresponding landmarks. On the other hand, the remaining 20% of the data reserved for testing serves as an independent validation set to evaluate both models’ performance and generalization ability on new, unseen examples.
  • CNN Training: The CNN sub-model of our hybrid model is designed to detect human hands by leveraging the Google MediaPipe framework to identify the hand’s 21 3D landmarks. These landmarks are crucial for understanding hand gestures and movements. To train the CNN sub-model, we utilize pre-processed data stored in a pickle file (i.e., in the cloud layer), where the hand landmarks and corresponding labels are organized. We then employ the Random Forest Classifier (RFC) to predict the output of these landmarks. The RFC is an ensemble learning algorithm that constructs multiple decision trees during training. It outputs the mode of the classes (i.e., classification) or the mean prediction (i.e., regression) of the individual trees. Each decision tree in the ensemble is trained on a random subset of the training data. During prediction, each tree contributes a decision, with the final output determined by a majority or averaging mechanism. This combined approach helps the CNN sub-model to accurately detect and interpret human hand gestures in real-time applications.
  • LSTM Training: The LSTM sub-model of our hybrid model uses an approach that focuses on capturing sequential patterns in time-series data, particularly in the context of hand gesture sequences. To train the LSTM sub-model, we utilize pre-processed data stored in a pickle file (i.e., in the cloud layer), where the features and corresponding labels of the hand gesture sequences are organized. The features are then organized into sequences of frames, with each frame representing a window of historical data of the hand gesture sequence. We utilize the LSTM architecture, a type of recurrent neural network (RNN) designed to effectively model long-term dependencies in sequential data. During training, the LSTM sub-model learns to capture intricate patterns and relationships within the hand gesture sequences. We employ techniques such as dropout regularization to prevent overfitting and optimize the model’s generalization ability.
  • Sign Language Recognition: This module is responsible for predicting the hand gestures based on the training information from both the CNN and LSTM models (i.e., hybrid model). Further details on the hybrid model are elaborated in Section 3.2.

3.2. Hybrid Model

Our proposed architecture is designed to gather the ArSL data that require translation using images or videos collected from cameras, displaying the corresponding text translation on screen in real time. We propose a hybrid model that consists of a CNN classifier to extract spatial features from ArSL data and an LSTM classifier to extract spatial and temporal characteristics to handle sequential data. To enable the user to use both images and videos, we created a function called “Sign Language Recognition Decision” to give our system the flexibility to use either images or videos. This function essentially chooses between the image processing sub-model (CNN) and the video processing sub-model (LSTM) depending on the input type. By integrating the strengths of CNNs and LSTMs, the system can leverage both spatial and temporal information simultaneously. The CNN extracts relevant visual features, and the LSTM processes these features in a sequential manner, capturing the dynamic aspects of sign language communication. This allows users to communicate naturally using gestures. More details on CNN and LSTM models are elaborated in the following subsection.

3.2.1. Convolutional Neural Network (CNN)

The CNN sub-model of our hybrid model is designed to detect human hands by leveraging the Google MediaPipe framework to identify the hand’s 21 3D landmarks. These landmarks are crucial for understanding hand gestures and movements. For the proposed CNN sub-model, we utilized LeNet-5, a well-known CNN architecture frequently employed for image recognition tasks. This architecture comprises several layers: it begins with two convolutional layers, each followed by max-pooling layers, and concludes with three fully connected (linear) layers. As shown in Figure 2, the input to LeNet-5 is a grayscale image with dimensions of 32 × 32 × 1, where the “1” denotes a single channel for grayscale intensity. The first convolutional layer applies six filters of size 5 × 5 to the input image, producing an output shape of [−1, 6, 28, 28]. The subsequent max-pooling layers halve the spatial dimensions, resulting in feature maps of sizes [−1, 6, 14, 14] and [−1, 16, 5, 5], respectively. The final three linear layers have output shapes of [−1, 120], [−1, 84], and [−1, 10], with the last layer representing the output classes. LeNet-5 consists of 61,706 parameters in total, including weights and biases, distributed across the convolutional and linear layers.
Figure 2. CNN model structure.
To train the CNN sub-model, we utilized pre-processed data stored in a pickle file. We then employed the Random Forest Classifier (RFC) to predict the output of these landmarks. Bootstrapping was used to sample a random subset of data and features used to train each tree in the forest. An aggregation of the predictions produced by every single decision tree results in the final RFC predictions. The final RFC prediction for hand gesture γ taking on sign language word class σ in the tree is explained in Equation (1):
P τ σ | γ , Ω , π = λ Λ π λ σ μ λ γ , Ω ,
where Ω denotes the parameter of the decision function δ , π denotes the class label distribution of all leaf nodes, Λ denotes a set of leaf nodes, π λ σ μ λ denotes the probability that the hand gesture belongs to a sign language word class σ given by leaf node λ , and γ , Ω denotes the probability of routing the hand gesture until leaf node λ . Ultimately, we can interpret this probability value as a weighted sum of the class distribution if we treat the possibility of reaching the leaf node as a weight. The decision function δ for split node ν is defined in Equations (2) and (3):
δ ν γ ; Ω = β f ν γ ; Ω ,
β γ = 1 + e γ 1 ,                         w h e r e             f ν . ; Ω : Γ R
where β γ denotes the sigmoid function, whose output can be used to calculate the possibility of moving to the left or right sub-tree in the RFC, and f ν . ; Ω denotes the real-valued function parametrized by Ω .
The model’s ability to learn hand landmarks is encoded by the parameter Ω . The parameters of a deep CNN, which are used to automatically train an appropriate hand landmark representation from incoming images, are represented by Ω in this paper. Every function f ν can be thought of as a deep network’s linear output unit. The final prediction of the CNN model, delivered by the forest F = { τ 1 , τ 2 , , τ γ } for the hand landmark γ , is calculated as shown in Equation (4):
P F σ | γ = 1 x n = 1 X P τ n σ | γ ,
where γ denotes the hand gesture, σ denotes the sign language word class, x denotes the number of trees in the forest, and P τ n represents the probability that the hand gesture belongs to sign language word class σ given by the n t h tree. The average of the prediction made by each tree in the forest eventually determines the final prediction.

3.2.2. Long Short-Term Memory (LSTM)

LSTM is a type of recurrent neural network (RNN) designed to effectively model long-term dependencies in sequential data [39,40,41]. For the proposed LSTM sub-model, we employ a stack structure, which involves arranging multiple LSTM layers in sequence, where the output of each LSTM layer serves as the input for the next. This deep architecture allows for capturing more intricate patterns in sequential data by utilizing both LSTM and dense layers. As illustrated in Figure 3, the first LSTM layer has 64 units, returns sequences, and uses the hyperbolic tangent function ( t a n h ) for activation. The second LSTM layer includes 128 units, returns sequences, and also uses t a n h for activation. The third LSTM layer, with 64 units, does not return sequences and performs t a n h activation. Following the LSTM layers are the dense layers: the first dense layer has 64 units and uses t a n h activation, the second dense layer has 32 units with t a n h activation, and the final dense layer has units equal to the number of actions, employing the Softmax function to constrain the outputs between 0 and 1, ensuring their sum is always 1.
Figure 3. LSTM model structure.
During training, the LSTM sub-model learns to capture intricate patterns and relationships within the hand gesture sequences. In this paper, LSTM model is employed to learn hand gesture sequences and to recognize the correspondent sign language translation. leveraging the Google MediaPipe framework to extract input representations, the structure of LSTM is expressed in the following manner:
A form of RNN called LSTM architecture is intended to efficiently model long-term dependencies in sequential data [15,42,43]. During training, the LSTM sub-model picks up on the complex correlations and patterns found in the hand gesture sequences. To learn hand gesture sequences and recognize the corresponding sign language translation, this paper uses an LSTM model. Exploiting Google MediaPipe framework to extract input representations, the structure of LSTM consists of several gates including the input gate, the forget gate, the output gate, and the cell input vector. The input gate is described in Equation (5) as follows:
α t = ψ ( ω α · [ h t 1 , x t ] + b α ) ,
where the input gate, denoted as α , is controlled by the time step, denoted as t. ψ denotes the logistic sigmoid function. The last hidden state, denoted as h [ t 1 ] , and the current input, denoted as x t , are weighted using the matrix weight ω . b α denotes the biases of the input gate. The forget gate is described in Equation (6), as follows:
φ t = ψ ( ω φ · [ h t 1 , x t ] + b φ ) ,
where the forget gate, denoted as φ , is controlled by the time step, denoted as t. Similar to the input gate, ψ denotes the logistic sigmoid function. The last hidden state, denoted as h t 1 , and the current input, denoted as x t , are weighted using the matrix weight ω . b φ denotes the biases of the forget gate. As we can observe, the input gate’s mechanism is identical to this one, but it uses a completely different set of weights. The output gate is expressed in Equation (7), as follows:
β t = ψ ( ω β · [ h t 1 , x t ] + b β ) ,
where the output gate, denoted as β , is also controlled by the time step, denoted as t. Similar to the input and forget gates, ψ denotes the logistic sigmoid function. The last hidden state, denoted as h t 1 , and the current input, denoted as x t , are weighted using the matrix weight ω . b φ denotes the biases of the output gate. The cell input vector is shown in Equations (8) and (9), as follows:
ε t = φ t ε t 1 + t i t a n h ( ω ε · [ h t 1 , x t ] + b ε ) ,
h t = β t t a n h ( ε t ) ,
where t a n h denotes the tangent function used to transform the data into a normalized encoding of the data. ⊙ represents the element-wise product of two vectors.

4. Implementation

For the implementation, we utilized Google Cloud Computing Services (Google Cloud) (https://cloud.google.com/, accessed on 28 February 2024). Within this environment, we deployed virtual machines running the Ubuntu Operating System version 20.04, sourced from the Ubuntu OS Cloud Marketplace (https://console.cloud.google.com/marketplace/product/ubuntu-os-cloud/ubuntu-focal, accessed on 28 February 2024). These virtual machines were configured with the Docker platform (https://cloud.google.com/compute/docs/containers, accessed on 28 February 2024) to execute the application container. In particular, we employed the standard c3d-Standard-4 configuration on Google Cloud, with each virtual machine featuring 4 VCPUs and 16 gigabytes (GB) of memory. Our system architecture consisted of three such virtual machines, each assigned distinct responsibilities for each layer in our architecture, including the data acquisition, sign language recognition, and cloud layers, as illustrated in Figure 1. Leveraging Google Cloud facilitated the setup and management of our system architecture, providing reliability, speed, and scalability. Additionally, Google Cloud offers a range of services and features tailored to our system architecture requirements and use cases. We also leveraged Google Cloud’s security and governance services to safeguard our data and applications against unauthorized access and potential threats. The configuration parameters and corresponding values chosen for setting up and managing our system architecture on Google Cloud are detailed in Table 1.
Table 1. List of configuration parameters and values.
Upon initiating the c3d-Standard-4 instance, we proceeded to install Docker version 4.28.0 on the virtual machine (https://docs.docker.com/desktop/release-notes/#4280, accessed on 29 February 2024). Subsequently, we established three containers utilizing Arch Linux via Docker. Each container served a distinct purpose within our system architecture: the first container managed the data acquisition layer, the second container handled the sign language recognition layer, and the third container oversaw the cloud layer functionality. The data acquisition layer was responsible for gathering images or videos from various sources such as webcams, smartphone cameras, and wearable gadget cameras. The sign language recognition layer processed these data to interpret hand gestures. These containers were equipped with a Python interpreter version 3.12.2 (https://www.python.org/downloads/release/python-3122/, accessed on 1 March 2024) and NodeJS version 20.11.1 (https://nodejs.org/en/blog/release/v20.11.1, accessed on 1 March 2024), enabling us to develop and execute code for both the data acquisition and sign language recognition layers. Python facilitated the creation of concise and comprehensible code capable of handling intricate tasks and data structures. Meanwhile, NodeJS permitted the utilization of JavaScript for both front-end and back-end development, streamlining our codebase and enhancing performance. The final container utilized a Docker Compose file comprising an image for MySQL to effectively manage and store the data (https://hub.docker.com/_/mysql, accessed on 29 February 2024).

4.1. Dataset Description

In this paper, we employed a hybrid model that leverages deep learning techniques to achieve high performance and accuracy in gesture recognition tasks. To accomplish this, we utilized a large-scale dataset consisting of images and videos captured using built-in webcams with a resolution of 480p. For the CNN model, we collected 400 images for each of the 10 targeted words. Each image has dimensions of 640 × 480 pixels, ensuring a detailed representation of hand gestures. Additionally, for the LSTM model, we obtained 50 videos for each word (i.e., 10 words), totaling 500 videos. These videos, recorded with the same 480p webcam, provide dynamic visual sequences for temporal analysis. We used the most universal and commonly used words for sign language, like hello, water, teacher, work, etc. The translations of these words were taken from the Handspeak website (https://www.handspeak.com/, accessed on 12 February 2024) and the Saudi Sign Language website (https://sshi.sa/, accessed on 12 February 2024). The Handspeak website is an online resource for sign language, and the Saudi Sign Language website is a platform affiliated with the Saudi Society for Hearing Impairment. The diverse nature and scale of our dataset enabled robust training and evaluation of the hybrid model for accurate gesture recognition tasks.

4.2. Data Labeling and Model Training

We divided the dataset into 80% for training and 20% for testing to validate the proposed hybrid model.

4.2.1. CNN Sub-Model

The dataset for training the CNN sub-model comprised images we captured, each depicting a specific hand gesture. We utilized MediaPipe to extract hand features from these images, focusing on 21 hand landmarks. These features were saved as .npy files in directories corresponding to each of the 10 distinct gesture labels in our dataset. Each label category contained 400 images, resulting in a balanced dataset of 4000 images. Of these, 3200 hand landmark images were manually labeled and categorized based on the selected 10 words (as shown in Table 2) to train the CNN sub-model of the proposed hybrid model. The remaining 800 hand landmark images were used to evaluate the CNN sub-model.
Table 2. CNN data labeling.
As detailed in Table 3, the CNN sub-model was trained using the following hyperparameters: a learning rate of 0.001 to ensure stable convergence, a batch size of 32 to balance memory usage and training speed, and a total of 50 epochs to allow for thorough training and convergence. We used the Adam optimizer, known for its efficiency and adaptive learning rates, and employed cross-entropy loss as the loss function, which is suitable for our classification task. To prevent overfitting and enhance the model’s generalization ability, we incorporated a dropout rate of 0.5. Dropout is a regularization technique that involves randomly ignoring selected neurons during training, which helps the model avoid relying too heavily on specific features or neurons and promotes robustness.
Table 3. CNN sub-model hyperparameter and loss function settings.

4.2.2. LSTM Sub-Model

The dataset comprised recorded sequences of actions stored in directories corresponding to each specific action. It included 10 actions, each represented by 50 videos. Each sequence consisted of 30 frames, with each frame represented by a 1662-dimensional feature vector. These files were loaded into memory, and the sequences and their labels were compiled into arrays. We captured the videos in this dataset; 400 of the collected hand gesture sequence videos were manually labeled and categorized according to 10 selected words (as shown in Table 4) to train the LSTM sub-model of the proposed hybrid model, which was trained for 50 epochs. Additionally, 100 hand gesture sequence videos were used to test the LSTM sub-model.
Table 4. LSTM data labeling.
Table 5 provides details of the LSTM sub-model’s training, including the parameters for the LSTM layers and dense layers. The first LSTM layer outputs a shape of (None, 30, 64) with 442,112 parameters, followed by a second LSTM layer with an output shape of (None, 30, 128) and 98,816 parameters. The third LSTM layer produces an output shape of (None, 64) with 49,408 parameters. This is followed by three dense layers: the first dense layer has an output shape of (None, 64) with 4,160 parameters, the second dense layer has an output shape of (None, 32) with 2,080 parameters, and the third dense layer has an output shape of (None, 10) with 330 parameters. In total, the model has 596,906 parameters.
Table 5. LSTM sub-model’s layers and parameter setup.

5. Experimental Results

To demonstrate our proposed hybrid model’s applicability, we conducted several experiments to evaluate the performance of each sub-model of our proposed hybrid model, including the Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) models.

5.1. Evaluation Metrics

In this section, we outline the evaluation criteria used to gauge the effectiveness of the hybrid model, which includes the Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) models. Specifically, we examine important metrics such as accuracy, precision, recall, F1 score, confusion matrix, and loss. These metrics are essential for gaining a comprehensive understanding of the hybrid model’s predictive abilities and the sub-models’ ability to generalize. Alongside the traditional performance metrics, assessing loss provides insights into the optimization process and the convergence of the sub-models. It is worth emphasizing that all metrics were assessed using a distinct validation dataset, separate from the training data, to ensure the robustness and reliability of the hybrid model performance assessment.

5.2. CNN Evaluation Metrics

Figure 4 illustrates the results of the performance of the CNN sub-model on the 10 words listed in Table 2. For this experiment, we trained the CNN sub-model using 3200 images (i.e., for hand landmark images) and validated it using 800 images. The accuracy of the CNN sub-model was 94.40% (see Figure 4a). The reason for this is the adequate amount of data for training, where the model could understand the hand gestures using the hand landmarks. The main purpose for setting the number of epochs to 50 was to leave room for the CNN sub-model to understand the targeted hand gestures. Additionally, the CNN sub-model achieved a precision of 95.00% (see Figure 4b) and recall and F1 scores of 94.40% and 94.90%, respectively, as shown in Figure 4c,d. The accuracy, precision, recall, and F1 scores consistently fell within the range of 91.00% to 95.00%. This uniformity in performance across these metrics indicates balanced model behavior. The similarity in accuracy, precision, recall, and F1 score can be attributed to the balanced nature of the dataset, where each class is adequately represented, facilitating equitable performance evaluation across all classes.
Figure 4. CNN sub-model performance evaluation. (a) CNN sub-model accuracy; (b) CNN sub-model precision; (c) CNN sub-model recall; (d) CNN sub-model F1 score.
Figure 5 presents the confusion matrix derived from the validation results of the CNN sub-model. This matrix offers a comprehensive visual representation of the CNN sub-model’s classification performance, highlighting the distribution of true positive, true negative, false positive, and false negative predictions across 10 different classes (i.e., the 10 labels for the 10 words listed in Table 2). Examining the confusion matrix provides crucial insights into the CNN sub-model’s ability to accurately classify instances within each class. As we can see, the true positives fell within the range of 82 to 97 when the CNN sub-model predicted the targeted word correctly. We can make an interesting observation from the confusion matrix: the CNN sub-model scored 89 and 92 TPs for labels 3 and 8, respectively. This is an indication that the CNN sub-model’s recognition performance was affected by viewing different angles. Furthermore, Figure 6 shows the loss evaluation for the training and validation of the CNN sub-model. The observed discrepancy in loss between the training and validation phases indicates that during the validation phase, the CNN sub-model became increasingly confident in interpreting hand gestures using hand landmarks as the number of epochs progressed.
Figure 5. CNN sub-model confusion matrix.
Figure 6. CNN sub-model validation loss.

5.3. LSTM Evaluation Metrics

Figure 7 illustrates the results of the performance of the LSTM sub-model on the 10 words listed in Table 4. For this experiment, we trained the LSTM sub-model using 400 videos (i.e., for hand gesture sequences) and validated it using 100 videos. As we can see, the accuracy of the LSTM sub-model was 82.70% (see Figure 7a). The reason for this is the adequate amount of data for training, where the model could understand the hand gesture sequences. For this experiment, we also used 50 epochs to allow the LSTM sub-model to understand the targeted hand gesture sequences. We can also observe that the LSTM sub-model reached a precision of 84.20% (see Figure 7b) and recall and F1 scores of 82.50% and 82.70%, respectively, as shown in Figure 7c,d. The accuracy, precision, recall, and F1 scores consistently fell within the range of 78.50% to 82.70%. The consistency in performance across these metrics suggests well-balanced model behavior. The similarity in accuracy, precision, recall, and F1 score can be attributed to the balanced composition of the dataset, ensuring sufficient representation of each class. This balanced representation enabled fair evaluation of performance across all classes.
Figure 7. LSTM sub-model performance evaluation. (a) LSTM sub-model accuracy; (b) LSTM sub-model precision; (c) LSTM sub-model recall; (d) LSTM sub-model F1 score.
Figure 8 presents the confusion matrix derived from the validation results of the LSTM sub-model. This matrix offers a comprehensive visual representation of the LSTM sub-model’s classification performance, highlighting the distribution of true positive, true negative, false positive, and false negative predictions across 10 different classes (i.e., the 10 labels for the 10 words listed in Table 4). Examining the confusion matrix provides crucial insights into the LSTM sub-model’s ability to accurately classify instances within each class. As we can see, the true positives fell within the range of 56 to 95 when the LSTM sub-model correctly predicted the targeted word. Furthermore, Figure 9 shows the loss evaluation for the training and validation of the LSTM sub-model. We can see fluctuations in the loss outcomes between the training and validation phases, which can be attributed to the LSTM sub-model’s focus on extracting temporal features, such as motion signs, rather than spatial features, such as static signs, as observed in the CNN sub-model. Additionally, the loss results in the validation phase of the CNN sub-model (see Figure 6) exhibit superior performance compared to those of the LSTM sub-model, primarily due to this distinction in feature extraction.
Figure 8. LSTM sub-model confusion matrix.
Figure 9. LSTM sub-model validation loss.

6. Conclusions and Future Work

We examined the issue of the pronounced deficiency of sign language interpreters in the Kingdom of Saudi Arabia, highlighting the notable gap in the ratio of interpreters to individuals with hearing impairments compared to other regions like California, USA. This study revealed that this scarcity presents a significant obstacle in delivering services to the deaf community in public spaces. In this paper, we propose a hybrid model designed to capture both spatial and temporal elements of sign language. This hybrid model comprises a CNN classifier to extract spatial features from sign language data and an LSTM classifier to capture both the spatial and temporal characteristics essential for sequential data processing. By automating ArSL translation in real time between sign language and spoken or written language, this hybrid model aims to address the interpreter shortage.
To demonstrate the viability of our proposed hybrid model, we created a dataset of 20 different words, comprising 4000 images for 10 static gesture words and 500 videos for 10 dynamic gesture words. Our hybrid model showcased promising performance, with the CNN and LSTM classifiers achieving accuracy rates of 94.40% and 82.70%, respectively. One implication of our research is the superior performance of the CNN compared to LSTM, attributed to LSTM’s emphasis on extracting temporal features. At the same time, the CNN focuses on spatial features, such as static signs. In future work, we aim to expand the number of words in the datasets for both models in terms of both images and videos and use different viewing angles to enhance the performance of the sub-models utilized in our hybrid model. This research work holds the potential to enhance the availability of translation services, catering to the needs of the deaf community, thereby fostering effective communication, enhancing accessibility, and promoting solidarity with this significant segment of society.

Author Contributions

Conceptualization, T.H.N., A.N., A.F.A. and A.F.; methodology, T.H.N. and A.N.; software, A.F.A., A.F., R.A. and A.S.A.; validation, T.H.N., A.F.A. and A.F.; formal analysis, T.H.N., A.N., R.A. and A.S.A.; investigation, T.H.N., G.A. and A.A.; resources, T.A. and A.A.; data curation, A.F.A., A.F., R.A. and A.S.A.; writing—original draft preparation, T.H.N. and A.N.; writing—review and editing, G.A. and T.A.; visualization, T.H.N. and A.N.; supervision, T.H.N. and A.N.; project administration, T.H.N.; funding acquisition, A.N., G.A. and T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data is available on the following link (https://github.com/AhmedIbrahim110/Dataset).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rastgoo, R.; Kiani, K.; Escalera, S. Sign language recognition: A deep survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
  2. Costello, E. Random House Webster’s Compact American Sign Language Dictionary; Penguin Random House: New York, NY, USA, 2008. [Google Scholar]
  3. Kumar, P.; Gauba, H.; Roy, P.P.; Dogra, D.P. A multimodal framework for sensor based sign language recognition. Neurocomputing 2017, 259, 21–38. [Google Scholar] [CrossRef]
  4. Hassan, M.; Assaleh, K.; Shanableh, T. Multiple proposals for continuous arabic sign language recognition. Sens. Imaging 2019, 20, 4. [Google Scholar] [CrossRef]
  5. Ministry of Health, S.A. The Deaf and Sign Language. Available online: https://www.moh.gov.sa/en/Ministry/Information-and-services/Pages/Sign-language.aspx (accessed on 15 February 2024).
  6. Ministry of Health, S.A. We Are with You. Available online: https://www.moh.gov.sa/en/Ministry/Projects/with-you/Pages/default.aspx (accessed on 10 February 2024).
  7. Center for Strategic and International Studies (CSIS). Reading the Signs: Diverse Arabic Sign Languages. Available online: https://www.csis.org/analysis/reading-signs-diverse-arabic-sign-languages (accessed on 5 January 2024).
  8. Alzohairi, R.; Alghonaim, R.; Alshehri, W.; Aloqeely, S. Image based Arabic sign language recognition system. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 185–194. [Google Scholar] [CrossRef]
  9. Zakariah, M.; Alotaibi, Y.A.; Koundal, D.; Guo, Y.; Elahi, M.M. Sign language recognition for Arabic alphabets using transfer learning technique. Comput. Intell. Neurosci. 2022, 2022, 4567989. [Google Scholar] [CrossRef]
  10. Wadhawan, A.; Kumar, P. Sign language recognition systems: A decade systematic literature review. Arch. Comput. Methods Eng. 2021, 28, 785–813. [Google Scholar] [CrossRef]
  11. Noor, T.H.; Noor, A.; Elmezain, M. Poisonous Plants Species Prediction Using a Convolutional Neural Network and Support Vector Machine Hybrid Model. Electronics 2022, 11, 3690. [Google Scholar] [CrossRef]
  12. Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote. Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
  13. Wang, B.; Chen, Y.; Yan, Z.; Liu, W. Integrating Remote Sensing Data and CNN-LSTM-Attention Techniques for Improved Forest Stock Volume Estimation: A Comprehensive Analysis of Baishanzu Forest Park, China. Remote Sens. 2024, 16, 324. [Google Scholar] [CrossRef]
  14. Almukhalfi, H.; Noor, A.; Noor, T.H. Traffic management approaches using machine learning and deep learning techniques: A survey. Eng. Appl. Artif. Intell. 2024, 133, 108147. [Google Scholar] [CrossRef]
  15. Noor, T.H.; Almars, A.M.; El-Sayed, A.; Noor, A. Deep learning model for predicting consumers’ interests of IoT recommendation system. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 161–170. [Google Scholar] [CrossRef]
  16. Abib, G.; Castel, F.; Satouri, N.; Afifi, H.; Said, A.M. Survey and Enhancements on Deploying LSTM Recurrent Neural Networks on Embedded Systems. In Proceedings of the ICC 2023-IEEE International Conference on Communications, Rome, Italy, 28 May–1 June 2023; pp. 949–953. [Google Scholar]
  17. Koller, O. Quantitative survey of the state of the art in sign language recognition. arXiv 2020, arXiv:2008.09918. [Google Scholar]
  18. Cheok, M.J.; Omar, Z.; Jaward, M.H. A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cybern. 2019, 10, 131–153. [Google Scholar] [CrossRef]
  19. Mohandes, M.; Deriche, M.; Liu, J. Image-based and sensor-based approaches to Arabic sign language recognition. IEEE Trans. Hum.-Mach. Syst. 2014, 44, 551–557. [Google Scholar] [CrossRef]
  20. Aiouez, S.; Hamitouche, A.; Belmadoui, M.S.; Belattar, K.; Souami, F. Real-time Arabic Sign Language Recognition based on YOLOv5. In Proceedings of the IMPROVE, Online Streaming, 22–24 April 2022; pp. 17–25. [Google Scholar]
  21. Alawwad, R.A.; Bchir, O.; Ismail, M.M.B. Arabic sign language recognition using Faster R-CNN. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 692–700. [Google Scholar] [CrossRef]
  22. Elhagry, A.; Elrayes, R.G. Egyptian sign language recognition using cnn and lstm. arXiv 2021, arXiv:2107.13647. [Google Scholar]
  23. Hdioud, B.; Tirari, M.E.H. A Deep Learning based Approach for Recognition of Arabic Sign Language Letters. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 424–429. [Google Scholar] [CrossRef]
  24. Rivera-Acosta, M.; Ruiz-Varela, J.M.; Ortega-Cisneros, S.; Rivera, J.; Parra-Michel, R.; Mejia-Alvarez, P. Spelling correction real-time American sign language alphabet translation system based on YOLO network and LSTM. Electronics 2021, 10, 1035. [Google Scholar] [CrossRef]
  25. Dutta, K.K.; Bellary, S.A.S. Machine learning techniques for Indian sign language recognition. In Proceedings of the 2017 International Conference on Current Trends in Computer, Electrical, Electronics, and Communication (CTCEEC), Mysore, India, 8–9 September 2017; pp. 333–336. [Google Scholar]
  26. Sako, S.; Hatano, M.; Kitamura, T. Real-time Japanese sign language recognition based on three phonological elements of sign. In Proceedings of the HCI International 2016–Posters’ Extended Abstracts: 18th International Conference, HCI International 2016, Toronto, ON, Canada, 17–22 July 2016; Proceedings, Part II 18. Springer: Berlin/Heidelberg, Germany, 2016; pp. 130–136. [Google Scholar]
  27. Uyyala, P. Sign Language Recognition Using Convolutional Neural Networks. J. Interdiscip. Cycle Res. 2022, 14, 1198–1207. [Google Scholar]
  28. Vyavahare, P.; Dhawale, S.; Takale, P.; Koli, V.; Kanawade, B.; Khonde, S. Detection and interpretation of Indian Sign Language using LSTM networks. J. Intell Syst. Control 2023, 2, 132–142. [Google Scholar] [CrossRef]
  29. Shurid, S.A.; Amin, K.H.; Mirbahar, M.S.; Karmaker, D.; Mahtab, M.T.; Khan, F.T.; Alam, M.G.R.; Alam, M.A. Bangla sign language recognition and sentence building using deep learning. In Proceedings of the 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Gold Coast, Australia, 16–18 December 2020; pp. 1–9. [Google Scholar]
  30. Daniels, S.; Suciati, N.; Fathichah, C. Indonesian sign language recognition using YOLO method. In Proceedings of the IOP Conf. on Materials Science and Engineering, Yogyakarta, Indonesia, 13–14 November 2020; pp. 1–9. [Google Scholar]
  31. Ko, S.K.; Son, J.G.; Jung, H. Sign language recognition with recurrent neural network using human keypoint detection. In Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems, Honolulu, HI, USA, 9–12 October 2018; pp. 326–328. [Google Scholar]
  32. Deng, Z.; Leng, Y.; Chen, J.; Yu, X.; Zhang, Y.; Gao, Q. TMS-Net: A multi-feature multi-stream multi-level information sharing network for skeleton-based sign language recognition. Neurocomputing 2024, 572, 127194. [Google Scholar] [CrossRef]
  33. Zuo, R.; Wei, F.; Mak, B. Natural language-assisted sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14890–14900. [Google Scholar]
  34. Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef] [PubMed]
  35. Qi, W.; Fan, H.; Xu, Y.; Su, H.; Aliverti, A. A 3d-CLDNN based multiple data fusion framework for finger gesture recognition in human-robot interaction. In Proceedings of the 2022 4th International Conference on Control and Robotics (ICCR), Guangzhou, China, 22–24 October 2022; pp. 383–387. [Google Scholar]
  36. Bora, J.; Dehingia, S.; Boruah, A.; Chetia, A.A.; Gogoi, D. Real-time assamese sign language recognition using mediapipe and deep learning. Procedia Comput. Sci. 2023, 218, 1384–1393. [Google Scholar] [CrossRef]
  37. Eunice, J.; Sei, Y.; Hemanth, D.J. Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer Model. Sensors 2023, 23, 2853. [Google Scholar] [CrossRef] [PubMed]
  38. Sincan, O.M.; Keles, H.Y. Autsl: A large scale multi-modal turkish sign language dataset and baseline methods. IEEE Access 2020, 8, 181340–181355. [Google Scholar] [CrossRef]
  39. Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
  40. Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
  41. Alqurafi, A.; Alsanoosy, T. Measuring Customers’ Satisfaction Using Sentiment Analysis: Model and Tool. J. Comput. Sci. 2024, 20, 419–430. [Google Scholar] [CrossRef]
  42. Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
  43. Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 100071. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.