Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities

Mouti, Samar; Al Chalabi, Hani; Abushohada, Mohammed; Rihawi, Samer; Abdalla, Sulafa

doi:10.3390/informatics13020027

Open AccessArticle

Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities

by

Samar Mouti

^1,*

,

Hani Al Chalabi

²,

Mohammed Abushohada

³

,

Samer Rihawi

¹

and

Sulafa Abdalla

⁴

¹

College of Engineering and Computing, Liwa University, Abu Dhabi P.O. Box 41009, United Arab Emirates

²

General Department, Liwa University, Al Ain P.O. Box 68297, United Arab Emirates

³

Thumbay College of Management and AI in Healthcare, Gulf Medical University, Ajman P.O. Box 4184, United Arab Emirates

⁴

General Department, Liwa University, Abu Dhabi P.O. Box 41009, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Informatics 2026, 13(2), 27; https://doi.org/10.3390/informatics13020027

Submission received: 14 December 2025 / Revised: 2 January 2026 / Accepted: 16 January 2026 / Published: 5 February 2026

(This article belongs to the Section Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a hybrid deep learning framework for real-time sign language recognition (SLR) tailored to Internet of Things (IoT)-enabled environments, enhancing accessibility for Deaf communities. The proposed system integrates a Long Short-Term Memory (LSTM) network for static gesture recognition and a 3D Convolutional Neural Network (3D CNN) for dynamic gesture recognition. Implemented on a Raspberry Pi device using MediaPipe for landmark extraction, the system supports low-latency, on-device inference suitable for resource-constrained edge computing. Experimental results demonstrate that the LSTM model achieves its highest stability and performance for static signs at 1000 training epochs, yielding an average F1-score of 0.938 and an accuracy of 86.67%. In contrast, at 2000 epochs, the model exhibits a catastrophic performance collapse (F1-score of 0.088) due to overfitting and weight instability, highlighting the necessity of careful training regulation. Despite this, the overall system achieves consistently high classification performance under controlled conditions. In contrast, the 3D CNN component maintains robust and consistent performance across all evaluated training phases (500–2000 epochs), achieving up to 99.6% accuracy on dynamic signs. When deployed on a Raspberry Pi platform, the system achieves real-time performance with a frame rate of 12–15 FPS and an average inference latency of approximately 65 ms per frame. The hybrid architecture effectively balances recognition accuracy with computational efficiency by routing static gestures to the LSTM and dynamic gestures to the 3D CNN. This work presents a detailed epoch-wise comparative analysis of model stability and computational feasibility, contributing a practical and scalable IoT-enabled solution for inclusive, real-time sign-to-text communication in intelligent environments.

Keywords:

internet of things; edge computing; sign language recognition; LSTM; 3D CNN; accessibility; static and dynamic gestures; assistive technology

1. Introduction

According to the World Health Organization (WHO), over 430 million people worldwide suffer from disabling hearing loss, causing ongoing communication barriers in professional and public settings. Advances in Artificial Intelligence (AI), computer vision, and the Internet of Things (IoT) have enabled the development of automated sign language recognition (SLR) systems. However, most current systems require significant computational power or require separate testing, overlooking real-time constraints and the long-term training stability of low-power IoT devices.

To address these challenges, in this paper, an IoT-enabled hybrid deep learning system that combines an LSTM network for static gesture recognition and a 3D Convolutional Neural Network (CNN) for dynamic gesture recognition is proposed for real-time sign-to-text recognition. The system runs on a Raspberry Pi platform using MediaPipe for landmark extraction, allowing on-device inference without relying on cloud processing.

Unlike previous studies, this study emphasizes both model performance and deployment feasibility in IoT environments. The work offers a detailed, epoch-wise comparative analysis of LSTM and 3D CNN performance, demonstrating that while the 3D CNN maintains exceptional stability across all epochs, the LSTM is highly prone to overfitting beyond its optimal training period (1000 epochs). This analysis validates the architecture choice based on sign type and highlights the need for a hybrid approach to ensure robust performance across different gesture modalities while balancing computational costs.

The main contributions of this paper are as follows: (1) a real-time IoT-enabled hybrid Sign-to-Text system deployed on embedded hardware is developed; (2) a detailed epoch-wise comparison of LSTM and 3D CNN models for static and dynamic sign language recognition is conducted, highlighting optimal performance points (e.g., 1000 epochs for LSTM) and assessing long-term stability; (3) training stability, overfitting tendencies (notably the LSTM collapse at 2000 epochs), and deployment potential are analyzed on resource-limited IoT devices; and (4) a comprehensive evaluation of performance, including accuracy, precision, recall, inference time, and computational cost, is provided.

The rest of this paper is organized as follows. Section 2 reviews related work covering IoT-enabled assistive systems, CNN/LSTM-based gesture recognition, hybrid spatio-temporal models, MediaPipe-based pipelines, and transformer-based approaches, and highlights key gaps in the literature. Section 3 describes the proposed system architecture and methodology, including dataset preparation, MediaPipe-based feature extraction, model design, training procedures, and evaluation metrics. Section 4 presents the experimental results and performance analysis, including epoch-wise stability comparisons and sign classification accuracy. Section 5 discusses deployment feasibility, dataset limitations, generalizability, and real-time IoT performance. Finally, Section 6 concludes the paper and outlines future research directions.

2. Related Work

Advances in Artificial Intelligence (AI), computer vision, and the Internet of Things (IoT) have greatly enhanced automatic sign language recognition (SLR), promoting more inclusive communication for the Deaf and hard-of-hearing communities. Current research covers IoT-enabled assistive systems, CNN/LSTM-based gesture recognition, hybrid spatio-temporal models, transformer architectures, and lightweight MediaPipe-based pipelines. However, most studies evaluate architecture independently and lack a structured comparison of LSTM and 3D CNN models within a unified IoT environment. This gap is addressed in the present study.

2.1. IoT-Enabled Sign Language and Assistive Communication Systems

IoT devices have enabled affordable and portable SLR systems. A notable example is the Raspberry Pi-based Sign Language System (SLS) for UAE Sign Language, which achieves 99.52% accuracy using MobileNetV2 and speech-to-sign translation [1]. Similarly, the Classroom Assessment Sign Communicator (CASC) supports learners with special needs by converting speech into signs and gestures, achieving 92% accuracy [2]. These works demonstrate the IoT’s potential for real-time accessibility, although they focus on pipeline-level operations rather than deep model comparisons. IoT-based assistive communication systems have advanced further with improvements in edge AI. For example, IoT-driven hybrid SLR systems using deep learning have recently demonstrated promising results for Deaf communication and healthcare settings [3]. At the same time, edge computing has enhanced on-device inference efficiency for AI-based gesture processing [4].

2.2. CNN, LSTM, and GRU Models for Static and Sequential Gestures

Deep learning continues to be central in gesture modeling. CNNs typically excel at spatial recognition, while LSTMs are effective at capturing temporal continuity. Comparative studies on American Sign Language (ASL) have demonstrated CNN accuracy of 89.07% for static signs and LSTM accuracy of 94.3% for sequential gestures [5]. Lightweight architectures that combine hand landmarks with recurrent networks have also demonstrated real-time performance on edge devices and smartphones [6], underscoring their applicability to embedded IoT systems.

2.3. Hybrid CNN–LSTM and Spatio-Temporal Deep Learning

Hybrid spatio-temporal models that combine spatial (CNN/3D CNN) and temporal (LSTM/GRU) learning have achieved high accuracy in dynamic gesture recognition. A CNNSa-LSTM architecture that integrates VGG16, optical flow, and self-attention reached 98.7% accuracy in continuous SLR [7]. Hybrid CNN–LSTM models have also been used for Arabic SLR with strong recognition performance [8]. More advanced models include 3D CNN layers for better temporal modeling. Dynamic gesture recognition using 3D CNN–LSTM on datasets such as 20BN-Jester has demonstrated high spatio-temporal precision [9]. However, these studies do not analyze the behavior of LSTM versus 3D CNN across training epochs or assess resource constraints for IoT deployment.

2.4. MediaPipe-Based Real-Time Gesture Recognition

MediaPipe has become an effective lightweight feature extractor for real-time systems. YOLOv11 with MediaPipe achieved fast ASL detection under webcam settings [10], while MediaPipe–BiLSTM architectures on mobile devices achieved high accuracy with low latency [11]. MediaPipe-driven SLR systems have also shown strong performance for Egyptian and Norwegian sign languages [12,13]. Although MediaPipe-based models achieve real-time performance and high accuracy, most remain single-architecture solutions, and few incorporate static and dynamic gesture recognition within a unified IoT framework.

2.5. Transformer and Vision-Based Architectures

Transformer-based models have recently been introduced for SLR. A ResNet–Vision Transformer fusion model improved ASL alphabet classification accuracy to 97.09%, while the standalone ResNet achieved 99.98% accuracy [14]. Vision Transformer models have also been suggested for continuous SLR [15]. Despite these progressions, transformer architecture remains computationally intensive, restricting deployment on IoT hardware without pruning or quantization.

Although transformer-based architecture has shown strong performance in sequence modeling, their high computational complexity, memory overhead, and attention-based operations make them less suitable for real-time deployment on low-power IoT hardware such as the Raspberry Pi without aggressive compression. In contrast, LSTM and 3D CNN architectures provide deterministic inference latency, lower memory footprint, and stable execution on embedded edge devices. Therefore, this study prioritizes LSTM and 3D CNN models to evaluate training stability and deployment feasibility under realistic IoT constraints rather than benchmarking state-of-the-art transformer performance.

2.6. Observed Gaps in the Literature

Across the reviewed studies, three consistent limitations have been identified: (1) architectures are typically evaluated independently rather than within hybrid frameworks; (2) LSTM and 3D CNN models are rarely compared across extended training epochs, despite the importance of such analysis for understanding model degradation, overfitting, and temporal stability; and (3) while several existing studies have explored IoT-capable sign language recognition systems, few provide a systematic, epoch-wise comparison of LSTM and 3D CNN training stability under identical IoT deployment constraints within a unified embedded framework. Although experiments were conducted at 1500 epochs, performance trends at this stage closely resembled those observed at 2000 epochs and were therefore omitted from detailed reporting to maintain clarity.

To address these gaps, the proposed system integrates an LSTM for static gesture recognition and a 3D CNN for dynamic gesture recognition and experimentally compares their performance over 500–2000 training epochs. The framework further evaluates deployment feasibility using MediaPipe-based feature extraction and execution on Raspberry Pi hardware. A comparative summary of representative SLR studies—covering datasets, architectures, feature extraction methods, reported accuracy, and IoT applicability—is provided in Table 1.

Qualitative accuracy descriptors (e.g., “High”) are used where numerical values have not been explicitly reported in the original studies. Reported accuracy values reflect results under controlled experimental settings and limited gesture vocabularies, as described in the respective studies.

3. Methodology

This study introduces an IoT-enabled hybrid Sign-to-Text (STT) system that combines Long Short-Term Memory (LSTM) networks for static gesture recognition with a 3D Convolutional Neural Network (3D CNN) for dynamic gesture classification. The approach includes dataset preparation, feature extraction, model development, training procedures, evaluation metrics, and deployment on Raspberry Pi hardware. Figure 1 illustrates the overall workflow.

3.1. System Overview

The hybrid STT system comprises three main components: (1) Feature Extraction Layer—MediaPipe Hands detects 21 key landmarks per frame for both static and dynamic gestures; (2) Static Gesture Model—an LSTM classifier analyzes landmark sequences to identify alphabetic signs; and (3) Dynamic Gesture Model—a 3D CNN processes spatio-temporal gesture frames to recognize motion-based signs. Both models are integrated into a single inference pipeline deployed on a Raspberry Pi for real-time operation. The complete system architecture is shown in Figure 2, along with a detailed flowchart.

Table 2 summarizes the key hyperparameters used to train the two deep learning models integrated into the proposed IoT-enabled hybrid STT–DIST system. The static gesture recognition module (STT) relies on a Long Short-Term Memory (LSTM) network optimized for sequential hand landmark data. In contrast, the dynamic gesture recognition module (DIST) uses a 3D Convolutional Neural Network (3D CNN) to learn spatio-temporal motion patterns from short video clips. The chosen hyperparameters were selected through empirical tuning and prior research to ensure model stability, generalization, and suitability for embedded deployment on Raspberry Pi hardware.

For the LSTM model, the learning rate was set to 0.001 with the Adam optimizer, enabling quick convergence while ensuring gradient stability. A batch size of 32 struck a balance between computational efficiency and smooth training. Dropout regularization (0.2) was used to reduce the overfitting caused by the relatively small number of parameters extracted from MediaPipe landmarks. Experiments were conducted over multiple training epochs (500–2000) to examine how learning behavior and performance changed over time.

The 3D CNN model exhibited different training characteristics, requiring a lower learning rate (0.0005) and a smaller batch size (16) due to the computational cost of processing video volumes of video frames. L2 regularization (0.001) was applied to control weight magnitude and enhance generalization with high-dimensional spatio-temporal inputs. Similar epoch ranges (500–2000) were tested to compare the stability and convergence trends of 3D CNNs versus LSTMs. Both models were trained in TensorFlow 2.14 and later, quantified, and deployed using TensorFlow Lite on a Raspberry Pi 4B, ensuring real-time inference under IoT constraints. The hyperparameters were carefully selected to optimize training efficiency, minimize overfitting, and maintain real-time responsiveness—critical requirements for IoT-based assistive sign language recognition systems.

3.2. Dataset Description

Two datasets were used in this study: a static gesture dataset and a dynamic gesture dataset. The static gesture dataset (DSS), structured following the ISLS dataset format, includes 36 static alphabetic and numeric signs, each represented as 25-frame sequences captured using MediaPipe-based landmark extraction. The dynamic gesture dataset (SIRM-Dynamic) consists of 12 gesture classes represented by short video clips capturing arm, wrist, and hand movements.

Both datasets were divided into 80% training, 10% validation, and 10% testing subsets to ensure consistent and fair evaluation across all experiments.

3.3. Feature Extraction

MediaPipe Hands was used to extract the following: 21 hand landmarks, each with (x, y, z) coordinates; temporal sequences (25 frames) for static gestures; and frame arrays (32 × 32 × frames) for dynamic gestures. This method decreases computational complexity and allows for real-time operation on embedded devices without GPU support.

3.4. LSTM Model for Static Gestures

A three-layer LSTM architecture was designed: (1) Input: 63 features per frame (21 landmarks × 3); (2) Layers: LSTM (128) → LSTM (64) → Dense (Softmax); (3) Sequence Length: 25 frames; (4) Loss: Categorical Cross-Entropy; and (5) Optimizer: Adam (LR = 0.001). The LSTM focuses on temporal dependencies in landmark sequences.

3.5. Three-Dimensional CNN Model for Dynamic Gestures

The 3D CNN architecture comprises: (1) Conv3D layers with increasing filters (32, 64, 128); (2) 3D Max-Pooling layers for reducing both temporal and spatial dimensions; (3) Flatten and Dense layers for final classification; and (4) Input: 32 × 32 × T frames (T = 16 to 20). This model captures both motion cues and spatial texture patterns in dynamic gestures. The STT (LSTM) model has approximately 155,000 trainable parameters, while the DIST (3D CNN) model has around 2.1 million trainable parameters. Combined, the hybrid system remains lightweight and suitable for edge deployment.

3.6. Training Procedure

Both models were trained for 500, 1000, 1500, and 2000 epochs to examine training stability, overfitting behavior, accuracy development, and degradation or collapse at high epochs (LSTM sensitivity). Training was conducted using TensorFlow 2.14 on a workstation before deployment. Although experiments were conducted at 1500 epochs, results at this stage showed instability trends similar to 2000 epochs and were therefore omitted from detailed reporting to maintain clarity.

3.7. Evaluation Metrics

Models were evaluated using accuracy, precision, recall, F1-score, loss curves, confusion matrix, inference time (ms), and CPU/memory usage on the Raspberry Pi. These metrics highlight the model accuracy and the feasibility of IoT deployment.

3.8. Embedded Deployment

A Raspberry Pi 4B (4 GB RAM) was used to run Python 3.11 and TensorFlow Lite, with real-time webcam input at 30 FPS. Quantization and reduced input resolution were used to optimize performance, and CPU inference times were recorded for each model. The integration pipeline directs static gestures to the LSTM and dynamic gestures to the 3D CNN Via a gesture-type detector.

3.9. Performance Metrics: Precision, Recall, and F1-Score

Precision measures the accuracy of positive predictions, that is, how many of the model’s identified gestures were correct.

Precision = TP/(TP + FP)

Recall evaluates the model’s ability to detect all actual instances of a given gesture.

Recall = TP/(TP + FN)

F1-score is the harmonic mean of precision and recall, balancing the two metrics to provide an overall performance score.

F1-Score = 2 × (Precision × Recall)/(Precision + Recall)

where

TP (True Positives): the number of correctly predicted gesture instances;
FP (False Positives): the number of incorrectly predicted gesture instances;
FN (False Negatives): the number of actual gestures that the model failed to recognize.

These metrics were calculated for each gesture class across epochs (500, 1000, and 2000) and then averaged to evaluate the model’s overall performance at each training phase. This enabled tracking the models’ learning progress, identifying overfitting, and comparing the reliability of the LSTM and 3D CNN architectures for static and dynamic sign recognition, respectively.

3.10. Precision, Recall, and F1-Score Across Epochs

Table 3 examines the classification performance of both models using precision, recall, and F1-score for each sign at 500, 1000, and 2000 training epochs. STT demonstrates strong metrics early on but collapses at 2000 epochs, indicating overfitting. DIST maintains perfect classification consistency throughout all epochs, confirming its robustness and suitability for dynamic gesture recognition.

The STT model showed strong learning in the early epochs, especially at epoch 1000, achieving perfect classification scores on nearly all signs. However, at 2000 epochs, there was a sharp drop in performance, indicating catastrophic forgetting or overfitting—likely caused by the recurrent nature of LSTMs, which are prone to weight instability without early stopping. Conversely, the DIST system demonstrated consistent and high performance across all epochs. This emphasizes the robustness of the 3D CNN architecture for dynamic sign recognition, especially when paired with a local feedback loop and consistent training data from the DSS dataset. Although the overall accuracy at 1000 epochs is slightly lower than at 500 epochs, the macro F1-score is higher due to improved class balance and reduced false positives across underrepresented signs.

The DIST model’s ability to sustain high classification accuracy across all signs, even during prolonged training, highlights its suitability for real-world dynamic gesture recognition. These findings highlight the importance of choosing architecture based on sign type (static vs. dynamic) and training stability. They also emphasize the need for early stopping, validation checkpoints, and adaptive learning rate schedules, especially when training recurrent architectures such as LSTMs.

Although the DIST model achieved 100% classification accuracy, non-zero loss and MSE values indicate residual optimization errors during probability distribution approximation rather than misclassification.

Figure 3 shows the confusion matrices of the hybrid model system for IoT-enabled sign language recognition, illustrating how well the STT (LSTM) and DIST (3D CNN) models classify signs at 500, 1000, and 2000 epochs. The STT model performs well at 500 and 1000 epochs, but accuracy drops sharply at 2000 epochs due to overfitting and instability. The DIST model maintains high accuracy across all epochs, consistently recognizing dynamic gestures without performance declines. STT is best suited for static gestures but struggles with extended training. DIST remains reliable for dynamic gestures, even with more extended training periods.

Results at 1500 epochs were omitted due to instability trends similar to those observed with 2000 epochs.

3.11. STT Subsystem Analysis (Static Sign-to-Text Using LSTM)

The STT system recognizes static gestures with a three-layer LSTM trained on MediaPipe hand and facial landmarks. Three experiments were conducted with 500, 1000, and 2000 epochs, and performance was measured using precision, recall, and F1-score for each sign class as shown in the dataset.

3.11.1. STT Performance at 500 Epochs

After 500 epochs, the STT system achieves strong performance, with an overall accuracy of 93.33%. Most signs, such as drink, wrong, correct, teacher, father, mother, and good, achieved precision and recall of 1.0, while a few showed moderate confusion (e.g., blame = 0.50; bad = 0.66). These results demonstrate that the LSTM quickly learns static hand shape patterns in the early epochs, showing high consistency across most gestures.

3.11.2. STT Performance at 1000 Epochs

At 1000 epochs, the STT system attains its overall best trade-off between accuracy and robustness. Although the raw accuracy (86.67%) is slightly lower than the peak at 500 epochs, this checkpoint is selected as optimal because it yields a higher F1-score and more stable precision–recall behavior across classes. Several signs achieve perfect or near-perfect classification, reflecting improved stability compared with 500 epochs. Blame reaches a precision of 1.0, while drink attains both precision and recall of 1.0. The classes correct, teacher, good, and bad also maintain consistently high scores. The main exception is wrong, which exhibits reduced recall (0.40), likely attributable to limited data support. Despite minor dips in a few classes, the model demonstrates strong generalization in static gesture recognition.

3.11.3. STT Performance Collapse at 2000 Epochs

At 2000 epochs, the STT model fully degrades in accuracy, dropping to an F1-score of approximately 12% on average. This catastrophic drop indicates severe overfitting, weight divergence due to long training, LSTM instability without early stopping, and sensitivity to small dataset sizes. Although some signs, such as drink, correct, and teacher, retain acceptable scores, overall macro performance collapses. This finding reinforces the need for early stopping, regularization, dropout, and learning rate scheduling.

4. Experimental Results

The experiments were conducted on a hybrid model system using representative subsets of static and dynamic gestures to evaluate classification accuracy, stability, and generalization across training epochs. The experiments involved predicting each sign gesture individually to assess how accurately the models can predict them. The results show a modest improvement in sign blame from 82.29% at 500 epochs to 90.16% over 1000 epochs. The accuracy is high for both teachers and danger at 500 and 1000 epochs. On the other hand, there are no results at 2000 epochs due to instability and inaccuracy in the training data. The results show that the sign “teacher” demonstrated strong and consistent learning, with performance remaining high and increasing from 98.59% at 500 epochs to 99.07% at 1000 epochs. The sign “blame” improved modestly from 89.29% to 90.16%, indicating gradual learning. Conversely, the accuracy of the sign “danger” dropped from 93.88% at 500 epochs to 83.33% at 1000 epochs, possibly due to model instability or overfitting. No training results are available at 2000 epochs, likely because of previous reports of instability during extended training, which led to low training accuracy. The experimental results for both models are presented in Table 4.

As shown in Figure 4, across all assessed indicators and epochs, the DIST system’s findings using a 3D CNN model show consistently high training accuracy. Early convergence and model stability were demonstrated by the 99.5% accuracy of the angry and worried indications after 500 epochs, with no improvement or decline observed at 1000 and 2000 epochs. The sign again started at 95.68% accuracy at 500 epochs, but by 1000 epochs, it had increased to 99.6%, and it maintained this performance at 2000 epochs.

The 3D CNN model demonstrates high learning capacity, quick convergence, and strong resistance to overfitting across multiple training runs. Table 5 shows the results of the experiment evaluating how well each model matches the expected sign-gesture output after training for 500 epochs. Three-Dimensional CNNs consistently outperform LSTMs, especially for dynamic gestures such as “again”, “angry”, and “online”.

The extremely low recognition rate for the “work” gesture suggests either strong gesture ambiguity or insufficient representative samples in the dataset, rather than a failure of the underlying model architecture. The results clearly show a performance difference between the LSTM and 3D CNN models in recognizing dynamic sign language gestures. Although both models were trained for 500 epochs under the same conditions, the 3D CNN performed significantly better than the LSTM on nearly all tested signs. The 3D CNN achieved a match rate of over 94% for four of the five signs, with the highest at 98% for the sign “online”. In contrast, the LSTM recognized only two signs, “angry” (95.48%) and “online” (95%), and did not recognize “again”, “appointment”, or “work” (0%). The low recognition rate for the “work” gesture suggests possible gesture ambiguity or insufficient representative samples, indicating a need for further dataset refinement. These results highlight LSTM’s weakness in modeling complex spatiotemporal patterns of dynamic gestures, potentially due to its sequential processing and lack of explicit spatial feature extraction. Three-dimensional CNN, with its ability to extract spatial and temporal dimensions simultaneously, shows higher consistency and stability and is therefore the more reliable architecture for dynamic sign recognition systems. The near-zero detection of “work” by both models suggests potential issues in the dataset or gesture ambiguity that need to be investigated. Table 6 compares the accuracy of recognizing three key dynamic gestures (appointment, useless, worry) using both STT-LSTM and a 3D CNN. Three-dimensional CNN demonstrates superior recognition performance across all signs.

Table 6 presents recognition accuracy percentages for three sign gestures—“appointment”, “useless”, and “worry”—using two deep learning models: STT-LSTM and 3D CNN. Three-dimensional CNN consistently outperformed STT-LSTM across all gestures. “Appointment” had the highest recognition accuracy with 3D CNN (98.96%), followed closely by STT-LSTM (96.99%). “Useless” showed a significant improvement with 3D CNN (98.75%) compared to STT-LSTM (91.74%), indicating better feature extraction by the 3D convolutional approach for this gesture. “Worry” had the lowest overall recognition performance, with STT-LSTM at 93.28% and 3D CNN at 96.97%, though the CNN still achieved higher accuracy. The STT model exhibited high accuracy for static gestures at 500 and 1000 epochs. For example, the teacher, correct, and good signs consistently achieved precision, recall, and F1-scores of 1.0. The sign “blame” improved from 50% to 100% precision between 500 and 1000 epochs, while “bad” reached an F1-score of 0.80 at 500 epochs. However, after 2000 epochs, the STT model exhibited a dramatic decline in macro-F1 score (0.088). This instability confirms the model’s susceptibility to overfitting at extended training durations. As a result, 1000 epochs were identified as the optimal training duration for the STT subsystem.

Table 7 provides a quantitative comparison of the training performance of STT and DIST over 500, 1000, and 2000 epochs. It indicates that the STT model experiences a notable drop in accuracy at 2000 epochs, while the DIST model maintains consistent performance. The results show that STT performs well at 500 and 1000 epochs but exhibits a significant drop at 2000 epochs, likely due to overfitting or weight divergence. On the other hand, DIST maintains high performance across all epochs, suggesting better generalization and stability.

The observed STT behavior exhibits several notable strengths. It achieves excellent performance on static signs, with stable convergence observed up to 1000 training epochs. The model has relatively low computational requirements, making it well-suited for deployment on IoT edge devices, and it integrates effectively with MediaPipe-derived landmark features. However, several limitations were also identified. The model is sensitive to overtraining, and signs with limited representation in the dataset (e.g., blame, bad) show higher variability in precision. Its performance is strongly dependent on dataset size and class balance, and the architecture is not appropriate for recognizing dynamic, temporally complex gestures.

The STT system (LSTM-based) performed well on static gestures at 500–1000 epochs but failed at 2000 epochs due to overfitting and recurrent weight divergence. In contrast, the DIST system (3D CNN-based) achieved stable, perfect performance across all epochs, effectively capturing the temporal motion in dynamic gestures. This difference underscores the importance of using an LSTM for static signs and a 3D CNN for dynamic signs, supporting the hybrid architecture of the proposed IoT-enabled Sign-to-Text framework.

Although the DIST model achieved near-perfect classification accuracy across all evaluated epochs, these results should be interpreted in the context of a controlled acquisition environment, limited gesture vocabulary, and a small dataset size. The reported accuracy reflects feasibility and stability under constrained conditions rather than large-scale, real-world generalization.

5. Discussion

The evaluation of the hybrid system uncovers distinct strengths and limitations for both deep learning architectures. The LSTM model showed excellent performance for static gestures at lower training epochs, reaching up to 99.07% accuracy for signs like “teacher”. This aligns with LSTM’s ability to capture temporal patterns in short, structured sequences. However, its performance declined significantly after 2000 epochs, highlighting the model’s sensitivity to extended training, which can lead to overfitting or gradient instability. Effectively using LSTM in real-world applications requires careful hyperparameter tuning, regularization, and early stopping.

In contrast, the 3D CNN model maintained stable and high performance across all evaluated epochs. Its capacity to simultaneously capture spatial and temporal features makes it especially suitable for recognizing dynamic gestures such as “again” and “appointment”, with near-perfect accuracy. Nevertheless, the increased computational demands of 3D CNNs must be considered when deploying on resource-limited devices like Raspberry Pi. Optimization techniques such as model pruning or quantization may be necessary to maintain real-time performance without sacrificing accuracy. A key finding is that combining the two models allows the system to intelligently select the appropriate architecture based on whether the gesture is static or dynamic. This hybrid approach not only boosts recognition accuracy but also balances computational efficiency, making it ideal for IoT-based assistive technologies. Future research may consider integrating attention mechanisms, transformer architectures, or federated learning to improve personalization and privacy.

Despite promising results, the system has some limitations. The LSTM model exhibited overfitting in later epochs, limiting its scalability without regularization. The 3D CNN, while accurate, is computationally demanding, making real-time use on low-power devices challenging. Additionally, the system currently only supports isolated gestures and is limited to a small dataset. Future work will focus on improving training stability through early stopping, compressing the 3D CNN for edge deployment, and expanding the gesture set with larger, multilingual datasets. We also plan to incorporate transformer-based models and extend the system to support continuous gesture recognition in natural communication environments.

This study focuses on comparative training stability and IoT deployment feasibility rather than large-scale generalization, providing a practical and transparent evaluation of hybrid deep learning architectures under real-world edge computing constraints.

5.1. Dataset Limitations and Generalizability

The performance of the proposed hybrid STT–DIST system is influenced by several dataset-related limitations that affect its ability to generalize across larger populations, different signers, and real-world settings. First, both the static and dynamic gesture datasets are relatively small, which is typical of early-stage sign recognition research due to the high costs and time required for gesture annotation. While the controlled dataset enables efficient training and rapid prototyping on IoT devices, the limited number of samples increases the risk of overfitting, especially for sequential models such as LSTMs. This issue was evident in the drop in LSTM performance at 2000 epochs, where gradient instability and saturation led to a collapse in accuracy—an effect commonly observed with prolonged training on small datasets. Second, the datasets feature a restricted variety of signs: 36 static alphabetic/numeric gestures and 12 dynamic action gestures. Although sufficient to demonstrate the system’s feasibility, this narrow gesture vocabulary limits the model’s ability to generalize to larger lexicons, continuous sign phrases, or multi-movement expressions found in full sign languages. Expanding the dataset to include multi-signer data, continuous signing sequences, and a wider range of semantic categories would greatly enhance robustness and linguistic coverage. Third, all data were collected in controlled environments with consistent lighting and minimal occlusion. Real-world conditions introduce additional sources of variability—background clutter, camera noise, motion blur, hand–object occlusion, and multi-user environments—that can reduce model accuracy and the reliability of temporal tracking. Future datasets should include diverse environments and utilize multimodal data (e.g., hand, face, and body landmarks) to improve real-world robustness.

To improve generalizability, future studies will focus on creating large-scale, fully annotated sign language datasets using both volunteer signers and automated annotation methods. High-quality datasets such as these will enable the training of deeper hybrid models, transformer-based architectures, and lightweight 3D CNN variants without risking overfitting or performance issues. Given the limited number of dynamic gesture classes (12) and the 80/10/10 data split, the test set contains only a small number of samples per class, which limits the statistical reliability of per-class performance estimates.

5.2. IoT Deployment and Real-Time Performance

To assess practical feasibility, the proposed hybrid STT–DIST system was deployed and tested on a Raspberry Pi 4B (4 GB RAM), showing that real-time sign classification is possible on resource-limited IoT hardware. Table 8 summarizes the deployment metrics obtained during continuous inference. The system achieves a frame rate of 12–15 fps, meeting real-time processing requirements, with an average inference time of approximately 65 ms per frame. Peak memory usage remained around 850 MB, while CPU utilization ranged between 72% and 85% under sustained operation. The deployed TensorFlow Lite models had compact sizes (5.8 MB for LSTM and 18.2 MB for the 3D CNN), enabling efficient edge deployment. Thermal measurements remained below 68 °C using passive cooling, indicating stable long-term operation.

The Raspberry Pi maintains stable real-time performance due to the optimized input pipeline (MediaPipe key points for LSTM, down-sampled video frames for 3D CNN) and lightweight model architecture. The LSTM quickly performs inference because of its small input space, while the 3D CNN, though heavier, stays within the device’s computational limits.

Model Compression and Optimization Techniques: To sustain high performance on low-power IoT devices, the following model compression and optimization strategies were implemented.

Quantization: Both models were converted to TensorFlow Lite using 8-bit integer quantization. This process resulted in an approximate 40–60% reduction in model size and about 30% faster inference, with no more than 1% decrease in classification accuracy. These improvements are essential for deployment on embedded platforms with limited storage and computing resources.

Pruning: Unstructured pruning was incorporated during training to remove low-magnitude weights. This reduced the computational and memory footprints, improved inference speed on ARM-based CPUs, and contributed to smoother optimization dynamics, particularly for the LSTM model, by mitigating redundancy in the recurrent layers and stabilizing training.

Lightweight Architectures: The baseline network designs were tailored explicitly for edge deployment. The 3D CNN uses relatively shallow 3D convolutional layers compared to full C3D or I3D architectures, and the input dimensionality is reduced to 32 × 32 frames. Similarly, the LSTM model employs minimal recurrent depth. These design choices together ensure that both models operate within the thermal, computing, and memory constraints of the Raspberry Pi platform while maintaining discriminative capacity.

Comparative Behavior of LSTM and 3D CNN Models: The experimental results highlight distinct behaviors for the two architectures.

LSTM Model: The LSTM achieved over 99% accuracy for static gestures when trained for 500–1000 epochs. However, performance declined when training was extended to 2000 epochs, primarily due to overfitting, gradient saturation, and the relatively small dataset size. Stability can be improved through the use of early stopping, dropout (0.2), learning-rate scheduling, as well as pruning and quantization techniques. Importantly, the sharp performance collapse of the LSTM model at 2000 epochs highlights its sensitivity to prolonged training on small datasets; this behavior, driven by overfitting, recurrent weight instability, and the absence of early stopping, represents a cautionary finding rather than a modeling failure and underscores the importance of careful training regulation for edge-based IoT deployments.

Three-Dimensional CNN Model: The 3D CNN model consistently achieved over 99% accuracy across all tested epochs and showed limited sensitivity to overfitting. This robustness is due to its strong spatio-temporal feature representations and larger adequate input space. The main drawback is its higher computational cost on IoT devices. This issue was mitigated through TensorFlow Lite conversion, lower frame resolution, and the lightweight architectural design described earlier.

Despite dataset limitations and IoT hardware constraints, the proposed hybrid STT–DIST system effectively demonstrates real-time, on-device sign language recognition using LSTM for static signs and 3D CNN for dynamic gestures. By applying model compression techniques (quantization, pruning) and efficient feature extraction (MediaPipe), the system achieves high accuracy, low latency, and stable performance, demonstrating strong potential for scalable deployment in inclusive assistive technologies for the Deaf and hard-of-hearing communities.

While the system achieves real-time performance under controlled conditions, real-world deployment introduces additional challenges, including multi-user scenarios, partial hand occlusion, variations in lighting, motion blur, and hands moving outside the camera frame. These factors may degrade recognition accuracy and temporal tracking, particularly for continuous signing. Addressing these challenges will require adaptive gesture segmentation, multi-person detection, and robustness-enhancing data augmentation strategies.

6. Conclusions

In this work, an IoT-enabled hybrid deep learning system that combines the strengths of LSTM and 3D CNN models is presented for real-time sign language recognition. The system was successfully deployed on a Raspberry Pi platform and optimized for on-device, real-time gesture classification using techniques such as 8-bit integer quantization and pruning. The experimental evaluation offers a vital comparison of architectural stability: The LSTM component demonstrated strong performance for static signs, achieving its highest stability and accuracy (avg. F1-score of 0.938) at 1000 training epochs. Crucially, the analysis revealed the model’s sensitivity to extended training, as performance collapsed catastrophically at 2000 epochs, confirming the need for regularization and early stopping.

The 3D CNN model consistently outperformed all dynamic gestures, maintaining stable, nearly perfect performance (up to 99.6% accuracy) across all evaluated epochs, demonstrating its robustness for spatiotemporal recognition tasks.

The hybrid system offers a practical and scalable solution for sign language translation in real-world environments. By intelligently routing gestures based on modality (static vs. dynamic), the system maximizes accuracy while reducing computational overhead, which is essential for IoT scenarios with limited processing power and latency constraints. Future work will focus on integrating gesture boundary detection and temporal segmentation to support continuous sign language recognition (CSLR) and on further optimizing the models through quantization and pruning to improve deployment feasibility in real-world IoT applications.

Author Contributions

S.M.: Conceptualization; Methodology; Supervision; Project administration; Writing—original draft; Funding acquisition; Writing—review and editing. H.A.C.: Software; Data curation; Investigation. M.A.: Formal analysis; Methodology; Validation; Writing—review and editing. S.R.: Investigation; Resources; Validation; Software; Data curation. S.A.: Resources; Visualization; Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Liwa University, United Arab Emirates, through the Faculty Research Incentive Grant (Research Cluster Program 2024–2025). The Article Processing Charge (APC) was funded by Liwa University.

Institutional Review Board Statement

In accordance with the institutional guidelines, ethical review and approval were not required for this study as the research involved voluntary participation, non-invasive data collection, and did not include the collection of sensitive or identifiable personal information.

Informed Consent Statement

Informed consent was obtained from participants involved in the study.

Data Availability Statement

The dynamic gesture dataset (SIRM-Dynamic) used in this study is publicly available at GitHub: https://github.com/srihawi/SIRM-Dynamic (accessed on 1 January 2025) [25].

Acknowledgments

The authors would like to express their sincere thanks to Heba Abdullah Issa for her valuable support and assistance. In addition, the authors used an AI-powered language tool to improve the clarity and coherence of the manuscript. All content, analysis, and interpretations remain the original work of the authors.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

3D	Three-Dimensional
3D CNN	3D Convolutional Neural Network
AI	Artificial Intelligence
ARM	Advanced RISC Machine
ASL	American Sign Language
Avg.	Average
BiLSTM	Bidirectional Long Short-Term Memory
C3D	Convolutional 3D (C3D architecture)
CASC	Classroom Assessment Sign Communicator
CC BY	Creative Commons Attribution (license)
CLS/CSLR	Continuous Sign Language Recognition
CNN	Convolutional Neural Network
Conv3D	3D Convolution layer
CPU	Central Processing Unit
DIST	Dynamic (gesture) Sign-to-Text module (your dynamic subsystem name)
DSS	(Static Gesture Dataset) DSS dataset
DS	Dataset (e.g., DS-1/DS-2)
Edge AI	Edge Artificial Intelligence
FPS	Frames Per Second
F1-Score	F1 score (harmonic mean of precision and recall)
FN	False Negatives
FP	False Positives
GB	Gigabyte
GPU	Graphics Processing Unit
GRU	Gated Recurrent Unit
HGR	Hand Gesture Recognition
I3D	Inflated 3D (I3D architecture)
IoT	Internet of Things
ISLS	(Dataset name as written) ISLS dataset
L2	L2 regularization
LR	Learning Rate
LSTM	Long Short-Term Memory
MSE	Mean Squared Error
MP	MediaPipe (as used in “MP + CNN + BiLSTM”)
ms	Milliseconds
NA	Not Available
NLP	Natural Language Processing
Pi	Raspberry Pi (platform name, used as “Raspberry Pi”)
RAM	Random Access Memory
ReLU	Rectified Linear Unit
ResNet	Residual Network
RGB	Red, Green, Blue (color channels)
SIRM-Dynamic	(Dynamic Sign Language Gesture Dataset) SIRM-Dynamic dataset
SLR	Sign Language Recognition
SLS	Sign Language System
ST-CNN	Spatio-Temporal Convolutional Neural Network
STT	Sign-to-Text
T	Time dimension/number of frames (as used in “32 × 32 × T”)
TFLite	TensorFlow Lite
TP	True Positives
UAE	United Arab Emirates
ViT	Vision Transformer
VGG16	Visual Geometry Group 16-layer network
VTA	VTA system
WHO	World Health Organization
XR	Extended Reality
YOLOv11	You Only Look Once (version 11)

References

Mouti, S.; Rihawi, S. IoT and sign language system (SLS). Int. J. Eng. Res. Technol. 2020, 13, 4199–4205. [Google Scholar]
Mouti, S.; Rihawi, S. Special needs classroom assessment using a sign language communicator (CASC) based on AI techniques. Int. J. e-Collab. 2023, 19, e313960. [Google Scholar] [CrossRef]
Rihawi, S.; Mouti, S.; Rasli, R.M.; Ariffin, S. Virtual teaching assistant for capturing facial and pose landmarks of the students in the classroom using deep learning. Int. J. e-Collab. 2023, 19, e316663. [Google Scholar] [CrossRef]
Paul, S.K.; Walid, A.A.; Paul, R.R.; Uddin, J.; Rana, S.; Devnath, M.K.; Dipu, I.R.; Haque, M. An Adam-based CNN and LSTM approach for sign language recognition in real time for deaf people. Bull. Electr. Eng. Inform. 2024, 13, 499–509. [Google Scholar] [CrossRef]
Baihan, A.; Alutaibi, A.I.; Sharma, S.K. Sign language recognition using modified deep learning network and hybrid optimization. Sci. Rep. 2024, 14, 76174. [Google Scholar] [CrossRef] [PubMed]
Noor, T.H.; Noor, A.; Alharbi, A.F.; Faisal, A.; Alrashidi, R.; Alsaeedi, A.S.; Alharbi, G.; Alsanoosy, T.; Alsaeedi, A. Real-time Arabic sign language recognition using a hybrid deep learning model. Sensors 2024, 24, 3683. [Google Scholar] [CrossRef] [PubMed]
Strobel, G.; Banh, L.; Schoormann, T.; Möller, F. Artificial intelligence for sign language translation. Commun. Assoc. Inf. Syst. 2023, 53, 42–64. [Google Scholar] [CrossRef]
Moosavi, S.M.M.; Khoshbakht, S.; Taheri, H. A low-cost IoT-based hybrid multiscale CNN–LSTM approach for bearing fault diagnosis using low sampling rate vibration data. Sustain. Energy Artif. Intell. 2025, 1, 113–125. [Google Scholar] [CrossRef]
Jain, M.; Srihari, A. Comparison of machine learning models for stress detection from sensor data using long short-term memory (LSTM) networks and convolutional neural networks (CNNs). Int. J. Sci. Res. Manag. 2024, 12, 1775–1792. [Google Scholar] [CrossRef]
Ravikiran, V. Real-time sign language recognition and translation using MediaPipe and LSTM-based deep learning. Int. J. Comput. Appl. 2025, 187, 10–14. [Google Scholar]
Patil, V.T.; Deore, S.S. Deep learning-driven IoT defence: Comparative analysis of CNN and LSTM for DDoS detection and mitigation. J. Inf. Syst. Eng. Manag. 2025, 10, e951. [Google Scholar] [CrossRef]
Xue, H.; Huang, B.; Qin, M.; Zhou, H.; Yang, H. Edge computing for Internet of Things: A survey. In Proceedings of the 2020 International Conferences on Internet of Things (iThings), IEEE Green Computing and Communications (GreenCom), IEEE Cyber, Physical and Social Computing (CPSCom), IEEE Smart Data (SmartData), and IEEE Congress on Cybermatics (Cybermatics), Rhodes, Greece, 2–6 November 2020; IEEE: New York, NY, USA, 2020; pp. 755–760. [Google Scholar] [CrossRef]
Zhang, L.; Zhu, G.; Shen, P.; Song, J.; Shah, S.A.; Bennamoun, M. Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3120–3128. [Google Scholar]
Brettmann, A.; Grävinghoff, J.; Rüschoff, M.; Westhues, M. Breaking the barriers: Video vision transformers for word-level sign language recognition. arXiv 2025, arXiv:2504.07792. [Google Scholar] [CrossRef]
Rehman, M.U.; Ahmed, F.; Khan, M.A.; Tariq, U.; Alfouzan, F.A.; Alzahrani, N.M.; Ahmad, J. Dynamic hand gesture recognition using 3D-CNN and LSTM networks. Comput. Mater. Contin. 2022, 70, 4675–4690. [Google Scholar] [CrossRef]
Alsharif, B. IoT technologies in healthcare for people with hearing impairments. In Internet of Things for Smart Healthcare; Springer: Cham, Switzerland, 2023. [Google Scholar] [CrossRef]
Cui, R.; Liu, H.; Zhang, C. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1610–1618. [Google Scholar] [CrossRef]
Boulesnane, A.; Bellil, L.; Ghiri, M.G. A hybrid CNN-random forest model with landmark angles for real-time Arabic sign language recognition. Neural Comput. Appl. 2025, 37, 2641–2662. [Google Scholar] [CrossRef]
Ghanimi, H.M.A.; Sengan, S.; Sadu, V.B.; Kaur, P.; Kaushik, M.; Alroobaea, R.; Baqasah, A.M.; Alsafyani, M.; Dadheech, P. An open-source MP + CNN + BiLSTM model-based hybrid model for recognizing sign language on smartphones. Int. J. Syst. Assur. Eng. Manag. 2024, 15, 3794–3806. [Google Scholar] [CrossRef]
Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign language recognition using graph and general deep neural network based on a large-scale dataset. IEEE Access 2024, 12, 34553–34569. [Google Scholar] [CrossRef]
Alsharif, B.; Alanazi, M.; Altaher, A.S.; Altaher, A.; Ilyas, M. Deep learning technology to recognize American sign language alphabet using multi-focus image fusion technique. In Proceedings of the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 4–6 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Alsharif, B.; Alalwany, E.; Ibrahim, A.; Mahgoub, I.; Ilyas, M. Real-Time American Sign Language Interpretation Using Deep Learning and Keypoint Tracking. Sensors 2025, 25, 2138. [Google Scholar] [CrossRef]
Khan, A.; Jin, S.; Lee, G.-H.; Moon, H.; Arzu, G.E.; Dang, L.M.; Nguyen, T.N.; Choi, W.; Moon, H. Deep Learning Approaches for Continuous Sign Language Recognition: A Comprehensive Review. IEEE Access 2025, 13, 2169–3536. [Google Scholar] [CrossRef]
Maashi, M.; Iskandar, H.G.; Rizwanullah, M. IoT-driven smart assistive communication system for the hearing impaired with hybrid deep learning models for sign language recognition. Sci. Rep. 2025, 15, 6192. [Google Scholar] [CrossRef] [PubMed]
Rihawi, S. SIRM-Dynamic: A Dataset for Dynamic Sign Language Recognition. GitHub Repository. 2025. Available online: https://github.com/srihawi/SIRM-Dynamic (accessed on 15 September 2025).
Utomo, W.; Suhanda, Y.; Ar-Rasyid, H.; Dharmalau, A. Indonesian Language Sign Detection Using MediaPipe with Long Short-Term Memory (LSTM) Algorithm. J. Inform. Web Eng. 2025, 4, 245–258. [Google Scholar] [CrossRef]
Wei, D.; Yang, X.-H.; Weng, Y.; Lin, X.; Hu, H.; Liu, S. Cross-modal adaptive prototype learning for continuous sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7354–7367. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, X.; Feng, Y.; Zhang, T.; Xiong, L. Efficient human activity recognition on edge devices using DeepConv LSTM architectures. Sci. Rep. 2025, 15, 13830. [Google Scholar] [CrossRef] [PubMed]

Figure 1. System architecture.

Figure 2. Hybrid model system flow chart.

Figure 3. Hybrid model system confusion matrix.

Figure 4. Hybrid model performance across epochs.

Table 1. Summary of existing sign language recognition studies.

Reference	Dataset	Model/Architecture	Feature Extraction	Accuracy	Real-Time	IoT Support
This work	Proposed Hybrid Model (STT + DIST)	Hybrid: LSTM + 3D CNN	Static: MediaPipe; Dynamic: Raw video frames	STT: 93.3% (peak), DIST: 100%	Yes	Edge-deployable; Occlusion-resilient (planned)
[1]	SLS (UAE)	MobileNetV2 + NLP + Video Generator	RGB → Transfer learning	99.52%	Yes	Raspberry Pi
[2]	CASC	Speech-to-Sign + Sign-to-Text	Audio + gesture mapping	92%	Yes	Classroom IoT
[3]	VTA system	CNN	Facial + pose landmarks	100%	Yes	Yes
[5]	ASL DS-1/DS-2	CNN, LSTM, GRU	Raw frames + sequential data	89.07% (CNN); 94.3% (LSTM)	Yes	Limited IoT
[7]	Continuous SLR	CNNSa-LSTM	VGG16 + Optical Flow	98.7%	Yes	No
[8]	Arabic SLR	CNN–LSTM hybrid	RGB frames	High	Yes	Mobile-capable
[10]	ASL (webcam)	YOLOv11 + MediaPipe	Pose + hand landmarks	High	Yes	Edge-capable (GPU-assisted)
[11]	Mobile SLR	MP + CNN + BiLSTM	Hand landmarks	High	Yes	Mobile/edge
[12]	Egyptian SLR	LSTM/GRU	MediaPipe	94.95%	Yes	Yes
[13]	Norwegian SLR	LSTM	MediaPipe	95%	Yes	IoT-suitable
[14]	ASL alphabet	ResNet + Vision Transformer	Multi-image fusion	97.09% (ViT); 99.98% (ResNet)	No	Not IoT-ready
[15]	20BN-Jester	3D CNN + LSTM	Video frames	High	Yes	Edge-capable
[16]	Healthcare IoT (review)	Conceptual/IoT frameworks	IoT sensors + assistive systems	N/A	N/A	Healthcare IoT
[17]	Continuous SLR (multiple datasets)	Recurrent CNN	RGB frames + temporal modeling	High	No	No
[18]	Arabic SLR	CNN + Random Forest	Landmark angles	99.95%	Yes	Mobile-ready
[19]	Mobile HGR	MP + CNN + BiLSTM	Hand landmarks	High	Yes	Mobile/IoT
[20]	Skeleton SLR	Graph CNN/ST-CNN	Body joints	High	Yes	XR/Edge-supported
[21]	ASL alphabet	CNN (multi-focus fusion)	Multi-focus image fusion	High	No	Not IoT-ready
[22]	ASL (real-time)	Deep learning + keypoint tracking	Hand and body keypoints	High	Yes	Mobile-capable
[23]	Multiple datasets (survey)	Survey/Review	N/A	N/A	N/A
[24]	IoT SLR	Hybrid Deep Learning	RGB + motion cues	High	Yes	IoT-specific
[25]	SIRM-Dynamic	Dataset contribution	Dynamic sign video samples	N/A	N/A	Research dataset
[26]	Indonesian ISLS	LSTM	MediaPipe hands	97.1%	Yes	Mobile/edge
[27]	Continuous SLR (multiple datasets)	Cross-modal adaptive prototype learning	RGB video + multimodal representations	High	No	No
[28]	Human activity datasets (edge HAR)	DeepConv LSTM	Sensor-based temporal features	High	Yes	Edge/IoT-capable

Table 2. Training hyperparameters for STT (LSTM) and DIST (3D CNN).

Parameter	STT (LSTM)	DIST (3D CNN)
Input Type	MediaPipe keypoints	Video clips
Learning Rate	0.001	0.0005
Optimizer	Adam	Adam
Batch Size	32	16
Epoch Range	500–2000	500–2000
Loss	Categorical Cross-Entropy	Cross-Entropy
Regularization	Dropout 0.2	L2 0.001
Activation	ReLU, Softmax	ReLU, Softmax
Deployment	Raspberry Pi	Raspberry Pi (TFLite)

Table 3. Summary of precision, recall, and F1-score for STT and DIST models across epochs.

Model	Epochs	Avg. Precision	Avg. Recall	Avg. F1-Score	Notes
STT (LSTM)	500	0.664	0.870	0.704	High variance due to zero scores on some signs
STT (LSTM)	1000	0.955	0.923	0.938	Near-perfect classification across all signs
STT (LSTM)	2000	0.120	0.120	0.088	Performance collapsed due to overfitting
DIST (3D CNN)	500	1.000	1.000	1.000	Perfect classification on all dynamic signs
DIST (3D CNN)	1000	1.000	1.000	1.000	Maintained perfect scores
DIST (3D CNN)	2000	1.000	1.000	1.000	Stable performance with no degradation observed

Table 4. Accuracy comparison of STT (LSTM) and DIST (3D CNN) models across epochs for selected signs.

Sign Used	Epochs	Accuracy
STT-LSTM
blame	500	89.29%
	1000	90.16%
	2000	NA
teacher	500	98.59%
	1000	99.07%
	2000	NA
danger	500	93.88%
	1000	83.33%
	2000	NA
DIST-3D CNN
again	500	95.68%
	1000	99.6%
	2000	99.6%
angry	500	99.5%
	1000	99.5%
	2000	99.5%
worry	500	99.5%
	1000	99.5%
	2000	99.5%

Table 5. Match percentage of dynamic signs at 500 epochs using LSTM vs. 3D CNN models.

Sign	Model	Epochs	Match Percentage (%)
Again	LSTM	500	0.00
Again	3D CNN	500	96.15
Angry	LSTM	500	95.48
Angry	3D CNN	500	97.83
Appointment	LSTM	500	0.00
Appointment	3D CNN	500	94.59
Online	LSTM	500	95.00
Online	3D CNN	500	98.00
Work	LSTM	500	0.00
Work	3D CNN	500	0.82

Table 6. Recognition accuracy comparison for selected signs using STT-LSTM and 3D CNN.

Sign Gesture	Accuracy_STT-LSTM	Accuracy_3D CNN
Appointment	96.99%	98.96%
Useless	91.74%	98.75%
Worry	93.28%	96.97%

Table 7. Accuracy, loss, and MSE of STT and DIST models across epochs.

Epochs	STT Accuracy (%)	STT Loss	STT MSE	DIST Accuracy (%)	DIST Loss	DIST MSE
500	93.33	0.12	0.03	100.0	0.02	0.01
1000	86.7	0.04	0.01	100.0	0.01	0.005
2000	9.2	2.53	1.87	100.0	0.01	0.005

Table 8. IoT deployment metrics on Raspberry Pi 4B.

Metric	Measured Value
Device	Raspberry Pi 4B (4 GB RAM)
Frame Rate (FPS)	12–15 fps (real-time threshold)
Inference Time per Frame	~65 ms (average)
RAM Usage	~850 MB during peak load
Model Size (TFLite)	LSTM: 5.8 MB, 3D CNN: 18.2 MB
CPU Utilization	72–85% during continuous inference
Thermal Performance	<68 °C with passive cooling

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mouti, S.; Al Chalabi, H.; Abushohada, M.; Rihawi, S.; Abdalla, S. Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities. Informatics 2026, 13, 27. https://doi.org/10.3390/informatics13020027

AMA Style

Mouti S, Al Chalabi H, Abushohada M, Rihawi S, Abdalla S. Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities. Informatics. 2026; 13(2):27. https://doi.org/10.3390/informatics13020027

Chicago/Turabian Style

Mouti, Samar, Hani Al Chalabi, Mohammed Abushohada, Samer Rihawi, and Sulafa Abdalla. 2026. "Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities" Informatics 13, no. 2: 27. https://doi.org/10.3390/informatics13020027

APA Style

Mouti, S., Al Chalabi, H., Abushohada, M., Rihawi, S., & Abdalla, S. (2026). Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities. Informatics, 13(2), 27. https://doi.org/10.3390/informatics13020027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Evaluation of LSTM and 3D CNN Models in a Hybrid System for IoT-Enabled Sign-to-Text Translation in Deaf Communities

Abstract

1. Introduction

2. Related Work

2.1. IoT-Enabled Sign Language and Assistive Communication Systems

2.2. CNN, LSTM, and GRU Models for Static and Sequential Gestures

2.3. Hybrid CNN–LSTM and Spatio-Temporal Deep Learning

2.4. MediaPipe-Based Real-Time Gesture Recognition

2.5. Transformer and Vision-Based Architectures

2.6. Observed Gaps in the Literature

3. Methodology

3.1. System Overview

3.2. Dataset Description

3.3. Feature Extraction

3.4. LSTM Model for Static Gestures

3.5. Three-Dimensional CNN Model for Dynamic Gestures

3.6. Training Procedure

3.7. Evaluation Metrics

3.8. Embedded Deployment

3.9. Performance Metrics: Precision, Recall, and F1-Score

3.10. Precision, Recall, and F1-Score Across Epochs

3.11. STT Subsystem Analysis (Static Sign-to-Text Using LSTM)

3.11.1. STT Performance at 500 Epochs

3.11.2. STT Performance at 1000 Epochs

3.11.3. STT Performance Collapse at 2000 Epochs

4. Experimental Results

5. Discussion

5.1. Dataset Limitations and Generalizability

5.2. IoT Deployment and Real-Time Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI