1. Introduction
According to the World Health Organization (WHO), over 430 million people worldwide suffer from disabling hearing loss, causing ongoing communication barriers in professional and public settings. Advances in Artificial Intelligence (AI), computer vision, and the Internet of Things (IoT) have enabled the development of automated sign language recognition (SLR) systems. However, most current systems require significant computational power or require separate testing, overlooking real-time constraints and the long-term training stability of low-power IoT devices.
To address these challenges, in this paper, an IoT-enabled hybrid deep learning system that combines an LSTM network for static gesture recognition and a 3D Convolutional Neural Network (CNN) for dynamic gesture recognition is proposed for real-time sign-to-text recognition. The system runs on a Raspberry Pi platform using MediaPipe for landmark extraction, allowing on-device inference without relying on cloud processing.
Unlike previous studies, this study emphasizes both model performance and deployment feasibility in IoT environments. The work offers a detailed, epoch-wise comparative analysis of LSTM and 3D CNN performance, demonstrating that while the 3D CNN maintains exceptional stability across all epochs, the LSTM is highly prone to overfitting beyond its optimal training period (1000 epochs). This analysis validates the architecture choice based on sign type and highlights the need for a hybrid approach to ensure robust performance across different gesture modalities while balancing computational costs.
The main contributions of this paper are as follows: (1) a real-time IoT-enabled hybrid Sign-to-Text system deployed on embedded hardware is developed; (2) a detailed epoch-wise comparison of LSTM and 3D CNN models for static and dynamic sign language recognition is conducted, highlighting optimal performance points (e.g., 1000 epochs for LSTM) and assessing long-term stability; (3) training stability, overfitting tendencies (notably the LSTM collapse at 2000 epochs), and deployment potential are analyzed on resource-limited IoT devices; and (4) a comprehensive evaluation of performance, including accuracy, precision, recall, inference time, and computational cost, is provided.
The rest of this paper is organized as follows.
Section 2 reviews related work covering IoT-enabled assistive systems, CNN/LSTM-based gesture recognition, hybrid spatio-temporal models, MediaPipe-based pipelines, and transformer-based approaches, and highlights key gaps in the literature.
Section 3 describes the proposed system architecture and methodology, including dataset preparation, MediaPipe-based feature extraction, model design, training procedures, and evaluation metrics.
Section 4 presents the experimental results and performance analysis, including epoch-wise stability comparisons and sign classification accuracy.
Section 5 discusses deployment feasibility, dataset limitations, generalizability, and real-time IoT performance. Finally,
Section 6 concludes the paper and outlines future research directions.
3. Methodology
This study introduces an IoT-enabled hybrid Sign-to-Text (STT) system that combines Long Short-Term Memory (LSTM) networks for static gesture recognition with a 3D Convolutional Neural Network (3D CNN) for dynamic gesture classification. The approach includes dataset preparation, feature extraction, model development, training procedures, evaluation metrics, and deployment on Raspberry Pi hardware.
Figure 1 illustrates the overall workflow.
4. Experimental Results
The experiments were conducted on a hybrid model system using representative subsets of static and dynamic gestures to evaluate classification accuracy, stability, and generalization across training epochs. The experiments involved predicting each sign gesture individually to assess how accurately the models can predict them. The results show a modest improvement in sign blame from 82.29% at 500 epochs to 90.16% over 1000 epochs. The accuracy is high for both teachers and danger at 500 and 1000 epochs. On the other hand, there are no results at 2000 epochs due to instability and inaccuracy in the training data. The results show that the sign “teacher” demonstrated strong and consistent learning, with performance remaining high and increasing from 98.59% at 500 epochs to 99.07% at 1000 epochs. The sign “blame” improved modestly from 89.29% to 90.16%, indicating gradual learning. Conversely, the accuracy of the sign “danger” dropped from 93.88% at 500 epochs to 83.33% at 1000 epochs, possibly due to model instability or overfitting. No training results are available at 2000 epochs, likely because of previous reports of instability during extended training, which led to low training accuracy. The experimental results for both models are presented in
Table 4.
As shown in
Figure 4, across all assessed indicators and epochs, the DIST system’s findings using a 3D CNN model show consistently high training accuracy. Early convergence and model stability were demonstrated by the 99.5% accuracy of the angry and worried indications after 500 epochs, with no improvement or decline observed at 1000 and 2000 epochs. The sign again started at 95.68% accuracy at 500 epochs, but by 1000 epochs, it had increased to 99.6%, and it maintained this performance at 2000 epochs.
The 3D CNN model demonstrates high learning capacity, quick convergence, and strong resistance to overfitting across multiple training runs.
Table 5 shows the results of the experiment evaluating how well each model matches the expected sign-gesture output after training for 500 epochs. Three-Dimensional CNNs consistently outperform LSTMs, especially for dynamic gestures such as “again”, “angry”, and “online”.
The extremely low recognition rate for the “work” gesture suggests either strong gesture ambiguity or insufficient representative samples in the dataset, rather than a failure of the underlying model architecture. The results clearly show a performance difference between the LSTM and 3D CNN models in recognizing dynamic sign language gestures. Although both models were trained for 500 epochs under the same conditions, the 3D CNN performed significantly better than the LSTM on nearly all tested signs. The 3D CNN achieved a match rate of over 94% for four of the five signs, with the highest at 98% for the sign “online”. In contrast, the LSTM recognized only two signs, “angry” (95.48%) and “online” (95%), and did not recognize “again”, “appointment”, or “work” (0%). The low recognition rate for the “work” gesture suggests possible gesture ambiguity or insufficient representative samples, indicating a need for further dataset refinement. These results highlight LSTM’s weakness in modeling complex spatiotemporal patterns of dynamic gestures, potentially due to its sequential processing and lack of explicit spatial feature extraction. Three-dimensional CNN, with its ability to extract spatial and temporal dimensions simultaneously, shows higher consistency and stability and is therefore the more reliable architecture for dynamic sign recognition systems. The near-zero detection of “work” by both models suggests potential issues in the dataset or gesture ambiguity that need to be investigated.
Table 6 compares the accuracy of recognizing three key dynamic gestures (appointment, useless, worry) using both STT-LSTM and a 3D CNN. Three-dimensional CNN demonstrates superior recognition performance across all signs.
Table 6 presents recognition accuracy percentages for three sign gestures—“appointment”, “useless”, and “worry”—using two deep learning models: STT-LSTM and 3D CNN. Three-dimensional CNN consistently outperformed STT-LSTM across all gestures. “Appointment” had the highest recognition accuracy with 3D CNN (98.96%), followed closely by STT-LSTM (96.99%). “Useless” showed a significant improvement with 3D CNN (98.75%) compared to STT-LSTM (91.74%), indicating better feature extraction by the 3D convolutional approach for this gesture. “Worry” had the lowest overall recognition performance, with STT-LSTM at 93.28% and 3D CNN at 96.97%, though the CNN still achieved higher accuracy. The STT model exhibited high accuracy for static gestures at 500 and 1000 epochs. For example, the teacher, correct, and good signs consistently achieved precision, recall, and F1-scores of 1.0. The sign “blame” improved from 50% to 100% precision between 500 and 1000 epochs, while “bad” reached an F1-score of 0.80 at 500 epochs. However, after 2000 epochs, the STT model exhibited a dramatic decline in macro-F1 score (0.088). This instability confirms the model’s susceptibility to overfitting at extended training durations. As a result, 1000 epochs were identified as the optimal training duration for the STT subsystem.
Table 7 provides a quantitative comparison of the training performance of STT and DIST over 500, 1000, and 2000 epochs. It indicates that the STT model experiences a notable drop in accuracy at 2000 epochs, while the DIST model maintains consistent performance. The results show that STT performs well at 500 and 1000 epochs but exhibits a significant drop at 2000 epochs, likely due to overfitting or weight divergence. On the other hand, DIST maintains high performance across all epochs, suggesting better generalization and stability.
The observed STT behavior exhibits several notable strengths. It achieves excellent performance on static signs, with stable convergence observed up to 1000 training epochs. The model has relatively low computational requirements, making it well-suited for deployment on IoT edge devices, and it integrates effectively with MediaPipe-derived landmark features. However, several limitations were also identified. The model is sensitive to overtraining, and signs with limited representation in the dataset (e.g., blame, bad) show higher variability in precision. Its performance is strongly dependent on dataset size and class balance, and the architecture is not appropriate for recognizing dynamic, temporally complex gestures.
The STT system (LSTM-based) performed well on static gestures at 500–1000 epochs but failed at 2000 epochs due to overfitting and recurrent weight divergence. In contrast, the DIST system (3D CNN-based) achieved stable, perfect performance across all epochs, effectively capturing the temporal motion in dynamic gestures. This difference underscores the importance of using an LSTM for static signs and a 3D CNN for dynamic signs, supporting the hybrid architecture of the proposed IoT-enabled Sign-to-Text framework.
Although the DIST model achieved near-perfect classification accuracy across all evaluated epochs, these results should be interpreted in the context of a controlled acquisition environment, limited gesture vocabulary, and a small dataset size. The reported accuracy reflects feasibility and stability under constrained conditions rather than large-scale, real-world generalization.
5. Discussion
The evaluation of the hybrid system uncovers distinct strengths and limitations for both deep learning architectures. The LSTM model showed excellent performance for static gestures at lower training epochs, reaching up to 99.07% accuracy for signs like “teacher”. This aligns with LSTM’s ability to capture temporal patterns in short, structured sequences. However, its performance declined significantly after 2000 epochs, highlighting the model’s sensitivity to extended training, which can lead to overfitting or gradient instability. Effectively using LSTM in real-world applications requires careful hyperparameter tuning, regularization, and early stopping.
In contrast, the 3D CNN model maintained stable and high performance across all evaluated epochs. Its capacity to simultaneously capture spatial and temporal features makes it especially suitable for recognizing dynamic gestures such as “again” and “appointment”, with near-perfect accuracy. Nevertheless, the increased computational demands of 3D CNNs must be considered when deploying on resource-limited devices like Raspberry Pi. Optimization techniques such as model pruning or quantization may be necessary to maintain real-time performance without sacrificing accuracy. A key finding is that combining the two models allows the system to intelligently select the appropriate architecture based on whether the gesture is static or dynamic. This hybrid approach not only boosts recognition accuracy but also balances computational efficiency, making it ideal for IoT-based assistive technologies. Future research may consider integrating attention mechanisms, transformer architectures, or federated learning to improve personalization and privacy.
Despite promising results, the system has some limitations. The LSTM model exhibited overfitting in later epochs, limiting its scalability without regularization. The 3D CNN, while accurate, is computationally demanding, making real-time use on low-power devices challenging. Additionally, the system currently only supports isolated gestures and is limited to a small dataset. Future work will focus on improving training stability through early stopping, compressing the 3D CNN for edge deployment, and expanding the gesture set with larger, multilingual datasets. We also plan to incorporate transformer-based models and extend the system to support continuous gesture recognition in natural communication environments.
This study focuses on comparative training stability and IoT deployment feasibility rather than large-scale generalization, providing a practical and transparent evaluation of hybrid deep learning architectures under real-world edge computing constraints.
5.2. IoT Deployment and Real-Time Performance
To assess practical feasibility, the proposed hybrid STT–DIST system was deployed and tested on a Raspberry Pi 4B (4 GB RAM), showing that real-time sign classification is possible on resource-limited IoT hardware.
Table 8 summarizes the deployment metrics obtained during continuous inference. The system achieves a frame rate of 12–15 fps, meeting real-time processing requirements, with an average inference time of approximately 65 ms per frame. Peak memory usage remained around 850 MB, while CPU utilization ranged between 72% and 85% under sustained operation. The deployed TensorFlow Lite models had compact sizes (5.8 MB for LSTM and 18.2 MB for the 3D CNN), enabling efficient edge deployment. Thermal measurements remained below 68 °C using passive cooling, indicating stable long-term operation.
The Raspberry Pi maintains stable real-time performance due to the optimized input pipeline (MediaPipe key points for LSTM, down-sampled video frames for 3D CNN) and lightweight model architecture. The LSTM quickly performs inference because of its small input space, while the 3D CNN, though heavier, stays within the device’s computational limits.
Model Compression and Optimization Techniques: To sustain high performance on low-power IoT devices, the following model compression and optimization strategies were implemented.
Quantization: Both models were converted to TensorFlow Lite using 8-bit integer quantization. This process resulted in an approximate 40–60% reduction in model size and about 30% faster inference, with no more than 1% decrease in classification accuracy. These improvements are essential for deployment on embedded platforms with limited storage and computing resources.
Pruning: Unstructured pruning was incorporated during training to remove low-magnitude weights. This reduced the computational and memory footprints, improved inference speed on ARM-based CPUs, and contributed to smoother optimization dynamics, particularly for the LSTM model, by mitigating redundancy in the recurrent layers and stabilizing training.
Lightweight Architectures: The baseline network designs were tailored explicitly for edge deployment. The 3D CNN uses relatively shallow 3D convolutional layers compared to full C3D or I3D architectures, and the input dimensionality is reduced to 32 × 32 frames. Similarly, the LSTM model employs minimal recurrent depth. These design choices together ensure that both models operate within the thermal, computing, and memory constraints of the Raspberry Pi platform while maintaining discriminative capacity.
Comparative Behavior of LSTM and 3D CNN Models: The experimental results highlight distinct behaviors for the two architectures.
LSTM Model: The LSTM achieved over 99% accuracy for static gestures when trained for 500–1000 epochs. However, performance declined when training was extended to 2000 epochs, primarily due to overfitting, gradient saturation, and the relatively small dataset size. Stability can be improved through the use of early stopping, dropout (0.2), learning-rate scheduling, as well as pruning and quantization techniques. Importantly, the sharp performance collapse of the LSTM model at 2000 epochs highlights its sensitivity to prolonged training on small datasets; this behavior, driven by overfitting, recurrent weight instability, and the absence of early stopping, represents a cautionary finding rather than a modeling failure and underscores the importance of careful training regulation for edge-based IoT deployments.
Three-Dimensional CNN Model: The 3D CNN model consistently achieved over 99% accuracy across all tested epochs and showed limited sensitivity to overfitting. This robustness is due to its strong spatio-temporal feature representations and larger adequate input space. The main drawback is its higher computational cost on IoT devices. This issue was mitigated through TensorFlow Lite conversion, lower frame resolution, and the lightweight architectural design described earlier.
Despite dataset limitations and IoT hardware constraints, the proposed hybrid STT–DIST system effectively demonstrates real-time, on-device sign language recognition using LSTM for static signs and 3D CNN for dynamic gestures. By applying model compression techniques (quantization, pruning) and efficient feature extraction (MediaPipe), the system achieves high accuracy, low latency, and stable performance, demonstrating strong potential for scalable deployment in inclusive assistive technologies for the Deaf and hard-of-hearing communities.
While the system achieves real-time performance under controlled conditions, real-world deployment introduces additional challenges, including multi-user scenarios, partial hand occlusion, variations in lighting, motion blur, and hands moving outside the camera frame. These factors may degrade recognition accuracy and temporal tracking, particularly for continuous signing. Addressing these challenges will require adaptive gesture segmentation, multi-person detection, and robustness-enhancing data augmentation strategies.