Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM

Chen, Jinlong; Jin, Fuqiang; Jiao, Yingjie; Zhan, Yongsong; Qin, Xingguo

doi:10.3390/electronics14091793

Open AccessArticle

Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM

by

Jinlong Chen

^†,

Fuqiang Jin

^†,

Yingjie Jiao

,

Yongsong Zhan

and

Xingguo Qin

^*

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(9), 1793; https://doi.org/10.3390/electronics14091793

Submission received: 28 March 2025 / Revised: 22 April 2025 / Accepted: 23 April 2025 / Published: 28 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Dynamic gesture detection is a key topic in computer vision and deep learning, with applications in human–computer interaction and virtual reality. However, traditional methods struggle with long sequences, complex scenes, and multimodal data, facing issues such as high computational cost and background noise. This study proposes an Attention-Enhanced dual-layer LSTM (Long Short-Term Memory) network combined with Grounding SAM (Grounding Segment Anything Model) for gesture detection. The dual-layer LSTM captures long-term temporal dependencies, while a multi-head attention mechanism improves the extraction of global spatiotemporal features. Grounding SAM, composed of Grounding DINO for object localization and SAM (Segment Anything Model) for image segmentation, is employed during preprocessing to precisely extract gesture regions and remove background noise. This enhances feature quality and reduces interference during training. Experiments show that the proposed method achieves 96.3% accuracy on a self-constructed dataset and 96.1% on the SHREC 2017 dataset, outperforming several baseline methods by an average of 4.6 percentage points. It also demonstrates strong robustness under complex and dynamic conditions. This approach provides a reliable and efficient solution for future dynamic gesture-recognition systems.

Keywords:

dynamic gesture detection; Grounding SAM; multi-head attention mechanism; LSTM

1. Introduction

With the rapid advancement of technology, dynamic gesture detection has shown broad application prospects across various fields. In virtual reality (VR) and augmented reality (AR) especially, users can interact with virtual environments through gestures, achieving a more natural and immersive experience. Furthermore, dynamic gesture detection has also proven to be highly effective in human–computer interaction scenarios such as smart homes and automotive control, enabling users to operate devices with simple gestures without the need to touch screens or physical buttons. For example, drivers can control in-car systems via gestures, simplifying operations and enhancing driving safety. More importantly, gesture recognition plays a crucial role in health monitoring and rehabilitation, allowing for the real-time tracking of patients’ progress and assisting in therapy and rehabilitation exercises. Additionally, in the security sector, gesture-detection technology is being used to identify potential threats and abnormal behavior, thus improving public safety. Therefore, dynamic gesture detection is not only an important topic in academic research but also an essential tool in driving the development of an intelligent, future-oriented society.

Dynamic gesture [1,2] detection is a pivotal research area within computer vision and deep learning, with broad applications in cutting-edge fields such as human–computer interaction and virtual reality. Traditional gesture-recognition methods primarily rely on image classification techniques and basic temporal modeling approaches, achieving limited success. However, with the advent of deep learning technologies, particularly the application of LSTM [3] networks and Transformer [4] architectures, dynamic gesture recognition has made transformative progress. Despite these advancements, existing methods still encounter substantial challenges, particularly when handling long time sequences, complex environments, and multimodal data. These challenges include high computational complexity, inefficient training processes, and sensitivity to background noise, all of which necessitate further optimization.

In dynamic gesture-detection tasks, a critical challenge is the effective extraction of spatiotemporal features [5] from video data, especially during the preprocessing phase, where removing background noise is essential for ensuring the clarity of the gesture regions. Additionally, modeling long-term dependencies in long time sequences is crucial for accurately capturing the dynamic changes of gestures, especially in complex environments. Moreover, optimizing the computational efficiency and memory usage of models while maintaining high accuracy, reducing computational complexity, and enhancing real-time recognition capabilities remains an essential challenge in contemporary research.

To address these challenges, this study proposes an innovative dynamic gesture-detection method that integrates the strengths of LSTM and multi-head attention mechanism models [6]. Initially, LSTM is employed to model temporal dependencies, capturing the long-term dynamic changes of gestures within video data. Subsequently, Transformer, through its multi-head attention mechanism, efficiently extracts global temporal features and enhances the model’s ability to capture global time dependencies, thereby improving both the accuracy and efficiency of gesture recognition. Furthermore, this study incorporates the Grounding SAM (Grounding Segment Anything Model) [7,8] technology for data preprocessing. Grounding SAM consists of Grounding Dino and SAM, where Grounding Dino is used for precise object localization, and SAM is responsible for image segmentation, effectively extracting gesture regions and removing background noise. By combining the capabilities of both localization and segmentation, this approach not only optimizes the feature-extraction process but also significantly reduces the impact of background interference [9] on model training, ensuring high accuracy and system stability in complex environments [10].

To address the limitations of existing methods, this study makes the following contributions:

1.: A text-guided Grounding SAM preprocessing framework is proposed for accurate gesture-region localization and segmentation. By binarizing the extracted features, the method effectively reduces computational complexity while preserving essential spatial information, improving efficiency.
2.: A hybrid recognition model that combines multi-head attention mechanisms with LSTM is introduced to enhance the model’s ability to focus on key hand regions and capture the temporal evolution of gestures. This significantly improves gesture-recognition accuracy and temporal modeling capability in dynamic scenarios.

2. Related Work

Dynamic gesture recognition plays a vital role in human–computer interaction (HCI) and computer vision, aiming to enable more natural and efficient interactions by analyzing continuous hand movements, such as waving, fist-clenching, and sliding [11]. As virtual reality (VR), augmented reality (AR), smart home systems, and robotics continue to advance, the application domains for dynamic gesture recognition have expanded, encompassing areas like smart device control, surgical robotics, and communication aids for the hearing-impaired. Despite these advancements, several challenges persist. Gestures inherently involve complex temporal dynamics, requiring models to accurately capture the entire gesture sequence and differentiate between similar movements [12]. Additionally, variations in gesture speed, amplitude, and style can significantly impact a model’s ability to generalize. The presence of external factors such as cluttered backgrounds, lighting shifts, and hand occlusion further complicates the recognition task [13,14]. As a result, improving the robustness and adaptability of gesture-recognition systems remains a significant challenge in the field.

Early dynamic gesture-recognition methods predominantly relied on handcrafted features, such as trajectories, speed, and shape descriptors, in combination with traditional machine learning techniques like Hidden Markov Models (HMMs) [15] and Dynamic Time Warping (DTW) [16]. For instance, Lu et al. [17] utilized HMMs to model the temporal dependencies in gesture trajectories. However, these methods were heavily dependent on manually designed features, making them less effective at handling complex gestures and limiting their ability to generalize. With the advent of deep learning, Köpüklü et al. [18] introduced a dual-stream CNN-RNN architecture, where CNNs extract spatial features, and RNNs model temporal dynamics, resulting in significant improvements in recognition accuracy. However, RNNs are prone to vanishing gradients, making it difficult to capture long-term dependencies, while CNNs focus on local features and fail to capture global spatiotemporal relationships. To address these challenges, LSTM-based models were introduced, with Min et al. [19] proposing an LSTM-based framework that enhanced the model’s ability to handle long sequences. However, LSTM models are computationally intensive and lack the flexibility needed to focus on key time segments, which affects real-time performance. More recently, Zhang et al. [20] integrated the attention mechanism with LSTM to dynamically adjust the importance of different time steps, thereby improving the recognition of intricate gestures.

By 2023, the widespread adoption of Transformer architectures marked a significant step forward in gesture recognition. Hampiholi et al. [21] proposed a Transformer-based multimodal fusion model that combines visual and depth data, leading to a marked improvement in complex gesture recognition. This model employs a self-attention mechanisms to capture global spatiotemporal relationships, while reducing the computational demands of RNNs and LSTMs in long-sequence processing. Additionally, Slama et al. [22] introduced Graph Neural Networks (GNNs) to model spatial dependencies within gestures, which, when combined with Transformers, enhances the model’s adaptability across diverse users and environments. Although these methods represent significant progress, challenges persist in optimizing real-time performance while mitigating the impact of external factors like background complexity and lighting variations. Therefore, advancing the model’s ability to capture finer gesture features and long-term dependencies without sacrificing computational efficiency remains a key area of ongoing research.

This study presents the “Attention-Enhanced LSTM with Grounding SAM” model to improve dynamic gesture recognition performance. By combining Grounding SAM [8], multi-head attention mechanisms [6], and LSTM [3], the model enhances its ability to capture temporal dynamics in gesture sequences. Grounding SAM first isolates the gesture regions, reducing background interference and irrelevant information, which helps clarify the key features. The multi-head attention mechanism then dynamically weights these features, focusing on the most important time intervals to capture subtle gesture transitions more effectively. LSTM models the long-term temporal dependencies, enabling the system to learn the global evolution of gestures and track changes over extended periods. This approach not only mitigates the impact of background noise but also improves adaptability to variations in user style and gesture execution speed, thereby enhancing generalization. The experimental results show that the Attention-Enhanced LSTM with Grounding SAM model achieves superior robustness in complex environments, significantly boosts recognition accuracy, and ensures stability in dynamic settings.

3. Methods

As shown in Figure 1, first, the input text information is used by Grounding DINO to locate the target in the image and draw bounding boxes. Then, SAM performs segmentation within these regions, and the segmented result is processed and converted into a binary image. Next, the dynamic gesture images are also binarized. The binarized images are then processed by ViT for feature extraction, with position embeddings and token embeddings added, followed by multi-head attention and Add/Norm layers. The processed features are then passed through a two-layer LSTM, with a dropout layer in between, and finally classified through the Softmax layer.

3.1. Grounding SAM Data Preprocessing

Grounding SAM is a composite system that integrates Grounding DINO and Segment Anything Model (SAM) to enable text-driven object detection and high-precision image segmentation. The pipeline begins with Grounding DINO [23], which takes an image and a natural language description as input to generate bounding boxes corresponding to the described objects. These bounding boxes are then passed as prompts to SAM [7], which produces pixel-level segmentation masks for the detected regions. By combining language-guided detection with versatile segmentation capabilities, Grounding SAM forms a streamlined workflow from “text description” to “object localization” to “precise segmentation”, significantly improving annotation efficiency and segmentation accuracy—especially beneficial for large-scale data labeling and complex scene understanding. The workflow of Grounding SAM is shown in Figure 2.

The mask generation process in SAM relies on its powerful visual understanding and multi-level feature-extraction mechanisms. Initially, the model extracts multi-scale features through deep neural networks and employs a self-attention mechanism to analyze the semantic and edge features of the image. Building on this, SAM integrates user-provided text prompts or detection boxes through cross-modal alignment, accurately determining target regions. It then utilizes a Transformer architecture to model both global and local information, achieving high-precision foreground segmentation. Finally, the model generates a binary mask with dimensions matching the input image, where a value of 1 corresponds to the gesture region and 0 to the background. Specifically,

M (x, y) = 1

denotes the gesture, while

M (x, y) = 0

indicates the background. The preprocessing workflow of dynamic gesture sequences is shown in Figure 3, which illustrates the steps of using Grounding DINO for object detection, fine segmentation with SAM, extracting the hand-region mask, and performing binarization to highlight key features. This approach effectively retains essential information while reducing computational cost and improving processing efficiency.

Using this mask, we further refine the extraction of the gesture region to effectively remove background noise. Specifically, the non-mask regions in the original gesture image are set to black, ensuring that only a clear gesture outline remains while the background is completely eliminated, thereby reducing the interference of background noise in subsequent tasks. To further enhance gesture recognition, the processed image undergoes binarization, converting the gesture region to pure white while keeping the background black, thereby creating a high-contrast foreground outline. This processing significantly improves the clarity of gesture contours, making features more distinct and optimizing feature extraction [24] and deep learning training.

By integrating the target-detection capability of Grounding DINO with the precise segmentation of SAM [8,25], we improve both the speed and precision of data annotation while minimizing background noise during preprocessing. This approach significantly boosts the performance and reliability of dynamic gesture recognition, enabling more accurate recognition in complex environments [26].

3.2. Network Model

As shown in Figure 4, this figure illustrates the main process of the dynamic gesture-recognition model. The input gesture consists of five binarized images

P_{i}

, where each image is divided into

2 \times 3

(a total of 6) patches and processed by ViT to extract the feature matrix

Z_{i}

. Subsequently,

Z_{i}

undergoes position encoding, multi-head attention, and Add/Norm for feature enhancement, resulting in

Z_{i}^{'}

. The enhanced features are then sequentially fed into a two-layer LSTM for temporal modeling, where the first LSTM layer captures inter-frame dependencies, and the second LSTM layer further learns global temporal features. Finally, at

t = 4

, the second LSTM layer predicts the gesture class and outputs the final recognition result.

The model focuses on deep modeling of temporal features in dynamic gesture sequences, aiming to significantly improve recognition accuracy, enhance robustness, and optimize the model’s generalization capabilities. Specifically, the method first decomposes the dynamic gesture video into multiple keyframes (this study uses 5 frames) to capture the critical phases of gesture movements and reduce redundant computations. Subsequently, Vision Transformer (ViT) [4,27] is utilized to extract both local detailed features and global contextual information from each frame, while integrating token embedding and positional encoding techniques to further enhance the expressive power of the features, ensuring the model can effectively distinguish subtle differences between similar gestures. Next, the spatial correlations within and between frames are modeled using multi-head attention [9], capturing the spatial structural information of the gesture movements. The Add Norm operation (residual connection and layer normalization) is applied to stabilize the training process and prevent issues such as gradient vanishing or explosion. Finally, an LSTM network [3] is used to capture the temporal evolution of gesture movements, modeling their dynamic patterns. The output is then processed by a fully connected layer to perform the classification task. This design not only effectively addresses the gradient vanishing problem in traditional methods for long-sequence modeling but also enhances the model’s focus on keyframes through the attention mechanism, thereby demonstrating higher robustness and generalization performance in complex scenarios [28]. The model proceeds through the following steps:

Given a dynamic gesture video V, we uniformly sample five keyframes to form an input sequence

X = {x_{1}, x_{2}, \dots, x_{5}}

, where each frame

x_{i} \in R^{H \times W}

is a single-channel binary image (the pixel values are 0 for background, and 1 for gesture regions). This sampling captures the gesture’s start, transition, peak, and end phases, preserving temporal information while reducing redundancy. Binarization simplifies the image, emphasizing gesture contours and reducing computational complexity, enhancing robustness to noise and background interference. The sequence X provides spatiotemporal features for efficient and accurate gesture recognition.

In Vision Transformer (ViT), the input image needs to be converted into a sequence of fixed-size patches for processing. Specifically, for each frame

x_{i} \in R^{H \times W}

, it is first divided into N non-overlapping patches, where each patch has a size of

P \times P

. The number of patches N is calculated by the following formula.

N = \frac{H \times W}{P^{2}}

(1)

In this context, H and W represent the image’s height and width, respectively, while P is the side length of each patch. Each patch is treated as an individual token and is transformed into a feature vector via a linear projection.

z_{i}^{0} = W_{e} \cdot P (x_{i}), z_{i}^{0} \in R^{N \times D}

(2)

In this equation,

P (x_{i})

denotes the matrix obtained by dividing the image

x_{i}

into patches, and

W_{e}

is a learnable projection matrix that maps each patch into a D-dimensional feature space, where D is the hidden dimension of the Transformer. Through this step, the image is converted into a sequence of tokens, which are then fed into the Transformer encoder for further processing. The purpose of patch embedding is to transform the spatial information of the image into serialized data suitable for Transformer processing, while mapping the low-dimensional pixel space into a high-dimensional feature space through linear projection, thereby enhancing the model’s ability to represent image features. This approach not only preserves the local structural information of the image but also provides a foundation for subsequent global modeling.

Given an input dynamic gesture sequence consisting of five frames, each frame

x_{i} \in R^{H \times W}

is processed by a Vision Transformer (ViT) to obtain patch embeddings, mapping the input into a high-dimensional feature space.

Z_{i}^{0} = [C L S; z_{i 1}^{0}, z_{i 2}^{0}, \dots, z_{i N}^{0}], z_{i}^{0} \in R^{(N + 1) \times D}

(3)

where the

C L S

token serves as a global representation, N is the number of patches, and D is the feature-embedding dimension. All frame tokens are concatenated, and trainable positional embeddings are introduced to enhance temporal modeling, ensuring position information is explicitly encoded into the feature representation.

Z_{i} = Z_{i}^{0} + Z_{p o s}, i \in [1, 5]

(4)

In the multi-head self-attention (MHA) mechanism, the input features are projected into the query, key, and value spaces to capture global dependencies across frames.

Q = Z W_{Q}, K = Z W_{K}, V = Z W_{V}

(5)

A = softmax (\frac{Q K^{T}}{\sqrt{D}}) V

(6)

After the multi-head self-attention (MHA) computation [29], the outputs from multiple attention heads are concatenated and projected back into a unified feature space, expressed as

MHA (Z) = Concat (h e a d_{1}, \dots, h e a d_{h}) W_{0}

. Subsequently, a residual connection and layer normalization (LayerNorm) are applied [30], where the input is added to the MHA output and then normalized to enhance gradient stability and improve feature expressiveness, formulated as

Z^{'} = LayerNorm (Z + MHA (Z))

. This modeling approach effectively reinforces the spatiotemporal dependencies of dynamic gestures, enabling more robust and expressive feature representations [31]. As shown in Figure 5, the multi-head attention mechanism projects input features into query (Q), key (K), and value (V) matrices. Multiple attention heads capture different information, and the outputs are concatenated and linearly transformed to enhance feature representation.

After applying LayerNorm, the feature representations are fed into a two-layer bidirectional LSTM [32] to model the temporal dependencies of dynamic gestures [28]. A Dropout layer is introduced between the two LSTM layers to prevent overfitting and improve generalization. The final output of the LSTM layers integrates both forward and backward temporal information, enhancing the robustness and expressiveness of gesture recognition. The feature sequence

Z^{'}

obtained from the multi-head attention mechanism serves as the input to the first LSTM layer. The hidden state output from the first LSTM layer undergoes Dropout to reduce overfitting. The dropout-processed hidden states serve as the input to the second LSTM layer [3,20,21].

h_{t}^{(1)}, c_{t}^{(1)} = L S T M^{(1)} (Z_{t}^{'}, h_{t - 1}^{(1)}, c_{t - 1}^{(1)})

(7)

{\tilde{h}}_{t}^{(1)} = Dropout (h_{t}^{(1)})

(8)

h_{t}^{(2)}, c_{t}^{(2)} = L S T M^{(2)} (Z_{t}^{'}, h_{t - 1}^{(2)}, c_{t - 1}^{(2)})

(9)

In the equations,

{\tilde{h}}_{t}^{(1)}

is the hidden state from the first LSTM after Dropout, serving as the input to the second LSTM.

h_{t - 1}^{(1)}

and

c_{t - 1}^{(1)}

represent the hidden state

h_{t}^{(2)}

and cell state

c_{t}^{(2)}

from the previous time step. The second LSTM uses

{\tilde{h}}_{t}^{(1)}

to compute the current time step’s hidden state and cell state, further extracting temporal features [33].

Figure 6 demonstrates the structure of a two-layer LSTM, where a dropout layer is incorporated between the two LSTM layers to mitigate overfitting and improve generalization. The first LSTM layer processes the input features, followed by the dropout layer, which randomly drops certain connections to enhance model robustness. The second LSTM layer refines the output from the dropout layer, effectively capturing temporal dependencies within the sequence.

To offer a clearer insight into our hybrid model, we present a streamlined pseudocode that outlines its key components and workflow.

Algorithm 1 outlines the core pipeline of our hybrid model for dynamic gesture recognition. This model integrates Transformer layers for spatial feature extraction, multi-head attention mechanisms for enhancing representation learning, and LSTM layers for temporal modeling [33]. The pseudocode provides a high-level overview of how these components work together to achieve robust performance.

Algorithm 1 Algorithm for dynamic gesture recognition

Input:: Input sequence of images
Output:: Final output classification
1:: Initialize: Transformer layers, Positional Encodings, Token Encodings, LSTM layers (2 layers)
2:: for each input sequence do
3:: # Transformer processing
4:: Apply Token Encoding and Positional Encoding to input
5:: for each Transformer layer do
6:: Apply Multi-Head Attention
7:: Apply Add and Norm (Residual connection and Layer Normalization)
8:: end for
9:: # LSTM processing
10:: Apply LSTM cell processing for first LSTM layer
11:: Apply Dropout
12:: Apply LSTM cell processing for second LSTM layer
13:: # Output processing
14:: Apply Dropout
15:: Apply Softmax
16:: Compute final output from second LSTM layer
17:: end for
18:: return Final output classification

Key Steps:

Input Processing: The input is a sequence of images representing a dynamic gesture. Each frame is encoded using token and positional encoding, which allows the model to incorporate spatial structure and temporal order.
Transformer Stage: The encoded sequence is passed through Transformer layers. In each layer, multi-head attention enables the model to attend to different parts of the sequence in parallel. This is followed by residual connections and layer normalization (Add/Norm), which help stabilize training and improve gradient flow [34].
LSTM Stage: The output of the Transformer is then fed into a two-layer LSTM network, which captures long-term temporal dependencies across the sequence. Dropout is applied between and after the LSTM layers to prevent overfitting and enhance generalization.
Prediction: The final LSTM output is passed through a Softmax layer to produce the classification result.

This algorithm highlights the synergy between Transformer-based spatial modeling and LSTM-based temporal modeling, forming a highly adaptable and accurate solution for gesture recognition in real-world environments.

4. Experimental Design and Results Analysis

4.1. Introduction to the SHREC 2017 Dataset and Our Dataset

To evaluate the performance of the proposed dynamic gesture-detection method, experiments were conducted on two datasets: the SHREC 2017 dataset and our dataset. SHREC 2017 is a widely used public benchmark for dynamic hand-gesture recognition, consisting of 2800 gesture sequences collected from 28 participants. It includes 14 different gesture categories, each performed under three different scenarios: “arm straight”, “arm bent”, and “random”. The dataset provides depth map sequences recorded using the Intel RealSense camera, offering both spatial and temporal variations, and is well-suited for evaluating the robustness of gesture-recognition algorithms in real-world settings.

In addition, our dataset was developed to simulate more complex and varied application environments. The dataset contains 3600 dynamic gesture sequences across 12 gesture categories, collected from 30 participants of different ages and genders under diverse background settings and lighting conditions. Each gesture was captured from multiple viewpoints using a depth camera, and the sequences were manually labeled. This dataset emphasizes real-world variability, background clutter, and lighting diversity, making it particularly valuable for testing the generalization and adaptability of dynamic gesture recognition models.

4.2. Ablation Study

To assess the contribution of each component in the proposed dynamic gesture-detection model, a series of ablation experiments were conducted by systematically removing or altering key components. The results are summarized in Table 1.

In the baseline configuration, the model achieved an impressive accuracy of 96.3%. When the Transformer layers were removed, performance dropped significantly to 87.5%, highlighting the essential role of the multi-head attention mechanism in capturing global dependencies and enhancing gesture recognition. Removing the LSTM layers resulted in an accuracy of 89.2%, emphasizing the importance of temporal dependency modeling in dynamic gestures.

Additionally, removing the Token Encoding component led to an accuracy of 91.4%, showing that while helpful, it is not as critical as other components. Similarly, removing the Positional Encoding slightly reduced the accuracy to 92.0%, indicating its role in capturing temporal relationships between gestures. The removal of Dropout led to overfitting, with accuracy dropping to 90.1%, demonstrating the importance of regularization. Finally, removing the Softmax layer resulted in a slight accuracy drop to 93.0%, confirming its role in normalizing output for classification.

These results emphasize the importance of each component, particularly the Transformer, LSTM layers, and Dropout regularization, in achieving high performance in dynamic gesture recognition. The experiment also highlights the robustness of the model.

4.3. Performance Comparison on SHREC 2017 and Our Dataset

In Figure 7 and Table 2, we present the performance of different models on the SHREC 2017 Dataset [35] and our own dataset, including Swin [36], ViViT [37], SlowFast [38], X3D [39], SFT [40], and our proposed method (Ours). These results clearly highlight the disparities in accuracy and mean average precision (mAP) between models.

On the SHREC 2017 Dataset, the performance of all models varies: some demonstrate more stability, while others fluctuate significantly on specific gesture categories. Our method achieves state-of-the-art performance, particularly excelling in both accuracy and robustness, clearly outperforming existing mainstream approaches.

On our own dataset, which features more complex environments and greater variability, the differences in model performance become even more apparent. Our method again demonstrates leading accuracy and mAP, showcasing strong adaptability and effectiveness in real-world dynamic gesture-recognition tasks.

To better understand the comparative models, we briefly describe them below:

Swin Transformer adopts a hierarchical structure with shifted windows to efficiently capture spatial representations.
ViViT extends Vision Transformers to video tasks by decomposing attention mechanisms for temporal and spatial modeling.
SlowFast employs a dual-pathway structure to model both slow and fast motion streams, well suited for capturing dynamic changes.
X3D expands 2D CNNs along spatial–temporal dimensions, offering a good balance between performance and efficiency.
SFT (Space-Time Fusion Transformer) integrates spatial and temporal features through Transformer modules tailored for hand-gesture tasks.
Ours combines Transformer encoding with dual-layer LSTM for sequential modeling, enhanced by multi-head attention and normalization layers, yielding superior performance across both benchmark and real-world datasets.

Overall, although performance varies across models and datasets, our method consistently delivers superior results across multiple metrics, confirming its robustness and generalization capabilities.

4.4. Model Performance Under Different Conditions

In Figure 8 and Table 3, we present the accuracy performance of various models under both interference and non-interference conditions as the observation distance increases. These experiments were conducted using our proprietary dataset to evaluate the adaptability of different models in real-world environments. The results clearly indicate that as the observation distance grows, the accuracy of all models gradually decreases. Under non-interference conditions, the performance of all models remains relatively stable, with our proposed method consistently leading in accuracy, particularly at longer distances, where it demonstrates a significant advantage over other models. In comparison, the accuracy of other models declines more sharply, especially at greater distances.

When interference is introduced, all models experience a drop in accuracy, with the interference negatively impacting performance. However, even under interference conditions, our method exhibits strong robustness, maintaining relatively higher accuracy with a smaller rate of decline as the distance increases, highlighting its superior adaptability. Despite the overall performance degradation caused by interference, our method continues to outperform others across the entire testing range, demonstrating its effectiveness and stability in complex environments.

Our model outperforms existing mainstream methods due to the combination of several innovations and optimizations. Firstly, by integrating the multi-head attention mechanism with a dual-layer LSTM, we are able to capture both spatial and temporal features while effectively modeling long-term and short-term dependencies, overcoming the limitations of traditional methods in handling long sequences. In addition, the introduction of Grounding SAM technology helps the model eliminate interference in complex backgrounds, improving gesture-recognition accuracy and stability. The combination of these technologies allows our model to excel in accuracy, robustness, and generalization ability. The experimental results show that our model maintains high performance, especially in long-distance observation and high-interference environments, demonstrating its strong adaptability. Therefore, our approach not only stands out in the current technological framework but also provides a promising solution for practical applications in dynamic gesture recognition.

4.5. Computational Cost and Deployment Considerations

Although the proposed hybrid model combines both Transformer and LSTM layers to enhance feature representation and temporal modeling, this architecture inevitably increases computational cost. To evaluate its feasibility for deployment in real-time or resource-constrained environments, we conducted an analysis of inference time and memory usage. Experiments were performed on a machine equipped with an NVIDIA RTX 3090 GPU and an Intel i7 CPU. The average inference time per gesture sequence is approximately 23.5 ms on GPU and 97.2 ms on CPU, indicating that the model is capable of near real-time performance. Additionally, the model contains around 12.4 million parameters and requires approximately 420 MB of GPU memory during inference. While the Transformer layers introduce additional complexity, the model maintains acceptable efficiency and remains deployable in edge-computing scenarios with moderate hardware resources.

5. Discussion

The model proposed in this study combines Transformer, multi-head attention, and LSTM, achieving excellent performance in dynamic gesture recognition. However, its complexity requires significant computational resources, which may limit its use on resource-constrained devices. To address this, techniques such as model compression (e.g., knowledge distillation), lightweight architectures (e.g., MobileBERT), and hardware acceleration (e.g., GPU or FPGA) can be explored to reduce computational load without sacrificing performance.

Currently, the focus is mainly on visual data, but integrating multimodal information like audio and depth data could improve accuracy and robustness. Audio provides context on environmental noise, while depth data enhances spatial understanding, especially in complex environments. However, multimodal fusion presents challenges, such as handling data discrepancies and synchronization across different sources.

The attention mechanism significantly enhances model performance by focusing on important features. To optimize it further, exploring adaptive attention, depthwise separable convolutions, and multi-scale attention mechanisms could improve both efficiency and accuracy, especially in complex gesture-recognition tasks.

In conclusion, while the proposed method shows strong performance, there is room for improvement in terms of resource efficiency, multimodal fusion, and attention mechanism optimization. These areas should be explored in future work to enhance the model’s practical applicability.

6. Conclusions

This study presents a dynamic gesture-recognition approach that leverages Transformer, multi-head attention, and LSTM. The experimental results indicate that our method surpasses existing models on both the SHREC 2017 and custom datasets, achieving remarkable accuracy and mAP. The use of Grounding SAM ensures stable performance under various conditions, demonstrating strong anti-interference and generalization. Future work will focus on optimizing the attention mechanism, incorporating self-supervised learning, and integrating multi-modal information to further improve performance. This approach provides a robust and efficient solution for recognizing dynamic gestures in complex environments.

Author Contributions

Conceptualization: J.C. and F.J.; Methodology: J.C. and F.J.; Software: J.C. and F.J.; Validation: J.C., F.J. and Y.J.; Formal Analysis: J.C. and F.J.; Investigation: J.C. and F.J.; Resources: J.C. and F.J.; Data Curation: J.C. and F.J.; Writing—Original Draft Preparation: J.C. and F.J.; Writing—Review and Editing: J.C., F.J. and Y.Z.; Visualization: J.C. and F.J.; Supervision: F.J.; Project Administration: F.J.; Funding Acquisition: X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Guangxi Key Research & Development Program, Grant No. Gui Ke AB23026048, AB22080047.

Data Availability Statement

The image used in Figure 2 is openly available at [https://img1.qunliao.info/fastdfs6/M00/2F/9C/rBUESWEHS1GALaDHAET3ntlZL4w099.jpg?imgW=4356&imgH=2904] (accessed on 22 April 2025). The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4207–4215. [Google Scholar]
Wu, D.; Pigou, L.; Kindermans, P.J.; Le, N.D.H.; Shao, L.; Dambre, J.; Odobez, J.M. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1583–1597. [Google Scholar] [CrossRef]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Networks Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Suresha, M.; Kuppa, S.; Raghukumar, D. A study on deep learning spatiotemporal models and feature extraction techniques for video understanding. Int. J. Multimed. Inf. Retr. 2020, 9, 81–101. [Google Scholar] [CrossRef]
Fu, R.; Liang, H.; Wang, S.; Jia, C.; Sun, G.; Gao, T.; Chen, D.; Wang, Y. Transformer-BLS: An efficient learning algorithm based on multi-head attention mechanism and incremental learning algorithms. Expert Syst. Appl. 2024, 238, 121734. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Liu, R.; Zhang, J.; Peng, K.; Zheng, J.; Cao, K.; Chen, Y.; Yang, K.; Stiefelhagen, R. Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual impairments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1857–1867. [Google Scholar]
Liu, C.; Wu, Y.; Liu, J.; Sun, Z. Improved YOLOv3 network for insulator detection in aerial images with diverse background interference. Electronics 2021, 10, 771. [Google Scholar] [CrossRef]
Bambach, S.; Lee, S.; Crandall, D.J.; Yu, C. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1949–1957. [Google Scholar]
Nolker, C.; Ritter, H. Visual recognition of continuous hand postures. IEEE Trans. Neural Netw. 2002, 13, 983–994. [Google Scholar] [CrossRef]
Lee, M.; Bae, J. Real-time gesture recognition in the view of repeating characteristics of sign languages. IEEE Trans. Ind. Inform. 2022, 18, 8818–8828. [Google Scholar] [CrossRef]
Zheng, C.; Lin, W.; Xu, F. A Self-Occlusion Aware Lighting Model for Real-Time Dynamic Reconstruction. IEEE Trans. Vis. Comput. Graph. 2022, 29, 4062–4073. [Google Scholar] [CrossRef]
Iwase, S.; Saito, S.; Simon, T.; Lombardi, S.; Bagautdinov, T.; Joshi, R.; Prada, F.; Shiratori, T.; Sheikh, Y.; Saragih, J. Relightablehands: Efficient neural relighting of articulated hand models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17– 24 June 2023; pp. 16663–16673. [Google Scholar]
Blunsom, P. Hidden markov models. Lect. Notes August 2004, 15, 48. [Google Scholar]
Lichtenauer, J.F.; Hendriks, E.A.; Reinders, M.J. Sign language recognition by combining statistical DTW and independent classification. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 2040–2046. [Google Scholar] [CrossRef]
Lu, P.; Zhang, M.; Zhu, X.; Wang, Y. Head nod and shake recognition based on multi-view model and hidden markov model. In Proceedings of the International Conference on Computer Graphics, Imaging and Visualization (CGIV’05), Beijing, China, 26–29 July 2005; pp. 61–64. [Google Scholar]
Köpüklü, O.; Gunduz, A.; Kose, N.; Rigoll, G. Real-time hand gesture detection and classification using convolutional neural networks. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–8. [Google Scholar]
Min, Y.; Zhang, Y.; Chai, X.; Chen, X. An efficient pointlstm for point clouds based gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5761–5770. [Google Scholar]
Zhang, W.; Wang, J.; Lan, F. Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Autom. Sin. 2020, 8, 110–120. [Google Scholar] [CrossRef]
Hampiholi, B.; Jarvers, C.; Mader, W.; Neumann, H. Convolutional transformer fusion blocks for multi-modal gesture recognition. IEEE Access 2023, 11, 34094–34103. [Google Scholar] [CrossRef]
Slama, R.; Rabah, W.; Wannous, H. Online hand gesture recognition using Continual Graph Transformers. arXiv 2025, arXiv:2502.14939. [Google Scholar]
Hu, Z.; Gao, K.; Zhang, X.; Yang, Z.; Cai, M.; Zhu, Z.; Li, W. Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
Duch, W.; Adamczak, R.; Grabczewski, K. A new methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Trans. Neural Netw. 2001, 12, 277–306. [Google Scholar] [CrossRef]
Zhai, J.; Tian, F.; Ju, F.; Zou, X.; Qian, S. with Bounding SAM for HIFU Target Region Segmentation. In Pattern Recognition and Computer Vision, Proceedings of the 7th Chinese Conference, PRCV 2024, Urumqi, China, 18–20 October 2024, Proceedings, Part XIV; Springer Nature: Berlin/Heidelberg, Germany, 2024; Volume 15044, p. 118. [Google Scholar]
Min, R.; Wang, X.; Zou, J.; Gao, J.; Wang, L.; Cao, Z. Early gesture recognition with reliable accuracy based on high-resolution IoT radar sensors. IEEE Internet Things J. 2021, 8, 15396–15406. [Google Scholar] [CrossRef]
Yao, T.; Li, Y.; Pan, Y.; Wang, Y.; Zhang, X.P.; Mei, T. Dual vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10870–10882. [Google Scholar] [CrossRef]
Mitra, S.; Acharya, T. Gesture recognition: A survey. IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev. 2007, 37, 311–324. [Google Scholar] [CrossRef]
Lin, Y.; Wang, C.; Song, H.; Li, Y. Multi-head self-attention transformation networks for aspect-based sentiment analysis. IEEE Access 2021, 9, 8762–8770. [Google Scholar] [CrossRef]
Sauvola, J.; Pietikäinen, M. Adaptive document image binarization. Pattern Recognit. 2000, 33, 225–236. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Wang, J.; Zhang, J.; Wang, X. Bilateral LSTM: A two-dimensional long short-term memory model with multiply memory units for short-term cycle time forecasting in re-entrant manufacturing systems. IEEE Trans. Ind. Inform. 2017, 14, 748–758. [Google Scholar] [CrossRef]
Jin, R.; Chen, Z.; Wu, K.; Wu, M.; Li, X.; Yan, R. Bi-LSTM-based two-stream network for machine remaining useful life prediction. IEEE Trans. Instrum. Meas. 2022, 71, 1–10. [Google Scholar] [CrossRef]
Viel, F.; Maciel, R.C.; Seman, L.O.; Zeferino, C.A.; Bezerra, E.A.; Leithardt, V.R.Q. Hyperspectral image classification: An analysis employing CNN, LSTM, transformer, and attention mechanism. IEEE Access 2023, 11, 24835–24850. [Google Scholar] [CrossRef]
Savva, M.; Yu, F.; Su, H.; Kanezaki, A.; Furuya, T.; Ohbuchi, R.; Zhou, Z.; Yu, R.; Bai, S.; Bai, X.; et al. Shrec’17 track large-scale 3d shape retrieval from shapenet core55. In Proceedings of the Eurographics Workshop on 3D Object Retrieval, Lyon, France, 23–24 April 2017; Volume 10. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
Beeri, E.B.; Nissinman, E.; Sintov, A. Recognition of Dynamic Hand Gestures in Long Distance using a Web-Camera for Robot Guidance. arXiv 2024, arXiv:2406.12424. [Google Scholar]

Figure 1. The left shows Grounding DINO locating targets, the middle shows SAM segmenting, and the right is the network model.

Figure 2. The figure illustrates the workflow of Grounding SAM, where Grounding DINO locates the target from text input, and SAM segments the relevant areas to extract features.

Figure 3. This figure shows preprocessing steps: Grounding DINO for detection, SAM for segmentation, hand mask extraction, and binarization.

Figure 4. The architecture of the dynamic gesture-recognition model.

Figure 5. This figure shows the multi-head attention mechanism, which captures relationships between features and outputs fused information.

Figure 6. Illustrates a two-layer LSTM with dropout to capture temporal features and reduce overfitting.

Figure 7. This figure shows model accuracy on SHREC 2017 and our dataset.

Figure 8. A demonstration of accuracy vs. distance with/without interference.

Table 1. Ablation study results on our custom dataset.

Configuration	Removed Component	Accuracy (%)	F1	Epochs
Full Model	None	96.3	0.94	300
No Transformer	Transformer	87.5	0.85	300
No LSTM	LSTM	89.2	0.88	300
No Token Enc.	Token Encoding	91.4	0.91	300
No Pos. Enc.	Positional Encoding	92.0	0.92	300
No Dropout	Dropout	90.1	0.89	300
No Softmax	Softmax	93.0	0.92	300
No Trans. and LSTM	Transformer and LSTM	75.3	0.75	300

Table 2. Comparison of accuracy and mAP values of different models on SHREC 2017 and custom datasets.

Dataset	Model	Accuracy	mAP
SHREC 2017 Dataset	Swin [36]	78.5	0.74
	ViViT [37]	77.3	0.68
	SlowFast [38]	73.4	0.66
	X3D [39]	82.2	0.76
	SFT [40]	94.7	0.86
	Ours	96.1	0.90
Our Dataset	Swin	81.5	0.78
	ViViT	79.3	0.75
	SlowFast	76.4	0.71
	X3D	81.4	0.80
	SFT	95.3	0.92
	Ours	96.3	0.94

Table 3. Accuracy variations of different models at various distances under interference and non-interference conditions.

Distance (m)	Swin [36]	ViViT [37]	SlowFast [38]	X3D [39]	SFT [40]	Ours
2	85	83	82	87	96	97
3	81	80	78	84	92	94
4	78	76	75	80	88	91
5	74	72	71	77	84	87
6	70	68	67	73	80	83
7	66	64	63	69	76	79
8	63	61	60	66	72	75
Under Interference Conditions
2	75	73	72	77	90	92
3	72	70	68	73	85	89
4	69	67	65	70	81	85
5	65	63	61	66	77	81
6	61	59	58	62	73	77
7	57	55	54	58	69	73
8	53	52	51	55	65	70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Jin, F.; Jiao, Y.; Zhan, Y.; Qin, X. Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM. Electronics 2025, 14, 1793. https://doi.org/10.3390/electronics14091793

AMA Style

Chen J, Jin F, Jiao Y, Zhan Y, Qin X. Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM. Electronics. 2025; 14(9):1793. https://doi.org/10.3390/electronics14091793

Chicago/Turabian Style

Chen, Jinlong, Fuqiang Jin, Yingjie Jiao, Yongsong Zhan, and Xingguo Qin. 2025. "Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM" Electronics 14, no. 9: 1793. https://doi.org/10.3390/electronics14091793

APA Style

Chen, J., Jin, F., Jiao, Y., Zhan, Y., & Qin, X. (2025). Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM. Electronics, 14(9), 1793. https://doi.org/10.3390/electronics14091793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Grounding SAM Data Preprocessing

3.2. Network Model

4. Experimental Design and Results Analysis

4.1. Introduction to the SHREC 2017 Dataset and Our Dataset

4.2. Ablation Study

4.3. Performance Comparison on SHREC 2017 and Our Dataset

4.4. Model Performance Under Different Conditions

4.5. Computational Cost and Deployment Considerations

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI