A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights

Zhang, Xiaorui; Li, Shuaitong; Zeng, Xianglong; Lu, Peisen; Sun, Wei

doi:10.3390/computers14100432

Open AccessArticle

A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights

by

Xiaorui Zhang

^1,2,*,

Shuaitong Li

³,

Xianglong Zeng

⁴

,

Peisen Lu

³ and

Wei Sun

⁵

¹

College of Computer and Information Engineering, Nanjing Tech University, Nanjing 211816, China

²

College of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁴

College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

⁵

College of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(10), 432; https://doi.org/10.3390/computers14100432

Submission received: 12 August 2025 / Revised: 21 September 2025 / Accepted: 25 September 2025 / Published: 13 October 2025

(This article belongs to the Special Issue Multimodal Pattern Recognition of Social Signals in HCI (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Dynamic hand gesture recognition based on computer vision aims at enabling computers to understand the semantic meaning conveyed by hand gestures in videos. Existing methods predominately rely on spatiotemporal attention mechanisms to extract hand motion features in a large spatiotemporal scope. However, they cannot accurately focus on the moving hand region for hand feature extraction because frame sequences contain a substantial amount of redundant information. Although multimodal techniques can extract a wider variety of hand features, they are less successful at utilizing information interactions between various modalities for accurate feature extraction. To address these challenges, this study proposes a multimodal hand gesture recognition model combining inter-frame motion and shared attention weights. By jointly using an inter-frame motion attention (IFMA) mechanism and adaptive down-sampling (ADS), the spatiotemporal search scope can be effectively narrowed down to the hand-related regions based on the characteristic of hands exhibiting obvious movements. The proposed inter-modal attention weight (IMAW) loss enables RGB and Depth modalities to share attention, allowing each to adjust its distribution based on the other. Experimental results on the EgoGesture, NVGesture, and Jester datasets demonstrate the superiority of our proposed model over existing state-of-the-art methods in terms of hand motion feature extraction and hand gesture recognition accuracy.

Keywords:

hand gesture recognition; attention mechanisms; spatiotemporal scope; multimodal techniques

1. Introduction

As a basic form of communication, gestures are widely employed in a variety of contexts, including beckoning, traffic police hand signals, and particularly sign language used by deaf–mute people. Fundamentally, gesturing is a way of expression. Similarly to spoken languages, which are continuous and dynamic, gestures convey meaning through changing and moving hand postures over time. Hand gesture recognition aims at enabling computers to understand the meaning of target hand gestures. Dynamic hand gesture recognition based on computer vision refers to analyzing videos captured by cameras using specific algorithms to classify gestures [1]. It has extensive applications in human–computer interaction scenarios [2,3], such as virtual reality [4], sign language translation [5], and clinical care [6].

With the advancements of deep learning, an increasing number of neural networks have been developed for video understanding in various fields [7,8,9,10,11]. However, dynamic hand gesture recognition, compared to other video understanding fields, prefers to focus on hand movement. It classifies gestures mostly based on information from the hand in every frame and the movement of the hand between frames [1]. Nevertheless, frame sequences in video often contain a lot of redundant information, which interferes with the attention of the model to the hand [12]. On the one hand, the frame sequences contain a complex background in addition to the human body. Complex background outside the hand region may degrade the models’ ability to locate the hand accurately. Moreover, poor device quality can lead to blurred frames, while excessively high frame rates during video collection may produce redundant or repeated frames, both of which impact the ability to extract hand motion features from the video. Commonly, an attention mechanism is used to address such issues. It employs specific neural network structures to automatically learn the contributions of different parts in images to the recognition task. This enables the neural network model to pay more attention to the moving hand region. Existing methods [13] often feed the data directly into attention modules and use spatiotemporal attention mechanisms to extract effective features from a large spatiotemporal scope. Although these methods allow the model to focus on the hand, the large search scope makes it difficult to capture the hand motion features accurately. Additionally, in the dynamic hand gesture recognition datasets, the complex background is often stationary or only exhibits slight movements. In other words, the hand is the primary moving object in the datasets for hand gesture recognition. The characteristics of hands exhibiting obvious movements can help the model exclude redundant information irrelevant to the hand, accurately extract hand motion features, and further improve hand gesture recognition accuracy.

Some researchers work to extract hand motion information more thoroughly using multimodal fusion techniques to increase the accuracy of hand gesture recognition. These methods typically apply an attention mechanism to the different modalities individually and then extract and fuse the features from each modality. Although different modal data with different attributes can enrich the final fused features, directly fusing features from different modalities lacks informative interaction between the modalities. Therefore, it fails to leverage the feature information from other modalities besides its own modality during hand localization, leading to an inaccurate feature extraction for each modality.

To address the aforementioned challenges, this study proposes a multimodal hand gesture recognition model combining inter-frame motion and shared attention weights. The motivation is to exploit two complementary strategies: first, the inter-frame motion attention (IFMA) module explicitly captures pixel-level motion between adjacent frames, enabling the model to focus on moving hand regions and discard redundant static background information [14]; second, the inter-modal attention weight (IMAW) loss allows the RGB and Depth modalities to share attention information, enhancing cross-modal feature interaction and improving robustness when one modality is noisy or occluded [15]. The model integrates IFMA with an adaptive down-sampling (ADS) mechanism to efficiently determine the spatiotemporal search scope, guiding feature extraction to moving hand regions and reducing irrelevant information. IMAW further improves feature extraction accuracy for both the RGB and Depth modalities through cross-modal attention guidance, minimizing redundancy in the fused features. Specifically, to the best of our knowledge, this is the first work that proposes the IFMA module. The proposed module describes the obvious movement of the hand by calculating the similarity between any pixel in each frame and all pixels in the adjacent frames within the same local region. Accordingly, the spatiotemporal search scope is efficiently reduced to hand-related regions based on the characteristic of hands exhibiting obvious movements, which guides the model to focus on the hand motion feature extraction, reducing redundant information contained in the feature. To reduce the computational complexity of the whole model while maintaining the effectiveness of features, we further propose an ADS module based on the self-attention mechanism, which samples the output of IFMA adaptively in spatial, temporal, and modal dimensions according to the calculated weights of importance, thus reducing the computational burden of subsequent modules. Furthermore, we propose a novel IMAW loss to train the model. This loss leverages the attention weights of one modality to compensate for the missing attention weights of another modality because of an ineffective attention generation, considering the correlation and interaction between different modalities. Specifically, it enables each modality to utilize the attention weights of another modality to adjust its own attention distribution. This is achieved by reducing the distance between the attention weights of different modalities by a certain proportion, thereby enhancing the reliability of gesture recognition. In other words, the attention weights of various modalities collectively create a shared repository and each modality takes into account all of the attention weights in the repository to optimize its parameters. To prevent incorrect weights in one modality from negatively impacting the other modality, we evaluate their effectiveness before sharing the attention weights. This approach significantly enhances the accuracy of feature extraction in each modality, thereby reducing redundant information in the fused features. Experimental results on the EgoGesture, NVGesture, and Jester datasets show that our proposed model can better extract the features of moving hands and improve the accuracy of hand gesture recognition.

In summary, the main contributions of this paper are as follows:

A novel inter-frame motion attention mechanism is proposed, which effectively reduces the spatiotemporal search range to hand related areas, enabling the model to more accurately focus on hand motion features and improve gesture recognition accuracy.
A novel inter-modal attention weights loss is proposed to train the model, which can improve the interaction between depth modality and RGB modality, thereby reducing the amount of redundant information in the final fused features.
Experiments are conducted on challenging datasets such as EgoGesture, NVGesture, and Jester. The results demonstrate that our proposed model outperforms other existing methods in terms of the accuracy of hand gesture recognition. Ablation experiments highlight the contributions of each component of our proposed model.

The remainder of the paper is organized as follows. Section 2 reviews related work on hand gesture recognition, attention mechanisms, and multimodal applications. Section 3 introduces our proposed model along with the experimental setup. Section 4 presents the results and analysis. Section 5 discusses the findings, and Section 6 concludes the study with directions for future work.

2. Related Work

The proposed gesture recognition method aims to improve existing attention mechanisms and multimodal approaches. Therefore, we review the application of attention mechanisms to hand gesture recognition and related work on multimodal hand gesture recognition in the section.

2.1. Attention Mechanism

The attention mechanism [16,17] in neural networks can automatically focus on the regions that contribute significantly to the recognition results. In the field of dynamic hand gesture recognition, it is commonly used to capture hand motion features in frame sequences, and existing research often directly applies the attention mechanism to the spatiotemporal dimensions. For example, Chen et al. [18] proposed a dynamic graph-based spatiotemporal attention method that automatically learns effective features through spatiotemporal attention mechanisms. Shi et al. [19] improved the spatiotemporal attention mechanism, in which a decoupled spatiotemporal attention network is introduced to model the correlations among different joints in dynamic skeletal data. To extend the learning ability of the attention mechanism, Abu et al. [20] proposed a multi-level spatiotemporal attention network, applied at multiple stages, to enhance hand gesture recognition accuracy. Additionally, attention mechanisms have been used in two-stream networks as well. Zhang et al. [21] applied spatiotemporal attention mechanisms to a two-stream network, in which the attention value for the pose stream and motion stream is computed to guide the model towards extracting effective features. Using attention mechanisms in the spatiotemporal dimension does enable the model to concentrate on hand motion features and increase hand gesture recognition accuracy, but the search scope is still too broad without taking into account the characteristic of hands exhibiting obvious movements, leading to an insufficiently inaccurate hand gesture recognition. Unlike the above methods that directly apply attention mechanisms to the spatiotemporal dimension, our proposed IFMA module reduces the search scope of the attention mechanism based on the characteristic of hands exhibiting obvious movements. It describes the hand’s obvious movement by computing the similarity between any pixel in each frame and all pixels in the adjacent frames within the same local region. This helps further narrow down the search scope of the attention mechanism to the hand-related region, thus guiding the model to focus on hand motion feature extraction and reducing redundant information. Additionally, in order to reduce the computational complexity brought by the IFMA module while keeping the efficacy of the extracted features, we propose an ADS module based on self-attention mechanisms.

2.2. Multimodal Hand Gesture Recognition

Recent studies have shown that multimodal data have been used widely in image recognition tasks. RGB, Depth, and optical flow are common data modalities that are used in the field of hand gesture recognition [22,23,24]. For example, Zhang et al. [25] randomly selected a frame of RGB image and an optical stream from a fixed group for feature fusion before feeding it into a convolutional neural network to learn short-term and long-term features. To handle complex backgrounds, occlusion, and illumination simultaneously, which cannot be solved by a single network, Elboushaki et al. [26] proposed a multi-dimensional CNN for gesture recognition in videos, in which multiple features from RGB and Depth modalities are combined by different network branches to improve the accuracy of hand gesture recognition. Similarly, Yu et al. [27] proposed a 3D Center Difference Convolution (3D-CDC) in conjunction with neural architecture search strategy to extract features from RGB and Depth modalities. Additionally, further research suggested that extracting correlations between modalities can help improve model accuracy. Gammulle et al. [28] individually trained each modality using the same network architecture. Then, they establish the correlations between different modalities at the same time step by using a proposed Fusion Block. In contrast, Li et al. [29] proposed the Skeleton-Guided MultiModal Network (SGM-Net), which leverages the actionable information from the skeleton modality to assist the RGB modality for feature extraction. However, their guiding strategy is unidirectional and not easily extended to other modalities. Note that, although these multimodal methods can extract richer hand features compared to single-modal methods, they cannot effectively utilize the features from different modalities to improve the accuracy of feature extraction for individual modalities before fusing them. As a result, the fused features for classification are not accurate enough.

This study presents a novel IMAW loss to enhance the training of the model. This loss function takes into consideration the correlation and interaction between different modalities, leveraging the attention weights of one modality to compensate for the missing attention regions in another modality caused by ineffective attention weights. As a result, each modality can use the attention weights from other modalities to dynamically adjust its attention distribution and increase the accuracy of feature extraction for each modality. This is accomplished by reducing the distance between attention weights of various modalities by a certain proportion.

3. Materials and Methods

This section details the materials and methods used in this study. We first introduce the datasets for training and evaluation, followed by the implementation details and the metrics used to evaluate performance. Finally, we present the detailed architecture of our proposed model, including its core components and the loss function.

3.1. Datasets

In this study, we utilize two complementary modalities, RGB and Depth (RGB-D), to leverage both appearance and 3D structural information. Specifically, RGB provides rich spatial and texture features that describe hand appearance and background context, while Depth offers structural and distance cues that are more robust to illumination changes and background clutter. Both modalities are temporal in nature, as the datasets contain continuous video sequences of dynamic hand gestures, enabling the model to capture frame-to-frame motion patterns within a dynamic window. Only these two modalities (RGB and Depth) are considered in our experiments, as they represent the most widely available and practically applicable data sources in gesture recognition. The dataset is as follows:

EgoGesture Dataset [30]: It is a rich multimodal dataset for the classification of first-person gestures, with a primary concentration on gestures related to interactions with wearable devices. The dataset contains dynamic gestures in complex backgrounds, such as walking subjects, and intense sunlight condition. It covers real-world scenarios, including the indoor and outdoor, providing abundant gesture data in realistic environments. The dataset comprises 2081 RGB-D videos from 50 different subjects and involves 83 gesture categories. In total, there are 24,161 gesture samples, with 14,416 samples for training, 4768 samples for validation, and 4977 samples for testing. Representative examples from the six scene settings are shown in Figure 1, with black-and-white images representing Depth.

NVGesture Dataset [31]: The dataset is a multimodal dataset primarily focused on touchless driver control. It consists of 1532 dynamic gestures executed by 20 subjects, with 25 gesture categories. We divided the RGB-D data in the dataset into training and testing sets by a 7:3 ratio to perform experiments. An example frame sequence is illustrated in Figure 2.

Jester Dataset [32]: The dataset only provides RGB modality. It consists of 148,092 labeled gesture videos captured through a laptop or network camera. Some gestures have high similarity, allowing the model to learn fine-grained variances. Therefore, the dataset can be used to verify the capacity of the model to distinguish subtle differences between gestures. Specifically, it consists of 27 gesture categories, including two “no gesture” categories to help the model differentiate between meaningless actions and specific gestures in real-world scenarios. The dataset is divided into three subsets, such as training, validation, and testing, by a split ratio of 8:1:1, including 118,562 training videos, 14,787 validation videos, and 14,743 testing videos. Four representative gestures are illustrated in Figure 3.

3.2. Computer Setup

The model was implemented on a Linux system with Python 3.7, PyTorch 1.3, and CUDA 11.3. The experiments were conducted on three parallel GeForce RTX 3090 GPUs. We optimized our proposed model using the Stochastic Gradient Descent (SGD) for 90 epochs. The initial learning rate was set to 1 × 10⁻², momentum was set to 0.9, and weight decay was set to 1 × 10⁻⁴. A warmup learning rate strategy and a cosine annealing learning rate decay strategy were applied to the learning rate. The warmup strategy used 5 epochs, and the minimum learning rate for decay was set to 0. The network’s weights were initialized using pre-trained weights from ImageNet. The input size of frame sequences was uniformly set to 16 × 224 × 224 × 3. Note that the embedding layer was removed from the gesture classification model to ensure matching in the feature dimension.

3.3. Evaluation Metrics

We evaluate the performance of all models using Top-1 and Top-5 accuracy, which are standard metrics for multi-class classification tasks like gesture recognition.

Top-1 Accuracy: This metric measures the percentage of predictions where the class with the highest predicted probability is identical to the ground-truth label.
Top-5 Accuracy: This metric considers a prediction to be correct if the ground-truth label is among the top five classes with the highest predicted probabilities. It is particularly useful for datasets with many classes or fine-grained distinctions, where multiple predictions might be plausible.

3.4. Patch Embedding

To enable the proposed IFMA to better extract local patterns from local regions, a patch embedding operation is applied to the frame sequence to establish correlations between different local regions. This is accomplished by applying a 3D convolution with a size of (2,4,4) and a stride of 1 to extract local features while expanding the channel dimension, thereby mapping the frame sequence to a higher-dimensional vector space. Additionally, a learnable parameter matrix is used for position encoding. The embedding operation transforms the dimensions of the frame sequence from (B, 3, T, H, W) to (B, C, T, H, W), where B represents the batch size, C denotes the embedded dimension, T stands for the number of frames, H and W are the height and width of the frames, respectively.

3.5. Inter-Frame Motion Attention

As shown in Figure 4, the proposed IFMA module is used to extract motion hand features from the patch-embedded frame sequence. The search scope of the attention mechanism in the spatiotemporal dimension is determined by IFMA based on the local pattern similarity between any point in a frame and all points in the corresponding local regions of adjacent frames. The IFMA module has two primary components: the feature extractor and the attention calculator. First, the feature extractor extracts the local patterns of each frame, and then the attention calculator computes the IFMA for two adjacent frames in the same modality.

3.5.1. Feature Extractor

To facilitate the subsequent similarity computation, the tensor

X_{r a w} \in R^{B \times C \times T \times H \times W}

output by the patch embedding is first divided into multiple blocks using a 3D sliding window with size (2, h, w), resulting in a tensor with the dimension of

(\begin{matrix} B, & C, & \frac{T}{2}, & \frac{H}{h}, & \frac{W}{w}, & h, & w \end{matrix})

. Each block in this tensor corresponds to the same local region between adjacent frames. Additionally, the shape of the tensor is adjusted to

(\begin{matrix} 2, B * T * \frac{H}{h} * \frac{W}{w}, h * w, C \end{matrix})

, and then the tensor is divided into two tensors along the first dimension. The above calculation is shown in Equation (1).

X_{t}, X_{t + 1} = r e s h a p e (w i n d o w (X_{r a w}))

(1)

where

X_{t} \in R^{B * \frac{T}{2} * \frac{H}{h} * \frac{W}{w} \times h * w \times C}

represents the previous frame, and

X_{t + 1} \in R^{B * \frac{T}{2} * \frac{H}{h} * \frac{W}{w} \times h * w \times C}

represents the subsequent frame, where window represents the division of the tensor using a sliding window and reshape represents adjusting the shape of the tensor and dividing it along the first dimension. Subsequently, the features

X_{t}

,

X_{t + 1}

are fed into the feature extractor to extract local patterns. The feature extractor consists of a convolution module and two linear layers, which are used to extract and further optimize the local patterns. The convolution module, inspired by the ResNet [33] architecture, is employed to extract local patterns. It begins with adjusting the channel dimension of the frame images by a 1 × 1 convolutional kernel with 64 output channels. Then, a 3 × 3 convolutional kernel with 64 output channels is utilized to extract local patterns. Finally, a 1 × 1 convolutional kernel with C output channels is applied to restore the number of channels. Additionally, when the neural network structure becomes excessively deep, feature degradation will inevitably arise. To alleviate this problem, a Shortcut Connection is employed. The expression is as follows:

P_{t} = X_{t} + C o n v (X_{t})

(2)

P_{t + 1} = X_{t + 1} + C o n v (X_{t + 1})

(3)

where

P_{t}, P_{t + 1} \in R^{B * \frac{T}{2} * \frac{H}{h} * \frac{W}{w} \times h * w \times C}

represents the features containing local patterns in the original frames, and Conv denotes the convolution operation. To further enhance the expressive capacity of the features, two linear layers are utilized to optimize the local patterns:

U = P_{t} {(A_{1})}^{T} + b_{1}

(4)

V = P_{t + 1} {(A_{2})}^{T} + b_{2}

(5)

where U and V are the outputs of the two linear layers, respectively,

A_{1}

and

A_{2}

represent the weight matrices of the two linear layers, and meanwhile

b_{1}

and

b_{2}

correspond to the biases of these layers,

T

denotes the matrix transpose operation.

3.5.2. Attention Calculator

After the above-described extraction and optimization of local patterns, each point in the feature represents the high-dimensional information of the local region in the original frame. The Attention Calculator determines whether hand movements happen in the original frame by comparing the similarity of high-dimensional information between frames. In this way, the attention mechanism’s search area is narrowed to the area associated with the hand. To be more precise, we multiply a point in one of the two frames by all the points in the other frame and then add them all up to get the point’s similarity value. Further observations show that a higher similarity value indicates a higher probability of motion occurrence at that point, and vice versa. Since each point needs to be multiplied with all the points in the adjacent frame, it is inefficient to compute them one by one. To improve the computational efficiency, we use matrix parallel computation to compute the similarity matrix

M_{S} \in R^{B * \frac{T}{2} * \frac{H}{h} * \frac{W}{w} \times h * w \times h * w}

, which can directly obtain the similarity value of any one point in the two frames by directly matrix multiplication shown in Equation (6).

M_{S} = U \times V^{T}

(6)

In addition, the core idea of the IFMA mechanism is to judge whether a point is in motion or not by calculating its similarity with a point at a different location in the adjacent frame. However, when a point is not in motion, its similarity with the same position in the adjacent frame also will be the highest, and this will negatively affect the model’s judgment. To prevent this, we employ a mask operation, which sets the main diagonal elements of the similarity matrix into zero, considering that the main diagonal element represents the product of points at the same position in adjacent frames.

M_{m} = M a s k (M_{S})

(7)

Afterward, summation operations are performed separately along the row and column dimensions of the similarity matrix to obtain the similarity vectors for the corresponding two adjacent frames. In Figure 4, sum (−1) denotes summation along the rows, and sum (−2) denotes summation along the columns. To prevent the SoftMax function from weakening the diversity of the original features, the weight vector obtained through SoftMax is multiplied by a trainable parameter, called Scale:

A t t e n_{1} = S o f t m a x (S u m (M_{m}, - 1)) * S c a l e

(8)

A t t e n_{2} = S o f t m a x (S u m (M_{m}, - 2)) * S c a l e

(9)

Finally, to make the obtained attention weight dimensions match the input features, the weights

A t t e n_{1}

and

A t t e n_{2}

are repeated C times to expand the channel dimension, and then apply them to the input features:

X_{o u t} = [X_{t}, X_{t + 1}] ⊙ [A t t e n_{1}, A t t e n_{2}]

(10)

where

X_{o u t} \in R^{B \times C \times T \times H \times W}

represents the output result of the IFMA weights,

⊙

represents the Hadamard product, [] denotes concatenation.

3.6. Adaptive Down-Sampling Module

To reduce the overall computational complexity of the model while preserving the effectiveness of features, an ADS module is employed after enhancing the hand motion features through IFMA. This module implements adaptive fusion down-sampling in the spatial, temporal, and modal dimensions based on the importance weights computed by the self-attention mechanism. The ADS module is subdivided into ASDS, ATDS, and AMDS, which divide the features using sliding windows based on the dimension to be downsampled and then perform the self-attention down-sampling for the divided features.

First, the feature is partitioned using sliding window operations based on the dimension to be down-sampled. ASDS divides the features in the spatial dimension, and it assigns different weights to each point in the local features and samples them according to the weight distribution and reduces the size of each frame’s feature map by half, thereby reducing the size for subsequent modules. As shown in Figure 5a, ASDS partitions the input features using a 3 × 3 window with a stride of 2. The purpose of ATDS is to fuse and down-sample adjacent frames in the input sequence for the same region based on weight proportions. This process further refines effective features while reducing the number of frames in the sequence. As depicted in Figure 5b, ATDS divides the features in the temporal dimension and it partitions the input feature using a 3 × 1 × 1 window with a stride of 2. The AMDS divides the features in the modal dimension, it aims at exploiting the local correlations between different modalities through self-attention mechanisms to fuse and down-sample multimodal data. As shown in Figure 5c, The AMDS module divides the T − moment of each modality and the frames adjacent to the left and right of the T − moment into one block.

After dividing the data in spatial, temporal, and modal dimensions using sliding windows as described above, we performed feature extraction, weight vector computation, and fusion down-sampling on the features to down-sample adaptively based on the features importance. Firstly, we utilize three linear layers that have the same number of neurons as the input, to extract features for each point, resulting in the feature matrices Q, K, V:

Q = X {(A_{q})}^{T} + b_{q}

(11)

K = X {(A_{K})}^{T} + b_{K}

(12)

V = X {(A_{v})}^{T} + b_{v}

(13)

where

A_{q}

,

A_{K}

and

A_{v}

are the weight matrices of three linear layers,

b_{q}

,

b_{K}

and

b_{v}

are the corresponding biases of these layers. Next, to obtain the weight vectors based on self-attention, we perform matrix multiplication between Q and K, and the result is multiplied by the scaling factor

\frac{1}{\sqrt{d k}}

to obtain the correlation matrix. Here, d represents the number of neurons in the linear layer, and the scaling factor is introduced to prevent the SoftMax producing extremely small gradients [34]. Before performing the SoftMax function, we apply average pooling to each row of the correlation matrix to obtain the importance of different points from the same sliding window, which is represented as:

A (Q, K) = S o f t m a x (P_{a v g} (\frac{Q K^{T}}{\sqrt{d}}))

(14)

After obtaining the weight vectors A (Q, K), we proceed to perform matrix multiplication between the output V of the third linear layer and the transpose of the weight vectors. Here, A (Q, K) corresponds to the importance of each point in the original features, while V corresponds to different points within the original features. Multiplying A (Q, K) and V enables features to be fused and down-sampled according to their importance. Subsequently, we optimize the result through another linear layer to obtain the final output R. This process can be represented as follows:

R = V^{T} A (Q, K) (A_{r})^{T} + b_{r}

(15)

where

A_{r}

and

b_{r}

are the weight matrices and biases of the linear layer, respectively.

3.7. Inter-Modal Attention Weights Loss

We propose an inter-modal attention weights loss, which allows each modality to adjust its own attention weight according to the attention weight of the other modality. This is achieved by reducing the distance between the attention weights of the depth modality and the RGB modality by a certain proportion. In other words, the attention weights of different modalities form a shared repository, and each modality considers all attention weights in the repository to optimize its own parameters. To prevent incorrect weights in one modality from affecting the other modality, the validity of the attention weights in each region is evaluated before sharing them. This approach significantly enhances the accuracy of feature extraction in each modality, thereby reducing redundant information in the fused features. The implementation of this method is divided into two stages: Firstly, the effectiveness of the attention weight is evaluated, and then the weight distance between modalities is calculated.

After attention mechanisms are applied, there still exist some attention weights allocated to incorrect regions due to the influence of redundant information, with relatively smaller values. To restrain incorrect weights from the incorrect regions, we first evaluate the effectiveness of weight values from each region. we employ a 4 × 4 average pooling layer to process the weight matrices. This step leverages the translational invariance and feature summary capability of average pooling to obtain localized representative weight values (LRWV) for each modality, and this value is used as evaluation of the effectiveness of the local attention weights.

M^{1} = A v g (m^{1})

(16)

M^{2} = A v g (m^{2})

(17)

where

M^{1}

and

M^{2}

represent the LRWVs of the RGB modality and the depth modality, respectively, and they are used to evaluate the effectiveness of the attention weights in the local region.

m^{1}

represents the attention weights from the IFMA of RGB modality, while

m^{2}

represents the attention weights from the IFMA of depth modality. The Avg refers to a 4 × 4 average pooling operation. To simultaneously evaluate the effectiveness of attention weights from both modalities within the same local region, we perform a Hadamard product on the LRWV to obtain the effective attention weight matrix (EAWM) α:

α = M^{1} ⊙ M^{2}

(18)

To optimize attention weights of each modality based on the attention weights of another modality during training, we reduce the distance between the attention weights of the different modalities by a certain proportion and further multiply by the EAWM α to avoid the negative effect of incorrect weights. Note that α represents the validity of attention weights for the same localized regions of both modalities. When α of a localized region of a modality is higher, that is, the effectiveness of the attention weights in this region is considered to be higher, allowing the attention weights of the same localized region of another modality to move closer to it to a greater extent. This allows the attention weights of each local region in each modality to be adjusted according to the attention weights of the same local region in another modality, and the proportion of the adjustment changes dynamically with the measure of the effectiveness of the local attention weights of both sides.

L_{m i w} = \frac{\sum_{i}^{n} \sum_{j}^{n} α_{⌈\frac{i}{4}⌉ ⌈\frac{j}{4}⌉} {(W_{i, j}^{1} - W_{i, j}^{2})}^{2}}{n^{2}}

(19)

where

α_{⌈\frac{i}{4}⌉ ⌈\frac{j}{4}⌉}

represents the value at the

\frac{i}{4}

-th row and

\frac{j}{4}

-th column of the EAWM.

W_{i, j}^{1}

and

W_{i, j}^{2}

denote the attention weights of the RGB modality and depth modality, respectively, at the i-th row and j-th column. n represents the number of rows and columns. Additionally, for a correct classification, we utilize cross-entropy loss as the final classification loss function, defined as follows:

L_{c e} = - \sum_{1}^{n} Y_{i} \log y_{i}

(20)

where

Y_{i}

denotes the ground truth labels, and

y_{i}

represents the model’s predicted values. Finally, the two loss functions are summed by weights as the final loss:

L = \sum_{1}^{n} Y_{i} \log y_{i} + λ \frac{\sum_{i}^{n} \sum_{j}^{n} {(A v g (m^{1}) ⊙ A v g (m^{2}))}_{⌈\frac{i}{4}⌉ ⌈\frac{j}{4}⌉} {(M_{i, j}^{1} - M_{i, j}^{2})}^{2}}{n^{2}}

(21)

where λ is the proportion of IMAW loss in the final loss value, which is a hyperparameter.

4. Results

This section presents the results of our comprehensive experiments. We begin by reporting on the hyperparameter selection process used to optimize our model. We then present a series of ablation studies to validate the contribution of each proposed module. Finally, we compare the performance of our proposed model against several state-of-the-art methods on the benchmark datasets.

4.1. Hyperparameter Selection

This subsection describes the fine-tuning procedure conducted to determine the optimal hyperparameters for our proposed modules. The values determined here were fixed for all subsequent ablation and comparison experiments to ensure a fair evaluation. Firstly, we set the stacking times of IFMA and ADS to 2, which allows us to reduce the input size by half in both the temporal and spatial dimensions, resulting in a 4 × 56 × 56 × C feature map, which can well match the classifier input dimensions. Then, we use SlowOnly [35] as the classifier in this part of the experiment to classify the features.

4.1.1. Embedding Channel Number C

To appropriately determine the number of feature channels in the IFMA module, different channel values were evaluated on two datasets. As shown in Figure 6, the highest accuracy on the Jester dataset was achieved when C = 96. Although peak performance on EgoGesture occurred at C = 192, we selected C = 96 for all subsequent experiments to maintain a favorable balance between computational cost and performance.

4.1.2. Loss Function Coefficient λ

An evaluation was conducted to determine the optimal weight coefficient λ for the proposed IMAW loss. The results, shown in Figure 7, indicate that the best performance was achieved when λ = 0.25. As λ increased further, the overall performance slightly decreased. Therefore, we set λ = 0.25 for subsequent experiments.

4.1.3. Number of Feature Extractor Layers N

The feature extractor in the IFMA module can capture local patterns. However, as the N increases, gradient vanishes gradually and computational complexity increases, thereby leading a poor model convergence. To select the optimal value for N, corresponding experiments were conducted on two datasets, as shown in Table 1. Both the EgoGesture and Jester datasets achieved the best performance when N = 2. Hence, N = 2 was used for the subsequent experiments.

4.2. Ablation Study

To validate the effectiveness of each proposed module, ablation experiments were conducted. In this section, we divided the experiments into two parts: single-modal and multimodal. The former evaluates the proposed IFMA, ASDS and ATDS on RGB modality data, while the latter evaluated the proposed AMDS and IMAW on multimodal data.

4.2.1. Single-Modal Modules

In this subsection, we describe how we validated each proposed single-modal module on the EgoGesture and Jester datasets. Table 2 presents the effects of different module combinations, using the SlowOnly [35] model as the classifier. The outcomes show that each module contributes to the overall hand gesture recognition model. The inclusion of the IFMA module greatly raises the accuracy of hand gesture recognition. This may be attributed to the advantage that the attentional mechanism reduces the search scope to hand-related regions, enabling the model to extract more accurate features. Although the individual use of ASDS and ATDS may not show a significant improvement, when combined with IFMA, they improve the accuracy of hand gesture recognition. This is because the feature maps in the absence of IFMA include too many redundant features, making it challenging for ASDS and ATDS to avoid combining redundant and useful features, which in turn affects classification accuracy. However, using the IFMA can filter out many redundant features to a large extent, and it makes the attention mechanism work in a smaller search scope, thereby obtaining effective and refined features. On this basis, ASDS and ATDS can better leverage their advantages, by optimizing the features and reducing the feature map size input to the classifier.

Additionally, Figure 8 provides a clearer curve of the accuracy improvement brought by each module. As shown in Figure 8, with the addition of IFMA, ASDS, and ATDS to SlowOnly, the accuracy curve shows an upward trend, which indicates that each of the proposed modules brings a significant contribution to the accuracy of the final classification.

4.2.2. Multimodal Modules

To demonstrate the effectiveness of the proposed multimodal methods, AMDS and IMAW loss, we conducted ablation experiments on different modalities using the EgoGesture and NVGesture datasets. In Table 3, we present the evaluation results of RGB modality and depth modality. We use the most commonly used late fusion as the baseline for multimodal fusion methods; it is denoted as FC in Table 3. From the table, it can be observed that the AMDS fusion method achieved a 0.9% improvement in the Top-1 accuracy compared to the late fusion method on both datasets. This is due to the adaptive capacity of the AMDS self-attention mechanism. Additionally, IMAW increased both fusion approaches’ accuracy, indicating a key role for the attention weight sharing process in IMAW. When both AMDS and IMAW were used together, the Top-1 accuracy was improved by 1.5% and 1.9% on the two datasets compared to the traditional late fusion method. This could be due to the fact that the proposed AMDS and IMAW loss enhance the accuracy of multimodal feature extraction, and these two multimodal methods work well together without mutual exclusion.

4.3. Comparison with State-of-the-Art Methods

To validate the effectiveness of our proposed model, in this section, we compared it with state-of-the-art methods in both single-modal and multimodal scenarios.

4.3.1. Compared Methods and Naming Convention

We compare our work against several established gesture recognition models, including 2D CNN-based (VGG16 [36], TSN [37]), 3D CNN-based (C3D [38], 3D ResNet-50 [39], I3D [40]), and efficient modern architectures (SlowOnly [35], CatNet [41], TEA [42], TSM [43], X3D [44], ACTION-Net [45]). To ensure clarity and consistency, we adopt the following naming convention:

Backbone: Refers to an existing SotA model (e.g., SlowOnly) without our modules.
Ours+Backbone: Refers to a single-modal (RGB) backbone enhanced with our proposed IFMA, ASDS, and ATDS modules. For example, Ours+SlowOnly denotes the SlowOnly model integrated with our complete single-modal framework.
Ours+SlowOnly (RGB-D): Refers to our full multimodal model built upon the SlowOnly architecture, utilizing both the AMDS module for fusion and the IMAW loss for training.

4.3.2. Single-Modal (RGB) Performance

Table 4 summarizes the experimental results on the EgoGesture and Jester datasets using only the RGB modality. Our proposed modules, when integrated with modern backbones like X3D [44], ACTION-Net [45], and SlowOnly [35], consistently improved performance. For instance, Ours+SlowOnly achieved a Top-1 accuracy of 95.2% on EgoGesture and 98.3% on Jester, outperforming the original SlowOnly by 0.6% and 1.1%, respectively. While these improvements may seem modest, it is important to note that on mature and competitive benchmarks like these, even gains of less than 1–2% are considered valuable contributions. The consistent performance uplift across three different backbones indicates that the improvements are a direct result of our modules’ ability to extract more potent features, rather than being an artifact of a single architecture. The confusion matrices in Figure 9 further illustrate this performance gain, where the Ours+SlowOnly model shows a stronger concentration of predictions along the main diagonal compared to the baseline SlowOnly. The 3D bar graph (Log-scaled for clarity) visually confirms this, as taller blue bars indicate more correct predictions.

4.3.3. Multimodal (RGB-D) Performance

Table 5 presents the results of our multimodal experiments on the EgoGesture and NVGesture datasets. Building upon our single-modal results where Ours+SlowOnly was the top-performing model, we use this architecture as the foundation for our multimodal system.

Our full multimodal model, Ours+SlowOnly (RGB-D), demonstrated superior performance compared to other methods. On the EgoGesture and NVGesture datasets, our model achieved Top-1 accuracies of 95.8% and 84.2%, representing improvements of 0.7% and 0.4% over the strongest baseline, respectively. This highlights that our proposed AMDS and IMAW modules can effectively fuse features from different modalities to capture richer, more comprehensive representations for gesture recognition.

5. Discussion

Our experimental results demonstrate the effectiveness of the proposed framework, consistently improving gesture recognition accuracy across multiple datasets and backbone architectures. The primary success of our model comes from efficiently pruning irrelevant information by focusing on motion. The IFMA module calculates pixel-level similarity between adjacent frames, effectively isolating moving hands from static or minimally moving backgrounds. Ablation studies (Table 2) confirm that IFMA provides the largest performance gain among the single-modal components. Combined with the ADS module, IFMA enables more effective dimensionality reduction while preserving salient motion features, which would be less effective on noisy inputs.

In multimodal scenarios, the proposed IMAW loss introduces cross-modal supervision by forcing RGB and Depth streams to learn from each other’s attention distributions. This shared-weights strategy strengthens feature robustness, particularly when one modality struggles due to poor lighting or occlusion. As a result, multimodal fusion produces complementary and more reliable features, leading to the superior performance observed in our experiments (Table 3 and Table 5).

Despite these improvements, our study has some limitations. The pixel-wise similarity calculation in IFMA is computationally intensive, which may hinder real-time deployment on resource-constrained devices. The ADS module currently operates separately on 2D spatial and 1D temporal dimensions; more integrated 3D approaches could capture complex spatiotemporal dynamics more effectively and simplify the architecture. Despite the strong performance of our RGB-D model, other modalities—such as skeletal poses, infrared, IMU, or audio—were not considered. Incorporating these could provide complementary information, improve robustness under occlusion or low lighting, and is left for future work. Additionally, while our model shows strong benchmark performance, its robustness under extreme real-world conditions—such as severe blur, occlusions, or multiple gesturing individuals—remains to be thoroughly evaluated.

The broader implication of this work is that using explicit motion cues to guide attention is highly generalizable. This approach could be adapted for other video understanding tasks where moving objects are the primary focus, such as action recognition, vehicle tracking, or animal behavior analysis. Our results emphasize a shift toward specialized, domain-aware attention mechanisms that exploit the inherent properties of the target task. This highlights the importance of designing task-specific attention mechanisms rather than relying solely on generic spatiotemporal attention.

6. Conclusions and Future Work

In the present study, we propose a multimodal hand gesture recognition model combining inter-frame motion and shared attention weights for improved accuracy. This model integrates the IFMA module with ADS, leveraging the characteristic of obvious hand movement to reduce the search scope in the spatiotemporal dimension. This guides the model to focus on extracting features from moving hand-related regions, effectively reducing the interference of redundant information in hand motion feature extraction. Additionally, the proposed IMAW loss allows the RGB modality and depth modality to achieve more accurate hand feature extraction by sharing attention weights. We compared our proposed model with state-of-the-art dynamic hand gesture recognition methods on the EgoGesture, Jester, and NVGesture datasets. The results showed that the accuracy of our suggested model was significantly improved.

Future research can concentrate on minimizing the computing load and latency in hand gesture recognition. The IFMA module could be improved by using compression strategies like model pruning and knowledge distillation. Additionally, incorporating 3D sliding windows instead of 2D sliding windows into different-dimensional ADS modules would simplify the model architecture and enhance the real-time performance of hand gesture recognition systems, making them more suitable for practical applications. While our method shows strong performance, it relies on the quality of depth data; low-quality or noisy depth input could reduce recognition accuracy. Our method focuses on RGB and depth modalities; exploring additional modalities could further enhance gesture recognition accuracy and robustness in complex real-world scenarios.

Author Contributions

Conceptualization, X.Z. (Xiaorui Zhang); Methodology, S.L.; Software, X.Z. (Xianglong Zeng); Validation, P.L.; Formal analysis, W.S.; Writing—original draft preparation, X.Z. (Xianglong Zeng); Funding acquisition, X.Z. (Xiaorui Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China [grant numbers 62272236, 62376128] and in part by the Natural Science Foundation of Jiangsu Province [grant numbers BK20201136, BK20191401].

Data Availability Statement

Data are contained within the article.

Acknowledgments

X.Z. (Xiaorui Zhang) sincerely thanks all the individuals who have contributed to this and other related projects.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rahman, M.M.; Uzzaman, A.; Khatun, F.; Aktaruzzaman, M.; Siddique, N. A Comparative Study of Advanced Technologies and Methods in Hand Gesture Analysis and Recognition Systems. Expert Syst. Appl. 2025, 266, 125929. [Google Scholar] [CrossRef]
Xu, C.; Wu, X.; Wang, M.; Qiu, F.; Liu, Y.; Ren, J. Improving Dynamic Gesture Recognition in Untrimmed Videos by an Online Lightweight Framework and a New Gesture Dataset ZJUGesture. Neurocomputing 2023, 523, 58–68. [Google Scholar] [CrossRef]
Qi, J.; Ma, L.; Cui, Z.; Yu, Y. Computer Vision-Based Hand Gesture Recognition for Human-Robot Interaction: A Review. Complex Intell. Syst. 2024, 10, 1581–1606. [Google Scholar] [CrossRef]
Yang, L.I.; Huang, J.; Feng, T.; Hong-An, W.; Guo-Zhong, D.A.I. Gesture Interaction in Virtual Reality. Virtual Real. Intell. Hardw. 2019, 1, 84–112. [Google Scholar] [CrossRef]
Sharma, S.; Singh, S. Vision-Based Hand Gesture Recognition Using Deep Learning for the Interpretation of Sign Language. Expert Syst. Appl. 2021, 182, 115657. [Google Scholar] [CrossRef]
Hashi, A.O.; Hashim, S.Z.M.; Asamah, A.B. A Systematic Review of Hand Gesture Recognition: An Update from 2018 to 2024. IEEE Access 2024, 12, 143599–143626. [Google Scholar] [CrossRef]
Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.-G. Svformer: Semi-Supervised Video Transformer for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18816–18826. [Google Scholar]
Gedamu, K.; Ji, Y.; Gao, L.; Yang, Y.; Shen, H.T. Relation-Mining Self-Attention Network for Skeleton-Based Human Action Recognition. Pattern Recognit. 2023, 139, 109455. [Google Scholar] [CrossRef]
Esteva, A.; Chou, K.; Yeung, S.; Naik, N.; Madani, A.; Mottaghi, A.; Liu, Y.; Topol, E.; Dean, J.; Socher, R. Deep Learning-Enabled Medical Computer Vision. npj Digit. Med. 2021, 4, 5. [Google Scholar] [CrossRef]
Zhao, D.; Yang, Q.; Zhou, X.; Li, H.; Yan, S. A Local Spatial–Temporal Synchronous Network to Dynamic Gesture Recognition. IEEE Trans. Comput. Soc. Syst. 2022, 10, 2226–2233. [Google Scholar] [CrossRef]
Dong, S.; Wang, P.; Abbas, K. A Survey on Deep Learning and Its Applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
Rastgoo, R.; Kiani, K.; Escalera, S. Sign Language Recognition: A Deep Survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
Niu, Z.; Zhong, G.; Yu, H. A Review on the Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Saini, M.; Fatemi, M.; Alizad, A. Fast Inter-Frame Motion Correction in Contrast-Free Ultrasound Quantitative Microvasculature Imaging Using Deep Learning. Sci. Rep. 2024, 14, 26161. [Google Scholar] [CrossRef] [PubMed]
Shao, Z.; Zhu, H.; Zhou, Y.; Xiang, X.; Liu, B.; Yao, R.; Ma, L. Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample. Int. J. Comput. Vis. 2025, 133, 1711–1726. [Google Scholar] [CrossRef]
Wang, Y.; Yang, G.; Li, S.; Li, Y.; He, L.; Liu, D. Arrhythmia Classification Algorithm Based on Multi-Head Self-Attention Mechanism. Biomed. Signal Process. Control 2023, 79, 104206. [Google Scholar] [CrossRef]
Li, X.; Li, M.; Yan, P.; Li, G.; Jiang, Y.; Luo, H.; Yin, S. Deep Learning Attention Mechanism in Medical Image Analysis: Basics and Beyonds. Int. J. Netw. Dyn. Intell. 2023, 2, 93–116. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, L.; Peng, X.; Yuan, J.; Metaxas, D.N. Construct Dynamic Graphs for Hand Gesture Recognition via Spatial-Temporal Attention. arXiv 2019, arXiv:1907.08871. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Miah, A.S.M.; Hasan, M.A.M.; Shin, J.; Okuyama, Y.; Tomioka, Y. Multistage Spatial Attention-Based Neural Network for Hand Gesture Recognition. Computers 2023, 12, 13. [Google Scholar] [CrossRef]
Zhang, W.; Lin, Z.; Cheng, J.; Ma, C.; Deng, X.; Wang, H. STA-GCN: Two-Stream Graph Convolutional Network with Spatial–Temporal Attention for Hand Gesture Recognition. Vis. Comput. 2020, 36, 2433–2444. [Google Scholar] [CrossRef]
Ohn-Bar, E.; Trivedi, M.M. Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2368–2377. [Google Scholar] [CrossRef]
Miao, Q.; Li, Y.; Ouyang, W.; Ma, Z.; Xu, X.; Shi, W.; Cao, X. Multimodal Gesture Recognition Based on the Resc3d Network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3047–3055. [Google Scholar]
Zhang, X.; Zeng, X.; Sun, W.; Ren, Y.; Xu, T. Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition. Comput. Syst. Sci. Eng. 2023, 46, 671–686. [Google Scholar] [CrossRef]
Zhang, W.; Wang, J.; Lan, F. Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks. IEEECAA J. Autom. Sin. 2020, 8, 110–120. [Google Scholar] [CrossRef]
Elboushaki, A.; Hannane, R.; Afdel, K.; Koutti, L. MultiD-CNN: A Multi-Dimensional Feature Learning Approach Based on Deep Convolutional Networks for Gesture Recognition in RGB-D Image Sequences. Expert Syst. Appl. 2020, 139, 112829. [Google Scholar] [CrossRef]
Yu, Z.; Zhou, B.; Wan, J.; Wang, P.; Chen, H.; Liu, X.; Li, S.Z.; Zhao, G. Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition. IEEE Trans. Image Process. 2021, 30, 5626–5640. [Google Scholar] [CrossRef] [PubMed]
Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. TMMF: Temporal Multi-Modal Fusion for Single-Stage Continuous Gesture Recognition. IEEE Trans. Image Process. 2021, 30, 7689–7701. [Google Scholar] [CrossRef]
Li, J.; Xie, X.; Pan, Q.; Cao, Y.; Zhao, Z.; Shi, G. SGM-Net: Skeleton-Guided Multimodal Network for Action Recognition. Pattern Recognit. 2020, 104, 107356. [Google Scholar] [CrossRef]
Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE Trans. Multimed. 2018, 20, 1038–1050. [Google Scholar] [CrossRef]
Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3d Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4207–4215. [Google Scholar]
Materzynska, J.; Berger, G.; Bax, I.; Memisevic, R. The Jester Dataset: A Large-Scale Video Dataset of Human Gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9912, pp. 20–36. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3d Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Hara, K.; Kataoka, H.; Satoh, Y. Learning Spatio-Temporal Features with 3d Residual Networks for Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3154–3160. [Google Scholar]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Abavisani, M.; Joze, H.R.V.; Patel, V.M. Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1165–1174. [Google Scholar]
Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. Tea: Temporal Excitation and Aggregation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 909–918. [Google Scholar]
Lin, J.; Gan, C.; Han, S. Tsm: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Feichtenhofer, C. X3d: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
Wang, Z.; She, Q.; Smolic, A. Action-Net: Multipath Excitation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13214–13223. [Google Scholar]

Figure 1. Examples from the EgoGesture dataset, illustrating diverse recording scenarios.

Figure 2. Example sequence from the NVGesture dataset, showing dynamic gestures.

Figure 3. Examples from the Jester dataset, showing common gesture categories.

Figure 4. Overall network architecture with the proposed inter-frame motion attention module. The upper part illustrates the full model pipeline, while the lower part details the structure and operation of IFMA, including the feature extractor and attention calculator.

Figure 5. Examples of the proposed down-sampling modules: (a) ASDS for spatial feature reduction, (b) ATDS for temporal feature fusion and reduction, and (c) AMDS for modality-aware down-sampling. All three modules share the same network structure but use different sliding windows to partition and sample features.

Figure 6. Effect of feature channel number in the IFMA module on recognition accuracy.

Figure 7. Effect of the inter-modal attention loss weight λ on recognition accuracy.

Figure 8. Accuracy improvement contributed by each proposed module (IFMA, ASDS, and ATDS) on the EgoGesture and Jester datasets.

Figure 9. Confusion matrix and 3D confusion bar graph showing the performance improvement of the SlowOnly model when combined with the proposed IFMA and ADS modules.

Table 1. Effect of the number of feature extractor layers in the IFMA module on recognition accuracy.

N	EgoGesture		Jester
N	Top-1	Top-5	Top-1	Top-5
0	92.2	96.8	94.3	98.2
1	94.2	98.8	96.2	98.6
2	95.2	99.9	98.3	99.9
3	94.5	98.9	96.5	98.6
4	89.3	96.2	91.2	98.0

Table 2. Performance of different single-modal modules (IFMA, ASDS, and ATDS) on EgoGesture and Jester datasets. “√” indicates that the module is used; “-” indicates that the module is not used.

Methods			EgoGesture		Jester
IFMA	ASDS	ATDS	Top-1	Top-5	Top-1	Top-5
-	-	-	94.6	99.8	97.2	99.9
√	-	-	94.9	99.8	97.5	99.9
-	√	-	93.2	98.6	95.8	99.8
-	-	√	92.4	97.3	95.6	99.8
√	√	-	95.1	99.9	97.6	99.9
√	-	√	95.0	99.9	97.9	99.9
-	√	√	89.8	96.8	93.2	98.6
√	√	√	95.2	99.9	98.3	99.9

Table 3. Ablation study of the proposed multimodal methods (AMDS and IMAW) on EgoGesture and NVGesture datasets. “√” indicates that the module is used; “-” indicates that the module is not used.

Modality		Methods			EgoGesture		NVGesture
RGB	Depth	FC	AMDS	IMAW	Top-1	Top-5	Top-1	Top-5
√	-	-	-	-	95.2	99.9	80.2	92.6
-	√	-	-	-	93.5	98.3	78.5	91.3
√	√	√	-	-	94.3	98.6	82.3	92.7
√	√	-	√	-	95.2	99.9	83.2	96.5
√	√	√	-	√	95.3	99.9	83.6	96.8
√	√	-	√	√	95.8	99.9	84.2	98.6

Table 4. Performance metrics of the proposed inter-modal attention weight loss model (“Ours”) compared with selected state-of-the-art gesture recognition models.

Methods	EgoGesture		Jester
Methods	Top-1	Top-5	Top-1	Top-5
VGG16 [36]	63.1	95.4	67.2	96.3
C3D [38]	86.4	98.6	87.6	98.3
CatNet [41]	90.1	99.1	91.2	98.5
TEA [42]	92.1	98.2	96.7	99.8
TSN [37]	79.6	98.3	82.1	98.7
TSM [43]	92.2	98.6	94.5	99.8
X3D [44]	93.5	98.5	95.6	98.5
Ours+X3D	94.5	99.8	95.8	98.8
ACTION-Net [45]	94.4	98.8	97.1	99.9
Ours+ACTION-Net	95.1	99.8	98.2	99.9
SlowOnly [35]	94.6	99.8	97.2	99.9
Ours+SlowOnly	95.2	99.9	98.3	99.9

Table 5. Comparison of multimodal hand gesture recognition methods on EgoGesture and NVGesture datasets.

Methods	Modality	EgoGesture		NVGesture
Methods	Modality	Top-1	Top-5	Top-1	Top-5
3D Resnet-50 [39]	RGB-D	86.2	97.5	79.6	92.6
C3D [38]	RGB-D	88.7	98.2	80.2	94.6
I3D+FC [40]	RGB-D	92.6	98.6	83.6	96.5
SlowOnly+FC [35]	RGB-D	95.1	99.9	83.8	98.3
Ours+SlowOnly	RGB-D	95.8	99.9	84.2	98.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Li, S.; Zeng, X.; Lu, P.; Sun, W. A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights. Computers 2025, 14, 432. https://doi.org/10.3390/computers14100432

AMA Style

Zhang X, Li S, Zeng X, Lu P, Sun W. A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights. Computers. 2025; 14(10):432. https://doi.org/10.3390/computers14100432

Chicago/Turabian Style

Zhang, Xiaorui, Shuaitong Li, Xianglong Zeng, Peisen Lu, and Wei Sun. 2025. "A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights" Computers 14, no. 10: 432. https://doi.org/10.3390/computers14100432

APA Style

Zhang, X., Li, S., Zeng, X., Lu, P., & Sun, W. (2025). A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights. Computers, 14(10), 432. https://doi.org/10.3390/computers14100432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights

Abstract

1. Introduction

2. Related Work

2.1. Attention Mechanism

2.2. Multimodal Hand Gesture Recognition

3. Materials and Methods

3.1. Datasets

3.2. Computer Setup

3.3. Evaluation Metrics

3.4. Patch Embedding

3.5. Inter-Frame Motion Attention

3.5.1. Feature Extractor

3.5.2. Attention Calculator

3.6. Adaptive Down-Sampling Module

3.7. Inter-Modal Attention Weights Loss

4. Results

4.1. Hyperparameter Selection

4.1.1. Embedding Channel Number C

4.1.2. Loss Function Coefficient λ

4.1.3. Number of Feature Extractor Layers N

4.2. Ablation Study

4.2.1. Single-Modal Modules

4.2.2. Multimodal Modules

4.3. Comparison with State-of-the-Art Methods

4.3.1. Compared Methods and Naming Convention

4.3.2. Single-Modal (RGB) Performance

4.3.3. Multimodal (RGB-D) Performance

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI