The proposed system, called 3D-Jointsformer, enables real-time recognition of hand gestures from color video sequences using a standard CPU, in contrast to other approaches that require specialized hardware like Graphical Processing Units (GPUs). The system consists of two modules (see
Figure 1): Hand Skeleton Detection and hand gesture recognition. In the first module, Hand Skeleton Detection, a set of 3D key points or joints representing the hand skeleton is inferred from every image. This results in a sequence of hand skeletons obtained from the video stream. The second module, hand gesture recognition, processes and classifies this sequence into one of the predetermined hand gestures. This recognition module consists of three blocks. The first block, the Local Spatio-Temporal Embedding Estimator, employs a specially designed 3D-CNN architecture to compute local embeddings that encode subsets of skeleton joints within a short time span. These embeddings serve as local high-level semantic representations of the hand skeleton sequence. The second block, the Long-Term Embedding Estimator, utilizes a Transformer-based architecture to efficiently capture long-term dependencies among all the previous local embeddings. The output is a final and holistic feature-based representation of the hand skeleton sequence. The third and final stage, hand gesture classification, globally combines all the information from the previous holistic feature to infer the class of the involved hand gesture. Overall, 3D-Jointsformer offers real-time hand gesture recognition capabilities without compromising accuracy, using readily available CPU resources.
3.2. Hand Gesture Recognition
The hand gesture recognition module estimates the category of the performed gesture from a sequence of 3D skeletons through three different stages. First, the Local Spatio-Temporal Embedding Estimator computes local embeddings that encode subsets of skeleton joints within short time spans, resulting in high-level semantic representations of subparts of a hand skeleton sequence. These embeddings are obtained using a 3D-CNN neural network, which is a modified implementation of SlowFast Networks, initially conceived for Video Recognition [
48]. The original SlowFast architecture comprises two streams: the Slow Pathway and the Fast Pathway. The Slow Pathway processes frames at a lower frame rate and aims to capture broader motion patterns and context changes in the video. On the other hand, the Fast Pathway operates at a higher frame rate and is designed to capture fast and subtle motion patterns. The backbone architecture of the original SlowFast combines 3D convolutional layers for the Slow Pathway and 2D convolutional layers for the Fast Pathway. Finally, the outputs from both the Slow and Fast Pathways are fused together to obtain the final prediction. While this architecture has shown great success in recognizing actions performed by the entire body, it may not be ideal for capturing the fine-grained movements specific to hand gestures. Therefore, several key modifications have been designed and implemented to adapt and enhance the existing framework for Local Spatio-Temporal Embeddings of hand skeleton sequences.
Regarding the nature of the hand skeleton, four modifications were introduced to the original architecture. The first modification is to adopt only the Slow Pathway stream from the original architecture to fulfill the purpose of this local embedding module, which is to compute the Local Spatio-Temporal Embedding from subsets of joints. By using only the Slow Pathway, the overall computational complexity of the model is reduced, facilitating real-time operation. The second modification reduces the depth of the network by using fewer convolutional blocks, considering that the number of joints is much less than the number of pixels in an image. This helps avoid overfitting problems associated with excessive network parameters while reducing computational complexity. The third modification involves reducing the width of the network by using fewer channels for similar purposes. The fourth modification is to use a smaller kernel size, which reduces the receptive field, aiming to focus on capturing more local spatial information.
Figure 2 depicts the specific sizes and channels. The overall architecture of the modified system aims to capture finer spatial features of the input skeleton sequence. It does so by processing frames at a lower temporal resolution in the earlier layers of the network while increasing the temporal resolution in the deeper layers to capture fine-grained temporal dynamics and hand motion information. The detailed architecture of the proposed 3D-CNN network is shown in
Figure 2.
Based on the optimal trade-off between computational efficiency and performance, the final setting of the proposed lightweight Local Spatio-Temporal Embedding Estimator is composed of four CNN blocks. The first block, also known as the stem layer, processes an input tensor of , representing a hand skeleton sequence, where C stands for channel, T stands for temporal, and V stands for spatial dimensions. The kernel size applied is , considering neighborhoods of three joints since hand gestures are finer than whole-body actions and operate on smaller spatial dimensions. Note that the temporal stride is set to one to preserve the temporal resolution, while the spatial stride is three, allowing the model to capture global patterns of the hand joints inside each frame. The second block stacks several 3D-CNN layers with kernel sizes of and , focusing on processing local spatial joint information without mixing the temporal aspect. The last two blocks, 3 and 4, introduce a kernel size of , combining temporal information and increasing the temporal receptive field to capture a wider temporal context.
For all blocks, batch normalization (BN) and rectified linear unit (ReLU) layers are applied after each convolutional layer, helping to regularize the model. Moreover, residual connections are adopted after each block to alleviate the problem of vanishing gradients and stabilize the network training. Consequently, the output of the 3D-CNN backbone is a feature map of size , where D represents the embedding dimension of the model.
The second stage, known as the Long-Term Embedding Estimator, focuses on inferring long-term temporal interactions from the previous feature map,
S, which contains local spatio-temporal joint skeleton information. To achieve this, a Transformer neural network [
49] is employed, consisting of multiple Transformer encoder blocks that determine the depth of the model. These encoder blocks refine and improve the representations by capturing increasingly complex patterns and long-range dependencies. Each encoder block utilizes a self-attention mechanism to draw long-term dependencies among the spatio-temporal joint skeleton embeddings. Specifically, the Long-Term Embedding Estimator comprises an initial embedding layer followed by
N Transformer encoder blocks (as illustrated in
Figure 3). The best performance was achieved by employing
N = 2 blocks of the Transformer encoder. This result confirms our hypothesis that the spatio-temporal embeddings obtained from the previous Local Spatio-Temporal Embedding Estimator represent high-level representations of the input skeleton sequence. Adding more Transformer encoder blocks would propagate unnecessary information through the network, leading to an insignificant improvement in accuracy.
The embedding layer takes every component of the feature map,
S, as a token and adds a positional embedding proposed by Vaswani et al. [
49], which adopts the sine and cosine functions with different frequencies.
Defining
PE as the positional embedding;
p and
i as the position of the embedded skeleton component and the total number of components of
S, respectively;
as the dimension of the output embedding space; and sin and cos as the sine and cosine functions, respectively, the positional embedding equation is represented as follows:
Next, the resulting embeddings are passed through the attention blocks, where a normalization layer is applied, and three distinct new embeddings, namely, Query (
Q), Key (
K), and Value (
V), are computed using fully connected layers. Subsequently, the dot product,
, is calculated to determine the pairwise similarities between different components of
S, the input sequence. To ensure proper scaling, the dot product is normalized by a factor of
, where
represents the dimension of each embedding (
Q or
K). The resulting values are referred to as attention scores, and they contain information about the affinity of each spatio-temporal joint skeleton embedding with respect to the others in the sequence. These attention scores are further processed using the softmax function, which enables them to represent the importance of each embedding in relation to the others. Finally, the attention scores are used as weights to integrate the most relevant spatio-temporal contextual information into each local joint skeleton embedding. This process is illustrated in Equation (
2), where Att represents the attention function.
The attention block illustrated in Equation (
2) is applied in a parallel fashion, known as Multi-Head Attention (MHA) [
49], by dividing the input embedding vector into parts, which are processed by different heads or attention blocks. Each head learns a different representation from different perspectives, which are then linearly combined into one final feature vector. Specifically, eight heads are used, each with a dimension of 64. The attention scores obtained from each head (
), where
s denotes the input sequence of the corresponding Transformer encoder block, are then concatenated using the
Concat operation. This concatenated output is projected into a vector of dimension
using the matrix of learned parameters represented by
, as shown in Equation (
3). The number of attention heads, the embedding size of the model, and the head dimension were determined based on the settings commonly applied in the literature and tuned by the experimental observations performed in this work, taking into account the trade-off between model size and performance.
Each Transformer encoder block is composed of Multi-Head Attention (MHA) and Feed-Forward Neural Network (FFN) blocks (Equations (
4) and (
5)). Given the input sequence,
, the feature representation,
, obtained by MHA is then passed to the FFN block, which globally aggregates all the spatio-temporal skeleton information. As shown in Equation (
5), the FFN comprises a Multilayer Perceptron (MLP) and Layer Normalization (LN). The MLP is composed of two fully connected layers, followed by a Gaussian Error Linear Unit (GELU) activation function. The output of the Transformer encoder block is added to the subsequent one, and the dimension of the embedding remains consistent across all the encoder blocks.
In the final stage, the classifier head comprises a fully connected layer followed by a softmax layer, which calculates a probability distribution across all hand gesture classes. The predicted gesture is then determined to be the one with the highest probability.