HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study

Macías, Luis Gabriel; Zea, Jonathan A.; Barona, Lorena Isabel; Valdivieso, Ángel Leonardo; Benalcázar, Marco E.

doi:10.3390/mca30050101

Open AccessArticle

HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study

by

Luis Gabriel Macías

^*

,

Jonathan A. Zea

,

Lorena Isabel Barona

,

Ángel Leonardo Valdivieso

and

Marco E. Benalcázar

Artificial Intelligence and Computer Vision Research Lab, Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito 170143, Ecuador

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2025, 30(5), 101; https://doi.org/10.3390/mca30050101

Submission received: 18 July 2025 / Revised: 7 August 2025 / Accepted: 15 August 2025 / Published: 16 September 2025

(This article belongs to the Special Issue New Trends in Computational Intelligence and Applications 2025)

Download

Browse Figures

Versions Notes

Abstract

In the field of Hand Gesture Recognition (HGR), Electromyography (EMG) is used to detect the electrical impulses that muscles emit when a movement is generated. Currently, there are several HGR models that use EMG to predict hand gestures. However, most of these models have limited performance in real-time applications, with the highest recognition rate achieved being 65.78 ± 15.15%, without post-processing steps. Other non-generalizable models, i.e., those trained with a small number of users, achieved a window-based classification accuracy of 93.84%, but not in time-real applications. Therefore, this study addresses these issues by employing transformers to create a generalizable model and enhance recognition accuracy in real-time applications. The architecture of our model is composed of a Convolutional Neural Network (CNN), a positional encoding layer, and the transformer encoder. To obtain a generalizable model, the EMG-EPN-612 dataset was used. This dataset contains records of 612 individuals. Several experiments were conducted with different architectures, and our best results were compared with other previous research that used CNN, LSTM, and transformers. The findings of this research reached a classification accuracy of 95.25 ± 4.9% and a recognition accuracy of 89.7 ± 8.77%. This recognition accuracy is a significant contribution because it encompasses the entire sequence without post-processing steps.

Keywords:

EMG; encoder; HGR; self-attention; transformers

1. Introduction

Communication and control of electronic devices through Hand Gesture Recognition (HGR) are fundamental aspects of human–machine interaction. Gestures represent a natural and effective way to convey information and commands in various contexts. Therefore, HGR has a wide range of applications, such as surgery, video gaming, and sign language translation, among others [1,2].

However, implementing HGR systems presents several challenges, particularly for individuals who have experienced limb loss [3]. Hand gestures can be recognized using images, videos, or surface electromyographic (sEMG) signals, among others [4,5,6,7,8]. Although image or video-based approaches are common, they require the physical presence of the hand or limb to capture movement, making them unsuitable for amputees. In contrast, sEMG signals offer a key advantage: they can still be detected in individuals with partial or total hand amputation. This is possible because sEMG measures the electrical activity generated by muscles in response to neural stimulation, regardless of whether the limb is present or not [9,10,11].

These signals are used in the HGR to predict hand gestures. Using EMG signals to create an HGR model is a good option in cases of mid- or low-transradial amputations [3]. Despite the existence of many models that predict movement using EMG, a functional model has not yet been achieved. Existing models have difficulty predicting movement, especially in real-time, because EMG signals vary between people and between different people (that is, intrapersonal and interpersonal variations) [9,12,13].

Transformers are a type of Artificial Neural Network (ANN) that has proven successful in Natural Language Processing (NLP) due to its ability to consider context [14]. The context in an EMG signal is useful for predicting a gesture in real-time. The action of performing a gesture is important before, during, and after its execution [15,16]. Furthermore, transformers are based on attention mechanisms. Attention mechanisms can mitigate the problem of variation within a person and between different people.

However, transformer models have been trained on datasets involving only 40 or fewer, which is a relatively small number [6,15,17]. The use of a limited number of participants affects the generalizability of the model [11]. A generalizable and higher accuracy HGR model is crucial to ensure smooth interactions with electronic devices. To address these challenges, transformers are used in this research along with EMG-EPN-612 dataset [18], which contains records of 612 individuals. This research aims to achieve the following objectives:

1.1. Research Objective

Develop a real-time HGR model for five right-hand gestures using a transformer neural network and EMG signals.

1.2. Specific Objectives

1.: Analyze the literature on neural network architectures for HGR, focusing on models that have employed transformer neural networks over the past five years, in order to gain a deeper understanding of the existing scientific literature on this topic.
2.: Develop a real-time HGR model for five right-hand gestures using a transformer neural network, using the EMG-EPN-612 dataset to address generalization challenges, remove spurious labels, and improve recognition accuracy.
3.: Compare the performance of the proposed model with existing models in the HGR field, evaluating their classification and recognition accuracy.

1.3. Hypothesis

A HGR model that uses transformers neural networks can have higher accuracy and better real-time recognition than existing models that use other neural network architectures.

1.4. Problems Addressed by the Study

This study develops a HGR model that uses transformers with attention mechanisms, significantly improving real-time recognition accuracy, eliminating the need for post-processing, and reducing variability and classification errors present in other approaches. Additionally, the use of the EMG-EPN-612 dataset [18], which includes a substantial number of volunteers, helps create a generalizable model, overcoming the limitation of small datasets used in previous studies.

2. Literature Review

Several methods have emerged to capture data related to hand gestures. The methods for capturing hand gesture data can be classified into the following two groups: invasive and non-invasive methods [19].

Invasive methods stand out for their ability to detect the electromyographic signal of a specific muscle, which allows muscle movements to be monitored with great precision. Invasive methods necessitate the insertion of intramuscular electrodes, which demands specialized skills from the installer [19,20].
Non-invasive methods are those that do not require penetrating the person’s skin. Unlike invasive methods, non-invasive methods are easy to use and apply, making them easy to use even for people with little knowledge [19].

Surface electromyography (sEMG) records muscle activity by placing electrodes on the skin and is a non-invasive method. For this reason, in (HGR), sEMG is commonly used to monitor muscle activity, as shown in [9]. Despite its advantages, sEMG faces a significant challenge related to variability. This variability can be both intrapersonal and interpersonal, as follows [21,22]:

Interpersonal variability is associated with factors such as fatigue, mood, sensor repositioning, and other aspects [23,24,25,26].
Intrapersonal variability refers to differences in the EMG signal produced by one person compared to another. This variability can be attributed to different factors, such as physical constitution, body mass index, fat level, gender, age, among other aspects [27,28].

2.1. Machine Learning Models for Hand Gesture Recognition

Considering the above, a considerable amount of research has been conducted to develop specific or general models in the field of HGR with sEMG. Several machine learning models have been developed, including Support Vector Machines (SVM), Hidden Markov Models (HMM), Decision Trees, Random Forests, and Artificial Neural Networks (ANN) [9,29]. Specific models are designed by an individual with their or her own data, so they tend to have good accuracy, between 94% and 97%. In other research, a CovGCN architecture was proposed for sEMG-based HGR [30]. This method demonstrated a classification accuracy of 88.02% with a CovGCN model that dynamically refines topologies for each hand gesture. The following two datasets were used in this research: the UWB-Gestures dataset with 8 different subjects and the Ninapro DB2 dataset with 40 subjects.

General models are trained on a multi-user dataset and can be used by several users without additional training [31]. For general models, the results are less favorable, as it is shown in Table 1. According to Chung et al. in their research [32] used ANN to achieve a accuracy recognition of 85.08 ± 15.21%. For their part, Jaramillo et al. in [33] used SVM to achieve a result of 87.5%. Finally, Barona et al. in [16] achieved results of 87.26 ± 11.14% using CNN and 90.55 ± 9.45% using LSTM. It is necessary to note that all of these results stated above use post-processing techniques to calculate their recognition accuracy. Then, machine learning offers improved performance, although the results are still not perfect.

2.2. Transformers in Hand Gesture Recognition

In 2017, a machine learning algorithm known as “transformers” was developed. Transformers were initially designed for NLP tasks, as described in the article “Attention is all you need” [14]. This architecture uses attention mechanisms that are part of an encoder and a decoder. In several studies, it has been confirmed that using only the encoder is sufficient to perform classification tasks [34,35]. In the field of HGR with sEMG, research has also been conducted using only the transformer encoder [6,15,36].

In the research conducted by Zabihi et al. in [15]. A model was created with hybrid architecture based on a transformers encoder combined with CNN. This model (TraHGR) was developed using data from 40 individuals with nine gestures, achieving a classification accuracy of 93.84%. On the other side, Montazerin et al. conducted two investigations using transformers in their architecture: ViT-HGR and CT-HGR [6,36]. Montazerin et al. used a variant Vision Transformer (ViT) architecture in both cases (ViT-HGR and CT-HGR) and employed two grids each comprising 64 electrodes to capture sEMG signals while individuals performed 66 distinct hand gestures. These data were collected from a group of 20 individuals. The best model (CT-HGR) achieved a classification accuracy of 91.98 ± 2.22%.

Models using the transformer architecture show better performance than previous machine learning models. This good result is attributed to the transformer’s ability to process data in sequential blocks, as noted in [15]. Processing data in sequential blocks reduces spurious labels. Spurious labels are incorrectly assigned values in a classification vector, as explained by Benalcázar et al. in their research [37]. In other words, a classification vector is valid when a user has their hand at relax, performs a gesture for a certain amount of time, and then returns to relax. Therefore, the gesture performed at execution time should not have a different classification label. This different classification label is called a spurious label, as it is shown in Figure 1.

2.3. Research Gap

Transformer models have managed to mitigate the problem of false labels. However, the transformer models mentioned above (TraHGR, ViT-HGR and CT-HGR) are specific models. The results obtained cannot be generalized due to the use of relatively small datasets [31]. For this reason, we used transformers in our research with the [18] EMG-EPN-612 dataset. This architecture uses sequences of blocks that produce fewer spurious labels. The model achieved is generalizable since it has been trained, validated and tested with data from 612 individuals.

Using a larger dataset improves the generalization of the model and reduces spurious labels, as the transformer architecture we employ works with block sequences, which optimizes the model’s performance by better handling variability in gestures.

Additionally, our research reduces the need for post-processing techniques, which are necessary in many previous models to correct false labels and inconsistencies in gesture classification. Thanks to the use of CNN for feature extraction and transformers, our model does not rely on these additional steps, significantly improving its real-time performance. This ability to process data quickly and accurately, without extra steps, is essential for its implementation in real-world applications, where efficiency is crucial.

3. Materials and Methods

3.1. Dataset

A common challenge in many HGR models is the use of small datasets. For a model to be generalizable, it needs to be trained with a significant amount of data. Therefore, we have used the “EMG-EPN-612” dataset [18]. This dataset is publicly available for use in similar research.

3.1.1. Origin of the Dataset

The dataset consists of a collection of EMG signals from 612 individuals, of which 66% were male and the remaining 34% were female. This dataset is designed for training, evaluating, and testing HRG models. It was created by the Artificial Intelligence and Computer Vision Research Lab from the Escuela Politécnica Nacional [38].

The EMG-EPN-612 dataset was divided into two groups, each group has 306 individuals. One group was intended for training and validating the model, while the other group was intended for testing the classification and recognition accuracy of the created model. Furthermore, each gesture in each group consists of 50 EMGs, including the hand relaxed state. Consequently, each individual has a total of 300 EMGs, and there are a total of 91,800 EMGs in each group. This is illustrated in Figure 2:

3.1.2. Characteristics of the Dataset

The EMGs were captured using six EMG signal collection device, which was placed on the volunteers’ right forearms, as shown in Figure 3a. The volunteers performed five hand gestures: pinch, wave out, wave in, fist, and open, as it is shown in Figure 3b. The recording time for each gesture was 5 s. In addition, during the same time, EMG of the relaxed hand called “noGesture” was recorded.

3.1.3. Comparison with Other Datasets

Compared to other widely used datasets such as NinaPro [39], which is one of the most recognized databases in the HGR domain, EMG-EPN-612 offers features that support the development of generalizable models [18]. Although NinaPro includes a greater variety of gestures, even its most comprehensive versions include no more than 78 participants, which limits its utility for training and evaluating models with high generalization capacity [39]. In contrast, EMG-EPN-612 contains recordings from 612 individuals, making it the largest known public dataset in terms of user count.

Moreover, the dataset was carefully designed with a balanced division: 306 users with ground truth labels are used for training and validation, while the remaining 306 users—whose ground truth is not provided—are reserved for final testing [12]. This allows for a clear and unbiased evaluation of the model’s generalization ability and potential overfitting. It is worth noting that this same division strategy has been adopted in previous studies, as indicated in references [12,16,40,41].

3.2. Data Preprocessing

The raw data for all 612 individuals include the EMG signal in 8 channels for each of the hand gestures. In data preprocessing, the entire dataset containing the emg signals is discretized and normalized to facilitate its conversion into spectrograms. The preprocessing process of an EMG signal is detailed below, represented by the following:

E (t) = {[\begin{matrix} E_{1} (t) \\ E_{2} (t) \\ ⋮ \\ E_{8} (t) \end{matrix}]}^{T}

(1)

The EMG signal is divided into overlapping windows with a selected window size,

| W | = 300

sample points (1500 ms), as a design criterion, and a stride of 45 (225 ms), as it is shown in Figure 4a. The formal representation of W at time instant n of the signal for 8 channels is given by the following:

S_{t} = [\begin{matrix} E (n - 299) \\ ⋮ \\ E (n) \end{matrix}] \in {[- 1, 1]}^{300 \times 8}

(2)

Following the segmentation process, we rectify the signal

S_{t}

by taking its absolute value

| S_{t} |

. This rectification step is crucial to prevent the mean in each channel of

S_{t}

becomes zero [42]. Subsequently, we apply a digital low-pass Butterworth filter, denoted as

Ψ

, to the rectified signal

| S_{t} |

in order to smooth the signal and reduce noise. The Butterworth filter has a fifth order and a cutoff frequency of

10 Hz

, this is a design criterion. The resulting filtered signal is denoted as

Ψ (| S_{t} |)

, referred to as

S_{Ψ}

, as you can see in Figure 4b. Then, we utilize an internal sliding window, denoted as L, within a window W with a width of 24 sample points, denoted as

| L | = 24

, and an overlapping region of 12 sample points, denoted as

| d | = 12

, as it is shown in Figure 4c.

Once the signal has been rectified and filtered, the spectrograms are calculated using the Short-time Fourier transform (STFT). According to Tsai et al. in [43], the STFT is used to transform the EMG signal from the time domain to the frequency-time domain. This process is necessary to later perform feature extraction. In our case, using STFT, we obtain as a result a matrix denoted as follows:

X = [\begin{matrix} X_{1} \\ X_{2} \\ ⋮ \\ X_{| L |} \end{matrix}] \in C^{13 \times 24}

(3)

The value

| L |

corresponds to the number of columns in

X

. It is important to note that

X

can also be expressed as follows:

X = A + B i

(4)

Encompassing the real and imaginary components of the STFT. This entire process can be visualized in Figure 4d. We generate the spectrogram, denoted as follows:

M_{i} = {| X |}^{2}

(5)

Calculating the squared magnitude of the signal, as depicted in Figure 4e. We calculate the spectrogram for each of the 8 channels individually, as shown in Figure 4f. Finally, we combine the spectrograms of all channels to create a tensor, as follows:

Λ = (M_{1}, M_{2}, \dots, M_{8})

(6)

This tensor serves as input for the feature extraction process, as it is shown in Figure 4g.

3.3. Feature Extraction

As mentioned in the data preprocessing Section, the signal in the time domain was transformed into a time-frequency domain signal using STFT. This resulting signal was processed with a Convolutional Neural Network (CNN). CNNs automatically extract and learn relevant features from the input data [44]. In our study, we employ the CNN architecture for feature extraction as part of preprocessing as it has shown considerable potential to improve overall performance compared to models that do not use CNN for preprocessing [15].

The feature extraction was performed based on the previous experiences and empirical experimentation of Szegedy et al. in their article defining “Inception modules” [45]. The structure that will be used for feature extraction is as follows: 6 convolutional layers, each layer followed by a RELU activation function and with residual networks, as it is shown in Figure 5a. Within each CNN layer, the following settings were used: 4 blocks of 2D convolution of 1 × 1, 1 block of 3 × 3 convolution, and one block of 5 × 5 convolution, and finally a concatenation layer, as it is shown in Figure 5b.

3.4. Architecture

3.4.1. Overview of Transformer Architecture

Transformers is a type of ANN effective in NLP, computer vision (CV), and speech recognition [34,35]. Transformer was designed by Vaswani et al. 2017 in the paper “Attention is All You Need” [14]. Transformer architecture has 4 main components, as illustrated in Figure 6a. These main components are:

•: Input Embedding: Involves representing input tokens as vectors in a high-dimensional space.
•: Positional Encoding: Adds positional information to the input sequence.
•: The Encoder: Processes the signal and transforms it into an abstract representation.
•: The Decoder: Uses this representation to generate an output sequence.

Figure 6. Transformer and encoder architecture.

Both the encoder and the decoder employ attention mechanisms. It is necessary to emphasize that in our research we only use the encoder because the output is not a sequence. Our output is a value between 0 and 5, including the “noGesture” class, where each number represents a gesture. Furthermore, we do not use input embedding because our input is already a numeric vector.

3.4.2. Positional Encoding

Positional encoding is a sinusoidal function that is added to each sequence to provide positional information. Positional encoding is necessary because the transformer receives the entire sequence to process it comprehensively, without distinction [14]. To preserve the relative positions of the sequence, a geometric progression is employed. The wavelengths of these sinusoidal signals range from

2 π

to

10,000 \times 2 π

expressed as follows:

{PE}_{(p o s, 2 i)} = sin (\frac{p o s}{{10,000}^{\frac{2 i}{d_{model}}}})

(7)

{PE}_{(p o s, 2 i + 1)} = cos (\frac{p o s}{{10,000}^{\frac{2 i}{d_{model}}}})

(8)

The choice of a geometric progression for the wavelengths is based on [14]. It allows the model to capture different scales of positional information effectively. The lower frequencies, represented by smaller wavelengths near

2 π

, contribute to encoding broader positional relationships, while the higher frequencies, with larger wavelengths around

10, 000 \times 2 π

, capture finer-grained positional details.

3.4.3. Encoder

The encoder consists of a series (n) identical layers. Each layer has two parts: a multi-head attention layer and a fully connected layer, as can be seen in Figure 6b.

The multi-head attention layer helps the encoder to focus on different parts of the input, capturing long-term relationships between words. The fully connected layer is a connected network applied independently to each input position. After each layer, “layer normalization” is applied to stabilize the training. Additionally, residual connections are used to avoid the vanishing gradient problem [14].

3.4.4. Multi-Head Attention

Multi-head attention allows the model to pay attention to different parts of the input simultaneously. It performs multiple attention calculations simultaneously and then combines them into a single output. The expression for multi-head attention is as follows:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(9)

In this formula,

Q

,

K

and

V

represent the query, key, and value matrices. These matrices are transformed into h different sub-spaces using the projection matrices, denoted as

{QW}_{i}^{Q}, {KW}_{i}^{K}

and

{VW}_{i}^{V}

. Then, a weighted attention calculation is performed for each subspace using the following formula:

{head}_{i} = Attention ({QW}_{i}^{Q}, {KW}_{i}^{K}, {VW}_{i}^{V})

(10)

W

is a parameter matrix projected from the matrices

Q

,

K

and

V

for the i-th attention head. The weighted attention function (also called in this particular case “Scaled Dot-Product Attention”) is defined as follows:

Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V

(11)

where

d_{k}

represents the dimension of the key projection, and

“ s o f t m a x ”

is the function that normalizes the attention weights. The division by

\sqrt{d_{k}}

is used to maintain stability during model training. Finally, the results of the weighted attention are combined and transformed again using a projection matrix

W^{O}

, producing the final output.

3.5. Customization for Hand Gesture Recognition

To meet the specific demands of this paper, we tailored the transformer architecture with several adjustments. This study focused exclusively on the use of the transformer encoder. Encoder has proven sufficient for classification tasks, as demonstrated in previous research [34,35]. Therefore, after obtaining the output from the CNN feature extraction process, the signal was flattened and used as input for the Transformer encoder. The developed Transformer encoder consists of the following stages:

1.: A positional encoding layer that assigns an additional value to each position in the sequence.
2.: The processed sequence then passes through 8 heads (multi-head attention). The different details of the sequence will be attended simultaneously by each attention head.
3.: After the attention stage, the sequence moves through a fully connected layer.
4.: Between the layers of multi-head attention and fully connected layer, normalization is applied to both the inputs and outputs of these layers, mitigating issues such as overflow and gradient vanishing, ensuring more efficient convergence.
5.: Finally, a Softmax function is applied to the 6 possible gesture categories, including the “noGesture” class.

This process culminates in the determination of the resulting gesture class, as it is shown in Figure 7.

3.6. Training and Validation

The EMG-EPN612 dataset is composed of two groups for training and testing. Each group has a signal of 306 volunteers, as mentioned in the dataset Section. The training group were subdivided into two subgroups for training and validation. The two subgroups were converted into spectrograms. Training was carried out in MATLAB (R2025a) using deep learning toolbox. The training parameters were set as shown in Table 2:

These training parameters were based on the model developed in [16]. Using the training parameters mentioned above, a graph of the learning process was obtained and the confusion matrix was constructed for a detailed analysis of the classification results.

3.7. Post-Processing

The post-processing technique applied was the Voting Ensemble [46]. This method compares the previous three labels, assigning the value of the most frequent label to all spurious label. This substitution was conducted from left to right, ignoring “NoGesture” labels. This process reduces spurious labels in classification predictions, as it is shown in Figure 8.

Classification and recognition were performed on both the training and testing datasets. In both cases the precision was calculated. It is important to note that the test dataset was not used during the training process. This practice ensures reliable evaluation of the model’s generalization performance on unknown data and avoids overfitting to the test dataset.

4. Results

The results obtained and their analysis are detailed below. We have focused on classification and recognition accuracy metrics, along with their respective standard deviations, overlap, and processing time.

4.1. Evaluation Metrics

The developed model has six classification classes (including “NoGesture”) and each gesture is performed in an interval of 5 s. This model will be evaluated taking into account the following metrics:

Prediction Vector Validity: The prediction vector is considered valid only when it does not present discontinuities throughout its classification. That is, it does not have spurious labels.
Classification Accuracy: Classification accuracy refers to the proportion of gestures correctly classified by the model.
Recognition Accuracy: Recognition accuracy goes beyond classification. Recognition is based on the presence of a valid prediction vector, which maintains a single designated label over time.
Standard Deviation: It is a statistical measure of dispersion used to understand the variability present in a dataset. In this case, the standard deviation refers to the variability between the users’ mean values.
Overlap: The overlap is a percentage value. This value represents the percentage by which the prediction vector overlaps with the ground truth at a given time. Ground truth refers to the manually labeled actual values that serve as a reference for evaluating the accuracy of the model’s predictions.
Processing Time: Records the time it takes for the model to process and assign a label to a gesture. This metric is essential to evaluate the model in real-time. In this work, a model must respond in less than 300 ms for its response to be considered real-time according to [47].

These metrics provide a comprehensive assessment of both the effectiveness and consistency of the gesture recognition model, enabling comparison with other research.

4.2. Training Results

Initially, several tests were conducted without using positional encoding, where classification scores remained below 70%, despite experimenting with different numbers of heads and channels. Subsequently, positional encoding was incorporated, leading to notable improvements in performance. Following this adjustment, several training sessions were conducted with different numbers of heads, namely 4, 8, and 16. The number of channels for

Q

,

K

, and

V

was also varied, using channel numbers of 16, 32, 64, 128, and 256, as it is shown in Figure 9. The final result was achieved with 8 heads and 128 channels. It is noteworthy that the number of channels must be divisible by the number of heads since they are concatenated, as indicated in the multi-head attention Section.

Figure 10 illustrates the training progress and loss of the best model obtained, which completed 7170 iterations in 10 epochs, with a total time of 152 min. The classification accuracy by window provided by Matlab Deep Learning Toolbox was 96.56%. The result was obtained through cross-validation using the training dataset. The learning curve exceeds 90% in 2000 iterations.

Figure 11 presents the confusion matrix corresponding to the 6 gestures classified by the model. The matrix highlights the correct predictions diagonally and provides insights into the model’s performance across different gesture categories.

Table 3 contains performance metrics such as precision, recall, F1 score, and specificity [48]:

•: Precision: Proportion of true positive predictions among all positive predictions. Measures accuracy when the model predicts a positive class.
•: Recall: Proportion of true positive predictions among all actual positives. Measures the model’s ability to capture all positive instances.
•: F1 Score: Harmonic mean of precision and recall. Balances precision and recall, providing a single metric.
•: Specificity: Proportion of true negative predictions among all actual negatives. Measures accuracy when the model predicts a negative class.

Table 3. Performance metrics per class for the hand gesture classification by window model.

Class	Precision	Recall	F1 Score	Specificity
Fist	97.12%	95.88%	96.49%	99.43%
NoGesture	99.28%	99.66%	99.47%	99.86%
Open	94.87%	91.31%	93.05%	99.01%
Pinch	95.36%	97.27%	96.30%	99.05%
WaveIn	96.90%	96.71%	96.81%	99.38%
WaveOut	93.54%	96.20%	94.85%	98.67%

As can be seen in Table 3, the model is more accurate with the “NoGesture” gesture. This fact must be taken into account when interpreting the results, as it could bias them. For this reason, the analysis of the results is supported by other metrics such as overlap and recognition accuracy.

4.3. Test Results

The test results were obtained using the test dataset. Table 4 shows that the classification by window is the same with and without post-processing. The classification is the same in both cases because the classification is used to improve recognition, as already explained in the post-processing Section 4. For this reason, only the results in terms of recognition accuracy increase by 2% and their standard deviation is reduced from 8.77% to 8.2%. Furthermore, it is important to note that the classification accuracy provided by Matlab (96.56%) on the training results is calculated window by window. This result differs from the classification accuracy applied during testing (95.25%), as the latter is based on the classification of the entire signal.

4.4. Discussion

In terms of classification accuracy, the transformer model developed in this study (HGREncoder) has outperformed all the models indicated in the literature review, which are: CNN, LSTM, CT-HGR and TraHGR. HGREncoder achieved a classification accuracy by windows of 95.25 ± 4.9%.

Figure 12a shows the comparison of the four mentioned models (CNN, LSTM, CT-HGR and TraHGR) and HGREncoder. Firstly, it will be discussed in terms of classification accuracy between CNN, LSTM and HGREncoder models, and secondly it will be discussed in terms of classification accuracy between CT-HGR, TraHGR and HGREncoder models.

In the first case, the comparison is balanced because CNN, LSTM, and HGREncoder models used the same dataset. CNN (in purple in Figure 12) with an accuracy of 90.62 ± 9.58% and LSTM (in yellow in Figure 12) with an accuracy of 92.02 ± 2.22% achieved a lower classification accuracy and a lower standard deviation compared to HGREncoder (in red in Figure 12). The reduction in the standard deviation indicates that a more consistent and improved model was obtained.

In the second case, CT-HGR and TraHGR (in blue in Figure 12a) are transformer-based models. These models (CT-HGR and TraHGR) were developed to classify a larger number of gestures (9 and 66 gestures, respectively) and were tested on a smaller dataset (40 and 20 individuals, respectively) compared to HGREncoder model. For these reason, a fair comparison cannot be made. Despite this, HGREncoder shows a higher standard deviation compared to CT-HGR, which achieves an accuracy of 91.98 ± 2.22%. The higher standard deviation of HGREncoder model is due to using a larger dataset, resulting in greater variability in the results. For the case of TraHGR model, the standard deviation is not reported in their study, but their accuracy for the classification is 93.84%.

Figure 12b displays a comparison between HGREncoder with the CNN and LSTM models in terms of response time. It is observed that the response time of HGREncoder is longer than of the other models (CNN and LSTM). A longer response time means that it takes longer to process and assign a label to a gesture. Response speed is important in determining the viability of a real-time application. The response time of HGREncoder is less than 300 ms, so it is considered acceptable for real-time applications. Importantly, to simulate real-time conditions, HGREncoder needs to receive more than three previous time windows to make predictions. The wait for the first three windows is due to the model’s need for a broader context. Therefore, we collect information from the third window to the end of the sequence to make predictions. HGREncoder processes the entire sequence simultaneously, which implies greater processing and response time.

The recognition accuracy results of CNN, LSTM and HGREncoder, with and without post-processing are presented in Figure 13a. The recognition accuracy without post-processing achieved for HGREncoder is the most significant advance of this research. HGREncoder model achieves a recognition accuracy without post-processing of 89.7 ± 8.7692%, significantly outperforming the LSTM model that has a recognition accuracy of 65.8 ± 15.51% and CNN model that has a recognition accuracy of 45.4 ± 15.51%. This good recognition accuracy is because transformer models use attention mechanisms to focus on the most relevant parts of the input sequence. Furthermore, Figure 13b shows the results obtained in terms of recognition accuracy with post-processing; HGREncoder reaches 91.7 ± 8.1967%. This result represents a slight improvement compared to 90.6 ± 9.45% of LSTM model and 87.26 ± 11.14% of CNN model.

Figure 14a shows the overlap percentage of the transformer model without post-processing which reaches a value of 49.752 ± 8.23%. This result is lower than the 64 ± 9.5% achieved by CNN and the 67.59 ± 10.05% achieved by LSTM. Figure 14b shows the overlap percentage with post-processing. The trend of overlap percentages with post-processing and without post-processing is the same. The CNN and LSTM models show better results in overlap, with 67.9 ± 7.98% and 69.5 ± 7.4%, respectively, and HGREncoder reaches 50.55 ± 8.31%. The low results in both cases (with and without post-processing) in the overlap are low because the transformer model utilizes the entire sequence for recognition. Using the entire sequence for classification could limit the ability to recognize transitions between gestures. Additionally, using the entire sequence could result in predictions with a shorter duration, leading or lagging predictions.

5. Conclusions

In this research, a transformer model using positional encoding and the encoder has been developed for real-time gesture classification and recognition. The transformer model created in this work is better compared to CNN, LSTM, CT-HGR, TraHGR. HGREncoder achieves a classification accuracy of 95.25 ± 4.9% and a recognition accuracy of 91.73 ± 8.19%. Despite requiring more processing time, these results highlight the potential of the transformer architecture, which has been extensively studied over the past five years.

The most notable advancement in this research is the significant improvement in recognition accuracy without post-processing, achieving 89.71 ± 8.77%. This recognition accuracy represents an up to 23% improvement compared to previous models developed with the same dataset (CNN and LSTM models). The transformer model learns better because it pays attention to the key components of the entire sequence. The Transformer model comprises the sequence of “relax”, “gesture”, and “relax” that occurs during each gesture. However, transformer model developed in this research can be improved in terms of processing time and overlapping. The processing time of HGREncoder is 4.8 times longer than that of previous models (CNN and LSTM). The model takes longer to respond because it needs context from the previous three windows. Future work could explore model lightening techniques to reduce processing time while maintaining accuracy. Additionally, the use of more efficient transformer variants could be investigated to optimize the architecture for real-time applications.

On the other hand, the overlap percentage achieved by the transformer model is lower than the overlap percentage of the previous models (CNN and LSTM). This decrease is due to the fact that the transformer model does not identify the exact moment in which the transition from “noGesture” to gesture occurs. Furthermore, this low percentage could be a consequence of the time of the detected gesture being shorter than the actual duration, or being ahead or behind the actual event. Receiving the full sequence could affect the ability to detect subtle variations.

Author Contributions

Conceptualization, L.G.M. and M.E.B.; methodology, L.G.M. and J.A.Z.; software, J.A.Z.; validation, L.I.B. and Á.L.V.; formal analysis, L.I.B. and L.G.M.; investigation, J.A.Z.; data curation, L.G.M.; writing—original draft preparation, L.G.M.; writing—review M.E.B.; supervision, Á.L.V. and M.E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are available upon reasonable request.

Acknowledgments

The authors gratefully acknowledge the financial support provided by the Escuela Politécnica Nacional (EPN) for the development of the research project “PIS-22-10 Análisis dela influencia de la variación intrapersonal de señales electromiográficas sobre la exactitud de modelos de reconocimientos de gestos de la mano”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saggio, G.; Cavallo, P.; Ricci, M.; Errico, V.; Zea, J.; Benalcázar, M.E. Sign language recognition using wearable electronics: Implementing K-nearest neighbors with dynamic time warping and convolutional neural network algorithms. Sensors 2020, 20, 3879. [Google Scholar] [CrossRef]
Koh, J.I.; Cherian, J.; Taele, P.; Hammond, T. Developing a hand gesture recognition system for mapping symbolic hand gestures to analogous emojis in computer-mediated communication. ACM Trans. Interact. Intell. Syst. 2019, 9, 1–35. [Google Scholar] [CrossRef]
Simon, A.M.; Turner, K.L.; Miller, L.A.; Dumanian, G.A.; Potter, B.K.; Beachler, M.D.; Hargrove, L.J.; Kuiken, T.A. Myoelectric prosthesis hand grasp control following targeted muscle reinnervation in individuals with transradial amputation. PLoS ONE 2023, 18, e0280210. [Google Scholar] [CrossRef] [PubMed]
Godoy, R.V.; Dwivedi, A.; Liarokapis, M. Electromyography Based Decoding of Dexterous, In-Hand Manipulation Motions With Temporal Multichannel Vision Transformers. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 30, 2207–2216. [Google Scholar] [CrossRef] [PubMed]
Tan, C.K.; Lim, K.M.; Chang, R.K.Y.; Lee, C.P.; Alqahtani, A. HGR-ViT: Hand Gesture Recognition with Vision Transformer. Sensors 2023, 23, 5555. [Google Scholar] [CrossRef] [PubMed]
Montazerin, M.; Zabihi, S.; Rahimian, E.; Mohammadi, A.; Naderkhani, F. ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, Scotland, UK, 11–15 July 2022; pp. 5115–5119. [Google Scholar] [CrossRef]
Birkeland, S.; Fjeldvik, L.J.; Noori, N.; Yeduri, S.R.; Cenkeramaddi, L.R. Thermal video-based hand gestures recognition using lightweight CNN. J. Ambient Intell. Humaniz. Comput. 2024, 15, 3849–3860. [Google Scholar] [CrossRef]
Ahmed, I.T.; Gwad, W.H.; Hammad, B.T.; Alkayal, E. Enhancing Hand Gesture Image Recognition by Integrating Various Feature Groups. Technologies 2025, 13, 164. [Google Scholar] [CrossRef]
Jaramillo-Yánez, A.; Benalcázar, M.E.; Mena-Maldonado, E. Real-time hand gesture recognition using surface electromyography and machine learning: A systematic literature review. Sensors 2020, 20, 2467. [Google Scholar] [CrossRef]
Vásconez, J.P.; López, L.I.B.; Ángel Leonardo Valdivieso Caraguay; Benalcázar, M.E. Hand Gesture Recognition Using EMG-IMU Signals and Deep Q-Networks. Sensors 2022, 22, 9613. [Google Scholar] [CrossRef]
Benalcázar, M.E.; Motoche, C.; Zea, J.A.; Jaramillo, A.G.; Anchundia, C.E.; Zambrano, P.; Segura, M.; Benalcázar Palacios, F.; Pérez, M. Real-time hand gesture recognition using the Myo armband and muscle activity detection. In Proceedings of the 2017 IEEE Second Ecuador Technical Chapters Meeting (ETCM), Salinas, Ecuador, 16–20 October 2017; pp. 1–6. [Google Scholar] [CrossRef]
Valdivieso Caraguay, A.L.; Vásconez, J.P.; Barona López, L.I.; Benalcázar, M.E. Recognition of Hand Gestures Based on EMG Signals with Deep and Double-Deep Q-Networks. Sensors 2023, 23, 3905. [Google Scholar] [CrossRef]
Anastasiev, A.; Kadone, H.; Marushima, A.; Watanabe, H.; Zaboronok, A.; Watanabe, S.; Matsumura, A.; Suzuki, K.; Matsumaru, Y.; Nishiyama, H.; et al. A Novel Bilateral Data Fusion Approach for EMG-Driven Deep Learning in Post-Stroke Paretic Gesture Recognition. Sensors 2025, 25, 3664. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zabihi, S.; Rahimian, E.; Asif, A.; Mohammadi, A. TraHGR: Transformer for Hand Gesture Recognition via ElectroMyography. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 4211–4224. [Google Scholar] [CrossRef] [PubMed]
Barona López, L.I.; Ferri, F.M.; Zea, J.; Ángel Leonardo Valdivieso Caraguay; Benalcázar, M.E. CNN-LSTM and post-processing for EMG-based hand gesture recognition. Intell. Syst. Appl. 2024, 22, 200352. [Google Scholar] [CrossRef]
Jiang, Y.; Song, L.; Zhang, J.; Song, Y.; Yan, M. Multi-Category Gesture Recognition Modeling Based on sEMG and IMU Signals. Sensors 2022, 22, 5855. [Google Scholar] [CrossRef]
Benalcázar, M.; Barona, L.; Valdivieso, L.; Aguas, X.; Zea, J. EMG-EPN-612 Dataset. 2020. Available online: http://dx.doi.org/10.5281/zenodo.4421500 (accessed on 18 July 2025).
Merletti, R. Electromyography: Physiology, Engineering, and Non-Invasive Applications; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 11. [Google Scholar] [CrossRef]
Péter, A.; Andersson, E.; Hegyi, A.; Finni, T.; Tarassova, O.; Cronin, N.J.; Grundström, H.; Arndt, A. Comparing Surface and Fine-Wire Electromyography Activity of Lower Leg Muscles at Different Walking Speeds. Front. Physiol. 2019, 10, 1283. [Google Scholar] [CrossRef]
Kyranou, I.; Vijayakumar, S.; Erden, M.S. Causes of performance degradation in non-invasive electromyographic pattern recognition in upper limb prostheses. Front. Neurorobotics 2018, 12, 58. [Google Scholar] [CrossRef]
Huang, Q.; Yang, D.; Jiang, L.; Zhang, H.; Liu, H.; Kotani, K. A Novel Unsupervised Adaptive Learning Method for Long-Term Electromyography (EMG) Pattern Recognition. Sensors 2017, 17, 1370. [Google Scholar] [CrossRef]
Schulte, R.V.; Prinsen, E.C.; Buurke, J.H.; Poel, M. Adaptive Lower Limb Pattern Recognition for Multi-Day Control. Sensors 2022, 22, 6351. [Google Scholar] [CrossRef]
Jiang, N.; Muceli, S.; Graimann, B.; Farina, D. Effect of arm position on the prediction of kinematics from EMG in amputees. Med Biol. Eng. Comput. 2013, 51, 143–151. [Google Scholar] [CrossRef]
Zhai, X.; Jelfs, B.; Chan, R.H.; Tin, C. Self-recalibrating surface EMG pattern recognition for neuroprosthesis control based on convolutional neural network. Front. Neurosci. 2017, 11, 379. [Google Scholar] [CrossRef]
He, J.; Zhang, D.; Jiang, N.; Sheng, X.; Farina, D.; Zhu, X. User adaptation in long-term, open-loop myoelectric training: Implications for EMG pattern recognition in prosthesis control. J. Neural Eng. 2015, 12, 046005. [Google Scholar] [CrossRef] [PubMed]
Côté-Allard, U.; Fall, C.L.; Drouin, A.; Campeau-Lecours, A.; Gosselin, C.; Glette, K.; Laviolette, F.; Gosselin, B. Deep Learning for Electromyographic Hand Gesture Signal Classification Using Transfer Learning. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 760–771. [Google Scholar] [CrossRef] [PubMed]
Unanyan, N.; Belov, A. Case Study: Influence of Muscle Fatigue and Perspiration on the Recognition of the EMG Signal. Adv. Syst. Sci. Appl. 2021, 21, 58–70. [Google Scholar] [CrossRef]
Espinoza, D.L.; Velasco, L.E.S. Comparison of EMG Signal Classification Algorithms for the Control of an Upper Limb Prosthesis Prototype; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020; Volume 11. [Google Scholar] [CrossRef]
Zhou, H.; Thanh Le, H.; Zhang, S.; Lam Phung, S.; Alici, G. Hand Gesture Recognition From Surface Electromyography Signals With Graph Convolutional Network and Attention Mechanisms. IEEE Sens. J. 2025, 25, 9081–9092. [Google Scholar] [CrossRef]
Barona López, L.I.; Valdivieso Caraguay, A.L.; Vimos, V.H.; Zea, J.A.; Vásconez, J.P.; Álvarez, M.; Benalcázar, M.E. An Energy-Based Method for Orientation Correction of EMG Bracelet Sensors in Hand Gesture Recognition Systems. Sensors 2020, 20, 6327. [Google Scholar] [CrossRef]
Chung, E.A.; Benalcázar, M.E. Real-Time Hand Gesture Recognition Model Using Deep Learning Techniques and EMG Signals. In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coruña, Spain, 2–6 September 2019; pp. 1–5. [Google Scholar] [CrossRef]
Jaramillo-Yanez, A.; Unapanta, L.; Benalcázar, M.E. Short-Term Hand Gesture Recognition using Electromyography in the Transient State, Support Vector Machines, and Discrete Wavelet Transform. In Proceedings of the 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Guayaquil, Ecuador, 11–15 November 2019; pp. 1–6. [Google Scholar] [CrossRef]
Alaparthi, S.; Mishra, M. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey. arXiv 2020, arXiv:2007.01127. [Google Scholar] [CrossRef]
Llinet Benavides, C.; Manso-Callejo, M.A.; Cira, C.I. BERT (Bidirectional Encoder Representations from Transformers) for Missing Data Imputation in Solar Irradiance Time Series. Eng. Proc. 2023, 39, 26. [Google Scholar]
Montazerin, M.; Rahimian, E.; Naderkhani, F.; Atashzar, S.; Yanushkevich, S.; Mohammadi, A. Transformer-based hand gesture recognition from instantaneous to fused neural decomposition of high-density EMG signals. Sci. Rep. 2023, 13, 11000. [Google Scholar] [CrossRef]
Benalcázar, M.E.; Ángel Leonardo Valdivieso Caraguay; López, L.I.B. A user-specific hand gesture recognition model based on feed-forward neural networks, emgs, and correction of sensor orientation. Appl. Sci. 2020, 10, 8604. [Google Scholar] [CrossRef]
Laboratorio de Investigación en Inteligencia y Visión Artificial. Escuela Politécnica Nacional, 2024. Available online: https://laboratorio-ia.epn.edu.ec/es/ (accessed on 15 August 2025).
Atzori, M.; Gijsberts, A.; Castellini, C.; Caputo, B.; Hager, A.G.; Elsig, S.; Giatsidis, G.; Bassetto, F.; Müller, H. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Sci. Data 2014, 1, 140053. [Google Scholar] [CrossRef]
Eddy, E.; Campbell, E.; Bateman, S.; Scheme, E. Big data in myoelectric control: Large multi-user models enable robust zero-shot EMG-based discrete gesture recognition. Front. Bioeng. Biotechnol. 2024, 12, 2024. [Google Scholar] [CrossRef]
Vásconez, J.P.; Barona López, L.I.; Ángel Leonardo Valdivieso Caraguay; Benalcázar, M.E. A comparison of EMG-based hand gesture recognition systems based on supervised and reinforcement learning. Eng. Appl. Artif. Intell. 2023, 123, 106327. [Google Scholar] [CrossRef]
Benalcázar, M.E.; Jaramillo, A.G.; Jonathan; Zea, A.; Páez, A.; Andaluz, V.H. Hand gesture recognition using machine learning and the Myo armband. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 1040–1044. [Google Scholar] [CrossRef]
Tsai, A.C.; Luh, J.; Lin, T.T. A novel STFT-ranking feature of multi-channel EMG for motion pattern recognition. Expert Syst. Appl. 2015, 42, 44. [Google Scholar] [CrossRef]
Jiang, W.; Ren, Y.; Liu, Y.; Wang, Z.; Wang, X. Recognition of Dynamic Hand Gesture Based on Mm-Wave Fmcw Radar Micro-Doppler Signatures. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4905–4909. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Rahimi, N.; Eassa, F.; Elrefaei, L. An Ensemble Machine Learning Technique for Functional Requirement Classification. Symmetry 2020, 12, 1601. [Google Scholar] [CrossRef]
Farrell, T.R.; Weir, R.F. The Optimal Controller Delay for Myoelectric Prostheses. IEEE Trans. Neural Syst. Rehabil. Eng. 2007, 15, 111–118. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2011, arXiv:2010.16061. [Google Scholar]

Figure 1. Illustration of valid and invalid classification vectors.

Figure 2. Overview of the EMG-EPN-612 dataset structure [18].

Figure 3. EMG data capture setup and hand gesture performance using an EMG signal collection device.

Figure 4. Preprocessing of EMG signals: Application of STF to convert raw data into spectrograms. This figure is a modified version of [16].

Figure 5. Preprocessing of EMG signals: Application of STF to convert raw data into spectrograms.

Figure 7. Architecture of the HGR model.

Figure 8. Post-processing technique.

Figure 9. Heatmap of performance results with different numbers of heads and channel.

Figure 10. Training progress (top) and cost function (bottom) of the obtained model.

Figure 11. Heatmap of confusion matrix for hand gesture classification by window.

Figure 12. Comparison between HGREncoder, CNN, LSTM, CT-HGR, and TraHGR models in terms of classification accuracy and response time.

Figure 13. Comparison between the developed model in this work with the CNN and LSTM models in terms of recognition accuracy.

Figure 14. Comparison between the developed model in this work with the CNN and LSTM models in terms of overlap.

Table 1. Accuracy recognition for HGR general models.

Model	Recognition Accuracy	Variability	Number of Weights
ANN	85.08%	±15.21	70.9 K
SVM	87.5%	–	–
CNN	87.26%	±11.14	219.1 K
LSTM	90.55%	±9.45	11.7 M

Table 2. Training parameter configuration.

Parameter	Value
Initial Learning Rate	0.001
Learning Rate Scheduling	piecewise
Learning Rate Drop Factor	0.2
Learning Rate Drop Period	8
MiniBatch Size	64
Number of Epochs	10

Table 4. Test results with and without post-processing.

Metric	No Post-Processing	With Post-Processing
Classification	95.25 ± 4.9%
Recognition	89.7 ± 8.77%	91.7 ± 8.2%
Overlapping	49.8 ± 8.27%	50.5 ± 8.35%
Time (ms)	151 ± 49.2	167 ± 50.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Macías, L.G.; Zea, J.A.; Barona, L.I.; Valdivieso, Á.L.; Benalcázar, M.E. HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study. Math. Comput. Appl. 2025, 30, 101. https://doi.org/10.3390/mca30050101

AMA Style

Macías LG, Zea JA, Barona LI, Valdivieso ÁL, Benalcázar ME. HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study. Mathematical and Computational Applications. 2025; 30(5):101. https://doi.org/10.3390/mca30050101

Chicago/Turabian Style

Macías, Luis Gabriel, Jonathan A. Zea, Lorena Isabel Barona, Ángel Leonardo Valdivieso, and Marco E. Benalcázar. 2025. "HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study" Mathematical and Computational Applications 30, no. 5: 101. https://doi.org/10.3390/mca30050101

APA Style

Macías, L. G., Zea, J. A., Barona, L. I., Valdivieso, Á. L., & Benalcázar, M. E. (2025). HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study. Mathematical and Computational Applications, 30(5), 101. https://doi.org/10.3390/mca30050101

Article Menu

HGREncoder: Enhancing Real-Time Hand Gesture Recognition with Transformer Encoder—A Comparative Study

Abstract

1. Introduction

1.1. Research Objective

1.2. Specific Objectives

1.3. Hypothesis

1.4. Problems Addressed by the Study

2. Literature Review

2.1. Machine Learning Models for Hand Gesture Recognition

2.2. Transformers in Hand Gesture Recognition

2.3. Research Gap

3. Materials and Methods

3.1. Dataset

3.1.1. Origin of the Dataset

3.1.2. Characteristics of the Dataset

3.1.3. Comparison with Other Datasets

3.2. Data Preprocessing

3.3. Feature Extraction

3.4. Architecture

3.4.1. Overview of Transformer Architecture

3.4.2. Positional Encoding

3.4.3. Encoder

3.4.4. Multi-Head Attention

3.5. Customization for Hand Gesture Recognition

3.6. Training and Validation

3.7. Post-Processing

4. Results

4.1. Evaluation Metrics

4.2. Training Results

4.3. Test Results

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI