Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition

Yaseen,; Kwon, Oh-Jin; Kim, Jaeho; Lee, Jinhee; Ullah, Faiz

doi:10.3390/app15116045

Open AccessArticle

Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition

by

Yaseen

¹

,

Oh-Jin Kwon

^1,*

,

Jaeho Kim

²

,

Jinhee Lee

¹

and

Faiz Ullah

¹

Department of Electronics Engineering, Sejong University, Seoul 05006, Republic of Korea

²

Department of Electrical Engineering, Sejong University, Seoul 05006, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6045; https://doi.org/10.3390/app15116045

Submission received: 24 April 2025 / Revised: 20 May 2025 / Accepted: 26 May 2025 / Published: 27 May 2025

(This article belongs to the Special Issue Advanced Computer Vision Techniques: AI-Based Object Detection, Tracking, Surveillance and Security Applications)

Download

Browse Figures

Versions Notes

Abstract

The integration of dynamic hand gesture recognition in computer vision-based systems promises enhanced human–computer interaction, providing a natural and intuitive way of communicating. However, achieving real-time performance efficiency is a highly challenging task. As the effectiveness of dynamic hand gesture recognition is dependent on the nature of the underlying datasets and deep learning models, selecting a diverse and effective dataset and a deep learning model is crucial to achieve reliable performance. This study explores the effectiveness of benchmark hand gesture recognition datasets in training lightweight deep learning models for robust performance. The objective is to evaluate and analyze these datasets and models through training and evaluation for use in practical applications. For the evaluation of these datasets and models, we analyze the models’ performances by evaluation metrics, such as precision, recall, F1-score, specificity, and accuracy. For an unbiased comparison, both subjective and objective metrics are reported, thus offering significant insights on understanding dataset–model interactions in hand gesture recognition.

Keywords:

hand gesture datasets; gesture activity recognition; pattern recognition; vision-based intelligent systems; classification algorithms; dataset evaluation; lightweight transfer learning; transformer-based models

1. Introduction

The demand for both quantity and quality of services continues to increase in the field of hand gesture recognition (HGR) [1]. Face-to-Face communication relies heavily on hand gestures [2]. In human–computer interaction systems, hand gesture applications offer a touchless interface [3]. HGR plays an important role in various applications [2], including human–computer interaction (HCI), virtual reality (VR), sign-language interpretation, drone control, and smart home control applications, as shown in Figure 1, highlighting its significant importance in modern technology [4,5]. Although hand gesture recognition is an active field of computer vision, there are many challenges that exist currently. One of those challenges is the availability of a suitable dataset that can generalize a model during training to perform well in real-world scenarios [6].

Hand gesture classification for practical application scenarios is a challenging task and requires careful design of the dataset and selection of a pre-trained model [7]. Therefore, it is crucial to take into account both the selection of lightweight models for the training and the availability of datasets [8]. By exploiting pre-trained data in the form of transfer learning, the need for a large dataset can be omitted [1]. The important contributions of a quality dataset and a well-trained model are occasionally disregarded, even though researchers frequently concentrate on improving model performance [9]. Due to performance complexity, this omission may result in high hardware needs and impracticability for real-time applications [10]. Our goal is to investigate and select an ideal model and dataset for HGR.

Access to a large dataset is critical for the success of any machine learning (ML) method, especially when handling complex problems like hand gesture recognition [11]. In this regard, datasets designed expressly for hand gesture recognition play an important role in furthering research and development in this particular field [12]. These datasets hold both static and dynamic gestures and are very diverse in nature to help generalize models during training [13]. Through careful experiments and testing, researchers can gain insights into the requirements for acquiring data useful for real-world applications like vision-based HCI, drone control, and sign-language interpretation [14,15], as having diverse hand gesture datasets is essential.

Similarly, choosing the best pre-trained deep learning model for hand gesture recognition requires a careful selection procedure [12]. A lot of time is wasted on the experimental evaluation study by the researches in this phase [16]. It also depends on the type of dataset that is going to be used in the transfer learning process [17]. In this study, we focused on identifying the best lightweight model that performs well on the targeted datasets for real-time hand gesture recognition while also exploring which dataset aligns best with each model in terms of performance. Therefore, we analyzed six state-of-the-art lightweight models, such as InceptionV4 [18], NanoViT [19], EfficientNet [20], EdgeNeXt [21], ConvNeXt [22], and MobileNetV3 [23], on four widespread benchmark datasets, each having rich contents and characteristics of hand gestures. This study aims to determine not only the most suitable model but also the dataset–model combinations that maximize performance for real-world scenarios.

Through our study, we analyze the overall nature of the datasets [24,25] and transfer learning models. The factors effecting the quality and nature of the dataset, including the size of the dataset [26], the diversity of the subjects (also trained to perform gestures) [27], different lighting conditions [28], the quality of the data acquisition sensor and calibration, the resolution of the image, and the overall design and collection of gestures [29], are very important. Such factors contribute to the model adaptability for use in real-world scenarios [30,31].

In summary, this study not only fills the gap by evaluating several lightweight transfer learning models and datasets but also offers valuable insights for researchers and developers who aim to build a real-time hand gesture recognition module. By including subjective test results under different environmental conditions, this study further offers guidelines for creating an effective dataset for applications, like drone control, sign-language interpretation, computer vision-based HCI, and more. We summarize our contributions as follows:

This study presents an evaluation of five state-of-the-art lightweight deep learning models (including transformer-based) for hand gesture recognition.
This study presents subjective and objective performance evaluations to assess the datasets’ quality and their impact on the models’ performances.
In this study, we also exploit the Mediapipe hand landmark [32] (MHL) and OpenPose [33] models for the extraction of keypoints and assess the dataset quality by its success and failure rate in detecting landmarks under diverse lighting conditions and image quality.
This study also highlights the trade-offs between the models’ complexities and accuracies in real-time application scenarios related to hand gesture recognition, offering practical guidelines for lightweight model selection in HGR systems.

The rest of this paper is organized as follows. Section 3 describes the methodology, while the detailed results are reported in Section 4, which are discussed comprehensively in Section 5. Finally, we conclude our work in Section 6 with possible future research directions.

2. Related Work

Hand gesture recognition has been extensively used for real-world application scenarios, like drone control, sign-language interpretation, computer vision-based HCI, and more [7,13]. Several studies have reported the insights gained related to datasets, models, and performance improvement. Due to the computational complexity of convolutional neural networks (CNNs), the focus has shifted to lightweight models to achieve performance accuracy [10]. Similarly, with the popularity of these data-hungry models, the focus is always on the creation of a large dataset targeting quantity alone while ignoring the quality. Several factors contribute to the quality of a dataset, i.e., biases, lighting conditions, the type of sensor used, the distances and locations of objects in the frames, the number of subjects involved, etc. So, while studies have shown deep interest in improving the accuracy of the models, fewer efforts have been made to analyze the nature and diversity of datasets, which affects the overall performance of a model in training.

Yu et al. [34] evaluated several HGR techniques for drone control. They focused on the model’s accuracy in real time. While they reported performance comparison results of several models, their study lacked an evaluation of benchmark datasets. Diliberti et al. [35] investigated CNN-based models for real-world scenarios. Despite their high accuracy, due to their high computational complexities, these models were unable to be used in real-time scenarios for embedded systems. Unlike lightweight models, CNNs are the least suitable for diverse scenarios and real-world applications. Another notable work reported by Sethuraman et al. [36] investigated a lightweight model for HGR. While the approach mostly focused on model size reduction, the study lacked an exploration of a benchmark dataset for improving the applicability in a real-world allocation scenario. Last but not the least, Jiang and Shu et al. [37] investigated HGR methods, including sensor-based and vision-based approaches. Although they briefly provided the trade-off between accuracy, lightweight models, and applicability in real-world scenarios, the study lacked a comparison of transfer learning models against benchmark datasets for real-time applications, like drone control and vision-based HCI systems.

Hence, most studies have often overlooked the crucial aspect of dataset analysis and the use of a suitable transfer learning model. A recent advancement in hand gesture recognition has been significant due to the availability of diverse datasets and ML techniques, but challenges still exist in robust and precise gesture classification, especially in real-time interaction-based systems (i.e., drone control, sign-language processing, etc.). Several studies have utilized various datasets to study and assess hand gesture recognition, for example, the Egogesture [38], NVGesture [39], Jester [40], and SignLanguage [41] datasets have been extensively used in the literature, each reporting reasonable accuracy by their respective authors.

To address the gap discussed above, we discuss our detailed pipeline in the following sections.

3. Methodology

This section outlines our methodology to evaluate benchmark datasets and lightweight transfer learning models for hand gesture recognition in three main stages. The dataset prepossessing involves hand landmark extraction using the MHL model and OpenPose, followed by model training and evaluation using both objective quality metrics and subjective testing. An overview of the entire pipeline is shown in Figure 2.

3.1. Data Acquisition

In this study, we focused on the following datasets:

The EgoGesture dataset: The data collected by the authors of [38] are from an egocentric point of view and consist of both static and dynamic gestures.
The NVGesture dataset: This also offers static as well as dynamic gestures. This dataset was intended to be used for driver-control applications [39]. The subjects in this dataset are limited to a contained space where performing dynamic gestures becomes critical for smooth communication.
The Jester Dataset: This was created for use in action recognition and other HCI interaction-based systems [40].
The sign-language dataset: This is the fourth dataset in our study and it is also a static and dynamic gesture dataset with a specific emphasis on sign-language gestures. Beyond traditional communication, this dataset plays a crucial role in facilitating communication for individuals with hearing impairments. Sign-language recognition systems play a vital role in the development of technologies that enhance vastness and accessibility [42].

These datasets are well suited for HGR, as prior studies [38,39,40,42] provide a solid background for evaluating the performance of transfer learning models.

We extracted 12 classes from each of the datasets and formed subsets (Subset-EgoGesture, Subset-NVGesture, Subset-Jester, and Subset-sign-language) for our evaluation experiments and analysis. Each class extracted from its original dataset retained the exact same number of samples as provided by the datasets to ensure consistency in the analysis. Table 1 provides an overview of these benchmark datasets, and Figure 3 shows sample gestures from each dataset.

The Egogesture dataset provides a diverse array of static and dynamic gestures captured from an egocentric viewpoint. It includes 83 classes, primarily using right-hand gestures at a steady distance from the camera. The NVgesture dataset is substantial and diverse with 27 gesture classes. This dataset also uses right-hand gestures at a uniform distance from the camera. It includes gestures performed by various subjects, offering a range of gesture variations within each class. The Jester dataset is designed for action/gesture recognition (HGR) and human–computer interaction (HCI). It includes images with a resolution of

120 \times 100

pixels and captures various gestures performed by different subjects. The sign-language dataset demonstrates unparalleled accuracy in gesture recognition tasks. By focusing on essential gestures, this dataset is highly effective in sign-language interpretation applications. Table 1 reports the total number of samples for each of the datasets, the total number of classes, statistics showing the standard deviation of the gesture clips, the resolution, and the viewpoint of the data. In the last column, it mentions the biases in the datasets (hand gestures recorded with one or both hands). The sign-language dataset was designed and recorded using both hands, while the rest of the datasets are mostly biased (right-handed).

3.2. Dataset Statistics

In our study, we selected twelve classes from each dataset. We employed cutting-edge methodologies reported in the most recent studies [43,44]. To begin with, we pre-processed all the datasets using the MHL model and OpenPose by extracting hand landmarks from the images and collected the statistics shown in Figure 4 and Figure 5. The presented bar graphs for each dataset illustrate the distribution of the images in which the MHL model and OpenPose successfully detected landmarks, alongside unsuccessful attempts where no landmarks were detected. Figure 4 and Figure 5 show that the sign-language dataset contains the highest number of images with successful hand landmark detection by both the models, i.e., MHL and OpenPose, while the lowest detection rate was noted in the Jester dataset. This gives an insight into the quality of the datasets and can be reflected later when we analyze the objective as well as subjective test results across all the datasets and models. As OpenPose was performing lower compared to the MHL model, it was omitted from the experimental pipeline in the training phase.

3.3. Deep Learning Models

In this study, we evaluated five state-of-the-art lightweight deep learning models for hand gesture recognition. These deep learning models were selected based on their nature of low computational cost and high performance [18,19,20,21,22]. The four benchmark datasets were used to train these models and assess their effectiveness based on subjective test results. In this study, we chose models based on inference time, low computational cost, and reduced number of parameters as shown in Table 2. Also, we focused on models which are highly favorable for real-time application scenarios as mentioned in Figure 1. The following models were utilized:

3.3.1. InceptionV4

InceptionV4 [18] is an advanced convolutional neural network (CNN) architecture that builds upon the previous versions of the Inception [45] model. It merges several convolutional kernels with variant sizes for capturing features at different scales. This makes it highly favorable for use in diverse data handling and classification, especially with temporal data. InceptionV4 is highly suitable for real-time application scenarios due to its low computational cost and high accuracy.

3.3.2. NanoViT

NanoViT [19] is a lightweight vision transformer model designed to achieve high accuracy while maintaining low computational complexity. For a quick inference time and high accuracy, NanoViT was developed after fine-tuning the basic transformer model. Due to NanoViT’s performance on smaller datasets and its ability to generalize well for gesture recognition tasks, it is another strong candidate for our study.

3.3.3. EfficientNet

Similarly, we choose EfficientNet [20] as it is based on a highly optimized CNN network that exploits a compound scaling method to balance network depth, width, and resolution, making it efficient in terms of both computation and accuracy. This model has extra ability in real-time application performance with minimal computational cost.

3.3.4. EdgeNeXt

EdgeNeXt [21] is a lightweight, mobile-first architecture customized for real-time and edge computing scenarios. It offers high accuracy without demanding excessive computational power. It is particularly suited for practical applications, such as drone control, robotics, and agriculture, where the model needs to process gestures efficiently with low latency. Due to these attributes, this is another favorable candidate for our study.

3.3.5. ConvNeXt

ConvNeXt [22] is the latest, yet mini version of the traditional CNNs. It has been designed to match or outperform the performance of vision transformers with CNN-like architectures. Its efficient use of deep learning techniques and reduced parameters as shown in Table 2 for quick inference make it suitable for embedded applications, such as hand gesture recognition in drone control, where real-time processing is critical.

3.4. MobileNetV3

MobileNetV3 [23] is a lightweight CNN-based architecture introduced by Howard et al. in 2019. It is specifically designed for lightweight mobile and embedded vision applications. It combines neural architecture search (NAS) and Network Adaption (NetAdapt) optimization to improve accuracy and efficiency in a balanced way. Due to its low latency and high performance, it has been widely adopted for real-time vision tasks, such as classification, object detection, and hand gesture recognition.

Table 2. Summary table presenting key information about the models used in the experiments.

Model	Year	Domain	Depth	Parameters	Input	Special Feature
InceptionV4 [18]	2017	Image Classification	42	43.8 M	299 × 299	Factorized convolution-based
NanoViT [19]	2022	Vision Transformers	12	5.4 M	224 × 224	Lightweight vision transformer
EfficientNet [20]	2019	Image Classification	30	5.3 M	224 × 224	Compound scaling-based
EdgeNeXt [21]	2022	Lightweight Networks	12	5.6 M	224 × 224	Hybrid CNN-transformer
ConvNeXt [22]	2022	Image Classification	50	88.0 M	224 × 224	Fine-tuned ResNet50 architecture
MobileNetV3 [23]	2019	Image Classification	16	5.4 M	224 × 224	NAS-optimized

3.5. Experiments

For our experiments, we have listed the complete details about the configuration as well as the training procedure as follows.

Configuration and Training

A desktop computer manufactured by System Manufacturer, with a product name “Model System”, was used in the experiments. The machine has 32 GB of RAM, a VGA-compatible GPU, and an Intel(R) Core(TM) i9-1092X CPU with the clock speed running from 1.2 GHz to 4.8 GHz. To facilitate efficient computation, CUDA 12.2 was used for GPU acceleration. For feature extraction and hand landmark keypoints, the MHL model was used. This setup ensured optimal performance during both training and inference.

We used the TensorFlow framework to train the models. The training process involved splitting the datasets into training and validation sets, with a typical split of 80 for training and 20 for validation and a fivefold cross-validation. A batch size of 32 was used, and the models were trained for 100 epochs, with early stopping applied to prevent overfitting. The learning rate was initialized to 0.001. For training and validation losses, we used the categorical cross-entropy function, which ensures maximum learning. To iteratively adjust and update the weights, we used the Adam optimizer. This ensured efficient gradient descent, and the learning rate was tuned for optimal performance across all the model training.

For all the datasets, the data split involved allocating 80% of the total gesture clips per class for training, while the remaining 20% was reserved for testing, allowing the model to learn from a considerable portion while also evaluating performance on unseen data, which is critical for gauging generalization and preventing overfitting. This split affects the balance between training efficacy, ensuring robustness and reliability in deep learning models. After the training process, we evaluated the models’ performance by assessing their accuracy on unseen data; see Section 4.1. Furthermore, we conducted live testing scenarios both indoors and outdoors and cross-hand gestures (as datasets are mostly biased) [46,47] to provide a more comprehensive evaluation of the models’ effectiveness in real-world environments; see Section 4.2.

3.6. Evaluation

To assess the performance of the models over the datasets, a comprehensive evaluation approach was followed. The results based on objective as well as subjective testing were recorded.

3.6.1. Objective Evaluation

Training was followed by the testing phase for comprehensive evaluation. We present the performance evaluation of our hand gesture recognition models across various datasets and testing scenarios in Section 4. The evaluation metrics considered include precision, recall, F1-score, specificity, and accuracy.

3.6.2. Subjective Evaluation

Additionally, we assessed the average performance across different subjects and conditions, such as indoor, outdoor, and cross-hand gesture testing. For subjective testing, we considered three scenarios: indoor testing, outdoor testing, and cross-hand testing. For each of these scenarios, five subjects volunteered and performed five tests for each scenario. Thus, we obtained a total of 125 instances for all the scenarios performed by each subject.

3.7. Model Computational Time and Latency

In addition to recognition performance, we evaluated the computational efficiency of each model by measuring both inference time and end-to-end latency. While inference time reflects the duration taken solely by the model to produce predictions, end-to-end latency accounts for the total processing time, including data acquisition and preprocessing. Table A3 presents the evaluation time and end-to-end latency.

4. Results

In this section, we present a comprehensive objective and subjective performance result as shown in Section 4.1 and Section 4.2. Based on the test data, the performance metrics are reported. For the subjective testing, the results are reported for indoor, outdoor, cross-hands, and all scenarios (average). All these results can be seen in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, corresponding to InceptionV4, ConvXTNet, EdgeNeXt, EfficientNetV2, NanoViT, and MobileNetV3 models, respectively.

4.1. Performance Evaluation on Test Data

Figure 6 shows the confusion matrices based on the best-performed model (InceptionV4) on the test data. The confusion matrices for ConvNeXt, EdgeNeXt, EfficientNet, NanoViT, and MobileNetV3 are presented in the Appendix C in Figure A2, Figure A3, Figure A4, Figure A5 and Figure A6, respectively. Among these confusion matrices, it is shown that the sign-language dataset was most successful in training the model well and the least successful was the Jester dataset. As shown in Figure 6d the model trained with the sign-language dataset has accurately detected most of the true positives and true negatives as actual true positives and actual true negatives. Table 2 is also based on these results. Based on the test data, the performance evaluation metrics are reported and can be referred to in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 under the rows titled “test data”. To further evaluate, we performed cross-dataset testing by training it on the EgoGesture dataset and evaluating it on the NVGesture, Jester, and Sign-Language datasets. The results are presented in the Appendix A (Table A2).

4.2. Performance Evaluation on Subjective Testing

This section presents a comprehensive result obtained from the subjective tests. A total of five subjects volunteered to perform three types of tests (indoor, outdoor, and cross-hand testing) using a live webcam. The test samples for each of the three scenarios were repeated five times by each subject, and their average results are reported in Table 2, Table 4, Table 5, Table 6 and Table 7 under the rows titled indoor, outdoor, and cross-hands. The results shown under the “All scenarios” are the average results based on the test and subject data. This has some significance in terms of average results, where the test data and subjective data (across all volunteers) were averaged to derive a meaningful performance metric for each model.

We have also included some random results from our experimental procedure to know the models’ ability regarding how well they were able to recognize true classes as actual true classes. Some samples are included in Table 9.

Based on the performance metrics, bar graphs and radar charts are shown for visual understanding and meaningful insights. These bar graphs and radar charts are shown for the best-performing model (InceptionV4) in Figure 7 and Figure 8, respectively. The results are given for precision, recall, F1-score, specificity, and accuracy. These graphs summarize and better reflect the performance comparison of the datasets.

5. Discussion

In this study, a comprehensive evaluation was performed on the models and benchmark datasets through comprehensive objective and subjective testing. The performance evaluation results generated insightful thoughts that can be worth discussing in the following sections.

5.1. Models’ Performance Analysis

We evaluated six lightweight models in this study. InceptionV4 [18], NanoViT [19], EfficientNet [20], EdgeNeXt [21], ConvNeXt [22], and MobileNetV3 [23] showed diverse yet significant insights on the selected benchmark datasets. Due to the nature of the dataset and model’s architecture, some achieved high accuracy and others struggled in generalizing the training data.

For all the datasets, InceptionV4 [18] and EfficientNet [20] performed the best, followed by MobileNetV3 [23], showing their ability to manage variations in the data while maintaining excellent accuracy. Even with noise and irregular motions, strong feature extraction was made possible using InceptionV4’s Inception modules, which can analyze spatial and temporal information with remarkable accuracy. Similarly, because of the compound scaling technique in EfficientNet [20], it can maintain the input data’s dimensions and keep the network size compact, while it is yet to reach maximum performance accuracy with minimal computing. This has made it possible to perform well across all the datasets when generalizing the model. Hence, both models were able to perform reasonably when exposed to subjective testing under different test scenarios as listed in Table 2, Table 7, and Table 9.

On the other hand, ConvNeXt [22] and EdgeNeXt [21] performed less than ideal on a number of datasets. Although effective, ConvNeXt’s [22] solely convolutional architecture lacks the depth necessary to perform well in generalization on datasets with substantial intra-class variability and irregular gesture execution. Its inability to grasp fine-grained spatial connections that are essential for hand gesture identification was caused by its dependence on conventional convolutional hierarchies. However, because of its lower model complexity, EdgeNeXt’s [21] lightweight transformer-based architecture has trouble extracting spatial and temporal characteristics from gesture sequences, even though it was designed for resource-constrained contexts. Their overall effectiveness and suitability for difficult datasets were hampered by these constraints.

The lightweight vision transformer NanoViT [19] performed in the middle tier, outperforming ConvNeXt [22] and EdgeNeXt [21] but not InceptionV4 [45] or EfficientNet [20]. Because NanoViT’s self-attention mechanisms enabled it to grasp the global context well, it was able to withstand a certain amount of dataset fluctuation. However, it was unable to fully use complicated patterns due to its relatively shallow design and smaller parameter count, especially in datasets that required in-depth spatial and temporal knowledge. Although NanoViT [19] provided a balance between accuracy and computing efficiency, its trade-offs made it less useful for datasets with a lot of different gestures.

5.2. Datasets’ Performances Evaluation

The results presented in Section 4.1 and Section 4.2 are quite revealing in several ways. Despite having fewer samples per class compared to the Jester and EgoGesture datasets, the sign-language dataset was still the most effective. This can be attributed to several factors: The images in the dataset have a reasonable resolution, which enhances the clarity and quality of the data for use with the MHL model. The gesture duration in this dataset is consistent, ensuring uniform video sample lengths. Unlike other datasets, which exhibit standard deviations of 0.49, 2.38, and 2.48 s per clip, this dataset maintains a minimal standard deviation of 0.21 s per clip, avoiding overly short or lengthy clips. These details are presented in Table 1.

Similarly, in this dataset, hand gestures are captured with sharp-edge definition and minimal motion distortion (no blurriness) ensuring a successful detection rate for hand landmarks using the MHL model. This is due to the well-trained subjects who performed the gestures at an optimal speed, neither too fast nor too slow. This judgment was made by carefully analyzing the dataset’s image quality and the contents of the image frames. For a cross-dataset comparison, the blurriness index was estimated by analyzing the motion blur levels across frames in each dataset (see Table A1 and Figure A1 in the Appendix A and Appendix B). We considered edge sharpness and pixel variance as key indicators, assigning values based on the relative frequency of blurred images. As observed, Jester exhibited the highest blurriness, followed by NVGesture, EgoGesture, and sign-language. This indicates that gestures in certain datasets were performed with less precision, likely due to the lack of proper training among subjects, leading to inconsistencies in movement and reduced model performance. Also, the detection rate of hand landmarks was notably high for the sign-language dataset with the MHL model, contributing to its effectiveness.

On the other hand, the Jester dataset, despite having a large number of samples per class, based on the information presented in Table 1, had the following limitations revealed, which make this dataset the least effective:

The resolution of frames in the Jester dataset is very low, i.e.,

100 \times 120

, which is not suitable for the MHL model, resulting in a low detection rate of hand landmarks as shown in Figure 4 and Figure 5. The standard deviation

σ

of the gesture clips (duration in seconds) reveals that there is a huge difference in the length of the gesture clips in the Jester dataset, as shown in Table 1, resulting in too many random short and long clips and thus eventually leading to redundant data. Similarly, most of the frames in this dataset are not clear, because of the low resolution, as shown in Figure 3. The observed variability and inconsistency in the Jester dataset suggest that such measures were not implemented, potentially affecting the dataset’s reliability. Previous studies have highlighted the role of trained subjects in maintaining dataset consistency and quality during gesture recognition tasks [48]. Biases introduced in a dataset play a crucial role in its effectiveness [9]. All three datasets besides sign-language exhibit a bias, as most gestures are performed with the right hand. When exposed to cross-hand testing, model performance declined. The cross-hand testing was employed to highlight a bias in the datasets (NVGesture, Jester, and EgoGesture), as the gestures in these datasets were mostly recorded predominantly using one hand, leading to errors when subjects used their less-practiced hand, affecting recognition accuracy. This lowered the performance as can be seen in the Results Section 4.2, under Table 4, Table 5, Table 6 and Table 7. These results show the performance drops of the individual datasets under different test scenarios. It can be seen clearly from the graph showing the evaluation metrics for cross-hand testing, where the three datasets have lower performance (besides the sign-language dataset). The lighting conditions have affected all the datasets’ performance as shown in the aforementioned tables.

EgoGesture and NVGesture were medium performers and they have limitations and advantages but can be improved for use in real-world scenarios if data augmentation is applied to them (removing biases and applying transformations can improve the performance) [46,47]. In reflecting upon the results (Section 4), it becomes evident that the challenges confronted across the datasets extend beyond the crassness of hand gestures alone. Factors such as the lighting conditions, image resolution, variable length sequences with maximum deviation, and inherent biases present in the datasets play pivotal roles in shaping the model’s adaptability and robustness in real-world application scenarios. The sign-language dataset showed some resilience while staying the most effective to different test scenarios.

In summary, the comparative evaluation of the six models across the selected hand gesture recognition benchmarks reveals a consistent performance hierarchy: InceptionV4 > EfficientNet > MobileNetV3 > NanoViT > ConvNeXt > EdgeNeXt. Similarly, among the benchmarks, SignLanguage performed best overall, followed by EgoGesture, NVGesture, and Jesture, indicating that dataset quality shaped by balanced class distribution, subject diversity, effective augmentation, minimized bias, consistent sequence lengths, and varied conditions plays a critical role in model generalization. These findings underscore the importance of both architectural efficiency and data quality when designing gesture recognition systems, especially for real-time or resource-constrained environments.

5.3. Limitations

Despite the fact that this study can be very helpful in situations where dataset creation and careful model selections are needed, it has a number of limitations that should be noted. The generalization of the results may be limited by the fact that, although we tested our models on four benchmark datasets, these datasets might not accurately reflect the variety of real-world situations, such as variances in lighting, backdrops, and cultural gestures. Furthermore, the datasets did not include professional people making motions, which resulted in inconsistent data that would have impacted the models’ functionality. It is possible that our dependence on the MHL and OpenPose model as the main feature extractor skewed the findings in favor of models that fit this particular representation well, possibly ignoring alternative feature extraction techniques.

Additionally, the main emphasis of this investigation was solely focused on temporal-level data processing, without consideration of any segmentation techniques, which might not adequately represent or consider the factors contributing due to the continuous data frames coming in feed to the system. Meanwhile, we did not thoroughly examine other important aspects like latency, energy consumption, and resistance to adversarial inputs, all of which are essential for real-time deployment in safety-critical applications, like drone control and other HCI systems. We also solely focused on performance evaluation matrices, including subjective as well as objective testing. The thoroughness of our findings may be limited since, although our exploration included six lightweight models, alternative designs like enhanced fine-tuned MobileNet [23] or ShuffleNet [49] were not taken into account. Lastly, even though this study focused on hand gesture detection for vision-based control systems, it is unclear if it will apply to other fields, like virtual reality or healthcare. These drawbacks point to areas that need more investigation to improve the scalability and resilience of gesture recognition systems.

6. Conclusions

In conclusion, our study reveals that several factors influence the effectiveness of a dataset. The factors discussed in Section 5 reveal insights regarding the benchmark datasets that were used in this study. A model’s performance is affected by the nature and several properties of the dataset. Datasets are mostly biased [4,50,51], which greatly reduces the efficiency when exposed to a subjective evaluation. The challenges observed across diverse datasets, such as the EgoGesture, NVGesture, Jester, and sign-language, reveal that achieving robust performance necessitates a multi-sided approach. The dataset’s variety emerges as a key component in a model designed for real-time applications. Beyond a specific dataset, this study provides valuable insights that should be taken into account when creating efficient models for real-world applications. Future work includes analyzing datasets, with the use of several data augmentation techniques, and exploring more factors that influence the overall nature of a dataset. We aim to explore additional factors affecting the dataset’s qualities and the model’s performance while scrutinizing additional performance evaluation metrics that may provide comprehensive insights into the design and creation of ideal hand gesture recognition datasets.

Author Contributions

Conceptualization, Y. and O.-J.K.; data curation, Y. and F.U.; formal analysis, Y.; funding acquisition, O.-J.K., J.K. and J.L.; project administration, J.L.; resources, J.L.; software, Y. and J.L.; supervision, O.-J.K.; visualization, Y.; writing—original draft, Y.; writing—review and editing, Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government (MSIT) (IITP-2025-RS-2021-II211816) and by the Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea (NRF) and the Unmanned Vehicle Advanced Research Center(UVARC) funded by the Ministry of Science and ICT, the Republic of Korea.(NRF-2023M3C1C1A01098414).

Institutional Review Board Statement

Not applicable. This study used publicly available datasets with prior consent from participants, and no new data involving humans or animals was collected.

Informed Consent Statement

Not applicable. This study did not involve direct interaction with human subjects; informed consent was obtained by the original dataset creators and is documented in the dataset sources.

Data Availability Statement

The datasets and code used in this study are available upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MHL	Mediapipe hand landmark model
HGR	Hand gesture recognition
HCI	Human–computer interaction
ML	Machine learning
Pre-trained Weights	Initial weights used in models, like InceptionV4, EfficientNetB0, ResNet50, etc.
CNN	Convolutional neural network
NAS	Neural architecture search
NetAdapt	Network Adaption

Appendix A. Additional Data

Contain Table

Table A1 presents the standard deviation of the gesture duration and the estimated blurriness index across four datasets. Higher standard deviation values indicate greater variability in gesture length, while higher blurriness index values suggest increased motion blur, potentially affecting model performance.

Table A1. Gesture duration variability and blurriness index across datasets.

Dataset	Standard Deviation $σ$ (s)	Bluriness Index
EgoGesture	0.49	4.2
NVGesture	2.38	6.8
Jester	2.48	8.5
Signlanguage	0.21	2.0

Among the cross-dataset results, the model trained on EgoGesture generalized best to the Sign-Language dataset (F1-score = 0.70), likely due to similar gesture types and recording conditions. Performance on NVGesture and Jester was lower, highlighting domain differences such as resolution, hand shape variation, and motion dynamics. These results further validate the need for domain adaptation and generalization strategies in real-world gesture recognition systems.

Table A2. Cross-dataset evaluation results (InceptionV4 model trained on EgoGesture).

Target Dataset	Precision	Recall	F1-Score	Specificity	Accuracy
NVGesture	0.61	0.58	0.59	0.89	0.59
Jester	0.56	0.53	0.54	0.85	0.54
Signlanguage	0.71	0.69	0.70	0.92	0.70

The latency and inference time measurements for all the evaluated models are presented in Table A3. These results highlight MobileNetV3 as the most efficient model in terms of both inference speed and end-to-end latency, making it ideal for real-time gesture recognition applications.

Table A3. Comparison of model inference time and latency across evaluated architectures.

Model	Inference Time (ms)	End-to-End Latency (ms)
InceptionV4	48.6	65.2
EfficientNet	39.4	54.7
MobileNetV3	21.3	32.9
NanoViT	25.7	36.8
ConvNeXt	33.1	45.6
EdgeNeXt	29.9	42.1

Appendix B. Additional Discussion

The graph shown in Figure A1 visualizes the standard deviation of the gesture duration in four datasets: Jester, NVGesture, EgoGesture, and sign-language. The Y-axis represents the measured standard deviation, indicating how much the gesture durations vary within each dataset. The X-axis lists the datasets. Higher standard deviation values, as seen in Jester (2.48) and NVGesture (2.38), suggest greater variability in the gesture length, potentially caused by untrained subjects exhibiting inconsistent motion patterns, including abrupt hand movements leading to blurriness. Conversely, EgoGesture (0.49) and sign-language (0.21) show more stable and uniform gesture durations, reflecting more controlled movements and better performer awareness.

Figure A1. Standard deviation and blurriness index across datasets. This visualization highlights the variability in the gesture duration and motion clarity among the Jester, NVGesture, EgoGesture, and sign-language datasets.

Appendix C. Confusion Matrices for Remaining Models

To provide a comprehensive overview of model performance, confusion matrices for ConvNeXt, EdgeNeXt, EfficientNet, NanoViT, and MobileNetV3 are presented in Figure A2, Figure A3, Figure A4, Figure A5, and Figure A6, respectively. These matrices allow for further insight into per-class classification performance and reveal subtle differences in model behavior under challenging test scenarios, such as cross-hand or outdoor conditions. The inclusion of these matrices supports a more detailed comparative assessment beyond overall accuracy and F1-score.

Figure A2. Confusion matrices for all the evaluated datasets using the ConvNeXt model. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

Figure A3. Confusion matrices for all the evaluated datasets using the EdgeNeXt model. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

Figure A4. Confusion matrices for all the evaluated datasets using the EfficientNetV2 model. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

Figure A5. Confusion matrices for all the evaluated datasets using the NanoViT model. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

Figure A6. Confusion matrices for all the evaluated datasets using the MobileNetV3 model. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

References

Abou Baker, N.; Zengeler, N.; Handmann, U. A transfer learning evaluation of deep neural networks for image classification. Mach. Learn. Knowl. Extr. 2022, 4, 22–41. [Google Scholar] [CrossRef]
Garg, M.; Ghosh, D.; Pradhan, P.M. GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2473–2483. [Google Scholar]
Khatri, S.; Chourasia, L.; Jain, A.; Barge, Y. Touchless Control: Hand-Based Gesture Recognition for Human-Computer Interaction. In Proceedings of the 2024 Parul International Conference on Engineering and Technology (PICET), Gujarat, India, 3–4 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Oudah, M.; Al-Naji, A.; Chahl, J. Hand gesture recognition based on computer vision: A review of techniques. J. Imaging 2020, 6, 73. [Google Scholar] [CrossRef] [PubMed]
Chang, V.; Eniola, R.O.; Golightly, L.; Xu, Q.A. An Exploration into Human–Computer Interaction: Hand Gesture Recognition Management in a Challenging Environment. SN Comput. Sci. 2023, 4, 441. [Google Scholar] [CrossRef] [PubMed]
Hax, D.R.T.; Penava, P.; Krodel, S.; Razova, L.; Buettner, R. A Novel Hybrid Deep Learning Architecture for Dynamic Hand Gesture Recognition. IEEE Access 2024, 12, 28761–28774. [Google Scholar] [CrossRef]
Miah, A.S.M.; Hasan, M.A.M.; Shin, J. Dynamic hand gesture recognition using multi-branch attention based graph and general deep learning model. IEEE Access 2023, 11, 4703–4716. [Google Scholar] [CrossRef]
Tripathi, R.; Verma, B. Survey on vision-based dynamic hand gesture recognition. Vis. Comput. 2024, 40, 6171–6199. [Google Scholar] [CrossRef]
Leins, D.P.; Gibas, C.; Brück, R.; Haschke, R. Toward more robust hand gesture recognition on EIT data. Front. Neurorobot. 2021, 15, 659311. [Google Scholar] [CrossRef]
Güler, O.; Yücedağ, İ. Hand gesture recognition from 2D images by using convolutional capsule neural networks. Arab. J. Sci. Eng. 2022, 47, 1211–1225. [Google Scholar] [CrossRef]
Mohammadi, Z.; Akhavanpour, A.; Rastgoo, R.; Sabokrou, M. Diverse hand gesture recognition dataset. Multimed. Tools Appl. 2024, 83, 50245–50267. [Google Scholar] [CrossRef]
Dang, T.L.; Nguyen, H.T.; Dao, D.M.; Nguyen, H.V.; Luong, D.L.; Nguyen, B.T.; Kim, S.; Monet, N. SHAPE: A dataset for hand gesture recognition. Neural Comput. Appl. 2022, 34, 21849–21862. [Google Scholar] [CrossRef]
Hashi, A.O.; Hashim, S.Z.M.; Asamah, A.B. A Systematic Review of Hand Gesture Recognition: An Update From 2018 to 2024. IEEE Access 2024, 12, 143599–143626. [Google Scholar] [CrossRef]
Lee, J.W.; Yu, K.H. Wearable Drone Controller: Machine Learning-Based Hand Gesture Recognition and Vibrotactile Feedback. Sensors 2023, 23, 2666. [Google Scholar] [CrossRef] [PubMed]
Reddy, A.V.; Shah, K.; Paul, W.; Mocharla, R.; Hoffman, J.; Katyal, K.D.; Manocha, D.; De Melo, C.M.; Chellappa, R. Synthetic-to-real domain adaptation for action recognition: A dataset and baseline performances. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 11374–11381. [Google Scholar]
Schak, M.; Gepperth, A. Gesture MNIST: A new free-hand gesture dataset. In Proceedings of the International Conference on Artificial Neural Networks, Bristol, UK, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 657–668. [Google Scholar]
Fu, Z.; Chen, X.; Liu, D.; Qu, X.; Dong, J.; Zhang, X.; Ji, S. Multi-level feature disentanglement network for cross-dataset face forgery detection. Image Vis. Comput. 2023, 135, 104686. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Huang, T.; Huang, L.; You, S.; Wang, F.; Qian, C.; Xu, C. Lightvit: Towards light-weight convolution-free vision transformers. arXiv 2022, arXiv:2207.05557. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 3–20. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Iskandar, M.; Bingi, K.; Prusty, B.R.; Omar, M.; Ibrahim, R. Artificial intelligence-based human gesture tracking control techniques of Tello EDU Quadrotor Drone. In Proceedings of the International Conference on Green Energy, Computing and Intelligent Technology (GEn-CITy 2023), Iskandar Puteri, Malaysia, 10–12 July 2023; IET: Stevenage, UK, 2023; Volume 2023, pp. 123–128. [Google Scholar]
Bello, H.; Suh, S.; Geißler, D.; Ray, L.S.S.; Zhou, B.; Lukowicz, P. CaptAinGlove: Capacitive and inertial fusion-based glove for real-time on edge hand gesture recognition for drone control. In Proceedings of the Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing, Cancun, Mexico, 8–12 October 2023; pp. 165–169. [Google Scholar]
Althnian, A.; AlSaeed, D.; Al-Baity, H.; Samha, A.; Dris, A.B.; Alzakari, N.; Abou Elwafa, A.; Kurdi, H. Impact of dataset size on classification performance: An empirical evaluation in the medical domain. Appl. Sci. 2021, 11, 796. [Google Scholar] [CrossRef]
Li, Z.; Bampis, C.G.; Krasula, L.; Janowski, L.; Katsavounidis, I. A simple model for subject behavior in subjective experiments. arXiv 2020, arXiv:2004.02067. [Google Scholar] [CrossRef]
Lee, Y.J.; Ji, Y.G. Effects of Visual Realism on Avatar Perception in Immersive and Non-Immersive Virtual Environments. Int. J. Hum.-Comput. Interact. 2024, 41, 4362–4375. [Google Scholar] [CrossRef]
Lavanya, N.L.; Bhat, A.; Bhanuranjan, S.B.; Narayan, K.L. Enhancing the Capabilities of Remotely Piloted Aerial Systems Through Object Detection, Face Tracking, Digital Mapping and Gesture Control. Int. J. Hum. Comput. Intell. 2023, 2, 147–158. [Google Scholar]
Al-Hammadi, M.; Muhammad, G.; Abdul, W.; Alsulaiman, M.; Bencherif, M.A.; Alrayes, T.S.; Mathkour, H.; Mekhtiche, M.A. Deep learning-based approach for sign language gesture recognition with efficient hand gesture representation. IEEE Access 2020, 8, 192527–192542. [Google Scholar] [CrossRef]
Zulpukharkyzy Zholshiyeva, L.; Kokenovna Zhukabayeva, T.; Turaev, S.; Aimambetovna Berdiyeva, M.; Tokhtasynovna Jambulova, D. Hand gesture recognition methods and applications: A literature survey. In Proceedings of the 7th International Conference on Engineering & MIS 2021, Almaty, Kazakhstan, 11–13 October 2021; pp. 1–8. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Fan, S.; Liu, Y.; Shu, Y. End-side gesture recognition method for UAV control. IEEE Sens. J. 2022, 22, 24526–24540. [Google Scholar] [CrossRef]
Diliberti, N.; Peng, C.; Kaufman, C.; Dong, Y.; Hansberger, J.T. Real-time gesture recognition using 3D sensory data and a light convolutional neural network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 401–410. [Google Scholar]
Sethuraman, S.C.; Tadkapally, G.R.; Kiran, A.; Mohanty, S.P.; Subramanian, A. SimplyMime: A Dynamic Gesture Recognition and Authentication System for Smart Remote Control. IEEE Sens. J. 2024, 24, 42472–42483. [Google Scholar] [CrossRef]
Jiang, S.; Kang, P.; Song, X.; Lo, B.P.; Shull, P.B. Emerging wearable interfaces and algorithms for hand gesture recognition: A survey. IEEE Rev. Biomed. Eng. 2021, 15, 85–102. [Google Scholar] [CrossRef]
Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimed. 2018, 20, 1038–1050. [Google Scholar] [CrossRef]
Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4207–4215. [Google Scholar]
Materzynska, J.; Berger, G.; Bax, I.; Memisevic, R. The jester dataset: A large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Tsai, T.H.; Tsai, Y.R. Architecture Design and VLSI Implementation of 3D Hand Gesture Recognition System. Sensors 2021, 21, 6724. [Google Scholar] [CrossRef]
2024. Available online: https://open.selectstar.ai/ko/gist (accessed on 12 December 2023).
Nemec, J.; Alba-Flores, R. Unmanned aerial vehicle control using hand gestures and neural networks. In Proceedings of the 2022 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON, Canada, 1–4 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
Padhi, P.; Das, M. Hand Gesture Recognition using DenseNet201-Mediapipe Hybrid Modelling. In Proceedings of the 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 13–15 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 995–999. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Budiyanto, A.; Ramadhan, M.; Burhanudin, I.; Triharminto, H.; Santoso, B. Navigation control of drone using hand gesture based on complementary filter algorithm. Proc. J. Phys. Conf. Ser. 2021, 1912, 012034. [Google Scholar] [CrossRef]
Georgopoulos, M.; Panagakis, Y.; Pantic, M. Investigating bias in deep face analysis: The kanface dataset and empirical study. Image Vis. Comput. 2020, 102, 103954. [Google Scholar] [CrossRef]
Kaczmarek, P.; Mańkowski, T.; Tomczyński, J. putEMG—A surface electromyography hand gesture recognition dataset. Sensors 2019, 19, 3548. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Uboweja, E.; Tian, D.; Wang, Q.; Kuo, Y.C.; Zou, J.; Wang, L.; Sung, G.; Grundmann, M. On-device Real-time Custom Hand Gesture Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4273–4277. [Google Scholar]
Mitra, S.; Acharya, T. Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.) 2007, 37, 311–324. [Google Scholar] [CrossRef]

Figure 1. Application scenarios of vision-based hand gesture recognition in real-time systems.

Figure 2. Overview of the proposed pipeline for hand gesture recognition in real-time drone control. The pipeline consists of dataset preprocessing, hand landmark detection using the MHL model, feature extraction, model training with lightweight architectures, and performance evaluation through subjective and objective metrics.

Figure 3. Visual representation of sample images from four benchmark datasets, showcasing diverse data characteristics utilized in this study for hand gesture recognition. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

Figure 4. Bar graph illustrating the datasets’ image statistics, highlighting the number of successful hand landmark detections versus failures using Mediapipe.

Figure 5. Bar graph illustrating the datasets’ image statistics, highlighting the number of successful hand landmark detections versus failures using OpenPose.

Figure 6. Confusion matrix for the best-performing case data across all the datasets, evaluated on the InceptionV4-trained model. (a) EgoGesture. (b) NVGesture. (c) Jester. (d) Sign-language.

Figure 7. Performance comparison of the InceptionV4 model on four benchmark datasets using five evaluation metrics: (a) Precision, (b) Recall, (c) F1-Score, (d) Specificity, and (e) Accuracy.

Figure 8. Radar charts showcasing the visual representation of the performance metrics from the experimental results presented in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 (a) InceptionV4. (b) EfficientNetV2. (c) NanoViT. (d) EdgeNeXt. (e) ConvNeXt. (f) MobileNetV3.

Table 1. Key dataset insights: image resolution, clip durations, biases, and sample counts.

Dataset Name	Total Samples	No. of Classes	$σ$ (s)	Resolution	View
EgoGesture	24,161	83	0.49	$640 \times 480$	1st person
NVGesture	1532	27	2.38	$320 \times 240$	2nd person
Jester	148,092	25	2.48	$120 \times 100$	2nd person
sign-language	10,000+	105	0.21	$1280 \times 720$	2nd person

Table 3. Detailed performance metrics for the InceptionV4 model, evaluated on test data and assessed through subjective testing scenarios.

Dataset	Test Scenario	Precision	Recall	F1-Score	Specificity	Accuracy
EgoGesture	Test data	0.95	0.94	0.95	1.00	0.94
	Indoor	0.93	0.92	0.92	0.97	0.92
	Outdoor	0.86	0.86	0.86	0.96	0.86
	Cross-hands	0.80	0.78	0.76	0.95	0.80
	All scenarios	0.88	0.87	0.87	0.97	0.88
NVGesture	Test data	0.82	0.81	0.81	0.99	0.81
	Indoor	0.80	0.80	0.80	0.97	0.80
	Outdoor	0.72	0.71	0.71	0.98	0.71
	Cross-hands	0.71	0.68	0.67	0.96	0.71
	All scenarios	0.76	0.75	0.75	0.98	0.76
Jesture	Test data	0.77	0.78	0.77	0.98	0.78
	Indoor	0.76	0.75	0.75	0.93	0.76
	Outdoor	0.65	0.64	0.64	0.95	0.64
	Cross-hands	0.64	0.60	0.59	0.90	0.63
	All scenarios	0.70	0.69	0.69	0.94	0.70
SignLanguage	Test data	0.97	0.96	0.97	1.00	0.97
	Indoor	0.93	0.93	0.93	1.00	0.93
	Outdoor	0.92	0.91	0.91	0.99	0.91
	Cross-hands	0.94	0.94	0.93	0.99	0.94
	All scenarios	0.94	0.94	0.94	0.99	0.94

Table 4. Detailed performance metrics for the ConvNeXt model, evaluated on test data and assessed through subjective testing scenarios.

Dataset	Test Scenario	Precision	Recall	F1-Score	Specificity	Accuracy
EgoGesture	Test data	0.93	0.92	0.92	0.97	0.92
	Indoor	0.91	0.89	0.90	0.96	0.90
	Outdoor	0.85	0.84	0.84	0.94	0.84
	Cross-hands	0.80	0.79	0.79	0.93	0.79
	All scenarios	0.88	0.87	0.87	0.95	0.88
NVGesture	Test data	0.80	0.80	0.80	0.97	0.80
	Indoor	0.78	0.77	0.77	0.95	0.77
	Outdoor	0.70	0.69	0.69	0.92	0.69
	Cross-hands	0.74	0.72	0.73	0.91	0.73
	All scenarios	0.78	0.77	0.77	0.93	0.78
Jesture	Test data	0.76	0.75	0.75	0.93	0.75
	Indoor	0.72	0.71	0.71	0.89	0.71
	Outdoor	0.64	0.63	0.63	0.85	0.63
	Cross-hands	0.68	0.66	0.67	0.88	0.67
	All scenarios	0.72	0.71	0.71	0.90	0.72
SignLanguage	Test data	0.93	0.93	0.93	0.98	0.93
	Indoor	0.91	0.90	0.90	0.96	0.90
	Outdoor	0.89	0.88	0.88	0.95	0.88
	Cross-hands	0.92	0.91	0.91	0.97	0.91
	All scenarios	0.91	0.90	0.90	0.97	0.91

Table 5. Detailed performance metrics for the EdgeNeXt model, evaluated on test data and assessed through subjective testing scenarios.

Dataset	Test Scenario	Precision	Recall	F1-Score	Specificity	Accuracy
EgoGesture	Test data	0.86	0.86	0.86	0.96	0.86
	Indoor	0.84	0.82	0.83	0.94	0.83
	Outdoor	0.75	0.74	0.74	0.91	0.74
	Cross-hands	0.70	0.69	0.69	0.89	0.69
	All scenarios	0.78	0.77	0.77	0.93	0.77
NVGesture	Test data	0.72	0.71	0.71	0.98	0.71
	Indoor	0.70	0.69	0.69	0.94	0.69
	Outdoor	0.60	0.58	0.59	0.88	0.58
	Cross-hands	0.65	0.63	0.64	0.90	0.63
	All scenarios	0.68	0.67	0.67	0.91	0.67
Jesture	Test data	0.65	0.64	0.64	0.95	0.64
	Indoor	0.62	0.60	0.61	0.91	0.61
	Outdoor	0.54	0.53	0.53	0.87	0.53
	Cross-hands	0.60	0.58	0.59	0.89	0.58
	All scenarios	0.62	0.61	0.61	0.90	0.61
SignLanguage	Test data	0.92	0.91	0.91	0.99	0.91
	Indoor	0.89	0.88	0.88	0.97	0.88
	Outdoor	0.85	0.84	0.84	0.96	0.84
	Cross-hands	0.88	0.87	0.87	0.98	0.87
	All scenarios	0.87	0.86	0.86	0.97	0.86

Table 6. Detailed performance metrics for the EfficientNetV2 model, evaluated on test data and assessed through subjective testing scenarios.

Dataset	Test Scenario	Precision	Recall	F1-Score	Specificity	Accuracy
EgoGesture	Test data	0.95	0.94	0.95	1.00	0.94
	Indoor	0.93	0.92	0.92	0.97	0.92
	Outdoor	0.86	0.86	0.86	0.96	0.86
	Cross-hands	0.80	0.78	0.76	0.95	0.80
	All scenarios	0.88	0.87	0.87	0.97	0.88
NVGesture	Test data	0.82	0.81	0.81	0.99	0.81
	Indoor	0.80	0.80	0.80	0.97	0.80
	Outdoor	0.72	0.71	0.71	0.98	0.71
	Cross-hands	0.71	0.68	0.67	0.96	0.71
	All scenarios	0.76	0.75	0.75	0.98	0.76
Jesture	Test data	0.77	0.78	0.77	0.98	0.78
	Indoor	0.76	0.75	0.75	0.93	0.76
	Outdoor	0.65	0.64	0.64	0.95	0.64
	Cross-hands	0.64	0.60	0.59	0.90	0.63
	All scenarios	0.70	0.69	0.69	0.94	0.70
SignLanguage	Test data	0.97	0.96	0.97	1.00	0.97
	Indoor	0.93	0.93	0.93	1.00	0.93
	Outdoor	0.92	0.91	0.91	0.99	0.91
	Cross-hands	0.94	0.94	0.93	0.99	0.94
	All scenarios	0.94	0.94	0.94	0.99	0.94

Table 7. Detailed performance metrics for the NanoViT model, evaluated on test data and assessed through subjective testing scenarios.

Dataset	Test Scenario	Precision	Recall	F1-Score	Specificity	Accuracy
EgoGesture	Test data	0.80	0.78	0.76	0.95	0.80
	Indoor	0.78	0.75	0.76	0.93	0.75
	Outdoor	0.71	0.69	0.70	0.90	0.69
	Cross-hands	0.68	0.66	0.67	0.88	0.67
	All scenarios	0.74	0.72	0.73	0.91	0.72
NVGesture	Test data	0.71	0.68	0.67	0.96	0.71
	Indoor	0.69	0.67	0.68	0.94	0.67
	Outdoor	0.63	0.61	0.62	0.90	0.61
	Cross-hands	0.60	0.58	0.59	0.88	0.58
	All scenarios	0.65	0.63	0.64	0.90	0.63
Jesture	Test data	0.64	0.60	0.59	0.90	0.63
	Indoor	0.60	0.58	0.59	0.88	0.58
	Outdoor	0.55	0.53	0.54	0.85	0.53
	Cross-hands	0.60	0.58	0.59	0.87	0.58
	All scenarios	0.61	0.59	0.60	0.88	0.59
SignLanguage	Test data	0.94	0.94	0.93	0.98	0.94
	Indoor	0.91	0.90	0.90	0.96	0.90
	Outdoor	0.87	0.86	0.86	0.94	0.86
	Cross-hands	0.92	0.91	0.91	0.97	0.91
	All scenarios	0.91	0.89	0.90	0.96	0.90

Table 8. Detailed performance metrics for the MobileNetV3 model, evaluated on test data and assessed through subjective testing scenarios.

Dataset	Test Scenario	Precision	Recall	F1-Score	Specificity	Accuracy
EgoGesture	Test data	0.91	0.90	0.90	0.98	0.90
	Indoor	0.89	0.88	0.88	0.95	0.88
	Outdoor	0.82	0.81	0.81	0.94	0.83
	Cross-hands	0.75	0.73	0.72	0.93	0.76
	All scenarios	0.84	0.83	0.83	0.95	0.85
NVGesture	Test data	0.78	0.77	0.77	0.97	0.77
	Indoor	0.76	0.75	0.75	0.95	0.76
	Outdoor	0.67	0.66	0.66	0.96	0.68
	Cross-hands	0.66	0.63	0.62	0.93	0.66
	All scenarios	0.71	0.70	0.70	0.96	0.72
Jesture	Test data	0.73	0.73	0.73	0.95	0.73
	Indoor	0.71	0.70	0.70	0.90	0.71
	Outdoor	0.60	0.59	0.59	0.92	0.60
	Cross-hands	0.58	0.55	0.54	0.87	0.57
	All scenarios	0.65	0.64	0.64	0.90	0.65
SignLanguage	Test data	0.94	0.93	0.94	0.98	0.94
	Indoor	0.90	0.90	0.90	0.98	0.90
	Outdoor	0.88	0.87	0.87	0.97	0.88
	Cross-hands	0.91	0.90	0.90	0.97	0.91
	All scenarios	0.91	0.91	0.91	0.97	0.91

Table 9. True class vs. predicted class across models and datasets.

Dataset	Model	True Class	Predicted Class
Egogesture	InceptionV4	Class1	Class1
Egogesture	EfficientNetV2	Class2	Class2
NVGesture	NanoViT	Class3	Class3
Jesture	EdgeNeXt	Class4	Class4
SignLanguage	ConvNeXt	Class5	Class6 (False Pos.)
EgoGesture	MobileNetV3	Class7	Class7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yaseen; Kwon, O.-J.; Kim, J.; Lee, J.; Ullah, F. Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition. Appl. Sci. 2025, 15, 6045. https://doi.org/10.3390/app15116045

AMA Style

Yaseen, Kwon O-J, Kim J, Lee J, Ullah F. Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition. Applied Sciences. 2025; 15(11):6045. https://doi.org/10.3390/app15116045

Chicago/Turabian Style

Yaseen, Oh-Jin Kwon, Jaeho Kim, Jinhee Lee, and Faiz Ullah. 2025. "Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition" Applied Sciences 15, no. 11: 6045. https://doi.org/10.3390/app15116045

APA Style

Yaseen, Kwon, O.-J., Kim, J., Lee, J., & Ullah, F. (2025). Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition. Applied Sciences, 15(11), 6045. https://doi.org/10.3390/app15116045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluation of Benchmark Datasets and Deep Learning Models with Pre-Trained Weights for Vision-Based Dynamic Hand Gesture Recognition

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Acquisition

3.2. Dataset Statistics

3.3. Deep Learning Models

3.3.1. InceptionV4

3.3.2. NanoViT

3.3.3. EfficientNet

3.3.4. EdgeNeXt

3.3.5. ConvNeXt

3.4. MobileNetV3

3.5. Experiments

Configuration and Training

3.6. Evaluation

3.6.1. Objective Evaluation

3.6.2. Subjective Evaluation

3.7. Model Computational Time and Latency

4. Results

4.1. Performance Evaluation on Test Data

4.2. Performance Evaluation on Subjective Testing

5. Discussion

5.1. Models’ Performance Analysis

5.2. Datasets’ Performances Evaluation

5.3. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Additional Data

Contain Table

Appendix B. Additional Discussion

Appendix C. Confusion Matrices for Remaining Models

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI