A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition

Chen, Yeou-Jiunn; Aryanti, Aryanti; Hong, Qian-Bei

doi:10.3390/asi9050100

Open AccessArticle

A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition

by

Yeou-Jiunn Chen

¹

,

Aryanti Aryanti

^1,2 and

Qian-Bei Hong

^1,*

¹

Department of Electrical Engineering, Southern Taiwan University of Science and Technology, Tainan 710, Taiwan

²

Department of Electrical Engineering, Politeknik Negeri Sriwijaya, Palembang 30139, Indonesia

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2026, 9(5), 100; https://doi.org/10.3390/asi9050100

Submission received: 31 March 2026 / Revised: 8 May 2026 / Accepted: 11 May 2026 / Published: 13 May 2026

(This article belongs to the Topic Social Sciences and Intelligence Management, 2nd Volume)

Download

Browse Figures

Versions Notes

Abstract

Hand gesture recognition (HGR) is a critical area in computer vision that supports intuitive human–computer interaction and sign language communication, yet existing systems remain sensitive to lighting variations, background clutter, and diverse hand postures. This study introduces two contributions to address these limitations: a Gradient-Based Augmentation Validation (GBAV) framework that establishes structurally safe augmentation ranges before training, and a multi-backbone Convolutional Neural Network (CNN) architecture combining ResNet50 and InceptionV3 with optional attention-based pooling. GBAV uses magnitude-weighted gradient orientation histograms with Pearson correlation and Kullback–Leibler divergence thresholds to verify label invariance under spatial transformations, providing a classifier-agnostic pre-training calibration mechanism. The proposed framework is evaluated on three static gesture datasets, Indonesian Sign Language (BISINDO), American Sign Language (ASL), and Hand Gesture 14 (HG14), yielding validation accuracies of 96.87%, 99.92%, and 95.25%, respectively, with 5-fold cross-validation on HG14 confirming result stability (93.51% ± 2.31%). Quantitative attention localization, cross-dataset transfer evaluation, and computational efficiency analysis (26.8 ms per image, ~37 FPS) further support the framework’s robustness and practical deployability. These findings establish GBAV-calibrated augmentation as the principal performance driver, which complements the multi-backbone architecture for robust hand gesture recognition across diverse visual contexts.

Keywords:

hand gesture recognition; convolutional neural networks; attention-based pooling; multi-model integration; sign language recognition

1. Introduction

Hand gesture recognition (HGR) offers an intuitive interface for human–computer interaction, facilitating efficient control across multiple technological domains. It has thus been implemented in many applications, including gaming, virtual and augmented reality, automotive systems, industrial automation, and healthcare [1]. HGR also supports communication accessibility by enabling sign language interpretation, facilitating daily interactions, education, and social inclusion for members of the Deaf community. However, linguistic and cultural variations among Deaf communities worldwide lead to differences in manual alphabet systems. Consequently, there is a need for HGR systems capable of learning regional sign languages rather than relying on a presumed universal standard.

Recent advances in machine learning have substantially improved the performance of automated systems for hand movement recognition. Various techniques have been proposed for translating sign language into textual representations, including optical character recognition (OCR) [2] and ultrasonic array sensing [3]. In addition, machine learning has been used for educational support, automatic feature extraction, and dynamic HGR [4]. More recent studies have focused on the use of deep learning architectures, particularly pre-trained Convolutional Neural Networks (CNNs), to achieve higher accuracy and robustness in hand gesture classification.

However, despite these promising developments, CNN-based HGR models remain vulnerable to dataset bias because their feature extractors are typically optimized for a single visual distribution. As a result, illumination shifts, background clutter, or variations in the hand morphology can suppress discriminative cues and compromise the classification robustness [5,6]. Furthermore, traditional single-model CNN architectures may fail to capture all the spatial and semantic features required for robust recognition, resulting in impaired performance in real-world scenarios. To address these limitations, recent studies have explored multi-model strategies that combine multiple pre-trained CNN architectures [7,8]. By leveraging complementary feature representations across different backbones, such approaches aim to enhance the classification accuracy and overall reliability of automatic HGR systems.

Several studies have implemented ensemble or multi-backbone strategies to enhance the feature diversity in HGR. Typical examples include combining transfer learning-based VGG16 with Random Forest classifiers [9] and training GoogLeNet, VGGNet, and AlexNet in parallel for real-time CNN-ensemble processing [10]. In another example, ResNet50 was integrated with the Tamura texture descriptor to improve its robustness to texture and lighting variations [11]. Some ensemble deep learning pipelines combine SE-CapsNet with BiGRU, VAE, and BiLSTM, with the hyperparameters tuned using Improved Beluga Whale Optimization [12]. Other authors have employed probabilistic fusion strategies, including deep ensembles for predictive uncertainty estimation [13], or used GAN-based data generation to improve the generalization performance of CNN classifiers [14]. Two-stream networks have also been studied. For instance, HGR-Net integrates semantic segmentation with CNN-based classification, achieving a high classification performance with low-latency inference (~23 ms/frame). Collectively, these ensemble and multi-stage approaches enhance the classification precision, resilience, and processing speed, supporting the development of reliable real-time HGR applications.

Attention mechanisms have also been employed to improve the spatial selectivity and feature efficiency of HGR models. By focusing on key attributes within the input data, these mechanisms enhance feature discrimination and support real-time processing. Several implementations have been reported. For example, ResNet architectures have been augmented with attention modules for static gesture recognition [15], while CNN–BiLSTM frameworks incorporating attention mechanisms have also been explored for gesture recognition tasks [16]. Similarly, VGG-16 has been enhanced with attention blocks to strengthen the image-based gesture recognition performance [17]. More recently, Vision Transformer–based models, such as HGR-ViT, have leveraged multi-head self-attention to capture long-range positional dependencies within gesture representations [18].

Although multi-backbone CNN architectures and attention mechanisms have been widely adopted for HGR, many existing studies implicitly assume that frequently used data augmentation techniques preserve semantic consistency throughout transformations. This premise, referred to as label invariance, is seldom explicitly verified. In real-world applications, however, significant spatial perturbations, such as substantial translations, rotations, or scaling operations, can alter critical finger configurations and disrupt structural coherence. As a result, augmented samples may deviate from the underlying gesture manifold, introducing semantic distortion and potentially misleading the learning process.

Beyond this theoretical inconsistency, the implications are empirical. For instance, the performance gains reported in previous studies may partially stem from training on semantically invalid samples, potentially leading to unstable optimization dynamics and an overstated generalization capability. Furthermore, although attention mechanisms are frequently employed to improve spatial selectivity, relatively few studies have offered feature-level evidence demonstrating that learned attention maps reliably concentrate on significant hand regions rather than background artifacts. Together, these limitations underscore the need for a framework that simultaneously validates the augmentation integrity and ensures model interpretability, while leveraging complementary CNN backbones.

To address this need, this study introduces a Gradient-Based Augmentation Validation (GBAV) framework to explicitly assess the reliability of the training distribution. This framework outlines the structural boundaries of the gesture manifold, thereby reducing semantic noise and facilitating a reliable interpretability analysis.

The main contributions of this study can be summarized as follows:

A GBAV framework is presented as a calibration mechanism for pre-training, aimed at empirically validating label invariance in the context of spatial transformations. It also seeks to establish safe ranges for augmentation by employing KL divergence and analyzing gradient-structure consistency.
A multi-CNN architecture is proposed that integrates ResNet50 and InceptionV3 to leverage the synergistic advantages of residual learning and inception modules, respectively, to improve feature extraction.

The remainder of this paper is organized as follows. Section 2 details the Materials and Methods, covering the GBAV calibration procedure, combined CNN architecture, datasets, preprocessing, and training configuration. Section 3 reports classification results, cross-dataset generalization, quantitative attention localization, 5-fold cross-validation, and computational efficiency analysis. Section 4 discusses implications and limitations. Section 5 concludes.

2. Materials and Methods

The research methodology comprised a sequence of structured stages, as shown in Figure 1.

The workflow began with dataset collection, followed by preprocessing to standardize the inputs and prepare samples for model training. The deep learning model was then designed and optimized. The model performance was evaluated using validation data and established classification metrics. Finally, the experimental results were obtained and analyzed.

2.1. Datasets

This study employed three publicly available hand gesture datasets: Indonesian Sign Language (BISINDO), American Sign Language (ASL), and HG14. These datasets exhibit significant variations in gesture vocabulary, background complexity, image resolution, and color-channel composition, thus offering a diverse benchmark for assessing the robustness and generalization ability of the proposed models. The details of each dataset are provided below.

(1): Indonesian Sign Language (BISINDO) Dataset

The BISINDO dataset was developed by Ma’aruf [19] and consists of 26 hand gestures. It contains 11,471 images, each with a resolution of 640 × 640 pixels. The samples are captured against diverse backgrounds, as shown in Figure 2.

(2): American Sign Language (ASL) Dataset

The ASL dataset contains 26,000 images representing A to Z [20]. Each image has a size of 256 × 256 pixels and is captured against a black background to maintain visual consistency and facilitate feature extraction. As shown in Figure 3, the dataset includes samples from multiple individuals to capture variations in the hand size, shape, and pose, thereby reducing the model bias and improving its robustness.

(3): Hand Gesture 14 (HG14) Dataset

The HG14 dataset, created by Guler [21], comprises 14 different hand gestures for hand-based interaction and application control in augmented reality environments (Figure 4). The dataset contains 14,000 color (RGB) images with a resolution of 256 × 256 pixels.

2.2. Preprocessing Data

In machine learning and deep learning models, data preprocessing plays a crucial role in standardizing the data format, improving the data quality, and ensuring that the input is easy for the model to interpret, thereby increasing the training effectiveness and supporting more accurate predictions. In this study, the preprocessing step involved resizing, rescaling, and partitioning to ensure consistent input dimensions and well-structured samples for model development. The details of the preprocessing step are described below.

(1): Resizing Dataset

For each dataset, the images were adjusted to a size of 299 × 299 pixels to ensure consistent spatial dimensions for CNN processing. This specific resolution was chosen to align with the architectural requirements of the Inception-V3 backbone [22], while also being compatible with ResNet50, which accommodates varying input resolutions through global pooling. Moreover, this standard resolution aligns with common ImageNet-pretrained architectures, reduces the computational overhead, and limits scale variation so that the model can focus on the salient visual features [23].

(2): Rescaling

After resizing, the pixel values were normalized from the 0–255 RGB range to [0, 1] [24]. The normalization process stabilizes gradient-based optimization and accelerates convergence by constraining input magnitudes, which supports efficient CNN training.

(3): Partitioning

Each dataset was partitioned into training and validation subsets using an 80:20 ratio, where 80% of the samples were used for model learning and 20% were reserved for performance evaluation [25]. This split supports generalization assessment and mitigates overfitting. Accordingly, BISINDO was divided into 9186 training and 2285 validation images, ASL into 20,800 and 5200 images, and HG14 into 11,200 and 2800 images, respectively.

2.3. Data Augmentation

Data augmentation was performed to improve the generalization of the model by broadening the data diversity while maintaining semantic consistency. Standard spatial transformations were first applied and subsequently constrained using the proposed GBAV framework.

(1): Baseline Augmentation Strategies

Data augmentation was implemented on-the-fly during training using a stochastic pipeline that included horizontal flipping, rotation (±5%), zoom (±5%), and translation (±5%). While these transformations were randomly sampled at each epoch, their permissible ranges were constrained by the proposed GBAV framework to maintain semantic consistency and prevent structural distortion.

(2): Gradient-Based Augmentation Validation (GBAV)

Data augmentation typically operates under the label invariance assumption, which posits that semantic class labels do not change under permissible transformations [26]. Nevertheless, excessive or unconstrained augmentation can distort the underlying data manifold and cause semantic drift, resulting in inconsistent supervision signals and diminished model performance. Unlike adaptive augmentation strategies, GBAV does not dynamically adjust transformations during optimization but instead provides a principled verification mechanism to constrain augmentation intensity before training.

To mitigate this issue, the proposed GBAV framework systematically evaluated the structural consistency of the augmented samples prior to their incorporation into the training process. In the proposed framework, the structural integrity of the augmented samples was quantified by leveraging principles from representation learning [27] and gradient-based feature descriptors [28]. In particular, the structural characteristics were encoded through magnitude-weighted gradient orientation histograms, which are inspired by HOG representations and have been demonstrated to effectively encode human-centric structures such as edges and contours [29]. The similarity between the original and augmented samples was then measured using Kullback–Leibler divergence [30], complemented by correlation-based metrics, to identify the transformations that preserve semantic consistency.

To operationalize this framework, a calibration procedure was conducted using a held-out subset of 200 BISINDO images stratified across all classes. For each augmentation type, magnitude-weighted 9-bin gradient orientation histograms (HOG-style descriptors) [28] were computed for both original and augmented samples. Structural similarity was quantified using Pearson correlation (r) and symmetric Kullback–Leibler (KL) divergence with Laplace smoothing. A transformation magnitude was accepted as “safe” if

r \geq 0.85

and

K L \leq 0.05

. These thresholds correspond to conventional cutoffs for strong structural similarity and distributional equivalence and were defined prior to the sweep to avoid post hoc selection. The evaluated parameter grid included rotation {2, 3, 4, 5, 6, 7, 9, 18, 27, 36, 45} degrees, zoom {0.05, 0.10, 0.20, 0.50, 0.90}, and translation {0.05, 0.10, 0.15, 0.20, 0.30}. The resulting safe magnitudes used in all training experiments were rotation = 5°, zoom = 5%, and translation = 10%.

Unlike existing augmentation strategies such as AutoAugment [31], RandAugment [32], Faster AutoAugment [33], and TrivialAugment [34], which rely on downstream model performance to evaluate augmentation policies, GBAV operates directly on input-level structural statistics and is therefore architecture-independent in principle. Empirical validation in this work is conducted on CNN-based pipelines, while extension to other architectures remains straightforward and is left for future work. Furthermore, GBAV is applied as a pre-training calibration step rather than during optimization, in contrast to adaptive curriculum-based approaches such as AugMax [35]. These distinctions position GBAV as a principled validation mechanism for ensuring semantic integrity prior to model training, rather than a training-time augmentation strategy.

2.4. Model Development

ResNet50 was selected for its hierarchical residual representations and stable optimization in deep architectures; InceptionV3 was selected for its multi-scale receptive fields, capturing both local finger-level features and global hand shape simultaneously. We considered EfficientNet and Vision Transformer alternatives but selected ResNet50 + InceptionV3 because both backbones provide well-validated ImageNet pre-training and produce complementary representation types at comparable parameter budgets.

A hybrid pretrained model was constructed by integrating two complementary CNN architectures, ResNet50 and InceptionV3, together with an attention-based pooling mechanism to enhance feature extraction and classification performance for hand gesture images. The model was fed input images with a size of 299 × 299 × 3, meeting the standard input resolution of both pretrained networks. The images were processed in parallel through two branches: ResNet50, which generated a 10 × 10 × 2048 feature map, and InceptionV3, which produced an 8 × 8 × 2048 feature map. Global Average Pooling (GAP) was then applied to convert both feature maps into 2048-dimensional vectors, which were subsequently reduced to 256-dimensional embeddings via ReLU-activated dense layers to reduce the computational complexity while presenting the key discriminative information.

The two compact feature vectors were subsequently concatenated into a unified 512-dimensional representation to form the fused backbone of the model. The attention-based pooling layer then refined this vector by prioritizing the features that contributed most significantly to the accuracy of the classification outputs. The attention-enhanced representation was further processed by a fully connected layer with 256 ReLU-activated neurons, followed by a dropout layer with a rate of 0.3 to prevent overfitting during training. Finally, a dense layer equipped with SoftMax activation was used to produce a probability distribution across the target classes, enabling the model to determine the most likely category for each input image. The proposed model, combined with pre-training (ResNet50 + InceptionV3) and attention-based pooling, is shown in Figure 5.

The following subsections briefly summarize the core building blocks of the proposed framework, including the convolutional feature extractors, residual networks, multi-branch inception modules, channel-wise concatenation, and attention-based pooling, to clarify the functional roles of each element within the fused backbone.

(1): Convolutional Neural Networks (CNNs)

CNNs comprise a feature extraction stage followed by a classification stage, as illustrated in Figure 6.

Convolutional layers apply learned kernels to extract spatial representations, which are passed through non-linear activations (ReLU) and refined using pooling to reduce the spatial resolution and enhance the translational invariance. Fully connected layers then map the resulting feature vectors to the output classes through supervised training.

(2): Residual Network 50 (ResNet50)

ResNet50 is a 50-layer convolutional network that mitigates the vanishing gradient problem through residual (skip) connections, as shown in Figure 7.

These identity mappings bypass stacks of convolutions and support stable optimization in deep feature hierarchies built from bottleneck blocks (1 × 1, 3 × 3, and 1 × 1 with batch normalization and ReLU). Spatial activations are then aggregated by global-average pooling and mapped to output classes by a fully connected layer.

(3): Inception Version 3 (InceptionV3)

InceptionV3 is a convolutional architecture designed to capture multi-scale spatial structures using parallel convolutional branches with different receptive field sizes, as shown in Figure 8.

The network consists of a stem for initial feature extraction, followed by stacked inception modules that perform multi-branch processing with dimensionality-reduction (1 × 1) factorization to control the computational cost. The outputs from the parallel paths are concatenated and subsequently aggregated by global-average pooling, and a fully connected layer with SoftMax activation then generates class probability predictions.

(4): Concatenation Layer

Concatenation combines feature maps along the channel dimension to merge outputs from parallel convolutions, enabling the network to integrate multi-scale representations within a single layer. By preserving all the activations produced by the individual branches, the concatenation operation expands the channel depth without altering the spatial resolution, thereby enriching the representational capacity of the feature tensor [36].

(5): Attention-Based Pooling Method

Attention-based pooling assigns learned weights to the feature activations to produce a weighted representation that prioritizes task-relevant patterns in the input images. Unlike max or average pooling, which applies fixed selection rules, attention pooling adapts to the input by estimating importance scores using a self-attention mechanism. Multi-head attention extends this operation by projecting the features into multiple subspaces and computing independent weight vectors, yielding richer contextual representations and an improved discriminative performance.

The attention score is calculated for each input element by initially applying a linear transformation. Next, the tanh activation function is used, which can be expressed as (1):

e_{i} = t a n h (X \cdot W + b)

(1)

where b represents the bias,

W

is the trainable attention weight matrix, and

X

is the input vector. The Softmax function is then used to normalize these attention scores and generate probabilistic weights, as shown in (2):

α_{i} = \frac{\exp (e_{j})}{\sum_{j - 1}^{n} \exp (e_{j})}

(2)

Finally, the obtained weights are applied to combine the features in a weighted manner [37,38], resulting in a combined output defined as (3):

z = \sum_{i = 1}^{n} α_{i} \cdot X_{i}

(3)

The main advantage of the attention-based pooling mechanism is that it can flexibly select the most important features, making the model more efficient and accurate when handling difficult tasks such as classification and segmentation.

2.5. Implementation Details

The experiments were implemented in Python using a suite of deep-learning and scientific computing libraries. TensorFlow, accessed through the Keras API, served as the primary framework for building and training the models. NumPy was utilized for numerical computations, and the OS library handled file management. Image preprocessing and augmentation were performed using the TensorFlow tf.data API alongside Keras preprocessing layers, enabling efficient data loading, parallel processing, and real-time augmentation. Model evaluation, including the generation of confusion matrices and classification reports, was performed with Scikit-Learn. Finally, the training outcomes, such as the accuracy and loss curves and confusion matrices, were visualized using the Matplotlib library 3.10.0. GBAV calibration was performed prior to model training to determine the maximum permissible transformation intensities, which were established based on divergence thresholds obtained from an analysis of gradient-structure consistency.

Four model configurations were evaluated:

(1): Combined ResNet50 + InceptionV3 with attention;
(2): Combined without attention;
(3): ResNet50;
(4): InceptionV3.

All the experiments were carried out on a personal computer, the specifications of which are summarized in Table 1, while the relevant hyperparameter configurations are detailed in Table 2.

Preprocessing was implemented using explicit TensorFlow operations. Image resizing was performed using tf.image.resize with bilinear interpolation to a fixed resolution of 299 × 299 × 3, followed by per-image normalization to the [0, 1] range. Dataset partitioning was conducted deterministically using a per-class 80/20 split with a global random seed of 42 to ensure reproducibility. Data augmentation was implemented using tf.keras.layers, including RandomFlip, RandomRotation, RandomZoom, and RandomTranslation, with transformation magnitudes constrained by the GBAV calibration procedure. Model training utilized the Adam optimizer with a constant learning rate of 5 × 10⁻⁵ and a batch size of 64 for a maximum of 30 epochs. Label-smoothed categorical cross-entropy (smoothing = 0.01) was used as the loss function. L2 weight regularization with a coefficient of 1 × 10⁻⁴ was applied to fully connected layers, and dropout with a rate of 0.3 was used to mitigate overfitting. The pretrained ResNet50 and InceptionV3 backbones were initially frozen and subsequently fine-tuned by unfreezing the final convolutional block of each network. Training progress was monitored using a validation set, with EarlyStopping (patience = 5) and ModelCheckpoint configured to preserve the weights corresponding to the lowest validation loss. The complete training and evaluation pipeline, including GBAV calibration, cross-dataset evaluation, attention quantification, efficiency benchmarking, and 5-fold cross-validation, is publicly available as described in the Data Availability statement.

2.6. Model Evaluation

The effectiveness of the proposed model was assessed using four standard classification metrics: accuracy, precision, recall, and F1-score. Accuracy indicates the ratio of correctly classified samples to the total number of inputs, whereas precision evaluates the reliability of the positive predictions. Recall assesses the model’s capability to accurately identify pertinent samples, and the F1-score delivers a holistic evaluation by merging the precision and recall into one metric. Together, these metrics provide a thorough assessment of the classification performance across datasets characterized by different complexities and class distributions.

Besides quantitative assessment, a qualitative interpretability analysis was also performed to investigate the spatial emphasis of the acquired representations. In particular, Grad-CAM was utilized as a post hoc visualization method to emphasize class-discriminative areas and evaluate the impact of attention pooling on feature localization [39].

3. Results

This section evaluates the proposed methodology from three perspectives: the feature robustness, classification performance across datasets, and model interpretability.

3.1. Feature Stability Analysis

Before evaluating the classification performance, a gradient-based analysis was performed to assess the structural stability of the extracted features under typical spatial transformations. The GBAV descriptor was used to examine the variations in the gradient orientation distributions caused by rotation, translation, and scaling perturbations. To measure the feature divergence, Pearson’s correlation and the Kullback–Leibler (KL) divergence were calculated between the original and transformed representations. This approach enabled a classifier-independent evaluation of the feature invariance. The GBAV responses to rotational, scaling, and translational transformations are depicted in Figure 9, while the associated quantitative correlation and divergence metrics are presented in Table 3.

Rotational perturbations demonstrated the greatest sensitivity to transformation magnitude. Small rotations up to 5° preserved structural consistency, achieving high correlation (

r

= 0.8596) and low divergence (

D_{K L}

= 0.0135), satisfying the GBAV acceptance criteria. However, increasing the rotation beyond this threshold resulted in a rapid degradation of structural alignment. At 18°, the correlation dropped significantly (

r

= 0.4735), and at 36°, it became negative (

r

= −0.0937), accompanied by increased divergence (

D_{K L}

= 0.0998), indicating a breakdown in gradient correspondence. These results confirm that rotational transformations are highly sensitive and must be tightly constrained to preserve semantic consistency.

Conversely, moderate scaling transformations exhibited stable behavior at low magnitudes. A zoom factor of 5% maintained strong structural consistency (

r

= 0.9007,

D_{K L}

= 0.0089), while larger scaling levels led to a progressive degradation. At 10%, the correlation dropped below the acceptance threshold (

r

= 0.8313), and extreme zooming (90%) resulted in severe structural distortion (

r

= 0.3082,

D_{K L}

= 0.1504), likely due to contextual loss and feature truncation.

Translational perturbations showed a more gradual degradation pattern. Small displacements of 5% and 10% preserved high structural similarity (

r

= 0.9503 and 0.8875, respectively), with low divergence values. However, larger translations beyond 10% led to a decline in correlation (

r

= 0.8200 at 15%) and increased divergence, indicating partial occlusion of discriminative regions. Based on these findings, conservative augmentation parameters were selected for training:

rotation = 5 °, zoom = 5 %, and translation = 10 %

. These values correspond to the maximum magnitudes that satisfy the GBAV acceptance criteria (

r

≥ 0.85 and

D_{K L}

≤ 0.05), ensuring semantic validity while maintaining controlled variability.

3.2. Classification Performance

Four model configurations were assessed:

Unified ResNet50–InceptionV3 architecture utilizing attention-based pooling.
The same combined backbone without attention.
ResNet50 standalone model.
InceptionV3 standalone model.

Given that all the datasets comprised an equal number of classes, accuracy was selected as the primary comparative metric, while the precision, recall, and F1-score metrics were used to provide additional insights.

(1): Combined Model with Attention-Based Pooling

The evaluation outcomes of the attention-augmented combined model are presented in Table 4.

The model demonstrated a robust and consistent performance across all the datasets, achieving validation accuracies of 96.87% (BISINDO), 99.92% (ASL), and 95.25% (HG14). Importantly, the HG14 dataset, marked by greater gesture variability and intricate backgrounds, significantly benefited from attention-based pooling, which enhanced the spatial focus and aggregation of the discriminative features.

(2): Combined Model without Attention Pooling.

The performance metrics of the combined backbone without attention are presented in Table 5.

Although the model retained a high accuracy on BISINDO and ASL, a noticeable decline was evident on HG14 when compared to the attention-enhanced version. This performance disparity underscores the importance of attention pooling in alleviating feature ambiguity in visually complex gesture datasets.

(3): Single-Model Backbones

Table 6 and Table 7 present the findings for the ResNet50 and InceptionV3 models, respectively. Both architectures performed well on the ASL and BISINDO datasets.

However, their performance significantly deteriorated on HG14. ResNet50 showed limited resilience to background clutter and gesture variability, while InceptionV3, despite achieving strong training accuracy, exhibited diminished generalization on HG14. These findings suggest that single-backbone architectures face significant challenges in capturing the varied spatial cues inherent in more demanding gesture datasets.

3.3. Interpretability Analysis

To explore the internal decision-making behavior of the various models, a qualitative interpretability analysis was conducted utilizing Grad-CAM. Although quantitative metrics provide an overall evaluation of the classification performance, Grad-CAM facilitates a more detailed examination of the specific spatial areas that significantly impact the model predictions. Such analysis is particularly valuable in hand gesture recognition tasks characterized by visual ambiguity and subtle finger configurations [5].

This investigation is of particular significance for the HG14 dataset, which yielded a relatively lower performance compared to BISINDO and ASL due to greater intra-class variability and subtle visual distinctions among the hand gestures. Based on the class-wise performance evaluation results, Gestures 3 and 10 were chosen as representative examples. These gestures involve visually similar hand configurations with overlapping finger placements, a challenge frequently reported in CNN-based ensemble recognition studies [10]. Such similarity reduces the class separability despite high overall accuracy, making these classes suitable for analyzing how architectural design influences spatial feature discrimination.

As shown in Figure 10, the Grad-CAM responses generated without the use of attention-based pooling were notably diffuse, with the activations spread across wider areas of the hand and, in certain instances, encroaching into the background.

A similar pattern has been noted in baseline convolutional models documented in existing research, where the lack of explicit spatial weighting leads to a partial dependence on non-discriminative areas such as the base of the palm or the wrist [11]. This phenomenon was particularly pronounced for Gesture 10, where the inaccurate localization of the primary finger structure added to the classification uncertainty.

Conversely, the attention-enhanced mode demonstrated a more focused and semantically relevant activation pattern, consistent with observations from recent attention-based fusion methodologies. The highlighted regions corresponded to essential finger articulations and joint configurations that characterized each gesture, signifying an improved spatial selectivity [39]. For Gesture 3, attention was focused on the relative arrangement of the extended fingers, whereas for Gesture 10, it was concentrated on the orientation of the dominant finger. This refined spatial localization demonstrates the role of attention-based pooling in suppressing irrelevant background elements and amplifying discriminative areas, thus enhancing the model’s robustness for visually intricate gesture categories. Qualitatively, the attention-enhanced architecture exhibited reduced background activation dispersion compared to the baseline configuration. This suggests an enhancement in spatial selectivity and a more accurate localization of distinguishing finger structures.

3.4. Cross-Dataset Generalization

To evaluate external validity, we trained the headline model on BISINDO and evaluated it on ASL without retraining, and vice versa, since both datasets share the alphabetic label space A–Z. Both directions of transfer collapsed to near-chance accuracy (BISINDO → ASL: 6.99%, ASL → BISINDO: 6.72%, versus a 3.85% random baseline over 26 classes). This reflects the linguistic independence of the two sign systems rather than a model failure. We classified each letter pair as identical (I, L, V), similar (C, O), or different (the remaining 21 letters). Per-class transfer precision correlated strongly with handshape similarity (Pearson r = 0.71 for BISINDO → ASL, r = 0.80 for ASL → BISINDO). Letters with identical handshapes achieved 30–55% mean transfer precision (peak: I = 95.4% precision, ASL → BISINDO), letters with similar handshapes achieved 19–32%, and letters with different handshapes averaged below 2%. This monotonic correspondence provides direct evidence that the model captures genuine gesture features rather than dataset-specific artifacts. A summary is shown in Table 8. Details of each class distribution are depicted in Figure 11.

3.5. Quantitative Attention Localization

To move beyond purely visual Grad-CAM inspection, we extracted pseudo-ground-truth hand masks via MediaPipe HandLandmarker (convex-hull dilation of detected landmarks) for 200 randomly sampled HG14 validation images and computed three localization metrics on the fused Grad-CAM heatmap (mean of ResNet50 conv5_block3_out and InceptionV3 mixed10 activations): pointing-game accuracy (peak activation inside the mask), energy-in-hand fraction (proportion of above-threshold activation energy inside the mask), and IoU (intersection-over-union between thresholded heatmap and mask). The results for both architectures are shown in Table 9. Both models attend strongly to hand regions in absolute terms (energy > 70%, pointing-game = 100%), confirming that the combined backbone intrinsically learns hand-focused features regardless of whether attention pooling is used.

3.6. 5-Fold Cross-Validation

5-fold stratified cross-validation was conducted on HG14, the dataset with the most realistic intra-class variability, to assess result stability. Across five folds, mean accuracy was 93.51% ± 2.31% and macro-F1 was 93.51% ± 2.33%. The single-split HG14 accuracy of 95.25% reported in the main results falls within one standard deviation of the cross-validated mean, confirming that the main result is not an artifact of a favorable partition. Cross-validation was prioritized for HG14 because ASL exhibits near-saturated performance across all models, making it insensitive to partitioning; and BISINDO single-split performance (96.87%) lies within the HG14 variance range (±2.31%), suggesting additional cross-validation would not materially affect conclusions. Per-fold results are shown in Table 10.

3.7. Computational Efficiency Analysis

To characterize the computational cost of the dual-backbone architecture, we benchmarked all four variants on a Tesla P100 GPU. FLOPs were computed via the TensorFlow graph profiler; latency was measured over 100 trials of single-image forward passes (299 × 299 resolution, 20 warm-up iterations). The results are shown in Table 11. The headline combined model has 46.6 M parameters, 26.0 GFLOPs, and 26.8 ± 1.5 ms latency per image (~37 FPS on the P100), suitable for interactive applications. The optional attention layer adds ~0.06 M parameters and <1% latency overhead; given that the ablation shows it does not improve accuracy, the simpler no-attention variant is recommended for deployment.

4. Discussion

The experimental findings indicate that the integration of diverse CNN backbones enhances classification performance across all evaluated datasets when compared to single-model configurations. The hybrid ResNet50–InceptionV3 architecture consistently outperformed its standalone counterparts on BISINDO, ASL, and HG14, confirming that the fusion of residual learning and multi-scale convolution improves representational capacity. This observation aligns with prior studies demonstrating performance gains from combining complementary CNN architectures [37,40,41].

Specifically, on the BISINDO dataset, the combined model achieved validation accuracies of 96.87% with attention and 97.00% without attention, indicating that the inclusion of attention does not provide a statistically meaningful improvement when hand structures are already distinct and background complexity is moderate. A similar trend is observed on the ASL dataset, where all configurations achieved near-saturated performance (approximately 99.7–99.9%), suggesting that the uniform background and limited gesture variability reduce the potential benefit of attention mechanisms. In contrast, the HG14 dataset exhibits a modest improvement when attention is included, with the combined model achieving 95.25% compared to 94.18% without attention. This suggests that attention-based pooling enhances spatial discrimination in visually complex scenarios characterized by background clutter and subtle finger variations, although the improvement remains limited.

Importantly, the observed differences between attention and non-attention configurations fall within ±2.31%, corresponding to the fold-to-fold variance identified through cross-validation on HG14. This indicates that the performance gains attributed to attention are not statistically distinguishable across the evaluated datasets. Instead, the primary improvement over baseline configurations arises from the use of GBAV-calibrated augmentation, which ensures that training data remain structurally consistent and semantically valid.

The effectiveness of attention mechanisms is therefore context-dependent. In datasets with uniform backgrounds and well-aligned hand regions, such as ASL, attention provides minimal benefit. In more complex scenarios such as HG14, attention contributes to improved spatial focus, but the magnitude of improvement remains modest. This suggests that further gains may be achieved through complementary preprocessing strategies, such as segmentation or background suppression, rather than relying solely on architectural modifications.

The cross-dataset transfer experiment provides additional insight into the nature of the learned representations. Despite sharing identical alphabetic labels, BISINDO and ASL represent linguistically distinct sign systems. The model achieved only approximately 7% transfer accuracy between datasets, while maintaining within-distribution accuracies above 96%. This indicates that the model learns dataset-specific handshape representations rather than domain-general gesture features. Importantly, this finding strengthens the methodological interpretation rather than weakening it, as it confirms that high within-dataset accuracy reflects consistent learning of the underlying sign system rather than spurious dataset bias.

Furthermore, InceptionV3 exhibited a more pronounced performance decline on HG14 compared to the hybrid architecture, suggesting that multi-scale feature extraction alone is insufficient to handle high intra-class variability without complementary residual representations. The combined architecture mitigates this limitation by integrating both global and hierarchical feature learning.

Although the proposed hybrid model introduces a higher parameter count relative to single-backbone configurations, the computational overhead remains manageable. As shown in the efficiency analysis, the model achieves real-time performance (~37 FPS), making it suitable for practical deployment in moderate-scale applications. When evaluated in conjunction with the comparative findings shown in Table 12, the suggested architecture exhibits competitive performance in relation to well-established CNN-based methods such as DeepASLR [5], BISINDO-oriented CNN models [42], and MobileNet-based HG14 assessments [43].

When compared with existing approaches, the proposed method demonstrates competitive performance across multiple datasets with varying characteristics. While some studies report higher accuracy under specific dataset splits or controlled conditions, the proposed approach maintains consistently strong performance across BISINDO, ASL, and HG14. This cross-dataset stability highlights the advantage of combining complementary backbone architectures with GBAV-calibrated augmentation, rather than optimizing exclusively for a single dataset.

5. Conclusions

This study investigated hand gesture recognition across datasets with varying visual complexity by combining a Gradient-Based Augmentation Validation (GBAV) framework with a hybrid CNN architecture integrating ResNet50 and InceptionV3. The results demonstrate two primary contributions. First, GBAV provides a principled pre-training calibration mechanism that constrains augmentation magnitudes based on structural consistency, leading to improved model reliability and performance across all evaluated datasets. Second, the proposed multi-backbone CNN architecture leverages complementary feature representations from residual and multi-scale convolutional networks, resulting in consistently strong classification performance compared to single-model baselines.

The experimental findings further indicate that the impact of attention-based pooling is limited and dataset-dependent. While attention provides modest improvements in visually complex scenarios such as HG14, its effect remains within the cross-validation fold-to-fold variance (±2.31%), indicating that the gains are not statistically distinguishable across the evaluated benchmarks. Instead, the primary performance improvements can be attributed to GBAV-calibrated augmentation, which ensures semantic consistency during training.

Additional analyses reinforce these conclusions. The cross-dataset transfer experiment revealed that BISINDO and ASL, despite sharing alphabetic labels, represent linguistically distinct sign systems, with transfer accuracy remaining near chance (~7%) while within-dataset performance exceeds 96%. This finding confirms that the model captures dataset-specific handshape structures rather than domain-general gesture representations. Furthermore, efficiency evaluation shows that the proposed model achieves real-time performance (~37 FPS), demonstrating its practicality for deployment in real-world applications.

Future work will focus on enhancing robustness in visually complex environments, particularly for datasets such as HG14. Promising directions include integrating explicit hand segmentation or background suppression techniques to stabilize spatial representations and reduce interference from non-discriminative regions. Additionally, optimizing hybrid architectures through backbone-specific learning rates or adaptive fusion strategies may further improve training stability and feature balance. Extensions to alternative architectures, including EfficientNet and Vision Transformers, as well as applications to dynamic gesture recognition and multimodal interaction systems, represent valuable avenues for further research.

Author Contributions

Conceptualized the research, Y.-J.C.; methodology and software used, A.A.; investigation and validation, Y.-J.C. and Q.-B.H.; formal analysis, A.A.; resources, A.A.; data curation, A.A.; writing—preparing the original draft.; writing—reviewing and editing the manuscript, A.A., Y.-J.C. and Q.-B.H.; supervised the analysis, reviewed the manuscript, Y.-J.C. and Q.-B.H.; visualization and project administration, A.A.; funding acquisition, Y.-J.C. and Q.-B.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Higher Education Sprout Project of the Ministry of Education, Taiwan, and the Ministry of Science and Technology, grant number NSTC 114-2221-E-218-018 and NSTC 113-2222-E-218-003-MY2.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chang, V.; Eniola, R.O.; Golightly, L.; Xu, Q.A. An Exploration into Human–Computer Interaction: Hand Gesture Recognition Management in a Challenging Environment. SN Comput. Sci. 2023, 4, 441. [Google Scholar] [CrossRef]
Nahar, K.M.O.; Alsmadi, I.; Al Mamlook, R.E.; Nasayreh, A.; Gharaibeh, H.; Almuflih, A.S.; Alasim, F. Recognition of Arabic Air-Written Letters: Machine Learning, Convolutional Neural Networks, and Optical Character Recognition (OCR) Techniques. Sensors 2023, 23, 9475. [Google Scholar] [CrossRef]
Joo, J.; Koh, J.; Lee, H. Hand Gesture Recognition Using Ultrasonic Array with Machine Learning. Sensors 2024, 24, 6763. [Google Scholar] [CrossRef] [PubMed]
Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022, 22, 706. [Google Scholar] [CrossRef]
Al-Qurishi, M.; Souissi, R. DeepASLR: A CNN based human computer interface for American Sign Language recognition for hearing-impaired individuals. Comput. Methods Programs Biomed. Update 2022, 2, 100048. [Google Scholar] [CrossRef]
Azhar, N.Z.B.; Teo, N.H.I.; Hamzah, R.; Roslan, R.; Maskat, R. A Hybrid ResNet—MobileNet Deep Learning Model for Smart Bin Waste Classification. In Proceedings of the 2024 5th International Conference on Artificial Intelligence and Data Sciences (AiDAS), Bangkok, Thailand, 3–4 September 2024; pp. 356–362. [Google Scholar] [CrossRef]
Khattar, S.; Kumar, V. ResMobileNet: A Deep Ensemble Approach for Classification of Breast Cancer Using Transfer Learning. In Proceedings of the 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT 2025), Bengaluru, India, 5–7 February 2025. [Google Scholar] [CrossRef]
Zhang, X.; Huang, S.; Zhang, X.; Wang, W.; Wang, Q.; Yang, D. Residual Inception: A New Module Combining Modified Residual with Inception to Improve Network Performance. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP 2018), Athens, Greece, 7–10 October 2018; pp. 886–893. [Google Scholar] [CrossRef]
Ewe, E.L.R.; Lee, C.P.; Kwek, L.C.; Lim, K.M. Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier. Appl. Sci. 2022, 12, 7643. [Google Scholar] [CrossRef]
Sen, A.; Mishra, T.K.; Dash, R. A novel hand gesture detection and recognition system based on ensemble-based convolutional neural network. Multimed. Tools Appl. 2022, 81, 40043–40066. [Google Scholar] [CrossRef]
Ahmed, I.T.; Gwad, W.H.; Hammad, B.T.; Alkayal, E. Enhancing Hand Gesture Image Recognition by Integrating Various Feature Groups. Technologies 2025, 13, 164. [Google Scholar] [CrossRef]
Assiri, M.; Selim, M.M. Gesture recognition for hearing impaired people using an ensemble of deep learning models with improving beluga whale optimization-based hyperparameter tuning. Sci. Rep. 2025, 15, 21441. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Adv. Neural Inf. Process. Syst. 2017, 30, 6405–6416. [Google Scholar]
Alex, S.A.; Jhanjhi, N.Z.; Khan, N.A.; Husin, H.S. G-DCNN: GAN based Deep 2D-CNN for COVID-19 Classification. In Proceedings of the 2022 International Visualization and Informatics Technology Conference (IVIT), Kuala Lumpur, Malaysia, 1–2 November 2022. [Google Scholar] [CrossRef]
Reddy, V.S.N.; Harsha, D.S.S.; Krishna, M.G.; Arhith, P.S. Interactive Projection Technology using Hand Gesture Recognition with Attention Mechanism and ResNet. In Proceedings of the 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 May 2023; pp. 650–654. [Google Scholar] [CrossRef]
Wu, J.; Ren, P.; Song, B.; Zhang, R.; Zhao, C.; Zhang, X. Data glove-based gesture recognition using CNN-BiLSTM model with attention mechanism. PLoS ONE 2023, 18, e0294174. [Google Scholar] [CrossRef]
Kumar, S.; Rani, R.; Chaudhari, U. Real-time sign language detection: Empowering the disabled community. MethodsX 2024, 13, 102901. [Google Scholar] [CrossRef]
Tan, K.; Lim, K.M.; Chang, R.K.Y.; Lee, C.P.; Alqahtani, A. HGR-ViT: Hand Gesture Recognition with Vision Transformer. Sensors 2023, 23, 5555. [Google Scholar] [CrossRef] [PubMed]
Nurrahma, N.; Yusuf, R.; Prihatmanto, A.S. Indonesian Sign Language Fingerspelling Recognition using Vision-based Machine Learning. In Proceedings of the 2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), Bandung, Indonesia, 1–2 December 2021. [Google Scholar] [CrossRef]
Garg, B.; Kasar, M.; Paygude, P.; Dhumane, A.; Ambala, S.; Rajpurohit, J.; Sharma, A.; Meshram, V.; Vats, A.; Kashyap, A. Sign language detection dataset: A resource for AI-based recognition systems. Data Brief 2025, 61, 111703. [Google Scholar] [CrossRef]
Güler, O.; Yücedağ, İ. Hand Gesture Recognition from 2D Images by Using Convolutional Capsule Neural Networks. Arab. J. Sci. Eng. 2022, 47, 1211–1225. [Google Scholar] [CrossRef]
Talebi, H.; Milanfar, P. Learning to Resize Images for Computer Vision Tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 497–506. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, 2016; pp. 2818–2826. [Google Scholar]
Pei, X.; Zhao, Y.H.; Chen, L.; Guo, Q.; Duan, Z.; Pan, Y.; Hou, H. Robustness of machine learning to color, size change, normalization, and image enhancement on micrograph datasets with large sample differences. Mater. Des. 2023, 232, 112086. [Google Scholar] [CrossRef]
Joseph, V.R. Optimal ratio for data splitting. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 531–538. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar] [CrossRef]
Wang, X.; Han, T.X.; Yan, S. An HOG-LBP human detector with partial occlusion handling. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 32–39. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Policies from Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Hataya, R.; Zdenek, J.; Yoshizoe, K.; Nakayama, H. Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Müller, S.G.; Hutter, F. TrivialAugment: Tuning-Free Yet State-of-the-Art Data Augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Wang, Z.; Jiang, H.; Zhang, Y.; Liu, Z.; Chen, X.; Gu, Q. AugMax: Adversarial Composition of Random Augmentations for Robust Training. Adv. Neural Inf. Process. Syst. 2021, 34, 237–250. [Google Scholar]
Ismail, M.H.; Dawwd, S.A.; Ali, F.H. Static hand gesture recognition of Arabic sign language by using deep CNNs. Indones. J. Electr. Eng. Comput. Sci. 2021, 24, 178–188. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762v7. [Google Scholar] [CrossRef]
Er, M.J.; Zhang, Y.; Wang, N.; Pratama, M. Attention pooling-based convolutional neural network for sentence modelling. Inf. Sci. 2016, 373, 388–403. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Alkhaled, L.; Roy, A.; Palaiahnakote, S. An Attention-Based Fusion of ResNet50 and InceptionV3 Model for Water Meter Digit Recognition. Artif. Intell. Appl. 2025, 20, 1–11. [Google Scholar] [CrossRef]
Hossain, M.M.; Hossain, M.M.; Arefin, M.B.; Akhtar, F.; Blake, J. Combining State-of-the-Art Pre-Trained Deep Learning Models: A Noble Approach for Skin Cancer Detection Using Max Voting Ensemble. Diagnostics 2024, 14, 89. [Google Scholar] [CrossRef]
Kelana, E.L.; Prasetya, M.R.A.; Zulfadhilah, M. Integrating the CNN Model with the Web for Indonesian Sign Language (BISINDO) Recognition. J. Appl. Inform. Comput. 2025, 9, 883–896. [Google Scholar] [CrossRef]
Savaş, S.; Ergüzen, A. Hand Gesture Recognition with Two Stage Approach Using Transfer Learning and Deep Ensemble Learning. arXiv 2023, arXiv:2309.11610. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed research workflow for hand gesture recognition, including dataset preparation, preprocessing, data augmentation, model development, and performance evaluation.

Figure 2. Sample images from the Indonesian Sign Language (BISINDO) dataset showing hand gestures captured under diverse background conditions.

Figure 3. Example images from the American Sign Language (ASL) dataset representing alphabet gestures captured against a black background.

Figure 4. Sample hand gesture images from the HG14 dataset used for gesture-based interaction and control in augmented reality applications.

Figure 5. Architecture of the proposed hybrid CNN model combining ResNet50 and InceptionV3 with attention-based pooling for hand gesture classification.

Figure 6. General architecture of a convolutional neural network (CNN) illustrating convolution, pooling, and fully connected classification layers.

Figure 7. Architecture of the ResNet50 network showing residual blocks with identity shortcut connections for deep feature learning.

Figure 8. InceptionV3 architecture illustrating multi-branch convolutional modules used for multi-scale feature extraction.

Figure 9. GBAV calibration curves showing structural consistency across augmentation magnitudes. Pearson correlation (blue) and KL divergence (red) are plotted for rotation, zoom, and translation transformations. Dashed lines indicate the acceptance thresholds (r ≥ 0.85, KL ≤ 0.05). Safe augmentation magnitudes correspond to the region where correlation remains high and divergence remains low, beyond which structural degradation is observed.

Figure 10. Grad-CAM visualization comparing baseline CNN responses and attention-enhanced activations. The attention-based pooling mechanism produces more focused activation regions corresponding to discriminative finger structures.

Figure 11. Per-class cross-dataset transfer accuracy between BISINDO and ASL, grouped by handshape similarity. Letters with identical handshapes (green: I, L, V) exhibit substantially higher transfer precision, while similar handshapes (yellow: C, O) show moderate transfer. In contrast, most letters with different handshapes (red) result in near-zero precision. Blue bars indicate BISINDO → ASL precision, and orange bars indicate ASL → BISINDO precision. The results demonstrate that transfer performance is primarily governed by handshape similarity rather than dataset origin.

Table 1. Hardware and software specifications used in the experiments.

Hardware/Software	Specification
Processor (CPU)	Intel(R) Xeon(R) CPU @ 2.20 GHz (2 vCPUs)
Memory (RAM)	13 GB (System)/16 GB (GPU VRAM)
Graphical Processing Unit (GPU)	NVIDIA Tesla P100-PCIE-16GB
Operating System	Linux (Ubuntu 20.04.6 LTS)
Python Version	3.11.5
Frameworks	TensorFlow 2.15.0, Keras 3.0

Table 2. Hyperparameters of the CNN used for training the Combined Models.

Hyperparameter	Value
Batch size	64
Optimizer	Adam
Learning rate	5 × 10⁻⁵
L2 Regularization	1 × 10⁻⁴
Number of epochs	30
Activation function	ReLu, Softmax

Table 3. GBAV analysis under spatial augmentations.

Augmentation	Magnitude	Correlation (r)	KL Divergence
Rotation	5°	0.8956	0.0134
Rotation	18°	0.4731	0.0449
Rotation	36°	−0.0936	0.0998
Zoom	5%	0.9006	0.0089
Zoom	10%	0.8312	0.0143
Zoom	90%	0.3082	0.1504
Translation	5%	0.9502	0.0057
Translation	10%	0.8875	0.0115
Translation	20%	0.7445	0.0248

Table 4. The evaluation results of the Combine Pretrained Models (ResNet50 + Inception V3) using attention-based pooling.

Dataset	Training Accuracy	Validation Accuracy	Precision	Recall	F1-Score
BISINDO	98.23%	96.87%	98.84%	98.81%	98.83%
ASL	99.30%	99.92%	99.98%	99.98%	99.98%
HG14	95.57%	95.25%	95.38%	95.25%	95.27%

Table 5. The evaluation results of the Combine Pretrained Models (ResNet50 + InceptionV3) without attention-based pooling.

Dataset	Training Accuracy	Validation Accuracy	Precision	Recall	F1-Score
BISINDO	98.61%	97.00%	97.10%	97.00%	97.00%
ASL	99.12%	99.75%	99.76%	99.75%	99.75%
HG14	93.52%	94.18%	94.45%	94.18%	94.23%

Table 6. The evaluation results of the single model (ResNet50).

Dataset	Training Accuracy	Validation Accuracy	Precision	Recall	F1-Score
BISINDO	96.83%	95.05%	95.35%	95.04%	95.05%
ASL	99.29%	99.98%	99.98%	99.98%	99.98%
HG14	93.00%	90.79%	90.89%	90.79%	90.73%

Table 7. The evaluation results of the single model (InceptionV3).

Dataset	Training Accuracy	Validation Accuracy	Precision	Recall	F1-Score
BISINDO	89.40%	88.80%	89.43%	88.81%	88.85%
ASL	100%	100%	100%	100%	100%
HG14	80.00%	82.21%	82.65%	82.21%	82.24%

Table 8. Cross-dataset transfer accuracy summary.

Direction	Transfer Accuracy	Random Baseline	r (Hand-Shape)
BISINDO → ASL	6.99%	3.85%	0.71
ASL → BISINDO	6.72%	3.85%	0.80

Table 9. Quantitative attention localization on HG14 (n = 181 images with detected hands).

Architecture	Pointing-Game	Energy-in-Hand	IoU
Without Attention	0.9061	0.7553 ± 0.1753	0.4849 ± 0.1592
With Attention	0.8950	0.7247 ± 0.1950	0.4911 ± 0.1657

Table 10. 5-fold cross-validation on HG14 (combined model with attention pooling).

Metric	Mean	Std
Accuracy	93.51%	±2.31%
Precision	93.74%	±2.32%
Recall	93.51%	±2.31%
F1-Recall	93.51%	±2.33%
F1-Weighted	93.51%	±2.33%

Table 11. Computational efficiency of all four architectural variants (Tesla P100, 299 × 299 input).

Architecture	Params (M)	Latency ms (Mean ± Std)	GFLOPs
Combined With Attention	46.7	27.2 ± 1.6	26.0
Combined Without Attention	46.6	26.8 ± 1.5	26.0
ResNet50	24.1	10.7 ± 0.8	14.5
InceptionV3	22.3	14.9 ± 1.1	11.5

Table 12. Comparison of representative hand gesture recognition methods.

Category	Method	Backbone	Dataset	Evaluation	Accuracy (%)	Reference
External Works	DeepASLR	Custom CNN	ASL	Test	99.38%	[5]
	Kelana et al.	Custom CNN	BISINDO	Validation	97.44%	[42]
	Savaş & Ergüzen	MobileNet	HG14	Test	96.79%	[43]
		ResNet50	HG14	Test	92.14%	[43]
		ResNet152	HG14	Test	90.71%	[43]
		InceptionV3	HG14	Test	76.64%	[43]
Proposed Method	Hybrid + Attention	ResNet50 + InceptionV3	BISINDO	Validation	96.87%	-
		ResNet50 + InceptionV3	HG14	Validation	95.25%	-
		ResNet50 + InceptionV3	ASL	Validation	99.92%	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Chen, Y.-J.; Aryanti, A.; Hong, Q.-B. A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition. Appl. Syst. Innov. 2026, 9, 100. https://doi.org/10.3390/asi9050100

AMA Style

Chen Y-J, Aryanti A, Hong Q-B. A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition. Applied System Innovation. 2026; 9(5):100. https://doi.org/10.3390/asi9050100

Chicago/Turabian Style

Chen, Yeou-Jiunn, Aryanti Aryanti, and Qian-Bei Hong. 2026. "A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition" Applied System Innovation 9, no. 5: 100. https://doi.org/10.3390/asi9050100

APA Style

Chen, Y.-J., Aryanti, A., & Hong, Q.-B. (2026). A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition. Applied System Innovation, 9(5), 100. https://doi.org/10.3390/asi9050100

Article Menu

A Multi-Model CNN Approach Using Pre-Trained Network for Improved Hand Gesture Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Preprocessing Data

2.3. Data Augmentation

2.4. Model Development

2.5. Implementation Details

2.6. Model Evaluation

3. Results

3.1. Feature Stability Analysis

3.2. Classification Performance

3.3. Interpretability Analysis

3.4. Cross-Dataset Generalization

3.5. Quantitative Attention Localization

3.6. 5-Fold Cross-Validation

3.7. Computational Efficiency Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI