1. Introduction
Hand gesture recognition (HGR) offers an intuitive interface for human–computer interaction, facilitating efficient control across multiple technological domains. It has thus been implemented in many applications, including gaming, virtual and augmented reality, automotive systems, industrial automation, and healthcare [
1]. HGR also supports communication accessibility by enabling sign language interpretation, facilitating daily interactions, education, and social inclusion for members of the Deaf community. However, linguistic and cultural variations among Deaf communities worldwide lead to differences in manual alphabet systems. Consequently, there is a need for HGR systems capable of learning regional sign languages rather than relying on a presumed universal standard.
Recent advances in machine learning have substantially improved the performance of automated systems for hand movement recognition. Various techniques have been proposed for translating sign language into textual representations, including optical character recognition (OCR) [
2] and ultrasonic array sensing [
3]. In addition, machine learning has been used for educational support, automatic feature extraction, and dynamic HGR [
4]. More recent studies have focused on the use of deep learning architectures, particularly pre-trained Convolutional Neural Networks (CNNs), to achieve higher accuracy and robustness in hand gesture classification.
However, despite these promising developments, CNN-based HGR models remain vulnerable to dataset bias because their feature extractors are typically optimized for a single visual distribution. As a result, illumination shifts, background clutter, or variations in the hand morphology can suppress discriminative cues and compromise the classification robustness [
5,
6]. Furthermore, traditional single-model CNN architectures may fail to capture all the spatial and semantic features required for robust recognition, resulting in impaired performance in real-world scenarios. To address these limitations, recent studies have explored multi-model strategies that combine multiple pre-trained CNN architectures [
7,
8]. By leveraging complementary feature representations across different backbones, such approaches aim to enhance the classification accuracy and overall reliability of automatic HGR systems.
Several studies have implemented ensemble or multi-backbone strategies to enhance the feature diversity in HGR. Typical examples include combining transfer learning-based VGG16 with Random Forest classifiers [
9] and training GoogLeNet, VGGNet, and AlexNet in parallel for real-time CNN-ensemble processing [
10]. In another example, ResNet50 was integrated with the Tamura texture descriptor to improve its robustness to texture and lighting variations [
11]. Some ensemble deep learning pipelines combine SE-CapsNet with BiGRU, VAE, and BiLSTM, with the hyperparameters tuned using Improved Beluga Whale Optimization [
12]. Other authors have employed probabilistic fusion strategies, including deep ensembles for predictive uncertainty estimation [
13], or used GAN-based data generation to improve the generalization performance of CNN classifiers [
14]. Two-stream networks have also been studied. For instance, HGR-Net integrates semantic segmentation with CNN-based classification, achieving a high classification performance with low-latency inference (~23 ms/frame). Collectively, these ensemble and multi-stage approaches enhance the classification precision, resilience, and processing speed, supporting the development of reliable real-time HGR applications.
Attention mechanisms have also been employed to improve the spatial selectivity and feature efficiency of HGR models. By focusing on key attributes within the input data, these mechanisms enhance feature discrimination and support real-time processing. Several implementations have been reported. For example, ResNet architectures have been augmented with attention modules for static gesture recognition [
15], while CNN–BiLSTM frameworks incorporating attention mechanisms have also been explored for gesture recognition tasks [
16]. Similarly, VGG-16 has been enhanced with attention blocks to strengthen the image-based gesture recognition performance [
17]. More recently, Vision Transformer–based models, such as HGR-ViT, have leveraged multi-head self-attention to capture long-range positional dependencies within gesture representations [
18].
Although multi-backbone CNN architectures and attention mechanisms have been widely adopted for HGR, many existing studies implicitly assume that frequently used data augmentation techniques preserve semantic consistency throughout transformations. This premise, referred to as label invariance, is seldom explicitly verified. In real-world applications, however, significant spatial perturbations, such as substantial translations, rotations, or scaling operations, can alter critical finger configurations and disrupt structural coherence. As a result, augmented samples may deviate from the underlying gesture manifold, introducing semantic distortion and potentially misleading the learning process.
Beyond this theoretical inconsistency, the implications are empirical. For instance, the performance gains reported in previous studies may partially stem from training on semantically invalid samples, potentially leading to unstable optimization dynamics and an overstated generalization capability. Furthermore, although attention mechanisms are frequently employed to improve spatial selectivity, relatively few studies have offered feature-level evidence demonstrating that learned attention maps reliably concentrate on significant hand regions rather than background artifacts. Together, these limitations underscore the need for a framework that simultaneously validates the augmentation integrity and ensures model interpretability, while leveraging complementary CNN backbones.
To address this need, this study introduces a Gradient-Based Augmentation Validation (GBAV) framework to explicitly assess the reliability of the training distribution. This framework outlines the structural boundaries of the gesture manifold, thereby reducing semantic noise and facilitating a reliable interpretability analysis.
The main contributions of this study can be summarized as follows:
A GBAV framework is presented as a calibration mechanism for pre-training, aimed at empirically validating label invariance in the context of spatial transformations. It also seeks to establish safe ranges for augmentation by employing KL divergence and analyzing gradient-structure consistency.
A multi-CNN architecture is proposed that integrates ResNet50 and InceptionV3 to leverage the synergistic advantages of residual learning and inception modules, respectively, to improve feature extraction.
The remainder of this paper is organized as follows.
Section 2 details the Materials and Methods, covering the GBAV calibration procedure, combined CNN architecture, datasets, preprocessing, and training configuration.
Section 3 reports classification results, cross-dataset generalization, quantitative attention localization, 5-fold cross-validation, and computational efficiency analysis.
Section 4 discusses implications and limitations.
Section 5 concludes.
2. Materials and Methods
The research methodology comprised a sequence of structured stages, as shown in
Figure 1.
The workflow began with dataset collection, followed by preprocessing to standardize the inputs and prepare samples for model training. The deep learning model was then designed and optimized. The model performance was evaluated using validation data and established classification metrics. Finally, the experimental results were obtained and analyzed.
2.1. Datasets
This study employed three publicly available hand gesture datasets: Indonesian Sign Language (BISINDO), American Sign Language (ASL), and HG14. These datasets exhibit significant variations in gesture vocabulary, background complexity, image resolution, and color-channel composition, thus offering a diverse benchmark for assessing the robustness and generalization ability of the proposed models. The details of each dataset are provided below.
- (1)
Indonesian Sign Language (BISINDO) Dataset
The BISINDO dataset was developed by Ma’aruf [
19] and consists of 26 hand gestures. It contains 11,471 images, each with a resolution of 640 × 640 pixels. The samples are captured against diverse backgrounds, as shown in
Figure 2.
- (2)
American Sign Language (ASL) Dataset
The ASL dataset contains 26,000 images representing A to Z [
20]. Each image has a size of 256 × 256 pixels and is captured against a black background to maintain visual consistency and facilitate feature extraction. As shown in
Figure 3, the dataset includes samples from multiple individuals to capture variations in the hand size, shape, and pose, thereby reducing the model bias and improving its robustness.
- (3)
Hand Gesture 14 (HG14) Dataset
The HG14 dataset, created by Guler [
21], comprises 14 different hand gestures for hand-based interaction and application control in augmented reality environments (
Figure 4). The dataset contains 14,000 color (RGB) images with a resolution of 256 × 256 pixels.
2.2. Preprocessing Data
In machine learning and deep learning models, data preprocessing plays a crucial role in standardizing the data format, improving the data quality, and ensuring that the input is easy for the model to interpret, thereby increasing the training effectiveness and supporting more accurate predictions. In this study, the preprocessing step involved resizing, rescaling, and partitioning to ensure consistent input dimensions and well-structured samples for model development. The details of the preprocessing step are described below.
- (1)
Resizing Dataset
For each dataset, the images were adjusted to a size of 299 × 299 pixels to ensure consistent spatial dimensions for CNN processing. This specific resolution was chosen to align with the architectural requirements of the Inception-V3 backbone [
22], while also being compatible with ResNet50, which accommodates varying input resolutions through global pooling. Moreover, this standard resolution aligns with common ImageNet-pretrained architectures, reduces the computational overhead, and limits scale variation so that the model can focus on the salient visual features [
23].
- (2)
Rescaling
After resizing, the pixel values were normalized from the 0–255 RGB range to [0, 1] [
24]. The normalization process stabilizes gradient-based optimization and accelerates convergence by constraining input magnitudes, which supports efficient CNN training.
- (3)
Partitioning
Each dataset was partitioned into training and validation subsets using an 80:20 ratio, where 80% of the samples were used for model learning and 20% were reserved for performance evaluation [
25]. This split supports generalization assessment and mitigates overfitting. Accordingly, BISINDO was divided into 9186 training and 2285 validation images, ASL into 20,800 and 5200 images, and HG14 into 11,200 and 2800 images, respectively.
2.3. Data Augmentation
Data augmentation was performed to improve the generalization of the model by broadening the data diversity while maintaining semantic consistency. Standard spatial transformations were first applied and subsequently constrained using the proposed GBAV framework.
- (1)
Baseline Augmentation Strategies
Data augmentation was implemented on-the-fly during training using a stochastic pipeline that included horizontal flipping, rotation (±5%), zoom (±5%), and translation (±5%). While these transformations were randomly sampled at each epoch, their permissible ranges were constrained by the proposed GBAV framework to maintain semantic consistency and prevent structural distortion.
- (2)
Gradient-Based Augmentation Validation (GBAV)
Data augmentation typically operates under the label invariance assumption, which posits that semantic class labels do not change under permissible transformations [
26]. Nevertheless, excessive or unconstrained augmentation can distort the underlying data manifold and cause semantic drift, resulting in inconsistent supervision signals and diminished model performance. Unlike adaptive augmentation strategies, GBAV does not dynamically adjust transformations during optimization but instead provides a principled verification mechanism to constrain augmentation intensity before training.
To mitigate this issue, the proposed GBAV framework systematically evaluated the structural consistency of the augmented samples prior to their incorporation into the training process. In the proposed framework, the structural integrity of the augmented samples was quantified by leveraging principles from representation learning [
27] and gradient-based feature descriptors [
28]. In particular, the structural characteristics were encoded through magnitude-weighted gradient orientation histograms, which are inspired by HOG representations and have been demonstrated to effectively encode human-centric structures such as edges and contours [
29]. The similarity between the original and augmented samples was then measured using Kullback–Leibler divergence [
30], complemented by correlation-based metrics, to identify the transformations that preserve semantic consistency.
To operationalize this framework, a calibration procedure was conducted using a held-out subset of 200 BISINDO images stratified across all classes. For each augmentation type, magnitude-weighted 9-bin gradient orientation histograms (HOG-style descriptors) [
28] were computed for both original and augmented samples. Structural similarity was quantified using Pearson correlation (r) and symmetric Kullback–Leibler (KL) divergence with Laplace smoothing. A transformation magnitude was accepted as “safe” if
and
. These thresholds correspond to conventional cutoffs for strong structural similarity and distributional equivalence and were defined prior to the sweep to avoid post hoc selection. The evaluated parameter grid included rotation {2, 3, 4, 5, 6, 7, 9, 18, 27, 36, 45} degrees, zoom {0.05, 0.10, 0.20, 0.50, 0.90}, and translation {0.05, 0.10, 0.15, 0.20, 0.30}. The resulting safe magnitudes used in all training experiments were rotation = 5°, zoom = 5%, and translation = 10%.
Unlike existing augmentation strategies such as AutoAugment [
31], RandAugment [
32], Faster AutoAugment [
33], and TrivialAugment [
34], which rely on downstream model performance to evaluate augmentation policies, GBAV operates directly on input-level structural statistics and is therefore architecture-independent in principle. Empirical validation in this work is conducted on CNN-based pipelines, while extension to other architectures remains straightforward and is left for future work. Furthermore, GBAV is applied as a pre-training calibration step rather than during optimization, in contrast to adaptive curriculum-based approaches such as AugMax [
35]. These distinctions position GBAV as a principled validation mechanism for ensuring semantic integrity prior to model training, rather than a training-time augmentation strategy.
2.4. Model Development
ResNet50 was selected for its hierarchical residual representations and stable optimization in deep architectures; InceptionV3 was selected for its multi-scale receptive fields, capturing both local finger-level features and global hand shape simultaneously. We considered EfficientNet and Vision Transformer alternatives but selected ResNet50 + InceptionV3 because both backbones provide well-validated ImageNet pre-training and produce complementary representation types at comparable parameter budgets.
A hybrid pretrained model was constructed by integrating two complementary CNN architectures, ResNet50 and InceptionV3, together with an attention-based pooling mechanism to enhance feature extraction and classification performance for hand gesture images. The model was fed input images with a size of 299 × 299 × 3, meeting the standard input resolution of both pretrained networks. The images were processed in parallel through two branches: ResNet50, which generated a 10 × 10 × 2048 feature map, and InceptionV3, which produced an 8 × 8 × 2048 feature map. Global Average Pooling (GAP) was then applied to convert both feature maps into 2048-dimensional vectors, which were subsequently reduced to 256-dimensional embeddings via ReLU-activated dense layers to reduce the computational complexity while presenting the key discriminative information.
The two compact feature vectors were subsequently concatenated into a unified 512-dimensional representation to form the fused backbone of the model. The attention-based pooling layer then refined this vector by prioritizing the features that contributed most significantly to the accuracy of the classification outputs. The attention-enhanced representation was further processed by a fully connected layer with 256 ReLU-activated neurons, followed by a dropout layer with a rate of 0.3 to prevent overfitting during training. Finally, a dense layer equipped with SoftMax activation was used to produce a probability distribution across the target classes, enabling the model to determine the most likely category for each input image. The proposed model, combined with pre-training (ResNet50 + InceptionV3) and attention-based pooling, is shown in
Figure 5.
The following subsections briefly summarize the core building blocks of the proposed framework, including the convolutional feature extractors, residual networks, multi-branch inception modules, channel-wise concatenation, and attention-based pooling, to clarify the functional roles of each element within the fused backbone.
- (1)
Convolutional Neural Networks (CNNs)
CNNs comprise a feature extraction stage followed by a classification stage, as illustrated in
Figure 6.
Convolutional layers apply learned kernels to extract spatial representations, which are passed through non-linear activations (ReLU) and refined using pooling to reduce the spatial resolution and enhance the translational invariance. Fully connected layers then map the resulting feature vectors to the output classes through supervised training.
- (2)
Residual Network 50 (ResNet50)
ResNet50 is a 50-layer convolutional network that mitigates the vanishing gradient problem through residual (skip) connections, as shown in
Figure 7.
These identity mappings bypass stacks of convolutions and support stable optimization in deep feature hierarchies built from bottleneck blocks (1 × 1, 3 × 3, and 1 × 1 with batch normalization and ReLU). Spatial activations are then aggregated by global-average pooling and mapped to output classes by a fully connected layer.
- (3)
Inception Version 3 (InceptionV3)
InceptionV3 is a convolutional architecture designed to capture multi-scale spatial structures using parallel convolutional branches with different receptive field sizes, as shown in
Figure 8.
The network consists of a stem for initial feature extraction, followed by stacked inception modules that perform multi-branch processing with dimensionality-reduction (1 × 1) factorization to control the computational cost. The outputs from the parallel paths are concatenated and subsequently aggregated by global-average pooling, and a fully connected layer with SoftMax activation then generates class probability predictions.
- (4)
Concatenation Layer
Concatenation combines feature maps along the channel dimension to merge outputs from parallel convolutions, enabling the network to integrate multi-scale representations within a single layer. By preserving all the activations produced by the individual branches, the concatenation operation expands the channel depth without altering the spatial resolution, thereby enriching the representational capacity of the feature tensor [
36].
- (5)
Attention-Based Pooling Method
Attention-based pooling assigns learned weights to the feature activations to produce a weighted representation that prioritizes task-relevant patterns in the input images. Unlike max or average pooling, which applies fixed selection rules, attention pooling adapts to the input by estimating importance scores using a self-attention mechanism. Multi-head attention extends this operation by projecting the features into multiple subspaces and computing independent weight vectors, yielding richer contextual representations and an improved discriminative performance.
The attention score is calculated for each input element by initially applying a linear transformation. Next, the tanh activation function is used, which can be expressed as (1):
where
b represents the bias,
is the trainable attention weight matrix, and
is the input vector. The Softmax function is then used to normalize these attention scores and generate probabilistic weights, as shown in (2):
Finally, the obtained weights are applied to combine the features in a weighted manner [
37,
38], resulting in a combined output defined as (3):
The main advantage of the attention-based pooling mechanism is that it can flexibly select the most important features, making the model more efficient and accurate when handling difficult tasks such as classification and segmentation.
2.5. Implementation Details
The experiments were implemented in Python using a suite of deep-learning and scientific computing libraries. TensorFlow, accessed through the Keras API, served as the primary framework for building and training the models. NumPy was utilized for numerical computations, and the OS library handled file management. Image preprocessing and augmentation were performed using the TensorFlow tf.data API alongside Keras preprocessing layers, enabling efficient data loading, parallel processing, and real-time augmentation. Model evaluation, including the generation of confusion matrices and classification reports, was performed with Scikit-Learn. Finally, the training outcomes, such as the accuracy and loss curves and confusion matrices, were visualized using the Matplotlib library 3.10.0. GBAV calibration was performed prior to model training to determine the maximum permissible transformation intensities, which were established based on divergence thresholds obtained from an analysis of gradient-structure consistency.
Four model configurations were evaluated:
- (1)
Combined ResNet50 + InceptionV3 with attention;
- (2)
Combined without attention;
- (3)
ResNet50;
- (4)
InceptionV3.
All the experiments were carried out on a personal computer, the specifications of which are summarized in
Table 1, while the relevant hyperparameter configurations are detailed in
Table 2.
Preprocessing was implemented using explicit TensorFlow operations. Image resizing was performed using tf.image.resize with bilinear interpolation to a fixed resolution of 299 × 299 × 3, followed by per-image normalization to the [0, 1] range. Dataset partitioning was conducted deterministically using a per-class 80/20 split with a global random seed of 42 to ensure reproducibility. Data augmentation was implemented using tf.keras.layers, including RandomFlip, RandomRotation, RandomZoom, and RandomTranslation, with transformation magnitudes constrained by the GBAV calibration procedure. Model training utilized the Adam optimizer with a constant learning rate of 5 × 10−5 and a batch size of 64 for a maximum of 30 epochs. Label-smoothed categorical cross-entropy (smoothing = 0.01) was used as the loss function. L2 weight regularization with a coefficient of 1 × 10−4 was applied to fully connected layers, and dropout with a rate of 0.3 was used to mitigate overfitting. The pretrained ResNet50 and InceptionV3 backbones were initially frozen and subsequently fine-tuned by unfreezing the final convolutional block of each network. Training progress was monitored using a validation set, with EarlyStopping (patience = 5) and ModelCheckpoint configured to preserve the weights corresponding to the lowest validation loss. The complete training and evaluation pipeline, including GBAV calibration, cross-dataset evaluation, attention quantification, efficiency benchmarking, and 5-fold cross-validation, is publicly available as described in the Data Availability statement.
2.6. Model Evaluation
The effectiveness of the proposed model was assessed using four standard classification metrics: accuracy, precision, recall, and F1-score. Accuracy indicates the ratio of correctly classified samples to the total number of inputs, whereas precision evaluates the reliability of the positive predictions. Recall assesses the model’s capability to accurately identify pertinent samples, and the F1-score delivers a holistic evaluation by merging the precision and recall into one metric. Together, these metrics provide a thorough assessment of the classification performance across datasets characterized by different complexities and class distributions.
Besides quantitative assessment, a qualitative interpretability analysis was also performed to investigate the spatial emphasis of the acquired representations. In particular, Grad-CAM was utilized as a post hoc visualization method to emphasize class-discriminative areas and evaluate the impact of attention pooling on feature localization [
39].
3. Results
This section evaluates the proposed methodology from three perspectives: the feature robustness, classification performance across datasets, and model interpretability.
3.1. Feature Stability Analysis
Before evaluating the classification performance, a gradient-based analysis was performed to assess the structural stability of the extracted features under typical spatial transformations. The GBAV descriptor was used to examine the variations in the gradient orientation distributions caused by rotation, translation, and scaling perturbations. To measure the feature divergence, Pearson’s correlation and the Kullback–Leibler (KL) divergence were calculated between the original and transformed representations. This approach enabled a classifier-independent evaluation of the feature invariance. The GBAV responses to rotational, scaling, and translational transformations are depicted in
Figure 9, while the associated quantitative correlation and divergence metrics are presented in
Table 3.
Rotational perturbations demonstrated the greatest sensitivity to transformation magnitude. Small rotations up to 5° preserved structural consistency, achieving high correlation ( = 0.8596) and low divergence ( = 0.0135), satisfying the GBAV acceptance criteria. However, increasing the rotation beyond this threshold resulted in a rapid degradation of structural alignment. At 18°, the correlation dropped significantly ( = 0.4735), and at 36°, it became negative ( = −0.0937), accompanied by increased divergence ( = 0.0998), indicating a breakdown in gradient correspondence. These results confirm that rotational transformations are highly sensitive and must be tightly constrained to preserve semantic consistency.
Conversely, moderate scaling transformations exhibited stable behavior at low magnitudes. A zoom factor of 5% maintained strong structural consistency ( = 0.9007, = 0.0089), while larger scaling levels led to a progressive degradation. At 10%, the correlation dropped below the acceptance threshold ( = 0.8313), and extreme zooming (90%) resulted in severe structural distortion ( = 0.3082, = 0.1504), likely due to contextual loss and feature truncation.
Translational perturbations showed a more gradual degradation pattern. Small displacements of 5% and 10% preserved high structural similarity ( = 0.9503 and 0.8875, respectively), with low divergence values. However, larger translations beyond 10% led to a decline in correlation ( = 0.8200 at 15%) and increased divergence, indicating partial occlusion of discriminative regions. Based on these findings, conservative augmentation parameters were selected for training: . These values correspond to the maximum magnitudes that satisfy the GBAV acceptance criteria ( ≥ 0.85 and ≤ 0.05), ensuring semantic validity while maintaining controlled variability.
3.2. Classification Performance
Four model configurations were assessed:
Unified ResNet50–InceptionV3 architecture utilizing attention-based pooling.
The same combined backbone without attention.
ResNet50 standalone model.
InceptionV3 standalone model.
Given that all the datasets comprised an equal number of classes, accuracy was selected as the primary comparative metric, while the precision, recall, and F1-score metrics were used to provide additional insights.
- (1)
Combined Model with Attention-Based Pooling
The evaluation outcomes of the attention-augmented combined model are presented in
Table 4.
The model demonstrated a robust and consistent performance across all the datasets, achieving validation accuracies of 96.87% (BISINDO), 99.92% (ASL), and 95.25% (HG14). Importantly, the HG14 dataset, marked by greater gesture variability and intricate backgrounds, significantly benefited from attention-based pooling, which enhanced the spatial focus and aggregation of the discriminative features.
- (2)
Combined Model without Attention Pooling.
The performance metrics of the combined backbone without attention are presented in
Table 5.
Although the model retained a high accuracy on BISINDO and ASL, a noticeable decline was evident on HG14 when compared to the attention-enhanced version. This performance disparity underscores the importance of attention pooling in alleviating feature ambiguity in visually complex gesture datasets.
- (3)
Single-Model Backbones
Table 6 and
Table 7 present the findings for the ResNet50 and InceptionV3 models, respectively. Both architectures performed well on the ASL and BISINDO datasets.
However, their performance significantly deteriorated on HG14. ResNet50 showed limited resilience to background clutter and gesture variability, while InceptionV3, despite achieving strong training accuracy, exhibited diminished generalization on HG14. These findings suggest that single-backbone architectures face significant challenges in capturing the varied spatial cues inherent in more demanding gesture datasets.
3.3. Interpretability Analysis
To explore the internal decision-making behavior of the various models, a qualitative interpretability analysis was conducted utilizing Grad-CAM. Although quantitative metrics provide an overall evaluation of the classification performance, Grad-CAM facilitates a more detailed examination of the specific spatial areas that significantly impact the model predictions. Such analysis is particularly valuable in hand gesture recognition tasks characterized by visual ambiguity and subtle finger configurations [
5].
This investigation is of particular significance for the HG14 dataset, which yielded a relatively lower performance compared to BISINDO and ASL due to greater intra-class variability and subtle visual distinctions among the hand gestures. Based on the class-wise performance evaluation results, Gestures 3 and 10 were chosen as representative examples. These gestures involve visually similar hand configurations with overlapping finger placements, a challenge frequently reported in CNN-based ensemble recognition studies [
10]. Such similarity reduces the class separability despite high overall accuracy, making these classes suitable for analyzing how architectural design influences spatial feature discrimination.
As shown in
Figure 10, the Grad-CAM responses generated without the use of attention-based pooling were notably diffuse, with the activations spread across wider areas of the hand and, in certain instances, encroaching into the background.
A similar pattern has been noted in baseline convolutional models documented in existing research, where the lack of explicit spatial weighting leads to a partial dependence on non-discriminative areas such as the base of the palm or the wrist [
11]. This phenomenon was particularly pronounced for Gesture 10, where the inaccurate localization of the primary finger structure added to the classification uncertainty.
Conversely, the attention-enhanced mode demonstrated a more focused and semantically relevant activation pattern, consistent with observations from recent attention-based fusion methodologies. The highlighted regions corresponded to essential finger articulations and joint configurations that characterized each gesture, signifying an improved spatial selectivity [
39]. For Gesture 3, attention was focused on the relative arrangement of the extended fingers, whereas for Gesture 10, it was concentrated on the orientation of the dominant finger. This refined spatial localization demonstrates the role of attention-based pooling in suppressing irrelevant background elements and amplifying discriminative areas, thus enhancing the model’s robustness for visually intricate gesture categories. Qualitatively, the attention-enhanced architecture exhibited reduced background activation dispersion compared to the baseline configuration. This suggests an enhancement in spatial selectivity and a more accurate localization of distinguishing finger structures.
3.4. Cross-Dataset Generalization
To evaluate external validity, we trained the headline model on BISINDO and evaluated it on ASL without retraining, and vice versa, since both datasets share the alphabetic label space A–Z. Both directions of transfer collapsed to near-chance accuracy (BISINDO → ASL: 6.99%, ASL → BISINDO: 6.72%, versus a 3.85% random baseline over 26 classes). This reflects the linguistic independence of the two sign systems rather than a model failure. We classified each letter pair as identical (I, L, V), similar (C, O), or different (the remaining 21 letters). Per-class transfer precision correlated strongly with handshape similarity (Pearson r = 0.71 for BISINDO → ASL, r = 0.80 for ASL → BISINDO). Letters with identical handshapes achieved 30–55% mean transfer precision (peak: I = 95.4% precision, ASL → BISINDO), letters with similar handshapes achieved 19–32%, and letters with different handshapes averaged below 2%. This monotonic correspondence provides direct evidence that the model captures genuine gesture features rather than dataset-specific artifacts. A summary is shown in
Table 8. Details of each class distribution are depicted in
Figure 11.
3.5. Quantitative Attention Localization
To move beyond purely visual Grad-CAM inspection, we extracted pseudo-ground-truth hand masks via MediaPipe HandLandmarker (convex-hull dilation of detected landmarks) for 200 randomly sampled HG14 validation images and computed three localization metrics on the fused Grad-CAM heatmap (mean of ResNet50 conv5_block3_out and InceptionV3 mixed10 activations): pointing-game accuracy (peak activation inside the mask), energy-in-hand fraction (proportion of above-threshold activation energy inside the mask), and IoU (intersection-over-union between thresholded heatmap and mask). The results for both architectures are shown in
Table 9. Both models attend strongly to hand regions in absolute terms (energy > 70%, pointing-game = 100%), confirming that the combined backbone intrinsically learns hand-focused features regardless of whether attention pooling is used.
3.6. 5-Fold Cross-Validation
5-fold stratified cross-validation was conducted on HG14, the dataset with the most realistic intra-class variability, to assess result stability. Across five folds, mean accuracy was 93.51% ± 2.31% and macro-F1 was 93.51% ± 2.33%. The single-split HG14 accuracy of 95.25% reported in the main results falls within one standard deviation of the cross-validated mean, confirming that the main result is not an artifact of a favorable partition. Cross-validation was prioritized for HG14 because ASL exhibits near-saturated performance across all models, making it insensitive to partitioning; and BISINDO single-split performance (96.87%) lies within the HG14 variance range (±2.31%), suggesting additional cross-validation would not materially affect conclusions. Per-fold results are shown in
Table 10.
3.7. Computational Efficiency Analysis
To characterize the computational cost of the dual-backbone architecture, we benchmarked all four variants on a Tesla P100 GPU. FLOPs were computed via the TensorFlow graph profiler; latency was measured over 100 trials of single-image forward passes (299 × 299 resolution, 20 warm-up iterations). The results are shown in
Table 11. The headline combined model has 46.6 M parameters, 26.0 GFLOPs, and 26.8 ± 1.5 ms latency per image (~37 FPS on the P100), suitable for interactive applications. The optional attention layer adds ~0.06 M parameters and <1% latency overhead; given that the ablation shows it does not improve accuracy, the simpler no-attention variant is recommended for deployment.
4. Discussion
The experimental findings indicate that the integration of diverse CNN backbones enhances classification performance across all evaluated datasets when compared to single-model configurations. The hybrid ResNet50–InceptionV3 architecture consistently outperformed its standalone counterparts on BISINDO, ASL, and HG14, confirming that the fusion of residual learning and multi-scale convolution improves representational capacity. This observation aligns with prior studies demonstrating performance gains from combining complementary CNN architectures [
37,
40,
41].
Specifically, on the BISINDO dataset, the combined model achieved validation accuracies of 96.87% with attention and 97.00% without attention, indicating that the inclusion of attention does not provide a statistically meaningful improvement when hand structures are already distinct and background complexity is moderate. A similar trend is observed on the ASL dataset, where all configurations achieved near-saturated performance (approximately 99.7–99.9%), suggesting that the uniform background and limited gesture variability reduce the potential benefit of attention mechanisms. In contrast, the HG14 dataset exhibits a modest improvement when attention is included, with the combined model achieving 95.25% compared to 94.18% without attention. This suggests that attention-based pooling enhances spatial discrimination in visually complex scenarios characterized by background clutter and subtle finger variations, although the improvement remains limited.
Importantly, the observed differences between attention and non-attention configurations fall within ±2.31%, corresponding to the fold-to-fold variance identified through cross-validation on HG14. This indicates that the performance gains attributed to attention are not statistically distinguishable across the evaluated datasets. Instead, the primary improvement over baseline configurations arises from the use of GBAV-calibrated augmentation, which ensures that training data remain structurally consistent and semantically valid.
The effectiveness of attention mechanisms is therefore context-dependent. In datasets with uniform backgrounds and well-aligned hand regions, such as ASL, attention provides minimal benefit. In more complex scenarios such as HG14, attention contributes to improved spatial focus, but the magnitude of improvement remains modest. This suggests that further gains may be achieved through complementary preprocessing strategies, such as segmentation or background suppression, rather than relying solely on architectural modifications.
The cross-dataset transfer experiment provides additional insight into the nature of the learned representations. Despite sharing identical alphabetic labels, BISINDO and ASL represent linguistically distinct sign systems. The model achieved only approximately 7% transfer accuracy between datasets, while maintaining within-distribution accuracies above 96%. This indicates that the model learns dataset-specific handshape representations rather than domain-general gesture features. Importantly, this finding strengthens the methodological interpretation rather than weakening it, as it confirms that high within-dataset accuracy reflects consistent learning of the underlying sign system rather than spurious dataset bias.
Furthermore, InceptionV3 exhibited a more pronounced performance decline on HG14 compared to the hybrid architecture, suggesting that multi-scale feature extraction alone is insufficient to handle high intra-class variability without complementary residual representations. The combined architecture mitigates this limitation by integrating both global and hierarchical feature learning.
Although the proposed hybrid model introduces a higher parameter count relative to single-backbone configurations, the computational overhead remains manageable. As shown in the efficiency analysis, the model achieves real-time performance (~37 FPS), making it suitable for practical deployment in moderate-scale applications. When evaluated in conjunction with the comparative findings shown in
Table 12, the suggested architecture exhibits competitive performance in relation to well-established CNN-based methods such as DeepASLR [
5], BISINDO-oriented CNN models [
42], and MobileNet-based HG14 assessments [
43].
When compared with existing approaches, the proposed method demonstrates competitive performance across multiple datasets with varying characteristics. While some studies report higher accuracy under specific dataset splits or controlled conditions, the proposed approach maintains consistently strong performance across BISINDO, ASL, and HG14. This cross-dataset stability highlights the advantage of combining complementary backbone architectures with GBAV-calibrated augmentation, rather than optimizing exclusively for a single dataset.
5. Conclusions
This study investigated hand gesture recognition across datasets with varying visual complexity by combining a Gradient-Based Augmentation Validation (GBAV) framework with a hybrid CNN architecture integrating ResNet50 and InceptionV3. The results demonstrate two primary contributions. First, GBAV provides a principled pre-training calibration mechanism that constrains augmentation magnitudes based on structural consistency, leading to improved model reliability and performance across all evaluated datasets. Second, the proposed multi-backbone CNN architecture leverages complementary feature representations from residual and multi-scale convolutional networks, resulting in consistently strong classification performance compared to single-model baselines.
The experimental findings further indicate that the impact of attention-based pooling is limited and dataset-dependent. While attention provides modest improvements in visually complex scenarios such as HG14, its effect remains within the cross-validation fold-to-fold variance (±2.31%), indicating that the gains are not statistically distinguishable across the evaluated benchmarks. Instead, the primary performance improvements can be attributed to GBAV-calibrated augmentation, which ensures semantic consistency during training.
Additional analyses reinforce these conclusions. The cross-dataset transfer experiment revealed that BISINDO and ASL, despite sharing alphabetic labels, represent linguistically distinct sign systems, with transfer accuracy remaining near chance (~7%) while within-dataset performance exceeds 96%. This finding confirms that the model captures dataset-specific handshape structures rather than domain-general gesture representations. Furthermore, efficiency evaluation shows that the proposed model achieves real-time performance (~37 FPS), demonstrating its practicality for deployment in real-world applications.
Future work will focus on enhancing robustness in visually complex environments, particularly for datasets such as HG14. Promising directions include integrating explicit hand segmentation or background suppression techniques to stabilize spatial representations and reduce interference from non-discriminative regions. Additionally, optimizing hybrid architectures through backbone-specific learning rates or adaptive fusion strategies may further improve training stability and feature balance. Extensions to alternative architectures, including EfficientNet and Vision Transformers, as well as applications to dynamic gesture recognition and multimodal interaction systems, represent valuable avenues for further research.