1. Introduction
Sign language serves as the principal mode of communication for individuals with hearing and speech impairments. It conveys information through structured hand gestures and visual expressions, enabling non-verbal interaction and emotional expression. However, people who are not familiar with sign language often find it difficult to understand, which creates communication barriers [
1]. According to available statistics, approximately 5% of the global population, or about 466 million people, are affected by hearing impairment, including around 34 million children. Therefore, members of the hearing-impaired community rely on sign language to express their needs and communicate effectively with both hearing-impaired and hearing individuals in essential areas such as education, healthcare, and employment [
2].
With the progress of computer technology and the enhancement of hardware systems, human–computer interaction (HCI) has become an integral part of everyday life. Hand gesture recognition (HGR), as an important component of HCI, has become an active research topic in recent years and plays a significant role in modern technological applications. It is widely used in areas such as virtual reality, robot control, interactive gaming, and natural user interfaces. Among these applications, sign language recognition, which belongs to the human communication system, represents a major and important application domain of hand gesture recognition [
3]. With the rapid development of modern technologies, there is growing potential to utilize these tools to create new approaches for sign language learning [
4].
ASL is an alternative non-verbal communication method based on hand gestures that allows deaf people to communicate information effectively and to enhance their ability to express themselves [
5]. However, effective communication with hearing individuals still remains a challenge, especially in the accurate recognition of diverse hand poses and gestures, without automated recognition systems. In addition, the visual similarity among certain signs further increases the complexity of building reliable recognition models [
6]. Therefore, this study focuses on sign language alphabet recognition, as alphabet letters form the fundamental components of any language. To verify the proposed approach, experiments are performed on the publicly available ASL Alphabet Dataset which is available on Kaggle (
https://www.kaggle.com/grassknoted/asl-alphabet) (accessed on 20 April 2026).
Early studies on sign language alphabet recognition mainly relied on handcrafted features and traditional classifiers. For example, Sharma et al. employed skin-color-based hand segmentation and a contour tracing descriptor, followed by k-NN and multi-class SVM for classification. Although SVM achieved better performance, noticeable confusion still occurred among visually similar alphabet gestures, indicating the remaining challenges in recognizing ambiguous hand shapes [
7]. Convolutional Neural Networks (CNNs) were extensively applied to image-based ASL recognition. Bheda and Radpour used a deep CNN to classify still images of ASL letters and numbers. On a controlled dataset, they hit 82.5% accuracy, but when they tried the same setup with their own photos—where things like lighting and skin tone varied a lot—the results dropped off fast [
8]. Garcia and Viesca took a different approach. They built a real-time fingerspelling system using transfer learning. It worked, but topped out at just 72% accuracy for some basic classification tasks [
9]. In the last few years, researchers have started looking past just local feature extraction and are now modeling the whole sequence more globally. In another work, Ferreira et al. [
10] proposed a Transformer-based contrastive learning model that analyzed sequences of body key points. Although they achieve nearly 90% accuracy in few-shot recognition tests, some significant gaps remain unaddressed. CNNs are effective at capturing local patterns, but modeling long-range, global relationships which are necessary for recognizing complex signs is challenging for them. In contrast, Transformers are more efficient at capturing global context; however, they are computationally expensive and require lots of training data to tackle inductive biases. This makes them difficult to apply effectively to smaller ASL datasets. Furthermore, there are many prior studies that rely heavily on pure CNNs, Transformers, or existing hybrid frameworks like HCformer [
11]. The primary focus of these works is based on using complex shifted-window mechanisms for global denoising. Therefore, lightweight hybrid CNN–Transformer models that balance fine-grained local feature extractions with global contextual understanding remain limited. To address this gap, we proposed CT-Net, which is a lightweight, structure-aware hybrid architecture in ASL recognition.
To evaluate the effectiveness of the proposed CT-Net, we conducted extensive comparative experiments across multiple model families and data configurations. CT-Net achieves strong performance, reaching 95.67% accuracy, and consistently outperforms baselines, including traditional CNNs such as ResNet18; modern convolutional architectures such as ConvNeXt-Tiny; lightweight models such as MobileNetV3 and EfficientNet-B0; pure Transformers such as ViT; hybrid machine learning methods such as CNN + SVM and CNN + XGBoost; and recent lightweight vision models including MobileViT-S, MobileViT-v2, EfficientFormer-L1, and FastViT-SA12. In addition to accuracy, we evaluate computational efficiency and real-time suitability, showing that CT-Net preserves the speed advantages of lightweight CNN-based models while achieving stronger recognition performance than heavier Transformer-based approaches. These results demonstrate that CT-Net provides an effective balance between fine-grained local feature extraction, global contextual modeling, and computational efficiency for practical ASL recognition in resource-constrained settings. The main contributions of our work are as follows:
Architectural Innovation: We introduce a fusion mechanism that combines the convolutional inductive bias of ConvNeXt-Tiny with a lightweight Transformer encoder. Unlike generic models, this design balances the extraction of fine-grained finger-level details with broader contextual modeling, helping reduce confusion between visually similar gestures.
Methodological Advancement: We present a systematic evaluation across three distinct data configurations, including the MediaPipe-enhanced setting. This allows us to directly examine how well the model handles distribution shifts and gesture ambiguity. The results also provide practical evidence that appearance-based RGB features benefit from the addition of structural priors.
Comprehensive Experimental Validation: We perform a large-scale comparative evaluation across traditional CNNs, lightweight architectures, Transformer-based models, and recent hybrid vision frameworks, demonstrating that CT-Net achieves a strong balance between recognition accuracy and computational efficiency for real-time ASL recognition.
The paper is organized as follows: In
Section 2, we summarize some related works in the field of ASL recognition and discuss current research gaps. In
Section 3, we focus on the background and theoretical framework.
Section 4 describes the proposed methodology in detail, with a focus on the architecture of the hybrid model. In
Section 5, an experimental evaluation of the proposed model is provided. Finally, we wrap up this paper by discussing the proposed method, its challenges, and future direction.
2. Related Works
Existing research indicates that vision-based sign language recognition (SLR) frameworks have been analyzed across multiple dimensions. Cameras dominated data acquisition, while intrusive methods like gloves remained marginal. Most research has focused on static, isolated, and single-handed signs, whereas continuous recognition has received limited attention. Traditional neural networks and Support Vector Machines were the prevalent classification techniques, with Convolutional Neural Networks emerging as a recent trend. These findings highlight a historical preference for simpler tasks using non-intrusive visual sensors [
12]. Given this context, the present work focuses on fine-grained handshape classification of static American Sign Language (ASL) alphabet gestures. This section reviews image-based approaches under static conditions, emphasizing traditional techniques, CNN-based methods, Transformer-based models, and emerging hybrid architectures.
Before the emergence of deep learning approaches, sign language recognition predominantly relied on manually extracted global and meta features combined with classical machine learning classifiers. Darrell and Pentland applied the Dynamic Time Warping (DTW) algorithm, originally used in speech recognition, to perform gesture recognition based on temporal sequence matching [
13]. Starner et al. [
14] proposed a real-time ASL recognition system based on Hidden Markov Models (HMMs). In their framework, coarse descriptions of hand shape, orientation, and trajectory were extracted from video sequences and fed into HMMs for sentence-level continuous sign recognition. Grobel and Assan [
15] applied HMMs to isolated sign language recognition with a vocabulary of 262 signs. The system achieved recognition rates of up to 94% in signer-dependent experiments. Vogler and Metaxas [
16] pointed out that the use of HMMs alone imposed several limitations in ASL recognition, particularly the insufficiency of training data when modeling context-dependent HMMs. Pham Quoc Thang et al. experimentally evaluated SimpSVM and RVM on the Auslan and ASLID datasets. In addition, they compared these methods with several classical machine learning algorithms. J48 achieved an accuracy of 85.5%, Naïve Bayes obtained 70.8%, HMM reached 87.1%, AdaBoost achieved 93.6%, RVM reached 93.373%, and SimpSVM achieved 98.09% on the sign language recognition task. The authors concluded that SimpSVM and RVM could achieve good predictive performance for sign recognition [
17].
Recent studies in complex visual recognition tasks have also demonstrated the effectiveness of deep learning under challenging conditions, such as Mamba-based YOLOv10 mussel detection in turbid waters [
18] and lightweight EPRepSADet for railway catenary foreign object detection [
19]. Recent studies also highlight robustness and security, including DB-COVIDNet for backdoor mitigation and analyses of defense challenges [
20,
21]. Similarly, CNNs were among the earliest deep learning architectures to receive significant attention in vision-based sign language recognition research. CNNs fundamentally transformed the landscape of SLR by introducing automated feature learning. One big advantage of CNNs in this field is that they can cut down or even skip the complicated manual steps usually needed to extract features—Yang and Zhu [
22] showed this by getting accurate recognition with images fed straight into the system. Here, convolutional layers use learnable weights on the input images to create feature maps that capture important details [
23]. But researchers have gone further. By adding 3D-CNNs—like the 3D-ResNet architectures—they have started to consider both space and time at once, which means you get a much fuller picture when analyzing dynamic sign language video clips [
24]. In addition to end-to-end classification, CNNs were frequently employed as robust feature extractors to feed subsequent specialized classifiers, such as Support Vector Machines (SVMs), which helped in enhancing overall recognition accuracy [
25]. This versatility was further supported by the internal layer hierarchy of CNNs; for instance, the internal convolutional layers could effectively function as filters to eliminate irrelevant information, such as minor skin-color noise, during the feature learning stage [
26].
In the field of sign language research, several studies introduced self-attention mechanisms and Transformer-based models [
27]. In addition, Transformer models were employed in sign language production [
28]. Transformer and self-attention exhibit strong general modeling capacity and enable dynamic computation. Furthermore, a Transformer can capture arbitrary relationships through verification [
29], which is difficult for CNNs to achieve [
30]. In addition, Transformers complement convolution. According to [
31,
32], in CNNs, convolutional layers mainly focus on nearby pixels. To capture wider image regions and more complex spatial patterns, several layers are needed. On the other hand, a Transformer layer uses self-attention which enables it to directly compare different parts of an image and model long-range dependencies more easily. Inspired by the success of self-attention and Transformer architectures in natural language processing (NLP), recent studies have increasingly replaced traditional convolution modules with self-attention layers [
30]. One of the first models to apply Transformer layers directly to image patches was Vision Transformer (ViT) [
33]. Another model was Data-efficient image Transformers (DeiT) which improved training efficiency through knowledge distillation and optimization strategies [
34]. The Swin Transformer introduced a hierarchical architecture with shifted window mechanisms and achieved strong performance across various computer vision tasks [
35]. Although Transformers are a powerful approach for sequence modeling, they are not always directly suitable for SLR. Video action understanding is chronological, and you cannot see future frames from the present, which creates limits [
30]. Past studies note that end-to-end Sign2Text models struggle with long-term dependencies—video frame sequences are much longer than gloss sequences, which hurts direct video-to-text performance [
36]. Plus, Transformer-based dual learning frameworks often require heavy computation. Training several high-parameter models at once can limit how much large-scale data you actually get to use, especially if resources are tight [
37].
Existing studies show there is often a trade-off between what CNNs and Transformers do best. On the other hand, Transformer-style self-attention lets models process global context and see all parts of the input at once, though this usually comes with a higher computational cost and heavier training demands. To bridge this gap, we introduce CT-Net, a hybrid ConvNeXt–Transformer architecture that combines the local feature-learning strength of convolution with the global reasoning ability of self-attention. By bringing these two approaches together, CT-Net captures both fine hand details and broader contextual information, improving static ASL alphabet recognition while keeping computational cost low.
4. Proposed Methodology
In this section we introduce the details of the proposed CT-Net architecture and its training procedure. CT-Net follows a cascaded hybrid design that combines the strengths of convolution and self-attention mechanisms. First, ConvNeXt-Tiny is used as a modern convolutional backbone to extract fine-grained local features. Then, a lightweight Transformer encoder is added to capture broader spatial dependencies across the image. The overall pipeline includes data preprocessing and augmentation, hybrid architecture design, loss optimization, and the evaluation protocol.
4.1. Data Preprocessing and Augmentation Strategy
To address common variations in sign language images such as lighting changes, scale differences, and viewing-angle variations, we designed a comprehensive data augmentation pipeline.
4.1.1. Training Dataset Processing
To increase data diversity and improve model generalization, we used a series of random data augmentation methods during the training procedure. We apply random-sized cropping within a predefined scale range and then resize the image to the target resolution. This helps the model to handle changes in camera distance and pushes the model to learn features that work across different scales. Random rotation is used to improve robustness against small hand or camera tilts. Color jitter is also applied to randomly change the brightness and contrast, simulating real-world lighting conditions. Finally, all images are normalized using the ImageNet mean and standard deviation to stabilize optimization and support faster convergence.
Table 1. Shows all the specific hyperparameters for this stochastic augmentation pipeline.
4.1.2. Testing Dataset Processing
During testing, we applied deterministic resizing to the target resolution, followed by the same normalization used for training. This makes the evaluation consistent and reproducible. For dataset loading, we used the PyTorch (version 2.10.0) DataLoader with mini-batches and multi-process loading to improve I/O efficiency and computational speed.
4.2. CT-Net Architecture Design
The main idea behind CT-Net is to combine the local feature-learning strength of convolution with the broader contextual modeling ability of self-attention. The overall data flow of the proposed architecture can be summarized as follows:
Given an input image
, CT-Net produces the predicted class probability
.
Figure 1 illustrates each step of the architecture and its modular building blocks, showing how the image input moves step by step transformed into the final class prediction.
4.2.1. Local Feature Extraction Backbone
In the first feature extraction stage, we use a ConvNeXt-Tiny model pre-trained on ImageNet-1K. Unlike traditional ResNet architectures, ConvNeXt has a modern design architecture including depthwise convolution, an inverted bottleneck structure, and a Normalization layer. These components allow the model to learn hierarchical visual representations while preserving the efficiency of convolutional operations. In our implementation, we removed the global average pooling layer and the classification head, and retained only the feature extraction backbone. Given an input tensor , where denoted the batch size, the backbone produced a high-dimensional feature map . In our architecture, the channel dimension was C = 768, while the spatial resolution was reduced to and . This feature representation encoded rich fine-grained local information, including finger contours, edge structures, and subtle hand texture patterns.
4.2.2. Feature Tokenization
To prepare the convolutional feature map for the Transformer encoder, we use a parameter-free flattening operation that connects the CNN backbone to the sequence-based Transformer module. Specifically, ConvNeXt feature map is flattened along the spatial dimensions to produce a tensor of shape . The tensor is then permuted into a sequence representation , where equal to 49 considered as the sequence length and is the embedding dimension. This direct mapping strategy avoids extra projection layers which preserves the semantic information learned by convolutional features and matches the Transformer input shape .
4.2.3. Global Context Modeling
In this step we need to capture long-range dependencies between different parts of the hand, such as the spatial relationship between the thumb and the little finger. For this, the token sequence is passed to a lightweight Transformer encoder. The encoder consists of stacked Transformer Encoder layers. This is a shallow configuration that reduces computational cost unlike 6-layer or 12-layer Transformers and prevents overfitting on the medium-sized ASL dataset. Each encoder layer includes Multi-Head Self-Attention (MHSA) with attention heads. Through self-attention, each token can exchange information with every other token in the sequence. This enables the model to identify spatial regions that are important for gesture classification. After the attention module, a Feed-Forward Network (FFN) further refines these token representations by expanding the hidden dimension to 2048, which is four times larger than the embedding dimension. The GELU activation function introduces nonlinearity and improves the model’s representational capacity. To keep optimization stable, each layer includes Dropout at a 0.1 probability, residual connections, and Layer Normalization. Together, these parts make training more stable and help the model converge reliably in deeper architectures. Combining ConvNeXt-Tiny with the Transformer encoder tackled the confusion between visually similar ASL gestures by blending detailed local hand features with larger spatial patterns. ConvNeXt-Tiny picked up on tiny details—like the shape of fingers and texture shifts—while the Transformer encoder connected distant parts of the hand using self-attention to capture broader relationships. Compared to the ConvNeXt-Tiny baseline, CT-Net boosted F1-scores in several tough categories: ‘A’, ‘E’, ‘I’, ‘K’, ‘S’, ‘U’, ‘V’, and ‘Y’. Notably, the F1-score for ‘S’ increased from 76.85% to 86.49%, suggesting that the global contextual modeling introduced by the Transformer encoder enhanced the discriminative power for visually ambiguous gestures. Nevertheless, highly similar categories such as ‘M’ and ‘N’ remained challenging due to severe finger occlusion and subtle structural nuances.
4.2.4. Classification Head
The Transformer encoder produces a contextual token sequence
, where each token contains both local visual information and global contextual cues. Global Average Pooling is then applied along the token dimension to combine the 49 spatial tokens into a single 768-dimensional feature vector. This vector is then normalized using Layer Normalization and passed through a fully connected layer that projects it into the
ASL alphabet categories. The resulting logits have a shape of
, which is used for gesture classification.
Table 2 summarizes the tensor dimensions throughout the CT-Net architecture pipeline.
4.3. Experimental Setup and Training Strategy
All experiments were conducted in a GPU-enabled computing environment and implemented using the PyTorch framework. To account for the unique characteristics of our hybrid architecture, we adopted a tailored optimization strategy to ensure stable convergence and robust generalization. The model was optimized using AdamW, which decoupled weight decay from gradient-based parameter updates and was known to enhance convergence stability, particularly for architectures incorporating Transformer components. The initial learning rate was set to , as Transformer modules were typically sensitive to large learning rates and benefited from smaller, more stable optimization steps. A weight decay of 0.05 was applied to regularize model complexity and mitigate overfitting. The batch size was set to 48 to balance GPU memory usage and stable gradient estimation, particularly because the Transformer encoder requires additional memory. Furthermore, because many ASL hand gestures are visually similar, the model was trained with cross-entropy loss combined with label smoothing. We set the label smoothing factor to 0.1 to prevent the target labels from being treated as strictly one-hot. This training strategy reduces overconfidence and helps to limit overfitting caused by noisy labels, and supports better generalization across similar ASL classes. We considered 20 epochs to train the model. Since hybrid architectures may require more time to converge than pure CNN baselines, this provides enough time for optimization while still keeping the overall training process efficient. A best-checkpoint saving policy was implemented, whereby the model state dictionary was saved only when the current evaluation accuracy surpassed the best previously observed performance. All final results were reported using the best-performing checkpoint obtained during training.
4.4. Performance Evaluation Indicators
To evaluate CT-Net on the ASL alphabet recognition task, we used several standard classification metrics. Accuracy is the main metric and measures the overall proportion of correctly classified samples. To examine class-level performance, we also used Precision, Recall, and F1-score. Precision computes the model’s positive predictions; Recall detects the true labels of each class and F1-score calculates the harmonic mean between the two. These metrics are especially useful for ASL recognition because some hand gestures are visually similar and may be confused by the model.
5. Experimentation Results and Analysis
5.1. Experimental Setup
To evaluate the performance of the proposed CT-Net, we conducted a wide range of experiments. We considered three separate data subsets and twelve comparative models including eight core baselines and four state-of-the-art lightweight architectures. In this section, we explain the details of dataset protocols and partitioning, implementation details and the promising results of our proposed mechanism.
5.1.1. Dataset Protocols and Partitioning
Since the official test set of the ASL alphabet dataset contains only one image per class, it is insufficient for a robust statistical evaluation. We re-partitioned the data to establish a statistically significant test set. To ensure a fair comparison, all baselines were re-trained and evaluated under this same protocol, ensuring that performance gains reflect genuine generalization capabilities rather than partitioning artifacts. Therefore, we re-partitioned the original training pool (3000 images per class) into two segments: a training source (the first 2000 images) and a testing source (the remaining 1000 images). Based on this, three specialized sub-datasets were constructed:
Sub-dataset 1 (Sequential Sampling): This subset simulated a low-variance environment. We selected the first 500 images from the training source and the first 250 images from the testing source (indices 2001–2250). This sub-dataset served as a baseline to evaluate the model’s fundamental fitting ability under high temporal correlation.
Sub-dataset 2 (Uniform Stratified Sampling): To assess generalization across diverse conditions, we applied a uniform sampling strategy. We extracted every 4th image from both the training and testing sources to obtain 500 training and 250 testing samples per class. This sub-dataset provided a more comprehensive representation of the dataset’s overall distribution.
Sub-dataset 3 (MediaPipe-Enhanced Visual Fusion): Built upon the sampling logic of Sub-dataset 2, this version introduced a visual feature enhancement stage using the MediaPipe Hands framework. For images where hand landmarks were successfully detected, the skeletal topology (keypoints and connections) was explicitly rendered and overlaid onto the original RGB images. This process embedded geometric priors directly into the visual input, allowing the model to learn from both texture and structural cues simultaneously. For samples where landmark detection failed, typically due to underexposure or motion blur, the original RGB images were retained without modification to ensure data completeness. The failure rate of MediaPipe detection was 25.6%; this hybrid approach was adopted to respect the original data distribution in cases of poor image quality, while manual annotation could be considered in future work to further eliminate this modality inconsistency. By adopting this hybrid image synthesis strategy, Sub-dataset 3 maintained an identical sample size (500 training and 250 testing images per class) to the previous subsets. This controlled setup enabled precise quantification of the impact of visually integrated structural information on CT-Net, eliminating interference from data volume variations. An example of the landmark overlay is illustrated in
Figure 2.
5.1.2. Evaluation Metrics
We evaluated model performance using accuracy, Precision, Recall, and F1-score. To treat all 29 classes fairly and reduce bias toward more frequent ones, we used macro-averaging.
5.1.3. Implementation Details
All experiments were implemented in PyTorch and executed on GPU-accelerated hardware. Each input image was resized to pixels. During training, we applied geometric and photometric distortions, including random resizing, rotations, and color jittering followed by standard normalization. To ensure a fair comparison and reduce overfitting, we initialized all deep learning models with pre-trained ImageNet weights. For the hybrid CNN + SVM and CNN + XGBoost baselines, the CNN component was used as a fixed feature extractor to obtain general spatial representations, without further fine-tuning. All models were trained using the AdamW optimizer with a learning rate , and weight decay of . We minimized cross-entropy loss with label smoothing, using a batch size of 48 for 20 epochs. The same training protocol was applied across all models to support consistency and reproducibility; however, some architectures may achieve stronger performance with model-specific hyperparameter tuning.
5.1.4. Model Architectures and Baselines
To test how well CT-Net works, we picked eight different models that cover four main types of visual recognition architectures. We made sure the models were diverse, especially in their design, how much computation they require, and the way they extract features.
End-to-End Deep Learning Models:
ResNet18: This classic residual convolutional network uses skip connections to solve the vanishing gradient problem. We chose it as a standard CNN baseline to represent traditional feature extraction.
ConvNeXt-Tiny: This is a more up-to-date CNN, inspired by Transformers. It uses large kernels , LayerNorm, and inverted bottleneck structures. It stands for the latest advances in convolutional networks.
MobileNetV3: A lightweight CNN designed for mobile and edge devices. It uses depthwise separable convolutions and neural architecture search (NAS) to compute efficiently while keeping accuracy.
EfficientNet-B0: This CNN uses compound scaling to optimize network depth, width, and resolution all at once. It strikes a strong balance between accuracy and computational cost.
Vision Transformer (ViT): This is a purely Transformer-based model that treats images as sequences of patches, using self-attention mechanisms instead of any convolutional operations. It fully embraces global context modeling.
CT-Net: This proposed hybrid architecture combines ConvNeXt-Tiny as the local feature extractor with a Transformer Encoder to capture global relationships. By merging these two, it leverages the unique strengths of each architectural paradigm.
Two-Stage Hybrid Machine Learning Models:
CNN + SVM: This uses a pre-trained ResNet18 to extract features, which are then passed to a Support Vector Machine (SVM) classifier with an RBF kernel. It follows the classic machine learning pipeline, enhanced with deep features.
CNN + XGBoost: Like CNN + SVM, but swaps the SVM classifier for XGBoost (eXtreme Gradient Boosting), which relies on ensemble decision trees to boost discriminative power.
All deep learning models started with pre-trained weights and were fine-tuned end-to-end on the ASL datasets. For the two-stage hybrid models, feature extraction was done with a frozen ResNet18 backbone to ensure a fair comparison with the end-to-end methods.
5.2. Overall Performance Comparison
5.3. Analysis of Data Quality and Distribution
5.3.1. Generalization Ability Across Data Distributions
To evaluate the robustness of different architectures against data distribution shifts, we compared model performance between Sub-dataset 1 and Sub-dataset 2. As shown in
Figure 3, most models demonstrated improved accuracy on Sub-dataset 2, indicating that the uniform sampling strategy provides a more representative data distribution that facilitates better feature learning.
The proposed CT-Net achieved the highest accuracy on both subsets (92.69% on Sub-dataset 1; 94.90% on Sub-dataset 2), with an improvement of +2.21% when transitioning to the more diverse Sub-dataset 2. This consistent performance demonstrates that our hybrid architecture effectively learns generalized features rather than overfitting to sequential patterns present in Sub-dataset 1. Notably, traditional CNN architectures exhibited substantial performance gains from Sub-dataset 1 to Sub-dataset 2. ResNet18 improved a lot, jumping from 80.50% to 89.90%—that is a gain of 9.40%. EfficientNet-B0 saw an even bigger boost, up 17.92% from 64.22% to 82.14%. MobileNetV3 was not left behind either—it climbed 7.45%, moving from 66.55% to 74.00%. On the flip side, ViT (Pure Transformer) lost some ground, dropping 2.43% from 74.47% to 72.04%. Clearly, pure attention-based models can stumble with changes in data distribution unless they have the inductive bias that convolutional layers offer. The hybrid models (CNN + SVM and CNN + XGBoost) barely moved, with gains of just 3.40% and 2.81%. This supports the idea that fixed feature extractors struggle to adapt to diverse data. All things considered, CT-Net handled both sampling strategies with a steady and strong performance, proving its resilience to shifts in data distribution. That is exactly the kind of reliability you want for real-world applications, where data always changes.
5.3.2. Sensitivity to Data Quality Enhancement
To quantify the impact of data quality on model performance, we compared results between Sub-dataset 2 and Sub-dataset 3. As described in
Section 5.1.1, Sub-dataset 3 incorporates structural landmark information for high-quality samples while retaining original RGB images for samples with failed detection, maintaining identical sample sizes across both subsets. As illustrated in
Figure 4, the proposed CT-Net achieved the highest overall performance on Sub-dataset 3 with 95.67% accuracy, representing a +0.77% gain from Sub-dataset 2 (94.90%). This result indicates that hybrid architectures can effectively leverage structural landmarks when available, while the ConvNeXt backbone ensures robust feature extraction from raw RGB inputs when landmarks are unavailable. This observation aligns with recent studies demonstrating the effectiveness of structured representations in improving model performance [
52].
ConvNeXt-Tiny showed the most substantial improvement among all models, with a +2.04% gain (from 92.76% to 94.80%), confirming that modern CNN architectures benefit significantly from the structural enhancement. ResNet18 exhibited a modest gain of +1.12% (from 89.90% to 91.02%), while MobileNetV3 demonstrated a +3.77% improvement (from 74.00% to 77.77%), suggesting that lightweight models may benefit disproportionately from additional structural guidance. Conversely, several models experienced performance degradation on Sub-dataset 3. ViT suffered the most significant drop (−16.65%, from 72.04% to 55.39%), indicating that pure Transformer architectures struggle to effectively utilize the hybrid landmark-RGB representation without convolutional inductive bias. EfficientNet-B0 also decreased by −4.32% (from 82.14% to 77.82%), suggesting potential incompatibility between its compound-scaled feature extraction and the structural enhancement pipeline. The two-stage approaches (CNN + SVM and CNN + XGBoost) showed minimal changes (−1.83% and −0.10%, respectively), as their frozen feature extractor cannot adapt to the enhanced structural information.
These findings demonstrate that data quality enhancement through structural landmark integration benefits architectures capable of end-to-end adaptation. CT-Net’s superior performance confirms its ability to effectively fuse local RGB features with global structural information.
5.4. Per-Class Performance Analysis
To further evaluate the discriminative capacity of CT-Net across individual ASL gestures, a detailed per-class analysis was conducted on Sub-dataset 3, where the model reached its peak performance (95.67% accuracy).
Figure 5 illustrates the Precision, Recall, and F1-score for each of the 29 classes.
CT-Net demonstrated exceptional classification proficiency, achieving perfect or near-perfect scores (F1 0.99) in 15 of the 29 categories, including ‘F’, ‘L’, ‘Q’, ‘T’, ‘nothing’, and ‘space’. These results underscore the model’s robustness in extracting features from gestures with highly distinct hand configurations. Nevertheless, specific performance fluctuations were identified in four challenging categories, primarily due to inherent visual ambiguities in the ASL alphabet: Class ‘M’ (F1 = 0.54): This category yielded the lowest Recall (0.38), indicating a significant tendency of the model to misclassify ‘M’ as other morphologically similar gestures. Class ‘N’ (F1 = 0.76): This class exhibited diminished Precision (0.62), suggesting that it frequently incurred false-positive predictions from other visually proximal classes. Class ‘S’ (F1 = 0.86): A moderate decline in Recall (0.77) was observed, attributed to its high visual similarity with other closed-fist gestures. Class ‘V’ (F1 = 0.88): An imbalance between Precision and Recall was noted, likely stemming from the sensitivity of finger orientation features within this category. The systematic confusion, particularly between ‘M’ and ‘N’, aligns with the objective difficulty of ASL recognition; these two gestures are differentiated only by the subtle positioning of the thumb between different fingers. Such error patterns suggest that the remaining misclassifications are driven by inter-class similarity and fine-grained structural nuances rather than inherent limitations of the CT-Net architecture. These specific challenges could be further addressed in future studies through targeted data augmentation or higher-resolution landmark integration. Overall, CT-Net maintained F1-scores above 0.85 for 86.2% of the classes (25 out of 29), validating its reliability and efficacy for practical sign language recognition deployments.
5.5. Cross-Dataset Convergence Analysis
We evaluated how stable the training was and how well the models converged on three different sub-datasets.
Figure 6 shows how accuracy and loss changed over 20 training epochs for five deep learning models. CT-Net and ConvNeXt-Tiny consistently performed best, with the highest accuracy across all sub-datasets. Looking at Sub-dataset 1, both of these models hit around 89–92% accuracy, as you can see in
Figure 6a. On Sub-dataset 2, CT-Net peaked at 94.9% at epoch 17, and ConvNeXt-Tiny stayed above 90% accuracy throughout (
Figure 6c). For Sub-dataset 3, both CT-Net and ConvNeXt-Tiny stayed reliably above 93% accuracy during the entire training process (
Figure 6e). ResNet18 did alright, with accuracy between 80% and 90%, but MobileNetV3 and EfficientNet-B0 lagged behind; their accuracy stayed between 60% and 80% and they converged more slowly. The accuracy comparison across all three sub-datasets confirms that our hybrid architecture design effectively captures discriminative features regardless of the data representation method.
The training loss convergence further validates the stability of all models during the learning process. As shown in the loss comparison plots, all models demonstrated rapid convergence during the initial training phase, with the training loss decreasing sharply within the first 2–3 epochs before stabilizing around 0.64–0.65. This pattern is clearly visible in
Figure 6b for Sub-dataset 1,
Figure 6d for Sub-dataset 2, and
Figure 6f for Sub-dataset 3. The consistent loss convergence across all three sub-datasets indicates that the models quickly learned effective feature representations. Furthermore, the minimal fluctuation in loss values after epoch 5 confirms that 20 epochs were sufficient for training completion across all architectures. Despite requiring longer per-epoch training time (~275 s vs. ~82 s for ResNet18), CT-Net’s superior accuracy and convergence stability across all six sub-plots validate the effectiveness of our hybrid architecture design, demonstrating an optimal balance between accuracy, stability, and convergence speed on diverse data representations.
5.6. Computational Complexity and Inference Efficiency Analysis
To further evaluate the practical deployment potential of the proposed CT-Net, we compared the computational complexity and inference efficiency of different models in terms of parameters, Floating Point Operations (FLOPs), inference latency, and FPS. As shown in
Table 4, CT-Net contained 38.87 M parameters and required 4.77 G FLOPs, which were moderately higher than ConvNeXt-Tiny but substantially lower than ViT. Despite the additional Transformer encoder, CT-Net achieved an inference latency of only 6.11 ms per image and a processing speed of 163.55 FPS, indicating that the proposed model maintained real-time inference capability. Compared with ViT, CT-Net reduced the parameter count by 54.71% and FLOPs by 57.75%, while achieving higher FPS. As shown in
Table 2, although ResNet18 and ConvNeXt-Tiny showed slightly faster inference speeds, CT-Net provided a better trade-off between recognition accuracy and computational efficiency. In contrast, MobileNetV3 and EfficientNet-B0 had fewer parameters and lower FLOPs, but their inference latency was unexpectedly higher, resulting in lower FPS. The two-stage CNN + SVM approach also suffered from increased latency due to the additional classifier stage. These results demonstrated that CT-Net achieved competitive computational efficiency while preserving superior classification performance, supporting its suitability for real-time ASL recognition applications.
5.7. Ablation Study
An ablation study was conducted to examine the influence of Transformer configurations in CT-Net by varying the number of encoder layers and attention heads while keeping all other settings unchanged. As shown in
Table 5, the default configuration with two layers and eight heads achieved the best performance, reaching 96.59% Precision, 95.67% Recall, and 95.43% F1-score. Precision measured the proportion of true positive predictions among all samples predicted as positive.
Reducing the encoder depth to one layer slightly decreased the performance, with the F1-score dropping to 95.18%. Increasing the depth to three and four layers caused more obvious degradation, with the F1-score decreasing to 93.66% and 92.38%, respectively, indicating that excessive depth introduced unnecessary complexity. Similarly, changing the number of attention heads from eight to four or twelve also reduced performance, with F1-scores of 93.80% and 93.28%, respectively. These results indicate that the default configuration provides the most effective trade-off between global feature modeling and stable training behavior.
5.8. Additional Comparison with Recent Lightweight Architectures
To further test how effective CT-Net is, we carried out thorough comparisons with several state-of-the-art lightweight and hybrid architectures: MobileViT-S, MobileViT-v2, EfficientFormer-L1, and FastViT-SA12.
Table 6 summarizes the quantitative results.
CT-Net delivered the lowest inference latency at 6.11 ms and the highest throughput at 163.55 FPS. Even though CT-Net had more parameters and higher FLOPs than the lightweight baselines, it provided better real-time inference. This points to CT-Net’s design being more hardware-friendly and optimized for parallel execution. In terms of recognition performance, CT-Net consistently outperformed every competing model under all metrics. Of the lightweight models, EfficientFormer-L1 came closest in terms of performance. Conversely, MobileViT-v2 showed relatively lower accuracy in these experiments. This discrepancy was likely attributable to the unified training schedule of 20 epochs employed for all models. Since MobileViT-v2 has a specific architecture, it may require a longer training time or more specialized hyperparameter tuning to reach its best performance. CT-Net performance shows a stronger balance between recognition accuracy and inference efficiency. This makes it a strong choice for practical applications that require both precise classification and real-time processing.
6. Conclusions
This study introduces CT-Net, a hybrid deep learning architecture for American Sign Language (ASL) alphabet recognition, with a particular focus on distinguishing visually similar hand gestures. CT-Net combines the local feature extraction strength of ConvNeXt-Tiny with the global context modeling ability of a lightweight Transformer encoder. By integrating these two components, CT-Net addresses key limitations of approaches that rely solely on convolution or attention mechanisms. Across different experimental settings, CT-Net consistently outperforms strong baseline models, including ResNet18, MobileNetV3, and standard Vision Transformers. On the MediaPipe-enhanced dataset, CT-Net achieves its highest accuracy of 95.67%. Ablation studies further show that a shallow two-layer Transformer encoder provides the best balance between modeling capacity and training stability. The efficiency analysis and comparison with other models showed that even though CT-Net has more parameters than some lightweight networks, it delivers better hardware-friendly inference, reaching 163.55 FPS in throughput. This makes it a promising choice for real-time applications. Still, some problems remain. CT-Net sometimes confuses classes like ‘M’ and ‘N’, which are hard to tell apart because the main difference is simply the number of downward-pointing fingers. Factors like self-occlusion and low variation between classes hide these subtle details, especially at different hand angles. Moving forward, we plan to strengthen CT-Net’s robustness across more varied environments and with bigger datasets. We also want to expand beyond static image recognition, adapting the model for temporal video streams by adding layers to track the motion of hand gestures over time. This step will allow the architecture to support continuous sign language recognition and sentence-level interpretation, paving the way for more practical, real-time assistive communication tools.