CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification

Yang, Zhuofan; Lu, Houjin; Shamshiri, Samaneh

doi:10.3390/app16105168

Open AccessArticle

CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification

by

Zhuofan Yang

¹

,

Houjin Lu

^2,*

and

Samaneh Shamshiri

³

¹

Department of Integrated Arts, Silla University, Busan 46958, Republic of Korea

²

Division of Electronics and Electrical Engineering, Dongguk University, Seoul 04620, Republic of Korea

³

Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5168; https://doi.org/10.3390/app16105168 (registering DOI)

Submission received: 23 February 2026 / Revised: 8 May 2026 / Accepted: 20 May 2026 / Published: 21 May 2026

Download

Browse Figures

Versions Notes

Abstract

Recognition of the American Sign Language (ASL) alphabet is of utmost importance in bridging the communication gap between the hearing-impaired and the hearing. However, robust classification remains difficult because some hand gestures are morphologically very similar. To address this problem, this study presents CT-Net, a hybrid deep learning architecture that integrates ConvNeXt-Tiny with a lightweight Transformer encoder. CT-Net combines convolutional feature extraction and self-attention mechanisms, which enable it to capture fine-grained local patterns and long-range spatial dependencies effectively. The proposed model was extensively compared with various architectures including traditional CNNs, Transformer-based models, hybrid machine-learning approaches and recent lightweight hybrid networks. The experimental results show that CT-Net achieved the best overall performance with a peak accuracy of 95.67% on the enhanced ASL dataset. Ablation studies demonstrate the effectiveness of our design choice. CT-Net achieves a strong trade-off between recognition accuracy and computational efficiency with an inference rate of 163.55 Frames Per Second (FPS). These findings highlight the potential of hybrid frameworks as a powerful tool for fine-grained gesture recognition tasks.

Keywords:

American Sign Language; deep learning; gesture recognition; ConvNeXt; Transformer

1. Introduction

Sign language serves as the principal mode of communication for individuals with hearing and speech impairments. It conveys information through structured hand gestures and visual expressions, enabling non-verbal interaction and emotional expression. However, people who are not familiar with sign language often find it difficult to understand, which creates communication barriers [1]. According to available statistics, approximately 5% of the global population, or about 466 million people, are affected by hearing impairment, including around 34 million children. Therefore, members of the hearing-impaired community rely on sign language to express their needs and communicate effectively with both hearing-impaired and hearing individuals in essential areas such as education, healthcare, and employment [2].

With the progress of computer technology and the enhancement of hardware systems, human–computer interaction (HCI) has become an integral part of everyday life. Hand gesture recognition (HGR), as an important component of HCI, has become an active research topic in recent years and plays a significant role in modern technological applications. It is widely used in areas such as virtual reality, robot control, interactive gaming, and natural user interfaces. Among these applications, sign language recognition, which belongs to the human communication system, represents a major and important application domain of hand gesture recognition [3]. With the rapid development of modern technologies, there is growing potential to utilize these tools to create new approaches for sign language learning [4].

ASL is an alternative non-verbal communication method based on hand gestures that allows deaf people to communicate information effectively and to enhance their ability to express themselves [5]. However, effective communication with hearing individuals still remains a challenge, especially in the accurate recognition of diverse hand poses and gestures, without automated recognition systems. In addition, the visual similarity among certain signs further increases the complexity of building reliable recognition models [6]. Therefore, this study focuses on sign language alphabet recognition, as alphabet letters form the fundamental components of any language. To verify the proposed approach, experiments are performed on the publicly available ASL Alphabet Dataset which is available on Kaggle (https://www.kaggle.com/grassknoted/asl-alphabet) (accessed on 20 April 2026).

Early studies on sign language alphabet recognition mainly relied on handcrafted features and traditional classifiers. For example, Sharma et al. employed skin-color-based hand segmentation and a contour tracing descriptor, followed by k-NN and multi-class SVM for classification. Although SVM achieved better performance, noticeable confusion still occurred among visually similar alphabet gestures, indicating the remaining challenges in recognizing ambiguous hand shapes [7]. Convolutional Neural Networks (CNNs) were extensively applied to image-based ASL recognition. Bheda and Radpour used a deep CNN to classify still images of ASL letters and numbers. On a controlled dataset, they hit 82.5% accuracy, but when they tried the same setup with their own photos—where things like lighting and skin tone varied a lot—the results dropped off fast [8]. Garcia and Viesca took a different approach. They built a real-time fingerspelling system using transfer learning. It worked, but topped out at just 72% accuracy for some basic classification tasks [9]. In the last few years, researchers have started looking past just local feature extraction and are now modeling the whole sequence more globally. In another work, Ferreira et al. [10] proposed a Transformer-based contrastive learning model that analyzed sequences of body key points. Although they achieve nearly 90% accuracy in few-shot recognition tests, some significant gaps remain unaddressed. CNNs are effective at capturing local patterns, but modeling long-range, global relationships which are necessary for recognizing complex signs is challenging for them. In contrast, Transformers are more efficient at capturing global context; however, they are computationally expensive and require lots of training data to tackle inductive biases. This makes them difficult to apply effectively to smaller ASL datasets. Furthermore, there are many prior studies that rely heavily on pure CNNs, Transformers, or existing hybrid frameworks like HCformer [11]. The primary focus of these works is based on using complex shifted-window mechanisms for global denoising. Therefore, lightweight hybrid CNN–Transformer models that balance fine-grained local feature extractions with global contextual understanding remain limited. To address this gap, we proposed CT-Net, which is a lightweight, structure-aware hybrid architecture in ASL recognition.

To evaluate the effectiveness of the proposed CT-Net, we conducted extensive comparative experiments across multiple model families and data configurations. CT-Net achieves strong performance, reaching 95.67% accuracy, and consistently outperforms baselines, including traditional CNNs such as ResNet18; modern convolutional architectures such as ConvNeXt-Tiny; lightweight models such as MobileNetV3 and EfficientNet-B0; pure Transformers such as ViT; hybrid machine learning methods such as CNN + SVM and CNN + XGBoost; and recent lightweight vision models including MobileViT-S, MobileViT-v2, EfficientFormer-L1, and FastViT-SA12. In addition to accuracy, we evaluate computational efficiency and real-time suitability, showing that CT-Net preserves the speed advantages of lightweight CNN-based models while achieving stronger recognition performance than heavier Transformer-based approaches. These results demonstrate that CT-Net provides an effective balance between fine-grained local feature extraction, global contextual modeling, and computational efficiency for practical ASL recognition in resource-constrained settings. The main contributions of our work are as follows:

Architectural Innovation: We introduce a fusion mechanism that combines the convolutional inductive bias of ConvNeXt-Tiny with a lightweight Transformer encoder. Unlike generic models, this design balances the extraction of fine-grained finger-level details with broader contextual modeling, helping reduce confusion between visually similar gestures.
Methodological Advancement: We present a systematic evaluation across three distinct data configurations, including the MediaPipe-enhanced setting. This allows us to directly examine how well the model handles distribution shifts and gesture ambiguity. The results also provide practical evidence that appearance-based RGB features benefit from the addition of structural priors.
Comprehensive Experimental Validation: We perform a large-scale comparative evaluation across traditional CNNs, lightweight architectures, Transformer-based models, and recent hybrid vision frameworks, demonstrating that CT-Net achieves a strong balance between recognition accuracy and computational efficiency for real-time ASL recognition.

The paper is organized as follows: In Section 2, we summarize some related works in the field of ASL recognition and discuss current research gaps. In Section 3, we focus on the background and theoretical framework. Section 4 describes the proposed methodology in detail, with a focus on the architecture of the hybrid model. In Section 5, an experimental evaluation of the proposed model is provided. Finally, we wrap up this paper by discussing the proposed method, its challenges, and future direction.

2. Related Works

Existing research indicates that vision-based sign language recognition (SLR) frameworks have been analyzed across multiple dimensions. Cameras dominated data acquisition, while intrusive methods like gloves remained marginal. Most research has focused on static, isolated, and single-handed signs, whereas continuous recognition has received limited attention. Traditional neural networks and Support Vector Machines were the prevalent classification techniques, with Convolutional Neural Networks emerging as a recent trend. These findings highlight a historical preference for simpler tasks using non-intrusive visual sensors [12]. Given this context, the present work focuses on fine-grained handshape classification of static American Sign Language (ASL) alphabet gestures. This section reviews image-based approaches under static conditions, emphasizing traditional techniques, CNN-based methods, Transformer-based models, and emerging hybrid architectures.

Before the emergence of deep learning approaches, sign language recognition predominantly relied on manually extracted global and meta features combined with classical machine learning classifiers. Darrell and Pentland applied the Dynamic Time Warping (DTW) algorithm, originally used in speech recognition, to perform gesture recognition based on temporal sequence matching [13]. Starner et al. [14] proposed a real-time ASL recognition system based on Hidden Markov Models (HMMs). In their framework, coarse descriptions of hand shape, orientation, and trajectory were extracted from video sequences and fed into HMMs for sentence-level continuous sign recognition. Grobel and Assan [15] applied HMMs to isolated sign language recognition with a vocabulary of 262 signs. The system achieved recognition rates of up to 94% in signer-dependent experiments. Vogler and Metaxas [16] pointed out that the use of HMMs alone imposed several limitations in ASL recognition, particularly the insufficiency of training data when modeling context-dependent HMMs. Pham Quoc Thang et al. experimentally evaluated SimpSVM and RVM on the Auslan and ASLID datasets. In addition, they compared these methods with several classical machine learning algorithms. J48 achieved an accuracy of 85.5%, Naïve Bayes obtained 70.8%, HMM reached 87.1%, AdaBoost achieved 93.6%, RVM reached 93.373%, and SimpSVM achieved 98.09% on the sign language recognition task. The authors concluded that SimpSVM and RVM could achieve good predictive performance for sign recognition [17].

Recent studies in complex visual recognition tasks have also demonstrated the effectiveness of deep learning under challenging conditions, such as Mamba-based YOLOv10 mussel detection in turbid waters [18] and lightweight EPRepSADet for railway catenary foreign object detection [19]. Recent studies also highlight robustness and security, including DB-COVIDNet for backdoor mitigation and analyses of defense challenges [20,21]. Similarly, CNNs were among the earliest deep learning architectures to receive significant attention in vision-based sign language recognition research. CNNs fundamentally transformed the landscape of SLR by introducing automated feature learning. One big advantage of CNNs in this field is that they can cut down or even skip the complicated manual steps usually needed to extract features—Yang and Zhu [22] showed this by getting accurate recognition with images fed straight into the system. Here, convolutional layers use learnable weights on the input images to create feature maps that capture important details [23]. But researchers have gone further. By adding 3D-CNNs—like the 3D-ResNet architectures—they have started to consider both space and time at once, which means you get a much fuller picture when analyzing dynamic sign language video clips [24]. In addition to end-to-end classification, CNNs were frequently employed as robust feature extractors to feed subsequent specialized classifiers, such as Support Vector Machines (SVMs), which helped in enhancing overall recognition accuracy [25]. This versatility was further supported by the internal layer hierarchy of CNNs; for instance, the internal convolutional layers could effectively function as filters to eliminate irrelevant information, such as minor skin-color noise, during the feature learning stage [26].

In the field of sign language research, several studies introduced self-attention mechanisms and Transformer-based models [27]. In addition, Transformer models were employed in sign language production [28]. Transformer and self-attention exhibit strong general modeling capacity and enable dynamic computation. Furthermore, a Transformer can capture arbitrary relationships through verification [29], which is difficult for CNNs to achieve [30]. In addition, Transformers complement convolution. According to [31,32], in CNNs, convolutional layers mainly focus on nearby pixels. To capture wider image regions and more complex spatial patterns, several layers are needed. On the other hand, a Transformer layer uses self-attention which enables it to directly compare different parts of an image and model long-range dependencies more easily. Inspired by the success of self-attention and Transformer architectures in natural language processing (NLP), recent studies have increasingly replaced traditional convolution modules with self-attention layers [30]. One of the first models to apply Transformer layers directly to image patches was Vision Transformer (ViT) [33]. Another model was Data-efficient image Transformers (DeiT) which improved training efficiency through knowledge distillation and optimization strategies [34]. The Swin Transformer introduced a hierarchical architecture with shifted window mechanisms and achieved strong performance across various computer vision tasks [35]. Although Transformers are a powerful approach for sequence modeling, they are not always directly suitable for SLR. Video action understanding is chronological, and you cannot see future frames from the present, which creates limits [30]. Past studies note that end-to-end Sign2Text models struggle with long-term dependencies—video frame sequences are much longer than gloss sequences, which hurts direct video-to-text performance [36]. Plus, Transformer-based dual learning frameworks often require heavy computation. Training several high-parameter models at once can limit how much large-scale data you actually get to use, especially if resources are tight [37].

Existing studies show there is often a trade-off between what CNNs and Transformers do best. On the other hand, Transformer-style self-attention lets models process global context and see all parts of the input at once, though this usually comes with a higher computational cost and heavier training demands. To bridge this gap, we introduce CT-Net, a hybrid ConvNeXt–Transformer architecture that combines the local feature-learning strength of convolution with the global reasoning ability of self-attention. By bringing these two approaches together, CT-Net captures both fine hand details and broader contextual information, improving static ASL alphabet recognition while keeping computational cost low.

3. Background

3.1. Feature Extraction Paradigms in Computer Vision

Conventional computer vision methods relied heavily on hand-crafted features, such as filters designed to detect edges, corners, or textures. The state-of-the-art performances of deep learning models reduced this dependence on manual feature design and made it possible to scale computer vision models to a wide range of academic and industrial tasks [38]. Since then, two major approaches have shaped modern visual recognition: convolution-based models, mainly CNNs, and attention-based models, mainly Vision Transformers (ViTs). CNNs use learnable kernels that slide across local receptive fields to produce feature maps. Through multiple layers, they learn more complex patterns from simple textures to more complex, object-level features. Their built-in assumptions, such as local connectivity and shared weights, make them efficient and effective for tasks where local spatial patterns are important [39].

In contrast, ViTs divide an image into non-overlapping patches and treat those patches as a sequence of tokens, together with positional information. Bringing in multimodal and contextual cues improved scene understanding and overall perception performance [40]. This design allows the model to capture relationships between distant parts of an image and learn broader contextual information. However, because ViTs have fewer built-in assumptions about image structure, they usually require larger datasets and more computational resources to achieve strong performance [34,41].

Each approach has its own benefits. CNNs are strong at local feature extraction while ViTs are stronger at global context modeling. Researchers recently started exploring hybrid architectures by integrating convolutional backbones with attention layers. The main purpose of these hybrid models is to balance fine-grained local features, global context, and computational efficiency [42].

3.2. ConvNeXt: Modernizing Convolutional Networks

ConvNeXt was proposed to bridge the gap between the “pre-ViT” and “post-ViT” eras for ConvNets by investigating which Transformer design choices actually mattered and whether a pure ConvNet could reach Transformer-level performance while remaining simple [43]. Starting from a standard ResNet [44], the authors gradually modernized the architecture toward the style of a hierarchical vision Transformer [35,45,46], while retaining only standard convolutional components. The modernization included adjustments to the stage ratio, adoption of depthwise convolution, increased network width, and revisiting large kernel sizes

(7 \times 7)

. The micro-design moved closer to Transformer standards by swapping ReLU for GELU, cutting down on activation and normalization layers, and replacing Batch Normalization with Layer Normalization [43].

3.3. Transformer Encoder Theory

The Transformer architecture has changed from recurrence and convolution, and instead attention mechanisms have been employed for sequence transduction tasks [47]. This approach opened the door for much more parallel computation. The original Transformer encoder consisted of a stack of

N = 6

identical layers. Each layer includes two main subcomponents: a multi-head self-attention mechanism and a position-wise fully connected Feed-Forward Network. Residual connections were applied around each sub-layer, followed by Layer Normalization. The main attention operation is Scaled Dot-Product Attention [47], defined as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .

(1)

where

Q

denotes the Query matrix,

K

is the Key matrix, and

V

represents the Value matrix. The term

d_{k}

indicated the dimension of the keys, which is a scaling factor to stabilize training. Multi-head attention applies this operation in parallel across several learned projections, which lets the model focus on different parts of the input and capture information from multiple representation subspaces at the same time. Since the Transformer does not use recurrence or convolution, positional encodings are added to the input embeddings to preserve information about token order. The original Transformers used sinusoidal positional encodings with different frequencies [47]:

P E_{(p o s, 2 i)} = \sin (p o s / 10000^{2 i / d_{m o d e l}})

(2)

P E_{(p o s, 2 i + 1)} = \cos (p o s / 10000^{2 i / d_{m o d e l}})

(3)

where

P E

denotes Positional Encoding,

p o s

shows the token’s position in the sequence, and

i

is the dimension index in the embedding vector. This design supports efficient parallelization and helps the model capture long-range dependencies more effectively [47].

3.4. Theoretical Analysis of ASL Recognition Challenges

American Sign Language (ASL) recognition aimed to bridge the communication gap between hearing and deaf individuals by automatically identifying hand gestures corresponding to the alphabet and digits. However, sign language was considerably more sophisticated and unpredictable than many other visual activities because it combined fine-grained finger movements with coarse-grained arm motions [48]. This inherent structural complexity significantly increased the difficulty of achieving high-precision recognition. The performance of recognition systems was severely affected by variations in gestures and environmental disruptions, such as cluttered backgrounds, which limited real-world usability. Furthermore, certain signs involved hand configurations that were highly similar to one another, making it difficult for standard recognition algorithms to differentiate between them. Misclassification occurred frequently among visually similar motions, highlighting the necessity of collecting diverse datasets that accounted for different movement types and varied viewing angles [49,50].

In addition, recognition performance often fluctuated between user-dependent and user-independent evaluation settings, reflecting the significant variability in individual signing styles, such as differences in movement speed and occlusion patterns [51]. These factors collectively indicated that robust ASL recognition required advanced feature extraction and classification mechanisms, such as multi-stream fusion architectures capable of integrating complementary appearance and motion cues, to handle gesture variability, visual similarity, and environmental disturbances [51].

4. Proposed Methodology

In this section we introduce the details of the proposed CT-Net architecture and its training procedure. CT-Net follows a cascaded hybrid design that combines the strengths of convolution and self-attention mechanisms. First, ConvNeXt-Tiny is used as a modern convolutional backbone to extract fine-grained local features. Then, a lightweight Transformer encoder is added to capture broader spatial dependencies across the image. The overall pipeline includes data preprocessing and augmentation, hybrid architecture design, loss optimization, and the evaluation protocol.

4.1. Data Preprocessing and Augmentation Strategy

To address common variations in sign language images such as lighting changes, scale differences, and viewing-angle variations, we designed a comprehensive data augmentation pipeline.

4.1.1. Training Dataset Processing

To increase data diversity and improve model generalization, we used a series of random data augmentation methods during the training procedure. We apply random-sized cropping within a predefined scale range and then resize the image to the target resolution. This helps the model to handle changes in camera distance and pushes the model to learn features that work across different scales. Random rotation is used to improve robustness against small hand or camera tilts. Color jitter is also applied to randomly change the brightness and contrast, simulating real-world lighting conditions. Finally, all images are normalized using the ImageNet mean and standard deviation to stabilize optimization and support faster convergence. Table 1. Shows all the specific hyperparameters for this stochastic augmentation pipeline.

4.1.2. Testing Dataset Processing

During testing, we applied deterministic resizing to the target resolution, followed by the same normalization used for training. This makes the evaluation consistent and reproducible. For dataset loading, we used the PyTorch (version 2.10.0) DataLoader with mini-batches and multi-process loading to improve I/O efficiency and computational speed.

4.2. CT-Net Architecture Design

The main idea behind CT-Net is to combine the local feature-learning strength of convolution with the broader contextual modeling ability of self-attention. The overall data flow of the proposed architecture can be summarized as follows:

I \overset{C o n v N e X t}{\to} F \overset{T o k e n i z a t i o n}{\to} X \overset{T r a n s f o r m e r}{\to} Z \overset{P o o l i n g / C l a s s i f i e r}{\to} \hat{y} .

(4)

Given an input image

I

, CT-Net produces the predicted class probability

\hat{y}

. Figure 1 illustrates each step of the architecture and its modular building blocks, showing how the image input moves step by step transformed into the final class prediction.

4.2.1. Local Feature Extraction Backbone

In the first feature extraction stage, we use a ConvNeXt-Tiny model pre-trained on ImageNet-1K. Unlike traditional ResNet architectures, ConvNeXt has a modern design architecture including

7 \times 7

depthwise convolution, an inverted bottleneck structure, and a Normalization layer. These components allow the model to learn hierarchical visual representations while preserving the efficiency of convolutional operations. In our implementation, we removed the global average pooling layer and the classification head, and retained only the feature extraction backbone. Given an input tensor

I \in R^{B \times 3 \times 224 \times 224}

, where

B

denoted the batch size, the backbone produced a high-dimensional feature map

F \in R^{B \times C \times H^{'} \times W^{'}}

. In our architecture, the channel dimension was C = 768, while the spatial resolution was reduced to

H^{'} = 7

and

W^{'} = 7

. This feature representation encoded rich fine-grained local information, including finger contours, edge structures, and subtle hand texture patterns.

4.2.2. Feature Tokenization

To prepare the convolutional feature map for the Transformer encoder, we use a parameter-free flattening operation that connects the CNN backbone to the sequence-based Transformer module. Specifically, ConvNeXt feature map

F \in R^{B \times 768 \times 7 \times 7}

is flattened along the spatial dimensions to produce a tensor of shape

R^{B \times 768 \times 49}

. The tensor is then permuted into a sequence representation

X \in R^{B \times N \times D}

, where

{N = H}^{'} \times W^{'}

equal to 49 considered as the sequence length and

D = 768

is the embedding dimension. This direct mapping strategy avoids extra projection layers which preserves the semantic information learned by convolutional features and matches the Transformer input shape

(B, N, D)

.

4.2.3. Global Context Modeling

In this step we need to capture long-range dependencies between different parts of the hand, such as the spatial relationship between the thumb and the little finger. For this, the token sequence

X

is passed to a lightweight Transformer encoder. The encoder consists of

L = 2

stacked Transformer Encoder layers. This is a shallow configuration that reduces computational cost unlike 6-layer or 12-layer Transformers and prevents overfitting on the medium-sized ASL dataset. Each encoder layer includes Multi-Head Self-Attention (MHSA) with

h = 8

attention heads. Through self-attention, each token can exchange information with every other token in the sequence. This enables the model to identify spatial regions that are important for gesture classification. After the attention module, a Feed-Forward Network (FFN) further refines these token representations by expanding the hidden dimension to 2048, which is four times larger than the embedding dimension. The GELU activation function introduces nonlinearity and improves the model’s representational capacity. To keep optimization stable, each layer includes Dropout at a 0.1 probability, residual connections, and Layer Normalization. Together, these parts make training more stable and help the model converge reliably in deeper architectures. Combining ConvNeXt-Tiny with the Transformer encoder tackled the confusion between visually similar ASL gestures by blending detailed local hand features with larger spatial patterns. ConvNeXt-Tiny picked up on tiny details—like the shape of fingers and texture shifts—while the Transformer encoder connected distant parts of the hand using self-attention to capture broader relationships. Compared to the ConvNeXt-Tiny baseline, CT-Net boosted F1-scores in several tough categories: ‘A’, ‘E’, ‘I’, ‘K’, ‘S’, ‘U’, ‘V’, and ‘Y’. Notably, the F1-score for ‘S’ increased from 76.85% to 86.49%, suggesting that the global contextual modeling introduced by the Transformer encoder enhanced the discriminative power for visually ambiguous gestures. Nevertheless, highly similar categories such as ‘M’ and ‘N’ remained challenging due to severe finger occlusion and subtle structural nuances.

4.2.4. Classification Head

The Transformer encoder produces a contextual token sequence

Z \in R^{B \times 49 \times 768}

, where each token contains both local visual information and global contextual cues. Global Average Pooling is then applied along the token dimension to combine the 49 spatial tokens into a single 768-dimensional feature vector. This vector is then normalized using Layer Normalization and passed through a fully connected layer that projects it into the

K = 29

ASL alphabet categories. The resulting logits have a shape of

B \times 29

, which is used for gesture classification. Table 2 summarizes the tensor dimensions throughout the CT-Net architecture pipeline.

4.3. Experimental Setup and Training Strategy

All experiments were conducted in a GPU-enabled computing environment and implemented using the PyTorch framework. To account for the unique characteristics of our hybrid architecture, we adopted a tailored optimization strategy to ensure stable convergence and robust generalization. The model was optimized using AdamW, which decoupled weight decay from gradient-based parameter updates and was known to enhance convergence stability, particularly for architectures incorporating Transformer components. The initial learning rate was set to

3 \times 10^{- 5}

, as Transformer modules were typically sensitive to large learning rates and benefited from smaller, more stable optimization steps. A weight decay of 0.05 was applied to regularize model complexity and mitigate overfitting. The batch size was set to 48 to balance GPU memory usage and stable gradient estimation, particularly because the Transformer encoder requires additional memory. Furthermore, because many ASL hand gestures are visually similar, the model was trained with cross-entropy loss combined with label smoothing. We set the label smoothing factor to 0.1 to prevent the target labels from being treated as strictly one-hot. This training strategy reduces overconfidence and helps to limit overfitting caused by noisy labels, and supports better generalization across similar ASL classes. We considered 20 epochs to train the model. Since hybrid architectures may require more time to converge than pure CNN baselines, this provides enough time for optimization while still keeping the overall training process efficient. A best-checkpoint saving policy was implemented, whereby the model state dictionary was saved only when the current evaluation accuracy surpassed the best previously observed performance. All final results were reported using the best-performing checkpoint obtained during training.

4.4. Performance Evaluation Indicators

To evaluate CT-Net on the ASL alphabet recognition task, we used several standard classification metrics. Accuracy is the main metric and measures the overall proportion of correctly classified samples. To examine class-level performance, we also used Precision, Recall, and F1-score. Precision computes the model’s positive predictions; Recall detects the true labels of each class and F1-score calculates the harmonic mean between the two. These metrics are especially useful for ASL recognition because some hand gestures are visually similar and may be confused by the model.

5. Experimentation Results and Analysis

5.1. Experimental Setup

To evaluate the performance of the proposed CT-Net, we conducted a wide range of experiments. We considered three separate data subsets and twelve comparative models including eight core baselines and four state-of-the-art lightweight architectures. In this section, we explain the details of dataset protocols and partitioning, implementation details and the promising results of our proposed mechanism.

5.1.1. Dataset Protocols and Partitioning

Since the official test set of the ASL alphabet dataset contains only one image per class, it is insufficient for a robust statistical evaluation. We re-partitioned the data to establish a statistically significant test set. To ensure a fair comparison, all baselines were re-trained and evaluated under this same protocol, ensuring that performance gains reflect genuine generalization capabilities rather than partitioning artifacts. Therefore, we re-partitioned the original training pool (3000 images per class) into two segments: a training source (the first 2000 images) and a testing source (the remaining 1000 images). Based on this, three specialized sub-datasets were constructed:

Sub-dataset 1 (Sequential Sampling): This subset simulated a low-variance environment. We selected the first 500 images from the training source and the first 250 images from the testing source (indices 2001–2250). This sub-dataset served as a baseline to evaluate the model’s fundamental fitting ability under high temporal correlation.
Sub-dataset 2 (Uniform Stratified Sampling): To assess generalization across diverse conditions, we applied a uniform sampling strategy. We extracted every 4th image from both the training and testing sources to obtain 500 training and 250 testing samples per class. This sub-dataset provided a more comprehensive representation of the dataset’s overall distribution.
Sub-dataset 3 (MediaPipe-Enhanced Visual Fusion): Built upon the sampling logic of Sub-dataset 2, this version introduced a visual feature enhancement stage using the MediaPipe Hands framework. For images where hand landmarks were successfully detected, the skeletal topology (keypoints and connections) was explicitly rendered and overlaid onto the original RGB images. This process embedded geometric priors directly into the visual input, allowing the model to learn from both texture and structural cues simultaneously. For samples where landmark detection failed, typically due to underexposure or motion blur, the original RGB images were retained without modification to ensure data completeness. The failure rate of MediaPipe detection was 25.6%; this hybrid approach was adopted to respect the original data distribution in cases of poor image quality, while manual annotation could be considered in future work to further eliminate this modality inconsistency. By adopting this hybrid image synthesis strategy, Sub-dataset 3 maintained an identical sample size (500 training and 250 testing images per class) to the previous subsets. This controlled setup enabled precise quantification of the impact of visually integrated structural information on CT-Net, eliminating interference from data volume variations. An example of the landmark overlay is illustrated in Figure 2.

5.1.2. Evaluation Metrics

We evaluated model performance using accuracy, Precision, Recall, and F1-score. To treat all 29 classes fairly and reduce bias toward more frequent ones, we used macro-averaging.

5.1.3. Implementation Details

All experiments were implemented in PyTorch and executed on GPU-accelerated hardware. Each input image was resized to

224 \times 224

pixels. During training, we applied geometric and photometric distortions, including random resizing, rotations, and color jittering followed by standard normalization. To ensure a fair comparison and reduce overfitting, we initialized all deep learning models with pre-trained ImageNet weights. For the hybrid CNN + SVM and CNN + XGBoost baselines, the CNN component was used as a fixed feature extractor to obtain general spatial representations, without further fine-tuning. All models were trained using the AdamW optimizer with a learning rate

3 \times 10^{- 5}

, and weight decay of

0.05

. We minimized cross-entropy loss with label smoothing, using a batch size of 48 for 20 epochs. The same training protocol was applied across all models to support consistency and reproducibility; however, some architectures may achieve stronger performance with model-specific hyperparameter tuning.

5.1.4. Model Architectures and Baselines

To test how well CT-Net works, we picked eight different models that cover four main types of visual recognition architectures. We made sure the models were diverse, especially in their design, how much computation they require, and the way they extract features.

End-to-End Deep Learning Models:

ResNet18: This classic residual convolutional network uses skip connections to solve the vanishing gradient problem. We chose it as a standard CNN baseline to represent traditional feature extraction.
ConvNeXt-Tiny: This is a more up-to-date CNN, inspired by Transformers. It uses large kernels $(7 \times 7)$ , LayerNorm, and inverted bottleneck structures. It stands for the latest advances in convolutional networks.
MobileNetV3: A lightweight CNN designed for mobile and edge devices. It uses depthwise separable convolutions and neural architecture search (NAS) to compute efficiently while keeping accuracy.
EfficientNet-B0: This CNN uses compound scaling to optimize network depth, width, and resolution all at once. It strikes a strong balance between accuracy and computational cost.
Vision Transformer (ViT): This is a purely Transformer-based model that treats images as sequences of patches, using self-attention mechanisms instead of any convolutional operations. It fully embraces global context modeling.
CT-Net: This proposed hybrid architecture combines ConvNeXt-Tiny as the local feature extractor with a Transformer Encoder to capture global relationships. By merging these two, it leverages the unique strengths of each architectural paradigm.
Two-Stage Hybrid Machine Learning Models:
CNN + SVM: This uses a pre-trained ResNet18 to extract features, which are then passed to a Support Vector Machine (SVM) classifier with an RBF kernel. It follows the classic machine learning pipeline, enhanced with deep features.
CNN + XGBoost: Like CNN + SVM, but swaps the SVM classifier for XGBoost (eXtreme Gradient Boosting), which relies on ensemble decision trees to boost discriminative power.

All deep learning models started with pre-trained weights and were fine-tuned end-to-end on the ASL datasets. For the two-stage hybrid models, feature extraction was done with a frozen ResNet18 backbone to ensure a fair comparison with the end-to-end methods.

5.2. Overall Performance Comparison

Cross-Dataset Performance Benchmarking

To get a clear picture of how robust and adaptable CT-Net really is, we dug into a set of experiments using three different data protocols: Sequential Sampling (Sub-dataset 1), Uniform Stratified Sampling (Sub-dataset 2), and MediaPipe-Filtered Quality Control (Sub-dataset 3). Table 3 lays out how eight different architectures stack up against each other, showing both accuracy (%) and Macro F1-score for each setup.

The numbers told a consistent story: CT-Net came out on top in every scenario. With Sequential Sampling in Sub-dataset 1, CT-Net reached 92.69% accuracy and an F1 of 92.85%. That not only edged past ConvNeXt-Tiny’s 92.12%, but also left ResNet18 trailing at 80.50%. Clearly, this hybrid architecture deals well with subtle spatial patterns in sequential data—places where conventional CNNs fall short. In the Sub-dataset 2 (Uniform Stratified Sampling) scenario, designed to test generalization, CT-Net maintained its lead with 94.90% accuracy (F1 = 0.9447). While ConvNeXt-Tiny remained competitive (92.76%), ResNet18 exhibited significant instability, with accuracy fluctuating between 80.50% and 89.90% across protocols. In contrast, CT-Net’s consistent performance validated its robustness against distribution shifts. The influence of data quality was most evident in Sub-dataset 3 (MediaPipe-Filtered), where CT-Net reached its performance ceiling of 95.67% accuracy (F1 = 0.9543), exceeding ConvNeXt-Tiny by 0.87%. This supported the hypothesis that the synergy between local feature extraction and global context modeling was highly effective on high-purity structural data. Conversely, the Pure Transformer (ViT) experienced a catastrophic decline to 55.39%, likely due to the lack of convolutional inductive bias and overfitting on the refined dataset. The results also reveal the clear limitations of two-stage pipelines such as CNN + SVM and CNN + XGBoost, whose accuracy remained below 55%. These findings demonstrate that fixed CNN features are not expressive enough for ASL recognition, where small differences in hand shape and finger position can change the class label. On the other hand, end-to-end models can adapt their feature representations directly to these subtle gesture-specific patterns. Therefore, CT-Net achieved the strongest and most stable performance across the evaluated data settings, showing its effectiveness for static ASL gesture recognition.

5.3. Analysis of Data Quality and Distribution

5.3.1. Generalization Ability Across Data Distributions

To evaluate the robustness of different architectures against data distribution shifts, we compared model performance between Sub-dataset 1 and Sub-dataset 2. As shown in Figure 3, most models demonstrated improved accuracy on Sub-dataset 2, indicating that the uniform sampling strategy provides a more representative data distribution that facilitates better feature learning.

The proposed CT-Net achieved the highest accuracy on both subsets (92.69% on Sub-dataset 1; 94.90% on Sub-dataset 2), with an improvement of +2.21% when transitioning to the more diverse Sub-dataset 2. This consistent performance demonstrates that our hybrid architecture effectively learns generalized features rather than overfitting to sequential patterns present in Sub-dataset 1. Notably, traditional CNN architectures exhibited substantial performance gains from Sub-dataset 1 to Sub-dataset 2. ResNet18 improved a lot, jumping from 80.50% to 89.90%—that is a gain of 9.40%. EfficientNet-B0 saw an even bigger boost, up 17.92% from 64.22% to 82.14%. MobileNetV3 was not left behind either—it climbed 7.45%, moving from 66.55% to 74.00%. On the flip side, ViT (Pure Transformer) lost some ground, dropping 2.43% from 74.47% to 72.04%. Clearly, pure attention-based models can stumble with changes in data distribution unless they have the inductive bias that convolutional layers offer. The hybrid models (CNN + SVM and CNN + XGBoost) barely moved, with gains of just 3.40% and 2.81%. This supports the idea that fixed feature extractors struggle to adapt to diverse data. All things considered, CT-Net handled both sampling strategies with a steady and strong performance, proving its resilience to shifts in data distribution. That is exactly the kind of reliability you want for real-world applications, where data always changes.

5.3.2. Sensitivity to Data Quality Enhancement

To quantify the impact of data quality on model performance, we compared results between Sub-dataset 2 and Sub-dataset 3. As described in Section 5.1.1, Sub-dataset 3 incorporates structural landmark information for high-quality samples while retaining original RGB images for samples with failed detection, maintaining identical sample sizes across both subsets. As illustrated in Figure 4, the proposed CT-Net achieved the highest overall performance on Sub-dataset 3 with 95.67% accuracy, representing a +0.77% gain from Sub-dataset 2 (94.90%). This result indicates that hybrid architectures can effectively leverage structural landmarks when available, while the ConvNeXt backbone ensures robust feature extraction from raw RGB inputs when landmarks are unavailable. This observation aligns with recent studies demonstrating the effectiveness of structured representations in improving model performance [52].

ConvNeXt-Tiny showed the most substantial improvement among all models, with a +2.04% gain (from 92.76% to 94.80%), confirming that modern CNN architectures benefit significantly from the structural enhancement. ResNet18 exhibited a modest gain of +1.12% (from 89.90% to 91.02%), while MobileNetV3 demonstrated a +3.77% improvement (from 74.00% to 77.77%), suggesting that lightweight models may benefit disproportionately from additional structural guidance. Conversely, several models experienced performance degradation on Sub-dataset 3. ViT suffered the most significant drop (−16.65%, from 72.04% to 55.39%), indicating that pure Transformer architectures struggle to effectively utilize the hybrid landmark-RGB representation without convolutional inductive bias. EfficientNet-B0 also decreased by −4.32% (from 82.14% to 77.82%), suggesting potential incompatibility between its compound-scaled feature extraction and the structural enhancement pipeline. The two-stage approaches (CNN + SVM and CNN + XGBoost) showed minimal changes (−1.83% and −0.10%, respectively), as their frozen feature extractor cannot adapt to the enhanced structural information.

These findings demonstrate that data quality enhancement through structural landmark integration benefits architectures capable of end-to-end adaptation. CT-Net’s superior performance confirms its ability to effectively fuse local RGB features with global structural information.

5.4. Per-Class Performance Analysis

To further evaluate the discriminative capacity of CT-Net across individual ASL gestures, a detailed per-class analysis was conducted on Sub-dataset 3, where the model reached its peak performance (95.67% accuracy). Figure 5 illustrates the Precision, Recall, and F1-score for each of the 29 classes.

CT-Net demonstrated exceptional classification proficiency, achieving perfect or near-perfect scores (F1

\geq

0.99) in 15 of the 29 categories, including ‘F’, ‘L’, ‘Q’, ‘T’, ‘nothing’, and ‘space’. These results underscore the model’s robustness in extracting features from gestures with highly distinct hand configurations. Nevertheless, specific performance fluctuations were identified in four challenging categories, primarily due to inherent visual ambiguities in the ASL alphabet: Class ‘M’ (F1 = 0.54): This category yielded the lowest Recall (0.38), indicating a significant tendency of the model to misclassify ‘M’ as other morphologically similar gestures. Class ‘N’ (F1 = 0.76): This class exhibited diminished Precision (0.62), suggesting that it frequently incurred false-positive predictions from other visually proximal classes. Class ‘S’ (F1 = 0.86): A moderate decline in Recall (0.77) was observed, attributed to its high visual similarity with other closed-fist gestures. Class ‘V’ (F1 = 0.88): An imbalance between Precision and Recall was noted, likely stemming from the sensitivity of finger orientation features within this category. The systematic confusion, particularly between ‘M’ and ‘N’, aligns with the objective difficulty of ASL recognition; these two gestures are differentiated only by the subtle positioning of the thumb between different fingers. Such error patterns suggest that the remaining misclassifications are driven by inter-class similarity and fine-grained structural nuances rather than inherent limitations of the CT-Net architecture. These specific challenges could be further addressed in future studies through targeted data augmentation or higher-resolution landmark integration. Overall, CT-Net maintained F1-scores above 0.85 for 86.2% of the classes (25 out of 29), validating its reliability and efficacy for practical sign language recognition deployments.

5.5. Cross-Dataset Convergence Analysis

We evaluated how stable the training was and how well the models converged on three different sub-datasets. Figure 6 shows how accuracy and loss changed over 20 training epochs for five deep learning models. CT-Net and ConvNeXt-Tiny consistently performed best, with the highest accuracy across all sub-datasets. Looking at Sub-dataset 1, both of these models hit around 89–92% accuracy, as you can see in Figure 6a. On Sub-dataset 2, CT-Net peaked at 94.9% at epoch 17, and ConvNeXt-Tiny stayed above 90% accuracy throughout (Figure 6c). For Sub-dataset 3, both CT-Net and ConvNeXt-Tiny stayed reliably above 93% accuracy during the entire training process (Figure 6e). ResNet18 did alright, with accuracy between 80% and 90%, but MobileNetV3 and EfficientNet-B0 lagged behind; their accuracy stayed between 60% and 80% and they converged more slowly. The accuracy comparison across all three sub-datasets confirms that our hybrid architecture design effectively captures discriminative features regardless of the data representation method.

The training loss convergence further validates the stability of all models during the learning process. As shown in the loss comparison plots, all models demonstrated rapid convergence during the initial training phase, with the training loss decreasing sharply within the first 2–3 epochs before stabilizing around 0.64–0.65. This pattern is clearly visible in Figure 6b for Sub-dataset 1, Figure 6d for Sub-dataset 2, and Figure 6f for Sub-dataset 3. The consistent loss convergence across all three sub-datasets indicates that the models quickly learned effective feature representations. Furthermore, the minimal fluctuation in loss values after epoch 5 confirms that 20 epochs were sufficient for training completion across all architectures. Despite requiring longer per-epoch training time (~275 s vs. ~82 s for ResNet18), CT-Net’s superior accuracy and convergence stability across all six sub-plots validate the effectiveness of our hybrid architecture design, demonstrating an optimal balance between accuracy, stability, and convergence speed on diverse data representations.

5.6. Computational Complexity and Inference Efficiency Analysis

To further evaluate the practical deployment potential of the proposed CT-Net, we compared the computational complexity and inference efficiency of different models in terms of parameters, Floating Point Operations (FLOPs), inference latency, and FPS. As shown in Table 4, CT-Net contained 38.87 M parameters and required 4.77 G FLOPs, which were moderately higher than ConvNeXt-Tiny but substantially lower than ViT. Despite the additional Transformer encoder, CT-Net achieved an inference latency of only 6.11 ms per image and a processing speed of 163.55 FPS, indicating that the proposed model maintained real-time inference capability. Compared with ViT, CT-Net reduced the parameter count by 54.71% and FLOPs by 57.75%, while achieving higher FPS. As shown in Table 2, although ResNet18 and ConvNeXt-Tiny showed slightly faster inference speeds, CT-Net provided a better trade-off between recognition accuracy and computational efficiency. In contrast, MobileNetV3 and EfficientNet-B0 had fewer parameters and lower FLOPs, but their inference latency was unexpectedly higher, resulting in lower FPS. The two-stage CNN + SVM approach also suffered from increased latency due to the additional classifier stage. These results demonstrated that CT-Net achieved competitive computational efficiency while preserving superior classification performance, supporting its suitability for real-time ASL recognition applications.

5.7. Ablation Study

An ablation study was conducted to examine the influence of Transformer configurations in CT-Net by varying the number of encoder layers and attention heads while keeping all other settings unchanged. As shown in Table 5, the default configuration with two layers and eight heads achieved the best performance, reaching 96.59% Precision, 95.67% Recall, and 95.43% F1-score. Precision measured the proportion of true positive predictions among all samples predicted as positive.

Reducing the encoder depth to one layer slightly decreased the performance, with the F1-score dropping to 95.18%. Increasing the depth to three and four layers caused more obvious degradation, with the F1-score decreasing to 93.66% and 92.38%, respectively, indicating that excessive depth introduced unnecessary complexity. Similarly, changing the number of attention heads from eight to four or twelve also reduced performance, with F1-scores of 93.80% and 93.28%, respectively. These results indicate that the default configuration provides the most effective trade-off between global feature modeling and stable training behavior.

5.8. Additional Comparison with Recent Lightweight Architectures

To further test how effective CT-Net is, we carried out thorough comparisons with several state-of-the-art lightweight and hybrid architectures: MobileViT-S, MobileViT-v2, EfficientFormer-L1, and FastViT-SA12. Table 6 summarizes the quantitative results.

CT-Net delivered the lowest inference latency at 6.11 ms and the highest throughput at 163.55 FPS. Even though CT-Net had more parameters and higher FLOPs than the lightweight baselines, it provided better real-time inference. This points to CT-Net’s design being more hardware-friendly and optimized for parallel execution. In terms of recognition performance, CT-Net consistently outperformed every competing model under all metrics. Of the lightweight models, EfficientFormer-L1 came closest in terms of performance. Conversely, MobileViT-v2 showed relatively lower accuracy in these experiments. This discrepancy was likely attributable to the unified training schedule of 20 epochs employed for all models. Since MobileViT-v2 has a specific architecture, it may require a longer training time or more specialized hyperparameter tuning to reach its best performance. CT-Net performance shows a stronger balance between recognition accuracy and inference efficiency. This makes it a strong choice for practical applications that require both precise classification and real-time processing.

6. Conclusions

This study introduces CT-Net, a hybrid deep learning architecture for American Sign Language (ASL) alphabet recognition, with a particular focus on distinguishing visually similar hand gestures. CT-Net combines the local feature extraction strength of ConvNeXt-Tiny with the global context modeling ability of a lightweight Transformer encoder. By integrating these two components, CT-Net addresses key limitations of approaches that rely solely on convolution or attention mechanisms. Across different experimental settings, CT-Net consistently outperforms strong baseline models, including ResNet18, MobileNetV3, and standard Vision Transformers. On the MediaPipe-enhanced dataset, CT-Net achieves its highest accuracy of 95.67%. Ablation studies further show that a shallow two-layer Transformer encoder provides the best balance between modeling capacity and training stability. The efficiency analysis and comparison with other models showed that even though CT-Net has more parameters than some lightweight networks, it delivers better hardware-friendly inference, reaching 163.55 FPS in throughput. This makes it a promising choice for real-time applications. Still, some problems remain. CT-Net sometimes confuses classes like ‘M’ and ‘N’, which are hard to tell apart because the main difference is simply the number of downward-pointing fingers. Factors like self-occlusion and low variation between classes hide these subtle details, especially at different hand angles. Moving forward, we plan to strengthen CT-Net’s robustness across more varied environments and with bigger datasets. We also want to expand beyond static image recognition, adapting the model for temporal video streams by adding layers to track the motion of hand gestures over time. This step will allow the architecture to support continuous sign language recognition and sentence-level interpretation, paving the way for more practical, real-time assistive communication tools.

Author Contributions

Conceptualization, Z.Y. and H.L.; methodology, Z.Y. and H.L.; software, Z.Y.; validation, Z.Y.; formal analysis, Z.Y.; investigation, Z.Y.; data curation, Z.Y.; writing—original draft preparation, Z.Y.; writing—review and editing, Z.Y., H.L. and S.S.; visualization, Z.Y. and H.L.; supervision, H.L. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data supporting the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, L.; Liang, B.; Jiang, A. Recent advances of deep learning for sign language recognition. In Proceedings of the 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA); IEEE: New York, NY, USA, 2017; pp. 1–7. [Google Scholar]
Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign language recognition using graph and general deep neural network based on large scale dataset. IEEE Access 2024, 12, 34553–34569. [Google Scholar] [CrossRef]
Sharma, S.; Singh, S. Vision-based hand gesture recognition using deep learning for the interpretation of sign language. Expert Syst. Appl. 2021, 182, 115657. [Google Scholar] [CrossRef]
Alam, S.; Lamberton, J.; Wang, J.; Leannah, C.; Miller, S.; Palagano, J.; de Bastion, M.; Smith, H.L.; Malzkuhn, M.; Quandt, L.C. ASL champ!: A virtual reality game with deep-learning driven sign recognition. Comput. Educ. X Real. 2024, 4, 100059. [Google Scholar] [CrossRef]
Khan, A.; Ali, R.H.; Akmal, U.; Ramazan, A. Asl recognition using deep learning algorithms. In Proceedings of the 2024 International Conference on IT and Industrial Technologies (ICIT); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Kasapbaşi, A.; Elbushra, A.E.A.; Omar, A.H.; Yilmaz, A. DeepASLR: A CNN based human computer interface for American Sign Language recognition for hearing-impaired individuals. Comput. Methods Programs Biomed. Update 2022, 2, 100048. [Google Scholar] [CrossRef]
Sharma, R.; Nemani, Y.; Kumar, S.; Kane, L.; Khanna, P. Recognition of single handed sign language gestures using contour tracing descriptor. In Proceedings of the world congress on engineering; International Association of Engineers: Hong Kong, China, 2013; Volume 2, pp. 3–5. [Google Scholar]
Bheda, V.; Radpour, D. Using deep convolutional networks for gesture recognition in american sign language. arXiv 2017, arXiv:1710.06836. [Google Scholar] [CrossRef]
Garcia, B.; Viesca, S.A. Real-time American sign language recognition with convolutional neural networks. Convolutional Neural. Netw. Vis. Recognit. 2016, 2, 225–232. [Google Scholar]
Ferreira, S.; Costa, E.; Dahia, M.; Rocha, J. A transformer-based contrastive learning approach for few-shot sign language recognition. arXiv 2022, arXiv:2204.02803. [Google Scholar]
Yuan, J.; Zhou, F.; Guo, Z.; Li, X.; Yu, H. HCformer: Hybrid CNN-transformer for LDCT image denoising. J. Digit. Imaging 2023, 36, 2290–2305. [Google Scholar] [CrossRef]
Wadhawan, A.; Kumar, P. Sign Language Recognition Systems: A Decade Systematic Literature Review: A. Wadhawan, P. Kumar. Arch. Comput. Methods Eng. 2021, 28, 785–813. [Google Scholar] [CrossRef]
Darrell, T.; Pentland, A. Space-time gestures. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 1993; pp. 335–340. [Google Scholar]
Starner, T.; Weaver, J.; Pentland, A. Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 20, 1371–1375. [Google Scholar] [CrossRef]
Grobel, K.; Assan, M. Isolated sign language recognition using hidden Markov models. In Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation; IEEE: New York, NY, USA, 1997; Volume 1, pp. 162–167. [Google Scholar]
Vogler, C.; Metaxas, D. ASL recognition based on a coupling between HMMs and 3D motion analysis. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271); IEEE: New York, NY, USA, 1998; pp. 363–369. [Google Scholar]
Pham, Q.T.; Nguyen, D.D.; Nguyen, T.T. A comparison of simpsvm and rvm for sign language recognition. In Proceedings of the 2017 International Conference on Machine Learning and Soft Computing, ICMLSC; Association for Computing Machinery: New York, NY, USA, 2017; Volume 17, pp. 13–16. [Google Scholar]
Zhao, F.; Xu, D.; Ren, Z.; Shao, X.; Wu, Q.; Liu, Y.; Wang, J.; Song, J.; Chen, Y.; Zhang, G.; et al. Mamba-based super-resolution and semi-supervised YOLOv10 for freshwater mussel detection using acoustic video camera: A case study at Lake Izunuma, Japan. Ecol. Inform. 2025, 90, 103324. [Google Scholar] [CrossRef]
Chen, Z.; Yang, J.; Li, F.; Feng, Z.; Chen, L.; Jia, L.; Li, P. Foreign object detection method for railway catenary based on a scarce image generation model and lightweight perception architecture. IEEE Trans. Circuits Syst. Video Technol. 2025, 36, 1377–1391. [Google Scholar] [CrossRef]
Shamshiri, S.; Han, K.J.; Sohn, I. Db-covidnet: A defense method against backdoor attacks. Mathematics 2023, 11, 4236. [Google Scholar] [CrossRef]
Shamshiri, S.; Sohn, I. Defense method challenges against backdoor attacks in neural networks. In Proceedings of the 2024 International Conference on Artificial Intelligence in Information and Communication (ICAIIC); IEEE: New York, NY, USA, 2024; pp. 396–400. [Google Scholar]
Yang, S.; Zhu, Q. Video-based Chinese sign language recognition using convolutional neural network. In Proceedings of the 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN); IEEE: New York, NY, USA, 2017; pp. 929–934. [Google Scholar]
Cihan Camgoz, N.; Hadfield, S.; Koller, O.; Bowden, R. Subunets: End-to-end hand shape and continuous sign language recognition. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 3056–3065. [Google Scholar]
Pu, J.; Zhou, W.; Li, H. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2019; pp. 4165–4174. [Google Scholar]
Nguyen, H.B.; Do, H.N. Deep learning for american sign language fingerspelling recognition system. In Proceedings of the 2019 26th International Conference on Telecommunications (ICT); IEEE: New York, NY, USA, 2019; pp. 314–318. [Google Scholar]
Al-Qurishi, M.; Khalid, T.; Souissi, R. Deep learning for sign language recognition: Current techniques, benchmarks, and open issues. IEEE Access 2021, 9, 126917–126951. [Google Scholar] [CrossRef]
Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Multi-channel transformers for multi-articulatory sign language translation. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 301–319. [Google Scholar]
Saunders, B.; Camgoz, N.C.; Bowden, R. Progressive transformers for end-to-end sign language production. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 687–705. [Google Scholar]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 3588–3597. [Google Scholar]
Du, Y.; Xie, P.; Wang, M.; Hu, X.; Zhao, Z.; Liu, J. Full transformer network with masking future for word-level sign language recognition. Neurocomputing 2022, 500, 115–123. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 7794–7803. [Google Scholar]
Yin, M.; Yao, Z.; Cao, Y.; Li, X.; Zhang, Z.; Lin, S.; Hu, H. Disentangled non-local neural networks. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 191–207. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 8–24 July 2021; pp. 10347–10357. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 10023–10033. [Google Scholar]
Chaudhary, L.; Ananthanarayana, T.; Hoq, E.; Nwogu, I. Signnet ii: A transformer-based two-way sign language translation model. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12896–12907. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
Zhang, J.; Lu, G. Vision-language embodiment for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 558–567. [Google Scholar]
Manzari, O.N.; Kaleybar, J.M.; Saadat, H.; Maleki, S. BEFUnet: A hybrid CNN-transformer architecture for precise medical image segmentation. arXiv 2024, arXiv:2402.08793. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 11976–11986. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Zhang, X.; Tian, Y.; Xie, L.; Huang, W.; Dai, Q.; Ye, Q.; Tian, Q. Hivit: A simpler and more efficient design of hierarchical vision transformer. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762v7. [Google Scholar]
Mohamed, N.; Mustafa, M.B.; Jomhari, N. A review of the hand gesture recognition system: Current progress and future directions. IEEE Access 2021, 9, 157422–157436. [Google Scholar] [CrossRef]
Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int. J. Comput. Vis. 2018, 126, 1311–1325. [Google Scholar] [CrossRef]
Sathyanarayanan, D.; Reddy, T.S.; Sathish, A.; Geetha, P.; Arunkumar, J.R.; Deepak, S.P.K. American Sign Language Recognition System for Numerical and Alphabets. In Proceedings of the 2023 International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Yang, Y.; Min, Y.; Chen, X. S2net: Skeleton-aware slowfast network for efficient sign language recognition. In Proceedings of the Asian Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 319–336. [Google Scholar]
Lin, J.; Zhang, J.; Lu, G. Graph integrated multimodal concept bottleneck model. arXiv 2025, arXiv:2510.00701. [Google Scholar] [CrossRef]

Figure 1. CT-Net stage-wise architecture from image input to class prediction output.

Figure 2. Example of landmark-enhanced image synthesis for alphabet ‘A’. (a) Original RGB image; (b) processed image with explicit skeletal connections overlaid via MediaPipe.

Figure 3. Accuracy variation from Sub-dataset 1 to Sub-dataset 2.

Figure 4. Accuracy variation from Sub-dataset 2 to Sub-dataset 3.

Figure 5. Fine-grained performance metrics (Precision, Recall, and F1-score) of CT-Net across 29 gesture classes on Sub-dataset 3.

Figure 6. (a) Accuracy on Sub-dataset 1. (b) Loss comparison on Sub-dataset 1. (c) Accuracy comparison on Sub-dataset 2. (d) Loss comparison on Sub-dataset 2. (e) Accuracy comparison on Sub-dataset 3. (f) Loss comparison on Sub-dataset 3.

Table 1. Hyperparameter settings for the stochastic data augmentation pipeline.

Augmentation Technique	Hyperparameter Range/Value
Random Resized Crop	Scale: (0.8, 1.0), Aspect Ratio: (0.75, 1.33)
Random Rotation	Degree: (−10°, +10°)
Color Jitter	Brightness: 0.2, Contrast: 0.2
Input Size	224 × 224 pixels
Normalization	Mean: [0.485, 0.456, 0.406], Std: [0.229, 0.224, 0.225]

Table 2. Tensor dimension flow in CT-Net.

Stage	Operation	Output Shape
Input	Raw Image	[B, 3, 224, 224]
Backbone	ConvNeXt-Tiny Features	[B, 768, 7, 7]
Tokenization	Flatten and Permute	[B, 49, 768]
Encoder	Transformer Layers	[B, 49, 768]
Pooling	Global Average Pooling	[B, 768]
Output	Classifier Logits	[B, K]

Table 3. Performance comparison of 8 models across 3 sub-datasets.

Model Category	Model Architecture	Sub-Dataset 1	Sub-Dataset 2	Sub-Dataset 3
Proposed	CT-Net	92.69/92.85	94.90/94.47	95.67/95.43
Modern CNN	ConvNeXt-Tiny	92.12/92.21	92.76/92.42	94.80/94.65
Traditional CNN	ResNet18	80.50/78.89	89.90/89.24	91.02/90.80
Transformer	ViT	74.47/71.93	72.04/70.82	55.39/53.39
Lightweight CNN	MobileNetV3	66.55/64.18	74.00/71.08	77.77/75.81
Lightweight CNN	EfficientNet-B0	64.22/61.81	82.14/79.19	77.82/75.87
Hybrid ML	CNN + SVM	50.90/46.30	54.30/50.67	52.47/48.56
Hybrid ML	CNN + XGBoost	45.99/43.35	48.80/46.29	48.70/46.39

Values are presented as accuracy (%)/Macro F1-score (%).

Table 4. Computational complexity and inference efficiency comparison of different models.

Model Category	Model Architecture	Parameters (M)	FLOPs (G)	Inference Latency (ms/Image)	FPS
Proposed	CT-Net	38.87	4.77	6.11	163.55
Modern CNN	ConvNeXt-Tiny	27.84	4.46	5.44	183.76
Traditional CNN	ResNet18	11.19	1.82	4.40	226.95
Transformer	ViT	85.82	11.29	8.73	114.53
Lightweight CNN	MobileNetV3	4.24	0.23	13.84	72.28
Lightweight CNN	EfficientNet-B0	4.05	0.41	17.19	58.52
Hybrid ML	CNN + SVM	11.18	1.82	11.12	89.93
Hybrid ML	CNN + XGBoost	11.18	1.82	4.53	220.65

Table 5. Ablation study on the number of layers and attention heads.

Experiment	Layers	Heads	Precision (%)	Recall (%)	F1-Score (%)
(Default)	2	8	96.59	95.67	95.43
Layer Ablation	1	8	96.41	95.46	95.18
	3	8	95.74	94.06	93.66
	4	8	94.59	93.09	92.38
Head Ablation	2	4	95.74	94.08	93.80
Head Ablation	2	12	95.14	93.82	93.28

Table 6. Performance comparison with state-of-the-art lightweight models.

Model Category	Model Architecture	Parameters (M)	FLOPs (G)		Inference Latency (ms/Image)		FPS
Proposed	CT-Net	38.87	4.77		6.11		163.55
Lightweight	MobileViT-S	4.96	1.44		8.70		114.99
Lightweight	MobileViT-v2	1.12	0.37		19.17		52.16
Lightweight	EfficientFormer-L1	11.42	1.30		6.96		143.70
Lightweight	FastViT-SA12	10.59	1.50		10.49		95.30
Model Category	Model Architecture	Precision (%)		Recall (%)		F1-score (%)
Proposed	CT-Net	96.59		95.67		95.43
Lightweight	MobileViT-S	90.75		89.32		88.70
Lightweight	MobileViT-v2	68.46		50.61		45.46
Lightweight	EfficientFormer-L1	91.27		89.54		89.01
Lightweight	FastViT-SA12	85.36		81.66		80.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Z.; Lu, H.; Shamshiri, S. CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification. Appl. Sci. 2026, 16, 5168. https://doi.org/10.3390/app16105168

AMA Style

Yang Z, Lu H, Shamshiri S. CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification. Applied Sciences. 2026; 16(10):5168. https://doi.org/10.3390/app16105168

Chicago/Turabian Style

Yang, Zhuofan, Houjin Lu, and Samaneh Shamshiri. 2026. "CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification" Applied Sciences 16, no. 10: 5168. https://doi.org/10.3390/app16105168

APA Style

Yang, Z., Lu, H., & Shamshiri, S. (2026). CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification. Applied Sciences, 16(10), 5168. https://doi.org/10.3390/app16105168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification

Abstract

1. Introduction

2. Related Works

3. Background

3.1. Feature Extraction Paradigms in Computer Vision

3.2. ConvNeXt: Modernizing Convolutional Networks

3.3. Transformer Encoder Theory

3.4. Theoretical Analysis of ASL Recognition Challenges

4. Proposed Methodology

4.1. Data Preprocessing and Augmentation Strategy

4.1.1. Training Dataset Processing

4.1.2. Testing Dataset Processing

4.2. CT-Net Architecture Design

4.2.1. Local Feature Extraction Backbone

4.2.2. Feature Tokenization

4.2.3. Global Context Modeling

4.2.4. Classification Head

4.3. Experimental Setup and Training Strategy

4.4. Performance Evaluation Indicators

5. Experimentation Results and Analysis

5.1. Experimental Setup

5.1.1. Dataset Protocols and Partitioning

5.1.2. Evaluation Metrics

5.1.3. Implementation Details

5.1.4. Model Architectures and Baselines

5.2. Overall Performance Comparison

Cross-Dataset Performance Benchmarking

5.3. Analysis of Data Quality and Distribution

5.3.1. Generalization Ability Across Data Distributions

5.3.2. Sensitivity to Data Quality Enhancement

5.4. Per-Class Performance Analysis

5.5. Cross-Dataset Convergence Analysis

5.6. Computational Complexity and Inference Efficiency Analysis

5.7. Ablation Study

5.8. Additional Comparison with Recent Lightweight Architectures

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI