Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks

Amiruzzaman, Stefanie; Amiruzzaman, Md; Batchu, Raga Mouni; Dracup, James; Pham, Alexander; Crocker, Benjamin; Ngo, Linh; Dewan, M. Ali Akber

doi:10.3390/computers15010020

Open AccessArticle

Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks

by

Stefanie Amiruzzaman

^1,*

,

Md Amiruzzaman

²

,

Raga Mouni Batchu

²

,

James Dracup

³

,

Alexander Pham

²,

Benjamin Crocker

²,

Linh Ngo

²

and

M. Ali Akber Dewan

^4,*

¹

Department of Languages and Cultures, West Chester University, West Chester, PA 19383, USA

²

Department of Computer Science, West Chester University, West Chester, PA 19383, USA

³

Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA

⁴

School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Athabasca, AB T9S 3A3, Canada

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(1), 20; https://doi.org/10.3390/computers15010020

Submission received: 24 November 2025 / Revised: 22 December 2025 / Accepted: 24 December 2025 / Published: 4 January 2026

(This article belongs to the Section AI-Driven Innovations)

Download

Browse Figures

Versions Notes

Abstract

This study presents a real-time, bidirectional system for translating American Sign Language (ASL) to and from English using computer vision and transformer-based models to enhance accessibility for deaf and hard of hearing users. Leveraging publicly available sign language and text–to-gloss datasets, the system integrates MediaPipe-based holistic landmark extraction with CNN- and transformer-based architectures to support translation across video, text, and speech modalities within a web-based interface. In the ASL-to-English direction, the sign-to-gloss model achieves a 25.17% word error rate (WER) on the RWTH-PHOENIX-Weather 2014T benchmark, which is competitive with recent continuous sign language recognition systems, and the gloss-level translation attains a ROUGE-L score of 79.89, indicating strong preservation of sign content and ordering. In the reverse English-to-ASL direction, the English-to-Gloss transformer trained on ASLG-PC12 achieves a ROUGE-L score of 96.00, demonstrating high-fidelity gloss sequence generation suitable for landmark-based ASL animation. These results highlight a favorable accuracy-efficiency trade-off achieved through compact model architectures and low-latency decoding, supporting practical real-time deployment.

Keywords:

ASL translation; transformer models; computer vision (MediaPipe); gloss-to-sentence modeling; EfficientNet-B0 feature extraction

1. Introduction

American Sign Language (ASL) is a vital mode of communication for the Deaf and Hard of Hearing (DHH) community [1,2]. However, the communication gap between ASL users and non-ASL speakers restricts access to social, educational, and professional settings. Existing approaches are limited, often focusing on static handshapes, isolated ASL alphabets [3], numbers [4], or simple voice-to-English text translations [5]. This gap is further expanded by the fact that not all hearing people know ASL, and not all deaf people know English. Studies have reported that automated solutions often assume that deaf people in the USA know English [2]. As a result, there is a lack of systems that can translate ASL into English text and English speech (sounds) into ASL sentences. The development of automated translation systems between ASL and English can address this limitation and help improve inclusivity and accessibility for the DHH community [6].

This study aims to create an ASL-to-English and English-to-ASL translation system utilizing both computer vision and machine learning techniques. In this study, we used a publicly available dataset sourced from Kaggle and RWTH Aachen University. Our system trains models to recognize ASL signs accurately and convert them into English text or speech. Additionally, we employ a combination of CNN and transformer-based models to help with ASL glossing and convert images to text. Unlike previous work, our system features an easy-to-use and intuitive web interface to translate from ASL to English and vice versa. This bidirectional approach allows us to translate from an English sentence to sequences of ASL signs, unlike previous ASL translation models, which prioritized translation from ASL to English. Users can translate a voice recording into a sequence of ASL signs. The web interface component provides a simple way to expand the current dataset, making it easy to add additional signs/samples. The current approach uses real-time translation, whereas previous architectures rely on static video sequences. The specific contributions of this work are as follows:

Training an NLP-based transformer model using a publicly available dataset and incorporate it with computer vision.
Developing a web-based interface to translate ASL into English and English into ASL, enabling real-time bidirectional translation with interactive playback, going beyond previous one-way or static video systems.
Demonstrating advances in gesture recognition, translation accuracy, and system efficiency through an integrated multimodal design, showing improvements over previous isolated or single-modal approaches.
Conducting a case study with the help of deaf individuals and incorporate feedback from deaf educators, providing practical validation and user-centered evaluation not previously done in similar research.

The remainder of this paper is organized as follows. Section 2 provides an overview of previous and related work. Section 3 outlines the methodologies used. Section 5 and Section 6 report and discuss the experimental results and challenges encountered. Section 7 concludes the paper and discusses future directions.

2. Related Work

Prior works in ASL translation technologies center primarily on American Sign Language Recognition (ASLR) and American Sign Language Translation (ASLT) and typically use traditional computer vision methods, deep learning approaches, and linguistically informed frameworks. Most of these works focus on isolated components of the ASL–English translation pipeline.

Earlier notable examples include [1,2], which used isolated sign recognition and was constrained by limited datasets that lacked contextual and sentence-level fluency. Traditional methods emphasized hand gesture classification but did not capture full linguistic meaning. The transition to continuous sign language datasets enabled more naturalistic modeling. A major breakthrough came with the RWTH-PHOENIX-Weather dataset and the Neural Machine Translation (NMT) approach in [7], which used attention-based models to map video input to glosses and written language. The introduction of transformer architectures further improved translation quality across domains. Sign language transformers, combining Connectionist Temporal Classification (CTC) and attention mechanisms, achieved state-of-the-art performance by integrating recognition and translation [7,8]. CTC enables sign language transformers to align variable-length sign videos with text without precise timing labels, allowing end-to-end recognition and translation with strong performance [9,10]. However, these systems still rely heavily on gloss-level supervision and lack multi-modal or bidirectional capabilities.

Further advances emerged with two key contributions. The Hand-Talk model introduced a multimodal architecture that fused RGB video, optical flow, and hand keypoints for sentence-level ASL translation using a transformer encoder–decoder pipeline [1]. Follow-up work extended this approach by unifying recognition and translation in a joint CTC-NMT framework, enabling weakly supervised training and improving gloss alignment [2]. While both works enhanced semantic fluency, they remained limited to one-way translation and did not incorporate audio or support real-time interaction.

Additional innovations explored Long Short-Term Memory (LSTM)-based gesture recognition. An LSTM is a type of recurrent neural network designed to learn long-range temporal dependencies in sequential data [11]. It is helpful for ASL recognition because sign language consists of time-ordered hand shapes, movements, and transitions, and LSTMs can model these temporal dynamics to distinguish between similar signs that differ in motion or timing [12]. For example, the Thai Sign Language work in [13] used Mediapipe to capture full-body landmarks, but the system still performed only one-way gesture-to-text mapping without generative or bidirectional features. In [14], YOLOv8s enabled high-speed detection and real-time ASL-to-text translation; however, the focus on hand gestures alone restricted linguistic depth. In the generative AI domain, [15] compared feed-forward, convolutional, and diffusion autoencoders for ASL alphabet reconstruction, showing diffusion models achieved the highest Mean Opinion Score (MOS). However, this work centered on image reconstruction rather than full sequence translation.

This study builds upon these prior contributions by expanding beyond isolated components of the SLR-SLT pipeline. While previous work advanced gesture classification, gloss prediction, or one-way translation, our system unifies and extends these capabilities to support full bidirectional communication.

For multimodal input, we integrate video (ASL), audio (spoken English), and text using holistic Mediapipe tracking for hands, face, and full-body landmarks, providing richer spatiotemporal context than gesture-only systems.
For bidirectional translation, our system supports both ASL to English and English to ASL, enabling real-time interactive communication rather than one-way output.
For temporal and semantic modeling, we combine transformer and CNN architectures to capture long-range temporal structure and semantic alignment without relying solely on gloss intermediates.
Finally, for real-time scalability, we deploy a working system capable of generalizing to varied environments and future extensions (e.g., AR or wearable hardware), moving beyond controlled, dataset-bound research.

Compared to the framework of Avina et al. (2023) [1], which primarily relies on CNN-based word-level classification and smoothing techniques for bidirectional ASL–English translation, our work advances toward sequence-level continuous sign language translation by explicitly modeling temporal dependencies using transformer architectures and CTC-based training. While Avina et al. demonstrate strong performance on isolated and word-level ASL recognition, their approach does not fully address sentence-level alignment between gloss sequences and natural language. Similarly, the English-to-ASL system presented by May et al. (2024) [2] focuses on a rule-based NLP pipeline combined with MediaPipe-driven animation for proof-of-concept translation, emphasizing accessibility and deaf-centered design rather than quantitative evaluation or continuous ASL recognition. In contrast, our system unifies data-driven gloss-to-sentence modeling, benchmark-based evaluation, and holistic landmark representations (hands, face, and upper body), enabling scalable, real-time translation for both ASL-to-English and English-to-ASL directions within a single integrated framework.

3. Methodology

In this section, we first describe the data used to develop the framework. Next, we present the AI models used to transform ASL into English, followed by the models for English-to-ASL translation. We then describe the data preprocessing steps and the model architectures. After that, we discuss the training setup and present the user interface developed for the system (see Figure 1). Finally, we explain how all processes and interfaces are integrated.

3.1. Data

A major challenge in sign language translation is the scarcity of large parallel corpora linking written text to ASL [16]. This domain therefore relies on a mix of continuous sign-language corpora and large-scale text–gloss or isolated-sign resources:

RWTH-PHOENIX-Weather 2014T (GSL). A widely used continuous sign language translation benchmark with signing videos aligned to gloss annotations and German sentences [17].
How2Sign (ASL). A large-scale continuous ASL dataset providing signing videos aligned to English transcripts, with additional annotations/modalities such as gloss and keypoints [18].
ASLG-PC12. A large English–ASL sentence–gloss parallel corpus generated via rule-based transformations and verified by experts, designed to support text–gloss modeling at scale [19].
WLASL. A word-level ASL dataset containing isolated sign videos labeled by gloss, commonly used for gloss-level ASL modeling and sign realization/visualization tasks [20].

Other notable datasets in the broader sign language translation/recognition landscape include large isolated-sign datasets (e.g., MS-ASL and ASL Citizen) and additional continuous SLT corpora (e.g., CSL-Daily for CSL) [21,22,23].

3.2. ASL to English

Our sign-to-English direction is implemented as a cascaded pipeline with three stages: (a) sign video → gloss, (b) gloss → spoken-language sentence, and (c) sentence → English. In the current stage, we train and evaluate the video-to-gloss and gloss-to-sentence components using RWTH-PHOENIX-Weather 2014T [17], which provides continuous signing videos with aligned gloss annotations and German sentences. The resulting German sentences are then translated into English using a German→English machine translation model.

3.2.1. Sign Video to German Gloss

For the video-to-gloss component, we use RWTH-PHOENIX-Weather 2014T [17], a widely adopted benchmark for continuous sign language recognition and translation. The dataset contains weather forecast broadcasts recorded between 2009 and 2011, with signing video at 25 fps (interpreter box resolution

210 \times 260

), gloss annotations, and aligned German sentences [17]. For our experiments, we follow the official dataset splits provided by RWTH-PHOENIX-Weather 2014T [17], using 70% samples for training, 15% for validation, and 15% for testing.

RWTH-PHOENIX-Weather 2014T is based on German Sign Language (GSL). Although GSL and ASL are linguistically distinct and not mutually intelligible, the underlying visual and kinematic structure of sign languages is largely shared [7,24,25]. Across sign languages, communication relies on similar components such as hand shape, hand motion, orientation, location, and non-manual cues, all of which evolve continuously over time [26]. Consequently, datasets in one sign language can be leveraged to learn general visual–temporal representations of signing, particularly for training the visual encoder and temporal modeling components. Following the design guidelines in [27], we use RWTH-PHOENIX-Weather 2014T to train our CNN–BiLSTM video-to-gloss model. While this work focuses on the GSL-based training setup, continuous ASL datasets such as How2Sign [18] provide a natural path for future ASL-specific extensions.

We adopt the model design guidelines from Hu et al. [27] primarily for practicality and deployability. Recent continuous sign language recognition models often introduce substantial computational and memory overheads that make GPU-only inference the default in many settings. In contrast, the architecture in Hu et al. [27] provides a favorable efficiency–accuracy trade-off for our use case: the released checkpoint is compact (under 400 MB), does not impose strict storage constraints, and enables CPU-based inference with reasonable latency. In our preliminary tests, the model produced gloss predictions in approximately 2–3 s on CPU, which aligns with our goal of a lightweight pipeline that can run without specialized hardware.

3.2.2. German Gloss to German Sentence

To convert the predicted German gloss sequences into well-formed German sentences, we train a gloss-to-text model using RWTH-PHOENIX-Weather 2014T [17]. This dataset provides parallel gloss and spoken-language sentence annotations (distributed in CSV files), which allows us to directly learn a mapping from gloss sequences to German sentences.

We follow the official RWTH-PHOENIX-Weather 2014T splits used in Section 3.2.1, using the same 70:15:15 split train/validation/test partition to ensure consistency across pipeline components and avoid overlap between training and evaluation data. Our gloss-to-sentence model follows a standard sequence-to-sequence Transformer architecture based on Vaswani et al. [28]. Gloss tokens are treated as the source sequence and German word tokens as the target sequence. The model is trained with teacher forcing to maximize the likelihood of the reference German sentence conditioned on the input gloss sequence.

3.2.3. German Sentence to English Sentence

After generating German sentences from gloss in Section 3.2.2, we translate these German sentences into English using a transformer-based Neural Machine Translation (NMT) model. We follow the recent Gloss2Text study of Fayyazsanavi et al. [29], which demonstrates that multilingual sequence-to-sequence LLMs (e.g., NLLB-200) can be effectively leveraged in low-resource sign-language translation settings [29]. Motivated by their findings, we adopt a similar multilingual transformer NMT setup for the German → English sentence translation stage, which is straightforward to integrate.

3.3. English to ASL

We additionally built an English-to-gloss model to generate an intermediate gloss sequence from a provided English sentence. This component follows a standard transformer sequence-to-sequence architecture based on [28], where the English token sequence is encoded and a decoder autoregressively predicts the corresponding gloss tokens. The predicted gloss sequence is then used to retrieve the corresponding sign realizations from WLASL, enabling a direct bridge from text to sign-level animation.

Finally, to support English-to-ASL animation generation, we used the Word-Level American Sign Language dataset (WLASL), which contains over 12,000 videos covering more than 2000 lexical signs [20]. For each predicted gloss token, we look up the matching WLASL entry, select a representative video, and extract full landmark data using MediaPipe (see Figure 2). These extracted keypoints are stored in the MessagePack format for efficient retrieval during run-time and are used to drive the landmark-based ASL animation pipeline. The working procedure for the English-to-ASL interface is described in Section 3.7.1.

3.4. Data Pre-Processing

Before training and evaluation, the annotation text and gloss sequences are normalized. For RWTH-PHOENIX-Weather 2014T, punctuation such as periods is removed, and samples containing numbers or plus signs are excluded to reduce tokenization noise [27].

During training, video preprocessing follows a standard augmentation pipeline to improve robustness to appearance and temporal variability. Frames are resized to

256 \times 256

and randomly cropped to

224 \times 224

, with horizontal flipping and random temporal scaling applied as lightweight augmentations.

At inference time (validation/testing and runtime decoding), preprocessing is made deterministic to ensure stable outputs. A central

224 \times 224

crop is selected, and frames are sub-sampled at a fixed rate (e.g., every two frames starting from the first frame) without stochastic augmentations such as flipping or temporal scaling. To further prioritize low-latency inference, decoding is performed with greedy decoding (beam size 1), since increasing beam size typically improves translation quality at the cost of slower decoding [30]. Figure 3 summarizes the complete preprocessing flow used in our system.

3.5. Model Architectures

In this section, we introduce the two core models used in our system: the English-to-gloss transformer and the ASL-to-English CNN transformer (see Figure 4), followed by their training and implementation details.

For sign video to gloss recognition, we adopt a lightweight hybrid CNN–BiLSTM architecture (see Figure 4). The CNN extracts frame-level visual features, and a BiLSTM models temporal dependencies across the resulting feature sequence, enabling the recognition of continuous signing without requiring explicit frame-level segmentation. For gloss-to-text generation (German gloss → German sentence) and sentence-level translation (German → English), we use standard transformer sequence-to-sequence models. Transformers are widely used for language modeling and translation and have also been successfully applied in vision and multimodal settings [31]. Compared to recurrent architectures, transformers provide strong performance for text generation while benefiting from parallelizable training and efficient inference.

For the sign-to-gloss component, each video frame is first processed by a 2D CNN to produce a frame-level feature representation. The resulting feature sequence is then passed through a 1D temporal convolution module to capture short-range motion patterns and to improve local temporal smoothness. Next, a Bidirectional Recurrent Layer (BiLSTM) models longer-range dependencies across the entire sequence. The network outputs per-time-step gloss logits, and Connectionist Temporal Classification (CTC) is used to map the unsegmented frame sequence to a gloss sequence. Following this design, lightweight self-emphasizing attention modules are inserted within the CNN backbone to highlight informative spatial regions (e.g., hands/face) and temporally discriminative frames, improving recognition without expensive pose/heatmap supervision [27]. During inference, we decode using greedy CTC decoding (beam size 1) to prioritize low-latency recognition [30].

For the gloss-to-text stage, we adopt a transformer encoder–decoder architecture based on the NLLB-200 family, which provides a multilingual sequence-to-sequence formulation suitable for generating spoken-language sentences from discrete gloss token sequences. We use the NLLB-200 600 M variant as a practical accuracy–efficiency trade-off [29]: it remains lightweight enough for fast inference and easy loading in low-compute environments while retaining strong multilingual generation capacity.

The English-to-ASL transformer translates an English sentence into a sequence of ASL signs. The sentence is first translated into ASL gloss using the English-to-gloss transformer. We used MediaPipe Holistic to extract landmark points from videos in the Word-Level American Sign Language (WLASL) dataset, storing them in a custom landmark dataset. Given the predicted gloss sequence, we retrieve the corresponding landmark sequences from the dataset to generate an ordered sequence of ASL signs.

3.6. Training Setup

For the sign-to-gloss component, the model is trained with the Adam optimizer for 80 epochs using an initial learning rate of

1 \times 10^{- 4}

, decayed by a factor of 5 after epochs 40 and 60, and a weight decay of

10^{- 3}

[27]. In our implementation, we leverage the released checkpoint to minimize additional training overhead.

For the gloss-to-text model, the training protocol is based on the Gloss2Text setup [29], but adapted for fast training and deployment on low-compute environments. We initialize from the distilled NLLB-200 600 M encoder–decoder model and train the model end-to-end (without LoRA), using AdamW with

β_{1} = 0.9

and

β_{2} = 0.998

. A batch size of 64 is used to take advantage of the available VRAM on an A100 GPU, improving throughput.

For the English-to-gloss model trained using the ASLG-PC12 dataset, we utilize 2 encoder blocks and 2 decoder blocks. For the attention modules, we used 512 hidden units and 8 attention heads. Along with that, we used 2048 hidden units for the transformer block’s feed-forward layers. Finally, we employed 0.1 dropout rates in order to mitigate overfitting.

3.7. Interface

A key component of our system is a real-time interface that exposes the model’s capabilities to end users. This interface allows users to interact with the system without needing to run any models locally. The interface supports three main functions, each accessible through its own page (see Figure 1):

English-to-ASL interface (see more details in Section 3.7.1);
Record ASL Signs interface (see more details in Section 3.7.2);
ASL-to-English interface (see more details in Section 3.7.3).

The backend is implemented in Flask, and the frontend is built in AstroJS. Communication occurs through routes prefixed with /api. Both codebases are integrated using Nix and deployed through a NixOS virtual machine (see Figure 5), as described in Section 3.8.

3.7.1. English-to-ASL Interface

The “English-to-ASL” interface (see Figure 1 and Figure 8) allows users to enter or speak English sentences (see Figure 6). The SpeechRecognition API can optionally be used to capture spoken English input. As described in Section 3.3, the English-to-ASL transformer is responsible for converting English input into ASL representations. The “English-to-ASL” interface uses this transformer to retrieve MediaPipe-based animations and display them to users. Users may also type English sentence(s), which are then passed to the transformer to generate and present the corresponding ASL animations (see Figure 7). Figure 8 illustrates an example in which a user says “hello, how are you?”, and the interface, with the help of the transformer model, produces the gloss sequence “hello, how you” along with the corresponding ASL animation.

The translation process consists of following:

Sending the English sentence to /api/gloss.
The backend uses the model described in Section 3.5 to translate it into ASL gloss terms.
The backend returns a list of gloss strings.
For each gloss term, the frontend requests its animation from /api/word/<name> and renders the animation on a Canvas element, which is described in Section 3.7.2).

3.7.2. Recording the ASL Interface

Animations used by the “English-to-ASL” interface (see Figure 1) must first be recorded through the “Record ASL” interface (see Section 3.7.1 for more details). As shown in Figure 7, users record themselves signing a gloss, after which the system extracts landmarks and generates a 3D animation. If satisfied, the user may save the animation to the glossary library. Animation recorded from using this step is used when the user wants to translate English to ASL in Section 3.7.3.

This page operates using the following steps:

Video is captured via navigator.getUserMedia.
The recording is sent to /api/mark.
The backend applies MediaPipe Holistic to extract landmarks.
Landmark data are returned as 3D coordinates for each frame.
The frontend replays the animation for user verification.
If approved by the user who is recording, the animation is saved to /api/word/<name> with the supplied gloss name.

For bulk creation, we automated this workflow using the WLASL dataset. Our script extracts the gloss name from each video filename, processes the video automatically, and stores the animation in the glossary library.

3.7.3. ASL-to-English Interface

The “ASL to English” page allows users to record ASL and receive an English translation (see Figure 8). This process involves translating sign videos into English sentences, as described in Section 3.2. More specifically, the pipeline consists of converting sign videos to gloss and then translating gloss into English sentence(s). After these transformations, the interface displays the corresponding English sentence(s) produced from the ASL input.

Figure 8. Interaction flow for the English-to-ASL translation system. After the user provides an English sentence, the frontend sends a gloss request to the backend, receives gloss terms, fetches landmark files for each term (e.g., hello, how, you), and finally, renders the ASL animations in sequence.

The frontend captures 30-frame video segments using getUserMedia.
Each segment is uploaded to /api/a2e.
The backend processes the segment with the model described in Section 3.7.1.
The backend returns an English sentence, which is displayed on the page.

3.8. Integration and Deployment

We used the Nix build system to integrate the backend and frontend in a unified, reproducible environment. This approach ensures consistent builds and deployments across development and production systems. Backend dependencies for the Python services are managed using the uv package manager.

Integration: The frontend is built using buildNpmPackage, while the backend is packaged with uv2nix. Both components are defined and exposed through flake.nix, allowing them to be built, composed, and reused in a consistent manner.

Deployment: The backend is deployed as a systemd service on NixOS and served using gunicorn. A NixOS virtual machine integrates the frontend and backend behind an Nginx reverse proxy. The VM is designed to be stateless, with the exception of controlled updates to the glossary library when required.

4. Model Evaluation

To rigorously assess the performance of our ASL-to-German gloss and German-gloss-to-English gloss framework, we employed evaluation metrics tailored to each translation stage and task objective. Since the framework consists of multiple components trained using different AI models, a single metric is insufficient to capture overall performance. Instead, we adopt task-specific evaluation measures commonly used in sign language processing and machine translation.

4.1. Evaluation of the Sign-to-Video Model

Word Error Rate (WER) Metric: The ASL-to-German gloss module is evaluated using the WER, which measures the discrepancy between the predicted gloss sequence and the ground-truth gloss annotation. WER is widely used in speech recognition and sign language recognition tasks, as it accounts for insertion, deletion, and substitution errors.

WER = \frac{S + D + I}{N},

(1)

where S denotes the number of substitutions, D denotes the number of deletions, I denotes the number of insertions, and N is the total number of words in the reference gloss sequence.

A lower Word Error Rate (WER) indicates better model performance, with a value of 0 corresponding to a perfect match between the predicted and reference gloss sequences. Because gloss annotations preserve the core linguistic structure while omitting full grammatical inflection, WER serves as a meaningful and interpretable measure of recognition accuracy for the ASL-to-gloss task. Prior work has shown that a WER below 30% is sufficiently accurate to be useful to human transcriptionists, providing performance comparable to human-assisted transcription workflows [32].

4.2. Evaluation of the Gloss-to-Sentence Model

Bilingual Evaluation Understudy (BLEU) and BLEU-4 Metrics: For the German gloss-to-English gloss translation module, we evaluate performance using the BLEU score, a standard metric for machine translation. BLEU measures the n-gram overlap between the generated English sentence and one or more reference translations, thereby capturing both lexical accuracy and fluency.

BLEU = BP \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n}),

(2)

where

p_{n}

represents the precision of n-grams,

w_{n}

are weighting factors (typically uniform), N is the maximum n-gram length, and

BP

is the brevity penalty, defined as follows:

BP = \{\begin{matrix} 1, & if c > r \\ exp (1 - \frac{r}{c}), & if c \leq r \end{matrix},

(3)

where c is the length of the generated translation, and r is the length of the reference translation.

Higher BLEU scores indicate better translation quality, reflecting improved alignment between generated and reference English sentences [33]. Recent studies have reported BLEU-4 scores either instead of or in addition to the general BLEU score, reflecting the widespread adoption of BLEU with a fixed n-gram order in contemporary evaluation practices [34,35]. Therefore, in our study, we evaluate model performance using both the BLEU score and the BLEU-4 score. BLEU-4 is a specific and widely adopted instantiation of BLEU in which

N = 4

and uniform weights are used, defined as follows:

w_{1} = w_{2} = w_{3} = w_{4} = \frac{1}{4},

(4)

resulting in

BLEU-4 = BP \cdot exp (\frac{1}{4} \sum_{n = 1}^{4} log p_{n}) .

(5)

From Equations (2) and (5), we observe that the two formulations are structurally identical, differing only in the choice of the maximum n-gram order and the weighting scheme. Consequently, BLEU-4 is not a distinct metric but a specific instantiation of BLEU with

N = 4

and uniform weights. This variant has become the de facto standard reported in most machine translation research.

The use of WER and BLEU enables a comprehensive evaluation of our multi-stage framework. WER effectively captures recognition errors at the gloss level, which is critical since errors at this stage propagate to downstream translation modules. BLEU, on the other hand, evaluates semantic and syntactic fidelity in the final English output. By employing task-appropriate metrics at each stage, we ensure that performance evaluation accurately reflects the strengths and limitations of individual model components. This multi-metric evaluation strategy allows us to identify error sources, guide model improvements, and provide a fair comparison with existing ASL translation systems.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L) Metric: The ROUGE-L is a family of automatic evaluation metrics commonly used in natural language generation tasks such as summarization and machine translation. ROUGE-L measures the similarity between a generated text (candidate) and one or more reference texts by computing the longest common subsequence (LCS) between them [36,37].

Formally, let S be the candidate summary and R be a reference summary. We define the following:

\begin{matrix} LCS (S, R) & = length of the longest common subsequence of S and R . \end{matrix}

(6)

Using LCS, we compute the following:

\begin{matrix} {Precision}_{LCS} & = \frac{LCS (S, R)}{| S |}, \end{matrix}

(7)

\begin{matrix} {Recall}_{LCS} & = \frac{LCS (S, R)}{| R |}, \end{matrix}

(8)

where

| S |

and

| R |

denote the number of tokens in S and R, respectively.

The ROUGE-L score is defined as the harmonic mean of precision and recall:

ROUGE-L = \frac{(1 + β^{2}) \cdot {Precision}_{LCS} \cdot {Recall}_{LCS}}{β^{2} \cdot {Precision}_{LCS} + {Recall}_{LCS}},

(9)

where

β

is typically set to 1, yielding the balanced F1 score.

ROUGE-L scores range from 0 to 1, with higher values indicating greater similarity between the candidate and reference texts. A ROUGE-L score close to 1 suggests high sequence overlap and better performance, while a score near 0 indicates low similarity [37]. In many summarization and generation evaluation tasks, ROUGE-L F1 scores above approximately 0.25–0.40 are considered acceptable or competitive, depending on task difficulty and dataset characteristics. Scores above 0.50 often indicate strong agreement with reference texts in extractive summarization benchmarks [36].

5. Results

Table 1 summarizes the end-to-end performance of our sign-based models on the RWTH-PHOENIX-Weather 2014T and ASLG-PC12 test sets. Our sign-to-gloss model achieved a Word Error Rate (WER) of 25.17%, along with BLEU-4 = 37.00 and ROUGE-L = 79.89. In addition, the sign-to-sentence pipeline achieved BLEU-4 = 14.95 and ROUGE-L = 33.14 on the test set. These results are competitive with recently reported performance, which typically ranges between 20–40% WER depending on the model architecture, training regime, and evaluation protocol [27,38].

Importantly, our approach achieves this level of performance using a relatively simple architecture and training pipeline, suggesting favorable efficiency–performance trade-offs. In addition, some design choices were made explicitly to reduce inference latency and support low-compute environments (e.g., greedy decoding with beam size 1 and selecting a compact 600 M gloss-to-text model), which can introduce a measurable drop in generation quality compared to larger models and wider-beam decoding. The remaining performance gap may be attributed to non-exhaustive hyperparameter tuning, limitations in the expressive power of the CNN-based visual embeddings, or sensitivity to noise introduced by aggressive data augmentation. We believe these factors represent clear directions for future improvement rather than fundamental limitations of the proposed framework.

We stopped the training process after observing stable training and validation losses (see Figure 9). Although additional training epochs and further hyperparameter tuning could potentially improve model performance, the achieved WER falls within a satisfactory range for the scope of this work [38].

For the intermediate gloss-to-sentence component (German gloss → German sentence), we observe a noticeable gap between dev and test scores (Dev: BLEU-4 = 25.85 vs. Test: BLEU-4 = 16.12), which suggests overfitting in this stage under the current training configuration. Figure 9 and Figure 10 provide additional evidence supporting this observation. As shown in Figure 9, the training loss for the gloss → text transformer decreases steadily and smoothly over time, indicating effective optimization and convergence on the training data. In contrast, Figure 10 shows that the dev ROUGE score improves rapidly during the early stages of training but subsequently plateaus, with only minor oscillations despite continued reductions in training loss. This divergence between training loss and validation performance is a characteristic symptom of overfitting, where the model increasingly fits the training distribution without corresponding improvements in generalization.

Prior work on Gloss2Text [29] reports that billion-parameter LLMs can exhibit similar overfitting behavior when trained on limited sign-language supervision, and demonstrates that LoRA adapters can effectively mitigate this issue by constraining the number of trainable parameters. Their results show that a 3.3 B model augmented with LoRA outperforms its fully fine-tuned counterpart while remaining memory-efficient. This motivates exploring larger backbone models with LoRA-based adaptation in future iterations, combined with stronger regularization and early-stopping strategies, to improve generalization and reduce the dev–test performance gap.

For the English-to-gloss component, our transformer model achieves strong sequence-level performance on the test set, with BLEU-4 = 93.27 and ROUGE-L = 96. These results indicate that the model preserves both gloss content and ordering with high fidelity, and are competitive with recent English–ASL gloss modeling results reported in the literature [39].

Table 2 summarizes key differences between our approach and recent ASL–English translation systems. Prior work by Avina et al. [1] primarily focuses on word-level classification and smoothing strategies, without addressing sentence-level alignment or continuous sign language translation. The English-to-ASL framework proposed by May et al. [2] emphasizes accessibility and prototype development using rule-based NLP and MediaPipe-driven animation, but does not provide quantitative performance evaluation. In contrast, our system supports continuous, sequence-level translation in both directions and is evaluated using established metrics, demonstrating competitive word error rates and strong sequence preservation while maintaining real-time performance.

6. Discussions

Our results highlight a practical efficiency–accuracy trade-off for bidirectional sign language translation. On the recognition side, the sign-to-gloss component achieves a WER of 25.17% with high ROUGE-L (79.89) and BLEU-4 (37.00), indicating that the model frequently recovers the correct gloss content and ordering even when exact token matches are not perfect. This aligns with the goal of using a lightweight CSLR backbone that remains usable in constrained environments while staying within the performance range reported by recent CSLR systems.

For sign-to-sentence translation, performance is lower (BLEU-4 14.95 and ROUGE-L 33.14), which is expected in a cascaded setting where errors can accumulate across stages: recognition errors in sign-to-gloss propagate to gloss-to-sentence generation, and any remaining ambiguity is further amplified during sentence-level translation. In addition, some design choices were made explicitly to prioritize low-latency inference (e.g., greedy decoding with beam size 1 and selecting compact models), which typically reduces generation quality compared to larger models and wider-beam decoding.

A key observation is the generalization gap in the gloss-to-sentence component, suggesting overfitting under the current training configuration. This may be driven by limited domain coverage, lexical sparsity, and the inherent difficulty of generating fluent spoken-language sentences from gloss sequences, which are structurally simplified and often omit grammatical markers [40]. Recent work in Gloss2Text shows that larger multilingual sequence-to-sequence models with parameter-efficient adaptation (e.g., LoRA) can improve training stability and reduce overfitting compared to training large models directly [29]. In addition, prior work has shown that incorporating pretrained language models can improve fluency and contextual coherence in generation tasks [41]. Together, these findings motivate exploring larger backbones with LoRA (and stronger regularization/early stopping) in future iterations, while still retaining a deployment-friendly footprint.

7. Conclusions

This study presents a real-time, bidirectional American Sign Language (ASL)–English translation system that integrates computer vision, deep learning, and web-based technologies to enhance communication accessibility for the Deaf and Hard of Hearing (DHH) community. The proposed system introduces a holistic, multimodal pipeline that captures spatiotemporal visual information from ASL video input, extracts body, hand, and facial landmarks, and maps these representations to linguistic meaning through a sequence of learning-based transformations.

In the ASL-to-English direction, the system employs a CNN transformer architecture to convert visual landmark sequences into intermediate gloss representations, which are subsequently translated into natural English sentences using a hybrid CNN- and Transformer-based language model. In the reverse direction, the system supports English-to-ASL translation by generating gloss sequences from English text and reconstructing corresponding ASL landmark animations, enabling visually interpretable sign playback.

A web-based interface enables real-time interaction through multiple modalities, including video, audio, and text, offering an accessible and intuitive user experience across different communication preferences. By tightly integrating gesture recognition, gloss prediction, and language translation within a single end-to-end framework, this work advances beyond prior approaches that typically address these components in isolation or operate in offline settings.

Overall, this research demonstrates the feasibility of a unified, real-time ASL–English translation system that balances linguistic accuracy, responsiveness, and system scalability. Future work will focus on expanding and diversifying ASL video datasets, improving model architectures for richer linguistic modeling, and incorporating non-manual features such as facial expressions to further enhance translation quality and support robust real-world deployment.

Limitations and Future Work

While our system demonstrates strong potential in real-time, bidirectional ASL–English translation, several limitations create opportunities for future improvement. One key limitation is the size and diversity of the training data. The CNN-Transformer model was trained on a specialized German sign language dataset, which may not capture the full variability of natural sign language use, including differences in signing styles, speeds, lighting conditions, and signer characteristics. Expanding training resources using larger publicly available datasets, such as How2Sign, RWTH-Fingerspelling or CSL-Daily, could substantially improve model generalization.

The primary challenge lies in developing a large-scale, high-quality video dataset for American Sign Language (ASL) and training models capable of directly extracting English glosses from ASL videos. Current approaches often rely on datasets such as GSL and intermediate representations like German glosses, which introduce unnecessary dependencies and limit applicability to ASL. While large-scale continuous ASL datasets such as How2Sign exist, incorporating them more directly into our framework to reduce cross-lingual mismatch and enable ASL-native training is an important direction we plan to explore.

Another limitation of the current system is its restricted vocabulary, as it depends on a manually recorded gloss-term library. This reliance constrains translation flexibility and increases the manual effort required to expand the lexicon. Future work could focus on automating gloss-term landmark generation using generative approaches such as diffusion models or 3D avatar synthesis, enabling the system to support unseen or low-resource vocabulary more efficiently.

Additionally, while the system leverages MediaPipe Holistic to extract landmarks for the hands, face, and body, it does not yet fully exploit facial expressions and other non-manual markers, which are essential components of ASL grammar. Incorporating explicit modeling of facial expressions, head movements, and other non-manual features could substantially improve grammatical accuracy and the overall naturalness of the generated translations.

The translation model also seems to have difficulty extracting necessary patterns and information from the videos themselves. This problem could be due to the visual backbone not effectively representing the features of the footage. In the future, we hope to utilize a pre-trained 3D convolutional network (I3D), as it incorporates RGB and optical flow information to encode more accurate temporal and visual cues in our feature vectors [42].

From a usability standpoint, the system could be extended to mobile, wearable, or AR platforms for deployment in real-world environments. Integration with smart glasses or mobile devices could provide hands-free accessibility for DHH users. Introducing a user feedback loop, allowing users to correct or rate translations, could further refine accuracy over time and personalize the translation experience.

In summary, while the current system lays a strong foundation for real-time, bidirectional ASL–English translation, future work will focus on scaling the data, enhancing model architectures, capturing richer linguistic features, and improving usability through continuous learning and broader deployment platforms.

Author Contributions

Conceptualization: S.A. and M.A.; methodology: S.A. and M.A.; software: B.C. and R.M.B.; validation: S.A., M.A., R.M.B., A.P., J.D., L.N. and M.A.A.D.; formal analysis: S.A., M.A., B.C., A.P. and M.A.A.D.; investigation: S.A., M.A., B.C. and A.P.; resources: S.A., M.A. and L.N.; data curation: S.A., B.C., A.P. and R.M.B.; writing—original draft preparation: S.A., M.A., J.D. and A.P.; writing—review and editing: S.A., M.A., R.M.B., L.N. and M.A.A.D.; visualization: B.C.; supervision: M.A. and S.A.; project administration: M.A. and S.A.; funding acquisition: M.A., S.A. and M.A.A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the Pennsylvania State System of Higher Education (PASSHE) Faculty Professional Development Council (FPDC) grant.

Data Availability Statement

All data are publicly available. Here is the URL: https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-2014-T/ (accessed on 25 August 2025) and https://www.kaggle.com/datasets/ayuraj/asl-dataset (accessed on 25 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Avina, V.D.; Amiruzzaman, M.; Amiruzzaman, S.; Ngo, L.B.; Dewan, M.A.A. An AI-Based Framework for Translating American Sign Language to English and Vice Versa. Information 2023, 14, 569. [Google Scholar] [CrossRef]
May, J.; Brennan, K.; Amiruzzaman, S.; Amiruzzaman, M. English to American Sign Language: An AI-Based Approach. J. Comput. Sci. Coll. 2024, 40, 164–175. [Google Scholar]
Shin, J.; Matsuoka, A.; Hasan, M.A.M.; Srizon, A.Y. American sign language alphabet recognition by extracting feature from hand pose estimation. Sensors 2021, 21, 5856. [Google Scholar] [CrossRef]
Hellara, H.; Barioul, R.; Sahnoun, S.; Fakhfakh, A.; Kanoun, O. Improving the accuracy of hand sign recognition by chaotic swarm algorithm-based feature selection applied to fused surface electromyography and force myography signals. Eng. Appl. Artif. Intell. 2025, 154, 110878. [Google Scholar] [CrossRef]
Balamani, T.; Subiksha, K.; Swathi, D.; Vennila, V.; Vishnu, B.J. IYAL: Real-Time Voice to Text Communication for the Deaf. In Proceedings of the 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), Erode, India, 20–22 January 2025; IEEE: New York, NY, USA, 2025; pp. 105–111. [Google Scholar]
Kaur, B.; Chaudhary, A.; Bano, S.; Yashmita; Reddy, S.; Anand, R. Fostering inclusivity through effective communication: Real-time sign language to speech conversion system for the deaf and hard-of-hearing community. Multimed. Tools Appl. 2024, 83, 45859–45880. [Google Scholar] [CrossRef]
Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10023–10033. [Google Scholar]
Tan, S.; Miyazaki, T.; Khan, N.; Nakadai, K. Improvement in sign language translation using text CTC alignment. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 3255–3266. [Google Scholar]
Graves, A. Connectionist temporal classification. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 61–93. [Google Scholar]
Kubra, K.T.; Umair, M.; Zubair, M.; Naseem, M.T.; Lee, C.S. Detection and Recognition of Bilingual Urdu and English Text in Natural Scene Images Using a Convolutional Neural Network–Recurrent Neural Network Combination with a Connectionist Temporal Classification Decoder. Sensors 2025, 25, 5133. [Google Scholar] [CrossRef]
Krichen, M.; Mihoub, A. Long short-term memory networks: A comprehensive survey. AI 2025, 6, 215. [Google Scholar] [CrossRef]
Mittal, A.; Kumar, P.; Roy, P.P.; Balasubramanian, R.; Chaudhuri, B.B. A modified LSTM model for continuous sign language recognition using leap motion. IEEE Sens. J. 2019, 19, 7056–7063. [Google Scholar] [CrossRef]
Jintanachaiwat, W.; Jongsathitphaibul, K.; Pimsan, N.; Sojiphan, M.; Tayakee, A.; Junthep, T.; Siriborvornratanakul, T. Using LSTM to translate Thai sign language to text in real time. Discov. Artif. Intell. 2024, 4, 17. [Google Scholar] [CrossRef]
Jia, W.; Li, C. SLR-YOLO: An improved YOLOv8 network for real-time sign language recognition. J. Intell. Fuzzy Syst. 2024, 46, 1663–1680. [Google Scholar] [CrossRef]
Praun-Petrovic, V.; Koundinya, A.; Prahallad, L. Comparison of Autoencoders for tokenization of ASL datasets. arXiv 2025, arXiv:2501.06942. [Google Scholar] [CrossRef]
Kahlon, N.K.; Singh, W. Machine translation from text to sign language: A systematic review. Univers. Access Inf. Soc. 2023, 22, 1–35. [Google Scholar] [CrossRef]
Camgöz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Torres, J.; Giro-i Nieto, X. How2sign: A large-scale multimodal dataset for continuous american sign language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2735–2744. [Google Scholar]
Achraf, O.; Zouhour, T. English-ASL Gloss Parallel Corpus 2012: ASLG-PC12. In Proceedings of the Fourth International Conference On Information and Communication Technology and Accessibility ICTA’13, Hammamet, Tunisia, 24–26 October 2013. [Google Scholar]
Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the The IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1459–1469. [Google Scholar]
Joze, H.R.V.; Koller, O. MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language. arXiv 2019, arXiv:1812.01053. [Google Scholar] [CrossRef]
Desai, A.; Berger, L.; Minakov, F.O.; Milan, V.; Singh, C.; Pumphrey, K.; Ladner, R.E.; Daumé, H., III; Lu, A.X.; Caselli, N.; et al. ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition. arXiv 2023, arXiv:2304.05934. [Google Scholar] [CrossRef]
Zhou, H.; Zhou, W.; Qi, W.; Pu, J.; Li, H. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. arXiv 2021, arXiv:2105.12397. [Google Scholar] [CrossRef]
Pfau, R.; Steinbach, M.; Woll, B. Sign Language: An International Handbook; Walter de Gruyter: Berlin, Germany, 2012; Volume 37. [Google Scholar]
Huang, J.; Zhou, W.; Li, H.; Li, W. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2822–2832. [Google Scholar] [CrossRef]
Koller, O. Quantitative survey of the state of the art in sign language recognition. arXiv 2020, arXiv:2008.09918. [Google Scholar] [CrossRef]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Self-Emphasizing Network for Continuous Sign Language Recognition. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Fayyazsanavi, P.; Anastasopoulos, A.; Košecká, J. Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing. arXiv 2024, arXiv:2407.01394. [Google Scholar] [CrossRef]
Freitag, M.; Al-Onaizan, Y. Beam Search Strategies for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, CO, Canada, 30 July–4 August 2017. [Google Scholar] [CrossRef]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Gaur, Y.; Lasecki, W.S.; Metze, F.; Bigham, J.P. The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th International Web for All Conference, Montreal, QC, Canada, 11–13 April 2016; pp. 1–8. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Jia, Y.; Ji, Y.; Xue, X.; Ren, Q.D.E.J.; Wu, N.; Liu, N.; Zhao, C.; Liu, F. A Semantic Uncertainty Sampling Strategy for Back-Translation in Low-Resources Neural Machine Translation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Vienna, Austria, 27 July–1 August 2025; pp. 528–538. [Google Scholar]
Marie, B.; Fujita, A.; Rubino, R. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. arXiv 2021, arXiv:2106.15195. [Google Scholar] [CrossRef]
Citarella, A.A.; Barbella, M.; Ciobanu, M.G.; De Marco, F.; Di Biasi, L.; Tortora, G. Assessing the effectiveness of ROUGE as unbiased metric in Extractive vs. Abstractive summarization techniques. J. Comput. Sci. 2025, 87, 102571. [Google Scholar] [CrossRef]
Kumar, S.; Solanki, A. Improving ROUGE-1 by 6%: A novel multilingual transformer for abstractive news summarization. Concurr. Comput. Pract. Exp. 2024, 36, e8199. [Google Scholar] [CrossRef]
Ranjbar, H.; Taheri, A. Continuous sign language recognition using intra-inter gloss attention. Multimed. Tools Appl. 2025, 1–19. [Google Scholar] [CrossRef]
Varanasi, A.B.; Sinha, M.; Dasgupta, T.; Jadhav, C. Linguistically Informed Transformers for Text to American Sign Language Translation. In Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), Bangkok, Thailand, 15 August 2024; Ojha, A.K., Liu, C.H., Vylomova, E., Pirinen, F., Abbott, J., Washington, J., Oco, N., Malykh, V., Logacheva, V., Zhao, X., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 50–56. [Google Scholar] [CrossRef]
Liddell, S.K. American Sign Language Syntax; Walter de Gruyter GmbH & Co KG: Berlin, Germany, 2021; Volume 52. [Google Scholar]
Wang, H.; Li, J.; Wu, H.; Hovy, E.; Sun, Y. Pre-trained language models and their applications. Engineering 2023, 25, 51–65. [Google Scholar] [CrossRef]
Tarrés, L.; Gállego, G.I.; Duarte, A.; Torres, J.; Giró-i Nieto, X. Sign language translation from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5625–5635. [Google Scholar]

Figure 1. Main interface: (a) shortcuts to the three primary buttons, (b) English-to-ASL translation, (c) recording ASL for future use or ASL vocabulary storage, and (d) ASL-to-English translation.

Figure 2. Mapping of a handshape using MediaPipe. The red lines capture the shape of the hand, and the red dots indicate the joint points.

Figure 3. Flowchart for dataflow after pre-processing, and how training and validation processes were conducted. The videos of GSL to obtain gloss (left), and gloss to obtain English sentences (right).

Figure 4. Diagram showing the architecture for the ASL-to-English model. The RGB frames are fed through the ASL-to-English model to generate an English sentence.

Figure 5. System architecture showing a NixOS virtual machine, where Nginx (ports 8080/8443) serves the static Astro frontend and proxies API requests to a Flask backend running on Gunicorn (port 8000).

Figure 6. A web interface showing the playback of the word “Hello” in ASL via a 3D animation. The “Download Data“ button allows users to save the video to *.msgpack (i.e., MessagePack) format, similar to how it is stored on the server.

Figure 7. A web interface showing a recorded animation of the ASL gloss “Hello”, with an option to save the gloss to the library.

Figure 9. Training loss over time for the gloss → text transformer.

Figure 10. Dev ROUGE over time for the gloss → text transformer.

Table 1. End-to-end results across sign, gloss, and text modules (test set).

Task	Dataset	BLEU-4	ROUGE-L	WER (%)
Sign → Gloss	RWTH-PHOENIX-Weather 2014T	37.00	79.89	25.17
Sign → Sentence	RWTH-PHOENIX-Weather 2014T	14.95	33.14	–
English → Gloss	ASLG-PC12	93.27	96.00	–

Table 2. Comparison of ASL–English translation systems.

Study	Translation Direction	Modeling Approach	Continuous ASL	Evaluation
Avina et al. [1]	ASL ↔ English	CNN-based word-level classification with smoothing	No	Classification accuracy
May et al. [2]	English → ASL	Rule-based NLP with MediaPipe animation	No	Qualitative/prototype-based
This work	ASL ↔ English	CNN + BiLSTM/Transformer with CTC	Yes	WER = 25.17%, ROUGE-L = 79.89/96.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Amiruzzaman, S.; Amiruzzaman, M.; Batchu, R.M.; Dracup, J.; Pham, A.; Crocker, B.; Ngo, L.; Dewan, M.A.A. Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks. Computers 2026, 15, 20. https://doi.org/10.3390/computers15010020

AMA Style

Amiruzzaman S, Amiruzzaman M, Batchu RM, Dracup J, Pham A, Crocker B, Ngo L, Dewan MAA. Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks. Computers. 2026; 15(1):20. https://doi.org/10.3390/computers15010020

Chicago/Turabian Style

Amiruzzaman, Stefanie, Md Amiruzzaman, Raga Mouni Batchu, James Dracup, Alexander Pham, Benjamin Crocker, Linh Ngo, and M. Ali Akber Dewan. 2026. "Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks" Computers 15, no. 1: 20. https://doi.org/10.3390/computers15010020

APA Style

Amiruzzaman, S., Amiruzzaman, M., Batchu, R. M., Dracup, J., Pham, A., Crocker, B., Ngo, L., & Dewan, M. A. A. (2026). Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks. Computers, 15(1), 20. https://doi.org/10.3390/computers15010020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data

3.2. ASL to English

3.2.1. Sign Video to German Gloss

3.2.2. German Gloss to German Sentence

3.2.3. German Sentence to English Sentence

3.3. English to ASL

3.4. Data Pre-Processing

3.5. Model Architectures

3.6. Training Setup

3.7. Interface

3.7.1. English-to-ASL Interface

3.7.2. Recording the ASL Interface

3.7.3. ASL-to-English Interface

3.8. Integration and Deployment

4. Model Evaluation

4.1. Evaluation of the Sign-to-Video Model

4.2. Evaluation of the Gloss-to-Sentence Model

5. Results

6. Discussions

7. Conclusions

Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI