Next Article in Journal
Enhancing Multi-Region Target Search Efficiency Through Integrated Peripheral Vision and Head-Mounted Display Systems
Previous Article in Journal
CPB-YOLOv8: An Enhanced Multi-Scale Traffic Sign Detector for Complex Road Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring Sign Language Dataset Augmentation with Generative Artificial Intelligence Videos: A Case Study Using Adobe Firefly-Generated American Sign Language Data

Computer Science Department, National University of Science and Technology Politehnica Bucharest, RO-060042 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Information 2025, 16(9), 799; https://doi.org/10.3390/info16090799
Submission received: 12 July 2025 / Revised: 24 August 2025 / Accepted: 10 September 2025 / Published: 15 September 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

Currently, high quality datasets focused on Sign Language Recognition are either private, proprietary or difficult to obtain due to costs. Therefore, we aim to mitigate this problem by augmenting a publicly available dataset with artificially generated data in order to enrich and obtain a more diverse dataset. The performance of Sign Language Recognition (SLR) systems is highly dependent on the quality and diversity of training datasets. However, acquiring large-scale and well-annotated sign language video data remains a significant challenge. This experiment explores the use of Generative Artificial Intelligence (GenAI), specifically Adobe Firefly, to create synthetic video data for American Sign Language (ASL) fingerspelling. Thirteen letters out of 26 were selected for generation, and short videos representing each sign were synthesized and processed into static frames. These synthetic frames replaced approximately 7.5% of the original dataset and were integrated into the training data of a publicly available Convolutional Neural Network (CNN) model. After retraining the model with the augmented dataset, the accuracy did not drop. Moreover, the validation accuracy was approximately the same. The resulting model achieved a maximum accuracy of 98.04%. While the performance gain was limited (less than 1%), the approach illustrates the feasibility of using GenAI tools to generate training data and supports further research into data augmentation for low-resource SLR tasks.

Graphical Abstract

1. Introduction

SLR plays a pivotal role in improving digital accessibility for the deaf and hard-of-hearing community, enabling more inclusive communication interfaces in education, healthcare, and public services. While recent advances in computer vision and deep learning have led to improvements in SLR systems, their effectiveness remains highly dependent on the availability and quality of training data. Unlike datasets for spoken or written language, sign language corpora are particularly difficult to obtain at scale [1]. Collecting sign language data often requires collaboration with skilled performers, high-quality video recording, and expert annotation, all of which are time-consuming and costly. As a result, many SLR models are trained on limited, imbalanced, or low-variability datasets—especially for static fingerspelling, where precision and hand pose clarity are critical. Recent developments in GenAI offer a promising solution to this bottleneck. In particular, tools like Adobe Firefly can generate high-quality, realistic videos based on text prompts. In this research, we investigate whether synthetic videos generated via Firefly can be used to supplement an existing ASL fingerspelling dataset to improve model performance.
That is why the specific purpose of this experiment is to analyse if GenAI video data can be integrated into an existing ASL fingerspelling dataset to boost (or, at least, to keep) a model’s accuracy. Using Adobe Firefly, we generated short video clips of individuals performing 13 selected ASL letters. These videos were converted into static frames and incorporated into both training and testing subsets. A publicly available deep learning model [2] for ASL recognition was then retrained on the augmented dataset, and its performance was compared to a baseline model trained on original data only.
This experiment conducts the following key contributions:
  • Proposing and evaluating a pipeline for generating synthetic ASL fingerspelling video data using Adobe Firefly;
  • Assessing the impact of synthetic augmentation on recognition accuracy and validation loss in a CNN-based ASL recognition model;
  • Analyzing the effectiveness and limitations of prompt engineering in producing useful training data for sign language models.
To guide this research, we define the following Scientific Questions (SQs):
  • SQ1: Can generative AI tools like Adobe Firefly be used to create visually realistic ASL fingerspelling video data suitable for dataset augmentation?
  • SQ2: Does augmenting an existing ASL dataset with synthetic video frames affect the recognition performance (accuracy and validation loss) of a CNN-based model?
  • SQ3: What are the practical limitations and considerations of using GenAI-based prompt engineering for generating usable sign language training data?
These questions are addressed through a case study on ASL letter recognition using a public CNN model retrained with and without GenAI-augmented data. The findings are discussed in detail in the conclusion.
The remainder of this paper is organized as follows: Section 2 reviews related work on sign language recognition, dataset limitations, and prior applications of generative AI for data augmentation. Section 3 details the prompt engineering strategies used to produce synthetic ASL fingerspelling videos with Adobe Firefly. Section 4 presents the research methodology, including dataset selection, video-to-frame conversion, augmentation processing, and CNN model retraining. Section 5 explores future directions for integrating GenAI into sign language recognition workflows. Section 6 concludes the paper by summarizing the findings, discussing limitations, and proposing avenues for further research.

2. Related Works

Deep learning approaches have shown strong performance in static sign recognition tasks. For instance, recent work by Barbhuiya et al. applied modified CNN architectures based on AlexNet and VGG16 for recognizing ASL hand gestures, including both alphabets and numerals. Their system achieved a high accuracy of 99.82% using a multiclass SVM classifier for final prediction [3]. Importantly, the model was trained and evaluated on real data, using both leave-one-subject-out and random split cross-validation, and demonstrated competitive performance even on low-resource hardware setups. While their study highlights the effectiveness of CNNs for hand gesture recognition, it relies entirely on manually collected and annotated real-world datasets.
Previous research on American Sign Language recognition has explored both classical machine learning techniques and deep learning architectures. In one such study, Jain et al. investigated the use of Support Vector Machines (SVM) with various kernels and CNNs with different filter sizes and depths for recognizing ASL signs [4]. Their experiments demonstrated that a double-layer CNN with an optimal filter size of 8 × 8 achieved a recognition accuracy of 98.58%, outperforming SVM-based models. The results also highlighted the importance of architectural tuning and hyperparameter optimization in improving model performance.
Previous research has explored the use of AI-generated images to augment datasets for SLR. In our earlier work, we proposed that artificially generated images can serve as a scalable solution for training SLR models, particularly in scenarios where data collection is difficult or where underrepresented sign languages lack sufficient annotated examples. This approach has the potential to enrich existing datasets, improve model generalization, and increase inclusivity in SLR systems by simulating a broader variety of gestures and signing styles. A similar use of GenAI to address dataset incompleteness was proposed by Lan et al., where a GenAI-based Data Completeness Augmentation Algorithm was applied in the healthcare domain to improve training data through a “Quest → Estimate → Tune-up” process [5]. While their work focused on structured data in smart healthcare applications, our approach leverages generative video synthesis to augment image datasets in SLR tasks, addressing similar challenges of limited and imbalanced data.
Recent work in other domains has similarly explored the potential of GenAI to overcome data scarcity. For instance, Kasimalla et al. [6], investigated the use of GANs and GenAI to generate synthetic fault data for microgrid protection systems, a domain where transient fault data is both rare and privacy-sensitive. Their findings demonstrated that synthetic data, when statistically validated and labeled, can significantly improve model robustness and accuracy. While the application domain differs, the underlying challenge-limited access to diverse, labeled data is shared, and their methodology reinforces the viability of synthetic data generation for high-stakes machine learning tasks, including SLR.
GenAI techniques have gained traction across various domains facing data scarcity. For example, the challenge of limited annotated X-ray datasets was addressed by introducing synthetic defects into real radiographs using a Scalable Conditional Wasserstein GAN. By strategically injecting synthetic defects based on noise and location constraints, they improved defect detection performance by 17% over baselines trained on real data alone. This supports the broader applicability of GenAI-generated synthetic datasets in enhancing machine learning models in data-constrained environments, including gesture and SLR [7].
Another recent study has explored the use of deep learning methods, particularly CNNs, for recognizing static ASL gestures. One such work developed a CNN-based application capable of detecting ASL alphabets in real time using webcam input, achieving an accuracy of around 98%. The system utilized a dataset of static hand gesture images that were preprocessed and fed into the model, enabling live prediction of individual letters. While the study focused on static alphabets rather than continuous sign sequences, it demonstrates the effectiveness of CNN architectures for sign recognition tasks and highlights the potential of computer vision techniques for real-time applications in accessibility technologies [8].
Many other recent studies related to Sign Language Recognition models obtained a respectable accuracy of more than 95%, either continuous SLR or static SLR, demonstrating the advancements in the field. For instance, Al Ahmadi et al. proposed a hybrid CNN–TCN model evaluated on British and American sign datasets, achieving around 95.31% accuracy [9]. Another study employed five state-of-the-art deep learning models—including ResNet-50, EfficientNet, ConvNeXt, AlexNet, and VisionTransformer—to recognize the ASL alphabet using a large dataset of over 87,000 images. The best-performing model, ResNet-50, achieved a striking 99.98% accuracy, while EfficientNet and ConvNeXt also surpassed 99.5% [10]. In Indian Sign Language recognition, a two-stream Graph Convolution Network, fusing joint and bone data, achieved 98.08% Top-1 accuracy on the CSL-500 dataset [11]. Moreover, a deep hybrid model using CNN and hybrid optimizers (HO-based CNNSa-LSTM) obtained 98.7% accuracy, significantly outperforming conventional CNN, RNN, and LSTM baselines [12].

3. Prompt Engineering

Prompt engineering played a critical role in controlling the visual fidelity and semantic relevance of the generated ASL signs using Adobe Firefly [13]. Since GenAI models are highly sensitive to input prompts, we carefully designed concise prompts specifying hand shape, position, and background neutrality to generate consistent and isolated gestures for each ASL letter. These prompt designs were iteratively tested and refined to reduce generation noise, minimize stylistic variation, and ensure alignment with the dataset. By optimizing prompts, we aimed to improve the quality of synthetic data and ensure it could reasonably substitute for real samples in the training set.
Another challenge when using generative models to create sign language data is that they often generate hands or fingers incorrectly. The models sometimes add extra fingers, miss fingers, or show unnatural hand shapes [14,15,16]. This is a problem because sign language relies on precise finger positions, and even small mistakes can change the meaning of a sign. Prompt engineering helps to some extent, but it does not fully solve the issue. This shows the need for better models or post-processing methods focused on generating accurate hands.
As we mentioned previously, one of the key challenges in prompt engineering is the importance of descriptive details. GenAI models interpret prompts in a probabilistic manner, which means even small ambiguities or vague descriptions can lead to unexpected or unintended outputs. For instance, a prompt like “Human describing cat in sign language using fingers” might yield an image of a man wearing a suit with a cat’s head, pointing his fingers at a cat’s face as in Figure 1—an output that deviates significantly from the intended meaning. This occurs because the model struggles to parse the abstract concept of “describing” with sign language and instead tries to combine visual elements literally, resulting in a surreal or confusing image.
Conversely, more precise prompts often generate outputs that align closely with user expectations. A prompt such as “A human showing OK with hand near the lips” provides clear, concrete details that the AI can easily interpret, producing the Figure 2 that matches the intended gesture without extraneous or bizarre features. This example highlights how clarity and careful wording in prompt construction help minimize misinterpretations by the model, leading to higher quality and more relevant results.
In our case, generating high-quality images was facilitated by first creating videos using Adobe Firefly and then extracting frames from those videos. For example, the prompt “hand of a person who shows the peace sign on a neutral background” combined with seed 963,445 resulted in Figure 3, which is a very good video output for the letter V in ASL. This approach demonstrated how detailed and explicit prompts, combined with a methodical workflow, can enhance the quality of generated content and better serve the goals of SLR dataset augmentation.

4. Research Methodology

To establish a fair baseline for evaluation, we began by training the SLR model using the original dataset, without any enhancements or synthetic data. The dataset was partitioned in the following way: 87.5% of the total data was allocated for training, while the remaining 12.5% was reserved for testing and evaluation purposes. For each letter in the ASL alphabet included in the dataset, a total of 2000 images were available. Of these, 1750 images per letter were used for training the model, and 250 images per letter were used to test the model’s performance. The purpose of this initial training phase was to obtain a reference point or baseline performance of the model using only the original dataset. These baseline results were essential for comparative analysis, allowing us to later evaluate the impact of augmenting the dataset with additional generated frames.
After that, we generated using Adobe Firefly numerous videos with hand gestures that describes letters in ASL. Each video was verified by us and in case the result was not good enough, we generated other videos by altering the prompt or seed. When the resulted video was good enough, we extracted frames from it and edited them to make them look like the real dataset. This process is described in Figure 4.
The following tools and libraries were utilized throughout the study to implement, train, and evaluate the sign language recognition pipeline:
  • Python 3.11.8: Served as the primary programming language for scripting, data manipulation, and integration of various components in the workflow.
  • TensorFlow 2.19.0 and Keras 3.10.0: Used for building, training, and evaluating the CNN model. Keras, as a high-level API, enabled rapid prototyping, while TensorFlow provided the backend computational support. The evaluation (accuracy, loss, validation accuracy, and validation loss) was provided through Keras’s terminal output.
  • Matplotlib 3.10.3 and SciPy 1.15.3: Used for visualization and statistical analysis. Matplotlib helped plot accuracy and loss curves to monitor training progress, while SciPy supported data processing and performance evaluation tasks.
To generate the data presented in the tables, the model was trained five times, using two different training durations: 25 and 35 epochs, respectively. This approach allowed us to observe the model’s behavior and performance consistency across multiple runs. During each training session, the script recorded key performance metrics after every epoch, including training accuracy, validation accuracy, training loss, and validation loss. However, for the purpose of analysis and comparison, we extracted and reported only the metrics from the final epoch of each run. Moreover, to check if the model gets overfit, we trained the model for each experiment for 300 epochs and checked the graphic resulted from Matplotlib.

4.1. Model Selection and Baseline Training

For the entire experiment, we utilized the open-source model called Simple Sign Language Detector, which uses a CNN for ASL letter classification. We chose this model because it was publicly available and most importantly has a high accuracy (more than 97%). So in this case, it would help us a lot to check if the AI-generated dataset could raise or maintain the high accuracy level. As we previously mentioned, to establish baseline performance, we trained the model using the original dataset for a total of 25 epochs. This training configuration replicated the setup provided in the original repository to ensure consistency and reliability in our comparison. The goal was to evaluate how the model performed under its default conditions, prior to any dataset changes. The results obtained from this initial training phase served as a reference point for subsequent experiments. This baseline enabled us to measure the effectiveness of our proposed dataset augmentation strategy and assess any improvements in classification accuracy, loss, and validation accuracy.
The baseline performance of the model was represented in Table 1:
The confusion matrix for the baseline training is presented in Figure 5. From the figure, we can conclude that letter ‘H’ is interpreted sometimes as letter ‘G’ but this is normal because these two letters are very similar in ASL. The same case is applicable for letters ‘S’ and ‘E’.
Taking into consideration the results from Figure 6 and Figure 7 with the model accuracy and model loss, we conducted a supplementary experiment in which the model was trained for an extended duration of 35 epochs. The purpose of this experiment was to observe how the model’s performance evolved over a longer training period and to evaluate whether additional epochs would lead to improvements in accuracy or generalization. The results obtained from this extended training phase are presented in Table 2:
As we observed from the initial and extended training experiments, the model demonstrated a slightly increase in accuracy alongside a gradual decrease in training loss (Table 2), indicating stable learning behavior and improved performance with additional epochs. Encouraged by these trends, we decided to further extend the training duration significantly and run the model for a total of 300 epochs. The aim of this long-duration training was to assess the model’s capacity to achieve near-optimal performance and determine whether prolonged exposure to the dataset would yield meaningful improvements or lead to overfitting.
Following this extended training process, the model achieved an impressive training accuracy of 99.82% with a corresponding loss of 0.01, suggesting excellent learning of the training data. On the validation set, the model reached a validation accuracy of 99.02%, while the validation loss increased slightly to 0.13. This indicates that although the model continued to generalize well, there may be early signs of overfitting due to the extended number of epochs. Nonetheless, the results demonstrate that the model maintained strong performance even under prolonged training, further validating the robustness of the architecture and the quality of the dataset.
As illustrated in the training and validation loss curves in Figure 8 and Figure 9, the model converged effectively within the first 30–50 epochs. While training loss continued to decrease and remained low throughout the 300 epochs, the validation loss plateaued and began to exhibit minor fluctuations. This behavior indicates that further training beyond the early convergence point provided minimal benefit to generalization and may suggest early signs of overfitting. These observations highlight the potential utility of early stopping strategies for improving training efficiency without compromising model performance.

4.2. Synthetic Video Generation with Firefly

To enhance the dataset and explore the potential of GenAI in SLR, we selected thirteen ASL letters for synthetic video generation. The selected letters were A, B, C, G, K, M, O, P, Q, R, S, U, and V. These were chosen because their corresponding hand gestures are static and do not require continuous motion for accurate representation, unlike letters such as J and Z, which involve dynamic movement and were therefore excluded from this phase of the study. For each selected letter, we generated a short video clip lasting between 3 and 5 s using Adobe Firefly, a GenAI tool capable of producing high-fidelity visual content. The videos were carefully designed to simulate realistic hand gestures corresponding to each letter, maintaining a consistent style, background, and visual clarity to support model training. This process ensured that the generated samples were not only visually accurate but also usable as input for training deep learning models under controlled conditions.
The use of synthetic data aimed to supplement the original dataset and evaluate the impact of AI-generated video samples on the overall performance of the SLR model.

4.3. Frame Extraction and Preprocessing

To incorporate the synthetic videos into the training dataset, we performed frame extraction and preprocessing to ensure consistency with the original data format. From each generated video, a total of 300 individual frames were extracted. This frame count was chosen to provide a diverse set of image samples that capture slight variations in hand positioning and gesture clarity across the short duration of each video (3–5 s).
To maintain visual consistency with the original dataset and enhance the usability of the frames for model training, we applied a series of preprocessing steps using Adobe Express (for example flip, increase contrast, remove background, etc., in Figure 10) [17]. These steps included contrast enhancement, which improved the visibility of hand shapes and edges, and resizing, which ensured that all frames matched the spatial dimensions required by the CNN used in our experiments. This preprocessing pipeline was essential for reducing domain discrepancies between the synthetic and original images, thereby facilitating more effective model training and integration of the augmented dataset.

4.4. Dataset Augmentation

To enhance the variability and representation of the dataset for selected ASL letters, we incorporated the synthetic frames into the original data splits. From the total of 300 frames extracted per letter, 87.5% were allocated to the training set and the remaining 12.5% were assigned to the validation set. This distribution mirrored the split used in the original dataset, ensuring consistency across experiments.
In the initial experiment, we evaluated the impact of replacing the training data for the 13 selected letters (A, B, C, G, K, M, O, P, Q, R, S, U, V) entirely with the newly generated synthetic frames, while keeping the original test set unchanged. The goal was to observe whether synthetic training data alone, without modifying the test set, could lead to improved model performance. However, this approach resulted in poor generalization during validation, as indicated by the following metrics:
  • Training Accuracy: 98.60%;
  • Training Loss: 0.03;
  • Validation Accuracy: 58.32%;
  • Validation Loss: 5.38.
These results from Figure 11 and Figure 12 suggest that the synthetic training data, although realistic in appearance, introduced a domain gap that the model could not effectively bridge when evaluated on real data.
To address this performance degradation, we adopted a data grafting strategy, in which a portion of the original dataset was retained and combined with the generated content. Specifically, for each of the 13 letters, 263 images (87.5%) in the training set were replaced with synthetic images; 37 images (12.5%) in the test set were also replaced with synthetic ones. This hybrid approach aimed to preserve the natural data distribution while introducing synthetic variability to improve model robustness. We then conducted two sub-experiments:
-
Sub-experiment A: Used the augmented training set (with synthetic images) while keeping the original test set unchanged.
-
Sub-experiment B: Used both the augmented training and test sets, incorporating the synthetic frames into both splits. These experiments allowed us to analyze the impact of synthetic data under both isolated and combined augmentation conditions.

4.4.1. Results Using the Augmented Training Set with the Original Test Set

In the first sub-experiment, we evaluated the model’s performance when trained exclusively on the newly augmented training set—comprising both original and AI-generated frames for the selected 13 ASL letters—while keeping the original test set unchanged. This setup allowed us to isolate the effect of the augmented training data on model performance without introducing any variation into the evaluation data.
The model was trained for 25 epochs under the same conditions as in the baseline experiment. The results of this training configuration were described in Table 3:
The confusion matrix for the trained model using only the augmented training set and the original test set is presented in Figure 13. From the figure, we can conclude that the letter ‘H’ is interpreted sometimes as the letter ‘G’, but this is normal because these two letters are very similar in ASL. The same case is applicable for letters ‘S’ and ‘E’. The same thing happened in the baseline training.
The model trained on the original dataset trained for 25 epochs achieved an accuracy of 97.26% (worst result), whereas training on the augmented dataset resulted in a higher accuracy of 98.04% (best result). This represents a performance improvement of 0.78 percentage points, which is very similar to the original one. The model trained on the original dataset trained for 35 epochs achieved an accuracy of 97.98%, whereas training on the augmented dataset resulted in a higher accuracy of 98.54%. This represents a performance improvement of 0.56 percentage points, which, again, is very similar to the original one. The results are in Table 4:
Observing a consistent increase in accuracy and a corresponding decrease in loss during earlier experiments (Table 4), we extended the training process to 300 epochs in order to further explore the model’s learning potential. The results from this extended training were promising: the model achieved a training accuracy of 99.82% with a training loss of 0.01. On the validation set, the model reached an accuracy of 98.88% and a validation loss of 0.11. These outcomes suggest in Figure 14 and Figure 15 that longer training contributed to improved performance and model generalization.

4.4.2. Results Using the Fully Augmented Dataset (Training and Testing Sets)

In this sub-experiment, both the training and testing sets were updated to include synthetic data generated for the selected ASL letters. This allowed us to evaluate the model’s performance on a fully augmented dataset, providing a consistent distribution between the training and evaluation phases.
The model was trained for 25 epochs, and the results of this configuration are presented in Table 5:
The confusion matrix for the model trained with the augmented training and test sets is presented in Figure 16. From the figure, we can conclude that letter ‘H’ is interpreted sometimes as letter ‘G’ but this is normal because these two letters are very similar in ASL. The same case is applicable for letters ‘S’ and ‘E’. The same thing happened in the baseline training and augmented training case. However, there are also some similarities between letters ‘O’ and ‘Z’ or ‘T’ and ‘N’.
The model trained on the original dataset trained for 25 epochs achieved an accuracy of 97.26%, whereas training on the augmented dataset resulted in a higher accuracy of 97.84%. This represents a performance improvement of 0.58 percentage points, highlighting the potential benefits of incorporating synthetic data into the training process.
To maintain consistency in the evaluation format, we performed another training of the model with augmented training and test sets and the results are presented in Table 6:
The model trained on the original dataset trained for 35 epochs achieved an accuracy of 97.98%, whereas training on the augmented dataset resulted in a higher accuracy of 98.51%. This represents a performance improvement of 0.53 percentage points, which is very similar to the original results. Given the observable trend of increasing accuracy and decreasing loss during earlier training sessions exemplified in Table 6, we maintained consistency of the evaluation by training the model for 300 epochs to further assess the model’s capacity for improvement and generalization. This extended training obtained good results: the model achieved a training accuracy of 99.81% with a corresponding loss of 0.01. On the validation set, it reached an impressive accuracy of 99.10% and a validation loss of 0.06. These results indicate that prolonged training contributed to further refinement of the model’s learning, improving its predictive performance while maintaining generalization across unseen data.
Compared to previous training sessions with fewer epochs, the extended training depicted in this graph demonstrates in Figure 17 and Figure 18 improved generalization, as evidenced by the narrowing gap between training and test loss beyond epoch 150. The validation loss closely follows the training loss curve, suggesting that the model has not overfitted, even after 300 epochs. This consistency indicates a well-regularized training process and effective model learning, capable of sustaining performance across unseen data.

5. Exploring Future Directions

Building on the promising results of this study, several avenues for future research are identified to further advance the effectiveness and applicability of synthetic data in SLR:
  • Comprehensive Alphabet and Dynamic Sign Expansion. Future work should aim to extend the synthetic data generation pipeline to encompass all 26 ASL letters, as well as incorporate dynamic signs representing full words, phrases, and sentences. This would enable the development of more comprehensive and context-aware recognition systems. The challenge for this direction consists of generating accurate dynamic sign sequences with temporal coherence, and using GenAI tools remains technically complex. Current video generation models often lack fine-grained control over hand-shape transitions, motion consistency, and signer-specific variations, which are critical for continuous sign language recognition.
  • Temporal Modeling for Continuous Sign Recognition. While the current work focused on isolated, static gestures, a natural progression involves applying similar synthetic augmentation strategies to continuous SLR. This would require models capable of temporal sequence modeling, such as LSTM, GRU, or transformer-based architectures, to handle video streams with complex temporal dependencies. The main challenge is that training temporal models requires large volumes of coherent, temporally aligned data. Generating high-quality synthetic video sequences with accurate frame-to-frame transitions is still a major limitation in generative AI tools. Moreover, aligning generated signs with appropriate gloss or text labels introduces further annotation complexity.
  • Deployment in Real-Time Systems. Finally, it is essential to evaluate the robustness of models trained on synthetic–augmented datasets in real-time applications. This includes deploying and testing models in virtual interpreting systems, gesture-controlled interfaces, or assistive technologies for the Deaf and hard-of-hearing communities. The challenge is that models trained on synthetic data may not generalize well to real-world environments due to differences in lighting, signer appearance, camera angle, and background. There is also a risk of domain shift if synthetic data lacks visual realism. Ensuring low-latency and efficient inference for real-time use remains a technical hurdle, especially on edge devices.

6. Conclusions

This research demonstrates one possible use-case of GenAI-generated videos. The integration of Adobe Firefly-generated content into the training pipeline still returned some good results. These findings underline the promise of generative tools for dataset enrichment or even enhancement, particularly in domains where data acquisition costs are the main issues. As GenAI technologies evolve, their strategic use in accessibility-related AI can significantly accelerate progress toward more inclusive communication systems. This experiment explored the replacement of a part of an SLR dataset using synthetic video frames generated, with the objective of keeping/improving model accuracy. By selectively augmenting 13 different ASL letters, the dataset was enriched with high-quality artificial samples. This approach allowed us to investigate not only the direct impact of synthetic augmentation on model accuracy and loss, but also the extent to which GenAI can serve as a viable solution for low-resource data challenges in SLR. Specifically, training with the augmented dataset (87.5% synthetic training frames and 12.5% synthetic test frames) achieved a peak accuracy of 98.04%, with a very similar accuracy to the original dataset 97.26%—a small increase of 0.78%.
Answers to scientific questions:
  • SQ1: Can generative AI tools like Adobe Firefly be used to create visually realistic ASL fingerspelling video data suitable for dataset augmentation? Answer: Yes. Adobe Firefly successfully generated visually realistic short videos of ASL fingerspelling signs based on text prompts. The output videos were plausible and distinguishable enough to be used as training samples after conversion into image frames.
  • SQ2: Does augmenting an existing ASL dataset with synthetic video frames affect the recognition performance of a CNN-based model? Answer: It depends. Integrating approximately 7.5% synthetic frames into the dataset led to similar results to the original dataset, with a very small increase of 0.78%. The baseline model trained on the original dataset achieved a peak accuracy of 97.26%, while the augmented model reached 98.04%. On the other hand, training with the augmented dataset for 300 epochs resulted in a training accuracy of 99.81%, validation accuracy of 99.10%, and a reduced validation loss of 0.06, indicating better generalization and model stability, but in this case, the model might be overfitted.
  • SQ3: What are the practical limitations and considerations of using GenAI-based prompt engineering for generating usable sign language training data? Answer: While Adobe Firefly produced useful data, certain challenges were observed: Generating signs with precise hand shapes and orientations was inconsistent. Limited control over lighting, hand size, and background uniformity affected output quality. Some signs required several prompt iterations to achieve satisfactory visual fidelity.
This study adds value to the field by doing the following:
  • Demonstrating a low-cost, scalable approach to SLR dataset augmentation;
  • Providing an empirical case study showing how synthetic data can be used without degrading model accuracy;
  • Introducing a hybrid training strategy (real + synthetic data) that achieved high accuracy, over 97%.
Several difficulties emerged during the research:
  • Prompt Engineering Complexity: It was non-trivial to generate accurate and diverse ASL signs using text prompts alone. Iterative testing was needed to obtain visually correct signs for each letter.
  • Dataset Imbalance: Balancing synthetic and real samples without over-representing certain classes required manual fine-tuning.
Overall, the findings support the use of GenAI tools such as Adobe Firefly to augment visual datasets in machine learning pipelines, particularly for domains like SLR where high-quality, balanced data remains scarce. The results underscore the potential of hybrid training strategies—combining real and synthetic data—to yield more robust and accurate models. Future work may explore scaling this method across the full ASL alphabet, applying similar strategies to continuous video-based sign recognition, and testing cross-dataset generalizability using other public sign language corpora.

Author Contributions

The concept of this article was proposed by N.P.; the data resources and validation were contributed by V.B.; the formal analysis, investigation, and draft preparation were performed by V.B.; the supervision and reviewing of this study were headed by N.P.; the final writing was critically revised by N.P. and finally approved by the authors. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SLRSign Language Recognition
GenAIGenerative Artificial Intelligence
ASLAmerican Sign Language
CNNConvolutional Neural Network
SQScientific Question

References

  1. Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Giro-i-Nieto, X. How2sign: A large-scale multimodal dataset for continuous american sign language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2735–2744. [Google Scholar]
  2. Convolution Neural Network Based SLR Model. Available online: https://github.com/rrupeshh/Simple-Sign-Language-Detector (accessed on 20 April 2025).
  3. Barbhuiya, A.A.; Karsh, R.K.; Jain, R. CNN based feature extraction and classification for sign language. Multimed. Tools Appl. 2021, 80, 3051–3069. [Google Scholar] [CrossRef]
  4. Jain, V.; Jain, A.; Chauhan, A.; Kotla, S.S.; Gautam, A. American sign language recognition using support vector machine and convolutional neural network. Int. J. Inf. Technol. 2021, 13, 1193–1200. [Google Scholar] [CrossRef]
  5. Lan, G.; Xiao, S.; Yang, J.; Wen, J.; Xi, M. Generative AI-based data completeness augmentation algorithm for data-driven smart healthcare. IEEE J. Biomed. Health Inform. 2023, 29, 4001–4008. [Google Scholar] [CrossRef] [PubMed]
  6. Kasimalla, S.R.; Park, K.; Hong, J.; Kim, Y.J. Enhancing Microgrid Protection Systems Through GAN-and GenAI-Generated Synthetic Fault Datasets. TechRxiv 2025. [Google Scholar] [CrossRef]
  7. García-Pérez, A.; Gómez-Silva, M.J.; de la Escalera-Hueso, A. Improving automatic defect recognition on GDXRay castings dataset by introducing GenAI synthetic training data. NDT E Int. 2025, 151, 103303. [Google Scholar] [CrossRef]
  8. Ikram, S.; Dhanda, N. American sign language recognition using convolutional neural network. In Proceedings of the 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies, Kuala Lumpur, Malaysia, 24–26 September 2021; pp. 1–12. [Google Scholar]
  9. Ahmadi, S.A.; Muhammad, F.D.; Al Dawsari, H. CNN-TCN: Deep Hybrid Model Based on Custom CNN with Temporal CNN to Recognize Sign Language. J. Disabil. Res. 2024, 3, 20240034. [Google Scholar] [CrossRef]
  10. Alsharif, B.; Altaher, A.S.; Altaher, A.; Ilyas, M.; Alalwany, E. Deep learning technology to recognize American sign language alphabet. Sensors 2023, 23, 7970. [Google Scholar] [CrossRef] [PubMed]
  11. Meng, L.; Li, R. An attention-enhanced multi-scale and dual sign language recognition network based on a graph convolution network. Sensors 2021, 21, 1120. [Google Scholar] [CrossRef]
  12. Baihan, A.; Alutaibi, A.I.; Alshehri, M.; Sharma, S.K. Sign language recognition using modified deep learning network and hybrid optimization: A hybrid optimizer (HO) based optimized CNNSa-LSTM approach. Sci. Rep. 2024, 14, 26111. [Google Scholar] [CrossRef] [PubMed]
  13. Adobe Firefly That Can Be Used to Generate Images and Videos Using Prompts. Available online: https://www.adobe.com/products/firefly/features/ai-video-generator.html (accessed on 17 May 2025).
  14. Keyes, O.K.; Hyl, A. Hands Are Hard: Unlearning How We Talk About Machine Learning in the Arts. Tradit. Innov. Arts Des. Media High. Educ. 2023, 1, 4. [Google Scholar] [CrossRef]
  15. Yang, Y.G.; Hi, A.N.; Turk, G. Annotated hands for generative models. arXiv 2024, arXiv:2401.15075. [Google Scholar] [CrossRef]
  16. Sarkar, M.; Chatterjee, S.; Hazra, S.; Sinha, A.; Reza, M.S.; Shah, M.A. Analyzing why AI struggles with drawing human hands with CLIP. F1000Research 2025, 14, 193. [Google Scholar] [CrossRef]
  17. Adobe Express-the Design Application Used to Process the Videos (Remove Background, Increase Contrast, Flip Elements etc.). Available online: https://new.express.adobe.com (accessed on 14 April 2025).
Figure 1. Image generated with Adobe Firefly using the prompt “Human describing cat in sign language using fingers”.
Figure 1. Image generated with Adobe Firefly using the prompt “Human describing cat in sign language using fingers”.
Information 16 00799 g001
Figure 2. Image generated with Adobe Firefly using the prompt “A human showing OK with hand near the lips”.
Figure 2. Image generated with Adobe Firefly using the prompt “A human showing OK with hand near the lips”.
Information 16 00799 g002
Figure 3. A frame from the video generated with prompt “hand of a person who shows the peace sign on a neutral background” with seed 963,445.
Figure 3. A frame from the video generated with prompt “hand of a person who shows the peace sign on a neutral background” with seed 963,445.
Information 16 00799 g003
Figure 4. Flow chart with the research methodology.
Figure 4. Flow chart with the research methodology.
Information 16 00799 g004
Figure 5. Confusion matrix of the SLR model that was trained with the initial dataset for 25 epochs.
Figure 5. Confusion matrix of the SLR model that was trained with the initial dataset for 25 epochs.
Information 16 00799 g005
Figure 6. Accuracy of the SLR model that was trained with the initial dataset for 25 epochs.
Figure 6. Accuracy of the SLR model that was trained with the initial dataset for 25 epochs.
Information 16 00799 g006
Figure 7. Loss of the SLR model that was trained with the initial dataset for 25 epochs.
Figure 7. Loss of the SLR model that was trained with the initial dataset for 25 epochs.
Information 16 00799 g007
Figure 8. Accuracy of the SLR model that was trained with the initial dataset for 300 epochs.
Figure 8. Accuracy of the SLR model that was trained with the initial dataset for 300 epochs.
Information 16 00799 g008
Figure 9. Loss of the SLR model that was trained with the initial dataset for 300 epochs.
Figure 9. Loss of the SLR model that was trained with the initial dataset for 300 epochs.
Information 16 00799 g009
Figure 10. Example of how letter “V” was processed in SLR. Steps of processing the videos in Adobe Express until they become valid data to be used in the dataset.
Figure 10. Example of how letter “V” was processed in SLR. Steps of processing the videos in Adobe Express until they become valid data to be used in the dataset.
Information 16 00799 g010
Figure 11. Accuracy of the SLR model that was trained with only the synthetic data for 25 epochs.
Figure 11. Accuracy of the SLR model that was trained with only the synthetic data for 25 epochs.
Information 16 00799 g011
Figure 12. Loss of the SLR model that was trained with only the synthetic data for 25 epochs.
Figure 12. Loss of the SLR model that was trained with only the synthetic data for 25 epochs.
Information 16 00799 g012
Figure 13. Confusion matrix of the SLR model that was trained with the augmented training set with the original test set for 25 epochs.
Figure 13. Confusion matrix of the SLR model that was trained with the augmented training set with the original test set for 25 epochs.
Information 16 00799 g013
Figure 14. Accuracy of the SLR model that was trained with the augmented training set with the original test set for 300 epochs.
Figure 14. Accuracy of the SLR model that was trained with the augmented training set with the original test set for 300 epochs.
Information 16 00799 g014
Figure 15. Accuracy of the SLR model that was trained with the augmented training set with the original test set for 300 epochs.
Figure 15. Accuracy of the SLR model that was trained with the augmented training set with the original test set for 300 epochs.
Information 16 00799 g015
Figure 16. Confusion matrix of the SLR model that was trained with the augmented training set with the original test set for 25 epochs.
Figure 16. Confusion matrix of the SLR model that was trained with the augmented training set with the original test set for 25 epochs.
Information 16 00799 g016
Figure 17. Accuracy of the SLR model that was trained with the augmented training and test sets for 300 epochs.
Figure 17. Accuracy of the SLR model that was trained with the augmented training and test sets for 300 epochs.
Information 16 00799 g017
Figure 18. Accuracy of the SLR model that was trained with the augmented training and test sets for 300 epochs.
Figure 18. Accuracy of the SLR model that was trained with the augmented training and test sets for 300 epochs.
Information 16 00799 g018
Table 1. Model performance across five runs with the original dataset.
Table 1. Model performance across five runs with the original dataset.
Nr crt.AccuracyLossVal. AccuracyVal. Loss
197.64%0.0797.85%0.10
297.53%0.0798.35%0.14
397.70%0.0699.80%0.06
497.26%0.0798.94%0.09
597.89%0.0697.92%0.09
Table 2. Model performance trained for 35 epochs across five runs.
Table 2. Model performance trained for 35 epochs across five runs.
Nr crt.AccuracyLossVal. AccuracyVal. Loss
198.42%0.0498.06%0.11
298.26%0.0599.23%0.02
398.11%0.0498.54%0.05
497.98%0.0697.89%0.13
598.23%0.0399.01%0.03
Table 3. Model performance for training it for 25 epochs with the augmented training set and the original test set.
Table 3. Model performance for training it for 25 epochs with the augmented training set and the original test set.
Nr crt.AccuracyLossVal. AccuracyVal. Loss
197.71%0.0698.37%0.15
297.56%0.0798.48%0.07
397.73%0.0698.91%0.07
497.49%0.0798.54%0.06
598.04%0.0598.49%0.14
Table 4. Model performance for training it for 35 epochs with the augmented training set and the original test set.
Table 4. Model performance for training it for 35 epochs with the augmented training set and the original test set.
Nr crt.AccuracyLossVal. AccuracyVal. Loss
198.11%0.0597.85%0.10
298.16%0.0498.91%0.10
398.24%0.0499.03%0.10
498.40%0.0399.12%0.10
598.52%0.0399.30%0.09
Table 5. Model performance for training it for 25 epochs with the augmented training and test sets.
Table 5. Model performance for training it for 25 epochs with the augmented training and test sets.
Nr crt.AccuracyLossVal. AccuracyVal. Loss
197.44%0.0798.57%0.10
297.38%0.0798.44%0.05
397.84%0.0698.44%0.09
497.68%0.0698.95%0.11
597.65%0.0698.56%0.12
Table 6. Model performance for training it for 35 epochs with the augmented training and test sets.
Table 6. Model performance for training it for 35 epochs with the augmented training and test sets.
Nr crt.AccuracyLossVal. AccuracyVal. Loss
198.43%0.0498.63%0.08
298.29%0.0598.45%0.13
398.37%0.0498.54%0.10
498.51%0.0498.70%0.09
598.45%0.0498.68%0.08
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bercaru, V.; Popescu, N. Exploring Sign Language Dataset Augmentation with Generative Artificial Intelligence Videos: A Case Study Using Adobe Firefly-Generated American Sign Language Data. Information 2025, 16, 799. https://doi.org/10.3390/info16090799

AMA Style

Bercaru V, Popescu N. Exploring Sign Language Dataset Augmentation with Generative Artificial Intelligence Videos: A Case Study Using Adobe Firefly-Generated American Sign Language Data. Information. 2025; 16(9):799. https://doi.org/10.3390/info16090799

Chicago/Turabian Style

Bercaru, Valentin, and Nirvana Popescu. 2025. "Exploring Sign Language Dataset Augmentation with Generative Artificial Intelligence Videos: A Case Study Using Adobe Firefly-Generated American Sign Language Data" Information 16, no. 9: 799. https://doi.org/10.3390/info16090799

APA Style

Bercaru, V., & Popescu, N. (2025). Exploring Sign Language Dataset Augmentation with Generative Artificial Intelligence Videos: A Case Study Using Adobe Firefly-Generated American Sign Language Data. Information, 16(9), 799. https://doi.org/10.3390/info16090799

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop