1. Introduction
Although deep learning-based automatic speech recognition (ASR) technologies have progressed significantly, obtaining a sufficient volume of high-quality speech data remains a major obstacle in the development of robust ASR systems, especially in low-resource settings [
1]. Modern ASR systems rely on large amounts of labeled speech data to learn complex acoustic and linguistic patterns through deep neural architectures [
2,
3,
4]. However, in many practical scenarios, especially within low-resource domains or languages, such labeled data remains limited [
4,
5]. This scarcity notably limits the model’s ability to generalize, often resulting in degraded recognition performance [
2]. Addressing this data bottleneck is therefore crucial for building robust ASR systems in real-world low-resource settings. Further, the collection and annotation of speech data incurs substantial financial and temporal costs, and these challenges are especially severe in the context of low-resource languages, where sufficiently large and high-quality labeled speech datasets are often unavailable [
6]. ASR models developed under such conditions are likely to overfit to the patterns in the training data [
7]. Empirical studies have demonstrated that ASR systems trained on such low-resource settings yield higher word error rates (WERs), thereby hindering their practical deployment and usability [
8,
9,
10,
11,
12].
To address data scarcity in low-resource ASR, data augmentation and large-scale pretrained speech foundation models have emerged as promising strategies. Augmentation techniques, such as noise addition, time-stretching, pitch shifting, SpecAugment, and speech synthesis have been shown to enhance ASR model generalization [
13]. In addition, the emergence of pretrained speech foundation models, such as wav2vec 2.0 [
14], HuBERT [
15], and WavLM [
16], has significantly advanced the performance of ASR. These models learn generalizable speech representations through self-supervised learning on large-scale unlabeled speech datasets, enabling better performance on downstream tasks, such as phoneme classification, speaker recognition, emotional recognition, and ASR [
17]. Among these foundation models, Whisper [
18] stands out as a supervised multilingual ASR model trained on 680K hours of labeled audio. Its broad coverage of languages and robustness to noisy conditions make it particularly well-suited for real-world speech recognition tasks.
Despite the availability of well-generalizable pretrained models and the application of data augmentations, developing ASR systems that perform well in practical scenarios remains difficult, particularly for domain-specific tasks or extremely low-resource settings. While pretrained speech foundation models offer strong generalization, they struggle to capture domain-specific speech patterns with limited coverage in the pertaining stage. Fine-tuning can help but is not always feasible in low-resource conditions [
12]. Moreover, traditional speech data augmentation approaches primarily enhance acoustic diversity while neglecting lexical and grammatical variations.
To overcome the limitations of conventional data augmentation and improve coverage of task-specific linguistic patterns, we leverage text-to-speech (TTS) systems to generate synthetic speech from task-relevant text. Unlike conventional augmentation methods that modify existing audio, TTS-based synthesis can effectively generate entirely new speech samples that directly align with ASR objectives [
19,
20]. Recent advances in TTS technologies, including conditional variational autoencoder with adversarial learning for end-to-end TTS (VITS) [
21], FastSpeech2 [
22], and Tacotron 2 [
23], have remarkably enhanced the perceptual quality of synthesized speech. Moreover, zero-shot multi-speaker systems such as VALL-E [
24], XTTS [
25], YourTTS [
26], and VoiceBox [
27] have demonstrated promising performance, enabling high-quality synthetic speech generation without speaker-specific fine-tuning. These advancements offer a promising avenue to mitigate the challenges associated with limited labeled data. However, synthetic speech generated by TTS methods differs from real speech in acoustic characteristics, prosody, and naturalness. Such discrepancies cause ASR models to perceive synthetic and real speech as originating from distinct distributions, consequently diminishing training efficiency and potentially degrading model performance. We refer to this phenomenon as “Domain Distribution Inconsistency”, as shown in
Figure 1.
In this paper, we address the domain mismatch between real and TTS-generated speech, which generally degrades ASR performance as the proportion of synthetic data increases. To this end, we propose a joint domain adversarial learning (JODAL) approach that employs domain adaptation to learn domain-invariant representations. By treating real and synthetic data as distinct domains, the model is trained to align the feature distributions of real and synthetic speech, enabling robust speech representation learning despite distributional shifts. This approach mitigates domain-specific biases and facilitates consistent learning of task-relevant acoustic characteristics without regard to the origin of the data.
The main contributions of this work are summarized as follows:
We introduce a resynthesis-based data augmentation strategy using off-the-shelf TTS systems, proving that acoustic diversity in multi-speaker systems is more critical for ASR utility than perceptual naturalness.
We define and analyze the “Domain Distribution Inconsistency” issue, demonstrating that excessive incorporation of synthetic data without domain adaptation can adversely affect ASR performance, especially in noisy datasets like GigaSpeech [
28].
We formulate the distributional mismatch between real and TTS-generated speech as a domain adaptation problem in low-resource ASR. Based on this formulation, we evaluate a gradient reversal-based adversarial learning strategy that mitigates feature-level real synthetic discrepancy across the evaluated ASR backbones, TTS systems, datasets, and training configurations.
We evaluate JODAL across multiple ASR backbones, TTS systems, synthetic-to-real data ratios, and two English benchmark datasets. In the evaluated WavLM configuration, JODAL achieved lower WER than the 100% real-data baseline on LibriSpeech [
29] and GigaSpeech, suggesting that TTS-generated speech can be an effective supplementary resource in the examined low-resource ASR settings.
3. Methodology
3.1. Problem Formulation: Data Augmentation for Low-Resource ASR by TTS Model
The collection and annotation of high-quality speech data remain a critical bottleneck, particularly in low-resource domains such as dialectal speech, pediatric speech, or speech with pathological characteristics. This data scarcity hinders the generalization capability of ASR models across diverse acoustic and linguistic environments. To address this limitation, we propose a TTS-based data augmentation framework that synthesizes training data to enhance ASR performance in low-resource settings.
Given a limited real dataset and its corresponding transcriptions , the goal is to generate a synthetic speech set such that forms a more diverse and robust training corpus for the ASR model. While TTS-generated samples offer scalability and linguistic flexibility, it also introduces challenges related to distributional discrepancy between synthetic and real speech. Inconsistencies such as unnatural prosody, limited speaker variability, absence of environmental noise, and artificial signal characteristics can adversely affect ASR performance, particularly when synthetic data accounts for a significant portion of the training set. To address the challenges outlined above, in the following sections we present a TTS-based augmentation framework and a JODAL strategy to improve ASR performance under varying synthetic-to-real data ratios.
3.2. Speech Data Generation Using TTS System
As an initial approach, we address the challenges of low-resource ASR training by generating synthetic speech, using both audio and textual prompts with TTS systems. As shown in
Table 1, we employ a range of TTS models, including XTTS [
25] (Available online:
https://github.com/coqui-ai/tts, accessed on 18 April 2026), YourTTS [
26] (Available online:
https://github.com/coqui-ai/tts, accessed on 18 April 2026), Tacotron2 [
23] (Available online:
https://github.com/coqui-ai/tts, accessed on 18 April 2026), and VITS [
21] (Available online:
https://github.com/coqui-ai/tts, accessed on 18 April 2026), to construct augmented training datasets. XTTS and YourTTS are capable of zero-shot speaker adaptation by leveraging learned speaker embeddings, thereby facilitating multilingual and multi-speaker speech synthesis. These models allow conditioning on both acoustic and textual inputs to generate speech outputs. In this study, we use a resynthesis strategy, wherein synthetic speech is generated by simultaneously employing the original audio waveform and its corresponding transcription, effectively preserving speaker characteristics and prosodic information from the source samples.
In contrast, Tacotron2 and VITS are designed for single-speaker synthesis and operate solely based on textual inputs. Tacotron2 adopts a sequence-to-sequence architecture that maps character-level inputs to Mel-spectrograms, which are subsequently converted into waveforms via neural vocoders. VITS, on the other hand, employs a variational inference framework within a fully end-to-end architecture, eliminating the need for external alignment mechanisms and enabling the generation of high-fidelity speech.
To simulate a low-resource scenario, we limited the original training set to only 10% of the full dataset. We then incrementally augmented this reduced set by synthesizing additional speech at varying proportions (specifically 20%, 30%, up to 90%) using TTS systems. Each synthesized subset was combined with the 10% real data subset to construct progressively larger mixed-domain training sets. These mixed datasets serve to examine the impact of synthetic data scale on ASR model performance.
3.3. Joint Domain Adversarial Learning for Mitigating Data Distributional Inconsistency Issue
Despite significant advances in neural TTS systems, synthetic speech still differs from real speech samples, particularly in prosody patterns, speaker-specific variations, and background noise characteristics. When synthetic data constitutes a substantial portion of the mixed training corpus (i.e., real + synthetic), these discrepancies may introduce distributional mismatches, which can cause ASR models to overfit to non-natural artifacts, leading to generalization errors on real-world speech. Empirically, we observed a consistent decline in ASR performance as the proportion of synthetic data increases. Specifically, models trained on a mixture of 10% real data and varying amounts of synthetic data (10%, 20%, 30%, 40%, 90%) exhibit noticeable performance drops compared to those trained on original data subsets of equivalent sizes (20%, 30%, 40%, 50%, and 100%), highlighting the potential negative effect of excessive reliance on synthetic inputs.
To address this issue, we adopt the established domain adversarial learning framework based on a gradient reversal layer and integrate it into the ASR training pipeline for TTS-augmented low-resource ASR, as illustrated in
Figure 2. We refer to this training framework as JODAL. The objective of JODAL is not to introduce a new adversarial optimization mechanism, but to formulate the real synthetic mismatch as a domain adaptation problem and to empirically evaluate whether feature-level adversarial alignment can improve ASR training when synthetic speech is increasingly incorporated. Specifically, the domain discriminator was implemented as a single-layer feedforward neural network that is attached to the ASR encoder representation and is trained to distinguish whether a given speech sample originates from the real or synthetic domain. Through the gradient reversal layer, the encoder receives the reversed domain classification gradient during backpropagation, thereby suppressing domain-discriminative information while preserving ASR-relevant representations.
Mathematically, the proposed JODAL approach is defined as follows. Let
denote the feature representation extracted by the ASR encoder parameterized by
, for an input speech signal
x. The ASR decoder, represented as
with parameters
, is responsible for predicting the corresponding output sequence, consisting of words or phonemes. To enforce domain invariance, we incorporate a data discriminator
, parameterized by
, which estimates the probability that the input
x originates from real
rather than synthetic speech
. The overall training objective is formulated as a multi-task loss, combining the standard ASR loss with an adversarial domain classification loss, and is given by
where
is the standard ASR loss typically sequence-to-sequence or Connectionist Temporal Classification [
46] (CTC) loss.
is the data discriminator loss, which in this case is cross-entropy. The ground truth transcript is denoted by
y, while
d is the data domain label (0 for synthetic, 1 for real), and
is a domain adaptation hyperparameter. The gradient reversal layer implicitly scales the gradient of
with
during backpropagation, encouraging the encoder to learn representations that are predictive for the ASR task but invariant to the speech domain. Through this mechanism, JODAL enables the ASR model to focus on the speech content rather than the domain-specific characteristics, thereby improving the model’s ability to generalize across both real and synthetic speech data. This approach has shown significant promise in addressing the domain shift problems [
43,
44,
45,
47] and enhancing the effectiveness of TTS-based data augmentation for ASR.
3.4. Fine-Tuning Pretrained ASR Models
To train ASR models on mixed datasets comprising both real and synthetic speech, we selected one supervised model (Whisper [
18]) and three self-supervised models: wav2vec 2.0 [
14], HuBERT [
15], WavLM [
16]. We then applied fine-tuning techniques to optimize them for our target dataset. These models have demonstrated state-of-the-art performance across a wide range of speech recognition studies and serve as robust initialization points for speech downstream tasks [
17]. In particular, Whisper, trained on a large-scale multilingual corpus exceeding 680,000 h by cross-entropy loss, provides a versatile encoder–decoder architecture that supports transcription and translation across diverse languages. The other self-supervised models, on the other hand, are leveraging a contrastive loss during pretraining to learn latent speech representations, followed by fine-tuning with a CTC [
46] loss for the ASR task. In this study, we employed a supervised Whisper model and three self-supervised models as the backbone encoders and performed task-specific fine-tuning for ASR.
4. Experimental Settings
We conducted extensive experiments under controlled low-resource conditions to evaluate the effectiveness of the proposed TTS-based data augmentation and JODAL approach. Our experimental setup was designed to address the following key questions:
Research Question 1. How effective is TTS-based data augmentation in improving ASR performance in low-resource settings?
Research Question 2. What is the impact of increasing the proportion of synthetic data on ASR performance?
Research Question 3. How do different TTS systems compare in terms of their effectiveness for ASR data augmentation?
Research Question 4. Can JODAL mitigate the domain distribution inconsistency problem between real and synthetic speech?
4.1. Dataset
We employed two primary speech datasets in our experiments: LibriSpeech [
29] (Available online:
https://www.openslr.org/12, accessed on 18 April 2026) and GigaSpeech [
28] (Available online:
https://github.com/SpeechColab/GigaSpeech, accessed on 18 April 2026), both of which are widely recognized resources in the field of ASR. LibriSpeech is a benchmark corpus comprising approximately 960 h of read English speech sourced from public domain audiobooks available through the LibriVox project. The dataset features over 2000 distinct speakers, with balanced representation across genders and diverse reading styles, thereby enabling robust modeling of speaker variability. To simulate a low-resource scenario, we selectively utilize the train-clean-100 subset, which contains approximately 100 h of speech. From this, we sample only 10% (roughly 10 h) to construct our real speech dataset
, while preserving diversity in speaker identity and linguistic content. The remaining speech data from the corpus is reserved for generating synthetic speech using the TTS systems.
GigaSpeech, on the other hand, is a significantly larger and acoustically more diverse corpus, comprising approximately 10,000 h of English speech recordings collected from a broad range of real-world sources, including audiobooks, podcasts, and YouTube videos. It features spontaneous, conversational, and scripted speech styles, and includes substantial background noise and speaker variability, which makes it particularly suitable for evaluating domain robustness in ASR models. For our experiments, we employed the 250 h of GigaSpeech training subset S, from which we considered a limited portion (10%, approximately 25 h) to serve as real speech data, while reserving the remainder of the corpus for synthetic data generation.
For ASR evaluation, we use the standard test sets provided with each dataset. From LibriSpeech, we employ the test-clean set, which contains high-quality read speech and serves as a benchmark for clean ASR evaluation. From GigaSpeech, we use its corresponding test set, which better reflects real-world acoustic variability due to its diverse recording sources and spontaneous speech content.
4.2. Training Set Design: Real vs. Synthetic Composition
To examine the impact of synthetic data on ASR performance, we constructed a series of training sets by systematically adjusting the ratio between real and synthesized speech. Specifically, we sampled subsets of real speech at 10%, 20%, 30%, 40%, 50%, and 100% of the original corpus. The 10% subset serves as the low-resource baseline condition. To simulate progressive data augmentation, we generate synthetic speech from the remaining transcripts using the TTS models outlined in
Section 3.1. The synthesized utterances were incrementally added to the 10% real subset, resulting in a range of augmented training scenarios, as summarized in
Table 2.
These training configurations enable a controlled analysis of the influence of increasing synthetic speech while maintaining a fixed amount (10%) of real speech. Furthermore, we constructed comparative baselines using real data only, at 20%, 30%, 40%, 50%, and 100% levels, denoted as R20, R30, …, R100. This design allows us to evaluate the relative benefit of synthetic data augmentation versus collecting additional real data. All audio inputs (both real and synthetic) were downsampled to 16 kHz, converted to mono, and truncated to a maximum duration of 30 s.
4.3. Implementation Details
For synthetic speech generation, all TTS model checkpoints employed in this study are obtained from the publicly available Coqui-AI TTS repository (
https://github.com/coqui-ai/TTS (accessed on 18 April 2026)). For ASR model fine-tuning, we adopt the “tiny” configuration of Whisper, comprising approximately 39 million parameters. Regarding wav2vec2.0 [
14], HuBERT [
15], and WavLM [
16], we used the “Base” configuration with around 95 million parameters and integrated a linear projection layer on top of the feature encoder to convert latent representations into output token distributions. For the evaluation of these self-supervised models, we integrated a 4-g language model [
29] (available online:
https://www.openslr.org/12, accessed on 18 April 2026) using KenLM [
48] (available online:
https://github.com/kpu/kenlm, accessed on 18 April 2026). During decoding, a beam size of 50 was applied, with fixed weights consisting of a language model weight of 2 and a word insertion penalty of −1, as described in [
14]. In this study, we exclusively report the performance of these self-supervised architectures employing the 4-g language model to ensure robust recognition results. In contrast, the Whisper model was evaluated without any language model.
All models were optimized on the mixed dataset
comprising real and synthesized speech samples. Training was conducted with FP16 precision, a batch size of 16, and an initial learning rate of 5 × 10
−5. All models were trained for up to 10 epochs using the AdamW optimizer [
49], incorporating gradient norm clipping [
50] and a linear learning rate decay schedule with a 7% warm-up phase. No additional data augmentation or explicit regularization techniques were applied during training. The training process typically consisted of approximately 35,470 steps under the R100 configuration, although the exact number of steps may vary depending on the configuration types. For both LibriSpeech and GigaSpeech, the best-performing checkpoints were selected based on the development set performance, and the final results were reported on the respective test sets using these optimized models.
4.4. Evaluation Metrics
To quantitatively assess the performance of the ASR model, we used the word error rate (WER), which is the primary metric for evaluating ASR performance, offering a comprehensive measure of transcription accuracy by accounting for various types of recognition errors. Formally, WER is defined as
where
S denotes the number of substitutions,
D the number of deletions,
I the number of insertions, and
N the total number of words in the ground-truth reference transcription. A lower WER value indicates higher transcription accuracy and, therefore, better ASR performance. We reported the ASR performance on the test-clean and test-other subsets of LibriSpeech, as well as on the dev and test sets of GigaSpeech.
Although the Mean Opinion Score (MOS) is inherently subjective and cannot serve as an absolute or standalone evaluation metric, it remains a valuable proxy for assessing perceptual quality, particularly in the context of synthetic speech evaluation. In this study, we investigated the relationship between MOS values and ASR performance, to analyze their correlation patterns. To this end, we employed the Naturalness Mean Opinion Score (nMOS) predicted by the UTMOS model to estimate perceptual naturalness, which was applied in prior studies [
25,
51]. Specifically, we employ the publicly available implementation of UTMOS (Available online:
https://github.com/tarepan/SpeechMOS, accessed on 18 April 2023), which has been shown to produce nMOS predictions that closely align with human-annotated scores across various TTS outputs.
6. Discussion
This section interprets the experimental findings in the context of the research questions (RQs) and discusses the broader implications for low-resource ASR development.
6.1. Evaluation of RQ1: Effectiveness of TTS-Based Augmentation
Our results consistently demonstrate that data augmentation using off-the-shelf TTS systems is a highly effective strategy for low-resource ASR. Across both LibriSpeech and GigaSpeech datasets, augmenting the 10% real data (R10) baseline led to significant WER reductions. For instance, adding just 10% XTTS-generated data (R10 + S10) reduced the Whisper-tiny WER on LibriSpeech test-clean from 25.49% to 17.93%. Similar improvements were observed in self-supervised models, where WavLM achieved a relative improvement of 60.8% on the same task. These findings confirm that synthetic speech can provide critical linguistic and phonetic variations that compensate for the scarcity of real-world recordings, despite the inherent acoustic differences between the two domains.
6.2. Evaluation of RQ2: Impact of Synthetic Data Scaling
The scaling analysis revealed that while increasing synthetic data volume generally improves performance, the gains are not linear and exhibit diminishing returns beyond the R10 + S40 configuration. On the cleaner LibriSpeech dataset, models followed a relatively stable scaling law. However, on GigaSpeech, vanilla augmentation (without domain adaptation) often resulted in performance stagnation or degradation as the proportion of synthetic data reached 90% (R10 + S90), such as YourTTS reaching 34.98% WER; worse than its R10 baseline. This phenomenon underscores the Domain Distribution Inconsistency problem; without proper alignment, an excessive reliance on synthetic data can cause the model to overfit to TTS-specific artifacts rather than learning generalized speech representations.
6.3. Evaluation of RQ3: Comparison of Different TTS Systems
A key insight from our comparative analysis is that acoustic diversity is more vital for ASR training than perceptual naturalness. While VITS achieved the highest UTMOS score (4.356), its utility for ASR was consistently surpassed by XTTS (UTMOS: 3.883). The superiority of XTTS and YourTTS is likely due to their zero-shot multi-speaker capabilities, which expose the ASR model to a wider variety of speaker styles and prosodic patterns compared to high-quality but acoustically limited single-speaker models like VITS and Tacotron. The decoupling of UTMOS and WER ranks—quantified by the drop in Pearson correlation from to in WavLM—further proves that JODAL allows models to leverage diverse synthetic data regardless of its perceived quality.
6.4. Evaluation of RQ4: Mitigating Domain Inconsistency Through JODAL
The proposed JODAL method was validated as a critical mechanism for bridging the domain gap. Across all architectures and datasets, JODAL consistently outperformed standard joint training, with its benefits being most pronounced at higher synthetic ratios. The most significant breakthrough occurred with WavLM on GigaSpeech, where the JODAL-enhanced R10 + S90 configuration achieved a WER of 16.76%, effectively surpassing the 100% real data baseline (R100: 17.67%). This remarkable result demonstrates that JODAL enables the ASR encoder to learn domain-invariant features, transforming synthetic data from a mere supplement into a superior substitute for expensive real-world data collection in low-resource settings.
6.5. Limitations and Future Works
Despite the effectiveness of JODAL, several avenues for future research remain:
Persistent Domain Discrepancy: While the proposed JODAL approach is effective in mitigating the domain inconsistency between real and TTS-generated speech, it does not completely eliminate this discrepancy. The experimental results show that even with JODAL and a high proportion of synthetic data (e.g., R10 + S90), the ASR performance did not reach the level of that obtained with fully real data (e.g., R100). This suggests that the synthetic data, even after domain adaptation, is not a perfect substitute for real speech in fully capturing its complexities and variability. Further improvements may require more advanced TTS models capable of capturing subtle speaker characteristics, coarticulation effects, and contextual prosody variations more faithfully. Moreover, extending domain adaptation beyond feature-level alignment to include model-level or task-level strategies may further reduce the remaining performance gap.
Dependence on TTS Quality: The effectiveness of the TTS-based data augmentation strategy is inherently dependent on the quality and characteristics of the synthetic speech produced by the specific TTS systems used. Our experimental results demonstrate that different TTS systems yield varying levels of improvement, with some (like XTTS) consistently outperforming others (YourTTS, VITS, Tacotron2). This highlights that the quality of the synthetic data serves as a limiting factor for the overall ASR performance gain, even with domain adaptation. As a future direction, it would be valuable to explore joint optimization frameworks in which TTS and ASR models are trained collaboratively, allowing the TTS system to generate speech more closely tailored to the requirements of ASR training. Furthermore, developing TTS models explicitly optimized for recognition performance could enhance the informativeness and effectiveness of synthetic data for speech recognition tasks.
Inherent Differences Between Synthetic and Real Speech: Although JODAL reduces feature-level domain discrepancy between real and TTS-generated speech, it does not completely eliminate fine-grained differences between the two domains. In particular, real-world background noise distributions, recording channel effects, speaker-specific idiosyncrasies, coarticulation patterns, and subtle prosodic variations may remain insufficiently captured by synthetic speech. Therefore, JODAL should be interpreted as a feature-level mitigation strategy rather than a complete solution to all real synthetic mismatch factors. These inherent distinctions may limit the extent to which synthetic data can replicate the rich acoustic and linguistic variations present in real-world speech, particularly in complex datasets such as GigaSpeech. Future work should investigate ASR-aware TTS generation, noise-aware synthesis, and multi-level adaptation strategies that explicitly address these fine-grained acoustic and speaker-dependent discrepancies.
Evaluation Scope: This study evaluates JODAL on two English ASR benchmark datasets, LibriSpeech and GigaSpeech. Therefore, the results should not be interpreted as evidence of general robustness across multilingual, accented, or broader multi-domain speech conditions. Future work should extend the evaluation to accented speech, multilingual ASR, domain-shifted recording environments, and other low-resource speech scenarios to more comprehensively assess the generalizability of the proposed framework.
7. Conclusions
This study addressed the critical challenge of data scarcity in low-resource ASR by strategically leveraging synthetic speech generated by TTS systems. To overcome the inherent domain discrepancy between real and synthetic data, we proposed JODAL, a framework designed to learn domain-invariant acoustic representations. Our extensive experiments on the LibriSpeech and GigaSpeech datasets revealed that while vanilla TTS-based augmentation improves performance, it often suffers from stagnation or degradation as the proportion of synthetic data increases, particularly in acoustically complex environments. JODAL mitigated the performance degradation caused by real synthetic mismatch and reduced WER across the evaluated ASR architectures, including Whisper-tiny and self-supervised models such as wav2vec 2.0, HuBERT, and WavLM. In the evaluated WavLM configuration, the model trained with a 10% real-data baseline augmented with 90% synthetic data achieved WERs of 16.76% on GigaSpeech and 4.15% on LibriSpeech, which were lower than those of the corresponding 100% real-data baselines in these experiments. These results suggest that TTS-generated speech, when combined with feature-level domain adversarial alignment, can be an effective supplementary resource under the examined English low-resource ASR settings. Furthermore, our analysis of UTMOS scores suggests that acoustic and speaker diversity in TTS models are more critical for ASR utility than perceptual naturalness alone.
While some domain differences remain, this work establishes JODAL as a robust and scalable solution for maximizing the utility of synthetic data. Future research will explore multi-level adversarial alignment and the development of ASR-aware TTS systems to further narrow the gap between synthetic and natural speech domains.