JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition

Kim, June-Woo; Jung, Ho-Young

doi:10.3390/math14101669

Open AccessFeature PaperArticle

JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition

by

June-Woo Kim

¹

and

Ho-Young Jung

^2,*

¹

Department of Electronic Engineering, Wonkwang University, Iksan 54538, Republic of Korea

²

Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1669; https://doi.org/10.3390/math14101669

Submission received: 18 April 2026 / Revised: 2 May 2026 / Accepted: 13 May 2026 / Published: 14 May 2026

(This article belongs to the Special Issue Advances in Deep Learning, Computer Vision, and Engineering Applications)

Download

Browse Figures

Versions Notes

Abstract

Despite the rapid advancements in deep learning techniques, the scarcity of task-specific labeled data remains a significant challenge in automatic speech recognition (ASR). While data augmentation approaches and large-scale pretrained speech foundation models have shown promise in mitigating these limitations, they are often insufficient for improving downstream ASR performance in low-resource practical scenarios. We address these challenges by leveraging off-the-shelf text-to-speech (TTS) systems to generate synthetic speech samples using task-specific text and speech prompts in a resynthesis manner. Although we expect that training with both real and synthetic datasets together can mitigate the data scarcity issue while maintaining ASR performance, we observe that the distributional discrepancy between real and synthetic data introduces a domain inconsistency problem, which negatively impacts ASR performance. To mitigate this, we propose a Joint Domain Adversarial Learning (JODAL) method using real and TTS-generated samples that mitigates domain mismatches through domain adversarial training. Experimental results demonstrate that the proposed JODAL consistently improves ASR robustness across various architectures, including Whisper-tiny and three self-supervised models. Notably, for the WavLM architecture, the enhancement caused by the proposed JODAL successfully surpassed the performance of models trained on 100% real data, highlighting its efficacy as a reasonable substitute for expensive real-world data collection.

Keywords:

automatic speech recognition; text-to-speech; domain adaptation; data augmentation; synthetic data; deep learning applications; artificial intelligence

MSC:

68T10

1. Introduction

Although deep learning-based automatic speech recognition (ASR) technologies have progressed significantly, obtaining a sufficient volume of high-quality speech data remains a major obstacle in the development of robust ASR systems, especially in low-resource settings [1]. Modern ASR systems rely on large amounts of labeled speech data to learn complex acoustic and linguistic patterns through deep neural architectures [2,3,4]. However, in many practical scenarios, especially within low-resource domains or languages, such labeled data remains limited [4,5]. This scarcity notably limits the model’s ability to generalize, often resulting in degraded recognition performance [2]. Addressing this data bottleneck is therefore crucial for building robust ASR systems in real-world low-resource settings. Further, the collection and annotation of speech data incurs substantial financial and temporal costs, and these challenges are especially severe in the context of low-resource languages, where sufficiently large and high-quality labeled speech datasets are often unavailable [6]. ASR models developed under such conditions are likely to overfit to the patterns in the training data [7]. Empirical studies have demonstrated that ASR systems trained on such low-resource settings yield higher word error rates (WERs), thereby hindering their practical deployment and usability [8,9,10,11,12].

To address data scarcity in low-resource ASR, data augmentation and large-scale pretrained speech foundation models have emerged as promising strategies. Augmentation techniques, such as noise addition, time-stretching, pitch shifting, SpecAugment, and speech synthesis have been shown to enhance ASR model generalization [13]. In addition, the emergence of pretrained speech foundation models, such as wav2vec 2.0 [14], HuBERT [15], and WavLM [16], has significantly advanced the performance of ASR. These models learn generalizable speech representations through self-supervised learning on large-scale unlabeled speech datasets, enabling better performance on downstream tasks, such as phoneme classification, speaker recognition, emotional recognition, and ASR [17]. Among these foundation models, Whisper [18] stands out as a supervised multilingual ASR model trained on 680K hours of labeled audio. Its broad coverage of languages and robustness to noisy conditions make it particularly well-suited for real-world speech recognition tasks.

Despite the availability of well-generalizable pretrained models and the application of data augmentations, developing ASR systems that perform well in practical scenarios remains difficult, particularly for domain-specific tasks or extremely low-resource settings. While pretrained speech foundation models offer strong generalization, they struggle to capture domain-specific speech patterns with limited coverage in the pertaining stage. Fine-tuning can help but is not always feasible in low-resource conditions [12]. Moreover, traditional speech data augmentation approaches primarily enhance acoustic diversity while neglecting lexical and grammatical variations.

To overcome the limitations of conventional data augmentation and improve coverage of task-specific linguistic patterns, we leverage text-to-speech (TTS) systems to generate synthetic speech from task-relevant text. Unlike conventional augmentation methods that modify existing audio, TTS-based synthesis can effectively generate entirely new speech samples that directly align with ASR objectives [19,20]. Recent advances in TTS technologies, including conditional variational autoencoder with adversarial learning for end-to-end TTS (VITS) [21], FastSpeech2 [22], and Tacotron 2 [23], have remarkably enhanced the perceptual quality of synthesized speech. Moreover, zero-shot multi-speaker systems such as VALL-E [24], XTTS [25], YourTTS [26], and VoiceBox [27] have demonstrated promising performance, enabling high-quality synthetic speech generation without speaker-specific fine-tuning. These advancements offer a promising avenue to mitigate the challenges associated with limited labeled data. However, synthetic speech generated by TTS methods differs from real speech in acoustic characteristics, prosody, and naturalness. Such discrepancies cause ASR models to perceive synthetic and real speech as originating from distinct distributions, consequently diminishing training efficiency and potentially degrading model performance. We refer to this phenomenon as “Domain Distribution Inconsistency”, as shown in Figure 1.

In this paper, we address the domain mismatch between real and TTS-generated speech, which generally degrades ASR performance as the proportion of synthetic data increases. To this end, we propose a joint domain adversarial learning (JODAL) approach that employs domain adaptation to learn domain-invariant representations. By treating real and synthetic data as distinct domains, the model is trained to align the feature distributions of real and synthetic speech, enabling robust speech representation learning despite distributional shifts. This approach mitigates domain-specific biases and facilitates consistent learning of task-relevant acoustic characteristics without regard to the origin of the data.

The main contributions of this work are summarized as follows:

We introduce a resynthesis-based data augmentation strategy using off-the-shelf TTS systems, proving that acoustic diversity in multi-speaker systems is more critical for ASR utility than perceptual naturalness.
We define and analyze the “Domain Distribution Inconsistency” issue, demonstrating that excessive incorporation of synthetic data without domain adaptation can adversely affect ASR performance, especially in noisy datasets like GigaSpeech [28].
We formulate the distributional mismatch between real and TTS-generated speech as a domain adaptation problem in low-resource ASR. Based on this formulation, we evaluate a gradient reversal-based adversarial learning strategy that mitigates feature-level real synthetic discrepancy across the evaluated ASR backbones, TTS systems, datasets, and training configurations.
We evaluate JODAL across multiple ASR backbones, TTS systems, synthetic-to-real data ratios, and two English benchmark datasets. In the evaluated WavLM configuration, JODAL achieved lower WER than the 100% real-data baseline on LibriSpeech [29] and GigaSpeech, suggesting that TTS-generated speech can be an effective supplementary resource in the examined low-resource ASR settings.

2. Related Works

2.1. TTS-Based Data Augmentation for ASR

Recent advancements in TTS have significantly improved the utility of synthetic speech for ASR data augmentation. Whereas early efforts focused primarily on limited-vocabulary or low-scalability approaches, modern approaches increasingly emphasize the synthesis of speech that reflects both lexical richness and acoustic variability. Mimura et al. [30] proposed augmenting ASR training with synthetic speech generated from target-domain text using a sequence-to-sequence TTS model, showing an improvement in word error rate (WER) over models trained only on real speech data. Ueno et al. [31] addressed the out-of-vocabulary (OOV) word limitation of ASR models by using a multi-speaker speech synthesis system to generate target-domain training data from text, leading to significant performance gains over both baseline and single-speaker setups. Rossenbach et al. [32] demonstrated that using synthetic speech generated from text-only data by TTS systems can notably improve end-to-end ASR performance, achieving up to 33% relative WER reduction in low-resource settings. Du and Yu [33] introduced a speaker augmentation method for ASR data generation using a VAE-based TTS system to synthesize diverse speaker voices from text. Zheng et al. [34] enhanced RNN-T ASR models [35] by using TTS-generated synthetic audio for OOV words, achieving a 57% relative WER reduction on OOV utterances through fine-tuning with elastic weight consolidation, while preserving overall performance on the full test set. VoiceBox [27] has demonstrated its applicability to ASR data augmentation by enabling the training of an automatic speech recognition model exclusively on synthetic speech, which was subsequently evaluated on real-world speech data with promising results.

2.2. Domain Adaptation

Although deep learning models have achieved strong performance across a wide range of tasks, their effectiveness heavily depends on the availability of large-scale labeled datasets, which are typically scarce in emerging or specialized domains [36,37]. This lack of diverse data hinders generalization to domains with distinct data distributions, resulting in degraded performance [38,39,40,41]. To mitigate this challenge, domain adaptation techniques have been proposed to facilitate knowledge transfer from a well-resourced source domain to a target domain with distinct characteristics. A foundational study by Ben-David et al. [42] introduced theoretical generalization bounds for domain adaptation, leveraging a finite number of labeled samples from the source domain. Building on this theory, Ganin et al. [38] introduced domain adversarial training of neural networks (DANN), which employs a domain discriminator to elevate the learning of domain-invariant representations through a gradient reversal layer, which inverts gradients during backpropagation, effectively aligning feature distributions between domains.

Domain Adversarial Training (DAT) has been extensively explored in the ASR community as an effective strategy to mitigate domain mismatch and enhance the learning of domain-invariant representations. Sun et al. [43] presented the DAT approach to improve accented speech recognition by learning accent-invariant features without requiring transcriptions for the target domain. Woszczyk et al. [44] applied DAT to dysarthric speech recognition, achieving a 12.18% absolute improvement over the baseline and matching speaker-adaptive models in performance. Tanaka et al. [45] introduced a self-supervised speech representation learning method that leverages DAT to learn domain-invariant features, enhancing ASR robustness in unseen domains. While prior studies have primarily applied DAT to mitigate domain shifts across speaker accents, speech disorders, or recording conditions, our approach targets a different source of mismatch: the distributional gap between real and synthetic speech samples. Unlike prior studies that focus on naturally occurring domain differences (e.g., accents or recording conditions), we explicitly address the domain mismatch introduced by increasing proportions of synthetic speech during training, which can degrade ASR performance. Therefore, the methodological novelty of this work is not in modifying the gradient reversal layer itself, but in applying and systematically evaluating domain adversarial feature alignment for the specific real synthetic mismatch caused by TTS-based augmentation in low-resource ASR.

3. Methodology

3.1. Problem Formulation: Data Augmentation for Low-Resource ASR by TTS Model

The collection and annotation of high-quality speech data remain a critical bottleneck, particularly in low-resource domains such as dialectal speech, pediatric speech, or speech with pathological characteristics. This data scarcity hinders the generalization capability of ASR models across diverse acoustic and linguistic environments. To address this limitation, we propose a TTS-based data augmentation framework that synthesizes training data to enhance ASR performance in low-resource settings.

Given a limited real dataset

D_{r e a l}

and its corresponding transcriptions

T = {t_{1}, t_{2}, \dots, t_{n}}

, the goal is to generate a synthetic speech set

D_{s y n} = {s_{1}, s_{2}, \dots, s_{n}}

such that

D_{m i x e d} = D_{r e a l} \cup D_{s y n}

forms a more diverse and robust training corpus for the ASR model. While TTS-generated samples offer scalability and linguistic flexibility, it also introduces challenges related to distributional discrepancy between synthetic and real speech. Inconsistencies such as unnatural prosody, limited speaker variability, absence of environmental noise, and artificial signal characteristics can adversely affect ASR performance, particularly when synthetic data accounts for a significant portion of the training set. To address the challenges outlined above, in the following sections we present a TTS-based augmentation framework and a JODAL strategy to improve ASR performance under varying synthetic-to-real data ratios.

3.2. Speech Data Generation Using TTS System

As an initial approach, we address the challenges of low-resource ASR training by generating synthetic speech, using both audio and textual prompts with TTS systems. As shown in Table 1, we employ a range of TTS models, including XTTS [25] (Available online: https://github.com/coqui-ai/tts, accessed on 18 April 2026), YourTTS [26] (Available online: https://github.com/coqui-ai/tts, accessed on 18 April 2026), Tacotron2 [23] (Available online: https://github.com/coqui-ai/tts, accessed on 18 April 2026), and VITS [21] (Available online: https://github.com/coqui-ai/tts, accessed on 18 April 2026), to construct augmented training datasets. XTTS and YourTTS are capable of zero-shot speaker adaptation by leveraging learned speaker embeddings, thereby facilitating multilingual and multi-speaker speech synthesis. These models allow conditioning on both acoustic and textual inputs to generate speech outputs. In this study, we use a resynthesis strategy, wherein synthetic speech is generated by simultaneously employing the original audio waveform and its corresponding transcription, effectively preserving speaker characteristics and prosodic information from the source samples.

In contrast, Tacotron2 and VITS are designed for single-speaker synthesis and operate solely based on textual inputs. Tacotron2 adopts a sequence-to-sequence architecture that maps character-level inputs to Mel-spectrograms, which are subsequently converted into waveforms via neural vocoders. VITS, on the other hand, employs a variational inference framework within a fully end-to-end architecture, eliminating the need for external alignment mechanisms and enabling the generation of high-fidelity speech.

To simulate a low-resource scenario, we limited the original training set to only 10% of the full dataset. We then incrementally augmented this reduced set by synthesizing additional speech at varying proportions (specifically 20%, 30%, up to 90%) using TTS systems. Each synthesized subset was combined with the 10% real data subset to construct progressively larger mixed-domain training sets. These mixed datasets serve to examine the impact of synthetic data scale on ASR model performance.

3.3. Joint Domain Adversarial Learning for Mitigating Data Distributional Inconsistency Issue

Despite significant advances in neural TTS systems, synthetic speech still differs from real speech samples, particularly in prosody patterns, speaker-specific variations, and background noise characteristics. When synthetic data constitutes a substantial portion of the mixed training corpus (i.e., real + synthetic), these discrepancies may introduce distributional mismatches, which can cause ASR models to overfit to non-natural artifacts, leading to generalization errors on real-world speech. Empirically, we observed a consistent decline in ASR performance as the proportion of synthetic data increases. Specifically, models trained on a mixture of 10% real data and varying amounts of synthetic data (10%, 20%, 30%, 40%, 90%) exhibit noticeable performance drops compared to those trained on original data subsets of equivalent sizes (20%, 30%, 40%, 50%, and 100%), highlighting the potential negative effect of excessive reliance on synthetic inputs.

To address this issue, we adopt the established domain adversarial learning framework based on a gradient reversal layer and integrate it into the ASR training pipeline for TTS-augmented low-resource ASR, as illustrated in Figure 2. We refer to this training framework as JODAL. The objective of JODAL is not to introduce a new adversarial optimization mechanism, but to formulate the real synthetic mismatch as a domain adaptation problem and to empirically evaluate whether feature-level adversarial alignment can improve ASR training when synthetic speech is increasingly incorporated. Specifically, the domain discriminator was implemented as a single-layer feedforward neural network that is attached to the ASR encoder representation and is trained to distinguish whether a given speech sample originates from the real or synthetic domain. Through the gradient reversal layer, the encoder receives the reversed domain classification gradient during backpropagation, thereby suppressing domain-discriminative information while preserving ASR-relevant representations.

Mathematically, the proposed JODAL approach is defined as follows. Let

f (x)

denote the feature representation extracted by the ASR encoder parameterized by

θ_{f}

, for an input speech signal x. The ASR decoder, represented as

g (f (x))

with parameters

θ_{g}

, is responsible for predicting the corresponding output sequence, consisting of words or phonemes. To enforce domain invariance, we incorporate a data discriminator

h (f (x))

, parameterized by

θ_{h}

, which estimates the probability that the input x originates from real

D_{r e a l}

rather than synthetic speech

D_{s y n}

. The overall training objective is formulated as a multi-task loss, combining the standard ASR loss with an adversarial domain classification loss, and is given by

L_{total} = L_{ASR} (g (f (x)), y) - λ L_{dis} (h (f (x)), d) .

(1)

where

L_{A S R}

is the standard ASR loss typically sequence-to-sequence or Connectionist Temporal Classification [46] (CTC) loss.

L_{d i s}

is the data discriminator loss, which in this case is cross-entropy. The ground truth transcript is denoted by y, while d is the data domain label (0 for synthetic, 1 for real), and

λ

is a domain adaptation hyperparameter. The gradient reversal layer implicitly scales the gradient of

L_{dis}

with

- λ

during backpropagation, encouraging the encoder to learn representations that are predictive for the ASR task but invariant to the speech domain. Through this mechanism, JODAL enables the ASR model to focus on the speech content rather than the domain-specific characteristics, thereby improving the model’s ability to generalize across both real and synthetic speech data. This approach has shown significant promise in addressing the domain shift problems [43,44,45,47] and enhancing the effectiveness of TTS-based data augmentation for ASR.

3.4. Fine-Tuning Pretrained ASR Models

To train ASR models on mixed datasets comprising both real and synthetic speech, we selected one supervised model (Whisper [18]) and three self-supervised models: wav2vec 2.0 [14], HuBERT [15], WavLM [16]. We then applied fine-tuning techniques to optimize them for our target dataset. These models have demonstrated state-of-the-art performance across a wide range of speech recognition studies and serve as robust initialization points for speech downstream tasks [17]. In particular, Whisper, trained on a large-scale multilingual corpus exceeding 680,000 h by cross-entropy loss, provides a versatile encoder–decoder architecture that supports transcription and translation across diverse languages. The other self-supervised models, on the other hand, are leveraging a contrastive loss during pretraining to learn latent speech representations, followed by fine-tuning with a CTC [46] loss for the ASR task. In this study, we employed a supervised Whisper model and three self-supervised models as the backbone encoders and performed task-specific fine-tuning for ASR.

4. Experimental Settings

We conducted extensive experiments under controlled low-resource conditions to evaluate the effectiveness of the proposed TTS-based data augmentation and JODAL approach. Our experimental setup was designed to address the following key questions:

Research Question 1. How effective is TTS-based data augmentation in improving ASR performance in low-resource settings?
Research Question 2. What is the impact of increasing the proportion of synthetic data on ASR performance?
Research Question 3. How do different TTS systems compare in terms of their effectiveness for ASR data augmentation?
Research Question 4. Can JODAL mitigate the domain distribution inconsistency problem between real and synthetic speech?

4.1. Dataset

We employed two primary speech datasets in our experiments: LibriSpeech [29] (Available online: https://www.openslr.org/12, accessed on 18 April 2026) and GigaSpeech [28] (Available online: https://github.com/SpeechColab/GigaSpeech, accessed on 18 April 2026), both of which are widely recognized resources in the field of ASR. LibriSpeech is a benchmark corpus comprising approximately 960 h of read English speech sourced from public domain audiobooks available through the LibriVox project. The dataset features over 2000 distinct speakers, with balanced representation across genders and diverse reading styles, thereby enabling robust modeling of speaker variability. To simulate a low-resource scenario, we selectively utilize the train-clean-100 subset, which contains approximately 100 h of speech. From this, we sample only 10% (roughly 10 h) to construct our real speech dataset

D_{r e a l}

, while preserving diversity in speaker identity and linguistic content. The remaining speech data from the corpus is reserved for generating synthetic speech using the TTS systems.

GigaSpeech, on the other hand, is a significantly larger and acoustically more diverse corpus, comprising approximately 10,000 h of English speech recordings collected from a broad range of real-world sources, including audiobooks, podcasts, and YouTube videos. It features spontaneous, conversational, and scripted speech styles, and includes substantial background noise and speaker variability, which makes it particularly suitable for evaluating domain robustness in ASR models. For our experiments, we employed the 250 h of GigaSpeech training subset S, from which we considered a limited portion (10%, approximately 25 h) to serve as real speech data, while reserving the remainder of the corpus for synthetic data generation.

For ASR evaluation, we use the standard test sets provided with each dataset. From LibriSpeech, we employ the test-clean set, which contains high-quality read speech and serves as a benchmark for clean ASR evaluation. From GigaSpeech, we use its corresponding test set, which better reflects real-world acoustic variability due to its diverse recording sources and spontaneous speech content.

4.2. Training Set Design: Real vs. Synthetic Composition

To examine the impact of synthetic data on ASR performance, we constructed a series of training sets by systematically adjusting the ratio between real and synthesized speech. Specifically, we sampled subsets of real speech at 10%, 20%, 30%, 40%, 50%, and 100% of the original corpus. The 10% subset serves as the low-resource baseline condition. To simulate progressive data augmentation, we generate synthetic speech from the remaining transcripts using the TTS models outlined in Section 3.1. The synthesized utterances were incrementally added to the 10% real subset, resulting in a range of augmented training scenarios, as summarized in Table 2.

These training configurations enable a controlled analysis of the influence of increasing synthetic speech while maintaining a fixed amount (10%) of real speech. Furthermore, we constructed comparative baselines using real data only, at 20%, 30%, 40%, 50%, and 100% levels, denoted as R20, R30, …, R100. This design allows us to evaluate the relative benefit of synthetic data augmentation versus collecting additional real data. All audio inputs (both real and synthetic) were downsampled to 16 kHz, converted to mono, and truncated to a maximum duration of 30 s.

4.3. Implementation Details

For synthetic speech generation, all TTS model checkpoints employed in this study are obtained from the publicly available Coqui-AI TTS repository (https://github.com/coqui-ai/TTS (accessed on 18 April 2026)). For ASR model fine-tuning, we adopt the “tiny” configuration of Whisper, comprising approximately 39 million parameters. Regarding wav2vec2.0 [14], HuBERT [15], and WavLM [16], we used the “Base” configuration with around 95 million parameters and integrated a linear projection layer on top of the feature encoder to convert latent representations into output token distributions. For the evaluation of these self-supervised models, we integrated a 4-g language model [29] (available online: https://www.openslr.org/12, accessed on 18 April 2026) using KenLM [48] (available online: https://github.com/kpu/kenlm, accessed on 18 April 2026). During decoding, a beam size of 50 was applied, with fixed weights consisting of a language model weight of 2 and a word insertion penalty of −1, as described in [14]. In this study, we exclusively report the performance of these self-supervised architectures employing the 4-g language model to ensure robust recognition results. In contrast, the Whisper model was evaluated without any language model.

All models were optimized on the mixed dataset

D_{m i x e d}

comprising real and synthesized speech samples. Training was conducted with FP16 precision, a batch size of 16, and an initial learning rate of 5 × 10⁻⁵. All models were trained for up to 10 epochs using the AdamW optimizer [49], incorporating gradient norm clipping [50] and a linear learning rate decay schedule with a 7% warm-up phase. No additional data augmentation or explicit regularization techniques were applied during training. The training process typically consisted of approximately 35,470 steps under the R100 configuration, although the exact number of steps may vary depending on the configuration types. For both LibriSpeech and GigaSpeech, the best-performing checkpoints were selected based on the development set performance, and the final results were reported on the respective test sets using these optimized models.

4.4. Evaluation Metrics

To quantitatively assess the performance of the ASR model, we used the word error rate (WER), which is the primary metric for evaluating ASR performance, offering a comprehensive measure of transcription accuracy by accounting for various types of recognition errors. Formally, WER is defined as

WER = \frac{S + D + I}{N} .

(2)

where S denotes the number of substitutions, D the number of deletions, I the number of insertions, and N the total number of words in the ground-truth reference transcription. A lower WER value indicates higher transcription accuracy and, therefore, better ASR performance. We reported the ASR performance on the test-clean and test-other subsets of LibriSpeech, as well as on the dev and test sets of GigaSpeech.

Although the Mean Opinion Score (MOS) is inherently subjective and cannot serve as an absolute or standalone evaluation metric, it remains a valuable proxy for assessing perceptual quality, particularly in the context of synthetic speech evaluation. In this study, we investigated the relationship between MOS values and ASR performance, to analyze their correlation patterns. To this end, we employed the Naturalness Mean Opinion Score (nMOS) predicted by the UTMOS model to estimate perceptual naturalness, which was applied in prior studies [25,51]. Specifically, we employ the publicly available implementation of UTMOS (Available online: https://github.com/tarepan/SpeechMOS, accessed on 18 April 2023), which has been shown to produce nMOS predictions that closely align with human-annotated scores across various TTS outputs.

5. Results

5.1. Results on LibriSpeech Dataset

5.1.1. Performance Analysis of Whisper-Tiny Model on LibriSpeech

Table 3 summarizes the WER performance of the Whisper-tiny model on the LibriSpeech test-clean and test-other subsets under various training configurations. The first row reports baseline results with increasing volumes of real speech data (R10 to R100), while the subsequent rows present the performance of models augmented with TTS-generated synthetic speech (

S X X

) with and without the application of JODAL. Notably, the Zero-Shot performance reflects the accuracy of the original pretrained Whisper-tiny model evaluated directly without further task-specific fine-tuning.

Interestingly, the zero-shot performance (7.54% on test-clean) significantly outperformed all fine-tuned models, including the R100 baseline (11.70%). This suggests that while fine-tuning on limited task-specific data adapts the model to the target domain, it may simultaneously induce catastrophic forgetting or a degradation of the robust, generalized recognition capabilities inherent in the large-scale pretrained model. This observation shows the difficulty of the experimental setting, where even substantial target-domain data (R100) struggles to surpass the original pretrained state.

Despite these challenges, data augmentation using synthetic speech consistently improved ASR performance relative to the R10 baseline (25.49%/40.46% WER). For instance, adding just 10% of XTTS-generated data (R10 + S10) notably reduced the WER to 17.93% and 33.18% on the respective test sets. Among the evaluated TTS systems, XTTS exhibited the most substantial and consistent improvements, outperforming other models in all configurations. This superiority is likely attributed to its advanced speaker modeling and multi-speaker synthesis capabilities, which effectively reduce the acoustic mismatch between synthetic and real domains.

While increasing the proportion of synthetic data generally yielded further performance gains, diminishing returns were observed beyond the R10 + S40 configuration. For instance, the WER difference between R10 + S40 and R10 + S90 using XTTS was relatively marginal (15.06% → 13.38% on test-clean), suggesting a saturation effect. Furthermore, even with 90% synthetic data (R10 + S90), performance remained inferior to the R100 baseline, indicating a persistent domain gap between synthetic and natural speech that cannot be resolved solely through volume scaling.

The proposed JODAL approach effectively mitigated this domain discrepancy. Across all TTS systems and data scales, models trained with JODAL consistently outperformed their non-adversarial counterparts. For example, in the R10 + S20 configuration, XTTS with JODAL achieved a WER of 16.74%/32.50%, compared to 17.93%/33.18% without the adversarial component. JODAL proved especially effective at higher synthetic ratios (e.g., R10 + S40), where non-adversarial models exhibited performance degradation. These results demonstrate that JODAL is effective in learning domain-invariant acoustic representations, thereby enhancing the utility of synthetic data in ASR model training. The benefit of JODAL is particularly emphasized in low-resource scenarios where reliance on TTS-based augmentation is high.

5.1.2. Performance Analysis of Self-Supervised Models on LibriSpeech

WER results for the three self-supervised learning models on the LibriSpeech test-clean and test-other sets are summarized in Table 4. All models in this evaluation were integrated with the external 4-g language model to ensure robust decoding. Compared to the Whisper-tiny results, these models exhibited a higher sensitivity to data scarcity in the R10 baseline. For instance, wav2vec2.0 yielded a WER of 21.40% on test-clean, which is notably higher than its R100 performance of 4.41%.

However, the introduction of TTS-based data augmentation led to a reduction in WER across all architectures. Using XTTS-generated data at the R10 + S10 configuration, WavLM achieved a WER of 9.58% on test-clean, marking a relative improvement of approximately 60.8% compared to the R10 baseline. Among the three models, WavLM consistently achieved the lowest absolute WER, likely benefiting from its specialized pretraining for ASR downstream tasks, as shown in [16]. As a result, consistent with supervised model observations, increasing synthetic data volume improved ASR performance, though gains began to stabilize after the R10 + S40 configuration. A notable domain gap remained evident, as the performance of augmented models even at R10 + S90 (e.g., 5.02% for wav2vec2.0) did not fully match the R100 baseline (4.41%) through scaling alone.

The application of JODAL further enhanced the robustness of these models by effectively aligning the feature distributions of real and synthetic speech. Across all configuration in Table 4, models leveraging the JODAL framework consistently outperformed their standard joint training counterparts. For example, in the wav2vec2.0 R10 + S90 setting using XTTS, JODAL reduced the WER from 5.02% to 4.96% on test-clean and from 12.13% to 11.98% on test-other. Notably, in the case of WavLM, the combination of XTTS and JODAL at R10 + S90 reached a WER of 4.15%, which remarkably surpassed the R100 baseline performance of 4.89%. These results suggest that the proposed JODAL approach can effectively substitute synthetic data in place of real data without significant performance loss even at high synthetic-to-real ratios.

5.2. Results on GigaSpeech Dataset

5.2.1. Performance Analysis of Whisper-Tiny Model on GigaSpeech

Table 5 summarizes the WER performance of the Whiper-tiny model on the GigaSpeech development and test sets. Compared to LibriSpeech, the GigaSpeech corpus presents a more challenging environment for ASR due to its substantial acoustic variability, spontaneous speech patterns, and prevalent real-world noise. The zero-shot performance (17.82%/28.39%) serves as a high-performance baseline, once again outperforming most fine-tuned configurations and highlighting the difficulty of adapting the pretrained Whisper to this complex domain with limited data.

The baseline results using only real speech samples reflect the inherent difficulty of the dataset. Interestingly, the R30 configuration (27.83%/27.33% WER) achieved slightly better performance than the R40 configuration (28.64%/27.67% WER), likely due to a higher degree of acoustic mismatch between the specific training samples and the evaluation sets at that scale.

In contrast to the results observed on LibriSpeech, standard synthetic data augmentation (Original + Synthetic) on GigaSpeech initially showed limited utility. Most TTS-based variants without domain adaptation struggled to improve upon the R10 baseline (32.57%/31.69%) as synthetic volume increased. For example, the XTTS model without JODAL reached a performance plateau, while models like YourTTS actually exhibited performance degradation at higher synthetic ratios, such as R10 + S90 (34.98%/33.66%). This failure of vanilla augmentation indicates the severe domain distribution inconsistency between TTS-generated speech and the noisy, spontaneous nature of the GigaSpeech real-world recordings.

In that respect, JODAL remarkably mitigated the domain discrepancy in the GigaSpeech experiments. JODAL-integrated models consistently outperformed their non-JODAL across all TTS systems. Specifically, for XTTS at the R10 + S10 configuration, JODAL improved the WER from 31.47%/30.94% to 30.15%/29.96%. This improvement was even more pronounced at the R10 + S40 scale, where the JODAL-based model reached 29.11%/29.07%. This result markedly narrowed the gap with the R50 (28.23%/27.41%), whereas the vanilla augmentation model achieved 31.70%/30.09%. These findings demonstrate that JODAL can effectively leverage synthetic data in high-variance acoustic domains by facilitating the learning of domain-invariant features.

5.2.2. Performance Analysis of Self-Supervised Models on GigaSpeech

Table 6 summarizes the WER results for the three self-supervised learning models on the GigaSpeech development and test sets. Similar to the observations with Whisper-tiny, these architectures exhibited high sensitivity to data scarcity in the R10 baseline but showed consistent performance gains through synthetic data augmentation. Among the tested configurations, XTTS remained the most effective TTS system. For wav2vec 2.0, the introduction of XTTS-generated data at the R10 + S10 scale reduced the WER from 31.69%/30.25% to 26.60%/25.22%, representing a significant improvement over the 10% real data baseline. This trend was mirrored in HuBERT and WavLM, with WavLM maintaining the lowest absolute WER across all training scenarios, underscoring its robust representation capabilities in noisy environments.

The application of JODAL consistently enhanced performance by mitigating the domain mismatch between the synthetic and real-world acoustic distributions of GigaSpeech. Across all models and TTS variants, training under the JODAL objective outperformed standard joint training. For instance, in the R10 + S90 configuration with XTTS, JODAL reduced the WER of wav2vec 2.0 to 22.94%/21.99%, compared to 23.15%/22.00% without the adversarial component. Notably, for WavLM, the integration of JODAL at the R10 + S90 scale achieved a WER of 17.25%/16.76%, which successfully surpassed the R100 baseline (18.04%/17.67%). These results demonstrate that JODAL effectively leverages high-volume synthetic data to compensate for limited real-world resources, even in acoustically challenging datasets like GigaSpeech, by ensuring the learning of domain-invariant acoustic features.

5.3. Analysis of Perceptual Quality and ASR Utility

5.3.1. Discrepancy Between UTMOS and ASR Performance

To investigate the relationship between the perceptual quality of synthetic speech and its effectiveness in ASR data augmentation, we evaluated the Naturalness Mean Opinion Score (nMOS) of speech generated by each TTS system using the publicly available UTMOS model, which has previously been demonstrated to align with human judgments [25,51]. While we comprehensively analyzed the ASR performance results across various proportions of synthetic data and the presence or absence of JODAL, this subsection focuses on the R10 + S30 configuration, where the proportion of synthetic data is three times that of real data (30% vs. 10%). This scenario effectively highlights the challenges of domain discrepancy and the compensatory role of JODAL. Simultaneously, applying JODAL at this configuration shows a substantial performance improvement in our observations, making it suitable for discussing the effect of our proposed method in relation to UTMOS. Therefore, while UTMOS scores are reported for the synthetic speech datasets derived from the original LibriSpeech or GigaSpeech datasets, we only provide the WER results of the R10 + S30 configuration and the real-only R40 subset. Moreover, we specifically focus on the GigaSpeech dataset for this analysis, as it was not used during the self-supervised pretraining phase of the evaluated models [14,15,16].

As summarized in Table 7, there is no direct linear correlation between a system’s UTMOS rank and its ASR utility. For instance, VITS achieved the highest naturalness score (

4.356 \pm 0.135

) but failed to provide the best recognition results across any of the evaluated ASR architectures. In contrast, XTTS, despite having a relatively lower UTMOS score (

3.883 \pm 0.319

), emerged as the most effective source for augmentation when paired with JODAL. Specifically, for Whisper-tiny, the integration of JODAL with XTTS reduced the WER from 31.35% to 29.74%, significantly narrowing the gap with the R40 baseline (28.64%).

The most compelling finding lies in the performance of self-supervised models. For wav2vec 2.0, HuBERT, and WavLM, the combination of XTTS and JODAL in the R10 + S30 configuration successfully outperformed the R40 baseline. For example, WavLM achieved a WER of 18.90% with JODAL-integrated XTTS augmentation, surpassing the 19.36% recorded for the R40 configuration. Similarly, HuBERT reached 24.69%, significantly better than the R40 result of 25.83%.

As shown in Table 7, within the R10 + S30 configuration, there is a weak direct correlation between the UTMOS rank and the ASR WER rank, regardless of whether JODAL was applied. For example, VITS, which has the highest UTMOS score (4.356 ± 0.135), did not yield the best ASR performance for either the Whisper-tiny or wav2vec 2.0 models. Instead, XTTS, with a relatively lower UTMOS score (3.883 ± 0.319), achieved the best WER when JODAL was applied (Whisper-tiny: 14.16%, wav2vec 2.0: 17.49%), demonstrating its effectiveness as a data augmentation source for ASR despite its UTMOS score. In terms of UTMOS, YourTTS places lower than XTTS, but its Whisper-tiny WER (16.20%) is similar to or better than VITS (16.18%) and Tacotron2 (16.83%), while its wav2vec 2.0 WER (19.10%) is higher than VITS (18.47%) and lower than Tacotron2 (19.91%).

5.3.2. Correlation Analysis and the Impact of Speaker Diversity

The absence of a meaningful linear relationship between perceptual quality and recognition accuracy is further evidenced by the Pearson correlation coefficients (r) illustrated in Figure 3. Notably, for the WavLM model, the correlation coefficient decreased significantly from

r = 0.49

to

r = 0.11

upon applying JODAL. This decoupling effect indicates that JODAL effectively mitigates the model’s reliance on the synthetic artifacts of the TTS engine, allowing it to focus on domain-invariant acoustic features rather than surface-level naturalness.

This discrepancy is also attributable to the speaker diversity inherent in different TTS architectures. Zero-shot multi-speaker models like XTTS and YourTTS expose ASR systems to a broader range of acoustic patterns, aiding better generalization despite having lower UTMOS scores. Conversely, single-speaker models like VITS produce highly natural speech from a single voice, leading to high UTMOS scores but limited variability for ASR training. Consequently, JODAL proves essential as it aligns the feature distributions of synthetic and real domains, ensuring that models can leverage linguistic and phonetic diversity without overfitting to domain-specific biases, even when the intrinsic naturalness is sub-optimal.

5.4. Comparative Analysis of Findings

In this subsection, we present our findings to provide a comparative comparison across different experimental settings, focusing on the impact of the data TTS systems, data scale, model architecture, and domain adaptation.

5.4.1. Impact of XTTS Architecture and Diversity

Across all evaluated datasets and ASR architectures, XTTS consistently emerged as the most effective TTS system for generating synthetic data to improve ASR performance. Unlike single-speaker models such as VITS and Tacotron2, which prioritize high perceptual naturalness but offer limited acoustic variability, XTTS’s multi-speaker and zero-shot synthesis capabilities effectively bridge the domain gap by providing a broader range of speaker representations. Our results suggest that this acoustic diversity is more critical for ASR generalization than intrinsic perceptual quality, as evidenced by the decoupling between UTMOS scores and WER. Nevertheless, the issue of inconsistency inherent in synthetic data remained observable.

5.4.2. Scaling Laws and Dataset Complexity

As illustrated by the scaling curves in Figure 4, the benefit of increasing synthetic data volume is highly dependent on the target domain’s complexity. On the relatively clean LibriSpeech dataset, scaling the synthetic portion generally led to consistent improvements, following a predictable scaling law until diminishing returns were observed beyond the R10 + S40 configuration.

In contrast, on the noisy and spontaneous GigaSpeech dataset, vanilla augmentation often led to performance stagnation or even degradation when the synthetic ratio was high. This performance drop (noted by the blue lines in Figure 4) highlights the severe acoustic mismatch between pristine TTS-generated speech and real-world audio. However, the integration of JODAL (red lines) successfully rectified this trend, enabling continuous performance gains as data volume increased.

5.4.3. Architecture-Wise Sensitivity and Breakthrough Points

We observed distinct responses to augmentation between supervised and self-supervised architectures. Whisper-tiny maintained superior absolute WER due to its extensive pre-training, but its relative gains from augmentation were modest compared to self-supervised models. Conversely, wav2vec 2.0, HuBERT, and WavLM exhibited higher sensitivity to data scarcity but derived more substantial relative benefits from the JODAL approach.

A significant breakthrough was observed for WavLM, where the JODAL-integrated R10 + S90 configuration successfully surpassed the R100 real-only baseline on both datasets (indicated by the arrows in Figure 4). Specifically, on GigaSpeech, WavLM with JODAL reached a WER of 16.76%, outperforming the 17.67% achieved using 100% real data. This proves that JODAL can effectively transform high-volume synthetic data into a reasonable, and sometimes superior, substitute for expensive real-world resources.

5.4.4. Effectiveness of Domain Adversarial Training

The proposed JODAL approach consistently improved ASR performance across all experimental configurations. While LibriSpeech yielded modest yet stable gains, the improvement on GigaSpeech was more outstanding, particularly at higher synthetic scales where relative WER reductions often exceeded 5%. Across all evaluated TTS models, JODAL consistently outperformed non-adversarial joint training, confirming its robustness in facilitating domain-invariant feature learning and maximizing the utility of TTS-generated speech for ASR training.

5.5. Sensitivity Analysis of Domain Adaptation Hyperparameter $λ$

To further investigate the impact of the domain adaptation hyperparameter

λ

on JODAL’s performance, we conducted a sensitivity analysis using the wav2vec 2.0 model. As defined in Equation (1),

λ

controls the trade-off between the standard ASR loss and the adversarial domain classification loss. We evaluated four candidate values of

λ \in {0.01, 0.1, 0.5, 1.0}

across varying synthetic-to-real data ratios, from R10 + S10 to R10 + S90, on the LibriSpeech test-clean dataset.

Figure 5 summarizes the WER results for these configurations under an LM-integrated decoding setting. Among the tested candidate values,

λ = 0.1

showed the most consistent and lowest WER trends across the evaluated TTS systems and synthetic data ratios. This result suggests that a moderate adversarial loss weight provides a practical balance between task-specific recognition learning and feature-level domain alignment in our experimental setting.

In contrast,

λ = 0.01

appeared to provide a relatively weak adversarial signal, whereas larger values such as

λ = 0.5

and

λ = 1.0

tended to degrade ASR performance. This trend suggests that excessively strong domain adversarial pressure may suppress task-relevant acoustic information and negatively affect ASR optimization. Based on these observations, we empirically selected

λ = 0.1

as the default value for the main experiments.

6. Discussion

This section interprets the experimental findings in the context of the research questions (RQs) and discusses the broader implications for low-resource ASR development.

6.1. Evaluation of RQ1: Effectiveness of TTS-Based Augmentation

Our results consistently demonstrate that data augmentation using off-the-shelf TTS systems is a highly effective strategy for low-resource ASR. Across both LibriSpeech and GigaSpeech datasets, augmenting the 10% real data (R10) baseline led to significant WER reductions. For instance, adding just 10% XTTS-generated data (R10 + S10) reduced the Whisper-tiny WER on LibriSpeech test-clean from 25.49% to 17.93%. Similar improvements were observed in self-supervised models, where WavLM achieved a relative improvement of 60.8% on the same task. These findings confirm that synthetic speech can provide critical linguistic and phonetic variations that compensate for the scarcity of real-world recordings, despite the inherent acoustic differences between the two domains.

6.2. Evaluation of RQ2: Impact of Synthetic Data Scaling

The scaling analysis revealed that while increasing synthetic data volume generally improves performance, the gains are not linear and exhibit diminishing returns beyond the R10 + S40 configuration. On the cleaner LibriSpeech dataset, models followed a relatively stable scaling law. However, on GigaSpeech, vanilla augmentation (without domain adaptation) often resulted in performance stagnation or degradation as the proportion of synthetic data reached 90% (R10 + S90), such as YourTTS reaching 34.98% WER; worse than its R10 baseline. This phenomenon underscores the Domain Distribution Inconsistency problem; without proper alignment, an excessive reliance on synthetic data can cause the model to overfit to TTS-specific artifacts rather than learning generalized speech representations.

6.3. Evaluation of RQ3: Comparison of Different TTS Systems

A key insight from our comparative analysis is that acoustic diversity is more vital for ASR training than perceptual naturalness. While VITS achieved the highest UTMOS score (4.356), its utility for ASR was consistently surpassed by XTTS (UTMOS: 3.883). The superiority of XTTS and YourTTS is likely due to their zero-shot multi-speaker capabilities, which expose the ASR model to a wider variety of speaker styles and prosodic patterns compared to high-quality but acoustically limited single-speaker models like VITS and Tacotron. The decoupling of UTMOS and WER ranks—quantified by the drop in Pearson correlation from

r = 0.49

to

r = 0.11

in WavLM—further proves that JODAL allows models to leverage diverse synthetic data regardless of its perceived quality.

6.4. Evaluation of RQ4: Mitigating Domain Inconsistency Through JODAL

The proposed JODAL method was validated as a critical mechanism for bridging the domain gap. Across all architectures and datasets, JODAL consistently outperformed standard joint training, with its benefits being most pronounced at higher synthetic ratios. The most significant breakthrough occurred with WavLM on GigaSpeech, where the JODAL-enhanced R10 + S90 configuration achieved a WER of 16.76%, effectively surpassing the 100% real data baseline (R100: 17.67%). This remarkable result demonstrates that JODAL enables the ASR encoder to learn domain-invariant features, transforming synthetic data from a mere supplement into a superior substitute for expensive real-world data collection in low-resource settings.

6.5. Limitations and Future Works

Despite the effectiveness of JODAL, several avenues for future research remain:

Persistent Domain Discrepancy: While the proposed JODAL approach is effective in mitigating the domain inconsistency between real and TTS-generated speech, it does not completely eliminate this discrepancy. The experimental results show that even with JODAL and a high proportion of synthetic data (e.g., R10 + S90), the ASR performance did not reach the level of that obtained with fully real data (e.g., R100). This suggests that the synthetic data, even after domain adaptation, is not a perfect substitute for real speech in fully capturing its complexities and variability. Further improvements may require more advanced TTS models capable of capturing subtle speaker characteristics, coarticulation effects, and contextual prosody variations more faithfully. Moreover, extending domain adaptation beyond feature-level alignment to include model-level or task-level strategies may further reduce the remaining performance gap.
Dependence on TTS Quality: The effectiveness of the TTS-based data augmentation strategy is inherently dependent on the quality and characteristics of the synthetic speech produced by the specific TTS systems used. Our experimental results demonstrate that different TTS systems yield varying levels of improvement, with some (like XTTS) consistently outperforming others (YourTTS, VITS, Tacotron2). This highlights that the quality of the synthetic data serves as a limiting factor for the overall ASR performance gain, even with domain adaptation. As a future direction, it would be valuable to explore joint optimization frameworks in which TTS and ASR models are trained collaboratively, allowing the TTS system to generate speech more closely tailored to the requirements of ASR training. Furthermore, developing TTS models explicitly optimized for recognition performance could enhance the informativeness and effectiveness of synthetic data for speech recognition tasks.
Inherent Differences Between Synthetic and Real Speech: Although JODAL reduces feature-level domain discrepancy between real and TTS-generated speech, it does not completely eliminate fine-grained differences between the two domains. In particular, real-world background noise distributions, recording channel effects, speaker-specific idiosyncrasies, coarticulation patterns, and subtle prosodic variations may remain insufficiently captured by synthetic speech. Therefore, JODAL should be interpreted as a feature-level mitigation strategy rather than a complete solution to all real synthetic mismatch factors. These inherent distinctions may limit the extent to which synthetic data can replicate the rich acoustic and linguistic variations present in real-world speech, particularly in complex datasets such as GigaSpeech. Future work should investigate ASR-aware TTS generation, noise-aware synthesis, and multi-level adaptation strategies that explicitly address these fine-grained acoustic and speaker-dependent discrepancies.
Evaluation Scope: This study evaluates JODAL on two English ASR benchmark datasets, LibriSpeech and GigaSpeech. Therefore, the results should not be interpreted as evidence of general robustness across multilingual, accented, or broader multi-domain speech conditions. Future work should extend the evaluation to accented speech, multilingual ASR, domain-shifted recording environments, and other low-resource speech scenarios to more comprehensively assess the generalizability of the proposed framework.

7. Conclusions

This study addressed the critical challenge of data scarcity in low-resource ASR by strategically leveraging synthetic speech generated by TTS systems. To overcome the inherent domain discrepancy between real and synthetic data, we proposed JODAL, a framework designed to learn domain-invariant acoustic representations. Our extensive experiments on the LibriSpeech and GigaSpeech datasets revealed that while vanilla TTS-based augmentation improves performance, it often suffers from stagnation or degradation as the proportion of synthetic data increases, particularly in acoustically complex environments. JODAL mitigated the performance degradation caused by real synthetic mismatch and reduced WER across the evaluated ASR architectures, including Whisper-tiny and self-supervised models such as wav2vec 2.0, HuBERT, and WavLM. In the evaluated WavLM configuration, the model trained with a 10% real-data baseline augmented with 90% synthetic data achieved WERs of 16.76% on GigaSpeech and 4.15% on LibriSpeech, which were lower than those of the corresponding 100% real-data baselines in these experiments. These results suggest that TTS-generated speech, when combined with feature-level domain adversarial alignment, can be an effective supplementary resource under the examined English low-resource ASR settings. Furthermore, our analysis of UTMOS scores suggests that acoustic and speaker diversity in TTS models are more critical for ASR utility than perceptual naturalness alone.

While some domain differences remain, this work establishes JODAL as a robust and scalable solution for maximizing the utility of synthetic data. Future research will explore multi-level adversarial alignment and the development of ASR-aware TTS systems to further narrow the gap between synthetic and natural speech domains.

Author Contributions

Conceptualization, J.-W.K.; methodology, J.-W.K.; software, J.-W.K.; validation, J.-W.K. and H.-Y.J.; formal analysis, J.-W.K.; investigation, J.-W.K.; resources, J.-W.K. and H.-Y.J.; data curation, J.-W.K.; writing—original draft preparation, J.-W.K.; writing—review and editing, J.-W.K. and H.-Y.J.; visualization, J.-W.K.; supervision, J.-W.K. and H.-Y.J.; project administration, J.-W.K. and H.-Y.J.; funding acquisition, H.-Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2019-II190004, Development of semi-supervised learning language intelligence technology and Korean tutoring service for foreigners), and by the Regional Innovation System & Education (RISE) program through the Jeonbuk RISE Center, funded by the Ministry of Education (MOE) and the Jeonbuk State, Republic of Korea (2026-RISE-13-WKU).

Data Availability Statement

The data presented in this study are openly available in reference [28] at https://doi.org/10.21437/Interspeech.2021-1965 and reference [29] at https://doi.org/10.1109/ICASSP.2015.7178964.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jurafsky, D. Speech & Language Processing; Pearson Education India: Delhi, India, 2000. [Google Scholar]
Kamath, U.; Liu, J.; Whitaker, J. Deep Learning for NLP and Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2019; Volume 84. [Google Scholar]
Kipyatkova, I.S.; Kagirov, I.A. Analytical review of methods for solving data scarcity issues regarding elaboration of automatic speech recognition systems for low-resource languages. Inform. Autom. 2022, 21, 678–709. [Google Scholar] [CrossRef]
Kheddar, H.; Hemis, M.; Himeur, Y. Automatic speech recognition using advanced deep learning approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
Yeroyan, A.; Karpov, N. Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach. arXiv 2024, arXiv:2406.01446. [Google Scholar] [CrossRef]
Abraham, B.; Goel, D.; Siddarth, D.; Bali, K.; Chopra, M.; Choudhury, M.; Joshi, P.; Jyoti, P.; Sitaram, S.; Seshadri, V. Crowdsourcing speech data for low-resource languages from low-income workers. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2819–2826. [Google Scholar]
Mainzinger, J.; Levow, G.A. Fine-Tuning ASR models for Very Low-Resource Languages: A Study on Mvskoke. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Bangkok, Thailand, 11–16 August 2024; pp. 170–176. [Google Scholar]
Le, V.B.; Besacier, L. Automatic speech recognition for under-resourced languages: Application to Vietnamese language. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1471–1482. [Google Scholar]
Burget, L.; Schwarz, P.; Agarwal, M.; Akyazi, P.; Feng, K.; Ghoshal, A.; Glembek, O.; Goel, N.; Karafiát, M.; Povey, D.; et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing; IEEE: Piscataway, NJ, USA, 2010; pp. 4334–4337. [Google Scholar]
Thomas, S. Data-Driven Neural Network Based Feature Front-Ends for Automatic Speech Recognition; The Johns Hopkins University: Baltimore, MD, USA, 2012. [Google Scholar]
Imseng, D.; Bourlard, H.; Dines, J.; Garner, P.N.; Doss, M.M. Applying multi-and cross-lingual stochastic phone space transformations to non-native speech recognition. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1713–1726. [Google Scholar] [CrossRef]
Bartelds, M.; San, N.; McDonnell, B.; Jurafsky, D.; Wieling, M. Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 715–729. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019, 2019, 2613. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Yang, S.w.; Chi, P.H.; Chuang, Y.S.; Lai, C.I.J.; Lakhotia, K.; Lin, Y.Y.; Liu, A.T.; Shi, J.; Chang, X.; Lin, G.T.; et al. SUPERB: Speech Processing Universal PERformance Benchmark. Interspeech 2021, 2021. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New Orleans, LA, USA, 2023; pp. 28492–28518. [Google Scholar]
Yang, G.; Yu, F.; Ma, Z.; Du, Z.; Gao, Z.; Zhang, S.; Chen, X. Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap. arXiv 2024, arXiv:2410.16726. [Google Scholar] [CrossRef]
Tran, M.; Pang, Y.; Paul, D.; Pandey, L.; Jiang, K.; Guo, J.; Li, K.; Zhang, S.; Zhang, X.; Lei, X. A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Wang, C.; Chen, S.; Wu, Y.; Zhang, Z.; Zhou, L.; Liu, S.; Chen, Z.; Liu, Y.; Wang, H.; Li, J.; et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv 2023, arXiv:2301.02111. [Google Scholar] [CrossRef]
Casanova, E.; Davis, K.; Gölge, E.; Göknar, G.; Gulea, I.; Hart, L.; Aljafari, A.; Meyer, J.; Morais, R.; Olayemi, S.; et al. XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model. arXiv 2024, arXiv:2406.04904. [Google Scholar]
Casanova, E.; Weber, J.; Shulby, C.D.; Junior, A.C.; Gölge, E.; Ponti, M.A. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Le, M.; Vyas, A.; Shi, B.; Karrer, B.; Sari, L.; Moritz, R.; Williamson, M.; Manohar, V.; Adi, Y.; Mahadeokar, J.; et al. Voicebox: Text-guided multilingual universal speech generation at scale. Adv. Neural Inf. Process. Syst. 2024, 36, 14005–14034. [Google Scholar]
Chen, G.; Chai, S.; Wang, G.B.; Du, J.; Zhang, W.Q.; Weng, C.; Su, D.; Povey, D.; Trmal, J.; Zhang, J.; et al. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio. Interspeech 2021, 2021, 3670–3674. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Mimura, M.; Ueno, S.; Inaguma, H.; Sakai, S.; Kawahara, T. Leveraging sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT); IEEE: Piscataway, NJ, USA, 2018; pp. 477–484. [Google Scholar]
Ueno, S.; Mimura, M.; Sakai, S.; Kawahara, T. Multi-speaker sequence-to-sequence speech synthesis for data augmentation in acoustic-to-word speech recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2019; pp. 6161–6165. [Google Scholar]
Rossenbach, N.; Zeyer, A.; Schlüter, R.; Ney, H. Generating synthetic audio data for attention-based speech recognition systems. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2020; pp. 7069–7073. [Google Scholar]
Du, C.; Yu, K. Speaker augmentation for low resource speech recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2020; pp. 7719–7723. [Google Scholar]
Zheng, X.; Liu, Y.; Gunceler, D.; Willett, D. Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end asr systems. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2021; pp. 5674–5678. [Google Scholar]
Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar] [CrossRef]
Quinonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; Lawrence, N.D. Dataset Shift in Machine Learning; Mit Press: Cambridge, MA, USA, 2008. [Google Scholar]
Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the CVPR 2011; IEEE: New York, NY, USA, 2011; pp. 1521–1528. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Proceedings, Part III 27; Springer: Berlin/Heidelberg, Germany, 2018; pp. 270–279. [Google Scholar]
Guan, H.; Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 2021, 69, 1173–1185. [Google Scholar]
Ben-David, S.; Blitzer, J.; Crammer, K.; Pereira, F. Analysis of representations for domain adaptation. Adv. Neural Inf. Process. Syst. 2006, 19, 137–144. [Google Scholar]
Sun, S.; Yeh, C.F.; Hwang, M.Y.; Ostendorf, M.; Xie, L. Domain adversarial training for accented speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2018; pp. 4854–4858. [Google Scholar]
Woszczyk, D.; Petridis, S.; Millard, D. Domain Adversarial Neural Networks for Dysarthric Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3875–3879. [Google Scholar]
Tanaka, T.; Masumura, R.; Sato, H.; Ihori, M.; Matsuura, K.; Ashihara, T.; Moriya, T. Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks. In Proceedings of the INTERSPEECH, Incheon, Republic of Korea, 18–22 September 2022; pp. 1066–1070. [Google Scholar]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Kim, J.W.; Yoon, C.; Toikkanen, M.; Bae, S.; Jung, H.Y. Adversarial fine-tuning using generated respiratory sound to address class imbalance. arXiv 2023, arXiv:2311.06480. [Google Scholar] [CrossRef]
Heafield, K. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 30–31 July 2011; pp. 187–197. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: New Orleans, LA, USA, 2013; pp. 1310–1318. [Google Scholar]
Lee, S.H.; Choi, H.Y.; Kim, S.B.; Lee, S.W. Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. arXiv 2023, arXiv:2311.12454. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Schematic illustration of the conceptual feature distribution discrepancy between synthetic (blue) and real (red) speech representations before and after joint domain adversarial training. The left panel illustrates a simplified domain discrepancy, where synthetic and real speech features are shown as separated distributions. The right panel illustrates the intended effect of our proposed method, which encourages domain-invariant representations by reducing this discrepancy.

Figure 2. Overview of the proposed ASR training framework combining real and synthetic speech data with domain adversarial training. A mixed ASR dataset composed of real (

D_{real}

) and TTS-generated synthetic data (

D_{syn}

) is passed through an ASR encoder. The encoder output is simultaneously used for ASR decoding and data discriminator, which predicts whether the input features are from real or synthetic data. Discriminator gradients are reversed during backpropagation to encourage domain-invariant representation learning.

Figure 2. Overview of the proposed ASR training framework combining real and synthetic speech data with domain adversarial training. A mixed ASR dataset composed of real (

D_{real}

) and TTS-generated synthetic data (

D_{syn}

) is passed through an ASR encoder. The encoder output is simultaneously used for ASR decoding and data discriminator, which predicts whether the input features are from real or synthetic data. Discriminator gradients are reversed during backpropagation to encourage domain-invariant representation learning.

Figure 3. Correlation analysis between perceptual speech quality (UTMOS) and ASR performance (WER) across four different architectures. The plots evaluate Whisper-tiny, wav2vec 2.0, HuBERT, and WavLM under the R10 + S30 configuration on the GigaSpeech dataset. The red and blue lines represent linear regression for configurations with and without JODAL, respectively. Markers indicate different TTS engines (XTTS, YourTTS, VITS, and Tacotron2), while the green ‘X’ denotes the real-data baseline (R40). The Pearson correlation coefficients (r) highlight the decoupling effect between naturalness and recognition utility facilitated by JODAL.

Figure 4. Comparative scaling analysis of ASR performance across LibriSpeech and GigaSpeech datasets. The plots illustrate the Word Error Rate (WER, %) trends for Whisper-tiny and WavLM as the total training data volume scales from 10% (R10 baseline) to 100% (R100 or R10 + S90 configuration). The grey dashed lines represent models trained solely on varying amounts of real data, while the blue and red lines depict the impact of augmenting a 10% real data baseline with XTTS-generated synthetic data, without and with the JODAL approach, respectively. Notably, on the GigaSpeech dataset, vanilla augmentation (blue lines) exhibits performance stagnation or degradation as the synthetic volume increases, whereas JODAL (red lines) consistently mitigates domain mismatch. In particular, for the WavLM architecture, the JODAL-integrated R10 + S90 configuration effectively surpasses the 100% real-data baseline (R100) on both datasets, as indicated by the arrows.

Figure 5. Evaluation of the impact of the adversarial loss weight

λ

on WER of LibriSpeech test-clean under the LM-integrated decoding setting. Four subplots illustrate the performance trends for different TTS-based augmentation sources. Among the tested candidate values,

λ = 0.1

generally achieved the lowest WER across the evaluated synthetic data scales. These results suggest that a moderate adversarial training weight can provide a practical balance between ASR learning and feature-level domain alignment, whereas excessively high

λ

values may degrade the primary ASR task performance by suppressing task-relevant acoustic features.

Figure 5. Evaluation of the impact of the adversarial loss weight

λ

on WER of LibriSpeech test-clean under the LM-integrated decoding setting. Four subplots illustrate the performance trends for different TTS-based augmentation sources. Among the tested candidate values,

λ = 0.1

generally achieved the lowest WER across the evaluated synthetic data scales. These results suggest that a moderate adversarial training weight can provide a practical balance between ASR learning and feature-level domain alignment, whereas excessively high

λ

values may degrade the primary ASR task performance by suppressing task-relevant acoustic features.

Table 1. Comparison of TTS models in terms of generation input type and speaker support.

Model	Input Type for Generation	Speaker Support
XTTS	Audio + Text	Multi-Speaker
YourTTS	Audio + Text	Multi-Speaker
VITS	Text-only	Single-Speaker
Tacotron2	Text-only	Single-Speaker

Table 2. Composition of training datasets with varying proportions of real and synthetic speech. RXX indicates real data percentage while SXX indicates synthetic data percentage.

Configuration ID	Real (%)	Synthetic (%)	Total Ratio (%)
R10	10	0	10
R10 + S10	10	10	20
R10 + S20	10	20	30
R10 + S30	10	30	40
R10 + S40	10	40	50
R10 + S90	10	90	100
R20	20	0	20
R30	30	0	30
R40	40	0	40
R50	50	0	50
R100	100	0	100

Table 3. WER (%) results on LibriSpeech test-clean and test-other sets using Whisper-tiny trained with varying amounts of real and TTS-augmented synthetic data with and without JODAL approach. “Zero-Shot” denotes the performance of the original Whisper-tiny model evaluated directly on the test sets without any additional task-specific fine-tuning.

		Data Amounts (Configuration IDs)
Training Data	TTS Model	R10	R20	R30	R40	R50	R100
Original (Real Only)		25.49/40.46	18.45/33.89	16.16/32.27	14.82/30.74	14.49/29.49	11.70/27.90
		R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
Original + Synthetic	XTTS	–	17.93/33.18	16.35/31.90	15.54/31.32	15.06/31.15	13.38/30.15
	YourTTS	–	20.44/36.72	18.13/33.69	17.58/34.30	17.98/33.68	15.56/33.85
	VITS	–	18.87/34.62	17.81/34.90	17.93/35.36	17.53/34.19	17.41/35.44
	Tacotron2	–	19.43/37.45	20.79/38.26	17.83/34.65	18.24/35.44	17.96/37.35
+ JODAL	XTTS	–	16.74/32.50	15.02/30.78	14.16/30.51	14.87/31.19	12.83/28.93
	YourTTS	–	19.42/35.26	17.15/32.89	16.20/33.61	16.61/32.85	13.19/30.53
	VITS	–	18.58/34.21	17.15/33.80	16.18/33.81	15.62/33.80	14.76/34.13
	Tacotron2	–	18.99/40.05	18.35/34.88	16.83/32.92	15.98/34.10	16.47/33.74
Zero-Shot		7.54/22.84

Table 4. WER (%) results on LibriSpeech test-clean and test-other sets using self-supervised models (wav2vec 2.0, HuBERT, and WavLM) trained with varying amounts of real and TTS-augmented synthetic data with and without the JODAL approach. All models in this table are evaluated using an external 4-g language model.

			Data Amounts (Configuration IDs)
ASR Model	Training Data	TTS Model	R10	R20	R30	R40	R50	R100
wav2vec 2.0	Original (Real Only)		21.40/31.07	7.63/15.47	6.15/13.43	5.59/12.46	5.14/11.89	4.41/10.60
			R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
	Original + Synthetic	XTTS	–	8.26/16.81	6.43/14.41	6.16/13.62	5.79/13.37	5.02/12.13
		YourTTS	–	8.20/16.15	6.99/15.30	6.75/14.92	6.53/14.65	6.17/14.57
		VITS	–	8.46/17.53	6.81/15.20	6.62/14.87	6.11/14.56	5.67/13.75
		Tacotron	–	9.10/18.23	7.91/16.67	7.24/15.96	6.68/15.16	6.09/14.19
	+ JODAL	XTTS	–	7.76/15.97	6.36/14.08	5.96/13.45	5.56/13.08	4.96/11.98
		YourTTS	–	7.87/16.02	6.74/15.19	6.72/15.01	6.38/14.67	6.04/14.02
		VITS	–	8.12/17.02	6.62/15.10	6.11/14.32	6.00/14.15	5.47/13.82
		Tacotron	–	8.74/17.68	7.59/16.17	6.49/14.75	6.28/14.45	6.08/14.69
HuBERT	Original (Real Only)		25.91/40.69	12.54/23.82	8.14/16.86	6.51/14.07	5.93/12.88	5.44/11.79
			R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
	Original + Synthetic	XTTS	–	12.73/24.21	8.48/18.18	7.01/15.65	6.58/15.18	6.07/13.99
		YourTTS	–	14.16/27.22	9.15/19.79	8.21/17.60	7.92/17.52	7.40/16.72
		VITS	–	14.12/27.06	9.81/20.79	8.04/17.05	7.37/16.56	6.83/15.88
		Tacotron	–	15.56/29.43	10.45/21.84	9.14/19.21	9.03/19.19	7.89/17.11
	+ JODAL	XTTS	–	12.36/23.60	8.12/21.38	6.93/15.45	6.44/14.56	5.79/13.06
		YourTTS	–	13.16/25.68	8.87/18.99	8.15/17.72	7.43/16.50	6.43/14.80
		VITS	–	13.67/25.58	9.61/19.62	7.82/17.42	7.04/16.07	6.79/15.30
		Tacotron	–	13.73/26.46	10.42/21.06	8.70/18.76	7.91/17.26	6.90/15.43
WavLM	Original (Real Only)		24.48/37.78	9.34/17.28	6.25/12.22	5.62/11.46	5.25/10.49	4.89/10.11
			R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
	Original + Synthetic	XTTS	–	9.58/17.85	6.33/12.67	5.70/11.96	5.18/11.08	4.38/9.99
		YourTTS	–	10.34/20.92	8.19/16.04	6.93/12.40	5.98/13.08	5.43/12.02
		VITS	–	10.99/21.53	6.56/13.93	6.40/13.63	5.67/12.71	5.51/12.37
		Tacotron	–	11.24/21.25	8.68/17.59	7.12/14.17	6.13/13.07	6.60/14.29
	+ JODAL	XTTS	–	8.00/16.20	6.03/12.11	5.50/11.08	4.77/11.64	4.15/9.82
		YourTTS	–	8.99/17.07	6.95/14.37	6.15/12.05	5.52/12.10	4.77/10.54
		VITS	–	9.97/18.57	6.52/13.66	5.95/11.71	5.44/13.34	5.23/11.77
		Tacotron	–	9.69/19.34	7.17/15.07	6.35/14.08	5.79/13.48	6.36/13.99

Table 5. WER (%) results on GigaSpeech dev/test sets using Whisper-tiny trained with varying amounts of real and TTS-augmented synthetic data with and without JODAL approach. “Zero-Shot” denotes the performance of the original Whisper-tiny model evaluated directly on the test sets without any additional task-specific fine-tuning.

		Data Amounts (Configuration IDs)
Training Data	Model	R10	R20	R30	R40	R50	R100
Original (Real Only)		32.57/31.69	30.25/30.60	27.83/27.33	28.64/27.67	28.23/27.41	26.06/25.81
		R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
Original + Synthetic	XTTS	–	31.47/30.94	30.68/29.81	31.35/29.90	31.70/30.09	31.37/30.54
	YourTTS	–	32.12/31.20	32.29/31.32	34.56/32.56	33.78/32.96	34.98/33.66
	VITS	–	32.41/31.49	32.36/30.90	32.62/30.93	33.05/31.96	33.97/32.58
	Tacotron	–	33.76/33.07	33.25/32.16	34.13/32.19	33.94/32.44	35.45/34.21
+ JODAL	XTTS	–	30.15/29.96	29.94/29.72	29.74/29.33	29.11/29.07	28.91/28.82
	YourTTS	–	31.00/30.54	30.59/29.83	30.31/29.60	30.04/29.46	29.85/29.34
	VITS	–	31.72/30.48	31.10/30.30	30.74/30.15	30.50/29.96	29.79/29.55
	Tacotron	–	32.41/31.63	32.09/31.39	31.83/30.91	31.71/30.88	30.90/30.67
Zero-Shot		17.82/28.39

Table 6. WER (%) results on GigaSpeech dev and test sets using self-supervised models (wav2vec 2.0, HuBERT, and WavLM) trained with varying amounts of real and TTS-augmented synthetic data with and without the JODAL approach. All self-supervised models were evaluated using the external 4-g language model.

			Data Amounts (Configuration IDs)
ASR Model	Training Data	TTS Model	R10	R20	R30	R40	R50	R100
wav2vec 2.0	Original (Real Only)		31.69/30.25	26.24/25.15	25.41/24.34	24.81/23.66	24.38/23.41	23.61/22.60
			R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
	Original + Synthetic	XTTS	–	26.60/25.22	25.17/23.89	24.70/23.35	24.17/22.93	23.15/22.00
		YourTTS	–	26.83/25.55	26.16/24.95	25.73/24.45	25.32/24.01	25.03/23.55
		VITS	–	27.24/25.76	26.45/24.76	25.79/24.22	25.64/24.04	24.95/23.45
		Tacotron	–	27.62/26.23	27.48/25.83	26.41/24.77	26.32/24.84	26.02/23.97
	+ JODAL	XTTS	–	26.46/25.11	25.11/23.95	24.60/22.78	24.01/22.78	22.94/21.99
		YourTTS	–	26.94/25.65	25.85/24.52	25.31/24.04	25.16/23.76	24.79/23.41
		VITS	–	26.86/25.37	26.41/24.86	25.41/23.82	25.57/23.99	24.80/23.16
		Tacotron	–	26.54/25.25	26.30/24.80	26.28/24.76	26.18/24.75	25.33/23.83
HuBERT	Original (Real Only)		29.56/28.38	28.71/27.53	26.15/25.26	25.83/25.03	25.08/24.42	23.32/22.61
			R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
	Original + Synthetic	XTTS	–	29.46/28.06	26.60/25.33	25.30/24.22	24.47/23.53	23.06/22.12
		YourTTS	–	28.41/27.19	27.69/26.26	25.97/25.00	25.63/24.69	25.18/24.05
		VITS	–	29.28/28.02	26.73/25.46	26.23/24.97	25.58/24.40	24.62/23.43
		Tacotron	–	29.53/28.33	28.42/27.03	27.39/26.03	26.89/25.49	25.54/24.23
	+ JODAL	XTTS	–	27.09/25.94	25.35/24.36	24.69/23.67	24.07/23.18	22.99/22.11
		YourTTS	–	27.97/26.73	26.71/25.61	25.94/24.92	25.45/24.57	25.14/24.01
		VITS	–	28.26/27.05	26.67/25.43	25.79/24.58	25.35/24.22	24.54/23.30
		Tacotron	–	29.25/28.08	27.61/26.58	27.09/25.99	26.39/24.92	25.45/24.07
WavLM	Original (Real Only)		29.28/28.72	21.57/20.13	19.73/19.26	19.36/19.02	19.27/18.89	18.04/17.67
			R10	R10 + S10	R10 + S20	R10 + S30	R10 + S40	R10 + S90
	Original + Synthetic	XTTS	–	21.99/21.01	20.56/19.78	19.65/19.09	18.88/18.22	17.69/17.10
		YourTTS	–	23.72/22.76	21.09/20.34	20.15/19.52	19.37/18.66	18.57/17.85
		VITS	–	22.91/22.24	21.30/20.43	20.57/19.81	19.70/18.99	18.35/17.63
		Tacotron	–	23.22/22.38	22.60/21.14	21.09/20.25	19.84/18.40	19.68/19.02
	+ JODAL	XTTS	–	21.07/20.49	19.85/19.38	18.90/18.29	18.30/17.28	17.25/16.76
		YourTTS	–	21.22/20.70	20.76/20.17	19.14/18.54	18.49/17.89	18.47/17.64
		VITS	–	21.11/20.47	20.73/20.02	18.96/18.38	18.36/17.69	18.20/17.53
		Tacotron	–	21.27/20.71	20.66/20.10	20.06/19.33	19.27/18.29	19.06/18.47

Table 7. UTMOS scores for synthetic speech samples along with ASR WERs (%) on the GigaSpeech test set using Whisper-tiny, wav2vec 2.0, HuBERT, and WavLM under the R10 + S30 and R40 configurations. Bold denotes the best performance in each column.

Configuration	TTS Model	UTMOS	JODAL	Whisper-Tiny	Wav2vec 2.0	HuBERT	WavLM
R10 + S30	XTTS	$3.883 \pm 0.319$	No	31.35	24.70	25.30	19.65
	XTTS	$3.883 \pm 0.319$	Yes	29.74	24.60	24.69	18.90
	YourTTS	$3.520 \pm 0.336$	No	34.56	25.73	25.97	20.15
	YourTTS	$3.520 \pm 0.336$	Yes	30.31	25.31	25.94	19.14
	VITS	$4.356 \pm 0.135$	No	32.62	25.79	26.23	20.57
	VITS	$4.356 \pm 0.135$	Yes	30.74	25.41	25.79	18.96
	Tacotron2	$4.083 \pm 0.320$	No	34.13	26.41	27.39	21.09
	Tacotron2	$4.083 \pm 0.320$	Yes	31.83	26.28	27.09	20.06
R40	Real Data Only		–	28.64	24.81	25.83	19.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, J.-W.; Jung, H.-Y. JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition. Mathematics 2026, 14, 1669. https://doi.org/10.3390/math14101669

AMA Style

Kim J-W, Jung H-Y. JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition. Mathematics. 2026; 14(10):1669. https://doi.org/10.3390/math14101669

Chicago/Turabian Style

Kim, June-Woo, and Ho-Young Jung. 2026. "JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition" Mathematics 14, no. 10: 1669. https://doi.org/10.3390/math14101669

APA Style

Kim, J.-W., & Jung, H.-Y. (2026). JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition. Mathematics, 14(10), 1669. https://doi.org/10.3390/math14101669

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition

Abstract

1. Introduction

2. Related Works

2.1. TTS-Based Data Augmentation for ASR

2.2. Domain Adaptation

3. Methodology

3.1. Problem Formulation: Data Augmentation for Low-Resource ASR by TTS Model

3.2. Speech Data Generation Using TTS System

3.3. Joint Domain Adversarial Learning for Mitigating Data Distributional Inconsistency Issue

3.4. Fine-Tuning Pretrained ASR Models

4. Experimental Settings

4.1. Dataset

4.2. Training Set Design: Real vs. Synthetic Composition

4.3. Implementation Details

4.4. Evaluation Metrics

5. Results

5.1. Results on LibriSpeech Dataset

5.1.1. Performance Analysis of Whisper-Tiny Model on LibriSpeech

5.1.2. Performance Analysis of Self-Supervised Models on LibriSpeech

5.2. Results on GigaSpeech Dataset

5.2.1. Performance Analysis of Whisper-Tiny Model on GigaSpeech

5.2.2. Performance Analysis of Self-Supervised Models on GigaSpeech

5.3. Analysis of Perceptual Quality and ASR Utility

5.3.1. Discrepancy Between UTMOS and ASR Performance

5.3.2. Correlation Analysis and the Impact of Speaker Diversity

5.4. Comparative Analysis of Findings

5.4.1. Impact of XTTS Architecture and Diversity

5.4.2. Scaling Laws and Dataset Complexity

5.4.3. Architecture-Wise Sensitivity and Breakthrough Points

5.4.4. Effectiveness of Domain Adversarial Training

5.5. Sensitivity Analysis of Domain Adaptation Hyperparameter λ

6. Discussion

6.1. Evaluation of RQ1: Effectiveness of TTS-Based Augmentation

6.2. Evaluation of RQ2: Impact of Synthetic Data Scaling

6.3. Evaluation of RQ3: Comparison of Different TTS Systems

6.4. Evaluation of RQ4: Mitigating Domain Inconsistency Through JODAL

6.5. Limitations and Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.5. Sensitivity Analysis of Domain Adaptation Hyperparameter $λ$