1. Introduction
Urdu TTS faces unique challenges, including complex phonetics, diacritic absence, and dialectal diversity. Unlike Google Translate’s generic Urdu TTS, our work addresses these by integrating phonetic alignment and regional acoustic models. We contribute the following: (1) a curated 100 h Urdu dataset with dialectal variations, (2) a Tacotron2-based model optimized for Urdu’s phonology, and (3) voice cloning for personalized outputs.
Section 2 reviews Urdu TTS limitations;
Section 3,
Section 4 and
Section 5 detail our methodology, results, and discussions. Recent advancements in NLP [
1,
2,
3], especially TTS systems [
4,
5], are revolutionizing human machine interfaces; however, such resources are not available for languages with limited computational support, such as Urdu. This study targets the development of a robust Urdu TTS model and high-quality dataset with an aim to bridge the digital divide and support the phonetic and morphological uniqueness of the language. A neural network-based approach using Tacotron technology produces natural-sounding Urdu speech that goes much beyond basic TTS systems [
6]. In this regard, one major limitation in Urdu system design is the absence of large datasets.
We address this limitation by gathering a wide range of audio samples reflecting dialects and tonal variations present in the language, which will encourage future Urdu NLP research. Besides TTS, our project aims at voice cloning, capturing Urdu’s regional accent and cultural nuances, to personalize synthesized voices. This methodology includes regionally tuned acoustic models and vocoders that reflect Urdu’s linguistic richness with wider applicability across a range of Urdu technologies. In its current form, technical hurdles in Urdu remain such as little corpora and the complexity of Urdu phonetics. Nonetheless, this work contributes to digital inclusion and accessibility. A reliable Urdu TTS will make it easier to reach people in Pakistan’s multilingual context, but continued development is required to bridge the gaps, specifically over the presence of fewer data samples in Urdu, the complexity of the language, and successful voice duplication.
2. Related Work
Tacotron2 [
4] maps text to mel spectrograms, with a high Mean Opinion Score (MOS) of 4.53, while Hussain’s work [
7] is on the improvement of Urdu intonation and stress algorithms in TTS. Wirth et al. [
5] applied the Tacotron2 architecture for German, and as a result, the MOS is competitive at 3.74, but further areas for improvement are noted to be in dataset quality. Jamal et al. [
6] also have a report regarding transfer learning for Urdu TTS using a Tacotron model that is actually trained on the English language with MOS scores ranging between 2.90 and 3.45, thus pointing to the requirement of more prominent corpora for the Urdu language. Prior Urdu TTS efforts, such as Hussain’s Letter-to-Sound (LTS) rules [
7] and Jamal’s transfer learning (2022) [
6], highlight the need for richer corpora. Tacotron2 (Shen et al., 2018 [
4]) achieved a MOS of 4.53 for English but struggles with Urdu due to phonetic mismatches. Our work directly addresses these gaps through the following steps: (1) expanding Hussain’s rules with dialectal data, (2) surpassing Jamal’s MOS (2.90–3.45) via curated training, and (3) incorporating voice cloning, absent in prior Urdu research.
Table 1 summarizes key studies.
3. Methodology
A sample workflow for the conduct of this study depicting the complete methodology visually is shown in
Figure 1.
3.1. Dataset Creation
We initiated this study by gathering data, but reliable datasets of Urdu are relatively hard to find and those which were found had voice and quality issues; so, we decided to prepare our dataset. More than 100 h of voice recordings were required; the best source was Urdu novel readers on YouTube, and Nemrah Ahmad was the most prominent choice. Audio snippets (5–7 s) were extracted from 170 h of YouTube Urdu novels, transcribed using Whisper (OpenAI, San Francisco, CA, USA) (115,000 samples) and validated via a 40,000-word Dakshina dictionary (AI4Bharat, Chennai, India). Silence/noise were removed using Librosa’s Voice Activity Detection (VAD) (version 0.10.1, Python library, open source), and phonetic alignment was performed with Montreal Forced Aligner (McGill University, Montreal, QC, Canada).
3.2. Text Transcription and Preprocessing
This study involved preparing clean, standardized data for TTS model development, starting with text translation using the AI4Bharat transliteration library (AI4Bharat, Chennai, India) and reducing errors through a 40,000-word dictionary from the Dakshina dataset. High-quality audio files were processed with a 22,050 Hz sampling rate for clarity, while the model supported 48 kHz specifications. The dataset was split into 85% training and 15% validation subsets, ensuring contamination-free performance evaluation. Some sample data from the cleaned dataset after preprocessing is shown in
Figure 2.
3.3. Model Training
We utilized Tacotron2 (open-source implementation, originally by Google Brain, Mountain View, CA, USA), a model pre-trained in English using the LJSpeech dataset (University of Montreal, Montreal, QC, Canada) covering 13 h of speech. The model was adapted to Urdu. Tacotron2 was fine-tuned using Urdu-specific hyperparameters: an Adam optimizer (lr = 1 × 10−4, , and ), a weight decay of for regularization, and a batch size of 32 (optimized via grid search). Training spanned 75,000 iterations on an NVIDIA RTX 3070 GPU (NVIDIA Corporation, Santa Clara, CA, USA), with early stopping at epoch 20 to prevent overfitting.
3.4. Inference and Testing
The deep learning model showed promise in generating Urdu poetry using TTS technology and was tested on various text formats. The versatility of the model makes it usable for various practical uses. Firstly, this model translates Urdu text to Romanized Urdu (Urdu written in English script) and then processes it with intricate analysis dismissing any noise present in creating the predictions. This model generates mel spectrograms that represent speech-related characteristics, such as pitch, duration, and intensity.
The model’s ability to handle unseen, real-world text in different contexts—such as news, casual conversations, and technical scripts—helps refine its performance. Our goal is to develop a TTS model that understands Urdu’s complexities and generates natural, emotional speech. By visualizing mel spectrograms, we can evaluate the model’s output and ensure that the generated speech aligns with the input text, resembling human-like speech synthesis.
3.5. Voice Cloning
We also explore voice cloning, which opened up new avenues for entertainment and education, not to mention assistive technology. After gaining some experience from TTS, we then advanced to a more complex synthesis mechanism: Reverse Voice Conversion. This enabled the model, Tacotron2, to successfully mimic a speaker’s voice down to even the slightest language difference. By feeding the TTS-derived speech traits to the Retrieval-based Voice Conversion (RVC) model (open-source project, GitHub community), we were able to successfully synthesize the audio that could mimic unique tones and pronunciations in Urdu. In this methodology, our model came out as flexible enough for personalizing any interaction and changing user experiences across diverse applications.
4. Results
One of the major contributions of our study is voice cloning. We compared the voice cloning at various iterations and found out that the best results were achieved after 28,000 iterations. A comparison of results after 20,000 and 28,000 iterations is shown in
Figure 3.
Model performance was evaluated using the following:
MOS (1–5): 4.2 (naturalness) vs. Jamal’s 3.45.
WER: 8.5% on test data.
Speaker Similarity (Speaker Verification to Text-to-Speech, SV2TTS): 85% for cloned voices. Audio samples are in the GitHub samples folder.
For epochs 1–7 of our model training in TTS, the produced output was mostly noise. This is normal, especially in complex NLP tasks like TTS, where the model is learning complex features, including grammar, pronunciation, intonation, and emotion in speech synthesis. The attention mechanism at this point is still not very developed, since it is still underdeveloped and cannot focus the model on important segments of text; thus, the outputs become noisy and unfocused. This is the time where the model is learning the phonemes, stress patterns, and other kinds of rhythms—a basic language acquisition phase.
By epochs 7–15, the model began producing intelligible words from input text, thereby enabling early success in speech-to-text translation. Although numerous pronunciation errors remained, the model was clearly learning, and the attention mechanism started to function quite well, paying more appropriate attention to key words. Although this stage was replete with noticeable distortions, it consequently represented a great step forward for the model’s comprehension and pronunciation, which implies that the model is managing to learn the relationship between the written text and the patterns of sound.
From periods 15–20, it further improves, and the quality of speaking words is more accurate and correct as well. The model started to act maturely in terms of attention mechanism; hence, now it focuses on more relevant parts of the text, which further makes the speech sound more natural and clearer. At that time, it was sure that some significant improvements were made as the model now could convert its input text into almost human-like speech in quality. It was proof of architectural modifications and adjustments in the training procedures.
At 20 epochs, however, overfitting signs began to emerge as evidenced by increased distortion in synthesized outputs—an indicator that is often caused by the model’s overspecialization with the training data (shown in
Figure 4). This implied that further training would limit the ability of the model to generalize to new examples. Hence, we stopped training at this stage to avoid incurring more overfitting and to preserve the flexibility of the model, thus supporting its generalization capability for a wide range of tasks pertaining to speech synthesis in real applications. Through training, our TTS model initially had noise and pronunciation issues, as would be expected in complex language tasks because of the learning curve of the attention mechanism [
8]. By the 20th epoch, it achieved clear, human-like speech, although slight overfitting emerged, degrading performance on new data.
5. Discussion
Our TTS model was able to produce speech that was almost like the transcription, but with noise and silence now and then. Noise and silence as errors may have resulted from very minor defects in the dataset: mismatched transcriptions, excessive silence, or even audio files with inconsistencies. These imperfections in the dataset make it harder for the model to learn from those same errors and would reduce its capability to produce smooth, flowing speech. Dealing with these issues of data quality alone would make the model output more capable in synthesizing natural-sounding speech. Noise in outputs stemmed from dataset silences (addressed via VAD) and Tacotron2’s mel spectrogram inversion limits. Future work includes WaveNet vocoders and dialect expansion. Despite overfitting post epoch 20, our model achieves state-of-the-art Urdu TTS, with cloning fidelity validated by SV2TTS. In general, our results indicate that better dataset quality is necessary for human-like output of TTS.
6. Conclusions
In conclusion, the present research forms a significant advancement toward developing TTS and voice cloning systems that could produce natural-sounding Urdu speech because, to this day, no considerable computational resource is available for Urdu, despite its rich nuances in phonetic and cultural flavors. More than 100 h of audio was used to build a high-quality large-scale Urdu speech dataset, coupled with sophisticated neural network models like Tacotron2, in order to create Urdu speech that has a native flavor to it. The integration of voice cloning further personalized the synthesized voices, capturing regional accents and cultural subtleties, which opens new possibilities for applications in education, entertainment, and assistive technologies. It remains a challenge to achieve perfect speech synthesis because the dataset imperfections are mismatched transcriptions and audio inconsistencies. The issues in these datasets mean the need for even higher quality to further fine-tune the performance of the model. Moreover, despite all improvements in natural Urdu speech, the model shows an overfitting after 20 epochs, suggesting that it is highly crucial to train a model that keeps its balance while training in generalization. This work bridges the digital divide for the Urdu-speaking communities but also forms the foundation of future research in Urdu NLP and speech synthesis.