Next Article in Journal
Research on Bearing Fault Diagnosis Method Based on Multi-Scale Convolution and Attention Mechanism in Strong Noise Environment
Previous Article in Journal
High-Fidelity Versus Reduced-Order Numerical Models for Sound Transmission Loss Prediction of Acoustic Metamaterials
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning †

by
Muhammad Ramiz Saud
1,‡,
Muhammad Romail Imran
1,2,*,‡ and
Raja Hashim Ali
1,2,*
1
Department of Artificial Intelligence, Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi 23460, Pakistan
2
Department of Business, University of Europe for Applied Sciences, Konrad-Zuse-Ring 11, 14469 Potsdam, Germany
*
Authors to whom correspondence should be addressed.
Presented at the 5th International Electronic Conference on Applied Sciences, 4–6 December 2024; https://sciforum.net/event/ASEC2024.
These authors contributed equally to this work.
Eng. Proc. 2025, 87(1), 112; https://doi.org/10.3390/engproc2025087112
Published: 20 October 2025
(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)

Abstract

This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an automated transcription pipeline to preserve linguistic accuracy. The dataset was then used to fine-tune Tacotron2, a neural network model originally trained for English, with modifications tailored to Urdu’s phonological and morphological features. To further enhance naturalness, we integrated voice cloning techniques that capture regional accents and produce personalized speech outputs. Model performance was evaluated through mean opinion score (MOS), word error rate (WER), and speaker similarity, showing substantial improvements compared to previous Urdu systems. The results demonstrate clear progress toward natural and intelligible Urdu speech synthesis, while also revealing challenges such as handling dialectal variation and preventing model overfitting. This work contributes an essential resource and methodology for advancing Urdu natural language processing (NLP), with promising applications in education, accessibility, entertainment, and assistive technologies.

Graphical Abstract

1. Introduction

Urdu TTS faces unique challenges, including complex phonetics, diacritic absence, and dialectal diversity. Unlike Google Translate’s generic Urdu TTS, our work addresses these by integrating phonetic alignment and regional acoustic models. We contribute the following: (1) a curated 100 h Urdu dataset with dialectal variations, (2) a Tacotron2-based model optimized for Urdu’s phonology, and (3) voice cloning for personalized outputs. Section 2 reviews Urdu TTS limitations; Section 3, Section 4 and Section 5 detail our methodology, results, and discussions. Recent advancements in NLP [1,2,3], especially TTS systems [4,5], are revolutionizing human machine interfaces; however, such resources are not available for languages with limited computational support, such as Urdu. This study targets the development of a robust Urdu TTS model and high-quality dataset with an aim to bridge the digital divide and support the phonetic and morphological uniqueness of the language. A neural network-based approach using Tacotron technology produces natural-sounding Urdu speech that goes much beyond basic TTS systems [6]. In this regard, one major limitation in Urdu system design is the absence of large datasets.
We address this limitation by gathering a wide range of audio samples reflecting dialects and tonal variations present in the language, which will encourage future Urdu NLP research. Besides TTS, our project aims at voice cloning, capturing Urdu’s regional accent and cultural nuances, to personalize synthesized voices. This methodology includes regionally tuned acoustic models and vocoders that reflect Urdu’s linguistic richness with wider applicability across a range of Urdu technologies. In its current form, technical hurdles in Urdu remain such as little corpora and the complexity of Urdu phonetics. Nonetheless, this work contributes to digital inclusion and accessibility. A reliable Urdu TTS will make it easier to reach people in Pakistan’s multilingual context, but continued development is required to bridge the gaps, specifically over the presence of fewer data samples in Urdu, the complexity of the language, and successful voice duplication.

2. Related Work

Tacotron2 [4] maps text to mel spectrograms, with a high Mean Opinion Score (MOS) of 4.53, while Hussain’s work [7] is on the improvement of Urdu intonation and stress algorithms in TTS. Wirth et al. [5] applied the Tacotron2 architecture for German, and as a result, the MOS is competitive at 3.74, but further areas for improvement are noted to be in dataset quality. Jamal et al. [6] also have a report regarding transfer learning for Urdu TTS using a Tacotron model that is actually trained on the English language with MOS scores ranging between 2.90 and 3.45, thus pointing to the requirement of more prominent corpora for the Urdu language. Prior Urdu TTS efforts, such as Hussain’s Letter-to-Sound (LTS) rules [7] and Jamal’s transfer learning (2022) [6], highlight the need for richer corpora. Tacotron2 (Shen et al., 2018 [4]) achieved a MOS of 4.53 for English but struggles with Urdu due to phonetic mismatches. Our work directly addresses these gaps through the following steps: (1) expanding Hussain’s rules with dialectal data, (2) surpassing Jamal’s MOS (2.90–3.45) via curated training, and (3) incorporating voice cloning, absent in prior Urdu research. Table 1 summarizes key studies.

3. Methodology

A sample workflow for the conduct of this study depicting the complete methodology visually is shown in Figure 1.

3.1. Dataset Creation

We initiated this study by gathering data, but reliable datasets of Urdu are relatively hard to find and those which were found had voice and quality issues; so, we decided to prepare our dataset. More than 100 h of voice recordings were required; the best source was Urdu novel readers on YouTube, and Nemrah Ahmad was the most prominent choice. Audio snippets (5–7 s) were extracted from 170 h of YouTube Urdu novels, transcribed using Whisper (OpenAI, San Francisco, CA, USA) (115,000 samples) and validated via a 40,000-word Dakshina dictionary (AI4Bharat, Chennai, India). Silence/noise were removed using Librosa’s Voice Activity Detection (VAD) (version 0.10.1, Python library, open source), and phonetic alignment was performed with Montreal Forced Aligner (McGill University, Montreal, QC, Canada).

3.2. Text Transcription and Preprocessing

This study involved preparing clean, standardized data for TTS model development, starting with text translation using the AI4Bharat transliteration library (AI4Bharat, Chennai, India) and reducing errors through a 40,000-word dictionary from the Dakshina dataset. High-quality audio files were processed with a 22,050 Hz sampling rate for clarity, while the model supported 48 kHz specifications. The dataset was split into 85% training and 15% validation subsets, ensuring contamination-free performance evaluation. Some sample data from the cleaned dataset after preprocessing is shown in Figure 2.

3.3. Model Training

We utilized Tacotron2 (open-source implementation, originally by Google Brain, Mountain View, CA, USA), a model pre-trained in English using the LJSpeech dataset (University of Montreal, Montreal, QC, Canada) covering 13 h of speech. The model was adapted to Urdu. Tacotron2 was fine-tuned using Urdu-specific hyperparameters: an Adam optimizer (lr = 1 × 10−4, β 1 = 0.9 , and β 2 = 0.999 ), a weight decay of 5 × 10 4 for regularization, and a batch size of 32 (optimized via grid search). Training spanned 75,000 iterations on an NVIDIA RTX 3070 GPU (NVIDIA Corporation, Santa Clara, CA, USA), with early stopping at epoch 20 to prevent overfitting.

3.4. Inference and Testing

The deep learning model showed promise in generating Urdu poetry using TTS technology and was tested on various text formats. The versatility of the model makes it usable for various practical uses. Firstly, this model translates Urdu text to Romanized Urdu (Urdu written in English script) and then processes it with intricate analysis dismissing any noise present in creating the predictions. This model generates mel spectrograms that represent speech-related characteristics, such as pitch, duration, and intensity.
The model’s ability to handle unseen, real-world text in different contexts—such as news, casual conversations, and technical scripts—helps refine its performance. Our goal is to develop a TTS model that understands Urdu’s complexities and generates natural, emotional speech. By visualizing mel spectrograms, we can evaluate the model’s output and ensure that the generated speech aligns with the input text, resembling human-like speech synthesis.

3.5. Voice Cloning

We also explore voice cloning, which opened up new avenues for entertainment and education, not to mention assistive technology. After gaining some experience from TTS, we then advanced to a more complex synthesis mechanism: Reverse Voice Conversion. This enabled the model, Tacotron2, to successfully mimic a speaker’s voice down to even the slightest language difference. By feeding the TTS-derived speech traits to the Retrieval-based Voice Conversion (RVC) model (open-source project, GitHub community), we were able to successfully synthesize the audio that could mimic unique tones and pronunciations in Urdu. In this methodology, our model came out as flexible enough for personalizing any interaction and changing user experiences across diverse applications.

4. Results

One of the major contributions of our study is voice cloning. We compared the voice cloning at various iterations and found out that the best results were achieved after 28,000 iterations. A comparison of results after 20,000 and 28,000 iterations is shown in Figure 3.
Model performance was evaluated using the following:
  • MOS (1–5): 4.2 (naturalness) vs. Jamal’s 3.45.
  • WER: 8.5% on test data.
  • Speaker Similarity (Speaker Verification to Text-to-Speech, SV2TTS): 85% for cloned voices. Audio samples are in the GitHub samples folder.
For epochs 1–7 of our model training in TTS, the produced output was mostly noise. This is normal, especially in complex NLP tasks like TTS, where the model is learning complex features, including grammar, pronunciation, intonation, and emotion in speech synthesis. The attention mechanism at this point is still not very developed, since it is still underdeveloped and cannot focus the model on important segments of text; thus, the outputs become noisy and unfocused. This is the time where the model is learning the phonemes, stress patterns, and other kinds of rhythms—a basic language acquisition phase.
By epochs 7–15, the model began producing intelligible words from input text, thereby enabling early success in speech-to-text translation. Although numerous pronunciation errors remained, the model was clearly learning, and the attention mechanism started to function quite well, paying more appropriate attention to key words. Although this stage was replete with noticeable distortions, it consequently represented a great step forward for the model’s comprehension and pronunciation, which implies that the model is managing to learn the relationship between the written text and the patterns of sound.
From periods 15–20, it further improves, and the quality of speaking words is more accurate and correct as well. The model started to act maturely in terms of attention mechanism; hence, now it focuses on more relevant parts of the text, which further makes the speech sound more natural and clearer. At that time, it was sure that some significant improvements were made as the model now could convert its input text into almost human-like speech in quality. It was proof of architectural modifications and adjustments in the training procedures.
At 20 epochs, however, overfitting signs began to emerge as evidenced by increased distortion in synthesized outputs—an indicator that is often caused by the model’s overspecialization with the training data (shown in Figure 4). This implied that further training would limit the ability of the model to generalize to new examples. Hence, we stopped training at this stage to avoid incurring more overfitting and to preserve the flexibility of the model, thus supporting its generalization capability for a wide range of tasks pertaining to speech synthesis in real applications. Through training, our TTS model initially had noise and pronunciation issues, as would be expected in complex language tasks because of the learning curve of the attention mechanism [8]. By the 20th epoch, it achieved clear, human-like speech, although slight overfitting emerged, degrading performance on new data.

5. Discussion

Our TTS model was able to produce speech that was almost like the transcription, but with noise and silence now and then. Noise and silence as errors may have resulted from very minor defects in the dataset: mismatched transcriptions, excessive silence, or even audio files with inconsistencies. These imperfections in the dataset make it harder for the model to learn from those same errors and would reduce its capability to produce smooth, flowing speech. Dealing with these issues of data quality alone would make the model output more capable in synthesizing natural-sounding speech. Noise in outputs stemmed from dataset silences (addressed via VAD) and Tacotron2’s mel spectrogram inversion limits. Future work includes WaveNet vocoders and dialect expansion. Despite overfitting post epoch 20, our model achieves state-of-the-art Urdu TTS, with cloning fidelity validated by SV2TTS. In general, our results indicate that better dataset quality is necessary for human-like output of TTS.

6. Conclusions

In conclusion, the present research forms a significant advancement toward developing TTS and voice cloning systems that could produce natural-sounding Urdu speech because, to this day, no considerable computational resource is available for Urdu, despite its rich nuances in phonetic and cultural flavors. More than 100 h of audio was used to build a high-quality large-scale Urdu speech dataset, coupled with sophisticated neural network models like Tacotron2, in order to create Urdu speech that has a native flavor to it. The integration of voice cloning further personalized the synthesized voices, capturing regional accents and cultural subtleties, which opens new possibilities for applications in education, entertainment, and assistive technologies. It remains a challenge to achieve perfect speech synthesis because the dataset imperfections are mismatched transcriptions and audio inconsistencies. The issues in these datasets mean the need for even higher quality to further fine-tune the performance of the model. Moreover, despite all improvements in natural Urdu speech, the model shows an overfitting after 20 epochs, suggesting that it is highly crucial to train a model that keeps its balance while training in generalization. This work bridges the digital divide for the Urdu-speaking communities but also forms the foundation of future research in Urdu NLP and speech synthesis.

Author Contributions

Conceptualization, R.H.A.; Methodology, M.R.S.; Software, M.R.S.; Validation, R.H.A.; Formal Analysis, M.R.I.; Writing—Original Draft Preparation, M.R.S., M.R.I., and R.H.A.; Writing—Review and Editing, R.H.A. and M.R.I.; Supervision, R.H.A. All authors have read and agreed to the published version of this manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset and code are openly available in GitHub at https://github.com/RamizSaud/TTS (accessed on 18 July 2025), with directories: 1. data: Cleaned audio (WAV) and transcripts (TXT). 2. scripts: Preprocessing (Whisper alignment, noise removal). 3. models: Tacotron2 configs and trained weights. See README.md for replication steps.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Naeem, M.; Siddique, A.B.; Ali, R.H.; Arshad, U.; ul Abideen, Z.; Khan, T.A.; Shah, M.H.; Ijaz, A.Z.; Ali, N. Performance Evaluation of Popular Deep Neural Networks for Neural Machine Translation. In Proceedings of the 2023 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 11–12 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 220–225. [Google Scholar] [CrossRef]
  2. Paracha, A.W.; Arshad, U.; Ali, R.H.; ul Abideen, Z.; Shah, M.H.; Khan, T.A.; Ijaz, A.Z.; Ali, N.; Siddique, A.B. Leveraging AI and NLP in Chatbot Development: An Experimental Study. In Proceedings of the 2023 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 11–12 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 172–177. [Google Scholar] [CrossRef]
  3. Ul-Abidden, Z.; Ali, R.H.; Zeeshan, A.I.; Khan, T.A. A Novel NLP-Based Stock Market Price Prediction and Risk Analysis Framework. J. Comput. Sci. Technol. 2024, 24, e07. [Google Scholar] [CrossRef]
  4. Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.J.; et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar] [CrossRef]
  5. Wirth, J.; Puchtler, P.; Peinl, R. Neural Speech Synthesis in German. In Proceedings of the 14th International Conference on Advanced Human-Oriented Personalized Mechanisms, Technologies, Services, Barcelona, Spain, 3–7 October 2021; pp. 26–34. Available online: https://personales.upv.es/thinkmind/dl/conferences/centric/centric_2021/centric_2021_2_30_30009.pdf (accessed on 18 July 2025).
  6. Jamal, S.; Rauf, S.A.; Majid, Q. Exploring Transfer Learning for Urdu Speech Synthesis. In Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered, and Lesser-Resourced Languages in Eurasia, 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 70–74. Available online: https://aclanthology.org/2022.eurali-1.11/ (accessed on 18 July 2025).
  7. Hussain, S. To-sound conversion for Urdu text-to-speech system. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based languages, Geneva, Switzerland, 28 August 2004; pp. 74–79. Available online: https://aclanthology.org/W04-1613.pdf (accessed on 18 July 2025).
  8. Bapna, A.; Cherry, C.; Zhang, Y.; Jia, Y.; Johnson, M.; Cheng, Y.; Khanuja, S.; Riesa, J.; Conneau, A. MSLAM: Massively multilingual joint pre-training for speech and text. arXiv 2022, arXiv:2202.01374. [Google Scholar] [CrossRef]
Figure 1. Workflow of the proposed Urdu text-to-speech system. The pipeline consists of three main stages: Data Handling (data collection and preprocessing), AI & NLP Framework (model training, inference, and evaluation), and Insights & Impact (recommendations and applications). Data are collected from YouTube Urdu novel readings, converted into Roman Urdu, and preprocessed through resampling, text cleaning, and alignment. The Tacotron2 model is fine-tuned on the dataset with GPU support, followed by inference, speech generation, and validation. The final stage includes evaluation of results and insights such as model refinement, dataset expansion, performance benchmarks, error analysis, and applications in assistive technologies, education, and media.
Figure 1. Workflow of the proposed Urdu text-to-speech system. The pipeline consists of three main stages: Data Handling (data collection and preprocessing), AI & NLP Framework (model training, inference, and evaluation), and Insights & Impact (recommendations and applications). Data are collected from YouTube Urdu novel readings, converted into Roman Urdu, and preprocessed through resampling, text cleaning, and alignment. The Tacotron2 model is fine-tuned on the dataset with GPU support, followed by inference, speech generation, and validation. The final stage includes evaluation of results and insights such as model refinement, dataset expansion, performance benchmarks, error analysis, and applications in assistive technologies, education, and media.
Engproc 87 00112 g001
Figure 2. Examples of entries from the curated Urdu speech dataset. Each row lists the corresponding audio file, the original Urdu text, and its Latin script transcription. The Latin script represents Romanized Urdu (phonetic transcription) rather than an English translation, illustrating the format and alignment used for training and evaluation.
Figure 2. Examples of entries from the curated Urdu speech dataset. Each row lists the corresponding audio file, the original Urdu text, and its Latin script transcription. The Latin script represents Romanized Urdu (phonetic transcription) rather than an English translation, illustrating the format and alignment used for training and evaluation.
Engproc 87 00112 g002
Figure 3. Result at 20,000 iterations and 28,000 iterations, respectively.
Figure 3. Result at 20,000 iterations and 28,000 iterations, respectively.
Engproc 87 00112 g003
Figure 4. Figure for training loss and validation loss at 28,000 iterations.
Figure 4. Figure for training loss and validation loss at 28,000 iterations.
Engproc 87 00112 g004
Table 1. Condensed literature review table.
Table 1. Condensed literature review table.
Sr.YearAuthor(s)MethodologyContributionLimitations
12004Sarmad Hussain [7]LTS rules for UrduFoundational rules for Urdu TTSNo quantitative results
22018Shen et al. [4]Tacotron 2 + WaveNet on 24.6 h US English dataHigh-quality TTS with MOS 4.53Mispronunciations, limited text diversity
32021Wirth et al. [5]Tacotron 2 + Multi-Band MelGAN for GermanMOS 3.74, comparable to English modelsNon-deterministic, dataset quality issues
42022Jamal et al. [6]Transfer learning with TacotronMOS 2.90–3.45 for naturalnessLimited Urdu corpus, need for richer data
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Saud, M.R.; Imran, M.R.; Ali, R.H. Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning. Eng. Proc. 2025, 87, 112. https://doi.org/10.3390/engproc2025087112

AMA Style

Saud MR, Imran MR, Ali RH. Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning. Engineering Proceedings. 2025; 87(1):112. https://doi.org/10.3390/engproc2025087112

Chicago/Turabian Style

Saud, Muhammad Ramiz, Muhammad Romail Imran, and Raja Hashim Ali. 2025. "Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning" Engineering Proceedings 87, no. 1: 112. https://doi.org/10.3390/engproc2025087112

APA Style

Saud, M. R., Imran, M. R., & Ali, R. H. (2025). Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning. Engineering Proceedings, 87(1), 112. https://doi.org/10.3390/engproc2025087112

Article Metrics

Back to TopTop