Next Article in Journal
PCB-Faster-RCNN: An Improved Object Detection Algorithm for PCB Surface Defects
Previous Article in Journal
Impact of Concurrent Appointment of Recycled Aggregate Quality Managers on Post-Certification Quality Audit Results in Korea
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages

Faculty of Information Technology and Artificial Intelligence, Farabi University, Almaty 050040, Kazakhstan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 12880; https://doi.org/10.3390/app152412880
Submission received: 27 October 2025 / Revised: 1 December 2025 / Accepted: 3 December 2025 / Published: 5 December 2025

Abstract

The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resource in speech technologies, lacking sufficiently large audio datasets with aligned transcriptions that are crucial for modern recognition, synthesis, and understanding systems. This article presents the development and experimental evaluation of a speech corpus focused on Turkic languages, intended for use in speech synthesis and automatic translation tasks. The primary objective is to create parallel audio corpora using a cascade generation method, which combines artificial intelligence and text-to-speech (TTS) technologies to generate both audio and text, and to evaluate the quality and suitability of the generated data. To evaluate the quality of synthesized speech, metrics measuring naturalness, intonation, expressiveness, and linguistic adequacy were applied. As a result, a multimodal (Kazakh–Turkish, Kazakh–Tatar, Kazakh–Uzbek) corpus was created, combining high-quality natural Kazakh audio with transcription and translation, along with synthetic audio in Turkish, Tatar, and Uzbek. These corpora offer a unique resource for speech and text processing research, enabling the integration of ASR, MT, TTS, and speech-to-speech translation (STS).

1. Introduction

For most Turkic languages, there is a severe lack of large open audio corpora, especially those containing parallel transcriptions and translations [1]. This problem significantly hinders the development of speech processing technologies such as automatic speech recognition (ASR), text-to-speech (TTS), and especially machine translation of spoken language [2]. Unlike languages with global coverage, such as English, Chinese, or Spanish, Turkic languages are rarely represented in international multimodal corpus collection initiatives. Even for the relatively well-resourced Turkish language, audio corpora with accurate speech-to-text alignment—and even more so with translations into other languages—are the exception rather than the rule. This shortage of structured and slim audio resources greatly limits the ability to design and train effective models that can handle the phonetic complexity, prosody, and morphological richness inherent to these languages. Among the Turkic language family, Turkish remains the most developed in terms of available speech corpora and audio technologies, supported by commercial applications and a growing set of open datasets. In contrast, Central Asian Turkic languages remain underrepresented mainly in global speech technology efforts, not to mention languages where such resources are either absent altogether or are presented as small, fragmented collections that do not meet the requirements of machine learning. The lack of such corpora not only increases the cost of developing and training language models and diminishes the accuracy of speech technologies but also significantly affects the digital landscape, restricting the presence and use of Turkic languages in this increasingly vital domain.
One of the key challenges in creating parallel audio corpora for Turkic languages is the quality and origin of the available speech data. Since there are virtually no targeted recordings of speech in these languages in the public domain, researchers are forced to collect audio materials from secondary sources such as radio broadcasts, TV shows, podcasts, or YouTube videos. However, such data usually contains much noise—background sounds, echo recordings, background music, interruptions, recording defects, and other acoustic artifacts such as reverberation or distortion—that significantly complicate ASR. Moreover, such sources are rarely accompanied by accurate, aligned transcriptions, making it impossible to use these data for model training without prior, labor-intensive annotation.
Aligning audio fragments with parallel texts at the sentence or word level (forced alignment) is a task that demands tremendous time and computational resources. In the absence of reliable language models and specialized tools, the task becomes virtually impossible to solve. This is especially true for Turkic languages, which have the potential to benefit significantly from fully developed and stable ASR and TTS systems. However, these systems are either non-existent or in the experimental stage and currently lack the accuracy and reliability for widespread use. The formation of synthetic parallel speech corpora in Turkic languages is hindered not only by the shortage of structured resources but also by the lack of the technical infrastructure needed for their effective processing and subsequent integration into machine learning systems and intelligent technologies. Overcoming these challenges is crucial to realizing the potential of ASR and TTS systems for Turkic languages.
Modern multilingual automatic speech recognition models, such as Whisper [3], Massively Multilingual Speech (MMS) [4], and Soyle Automatic Speech Recognition Model (Soyle) [5], demonstrate impressive results for widely spoken Turkic languages; however, their efficiency decreases significantly when working with rare and low-resource languages. For Turkic languages, this is reflected in a high proportion of speech recognition errors (Word Error Rate, WER), especially in the presence of dialectal features, non-standard phonetics, and limited training material. In addition, models pre-trained on one Turkic language often do not demonstrate satisfactory transferability to other related languages such as Uzbek, Kyrgyz, or Turkmen. Despite their genetic and typological closeness, differences in phonetics, intonation, alphabet, and orthographic norms create significant obstacles to effective interlingual transfer.
The scientific novelty of the proposed study lies in the creation of the first synthetic multilingual parallel audio corpus for Turkic languages, built on a unified, controlled methodology for “machine translation and its synthesis into speech” grounded in qualitative and synthetic data.
The paper is organized as follows: The Introduction discusses the resource scarcity issues for resource-poor Turkic languages. Section 2 provides an overview of the literature and resources on speech technology and Natural Language Processing (NLP). Section 2.1 provides an overview of high-resource multilingual speech corpora for speech-to-speech (S2S) and speech synthesis languages and their contributions to the field. Section 2.2 details the resources available for Turkic languages and their applicability to language technologies. Section 2.3 presents a comparative analysis of available audio and text resources for Kazakh, Tatar, Uzbek, and Turkish. Section 3 describes the methods used in the study and consists of five subsections. Section 3.1 discusses the creation of the audio corpus using a cascade scheme. Section 3.2 discusses the methodology for creating audio corpora for Turkic languages using the text-first approach. Section 3.3 describes text-to-speech systems developed for Turkic languages, including their features, limitations, and performance. Section 3.4 provides a detailed discussion of the datasets used in the study. Finally, Section 3.5 explains the metrics used to evaluate the quality of the resulting audio corpora and TTS systems. This section presents the results of the developed parallel audio corpora for the Kazakh–Turkish, Kazakh–Uzbek, and Kazakh–Tatar languages, as well as evaluations of synthesized speech quality using various metrics. Section 5 discusses the results, challenges encountered, and an analysis of the obtained results. The article concludes with a summary of the main findings and suggests future directions for work on audio corpora of Turkic languages for speech technologies.

2. Related Works

In recent years, speech technologies have expanded beyond basic tasks focused solely on speech recognition ASR and TTS. Multilingual and multidialectal systems capable of operating under resource-constrained and non-standard scenarios are of increasing interest.
One of the key drivers of this development has been the availability of large, open corpora of speech and translation data, which have enabled the training and comparison of different models. Most of the currently available resources focus on European and major Asian languages. Despite the rich morphology and historical significance of Turkic languages, they remain underrepresented in modern speech research. The lack of high-quality parallel and annotated corpora limits the development of effective translation systems. Nevertheless, research in this area has advanced in recent years. Specialized corpora have been created for a number of Turkic languages, including Kazakh, Tatar, Uzbek, and Turkish, and neural network models are being actively adapted to their specific features. These steps form the basis for integrating Turkic languages into the global system of multilingual speech technologies. This paper presents an overview of current speech and translation corpora used in speech recognition and synthesis, as well as in speech translation systems.

2.1. Multilingual Speech Corpora

Recent research has focused extensively on developing multilingual and multidialectal speech corpora to support tasks such as speech recognition, S2S translation, and speech synthesis.
Certain methodologies were applied to construct SpeechMatrix, a large-scale multilingual corpus for speech-to-speech translation, using real speech data mined from recordings of the European Parliament. These approaches demonstrated strong performance and scalability [6]. To assess the quality of the extracted parallel speech data, bilingual S2S translation models were trained exclusively on the mined corpus, establishing baseline performance metrics on the Europarl-ST, VoxPopuli, and FLEURS test sets. A thorough evaluation demonstrated high-quality speech alignments. However, SpeechMatrix does not include Kazakh as a source language and contains no systematically aligned synthetic data for Turkic languages; therefore, it cannot serve as a parallel resource comparable to the corpus proposed in this study.
Building upon this, the Multilingual TEDx corpus [7] extended the availability of multilingual data beyond English-centric resources, focusing on ASR and ST tasks across eight languages. Baseline experiments explored multilingual modeling strategies to enhance translation for low-resource pairs. Continuing this line, VoxPopuli [8] offered one of the world’s largest open speech datasets, comprising 400,000 h of unlabeled speech in 23 languages, and included the first large-scale open-access S2S interpretation data. Efforts have also targeted regional and dialectal diversity. For example, SwissDial provides parallel text and audio data across eight Swiss German dialects, with quality assessed via neural speech synthesis experiments covering multiple configurations [9]. Further methodologies introduced unified multilingual S2S translation frameworks leveraging vocabulary masking and multilingual vocoding to mitigate cross-lingual interference and improve model performance [10].
In addition to curated resources, community-driven projects such as Common Voice [11] and CVSS [12] have expanded the multilingual landscape. Common Voice facilitates scalable multilingual ASR development through crowd-sourced data, while CVSS integrates speech-to-speech translation pairs from 21 languages into English, validated through baseline direct and cascaded S2ST models. Beyond supervised datasets, unsupervised approaches have also emerged. Unlike CVSS, which does not cover Kazakh, Uzbek, or Tatar and does not generate controlled parallel audio across related languages, the present work introduces a synthetic Kazakh-centric parallel corpus specifically designed for Turkic languages. A method for building parallel corpora from dubbed video content combined visual and linguistic cues to align speech across languages, achieving high accuracy and robustness on Turkish–Arabic data [13]. Similarly, the MuST-C corpus addressed the scarcity of large-scale end-to-end SLT resources by aligning English TED Talk audio with multilingual translations, providing valuable training data for SLT systems [14]. Finally, wSPIRE contributed parallel data for both neutral and whispered speech, opening new research directions in speech modality analysis [15].
Together, these resources demonstrate steady progress toward building scalable and diverse multilingual corpora. However, challenges remain in ensuring balance across languages, improving data quality for low-resource settings, and facilitating cross-corpus interoperability—areas that continue to motivate further research.

2.2. Resources for Turkic Languages

While these resources have significantly advanced multilingual speech translation, they remain limited in coverage and applicability for Turkic languages, which continue to face data scarcity and underrepresentation in existing corpora. The following reviews existing research and available resources related to speech and translation technologies for Turkic languages. Certain methodologies were applied to develop KazakhTTS2, an extended version of an earlier open-source Kazakh text-to-speech (TTS) corpus. The updated dataset increases from 93 to 271 h of speech and includes recordings from five speakers (three female, two male) across a wider range of topics, such as literature and Wikipedia articles. The corpus is designed to support high-quality TTS systems for Kazakh, addressing challenges typical of agglutinative Turkic languages. Experimental evaluations report mean opinion scores ranging from 3.6 to 4.2. The dataset, including code and pretrained models, is publicly available on GitHub [16]. Specific methodologies were also employed to develop a cascade speech translation system for translating spoken Kazakh into Russian. The system is based on the ST-kk-ru dataset, which is derived from the ISSAI Corpus and includes aligned Kazakh speech and Russian translations. It consists of an ASR module for Kazakh transcription and an NMT module for translation into Russian. The study demonstrates that augmenting both components with additional data improves translation performance by approximately two BLEU points. Further comparison between DNN-HMM and End-to-End Transformer-based ASR models was conducted, with results reported using Word Error Rate (WER) and Character Error Rate (CER). These findings highlight the importance of data augmentation for improving speech translation systems in low-resource languages such as Kazakh [17].
For languages with limited digital resources, such as Tatar, the availability of corpus-based resources has become increasingly critical. In recent years, several open-access Tatar-language text and speech corpora have been developed through both academic initiatives and crowdsourced contributions, aiming to address data scarcity in Turkic-language technologies.
The development and implementation of the national corpus of the Tatar language “Tugan tel”, which represents an important step in the digitalization of the Tatar language, are described in the article [18]. The corpus includes more than 27-million-word usages from texts of various genres—from fiction to journalism and official documents. The authors consider the principles of morphological annotation adapted to the agglutinative structure of the Tatar language. Building on this foundation, the authors in [19] proposed a methodology and a toolkit for automatic grammatical disambiguation in a large Tatar-language corpus. The system demonstrates the ability to improve the accuracy of morphological processing significantly. The study makes an important contribution to the corpus infrastructure for the Tatar language and can become a basis for further improvements in the field of morphology and NLP. Complementary resources, such as the Corpus of Written Tatar [20], expand the infrastructure by offering detailed linguistic, morphological, and frequency data. The system later incorporated a multi-component morphological search engine [21], enabling flexible searches by lemma, affix, or structural parameters, and supporting advanced corpus analysis. Further research [22] discussed challenges in building a large-scale corpus exceeding 400 million tokens, outlining technical, annotation, and organizational solutions for sustainable corpus development. The first systematic study of semantic relation extraction from the Tatar corpus “Tugan Tel” was presented [23]. Algorithmic limitations, morphological difficulties, and the need for lexical semantic infrastructure are highlighted. The study makes a step towards semantically rich tools and resources for Turkic languages.
Progress in Tatar language technologies also extends to speech processing. The authors in [24] described the construction of an automatic speech recognition (ASR) system for the Tatar language using an iterative self-supervised pre-training approach. The paper demonstrates the potential of self-supervised methods in the context of limited manually annotated resources, which makes it especially valuable for low-resource languages such as Tatar. The TatarTTS dataset [25] provides an open-source speech synthesis corpus comprising 70 h of professionally recorded Tatar audio, facilitating TTS research and applications. Additionally, in [26], the authors presented a noise-robust multilingual ASR system adapted for Tatar, based on the Whisper model and trained on over 260 h of speech data from the TatSC_ASR corpus.
Overall, these studies on Tatar language resources demonstrate a gradual transition from corpus compilation and morphological annotation to more advanced semantic and speech technologies, highlighting the growing potential for integrating the Tatar language into multilingual and multimodal artificial intelligence systems.
While some Turkic languages still lack large publicly available datasets of both speech and audio, Uzbek has seen significant progress in recent years. The following is an overview of existing work related to the development of speech and audio resources for Uzbek. Several earlier efforts have explored Uzbek speech recognition, though most were limited in scope and dataset availability. For instance, specific methodologies were applied to build an ASR system focused on recognizing geographical entities, using a small dataset of 3500 utterances [27]. In a related study, a read speech recognition system was proposed based on 10 h of transcribed Uzbek audio [28]. A significant step forward was made with the introduction of the Uzbek Speech Corpus (USC)—the first open-source corpus for Uzbek. The dataset comprises 105 h of manually validated speech from 958 speakers. Baseline ASR experiments using both DNN-HMM and end-to-end architectures yielded word error rates of 18.1% and 17.4% on the validation and test sets, respectively. The dataset, along with pretrained models and training recipes, is publicly available to support reproducibility [29]. A crowdsourced contribution to Uzbek ASR was made through the Common Voice project—a multilingual corpus whose Uzbek subset includes 266 h of speech from over 2000 speakers, representing a valuable resource for building robust ASR systems in low-resource settings [11]. Additionally, FeruzaSpeech was introduced as a 60 h single-speaker read speech corpus recorded by a native female speaker from Tashkent. The dataset contains literary and news texts in both Cyrillic and Latin scripts and demonstrates improvements in ASR performance when integrated into training pipelines for Uzbek [30].
In recent years, there has been active development of Turkish speech corpora, contributing significantly to advances in automatic speech recognition and synthesis. Specific analyses were conducted to provide a comprehensive review of studies on Turkish automatic speech recognition systems, covering approaches from classical HMM-GMM models to modern transformer and self-supervised architectures. The main limitations were identified as a lack of resources and computational power, and it was concluded that targeted efforts are needed to achieve quality comparable to human speech perception [31]. A LAS (Listen-Attend-Spell)-based transformer architecture for Turkish ASR was proposed and trained on multilingual data, achieving high recognition performance on a Turkish subcorpus [32].
An end-to-end Turkish speech synthesizer based on the Tacotron2 architecture was developed with modifications to account for the morphological features of the language. The model was trained on a local corpus of audio and text, using the WaveGlow vocoder to generate sound. Experiments showed improvements in synthesis quality according to the MOS metric compared to traditional TTS systems, highlighting the potential of the technology for practical applications in voice interfaces and assistive technologies [33]. Comparative studies of Whisper Small and Wav2Vec2 XLS R 300M models were conducted on the Mozilla Common Voice Turkish v11 dataset, yielding WER scores of approximately 0.16 and 0.28, respectively, with additional testing on real call center data [34]. Adaptation of the Whisper model to Turkish speech using the LoRA (low-rank adaptation) method demonstrated a significant WER improvement—up to 52% compared to the original model—across five Turkish datasets, confirming the effectiveness of parameter-efficient fine-tuning for low-resource languages [35]. An ASR system based on the XLSR-Wav2Vec2.0 model retrained on Turkish Mozilla Common Voice recordings achieved an impressive WER of 0.23, demonstrating the high potential of self-trained transformer models on under-labeled data [36].
Further efforts were dedicated to creating Turkish speech resources. A collaborative initiative by METU and CSLR resulted in a 193-speaker audio corpus and accompanying newspaper text. Automatic phoneme alignment achieved an accuracy of 91.2% within ±20 ms relative to manual tagging, and a phoneme error rate (PER) of 29.3%, demonstrating that the ported SONIC ASR engine was successfully applied to Turkish. The development of a phonetic dictionary and corpus tools provided a foundation for further research and opened new opportunities for improving ASR systems [37]. The porting of the SONIC recognition OS to Turkish, along with the creation of new corpora, achieved up to 91% phoneme alignment accuracy compared to the standard. Analysis of the phoneme model showed that while some PER remained around 30%, the system was technically suitable for ASR, highlighting the importance of architecture transfer approaches and phonological tool development for low-resource languages [38]. These results mark an important milestone for Turkish ASR systems and lay the foundation for subsequent corpora and models.
A corpus-based analysis of the oral speech of Arabic–Turkish bilinguals and Turkish monolinguals revealed no major differences in word order, voice structures, or frequency of interjections. However, bilinguals produced a greater number of general sentences, indicating differences in speech structure strategies. This corpus-driven approach made it possible to identify subtle variations in linguistic repertoire and behavior, enriching the understanding of bilingualism and demonstrating the methodological value of corpus data for psycholinguistic analysis [39]. Finally, a methodology for creating a large Turkish spoken corpus was developed based on audio extracted from subtitled movies, yielding 90 h of synchronized audio and transcription. The integration of preprocessor, parser, and speech extractor modules ensured efficient automation of corpus acquisition and tagging, essential for scalable language resource construction. Speech data from 120 movies showed an average segment length of 45 min, confirming the high density of speech and lexical diversity. The corpus proved suitable for training and testing acoustic ASR models, demonstrating a scalable framework for future research in Turkish speech processing [40].

2.3. A Comparative Analysis of Available Audio and Text Resources for Kazakh, Tatar, Uzbek, and Turkish

The Kazakh Speech Corpus 2 (KSC2) [41] is the first industrial-scale, open-source Kazakh speech corpus developed by the Institute of Intelligent Systems and Artificial Intelligence (ISSAI) at Nazarbayev University. This corpus combines data from two corpora—the Kazakh Speech Corpus and KazakhTTS2—and includes additional material collected from a wide range of sources, such as radio, magazines, television programs, and podcasts. It contains approximately 1200 h of high-quality data, comprising over 600,000 utterances. The corpus is freely available and can be used for both academic research and industry. KazakhTTS2 [42] is a high-quality speech dataset and an updated, expanded version of the KazakhTTS dataset. Compared to the previous version of KazakhTTS, the audio data size has been increased from 90 to 271 h. Three new professional speakers (two female and one male) have also been added, each recording over 25 h of carefully transcribed speech. The inclusion of texts from various sources, including fiction, articles, and Wikipedia, has also enriched thematic content. Like the previous version, KazakhTTS2 is freely available and can be downloaded from the official ISSAI website. Relevant data on these corpora are shown in Table 1.
The National Corpus of the Kazakh Language is a large-scale collection of electronic texts containing millions of words, fully covering the lexical and grammatical structure of the Kazakh language (with extensive annotations), and a “smart” specialized knowledge base that collects all information about the Kazakh language. The total number of words is 65,000,000. The National Corpus of the Kazakh Language currently consists of 16 subcorpora, each specifically developed for a specific purpose. All internal subcorpora contain morphological, semantic, lexical, and phonetic-phonological annotations [45,46].
The TatSC_ASR corpus is one of the largest and most systematic speech datasets for the Tatar language. This corpus includes over 269 h of audio recordings annotated with transcriptions in the Tatar language. The corpus materials were collected from various sources, including crowdsourcing platforms and audiobooks. Hackathon Tatar ASR is a corpus that was assembled during the Tatar.Bu Hackathon in 2024 and is an example of a coordinated crowdsourcing approach to creating speech resources. The corpus size is about 90 h of audio data distributed across almost 69 thousand segments. TatarTTS is a specialized open corpus focused on speech synthesis tasks for the Tatar language. The distinctive feature of this corpus is that the audio recordings are made by professional speakers and are accompanied by high-quality transcriptions, which makes it optimal for Text-to-Speech (TTS) tasks. The corpus contains about 70 h of audio recordings. Table 2 provides an analysis for each of these corpora.
The Tatar corpus “Tugan tel” is a linguistic resource reflecting the modern literary Tatar language. This corpus is being developed for a wide range of users: linguists, specialists in Tatar and Turkic languages, general linguists, typologists, etc. The corpus includes texts of various genres—fiction, media materials, official documents, educational and scientific texts, etc. Each document is provided with a meta description, including information about the authors, publication data, creation dates, genres, and structural parts of the text. The Corpus of Written Tatar is a collection of electronic texts in the Tatar language created to support scientific research in the field of Tatar vocabulary. Currently, the corpus volume exceeds 500 million words (more than 620 million tokens), and the total number of unique word forms is about 5 million. The corpus is aimed at anyone interested in the structure, current state, and development of the Tatar language. The Tatar–English Corpus is an open parallel corpus containing ~7775 aligned pairs of Tatar and English sentences. However, despite its small size, this corpus is of great importance for various studies in the field of multilingual models and the development of machine translation systems for low-resource languages. A comparative analysis of the corpora is shown in Table 3.
The Uzbek Speech Corpus (USC), introduced by Musaev et al. [53], contains over 105 h of transcribed speech data from 958 speakers of different ages and dialects. This corpus is freely available and widely used in ASR research. Another important resource is FeruzaSpeech [30,54], a 60 h mono-speaker corpus of read speech with both Latin and Cyrillic transcriptions. It offers high-quality recordings and includes punctuation, casing, and contextual information, making it particularly valuable for robust model training. In addition, Uzbek is included in multilingual open-source datasets such as Mozilla’s Common Voice [55], although the amount of data remains limited and less curated. These corpora complement each other: USC offers diversity of speakers, FeruzaSpeech provides clarity and consistency, and Common Voice contributes with volume and crowd-sourced variation. In addition to publicly available speech corpora, several large-scale Uzbek speech datasets exist in closed access, including commercial corpora, a 392 h mobile-recorded dataset of 200 speakers [56]. Information about these corpora is presented in Table 4 below.
The following section outlines the key initiatives taken to develop and compile text corpora for the Uzbek language. The UzWaC (Uzbek Web Corpus), hosted on the Sketch Engine platform, contains around 18 million words collected from online sources and is used for lexical and frequency analysis [57]. Another significant project is the Uzbek National Corpus (UNC), which encompasses millions of words drawn from a wide range of sources, including literature, newspapers, and academic works [58]. Serving as a fundamental linguistic resource, the UNC provides extensive support for morphological and syntactic annotation tasks.
Another important dataset is the TIL Corpus, developed as part of the TIL project, which focuses on parallel Uzbek–English data for machine translation tasks [59]. The Leipzig Corpora Collection (uzb_community_2017) contains around 9 million tokens and is freely available with automatic linguistic annotation [60]. The Uzbek Corpus Sample, hosted on GitHub, is a small open-access corpus of approximately 100,000 sentences suitable for basic text analysis tasks [61]. The Text Classification Dataset includes more than 500,000 news articles and is used for training and evaluating topic classification models [62]. In addition, Uzbek is included in multilingual datasets such as the OPUS collection, particularly in corpora like OpenSubtitles, GlobalVoices, and Tatoeba, though the size and quality of Uzbek content vary significantly [63]. These resources serve different purposes: the UNC provides linguistic richness and structural annotation, the TIL Corpus is valuable for translation and alignment, and OPUS-based datasets contribute with multilingual breadth. Furthermore, some large-scale Uzbek textual corpora remain in closed access, including commercial datasets containing millions of sentences from web-crawled content, social media, and online publications. Information on these corpora is summarized in Table 5 below.
In recent years, there has been significant progress in the creation and publication of Turkish speech corpora, which is especially important for the tasks of automatic speech recognition (ASR), text-to-speech (TTS), linguistic analysis, and language learning. Among the presented resources, the Spoken Turkish Corpus (STC) developed by METU stands out as one of the most valuable academic corpora, including synchronized audio and transcriptions obtained from natural conversations [64]. The Middle East Technical University Speech Corpus focuses on structured speech and is a balanced corpus useful for building acoustic models and testing microphone speech [65]. The Mozilla Common Voice Turkish Corpus is a crowdsourced open resource, available without restrictions, and is actively used in educational and applied projects [11]. Despite its small size (about 22 h), it is regularly updated by the community. The largest by volume is the ASR-BigTurk Conversational Speech Corpus, which contains over 1500 h of spontaneous speech, making it particularly valuable for industrial applications, but its commercial license limits access for research projects [66]. The PhonBank Turkish Istanbul Corpus offers unique child speech data collected as part of projects studying phonological development and is of great importance for clinical linguistics and speech therapy [67]. Also noteworthy is the TAR Corpus, developed by the ITU NLP Group, as it is one of the few fully open resources that includes a variety of speech genres: reading, spontaneous dialogues, and interviews. It provides both audio data and verified transcriptions, making it easy to integrate into machine learning [68].
Thus, there is a growing diversity of Turkish speech corpora in both genres and intended uses—from academic research to commercial use. At the same time, problems remain related to limited access to large and well-annotated corpora, especially for spontaneous and dialectal speech recognition tasks. The development of open and accessible corpora such as the Mozilla Common Voice Turkish and the Turkish Broadcast News Speech & Transcripts corpus (developed by the ITU NLP Group, approximately 130 h of professionally recorded and transcribed news data) ITU NLP Group—Turkish Broadcast News Speech & Transcripts contributes to the democratization of research in the field of Turkish speech processing and supports the multilingual development of artificial intelligence systems. Information about these corpora is presented in Table 6 below.
The most significant project is the METU Turkish Corpus, which has pioneered the systematic collection of modern Turkish texts in a balanced genre composition. Despite its relatively small size (~2 million words), it contains high-quality markup (XCES and discourse annotation), making it valuable for academic research [69,70]. The largest in terms of scope and volume is the TS Corpus, which includes over 1.3 billion tokens. It covers both written and web texts, with automatic morphological and syntactic annotation. It is accessible via a user-friendly web interface, although downloading the raw data is limited [71,72]. Another large-scale resource is trTenTen20, a web text corpus of almost 5 billion words, accessible through the commercial Sketch Engine platform and widely used in statistical linguistic research [73]. The Turkish National Corpus (TNC) is the official national corpus of Turkey. It is an academically verified resource (~50 million words), with a carefully thought-out genre structure and time coverage (2000–2009). However, its access is limited: the full data are not always open, and access to it is carried out through the website interface [74,75,76]. In contrast, trWaC, created using CLARIN technologies, is a lightweight version of the web corpus and is available for free [77]. A special category is made up of corpora with syntactic and semantic annotations, such as ITU Web Treebank and Boun Treebank, made in the Universal Dependencies (UD) format. These resources are focused on training and testing parsing models and are intended for use in NLP competitions and educational projects [78,79].
Political discourse is represented in the ParlaMint-TR corpus, which contains speeches from members of the Turkish parliament from 2011 to 2021. The corpus is annotated with metadata about the speakers (gender, party, position) and provided with morphological markup, which makes it useful for studying political rhetoric and analyzing opinions [80].
Overall, the diversity of available Turkish-language text corpora enables effective solutions to a wide range of linguistic and engineering problems. At the same time, certain limitations remain, including limited access to the largest resources (TNC, TS Corpus) and underrepresentation of certain genres, such as fiction and social media. Information about these corpora is presented in Table 7 below.
Despite the diversity of corpora found through OPUS and other sources, the analysis reveals that Turkish still lacks sufficiently large, well-annotated, and balanced open-access resources, especially for speech-related systems.
Multilingual corpora play an important role in the development of modern speech processing and translation technologies. However, their linguistic diversity is limited. Most resources focus on high-resource languages, while morphologically complex and agglutinative languages, including Turkic, are either poorly represented or completely absent. Therefore, existing multilingual models require additional adaptation and data expansion to effectively work with Turkic languages.

3. Materials and Methods

The proposed methodology for constructing a parallel audio corpus for Turkic languages is implemented as a sequential workflow comprising several stages. In the first step, a validated Kazakh audio-text corpus is selected as the initial dataset. Transcriptions are then extracted from the corpus and automatically translated into the target Turkic languages (Turkish, Uzbek, and Tatar) using the NLLB-200 model. The resulting translations are synthesized into audio files using modern speech synthesis systems (MMS-TTS, TurkicTTS, ElevenLabs, and CoquiTTS). Then, each initial unit is mapped to four interconnected components: Kazakh audio, Kazakh transcription, translated text, and synthesized audio in the target(Turkish, Tatar, and Uzbek) language. The final stage involves corpus evaluation and validation using automated metrics (BLEU, chrF) to assess translation quality, along with manual verification of naturalness and pronunciation consistency. This overview provides a conceptual description of the methodological pipeline and serves as a guide for the detailed discussion in Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5

3.1. Cascade Audio Corpus Generation

The collection of similar speech corpora for low-resource languages using the cascade technology (STT—TTT—TTS) has several serious drawbacks, especially when working with Turkic and other low-resource languages.
The cascade scheme, with its sequential data processing from transcription to translation and speech synthesis, is highly scalable. It utilizes available resources at every stage, such as audio corpora, LLM translators, and TTS models, allowing for seamless adaptation to new languages or domains. Its scalability highlights the framework’s flexibility, requiring modifications only to specific modules when expanding or adjusting its scope. This scheme, with its use of a ready-made corpus (audio and transcription) and automated tools (an LLM for translation and a TTS for synthesis), offers a significant reduction in labor and financial costs. This cost-effectiveness is a key advantage, especially when compared to the full cycle of recording and manual data processing.
In this study, we introduced a highly efficient cascade technology for forming a parallel audio corpus. This innovative approach is based on the reuse of an existing resource—a structured Kazakh-language audio corpus with verified transcriptions. By eliminating the need for speakers and studio recordings, as well as manual audio segmentation, which is typical for processing unstructured Internet sources, we have significantly streamlined the process.
At the first stage, text transcriptions of the Kazakh corpus are automatically translated into target languages (Turkish, Uzbek, Tatar) using a large language model (NLLB-200 1.3B). The translated texts are subsequently processed by text-to-speech (TTS) systems, which produce corresponding audio files in the target languages. This process results in a fully developed parallel dataset that includes both audio and text components in the source language (Kazakh) and the respective target languages.
To quantify the accuracy of machine translation produced by the NLLB-200 (1.3B) model, the system was evaluated using two widely adopted automatic metrics: BLEU and chrF. In the absence of standardized benchmark datasets for the Kazakh → Turkish, Kazakh → Uzbek, and Kazakh → Tatar translation directions, a manually curated test set comprising 1000 parallel sentences for each target language was constructed. The evaluation results are provided in Table 8.
The developed approach significantly reduces the amount of work and financial costs required to form audio corpora in several languages. The advantages of the proposed approach include:
  • no need for manual recording and segmentation of audio;
  • complete automation of translation and voice-over stages;
  • the ability to scale to multiple target languages with minimal resources;
  • focus on supporting low-resource Turkic languages, for which the number of open parallel corpora is minimal.
The architecture of the cascade technology for forming a parallel audio corpus is implemented as a sequence of steps. This process aims to transform audio data in the Kazakh language into multilingual text–audio pairs (Turkish, Tatar, and Uzbek). The process includes the following steps (Figure 1):
  • Initial audio corpus (L1):
    As a starting point, a ready-made audio corpus in the Kazakh language (L1) is used, including synchronized pairs of “audio + transcribed text”. The source of the corpus is Nazarbayev University (17 K and 12 K sentences.
  • Text transformation (L1 → L2):
    The transcribed Kazakh text is passed to the input of a large language model (LLM), which performs automatic machine translation into the target language (L2), for example, Turkish, Uzbek, or Tatar.
  • Text-to-speech (TTS, L2 → audio):
    The resulting translation is used as input for a neural network text-to-speech (TTS) model, which generates an audio file corresponding to the translated text in L2.
  • Parallel data generation:
As a result of cascade processing, a complete dataset is created for each source sentence:
  • audio file in Kazakh (L1),
  • transcription in Kazakh (L1),
  • translation into the target language (L2),
  • generated an audio file in the target language (L2).
This resulted in four interconnected output components: (1) audio and (2) text in the source Kazakh language, and (3) audio and (4) text in the target language. This structure provides parallel correspondence at both text and audio data levels, making the corpus suitable for training and evaluating models in speech machine translation and multilingual TTS/STT.
Despite the high quality of output data achieved using cascade speech translation technology, this approach places high demands on resources at the stage of preparing training corpora. Building a reliable corpus involves several meticulous steps: organizing studio recordings with native speakers, manually segmenting audio files, performing transcription, conducting linguistic proofreading, and verifying the final transcriptions. These processes are highly labor-intensive and costly, particularly for low-resource languages like Kazakh, where access to extensive audio data and qualified experts remains limited. The impact of these challenges is profound, as they directly influence the progress and availability of speech translation technologies. Given these limitations, this paper also considered an alternative, less labor- and time-consuming method for preparing a parallel audio corpus. In Section 3.2, we introduce the use of crowdsourced data and automated tools for corpus preparation, which can significantly reduce the resources and time required to create corpora.

3.2. Text-First Cascade Audio Corpus Generation for Turkic Languages

Classical approaches using automatic speech recognition (ASR) for languages with limited resources, in particular Turkic languages (Kazakh, Uzbek, Kyrgyz, Tatar), are associated with significant difficulties. A significant number of recognition errors, a shortage of large speech databases, pronunciation variability, and richness of vocabulary hinder the effective use of STT technologies for the formation of high-quality parallel audio corpora. These limitations indicate the need for alternative methodologies.
The TTT approach is a reliable tool that offers a more accurate solution, guaranteeing high quality and reliability of the obtained data. In this methodology, parallel text pairs are first created manually or semi-automatically. Then, each side is processed using text-to-text (TTT) systems. This approach enables complete control over all critical parameters of the parallel corpus, from the semantic accuracy of translation to consistent sentence alignment and stylistic equivalence of texts. This meticulous control ensures the preservation of the grammatical and morphological integrity characteristic of agglutinative languages such as Kazakh, Uzbek, Kyrgyz, Tatar, and others.
First, controlling the process at the text level allows for eliminating distortions that often occur during automatic speech transcription. Moreover, the use of modern speech synthesis systems trained on high-quality and verified data ensures the generation of sound that is virtually indistinguishable from natural speech while accurately conveying the content of the source text. This emphasis on the naturalness of sound creates a sense of authenticity and closeness to the original.
We present an approach to forming audio corpora for Turkic languages, utilizing speech synthesis technology to eliminate the traditional stage of recording audio data with speaker participation. The source material is Kazakh texts, which are fed to the input of a neural machine translation system, particularly the NLLB (No Language Left Behind) model. This system provides automatic translation from Kazakh into closely related Turkic languages: Turkish, Tatar, and Uzbek. The translated texts are then processed using speech synthesis systems (TTS—Text-to-Speech), resulting in the formation of audio files corresponding to the translated texts. Thus, based on the synthesis, a parallel audio corpus is formed for the specified languages. Additionally, the original Kazakh texts are also synthesized using the same TTS system, allowing for the generation of synthetic audio data in the source language.
This approach is designed to automate the process of creating audio corpora, offering high scalability and reducing the dependence on resources required in the traditional approach involving speakers. This automation is a key feature of the methodology. An illustrative diagram of the proposed methodology is presented in Figure 2.
Thus, the TTT → TTS approach, which consists of the formation of parallel text pairs (TTT) with their subsequent pronunciation by speech synthesis systems (TTS), ensures not only high accuracy of parallel corpora but also the necessary flexibility for working with languages with limited resources, where automatic methods have not yet demonstrated sufficient reliability. This approach, with its inherent flexibility, becomes especially valuable for creating training datasets needed for the development of multilingual models for translation, speech generation and analysis, as well as the digital preservation and development of minority languages.
Using the Text-to-Text (TTT) method instead of Speech-to-Text (STT) when creating parallel audio data appears more justified from both theoretical and practical perspectives. This is especially relevant in the context of machine translation of speech, building multilingual systems, and forming synthetic parallel speech corpora for Turkic languages.
Firstly, the TTT method achieves high accuracy through the manual or semi-automatic creation of parallel texts. This approach is particularly effective for Turkic languages, which are characterized by rich morphological agglutination, the process of forming words by combining multiple morphemes, along with flexible word order and diverse grammatical sentence structures.
The use of TTT is a relief for low-resource Turkic languages, such as Uzbek, Turkmen, Kyrgyz, and even Kazakh, despite the limited available resources. In these cases, STT often produces high recognition errors due to insufficient model training, limited speech data, substantial dialectal variation, accents, noise, and phonetic distortions. TTT, on the other hand, significantly reduces these risks, providing a more secure and reliable basis for work.
In addition, the use of TTT enables the standardization of input texts, automatic normalization, morphological analysis, and filtering, which is critical when building training corpora for machine translation systems and multilingual speech synthesis. This is especially relevant in research on creating multimodal models and in projects like KazParC, TurkicASR, and KazakhTTS, where parallel text-audio data are generated for dozens of languages, including Turkic.
The proposed approach to creating audio corpora, based on automatic translation and speech synthesis, is of particular value for low-resource languages. It allows for a significant reduction in time and resource costs by eliminating the need for studio recording with speakers and special equipment. The complete automation of the process ensures high reproducibility and scalability, making it possible to quickly generate large volumes of audio data in several Turkic languages. This method opens up the possibility of creating speech resources even for those languages where traditional methods of collecting audio corpora are difficult or impossible. Its potential impact on the digital representation of low-resource languages in modern information systems is significant, contributing to their increased visibility and accessibility.

3.3. TTS Systems for Turkic Languages

Despite the advances in speech synthesis, not all modern TTS systems currently support Turkic languages, and many of them require additional adaptation and training efforts. Below are the TTS systems, which are described by the degree of openness, language support, and voice quality.
MMS-TTS, developed by Meta AI, is a truly advanced open-source speech synthesis system [81]. It supports more than 1100 languages, including resource-poor Turkic languages such as Kazakh, Uzbek, Kyrgyz, and Tatar. The system, presented in May 2023 on GitHub, has quickly become one of the most ambitious projects in the field of multilingual speech synthesis, catering to a wide range of linguistic diversity.
MMS-TTS uses a unified multilingual architecture based on Transformer and HiFi-GAN technologies, which allows it to effectively work with typologically diverse languages, including agglutinative Turkic languages. The system’s adaptability is a testament to its robustness, using cross-lingual phoneme representations and joint acoustic modeling, making it especially effective for low-resource languages.
It stands out for its high naturalness of synthesized speech in the Kazakh language, with correct intonation and accent design. This impressive capability is made possible by the use of open corpora KazakhTTS and KazakhTTS2, which significantly facilitate additional training and adaptation of models.
MMS-TTS is not just a theoretical concept, but a practical tool actively used in scientific initiatives and open-source projects such as TTS4All and TurkicTTS. This active use reassures the audience about its reliability and relevance in the academic environment.
Despite its strengths, MMS-TTS is not without limitations. One notable issue is the limited variety of voices, with most models lacking female speakers. The system also struggles with supporting dialects and tonal variations, particularly in the Tatar language. Furthermore, the lack of support for the Latin alphabet in the Uzbek language, the official script in Uzbekistan, poses a significant challenge. MMS either fails to voice such texts or makes segmentation errors. The scarcity of large open corpora in the Uzbek Latin alphabet further complicates the training and objective assessment of model quality. In some instances, models designed for English prosody yield inconsistent results when synthesizing Turkish speech, particularly in terms of stress and intonation.
Despite these technical and resource limitations, MMS-TTS remains a significant solution for speech synthesis in Turkic languages within the open software context. Its unique focus on Turkic languages, high naturalness of synthesized speech, and potential for further development make it a valuable tool for researchers, developers, and students interested in speech synthesis and open-source AI projects.
TurkicTTS is a significant leap in the development of speech synthesis technologies for Turkic languages, underlining our unwavering commitment to linguistic inclusivity and digital accessibility [82]. Designed explicitly for Kazakh, Uzbek, Kyrgyz, Tatar, and other Turkic languages, it considers their phonetic and morphological features, as well as the lack of language resources in public corpora. This commitment ensures that no Turkic language is left behind and that all can benefit from the advancements in speech synthesis technology.
Like Meta AI’s MMS-TTS, TurkicTTS leverages modern deep learning architectures such as Transformer and HiFi-GAN, but with a unique focus on the Turkic language family. These architectures enable a more accurate adaptation of models to the agglutinative structure of languages, complex stress patterns, and pronunciation variability. The system’s extensive use of cross-linguistic phoneme representations ensures robust generation even for dialectal forms and rare sound combinations. One of the key strengths of TurkicTTS is its high expressiveness and naturalness of synthesized speech, even when working with limited data. This, coupled with its high technical maturity, instills confidence in TurkicTTS as a reliable and advanced tool for Turkic language speech synthesis.
The most developed model in the system is the Kazakh model in TurkicTTS, based on the KazakhTTS2 corpus (about 270 h of speech from five speakers, including male and female voices), which ensures high-quality synthesis and stable work with long and complex phrases. However, spoken forms in Kazakh often sound mechanical, with limited intonation and stylistic variability. The model does not support emotional colors, ‘regional accents’ (variations in pronunciation specific to a region), and voice diversity, which makes the intonation monotonous and predictable, with a lack of natural pauses and rhythmic dynamics. Turkish is characterized by a more natural intonation, especially in neutral speech. Uzbek is supported in the Latin script, and at the standard speed, synthesized sentences sound rhythmic and precise. Despite the overall quality of Uzbek synthesis being currently inferior to Kazakh and Turkish, with less natural sound, a limited number of speakers, pronounced synthetic artifacts, and weak intonation modulation, the system’s ability to independently train on user corpora is a reassuring feature. This adaptability offers the potential to enhance quality and adapt to specific tasks, providing reassurance about the Uzbek model’s potential.
ElevenLabs TTS, a cutting-edge commercial speech synthesis system, provides robust support for Turkic languages and is capable of producing highly natural speech with accurate intonation and pauses [83]. Since 2024, ElevenLabs has gradually introduced support for Turkic languages into its multilingual TTS platform. Leveraging extensive multilingual corpora and advanced deep learning architectures, the system consistently generates high-quality speech in Kazakh, Turkish, Uzbek, and Tatar, ensuring reliable and fluent performance. A key advantage of ElevenLabs is its wide range of available voices, offering variations in timbre and tone that enable users to customize output to their preferences. Its handling of Turkish speech is particularly impressive, maintaining proper rhythm and flow even in lengthy sentences. Unlike many free systems, ElevenLabs fully supports the Latin alphabet—an essential feature for modern Uzbek. However, certain limitations persist. The system may mispronounce foreign names and terms, and difficulties arise when processing specific Uzbek letters (such as o’, g’, sh, and ch), which can affect pronunciation accuracy. Additionally, it sometimes struggles to capture the natural stress and intonation patterns of Uzbek, especially in formal contexts.
ElevenLabs TTS, a state-of-the-art commercial speech synthesis system, offers support for Turkic languages and the creation of very natural speech with the correct intonation and pauses [83]. Since 2024, ElevenLabs has been rolling out support for Turkic languages within its multilingual speech synthesis system. By utilizing large multilingual corpora and advanced deep learning architectures, the system consistently delivers high-quality speech synthesis in Kazakh, Turkish, Uzbek, and Tatar languages, ensuring a reliable performance. What sets it apart is the variety it offers—users can choose from a range of voices with different timbres and tones, allowing for a high level of customization. The system’s performance with the Turkish language, maintaining the correct rhythm even in long sentences, is exceptional. Unlike free systems, ElevenLabs fully supports the Latin alphabet, a key feature for the modern Uzbek language. However, the system does have its limitations. Foreign names and terms may be mispronounced, and working with the Uzbek language can present challenges with special letters (o’, g’, sh, ch) that can distort pronunciation. The system also struggles to convey Uzbek stress and intonation, particularly in formal speech.
Unlike academic or open solutions, such as MMS-TTS from Meta AI or the specialized TurkicTTS system, the ElevenLabs platform operates on a subscription model, which comes with certain usage restrictions. Even the basic tariffs have strict limits on the number of characters per month, which can pose challenges with intensive use, for instance, when voicing long texts or multiple generations. It is important to be aware of these limitations when considering the system for your needs.
The system does not provide open-source code, does not allow local installation, and excludes the possibility of additional training for specific tasks or accents. In addition, the stability and quality of synthesis in Uzbek and Tatar languages may be inferior to the support of Kazakh and Turkish, which is probably due to the different completeness of the language corpora used in training. Thus, ElevenLabs TTS is associated with financial costs, especially for large-scale tasks, where a paid subscription with a limited character volume may not be economically feasible in the long-term.
CoquiTTS is a modern open speech synthesis system created by the Coqui.ai team based on the Mozilla TTS project [84]. The system was officially launched in December 2021 by former Mozilla employees, who developed a completely open platform for speech generation. The main difference of CoquiTTS is the ability to work with various languages, including Turkic languages (Kazakh, Turkish, Tatar). The system has a modular structure, which empowers users to create individual voice models even with limited data. This adaptability opens opportunities for high-quality speech synthesis in Turkic languages, which previously did not have sufficient support in the field of voice technologies. The technological base includes modern components (Tacotron 2, Glow-TTS, HiFi-GAN), providing precise intonation control and natural sound. The platform supports working with several languages simultaneously and allows for customization of the voice characteristics, which further empowers users to tailor the system to their specific needs. The highest quality of synthesis is achieved for the Kazakh language, in particular, when using the KazakhTTS2 corpus, as well as for the Turkish language, where the VITS architecture implemented within CoquiTTS shows itself well. Speech in these languages is distinguished by clarity, intonation, expressiveness, and natural melody. The Uzbek language in ElevenLabs also demonstrates moderately high quality, but in terms of intonation, elaboration, and fluency, it is inferior to Kazakh and Turkish.
However, there are no official models for the Uzbek and Tatar languages in CoquiTTS, and their development requires independent training on the corresponding corpora. This complicates the initial setup of the system for low-resource languages. In addition, errors related to incorrect intonation may occur during the generation process, especially in long phrases. It is crucial to note that semantic accents can be lost, which is especially critical for the Uzbek and Tatar languages, where intonation contours play a significant role in conveying meaning.
Modern speech synthesis systems (MMS-TTS, ElevenLabs, CoquiTTS, TurkicTTS) have made significant strides in supporting Turkic languages. Each system has its advantages: TurkicTTS, for instance, has demonstrated remarkable progress in this area, particularly with Kazakh, Turkish, Uzbek, and Tatar. MMS is a versatile, free system that works with many languages. ElevenLabs is a commercial system that creates high-quality and emotional speech. CoquiTTS, on the other hand, allows for training models. While Kazakh and Turkish are currently the best-supported Turkic languages, there are still significant challenges with Uzbek and Tatar.
Thus, the systems considered are reliable and flexible tools for generating speech in Turkic languages. However, for the effective use of these systems, careful data preparation is necessary, which is especially critical when working with low-resource languages.

3.4. Datasets

This study employed two types of corpora: (1) an audio corpus transcription consisting of 1000 recordings for implementing the cascade speech synthesis approach, and (2) a text corpus comprising 4000 sentences for the Text-First method. In total, 5000 sentences were used in the experimental part.
For the cascade approach, the audio data were meticulously selected from the Nazarbayev University (NU) corpus. The NU contains 29,000 professional audio recordings with accurate transcriptions in the Kazakh language and was selected because of a detailed comparison of existing public and partially accessible databases. The Nazarbayev University (NU) corpus was selected as the primary source of real speech data in case of reliable acoustic quality. This choice was the result of a detailed analysis of various corpora by key parameters: quality of audio recordings, correctness of transcriptions, and accuracy of segmentation. The NU corpus, containing professional studio recordings of Kazakh speech, fully met the established requirements and turned out to be optimal for our research tasks. Due to the significant costs of generating synthetic speech by commercial TTS systems, a representative sample of 1000 sentences was formed for experiments, representing the diversity of vocabulary and grammatical structures of the language.
For testing the Text-First Cascade Audio Corpus Generation methodology, a test set containing 4000 sentences was used. This dataset was used to generate speech with four different TTS systems, providing the possibility of complex intersystem analysis using both quantitative and qualitative evaluation criteria. This test dataset included sentences in four Turkic languages: Kazakh, Uzbek, Turkish, and Tatar, with 4000 sentences per language, resulting in a total of 16,000 unique test instances. All sentences in the test set were disjointed from the training data to ensure experimental validity.
The experimental part included the dataset consisting of 1000 high-quality audio recordings, along with their verified transcriptions in Kazakh. These samples served as the reference dataset for the Cascade Audio Corpus Generation methodology. In addition, we prepared 4000 text sentences covering diverse thematic domains (e.g., everyday communication, news, technical discourse, literature) to ensure variability and representativeness in the generated audio for the Text-First Cascade Audio Corpus Generation for Turkic languages.
The combination of these datasets of 5000 sentences was added for the experiments with 4 TTS systems: MMS, TurkicTTS, Elevenlabs, and CoquiTTS.
The audio files generated by all four TTS systems, across a total of 5000 sentences, presented the following statistics. The whole duration in seconds was 508,129 s./8468.81 min/141.14 h. and was distributed among the TTS systems, as shown in Table 9.
Each TTS model is presented by several voice variations, including 1 female voice for the Kazakh, Tatar, and Uzbek languages and 1 male voice for the Turkish language of the CoquiTTS model; 1 female voice for the Kazakh, Tatar, and Uzbek languages and 1 male voice for the Turkish language of the Elevenlabs; 2 males voices for the Kazakh, Turkish, Tatar, and Uzbek languages of the MMS model; 4 males voices for the Kazakh, Turkish, Tatar, and Uzbek languages of the TurkicTTS model.
In addition, a monolingual text corpus in Kazakh was compiled, consisting of 100,000 sentences covering topics such as medicine, politics, and social affairs. The Kazakh texts were sourced from formal media and institutional publications and were intended to both enrich the linguistic base and serve as input for TTS synthesis.
An additional subcorpus of 5000 news headlines was extracted from the Kazakh-language news portal 24.kz. Headlines were selected for their linguistic purity and formal style.
These corpora provide broad linguistic coverage of contemporary Kazakh and serve as a robust empirical foundation for research in speech synthesis, TTS evaluation, and cross-lingual modeling.

3.5. Metrics

To evaluate TTS (text-to-speech) systems without original audio (reference-free), the following metrics were used: M O S N e t ,   D N S M O S ,   N I S Q A ,   and S S L M O S .
The M O S N e t (Mean Opinion Score Network) metric’s function is to estimate the perceived naturalness of synthetic speech, mimicking human judgment typically captured via Mean Opinion Scores ( M O S ) [85]. This metric is especially valuable in text-to-speech (TTS) and voice conversion systems, where producing human-like audio is a key objective. M O S N e t is a deeply objective deep neural network model. It has been trained on large-scale human-annotated datasets to learn the mapping between audio and corresponding M O S values. This objective ensures its unbiased evaluation. After training, it functions as a non-intrusive quality estimator that takes an audio signal as input, extracts relevant spectral features, and outputs a predicted M O S score that reflects the naturalness and intelligibility of the speech (1).
M O S N e t x = f θ m e l x = y ^ M O S [ 1,5 ]
where f θ —neural network regressor, m e l x —mel spectrum audio, y ^ M O S —predicted M O S .
D N S M O S (Deep Noise Suppression Mean Opinion Score) is a neural network-based metric developed by Microsoft researchers within the framework of the Deep Noise Suppression (DNS) Challenge [86]. It aims to automatically estimate the M O S , a standard subjective measure of speech quality rated by human listeners on a scale from 1 (poor) to 5 (excellent). Unlike traditional methods, D N S M O S is non-intrusive, meaning that it does not rely on a clean reference signal. Instead, it operates directly on noisy or synthesized audio inputs to generate an estimated M O S   score.
The model utilizes a neural function (2):
y ^ = f θ ( F e a t u r e s ( x ) )
where x represents the input audio signal, F e a t u r e s ( x ) denote spectral representations (e.g., mel-spectrograms or log-power spectra), and f θ is a neural network parameterized by f θ , which outputs the predicted M O S score y ^ .
Training is performed by minimizing the Mean Squared Error (MSE) between predicted and actual M O S values, as defined by Equation (3):
L M O S = 1 N i = 1 N ( y ^ i y i ) 2
The N I S Q A (Non-Intrusive Speech Quality Assessment) metric is used to assess the naturalness and intelligibility of synthesized audio speech where there is no reference (clean) signal for TTS systems. Its main task is to predict the quality of speech from the standpoint of human perception in situations with noise and distortions that occur during speech synthesis. N I S Q A   is based on a neural network model trained on large datasets with quality assessments collected from real users. The model receives an audio file, extracts spectral and statistical features (for example, log-mel spectrograms), and produces predicted scores reflecting the overall quality, noise level, and the presence of gaps. The score gives a reliable idea of the perceived quality: the closer to 5, the more natural and pleasant the sound is considered.
The most modern approach to date is considered to be S S L M O S   (Self-Supervised Learning MOS). S S L M O S is a state-of-the-art neural network model designed to automatically assess the quality of synthesized speech without requiring human intervention or a reference audio. It utilizes pre-trained self-supervised speech models (e.g., HuBERT, Wav2Vec 2.0, Data2Vec) to extract profound and universal acoustic features, making it more robust, portable, and accurate compared to classic M O S   models (e.g., M O S N e t ). The high correlation of S S L M O S   with real M O S   assessments instills confidence in its accuracy, making it a reliable tool for speech quality assessment [87].
y ^ = f θ ( S S L F e a t u r e s ( x ) )
Scores predicted by metrics such as M O S N e t ,   N I S Q A ,   S S L M O S ,   and D N S M O S   are interpreted as follows:
  • 5.0 ‘Excellent’—speech sounds completely natural;
  • 4.0–4.9 ‘Good’—speech is almost indistinguishable from real;
  • 3.0–3.9 ‘Acceptable’—unnatural elements are heard, but speech is understandable;
  • 2.0–2.9 ‘Poor’—speech is distorted, and the synthetics are noticeable;
  • 1.0–1.9 ‘Very poor’—speech is difficult to understand or incomprehensible.
The obtained values of the synthesized speech quality metrics ( M O S N e t ,   N I S Q A ,   S S L M O S ,   D N S M O S   ) are presented in the Results section for each of the languages considered.

3.6. Ethics/Data-Use Statement

In this study, we used a mixed-type audio corpus: (a) a subset of real speech recordings from the licensed Nazarbayev University (NU) corpus; and (b) synthetic speech generated via TTS systems (MMS, TurkicTTS, ElevenLabs, Coqui TTS). The real-speech portion was used under the existing licensing terms of the NU corpus; we did not record any new human voice data for the purposes of this study.
As recommended by Yang et al. [88], we documented the provenance and nature of all audio data, distinguishing real-speech and synthetic-speech components to ensure transparency in data use and compliance with privacy and ethical standards.
For the synthetic-speech portion, all audio samples were generated from text-only inputs; therefore, they do not represent personally identifiable voice or biometric data tied to any living individuals.
Study participants—native-speaker listeners—were involved solely to perform listening tests (listening to and rating audio samples). Prior to participation, all listener-evaluators provided informed voluntary consent. Their responses and any associated metadata are stored anonymously and will not be publicly disclosed.
Access to raw generated audio files is restricted. While synthetic speech samples were used internally for evaluation and analysis, raw synthetic audio will not be publicly released without explicit control and approval. Any audio shared will be stripped of metadata that could enable speaker identity tracing or misuse, ensuring robust anonymization and minimizing privacy risks.
We commit to providing audio data only under conditions that prevent misuse, require explicit acknowledgment of synthetic origin, and prohibit impersonation or unethical use.
Because our study combines licensed real-speech data and newly generated synthetic audio, without recording new voice talents, the requirement for obtaining additional consent from speakers/voice talents does not apply.

4. Results

Audios in Kazakh, Turkish, Tatar, and Uzbek were generated using four TTS systems: MMS, TurkicTTS, ElevenLabs, and Coqui TTS. Each model was deployed with its own parameters and computational environment.
  • MMS-TTS (https://huggingface.co/facebook/mms-tts, accessed on 1 August 2025) integrates a text encoder, stochastic duration predictor, flow-based decoder, and vocoder. Input text is processed end-to-end, producing speech directly from linguistic embeddings.
  • TurkicTTS (https://github.com/IS2AI/TurkicTTS, accessed on 1 August 2025) is a multilingual end-to-end model based on Tacotron 2, with IPA-based transliteration and a Parallel WaveGAN vocoder. Fine-tuning uses configurable hyperparameters (e.g., attention threshold, length ratios, speed control α = 1.0).
  • Coqui TTS (https://github.com/coqui-ai/TTS, accessed on 1 August 2025) employs XTTS-v2, an end-to-end multilingual voice cloning model. It takes text, a reference speaker’s audio, and a language code as inputs, running on a CUDA-enabled PyTorch GPU (Berlin, Germany) for GPU-accelerated inference.
  • ElevenLabs TTS (https://elevenlabs.io/app/speech-synthesis/text-to-speech, accessed on 1 August 2025) operates via a Python API client (v2.11.0). Speech generation is handled entirely on ElevenLabs’ cloud infrastructure, with text, voice_id, and model_id as inputs. Hyperparameter tuning is not available to users.
Hardware setup is following:
  • MMS and TurkicTTS required a high-performance station (Intel Core i7 CPU, RTX 4090 GPU, 1 TB SSD).
  • Coqui TTS and ElevenLabs were run on a lighter setup (Intel Core i7 CPU, RTX 2070 GPU, 1 TB SSD) as they do not demand extensive local resources.
The results of evaluating the quality of synthesized speech for Kazakh–Turkish, Kazakh–Uzbek, and Kazakh–Tatar using MOSNet, DNSMOS, NISQA, and SSL-MOS metrics are shown in Table 8, Table 9, Table 10 and Table 11. They also evaluate the audio intonation, expressiveness, and linguistic adequacy. Intonation refers to the pitch patterns and melodic contours of speech that signal emphasis, emotion, and sentence structure. Expressiveness describes the degree to which speech conveys emotion, style, and natural variation beyond the literal words. Linguistic adequacy is the extent to which the spoken output correctly reflects the intended text in terms of grammar, pronunciation, and semantic meaning.
Although these metrics are designed to evaluate the naturalness and perceptual audio quality, their scores have the potential to reflect aspects of intonation, expressiveness, and linguistic adequacy in generated speech.
Intonation presents the pitch patterns and melodic contour of speech that signal emphasis, emotion, and sentence structure. Expressiveness describes the degree to which speech conveys emotion, style, and natural variation beyond the literal words. Linguistic adequacy is the extent to which the spoken output correctly reflects the intended text in terms of grammar, pronunciation, and semantic meaning. Higher scores on these metrics generally correspond to more natural acoustic patterns and fewer distortions, which often co-occur with well-formed prosody and clearer articulation. SSL-MOS, due to its reliance on self-supervised speech representations, can be sensitive to unnatural pauses or inconsistent phoneme transitions, which are features that relate to intonation and linguistic clarity. Similarly, NISQA’s multi-dimensional modeling of speech quality captures subtle variations in timbre and continuity that influence expressiveness and linguistic adequacy.
The Results section presents the developed parallel audio corpora for Kazakh–Turkish, Kazakh–Uzbek, and Kazakh–Tatar (Table 10, Table 11, Table 12 and Table 13), and the results of the evaluation of the quality of synthesized speech using formal metrics aimed at measuring its naturalness, intonation, expressiveness, and auditory perception.
The evaluation of the obtained results was implemented for all four MMS, TurkicTTS, Elevenlabs, and Coqui TTS models. Their advantages and drawbacks were analyzed in the experimental results. In Kazakh, the MOSNet score of the MMS model was close to 3.91, and DNSMOS was in the range of 3.25 and 3.30. For Turkish, the MOSNet score was slightly above 4.0, while SSL-MOS often exceeded 3.4 at higher data sizes, reflecting improved perceptual similarity compared to Kazakh. In Uzbek, MOSNet again stayed around 4.0, with NISQA reaching 4.27, showing stronger expressiveness than Kazakh but less than ElevenLabs. In Tatar, MMS achieved a MOSNet score above 4.0, which indicated its balanced performance across all metrics. Overall, MMS is a stable audio generation system, but it does not outperform the commercialized Elevenlabs model. TurkicTTS showed similar MOSNet results to the MMS model, with scores from 3.9 to 4.0. It also reached strong NISQA scores in Kazakh and Tatar speech, indicating potential for expressive and intonational quality. However, its SSL-MOS values dropped below 3.0, which showed difficulties in perceptual quality, while also having good naturalness. TurkicTTS demonstrated stable performance for Uzbek. In terms of naturalness, it achieved solid results that were broadly comparable to the other systems, reflecting that the generated speech is generally clear and intelligible. High NISQA scores indicated that TurkicTTS generated speech with rich variations, which makes it sound less monotone than MMS or Coqui. ElevenLabs is a definite leader in all metric scores, showing better cores than all other models: MMS, TurkicTTS, and Coqui TTS. In Kazakh, the MOSNet reached 4.1, while MMS and TurkicTTS obtained scores around 3.9. The NISQA scores were above 4.6, which is a clear advantage over TurkicTTS and MMS. In Turkish, ElevenLabs was slightly behind the MMS and TurkicTTS models. While MMS and TurkicTTS performed reliably in naturalness, ElevenLabs was balanced in most aspects, including naturalness, clarity, and perceptual similarity. For Uzbek, ElevenLabs showed robust results. It maintained high naturalness, significantly outperforming TurkicTTS in perceptual similarity, turning expressive speech into one that also feels authentically human. The MOSNet score reached approximately 4.14, while the NISQA score of 4.6 outperformed TurkicTTS and MMS. Compared to TurkicTTS, which achieved a strong NISQA but weak SSL-MOS, ElevenLabs delivered high scores across every metric simultaneously. It is important to note that ElevenLabs is a commercial platform that operates on a subscription basis. The last platform, Coqui TTS, on the other hand, remains the weakest system. Although it generates reasonably natural-sounding speech, it lacks clarity, expressiveness, and perceptual similarity, which restricts its suitability for high-quality applications. While its MOSNet scores were strong, having values of around 3.98, its DNSMOS results were often below 3.0, and the NISQA values were lower than those of the other models. In Kazakh, DNSMOS values dropped below 2.9, highlighting weaker noise robustness and clarity compared to ElevenLabs or MMS. For Turkish, Coqui TTS achieved a similar MOSNet value of 3.99, suggesting a naturalness baseline comparable to MMS or TurkicTTS. Its DNSMOS improved slightly, averaging 3.1–3.3, but it was still behind ElevenLabs, which consistently stayed above 3.3. In Uzbek, Coqui again obtained MOSNet values close to 3.98, showing that the naturalness metric did not reveal more profound weaknesses. Its DNSMOS values remained low, around 2.96–3.0, indicating problems with clarity in noisier conditions. NISQA scores rose slightly to about 3.2–3.6 but still lagged behind TurkicTTS and ElevenLabs. For Tatar, Coqui repeated the same pattern. Its MOSNet stayed near 3.98–3.99, indicating basic naturalness. DNSMOS was particularly low here, sometimes dipping below 2.9, which reduced perceived clarity. Though the performance of Coqui TTS needs significant improvements in noise reduction and sound naturalness, it has potential for future improvements.
Summarizing the performance of TTS models, their evaluation results highlight clear distinctions among them. It was important to consider all 4 metrics when comparing audio quality across various TTS systems and languages, rather than relying on a single metric. Generally, scores above 4.0 for all metrics show that the sound is of high quality. In the case of NISQA, in some audios, its scores were below 4.0. For example, for 4000 audios generated by MMS and TurkicTTS, the NISQA scores were 2.991 and 2.992, respectively. Evaluating every system, it could be noted that MMS provides a steady and balanced baseline across all languages, consistently delivering natural and intelligible speech but lacking the expressiveness and perceptual similarity of stronger systems. TurkicTTS shows high potential, particularly in capturing prosody and intonation, with intense expressiveness in several languages. However, its performance is uneven, as it struggles to achieve convincing perceptual similarity, making its speech sound less human-like despite being lively. ElevenLabs emerged as the most advanced system, consistently outperforming all others across every metric and language. It combines high naturalness, clarity, expressive variation, and perceptual realism, producing speech that not only scores well numerically but also feels authentically human. Coqui TTS, in contrast, was the weakest overall: while it achieved reasonable naturalness, its scores in noise robustness, expressiveness, and similarity remained significantly lower, resulting in outputs that sounded flat and less realistic compared to the other models.
The results were also statistically evaluated with the use of the t-test and ANOVA. The t-test was implemented for the MOSNet metric. The scores for the Kazakh language and the comparison of all four models: MMS, TurkicTTS, Elevenlabs, and Coqui TTS, across 1 K, 4 K, and 5 K audio sets were taken. As the t-test metric proposes, every 2 models were compared with each other. The t-test metric gives 2 statistics: t_stat and p-value. A t-stat shows how big the difference between the two groups, compared to the variation inside each group, is. A p-value tells the chance that the two groups are truly different. The results are shown in Table 14.
The pairwise comparison revealed strong numerical distinctions between models. ElevenLabs achieved significantly higher scores than all other models, as shown by large t-stat and near-zero p-values against MMS (t-stat = −82.96, p-value = 0.0000002243), TurkicTTS (t-stat = −11.28, p-value = 0.00729), and Coqui TTS (t-stat = −51.77, p-value = 0.00000217). Coqui TTS outperformed MMS with a substantial difference, as t-stat = −24.21 and p-value = 0.0000186. In contrast, MMS and TurkicTTS did not differ significantly, with t-stat = −1.15 and p-value = 0.3653, indicating similar performance, and the comparison between TurkicTTS and Coqui TTS likewise did not reach significance, with t-stat = −3.03 and p-value = 0.0902.
Generally, the t-test revealed significant differences between most model pairs. ElevenLabs consistently outperformed all other models, with p-value < 0.01 and particularly large effect sizes when compared to MMS and Coqui TTS. Coqui TTS also showed significantly higher scores than MMS, with a p-value < 0.001. In contrast, MMS–TurkicTTS and TurkicTTS–CoquiTTS comparisons did not reach statistical significance, having a p-value > 0.05, which indicates similar levels of performance for these models. Overall, ElevenLabs demonstrated the highest objective speech quality, followed by Coqui TTS, TurkicTTS, and MMS.
The next statistical test utilized was the Analysis of Variance (ANOVA), which is used to determine whether three or more groups differ significantly in their mean values. In the context of TTS model evaluation, ANOVA examines whether the objective speech-quality scores are statistically different across the models. Instead of comparing only two systems at a time, as in the t-test, ANOVA evaluates all models simultaneously, allowing for the determination of whether there is an overall effect of model on the metric being measured. If the differences between models are much larger than the natural variability within the models, the ANOVA produces a large F-statistic and a small p-value, indicating that at least one model performs differently from the others. In ANOVA, the F-statistic measures how much the model means differ relative to their internal variability, and the p-value highlights whether the differences could occur by chance. As the ANOVA is measured by the models together, all four languages: Kazakh, Turkish, Tatar, and Uzbek, were used in the experiments. The results are shown in Table 15.
The ANOVA scores’ results showed that the differences in MOSNet scores among the TTS models were statistically significant across all four languages. For Kazakh, the scores had a high effect, with an F-statistic of 122.245 and a p-value of 5.07 × 10−7, indicating substantial variation in performance. Similarly, Tatar and Uzbek exhibited very strong model effects, with an F-statistic of 109.476 and a p-value of 7.79 × 10−7, and an F-statistic of 212.629 and a p-value of 5.76 × 10−7, respectively, confirming that the models significantly differed for these languages. The large F-statistic values reflect large between-model differences relative to the internal variability of each model’s scores. In contrast, Turkish showed a smaller but statistically significant effect, with an F-statistic of 5.792 and a p-value of 0.0211, which suggests that model differences stood out less than in the other languages. Overall, the ANOVA results provide strong statistical evidence that TTS model choice has a meaningful impact on output quality, with the strongest effects observed for Kazakh, Tatar, and Uzbek, and a moderate effect for Turkish.
In addition, experiments were conducted to recognize the generated audio across all Uzbek, Turkish, and Tatar languages and translate it back into Kazakh to assess how closely the result matched the original text. The results were evaluated with the use of BLEU, WER, TER, METEOR, and chrF scores.
The score values for 1000 sentences are shown below (Table 16).
The BLEU, WER, TER, and chrF ranged from 0 to 100, and the METEOR score ranged from 0 to 1. For BLEU, METEOR, and chrF, the higher the values, the better the original and translated texts. For WER and TER, it was the opposite case. The lower the values, the better the original and translated texts. So, these metrics provide an opportunity to obtain the translation accuracy of texts.
On a small dataset (1000 sentences), the Tatar-Kazakh translation performed best, while Turkish–Kazakh was the weakest. On a large dataset (4000 sentences), quality improved significantly across all dimensions: Turkish–Kazakh and Tatar–Kazakh translations demonstrated the highest scores across all metrics, while Uzbek–Kazakh remained weaker but also significantly improved.
The Uzbek language performed weaker because corpora and models often mix Latin and Cyrillic scripts. This graphical heterogeneity leads to errors in text processing: the system does not always interpret characters correctly, reducing the accuracy of translation and speech synthesis. As a result, the quality of text generation for Uzbek was lower than for other Turkic languages that use a single, stable script.
Table 17 presents the total number of developed parallel audio corpora for the Cascade and Text-First Cascade Audio Corpus Generation methods for Turkic languages.
Figure 3 shows screenshots of fragments of the developed audio corpus uploaded to the Hugging Face platform. At the current stage, the repository is closed and not made publicly available, since the project plans to register copyrights for the collected audio materials in the languages considered in this study. Premature publication of the corpus may create risks associated with the legal protection of data and limit the possibilities of its subsequent official use. After completing all the necessary legal procedures, the corpus is planned to be made publicly available.
To facilitate understanding, Table 18 provides English translations of the corpus samples shown in Figure 3.
The tables above present the performance metrics of the various models across the different languages studied. Notice that automated metrics do not always reflect perceptual qualities such as the naturalness or authenticity of a speaker’s speech. Therefore, to confirm these results, an additional manual audio evaluation was conducted by native speakers of each language.
Fifteen native speakers evaluated the quality of the synthesized audio: 4 for Uzbek, 3 for Tatar, 3 for Turkish, and 5 for Kazakh. The participants were university students and faculty members aged 18 to 42 who regularly use their native language in daily life. The participants were selected from different age groups and with varying professional experience. Native speakers listened to and rated 100 audio samples in each of the four languages.
The evaluation process included both synthesized audio and real recordings. The real recordings closely matched the generated audio in terms of language, speaker profile, duration, vocabulary, and recording conditions. Audio samples were randomly assigned to the experts, who did not know the condition labels to avoid bias.
This design provided sufficient statistical power for subsequent analyses (t tests, ANOVAs, effect sizes) and strengthened methodological transparency, thereby reducing potential bias and improving the robustness of the findings.

5. Discussion

A major limitation of current multilingual TTS systems, including Meta AI’s MMS-TTS (Multilingual Mixed-Script Text-to-Speech), is their lack of support for the Uzbek language in its Latin script. Although Uzbek holds official status and is widely spoken among Turkic languages, its integration within MMS remains minimal and insufficiently developed.
In current versions of the model, only Cyrillic variants are available or are absent altogether. Despite the Latin alphabet being officially adopted in Uzbekistan, the Cyrillic script remains ubiquitous in media, old books, and school materials. This discrepancy not only complicates the assessment of the quality of properly automated synthesis (for example, using DNSMOS or MOSNet) but also severely hampers the evaluation process. The model either fails to produce speech or distorts the input text, making it difficult to compare the synthesis to other languages or scripts within a single experimental procedure. This urgent issue needs to be addressed, as it renders the generation of synthetic speech in the Uzbek language using Latin orthography within MMS impossible without additional model adaptation. This excludes the possibility of comparing the synthesis with other languages or scripts within a single experimental procedure.
Therefore, the absence of support for the Uzbek Latin alphabet in MMS-TTS not only restricts the technology’s accessibility for native speakers but also hinders its objective assessment using existing metrics, thereby diminishing both the scientific and practical value of the model in the context of Turkic TTS. However, there is a way forward. Solutions such as adding Latin via custom preprocessing, retraining the model with the appropriate spelling, or including official Latin texts in the training corpus of future MMS versions hold the potential to overcome these limitations and enhance the capabilities of MMS-TTS. With these solutions, we can confidently look towards a future where MMS-TTS is fully equipped to handle the complexities of the Uzbek language in the Latin script.
When generating audio samples of different sizes, a clear pattern emerged. At 1000 audio metrics, the highest values were observed. At 4000, the lowest values appeared, and at 5000, the results were in between. This consistent pattern appeared in nearly all languages and models. It shows that quality decreases as data volume increases, but it can stabilize on mixed sets. In terms of models, MMS, TurkicTTS, ElevenLabs, and Coqui TTS across Kazakh, Turkish, Uzbek, and Tatar languages revealed distinct performance patterns in cases of naturalness, expressiveness, perceptual similarity, and clarity. MMS demonstrated stable, consistent results across all 4 languages, with MOSNet values typically between 3.9 and 4.0 and moderate DNSMOS and SSL-MOS scores. TurkicTTS achieved competitive naturalness scores comparable to MMS and performed exceptionally well in Kazakh and Tatar, with higher NISQA scores, demonstrating improved intonation and expressiveness. However, its relatively low SSL-MOS values below 3.0 suggest persistent issues with perceptual similarity, making the generated voices sound less convincingly human. Overall, while MMS and TurkicTTS produce natural and intelligent speech, their expressiveness and perceptual realism remain limited, particularly when compared to commercial systems like ElevenLabs. ElevenLabs consistently outperformed all other systems, achieving superior results across every metric and language. It reached MOSNet values above 4.1, with NISQA often exceeding 4.5, reflecting exceptional clarity, prosodic richness, and perceptual realism. This high-quality product-built confidence in its performance. However, a significant drawback of ElevenLabs is its commercial use model: the service is expensive, operates on a subscription basis, and after the audio generation limit has expired, you need to pay again. It is worth noting separately that the new version of ElevenLabs v3 already supports the Kazakh language and correctly places stresses, which significantly increases the naturalness of synthesis. This promising development, however, is currently limited as v3 is only available on the web interface and does not yet have an API, which limits its use in automated scenarios. Coqui TTS, in contrast, ranked lowest among the models. While it produced reasonably natural speech with MOSNet around 3.9–4.0, its DNSMOS and SSL-MOS scores below 3.0 indicate weaknesses in clarity and perceptual similarity. Despite these limitations, Coqui’s performance suggests potential for improvement, especially in noise handling and expressiveness.
The MOSNet, DNSMOS, NISQA, and SSL-MOS metrics’ scores were extended with t-tests and ANOVA. The t-test analysis further highlighted performance differences among the TTS models by identifying which systems differed significantly. Across languages, ElevenLabs consistently demonstrated superior performance, with large t-stat and p-values very close to zero, in comparisons against MMS, TurkicTTS, and CoquiTTS, indicating that its high objective scores are unlikely to be due to random cases. CoquiTTS also outperformed MMS with a strongly significant difference, reinforcing its position as a mid-level performer. In contrast, the comparisons between MMS and TurkicTTS, as well as between TurkicTTS and CoquiTTS, yielded non-significant p-values, suggesting that these models produce broadly similar quality levels. These findings show that ElevenLabs achieved the highest objective speech quality, CoquiTTS obtained moderate but reliable performance, and MMS and TurkicTTS stayed closely together with comparable results.
The ANOVA results provide further evidence of meaningful performance differences among the evaluated TTS models. Across Kazakh, Tatar, and Uzbek, the analysis produced large F-statistics with small p-values, below 1 × 10−6, indicating that the variation between models is far greater than the within-model variability. These strong effects show that the choice of model has a substantial impact on MOSNet scores for these languages, and that the models diverge sharply in their ability to generate high-quality synthetic speech. The Turkish dataset, with an F-statistic of 5.792 and a p-value of 0.021, showed a much smaller effect size, suggesting that model performance in Turkish is more clustered and less separable than in the other languages. This pattern may be attributed to greater data availability, language-specific acoustic properties, or smaller performance gaps among models for Turkish. Overall, the ANOVA findings demonstrate that model architecture and training strategy play a major role in determining TTS quality, with the most significant performance differentiation observed in Kazakh, Tatar, and Uzbek, while Turkish showed more moderate variation across models.
Although the study focused primarily on automated metrics, a limited manual analysis was conducted to partially verify the naturalness of the synthesized speech. A small expert group of 15 native speakers was engaged to provide preliminary validation of naturalness and language identity across the target languages. The evaluators were exposed to both generated and natural control audio that were matched in language, speaker characteristics, duration, lexical content, and recording conditions, and all samples were presented in a randomized, fully blinded manner.
Recent advances in TTS has led to synthetic speech that can be nearly indistinguishable from human voice—which raises important ethical and societal concerns. As argued in [88], beyond technical metrics, researchers must address risks of misuse, privacy violations, and data provenance transparency.
In our work, we followed these recommendations by providing a transparent “Ethics/Data-use Statement”, stating the sources and nature of data, the consent procedure for listeners, and the absence of new speaker recordings. We believe that this will help to ensure that our evaluation remains not only scientifically rigorous, but also ethically responsible and socially conscientious.
Moreover, access to raw generated audio files is restricted. While synthetic speech samples were used internally for evaluation and analysis, raw synthetic audio will not be publicly released without explicit access control and approval. Any audio shared will be stripped of metadata that could enable speaker-identity tracing or misuse—ensuring robust anonymization and minimizing privacy risks.
Nevertheless, we recognize that realistic synthesized speech—even when anonymized—can be potentially mis-used (e.g., for impersonation, fraudulent audio, misinformation). Therefore, we caution downstream users and recommend that any use or distribution of synthetic audio be accompanied by informed consent, clear labeling of synthetic origin, and, where possible, technical safeguards or usage restrictions. We encourage future work on TTS and voice synthesis to adopt responsible-use policies and transparent data-governance practices.
The evaluators were native speakers of their language. The manual analysis did not try to evaluate detailed language correctness in the other languages. Instead, the focus was on how natural the samples sounded, the authenticity of the speakers, and whether the synthesized samples kept the correct language identity without mixing languages. The evaluation showed apparent differences between models and languages. As seen on the assessment, the TurkicTTS model produced the most natural speech synthesis for the Kazakh language. Meanwhile, ElevenLabs performed best for Turkish, while MMS showed superior quality for Uzbek. In the case of the Tatar language, all models displayed noticeable foreign accents, but MMS performed somewhat better. Despite their limited scope, the results partially confirm the findings of the automated metrics and help interpret the differences between the models.
There are limitations to evaluating large numbers of samples (audio or text), as concentration can be reduced, leading to random errors and less consistent ratings. Human evaluation is also subjective, and individual evaluator preferences can influence the final values of metrics such as MOS, BLEU, and other indicators. To mitigate this effect, the study utilized automated metrics alongside human evaluation.
The evaluators’ linguistic proficiency is the following limitation: although all participants were fluent in the target language, their proficiency varied (native speakers, bilinguals, and advanced learners). This variability can affect the perceived quality of translation or speech synthesis, especially in subtle linguistic aspects such as intonation and stylistic appropriateness. In addition to the factors already discussed, the complexity of the evaluated texts represents another limitation of this study. The developed corpora did not include very long or syntactically complex sentences, which may pose additional challenges for both translation and speech synthesis systems. As a result, the findings primarily reflect performance on short to medium-length utterances. Future work should therefore consider incorporating longer, structurally more complex sentences to better capture system behavior under more demanding linguistic conditions.
Our experiments have uncovered a promising development: the audio corpora produced by the Text-First Cascade Audio Corpus Generation method match the quality of those generated by the traditional cascade method. This breakthrough not only enables a broader scale of research in the automatic translation of speech in the Kazakh language but also provides a viable solution for creating training resources in resource-constrained environments.

6. Conclusions and Future Work

Speech synthesis technologies, powered by deep neural networks, have now achieved a remarkable level of quality. The audio files they generate are so close to real human speech that they are practically indistinguishable. TTS systems using modern architectures demonstrate an exceptional degree of speech naturalness.
The development of Artificial Intelligence makes the audio corpora of Turkic languages an essential resource for cultural preservation. The creation of such corpora will help millions of people gain access to modern technologies. It will ensure equal opportunities for the linguistic heritage of Turkic people in the digital age. The quality of speech generation for Turkic languages was evaluated across four TTS models—MMS, TurkicTTS, ElevenLabs, and Coqui TTS—for Kazakh, Turkish, Uzbek, and Tatar. The experimental results showed significant quantitative and qualitative performance differences. MMS demonstrated a stable baseline, achieving MOSNet scores of around 3.8 and 4.0, DNSMOS scores between 3.25 and 3.30, and NISQA scores up to 4.1, reflecting consistent naturalness and intelligibility across all languages. However, its expressiveness and perceptual similarity remained relatively moderate. TurkicTTS also reached MOSNet values of 3.9 and 4.0 and NISQA around 4.2 and 4.3, especially strong in Kazakh and Tatar, but its SSL-MOS often dropped below 3.0, indicating reduced perceptual quality. ElevenLabs emerged as the clear leader, consistently outperforming all others with MOSNet scores reaching 4.2, NISQA exceeding 4.6, and DNSMOS above 3.3. These results underscore its superior naturalness, clarity, and expressive quality, producing speech that closely resembles human delivery across all evaluated languages. On the other hand, Coqui TTS achieved MOSNet values near 3.97, but its DNSMOS often decreased to 2.83, and NISQA remained in the range of 3.2 and 3.6, demonstrating significant deficiencies in clarity and robustness. In general characteristics, MMS and TurkicTTS provide a strong open-source foundation for multilingual TTS research, achieving competitive naturalness for low-resource Turkic languages. However, their perceptual and expressive quality is still behind ElevenLabs, which sets the benchmark for human-like synthesis. It is important to note that in these experiments, the Eleven Multilingual v2 model that provides an API was implemented. Even though it is not fully developed for some Turkic languages, this model gives high perception and expression quality of sound. Currently, an even newer model, Eleven v3, is under development, which is better adapted for Turkic languages, and specifically for the Kazakh language. When the full API access is released, this model will be even stronger than other existing TTS models. The Coqui TTS model requires major refinement in noise reduction and perceptual modeling to reach comparable standards. Future work should focus on cross-lingual adaptation and self-supervised fine-tuning to close the performance gap, particularly in expressiveness and perceptual similarity metrics.
Objective and subjective quality assessments (MOS, NISQA, DNSMOS metrics) show that for many modern TTS systems, the average scores for the naturalness of synthesized speech are in the range of 4.0–4.8 out of 5 possible, which is within the range of perception of “human” speech by listeners. Moreover, in several experiments, participants in the listening were unable to distinguish synthetic samples from real recordings statistically significantly.
Therefore, in the realm of automatic speech processing, which encompasses the training and testing of recognition, synthesis, and analysis systems, the synthetic audio corpora generated by modern TTS systems have the potential not only to supplement but also to replace natural corpora entirely. This breakthrough opens up new opportunities for rapid research scaling, cost-effective data collection, and enhanced linguistic diversity.
As part of the research, parallel audio corpora were developed for Kazakh–Turkish, Kazakh–Uzbek, and Kazakh–Tatar language pairs, with 272 K, 104 K, and 109 K audio samples, respectively. Similar corpora do not currently exist, lending significant scientific novelty to this work. Further studies are planned to train a new multilingual speech-to-text translation model. Our goal is to develop an innovative direct speech-to-speech translation system that will translate audio directly, without converting it to text. This system will ensure more natural and rapid translation between Turkic languages. This approach will ensure a worthy place for Turkic languages in modern speech technologies and open up new opportunities for intercultural communication in the digital age. This will allow us to use the resulting corpus for both training and evaluating multilingual systems.

Author Contributions

Conceptualization, A.K.; methodology, A.K. and V.K.; software, A.K. and V.K.; experiments, A.K. and V.K.; validation, A.K. and V.K.; formal analysis, B.A., D.A., D.R. and A.S.; investigation, A.K.; resources, A.K., V.K. and R.A.; data curation, A.K.; writing—original draft preparation, A.K., V.K., B.A., D.A. and D.R.; writing—review and editing, A.K., V.K. and U.T.; visualization, V.K.; supervision, A.K.; project administration, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the grant project “Study of automatic generation of parallel speech corpora of Turkic languages and their use for neural models” (grant number IRN AP AP23488624) of the Ministry of Science and Higher Education of the Republic of Kazakhstan.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The audio datasets used in our experiment are not publicly available at this stage, as we plan to register the copyright for the collected audio materials in the languages discussed in the article after the completion of the project. Publishing the data before this process is finalized may complicate their subsequent legal protection and future use. Once all necessary legal procedures are completed, we will carefully consider the possibility of making the datasets publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mussakhojayeva, S.; Dauletbek, K.; Yeshpanov, R.; Varol, H.A. Multilingual Speech Recognition for Turkic Languages. Information 2023, 14, 74. [Google Scholar] [CrossRef]
  2. Bekarystankyzy, A.; Mamyrbayev, O.; Mendes, M.; Fazylzhanova, A.; Assam, M. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets. Sci. Rep. 2024, 14, 13835. [Google Scholar] [CrossRef]
  3. Whisper. Available online: https://github.com/openai/whisper (accessed on 2 July 2025).
  4. MMS (Massively Multilingual Speech). Available online: https://github.com/facebookresearch/fairseq/tree/main/examples/mms (accessed on 15 July 2025).
  5. Soyle. Available online: https://github.com/IS2AI/Soyle (accessed on 20 July 2025).
  6. Duquenne, P.-A.; Gong, H.; Dong, N.; Du, J.; Lee, A.; Goswami, V.; Wang, C.; Pino, J.; Sagot, B.; Schwenk, H. SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations. In Proceedings of the 61st Annual Meeting of the As-sociation for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 16251–16269. [Google Scholar] [CrossRef]
  7. Salesky, E.; Wiesner, M.; Bremerman, J.; Cattoni, R.; Negri, M.; Turchi, M.; Oard, D.W.; Post, M. The Multilingual TEDx Corpus for Speech Recognition and Translation. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (Interspeech 2021), Brno, Czechia, 30 August–3 September 2021; pp. 3655–3659. [Google Scholar] [CrossRef]
  8. Wang, C.; Riviere, M.; Lee, A.; Wu, A.; Talnikar, C.; Haziza, D.; Williamson, M.; Pino, J.; Dupoux, E. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 993–1003. [Google Scholar] [CrossRef]
  9. Dogan-Schönberger, P.; Mäder, J.; Hofmann, T. SwissDial: Parallel Multidialectal Corpus of Swiss German Speech and Text. arXiv 2021, arXiv:2103.11401. [Google Scholar]
  10. Gong, H.; Dong, N.; Popuri, S.; Goswami, V.; Lee, A.; Pino, J. Multilingual Speech-to-Speech Translation into Multiple Target Languages. arxiv 2023, arXiv:2307.08655. [Google Scholar]
  11. Ardila, R.; Branson, M.; Davis, K.; Kohler, M.; Meyer, J.; Henretty, M.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 11–16 May 2020; pp. 4218–4222. [Google Scholar]
  12. Jia, Y.; Ramanovich, M.T.; Wang, Q.; Zen, H. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 20–25 June 2022; pp. 6691–6703. [Google Scholar]
  13. Baali, M.; El-Hajj, W.; Ali, A. Creating Speech-to-Speech Corpus from Dubbed Series. arXiv 2022, arXiv:2203.03601. [Google Scholar]
  14. Di Gangi, M.A.; Cattoni, R.; Bentivogli, L.; Negri, M.; Turchi, M. MuST-C: A Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 2012–2017. [Google Scholar] [CrossRef]
  15. Singhal, B.; Naini, A.R.; Ghosh, P.K. wSPIRE: A Parallel Multi-Device Corpus in Neutral and Whispered Speech. In Proceedings of the 24th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Singapore, 18–20 November 2021; pp. 146–151. [Google Scholar] [CrossRef]
  16. Mussakhojayeva, S.; Khassanov, Y.; Varol, H.A. KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus with More Data, Speakers, and Topics. In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 20–25 June 2022; pp. 5404–5411. [Google Scholar]
  17. Kozhirbayev, Z.; Islamgozhayev, T. Cascade Speech Translation for the Kazakh Language. Appl. Sci. 2023, 13, 8900. [Google Scholar] [CrossRef]
  18. Suleymanov, D.; Nevzorova, O.; Gatiatullin, A.; Gilmullin, R.; Khakimov, B. National Corpus of the Tatar Language “Tugan Tel”: Grammatical Annotation and Implementation. Procedia-Soc. Behav. Sci. 2013, 95, 68–74. [Google Scholar] [CrossRef]
  19. Khakimov, B.; Gilmullin, R.; Gataullin, R. Grammatical Disambiguation in the Tatar National Corpus. In Proceedings of the 8th International Conference on Corpus Linguistics (CILC2016), Málaga, Spain, 2–4 March 2016; EPiC Series in Language and Linguistics; EasyChair: Manchester, UK, 2016; Volume 1, pp. 228–235. [Google Scholar]
  20. Ibragimov, T.I.; Saikhunov, M.R. Written Corpus of the Tatar Language: Structural and Functional Characteristics. In Current Issues of Dialectology of the Languages of the Peoples of Russia, Proceedings of the 14th All-Russian Scientific Conference, Pereslavl-Zalessky, Russia, 20–22 November 2014; Kazan Federal University: Ufa, Russia, 2014; pp. 261–263. (In Russian) [Google Scholar]
  21. Saikhunov, M.R.; Khusainov, R.R.; Ibragimov, T.I. System of Complex Morphological Search in the Written Corpus of the Tatar Language. In Traditional Culture of Turkic Peoples in a Changing World, Proceedings of the 1st International Scientific Conference, Kazan, Russia, 12–15 April 2017; Ministry of Culture of the Republic of Tatarstan, the Republican Center for Development of Traditional Culture, and Kazan State Conservatoire named after N.G. Zhiganov: Kazan, Russia, 2017; pp. 554–558. (In Russian) [Google Scholar]
  22. Saikhunov, M.R.; Khusainov, R.R.; Ibragimov, T.I. Challenges in Creating a Text Corpus Exceeding 400 Million Tokens. In The Finno-Ugric World in the Multiethnic Space of Russia: Cultural Heritage and New Challenges, Proceedings of the 6th All-Russian Scientific Conference of Finno-Ugric Studies, Izhevsk, Russia, 4–7 June 2019; Udmurt State University: Izhevsk, Russia, 2019; pp. 548–554. (In Russian) [Google Scholar]
  23. Mindubaev, A.; Gatiatullin, A. Problems of Semantic Relation Extraction from Tatar Text Corpus “Tugan Tel”. In Proceedings of the IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE), Novosibirsk, Russia, 15–17 November 2024; pp. 1690–1694. [Google Scholar]
  24. Khusainov, A.; Suleymanov, D.; Muhametzyanov, I. Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language. In Text, Speech, and Dialogue. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12848, pp. 481–488. [Google Scholar] [CrossRef]
  25. Orel, D.; Kuzdeuov, A.; Gilmullin, R.; Khakimov, B.; Varol, H.A. TatarTTS: An Open-Source Text-to-Speech Synthesis Dataset for the Tatar Language. In Proceedings of the 6th International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Osaka, Japan, 19–22 February 2024; pp. 717–721. [Google Scholar] [CrossRef]
  26. Mussakhojayeva, S.; Gilmullin, R.; Khakimov, B.; Galimov, M.; Orel, D.; Abilbekov, A.; Varol, H.A. Noise-Robust Multilingual Speech Recognition and the Tatar Speech Corpus. In Proceedings of the 6th International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Osaka, Japan, 19–22 February 2024; pp. 732–737. [Google Scholar] [CrossRef]
  27. Musaev, M.; Khujayorov, I.; Ochilov, M. Automatic recognition of Uzbek speech based on integrated neural networks. In 11th World Conference “Intelligent System for Industrial Automation” (WCIS-2020); Springer: Cham, Switzerland, 2021; pp. 215–223. [Google Scholar] [CrossRef]
  28. Musaev, M.; Khujayorov, I.; Ochilov, M. Development of integral model of speech recognition system for Uzbek language. In Proceedings of the IEEE International Conference on Application of Information and Communication Technologies (AICT), Tashkent, Uzbekistan, 7–9 October 2020; pp. 1–6. [Google Scholar] [CrossRef]
  29. Musaev, M.; Mussakhojayeva, S.; Khujayorov, I.; Khassanov, Y.; Ochilov, M.; Varol, H.A. USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. arXiv 2021. [Google Scholar] [CrossRef]
  30. Povey, A.; Povey, K. FERUZASPEECH: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context. In Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), Trento, Italy, 19–20 October 2024; pp. 360–364. [Google Scholar]
  31. Sinan, R.; Barışçı, N. A detailed survey of Turkish automatic speech recognition. Turk. J. Electr. Eng. Comput. Sci. 2020, 28, 3253–3269. [Google Scholar] [CrossRef]
  32. Zen, H.; Sak, H. Unidirectional Long Short-Term Memory Recurrent Neural Network with Recurrent Output Layer for Low-Latency Speech Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 4470–4474. [Google Scholar] [CrossRef]
  33. Oyucu, S. A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep. Electronics 2023, 12, 1900. [Google Scholar] [CrossRef]
  34. Mercan, O.B.; Cepni, S.; Tasar, D.E.; Ozan, Ş. Performance Comparison of Pre-Trained Models for Speech-to-Text in Turkish: Whisper Small and Wav2Vec2 XLS-R 300M. arXiv 2023, arXiv:2307.04765. [Google Scholar] [CrossRef]
  35. Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
  36. Taşar, D.E.; Koruyan, K.; Çılgın, C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica 2024, 8, 1–10. [Google Scholar] [CrossRef]
  37. Salor, Ö.; Pellom, B.L.; Çılgın, T.; Hacıoğlu, K.; Demirekler, M. On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), Denver, CO, USA, 16–20 September 2002; pp. 349–352. [Google Scholar] [CrossRef]
  38. Salor, Ö.; Pellom, B.L.; Çılgın, T.; Demirekler, M. Turkish Speech Corpora and Recognition Tools Developed by Porting SONIC: Towards Multilingual Speech Recognition. Comput. Speech Lang. 2007, 21, 580–593. [Google Scholar] [CrossRef]
  39. Ağçam, R.; Bulut, A. A Corpus-Based Study on Turkish Spoken Productions of Bilingual Adults. Univers. J. Educ. Res. 2016, 4, 2032–2038. [Google Scholar] [CrossRef]
  40. Polat, H.; Oyucu, S. Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results. Symmetry 2020, 12, 290. [Google Scholar] [CrossRef]
  41. Institute of Smart Systems and Artificial Intelligence. Available online: https://issai.nu.edu.kz/ (accessed on 1 August 2025).
  42. Kazakh Language Text-to-Speech—2. Available online: https://issai.nu.edu.kz/tts2-eng/ (accessed on 1 August 2025).
  43. Kazakh Speech Corpus 2. Available online: https://github.com/IS2AI/Kazakh_ASR (accessed on 1 August 2025).
  44. KazTTS. Available online: https://github.com/IS2AI/Kazakh_TTS (accessed on 1 August 2025).
  45. National Corpus of Kazakh Language. General Information. Available online: https://qazcorpus.kz/header/jalpyMalimeten.php (accessed on 1 August 2025).
  46. National Corpus of Kazakh Language. Available online: https://qazcorpus.kz/indexen.php (accessed on 1 August 2025).
  47. TatSC_ASR (Tatar Speech Corpus). Available online: https://huggingface.co/datasets/yasalma/TatSC_ASR (accessed on 1 August 2025).
  48. Hackathon Tatar ASR. Available online: https://huggingface.co/datasets/yasalma/tat_hackathon_asr (accessed on 1 August 2025).
  49. TatarTTS. Available online: https://github.com/IS2AI/TatarTTS (accessed on 1 August 2025).
  50. Tugan Tel. Available online: https://tugantel.tatar/ (accessed on 1 August 2025).
  51. Corpus of Written Tatar. Available online: https://www.corpus.tatar/ (accessed on 1 August 2025).
  52. Tatar–English Corpus. Available online: https://huggingface.co/datasets/yasalma/tt-en-language-corpus (accessed on 1 August 2025).
  53. Uzbek Speech Corpus (USC). Available online: https://huggingface.co/datasets/issai/Uzbek_Speech_Corpus (accessed on 1 August 2025).
  54. FERUZASPEECH. Available online: https://huggingface.co/datasets/k2speech/FeruzaSpeech (accessed on 1 August 2025).
  55. Common Voice. Available online: https://commonvoice.mozilla.org/en/datasets (accessed on 1 August 2025).
  56. Uzbek Speech Recognition Corpus (Mobile). Available online: https://dataoceanai.com/datasets/asr/uzbek-speech-recognition-corpus-mobile-3/?utm_source=chatgpt.com (accessed on 1 August 2025).
  57. Uzbek Text Corpora. Available online: https://www.sketchengine.eu/corpora-and-languages/uzbek-text-corpora/ (accessed on 1 August 2025).
  58. Uzbek National Corpus. Available online: https://uzbekcorpus.uz/enNashrlar (accessed on 1 August 2025).
  59. TIL Project: Uzbek-English Parallel Corpus. Available online: http://uzbekcorpora.uz/ (accessed on 1 August 2025).
  60. Leipzig Corpora Collection: Uzbek (uzb_community_2017). Available online: https://corpora.uni-leipzig.de/en?corpusId=uzb_community_2017 (accessed on 1 August 2025).
  61. Uzbek Corpus Sample. Available online: https://github.com/elmurod1202/Uzbek-Corpus-Sample/tree/main (accessed on 1 August 2025).
  62. Text Classification Dataset and Analysis for Uzbek Language. Uzbek Text Classification Dataset. Available online: https://github.com/elmurod1202/textclassification?tab=readme-ov-file (accessed on 1 August 2025).
  63. Open Parallel Corpora (OPUS). Available online: https://opus.nlpl.eu/ (accessed on 1 August 2025).
  64. Salor; Ciloglu; Demirekler. METU Turkish Microphone Speech Corpus. In Proceedings of the 2006 IEEE 14th Signal Processing and Communications Applications, Antalya, Turkey, 17–19 April 2006; pp. 1–4. [Google Scholar] [CrossRef]
  65. METU Turkish Microphone Speech v1.0—LDC Catalog. Available online: https://catalog.ldc.upenn.edu/LDC2006S33 (accessed on 1 August 2025).
  66. ASR-BigTurkCSC: Turkish Conversational Speech Corpus. Available online: https://magichub.com/datasets/turkish-conversational-speech-corpus/ (accessed on 1 August 2025).
  67. PhonBank Turkish Istanbul Corpus. Available online: https://talkbank.org/phon/access/Other/Turkish/Istanbul.html (accessed on 1 August 2025).
  68. ITU Turkish Natural Language Processing Pipeline Prepared by ITU NLP Group. Available online: http://tools.nlp.itu.edu.tr/ (accessed on 1 August 2025).
  69. Say, B.; Zeyrek, D.; Oflazer, K.; Özge, U. Development of a Corpus and a Treebank for Present-Day Written Turkish. In Proceedings of the Eleventh International Conference of Turkish Linguistics, Eastern Mediterranean University, Famagusta, Cyprus, 7–9 August 2002. [Google Scholar]
  70. European Language Grid (ELG). METU Turkish Corpus (MTC)—Resource Catalogue. Available online: https://live.european-language-grid.eu/catalogue/corpus/15024 (accessed on 1 August 2025).
  71. TS Corpus: A Large-Scale Turkish Corpus with Morphological Annotation. Available online: https://tscorpus.com (accessed on 1 August 2025).
  72. Çöltekin, Ç.; Doğruöz, A.S.; Çetinoğlu, Ö. Resources for Turkish natural language processing: A critical survey. Lang Resour. Eval. 2023, 57, 449–488. [Google Scholar] [CrossRef] [PubMed]
  73. Kilgarriff, A.; Baisa, V.; Bušta, J.; Jakubíček, M.; Kovář, V.; Michelfeit, J.; Rychlý, P.; Suchomel, V. The Sketch Engine: Ten Years on. Lexicography; Springer: Berlin/Heidelberg, Germany, 2014; Volume 1, pp. 7–36. [Google Scholar] [CrossRef]
  74. Aksan, Y.; Aksan, M.; Kolcu, E.; Yildiz, I. Construction of the Turkish National Corpus (TNC). In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23–25 May 2012; pp. 3223–3227. [Google Scholar]
  75. Turkish National Corpus (TNC) Official Website. Türkçe Ulusal Derlemİ. Available online: https://live.european-language-grid.eu/catalogue/corpus/12920 (accessed on 2 December 2025).
  76. The Linguist List. Demo Access Announcement for Turkish National Corpus (TNC). Available online: https://linguistlist.org/issues/24/1024/ (accessed on 1 August 2025).
  77. ITU NLP Group. trWaC: Turkish Web as Corpus (Free CLARIN Version). Available online: https://nlp.itu.edu.tr/en/toolsandresources (accessed on 1 August 2025).
  78. Sulubacak, U.; Gökırmak, M.; Tyers, F.M.; Çöltekin, Ç.; Nivre, J.; Eryiğit, G. Universal Dependencies for Turkish. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016), Osaka, Japan, 11–16 December 2016; pp. 3444–3454. [Google Scholar]
  79. Türk, U.; Atmaca, F.; Özateş, Ş.B.; Berk, G.; Bedir, S.T.; Köksal, A.; Başaran, B.Ö.; Güngör, T.; Özgür, A. Resources for Turkish dependency parsing: Introducing the BOUN Treebank and the BoAT annotation tool. Lang Resour. Eval. 2022, 56, 259–307. [Google Scholar] [CrossRef]
  80. Erjavec, T.; Ogrodniczuk, M.; Osenova, P.; Ljubešić, N.; Simov, K.; Pančur, A.; Rudolf, M.; Kopp, M.; Barkarson, S.; Steingrímsson, S.; et al. The ParlaMint corpora of parliamentary proceedings. Lang Resour. Eval. 2023, 57, 415–448. [Google Scholar] [CrossRef] [PubMed]
  81. MMS: Multilingual Text-to-Speech (TTS). Available online: https://huggingface.co/facebook/mms-tts (accessed on 2 June 2025).
  82. TurkicTTS. Available online: https://github.com/IS2AI/TurkicTTS (accessed on 10 July 2025).
  83. ElevenLabs TTS. Available online: https://elevenlabs.io/docs/capabilities/text-to-speech (accessed on 20 July 2025).
  84. CoquiTTS. Available online: https://github.com/coqui-ai/TTS (accessed on 15 July 2025).
  85. Lo, C.-C.; Fu, S.-W.; Huang, W.-C.; Wang, X.; Yamagishi, J.; Tsao, Y.; Wang, H.-M. MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), Graz, Austria, 15–19 September 2019; pp. 1541–1545. [Google Scholar] [CrossRef]
  86. Reddy, C.K.; Gopal, V.; Cutler, R. Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6493–6497. [Google Scholar] [CrossRef]
  87. Manocha, P.; Kumar, A. Speech Quality Assessment through MOS using Non-Matching References. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association (Interspeech 2022), Incheon, Republic of Korea, 18–22 September 2022; pp. 654–658. [Google Scholar] [CrossRef]
  88. Yang, Y.; Wang, H.; Han, B.; Liu, S.; Li, J.; Qin, Y.; Chen, X. Towards Responsible Evaluation for Text-to-Speech. arXiv 2025, arXiv:2510.06927. [Google Scholar] [CrossRef]
Figure 1. The scheme of the Cascade Audio Corpus Generation for Turkic languages.
Figure 1. The scheme of the Cascade Audio Corpus Generation for Turkic languages.
Applsci 15 12880 g001
Figure 2. The scheme of Text-First Cascade Audio Corpus Generation for Turkic languages.
Figure 2. The scheme of Text-First Cascade Audio Corpus Generation for Turkic languages.
Applsci 15 12880 g002
Figure 3. Fragment of the audio corpus presentation on Hugging Face.
Figure 3. Fragment of the audio corpus presentation on Hugging Face.
Applsci 15 12880 g003
Table 1. Analysis of audio corpora for the Kazakh language.
Table 1. Analysis of audio corpora for the Kazakh language.
NameDurationReferenceOpen or Private
Kazakh Speech Corpus 2 (KSC2) [43]~1200 hhttps://github.com/IS2AI/Kazakh_ASR (accessed on 1 August 2025)Open (CC BY 4.0)
KazakhTTS2 [44]271 hhttps://github.com/IS2AI/Kazakh_TTS (accessed on 1 August 2025)Open (CC BY 4.0)
Table 2. Analysis of audio corpora for the Tatar language.
Table 2. Analysis of audio corpora for the Tatar language.
NameDurationReferenceOpen or Private
TatSC_ASR (Tatar Speech Corpus) [47]269 hHugging Face
https://huggingface.co/datasets/yasalma/TatSC_ASR (accessed on 1 August 2025)
Open (CC BY 4.0)
Hackathon Tatar ASR [48]~90 hHugging Face
https://huggingface.co/datasets/yasalma/tat_hackathon_asr (accessed on 1 August 2025)
Open
TatarTTS [49]~70 hGitHub
https://github.com/IS2AI/TatarTTS (accessed on 1 August 2025)
Open (Apache-2.0)
Table 3. Analysis of text corpora for the Tatar language.
Table 3. Analysis of text corpora for the Tatar language.
NameTypeSizeReference
Tugan Tel [50]Monolingual194 million word forms (15 December 2019)https://tugantel.tatar/ (accessed on 1 August 2025)
Corpus of Written Tatar [51]Monolingualmore than 500 million words (>620 million tokens)https://www.corpus.tatar/ (accessed on 1 August 2025)
Tatar–English Corpus [52]Parallel~7775 sentence pairsHugging Face, MIT: https://huggingface.co/datasets/yasalma/tt-en-language-corpus (accessed on 1 August 2025)
Table 4. Analysis of audio corpora for the Uzbek language.
Table 4. Analysis of audio corpora for the Uzbek language.
NameDurationReferenceOpen or Private
USC105 h (958 different speakers)https://huggingface.co/datasets/issai/Uzbek_Speech_Corpus (accessed on 1 August 2025)open (CC-BY 4.0)
FeruzaSpeech60 h (1 speaker)https://huggingface.co/datasets/k2speech/FeruzaSpeech (accessed on 1 August 2025)open
Common Voice (Uzbek)266 h (2000 different speakers)Common Voice
https://commonvoice.mozilla.org/en/datasets (accessed on 1 August 2025)
open
Uzbek Speech Recognition Corpus (Mobile)392 hhttps://dataoceanai.com/datasets/asr/uzbek-speech-recognition-corpus-mobile-3/?utm_source=chatgpt.com (accessed on 1 August 2025)private
Table 5. Analysis of text corpora for the Uzbek language.
Table 5. Analysis of text corpora for the Uzbek language.
NameTypeSizeReference
uzWaC (Uzbek Web Corpus)monolingual~18 million wordshttps://www.sketchengine.eu/uzwac-uzbek-corpus/ (accessed on 1 August 2025) access via SketchEngine; requires subscription
Uzbek National Corpusmonolingual and bilingual~20 million wordshttp://uzbekcorpus.uz (accessed on 1 August 2025)
partially open, registration, and possible restrictions
TIL
(uzbekcorpora)
bilingual~75 million wordshttp://uzbekcorpora.uz/ (accessed on 1 August 2025)
open
Leipzig Corpora Collection (uzb_community_2017)monolingual663,119 sentences and 9,256,001 tokenshttps://corpora.uni-leipzig.de/en?corpusId=uzb_community_2017 (accessed on 1 August 2025)
open
Uzbek Corpus Samplemonolingual100,000 sentenceshttps://github.com/elmurod1202/Uzbek-Corpus-Sample/tree/main (accessed on 1 August 2025)
open
Text classification datasetmonolingual120 million wordshttps://github.com/elmurod1202/textclassification?tab=readme-ov-file (accessed on 1 August 2025)
open
OPUSbilingualuzbek-english 34,533,094 sentences
uzbek-kazakh
253,941 sentences
uzbek-russian
652,625 sentences
https://opus.nlpl.eu/ (accessed on 1 August 2025)
open
Table 6. Analysis of audio corpora for the Turkish language.
Table 6. Analysis of audio corpora for the Turkish language.
NameDurationReferenceOpen or Private
Spoken Turkish Corpus (STC, METU)500,000 words https://std.metu.edu.tr (accessed on 1 August 2025)An email access request is required; demo and sample versions are available after registration.
Middle East Technical University Speech Corpus500 min (120 announcers, voicing 40 sentences each)https://catalog.ldc.upenn.edu/LDC2006S33 (accessed on 1 August 2025)Opened by academic request; often used in ASR research
Common Voice Turkish (Mozilla)22 h of speechhttps://commonvoice.mozilla.org (accessed on 1 August 2025)Open
ASR-BigTurk Conversational Speech Corpus1537 hhttps://magichub.com (accessed on 1 August 2025)Commercial access, license purchase required
PhonBank Turkish Istanbul CorpusOver 50 h
(120 participants, including children’s speech, ages 2–7)
https://talkbank.org (accessed on 1 August 2025)Open for research purposes, requires registration on TalkBank
TAR Corpus (ITU NLP Group)40 h of audio (reading, dialogues)http://tools.nlp.itu.edu.tr (accessed on 1 August 2025)Open
Table 7. Analysis of text corpora for the Turkish language.
Table 7. Analysis of text corpora for the Turkish language.
NameDurationReferenceComments
METU Turkish Corpus (MTC)2 million wordshttps://ii.metu.edu.tr (accessed on 1 August 2025)By request through the university, requires agreement
Turkish National Corpus (TNC)50 million words (2000–2009)https://live.european-language-grid.eu/catalogue/corpus/12920 (accessed on 1 August 2025)Partially open; web interface available
TS Corpus1.3 billion tokenshttps://tscorpus.com (accessed on 1 August 2025)Web interface (CQPweb), searchable but not for bulk download
Turkish Web Corpus (trTenTen20)4.9 billion wordshttps://www.sketchengine.eu/trtenten-turkish-corpus/ (accessed on 1 August 2025)Via Sketch Engine (subscription required)
Turkish Web Corpus (trWaC)32 million wordshttps://www.clarin.si/repository/xmlui/handle/11356/1514 (accessed on 1 August 2025)Available for free through CLARIN.SI
ParlaMint-TR (Parliament Corpus)43 million words (2011–2021)https://www.clarin.eu/parlamint (accessed on 1 August 2025)Open; downloadable in XML/TEI format
ITU Web Treebank5000 sentenceshttps://ddi.itu.edu.tr/en/toolsandresources (accessed on 1 August 2025)Open; used for syntactic and semantic tasks (UD, PropBank)
Table 8. The evaluation of NLLB-200.
Table 8. The evaluation of NLLB-200.
Language PairBLEUchF
Kazakh → Turkish21.344.8
Kazakh → Uzbek17.239.6
Kazakh → Tatar10.533.1
Table 9. Total Duration of Audio Recordings (in Seconds).
Table 9. Total Duration of Audio Recordings (in Seconds).
TTS SystemKazakhTurkishTatarUzbek
1000 sentences
MMS9411781279599164
TurkicTTS7298857080868789
Elevenlabs9436815092559365
Coqui TTS12,401850311,44912,169
4000 sentences
MMS22,27819,22718,90521,404
TurkicTTS22,27819,22718,90521,404
Elevenlabs21,18916,36120,11820,061
Coqui TTS33,07322,27229,95033,660
5000 sentences
MMS31,68927,03926,86430,568
TurkicTTS29,57627,79726,99130,193
Elevenlabs30,62524,51129,37329,426
Coqui TTS45,47430,77541,39945,829
Table 10. Results for Kazakh speech.
Table 10. Results for Kazakh speech.
Model\MetricsMOSNetDNSMOSNISQASSL-MOS
1 K audio
MMS3.9173.3073.0722.950
TurkicTTS3.9713.2964.5242.455
Elevenlabs4.1493.4384.6133.756
Coqui TTS3.9962.8493.0442.547
4 K audio
MMS3.9103.2612.9913.018
TurkicTTS3.9103.2622.9923.018
Elevenlabs4.1433.3714.5173.638
Coqui TTS3.9882.7342.9282.621
5 K audio
MMS3.9113.2703.0073.004
TurkicTTS3.9223.2693.2982.905
Elevenlabs4.1453.3844.5363.662
Coqui TTS3.9902.7572.9512.606
Table 11. Results for Turkish speech.
Table 11. Results for Turkish speech.
Model\MetricsMOSNetDNSMOSNISQASSL-MOS
1 K audio
MMS3.9863.4104.6172.438
TurkicTTS3.9873.3324.4492.434
Elevenlabs4.0553.4054.3863.545
Coqui TTS3.9903.3273.8933.405
4 K audio
MMS4.0123.3874.6223.679
TurkicTTS4.0123.3874.6223.679
Elevenlabs4.0203.2884.4313.451
Coqui TTS3.9923.2373.7193.342
5 K audio
MMS4.0073.3914.6213.431
TurkicTTS4.0073.3764.5873.430
Elevenlabs4.0273.3124.4223.470
Coqui TTS3.9923.2553.7543.355
Table 12. Results for Uzbek speech.
Table 12. Results for Uzbek speech.
Model\MetricsMOSNetDNSMOSNISQASSL-MOS
1 K audio
MMS4.0363.3264.2613.245
TurkicTTS3.9933.3074.4392.393
Elevenlabs4.1483.4244.6183.771
Coqui TTS3.9803.0123.2232.715
4 K audio
MMS4.0153.2964.2713.298
TurkicTTS4.0153.2964.2713.298
Elevenlabs4.1383.3684.6043.622
Coqui TTS3.9802.9633.6202.791
5 K audio
MMS4.0193.3024.2693.287
TurkicTTS4.0113.2984.3053.117
Elevenlabs4.1403.3794.6073.652
Coqui TTS3.9802.9733.5412.776
Table 13. Results for Tatar speech.
Table 13. Results for Tatar speech.
Model\MetricsMOSNetDNSMOSNISQASSL-MOS
1 K audio
MMS4.0203.3873.3373.293
TurkicTTS3.9723.3224.5022.438
Elevenlabs4.1503.4284.5603.677
Coqui TTS3.9802.9343.2522.905
4 K audio
MMS4.0163.3693.5403.440
TurkicTTS4.0163.3693.5413.440
Elevenlabs4.1413.3394.4363.579
Coqui TTS3.9892.8123.6492.910
5 K audio
MMS4.0173.3733.4993.411
TurkicTTS4.0073.3603.7333.240
Elevenlabs4.1433.3574.4613.599
Coqui TTS3.9872.8363.5702.909
Table 14. The statistics of the t-test.
Table 14. The statistics of the t-test.
Modelst-Statp-Value
MMS–TurkicTTS−1.1530.365259737017
MMS–Elevenlabs−82.9560.000000224322
MMS–Coqui TTS−24.2130.000018606616
TurkicTTS–MMS1.1530.365259737017
TurkicTTS–Elevenlabs−11.2770.007294295224
TurkicTTS–Coqui TTS−3.030.090167463643
Elevenlabs–MMS82.9560.000000224322
Elevenlabs–TurkicTTS11.2770.007294295224
Elevenlabs–Coqui TTS51.7650.000002165021
Coqui TTS–MMS24.2130.000018606616
Coqui TTS–TurkicTTS3.030.090167463643
Coqui TTS–Elevenlabs−51.7650.000002165021
Table 15. The statistics of ANOVA.
Table 15. The statistics of ANOVA.
LanguageF-Statisticp-Value
Kazakh122.2450.000000506785
Turkish5.7920.021006960888
Tatar109.4760.000000779332
Uzbek212.6290.000000057633
Table 16. The back-translation results.
Table 16. The back-translation results.
Model\MetricsBLEUWERTERMETEORchF
1 K audio
Uzbek–Kazakh25.0964.4852.980.5162.18
Turkish–Kazakh18.0269.3158.460.4556.67
Tatar–Kazakh28.2658.7347.190.5765.52
4 K audio
Uzbek–Kazakh37.2845.5444.660.5670.3
Turkish–Kazakh46.1637.2236.500.6577.1
Tatar–Kazakh47.3335.6434.910.6578.05
Table 17. Developed parallel audio corpora.
Table 17. Developed parallel audio corpora.
Language PairTotal Number of Audio Files
Kazakh–Turkish272,000
Kazakh–Uzbek104,000
Kazakh–Tatar109,000
Table 18. Corpus Samples with English Translation.
Table 18. Corpus Samples with English Translation.
KazakhEnglishTurkish
жаңғақ ағашының бұтағынан oларға аруақтар елестеді меdid they see ghosts on the walnut tree branch?ceviz ağacının dalından onlara cisimlenen hayaletler mi göründü
себебі бала күтімінде oтырмын because I am taking care of a childçünkü bir çocuğa bakıyorum
пеш тoңазытқыш және сөрелер бар ас үй kitchen with stove, refrigerator and shelvesfırın buzdolabı ve raflarla donatılmış bir mutfak
кеден тарифі customs tariffgümrük tarifesi
әр тақтайшаның қуаты екі жүз жетпіс ваттthe power of each plate is two hundred and seventy watts.her plakanın gücü iki yüz yetmiş watt’tır.
ерекшелікті бағалау assessment of uniquenessspesifikliğin değerlendirilmesi
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Karibayeva, A.; Karyukin, V.; Tukeyev, U.; Abduali, B.; Amirova, D.; Rakhimova, D.; Aliyev, R.; Shormakova, A. The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages. Appl. Sci. 2025, 15, 12880. https://doi.org/10.3390/app152412880

AMA Style

Karibayeva A, Karyukin V, Tukeyev U, Abduali B, Amirova D, Rakhimova D, Aliyev R, Shormakova A. The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages. Applied Sciences. 2025; 15(24):12880. https://doi.org/10.3390/app152412880

Chicago/Turabian Style

Karibayeva, Aidana, Vladislav Karyukin, Ualsher Tukeyev, Balzhan Abduali, Dina Amirova, Diana Rakhimova, Rashid Aliyev, and Assem Shormakova. 2025. "The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages" Applied Sciences 15, no. 24: 12880. https://doi.org/10.3390/app152412880

APA Style

Karibayeva, A., Karyukin, V., Tukeyev, U., Abduali, B., Amirova, D., Rakhimova, D., Aliyev, R., & Shormakova, A. (2025). The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages. Applied Sciences, 15(24), 12880. https://doi.org/10.3390/app152412880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop