Speech Recognition and Synthesis Models and Platforms for the Kazakh Language

Karibayeva, Aidana; Karyukin, Vladislav; Abduali, Balzhan; Amirova, Dina

doi:10.3390/info16100879

Open AccessArticle

Speech Recognition and Synthesis Models and Platforms for the Kazakh Language

Information Systems Department, Faculty of Information Technology and Artificial Intelligence, Farabi University, Almaty 050040, Kazakhstan

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 879; https://doi.org/10.3390/info16100879

Submission received: 9 July 2025 / Revised: 15 September 2025 / Accepted: 3 October 2025 / Published: 10 October 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech.

Keywords:

Kazakh language; ASR; STT; TTS; speech recognition; speech synthesis

1. Introduction

The introduction of automatic speech recognition (ASR) systems offers significant benefits to various aspects of society, including increased productivity and expanded opportunities for the elderly and individuals with disabilities [1]. However, effective and safe use of these technologies requires a high degree of accuracy and the ability of the speech model to understand language in a context-sensitive manner.

ASR systems are a set of methods and algorithms aimed at converting spoken language into text form. In recent decades, ASR has evolved from experimental technology into an integral part of everyday life. Modern ASR systems are widely used in various fields, including voice assistants, automatic translation systems, smart device control, and audio data processing.

ASR for the Kazakh language presents several significant challenges, both technical and linguistic. Despite growing interest in studying and digitalizing the Kazakh language, it remains a low-resource language, particularly in the context of speech technologies. Turkic languages like Kazakh, Kyrgyz, Uzbek, and Tatar are considered low-resource languages, and all are agglutinative [2]. An agglutinative language forms words by combining morphemes and affixes. The Kazakh language has complex morphology. In Kazakh, as in most Turkic languages, words are formed by adding affixes to the root. These affixes can convey a single grammatical or word-formation meaning, which is essential in this process. Affixes do not merge and keep their individuality. In the works [3,4,5,6], the authors explain how words are formed in Turkic languages. They note that all Turkic languages use four types of affixes: plural, personal, case, and possessive affixes. The mentioned works highlight that Turkic languages follow a universal principle of word formation based on these four main categories. The order of these affixes is not random; it is governed by strict morphological and phonological rules, especially vowel harmony, which ensures phonetic consistency and grammatical coherence. Apart from Turkish, most Turkic languages have a limited amount of available and labeled audio and text data, which complicates the development of speech recognition algorithms and other speech technology applications. Additionally, the high morphological complexity and agglutinative nature of these languages present further challenges when designing ASR models [7].

Moreover, Turkic languages have letters or diacritics that are typical for each language. Diacritics are found in Uzbek and Turkish and are orthographic manipulations that indicate a change in the pronunciation of the letter they refer to. For ASR systems, removing or incorrectly restoring diacritics, as well as incorrectly recognizing letters, directly increases CER/WER and generates false matches with frequent but semantically incorrect words.

Kazakh phonology has a unique set of letters that represent sounds not found in most Indo-European languages. These letters capture important phonetic differences and are key to maintaining the system of vowel harmony, a defining feature of the language. The Kazakh language has nine unique letters. The are: ‘ә’, ‘қ’, ‘ғ’, ‘ң’, ‘ү’, ‘ұ’, ‘і’, ‘ө’, ‘һ’.

Among the vowels, Ә (ә) denotes the front open vowel [æ], similar to the English word “map.” ‘Ө (ө)’ represents the front rounded vowel [ø], similar to the English word ‘Murphy’. Whereas ‘Ү (ү)’ corresponds to the front rounded vowel [y], like the ‘u’ in French word ‘Lune.’ ‘Ұ (ұ)’ stands for the back rounded vowel [ʊ], similar to the English word ‘but.’ Finally, І (і) denotes the short front unrounded vowel [i], which is systematically opposed to Ы (ы) ([ɯ]), the back unrounded vowel.

The consonantal system includes unique phonemes. Қ (қ) represents the voiceless uvular plosive [q], which is different from the velar К (к) [k]. Ғ (ғ) denotes the voiced uvular fricative [ʁ], which sounds similar to the Arabic ‘غ’. ‘Ң (ң)’ marks the velar nasal [ŋ], corresponding to the ng in English sing. Finally, ‘Һ (һ)’, a glottal fricative [h], mainly appears in borrowed words.

All nine phonemes not only expand the sound inventory of the Kazakh language but also play an important role in speech recognition. In most cases, many ASR models make confusions, such as ‘қ’ is changed to ‘k’, ‘ң’ to ‘н’, or ‘ү/ұ’ to ‘u’, which in turn leads to incorrect transcription of the text. For example, if it were “қу түлкі”, then after recognition, if it came out as “ку тулкы”, accordingly, it will lead to a large WER, since the words have changed. Violation of vowel harmony, for example, replacing ‘ы’ with ‘і’, violates the structure of agglutinative word forms. Therefore, correct modeling of these phonemes and their harmonic relations is important for speech technologies, especially for automatic speech recognition and synthesis in the Kazakh language.

Modern research in the field of automatic speech recognition inevitably faces several significant challenges, including the limited availability and inconsistent quality of audio recordings in most Turkic languages. Among them, Turkish stands out with relatively large and high-quality corpora, while Tatar, Uzbek, and Kazakh remain comparatively low-resource, with smaller datasets and less consistent recording conditions. Additional challenges include high variability in speech patterns, the need for many qualified annotators, and the requirement to comply with strict data protection regulations. Although the stage of audio data collection is an important component in the development of speech technologies, a much more complex task is to ensure the high quality and consistency of both the audio fragments themselves and their accompanying annotation.

Audio data often contains elements that allow identifying the speaker, which requires obtaining explicit and documented consent from participants to use their voice information for research or commercial purposes. The study utilized open data from Internet sources that do not infringe upon the privacy rights of the speakers.

The process of annotating audio materials inevitably requires text transcription, the quality of which has a direct impact on the success of model training. The presence of incomplete, inaccurate, or subjective transcription, as well as the use of non-standardized abbreviations and designations, significantly reduces the effectiveness of subsequent processing.

A separate difficulty is the technical quality of audio recordings. The presence of background noise, acoustic distortions, low diction intelligibility, or poor recording conditions complicates the training of models. It requires the use of specialized signal preprocessing algorithms, including noise reduction, volume normalization, and filtering methods. In some cases, data that cannot be corrected is excluded from the training set, which increases the cost of resources and time required to form a comprehensive training sample.

One practical approach to compensate for the lack of audio data is the creation of synthetic speech corpora, which generate audio signals based on text-to-speech (TTS) technologies [8,9]. This method enables the scalable acquisition of audio materials with controlled parameters, such as speech rate, intonation, acoustic conditions, and the speaker’s speech characteristics, ensuring the formation of balanced and diverse training samples. An alternative, widely used approach is the extraction of audio data from open-source Internet sources, including podcasts, video hosting sites, and public speech databases. Additionally, data can be collected through targeted recording of user speech using specialized equipment and software. These methods provide realistic examples of spontaneous and formal speech in the Kazakh language across various acoustic and linguistic contexts, which is crucial for testing and training models.

For the Kazakh language, the audio data is limited, compared with high-resource languages, like English [10,11]. These data are presented in Section 2.3 below.

The study’s purpose is not only a technical analysis, but also the development of recommendations for improving the availability and quality of voice technologies for Kazakh-speaking users. These recommendations, including proposals for the development of open speech corpora, support for data crowdsourcing initiatives, and stimulation of scientific and commercial projects in Kazakh speech analytics, have the potential to significantly reduce the digital divide and expand the presence of the Kazakh language in the digital space.

This study aims to address these issues by evaluating a wider range of models, including both ISSAI systems and commercial tools like OpenAI, ElevenLabs, and Voiser. Given the rapid advancements in commercial ASR and TTS applications, it is vital to include these systems in our comparisons. The main goal of this paper is to perform a thorough analysis of ASR and TTS for Kazakh. We want to identify distinct performance features and provide guidance for developing better models. One major challenge we face is the lack of large-scale parallel speech-text datasets for Kazakh. Other than the ISSAI resources, we did not find any suitable open-source datasets. Training new systems on just these datasets would likely repeat the earlier ISSAI results without pushing the field forward. To address this issue, our study evaluates current approaches and highlights the need to create new datasets, such as the proposed 24 kz corpus. This corpus can serve as a foundation for training stronger models and for further research.

This paper is organized as follows. The Introduction explains the issues related to speech recognition and synthesis in the Kazakh language. Section 2 contains a thorough literature review, discussing several key works related to ASR and TTS systems for low-resource and Turkic languages. Section 2.1 covers general approaches to low-resource ASR, while 2.2 looks at research progress on Turkic languages. Section 2.3 explains Kazakh speech features in systems and resources, and 2.4 summarizes the findings from the literature review. Section 3 outlines the methodological framework used in this study. Section 3.1 explains how the audio and text datasets were formed. In Section 3.2, ASR systems and their selection criteria are thoroughly examined, and Section 3.3 discusses the implementation of Text-to-Speech (TTS) to Kazakh. Section 4 presents the text and audio quality metrics and details the evaluation procedures results, offering a clear description of the experiments and performance outcomes of STT and TTS for the Kazakh language. Section 5 wraps up the research by summarizing the findings and discussing future work, including possible expansions, improvements to datasets, and the integration of new neural models for better performance. All these points will be discussed in detail in the article.

2. Related Works

Large datasets, advanced neural architectures, and powerful computing resources have become key factors for developing robust ASR and TTS systems. Current research efforts are focused on enhancing the models’ robustness to noise and their ability to adapt to low-resource languages, improving the capacity to generalize across languages, and increasing the naturalness and expressiveness of synthesized speech.

ASR and TTS technologies have improved significantly in recent years due to advances in deep learning, increased available speech corpora, and increased computing power. These advances have significantly improved the accuracy and efficiency of existing models for many languages, especially high-resource languages. However, there is a category of resource-constrained languages for which developing high-quality ASR and TTS systems remains a challenge. Limited data for such languages, such as minority languages or regional dialects, imposes significant constraints on the development of relevant technologies. This section reviews key research in the field of ASR and TTS for resource-constrained languages, with an emphasis on approaches that can be adapted for Turkic languages and other resource-constrained languages.

2.1. Low-Resource ASR: General Approaches

Automatic speech recognition has seen rapid progress due to advances in deep learning. A recent survey in [12] provides a comprehensive overview of end-to-end architectures, feature extraction, transfer learning, and multilingual systems, while also emphasizing challenges that arise in low-resource scenarios. The potential of speech synthesis for augmenting ASR training data is demonstrated in [13], where synthetic speech improved accuracy, though a gap with natural speech remains. A multilingual E2E ASR system with integrated LID prediction in the RNN-T framework was suggested in [14], achieving 96.2% LID accuracy and competitive WER.

The Universal Speech Model (USM), covering over 100 languages, was introduced in [15] as a large-scale multilingual foundation model for speech recognition. The Whisper model has also attracted considerable attention due to its strong cross-lingual performance. Fine-tuning strategies for Whisper in low-resource settings, including Kazakh, have been systematically evaluated in [16], showing that appropriate adaptation significantly improves recognition accuracy while reducing computational costs. Limited-vocabulary ASR for linguistic inclusion was discussed in [17], with experiments on Assamese, Bengali, Lao, Haitian, Zulu, and Tamil demonstrating the benefits of robust bottleneck features.

2.2. Turkic Languages: ASR and TTS

Turkic languages share several phonetic and morphological features that present unique challenges for speech technologies. In particular, the phonetic inventory of Turkic languages contains sounds and allophonic variations that directly affect Word Error Rate (WER). For example, in Kazakh and Kyrgyz, the consonant «қ» [q] (uvular plosive) is often confused with «к» [k] (velar plosive) by standard ASR models trained on high-resource languages, since many acoustic models are optimized for Indo-European phoneme sets. Similarly, front/back vowel harmony (e.g., келеді [keledi] vs. қалады [qalady]) introduces systematic vowel alternations that can lead to substitution errors if not explicitly modeled.

Morphologically, Turkic languages are agglutinative, with long word forms that concatenate several morphemes (e.g., үйіміздегілерден [üiımızdegılerden]—“from the ones in our house”). This increases the Out-Of-Vocabulary (OOV) rate and leads to higher WER if the training lexicon does not cover sufficient morpheme combinations. These findings indicate that the direct application of standard ASR architectures to low-resource Turkic languages may be insufficient to address their linguistic complexity.

The comprehensive review in [18] highlights that the unique phonetic and morphological features of Turkic languages, including vocal harmony and agglutinativity, pose additional challenges for automatic speech recognition (ASR) systems. These features affect the increase in WER, especially when using standard models that are not adapted to the specifics of Turkic languages, which requires the development of specialized approaches to acoustic and language modeling. A multilingual TTS system for ten Turkic languages was developed in [8], using Kazakh as training data in a zero-shot Tacotron 2 setup. Transcription-to-IPA transliteration improved cross-lingual synthesis quality.

For Uzbek, the UzLM language model was introduced in [19], combining statistical and neural methods on a large corpus. Neural approaches achieved a CER of 5.26%, outperforming traditional baselines. Broader NLP progress for Central Asian Turkic languages is summarized in [20], emphasizing transfer learning and corpus creation as critical directions.

For Turkish, one of the first DL-based end-to-end TTS systems using Tacotron 2 + HiFi-GAN was developed in [21], achieving MOS scores above 4.4. This high performance was largely due to the use of a parallel training strategy and high-quality, speaker-consistent data, which improved intonation and rhythm, making the speech sound more natural. Whisper-based Turkish ASR was explored in [22], where LoRA fine-tuning reduced WER by up to 52%. The improvement is attributed to parameter-efficient fine-tuning (LoRA) on domain-specific data, which allowed Whisper to better capture Turkish phonetics and reduce substitution errors. A publicly available Uzbek Speech Corpus (USC) with 105 h of manually verified recordings was released in [23], supporting further ASR work.

The study in [24] demonstrates that multilingual models that take into account the phonetic and morphological similarities of Turkic languages are able to significantly reduce WER (up to 54%) compared to models trained on unrelated linguistic material, which confirms the importance of taking into account linguistic features to improve the quality of speech recognition.

2.3. Kazakh Speech Features in Systems and Resources

Kazakh, as a representative of Turkic languages, inherits many of the phonetic and morphological complexities outlined in the previous section. Its rich system of vowel harmony, presence of uvular consonants, and highly agglutinative morphology directly impact both acoustic modeling and language modeling in ASR systems. For example, long compound words and context-sensitive suffixes lead to lexical sparsity and complicate pronunciation modeling. These factors must be considered when designing Kazakh ASR and TTS systems to ensure high accuracy and naturalness, particularly in low-resource settings.

Speech recognition and synthesis technologies have experienced remarkable development in recent years, largely driven by advances in deep learning. ASR systems are designed to transcribe spoken language into text. Traditional ASR architectures based on HMM and Gaussian Mixture Models (GMM) have largely been replaced by end-to-end models such as Transformer [25], Conformer [26], and Whisper [27], which unify acoustic, language, and pronunciation models into a single deep neural network.

In low-resource settings such as the Kazakh language, the lack of large annotated corpora presents a major limitation. To address this, researchers have applied techniques like transfer learning [28], multilingual modeling [29], and LoRA [30] to adapt pre-trained ASR models to Kazakh and other Turkic languages. Open-source toolkits such as ESPnet [31], Kaldi [32], and Hugging Face’s Transformers library [33] provide flexible environments for building such systems.

In parallel, TTS synthesis has evolved from concatenative and parametric approaches to fully neural, end-to-end architectures. Systems such as Tacotron 2 [34], FastSpeech 2 [35], and HiFi-GAN [36] achieve high-quality synthesis by modeling both prosody and phonetic detail. These models typically use a sequence-to-sequence frontend with attention mechanisms (which align input text with corresponding acoustic features) and a neural vocoder (a neural network that converts acoustic features into realistic speech waveforms), enabling the generation of natural and intelligible speech.

Recent studies provide a strong foundation for Kazakh ASR/TTS. A high-quality open-source Kazakh TTS dataset containing 93 h of audio is presented in [9], where the authors carefully selected text to ensure broad phonetic and morphological coverage, explicitly considering the agglutinative nature and vowel harmony of the Kazakh language. This dataset enabled the development of an end-to-end Tacotron 2/Transformer-based TTS system, which was evaluated using MOS scores. In [7], a much larger 554-h Kazakh speech corpus was collected, covering a wide range of speakers and phonetic contexts. This work emphasized the need to address phoneme-level confusion (e.g., uvular vs. velar consonants) and provided the basis for DeepSpeech2 experiments, resulting in a 66.7% reduction in model size and a 7.6% improvement in CER. Together, these resources demonstrate that accounting for phonetic diversity and morphological complexity is essential for improving both ASR and TTS system performance.

The impact of incorrect letter recognition is similarly presented in [37]. Authors provide a comparative analysis of ASR systems/frameworks, such as Kaldi, Mozilla DeepSpeech, and Google Speech-to-Text, for Kazakh speech. The experiment’s results show that Google Speech-to-Text demonstrated the lowest error rate, with a WER of 52.97%. In contrast, Kaldi and Mozilla DeepSpeech had even higher WER values, underscoring the challenges modern systems encounter in recognizing the Kazakh language. The authors conclude that even with the best result, the error exceeds 50%, which makes ready-made solutions unsuitable for practical use. They emphasize the urgent need for adaptation to the peculiarities of Kazakh phonetics and morphology, highlighting the importance of this task in improving Kazakh speech recognition systems.

The authors [38] tackle the difficult task of automatic speech recognition for the Kazakh language, focusing on children’s speech. The study is important because there are few corpora and models for low-resource languages. This is especially true for children’s speech, which has different sounds and production methods compared to adult speech. The study also explores the phonetics of the Kazakh language. The authors have made a valuable contribution by creating a specialized corpus of children’s speech for the first time. This corpus includes recordings of children by ages (2–8 years) and genders, and different dialects and accents. All recordings were carefully transcribed and checked, resulting in a high-quality dataset essential for training models. The study’s evaluation was carried out using the WER and CER metrics, which are particularly important for the Kazakh language with its agglutinative structure. WER metrics evaluated on short words and sentences. On the sentences to Kazakh the WER was about 0.86, whereas on words the lower WER was on 0.24. The experiments revealed that most errors are related to phonetic features, such as vowel reduction or doubling of consonants, as well as the morphological complexity of the language. The authors note that it is necessary to include integrating language models that take into account the morphological features of the Kazakh language.

Other studies on automatic speech recognition of Kazakh do not provide a detailed analysis of the impact of unique phonemes on quality, focusing only on the quality of the WER, CER, and TER metrics. Existing works on automatic speech recognition of Kazakh are mainly limited to assessing the leading indicators (WER/CER/TER), providing virtually no data on errors at the phoneme level; therefore, the contribution of language contrasts to the total number of errors remains insufficiently studied. The authors of the above-mentioned articles came to important conclusions. They argue that in order to improve recognition accuracy, it is important to create specialized acoustic models that correspond to the unique sound system of the Kazakh language. They also mention that it is necessary to develop a language model that takes into account phonetic rules and morphological agreement and greatly enhances our understanding of ASR models for low-resource languages.

For the Kazakh language, research efforts have focused on developing ASR and TTS systems using small-scale datasets and transliteration techniques. For instance, multilingual TTS approaches leveraging IPA mapping (conversion of text into a standardized phonetic representation using the International Phonetic Alphabet) and zero-shot synthesis (the ability of a model to generate speech for an unseen language without direct training data) have shown promising results for Turkic languages [8].

The Kazakh Speech Corpus (KSC), with 332 h of diverse audio, was introduced in [39], achieving strong benchmark results (CER 2.8%, WER 8.7%). Importantly, the authors emphasize the agglutinative nature and vowel harmony of Kazakh and ensure phonetic balance in speaker and text selection, which is critical for reducing phoneme-level confusions. Further, [40] proposed a cascade Kazakh speech translation system combining ASR and MT modules, where data augmentation improved BLEU by +2, demonstrating the benefit of increasing linguistic and morphological coverage. Transformer-based and CTC models for Kazakh ASR were studied in [11], achieving a CER of 3.7% with integrated language models and showing that handling long agglutinative word forms is crucial for improving recognition accuracy. A broader exploration of Kazakh ASR using both deep learning and HMMs is given in [41], emphasizing the importance of reliable acoustic and morphology-aware language modeling for low-resource languages. The Soyle model [42] demonstrated effectiveness across Turkic and UN languages, while also providing the first large-scale Tatar speech dataset and introducing noise-robustness augmentation to handle real-world phonetic variability. Finally, the first industrial-scale open-source Kazakh corpus was presented in [43], combining KSC, KazakhTTS2, and additional broadcast/podcast data to increase phonetic and stylistic diversity and improve model generalization.

Despite the progress in the research, since the Kazakh language is a low-resource language and is an agglutinative language, the problem remains. The availability and total duration of Kazakh audio resources are summarized in Table 1.

Thus, the overview of ASR and TTS systems, along with the current state of Kazakh audio resources, provides the methodological basis for the algorithm presented in the following sections.

2.4. Summary

The reviewed studies were selected to provide a comprehensive understanding of recent advancements, methodologies, and challenges in the development of speech and language technologies for Turkic and low-resource languages. By analyzing works related to ASR, TTS synthesis, language modeling, and multilingual processing—particularly for languages such as Kazakh, Uzbek, Turkish, and others—we aim to identify effective strategies, transferable techniques, and gaps that remain unaddressed. All the studies reviewed in this section have investigated various aspects of automatic ASR, TTS, and speech translation for Turkic languages. The insights of this study, which analyzes existing models, serve as a foundation for the proposed methodology, focusing on enhancing the performance of end-to-end ASR and TTS systems, improving language representation in multilingual models, and developing robust, scalable tools for Turkic languages with minimal annotated resources. This approach plays an important role in the development of Kazakh language technologies, and the study places special emphasis on its practical application.

In the subsequent Materials and Methods Section, the ASR and TTS frameworks will be considered for processing audio and text low-resource language corpora. While these systems can be used with different languages, they hardly depend on available models and frameworks to chosen language. All these steps of corpora and model processing are thoroughly arranged below.

3. Materials and Methods

3.1. Audio and Text Dataset Formation

Speech-to-Text (STT) and TTS are tasks that transform audio data into textual data and vice versa, respectively. These tasks require the datasets to be qualitative, precise, and mainly formal in nature. The Kazakh news website is valuable because it features YouTube-hosted video content accompanied by textual transcripts. The 24 kz news portal was scraped using a Python 3.11 script, retrieving the Title, Date, Script, YouTube URL, and web page URL. Another script was implemented for downloading videos from YouTube’s video hosting service. This step ensures that high-quality, diverse spoken language samples are captured directly from real-life broadcasts, which is crucial for building robust STT and TTS systems. Audio files from videos are extracted using video editing software such as Cyberlink PowerDirector or CapCut, which are commonly used in multimedia processing. They provide the possibility of separating audio tracks from video content, allowing for working with sound independently. The audio files were exported as MP3 and WAV files with a unified frequency range. For effective audio processing, the speech was split into sentences and saved in separate files. For effective audio processing, the speech was split into sentences and saved in separate files. The corresponding textual scripts were also split into sentences and saved in separate files. The scheme of web scraping, audio extraction, and parallel corpora formation is shown in Figure 1.

Then, the ready-made parallel audio and textual corpora from Nazarbayev University were taken. The same process was used to split audio and text into sentences and save them in separate files. Therefore, the audio and textual corpora of 200 sentences (100 sentences scraped from the news portal and 100 sentences of Nazarbayev University) were formed. Then, the ready-made parallel audio and textual corpora from Nazarbayev University were taken. The same process was used for splitting audio and text into sentences and saving them in separate files.

3.2. ASR Systems and Selecting Criteria

Several existing ASR systems were considered and evaluated based on a predetermined set of criteria, each of which plays a crucial role in the practical application of speech recognition technologies. The criteria were arranged in order of importance, allowing for an objective analysis.

Availability: The most important criterion was the system’s availability for large-scale use. This concept meant both technical availability (the ability to deploy and integrate quickly) and legal openness, including the availability of a free license or access to the source code. Preference was given to open-source solutions that did not require significant financial investments at the implementation stage.
Recognition quality: One of the key technical parameters was the linguistic accuracy of the system. This crucial factor ensures the system’s ability to correctly interpret both standard and accented speech, taking into account the language’s morphological and syntactic features. Particular attention was also paid to the contextual relevance of the recognized text, that is, the system’s ability to preserve semantic integrity when converting oral speech into written form.
The efficiency of subsequent processing: An additional criterion was the system’s ability to effectively work with a large volume of input data, implying not only accurate recognition but also the possibility of further processing (for example, automatic translation or categorization of content). Special importance was given to the scalability of the architecture and support for batch processing of audio files, ensuring a high-performance system.

Research on automatic speech recognition (ASR) for the Kazakh language is still limited. Contributions come mainly from a few research institutes that rely on their own datasets. The Institute of Smart Systems and Artificial Intelligence [48] has been a key player, introducing ASR solutions for Kazakh and nine other Turkic languages. Their developments include models like TurkicASR, KazakhTTS, and KazakhTTS2. However, these studies have primarily focused on in-house systems and have not offered thorough comparisons with other publicly available or commercial tools. The research conducted in this work allowed for the identification of several ASR models and systems that are effective in converting audio data into text for the Kazakh language. Among them, the most significant ones are Whisper, GPT-4o-transcribe, Soyle, Elevenlabs, Voiser, and others.

The Whisper model, developed by OpenAI [49], is a large-scale STT system for robust transcription and translation of multilingual and noisy speech. It can automatically detect the spoken language, perform direct speech-to-text transcription, and even translate speech into other languages. The model’s specification involves training on a large dataset of approximately 680,000 h of multilingual audio-text pairs collected from the web. The core of the Whisper model is formed by a transformer-based encoder–decoder architecture, structurally similar to autoregressive models like GPT. The encoder operates on log-Mel spectrograms, which are time–frequency representations of the input audio. These features are chosen instead of MFCCs because they retain the full spectral distribution across Mel bands, providing richer acoustic information for deep neural networks. The decoder autoregressively generates output tokens one step at a time, conditioning each prediction on previous tokens. The Whisper model includes 4–32 transformer layers, 4–16 attention heads, and 256–1280 embedding dimensions. Another important specification is related to the tokenization strategy, where Whisper employs a byte-pair encoding (BPE) vocabulary covering multiple languages. This unified vocabulary allows the model to switch between languages and tasks seamlessly. The transcription was performed using Whisper-1 of the OpenAI API, which corresponds to the Whisper large-v2 architecture with 1.55 B parameters, 32 encoder and decoder layers, a 1280-dimensional hidden state, and 20 attention heads. The input audio was processed into 80-dimensional log-Mel spectrograms at a sampling rate of 16 kHz, using a 25 ms window and a 10 ms stride. The decoder autoregressively generates tokens from a multilingual BPE vocabulary of 50,000 units. The architectural specifications and phonetic, morphological, and syntactic features of the Kazakh language influence the work of the Whisper model and its output results. Phonetically, Kazakh has many vowels with front–back and rounded–unrounded contrasts, along with vowel harmony, which requires the model’s encoder to capture long-range dependencies across syllables. Whisper’s transformer encoder, with its multi-head self-attention and large context window, is well-suited to model such harmony patterns. However, as context length or layer depth are configured primarily for English, the system may fail to capture these vowel interactions in Kazakh. Morphologically, Kazakh is agglutinative, where words can be extended with long chains of suffixes. It results in high token variability and a large effective vocabulary. As Whisper primarily relies on a BPE tokenizer with a fixed vocabulary size that is efficient for European languages, it often splits Kazakh morphemes in unnatural ways, reducing recognition accuracy for rare or complex forms. At last, syntactically, Kazakh is a subject-object-verb (SOV) language with this kind of word order. Whisper’s decoder must model long-distance dependencies between subjects, objects, and verbs. However, because the model was pre-trained with a dominance of SVO European languages, its learned attention patterns may not optimally handle Kazakh syntax. Consequently, even with strong architectural capacity, Whisper struggles with Kazakh texts due to mismatches between its pretraining distributions and the phonetic, morphological, and syntactic realities of the language. The Whisper model’s architectural and hyperparameter specifications establish the most versatile open-source ASR system. Its ability to integrate transcription and translation, along with robustness to real-world noise, makes it a strong benchmark for evaluating speech recognition in low-resource languages.

The GPT-4o-transcribe is a multimodal variant of GPT-4o optimized for speech recognition [50]. This model is based on a transformer encoder–decoder architecture with a multimodal encoder that processes log-Mel-like spectrograms and integrates them into a shared representation space. In comparison to the Whisper, GPT-4o-transcribe operates with a much larger parameter count and an extended context window, enabling higher robustness to noise and more accurate modeling of long speech sequences. The provided API configuration allows changing such options as temperature, language, prompt, and output format, which allows users to control decoding behavior and output formatting. In the process of text generation, the model employs a SentencePiece and BPE tokenizer with broad multilingual coverage. As the Whisper model, GPT-4o-transcribe deals with the phonetic, morphological, and syntactic specifications of the Kazakh language. Phonetically, Kazakh features a system of vowels, including the back rounded vowel ‘ұ’ [ʊ], which requires fine-grained spectral modeling. The model’s high-resolution acoustic encoders and multi-head attention over long contexts allow it to capture these harmonic dependencies across syllables more effectively. Morphologically, Kazakh’s agglutinative structure produces long, morphologically complex words with numerous suffixes for case, possession, tense, and derivation. The larger tokenizer of the GPT-4o-transcribe model, combined with subword segmentation optimized across multiple languages, reduces the fragmentation of morphemes compared to BPE in Whisper, allowing the model to better represent and generalize across Kazakh word forms. Syntactically, Kazakh’s SOV word order introduces long-distance dependencies between arguments and verbs. The autoregressive model, with its deeper transformer layers and longer context window, is better positioned to maintain coherence across such sentence structures, thereby reducing errors in word reordering. Generally, the GPT-4o-transcribe has been trained on multiple tasks, including speech recognition, translation, and dialog, which enhances its robustness to noise, improves its ability to process long audio segments, and enables seamless integration with conversational systems.

The Soyle model, developed by ISSAI at Nazarbayev University, is designed as a specialized ASR system for low-resource Turkic languages, with particular emphasis on Kazakh [51]. Architecturally, Soyle follows a transformer-based encoder–decoder design, similar to the Whisper model. The encoder processes acoustic features in the form of log-Mel spectrograms, extracted from 16 kHz audio, using a window of 25 ms and a stride of 10 ms, to produce a time–frequency representation of speech suitable for deep learning. The decoder generates transcriptions autoregressively, predicting each token conditioned on the previously generated sequence. Tokenization is performed with a BPE vocabulary, allowing the model to represent both Cyrillic orthography and morphologically complex word forms. In terms of scale, Soyle is a mid-sized model compared to large global architectures. Its parameters are designed to strike a balance between accuracy and efficiency, making it deployable on CPU-based and GPU-based systems in real-world environments. The model’s context window is sufficient to handle continuous speech segments, though it does not extend to very long-form audio processing as in GPT-4o-transcribe. Soyle’s most significant hyperparameter strength lies in its domain-specific training and noise robustness. The Soyle model’s architecture includes a transformer with approximately 100–200 million parameters, configured with 12 encoder layers and 6 decoder layers, each using 8 attention heads and a hidden dimension of 512. Input audio at 16 kHz is converted into 80-dimensional log-Mel spectrograms, which are then processed by the encoder. The decoder autoregressively generates tokens from a BPE vocabulary of 10,000 units tailored to Kazakh orthography. The model’s training phase utilizes the Adam optimizer, cross-entropy loss with label smoothing, and data augmentation techniques, including SpecAugment, to enhance robustness against noise and speaker variability. As the model retains Whisper’s architectural capacity but adjusts weights to represent better Kazakh phoneme distributions, morphological suffix sequences, and syntactic dependencies, the model is capable of recognizing many various specifications of the Kazakh language. Unlike large global STT models, Soyle was trained on carefully curated corpora, such as the Kazakh Speech Corpus 2 and Mozilla Common Voice. These corpora provide authentic recordings of Kazakh speech across different dialects, speaking styles, and acoustic conditions, enabling the model to capture the phonetic and morphological complexity of the language more effectively than general-purpose systems. In the phonetic aspect, Soyle benefits from training on more Kazakh audio, which helps its acoustic encoder better resolve subtle vowel contrasts, consonant allophony, and phenomena like vowel harmony and potentially reduced consonant sounds. Additionally, noise augmentation during training enhances robustness to background interference, which is crucial for Kazakh phonetics in real-world settings. In morphology, as Kazakh produces long words via multiple suffixes, many of which are rare or combinatorial, the tokenizer and decoder of Soyle better learns common morpheme boundaries and patterns. Syntactically, Kazakh’s SOV word order requires that the model be able to maintain context across longer spans, avoid confusing subject and object, etc. The architecture’s large context window and self-attention layers allow Soyle to capture these dependencies better than models not fine-tuned on similar syntactic phenomena. Although being on a smaller scale than global models, Soyle achieves competitive accuracy in Kazakh by leveraging domain-specific training data and noise-resilient hyperparameters.

The advanced ElevenLabs voice AI platform offers a more commercially driven, feature-rich solution [52]. It delivers outstanding transcription accuracy for Kazakh, marking a significant step forward in speech recognition for underrepresented languages. In addition to transcription, it enables the system to differentiate between multiple speakers. Although the model’s many architectural hyperparameters, such as layer counts and hidden dimensions, are not publicly disclosed, the model supports transcription in 99 languages, with up to 32 speakers, word-level timestamps, and the detection of non-speech audio events. As ElevenLabs is mostly a commercial application, it does not disclose all its layers or exact acoustic feature extraction parameters, but in the case of phonetic, morphological, and syntactic features of the Kazakh language, their specifications can be suggested in the following way. Scribe features a high-resolution deep encoder stack with multiple self-attention heads, enabling it to focus on subtle phonetic cues. A tokenization strategy with low error rates suggests that their subword vocabulary is well-regularized for Kazakh and has seen enough Kazakh data to learn morpheme patterns.

The Voiser Transcribe model, developed by Voiser, is a strong STT system that supports over 75 languages, including Kazakh. It is another ASR system developed to convert spoken audio into written text [53]. It supports a broad range of languages, with a specific emphasis on Kazakh and Turkish, thereby addressing an important gap in regional speech technologies. Voiser leverages the strengths of modern deep learning for robust handling of speech in real-world conditions, including noisy backgrounds and variable acoustic settings. While the model’s architecture is not fully publicly disclosed, it is known that it works on a Transformer-based encoder–decoder architecture, which is commonly used in modern STT systems due to its efficiency in handling sequential data and capturing long-range dependencies. However, the lack of publicly available architectural details and hyperparameters limits the ability to assess the model’s design and performance thoroughly. Overall, Voiser combines speed, accuracy, and ease of deployment, positioning itself as a practical solution for organizations that require reliable transcription services with support for Kazakh. Compared with Soyle, it is less specialized in terms of training data transparency and linguistic documentation, but it offers a broader commercial application scope. In the case of the Kazakh language’s features, Voiser’s transformer encoder layers can likely capture subtle phonemic transitions, vowel frontness, or harmonic cues that spread over adjacent syllables. The encoder–decoder transformer architecture facilitates flexibility in syntax and enables the modeling of long morphological chains. When compared to ElevenLabs, Voiser offers a more regionally focused alternative, with an emphasis on Turkic languages rather than primarily English. In this sense, Voiser can be seen as a middle ground between an academic specialization system, such as Soyle, and global commercial scalability, as seen in ElevenLabs, making it an important contributor to the growing ecosystem of Kazakh and Turkic STT technologies.

The scheme of the STT models is shown in Figure 2.

The comparison of STT models is presented in Table 2.

The presented comparative analysis of STT systems for the Kazakh language provides a reasonable approach to their practical assessment, with the results presented in the Results and Discussion Section, focusing on the metrics of text evaluation.

3.3. Text-to-Speech (TTS)

Modern TTS systems employ deep neural network models that significantly enhance the quality of synthesis, yielding speech that is more natural and expressive. There are several TTS models used for converting text data into audio. This research covers the Massively Multilingual Speech (MMS), TurkicTTS, KazakhTTS2, ElevenLabs TTS, and OpenAI TTS models.

The MMS model, developed by Meta AI, represents one of the largest-scale attempts at multilingual speech technology, covering over 1100 languages, including Kazakh [54]. The model’s architecture builds on wav2vec 2.0, a popular framework for speech representation, and adapts it for multi-task learning. The encoder processes raw audio waveforms through a convolutional feature encoder followed by 24 transformer layers, each with 16 attention heads and a hidden dimension of 1024. MMS employs a sequence-to-sequence architecture with a FastSpeech 2-style decoder, a non-autoregressive approach optimized for faster and more efficient speech synthesis. The TTS decoder consists of 12 transformer layers with a hidden dimension of 768. Training utilizes a learning rate of 1e-4, a batch size of 16, and gradient clipping set to 1.0 to stabilize the training process. Phonetically, Kazakh’s rich system of vowel harmony requires the acoustic (vocoder) part of the model to capture fine spectral details, including frequency resolution, formant transitions, and how phonemic distinctions are carried across morpheme boundaries. In MMS, the flow-based module maps text encodings to spectrogram-like acoustic features, and a HiFi-GAN-style decoder reconstructs waveforms with good accuracy. Morphologically, since suffixes in Kazakh often appear at the ends of words, the model’s text encoder and tokenizer represent morpheme boundaries and allow the duration predictor to place appropriate timing. MMS utilizes a text encoder and a tokenizer that handles the Kazakh orthograph script, represents frequent suffixes, and enables the stochastic duration predictor to produce rhythm variation. Syntactically, the MMS model handles intonation, stress at clause ends, and possible pause insertions carefully. Generally, MMS is a versatile solution for the Kazakh language, as the model’s hyperparameters, such as sampling rate, spectrogram resolution, and number of flow layers, are crucial for speech generation.

TurkicTTS is a multilingual text-to-speech system, designed to address the systematic underrepresentation of Turkic languages in speech technologies, with a primary focus on Kazakh [55]. The architecture of TurkicTTS combines Tacotron2, a popular sequence-to-sequence model for text-to-spectrogram conversion, with a WaveGAN vocoder specifically trained on Kazakh data. Tacotron2 is responsible for converting the input text into mel-spectrograms, which represent the acoustic features of speech. The WaveGAN vocoder then synthesizes these spectrograms into waveforms, providing the final audio output. This combination of Tacotron2 and WaveGAN is designed to efficiently handle the complexities of Kazakh phonology, enabling the generation of high-quality synthetic speech for this language. The architecture of the TurkicTTS model comprises an encoder consisting of a convolutional layer, followed by self-attention and GRU layers, which are responsible for sequence modeling. The GRU layers typically contain 512 hidden units, enabling the model to efficiently capture dependencies within the input text. For the attention mechanism, location-sensitive attention is used to ensure proper alignment between the input text and the generated speech features. The model generates 80-dimensional mel-spectrograms as its output, capturing the essential frequency components for speech synthesis. Both the input text and output spectrograms are padded to a maximum sequence length during training to ensure stable learning and alignment across the sequence-to-sequence process. During training, a batch size of 32–64 is commonly used to maintain stable learning. The learning rate is set to 1 × 10⁻⁴ for both the generator and discriminator, using the Adam optimizer to adjust model parameters and improve waveform generation quality. In the phonetic specifications of Kazakh, the Tacotron-2 encoder includes a bidirectional LSTM layer with 512 units that processes the normalized Kazakh input of 42 Kazakh letters and punctuation marks. This size is sufficient to capture phonetic context, including how earlier vowels or consonants influence later ones, which is crucial for vowel harmony. In the morphological aspect, TurkicTTS utilizes the KazakhTTS2 corpus, which features high-quality data, including varied morphological forms, and the model is exposed to numerous real examples of suffixation. The model can correctly render suffix-like ends because it has learned phoneme correspondences for suffixes. In the syntactic features, Tacotron-2 generates mel spectrogram frames conditioned on encoded text input. Because of its autoregressive nature, it can model prosodic features that depend on large spans of input text. For example, when the verb or suffix ending is far into the input, the attention mechanism helps the decoder anticipate how to shape the prosody of preceding parts. The inclusion of punctuation also contributes, since punctuation often cues syntactic boundaries in text, which are mirrored in prosody. Generally, TurkicTTS is more research-oriented. While accurate for Kazakh and adaptable for related languages, it is less optimized for large-scale production environments, and its audio quality can vary outside its primary training domain.

The KazakhTTS2 is explicitly developed for the Kazakh language, aimed at advancing the creation of high-quality digital Kazakh speech resources [45]. Built on the Tacotron 2 architecture, the system employs a Seq2Seq framework with attention mechanisms to convert input text into Mel spectrograms, which represent the acoustic features necessary for speech generation. These spectrograms are then processed by a HiFi-GAN neural vocoder, which reconstructs the waveform from the spectrograms, producing high-fidelity, natural-sounding speech output. The combination of Tacotron 2 and HiFi-GAN enables KazakhTTS2 to produce speech with smooth intonation, correct stress placement, and prosodic naturalness. In terms of hyperparameters, the KazakhTTS2 model uses an encoder with convolutional layers and self-attention layers, followed by a decoder that generates the Mel spectrograms using location-sensitive attention. The hidden dimension in the GRU layers of both the encoder and decoder is typically 512, with 16 heads for optimal alignment between the input text and generated speech. The model outputs 80-dimensional Mel spectrograms to capture the frequency components of speech. The HiFi-GAN vocoder is used for high-quality waveform synthesis and consists of a generator and a discriminator, both of which utilize 3 × 3 filters in convolutional layers. The system typically employs batch sizes of 32–64 and a learning rate of 1 × 10⁻⁴, optimized using the Adam optimizer. Phonetically, in KazakhTTS2, the large size and speaker variety of the corpus allow the model to learn fine spectral details and realistic variation in the pronunciation of Kazakh sounds. The Tacotron-2 encoder–decoder with attention has enough capacity to distinguish subtle formant differences that hinge on phoneme identity. Because vocoders like ParallelWaveGAN are used, the model’s hyperparameters, such as sampling rate, number of mel-spectrogram bins, window size, and the architecture depth of the encoder and decoder, are all tuned so that vowel quality is preserved. Morphologically, the text normalization and preprocessing must represent the complete set of suffixes, and the input must provide sufficient context so that the attention mechanism in Tacotron-2 can accurately predict duration and prosody for long words. Additionally, the text input is sufficiently large to handle long Kazakh words without truncation, so the model does not lose context or fail to generate correct phonetic output for those. Syntactically, in KazakhTTS2, building the model with attention in Tacotron-2 helps the decoder to see the full input and map syntax cues. Also, the corpus includes speakers reading varied sentence types, giving the model training examples for prosody in different syntactic settings. Hyperparameters that impact this include the maximum text length, attention window, or decoder memory, as well as the weighting or loss functions that penalize misalignment or unnatural prosody. KazakhTTS2 represents a significant advancement in Kazakh TTS technology, offering improved synthesis quality and the potential for broader application, though further work is needed to enhance emotional expressiveness in speech.

The ElevenLabs TTS system is a commercial TTS platform that produces high-fidelity, natural-sounding, and emotionally adaptive speech across multiple languages, including Kazakh, since 2024 [56]. As with the Elevenlabs STT system, the company does not disclose the exact details of its architecture; it is inferred that the system generates intermediate speech representations, likely in the form of spectrograms or quantized audio tokens, which are then decoded into audio waveforms. The system likely employs transformer-based models, which may be either autoregressive or diffusion-based, to effectively capture long-range dependencies in both the text and prosody of the input. Although the ElevenLabs TTS model’s architecture is not publicly disclosed, it can be inferred that the system employs a large transformer model with an appropriate number of layers, ranging from 12 to 24 layers, hidden dimensions typically between 512 and 1024 units, and attention heads of around 8–16 per layer. The model likely uses sequence-to-sequence architectures with mechanisms like location-sensitive attention for alignment between the input text and prosodic features. Quantization techniques, such as vector quantization or tokenization-based methods, are likely used to represent audio data efficiently during training and inference. In the phonetic aspect, the model features a sufficiently fine-grained acoustic frontend and spectral detail (i.e., the number of mel-spectrogram bins or the specific spectral representation used), a high sampling rate, and a high model capacity in the decoder, all of which contribute to optimal voice quality. Its “Multilingual v2” model aims for more nuanced expression, which suggests deeper or more expressive decoders that better capture subtle phonetic variation. Morphologically, the TTS system handles long input text with full in-context representation of suffixes. It has a tokenizer that retains morpheme boundaries or at least recognizes frequent suffixes, ensuring that pronunciation, prosody, and durations match. ElevenLabs supports character limits of up to 10,000 characters for long-form generation. It allows the model to see whole phrases or sentences, not just truncated parts, helping with suffixes at word ends or morphological context. In the syntactical aspect, Kazakh uses SOV as a default order. As Verbs often come at the end, finishing intonation, phrase boundaries, and prosodic cues at the end of phrases and words are important. ElevenLabs’ TTS models include features like nuanced intonation, pacing, emotional awareness, voice styles, and dialog style output. ElevenLabs TTS system’s architecture and hyperparameter choices make it reasonably well suited to meet Kazakh’s phonetic demands, morphological complexity, and syntactic structure.

The OpenAI TTS model is an advanced autoregressive system designed to convert text into high-quality, natural-sounding speech [57]. While the specific architectural details are proprietary, these models are known to utilize transformer-based architectures, leveraging large-scale pretraining on extensive audio datasets to capture nuanced prosody and emotional expressiveness. The tts-1 model is optimized for real-time applications, providing a balance between speed and quality, making it ideal for applications where sound quality is paramount. OpenAI TTS offers a variety of voice options, including alloy, echo, fable, onyx, nova, and shimmer, each with distinct tonal characteristics suited to different applications. In terms of technical specifications, the models generate speech at a sample rate of 24 kHz, ensuring clear and detailed sound reproduction. Audio outputs are typically provided in PCM format, making it easy to integrate into various systems. Accessible via OpenAI’s API, the models can be seamlessly integrated into applications with both synchronous and asynchronous operations.

Phonetically, OpenAI TTS has a sufficiently fine spectral resolution and capacity in its acoustic model, as well as precise duration modules, ensuring that transitions and coarticulation are well-rendered. The NeuralHD variants typically trade off some latency to improve voice fidelity and spectral richness. These higher-fidelity variants help with preserving vowel formant structure, acoustic differentiation of rounded vs. unrounded vowels, and the harmony of vowels across suffixes. Morphologically, OpenAI TTS features a robust text preprocessing and normalization mapping that preserves morpheme boundaries, allowing pronunciation and duration predictions to reflect morphological structure accurately. The ability to instruct voice style may help in choosing different prosodic patterns depending on sentence type, which interacts with morphology.

Syntactically, OpenAI TTS anticipates or projects toward the end of a clause when generating speech. OpenAI’s TTS models include training on large multilingual data and expressivity features, which likely gives the model exposure to languages with verb-final or flexible order, helping it learn to align prosody with such syntax. From the architectural side, attention layers in the encoder and decoder have sufficient capacity and context windows to handle long sentences and large clause structures, preserving syntactic cues, such as case suffixes or punctuation, that indicate where the verb will be located. Overall, the OpenAI TTS model’s architecture and hyperparameter choices are well aligned with the demands of Kazakh: preserving vowel harmony and phonetic contrast, handling long morphological sequences, and generating natural prosodic structure that corresponds to the syntax.

The scheme of the TTS models is shown in Figure 3.

The comparison of STT models is presented in Table 3.

The presented comparative analysis of TTS systems for the Kazakh language allows for a reasonable approach to their practical assessment, the results of which are presented in the next section based on the metrics of speech recognition and synthesis quality.

4. Results and Discussion

Table 4 presents the results obtained based on the evaluation metrics, providing an overview of the STT model quality metrics achieved by the selected ASR system for the Kazakh language collected from the 24 kz YouTube portal. The results are evaluated with BLEU, WER, TER, chrF, and COMET scores. BLEU, originally developed for machine translation, is also applied in STT evaluation when multiple valid transcriptions are possible. It measures the overlap of n-grams between the predicted transcription and the reference [58]. BLEU scores range from 0 to 1, with higher values indicating greater similarity to the reference. WER is the most widely adopted metric in speech recognition. It measures transcription errors at the word level [59]. Lower values correspond to better performance, with 0 representing a perfect match. Owing to its interpretability, WER remains the gold standard for STT evaluation. TER, adapted from MT evaluation, calculates the number of edits (insertions, deletions, substitutions, and shifts) required to transform a hypothesis into the reference, normalized by reference length [60]. As with WER, lower values indicate better transcription quality, with 0 representing a perfect match. chrF is a character-based metric that compares character n-grams between hypothesis and reference [61]. Scores range from 0 to 1, and higher values indicate better similarity. By operating at the character level, it captures morphological and lexical differences, making it particularly effective for agglutinative languages such as Kazakh, where words can be long and morphologically complex. COMET, developed for MT evaluation, leverages neural network models to estimate semantic similarity between hypothesis and reference [62]. The metric outputs a score typically ranging from –1 to 1 (or 0 to 1, depending on configuration), where higher values indicate closer semantic alignment. COMET is especially useful in multilingual or semantically sensitive tasks, where preserving meaning is more critical than exact word matches. In these scores, high values of BLEU, chrF, and COMET mean the better model’s work. On the other hand, lower scores for WER and TER indicate that the model’s performance is better. These cases in Table 4 and Table 5 are highlighted by green and red upper and lower arrows. An upper arrow means that higher scores are preferable, while a lower arrow means that lower scores are of higher priority. In addition, the model with the best average metric scores is highlighted in bold for two datasets.

Table 5 shows the quality metrics for STT on the finished audio data from Nazarbayev University.

The use of TTS after STT in this experiment is explained by the need for a two-way quality assessment, encompassing not only speech recognition but also its synthesis. When collecting audio data from the 24 kz portal, we did not just gather audio, but also their transcribed text, which was then used for TTS. This comprehensive approach ensures the reliability of our results. Table 6 and Table 7 show the results of comparing gold audio with that generated with different TTS models. The audio quality is evaluated with STOI, PESQ, MCD, LSD, and DNSMOS. PESQ (Perceptual Evaluation of Speech Quality, range 1–4.5) [63] measures perceptual quality by modeling human auditory response, with higher values indicating better speech quality. STOI (Short-Time Objective Intelligibility, range 0–1) [64] quantifies how well speech can be understood, where higher values reflect greater intelligibility. MCD (Mel Cepstral Distortion) [65] evaluates the distance between synthesized and reference speech in the mel-cepstral domain, while LSD (Log Spectral Distance) measures spectral differences on a logarithmic scale; in both cases, lower values represent better similarity to the reference signal. DNSMOS [66] is a neural-network-based metric (range 1–5) that predicts human mean opinion scores without requiring a reference, with higher scores corresponding to more natural and cleaner speech.

The comparative performance of the STT models across the 24 kz and Nazarbayev University datasets shows various strengths and weaknesses of every model. Generally, they showed different metric values.

Whisper exhibited consistently low performance on both datasets, having WER and TER scores of 77.10% and 74.87%. Despite its general applicability across multiple languages, it struggled significantly with the Kazakh language, failing to accurately capture the nuances of the language. Its performance indicates that Whisper needs more targeted adaptation and fine-tuning for effective Kazakh speech recognition. In the experimental evaluation, the Whisper model showed a high WER of 77.10%. The phonetic richness, morphological complexity, and syntactic patterns of Kazakh presented challenges that Whisper’s fixed subword vocabulary and pretraining distribution did not adequately capture. These significant errors mainly accounted for the high WER score. In contrast, the other tested models reached WER values between 40% and 48%, indicating much better accuracy. In addition to the analysis of metric scores, the manual examination of the errors in generated texts was implemented. It revealed frequent merging of letters within words, as well as shortened and incomplete outputs, and other hallucinations. Also, the translations of some words were incorrect, too. Therefore, it was concluded that Whisper did not perform well for Kazakh ASR tasks, and it is less effective than the other models.

GPT-4o-transcribe demonstrated much better scores than the Whisper model, with balanced and reliable results across all metrics. Its BLEU, WER, and TER reached 45.57%, 43.75%, and 42.35%. Its semantic accuracy was especially noteworthy, as it maintained consistent transcription quality, even when some errors occurred. This model’s high ability in the Kazakh speech recognition is a significant achievement. It proves to be highly adaptable, performing well across both datasets, with results that are stable and consistent. The chrF and COMET scores of 81.15 and 1.02, respectively, showed that GPT-4o captured character-level details and preserved meaning effectively across longer and more complex sentences. Compared to Whisper, these highlighted GPT-4o-transcribe’s advantage in modeling Kazakh’s flexible SOV syntax and maintaining coherence even with long sentences and rich case marking. Its architecture, featuring deeper transformers, longer context windows, and multilingual optimization, makes it better suited to the phonetic richness, morphological complexity, and syntactic flexibility of Kazakh. The manual analysis found morphological confusion and lexical substitution. The replacement of letters originated from the weak articulation of consonants in fluent speech. Another issue was semantically related to changes where there was a tendency for the replacement of Turkicized forms with Russian borrowings. Phonetically, some words were heard and transcribed differently. Overall, the transcription errors, combined with phonetic misperceptions, influence from Russian lexical forms, and morphological inaccuracies, affect both the grammatical correctness and the stylistic naturalness of the Kazakh sentence. Therefore, GPT-4o-transcribe maintains stability across both datasets, making it a much more reliable baseline for Kazakh STT tasks.

Soyle demonstrated remarkable performance, particularly on the Nazarbayev University dataset, where it excelled in domain-specific recognition, achieving the best BLEU score of 74.93%, as well as WER and TER scores of 18.61%, beating even GPT-4o-transcribe and Elevenlabs models. The primary reason for these high scores is that the collected corpus was used in the training of Soyle, which gives it a significant advantage when working with this dataset. These results highlighted that Soyle is particularly well adapted to Kazakh’s agglutinative morphology, where rare and long word forms can otherwise hinder recognition, and to its flexible syntax, where prosodic and morphological cues, rather than word order, carry much of the meaning. Overall, Soyle demonstrates why domain-specific fine-tuning is critical for low-resource languages like Kazakh. While Whisper struggles with Kazakh’s phonetic and morphological richness, Soyle’s adaptation enables very low error rates, strong lexical and semantic accuracy, and excellent handling of long, suffix-heavy words. Compared to GPT-4o-transcribe and Elevenlabs, Soyle appears especially strong in scenarios requiring robustness to Kazakh morphology and syntax, making it one of the best-performing models for Kazakh STT. In the manual analysis of errors, there are mostly no phonetic, morphological, or lexical deviations. Unlike the GPT-4o-transcribe output, Soyle maintained both the grammatical accuracy and the stylistic naturalness of the sentence without introducing distortions or unnecessary substitutions. This shows that Soyle provided a good match to the original text, avoiding the typical phonetic mishearings and Russian-influenced substitutions that appeared in the GPT-4o-transcribe transcription. While its performance on the 24 kz dataset was more moderate, Soyle’s outstanding results on domain-specific data underscore its potential as a local, open-source solution for the Kazakh speech transcription.

ElevenLabs also delivered strong results, showcasing excellent transcription accuracy and a good balance between error rates and semantic preservation. Its ability to handle transcription tasks makes it a competitive option for practical use, particularly in environments where precise and accurate transcriptions are required. On the 24 kz dataset, Elevenlabs achieved a BLEU score of 43.33, more than Whisper’s score of 13.22 and close to GPT-4o-transcribe’s score of 45.57. Its WER and TER scores of 42.77% and 41.89% were also much lower than Whisper, showing greater accuracy in whole-word recognition and sentence-level structure. On the Nazarbayev University dataset, Elevenlabs also demonstrated impressive BLEU, WER, and TER scores of 59.45%, 30.84%, and 17.27%. The experimental results showed the model’s ability to preserve both character-level detail and semantic content across longer and more complex Kazakh texts. This balance showed that Elevenlabs was particularly effective at modeling Kazakh’s morphological richness, where suffix chains often challenged other models, and syntactic flexibility aligned well with the word order and verbal structures. Overall, Elevenlabs delivers consistent, high-quality recognition across both datasets, making it one of the most competitive systems for Kazakh STT. While Soyle edged ahead in morphological precision and GPT-4o-transcribe maintained broad balance, Elevenlabs provided robust error reduction, strong semantic alignment, and adaptability across datasets, suggesting that its architecture is well-tuned to Kazakh phonetic and morphological specifications without fine-tuning. The manual analysis of the generated output data of ElevenLabs revealed the following problems. There were phonetically plausible but nonstandard forms, repeated twice in the sentence. Notable morphological distortions were also found in the text. Overall, the ElevenLabs transcription combines disfluency markers, phonetic approximations, and morphological/semantic distortions, resulting in a sentence that deviates more heavily from the original than Soyle’s version and introduces errors comparable to those in GPT-4o-transcribe’s output.

Voiser proved to be a highly effective model for Kazakh STT tasks, combining good accuracy with character-level precision. It performed well across both datasets. On the 24 kz dataset, Voiser reached a BLEU score of 38.41%, which is nearly triple Whisper’s 13.22% score and mostly equal to Soyle’s 38.66%. On the Nazarbayev University dataset, it achieved BLEU, WER, and TER scores of 47.04, 37.11, and 22.95, respectively. Voiser stands out as a commercially viable option for Kazakh speech recognition, offering a balanced performance that is well-suited for real-world applications. It shows that its architecture adapts well to Kazakh’s vowel harmony, suffix-heavy morphology, and flexible syntax. In practice, this makes Voiser a strong and practical alternative to Elevenlabs and GPT-4o, especially where minimizing raw recognition errors is prioritized. Generally, its reliable results and robust nature make it an attractive solution for industries requiring dependable transcription systems. The manual analysis of the generated texts showed that the Voiser’s outputs were quite precise compared to the original texts. There were no serious phonetic distortions, no morphological confusions, and no lexical substitutions. The verbal forms were preserved without errors. Compared to GPT-4o-transcribe and ElevenLabs, which introduced both phonetic and semantic mistakes, Voiser performs at the same level as Soyle, delivering a faithful and natural transcription that perfectly matches the original sentence.

After the transcription was obtained from the original audio using STT models, this text was voiced using TTS to check how well the original speech signal can be reconstructed from the text.

MMS performed consistently lower than other models in most of the metrics. On the 24 kz dataset, it achieved a DNSMOS of 4.63, the lowest among all systems, and its STOI of 0.09 and PESQ of 1.12 indicated poor perceptual quality. The model’s MCD score on the 24 kz dataset suggested moderate spectral distortion, and its LSD score further indicated noticeable synthesis artifacts. On the Nazarbayev dataset, MMS once again lagged behind with the lowest DNSMOS and modest improvements in STOI and MCD. This consistent underperformance across datasets highlighted the limitations of MMS for Kazakh, particularly given the language’s phonetic complexity and morphological richness, which require precise acoustic modeling for accurate vowel harmony and long suffix chains. These results reflect that MMS struggles with naturalness, making it the least effective system in this evaluation. The manual examination of the generated audio pointed out some pronunciation errors with missing sounds. Although the speech was mostly clear, it did not sound completely natural.

TurkicTTS delivered the best performance on the 24 kz dataset and average scores on the Nazarbayev University dataset. On the 24 kz dataset, TurkicTTS obtained a STOI of 0.11 and PESQ of 1.16, higher than MMS and most other models, indicating better intelligibility and perceptual quality of generated speech. Its LSD score of 1.06 is also the best in, reinforcing its ability to generate clearer and less distorted speech. On the Nazarbayev University dataset, TurkicTTS also made superior performance. Its STOI score of 0.15 was the best of all models, showing that it produces the most intelligent speech in this evaluation. Overall, TurkicTTS performs as a well-balanced, multilingual TTS system that adapts well to Kazakh’s linguistic challenges. Its strengths lie in capturing vowel harmony and complex suffixal morphology accurately at the acoustic level, and in producing highly intelligent speech. The manual analysis of the generated speech showed that it sounds natural and requires only minor improvements. The analysis of the generated audio showed that the pronunciation was natural with correctly placed stresses.

The KazakhTTS2 model demonstrated the best perceptual quality across both datasets. On the 24 kz dataset, KazakhTTS2 reached a STOI of 0.10 and PESQ of 1.09, which are slightly lower than TurkicTTS but still better than MMS. On the Nazarbayev dataset, KazakhTTS2 achieved the highest DNSMOS score. On the Nazarbayev University dataset (Table 7), KazakhTTS2 again shows similar trends. Its STOI (0.12) and PESQ (1.07) are solid but not the top values, and it git the highest DNSMOS score of 8.96. These results indicate that KazakhTTS2 produces highly natural-sounding speech, reflecting its effectiveness in preserving the specifications of the Kazakh language, which is essential for high-quality TTS synthesis. It demonstrates that the KazakhTTS2 model better captures prosody, rhythm, and intonation patterns tied to Kazakh’s SOV syntax and suffix-heavy morphology. Overall, KazakhTTS2 stands out for naturalness and human-likeness in Kazakh speech synthesis. The sound of the generated audio was natural, like in the TurkicTTS model.

OpenAI TTS performed consistently well across both datasets, showcasing balanced performance in spectral accuracy and intelligibility. The model achieved the lowest MCD values of 123.44 on the 24 kz dataset and 117.11 on Nazarbayev, highlighting its strong spectral accuracy. The MCD score is the lowest of all models on this dataset, showing that it produces the most acoustically precise spectral features. Similarly, its LSD score of 1.16 is very competitive, slightly higher than TurkicTTS but much lower than Elevenlabs and KazakhTTS2. Overall, OpenAI TTS shows a clear strength in spectral fidelity and intelligence, outperforming both MMS and specialized TurkicTTS and Elevenlabs systems in terms of acoustic precision. It handles Kazakh’s phonetic richness with minimal distortion, and it adapts well to the morphological complexity of long suffix chains by maintaining clarity across long input sequences. During the manual analysis of the generated audio, some accent was noticed, but it was quite natural. OpenAI TTS provides a robust balance of accuracy, clarity, and naturalness, making it one of the most effective systems for Kazakh TTS overall. The examination of the generated audio revealed that the pronunciation was quite clear, but with the presence of some accent, typical of European languages.

The ElevenLabs system showed mixed efficiency, excelling in perceptual quality but struggling with spectral accuracy. On the 24 kz dataset, it achieved a DNSMOS of 6.13, which was higher than MMS but lower than TurkicTTS and KazakhTTS2. It suggests that while the model produces subjective speech that is reasonably natural, it still faces some challenges in matching the top performers in terms of perceptual quality. ElevenLabs’ MCD score indicates significant spectral distortion, making it less competitive in terms of spectral accuracy compared to other models, such as KazakhTTS2 and OpenAI TTS. On the Nazarbayev University dataset, ElevenLabs showed mixed results. Its STOI score of 0.13 is a solid improvement, close to OpenAI TTS and TurkicTTS scores of 0.14 and 0.15. However, its PESQ score of 1.08 is the lowest among all systems, pointing to weaker perceptual quality. Overall, ElevenLabs TTS provides acceptable performance for Kazakh but is clearly outperformed by KazakhTTS2 and TurkicTTS models, which are tailored to the language. Its main weaknesses are higher spectral MCD and LSD distortions and lower DNSMOS naturalness, which limit its ability to capture Kazakh’s phonetic richness and prosodic patterns fully. While it handles Kazakh morphology and syntax adequately at the intelligibility level, the manual analysis of speech gave an accent that reminds the Russian language pronunciation. The output audio files revealed a much stronger accent than in the OpenAI TTS audios, but the words were still pronounced clearly.

Generally, the experimental results showed the need for further research and optimization in Kazakh TTS systems to improve spectral accuracy, naturalness, and intelligibility of the generated audio.

5. Conclusions and Future Work

The use of TTS after STT in this experiment represents a crucial stage in a comprehensive assessment of speech technology quality. It enables us to analyze the accuracy of the full cycle of speech signal conversion—from audio to text and back (audio → text → audio). This approach enables the assessment of TTS systems not only in terms of compliance with the generated text, but also in terms of the approximation of synthesized speech to the original sound in terms of both acoustic and perceptual characteristics. In addition, it enables us to determine the extent to which the system can reproduce the original speech flow, including intonation and prosodic features, rather than merely generating a formally correct voiceover of the text.

In this study, various STT and TTS systems were evaluated using a test dataset, and the most suitable system was chosen based on predefined criteria, including availability, speech recognition accuracy, speech synthesis quality, and efficiency. Thus, the integration of TTS after STT extends beyond the technical procedure and serves as a crucial tool for a comprehensive evaluation of the reliability and realism of speech systems under conditions of limited training data.

Through a meticulous evaluation methodology, the optimal models for STT and TTS tasks in the Kazakh language were selected. The selection was based on a range of quality metrics, including WER, TER, chrF, COMET, and BLEU for STT, and MCD, PESQ, STOI, LSD, and DNSMOS for TTS, with a focus on accuracy, intelligibility, and perceptual quality.

In the task of automatic speech recognition of Kazakh, the chosen GPT-4 Transcribe model demonstrated the best results among the general-purpose models not trained on the target corpus. The model’s high accuracy with WER = 36.22%, TER = 23.04%, as well as significant chrF (81.15) and COMET (1.02) values, reaffirm its potential as a strong candidate for STT tasks in scalable or multilingual applications.

It should be noted that, despite demonstrating the best metrics among all tested systems (WER = 18.61%, chrF = 95.60%, COMET = 1.23), the Soyle model was not selected as the final solution. This is because some of the data used in the experiment may have been part of its training set, potentially leading to an overestimation of the model’s actual accuracy on the target test corpus. This could reduce the validity of its choice as a universal tool for general speech recognition tasks, as it may not perform as well on unseen data.

The OpenAI TTS model was selected for Kazakh speech synthesis (TTS), demonstrating the best balance between spectral accuracy and subjective sound quality. The model achieved the lowest MCD value (117.11), as well as high PESQ (1.14) and DNSMOS (7.04) scores, indicating high-quality acoustic implementation and a positive perception of the synthesized speech. The STOI value (0.14) also confirms speech intelligibility, making OpenAI TTS a suitable choice for a wide range of applications, including speech generation in educational settings, media content, and dialog systems in the Kazakh language.

This study makes a significant contribution to low-resource speech technology. First, it provides an in-depth review of existing ASR and TTS systems, with a particular focus on Turkic languages. Second, it outlines a clear framework for creating a Kazakh audio-text dataset. Third, it compares recognition and synthesis systems with both objective and subjective quality measures. Finally, it highlights the strengths and weaknesses of current methods. Together, these efforts enhance our understanding of low-resource ASR and TTS, laying the groundwork for future research. In future work, we plan to focus on collecting parallel audio data between Kazakh and other Turkic languages. The creation of parallel corpora is essential given the current shortage of parallel speech data for Turkic languages, which is critical for developing and testing speech-to-speech systems (STS). The development of this database will address two important problems: it will provide a foundational platform for creating speech models in various languages and will aid in advancing the development of digital linguistic technologies in the Central Asian region.

Author Contributions

Conceptualization, A.K.; methodology, A.K. and V.K.; software, A.K. and V.K.; experiments, A.K. and V.K.; validation, A.K. and V.K.; formal analysis, B.A. and D.A.; investigation, A.K.; resources, A.K. and V.K.; data curation, A.K.; writing—original draft preparation, A.K., V.K., B.A. and D.A.; writing—review and editing, A.K. and V.K.; visualization, V.K.; supervision, A.K.; project administration, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the grant project «Study of automatic generation of parallel speech corpora of Turkic languages and their use for neural models» (grant number IRN AP AP23488624) of the Ministry of Science and Higher Education of the Republic of Kazakhstan.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available at https://github.com/NLP-KazNU/Kazakh-STT_TTS (accessed on 1 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASR	Automatic speech recognition
TTS	Text-to-speech
STT	Speech-to-Text
E2E	End-to-End
WER	Word error rates
TER	Translation Edit Rate
BLEU	Bilingual Evaluation Understudy
chrF	CHaRacter-level F-score
LoRA	Low-Rank Adaptation
CER	Character error rate
KSC	Kazakh Speech Corpus
MT	Machine translation
HMM	Hidden Markov Models
PESQ	Perceptual Evaluation of Speech Quality
STOI	Short-Time Objective Intelligibility
USM	Universal Speech Model
USC	Uzbek Speech Corpus
MCD	Mel Cepstral Distortion
DNSMOS	Deep Noise Suppression Mean Opinion Score
DNS	Deep Noise Suppression
MOS	Mean Opinion Score
MSE	Mean square error
MMS	Massively Multilingual Speech
ISSAI	Institute of Intelligent Systems and Artificial Intelligence
NU	Nazarbayev University
CTC	Connectionist temporal classification
KSD	Kazakh Speech Dataset
AI	Artificial Intelligence
COMET	Crosslingual Optimized Metric for Evaluation of Translation
RNN-T	Recurrent neural network-transducer
LSTM	Long Short-Term Memory
UzLM	Uzbek language model
STS	Speech-to-speech
LID	Language identifier
DL	Deep Learning
IPA	International Phonetic Alphabet
API	Application Programming Interface
GPT	Generative Pre-trained Transformer
WebRTC	Web Real-Time Communication
HiFi-GAN	Generative Adversarial Networks for Efficient and High-Fidelity Speech Synthesis
WaveGAN	Generative adversarial network for unsupervised synthesis of raw-waveform audio

References

Vacher, M.; Aman, F.; Rossato, S.; Portet, F. Development of Automatic Speech Recognition Techniques for Elderly Home Support: Applications and Challenges. In Lecture Notes in Computer Science, Proceedings of the International Conference on Human Aspects of IT for the Aged Population; Springer: Los Angeles, CA, USA, 2015; pp. 341–353. [Google Scholar] [CrossRef]
Bekarystankyzy, A.; Mamyrbayev, O.; Mendes, M.; Fazylzhanova, A.; Assam, M. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets. Sci. Rep. 2024, 14, 13835. [Google Scholar] [CrossRef]
Tukeyev, U.; Turganbayeva, A.; Abduali, B.; Rakhimova, D.; Amirova, D.; Karibayeva, A. Inferring the Complete Set of Kazakh Endings as a Language Resource. In Advances in Computational Collective Intelligence: Proceedings of the 12th International Conference, ICCCI 2020, Da Nang, Vietnam, 30 November–3 December 2020; Hernes, M., Wojtkiewicz, K., Szczerbicki, E., Eds.; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2020; Volume 1287, pp. 741–751. [Google Scholar] [CrossRef]
Tukeyev, U.; Karibayeva, A.; Zhumanov, Z. Morphological Segmentation Method for Turkic Language Neural Machine Translation. Cogent Eng. 2020, 7, 1832403. [Google Scholar] [CrossRef]
Tukeyev, U.; Karibayeva, A.; Turganbayeva, A.; Amirova, D. Universal Programs for Stemming, Segmentation, Morphological Analysis of Turkic Words. In Computational Collective Intelligence: Proceedings of the International Conference (ICCCI 2021), Rhodes, Greece, 29 September–1 October 2021; Nguyen, N.T., Iliadis, L., Maglogiannis, I., Trawiński, B., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12876, pp. 643–654. [Google Scholar] [CrossRef]
Tukeyev, U.; Gabdullina, N.; Karipbayeva, N.; Abdurakhmonova, N.; Balabekova, T.; Karibayeva, A. Computational Model of Morphology and Stemming of Uzbek Words on Complete Set of Endings. In Proceedings of the 2024 IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE), Novosibirsk, Russia, 15–17 November 2024; pp. 1760–1764. [Google Scholar] [CrossRef]
Kadyrbek, N.; Mansurova, M.; Shomanov, A.; Makharova, G. The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters. Big Data Cogn. Comput. 2023, 7, 132. [Google Scholar] [CrossRef]
Yeshpanov, R.; Mussakhojayeva, S.; Khassanov, Y. Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration. In Proceedings of the INTERSPEECH, 2023, Dublin, Ireland, 20–24 August 2023; pp. 5521–5525. [Google Scholar] [CrossRef]
Mussakhojayeva, S.; Janaliyeva, A.; Mirzakhmetov, A.; Khassanov, Y.; Varol, H.A. KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. In Proceedings of the INTERSPEECH, Brno, Czechia, 30 August–3 September 2021; pp. 2786–2790. [Google Scholar] [CrossRef]
Kuanyshbay, D.; Amirgaliyev, Y.; Baimuratov, O. Development of Automatic Speech Recognition for Kazakh Language Using Transfer Learning. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5880–5886. [Google Scholar] [CrossRef]
Orken, M.; Dina, O.; Keylan, A.; Tolganay, T.; Mohamed, O. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 2022, 12, 8337. [Google Scholar] [CrossRef]
Ahlawat, H.; Aggarwal, N.; Gupta, D. Automatic Speech Recognition: A Survey of Deep Learning Techniques and Approaches. Int. J. Cogn. Comput. Eng. 2025, 7, 201–237. [Google Scholar] [CrossRef]
Rosenberg, A.; Zhang, Y.; Ramabhadran, B.; Jia, Y.; Moreno, P.; Wu, Y.; Wu, Z. Speech recognition with augmented synthesized speech. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop, Singapore, 14–18 December 2019; pp. 996–1002. [Google Scholar]
Zhang, C.; Li, B.; Sainath, T.; Strohman, T.; Mavandadi, S.; Chang, S.-Y.; Haghani, P. Streaming end-to-end multilingual speech recognition with joint language identification. In Proceedings of the INTERSPEECH, 2022, Incheon, Korea, 18–22 September 2022. [Google Scholar] [CrossRef]
Zhang, Y.; Han, W.; Qin, J.; Wang, Y.; Bapna, A.; Chen, Z.; Chen, N.; Li, B.; Axelrod, V.; Wang, G.; et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv 2023, arXiv:2303.01037. [Google Scholar] [CrossRef]
Liu, Y.; Yang, X.; Qu, D. Exploration of Whisper Fine-Tuning Strategies for Low-Resource ASR. EURASIP J. Audio Speech Music Process. 2024, 2024, 29. [Google Scholar] [CrossRef]
Metze, F.; Gandhe, A.; Miao, Y.; Sheikh, Z.; Wang, Y.; Xu, D.; Zhang, H.; Kim, J.; Lane, I.; Lee, W.K.; et al. Semi-supervised training in low-resource ASR and KWS. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5036–5040. [Google Scholar] [CrossRef]
Du, W.; Maimaitiyiming, Y.; Nijat, M.; Li, L.; Hamdulla, A.; Wang, D. Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: An Overview. Appl. Sci. 2023, 13, 326. [Google Scholar] [CrossRef]
Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of Language Models for Continuous Uzbek Speech Recognition System. Sensors 2023, 23, 1145. [Google Scholar] [CrossRef]
Veitsman, Y.; Hartmann, M. Recent Advancements and Challenges of Turkic Central Asian Language Processing. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, Abu Dhabi, United Arab Emirates, 19–20 January 2025; pp. 309–324. [Google Scholar]
Oyucu, S. A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning. Electronics 2023, 12, 1900. [Google Scholar] [CrossRef]
Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish ASR System and Evaluation of Fine-Tuning with LoRA Adapter. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
Musaev, M.; Mussakhojayeva, S.; Khujayorov, I.; Khassanov, Y.; Ochilov, M.; Atakan Varol, H. USC: An open-source Uzbek speech corpus and initial speech recognition experiments. In Speech and Computer; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2021; pp. 437–447. [Google Scholar]
Mussakhojayeva, S.; Dauletbek, K.; Yeshpanov, R.; Varol, H.A. Multilingual Speech Recognition for Turkic Languages. Information 2023, 14, 74. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar] [CrossRef]
Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Soplin, N.E.; Heymann, J.; Wiesner, M.; Chen, N. ESPnet: End-to-End Speech Processing Toolkit. arXiv 2018, arXiv:1804.00015. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the ACL, Online, 19 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
ESPnet Toolkit. Available online: https://github.com/espnet/espnet (accessed on 10 June 2025).
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the ASRU, Hilton Waikoloa Village Resort, Waikoloa, HI, USA, 11–15 December 2011; pp. 1–4. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar] [CrossRef]
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Network for Efficient and High Fidelity Speech Synthesis. arXiv 2020, arXiv:2010.05646. [Google Scholar]
Karabaliyev, Y.; Kolesnikova, K. Kazakh Speech and Recognition Methods: Error Analysis and Improvement Prospects. Sci. J. Astana IT Univ. 2024, 20, 62–75. [Google Scholar] [CrossRef]
Rakhimova, D.; Duisenbekkyzy, Z.; Adali, E. Investigation of ASR Models for Low-Resource Kazakh Child Speech: Corpus Development, Model Adaptation, and Evaluation. Appl. Sci. 2025, 15, 8989. [Google Scholar] [CrossRef]
Khassanov, Y.; Mussakhojayeva, S.; Mirzakhmetov, A.; Adiyev, A.; Nurpeiissov, M.; Varol, H.A. A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL, Online, 19–23 April 2021; pp. 697–706. [Google Scholar] [CrossRef]
Kozhirbayev, Z.; Islamgozhayev, T. Cascade Speech Translation for the Kazakh Language. Appl. Sci. 2023, 13, 8900. [Google Scholar] [CrossRef]
Kapyshev, G.; Nurtas, M.; Altaibek, A. Speech recognition for Kazakh language: A research paper. Procedia Comput. Sci. 2024, 231, 369–372. [Google Scholar] [CrossRef]
Mussakhojayeva, S.; Gilmullin, R.; Khakimov, B.; Galimov, M.; Orel, D.; Abilbekov, A.; Varol, H.A. Noise-Robust Multilingual Speech Recognition and the Tatar Speech Corpus. In Proceedings of the 2024 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Osaka, Japan, 19–22 February 2024; pp. 732–737. [Google Scholar] [CrossRef]
Mussakhojayeva, S.; Khassanov, Y.; Varol, H.A. KSC2: An industrial-scale open-source Kazakh speech corpus. In Proceedings of the INTERSPEECH, Incheon, Korea, 18–22 September 2022; pp. 1367–1371. [Google Scholar] [CrossRef]
Common Voice. Available online: https://commonvoice.mozilla.org/ru/datasets (accessed on 10 June 2025).
KazakhTTS. Available online: https://github.com/IS2AI/Kazakh_TTS (accessed on 10 June 2025).
Kazakh Speech Corpus. Available online: https://www.openslr.org/102/ (accessed on 10 June 2025).
Kazakh Speech Dataset. Available online: https://www.openslr.org/140/ (accessed on 10 June 2025).
ISSAI. Available online: https://github.com/IS2AI/ (accessed on 10 September 2025).
Whisper. Available online: https://github.com/openai/whisper (accessed on 2 June 2025).
GPT-4o-transcribe (OpenAI). Available online: https://platform.openai.com/docs/models/gpt-4o-transcribe (accessed on 2 July 2025).
Soyle. Available online: https://github.com/IS2AI/Soyle (accessed on 2 June 2025).
ElevenLabs Scribe. Available online: https://elevenlabs.io/docs/capabilities/speech-to-text (accessed on 20 June 2025).
Voiser. Available online: https://voiser.net/ (accessed on 30 June 2025).
MMS (Massively Multilingual Speech). Available online: https://github.com/facebookresearch/fairseq/tree/main/examples/mms (accessed on 10 June 2025).
TurkicTTS. Available online: https://github.com/IS2AI/TurkicTTS (accessed on 12 June 2025).
ElevenLabs TTS. Available online: https://elevenlabs.io/docs/capabilities/text-to-speech (accessed on 2 July 2025).
OpenAI TTS. Available online: https://platform.openai.com/docs/guides/text-to-speech (accessed on 30 June 2025).
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Gillick, L.; Cox, S. Some Statistical Issues in the Comparison of Speech Recognition Algorithms. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Glasgow, UK, 23–26 May 1989; Volume 1, pp. 532–535. [Google Scholar] [CrossRef]
Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; Available online: https://aclanthology.org/2006.amta-papers.25/ (accessed on 10 June 2025).
Popović, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar] [CrossRef]
Rei, R.; Farinha, A.C.; Martins, A.F.T. COMET: A Neural Framework for MT Evaluation. In Proceedings of the EMNLP, Online, 16–20 November 2020; pp. 2685–2702. [Google Scholar] [CrossRef]
Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 125–128. [Google Scholar] [CrossRef]
Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ). In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
Reddy, C.K.; Gopal, V.; Cutler, R. DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2020; pp. 6493–6497. [Google Scholar] [CrossRef]

Figure 1. The scheme of parallel audio and text corpora formation from the news portal.

Figure 2. Audio transcription with STT models.

Figure 3. Text conversion with TTS models.

Table 1. Kazakh Audio Resources—Availability and Total Duration.

Audio Corpora Name	Data Type	Volume	Accessibility
Common Voice [44]	Audio recordings, with transcriptions.	150+ h	open access
KazakhTTS [45]	Audio-text pair	271 h	conditionally open
Kazakh Speech Corpus [39,46]	Speech + transcriptions	330 h	open access
Kazakh Speech Dataset (KSD) [7,47]	Speech	554 h	open access

Table 2. The comparison of models.

Models	Advantages	Drawbacks
GPT-4o-transcribe	State-of-the-art accuracy Real-time streaming via WebSocket/WebRTC Low word error rate, strong for morphologically rich languages Robust in noisy and complex environments Integration with multimodal GPT-4o	Limited access through the OpenAI API Proprietary, no open-source release Requires a stable internet connection
Whisper	Open-source and free availability Supports 99+ languages, multilingual robustness Automatic language detection and translation Pretrained models available in multiple sizes Resilient to noise and accents	Possibility of lags in low-latency applications Resource-intensivity of large models
Soyle	Focused on Kazakh and low-resource Turkic languages Trained on local corpora (Kazakh Speech Corpus 2, Common Voice) Effective against background noise and speaker variability Locally developed, supports national use	A few language limitations Scarce public documentation Restriction of deployment options
Elevenlabs	High transcription accuracy for Kazakh Advanced features: speaker diarization, non-verbal event detection Easy access via API and web interface Optimized for low-latency deployment (real-time version in development)	Closed-source, proprietary system Primarily optimized for English Data privacy and reproducibility concerns Primarily focus on English Closed-source Influence of a noisy environment on the transcription quality
Voiser	High accuracy in Kazakh and Turkish Real-time and batch transcription Punctuation and speaker diarization Flexible cloud-based integration	Proprietary and closed-source Limited global language range Less academic benchmarking and transparency Performance influenced by noisy input

Table 3. The comparison of TTS models.

Models	Advantages	Drawbacks
MMS	Open-source and publicly available Supports 1100+ languages, including Kazakh Unified model for ASR, TTS, and language identification Trained on a vast amount of data Rapid adaptation to low-resource languages	Less optimized for real-time use May show degraded performance on specific dialects
TurkicTTS	Specially designed for Turkic languages into 7 languages (Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek) Incorporates phonological features of Turkic speech Provides open research resources and benchmarks Tacotron2 and WaveGAN architecture Zero-shot generalization without parallel corpora	Sometimes misidentifies Turkic languages Limited domain coverage and audio variation Research-focused, minimal production integration
KazakhTTS2	Tailored for high-quality Kazakh TTS Improved naturalness, stress, and prosody Developed for national applications Open-source, available via GitHub Free Tacotron2 and HiFi-GAN vocoder Focus on Kazakh phonological accuracy Basis for national digital services and education	Limited to Kazakh only Requires fine-tuning for expressive or emotional speech
Elevenlabs	High-fidelity, human-like voice synthesis Supports multilingual and emotional speech User-friendly web and API interfaces Fast inference and low-latency output Speaker cloning and adaptation Optimized for interactive applications	Commercial licensing with usage restrictions No access to full training data or fine-tuning options
OpenAI TTS	Advanced expressiveness Integrated with GPT models for contextual adaptation Robust handling of pauses, emphasis, and emotion Multilingual support, including Kazakh	Closed-source and API-only Limited user customization Subject to quotas and usage caps

Table 4. STT Comparative Analysis for 24 kz Data.

Model	BLEU% ¹²	WER% ¹²	TER% ¹²	chrF ¹²	COMET ¹²
Whisper	13.22	77.10	74.87	55.30	0.42
GPT-4o-transcribe	45.57	43.75	42.35	76.99	0.86
Soyle	38.66	48.14	36.30	80.35	0.97
Elevenlabs	43.33	42.77	41.89	77.36	0.88
Voiser	38.41	40.65	31.97	80.88	1.01

¹ A green arrow means better scores. ² A red arrow means worse scores. Bold highlighting means the model with the best scores.

Table 5. Comparative Analysis of STT for Nazarbayev University Data.

Model	BLEU% ¹²	WER% ¹²	TER% ¹²	chrF ¹²	COMET ¹²
Whisper	21.97	60.55	54.36	68.36	0.30
GPT-4o-transcribe	53.46	36.22	23.04	81.15	1.02
Soyle	74.93	18.61	18.61	95.60	1.23
Elevenlabs	59.45	30.84	17.27	88.04	1.13
Voiser	47.04	37.11	22.95	84.51	1.05

¹ A green arrow means better scores. ² A red arrow means worse scores. Bold highlighting means the model with the best scores.

Table 6. TTS Comparative Analysis for 24 kz Data.

Model	STOI ¹²	PESQ ¹²	MCD ¹²	LSD ¹²	DNSMOS ¹²
MMS	0.09	1.12	145.16	1.15	4.63
TurkicTTS	0.11	1.16	129.54	1.06	5.92
KazakhTTS2	0.10	1.09	150.53	1.11	8.79
Elevenlabs	0.10	1.10	164.29	1.34	6.13
OpenAI TTS	0.09	1.12	123.44	1.16	7.43

¹ A green arrow means better scores. ² A red arrow means worse scores. Bold highlighting means the model with the best scores.

Table 7. Comparative Analysis of TTS for Nazarbayev University Data.

Model	STOI ¹²	PESQ ¹²	MCD ¹²	LSD ¹²	DNSMOS ¹²
MMS	0.12	1.11	148.40	1.20	3.91
TurkicTTS	0.15	1.14	145.49	1.12	6.39
KazakhTTS2	0.12	1.07	137.03	1.12	8.96
Elevenlabs	0.13	1.08	139.75	1.29	6.38
OpenAI TTS	0.14	1.14	117.11	1.19	7.04

¹ A green arrow means better scores. ² A red arrow means worse scores. Bold highlighting means the model with the best scores.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Karibayeva, A.; Karyukin, V.; Abduali, B.; Amirova, D. Speech Recognition and Synthesis Models and Platforms for the Kazakh Language. Information 2025, 16, 879. https://doi.org/10.3390/info16100879

AMA Style

Karibayeva A, Karyukin V, Abduali B, Amirova D. Speech Recognition and Synthesis Models and Platforms for the Kazakh Language. Information. 2025; 16(10):879. https://doi.org/10.3390/info16100879

Chicago/Turabian Style

Karibayeva, Aidana, Vladislav Karyukin, Balzhan Abduali, and Dina Amirova. 2025. "Speech Recognition and Synthesis Models and Platforms for the Kazakh Language" Information 16, no. 10: 879. https://doi.org/10.3390/info16100879

APA Style

Karibayeva, A., Karyukin, V., Abduali, B., & Amirova, D. (2025). Speech Recognition and Synthesis Models and Platforms for the Kazakh Language. Information, 16(10), 879. https://doi.org/10.3390/info16100879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speech Recognition and Synthesis Models and Platforms for the Kazakh Language

Abstract

1. Introduction

2. Related Works

2.1. Low-Resource ASR: General Approaches

2.2. Turkic Languages: ASR and TTS

2.3. Kazakh Speech Features in Systems and Resources

2.4. Summary

3. Materials and Methods

3.1. Audio and Text Dataset Formation

3.2. ASR Systems and Selecting Criteria

3.3. Text-to-Speech (TTS)

4. Results and Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI