1. Introduction
Children are often referred to as the “digital age generation” because they are growing up in a world where digital technologies have become an integral part of everyday life and are accessible to almost everyone [
1]. This has made them regular active users of smartphones, tablets, and other devices [
2]. Their constant interaction with technology leads to the development of unique features in their speech, such as a variety of intonations and manner of communication, which creates new challenges for automatic voice analysis and processing systems. Moreover, recent studies show that the widespread use of social media has a significant impact on the linguistic environment of Kazakhstani children, accelerating bilingualism and affecting lexical development [
3]. Early exposure to two or more languages through social media, entertainment, and peer interaction leads to inconsistent language use, code-switching, and hybrid lexical forms. These phenomena complicate acoustic modeling and language decoding in ASR systems, especially when applied to children’s speech in low-resource settings.
These features of children’s speech place serious demands on ASR and verification systems. While automatic speaker verification (ASV) has demonstrated significant advances in adult speech processing due to advances in deep learning [
4], the accuracy of such systems decreases dramatically when working with children’s voices. Studies show that the relative error rate (BER) can increase by 40–45% when using models trained on adult voices to register and verify children’s speech [
5,
6]. This decrease is due to the discrepancy between the learning and testing conditions of the models caused by physiological and articulatory differences between adults and children, particularly in the age group from 3 to 14 years. Significant differences in the characteristics of the vocal tract in children, underdevelopment of articulatory skills, and limited access to specialized children’s speech make it difficult to develop adapted models [
7].
Various approaches are used to solve these problems. For example, data from the CLUB Kids Corpus, PF-STAR, and My Science Tutor (MyST) databases [
8,
9,
10] can be used in conjunction with large adult speech corpora such as VoxCeleb2 [
11] to train acoustic models. This reduces dependence on the limited resources on children’s speech and improves the accuracy of models. However, to achieve significant improvements, it is necessary to consider the unique features of children’s voices and variations in their characteristics.
According to research [
12], the frequency of word errors (WER and BER) in speech recognition in children can be up to five times higher than that in adults. The difficulty lies in the lack of specialized corpora for child speech, whereas adult speech is more readily available from sources such as news broadcasts, interviews, and public records. Collecting a children’s speech corpus is a complex task due to ethical and technical factors. Firstly, informed consent must be obtained from parents or legal guardians, which complicates the data collection process. Secondly, children’s speech is highly variable; it can be unclear and unstable in tempo and phonetics, especially in young children or those with speech disorders. In addition, children tire quickly, limiting the duration and volume of recordings. It is also necessary to take into account age stratification and dialect diversity to ensure the corpus is representative. As of 2013, there were 13 child speech corpora with full or partial language transcriptions and a high level of digital resources; however, there was no specialized children’s acoustic corpus for low-resource languages such as Kazakh [
13]. To fill this gap, researchers have proposed an approach that combines large adult speech corpora with small amounts of children’s data to train acoustic models [
14,
15,
16,
17,
18]. Experiments presented in previous studies, e.g., by Tong, R.; Lei, W.; Bin, M. and Shahnawazuddin, S.; Dey, A.; Sinha, R. have shown that combining adult speech with small amounts of child speech using methods such as pitch adaptation and transfer learning can significantly lower the amount of child-specific data required while maintaining ASR performance.
A child’s voice is characterized by a higher timbre and significant changes in the frequency range with age. As the data in
Figure 1 show, the pitch distribution in children is much wider than in adult men and women. These differences require additional solutions, since the high-frequency characteristics of children’s speech are insufficient to build effective ASR models. However, low-frequency adult data can be used to improve models.
Studies [
9,
14] also emphasize that standard Mel-frequency cepstral coefficients (MFCC) are vulnerable to pitch effects. To address this problem, methods have been proposed for filtering high-frequency components depending on the speaker’s pitch [
4] and scaling the pitch of children’s speech to bring it closer to the adult range [
5]. In [
6], differences in speech speed were additionally taken into account, and in [
10], normalization of the length of the vocal tract was applied to compensate for spectral variations. These approaches aim to reduce acoustic variability and adapt systems to the unique features of children’s speech.
Amid the rapid evolution of the digital society, developing effective communication skills from an early age has become increasingly important. Ensuring equitable access to mother tongue education for all children, regardless of their linguistic or social background, is emerging as a key priority. In this context, preschools and educational institutions play a critical role by laying the groundwork for children’s future language acquisition and overall cognitive development. Research by G. Lyamina [
19] highlights notable individual differences in early speech activity levels. For example, during a 30 min play session, children aged 2–3 typically produce 25–28 words; those aged 3–4 produce 70–80 words; and by 4–4.5 years, their speech output reaches 110–115 words [
20]. As a result, evaluating and identifying speech developmental delays is most effective and informative within the 3–4 age range, when speech abilities are expected to be more fully developed.
2. Related Works
Modern approaches for speech recognition include the use of convolutional (CNN), recurrent (RN), LSTM, Transformer, and Conformer models, as well as wav2vec2.0 and HuBERT architectures [
21]. Children’s speech is being studied within the framework of the CMU Kids Corpus, OGI Kids’ Speech Corpus, and CHiME-5 projects. For low-resource languages, data augmentation [
22], transfer learning, synthetic speech generation (for example, FastSpeech2 and Tacotron2), and multilingual models [
23] are employed.
Kazakh ASR solutions are represented by KATEK (Kazakh Telephone Corpus), KazakhTTS, and KazakhBERT systems [
24]; however, children’s speech has not been widely studied. Research in the field of transferring models from adult speech to children’s speech confirms the effectiveness of fine-tuning in the presence of a limited corpus of children’s speech.
When recognizing children’s speech under difficult acoustic conditions, it is important to take into account not only the detailed speech patterns but also the physical recording parameters, especially for children with language disorders. Gordon et al., 2025, conducted experiments with 24 children aged 7–12 years who had a confirmed language disorder, investigating the effect of background noise and reverberation on speech recognition accuracy [
25]. A technique was employed to determine the threshold of the signal-to-noise ratio for 50% word recognition (SNR50) on sentences recorded using a conventional and a remote microphone (RM). The results showed that the use of RMs fundamentally improves speech intelligibility: under conditions of noise alone, the SNR50 increased from −7.8 dB to −19.9 dB, and under conditions of noise and reverberation, it increased from −4.0 dB to −10.9 dB. Greater improvement in intelligibility in terms of reverberation and noise when using RMs indicates that, in addition to algorithmic approaches (e.g., refinement of ASR models), technical considerations are also important—e.g., microphone selection and placement, as well as noise reduction. Thus, the study shows the fundamental prerequisites for the continuous assessment of pedagogical effectiveness for children with special educational needs within the framework of an integrated approach.
Maatallaoui et al. [
26] examined the prospects for integrating ASR technologies into applications for teaching children foreign languages, focusing on pronunciation and second language acquisition (L2). The researchers highlighted several key challenges in child speech recognition, including the high variability in children’s speech, limited availability of training data, influence of the first language (L1), and physiological differences in the child’s vocal tract. Special attention was paid to the comparative analysis of Wav2vec v2.0 and Whisper models, with the latter demonstrating greater resistance to non-native and spontaneous speech. The authors emphasized that standard ASR systems developed for adult speech have difficulties with children’s speech, and strategies for processing non-native speech tend to result in low accuracy. To solve these problems, the authors proposed (1) increasing the amount of data, including not only changing the original audio files, but also increasing the volume of a diverse training corpus; (2) hybrid modeling that combines traditional acoustic models with transformer architectures; and (3) feature mapping, which compares the characteristics of children’s voices with those of adults, enabling the use of pretrained models originally developed for adult speech. Instead of building new models from scratch, which requires large amounts of data and computing power, this approach adapts existing models using a smaller amount of child-specific data. This significantly reduces the effort, time, and resources needed, while still allowing the models to perform well with children’s speech. However, the authors emphasized the need for additional empirical research.
Block Medin et al. [
27] conducted a systematic comparison of three self-learning models of the latest generation—Wav2Vec v2.0, HuBERT, and WavLM—with phoneme recognition in the speech of French children (ages 5–8 years) in the context of assessing reading literacy. The basic model was Transformer + CTC, which was first trained on the Common Voice corpus, and then refined on the internal speech corpus of French children (13 h of audio). The results showed that WavLM base+ outperformed other models, achieving a PER of 41.5% with only CTC layer adaptation after training for ~94,000 h of untraceable English speech. Deep retraining (defrosting Transformer blocks) on the children’s body reduced the PER to 26.1%, which is 33% lower than with surface adaptation. The authors also tested the transfer of learning on a large MyST corpus (161 h of English), but performance suffered due to the discrepancy between age, language, and style of speech (spontaneous versus readable). The internal WavLM base + model demonstrated excellent results on complex word lists and pseudo-word discrimination tasks, especially under high noise conditions (SNR < 10 dB), with a relative decrease in the PER of approximately 44% compared to the baseline. Thus, WavLM base+ demonstrated resistance to noise and flexible applicability to languages with limited resources without large marked-up corpora, making it promising for educational speech technologies.
Below we review scientific works devoted to the problems of Kazakh speech recognition, each of which uses its own terminological base. The authors interpreted and applied such concepts as phonemes, allophones, graphemes, and other units of oral and written speech differently, which is due to differences in methodological approaches and specific research goals. In our work, we relied on descriptions and terminology, striving to preserve the main semantic meaning without distorting the scientific content. We evaluated ASR model outputs at the orthographic level using Kazakh Cyrillic transcriptions. Phonological terms like “phoneme” were applied post hoc to interpret recurring substitution errors (e.g., қ → к as /q/ → /k/), based on earlier studies of Kazakh sound patterns and their relevance for writing reform and language technology [
28,
29,
30]. We did not use IPA transcriptions, analyze allophones, or equate letters with sounds; rather, phonological categories were used as explanatory tools to identify systematic recognition challenges. The next approach aligns with other ASR studies on low-resource languages, where phoneme-level annotation is limited but linguistic interpretation remains essential. For example, Karabaliyev and Kolesnikova [
31] conducted a detailed comparison of Kaldi, Mozilla DeepSpeech, and Google Speech-to-Text for Kazakh ASR using 101 recordings from speakers with diverse dialects and fluency. The systems frequently made errors in recognizing Kazakh-specific phonemes. For example, the voiceless uvular stop қ was often confused with the velar stop к (e.g., қала → кала, translit: qala → kala), the nasal pronunciation ң was replaced by н (/ŋ/ → /n/), and the rounded vowels ү and ұ were often misrecognized as у (/y/, /ʊ/ → /u/). Additionally, the models failed to follow vowel harmony rules, such as incorrectly substituting ы (/ɯ/) with і (/i/), which affects word formation. Case endings such as -да, -нің, and -ға were frequently omitted or altered in long agglutinative word forms (e.g., менің кітаптарымда → менің кітаптарым, translit: menıñ kitapтарymda → menıñ kitapтарym). While Google STT had the lowest WER (52.97%), none of the systems reached an acceptable accuracy. The authors proposed a hybrid architecture integrating TDNN, Transfer Learning, and RNNLM to better model Kazakh’s morphological structure.
Gretter et al. [
32] presented the results of the large-scale international ETLT 2021 challenge, designed to improve the quality of ASR systems for non-native children’s speech. As part of the challenge, 101.6 English-speaking and 6.7 h of German-speaking children’s audio responses were presented, recorded during a language assessment based on the language competence of students aged 9–16 years. The target audience of the challenge was researchers, developers, and scientists of all levels. In addition, the challenge’s task included providing labeled and unlabeled data, which allowed the challenge participants to develop both fully supervised and semi-supervised models. The basic systems were implemented on the basis of Kaldi and executed on TDNN-F tasks using LF-MMI and MFCC + i-vector. Augmentation techniques included triple speed variation and SpecAugment. The best participants additionally used wav2vec2, Sequence-to-Sequence architecture, and RNNLM rescoring. In English, the best WER was 23.98% (compared to a baseline of 33.21%), and in German, it was 23.50% (against a baseline of 45.21%). Most of the winning systems used long-context representations, spectral augmentation, and did not resort to learning from adult data. This highlights the need to create specialized models and corpora that reflect the age and accent features of children’s speech, especially under conditions of low resource availability and variability of L2 pronunciation.
Jain et al. [
33] conducted a comprehensive study on the adaptation of the Whisper model for child speech recognition, comparing it to the excellent fine-tuning of the wav2vec2 self-learning model. During the experiments, the authors evaluated three main datasets of children’s speech, MyST, PFSTAR, and CMU Kids, and the main adult set dev-clean (LibriTTS) for an objective comparative assessment of the models’ results. The Zero Whisper was tested on several datasets, both through zero-shot and fine-tuning on different samples. All experiments demonstrated that the original (unbent) Whisper loses in terms of accuracy to wav2vec2 child speech recognition (WER on MyST: 25% for Whisper versus 12.5% for wav2vec2-large). However, after further training, Whisper improves significantly (11.66% on MyST), although it still lags behind wav2vec2 (7.42%). The exception is CMU Kids, where Whisper, trained on other data, showed a lower error than wav2vec2. This may be a reflection of Whisper’s resistance to domain variability. Additional experiments also showed that size plays a role (the best result is achieved with Whisper Large-V2 and wav2vec2-large). Thus, the authors concluded that although wav2vec2 adapts better to the task in the presence of data, Whisper is more versatile in the case of an “unknown” distribution of data and can be used in adaptation and generalization tasks, as well as in low-resource conditions.
Kim et al., 2025, studied the practical applicability of ASR in the diagnosis of SSD in a South Korean pediatric population by comparing the results obtained from speech therapists (SLPs) with those of the Wav2Vec2-XLS-R-1B model, which was trained on 93.6 min of speech from children with articulation disorders [
34]. The standardized APAC and U-TAP tests with phoneme error rate (PIPE), consonant PIPE (C-PIPE), and percentage of correct consonants (PCC) were used for validation. The average PIPE was 8.42% (APAC) and 8.91% (U-TAP), while the C-PIPE was 10.58% (APAC) and 11.86% (U-TAP). The ICC between the ASR and SLP results was 0.984 (APAC) and 0.978 (U-TAP), indicating high comparability. The most common errors were associated with final consonants, liquid (>l, t, k<), and the last position after the vowel i, as well as in complex CVC-CVC words. They noted that the model was particularly good at making sounds, rather than recognizing them, which underscores the need for visual control and the development of test words. This, in turn, suggests that a pre-trained model based on fragments of children’s speech can yield high results and can be used to help evaluate SSD, but cannot replace a speech therapist.
Liu et al., 2025, proposed the TAML-Adapter method, an improved technique for effective retraining of ASR model parameters for low-resource languages [
35]. It is based on the XLS-R (0.3B) architecture, with the integration of adapters after each Transformer layer. A distinctive feature of the approach is the use of Task-Agnostic Meta-Learning (TAML) to initialize adapter parameters before they are retrained in the target languages. During meta-training, data from three high-resource languages (German, Italian, and Swedish) from Common Voice were used, followed by additional training in five target low-resource languages (Arabic, Kazakh, Marathi, Nynorsk, and Swahili), including FLEURS sub-corpora. Experiments showed that TAML-Adapter significantly reduced WER compared to classic fine-tuning (FT), adapter tuning (AT), and Meta-Adapter (MAML): the average improvement in the WER was −12.32% compared to FT, −8.02% compared to AT, and −2.5% compared to MAML on Common Voice; on FLEURS, it was −8.68% (FT) and −2.88% (AT). TAML-Adapter retains generalizing ability and avoids bias towards meta-learning languages due to the entropy loss function. This approach demonstrates high efficiency in further training ASR models for low-resource languages, without increasing the number of parameters and maintaining scalability.
Table 1 shows how different studies approached child speech recognition, especially for languages with limited data. It includes who the models were made for, what methods were used, and how they were adapted for children’s speech. This helps compare the approaches and explains why our method works well for the Kazakh language.
Figure 1 illustrates the main architectures and approaches used in modern ASR systems for recognizing child speech across different languages and resource conditions.
The constructed pie chart reflects the distribution of studies on the use of adult speech adaptation in ASR for children. More than 70% of the studies employ transfer learning, fine-tuning, or meta-learning, which effectively helps to compensate for the scarcity of child speech data. Only a few studies manage without adaptation, relying solely on native child speech corpora. This confirms that adapting adult models is a key strategy in ASR for child speech, especially in low-resource language settings. In conclusion, it is evident that creating an application that uses a correctly compiled children’s thematic dictionary can significantly improve the development of speech in children.
3. Materials and Methods: Data Collection and Preprocessing
3.1. Corpus Development and Data Collection
Modern end-to-end ASR models such as Whisper, ESPnet (in E2E mode), DeepSpeech, and Vosk convert raw audio directly into text without requiring any explicit phonetic or phonological annotation. These models are trained on paired audio-text data and do not rely on manually labeled phonemes, allophones, or pronunciation dictionaries. Unlike traditional systems (e.g., Kaldi with GMM-HMM), they learn the mapping between acoustic input and symbolic output automatically. In line with this modeling approach, the development of the Kazakh child speech acoustic corpus focused not on phonetic detail, but on collecting diverse and representative speech data suitable for end-to-end training.
Data collection for creating the Kazakh child speech acoustic corpus was carried out using three methods. The first method involved the development and deployment of a custom Telegram-based bot, called Dataset Loader Bot [
36], developed with Python 3.9 and aiogram 2.25.1. This tool was specifically designed to enable the semi-automated collection and structuring of Kazakh child speech data. It was connected to a dedicated server maintained by the research team, ensuring uninterrupted operation, automatic verification of audio quality, and attachment of metadata such as age, gender, and topic. Telegram was selected as the platform due to its widespread use in Kazakhstan, especially among families with children, as well as its technical advantages: native support for voice messaging, seamless bot integration, and the absence of the need for additional applications. The bot operated according to a structured workflow: users received predefined words and phrases as prompts, recorded them directly in the chat, and submitted their voice messages through the interface. The system automatically assessed the audio quality, checking for duration, clarity, and signal-to-noise ratio, and requested re-recordings, if necessary. Dataset Loader Bot (software version 1.0) is officially registered as a computer program and protected, confirming its originality and legal status.
The second method involved recording children speaking Kazakh in natural environments such as at home, in schools, and in kindergartens, as well as collecting spontaneous speech from publicly available YouTube videos where real children speak in their own voices. We deliberately avoided dubbed cartoons or films, as these typically feature adult voice actors. Instead, we selected only videos that captured authentic child speech in everyday situations. For example, in the video “Learn Kazakh with My Niece” [
37], a young girl speaks Kazakh fluently in a natural setting. This method required additional effort to isolate children’s voices and reduce background noise, but it allowed us to gather more representative and realistic speech data for our target age group.
The third method involved using a voice recorder to collect high-quality recordings of children’s speech under more controlled conditions. An iPhone 14 Pro Max (Apple Inc., Cupertino, CA, USA) connected to a macOS computer (Apple Inc., Cupertino, CA, USA) was used to record audio in WAV format at a 16 kHz sampling rate. During these sessions, children were asked to complete simple tasks such as identifying colors, naming objects in pictures, and counting, which helped elicit clear, age-relevant speech samples. All recordings obtained through this method were manually annotated by the research team. Each file was carefully reviewed and segmented; utterances were transcribed, and metadata such as age and gender were assigned. Recordings were excluded if the child’s speech was unintelligible, if an adult dominated the recording, or if the utterance was incomplete (e.g., the beginning was not captured). Final recordings were organized and stored using Google Drive a cloud service provided by Google LLC (Mountain View, CA, USA), for further use.
The personal data of underage children were collected and processed only with the consent of their parents or legal guardians. All data, including audio recordings and metadata (such as name, address, gender, etc.), were anonymized. Unique ID codes were used instead of names. The research was conducted in full compliance with the following legal and ethical standards: [
38,
39,
40,
41].
Table 2 describes the data collection for the acoustic corpus of child speech and presents the methods and recording characteristics.
The use of diverse data collection methods—a Telegram bot, recordings in natural environments, and controlled recordings using a voice recorder—ensured the generation of a high-quality, well-structured corpus suitable for analyzing child speech. This corpus can also be utilized for training and analysis based on age-specific characteristics.
At the data processing stage, the collected words and phrases were grouped into themes such as nature, family, and everyday life. In parallel, vocabulary was divided by age groups of 3–4, 5–6, and 7–8 years to match the cognitive and language development stages of children. This helped make the material more age-appropriate and engaging.
For example, children aged 3–4 were given simple and familiar words such as жер (‘earth’, (zher)), ағаш (‘tree’, (aghash)), and алма (‘apple’, (alma)). In the 5–6 age group, vocabulary became more descriptive, including items such as жабайы жануар (‘wild animal’, (zhabaiy zhanwar)) and тәтті қызанақ (‘delicious tomato’, (tatti qyzanaq)), which helps develop expressive language. In the 7–8 age group, full phrases were used, for example, oрманда неше түрлі жидектер өседі (‘many kinds of berries grow in the forest’, (ormanda neshe türli zhidekter ösedi)) or дoстар күшікпен серуендеуге шықты (‘the friends went for a walk with the puppy’, (dostar küşikpen seruendeuge shykty)), to support sentence-building skills.
In addition to age-based groups, we created a separate category for complex words, such as шанышқы (‘fork’, (shanyshqy)) and бүлдіршін (‘toddler’, (büldirshin)), which may be challenging regardless of age. This categorization is based on pedagogical and speech therapy principles, which emphasize matching vocabulary to a child’s developmental stage [
42].
The division of vocabulary into age-specific groups and the inclusion of a category for complex words made it possible to structure the learning process in a way that matched the comprehension abilities of each group.
Many of the speech samples we collected, including the examples shown in
Table 3, were recorded using smartphones. This method proved to be one of the most convenient under real conditions, especially when working with families at home or in preschool environments. Smartphones were easy to use, did not require special equipment, and allowed us to capture natural, spontaneous speech from children across different age groups. Their availability made it possible to collect more balanced and diverse data, even in low-resource settings.
To initiate data collection, we developed a custom Telegram bot called Dataset Loader Bot, designed to systematically collect and organize child speech data. The recording process followed a clear structure: a list of words and phrases was uploaded to the bot and updated weekly to reflect new content.
Users (typically parents or educators) interacted with the bot, received short instructions, and were asked to record specific words or phrases. They recorded each item by holding the microphone button and then submitted the audio. The bot automatically saved each recording, along with metadata such as age group and topic, and even checked the quality, requesting re-recordings, if needed.
This method enabled the efficient and organized collection of audio data by both age and theme, making it suitable for later use in educational or speech recognition applications. The figures below provide a detailed illustration of each stage of the Dataset Loader Bot workflow in Telegram.
A structured database containing words and phrases adapted for a child audience was created based on the collected data. Each user is assigned a unique identifier (ID), under which all audio recordings are stored. Each recording is linked to the user’s ID and is accompanied by metadata such as the child’s age and gender, as well as the specific word or phrase that was recorded.
This process ensures that each recording is stored in an organized way and linked to relevant metadata, such as age group and topic. This structure makes it easier to use the data later in educational tools and research.
Figure 2 presents a horizontal block diagram illustrating the process of collecting, processing, and storing audio data using the Dataset Loader Bot system. The diagram illustrates the main stages of user interaction with the bot, as well as the internal data processing steps for structured storage and subsequent use.
Approximately 800 audio recordings containing speech data from children aged 2 to 8 years were collected using the Dataset Loader Bot. The total volume of the corpus consisted of approximately 1000 text entries and approximately 12 min of audio material. The database contains recordings from approximately 30 children (17 girls and 9 boys) and features 75 unique words and phrases selected based on thematic relevance and appropriate complexity levels for a child audience.
This structure provides an informative corpus suitable for the development of educational and therapeutic programs. To ensure the usability of audio data collected through the Telegram bot, thorough processing, involving several stages of sound quality enhancement, is conducted.
The recordings are initially saved in OGG format, which is converted to WAV, a format more suitable for further training and analysis, after data extraction.
Table 4 presents the details of the collected audio recordings, including the distribution of data by age and gender, as well as the number of recordings for each age group.
These data provide insight into the composition of the corpus and its suitability for use in educational and research purposes.
Figure 3 visualizes the number of audio recordings for each age group, showing the distribution between girls and boys. This data distribution allows for easy assessment of the corpus’s representativeness by age and gender, as well as the identification of potential gaps for further data collection (
Figure 4).
Thus, the age structure and gender composition of the corpus provide broad coverage of speech data across different ages, making it valuable for tasks associated with the training and analysis of children’s speech.
3.2. Technology and Processing Audio Recordings Using Libraries
The processing of audio recordings was conducted in stages using various libraries and algorithms, enabling the preparation of data for analysis and the generation of training models for speech recognition. All files were processed using a script containing functions for removing noise, trimming excess silence, and increasing volume. The script was written in the PyCharm 2024.1.4 (JetBrains, Prague, Czech Republic) environment [
43] in Python 3.11.9 (Python Software Foundation, Wilmington, DE, USA) [
44] using the following libraries: librosa [
45] for reading, processing, and analyzing audio signals; torch [
46], a framework for machine and deep learning; soundfile [
47] for reading and writing audio files; noisereduce [
48] for removing noise from audio recordings; and os [
49] for working with the file system. After processing, including noise reduction and silence trimming, the volume of the audio recordings decreased due to the removal of pauses between words.
To assess the quality of data processing and visually analyze the enhanced audio recordings, a spectral analysis method was employed, enabling visualization of the signal’s frequency content. The study utilized the Short-Time Fourier Transform (STFT) [
50], which decomposes the signal into sinusoidal components of varying frequencies, allowing identification of the presence and amplitude of frequency components within the audio signal.
Figure 5 illustrates the processing stages, providing a clearer understanding of the workflow.
During audio data processing, a sequence of steps is performed to improve sound quality and prepare the files for analysis. A description of each stage is provided below, starting from the initial conversion to the WAV format.
3.3. Conversion of Audio Files to Standard Format (WAV)
The preliminary processing of each audio recording, whether individual words, phrases, or sentences, was carried out as follows: First, all audio files were converted to the WAV format. This format preserves high sound quality since it is uncompressed. To efficiently and quickly convert a large number of recordings, the VSDC converter application (version 6.8.6.352, Flash-Integro LLC, Tashkent, Uzbekistan) [
51] was used.
Librosa, Soundfile, Torch, and Noisereduce libraries were used to remove background noise and equalize volume. In the process_audio method, audio files were loaded using qlibrary.load, then noise was suppressed via noisereduce.reduce_noise, and silence was cut using librosa.effects.trim. After this, the audio volume was increased using the gain function and the processed file was saved in the .wav format via soundfile.write. The process_audio function accepts a directory containing audio files for processing and returns the path to the folder, with the processed files being called in a loop to batch process all audio files in the specified folder. The program code is shown in
Figure 6.
To make the recording continuous, pauses were removed using libros.effects.trim. This feature reduced unnecessary silence at the beginning and end of recordings, leaving only the key signal. Furthermore, the recording was enhanced using the gain method from the Librosa library, which improved the clarity of the sound.
Table 5 shows the key parameters of the audio files before and after processing using the script. The main focus is on the duration of pauses, the volume of the signal, and the level of background noise.
To better understand the effectiveness of the audio processing script, spectrograms of the same audio recording were generated before and after processing. The recording used was of the word “oрман” (‘forest’, (orman)), spoken by a 4-year-old boy. It was 2 s long.
Figure 7 shows the spectrogram of this word before processing.
Figure 8 shows the spectrogram of the same word after processing.
Comparing the two figures shows that the contours of the spectrogram became sharper and some scattered noise points disappeared after processing, indicating a reduction in background noise. Additionally, the yellow lines (representing the child’s voice) shifted slightly toward the beginning of the spectrogram, showing that unnecessary silence at the start of the recording was trimmed.
To separate the voice from background sounds, the HPSS method (Harmonic-Percussive Source Separation) was used. This method, implemented in librosa.effects.hpss(), divides an audio file into harmonic (primarily vocal) and percussive (mostly background noise) components. As a result, a cleaner speech signal is extracted, which is especially important for training speech recognition systems.
A spectrogram is a visual representation of the spectral content of an audio signal as a function of time. It simultaneously displays three key variables: frequency, amplitude, and time [
52]. In other words, it shows which frequencies are present in the signal at different time points. The horizontal axis represents time, while the vertical axis represents frequency, i.e., the number of sound wave vibrations per unit of time, measured in Hertz (Hz); the higher the frequency, the higher the perceived pitch. Amplitude refers to the strength or loudness of the sound and is represented as a matrix of values in decibels (dB). On a spectrogram, amplitude is visualized using color or brightness: the more intense the color at a given point, the louder the sound at that time and frequency (
Table 6).
Adjusting these parameters allows for the optimization of the spectrogram to analyze different types of signals. For example, the choice of window size and hop length affects the accuracy of temporal and frequency resolution, while logarithmic scaling improves the visibility of low frequencies, which is important for speech processing tasks.
Children’s speech recordings taken from films required additional processing, including cutting out background music, reducing noise, and increasing volume. Segments with loud background music were difficult to clean completely, as removing the music could also result in loss of parts of the speech. As a result of multi-stage processing, the data volume was reduced from 15 to 9 min. To prevent quality degradation, the processing was carried out in two stages: initial noise reduction and silence removal, followed by separation into voice and background.
Figure 9 shows the visualization of the audio recording before and after processing, including noise reduction and voice extraction from background music.
The time-domain graph illustrates how the signal amplitude changes over time, allowing for the assessment of key sound characteristics: duration, intensity, and dynamic variations.
4. Architecture and Methodology of the ASR System for Kazakh Children’s Speech
An important task in this study was developing a robust and linguistically adaptive ASR system tailored to Kazakh-speaking children. This required designing a system architecture that could address the unique phonetic, acoustic, and linguistic characteristics of child speech, while also overcoming the challenges associated with low-resource language environments. The methodology combined acoustic preprocessing, data segmentation, model selection, and language adaptation strategies, all of which were aligned with the needs of early childhood speech development. The following sections describe the core components of the proposed ASR pipeline, the models employed, and the rationale behind the chosen adaptation techniques.
The overall architecture of the proposed ASR system is illustrated in
Figure 10.
The automatic recognition of children’s speech is a multi-stage process, the architecture of which is illustrated in
Figure 10. This process has been specifically adapted to accommodate the unique phonetic and linguistic characteristics of child speech. At the Input Stage, the system receives audio recordings collected through microphones, smartphones, or a dedicated Telegram bot. Child speech differs significantly from adult speech: it features a higher pitch, unstable articulation, spontaneous intonation, and often occurs in noisy environments. These factors necessitate enhanced audio quality and rigorous preprocessing.
In the Feature Extraction Stage, acoustic features are computed from the raw signal using MFCCs or log Mel spectrograms. These features are then passed to the Acoustic Modeling Stage, where deep learning models trained on speech fragments are applied to learn time-aligned representations. Due to the limited availability of annotated child speech data, models initially trained on adult speech are often adapted through fine-tuning techniques using a dedicated child speech corpus. In this study, four ASR models—Whisper, DeepSpeech, ESPnet, and Vosk ASR—were employed. The Whisper, DeepSpeech, and ESPnet models were fine-tuned on the collected Kazakh children’s speech corpus to account for age-specific pronunciation patterns and acoustic variability. The Vosk ASR system, however, was used in its standard pretrained configuration due to the lack of support for custom training.
Following acoustic decoding, the output enters the Language Modeling Stage, where raw phonetic hypotheses are transformed into coherent word sequences. Given the simplified syntax, limited vocabulary, and frequent grammatical inconsistencies in children’s speech, the language model was adapted with age-appropriate lexicons and linguistic patterns specific to Kazakh-speaking children.
After the acoustic model generates a sequence of sounds, the language model plays a key role in converting them into meaningful words and phrases. It relies on statistical patterns of word usage, syntax, and grammatical structures to make educated guesses about what the speaker intended to say. To make the language model more effective for Kazakh child speech, we adapted it using real examples from our collected dataset. These included word lists and sentence patterns commonly found in children’s spontaneous speech—often short, simple, and occasionally incomplete or ungrammatical. Because Kazakh is a morphologically rich agglutinative language, special attention was given to how children use affixes and endings, which often differ from standard adult forms. In addition, the model was tuned to handle common features of child speech, such as unclear articulation, pronunciation variability, and occasional code-switching with Russian. These adjustments allowed the language model to more accurately interpret the outputs of the acoustic model, especially in cases where the input was noisy or fragmentary.
At the Output Stage, the system generates final textual transcriptions. These transcriptions are used for speech development analysis, educational application design, and early diagnosis of speech-related disorders. The architecture also incorporates a feedback loop, where typical recognition errors and developmental speech features are used to iteratively improve model performance. This is especially critical in low-resource settings such as the Kazakh language, where high-quality, age-specific corpora are scarce.
Based on this architecture, in the present study, we implemented a practical ASR system using the aforementioned models. A key aspect of this work was the construction of a novel Kazakh child speech corpus covering a broad range of age groups and lexical categories. Special attention was given to improving recognition quality under conditions of limited data, high phonetic variability, and background noise. The following sections detail the data collection methodology, preprocessing pipeline, model adaptation strategies, and comparative evaluation based on standard ASR metrics.
Table 7 summarizes the key properties of the ASR models applied in this study.
This table summarizes the models adapted for Kazakh children’s speech and used in their original pretrained form. It also highlights each model’s language coverage, size, and notable features relevant to child ASR tasks.
Figure 11 shows a conceptual framework outlining key challenges and adaptation strategies for developing ASR systems for Kazakh-speaking children in low-resource settings.
This framework outlines the main challenges and research directions in developing ASR systems for Kazakh-speaking children in low-resource conditions. It extends existing models by addressing both language-specific and age-specific mismatches, which are particularly prominent in child speech applications.
The development of ASR systems for Kazakh children is of great significance for both educational and inclusive technology domains. Such systems can support the early diagnosis of speech disorders, improve access to digital learning resources in the Kazakh language, and serve as core components of smart classroom platforms, speech therapy tools, and mobile language learning applications. These technologies are especially relevant in the context of national digital transformation and the growing role of Kazakh in electronic services and educational systems.
Several key challenges arise in designing ASR systems for children. First, models trained on adult speech typically fail to generalize to children’s speech due to fundamental differences in articulation and acoustics. Second, child speech is highly variable-differing in rate, intonation, pronunciation, and phonetic stability. Third, there is a critical lack of publicly available corpora for Kazakh child speech. Finally, data collection itself poses challenges, including ethical concerns and the need for parental consent.
Child speech differs from adult speech in multiple ways. Acoustically, it is characterized by higher pitch, shorter articulation spans, and unstable phonetic patterns. Linguistically, it tends to use simpler syntax, limited vocabulary, shorter utterances, and frequent disfluencies or spontaneous pauses. All these factors must be carefully considered when designing robust ASR systems for children in under-resourced language environments.
To formalize the concepts presented in the framework, we describe several mathematical models that capture the core processes in Kazakh child ASR systems.
The general ASR model can be defined as a mapping function
, where
∈ ℝ
T is an acoustic input of duration T, and the output is a token sequence y ∈ V*. The model aims to maximize the posterior probability of output given the input
Here, θ represents the model parameters, V is the target vocabulary, and P|x;θ) is the conditional probability of predicting y given input x under model θ.
In case of acoustic mismatch, such as when child speech differs from adult training data, adaptation is performed through fine-tuning using Connectionist Temporal Classification (CTC) loss:
where
is the child speech input;
y is the target transcription;
is the CTC loss function;
θ∗ is the set of fine-tuned model parameters minimizing the loss.
For linguistic mismatch, we define a probabilistic language model that estimates the likelihood of a sequence based on its context:
where
is a sequence of tokens;
is the conditional probability of token yi given its history.
This model is optimized via negative log-likelihood over a child-specific corpus:
where φ are parameters of the language model trained on Kazakh child-directed speech,
is the i-th token in a child speech transcription, and
is the preceding context.
In low-resource settings, when only a small labeled dataset Dchild is available, two primary approaches are considered:
Semi-supervised learning, which combines labeled and unlabeled data:
where
and
are labeled and unlabeled datasets, λ is a balancing factor for unsupervised loss, and
is the optimized model parameters.
These formal mathematical models provide a comprehensive framework for addressing both acoustic and linguistic mismatches in Kazakh child ASR low-resource conditions. By clearly defining optimization objectives and adaptation mechanisms, they serve as a theoretical foundation for model development. In the next section, we empirically evaluate several state-of-the-art ASR architectures trained and tested using a dedicated Kazakh child speech corpus, based on the modeling strategies outlined above.
5. Practical Results and Evaluation
To evaluate the models, the input data consisted of audio files containing recordings of Kazakh words spoken by children. The audio files were stored and loaded from a preprocessed dataset, where each file had been manually or automatically trimmed to segments containing only the target speech. This preprocessing allowed the system to focus solely on the relevant speech signal, excluding background noise and non-speech fragments. All audio data were processed using a unified pipeline, ensuring consistent and reproducible conditions across all evaluated models.
Figure 12 illustrates the input audio files containing children’s speech, which served as the basis for testing various ASR models under unified preprocessing conditions.
A tabular structure was used to feed audio data into the ASR system, including the path to each file, the corresponding transcription, and the child’s gender, age group, and session ID. An example of the input data format is shown in
Figure 13. In this structure, the first column indicates the relative path to the audio file stored in the database, used for loading the speech signal during training and evaluation. The second column provides the reference transcription (ground truth) that the ASR model aims to predict. The third column specifies the child’s gender (F for female, M for male), and the fourth column indicates the age of the child at the time of recording. The fifth column contains a session ID, which groups multiple recordings belonging to the same speaker. This allows for session-based tracking and optional speaker-specific modeling. Such a structured approach supports consistent data processing and enables detailed analysis of ASR performance across different demographic and session-based factors.
To objectively evaluate ASR quality in this study, a set of widely accepted metrics was employed: WER, BLEU, TER, CSRF2, and Accuracy. Each of these metrics captures a distinct aspect of alignment between the recognized output and the reference text, collectively providing a multifaceted approach to ASR evaluation.
The WER is the fundamental metric for ASR performance, accounting for insertions, deletions, and substitutions of words [
53]. It provides a clear indication of overall transcription quality in a speech context.
BLEU, originally proposed by Papineni et al., is widely used not only in machine translation but also in speech recognition to assess semantic similarity through n-gram overlap [
54]. It reflects how closely the recognized text matches the intended meaning.
The TER, developed by Snover et al., measures the number of edits required to transform the system output into the reference, making it especially useful for evaluating phrase-level variation and acceptable rewordings [
55].
CSRF2 evaluates transcription accuracy at the character level, which is crucial when dealing with child speech that often includes subtle distortions and non-standard pronunciations.
Accuracy, in contrast, measures the proportion of utterances that are recognized correctly, serving as a concise and intuitive metric.
The combined use of these metrics enabled a robust and comprehensive assessment of the ASR systems, particularly in the context of recognizing non-standardized Kazakh child speech.
In this study, three models—Whisper, DeepSpeech, and ESPnet—were evaluated using a dedicated corpus of Kazakh child speech. Each of these models underwent fine-tuning based on its architectural characteristics. Whisper, built on an encoder–decoder Transformer architecture, was adapted using cross-entropy loss and trained on short child-oriented phrases. DeepSpeech, based on an RNN with CTC loss, was fine-tuned using manually cleaned WAV files containing speech from children of various ages.
ESPnet was applied in a seq2seq configuration that integrates CTC and attention mechanisms. It was trained on Kazakh transcriptions and enhanced with a language model. In contrast, the Vosk ASR model was used in its pretrained state (vosk-model-small-kz-0.15) without further adaptation, enabling a fair assessment of its out-of-the-box performance on Kazakh child speech. To ensure comparability, all models were evaluated using the same set of transcription labels and audio files, processed through a unified preprocessing pipeline. This consistency ensured a valid comparison across systems using key evaluation metrics.
To objectively compare ASR model performance on Kazakh child speech, two stages of evaluation were conducted: before and after fine-tuning. In the first stage, Whisper, DeepSpeech, and ESPnet were tested in their original pretrained configurations without adaptation to the Kazakh language, allowing for an assessment of baseline performance.
Table 8 and
Table 9 present the experimental outcomes for short speech segments consisting of single words spoken by children.
Table 8 shows qualitative examples of model outputs. Each row corresponds to one audio file containing a single word, with the reference transcription and outputs from the three models. Clear issues are observed, including sound distortions, word substitutions, insertions, and even language drift-particularly noticeable in Whisper’s output before fine-tuning. Words with agglutinative morphology, typical of Kazakh, were especially challenging for the systems.
Table 9 summarizes the quantitative performance metrics, including WER, BLEU, TER, CSRF2, and Accuracy. These metrics allowed us to measure each model’s capacity for accurate phonetic and semantic reproduction. Before fine-tuning, DeepSpeech achieved the best BLEU and TER scores; however, none of the models produced any fully correct transcriptions, as indicated by the zero Accuracy across all systems.
Following the fine-tuning of ASR models on a specialized corpus of Kazakh child speech, substantial improvements were observed across both qualitative and quantitative recognition performance indicators. Fine-tuning enabled the models to better adapt to the phonetic characteristics of Kazakh, the idiosyncrasies of child articulation, and the unique intonation patterns of short utterances.
Table 10 presents updated transcriptions for the same audio samples shown in
Table 8, reflecting outputs after fine-tuning. In most cases, the outputs exhibit closer alignment with the reference forms, with fewer phonetic distortions, elimination of language-switching errors (notably improved in Whisper), and a marked increase in semantic accuracy.
Table 11 reports the updated evaluation metrics, WER, BLEU, TER, CSRF2, and Accuracy, for each model. A comparison with
Table 9 reveals a consistent decrease in error rates (WER, TER), increased similarity at the character and lexical levels (CSRF2 and BLEU), and the appearance of non-zero Accuracy values. This confirms that several utterances were fully recognized without any errors. The most significant gains were recorded by ESPnet, while Whisper demonstrated stable and uniform improvements. DeepSpeech showed moderate improvements but fell short in several metrics.
As illustrated in
Table 10, fine-tuning led to a noticeable improvement in transcription quality across all models. While many utterances are now correctly recognized in full, certain errors remain. These patterns are particularly evident in longer sentences or words with rare or dialect-influenced pronunciations. Such observations highlight the linguistic complexity of Kazakh and underscore the necessity for using domain-specific data for ASR model adaptation.
Table 10 and
Table 11 present the results obtained after fine-tuning the ASR models on the Kazakh child speech corpus.
Table 10 shows updated transcription outputs compared to those in
Table 8. In most cases, the results exhibit closer alignment with the reference texts, particularly in terms of phonetic accuracy, morphological consistency, and reduced language-switching errors. Whisper demonstrates fewer code-mixing phenomena, while ESPnet and Vosk produce more stable and coherent word forms.
However, some recognition errors persist-particularly in longer and syntactically rich sentences-such as syllable omission (e.g., аспаұртты (aspaúrtty) instead of аспан бұлтты ‘the sky is cloudy’, (aspan búltty)), sound substitution (e.g., кoем (koem) instead of қoян ‘rabbit’, (qoyan)), and morphologically incorrect endings (e.g., қатамығы (qatamyǵy) instead of қараңғы ‘dark’, (qaranǵy)). These issues typically arise from phonetic variability in children’s speech and limited exposure to rare or context-sensitive lexical structures during training. Such error patterns are particularly noticeable in words containing nasal or affricate consonants, complex agglutinative suffixes, or intonation-sensitive segments. These findings underscore the phonological challenges specific to the Kazakh language and reveal the limitations of multilingual ASR models without dedicated language-specific adaptation.
Table 11 complements the qualitative analysis with updated evaluation metrics, WER, BLEU, TER, CSRF2, and Accuracy, reflecting quantitative improvements achieved through fine-tuning. Compared to the baseline scores in
Table 9, all models exhibit a decrease in error rates (WER and TER) and an increase in character- and token-level similarity (CSRF2 and BLEU). The appearance of non-zero Accuracy values confirms that several utterances were fully recognized without any transcription errors.
Among the models, Vosk demonstrated the highest Accuracy, while ESPnet showed the most balanced overall performance across all metrics. Whisper improved substantially in BLEU and CSRF2, particularly in reducing semantic drift, but still struggled with longer utterances. DeepSpeech showed moderate improvement but remained behind in lexical precision and complete utterance recognition.
Overall, these results clearly demonstrate the critical role of domain-specific fine-tuning when developing ASR systems for under-resourced languages such as Kazakh. They also highlight the importance of utilizing a dedicated, age-appropriate speech corpus to capture key phonetic, morphological, and cognitive characteristics of children’s speech-particularly for educational, assistive, and therapeutic speech technologies.
To provide a visual comparison of model performance before and after fine-tuning,
Figure 14 presents a comparative analysis of five key evaluation metrics, WER, BLEU, TER, CSRF2, and Accuracy, across the four ASR models tested on Kazakh child speech: Whisper, DeepSpeech, ESPnet, and Vosk. While the first three models were fine-tuned on the specialized child speech corpus, the Vosk model was evaluated in its pretrained state. Each subgraph in the figure illustrates how individual metrics evolved in each model, offering a clearer understanding of fine-tuning effects across different model architectures.
Тhe results illustrated in
Figure 14 clearly demonstrate the positive impact of fine-tuning on the ASR models. All three fine-tuned models showed consistent improvements across most evaluation metrics. ESPnet achieved the most significant overall gains, particularly in WER, TER, and Accuracy, indicating effective adaptation to both the phonetic and lexical structures of children’s speech. Whisper also exhibited strong improvements, especially in BLEU and CSRF2, with fewer language detection errors compared to its baseline performance. DeepSpeech showed moderate but measurable improvements, maintaining a good balance between structural and character-level accuracy. In contrast, the Vosk model-used without fine-tuning-yielded intermediate results. It outperformed the baseline (non-adapted) models but did not match the performance of fine-tuned ones. This underscores both the utility of pretrained ASR models and the crucial role of domain-specific fine-tuning, especially for child-directed speech in under-resourced languages such as Kazakh.
In summary, the results affirm that fine-tuning on contextually and linguistically relevant data is essential for maximizing ASR model performance, particularly in complex scenarios involving children’s speech. The combination of acoustic variability, morphological complexity, and age-specific pronunciation patterns makes this domain especially challenging—and equally essential—for the development of inclusive and accurate speech recognition technologies.
6. Discussion
The results of this study demonstrate the importance of domain-specific adaptation for ASR systems when applied to Kazakh children’s speech, a particularly challenging setting due to linguistic complexity, phonetic variability, and limited training resources. A detailed comparison across the four tested models-Whisper, DeepSpeech, ESPnet, and Vosk-reveals distinct strengths, limitations, and adaptation behaviors.
At the initial stage, the Whisper model showed some instability, particularly in recognizing short words and correctly identifying the language: Kazakh input was often misinterpreted as Vietnamese, English, or resulted in nonsensical outputs. This is likely due to the absence of Kazakh child speech in the model’s original training data. After fine-tuning on our dataset, the model’s BLEU and CSRF2 scores improved noticeably, indicating better phonetic and structural alignment. It also became more reliable in identifying Kazakh as the target language. Notably, Whisper tended to perform more consistently on longer and more complete utterances. In such cases, the model was more successful at preserving context and overall fluency. This observation aligns with findings by Radford et al. [
56], who noted that the model is sensitive to input structure and performs more stably when the speech is fluent and well-formed. However, its performance declined in the presence of background noise, fragmented speech, or atypical articulation—all of which are common in real-world Kazakh child speech recordings. Overall, our observations suggest that Whisper handled well-structured, longer utterances more confidently, while noisy or irregular input remained a challenge.
DeepSpeech, trained using a CTC-based loss function, showed moderate performance both before and after fine-tuning. Although it improved in terms of BLEU and WER, the model continued to struggle with high phonetic variation, often producing fragmentary or distorted outputs for short words. The character-level metric, CSRF2, and overall Accuracy remained comparatively low, suggesting limitations in DeepSpeech’s ability to generalize to acoustically diverse child speech. These findings are consistent with previous evaluations [
57], which highlight the challenges of applying CTC-based models to speech from non-standard populations.
ESPnet, employing a hybrid CTC and attention-based architecture, emerged as the most balanced and adaptive model. Fine-tuning on Kazakh children’s speech led to a consistent reduction in WER and TER, alongside the highest gains in BLEU, CSRF2, and Accuracy. Its flexible architecture allowed for better integration of phonetic and syntactic structures specific to the Kazakh language. Furthermore, ESPnet showed robustness across both short and long inputs, maintaining stable outputs even under variable background conditions. This confirms earlier findings [
34] on the adaptability of attention-enhanced architectures to low-resource settings.
Vosk, evaluated in a pretrained configuration using a Kaldi-derived TDNN model, showed surprisingly strong baseline performance. Although no fine-tuning was applied in this study, Vosk achieved respectable results on short child-spoken words. The model’s offline compatibility and efficiency make it a strong candidate for embedded or mobile ASR applications in low-connectivity environments. However, Vosk did show sensitivity to noisy and emotionally variable speech, consistent with reports by Alphabeta [
58], highlighting the limitations of general-purpose models without child-specific adaptation.
An important cross-model observation was that short words and phonetically similar syllables caused the most confusion, especially in the baseline (pre-finetuned) evaluations. Common recognition errors included substitutions involving soft and hard consonants (e.g., с (s) vs. ш (sh)), vowel confusion, and deletion of syllables. Additionally, models struggled with agglutinative structures in Kazakh, often truncating or overextending suffixes.
The use of multiple evaluation metrics, WER, BLEU, TER, CSRF2, and Accuracy, enabled a multidimensional performance analysis. The WER and TER effectively captured overall structural accuracy, while BLEU reflected semantic overlap and CSRF2 provided a granular look at character-level distortions, which is critical in child speech with unstable articulation. A visual comparison of the four tested models across all five evaluation metrics is shown in
Figure 15.
Тo provide a more detailed view of how the ASR models perform on short lexical units,
Figure 16 presents a comparative analysis of common error types observed during the recognition of short words across the four tested systems.
Overall, the fine-tuned models demonstrated measurable gains in recognition quality, with ESPnet showing the most robust performance. These results reinforce the necessity of not only language-specific but also age-specific model adaptation, particularly for educational and diagnostic speech technology in underrepresented languages such as Kazakh.
7. Conclusions
In conclusion, it is important to note that this research represents a significant step forward in developing ASR systems for under-resourced child speech, focusing on Kazakh language data for children aged 2 to 8 years. The constructed corpus, enriched with lexically stratified content and balanced gender representation, enabled a detailed evaluation of modern ASR architectures.
Evaluation of the ESPnet, Whisper, DeepSpeech, and Vosk models on this corpus revealed key performance differences:
ESPnet demonstrated the best results for short words, with WER = 0.567, BLEU = 0.371, TER = 0.711, CSRF2 = 0.351, and Accuracy = 21%, and for sentences, with WER = 0.811, BLEU = 0.242, TER = 0.800, CSRF2 = 0.300, and Accuracy = 32%. This indicates its ability to adapt to the phonetic variations and syntactic structures of child speech; however, the model still struggles with recognition accuracy, especially for longer utterances.
Whisper, after fine-tuning, achieved WER = 0.866 for short words and 0.767 for sentences, BLEU = 0.326 and 0.416, TER = 0.854 and 0.537, CSRF2 = 0.233 and 0.200, and Accuracy = 35% and 33% for short words and sentences. Whisper showed significant improvements, particularly in BLEU and CSRF2, indicating enhanced semantic processing accuracy and phonetic precision.
Vosk ASR, evaluated without fine-tuning, achieved the highest accuracy for short words (68%) and sentences (52%). The model demonstrated strong lexical recognition in simpler contexts, with BLEU values of 0.600 and 0.477 and CSRF2 values of 0.643 and 0.583 for short words and sentences, respectively. Fine-tuning is necessary for further improvement in recognizing more complex sentences.
DeepSpeech showed stable results, with WER = 0.717 and 0.800, BLEU = 0.500 and 0.287, TER = 0.600 and 0.717, CSRF2 = 0.267 and 0.200, and Accuracy = 60% and 25% for short words and sentences, respectively. Despite moderate improvements, DeepSpeech demonstrated noticeable progress in recognition accuracy, especially for short words, and is progressing toward improving the recognition of longer utterances. The improvements in Accuracy and BLEU confirm the progress achieved after fine-tuning and highlight the model’s ability to handle more complex speech structures.
Fine-tuning resulted in a significant reduction in the WER (up to 65%) and an increase in sentence-level accuracy of more than 30% for both ESPnet and Whisper compared to baseline models. These results confirm the crucial role of domain-specific adaptation and age-appropriate corpus design in ASR systems designed for child-directed speech.
Future work will involve fine-tuning the Vosk ASR model on a server using a pre-compiled Kazakh speech corpus for children formatted for Kaldi. After constructing the acoustic model and decoding graph (HCLG.fst), the final model will be integrated into Vosk and evaluated using child speech data. Further work will also focus on expanding the corpus to include emotional intonation, speaker and regional variability, and multilingual examples. We will aim to explore hybrid architectures that combine the real-time efficiency of Vosk with the contextual depth of ESPnet and Whisper. Additionally, emotion-aware ASR components will be integrated to better handle affective speech patterns typical of early childhood, and real-time augmentation techniques, such as pitch shifting and speaking rate simulation, will be applied to enhance robustness. The dataset will also be enriched with dialectal and sociolectal variations in the speech of children from different regions in Kazakhstan.
8. Patents
The authors developed a custom software system called “Dataset Loader Bot: Semi-Automated Collection System for Kazakh Child Speech Data”, designed for structured collection of children’s audio recordings via Telegram, including metadata assignment and server-side processing. This software is officially registered as a copyrighted object under the following certificate:
- ○
Copyright Certificate No. 56775, issued on 15 April 2025, by the Ministry of Justice of the Republic of Kazakhstan.
Authors: Zhansaya Duisenbekkyzy, Diana Ramazanovna Rakhimova, Rashid Shamilievich Aliev.
Object Type: Software for Electronic Computing Machines (ECM).
Date of Creation: 9 July 2024.
This registered tool was used as the primary mechanism for collecting speech data in this study and ensured compliance with ethical and technical standards in data acquisition.
https://copyright.kazpatent.kz/?!.iD=3sMv (accessed on 20 April 2025).