Next Article in Journal
Opinion Formation at Ising Social Networks
Next Article in Special Issue
Insightson the Pedagogical Abilities of AI-Powered Tutors in Math Dialogues
Previous Article in Journal
Using QR Codes for Payment Card Fraud Detection
Previous Article in Special Issue
Addressing the Dark Side of Differentiation: Bias and Micro-Streaming in Artificial Intelligence Facilitated Lesson Planning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment

Department for Smart and Interconnected Living, University of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria
*
Author to whom correspondence should be addressed.
Information 2026, 17(1), 40; https://doi.org/10.3390/info17010040
Submission received: 20 October 2025 / Revised: 27 November 2025 / Accepted: 12 December 2025 / Published: 4 January 2026
(This article belongs to the Special Issue AI Technology-Enhanced Learning and Teaching)

Abstract

Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for this are diverse, but an application that tracks children while reading aloud and guides them when they experience difficulties could offer meaningful help. Therefore, this proposal explores a prototyping approach for a core component that tracks children’s reading using a self-supervised Wav2Vec2 model with a limited amount of data. Self-supervised learning allows models to learn general representations from large amounts of unlabeled audio, which can then be fine-tuned on smaller, task-specific datasets, making it especially useful when labeled data is limited. Our model is operating on the phonetic level with the help of the International Phonetic Alphabet (IPA). To implement this, the KidsTALC dataset from the Leibniz University Hannover was used, which contains spontaneous speech recordings of German-speaking children. To enhance the training data and improve robustness, several data augmentation techniques were applied and evaluated, including pitch shifting, formant shifting, and speed variation. The models were trained using different data configurations to compare the effects of data variety and quality on recognition performance. The best model trained in this work achieved a phoneme error rate (PER) of 14.3% and a word error rate (WER) of 31.6% on unseen child speech data, demonstrating the potential of self-supervised models for such use cases.

1. Introduction

1.1. Motivation

Reading is a fundamental skill that shapes human development throughout life. For adults, it is essential for personal growth, professional success, and participating fully in society. However, it is during childhood that reading plays its most crucial role. Research published in “Psychological Medicine” [1] has clearly shown how important reading is for children. According to an article published by Austria’s public broadcaster ORF [2], reading helps children develop their minds and do better in school. When children read, they learn to understand texts better and think more critically. Reading books and stories on the one hand also helps them learn new words and improve their writing, but on the other hand, it also helps them understand the world better, be more creative, and understand how others feel. Children who read regularly also get better at focusing and solving problems and are less likely to suffer from depression and aggression problems [2].
Despite these benefits, many children today face significant challenges with reading. The latest PIRLS study from 2023 revealed a troubling trend. One in five children in Austria struggles with basic reading skills, with boys being more affected than girls [3]. In other words, 20% of children lack the fundamental reading abilities essential for success in school and everyday life. While Austria still ranks above the international average, the findings are nonetheless concerning. Notably, compared to other participating countries, children’s reading skills in Austria are more strongly linked to the socio-economic background of children’s families, including parental education and occupation [3]. Before looking at how this problem can be tackled and how technology can help, it is important to understand why children struggle with reading.
Several key factors contribute to children’s reading difficulties. One major issue is the changing media consumption habits of today’s children. According to Statista [4], only 14% of children in Germany read daily, while 67% watch TV every day. Compared to previous generations, children today read significantly less, as entertainment is increasingly consumed through visual and audio media like Netflix and YouTube.

1.2. Research Questions

The primary goal is to develop a system that monitors and tracks children as they read aloud, with the potential to detect reading difficulties or mistakes in the future. While various methods exist for assessing reading skills [5], this work specifically focuses on pronunciation accuracy at the word and phoneme levels. To achieve this, a machine learning model will be trained and tested using a speech corpus of German children. Unlike most existing models, the proposed approach is fine-tuned to recognize speech at the phonetic level rather than just the word level by translating into the International Phonetic Alphabet (IPA) instead of words. The following research questions are key:
1.
Feasibility of Accurate Phonetic Recognition: Is it feasible to achieve accurate phonetic recognition in children’s speech using a Wav2Vec2 model fine-tuned with less than 25 h of data, and what are the key factors that influence this performance?
2.
Trade-off Between Data Quality and Quantity: How does the trade-off between data quality and quantity affect phoneme recognition performance in children’s speech? This question explores whether it is more helpful to fine-tune a pretrained Wav2Vec2 model with a small amount of high-quality data or with a larger amount of lower-quality data.

2. Background and Related Work

This section shows how reading fluency is currently measured in Austrian schools and which tools are used for this purpose. Next, it introduces the fundamentals of automatic speech recognition (ASR) and highlights the specific challenges of tracking children’s speech. The focus then shifts to different ASR system implementations, which can be divided into two main groups. The traditional approaches such as Dynamic Time Warping (DTW) and Hidden Markov Models with Gaussian Mixture Models (HMM-GMM), and more recent deep learning-based methods. Finally, the chapter discusses how these deep learning models work and how well they have performed in previous studies.

2.1. Measuring and Analyzing Reading Fluency

Currently used methods to determine reading metrics in Austrian schools are Running Records [6] and the Salzburger Lese-Screening (SLS) [7]. During a Running Record, the teacher listens to the child reading a short text aloud while noting any errors, self-corrections or hesitations. SLS is a diagnostic tool where a student is given three minutes to evaluate as many statements as possible for their truthfulness, marking them accordingly.
However, such assessment methods often work by manually listening to children read aloud and actively following along with their reading. These methods can aim to capture various aspects of reading fluency, such as accuracy, speed, and comprehension, by closely monitoring and recording the child’s reading behaviors.
It is hence of particular interest to explore the usefulness of reading tools which aim to do exactly that, offering more scalable, personalized, and efficient ways to help children improve their reading skills. Unfortunately, most of the tools are currently only available in English and therefore not suitable to be used in Austria. Existing tools include:
  • SoapBox Labs offers a speech recognition engine optimized for children’s voices, designed as backend technology for integration into educational tools. It handles mispronunciations, accents, and noise, providing real-time feedback on pronunciation, fluency, and pace, and supports multiple languages including German, English, Spanish, and Mandarin [8].
  • Ello is an AI reading tutor for kindergarten to Grade 3, with over 700 decodable e-books tailored to a child’s level and interests. It listens to children reading aloud, giving real-time feedback, and supports both digital and physical books. Currently, it is only available in English and targets the US market [9].
  • Microsoft Reading Coach provides AI-driven, individualized reading support, identifying words or patterns students struggle with and offering targeted exercises. Integrated with Microsoft Teams and Reading Progress, it helps track progress and reduce teacher workload [10].
  • Google Read Along listens to children reading aloud, highlights mistakes, and integrates with Google Classroom for task assignment and progress tracking. It supports several languages, including English, Spanish, Portuguese, and Hindi, but not German [11].

2.2. Speech Recognition for Reading Assessment

So we see that especially in English-speaking countries such as the US, there are already tools available that have proven to help and support children to improve their reading skills. What they all have in common is that their core function is to understand what the child says while reading aloud. Tools like Ello, Microsoft Reading Coach, or SoapBox all depend on speech recognition to do this. Only when the system can correctly hear and understand the words or sounds the child says, does it become possible to measure reading speed, track progress, and give helpful feedback to the children. This chapter looks at how speech recognition works for reading, with a focus on the challenges of recognizing children’s speech.

2.2.1. Fundamentals of Speech Recognition

Speech recognition technology relies on a sequence of computational processes, including feature extraction, acoustic modeling, and language modeling. The fundamental workflow of ASR consists of converting spoken language into digital signals, analyzing phonetic structures, and matching them with linguistic models to produce textual output.
In 2022 Basak et al. [12] conducted a systematic review which analyzed data from 78 different research projects published on Google Scholar between 1993 and 2021. The goal was to analyze key advancements and challenges in the field of ASR. The review categorized ASR methodologies into traditional statistical models and modern deep learning-based approaches. According to their findings, a typical ASR pipeline involves three stages which are shown in the following Figure 1.
Feature Extraction
The first step in automatic speech recognition (ASR) according to Basak et al. is converting the continuous speech signal into a digital format that can be processed by machine learning models. Since speech is an analog waveform, it must be digitized through sampling and quantization. The sampling process records the amplitude of the sound wave at discrete intervals, typically at a rate of 16 kHz or higher to capture the full range of human speech. Quantization then maps these sampled values to a finite set of digital levels, allowing them to be represented as numerical data.
Once digitized, the audio data are often more analyzed and prepared to get more information out of them. A common method is to divide the data in small overlapping frames, usually between 20 and 40 ms, because speech characteristics change dynamically over time. Each of these frames is then transformed from the time domain to the frequency domain using techniques like the Fourier Transform (FT) or Mel-Frequency transformations, which helps reveal properties relevant for phoneme recognition like for example formats which are crucial for distinguishing phonemes [12]. According to Basak et al., the most common feature extraction methods in ASR are Mel-Frequency Cepstral Coefficients (MFCCs), Spectrogram, and Linear Predictive Coding (LPC). MFCCs are designed to imitate how humans perceive sound, focusing on important frequencies and reducing the complexity of speech signals. Spectrograms represent speech visually over time and frequency, making them useful for AI models that learn from image-like patterns. LPC, on the other hand, predicts future sounds based on past ones, which helps compress speech data but makes it more sensitive to background noise. Effective feature extraction significantly influences ASR performance, as it determines how well a system can differentiate between similar-sounding phonemes and handle variations in pronunciation.
Modeling
The modeling stage involves converting extracted speech features into phonemes, words or complete sentences. Once features are extracted, ASR systems employ modeling techniques to recognize and classify speech. The primary modeling approaches found by Basak et al. [12] are:
  • Acoustic-Phonetic Approach: This is one of the oldest strategies and based on linguistic knowledge, assuming that speech consists of finite phones. These units can be identified using formant analysis and time-domain processing. Early ASR systems used handcrafted linguistic principles by manually defining phoneme transition rules.
  • Pattern Recognition Approach: This approach treats speech recognition as a classification problem. Statistical models learn patterns in phoneme pronunciation from large speech datasets. There exist two different methods. The template matching approaches which simply tries to find the best match compared to a already existing record and the stochastic approach. Examples for this are the Hidden Markov Models (HMMs) and Dynamic Time Warping described in a upcoming Section 2.2.4. These techniques enhanced ASR performance, they required extensive manual tuning and struggled with generalization.
  • Deep Learning Approach: Modern ASR systems use deep learning models to learn speech representations directly from data. This eliminates the need for handcrafted phonetic rules. These models handle noisy environments, multiple languages, and real-time speech processing with unprecedented accuracy. A more detailed look into this is again taken in a later Section 2.2.5.
Performance Evaluation Techniques
ASR systems need good metrics to check how well they work and also measure how they handle different accents and noisy environments. Basak et al. lists the following metrics as being of interest in the ASR literature [12].
  • Word Error Rate (WER): The most common metric, quantifying the percentage of words inserted, deleted, or substituted in the ASR transcript. Lower WER indicates better performance compared to the reference transcript.
  • Phoneme Error Rate (PER): Focuses on errors at the phoneme level rather than entire words. This is useful for analyzing systems working with languages having complex phonetic structures.
  • Sentence Error Rate (SER): Assesses the correctness of full sentences, providing a broader evaluation of ASR accuracy in conversational speech.

2.2.2. Challenges in Children’s Speech Recognition

Although ASR systems have become quite accurate for adult speech, they don’t work as well for children’s speech. Research has shown that ASR systems trained mainly on adult speech have much higher error rates when they process children’s speech. For example, Yeung and Alwan found that ASR systems perform much worse for younger children, especially kindergarten-aged kids, because their speech tends to vary a lot more than that of older children [13].
Additionally, a study by D’Arcy and Russell in 2005 showed that ASR systems make more mistakes with children’s speech as they get younger, and they are especially bad at recognizing words when the audio quality is reduced [14].
According to a literature review by Bhardwaj et al. [15], the main reasons for the reduced performance of ASR systems are:
  • Physiological and Anatomical Differences: As children grow, both their physical development and speech characteristics change. These changes are largely influenced by the differences in the structure of their vocal tract compared to adults. The smaller size of children’s vocal folds and shorter vocal tracts result in higher pitch and sound frequencies in their speech. These physiological differences contribute to the distinct sound of children’s voices, making their speech sound different from that of adults. Additionally, children have less control over speech features like tone, rhythm, and intonation due to the ongoing development of their vocal control mechanisms. As they grow older, when around 12 to 14 years old, their speech becomes more similar to adult speech [15,16].
  • Pronunciation Variability: When it comes to pronunciation, younger children often struggle with articulating words accurately due to their limited experience with language. Children’s vocabulary tends to be smaller than that of adults, and they often rely on their imagination to create words [15].
  • Limited Training Data: There isn’t enough good-quality speech data from children to train ASR systems. Collecting this kind of data is hard because of privacy issues and other challenges. As a result, ASR models struggle to understand children’s speech since they weren’t trained on enough examples from children [13].

2.2.3. Augmentation Techniques for Children’s Speech

As already stated in previous sections, children’s speech recognition is a challenging task for ASR systems and children’s speech often performs poorly on systems trained for adults. To minimize this errors, several adaptation techniques have been developed and proven to be effective which try to bridge the gap between adult and children’s speech characteristics.
  • Vocal Tract Length Normalization (VTLN): VTLN specifically addresses the physiological differences between adult and child vocal tracts. Since children have shorter vocal tracts, their formant frequencies are shifted higher. VTLN applies a warping function to the frequency axis of the speech spectrum to normalize these differences. Potamianos and Narayanan demonstrated that VTLN alone could reduce WER by 10–20% relative for children’s speech.
  • Maximum Likelihood Linear Regression (MLLR): MLLR adapts the parameters of a GMM-HMM system by applying a linear transformation to the mean vectors of the Gaussian components. This technique can adapt to both speaker-specific characteristics and acoustic conditions. For children’s speech, MLLR helps the model adjust to the higher pitch and formant frequencies characteristic of children’s voices.
  • SpecAugment: SpecAugment, introduced by Park et al. in 2019 [17] from the Google Brain team, is a data augmentation method used to make automatic speech recognition (ASR) systems more robust. It works directly on log mel spectrograms and applies three simple changes: time warping, frequency masking, and time masking. These changes imitate missing or noisy parts of speech, helping the model become more flexible and better at handling different audio conditions. According to the initial paper, SpecAugment led to much lower word error rates on often compared benchmarks like the LibriSpeech and reached strong results without using extra language models.

2.2.4. Traditional Speech Recognition Approaches

Following the discussion of speech recognition fundamentals and the challenges specific to children’s speech, this section examines the traditional approaches that formed the foundation of automatic speech recognition (ASR) before the deep learning revolution. These approaches, while now largely superseded by neural methods, established important principles that continue to inform current research, particularly for phonetic-level analysis and pronunciation assessment.
Dynamic Time Warping (DTW)
Dynamic Time Warping (DTW) represents one of the earliest approaches to speech recognition, predating the statistical modeling paradigm. Developed in the 1970s [18], DTW addresses a fundamental challenge in speech recognition: the variable duration of speech sounds. The same word spoken at different speeds produces acoustic patterns of different lengths, making direct comparison difficult [19].
DTW works by non-linearly aligning two time series (such as speech patterns) to find the optimal match between them, even if they have different durations or speeds. The basic algorithm follows these steps:
1.
A reference template is created for each word or phoneme in the vocabulary
2.
The input speech is converted to a sequence of acoustic feature vectors
3.
A distance matrix is computed between the input sequence and each reference template
4.
The algorithm finds the path through this matrix that minimizes the total distance while satisfying continuity constraints
5.
The template with the lowest distance score is selected as the recognized word
Unlike statistical approaches, DTW is deterministic and requires no training in the conventional sense. Instead, it relies on having good reference templates that represent the target vocabulary items. However, DTW has several limitations. It requires a predefined reference template for each word or phoneme, making it less scalable for large vocabulary speech recognition. The approach is also computationally expensive, particularly for long speech sequences, due to its reliance on dynamic programming. Moreover, DTW struggles with speaker variability and background noise, as it does not model probabilistic variations in speech [19]. While DTW played an important role historically, it is no longer used in modern ASR systems, and the author was unable to identify any recent research applying DTW to children’s speech recognition.
Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs)
Following the template-based approaches, the combination of Hidden Markov Models (HMMs) with GMMs became the leading method in automatic speech processing systems for several decades before neural network approaches gained prominence in the last few years.
The combination of Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) has been widely used in automatic speech recognition (ASR). This approach models both the sequence of speech sounds and the acoustic features of each sound. One of the most important early works in this area is “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” by Lawrence R. Rabiner (1989) [20], which laid the foundation for applying HMMs to speech processing. In this approach, speech is modeled as a sequence of states, with each state representing a distinct phonetic unit. At a high level, the GMM-HMM framework operates as follows [21]:
1.
Breaking Speech into Frames: The speech signal is split into short time frames (typically 25 ms each). This helps analyze speech as a sequence of small, overlapping sound units.
2.
Extracting Features: Instead of working directly with the raw audio waveform, the system extracts important characteristics from each frame. A common method for this is Mel-Frequency Cepstral Coefficients (MFCCs), which were already mentioned previously.
3.
Modeling Speech Sequences with HMMs: Speech is not just a set of isolated sounds, it follows a structured order. HMMs help capture this structure by representing speech as a sequence of hidden states. Each state corresponds to a phoneme. The system learns how likely it is for one sound to follow another by analyzing large amounts of training data.
4.
Modeling Sound Variability with GMMs: People pronounce the same word in different ways depending on their accent, speed, and tone. GMMs help handle this variation by modeling the probability distribution of acoustic features within each HMM state.
5.
Recognizing Speech: Once the model is trained, it can analyze new speech samples. Given a sequence of extracted features, the system determines the most likely sequence of HMM states that match the observed speech. The Viterbi algorithm, a well-known dynamic programming method, is commonly used to find this optimal sequence.

2.2.5. Deep Learning & the Transformer Architecture

Deep learning has become a cornerstone of modern artificial intelligence, enabling breakthroughs in areas such as computer vision, natural language processing, and speech recognition. At its core, a deep learning model is a neural network composed of layers of interconnected processing units, the neurons, where each layer transforms its input data into a more abstract representation. The network’s first layers learn low-level features like simple sound patterns or edges in images, while deeper layers combine these into higher-level concepts like phonemes or object parts. Through a training process called backpropagation, the model’s parameters are adjusted so that each layer progressively refines its representation to better predict the target outputs.
The Transformer is a deep learning architecture introduced by Vaswani et al. in 2017 that revolutionized sequence modeling by relying entirely on self-attention mechanisms instead of recurrence or convolution [22]. In traditional RNN-based models, information is processed sequentially, which makes capturing long-range dependencies challenging and prevents parallelization across sequence elements. The Transformer’s key innovation, self-attention, allows the model to focus on different parts of the input sequence all at once, regardless of their positions.
In traditional models such as RNNs, the model reads the sentence word by word, from left to right. The model can only remember previous words as it moves forward, and often gives the most attention to neighboring words. This is shown in the Figure 2, where only short, direct connections are made between adjacent words. Figure 3 represents how the Transformer’s self-attention mechanism works. Each word can directly attend to every other word in the sentence. This means the model can compare “student” and “book” directly, even though they are several words apart.
The self-attention mechanism provides two major advantages over earlier models. First, it enables modeling of long-range dependencies with ease, even elements far apart in a sequence can directly influence one another’s representation, addressing the vanishing context problem of RNNs. Second, the Transformer can process sequence elements in parallel, since each layer’s attention computation does not depend on previous time steps as in an RNN. This parallelism leads to significantly faster training and the ability to utilize hardware like GPUs more effectively. The impact of the Transformer architecture has been enormous and it quickly became the foundation of most cutting-edge models in natural language processing, including BERT and GPT. Its versatility has also led to adoption in speech processing and other domains. One such model, which leverages the Transformer’s strengths for speech audio, is Wav2Vec2 which will be discussed in the following section [22,23].
Wav2Vec2-Self-Supervised Speech Recognition
Wav2Vec2 is a deep learning model that builds upon the described Transformer architecture to learn powerful speech representations in a self-supervised manner. Introduced by Baevski et al. in 2020 [24], wav2vec2 was a breakthrough in automatic speech recognition because it demonstrated that a model could be pre-trained on massive amounts of unlabeled audio data and later then fine-tuned to achieve excellent ASR performance with very little labeled data. In other words, the model learns from raw audio alone by solving an internal prediction task, and this knowledge can be transferred to speech recognition tasks without requiring large annotated speech corpora. Wav2Vec2’s design is composed of several components and training steps that work together to enable this self-supervised learning. According to Platen [24], the architecture of Wav2Vec2 is as follows:
  • Raw Audio Input: The model begins with continuous raw audio waveforms sourced from large-scale, predominantly unlabeled multilingual speech datasets. These corpora encompass a wide range of speaking styles, recording conditions, and linguistic diversity, including read speech, conversational speech, and spontaneous speech. Examples include Common Voice, Multilingual LibriSpeech, and VoxPopuli, which together provide hundreds of thousands of hours of audio across dozens of languages. This extensive variability enables the model to develop robust, language-agnostic acoustic representations.
  • Feature Extraction with CNN: A convolutional neural network (CNN) processes the raw waveform into a sequence of latent acoustic features. The CNN progressively downsamples the audio while preserving salient temporal and spectral cues, effectively converting high-frequency waveform information into a compact representation. This front-end captures low-level characteristics such as energy contours, pitch variations, and local phonetic structure, producing features suitable for higher-level contextual modeling.
  • Transformer: The extracted feature sequence is then passed to a Transformer network, which models contextual relationships across long temporal spans. Through self-attention mechanisms, the Transformer integrates information distributed across entire utterances, allowing the model to infer dependencies between distant acoustic events such as syllable structure, prosody, and cross-phoneme transitions. This stage is crucial for capturing long-range linguistic patterns that cannot be represented by local convolution alone.
  • Self-Supervised Learning: During pre-training, a subset of the Transformer’s input representations is randomly masked. The model is then trained to reconstruct the masked segments using a quantization module that maps continuous embeddings to a finite set of learned discrete speech units. This contrastive prediction task encourages the model to learn phonetic and subphonetic regularities without any labeled supervision, allowing it to generalize across languages and domains. By predicting masked content, the model develops internally consistent, language-independent acoustic abstractions.
  • Fine-Tuning Layer: After self-supervised pre-training, the learned speech representations can be adapted to downstream tasks through the addition of lightweight task-specific output layers. Fine-tuning is performed on labeled datasets tailored to the target task, enabling the model to specialize in speech recognition, speech-to-text translation, speaker or language classification, or other speech processing objectives. Because the bulk of the model is already pretrained, only modest amounts of labeled data are required to achieve strong performance across a wide range of applications.
Variants of the Wav2Vec2 model exist, differing in both model parameter size and the amount and type of training data used. One specific variant is XLSR-53, introduced by Conneau et al. in 2021 for multilingual applications [25]. As illustrated on the left side of the architecture, the model was trained on speech data from different languages, enabling cross-lingual speech pattern recognition. This multilingual training improves performance on tasks involving less widely spoken languages or underrepresented speech varieties, such as in this use case, German children’s speech, where no large-scale annotated datasets exist.
Alternative Deep-Learning Models
Published in 2020, Wav2Vec2 was one of the first self-supervised deep learning model for speech recognition. After the success, more models trying to enhance the initial idea, each with its own design or training strategy were released. These models differ from Wav2Vec2 in how they handle self-supervision, the size and type of training data, or the architecture components.
  • HuBERT (Hidden-Unit BERT), introduced by Hsu et al. [26] in 2021, is a self-supervised model that clusters audio segments via k-means to create pseudo-labels, then predicts these labels for masked segments. This approach learns phonetic and linguistic structures without transcripts and has shown strong performance in phoneme recognition.
  • WavLM, presented by Chen et al. [27] in 2022, extends Wav2Vec2 with improved masking, denoising objectives, and training on larger, more diverse datasets. It performs well in noisy conditions and supports tasks such as speaker diarization and speech separation.
  • Whisper, developed by OpenAI [28] in 2023, is a supervised model trained on 680,000 h of labeled speech. It handles transcription, translation, and language identification across many languages and noisy environments, but is less suited for small phoneme-level tasks.

2.2.6. Deep Learning Models Applied to Children’s Speech Recognition

Although self-supervised models like Wav2Vec2, HuBERT, and WavLM are relatively new, several studies already highlight their good performance:
  • Implementing Wav2Vec2 into an Automated Reading Tutor: Mostert et al. [29] integrated Wav2Vec2 models into a Dutch automated reading tutor to detect mispronunciations in children’s speech. Two variants were trained on a Dutch speech therapy corpus and 20 h of children’s speech: an end-to-end phonetic recognizer and a phoneme-level mispronunciation classifier. Both outperformed the baseline Goodness-of-Pronunciation detector in phoneme-level accuracy and F-score, yielding modest gains in phone-level error detection but little change in word-level accuracy. The study suggests Wav2Vec2 can enhance fine-grained phonetic feedback.
  • Phoneme Recognition for French Children: Medin et al. [30] compared Wav2Vec2.0, HuBERT, and WavLM to a supervised Transformer baseline for phoneme recognition in French-speaking children. Models were trained on 13 h of child speech, with cross-lingual English data offering no benefit. WavLM achieved the lowest phoneme error rate (13.6%), excelling on isolated words and pseudo-words, and proved most robust under noise (31.6% vs. baseline 40.6% at SNR < 10 dB). Results show self-supervised models, especially WavLM, improve phoneme recognition in noisy conditions.
  • Adaptation of Whisper models to child speech recognition: Jain et al. [31] fine-tuned Whisper-medium and large-v2 on 65 h of child speech (MyST and PF-STAR) and compared them to Wav2Vec2. Fine-tuning reduced Whisper-medium’s WER on MyST from 18.1% to 11.7%, with large-v2 reaching 3.1% on PF-STAR. Wav2Vec2 achieved the best matched-domain results (2.9% PF-STAR, 10.2% MyST), while Whisper generalized better to unseen domains.
  • Benchmarking Children’s ASR: Fan et al. [32] benchmarked Whisper, Wav2Vec2, HuBERT, and WavLM on MyST (133 h) and OGI Kids (50 h) datasets. In zero-shot tests, large Whisper models performed well (12.5% WER on MyST), with fine-tuning further reducing errors (Whisper-small: 9.3%, WavLM: 10.4%). On OGI, Whisper-small achieved 1.8% WER. Data augmentation gave small, consistent gains. Overall, fine-tuned Whisper models generally outperformed self-supervised models of similar size.

2.3. Summary and Research Gap

2.3.1. Summary

In Austria, teachers use various methods to assess children’s reading, often relying on reading aloud and listening closely. Studies show that automated reading tutors can improve skills and motivation, yet most existing tools target English-speaking children. Currently, no widely available solution exists for German-speaking learners. Speech recognition in this context falls into two categories: traditional and deep learning-based approaches. Traditional methods, such as Dynamic Time Warping and Gaussian Mixture Model-Hidden Markov Model, can perform well but require large labeled datasets, which is challenging for child speech due to higher pitch, shorter vocal tracts, and greater variability. Simulations help, but generalization remains limited. Modern deep learning approaches, such as Wav2Vec2 by Meta AI, use Transformer architectures with self-supervised learning to achieve strong results, including for children’s speech. They can learn useful representations from unlabeled audio, making them suitable where labeled data is scarce.

2.3.2. Research Gap

Self-supervised deep learning models, including Wav2Vec2, have proven effective for ASR and child speech, but most focus on word- or sentence-level recognition. Few address phoneme-level recognition, which is crucial for diagnosing early reading difficulties. While tools exist for English learners, there is a lack of phoneme-level speech tracking for German-speaking children. To our knowledge, no study has fine-tuned a deep learning ASR model for German children’s speech with phoneme-level annotations. This work aims to fill that gap.

3. Approach and Methodology

This section describes how the system was designed to recognize children’s speech at the phoneme level using self-supervised learning. The goal was to evaluate whether a Wav2Vec2-based model, fine-tuned on limited German child speech data, could reliably transcribe audio into sequences of phonemes represented in the International Phonetic Alphabet (IPA). The system is intended as a core component for a future reading support tool but does not yet include automatic error detection.
The following sections explain the design of the system pipeline, including the choice of datasets, preprocessing steps, IPA transcription process, model architecture, training strategy and evaluation framework.
Starting with dataset collection and preprocessing, the audio data was then resampled and aligned with phonetic transcriptions. These were then used to fine-tune a self-supervised speech recognition model. Finally the model performance was evaluated using standard metrics such as Word Error Rate (WER) and Phoneme Error Rate (PER).

3.1. System Overview

The system architecture is illustrated in Figure 4. It outlines each step of the implementation pipeline from data preparation to mobile deployment. The first step in the process was identifying suitable datasets. The KidsTALC dataset created by the Leibnitz University in Hannover [33] was selected as the primary data sources. After importing the data, all audio files were normalized and resampled to 16 kHz. This ensures compatibility with most speech models, including the used Wav2Vec2 model.
The next step was the phonetic transcription of the training data. All available text transcriptions were converted into the International Phonetic Alphabet (IPA). This enabled the model to be trained at the phoneme level rather than the word level. Phoneme-level output allows for more detailed analysis of mispronunciations and subtle reading errors.
Following the data preparation, the Wav2Vec2 Model was fine-tuned on the prepared audio data and the phonetic transcriptions. Finally the model’s performance was evaluated using two common metrics in automatic speech recognition: Word Error Rate (WER) and Phoneme Error Rate (PER).
After evaluation, the goal was to transform the best-performing model into a format suitable for mobile deployment and then use the output of the model to track the reading errors of children. The deployment and the reading error detection is not part of this proposal but we will discuss potential approaches in the future work section in the discussion Section 6.3. More details about the implementation of each step will be provided in the upcoming sections.

3.2. Datasets and Resources

As in most machine learning projects, the most important, and often also the most challenging part, is the data collection. This is also the case for speech recognition tasks involving children. Data must not only be sufficiently large and diverse but also age-appropriate and representative of the target users. Unfortunately, publicly available datasets that contain speech from German-speaking children are very rare. While there are well-known corpora for adult speech, as for example Mozilla’s Common Voice dataset [34], they cannot be used as direct substitutes due to the fact that children’s speech differs from adult speech in many ways: it has a higher pitch, different articulation, and more variation in pronunciation. These differences significantly reduce the performance of models trained only on adult data.
In our research, we were given the opportunity to use the KidsTALC corpus [33], a dataset collected by Leibniz University Hannover to support research and the development of speech technology solutions tailored to children’s speech. It represents one of the rather rare corpora that contains recordings of natural, connected German child speech across a broad age range. It consists of approximately 25 h of natural, continuous speech from 39 different children aged 3.5 to 11 years and focuses on spontaneous speech rather than read sentences, reflecting real conversational behavior where the children interact with adults and directly talk to them. The recordings took place under natural real-life recording conditions as in homes and schools. Age distribution and group characteristics can be seen in Table 1. The corresponding number of speakers and their respective age are provided with the dataset and have not been automatically recognized or assigned by us.
The speech data in KidsTALC is transcribed in two versions. The first version is the spoken text in orthographic transcription (standard German text) and the second version is a detailed phonetic transcription using the IPA (see Section 3.3 for more details). Furthermore, additional annotations and information are provided, including typical developmental speech errors, non-speech vocalizations and regions of overlapping or unintelligible speech, as well as the age and gender of the speaker. The transcriptions and annotations were made by trained graduate students of speech-language therapy, trained speech-language therapists and a professional transcription agency. Additionally, the recordings of the adults the children are talking to is also available, but for this parts, only the orthographic transcriptions are provided.

3.2.1. Word and Phoneme Occurrence

The corpus uses a simplified version of the IPA for the phonetic transcriptions with a set of 40 different phonemes to represent German children’s speech. In total, these are around 176,000 phonemes that were transcribed. These come from about 55,000 words the children spoke. Among these, there are roughly 4300 unique words and around 7600 different pronunciation variants. Most of these variants occur only a few times, leading to an uneven but natural distribution. Figure 5 shows the phoneme occurrence in the KidsTALC dataset in thousands. It is shown that some phonemes appear much more frequent than others and the phonemes in the German language are not distributed evenly.

3.2.2. Limitations

While the KidsTALC dataset provides a good starting point for a prototypical implementation, there are several limitations that need to be considered. All recordings are based on spontaneous speech, recorded during natural play and conversation. The children did not read from a fixed text, which means the spoken content is highly variable and unstructured. For the use case, tracking children while they read aloud, this is not ideal. Since the goal is to detect pronunciation and reading errors during reading, spontaneous speech is somewhat off-domain.
Another limitation concerns the phonetic transcriptions. According to [33], these transcriptions were created by multiple trained experts, but small errors and inconsistencies still occur. To understand how reliable the transcriptions are, the authors of the dataset carried out a transcription agreement study. They selected 144 utterances of different lengths, balanced across age and gender, and had them transcribed independently by three experts. The phoneme error rate (PER) was then calculated between each pair of transcriptions, as well as compared to the final reviewed transcript. The average PER between transcribers was 14.6% and the PER to the final version was 12.8%. Most disagreements came from similar-sounding vowels (such as /e/ vs. /ε/ or /a/ vs. /ɑ/), and whether final /t/ sounds were included, especially in words like “und” and “jetzt”.

3.3. IPA and Phoneme-Level Annotation

As already described in the previous sections, the KidsTALC corpus provides phonetic transcriptions which perfectly fits the goal to track children’s reading on phoneme level. A very common notation for phonetic transcriptions is the International Phonetic Alphabet, for which a rather approachable overview can be found in [35]. The IPA is a standardized system of phonetic notation designed to represent the sounds of spoken language. Developed by the International Phonetic Association in the late 19th century, the IPA assigns a unique symbol to each distinct sound (phoneme), ensuring a consistent and unambiguous representation of pronunciation across different languages and dialects.
Unlike conventional orthography, where a single letter can correspond to multiple sounds (e.g., the letter “e” in German words like “gehen” and “geben”), the IPA provides a one-to-one correspondence between symbols and sounds. This precision is particularly beneficial in fields such as linguistics, language education and speech technology, where accurate representation of pronunciation is crucial.
In the context of German, the IPA facilitates the detailed transcription of phonemes, capturing nuances that are often overlooked in standard spelling. For instance, the German word “Schule” (meaning “school”) is spelled using six letters but comprises only four phonemes. The word school transcribed with IPA is shown in the following Figure 6.
/ʃ/
represents the “sh” sound, as heard in the English word “shoe” or the German word “schön”. It is produced by placing the tongue near the roof of the mouth without touching it and letting air flow through.
/uː/
is a long “oo” sound, like in the English word “food” or the german word “gut”. It is produced with rounded lips and the tongue positioned high and towards the back of the mouth.
/l/
corresponds to the familiar “l” sound, as in “Lampe”. It is made by touching the tip of the tongue to the area just behind the upper front teeth.
/ə/
is known as a schwa, a very short and neutral vowel sound. It often occurs in unstressed syllables, such as the second syllable in the German word “bitte”. The tongue remains in a relaxed, central position.

3.4. Data Preparation & Augmentation

3.4.1. Data Preprocessing

The KidsTALC corpus provides one single audio file per child, along with one corresponding JSON file for all the recorded samples. This JSON file includes the orthographic and phonetic transcriptions, metadata, and annotations for each individual utterance.
To prepare the dataset for model training, the audio file was first resampled to 16 kHz. This sampling rate is required by most modern machine learning models, including Wav2Vec2, and ensures consistency in the training pipeline. After resampling, the large audio file was segmented into individual utterances. Each resulting audio file corresponds to one sample from the JSON file.
The next step was to filter the dataset. Samples without any orthographic transcription were removed. In addition, recordings shorter than 0.5 s were discarded, as they are often too short for reliable phoneme recognition. Many transcriptions contained annotation markers for reading or pronunciation errors. These were embedded directly into the orthographic or phonetic strings. To obtain clean transcriptions, these markers were removed using regular expressions. For the adult speech, phonetic transcriptions were missing and to generate these, the Python 3 library IPA-Phonemizer [36] was used, which produces phoneme sequences in the IPA based on orthographic input. Since the Phonemizer uses slightly different symbols than the experts who created the original transcriptions, all phonemes were mapped to the same reduced set of 40 IPA characters used for the children’s speech.
To improve audio quality, a mid-pass filter was applied which emphasized frequencies between 2 kHz and 4 kHz, which are crucial for understanding speech. Since many utterances were recorded at low volume, loudness normalization was also applied to ensure consistent volume across the dataset.
The final output of this pipeline was a folder containing individual, cleaned audio files and a structured CSV file with the corresponding orthographic and phonetic transcriptions. All samples were free of error annotations and ready for training and evaluation. An overview of the entire process is shown in Figure 7.

3.4.2. Data Augmentation

To improve model robustness and reduce overfitting, also some data augmentation techniques were applied. These methods help to artificially increase dataset diversity, which is especially valuable for underrepresented domains such as children’s speech recognition. As described by Bhardwaj et al. and Fan et al. [32], techniques like pitch shifting, speed perturbation, and vocal tract length modification have been shown to improve performance of ASR systems on children’s speech. In this project, three augmented versions were created for each child sample and two for each adult sample. For children, the aim was to slightly vary the signal while preserving child-like speech characteristics. For adult data, the goal was to simulate child-like traits to improve generalization to child speech. The following augmentations were applied:
  • Pitch Shift: Applied randomly within a narrow range to child samples and with stronger upward shifts for adults to increase similarity to the higher pitch of children’s voices.
  • Vocal Tract Length Normalization (VTLN): Is a speaker-normalization technique that reduces variability in speech signals by compensating for differences in speakers’ vocal tract lengths, which affect the resonance patterns of the vocal tract and thus the observed spectral characteristics. The method operates by applying a child-specific warping of the frequency axis, typically parameterized by a single scalar warp factor that stretches or compresses the spectrum so that speakers with different tract lengths are mapped to a common acoustic space. This warp factor is usually estimated through maximum-likelihood procedures, where the value that best aligns the child’s speech features with an acoustic model is selected.
  • Speed Perturbation: Applied to both groups. For children, only small variations were introduced. For adults, a slower speed was used more frequently to resemble the typically slower speech rate of children.
Figure 7 shows the activity diagram of the applied preprocessing steps necessary to prepare the data for training.

3.5. Model Architecture

After collecting and preparing the data, a Wav2Vec2 model, a self-supervised speech recognition model developed by Meta AI and made available through the Hugging Face Model Hub was fine tuned. As already discussed in a previous Section 2.2.5, Wav2Vec2 is specifically designed to learn contextualized representations of raw audio data and has shown strong performance across various speech-related tasks, especially when labeled training data is limited. The model architecture supports phoneme-level fine-tuning, making it well suited for applications such as detecting reading errors in children’s speech.

3.5.1. Model Choice and Justification

Among the available self-supervised speech models, Wav2Vec2 was selected over alternatives such (described in Section 2.2.5) WavLM, HuBERT, and Whisper due to a combination of technical and practical factors. While WavLM has recently demonstrated superior performance on several speech benchmarks, it was relatively new at the time of model development, and comprehensive documentation and fine-tuning support were still limited. Additionally, WavLM builds upon the Wav2Vec2 architecture, with many design principles remaining similar. As such, Wav2Vec2 was selected as a stable and well-supported foundation for prototyping.
In contrast, Whisper, developed by OpenAI, is a large, supervised model trained on 680,000 h of transcribed speech. While highly robust and multilingual, Whisper is primarily optimized for full-sentence transcription tasks and general-purpose speech recognition. Its architecture is significantly larger and less efficient for domain-specific adaptation to phoneme-level tasks. Furthermore, its resource requirements make it less suitable for mobile deployment, which is a potential use case for the system developed in this work. To sum up, Wav2Vec2 offers a strong balance of performance, flexibility and it is well documented. Additionally, it supports fine-tuning with small datasets and is able to provide interpretable outputs at the phoneme level.
The Wav2Vec2 model exists in different formats, sizes and types tackling specific use-case and parameters sizes. The specific variant we used is the wav2vec2-xls-r-300m (also referred to as “Base”), which contains approximately 300 million parameters. It is part of the XLS-R model family, pretrained on 436,000 h of unlabeled multilingual speech data from datasets such as VoxPopuli, CommonVoice, BABEL, and MLS. The large-scale and multilingual pretraining makes the model well suited for handling varied speaker characteristics, including those present in children’s speech.
While larger XLS-R variants exist, with model sizes up to 2 billion parameters, the base variant was chosen due to its significantly lower memory footprint and computational requirements. This improves compatibility with resource-constrained environments, such as mobile devices. Smaller models also allow faster inference, which is desirable in interactive educational settings.

3.5.2. Fine-Tuning Setup

For fine tuning the model, the data was split into 70% training, 15% validation, and 15% test sets, as is common practice to ensure a clear separation between model training, hyperparameter tuning and the final evaluation. The training set is used to optimize the model weights, the validation set monitors performance during training and helps guide tuning decisions and the test set provides an unbiased measure of how well the model generalizes to unseen data for evaluation.
During fine-tuning, different hyper-parameters like e.g., the batch sizes and learning rates were tested to identify a stable and effective configuration. Also a learning rate warm-up strategy was applied, where the learning rate increases gradually at the beginning of training. This helps prevent unstable updates and improves convergence by allowing the model to start learning more smoothly. Furthermore, an early stopping mechanism was used to automatically stop training if the validation performance did not improve over several steps. This helps avoid overfitting and reduces training time and costs by halting the process once further improvements are unlikely.

3.6. Evaluation Setup

To evaluate the performance of the trained models, two widely used metrics from the field of ASR were used: Phoneme Error Rate (PER) and Word Error Rate (WER). These metrics provide a quantitative assessment of how accurately the model transcribes spoken input into target sequences, either at the phoneme or word level. Both metrics are based on a comparison between the predicted transcription and the reference (ground truth) transcription. The comparison is performed using alignment algorithms such as Levenshtein distance, which compute the minimum number of operations needed to convert one sequence into another. These operations include:
  • Substitutions (S): a predicted token is incorrect (e.g., /b/ instead of /p/).
  • Deletions (D): a token from the reference is missing in the predicted output.
  • Insertions (I): an extra token appears in the predicted output that is not present in the reference.

3.6.1. Phoneme Error Rate (PER)

PER is particularly relevant as the model is trained to predict sequences of phonemes in the IPA. This allows for fine-grained analysis of pronunciation and phonetic accuracy. The phoneme error rate is calculated as:
PER = S + D + I N × 100
where S is the number of substituted, D is the number of deleted and I is the number of inserted phonemes. N is the total number of phonemes in the reference transcription. A lower PER indicates better phonetic performance and is desirable in ASR systems.

3.6.2. Word Error Rate (WER)

WER measures model performance at the word level and is commonly used in general-purpose ASR systems. Although our proposal focuses on phoneme-level transcription, WER is reported for completeness and broader comparability. The formula is identical in structure to the PER, but the variables refer to word-level errors instead of phonemes. While WER gives an indication of overall transcription quality, PER is more informative for our use case, where detecting specific phoneme-level errors is essential.

3.6.3. Evaluation Procedure

For evaluation, the model output is aligned with reference transcriptions using edit distance algorithms. The total numbers of insertions, deletions, and substitutions are recorded for both phoneme and word sequences. PER and WER are then computed over the entire test set. These metrics provide insight into the model’s strengths and limitations and form the basis for further refinement and integration into reading support applications.
The previous sections outlined the conceptual design and methodological foundations of the system, including data preparation, model architecture, and evaluation strategy. The following section now focuses on the practical implementation of these concepts. It describes how the training pipeline was set up, which tools and frameworks were used, and how the experiments were conducted in practice.

4. Practical Implementation

This chapter describes the practical implementation of the methodology described previously, covering the full pipeline from data preparation to model deployment. Using the KidsTALC dataset, we built a training infrastructure for fine-tuning Wav2Vec2 with Hugging Face Transformers, integrated experiment tracking, and developed model variants. Emphasis was placed on reproducibility and scalability through cloud-based training, data versioning, and automated workflows. The chapter ends with deployment challenges for mobile adaptation.

4.1. Data Preparation and Preprocessing

This section describes the data preparation and preprocessing steps undertaken to create a training dataset for the recognition model. It covers the cleaning and filtering of the KidsTALC corpus, phonetic transcription generation and audio processing techniques applied to optimize the data for model training.
The primary dataset used was KidsTALC, a corpus of German children’s speech aged 3.5 to 11 years containing roughly 25 h of speech, annotated with both orthographic and phonetic transcriptions. However, as described in the Section 3.2, raw data from KidsTALC required significant preprocessing to ensure quality and consistency for model training. The KidsTALC data was first cleaned to remove any data samples that would be problematic for training. All entries with missing or empty transcripts were dropped. Utterances with extremely short duration (<0.5 s) were also discarded to avoid uninformative training samples. After cleaning the dataset, the samples were reduced to only about ~14,000 data samples (from an initial ~22,000 entries). Each remaining audio file was validated to ensure it contained an audible speech segment with a corresponding transcript.
Since the goal is phoneme-level recognition, the target transcripts for the model are sequences of phonetic symbols rather than orthographic words. The KidsTALC corpus provides a phonetic transcription for each child utterance, using the International Phonetic Alphabet (IPA). Unfortunately the transcripts for adult data were only transcribed in form of the orthographic transcriptions, so they had to be converted to the IPA form using the Phonemize library [36]. This library provides a convenient way to convert German text to IPA symbols.

4.2. Audio Processing

The audio files of the KidsTALC corpus were provided as one long recording per speaker, so in total there were 39 different audio files. Each file contained multiple spoken utterances, which were segmented based on metadata provided in the dataset. Segmentation was done according to the location and duration of each utterance within the continuous recording, and used to extract individual audio segments. The extracted segments were then resampled from their original sample rate of 48 kHz to 16 kHz. This was necessary because the Wav2Vec2 model used was pre-trained on audio data with a sample rate of 16 kHz.
To further improve the clarity of speech for transcription, a frequency boost was applied to the midrange area between 2 kHz and 4 kHz, since this frequency band is especially important for understanding spoken language. The boost was achieved using a peaking equalizer filter, centered at 3 kHz with a gain of + 6 dB. After resampling and boosting, a loudness adjustment step was introduced. Many of the recordings, especially from children, had very low volume, which could negatively impact the performance of the speech recognition model.

4.3. Data Augmentation

After processing and cleaning the samples, data augmentation techniques were employed to enhance the model’s robustness and address the limited size of the children’s speech dataset. This section describes the various augmentation methods applied to both adult and children’s speech data, including pitch shifting, speed modification and the “childrenization” process used to make adult speech more acoustically similar to children’s voices.

4.3.1. Adult Speech Data and “Childrenization”

To improve the model’s performance given the limited amount of child speech data, the additional adult speech data were incorporated as outlined in Figure 8. Since adult speech differs significantly from child speech, exhibiting lower fundamental frequency, different formant structure and more fluent articulation (see Section 2.2.2), a direct use can lead to mismatches. To address this, a preprocessing method referred as “childrenization” was applied to the adult recordings. This technique tried to acoustically resemble child speech and thereby reduce the age-related domain mismatch. Specifically, each adult audio sample was digitally altered in pitch and tempo. The pitch was randomly increased by 1 to 3 semitones to approximate the higher pitch range typical of children’s voices. The tempo was also reduced to reflect the slower speech rate often observed in children. Furthermore, to simulate anatomical differences such as shorter vocal tracts in children, frequency-warping transformations were applied to emulate varying Vocal Tract Length Normalization (VTLN) conditions. These augmentations aimed to generate more child-like acoustic features and thus enhance the model’s generalization to child speech. For each adult sample, one extra augmented samples was created (augmentation ratio of 1:1).

4.3.2. Data Augmentation for Children’s Speech

Beyond the adult data augmentation, the children’s speech data itself was augmented to increase the effective training size and variety. On the one hand data augmentation can help to simulate the kids voice, on the other hand it also creates more variations of the sample and so helps the model generalize to variations in pronunciation and recording conditions. For each original child utterance, three additional augmented versions (a 2:1 augmentation ratio) were created. These augmentations included applying only small pitch shifts (−1 to 1 semitones) up or down and changing the speed by ±5–10%.
All augmented files were saved in separate folders to keep track of their origin and again make training easier by having a clear separation between original and augmented data. Also metadata fields were added in the dataset CSV to indicate whether a given sample was augmented or original and which type of augmentation parameters were applied. In total, after augmentation, the combined dataset (children + adults) contained 34,700 audio samples (of which roughly 20,900 were newly created augmentations). A detailed overview of the used augmentation parameters is given in the following Table 2.
As outlined before, the split strategy was 70% training, 15% validation and 15% test of the filtered and cleaned audio. For models that used augmented data, the augmented samples were included only in the training set.

4.4. Implementation of Wav2Vec2

The model implementation is based on the transformers library developed by Hugging Face [37]. This open-source library provides powerful tools for working with state-of-the-art Transformer-based models, including pretrained models, tokenizers, feature extractors and training utilities. To fine tune the chosen Wav2Vec2 model, the tokenizer, feature extractor and trainer components from the library were used:
  • Tokenizer: The tokenizer is responsible for converting text into a sequence of token IDs and vice versa. For Wav2Vec2, a special CTC tokenizer is used which is based on a custom vocabulary file (vocab.json) which all the phonetic symbols of the target text like shown in [35]. In addition, two special tokens for unknown characters and padding have been added.
  • Feature Extractor: The feature extractor transforms the raw audio samples into a format suitable for input to the Wav2Vec2 model. This preprocessing step typically involves resampling, normalization and segmenting the waveform into smaller frames.
  • Trainer: At its core, the system uses the pretrained facebook/wav2vec2-xls-r-300m checkpoint, a multilingual model with 300 million parameters trained in a self-supervised fashion. Instantiating it appends a linear classification layer to the Transformer encoder which head maps the model’s hidden representations to phoneme-level tokens and is trained during the fine-tuning process. The Trainer is set up with the training and evaluation datasets, the model and training arguments. To monitor performance and adjust training behavior, a custom metric computation function has been added to ensure that the model is evaluated using PER and WER, as described in Section 3.6.

4.5. Hyper-Parameter Tuning

To optimize model performance, key training hyper-parameters (such as the learning rate, batch size, number of epochs, warm-up ratio and learning rate scheduling strategy) were refined through a grid search performed on a reduced subset of the training data.
Grid search is a systematic method for hyperparameter optimization in which a fixed set of values is defined for each parameter. All possible combinations of these values are then evaluated. For each configuration, the model was trained and evaluated using the reduced dataset, while all other conditions were held constant. The performance of each configuration was validated using PER and WER, allowing the best-performing parameter set to be identified.
The final configuration obtained through this process was used to train the model on the full dataset. Table 3 summarizes the selected hyper-parameters along with brief descriptions of their roles.

4.6. Overfitting Avoidance

To improve efficiency and avoid overfitting, the model’s lower layers (e.g., feature encoder) were frozen, preventing parameter updates during backpropagation. This is particularly useful with small datasets, as these layers already contain pretrained audio representations, allowing training to focus on higher layers and the classification head. An early stopping mechanism monitored validation loss, halting training if no improvement occurred over three consecutive evaluations, with a minimum improvement threshold of 0.001 to reset patience. This ensured training stopped once optimal validation performance was reached.

4.7. Training Configurations and Model Variants

Three variants of the Wav2Vec2 model were trained, differing in the composition of their training data. All other training settings (model architecture, hyper-parameters etc.) remained consistent across these experiments. The following sections detail each model variant and the rationale behind their development:
(1)
Mixed Data Model: The first model trained was based on the idea that more data, regardless of quality, might improve generalization. This version used a large and diverse training set created from the KidsTALC dataset, combining both children’s and adult speech, along with various augmented versions of each. The dataset also included samples of mixed quality, such as recordings with background noise or unclear articulation. This model aimed to make full use of all available data and test whether the model could still learn useful speech patterns despite inconsistencies.
(2)
Child-Only Model: The second model was trained exclusively on child speech data from the KidsTALC dataset, including both original and augmented recordings. No adult speech was used. This variant focused on learning the specific acoustic and phonetic characteristics of young speakers. By removing adult voices, the model could specialize more closely in the types of speech it is expected to analyze during real-world use.
(3)
Combined Clean Children + Adult Model (No Augmentation): The final model used a carefully filtered subset of the KidsTALC dataset, containing only clean, original recordings from both children and adults. All samples marked as hard to understand, noisy, or overlapping were excluded and no data augmentation was applied. This variant was designed to test whether using only high-quality, natural speech would lead to more stable and accurate recognition, particularly in contrast to the larger but noisier training sets used in the previous models.
Table 4 summarizes the three model variants and their training data characteristics.

4.8. Experiment Tracking with Weights & Biases

Weights & Biases (W&B) [38] is a platform for experiment tracking, model management and performance visualization in machine learning workflows. It provides a suite of tools to log training metrics, visualize results, compare runs and organize experiments in a reproducible and collaborative way. In addition to experiment tracking, W&B supports dataset versioning, artifact storage and even model deployment, making it a comprehensive solution for managing the entire machine learning lifecycle. W&B was used to monitor the training process, log key performance metrics, store model checkpoints and record system statistics such as GPU and CPU usage. This helped ensure that all experiments were reproducible and well documented.

4.9. Model Evaluation

The evaluation of the fine-tuned Wav2Vec2 model focused on assessing transcription quality at both, the word and phoneme levels. To achieve this, two widely used metrics from automatic speech recognition (ASR) were applied: WER and PER. While WER offers a general measure of how accurately spoken words are transcribed, PER provides a more granular perspective (see methodology in Section 3.6).
The implementation of these metrics was carried out using the open-source Python library jiwer, which calculates error rates based on the Levenshtein distance. Insertions, deletions and substitutions were counted for each prediction-reference pair and used to derive the respective error rates.
Beyond summary statistics, a phoneme-level confusion matrix was also generated to gain deeper insight into model behavior. This matrix was used to analyze which phonemes were commonly misrecognized and which were accurately detected.

5. Evaluation and Results

This chapter evaluates the performance of the three fine-tuned Wav2Vec2 models on the KidsTALC child speech dataset. Each model was trained on a different subset of data, and their accuracy is compared using phoneme error rate (PER) and word error rate (WER). The results are outlined and the best performing model is analyzed in detail and it is discussed how data quality and quantity affected outcomes. Finally, the chapter also compares the findings to human performance and the research questions posed earlier are answered.

5.1. Model Performance Analysis

First of all, the performance of three Wav2Vec2 models, each fine-tuned for phoneme and word recognition on the KidsTALC dataset are compared. Each model was trained on a different subset of the available data, which enables a direct comparison of how data quality and quantity affect recognition accuracy. The following analysis summarizes the key results and gives a closer analysis for the best performing model.

5.1.1. Overall Performance of the Models

All of the three Wav2Vec2-based models achieved noticeably different results. Table 5 summarizes the performance of each model, in terms of PER and WER.
Training began with the mixed-data model, based on the idea that using as much data as possible would yield the best results. Although this provided a reasonable starting point, this intention proved wrong and the performance of the experiment was worse than what similar studies had achieved. For the next experiment, a model was trained only on child speech, under the assumption that the adult speech used in the mixed model, might make it harder for the model to learn child speech correctly. This approach brought some improvement, but the results were still not ideal. In the final experiment, training was done only with recordings that were clear and easy to understand. Even though this model used less data, it performed the best, showing that the quality of the data is more important than the amount when fine-tuning a pretrained model. The evaluation was always done using the same test data, so only the training data changed between models.
Model 1: Mixed Data
This model was trained on the largest dataset, which included both child and adult speech, as well as noisy and augmented audio, being about 40 h of speech. The model achieved a PER of 28.7% and a WER of 89.1%. These were the highest error rates among the three models. This means that about 71% of phonemes were recognized correctly, but nearly nine out of ten words were transcribed incorrectly. This showed that using a large amount of mixed and noisy data did not get the performance reported from other related work. It indicates, that the additional variability, especially from adult speech, hard to understand samples and noise, may have made the learning more difficult for the model.
Model 2: Child-Only Data
After the first model’s limited success, the next approach focused on training a model using only children’s speech, while still including augmented data to increase the training set size. This led to a noticeable improvement: the PER decreased to 23.5% but still the WER did not improve much being 91.3%. Although the phoneme recognition improved compared to the first model, the WER remained high, with only a small fraction of words transcribed wrong. This suggests that while the model became better at recognizing child-specific phonetic patterns, challenges remained at the word level.
Model 3: Clean Data
In the final experiment, the model was trained exclusively on high-quality data. So after filtering, this was only about 7.5 h of speech from both children and adults, but only included recordings marked as clear and easy to understand. So no noisy or augmented data was used. Despite having the smallest dataset, this model achieved the best results: the PER, when testing with the held back test set, dropped to 14.3% and the WER to 31.6%. This means that 85.7% of phonemes and more than two-thirds of words were recognized correctly. These results are almost twice as accurate as those of the first model and show a clear improvement over the second model as well. The findings demonstrate that, during fine-tuning, the quality of training data can be more important than quantity, especially when starting from a pre-trained model that has already seen a large amount of speech data like in our Wav2Vec2 model.

5.1.2. Phoneme and Word Error Rate Progression During Training

To better understand how the models learned over time, the progression of both Phoneme Error Rate (PER) and Word Error Rate (WER) was tracked during training. The following sections summarize how these error rates changed for each model, a quick overview of essential results is shown in Table 6.
Phoneme Error Rate (PER)
The first chart (Figure 9) shows the Phoneme Error Rate (PER) progression during training. This metric is equivalent to the Character Error Rate (CER), since phonemes are encoded as individual IPA characters. As already mentioned, the model trained only on clean children’s speech (blue line) demonstrates the best performance, reaching a PER below 0.12 (11.8%) on the validation set and a PER of 14.3% on the held back test set. The kids-only model (pink line) improves quickly at first, but then stops earlier in the training process. In contrast, the mixed-augmented model (gray line) learns more slowly and ends with a much higher error rate of about 0.29. It is important to note that the mixed-augmented model runs for more training steps than the other two models. This longer training is due to its larger dataset, which includes extra augmented samples as described in Section 4.7. The red line on the other hand, is trained with a larger dataset than the model showing the blue line. But due to the fact that the augmentation data is very similar, the model was not able to improve after a certain amount of epochs and the early stopping mechanism implemented in the training process stopped the training earlier.
Word Error Rate (WER)
In addition to the PER, the second chart (Figure 10) shows the WER for all three models during training. As PER decreases, WER also improves. However, in some cases as we can see in the red line, the WER remains high while PER is shrinking, since even a single phoneme error can cause an entire word to be transcribed incorrectly. When PER drops below about 20%, WER decreases much more rapidly, as more full words are recognized without any errors. This effect is clearly visible in the clean model (blue), which achieves a WER below 30%. The kids-only model (pink) shows some improvement, but not as much as the blue model with only clean data. The mixed-augmented model (gray) remains at a high WER throughout training and was not able to detect a lot of words correctly.

5.1.3. Best Model Analysis

While the evaluation of the model’s performance was primarily based on the PER, a more fine-grained phoneme-level analysis was also conducted for the best model. Initially, the goal was to generate a phoneme confusion matrix to identify specific substitutions made by the model. However, this approach proved to be hard to do in practice. Since the model outputs a full phoneme sequence without explicit phoneme alignments or timestamps, even a single deletion or insertion can cause all subsequent phonemes to be misaligned. This results in inflated error counts and makes the confusion matrix useless and wrong. Moreover, the model sometimes fails to correctly recognize word boundaries, which further complicates alignment. For these reasons, a manual qualitative evaluation of the model’s predictions was chosen instead to better understand the recurring phoneme-level errors. The following errors noticed during the result and translation evaluations are:
Vowel Confusions
One frequently observed issue is the confusion between phonetically similar vowels. In particular, the model often predicts [e] instead of [ε], and [ø] is sometimes replaced with [o] or even [e]. These vowel pairs are acoustically close, and such substitutions indicate that the model struggles to distinguish them reliably. This type of error likely arises from the overlap in spectral properties and the limited training data for certain vowel sounds. However, it is also important to note that some of these apparent errors might not originate from the model alone. As stated in the transcription agreement of the official KidsTALC documentation, the phoneme-level transcriptions themselves may contain a certain level of inconsistency, particularly in the representation of similar vowels. Disagreements among annotators or slight variations in phonetic realization can lead to a measurable PER even in the target transcriptions. The following example illustrates a case where the model mismatched the vowel [e] by the vowel [ε]:
Ground Truth: auf dea kaputn lateane
Prediction: auf dεa kaputn latεane
Word-Final Deletions
Another recurring issue is the loss of word ending consonants. Phonemes such as [n], [t], or [s] are often missing at the end of words. These phonemes are typically pronounced less prominent and more likely to being dropped and the model frequently skips them. In the example below, the nasal [en] in klainen (German word “kleinen”) is missing. Also again a vowel confusion error in the word dea is happening:
Ground Truth: di lebm in dea klain vεlt
Prediction: di nebm min dεa klainen vεlt
Short Utterance Misrecognition
In very short utterances or isolated words, the model sometimes fails to produce a meaningful phoneme. Instead, it seem like the model substitutes the target with a more frequent and acoustically safer vowel. This behavior is particularly visible in one-word phrases or non-standard expressions where the training data may have limited examples. The following example shows how the initial sound [ç] is replaced with [i], resulting in [ia] instead of the intended [ça]. While [ç] and [i] are not phonetically similar, [i] appears more frequently in the training data and may have been used by the model as a fallback.
Ground Truth: ça
Prediction: ia
Word Boundary Errors
Finally, the model sometimes fails to correctly identify word boundaries, leading to merged words or false splits. The following example illustrates a blending of two words which should be split into two different words:
Ground Truth: genau das dεa
Prediction: dasis dεa

5.2. Comparison with Existing Solutions

As discussed in the Background and Related Work in Section 2, there is existing work on phoneme recognition using a variety of approaches. In this section, a comparison with the results achieved in this research with those reported in related studies is done. First, we examine relevant work that also employs self-supervised learning solutions, look at there results and also look at a often used benchmarks the TIMIT dataset.
Following this, we consider previous results achieved on the KidsTALC dataset. For this specific dataset corpus, no related work for self-supervised models was done, but we will compare with models that do not rely on pretraining or self-supervised learning. We also compare our results with the results reached by reading experts translating the KidsTALC corpus to the IPA.

5.2.1. Comparison with Self-Supervised Solutions

At the time of writing, no published study was found that used self-supervised learning for phoneme-level recognition on German-speaking children. Therefore, direct benchmarks for this specific target group or work to compare with does not exist. However, relevant studies using similar methods with other languages and datasets are discussed, to provide context for the results of our research.
French Child Speech with WavLM
Medin et al. [30] investigated self-supervised phoneme recognition using French child speech and compared Wav2Vec2, HuBERT, and WavLM. Their training data consisted of an in-house corpus with about 13 h of read speech from french children 5 to 8 years old. The best results were achieved with WavLM, reaching a phoneme error rate (PER) of 13.6%. Notably, WavLM outperformed all other models when transcribing isolated words and pseudo-words and also showed strong robustness under noise.
Compared to this work, our model achieved a PER of 14.3%, trained on only 7.5 h of spontaneous speech from German children and adults. The data we used was more variable and from a different domain, as it came from spontaneous, conversational speech. In contrast, Medin et al. used controlled, read-aloud material and a slightly larger dataset.
The results, however, are very similar: both studies achieve a PER around 14%. This demonstrates that a PER of 14% is quite a good and acceptable result for phoneme recognition in child speech. It is also worth noting that Medin et al. used only 35 different IPA symbols for their transcriptions in french, while we used a more fine-grained set of 43 symbols, making the recognition task even more challenging.
Comparison to Common Benchmarks: Timit Dataset
To better contextualize our results, it is helpful to compare them with established benchmarks in speech recognition. One often used benchmarks is the TIMIT Corpus. TIMIT is a well-known dataset that is often used to evaluate the performance of speech recognition models. It contains recordings from 630 adult speakers from different regions of the United States. Each speaker reads ten short sentences, which are designed to cover a wide range of English sounds and dialects. The dataset also provides detailed word- and phoneme-level transcriptions.
Although TIMIT consists of read English speech from adults, and we focus on spontaneous speech from German-speaking children, it is still interesting to compare. In their initial Wav2Vec2 paper [24], Baevski et al. report a phoneme error rate between 7.4% and 8.3% on TIMIT, depending on the model size and the test data. More recent models such as HuBERT and WavLM aim to have achieved even lower rates.
Our model achieves a PER of around 14%. Given that it was pretrained on cross-lingual data and not specifically on German audio, and that the evaluation data consists of spontaneous child speech rather than clean adult recordings, this result can still be seen as good and acceptable.
Looking at the WER, one of the best trained Wav2Vec2.0 models reaches a WER of 8.1% on the TIMIT benchmark, as shown on this global leaderboard website https://paperswithcode.com (accessed on 11 December 2025). Compared to this top result, our model reaches a WER of 31.6%. While this number is a lot higher, it is reasonable given the major differences between the two models. The TIMIT dataset contains read speech from adults in clean recording conditions, all in American English. Similar as for the PER, this makes the recognition task easier. In contrast, our model deals with spontaneous speech from young children, which is often less clear, more varied, and harder to transcribe due less data and natural differences between adults and kids speech (see Section 2.2.2). In addition, the speech is in German, and the amount of labeled training data is much smaller. All of these factors make the task more challenging, so a higher error rate is expected. Still, the results show that better PER and WER recognition is possible and can be improved further with more and better quality data and model tuning.

5.2.2. Comparison with Previous KidsTALC Results

In their original paper [33], Rumberg et al., also trained a baseline phoneme recognition model using SpeechBrain on the KidsTALC dataset. Their system used Mel-spectrogram features processed by convolutional and recurrent layers, trained with a CTC loss function. The model was evaluated on the same corpus and according to them, achieved a PER of 26.18% when only using the KidsTALC data and a PER of 24.04% when adding additional samples from the Mozilla Common Voice dataset.
In comparison with our model, this represents a substantial improvement of nearly 10% over the baseline. While the exact test set and data splits are different, comparing them makes sense, since both models were evaluated on spontaneous speech from the same source corpus and used similar phoneme token sets.
The difference in performance again highlights the benefits of self-supervised pretraining. The Wav2Vec2 model learned general speech representations from large, unlabeled datasets and only needed fine-tuning on a small amount of clean, domain-specific data. In contrast, the baseline model from Rumberg et al. was trained from scratch, without leveraging pretraining. This comparison clearly shows the advantage of transfer learning.

5.2.3. Comparison with Human Performance

The KidsTALC paper by Rumberg also includes a transcriber agreement study that provides a interesting reference for comparing model performance to human level consistency. In that study, 144 utterances were independently transcribed phonetically by three trained annotators. The average phone error rate between human transcribers was 12.8%. This transcriber agreement study showed that certain phonemes were often mapped different by the experts to similar sounding vowels like for example the e was marked as E. And they also often disagreed on the endings of words such as german words “und” and “jetzt”, where the final /t/ is often omitted in adult language.
Looking at the analysis done with our best performing model (see Section 5.1.3), the model reached a phoneme error rate (PER) of 14.3%, which is similar to the level of agreement between human transcribers. This means that, for the dataset used, the model performs about as good as the trained people. Also when looking at the errors the model makes, they are very similar to the ones made by human transcribers. For example, the model also often mixes up similar sounding sounds like /e/ and ɛ and sometimes leaves out the endings of words. This makes sense and is also hard to prevent, since the model was trained using the same transcriptions that the human transcribers created.

5.3. Addressing the Research Questions

5.3.1. Is Accurate Phoneme-Level Recognition Possible with Limited Clean Data?

Yes, accurate phoneme-level recognition is possible even with a small but clean dataset. The best-performing model was trained on only about 7.5 h of carefully prepared speech data from both children and adults. No data augmentation was used. Despite the limited size of the training set, the model achieved a PER of 14.3%, which is close to the 12.8% PER observed between trained human annotators in the KidsTALC dataset.
This strong result can be explained by the self-supervised pretraining of the Wav2Vec2 model. Before fine-tuning, the model already learned general speech features from hundreds of hours of unlabeled audio. This helped it adapt well to the phoneme recognition task, even with only a small amount of labeled training data.
These findings show that clean, domain-specific fine-tuning data can lead to good performance, even in low-resource settings. However, looking at benchmarks like TIMIT, it becomes clear that using more data, and especially pretraining on speech that is closer to the target domain (such as for example only German speech), could still improve the model’s accuracy further.

5.3.2. How Does the Trade-Off Between Data Quality and Quantity Affect Phoneme Recognition Performance in Children’s Speech?

The results show that data quality is more important than quantity when fine-tuning modern pretrained models for phoneme recognition in children’s speech. All models were evaluated on the same clean test set of child speech to ensure a fair comparison. The best results were achieved by the model trained on only about 7.5 h of clean child and adult data, without using any noisy or augmented samples.
Experiments with larger training sets that included more, but less clean, data, or artificially augmented samples, did not lead to better performance. In fact, these models showed higher error rates and did not perform as good as the held back test set. This suggests that the model had already learned rich and generalizable speech representations during the pretraining phase, and therefore benefits more from a small amount of high-quality, domain-specific data during fine-tuning. Clean, well-prepared input gives the model the right signal to adapt its pretrained knowledge effectively, while simply increasing the dataset size without ensuring quality does not improve the model performance.

6. Discussion

This chapter looks back at the main results and experiences from building and testing the speech recognition system aimed for children. It explains what worked well, what was difficult, and why these findings are important for helping children learn to read. The discussion covers both the technical side, like how the model was trained and where it made mistakes, and the educational side, such as how this technology can support teachers and students in real life.
It also outlines ideas for future work and how this research can serve as a foundation for further experiments and tests, helping researchers and developers build on the current system and results.

6.1. System Analysis

This section analyzes the strengths and limitations of the developed system, considering both its technical performance and educational potential. It outlines what worked well and where future improvements are necessary.

6.1.1. Strengths

  • Recognition Accuracy with Limited Data: One of the system’s most notable achievements is its high phoneme recognition accuracy, despite being trained on a relatively small dataset. This success is largely due to the use of self-supervised learning with the Wav2Vec2-XLSR model, which was fine-tuned on a carefully selected and cleaned subset of children’s speech. By including both child and adult speech in the fine-tuning phase, the model was able to generalize well, including those not seen during training. This shows that a combination of high-quality data and diversity can compensate for limited training volume when fine-tuning a strong pre-trained model.
  • Phoneme-Level Output and IPA Representation: Another technical strength of the system is its ability to produce phoneme-level transcriptions using the International Phonetic Alphabet (IPA). This allows it to detect subtle pronunciation errors that word-level models might overlook. In combination with a matching algorithm, the model could be able to provide feedback on partial word production, sound substitutions or mispronounced endings. The use of IPA also makes the system adaptable for other languages or multilingual settings, improving its long term flexibility.
  • Educational Value and Scalability: From an educational perspective, the system offers detailed and objective feedback that can assist teachers, researchers, or digital learning tools. It identifies which sounds individual children struggle with, supports tracking of progress over time, and helps recognize patterns in reading development. Since its accuracy approaches the level of human transcription, the system can serve as a scalable assessment assistant. Unlike a teacher, it can be used by multiple learners at once, making it practical for classroom settings or large-scale studies.

6.1.2. Limitations

  • Phoneme Confusions and Speech Variability: One challenge lies in distinguishing between similar sounding phonemes, particularly consonant pairs that differ only in voicing, such as [b] vs. [p] and [e] vs. [ε].
  • Speaker and Recording Variability: The model is also sensitive to variations in recording conditions. Background noise, microphone quality, and overall audio clarity can significantly influence performance. This may limit the reliability of the system in uncontrolled environments, such as homes or busy classrooms. Furthermore, the KidsTALC dataset used in training contains recordings collected under relatively uniform conditions, which may not generalize well to more diverse real-world settings.
  • Model Size and Deployment Constraints: A significant practical limitation is the model’s size. The fine-tuned Wav2Vec2 model used is approximately 1.3 GB, making it unsuitable for deployment on most mobile devices or tablets. Without further optimization, the model requires server-based processing, which complicates deployment in classrooms or homes.

6.2. Key Findings

To sum up, we successfully applied self-supervised speech recognition to the task of tracking children’s reading and achieved notable technical outcomes. Most significantly, the fine-tuned Wav2Vec2 model reached a PER of approximately 14.3% on the test set. This represents a substantial improvement of around 12 percentage points compared to the traditional baseline system, which achieved a PER of 26% without self-supervised pretraining. Notably, this performance was achieved using only 7.5 h of transcribed training data, highlighting the model’s ability to generalize from limited labeled input by using knowledge learned during pretraining on large unlabeled audio.
Beyond the performance metrics, we demonstrated that self-supervised learning is a viable approach for tracking children’s speech and reading behavior. These findings support the broader goal stated in the introduction: to explore how automatic systems might assist and guide children during reading aloud. While no complete prototype was implemented, the experiments show that accurate phoneme-level analysis is technically feasible and can form the basis for future interactive applications of children tracking during reading aloud. Although challenges remain, such as optimizing the model for real time use and integrating it into a mobile platform, the results provide a good start for prototypical implementation and further development.

6.3. Future Work and Improvements

6.3.1. More Data and Variability

A clear next step is to expand the dataset for training and evaluation, as performance is limited by the small amount and variety of data. Collecting more recordings of German children reading aloud—through school collaborations or crowd-sourcing via educational apps—would improve robustness. A larger corpus would expose the model to more pronunciations, speaking styles, and variability in accents, ages, and reading materials. Training on children from different regions and language backgrounds would make the model more inclusive and accurate. Including younger voices could help capture early readers’ error patterns, while older children’s speech would provide more fluent examples for contrast.

6.3.2. Utilize Data from Similar Languages

Another approach to fix the data limitations, could be by adding child speech data from related Germanic languages. Languages such as Dutch, Flemish, or Norwegian share many similar sounds with German and thanks to the use of the International Phonetic Alphabet (IPA), these common phonemes can be represented consistently across languages.
Incorporating data from these languages could help the model learn a broader range of pronunciation patterns for similar phonemes. This may improve its ability to deal with variations in speech, accents, or disfluency and make it more robust—especially when German training data is too limited or lacks diversity. Cross-lingual training or pretraining on related languages is an approach that has shown promise in other speech tasks and would be worth exploring here.

6.3.3. Reading Mistake Classification

Also an interesting direction for future work is to move beyond phoneme recognition and explore direct classification of reading mistakes. The KidsTALC corpus we used includes annotations for different types of errors, such as phoneme substitutions, missing words or incorrect pronunciations. While we focused on recognizing phoneme sequences, future research could train a model to detect and classify reading errors directly based on these annotated labels.
This approach would make it possible to skip the intermediate step of phoneme transcription and instead output structured feedback about the types of mistakes a child makes. This could lead to more informative and actionable insights for teachers and learners. For example, instead of just showing that a phoneme was misread, the system could indicate whether the child skipped a word, mispronounced a vowel, or added an extra sound.

6.3.4. Fine-Grained Error Analysis

Another important area for future work is to conduct a fine-grained error analysis of the system’s performance and use those insights to guide improvements. In the current work, some analysis has already been done but there are much more insights to be gained. Going forward, this could be taken further by carefully analyzing the errors the system makes on a phoneme-by-phoneme and word-by-word basis. By identifying patterns in these errors, researchers can pinpoint why they occur and devise solutions. For instance, if it’s confirmed that voicing distinctions (like /d/ vs. /t/) are a frequent source of mistakes, one could explore ways to improve the model’s handling of those sounds. This might involve collecting more training examples of often confused phoneme pairs, or adjusting the training objective to pay more attention to those specific distinctions.

6.3.5. Extended Evaluation on Unseen Data

While the system was evaluated on a held-out portion of the KidsTALC dataset, broader evaluation on unseen data is crucial for understanding how well it would perform in real-world applications. Future work should test the model on completely new datasets or in live settings that were not part of the original training and testing regime. One target for extended evaluation is to gather a set of children reading prepared texts aloud (since the ultimate use-case is tracking reading).
This could involve, for example, giving a group of students a short story or a list of sentences to read while recording their voice. The model’s phoneme transcriptions could then be compared to the known text to see how accurately it tracks the reading. Such an evaluation would reveal how the system handles more formal reading tasks as opposed to conversational speech.

6.3.6. Streaming ASR Development

To make the system more interactive and useful in real-time situations like classrooms, future work should focus on streaming speech recognition. Right now, the model only processes audio after the child has finished reading. A streaming version would allow the system to listen continuously and update the transcription as the child speaks. This would make it possible to give immediate feedback during reading.
Since Wav2Vec2 is based on a Transformer architecture, it is not naturally designed for streaming. However, one solution could be to split the audio into small overlapping chunks that the model processes in sequence, while carefully merging the outputs to keep context intact.
Real-time feedback would make the tool more engaging and helpful for children. For example, the system could highlight correctly read words in green and mistakes in red, as they happen. It could even give gentle prompts like “try that word again” immediately, which may be more effective for learning than delayed feedback.

6.3.7. Model Compression and Deployment

Another important step, unfortunately tested but not achieved in this work, is to make the model small and fast enough to run on everyday devices like tablets or school computers. The current Wav2Vec2 model is large and needs a lot of computing power, which makes real-world use difficult.
There are several ways to reduce the model size. One is quantization, which stores the model using lower-precision numbers to save memory and speed up processing. Another is pruning, where parts of the model that aren’t so important are removed. A third method would be knowledge distillation, where a smaller “student” model learns to copy the behavior of the large model. This smaller version can be much faster and use less memory while still staying close in accuracy. Future work for example could train a student model and aim for a similar performance. Beyond optimizing the model itself, the next step would be to build and test an actual app.
This chapter begins with key lessons learned from the project and offers practical recommendations for future work. It then provides a brief summary of the main results and reflects on their significance for the field of reading support and speech technology.

7. Conclusions

In conclusion, we set out to test whether self-supervised learning methods could be applied to the task of tracking children as they read aloud. The results show that this approach is promising and works out very well. By fine-tuning a Wav2Vec2 model from the Hugging Face library on a relatively small dataset of German children’s speech, the system achieved a phoneme error rate (PER) of around 14%. This is a strong result, especially when compared with existing benchmarks and related work, and it suggests that self-supervised models can generalize well to child speech even in low-resource scenarios. Nevertheless, with more diverse and higher-quality training data, further improvements in accuracy are possible, as seen on benchmarks used for English datasets.
To build and evaluate the system, a full machine learning pipeline was developed. This included the use of Amazon S3 for external data storage and Weights & Biases for experiment tracking, versioning, and visualization. This infrastructure proved robust and scalable, enabling reproducible results throughout the project. Various data augmentation techniques, such as speed perturbation, pitch shifting, and vocal tract length normalization, were also applied and tested to help the model generalize better.
Despite the promising performance, the model still has several limitations. Most notably, it currently does not support streaming inference, which is essential for real-time applications such as classroom or home use. Additionally, it was not yet possible to deploy the model efficiently on resource-limited devices like smartphones or tablets, which would be useful for use in educational settings because of the privacy concerns. So these technical challenges point to clear areas for future development which need to be addressed before being able to use the system in production.
Perhaps most importantly, this work lays the foundation for a planned children’s reading app designed to support and monitor young readers as they read aloud. Such an application could provide teachers with valuable insights into reading progress and common difficulties, making it possible to offer more targeted and personalized support.

Author Contributions

Conceptualization, P.O.; methodology, P.O.; software, P.O.; resources, E.S. and M.K.; data curation, P.O.; writing—original draft preparation, P.O.; writing—review and editing, E.S., J.K. and S.S.; supervision, E.S.; project administration, E.S.; funding acquisition, E.S. and M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The KidsTALC dataset has thankfully been made available for this research by Leibniz University Hannover and may be made available on request only.

Acknowledgments

Open Access Funding by the University of Applied Sciences Upper Austria.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sun, Y.J.; Sahakian, B.J.; Langley, C.; Yang, A.; Jiang, Y.; Kang, J.; Zhao, X.; Li, C.; Cheng, W.; Feng, J. Early-initiated childhood reading for pleasure: Associations with better cognitive performance, mental well-being and brain structure in young adolescence. Psychol. Med. 2024, 54, 359–373. [Google Scholar] [CrossRef] [PubMed]
  2. ORF Science. Lesende Kinder Werden Zufriedene Jugendliche. 2023. Available online: https://science.orf.at/stories/3219992/ (accessed on 11 December 2025).
  3. Der Standard. PIRLS-Studie: Jedes fünfte Kind in Österreich hat Probleme beim Lesen. 2023. Available online: https://www.derstandard.at/story/2000146458792/ (accessed on 11 December 2025).
  4. Statista. Mediennutzung von Kindern-Statistiken und Umfragen. 2024. Available online: https://de.statista.com/themen/2660/mediennutzung-von-kindern/ (accessed on 11 December 2025).
  5. The Access Center. Early Reading Assessment: A Guiding Tool for Instruction. 2024. Available online: https://www.readingrockets.org/topics/assessment-and-evaluation/articles/early-reading-assessment-guiding-tool-instruction (accessed on 9 December 2024).
  6. Professional Development Service for Teachers (PDST). Running Records. 2015. Available online: https://www.pdst.ie/sites/default/files/Running%20Records%20Final%2015%20Oct.pdf (accessed on 11 December 2025).
  7. Bildung durch Sprache und Schrift (BiSS). Salzburger Lesescreening für die Schulstufen 2–9 (SLS 2–9). 2023. Available online: https://www.biss-sprachbildung.de/btools/salzburger-lesescreening-fuer-die-schulstufen-2-9/ (accessed on 9 April 2025).
  8. SoapBox Labs. Children’s Speech Recognition & Voice Technology Solutions. Available online: https://www.soapboxlabs.com/ (accessed on 9 April 2025).
  9. Hello Ello. Read with Ello. Available online: https://www.ello.com/ (accessed on 9 April 2025).
  10. Ravaglia, R. Microsoft Reading Coach—Seamless AI Support For Students and Teachers. Forbes. 2024. Available online: https://www.forbes.com/sites/rayravaglia/2024/01/18/microsoft-reading-coach-seamless-ai-support-for-students-and-teachers/ (accessed on 11 December 2025).
  11. Google. Read Along. Available online: https://readalong.google.com/ (accessed on 9 April 2025).
  12. Basak, S.; Agrawal, H.; Jena, S.; Gite, S.; Bachute, M.; Pradhan, B.; Assiri, M. Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems. CMES-Comput. Model. Eng. Sci. 2023, 135, 1053–1089. [Google Scholar] [CrossRef]
  13. Yeung, G.; Alwan, A. On the difficulties of automatic speech recognition for kindergarten-aged children. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
  14. D’Arcy, S.; Russell, M.J. A comparison of human and computer recognition accuracy for children’s speech. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 2197–2200. [Google Scholar]
  15. Bhardwaj, V.; Ben Othman, M.T.; Kukreja, V.; Belkhier, Y.; Bajaj, M.; Goud, B.S.; Rehman, A.U.; Shafiq, M.; Hamam, H. Automatic speech recognition (asr) systems for children: A systematic literature review. Appl. Sci. 2022, 12, 4419. [Google Scholar] [CrossRef]
  16. Russell, M.J.; D’Arcy, S. Challenges for computer recognition of children’s speech. SLaTE 2007, 108, 111. [Google Scholar]
  17. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; Volume 9. [Google Scholar] [CrossRef]
  18. Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef]
  19. Yadav, M.; Alam, M.A. Dynamic time warping (dtw) algorithm in speech: A review. Int. J. Res. Electron. Comput. Eng. 2018, 6, 524–528. [Google Scholar]
  20. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  21. Li, T. Study on a CNN-HMM approach for audio-based musical chord recognition. J. Phys. Conf. Ser. 2021, 1802, 032033. [Google Scholar] [CrossRef]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
  23. Kushwaha, N. A Basic High-Level View of Transformer Architecture & LLMs from 35000 Feet. Python Plain Engl. 2023. Available online: https://python.plainenglish.io/a-basic-high-level-view-of-transformer-architecture-llms-from-35000-feet-c036dd2f7c25 (accessed on 11 December 2025).
  24. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 30; Curran Associates Inc.: Red Hook, NY, USA, 2020; pp. 12449–12460. [Google Scholar]
  25. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
  26. Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
  27. Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
  28. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR. pp. 28492–28518. [Google Scholar]
  29. Mostert, N. Implementing Wav2Vec 2.0 into an Automated Reading Tutor. Master’s Thesis, Utrecht University, Utrecht, The Netherlands, 2024. [Google Scholar]
  30. Medin, L.B.; Pellegrini, T.; Gelin, L. Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning. In Proceedings of the 25th Interspeech Conference (Interspeech 2024), Kos Island, Greece, 1–5 September 2024; Volume 9, pp. 5168–5172. [Google Scholar] [CrossRef]
  31. Jain, R.; Barcovschi, A.; Yiwere, M.; Corcoran, P.; Cucu, H. Adaptation of Whisper models to child speech recognition. arXiv 2023, arXiv:2307.13008. [Google Scholar] [CrossRef]
  32. Fan, R.; Shankar, N.B.; Alwan, A. Benchmarking Children’s ASR with Supervised and Self-supervised Speech Foundation Models. arXiv 2024, arXiv:2406.10507. [Google Scholar]
  33. Rumberg, L.; Gebauer, C.; Ehlert, H.; Wallbaum, M.; Bornholt, L.; Ostermann, J.; Lüdtke, U. kidsTALC: A Corpus of 3- to 11-year-old German Children’s Connected Natural Speech. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 5160–5164. [Google Scholar] [CrossRef]
  34. Ardila, R.; Branson, M.; Davis, K.; Kohler, M.; Meyer, J.; Henretty, M.; Morais, R.; Saunders, L.; Tyers, F.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; European Language Resources Association: Reykjavik, Iceland, 2020; pp. 4218–4222. [Google Scholar]
  35. International Phonetic Association. The International Phonetic Alphabet (2020 Revision). Available online: https://www.internationalphoneticassociation.org/IPAcharts/IPA_chart_orig/pdfs/IPA_Kiel_2020_full.pdf (accessed on 21 October 2025).
  36. Bernard, M.; Titeux, H. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. J. Open Source Softw. 2021, 6, 3958. [Google Scholar] [CrossRef]
  37. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  38. Biewald, L. Experiment Tracking with Weights and Biases. 2020. Available online: https://wandb.ai/site/experiment-tracking/ (accessed on 11 December 2025).
Figure 1. Types of speech recognition techniques as described by Basak et al. [12].
Figure 1. Types of speech recognition techniques as described by Basak et al. [12].
Information 17 00040 g001
Figure 2. Example of RNN architecture processes a sentence [23].
Figure 2. Example of RNN architecture processes a sentence [23].
Information 17 00040 g002
Figure 3. Example of how transformer architecture processes a sentence [23].
Figure 3. Example of how transformer architecture processes a sentence [23].
Information 17 00040 g003
Figure 4. Flowchart of the proposed method.
Figure 4. Flowchart of the proposed method.
Information 17 00040 g004
Figure 5. Phoneme occurrence in the KidsTALC dataset in thousands [33].
Figure 5. Phoneme occurrence in the KidsTALC dataset in thousands [33].
Information 17 00040 g005
Figure 6. Example of the IPA with the word “Schule” vs. “Sule”.
Figure 6. Example of the IPA with the word “Schule” vs. “Sule”.
Information 17 00040 g006
Figure 7. Activity Diagram of the applied preprocessing steps.
Figure 7. Activity Diagram of the applied preprocessing steps.
Information 17 00040 g007
Figure 8. “Childrenization” steps of adult speech data.
Figure 8. “Childrenization” steps of adult speech data.
Information 17 00040 g008
Figure 9. Screenshot of the CER (PER) for the three models during training.
Figure 9. Screenshot of the CER (PER) for the three models during training.
Information 17 00040 g009
Figure 10. Screenshot of the WER for the three models during training.
Figure 10. Screenshot of the WER for the three models during training.
Information 17 00040 g010
Table 1. Overview of KidsTALC recording and the age distribution.
Table 1. Overview of KidsTALC recording and the age distribution.
AgeSexDuration in MinutesNumber of Speakers
3–4 yearsf101.36
m85.34
5–6 yearsf176.08
m166.18
7–8 yearsf36.62
m58.33
9–10 yearsf40.33
m85.95
Table 2. Overview of audio augmentation parameters to the speech data.
Table 2. Overview of audio augmentation parameters to the speech data.
Augmentation ParameterAdultChildPurpose
Pitch Shift (in semitones)1.0–3.0−1.0–1.0Simulate child-like speech and add variation
VTLN Factor (in percent)120–150%95–105%Adapt formant profile to simulate shorter vocal tract
Speed Factor (in percent)85–95%90–110%Reflect slower speech rate or natural variability
Augmentation Ratio1:12:1Increase training diversity and robustness
Amount of Augmented Samples651414,452
Table 3. Hyper-parameters used for model training.
Table 3. Hyper-parameters used for model training.
HyperparameterValueDescription
Learning rate3 × 10 5 This controls how big the steps are when the model learns. A small learning rate means the model learns slowly but more carefully.
Batch size8This is how many examples the model looks at before updating itself. Smaller batches use less memory but can make learning a bit noisier (the updates might jump around more because each batch gives a less overview of the whole dataset).
Number of epochs45Number of complete passes through the training dataset.
Evaluation steps500Specifies how often (in number of training steps) the model is evaluated on the validation set.
Warmup ratio0.1Fraction of total training steps used to gradually increase the learning rate from zero. This helps to stabilize early training.
Learning rate schedulerLinearGradually decreases the learning rate linearly after the warm-up phase, which helps stabilize training as convergence is approached.
OptimizerAdamWAn optimizer is responsible for updating model weights to minimize the loss. AdamW builds on the Adam optimizer by using adaptive learning rates and applying a regularization technique to help prevent overfitting.
Table 4. Overview of the three model variants and their training data characteristics.
Table 4. Overview of the three model variants and their training data characteristics.
Data TypeMixed DataChild-OnlyCombined Clean
Kids Data×××
Kids Augmented××
Adult Data× ×
Adult Augmented×
Hard to understand××
Total Samples~34,000~21,000~10,000
Table 5. Performance of the three Wav2Vec2-based models.
Table 5. Performance of the three Wav2Vec2-based models.
ModelPERWERData SizeTraining Duration
Model 1: Mixed Data28.7%89.1%≈25.5 h10 h 12 min
Model 2: Child-Only Data23.5%91.3%≈16 h6 h 17 min
Model 3: Clean Data14.3%31.6%≈7.5 h4 h 5 min
Table 6. Summary of PER and WER progression during training for the three models.
Table 6. Summary of PER and WER progression during training for the three models.
ModelPERWER TrendKey Observations
Clean Data11.8%Below 30%Best overall performance; steady learning; sharp WER drop once PER < 20%.
Child-Only Data25%Moderate improvementFast early learning but plateaus; worse than clean data.
Mixed Data29%High throughoutSlow learning; large dataset but augmentation not helpful; highest PER and WER.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ollmann, P.; Sonnleitner, E.; Kurz, M.; Krösche, J.; Selinger, S. Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment. Information 2026, 17, 40. https://doi.org/10.3390/info17010040

AMA Style

Ollmann P, Sonnleitner E, Kurz M, Krösche J, Selinger S. Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment. Information. 2026; 17(1):40. https://doi.org/10.3390/info17010040

Chicago/Turabian Style

Ollmann, Philipp, Erik Sonnleitner, Marc Kurz, Jens Krösche, and Stephan Selinger. 2026. "Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment" Information 17, no. 1: 40. https://doi.org/10.3390/info17010040

APA Style

Ollmann, P., Sonnleitner, E., Kurz, M., Krösche, J., & Selinger, S. (2026). Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment. Information, 17(1), 40. https://doi.org/10.3390/info17010040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop