1. Introduction
Mispronunciation detection and diagnosis (MD/MDD) is a core component in computer-assisted pronunciation training (CAPT) and speech-enabled language learning systems. The classical pipeline treats learner speech as a noisy realization of a canonical phone sequence, then detects mismatches via forced alignment, posterior-based scoring (e.g., goodness of pronunciation, GOP), or more recently, end-to-end neural models that predict mispronunciation labels at the phone or segment level [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12].
A long-standing but often under-discussed assumption in MD research is the existence of a deterministic gold standard label: for each phone token, there is a correct label (correct/mispronounced) that a human annotator can reveal. There are two common views of this “gold standard”, which we call Native Norm and Target Norm, respectively.
Many MD systems are built under a
Native Norm assumption: learner speech is evaluated by comparing it with a canonical L1 pronunciation lexicon, and deviations are treated as errors. Representative examples include forced-alignment-based goodness-of-pronunciation (GOP) scoring [
1] and graph-based recognition variants such as the extended recognition network (ERN) [
2].
Despite its engineering convenience, a strict Native Norm can be problematic in practice: it may penalize systematic accent patterns and raise fairness concerns. In addition, speech perception research indicates that listener judgments are adaptive rather than a fixed L1 template; listeners rapidly adjust to the statistics of the talker or target population [
13,
14,
15,
16,
17]. Motivated by this evidence, we introduced the
Target Norm in our previous study [
18], operationalizing population-level adaptation by adapting a pre-trained model to the target speech. We observed improved agreement between model outputs and human annotations after such adaptation.
Both the Native Norm and Target Norm treat the MD task from a “production” perspective, i.e., they measure to what degree the produced phones or phrases deviate from a reference pronunciation: the canonical native pronunciation (Native Norm) or the population-average pronunciation of the target speakers (Target Norm). However, MD is essentially a human perception task, and is much beyond an objective measurement of deviation on speech signals. Instead, it involves a complex perceptual and psychological process, in particular when the pronunciation is on the phone boundary. This leads to high intra-annotator uncertainty and inter-annotator variation. We argue that this “subjectivity” is the nature of MD (and speech assessment in general).
Therefore, we argue that an ideal MD system should simulate human perceptual judgments rather than solely measure signal deviation, if its final goal is to help L2 speakers speak in a way that others can understand, rather than merely sound like a canonical pronunciation. For simplicity, we call this idea the “Perception Norm”.
To support this argument, we built a new dataset
UY/CH-CHILD-MA that contains Chinese words and phrases spoken by Uyghur children. Compared to other datasets such as L2-ARCTIC [
19], UY/CH-CHILD-MA is challenging due to both L2-accented pronunciation and the additional variability of child speech [
20]. We hypothesize that these settings elicit stronger perceptual and psychological effects in human annotation. Moreover, UY/CH-CHILD-MA is annotated by four annotators, enabling the analysis of inter-annotator variation.
Following our previous work [
18], we employed a Transformer-based end-to-end model to simulate human perceptual behavior. We first pre-trained the model on native speech and then fine-tuned it to align with human annotators. The experiments demonstrate that fine-tuning is crucial for aligning the model with a target annotator, and that models aligned with different annotators behave distinctively. These observations provide strong evidence for the perception norm.
We finally propose an ensemble paradigm for the MD task. Specifically, it first trains an individual MD model for each annotator, and uses these models to compose an “expert committee” whose outputs are aggregated to produce a group decision. This new paradigm isolates the perception process of individual annotators from the social decision process of annotator groups, potentially providing a more flexible and faithful simulation of the annotation process.
Our contributions are threefold:
Perception Norm. We propose Perception Norm, which treats MD as a subjective perceptual judgment process rather than an objective deviation measurement. This perspective aligns with how labels are produced in practice: human annotators deploy perceptual and psychological resources to judge pronunciations, and their decisions are inherently uncertain and subjective. We argue that an ideal MD system should simulate this uncertain and subjective process of human perception as much as possible, which is the core idea of Perception Norm.
Computational evidence for Perception Norm. We conduct a series of experiments on UY/CH-CHILD-MA, a challenging accented child speech dataset with multi-annotator labels, and provide computational evidence supporting Perception Norm by showing that annotator-specific models can be learned accurately and exhibit distinct behaviors.
A new ensemble paradigm for pronunciation assessment. We propose a new ensemble paradigm for MD (and more general pronunciation assessment) that isolates individual assessment from social aggregation, providing a flexible and efficient way to simulate the underlying process of human annotation and decision making.
In the rest of this paper, we first briefly discuss the related work in
Section 2 and then introduce Perception Norm in
Section 3. In
Section 4, we introduce our new dataset—UY/CH-CHILD-MA.
Section 5 presents the proposed MD model,
Section 6 describes the experimental settings and results,
Section 7 discusses the implications and limitations of Perception Norm, and the paper is concluded in
Section 8.
3. Perception Norm
To facilitate a quick comparison,
Table 1 summarizes
Native Norm,
Target Norm, and
Perception Norm side-by-side in terms of (i) model inputs, (ii) supervision labels, (iii) training data, and (iv) the pronunciation standard represented by each norm. As illustrated in
Figure 1, the three norms differ in the reference region and the implied decision boundary in the acoustic/phonetic space.
3.1. A Formal Notation
Let x be an acoustic segment aligned to an intended phone p within a word/utterance context c (including neighboring phones, lexical identity, and prosodic cues). Let denote a listener/annotator. Each annotator produces a binary label indicating whether the phone is mispronounced.
Perception Norm treats
as an observable consequence of (i) a perceptual and psychological process parameterized by
and (ii) a decision process with an annotator-specific threshold
. One minimal form is a signal-detection style model:
where
ℓ measures the goodness of the pronunciation
perceived by the annotator
r (e.g., as a posterior probability), and
captures annotator severity/leniency.
Note that and can be interpreted computationally as parameters of annotator-specific decision models and can be optimized by annotator behavior data, i.e., the MD labels.
Relation to the model implementation. The signal-detection-style notation above is intended as a conceptual abstraction of listener-specific perception. In our experiments, the Transformer-based MD model (
Section 4) implements a flexible scoring function that maps
to a phone-level logit (and posterior) for each canonical position; the listener-dependent component
is operationalized
implicitly by fine-tuning the model using labels from annotator
r. The decision threshold
is instantiated at inference time via operating-point selection and, when combining multiple annotator-specific models, via post hoc calibration and aggregation (
Section 6.7.2).
3.2. Why Perception Norm?
Perception Norm has several distinct features that make it novel and attractive:
(1) Perception Norm does not assume any canonical pronunciation; instead, it learns from the perceptions of human annotators. This distinguishes it from Native Norm and Target Norm.
(2) Perception Norm does not try to learn aggregated annotations; instead, it learns the perception of a single annotator. This distinguishes it from traditional end-to-end learning methods that are trained on pooled or aggregated labels.
3.2.1. From Production Deviation to Perception Simulation
Both Native Norm and Target Norm assume a canonical pronunciation representation, parameterized by
, and the decision can be formulated as
where the evaluation function
f measures the deviation of
x from the canonical pronunciation, parameterized by
, and
is a pre-defined threshold (possibly context-dependent). For Native Norm,
represents the native pronunciation system, while for Target Norm,
represents the pronunciation system of the L2 population.
Both Native Norm and Target Norm assume an objective measure, i.e., they do not explicitly model the inner perception process of human listeners, even though these listeners are the communication targets and the evaluators in pronunciation assessment.
In fact, speech assessment is a complex perceptual and psychological process, rather than merely an objective measurement of how far a produced speech signal departs from a reference pronunciation. It depends on a listener’s linguistic background and experience. As an annotator, the assessment also depends on error tolerance, personal pronunciation habits, and individual preference.
If we believe that human MD behavior is the reference, then MD (and pronunciation assessment in general) should not be reduced to a simple distance measurement from an “accepted canonical pronunciation”. A more ideal MD system should model the complex perceptual and psychological processes of human annotators. Therefore, the central goal of MD is to simulate how human annotators evaluate pronunciations—one of the central ideas of Perception Norm.
3.2.2. From End-to-End Learning to Perception Simulation
Modern MD methods are often based on end-to-end models. In this setting, the model learns to predict MD labels
y from the input
x and context
c, where the model is parameterized by
. Formally, this can be written as
At first glance, this formulation looks similar to Perception Norm. The key difference, however, is that an end-to-end approach typically learns whatever labels the dataset provides, without explicitly considering what those labels represent. In contrast, Perception Norm emphasizes learning the perception process of individual annotators, and thus uses annotations from a single annotator at a time.
This seemingly trivial difference is significant not only conceptually but also in practice. In fact, most datasets are labeled by multiple annotators, and final labels are often derived by aggregation, which (1) may hide substantial variability among annotators; and (2) introduces a social mechanism (voting/averaging). As a result, a model trained on aggregated labels may learn a complex process that involves both individual perceptual/psychological activities and the social aggregation mechanism.
Learning “individual” annotators, as Perception Norm advocates, has several advantages. First, it learns a single annotator and is therefore not directly troubled by inter-annotator variation, reducing the complexity of the learning problem. Second, it learns only the perception/annotation process, without entangling the aggregation process, which further simplifies the learning. Finally, it naturally supports an ensemble approach, which firstly learns a group of MD models, each representing one annotator. Once learned, these individual models can form an expert committee, using a human-defined aggregation rule (e.g., majority voting). To change the system behavior, one can simply modify the aggregation rule, without retraining the individual models.
4. UY/CH-CHILD-MA Dataset
In previous work, we designed UY/CH-CHILD, a Mandarin speech corpus produced by Uyghur children (4–12 years old), for studying accented child speech and pronunciation modeling [
48]. In this work, we reuse this dataset and provide additional human annotations to support Perception Norm research. We denote the new dataset by UY/CH-CHILD-MA, where MA means “multiple annotations”. For completeness, we review the original UY/CH-CHILD dataset, and then present the multiple annotations.
4.1. Data Collection
UY/CH-CHILD is a prompted word-production corpus designed for studying accented child speech and segmental/tonal pronunciation modeling. In this work, we reuse the original recordings and phonetic annotations of UY/CH-CHILD and extend them with multiple independent MD labels. The construction of the original corpus can be summarized as (i) curating a phonetically representative set of
target words; (ii) eliciting and recording prompted productions from Uyghur children; (iii) extracting speech segments corresponding to the prompted target words; and (iv) annotating both the canonical and realized pronunciations in Pinyin. For a complete description of the original corpus and protocol, we refer readers to [
48].
The target-word list follows the articulation test for Mandarin Chinese-speaking preschoolers developed by the Institute of Linguistics at the Chinese Academy of Social Sciences (CASS) [
49,
50]. It contains 174 common, imageable words with 1–3 Chinese characters (syllables), covering key phonological features of Mandarin (e.g., syllables, tones, syllable combinations, weak stress, and rhotacism). This design keeps the task focused on pronunciation production rather than lexical knowledge.
To accommodate different ages, the 174 words were organized into two prompt sheets. Test A contains 138 relatively simple and highly recognizable items (e.g., “hand”, “flower”), and Test B contains 140 more challenging items (e.g., “crocodile”, “axe”, “chalk”). The two sheets have 104 overlapping words. Test A was mainly used for younger children (4–5 years old) in kindergarten, whereas Test B was used for older children in kindergarten and primary school.
Recordings were conducted in two phases (May 2022 and February 2023) in Ili Prefecture, Xinjiang Uyghur Autonomous Region, China. Participants were Uyghur children (4–12 years old) from Uyghur-speaking families who attended kindergartens or primary schools where Chinese is the primary instructional language. Basic speaker information was collected, and parental/guardian consent was obtained under privacy-protection procedures.
Speech was recorded in quiet rooms using a laptop at 16 kHz, 16-bit, single-channel settings. During recording, a tutor presented pictures of the target words and elicited spoken responses; when needed, a prerecorded reference prompt was played to facilitate word identification. Importantly, the tutor did not provide corrective feedback once the child identified the intended word, and the full session audio was retained to support downstream segmentation and annotation.
4.2. Multiple Annotations
All recordings were uploaded to a web-based annotation platform. We first performed target-segment extraction to locate clearly spoken instances of the prompted items. This segmentation step was carried out by trained student assistants, whose task was to identify cleanly articulated segments regardless of pronunciation correctness.
The extracted segments were then annotated for pronunciations and phone-level MD labels. Annotators listened to each segment and marked deviations from the canonical pronunciation in Pinyin. Each syllable is represented as Initial∼Final∼Tone. When an annotator could not confidently determine a component, an “*” tag was used. For example, if the canonical form is guang1, an annotation g∼uan∼* indicates a final mismatch (“uang”→“uan”) with uncertain tone. For the MD task, each canonical phone position is assigned a binary label: 1 indicates “mispronounced” and 0 indicates “correct”.
We collected MD labels in two phases. In Phase I, segments were annotated in a collective manner: each segment received two initial labels, and a third annotator adjudicated disagreements, yielding an effectively majority-based decision. Because the adjudicator varies across segments, the resulting labels reflect a collective behavior rather than any single individual; we thus treat it as a virtual “collective annotator” in the analyses. In Phase II, the same set of segments was labeled independently by three annotators (Anno1–Anno3) to enable the explicit study of inter-annotator variation.
Collective annotation. We assign each segment to two annotators for independent labeling. If the two labels agree, the decision is recorded; otherwise, a third annotator listens to the segment and makes the final decision.
Individual annotation. We distribute all segments to three annotators who label each segment independently, producing three parallel label sets for the same speech material.
Annotator profiles. To support interpretation of inter-annotator variation under Perception Norm,
Table 2 provides a coarse-grained profile of the three individual annotators (Anno1–Anno3), including native-language background, relevant training, and prior experience with child/L2 speech assessment. To protect privacy, we report only coarse categories and do not include any personally identifiable information. A key observation is that these annotators are not different in the coarse background, indicating that the significant cross-annotator variation is mainly due to individual preference.
4.3. Characteristics of UY/CH-CHILD-MA
Two factors make UY/CH-CHILD-MA particularly informative for Perception Norm.
Uyghur and Mandarin differ in phonemic inventories, syllable structure, and phonotactics. Transfer can induce systematic substitutions and distortions that are perceptually ambiguous.
Child speech exhibits higher acoustic variability and developmental effects, which can increase perceptual uncertainty and annotator dependence.
Together, these factors increase the space of “borderline” pronunciations, amplifying listener variability. To give a clear picture of UY/CH-CHILD-MA, some basic statistics are shown in
Table 3, and the agreement of the collective annotator and the three individual annotators is shown in
Table 4.
6. Experiments
6.1. Data
Our experiments focus on Chinese L2 speech, using two corpora: AIShell-2 for pre-training and UY/CH-CHILD-MA for fine-tuning.
We use AISHELL-2 [
51], a large-scale Mandarin read-speech corpus (about 1000 h; 1991 speakers) with an official train/dev/test split, as the source of native-speech pre-training data.
We use UY/CH-CHILD-MA for fine-tuning and evaluation. Due to the limited data size, we adopt a 10-fold speaker-level cross-validation protocol. Specifically, the 106 speakers are partitioned into 10 folds. For each fold, we use one fold (held-out speakers) as the test split and the remaining folds as the adaptation split for fine-tuning. Speaker identities are strictly separated between adaptation and test splits to avoid speaker leakage. We report the mean and standard deviation of the main evaluation metrics across the 10 folds.
6.2. Evaluation Metrics
We evaluate phone-level MD as a binary detection problem [
7]. For each canonical phone position, the system outputs a binary decision (or a posterior) indicating whether the phone is perceived as mispronounced, which we compare against the corresponding human label. We report counts in a confusion-matrix style:
True reject (TR; true positive): A mispronounced phone correctly flagged as mispronounced.
False reject (FR; false positive): A correct phone incorrectly flagged as mispronounced.
True accept (TA; true negative): A correct phone correctly accepted as correct.
False accept (FA; false negative): A mispronounced phone incorrectly accepted as correct.
Precision and recall are then computed as
We report the F1 score
as well as the ROC–AUC, which summarizes the trade-off between true-positive and false-positive rates across decision thresholds.
To better characterize system behavior in realistic use, we report results at two representative operating points. The first fixes precision at 0.5, i.e., among the phones flagged by the system, approximately half are true mispronunciations; we treat this as a reasonable working point for CAPT feedback. The second fixes recall at 0.5, meaning the system identifies about half of all mispronounced phones, which reflects a balanced detection regime. In addition to these two points, we report the full precision–recall (PR) curve to summarize performance across the entire operating range.
6.3. Settings
All models are implemented in PyTorch 2.6.0 using standard Transformer blocks. A HuBERT pre-trained front-end is used to extract acoustic features. The Hubert model involves billions of parameters and is pre-trained on the WenetSpeech L subset, which contains approximately 10,000 h of Mandarin Chinese speech data, and was downloaded from the official webset (
https://huggingface.co/TencentGameMate/chinese-hubert-large, accessed on 20 February 2026).
We generate synthetic MD labels on the fly by randomly substituting phones with other phones sampled from the same broad category (vowel vs. consonant). For each phone sequence, corruption is applied with probability , and at most, of phones can be substituted. We deliberately use relatively strong corruption to ensure that pre-training receives a stable and diverse supervision signal for learning robust MD-related representations; in our experience, milder corruption often leads to less stable pre-training and weaker downstream MD performance.
We use gradient clipping in all experiments. For AIShell-2 pre-training, we adopt a learning-rate schedule with a peak rate of
and 10 warm-up epochs. The batch size is 128, and
in Equation (
7) is set to 2. We pre-train for 25 epochs and take the averaged checkpoints from epochs 20–25 as the final pre-trained model.
For UY/CH-CHILD-MA fine-tuning, we initialize from the epoch-25 checkpoint and use a smaller learning rate () with batch size 32. We report results using the averaged checkpoints from epochs 15–25. The two test conditions in the fine-tuning stage are summarized below.
In this work, phone-level MD labels are defined on
canonical phone positions, i.e., whether the intended phone is perceived as mispronounced (substitution or deletion relative to the canonical sequence). Insertion errors do not naturally map to an intended canonical position without introducing additional alignment/labeling conventions (e.g., insertion slots between phones). Therefore, we exclude insertions in the current formulation to keep the label space consistent across annotators. We note that this choice may under-represent certain error patterns but it ensures scientific performance comparison. In fact, many recent studies have adopted this ‘insertion exclusion’ protocol as it is more consistent with the definition of the MD task. Further discussion can be found in [
18].
6.4. Experiment 1: Native Norm, Target Norm, and Perception Norm
6.4.1. Performance of Native Norm
The pre-trained model can be regarded as following Native Norm, as it was trained on large-scale native speech to detect mismatches between speech and phone labels.
Table 5 shows the results when its prediction is evaluated with the annotations of different annotators as the ground truth. (Note that the missing value with the target precision = 0.5 is because there are very limited positives (label = 1), and the operation point with precision = 0.5 could not be readily found.) It can be seen that there is a clear gap across annotators, confirming our argument that speech assessment is highly subjective and that there is no commonly agreed gold standard.
6.4.2. Performance of Target Norm
In the next experiment, we use synthetic MD labels to fine-tune the model. This fine-tuning adapts the model to the target pronunciation and essentially realizes Target Norm.
The results are shown in
Table 6, where the fine-tuning and test are conducted independently for each annotator. Three observations can be made: (1) once again, performance varies significantly across annotators, confirming that annotation is subjective and annotator-specific; (2) compared to
Table 5, significant and consistent improvements are observed. Since the models have been adapted to the target population and the test employs the human labels as ground truth, the obtained improvement implies that human annotators have adapted to the target pronunciation (as the models do), as advocated by Target Norm.
6.4.3. Performance of Perception Norm
In this experiment, we use human labels to fine-tune the model. Specifically, we fine-tune and test an individual model using labels from a single annotator. This learns the perceptual behavior of that annotator and thus represents Perception Norm. The results are shown in
Table 7.
To complement the point-wise comparisons at fixed precision–recall, we further examine the overall ranking behavior of the two norms by plotting the full precision–recall (PR) curve.
Figure 3 compares Target Norm and Perception Norm under collective-label supervision. As shown, Perception Norm yields a consistently higher PR curve across the entire recall range, with especially clear advantages in the mid-to-high recall region, suggesting substantially improved ranking quality and more favorable trade-offs for downstream CAPT feedback.
Compared to the Target Norm (
Table 6), the model trained under Perception Norm achieves substantially higher F1 scores and ROC-AUC. This suggests that accent-specific adaptation explains only part of label alignment. Modeling listener-specific perceptual criteria explains substantially more. In other words, adaptation-to-accent is necessary but insufficient; adaptation-to-listener is decisive.
For example, at the operating point precision = 0.5, F1 for Anno1 improves from 0.2343 under Target Norm (
Table 6) to 0.5429 under Perception Norm (
Table 7), highlighting the additional gain from listener-specific alignment beyond accent/population adaptation.
Interestingly, the collective labels do not always yield the highest F1 or ROC-AUC, implying that aggregated annotation is not necessarily more stable than individual annotation.
6.4.4. Statistical Significance
To assess whether Perception Norm consistently outperforms Target Norm across folds, we conduct a paired two-sided Wilcoxon signed-rank test over the 10 fold-level scores (paired by fold) for each annotator. We report
p-values together with the mean differences in
Table 8.
6.5. Experiment 2: Inter-Annotator Variation
In the previous section, we empirically demonstrated that learning the perception process of human annotators is possible, and Perception Norm is a reasonable framework for MD. In this section, we further demonstrate that different annotators behave differently, and so there is not a gold standard ground truth for the MD task.
6.5.1. Correlation Analysis
We first analyze the label agreement and Pearson correlation among the four human annotators. The results are shown in
Table 9. It can be seen that the Pearson correlations are not high—mostly below 0.5, confirming our argument that annotators are relatively independent. However, the label agreement degrees are high, indicating that the annotators agree in a large proportion of the samples, and the low correlation is due to a small set of confusing words and phrases. Another observation is that Anno1 and Anno3 are more alike. Finally, although the collective annotator is an aggregation of multiple annotators, there is no clear evidence that it is a representative annotator, as the correlation between the collective annotator and the three individual annotators is not significantly higher than the correlations among the three individual annotators.
6.5.2. Disagreement Concentration Analysis
To characterize
when Perception Norm matters most, we analyze where annotator disagreement concentrates. At the phone level, disagreement is substantially higher for vowel/final units (
8.36%) than for consonant/initial units (
2.26%), suggesting that vowel and tone realizations introduce stronger perceptual ambiguity. The most disagreement-prone phones are mainly vowel+tone units, including
iy3 (40.6%),
u5 (38.3%),
ang3 (35.8%),
uang3 (35.3%),
eng5 (35.3%), and
ou5 (34.5%). At the speaker level, disagreement decreases with age (4–6 years: 5.79%; 7–9 years: 4.88%; 10–12 years: 2.98%), consistent with higher acoustic variability and less stable production in younger children. To further illustrate these patterns, representative borderline cases are shown in
Table 10.
These examples mainly involve vowel- and tone-related units, which are more prone to perceptual ambiguity. As a result, the same token may lie near the perceptual decision boundary, leading to inconsistent labels across annotators.
6.6. Cross-Annotator Evaluation
In this experiment, we align a model fine-tuned with labels of a particular annotator to other annotators, by performing the test using the labels of different annotators as the ground truth. By this experiment, we wish to examine whether the fine-tuned models are annotator-specific. The results of the model fine-tuned with the collective labels are shown in
Table 11, and the models fine-tuned with labels of the three individual annotators are shown in
Table 12,
Table 13 and
Table 14. Some observations are as follows:
A model fine-tuned with a particular annotator’s labels achieves the best F1 score when aligned with that annotator, indicating that the model is simulating the annotator’s behavior.
The relative performance across annotators largely reflects the human correlations shown in
Table 9, further confirming that the annotators behave very differently from each other, and fine-tuning can effectively capture the specific characteristics of single annotators.
ROC-AUC does not drop dramatically across annotators, suggesting that annotators largely agree on the ranking of sample difficulty, while differing more in score calibration and acceptance/rejection thresholds.
Table 11.
Cross-annotator evaluation with model aligned to the collective annotator.
Table 11.
Cross-annotator evaluation with model aligned to the collective annotator.
| | | Precision | Recall | F1-Score | ROC_AUC |
|---|
| Target Precision | Collective | 0.5 | 0.6286 | 0.5533 | 0.94 |
| Anno1 | 0.5 | 0.4045 | 0.4235 | 0.94 |
| Anno2 | 0.5 | 0.3499 | 0.3684 | 0.96 |
| Anno3 | 0.5 | 0.5268 | 0.5015 | 0.92 |
| Target Recall | Collective | 0.6078 | 0.5 | 0.5464 | 0.94 |
| Anno1 | 0.4493 | 0.5 | 0.4679 | 0.94 |
| Anno2 | 0.3915 | 0.5 | 0.4285 | 0.96 |
| Anno3 | 0.5530 | 0.5 | 0.5139 | 0.92 |
Table 12.
Cross-annotator evaluation with model aligned to Anno1.
Table 12.
Cross-annotator evaluation with model aligned to Anno1.
| | | Precision | Recall | F1-Score | ROC_AUC |
|---|
| Target Precision | Anno1 | 0.5 | 0.6057 | 0.5429 | 0.95 |
| Anno2 | 0.5 | 0.3240 | 0.3481 | 0.96 |
| Anno3 | 0.5 | 0.4595 | 0.4487 | 0.91 |
| Target Recall | Anno1 | 0.6085 | 0.5 | 0.5476 | 0.95 |
| Anno2 | 0.3561 | 0.5 | 0.3996 | 0.96 |
| Anno3 | 0.5091 | 0.5 | 0.4898 | 0.91 |
Table 13.
Cross-annotator evaluation with model aligned to Anno2.
Table 13.
Cross-annotator evaluation with model aligned to Anno2.
| | | Precision | Recall | F1-Score | ROC_AUC |
|---|
| Target Precision | Anno1 | 0.5 | 0.3308 | 0.3880 | 0.89 |
| Anno2 | 0.5 | 0.5015 | 0.4688 | 0.96 |
| Anno3 | 0.5 | 0.3261 | 0.3756 | 0.87 |
| Target Recall | Anno1 | 0.2953 | 0.5 | 0.3620 | 0.89 |
| Anno2 | 0.5403 | 0.5 | 0.4996 | 0.96 |
| Anno3 | 0.3019 | 0.5 | 0.3587 | 0.87 |
Table 14.
Cross-annotator evaluation with model aligned to Anno3.
Table 14.
Cross-annotator evaluation with model aligned to Anno3.
| | | Precision | Recall | F1-Score | ROC_AUC |
|---|
| Target Precision | Anno1 | 0.5 | 0.4133 | 0.4241 | 0.92 |
| Anno2 | 0.5 | 0.3699 | 0.3950 | 0.96 |
| Anno3 | 0.5 | 0.7355 | 0.5879 | 0.95 |
| Target Recall | Anno1 | 0.4490 | 0.5 | 0.4604 | 0.92 |
| Anno2 | 0.3738 | 0.5 | 0.4112 | 0.96 |
| Anno3 | 0.7429 | 0.5 | 0.5914 | 0.95 |
6.7. Experiment 3: Utility of Multiple Annotators
In this section, we investigate how multiple annotators can be used effectively. We first check whether simple voting can offer any benefits, and then try the ensemble approach, which averages the prediction of a group of individual models.
6.7.1. Whether Voting Is Reliable
First, we examine whether voting can be regarded as a reliable ground truth. For that purpose, we examine three “voting rules” for the individual human labels, and examine whether the voting labels are closely related to the collective labels, which essentially follows a majority voting rule.
The results are shown in
Table 15. It can be seen that among the three voting rules, majority voting leads to a better alignment to the collective annotator. More importantly, when compared to the labels of the individual annotators, the majority voting labels obtain the highest agreement score and the second highest Pearson correlation when aligned with the collective annotator. Since the collective annotator is also derived via majority voting, these results suggest that majority voting can reduce annotation uncertainty. However, the correlation between the two majority voting-based annotators is not substantially higher than that between any two individual annotators, or an individual annotator and a majority voting-based annotator. This means that voting among only three annotators (as in both the collective annotator and the majority voting annotator in
Table 15) may be insufficient to yield a fully stable “mean” annotator.
6.7.2. Majority Voting Model and Score Average
We now investigate how to utilize multiple annotators effectively. Traditionally, this is achieved by training a majority voting model, i.e., fine-tuning the pre-trained model using majority voting labels. This is essentially the popular end-to-end approach. As mentioned, it simulates complex MD processing involving both individual perception and social aggregation.
We present a new approach to utilizing multiple annotators more effectively. Specifically, we train three annotation models (anno1–3) for the three individual annotators, respectively, and then score the test pronunciation with the three models. After calibration, the scores produced by the three models are averaged to perform the MD decision.
The calibration step is designed to ensure comparability across the three individual annotation models. We adopt “temperature scaling” as a simple yet effective post hoc calibration method.
Within each cross-validation fold, after fine-tuning is completed, we construct a held-out calibration set by sampling 10% of the training speakers from the adaptation split (speaker-level sampling). Given the pre-sigmoid logits
z produced by an individual model, a scalar temperature parameter
is then learned by minimizing the negative log-likelihood on this calibration set, while keeping all other model parameters fixed. The test speakers are never used for calibration.
where
denotes the sigmoid function. The temperature parameter
T is optimized while keeping all other model parameters fixed.
This procedure rescales the confidence of each individual model without altering its ranking behavior, thereby aligning their probabilistic outputs under a common decision framework. After the temperature scaling, we average the calibrated probabilities as follows:
where
is the annotator-specific temperature parameter, and
is the pre-sigmoid logit produced by the model aligned to annotator
r.
We compare the performance of the voting model approach and the score average approach, with the majority voting labels as the ground truth. The performance is shown in
Table 16. It can be seen that the score average approach slightly outperforms the majority voting model, and the improvement is statistically significant according to
Table 17, although the training objective of the majority voting model is to predict the majority voting labels. This is consistent with our argument that learning individual perception is easier than learning the entire MD process.
Beyond comparing summary metrics (e.g., ROC-AUC/PR-AUC and the fixed operating points), we also examine ranking quality across the full operating range using precision–recall (PR) curves.
Figure 4 contrasts the score-averaging ensemble with the voting-label-trained single model. The ensemble yields a consistently higher PR curve across most recall values, indicating a more favorable precision–recall trade-off and more robust ranking behavior for downstream use.
We conduct a paired two-sided Wilcoxon signed-rank test over the 10 fold-level scores (paired by fold) to assess whether the score average committee significantly outperforms the majority voting model. The results are summarized in
Table 17.
This result suggests an ensemble paradigm that learns individual models to simulate individual annotators, and then makes decisions using a separate aggregation rule. A key advantage of this ensemble approach, compared to the end-to-end approach using the aggregation labels (e.g., voting labels), is that the aggregation function can be adjusted according to the request of real applications, without re-training the model. This provides a powerful paradigm that can simulate the social mechanism of voting (or other aggregation rules) with limited additional cost.
7. Discussion
7.1. Does the Perception Norm Lead to Subjective Evaluation?
Yes—Perception Norm explicitly acknowledges that pronunciation assessment is subjective, because it depends on listeners’ perceptual boundaries, priors, and acceptance thresholds. Importantly, “subjective” does not mean “arbitrary”. In our experiments, annotators show high label agreement while exhibiting low-to-moderate Pearson correlations (
Table 9), suggesting that they largely agree on clearly correct/incorrect cases but differ on borderline cases and calibration. This is consistent with a signal-detection view: different annotators may share similar rankings of difficulty (reflected by relatively stable ROC-AUC across cross-annotator evaluation) but apply different decision thresholds and severities.
From an application perspective, subjectivity is unavoidable and must be specified rather than ignored. Any MD system that is trained and evaluated against human labels is implicitly tied to a listener population and an operational definition of “error”. Perception Norm makes this dependence explicit: the system designer can decide which listener(s) the system is meant to simulate and how multiple judgments should be aggregated for the target use case.
7.2. Implication for MD System Design
Perception Norm suggests separating two components that are conflated in many end-to-end approaches: (i) individual perceptual modeling and (ii) social aggregation. Concretely, instead of directly training a single model on aggregated labels, one can train annotator-specific models via fine-tuning (
Table 7) and then combine them at inference time.
This separation has several practical benefits. First, learning each annotator is an easier and more stable learning problem than learning an aggregated label that already mixes perception and aggregation. Second, the aggregation function can be chosen to match the target application without retraining.
For example, a strict aggregation rule may be preferred for high-stakes testing (high precision; teacher-like), while a more inclusive aggregation may be preferred in CAPT to provide conservative feedback (high recall; peer-like). Third, model ensembles can naturally represent annotator diversity and provide uncertainty estimates: as an uncertainty signal, disagreement among the committee of models can be used to flag borderline pronunciations for human review or for adaptive feedback strategies. More broadly, the aggregation rule can be selected to align with different stakeholder preferences (e.g., teachers, learners, or examination boards) without retraining the underlying annotator-specific models.
Experiment 3 provides a concrete instance: calibrating annotator-specific models and averaging their outputs slightly outperforms a voting-label-trained model (
Table 16), and the gain is statistically significant (
Table 17), suggesting that “model-then-aggregate” can be competitive or superior to “aggregate-then-model”.
A practical concern is computational cost. Under the model architecture proposed in this paper, using N MD models would indeed increase the computation by a factor of N. However, this overhead can be significantly reduced by sharing a common backbone, while keeping only the MD head annotator-specific. We have verified this approach and found that it not only reduces computational cost but also improves accuracy.
7.3. Implications for MD Database Construction
Perception Norm implies that MD datasets should preserve annotator information rather than collapsing it into a single aggregated label. Practically, we recommend (1) storing per-annotator labels and annotator metadata (experience, linguistic background) when possible; (2) ensuring sufficient overlap so that inter-annotator variability and calibration can be quantified; and (3) documenting the aggregation rule as part of the dataset specification, since it changes the construct the model is trained to predict.
In addition, borderline cases are especially informative for Perception Norm research and for training robust systems. Data collection protocols that elicit challenging pronunciations (e.g., accented child speech as in UY/CH-CHILD-MA) can amplify perceptual variability and make annotator effects observable.
7.4. Limitations and Future Work
This study has several limitations. First, UY/CH-CHILD-MA contains four annotators; while sufficient to reveal substantial annotator variation, larger and more diverse annotator pools would enable more fine-grained analysis and more stable estimates of an “average” listener. Second, our work focuses on segment-level/phone-level mispronunciation detection; extending Perception Norm to richer diagnoses (e.g., substitution types, suprasegmentals, and intelligibility/comprehensibility dimensions) is an important direction. Third, we used post hoc calibration (temperature scaling) for averaging; exploring jointly trained mixtures-of-experts or Bayesian models that explicitly represent annotator severity and uncertainty may further improve aggregation. Finally, evaluating how Perception-Norm-based systems affect real CAPT outcomes and perceived fairness across learner populations remains an open and practically important question.