Next Article in Journal
Quantitative Analysis of Real-Time Virtual Reality Sickness During 360° Video Viewing
Previous Article in Journal
Nonlinear Seismic Response of Long-Span Bridges Constructed by the Balanced Cantilever Method Under Earthquake Excitations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Perception Norm for Mispronunciation Detection

1
School of Computer Science and Technology, Xinjiang University, Urumqi 830049, China
2
Center for Speech and Language Technologies, BNRist, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(7), 3311; https://doi.org/10.3390/app16073311 (registering DOI)
Submission received: 21 February 2026 / Revised: 23 March 2026 / Accepted: 26 March 2026 / Published: 29 March 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Mispronunciation detection (MD) is a key component in computer-assisted pronunciation training (CAPT) and speaking tests. Most MD systems adopt a production view, measuring phone-level deviation from a canonical pronunciation (Native Norm) or the expected pronunciation of a target population (Target Norm). Yet, pronunciation assessment is fundamentally perceptual: listeners map speech to linguistic categories under uncertainty and with individual psychological priors, so judgments are inherently subjective and lack a single gold standard. Labels are therefore often aggregated (e.g., voting), but aggregation rules are themselves subjective, require many annotators, and entangle individual perception with social consensus, complicating model training. In this paper, we propose a “Perception Norm”, which models MD as the decision process of individual annotators and trains models to simulate single listeners rather than an annotator pool. To support this study, we introduce UY/CH-CHILD-MA, a corpus of Uyghur-accented child Mandarin words and phrases with four independent phone-level annotations. Our experiments reveal substantial inter-annotator variation and show that a Transformer with pre-training and fine-tuning can learn annotator-specific patterns with high accuracy. Finally, we present a committee ensemble that combines annotator models using application-matched aggregation rules to produce task-specific assessments. The data and source code will be made publicly available upon publication.

1. Introduction

Mispronunciation detection and diagnosis (MD/MDD) is a core component in computer-assisted pronunciation training (CAPT) and speech-enabled language learning systems. The classical pipeline treats learner speech as a noisy realization of a canonical phone sequence, then detects mismatches via forced alignment, posterior-based scoring (e.g., goodness of pronunciation, GOP), or more recently, end-to-end neural models that predict mispronunciation labels at the phone or segment level [1,2,3,4,5,6,7,8,9,10,11,12].
A long-standing but often under-discussed assumption in MD research is the existence of a deterministic gold standard label: for each phone token, there is a correct label (correct/mispronounced) that a human annotator can reveal. There are two common views of this “gold standard”, which we call Native Norm and Target Norm, respectively.
Many MD systems are built under a Native Norm assumption: learner speech is evaluated by comparing it with a canonical L1 pronunciation lexicon, and deviations are treated as errors. Representative examples include forced-alignment-based goodness-of-pronunciation (GOP) scoring [1] and graph-based recognition variants such as the extended recognition network (ERN) [2].
Despite its engineering convenience, a strict Native Norm can be problematic in practice: it may penalize systematic accent patterns and raise fairness concerns. In addition, speech perception research indicates that listener judgments are adaptive rather than a fixed L1 template; listeners rapidly adjust to the statistics of the talker or target population [13,14,15,16,17]. Motivated by this evidence, we introduced the Target Norm in our previous study [18], operationalizing population-level adaptation by adapting a pre-trained model to the target speech. We observed improved agreement between model outputs and human annotations after such adaptation.
Both the Native Norm and Target Norm treat the MD task from a “production” perspective, i.e., they measure to what degree the produced phones or phrases deviate from a reference pronunciation: the canonical native pronunciation (Native Norm) or the population-average pronunciation of the target speakers (Target Norm). However, MD is essentially a human perception task, and is much beyond an objective measurement of deviation on speech signals. Instead, it involves a complex perceptual and psychological process, in particular when the pronunciation is on the phone boundary. This leads to high intra-annotator uncertainty and inter-annotator variation. We argue that this “subjectivity” is the nature of MD (and speech assessment in general).
Therefore, we argue that an ideal MD system should simulate human perceptual judgments rather than solely measure signal deviation, if its final goal is to help L2 speakers speak in a way that others can understand, rather than merely sound like a canonical pronunciation. For simplicity, we call this idea the “Perception Norm”.
To support this argument, we built a new dataset UY/CH-CHILD-MA that contains Chinese words and phrases spoken by Uyghur children. Compared to other datasets such as L2-ARCTIC [19], UY/CH-CHILD-MA is challenging due to both L2-accented pronunciation and the additional variability of child speech [20]. We hypothesize that these settings elicit stronger perceptual and psychological effects in human annotation. Moreover, UY/CH-CHILD-MA is annotated by four annotators, enabling the analysis of inter-annotator variation.
Following our previous work [18], we employed a Transformer-based end-to-end model to simulate human perceptual behavior. We first pre-trained the model on native speech and then fine-tuned it to align with human annotators. The experiments demonstrate that fine-tuning is crucial for aligning the model with a target annotator, and that models aligned with different annotators behave distinctively. These observations provide strong evidence for the perception norm.
We finally propose an ensemble paradigm for the MD task. Specifically, it first trains an individual MD model for each annotator, and uses these models to compose an “expert committee” whose outputs are aggregated to produce a group decision. This new paradigm isolates the perception process of individual annotators from the social decision process of annotator groups, potentially providing a more flexible and faithful simulation of the annotation process.
Our contributions are threefold:
  • Perception Norm. We propose Perception Norm, which treats MD as a subjective perceptual judgment process rather than an objective deviation measurement. This perspective aligns with how labels are produced in practice: human annotators deploy perceptual and psychological resources to judge pronunciations, and their decisions are inherently uncertain and subjective. We argue that an ideal MD system should simulate this uncertain and subjective process of human perception as much as possible, which is the core idea of Perception Norm.
  • Computational evidence for Perception Norm. We conduct a series of experiments on UY/CH-CHILD-MA, a challenging accented child speech dataset with multi-annotator labels, and provide computational evidence supporting Perception Norm by showing that annotator-specific models can be learned accurately and exhibit distinct behaviors.
  • A new ensemble paradigm for pronunciation assessment. We propose a new ensemble paradigm for MD (and more general pronunciation assessment) that isolates individual assessment from social aggregation, providing a flexible and efficient way to simulate the underlying process of human annotation and decision making.
In the rest of this paper, we first briefly discuss the related work in Section 2 and then introduce Perception Norm in Section 3. In Section 4, we introduce our new dataset—UY/CH-CHILD-MA. Section 5 presents the proposed MD model, Section 6 describes the experimental settings and results, Section 7 discusses the implications and limitations of Perception Norm, and the paper is concluded in Section 8.

2. Background and Related Work

We position Perception Norm by drawing connections to three relevant research threads: (i) mispronunciation detection and pronunciation assessment, (ii) speech perception and adaptation, and (iii) multi-annotator learning together with label aggregation.

2.1. Mispronunciation Detection and Pronunciation Assessment

Early MD systems relied on ASR alignment and posterior-based scoring. GOP-based approaches compare the posterior probability of the intended phone to competing phones, providing a phone-level score used for assessment [1]. Subsequent work improved acoustic modeling and classification by integrating DNN acoustic models and logistic regression transfer learning [3]. Other approaches enrich diagnostic feedback by modeling speech attributes [21] or by multi-distribution DNNs to better handle non-native variants [4].
End-to-end approaches reduce dependence on forced alignment and hand-crafted rules. CNN-RNN-CTC models provide phone recognition and error labeling in a unified framework [5]. SED-MDD introduces sentence-dependent modeling to exploit reference text [6]. Anti-phone modeling expands the label space to capture categorical and non-categorical errors [7]. Transformer and self-supervised front-ends (e.g., wav2vec 2.0/HuBERT) improve representation and robustness [8,22].
A parallel trend is to move from MD to broader pronunciation assessment (segmental + suprasegmental, intelligibility/comprehensibility), with reviews emphasizing construct validity and evaluation concerns [23].

2.2. Speech Perception: Adaptation and Listener Variability

Speech perception is known to be adaptive and context-sensitive. Listeners rapidly adapt to foreign-accented speech [13,14]. Lexically guided perceptual learning (“phonetic retuning”) shows that exposure to ambiguous sounds in lexical contexts shifts category boundaries [16,24]. Bayesian/ideal-adapter accounts formalize perception as rational inference under uncertainty, predicting listener-specific adaptation based on priors and cue reliability [15].
In L2 speech assessment, “accentedness” and “comprehensibility” are related but distinct constructs. Annotators differ in how they weight segmental and suprasegmental cues, and annotator experience changes judgments [25,26,27,28].
These findings suggest a natural hypothesis for MD: different annotators effectively implement different perceptual decision strategies, especially for borderline cases.

2.3. Multi-Annotator Learning, Aggregation, and Calibration

When multiple annotators label the same item, disagreement can be informative rather than noisy [29]. In statistics, observer error models (e.g., Dawid–Skene) treat the true label as latent and estimate annotator confusion matrices [30]. In machine learning, learning from crowds models annotators of unknown expertise and item difficulty [31,32,33].
A pragmatic alternative to probabilistic label models is to keep per-annotator models and combine them via ensembling or score averaging. Ensemble methods often reduce variance and improve generalization [34]. However, combining probabilistic outputs meaningfully requires calibration: modern neural networks can be miscalibrated, and temperature scaling is a simple yet effective post hoc method [35,36]. In imbalanced MD settings, PR-based evaluation is recommended over ROC alone [37,38,39].

2.4. Norms, Construct Validity, and Annotator Effects: Lessons from Language Assessment

The Perception Norm framing is closely aligned with the notion of construct validity in educational and language assessment: what a score or label means depends on the interpretation and the intended use, and validation requires an argument about the inferences and assumptions linking observations to score-based decisions [40,41]. In pronunciation assessment, the construct has multiple facets (segmental accuracy, prosody, intelligibility, comprehensibility), and different assessment settings operationalize different facets with different consequences [25,26,42].
From this viewpoint, a phone-level “mispronunciation” label is not merely an empirical observation but an assessment event: an annotator observes an acoustic token, applies a decision process (often implicit), and produces a categorical judgment. Annotator effects are therefore not surprising; they are an expected component of the measurement process.

2.4.1. Annotator Severity, Consistency, and the Many-Facet Perspective

In performance assessments, annotator severity/leniency and annotator consistency are routinely studied and corrected. Many-facet Rasch measurement (MFRM) treats examinee ability, item difficulty, task difficulty, and annotator severity as separable latent facets [43,44]. Although MD research rarely uses MFRM directly, the conceptual lesson is important: when multiple human judgments determine a label, it is scientifically meaningful to model who the annotator is, not only what the acoustic signal is.

2.4.2. Aggregation Rules Are Construct Definitions

Perception Norm emphasizes that an “MD label” is inseparable from the norm used to generate it. Even when we treat aggregation as a purely technical step, the choice of aggregation rule changes the construct that the model is trained to predict. This observation is routine in assessment validity theory [40,41] but is not yet standard in MD benchmarking.
We distinguish three idealized aggregation norms (often implemented implicitly in annotation projects):
  • Inclusive (union-like) aggregation: This marks a token as an error if any qualified annotator flags it. This increases sensitivity and encourages conservative feedback in CAPT, but it may over-represent borderline deviations.
  • Consensus (majority-vote) aggregation: This marks an error only when supported by a majority. This increases stability across annotators and may better reflect “typical listener” perception, but it can miss systematic deviations that are salient to certain expert annotators.
  • Strict (intersection-like) aggregation: This marks an error only when all annotators agree. This produces high-precision positives but can severely under-sample the positive class.
The important point is not that one rule is universally correct, but that each rule corresponds to a different intended use and a different interpretation of what “mispronunciation” means.

2.4.3. Consequential and Fairness Considerations

Norm choice also has social consequences. A native-speaker norm can systematically disadvantage certain learner populations and accents, while an intelligibility-oriented norm may better reflect communicative goals [42,45,46,47]. Perception Norm provides an empirical lens on this debate: if listener populations differ in their judgments, then “fair” MD cannot be defined without specifying which listener population the system is meant to serve.

2.4.4. Implication for MD Corpora and Evaluation

When a corpus releases only a single merged label, the annotator facet is collapsed into an opaque mixture. This is convenient for supervised learning, but it makes it difficult to evaluate how a model would behave for different listener populations (e.g., teachers vs. peers; trained phoneticians vs. lay listeners) or under different instructional goals (native-likeness vs. intelligibility). Our corpus design follows the opposite principle: preserve annotator traces, then study the consequences for learning and evaluation.

3. Perception Norm

To facilitate a quick comparison, Table 1 summarizes Native Norm, Target Norm, and Perception Norm side-by-side in terms of (i) model inputs, (ii) supervision labels, (iii) training data, and (iv) the pronunciation standard represented by each norm. As illustrated in Figure 1, the three norms differ in the reference region and the implied decision boundary in the acoustic/phonetic space.

3.1. A Formal Notation

Let x be an acoustic segment aligned to an intended phone p within a word/utterance context c (including neighboring phones, lexical identity, and prosodic cues). Let r { 1 , , R } denote a listener/annotator. Each annotator produces a binary label y ( r ) { 0 , 1 } indicating whether the phone is mispronounced.
Perception Norm treats y ( r ) as an observable consequence of (i) a perceptual and psychological process parameterized by θ r and (ii) a decision process with an annotator-specific threshold τ r . One minimal form is a signal-detection style model:
l = p ( z x , c ; θ r ) ,
y ( r ) = I [ l > τ r ] ,
where measures the goodness of the pronunciation perceived by the annotator r (e.g., as a posterior probability), and τ r captures annotator severity/leniency.
Note that θ r and τ r can be interpreted computationally as parameters of annotator-specific decision models and can be optimized by annotator behavior data, i.e., the MD labels.
Relation to the model implementation. The signal-detection-style notation above is intended as a conceptual abstraction of listener-specific perception. In our experiments, the Transformer-based MD model (Section 4) implements a flexible scoring function that maps ( x , c ) to a phone-level logit (and posterior) for each canonical position; the listener-dependent component θ r is operationalized implicitly by fine-tuning the model using labels from annotator r. The decision threshold τ r is instantiated at inference time via operating-point selection and, when combining multiple annotator-specific models, via post hoc calibration and aggregation (Section 6.7.2).

3.2. Why Perception Norm?

Perception Norm has several distinct features that make it novel and attractive:
(1) Perception Norm does not assume any canonical pronunciation; instead, it learns from the perceptions of human annotators. This distinguishes it from Native Norm and Target Norm.
(2) Perception Norm does not try to learn aggregated annotations; instead, it learns the perception of a single annotator. This distinguishes it from traditional end-to-end learning methods that are trained on pooled or aggregated labels.

3.2.1. From Production Deviation to Perception Simulation

Both Native Norm and Target Norm assume a canonical pronunciation representation, parameterized by θ , and the decision can be formulated as
d = f ( x , c ; θ ) ,
y = I [ d > τ ] ,
where the evaluation function f measures the deviation of x from the canonical pronunciation, parameterized by θ , and τ is a pre-defined threshold (possibly context-dependent). For Native Norm, θ represents the native pronunciation system, while for Target Norm, θ represents the pronunciation system of the L2 population.
Both Native Norm and Target Norm assume an objective measure, i.e., they do not explicitly model the inner perception process of human listeners, even though these listeners are the communication targets and the evaluators in pronunciation assessment.
In fact, speech assessment is a complex perceptual and psychological process, rather than merely an objective measurement of how far a produced speech signal departs from a reference pronunciation. It depends on a listener’s linguistic background and experience. As an annotator, the assessment also depends on error tolerance, personal pronunciation habits, and individual preference.
If we believe that human MD behavior is the reference, then MD (and pronunciation assessment in general) should not be reduced to a simple distance measurement from an “accepted canonical pronunciation”. A more ideal MD system should model the complex perceptual and psychological processes of human annotators. Therefore, the central goal of MD is to simulate how human annotators evaluate pronunciations—one of the central ideas of Perception Norm.

3.2.2. From End-to-End Learning to Perception Simulation

Modern MD methods are often based on end-to-end models. In this setting, the model learns to predict MD labels y from the input x and context c, where the model is parameterized by θ . Formally, this can be written as
l = p ( z x , c ; θ ) ,
y = I [ l > τ ] .
At first glance, this formulation looks similar to Perception Norm. The key difference, however, is that an end-to-end approach typically learns whatever labels the dataset provides, without explicitly considering what those labels represent. In contrast, Perception Norm emphasizes learning the perception process of individual annotators, and thus uses annotations from a single annotator at a time.
This seemingly trivial difference is significant not only conceptually but also in practice. In fact, most datasets are labeled by multiple annotators, and final labels are often derived by aggregation, which (1) may hide substantial variability among annotators; and (2) introduces a social mechanism (voting/averaging). As a result, a model trained on aggregated labels may learn a complex process that involves both individual perceptual/psychological activities and the social aggregation mechanism.
Learning “individual” annotators, as Perception Norm advocates, has several advantages. First, it learns a single annotator and is therefore not directly troubled by inter-annotator variation, reducing the complexity of the learning problem. Second, it learns only the perception/annotation process, without entangling the aggregation process, which further simplifies the learning. Finally, it naturally supports an ensemble approach, which firstly learns a group of MD models, each representing one annotator. Once learned, these individual models can form an expert committee, using a human-defined aggregation rule (e.g., majority voting). To change the system behavior, one can simply modify the aggregation rule, without retraining the individual models.

4. UY/CH-CHILD-MA Dataset

In previous work, we designed UY/CH-CHILD, a Mandarin speech corpus produced by Uyghur children (4–12 years old), for studying accented child speech and pronunciation modeling [48]. In this work, we reuse this dataset and provide additional human annotations to support Perception Norm research. We denote the new dataset by UY/CH-CHILD-MA, where MA means “multiple annotations”. For completeness, we review the original UY/CH-CHILD dataset, and then present the multiple annotations.

4.1. Data Collection

UY/CH-CHILD is a prompted word-production corpus designed for studying accented child speech and segmental/tonal pronunciation modeling. In this work, we reuse the original recordings and phonetic annotations of UY/CH-CHILD and extend them with multiple independent MD labels. The construction of the original corpus can be summarized as (i) curating a phonetically representative set of target words; (ii) eliciting and recording prompted productions from Uyghur children; (iii) extracting speech segments corresponding to the prompted target words; and (iv) annotating both the canonical and realized pronunciations in Pinyin. For a complete description of the original corpus and protocol, we refer readers to [48].
The target-word list follows the articulation test for Mandarin Chinese-speaking preschoolers developed by the Institute of Linguistics at the Chinese Academy of Social Sciences (CASS) [49,50]. It contains 174 common, imageable words with 1–3 Chinese characters (syllables), covering key phonological features of Mandarin (e.g., syllables, tones, syllable combinations, weak stress, and rhotacism). This design keeps the task focused on pronunciation production rather than lexical knowledge.
To accommodate different ages, the 174 words were organized into two prompt sheets. Test A contains 138 relatively simple and highly recognizable items (e.g., “hand”, “flower”), and Test B contains 140 more challenging items (e.g., “crocodile”, “axe”, “chalk”). The two sheets have 104 overlapping words. Test A was mainly used for younger children (4–5 years old) in kindergarten, whereas Test B was used for older children in kindergarten and primary school.
Recordings were conducted in two phases (May 2022 and February 2023) in Ili Prefecture, Xinjiang Uyghur Autonomous Region, China. Participants were Uyghur children (4–12 years old) from Uyghur-speaking families who attended kindergartens or primary schools where Chinese is the primary instructional language. Basic speaker information was collected, and parental/guardian consent was obtained under privacy-protection procedures.
Speech was recorded in quiet rooms using a laptop at 16 kHz, 16-bit, single-channel settings. During recording, a tutor presented pictures of the target words and elicited spoken responses; when needed, a prerecorded reference prompt was played to facilitate word identification. Importantly, the tutor did not provide corrective feedback once the child identified the intended word, and the full session audio was retained to support downstream segmentation and annotation.

4.2. Multiple Annotations

All recordings were uploaded to a web-based annotation platform. We first performed target-segment extraction to locate clearly spoken instances of the prompted items. This segmentation step was carried out by trained student assistants, whose task was to identify cleanly articulated segments regardless of pronunciation correctness.
The extracted segments were then annotated for pronunciations and phone-level MD labels. Annotators listened to each segment and marked deviations from the canonical pronunciation in Pinyin. Each syllable is represented as Initial∼Final∼Tone. When an annotator could not confidently determine a component, an “*” tag was used. For example, if the canonical form is guang1, an annotation g∼uan∼* indicates a final mismatch (“uang”→“uan”) with uncertain tone. For the MD task, each canonical phone position is assigned a binary label: 1 indicates “mispronounced” and 0 indicates “correct”.
We collected MD labels in two phases. In Phase I, segments were annotated in a collective manner: each segment received two initial labels, and a third annotator adjudicated disagreements, yielding an effectively majority-based decision. Because the adjudicator varies across segments, the resulting labels reflect a collective behavior rather than any single individual; we thus treat it as a virtual “collective annotator” in the analyses. In Phase II, the same set of segments was labeled independently by three annotators (Anno1–Anno3) to enable the explicit study of inter-annotator variation.
  • Collective annotation. We assign each segment to two annotators for independent labeling. If the two labels agree, the decision is recorded; otherwise, a third annotator listens to the segment and makes the final decision.
  • Individual annotation. We distribute all segments to three annotators who label each segment independently, producing three parallel label sets for the same speech material.
Annotator profiles. To support interpretation of inter-annotator variation under Perception Norm, Table 2 provides a coarse-grained profile of the three individual annotators (Anno1–Anno3), including native-language background, relevant training, and prior experience with child/L2 speech assessment. To protect privacy, we report only coarse categories and do not include any personally identifiable information. A key observation is that these annotators are not different in the coarse background, indicating that the significant cross-annotator variation is mainly due to individual preference.

4.3. Characteristics of UY/CH-CHILD-MA

Two factors make UY/CH-CHILD-MA particularly informative for Perception Norm.
  • Accent and phonological transfer.
Uyghur and Mandarin differ in phonemic inventories, syllable structure, and phonotactics. Transfer can induce systematic substitutions and distortions that are perceptually ambiguous.
  • Child speech variability.
Child speech exhibits higher acoustic variability and developmental effects, which can increase perceptual uncertainty and annotator dependence.
Together, these factors increase the space of “borderline” pronunciations, amplifying listener variability. To give a clear picture of UY/CH-CHILD-MA, some basic statistics are shown in Table 3, and the agreement of the collective annotator and the three individual annotators is shown in Table 4.

5. Mispronunciation Detection Models

Following previous work on Target Norm [18], we employ an end-to-end MD model based on the Transformer architecture. We briefly describe the model in this section.

5.1. Model Architecture

We take as input a raw speech waveform together with its canonical phone sequence S = ( s 1 , , s N ) . A frozen HuBERT encoder transforms the waveform into a frame-level feature sequence X = ( x 1 , , x T ) . The model then predicts, for each canonical phone position n, a binary mispronunciation label y n { 0 , 1 } , where y n = 1 denotes an error (substitution or deletion) and y n = 0 denotes a correct realization. As illustrated in Figure 2, the network contains (i) a speech branch trained with an auxiliary CTC objective for phone recognition and (ii) a phone branch trained with a BCE objective to produce phone-level MD posteriors.

5.1.1. Speech Branch

The speech branch consumes 1024-dimensional HuBERT features and projects them to 256-dimensional vectors via a linear layer, followed by positional encoding. The resulting sequence is processed by a 6-layer Speech Transformer. Each layer applies (i) multi-head self-attention over the speech stream, (ii) speech-to-phone cross-attention that conditions on the phone-branch representations, and (iii) a position-wise feed-forward network.
We use 4 heads for self-attention with d model = 256 . The speech-to-phone cross-attention uses a single head with d model = 256 ; speech hidden states act as queries and attend to phone representations as keys/values, and LayerNorm is applied within the cross-attention block. The feed-forward subnetwork has dimensions 256 × 512 × 256 with ReLU activation.
For the auxiliary phone-recognition objective, the Speech Transformer outputs are projected to 218 logits (214 pronounced phones, one blank, and three special tokens) at each (subsampled) time step. A softmax is applied to obtain posteriors, and the CTC loss is computed against the canonical phone sequence.

5.1.2. Phone Branch

The phone branch starts from the canonical phone sequence S = ( s 1 , , s N ) . Phone IDs are embedded into E R N × 256 using a 256-dimensional embedding layer (vocabulary size 69). We then apply a position-wise feed-forward block ( 256 × 256 × 256 , ReLU) and add positional encoding.
The encoded phone sequence is passed through a 4-layer Phone Transformer (4-head self-attention with d model = 256 , followed by a 256 × 512 × 256 ReLU feed-forward network). To produce MD decisions conditioned on the speech evidence, we further stack a 4-layer MD Transformer that alternates (i) self-attention over phone positions, (ii) phone-to-speech cross-attention (single head; phone states query the speech states; LayerNorm inside the cross-attention block), and (iii) a 256 × 512 × 256 ReLU feed-forward module.
Finally, the MD Transformer outputs are fed into an additional position-wise FFN (output dimension 256, ReLU) and a linear projection 256 1 per phone position. After a sigmoid, we obtain phone-level posteriors y ^ n ( 0 , 1 ) and compute the BCE loss against the corresponding MD label y n .

5.1.3. Training Objectives

We optimize the network with two complementary objectives: an auxiliary CTC loss from the speech branch (phone recognition) and a BCE loss from the phone branch (phone-level mispronunciation detection). The overall objective is the weighted sum:
L = L CTC + λ L BCE ,
where λ balances the two terms. Intuitively, the CTC term encourages robust phone-discriminative representations in the speech stream, while the BCE term specializes the model for detecting phone-level errors conditioned on the canonical phone sequence.

5.2. Two-Stage Training: Native Speech Pre-Training and L2 Fine-Tuning

We follow a two-stage training recipe. The first stage learns general Mandarin acoustic–phonetic representations from large-scale native speech, and the second stage adapts the MD component to the target learner population and/or a specific annotator, depending on the norm being simulated.
  • Stage 1: Native-speech pre-training.
We pre-train on the 1000-h AISHELL-2 corpus [51]. For each utterance, we convert the transcript into a canonical Pinyin/phone sequence and create synthetic MD supervision by randomly substituting a small subset of phones with phones from the same broad class (vowel or consonant). The CTC target remains the original canonical sequence, while the BCE target indicates whether each canonical phone position has been corrupted. This produces self-supervised MD training signals without using L2 speech or human MD labels.
  • Stage 2: L2 fine-tuning.
We fine-tune on UY/CH-CHILD-MA and compare two adaptation regimes that correspond to different norms. (i) Synthetic-label adaptation uses the same corruption mechanism to generate labels and primarily aligns the model to the target population acoustics, serving as an operationalization of Target Norm. (ii) Human-label adaptation uses MD labels from a designated annotator and directly fits the listener-specific decision behavior, serving as an operationalization of Perception Norm.

6. Experiments

6.1. Data

Our experiments focus on Chinese L2 speech, using two corpora: AIShell-2 for pre-training and UY/CH-CHILD-MA for fine-tuning.
  • AIShell-2
We use AISHELL-2 [51], a large-scale Mandarin read-speech corpus (about 1000 h; 1991 speakers) with an official train/dev/test split, as the source of native-speech pre-training data.
  • UY/CH-CHILD-MA
We use UY/CH-CHILD-MA for fine-tuning and evaluation. Due to the limited data size, we adopt a 10-fold speaker-level cross-validation protocol. Specifically, the 106 speakers are partitioned into 10 folds. For each fold, we use one fold (held-out speakers) as the test split and the remaining folds as the adaptation split for fine-tuning. Speaker identities are strictly separated between adaptation and test splits to avoid speaker leakage. We report the mean and standard deviation of the main evaluation metrics across the 10 folds.

6.2. Evaluation Metrics

We evaluate phone-level MD as a binary detection problem [7]. For each canonical phone position, the system outputs a binary decision (or a posterior) indicating whether the phone is perceived as mispronounced, which we compare against the corresponding human label. We report counts in a confusion-matrix style:
  • True reject (TR; true positive): A mispronounced phone correctly flagged as mispronounced.
  • False reject (FR; false positive): A correct phone incorrectly flagged as mispronounced.
  • True accept (TA; true negative): A correct phone correctly accepted as correct.
  • False accept (FA; false negative): A mispronounced phone incorrectly accepted as correct.
Precision and recall are then computed as
Precision = TR TR + FR ,
Recall = TR TR + FA .
We report the F1 score
F 1 = 2 × Precision × Recall Precision + Recall ,
as well as the ROC–AUC, which summarizes the trade-off between true-positive and false-positive rates across decision thresholds.
To better characterize system behavior in realistic use, we report results at two representative operating points. The first fixes precision at 0.5, i.e., among the phones flagged by the system, approximately half are true mispronunciations; we treat this as a reasonable working point for CAPT feedback. The second fixes recall at 0.5, meaning the system identifies about half of all mispronounced phones, which reflects a balanced detection regime. In addition to these two points, we report the full precision–recall (PR) curve to summarize performance across the entire operating range.

6.3. Settings

All models are implemented in PyTorch 2.6.0 using standard Transformer blocks. A HuBERT pre-trained front-end is used to extract acoustic features. The Hubert model involves billions of parameters and is pre-trained on the WenetSpeech L subset, which contains approximately 10,000 h of Mandarin Chinese speech data, and was downloaded from the official webset (https://huggingface.co/TencentGameMate/chinese-hubert-large, accessed on 20 February 2026).
  • Synthetic MD labels.
We generate synthetic MD labels on the fly by randomly substituting phones with other phones sampled from the same broad category (vowel vs. consonant). For each phone sequence, corruption is applied with probability 0.9 , and at most, 50 % of phones can be substituted. We deliberately use relatively strong corruption to ensure that pre-training receives a stable and diverse supervision signal for learning robust MD-related representations; in our experience, milder corruption often leads to less stable pre-training and weaker downstream MD performance.
  • Optimization.
We use gradient clipping in all experiments. For AIShell-2 pre-training, we adopt a learning-rate schedule with a peak rate of 1 × 10 4 and 10 warm-up epochs. The batch size is 128, and λ in Equation (7) is set to 2. We pre-train for 25 epochs and take the averaged checkpoints from epochs 20–25 as the final pre-trained model.
For UY/CH-CHILD-MA fine-tuning, we initialize from the epoch-25 checkpoint and use a smaller learning rate ( 1 × 10 5 ) with batch size 32. We report results using the averaged checkpoints from epochs 15–25. The two test conditions in the fine-tuning stage are summarized below.
  • In the self-supervised MD condition, synthetic MD labels are again generated via random substitution, following the same principle as in the pre-training stage.
  • In the human-supervised MD condition, we train with human MD labels and focus on substitutions and deletions, while ignoring insertions. This choice aligns with the goal of the MD task—determining whether an intended phone is well pronounced—and is consistent with our previous work [18].
In this work, phone-level MD labels are defined on canonical phone positions, i.e., whether the intended phone is perceived as mispronounced (substitution or deletion relative to the canonical sequence). Insertion errors do not naturally map to an intended canonical position without introducing additional alignment/labeling conventions (e.g., insertion slots between phones). Therefore, we exclude insertions in the current formulation to keep the label space consistent across annotators. We note that this choice may under-represent certain error patterns but it ensures scientific performance comparison. In fact, many recent studies have adopted this ‘insertion exclusion’ protocol as it is more consistent with the definition of the MD task. Further discussion can be found in [18].

6.4. Experiment 1: Native Norm, Target Norm, and Perception Norm

6.4.1. Performance of Native Norm

The pre-trained model can be regarded as following Native Norm, as it was trained on large-scale native speech to detect mismatches between speech and phone labels. Table 5 shows the results when its prediction is evaluated with the annotations of different annotators as the ground truth. (Note that the missing value with the target precision = 0.5 is because there are very limited positives (label = 1), and the operation point with precision = 0.5 could not be readily found.) It can be seen that there is a clear gap across annotators, confirming our argument that speech assessment is highly subjective and that there is no commonly agreed gold standard.

6.4.2. Performance of Target Norm

In the next experiment, we use synthetic MD labels to fine-tune the model. This fine-tuning adapts the model to the target pronunciation and essentially realizes Target Norm.
The results are shown in Table 6, where the fine-tuning and test are conducted independently for each annotator. Three observations can be made: (1) once again, performance varies significantly across annotators, confirming that annotation is subjective and annotator-specific; (2) compared to Table 5, significant and consistent improvements are observed. Since the models have been adapted to the target population and the test employs the human labels as ground truth, the obtained improvement implies that human annotators have adapted to the target pronunciation (as the models do), as advocated by Target Norm.

6.4.3. Performance of Perception Norm

In this experiment, we use human labels to fine-tune the model. Specifically, we fine-tune and test an individual model using labels from a single annotator. This learns the perceptual behavior of that annotator and thus represents Perception Norm. The results are shown in Table 7.
To complement the point-wise comparisons at fixed precision–recall, we further examine the overall ranking behavior of the two norms by plotting the full precision–recall (PR) curve. Figure 3 compares Target Norm and Perception Norm under collective-label supervision. As shown, Perception Norm yields a consistently higher PR curve across the entire recall range, with especially clear advantages in the mid-to-high recall region, suggesting substantially improved ranking quality and more favorable trade-offs for downstream CAPT feedback.
Compared to the Target Norm (Table 6), the model trained under Perception Norm achieves substantially higher F1 scores and ROC-AUC. This suggests that accent-specific adaptation explains only part of label alignment. Modeling listener-specific perceptual criteria explains substantially more. In other words, adaptation-to-accent is necessary but insufficient; adaptation-to-listener is decisive.
For example, at the operating point precision = 0.5, F1 for Anno1 improves from 0.2343 under Target Norm (Table 6) to 0.5429 under Perception Norm (Table 7), highlighting the additional gain from listener-specific alignment beyond accent/population adaptation.
Interestingly, the collective labels do not always yield the highest F1 or ROC-AUC, implying that aggregated annotation is not necessarily more stable than individual annotation.

6.4.4. Statistical Significance

To assess whether Perception Norm consistently outperforms Target Norm across folds, we conduct a paired two-sided Wilcoxon signed-rank test over the 10 fold-level scores (paired by fold) for each annotator. We report p-values together with the mean differences in Table 8.

6.5. Experiment 2: Inter-Annotator Variation

In the previous section, we empirically demonstrated that learning the perception process of human annotators is possible, and Perception Norm is a reasonable framework for MD. In this section, we further demonstrate that different annotators behave differently, and so there is not a gold standard ground truth for the MD task.

6.5.1. Correlation Analysis

We first analyze the label agreement and Pearson correlation among the four human annotators. The results are shown in Table 9. It can be seen that the Pearson correlations are not high—mostly below 0.5, confirming our argument that annotators are relatively independent. However, the label agreement degrees are high, indicating that the annotators agree in a large proportion of the samples, and the low correlation is due to a small set of confusing words and phrases. Another observation is that Anno1 and Anno3 are more alike. Finally, although the collective annotator is an aggregation of multiple annotators, there is no clear evidence that it is a representative annotator, as the correlation between the collective annotator and the three individual annotators is not significantly higher than the correlations among the three individual annotators.

6.5.2. Disagreement Concentration Analysis

To characterize when Perception Norm matters most, we analyze where annotator disagreement concentrates. At the phone level, disagreement is substantially higher for vowel/final units (8.36%) than for consonant/initial units (2.26%), suggesting that vowel and tone realizations introduce stronger perceptual ambiguity. The most disagreement-prone phones are mainly vowel+tone units, including iy3 (40.6%), u5 (38.3%), ang3 (35.8%), uang3 (35.3%), eng5 (35.3%), and ou5 (34.5%). At the speaker level, disagreement decreases with age (4–6 years: 5.79%; 7–9 years: 4.88%; 10–12 years: 2.98%), consistent with higher acoustic variability and less stable production in younger children. To further illustrate these patterns, representative borderline cases are shown in Table 10.
These examples mainly involve vowel- and tone-related units, which are more prone to perceptual ambiguity. As a result, the same token may lie near the perceptual decision boundary, leading to inconsistent labels across annotators.

6.6. Cross-Annotator Evaluation

In this experiment, we align a model fine-tuned with labels of a particular annotator to other annotators, by performing the test using the labels of different annotators as the ground truth. By this experiment, we wish to examine whether the fine-tuned models are annotator-specific. The results of the model fine-tuned with the collective labels are shown in Table 11, and the models fine-tuned with labels of the three individual annotators are shown in Table 12, Table 13 and Table 14. Some observations are as follows:
  • A model fine-tuned with a particular annotator’s labels achieves the best F1 score when aligned with that annotator, indicating that the model is simulating the annotator’s behavior.
  • The relative performance across annotators largely reflects the human correlations shown in Table 9, further confirming that the annotators behave very differently from each other, and fine-tuning can effectively capture the specific characteristics of single annotators.
  • ROC-AUC does not drop dramatically across annotators, suggesting that annotators largely agree on the ranking of sample difficulty, while differing more in score calibration and acceptance/rejection thresholds.
Table 11. Cross-annotator evaluation with model aligned to the collective annotator.
Table 11. Cross-annotator evaluation with model aligned to the collective annotator.
PrecisionRecallF1-ScoreROC_AUC
Target PrecisionCollective0.50.62860.55330.94
Anno10.50.40450.42350.94
Anno20.50.34990.36840.96
Anno30.50.52680.50150.92
Target RecallCollective0.60780.50.54640.94
Anno10.44930.50.46790.94
Anno20.39150.50.42850.96
Anno30.55300.50.51390.92
Table 12. Cross-annotator evaluation with model aligned to Anno1.
Table 12. Cross-annotator evaluation with model aligned to Anno1.
PrecisionRecallF1-ScoreROC_AUC
Target PrecisionAnno10.50.60570.54290.95
Anno20.50.32400.34810.96
Anno30.50.45950.44870.91
Target RecallAnno10.60850.50.54760.95
Anno20.35610.50.39960.96
Anno30.50910.50.48980.91
Table 13. Cross-annotator evaluation with model aligned to Anno2.
Table 13. Cross-annotator evaluation with model aligned to Anno2.
PrecisionRecallF1-ScoreROC_AUC
Target PrecisionAnno10.50.33080.38800.89
Anno20.50.50150.46880.96
Anno30.50.32610.37560.87
Target RecallAnno10.29530.50.36200.89
Anno20.54030.50.49960.96
Anno30.30190.50.35870.87
Table 14. Cross-annotator evaluation with model aligned to Anno3.
Table 14. Cross-annotator evaluation with model aligned to Anno3.
PrecisionRecallF1-ScoreROC_AUC
Target PrecisionAnno10.50.41330.42410.92
Anno20.50.36990.39500.96
Anno30.50.73550.58790.95
Target RecallAnno10.44900.50.46040.92
Anno20.37380.50.41120.96
Anno30.74290.50.59140.95

6.7. Experiment 3: Utility of Multiple Annotators

In this section, we investigate how multiple annotators can be used effectively. We first check whether simple voting can offer any benefits, and then try the ensemble approach, which averages the prediction of a group of individual models.

6.7.1. Whether Voting Is Reliable

First, we examine whether voting can be regarded as a reliable ground truth. For that purpose, we examine three “voting rules” for the individual human labels, and examine whether the voting labels are closely related to the collective labels, which essentially follows a majority voting rule.
The results are shown in Table 15. It can be seen that among the three voting rules, majority voting leads to a better alignment to the collective annotator. More importantly, when compared to the labels of the individual annotators, the majority voting labels obtain the highest agreement score and the second highest Pearson correlation when aligned with the collective annotator. Since the collective annotator is also derived via majority voting, these results suggest that majority voting can reduce annotation uncertainty. However, the correlation between the two majority voting-based annotators is not substantially higher than that between any two individual annotators, or an individual annotator and a majority voting-based annotator. This means that voting among only three annotators (as in both the collective annotator and the majority voting annotator in Table 15) may be insufficient to yield a fully stable “mean” annotator.

6.7.2. Majority Voting Model and Score Average

We now investigate how to utilize multiple annotators effectively. Traditionally, this is achieved by training a majority voting model, i.e., fine-tuning the pre-trained model using majority voting labels. This is essentially the popular end-to-end approach. As mentioned, it simulates complex MD processing involving both individual perception and social aggregation.
We present a new approach to utilizing multiple annotators more effectively. Specifically, we train three annotation models (anno1–3) for the three individual annotators, respectively, and then score the test pronunciation with the three models. After calibration, the scores produced by the three models are averaged to perform the MD decision.
The calibration step is designed to ensure comparability across the three individual annotation models. We adopt “temperature scaling” as a simple yet effective post hoc calibration method.
Within each cross-validation fold, after fine-tuning is completed, we construct a held-out calibration set by sampling 10% of the training speakers from the adaptation split (speaker-level sampling). Given the pre-sigmoid logits z produced by an individual model, a scalar temperature parameter T > 0 is then learned by minimizing the negative log-likelihood on this calibration set, while keeping all other model parameters fixed. The test speakers are never used for calibration.
C a l i b ( z ; T ) = σ z T ,
where σ ( · ) denotes the sigmoid function. The temperature parameter T is optimized while keeping all other model parameters fixed.
This procedure rescales the confidence of each individual model without altering its ranking behavior, thereby aligning their probabilistic outputs under a common decision framework. After the temperature scaling, we average the calibrated probabilities as follows:
p ¯ ( x ) = 1 3 r = 1 3 C a l i b z r ( x ) ; T r .
where T r is the annotator-specific temperature parameter, and z r ( x ) is the pre-sigmoid logit produced by the model aligned to annotator r.
We compare the performance of the voting model approach and the score average approach, with the majority voting labels as the ground truth. The performance is shown in Table 16. It can be seen that the score average approach slightly outperforms the majority voting model, and the improvement is statistically significant according to Table 17, although the training objective of the majority voting model is to predict the majority voting labels. This is consistent with our argument that learning individual perception is easier than learning the entire MD process.
Beyond comparing summary metrics (e.g., ROC-AUC/PR-AUC and the fixed operating points), we also examine ranking quality across the full operating range using precision–recall (PR) curves. Figure 4 contrasts the score-averaging ensemble with the voting-label-trained single model. The ensemble yields a consistently higher PR curve across most recall values, indicating a more favorable precision–recall trade-off and more robust ranking behavior for downstream use.
  • Statistical significance
We conduct a paired two-sided Wilcoxon signed-rank test over the 10 fold-level scores (paired by fold) to assess whether the score average committee significantly outperforms the majority voting model. The results are summarized in Table 17.
This result suggests an ensemble paradigm that learns individual models to simulate individual annotators, and then makes decisions using a separate aggregation rule. A key advantage of this ensemble approach, compared to the end-to-end approach using the aggregation labels (e.g., voting labels), is that the aggregation function can be adjusted according to the request of real applications, without re-training the model. This provides a powerful paradigm that can simulate the social mechanism of voting (or other aggregation rules) with limited additional cost.

7. Discussion

7.1. Does the Perception Norm Lead to Subjective Evaluation?

Yes—Perception Norm explicitly acknowledges that pronunciation assessment is subjective, because it depends on listeners’ perceptual boundaries, priors, and acceptance thresholds. Importantly, “subjective” does not mean “arbitrary”. In our experiments, annotators show high label agreement while exhibiting low-to-moderate Pearson correlations (Table 9), suggesting that they largely agree on clearly correct/incorrect cases but differ on borderline cases and calibration. This is consistent with a signal-detection view: different annotators may share similar rankings of difficulty (reflected by relatively stable ROC-AUC across cross-annotator evaluation) but apply different decision thresholds and severities.
From an application perspective, subjectivity is unavoidable and must be specified rather than ignored. Any MD system that is trained and evaluated against human labels is implicitly tied to a listener population and an operational definition of “error”. Perception Norm makes this dependence explicit: the system designer can decide which listener(s) the system is meant to simulate and how multiple judgments should be aggregated for the target use case.

7.2. Implication for MD System Design

Perception Norm suggests separating two components that are conflated in many end-to-end approaches: (i) individual perceptual modeling and (ii) social aggregation. Concretely, instead of directly training a single model on aggregated labels, one can train annotator-specific models via fine-tuning (Table 7) and then combine them at inference time.
This separation has several practical benefits. First, learning each annotator is an easier and more stable learning problem than learning an aggregated label that already mixes perception and aggregation. Second, the aggregation function can be chosen to match the target application without retraining.
For example, a strict aggregation rule may be preferred for high-stakes testing (high precision; teacher-like), while a more inclusive aggregation may be preferred in CAPT to provide conservative feedback (high recall; peer-like). Third, model ensembles can naturally represent annotator diversity and provide uncertainty estimates: as an uncertainty signal, disagreement among the committee of models can be used to flag borderline pronunciations for human review or for adaptive feedback strategies. More broadly, the aggregation rule can be selected to align with different stakeholder preferences (e.g., teachers, learners, or examination boards) without retraining the underlying annotator-specific models.
Experiment 3 provides a concrete instance: calibrating annotator-specific models and averaging their outputs slightly outperforms a voting-label-trained model (Table 16), and the gain is statistically significant (Table 17), suggesting that “model-then-aggregate” can be competitive or superior to “aggregate-then-model”.
A practical concern is computational cost. Under the model architecture proposed in this paper, using N MD models would indeed increase the computation by a factor of N. However, this overhead can be significantly reduced by sharing a common backbone, while keeping only the MD head annotator-specific. We have verified this approach and found that it not only reduces computational cost but also improves accuracy.

7.3. Implications for MD Database Construction

Perception Norm implies that MD datasets should preserve annotator information rather than collapsing it into a single aggregated label. Practically, we recommend (1) storing per-annotator labels and annotator metadata (experience, linguistic background) when possible; (2) ensuring sufficient overlap so that inter-annotator variability and calibration can be quantified; and (3) documenting the aggregation rule as part of the dataset specification, since it changes the construct the model is trained to predict.
In addition, borderline cases are especially informative for Perception Norm research and for training robust systems. Data collection protocols that elicit challenging pronunciations (e.g., accented child speech as in UY/CH-CHILD-MA) can amplify perceptual variability and make annotator effects observable.

7.4. Limitations and Future Work

This study has several limitations. First, UY/CH-CHILD-MA contains four annotators; while sufficient to reveal substantial annotator variation, larger and more diverse annotator pools would enable more fine-grained analysis and more stable estimates of an “average” listener. Second, our work focuses on segment-level/phone-level mispronunciation detection; extending Perception Norm to richer diagnoses (e.g., substitution types, suprasegmentals, and intelligibility/comprehensibility dimensions) is an important direction. Third, we used post hoc calibration (temperature scaling) for averaging; exploring jointly trained mixtures-of-experts or Bayesian models that explicitly represent annotator severity and uncertainty may further improve aggregation. Finally, evaluating how Perception-Norm-based systems affect real CAPT outcomes and perceived fairness across learner populations remains an open and practically important question.

8. Conclusions

This paper argues that mispronunciation detection should be framed as a subjective perception task rather than a purely objective measurement of production deviation. We propose Perception Norm, which models the perceptual and psychological decision process of individual annotators. To support this study, we introduce multiple independent annotations on the UY/CH-CHILD-MA corpus of Uyghur-accented child Mandarin. Experiments confirm substantial inter-annotator variation and show that a Transformer-based pre-training and fine-tuning pipeline can learn annotator-specific behaviors with high accuracy. Based on these findings, we advocate for a new “model-then-aggregate” paradigm: learn one model per annotator and then combine a committee of models with an aggregation rule aligned to the target application.

Author Contributions

Conceptualization, M.N. and A.H.; methodology, M.N. and Y.W.; software, Y.W.; validation, M.N., Y.W. and A.H.; formal analysis, M.N. and Y.W.; investigation, M.N. and Y.W.; resources, A.H.; data curation, M.N. and Y.W.; writing—original draft preparation, M.N.; writing—review and editing, Y.W. and A.H.; visualization, Y.W. and M.N.; supervision, A.H.; project administration, A.H.; funding acquisition, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Social Science Foundation of China (NSSFC) under Grants No. 22XYY048 and Tianshan Talents Cultivation Program—Leadings Talents for Scientific and Technological Innovation (No. 2024TSYCLJ0002).

Data Availability Statement

The UY/CH-CHILD-MA dataset introduced in this study will be publicly released at http://child.cslt.org (accessed on 20 February 2026) upon publication, including audio recordings and the phone-level annotations from four annotators. The source code for training and evaluation and the scripts for reproducing the reported experiments will be released in a public GitHub repository maintained by the authors (https://github.com/Darlig/md_asr_joint, accessed on 20 February 2026). This work may also use additional publicly available datasets; these external resources are accessible from their original providers and are cited in the manuscript with corresponding links or references.

Acknowledgments

The authors would like to express their sincere appreciation to Dong Wang (Tsinghua University) for his generous support and guidance throughout this study. We also gratefully acknowledge the valuable advice and constructive input provided by colleagues at the Chinese Academy of Social Sciences, which helped inform the data-related aspects of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Witt, S.M.; Young, S.J. Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 2000, 30, 95–108. [Google Scholar] [CrossRef]
  2. Harrison, A.M.; Lo, W.K.; Qian, X.J.; Meng, H. Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. In Proceedings of the Speech and Language Technology in Education (SLaTE 2009), Warwickshire, UK, 3–5 September 2009; pp. 45–48. [Google Scholar] [CrossRef]
  3. Hu, W.; Qian, Y.; Soong, F.K.; Wang, Y. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 2015, 67, 154–166. [Google Scholar] [CrossRef]
  4. Li, K.; Qian, X.; Meng, H. Mispronunciation detection and diagnosis in L2 English speech using multi-distribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 193–207. [Google Scholar] [CrossRef]
  5. Leung, W.K.; Liu, X.; Meng, H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8132–8136. [Google Scholar] [CrossRef]
  6. Feng, Y.; Fu, G.; Chen, Q.; Chen, K. SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3492–3496. [Google Scholar] [CrossRef]
  7. Yan, B.C.; Wu, M.C.; Hung, H.T.; Chen, B. An End-to-End Mispronunciation Detection System for L2 English Speech Leveraging Novel Anti-Phone Modeling. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3032–3036. [Google Scholar] [CrossRef]
  8. Wu, M.; Li, K.; Leung, W.K.; Meng, H. Transformer based end-to-end mispronunciation detection and diagnosis. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 3954–3958. [Google Scholar] [CrossRef]
  9. Zheng, N.; Deng, L.; Huang, W.; Yeung, Y.T.; Xu, B.; Guo, Y.; Wang, Y.; Chen, X.; Jiang, X.; Liu, Q. CoCA-MDD: A coupled cross-attention based framework for streaming mispronunciation detection and diagnosis. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 4352–4356. [Google Scholar] [CrossRef]
  10. Wang, X.; Shi, M.; Wang, Y. Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis. In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024. [Google Scholar] [CrossRef]
  11. Cao, X.; Fan, Z.; Svendsen, T.; Salvi, G. A Framework for Phoneme-Level Pronunciation Assessment Using CTC. In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024. [Google Scholar] [CrossRef]
  12. Fort, A.; Tyers, F. Evaluating Wav2Vec2-Bert for Computer-Assisted Pronunciation Training for isiZulu. In Proceedings of the Interspeech 2025, Rotterdam, The Netherlands, 17–21 August 2025. [Google Scholar] [CrossRef]
  13. Clarke, C.M.; Garrett, M.F. Rapid adaptation to foreign-accented English. J. Acoust. Soc. Am. 2004, 116, 3647–3658. [Google Scholar] [CrossRef]
  14. Bradlow, A.R.; Bent, T. Perceptual adaptation to non-native speech. Cognition 2008, 106, 707–729. [Google Scholar] [CrossRef]
  15. Kleinschmidt, D.F.; Jaeger, T.F. Robust Speech Perception: Recognize the Familiar, Generalize to the Similar, and Adapt to the Novel. Psychol. Rev. 2015, 122, 148–203. [Google Scholar] [CrossRef]
  16. Norris, D.; McQueen, J.M.; Cutler, A. Perceptual learning in speech. Cogn. Psychol. 2003, 47, 204–238. [Google Scholar] [CrossRef]
  17. Phonetic category recalibration: What are the categories? J. Phon. 2014, 45, 91–105. [CrossRef]
  18. Nijat, M.; Wei, Y.; Li, S.; Dawut, A.; Hamdulla, A. Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Appl. Sci. 2026, 16, 647. [Google Scholar] [CrossRef]
  19. Zhao, G.; Sonsaat, S.; Silpachai, A.; Lucic, I.; Chukharev-Hudilainen, E.; Levis, J.; Gutierrez-Osuna, R. L2-ARCTIC: A non-native English speech corpus. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2783–2787. [Google Scholar] [CrossRef]
  20. Kadambi, P.; Mahr, T.; Annear, L.; Nomeland, H.; Liss, J.; Hustad, K.; Berisha, V. How Does Alignment Error Affect Automated Pronunciation Scoring in Children’s Speech? In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024; pp. 5133–5137. [Google Scholar] [CrossRef]
  21. Li, W.; Siniscalchi, S.M.; Chen, N.F.; Lee, C.H. Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6135–6139. [Google Scholar] [CrossRef]
  22. Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
  23. El Kheir, Y.; Ali, A.; Chowdhury, S.A. Automatic Pronunciation Assessment—A Review. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 8304–8324. [Google Scholar] [CrossRef]
  24. Eisner, F.; McQueen, J.M. The specificity of perceptual learning in speech processing. Percept. Psychophys. 2005, 67, 224–238. [Google Scholar] [CrossRef]
  25. Munro, M.J.; Derwing, T.M. Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Lang. Learn. 1995, 45, 73–97. [Google Scholar] [CrossRef]
  26. Derwing, T.M.; Munro, M.J. Second language accent and pronunciation teaching: A research-based approach. TESOL Q. 2005, 39, 379–397. [Google Scholar] [CrossRef]
  27. Trofimovich, P.; Isaacs, T. Disentangling Accent from Comprehensibility. Biling. Lang. Cogn. 2012, 15, 905–916. [Google Scholar] [CrossRef]
  28. Kang, O.; Rubin, D. Suprasegmental Measures of Accentedness and Judgments of Language Learner Proficiency in Oral English. Mod. Lang. J. 2010, 94, 554–566. [Google Scholar] [CrossRef]
  29. Shinoda, K.; Hojo, N.; Mizuno, S.; Suzuki, K.; Kobashikawa, S.; Masumura, R. Learning from Multiple Annotator Biased Labels in Multimodal Conversation. In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024. [Google Scholar] [CrossRef]
  30. Dawid, A.P.; Skene, A.M. Maximum likelihood estimation of observer error-rates using the EM algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 20–28. [Google Scholar] [CrossRef]
  31. Whitehill, J.; Wu, T.; Bergsma, J.; Movellan, J.; Ruvolo, P. Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems (NeurIPS); Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., Culotta, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2009; Volume 22. [Google Scholar]
  32. Welinder, P.; Branson, S.; Perona, P.; Belongie, S. The Multidimensional Wisdom of Crowds. In Advances in Neural Information Processing Systems; Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2010; Volume 23. [Google Scholar]
  33. Raykar, V.C.; Yu, S.; Zhao, L.H.; Jerebko, A.; Florin, C.; Valadez, G.H.; Bogoni, L.; Moy, L. Learning from crowds. J. Mach. Learn. Res. 2010, 11, 1297–1322. [Google Scholar]
  34. Dietterich, T.G. Ensemble Methods in Machine Learning. In Multiple Classifier Systems; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1857, pp. 1–15. [Google Scholar] [CrossRef]
  35. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In ICML’17: Proceedings of the 34th International Conference on Machine Learning—Volume 70; JMLR.org: Norfolk, MA, USA, 2017; pp. 1321–1330. [Google Scholar]
  36. Hao, Y.; Hu, C.; Gao, Y.; Zhang, S.; Feng, J. On Calibration of Speech Classification Models: Insights from Energy-Based Model Investigations. In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024. [Google Scholar] [CrossRef]
  37. Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
  38. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  39. Davis, J.; Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. In ICML ’06: Proceedings of the 23rd International Conference on Machine Learning (ICML); ACM: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  40. Messick, S. Validity. In Educational Measurement, 3rd ed.; Linn, R.L., Ed.; Macmillan: London, UK, 1989; pp. 13–103. [Google Scholar]
  41. Kane, M. Validating the Interpretations and Uses of Test Scores. J. Educ. Meas. 2013, 50, 1–73. [Google Scholar] [CrossRef]
  42. Levis, J.M. Changing Contexts and Shifting Paradigms in Pronunciation Teaching. TESOL Q. 2005, 39, 369–377. [Google Scholar] [CrossRef]
  43. Linacre, J.M. Many-Facet Rasch Measurement; MESA Press: Chicago, IL, USA, 1989. [Google Scholar]
  44. McNamara, T. Measuring Second Language Performance; Longman: London, UK, 1996. [Google Scholar] [CrossRef]
  45. Jenkins, J. The Phonology of English as an International Language; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  46. Lippi-Green, R. English with an Accent: Language, Ideology, and Discrimination in the United States, 2nd ed.; Routledge: London, UK, 2012. [Google Scholar] [CrossRef]
  47. Cai, D.; Naismith, B.; Kostromitina, M.; Teng, Z.; Yancey, K.P.; LaFlair, G.T. Developing an Automatic Pronunciation Scorer: Aligning Speech Evaluation Models and Applied Linguistics Constructs. Lang. Learn. 2025, 75, 170–203. [Google Scholar] [CrossRef]
  48. Nijat, M.; Chen, C.; Wang, D.; Hamdulla, A. UY/CH-CHILD: A Public Chinese L2 Speech Database of Uyghur Children. In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024. [Google Scholar] [CrossRef]
  49. Gao, J.; Li, A.; Xiong, Z. Mandarin multimedia child speech corpus: Cass_Child. In Proceedings of the 2012 International Conference on Speech Database and Assessments, Macau, China, 9–12 December 2012; pp. 7–12. [Google Scholar] [CrossRef]
  50. Gao, J.; Li, A.; Xiong, Z.; Shen, J.; Pan, Y. A Normative Database of Word Production of Putonghua-speaking Children. In Proceedings of the 2013 O-COCOSDA/CASLRE International Conference on Speech Database and Assessments, Gurgaon, India, 25–27 November 2013; pp. 1–4. [Google Scholar] [CrossRef]
  51. Du, J.; Na, X.; Liu, X.; Bu, H. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale. arXiv 2018, arXiv:1808.10583. [Google Scholar] [CrossRef]
Figure 1. (a) Native Norm treats mispronunciation as deviation from native pronunciations, where the green shaded region indicates the native-pronunciation cluster in the acoustic/phonetic space. (b) Target-Population Norm treats mispronunciation as deviation from the expected pronunciation of the target population, where the yellow shaded region represents the target-population reference region. (c) Perception Norm assumes no unique canonical pronunciation; instead, each annotator may apply an individual decision boundary, illustrated by annotator-dependent acceptance regions across subplots. Green dots denote native pronunciations, yellow squares denote L2 pronunciations perceived as correct, and red crosses denote L2 pronunciations perceived as incorrect under the corresponding norm.
Figure 1. (a) Native Norm treats mispronunciation as deviation from native pronunciations, where the green shaded region indicates the native-pronunciation cluster in the acoustic/phonetic space. (b) Target-Population Norm treats mispronunciation as deviation from the expected pronunciation of the target population, where the yellow shaded region represents the target-population reference region. (c) Perception Norm assumes no unique canonical pronunciation; instead, each annotator may apply an individual decision boundary, illustrated by annotator-dependent acceptance regions across subplots. Green dots denote native pronunciations, yellow squares denote L2 pronunciations perceived as correct, and red crosses denote L2 pronunciations perceived as incorrect under the corresponding norm.
Applsci 16 03311 g001
Figure 2. Overall framework of the MD model. The blue blocks denote the speech branch; the green blocks denote the phone branch. Dotted boxes group components into functional modules (e.g., branches and output heads). Arrows indicate the direction of information flow. ⊕ denotes element-wise addition.
Figure 2. Overall framework of the MD model. The blue blocks denote the speech branch; the green blocks denote the phone branch. Dotted boxes group components into functional modules (e.g., branches and output heads). Arrows indicate the direction of information flow. ⊕ denotes element-wise addition.
Applsci 16 03311 g002
Figure 3. Precision–recall curves comparing Target Norm and Perception Norm under collective-label supervision. The Perception Norm consistently outperforms the Target Norm across the entire recall range, with particularly notable gains in the mid-to-high recall region, indicating substantially improved ranking quality.
Figure 3. Precision–recall curves comparing Target Norm and Perception Norm under collective-label supervision. The Perception Norm consistently outperforms the Target Norm across the entire recall range, with particularly notable gains in the mid-to-high recall region, indicating substantially improved ranking quality.
Applsci 16 03311 g003
Figure 4. Precision–recall curves comparing the score-averaging ensemble and the voting-trained model. The ensemble approach consistently achieves higher precision across most operating regions, demonstrating improved robustness and ranking performance beyond the single-model baseline.
Figure 4. Precision–recall curves comparing the score-averaging ensemble and the voting-trained model. The ensemble approach consistently achieves higher precision across most operating regions, demonstrating improved robustness and ranking performance beyond the single-model baseline.
Applsci 16 03311 g004
Table 1. Side-by-side comparison of the three norms discussed in this paper.
Table 1. Side-by-side comparison of the three norms discussed in this paper.
AspectNative NormTarget NormPerception Norm
InputsSpeech x + canonical phones SSpeech x + canonical phones SSpeech x + canonical phones S
SupervisionContent labels + Synthetic MD labelsSynthetic MD labelsHuman MD labels
Training dataL1 speechL1 + L2 speechL1 + L2 speech
Pronunciation standardNative accentTarget population accentHuman perception
Table 2. Coarse-grained profile of individual annotators (Anno1–Anno3) for interpreting inter-annotator variation.
Table 2. Coarse-grained profile of individual annotators (Anno1–Anno3) for interpreting inter-annotator variation.
IDNative-Language BackgroundPhonetic/Linguistic TrainingExperience with Child/L2 Speech AssessmentFamiliarity with Uyghur-Accented Mandarin
Anno1ChineseSome phonetics trainingSome experienceLimited
Anno2ChineseSome phonetics trainingSome experienceLimited
Anno3ChineseSome phonetics trainingSome experienceLimited
Table 3. Basic statistics of UY/CH-CHILD-MA.
Table 3. Basic statistics of UY/CH-CHILD-MA.
QuantityValue
Number of Speakers106
Number of segments24,958
Number of Phones76,030
Age Range4–12
Count/Ratio of MP (Collective)6.76%
Count/Ratio of MP (Anno1)2349/3.09%
Count/Ratio of MP (Anno2)806/1.06%
Count/Ratio of MP (Anno3)3687/4.85%
Table 4. Proportion of the samples with different levels of agreement from the three individual annotators.
Table 4. Proportion of the samples with different levels of agreement from the three individual annotators.
Pattern Across (Anno1, Anno2, Anno3)CountRatio
All agree correct (0, 0, 0)71,42593.94%
Two correct, one error (0, 0, 1) etc.29373.86%
One correct, two error (0, 1, 1) etc.11011.45%
All agree error (1, 1, 1)5670.75%
Table 5. Performance with pre-trained model.
Table 5. Performance with pre-trained model.
AnnotatorPrecisionRecallF1-ScoreROC_AUC
Target PrecisionCollective0.50.83
Anno10.50.86
Anno20.50.89
Anno30.50.87
Target RecallCollective0.24040.50.32520.83
Anno10.17000.50.25370.86
Anno20.09640.50.16180.89
Anno30.26700.50.34830.87
Table 6. Performance of models fine-tuned with synthetic labels (mean ± std over 10 folds).
Table 6. Performance of models fine-tuned with synthetic labels (mean ± std over 10 folds).
PrecisionRecallF1-ScoreROC_AUC
Target PrecisionCollective0.50.2351 ± 0.07390.3138 ± 0.06680.85 ± 0.01
Anno10.50.1643 ± 0.09510.2343 ± 0.10130.88 ± 0.02
Anno20.50.2104 ± 0.17050.2588 ± 0.17620.91 ± 0.04
Anno30.50.2943 ± 0.10010.3610 ± 0.08510.88 ± 0.02
Target RecallCollective0.2826 ± 0.06880.50.3594 ± 0.05590.85 ± 0.01
Anno10.2377 ± 0.08080.50.3162 ± 0.07470.88 ± 0.02
Anno20.2715 ± 0.12360.50.3415 ± 0.10340.91 ± 0.04
Anno30.3579 ± 0.08380.50.4144 ± 0.06320.88 ± 0.02
Table 7. Performance of models fine-tuned with human labels (mean ± std over 10 folds).
Table 7. Performance of models fine-tuned with human labels (mean ± std over 10 folds).
PrecisionRecallF1-ScoreROC_AUC
Target PrecisionCollective0.50.6286 ± 0.10510.5533 ± 0.04430.94 ± 0.02
Anno10.50.6057 ± 0.11870.5429 ± 0.05020.95 ± 0.01
Anno20.50.5015 ± 0.23070.4688 ± 0.15800.96 ± 0.02
Anno30.50.7355 ± 0.15210.5879 ± 0.06750.95 ± 0.02
Target RecallCollective0.6078 ± 0.10080.50.5464 ± 0.04140.94 ± 0.02
Anno10.6085 ± 0.11820.50.5476 ± 0.04740.95 ± 0.01
Anno20.5403 ± 0.26110.50.4996 ± 0.16680.96 ± 0.02
Anno30.7429 ± 0.16270.50.5914 ± 0.07110.95 ± 0.02
Table 8. Target VS Perception (ROC-AUC).
Table 8. Target VS Perception (ROC-AUC).
TargetPerceptionΔp-Value
collective0.85390.9351+0.08120.00195
Anno10.87600.9533+0.07720.00195
Anno20.91150.9637+0.05220.00195
Anno30.88000.9548+0.07480.00195
Table 9. Pearson correlation and label agreement between individual annotators.
Table 9. Pearson correlation and label agreement between individual annotators.
Pearson Correlation
CollectiveAnno1Anno2Anno3
Collective10.4310.3330.468
Anno1 10.4300.534
Anno2 10.422
Anno3 1
Label Agreement
CollectiveAnno1Anno2Anno3
Collective194.35%94.03%94.1%
Anno1 197.41%96.06%
Anno2 195.91%
Anno3 1
Table 10. Representative borderline cases where annotators diverge.
Table 10. Representative borderline cases where annotators diverge.
uttidphoneAnno1Anno2Anno3
c0490084ao4001
c0060002a3001
c0760100ian3001
c0130213ou5100
c0640116u5100
Table 15. Correlation between the individual/voting annotators and the collective annotator.
Table 15. Correlation between the individual/voting annotators and the collective annotator.
Individual AnnotatorPositive Rate (%)AgreementPearson Correlation
Anno13.0994.350.431
Anno21.0694.030.333
Anno34.8594.10.468
Voting AnnotatorPositive Rate (%)AgreementPearson Correlation
Majority vote (≥2/3 = 1)2.190.94760.4646
OR vote (≥1/3 = 1)6.060.93810.4852
AND vote (=3/3 = 1)0.750.93910.3036
Table 16. Comparison between the average annotator and the voting model (mean ± std over 10 folds). “Best-F1” is the maximum over thresholds. The voting labels are used as the ground truth.
Table 16. Comparison between the average annotator and the voting model (mean ± std over 10 folds). “Best-F1” is the maximum over thresholds. The voting labels are used as the ground truth.
ModelROC-AUCPR-AUCBest-F1
Score Average0.9681 ± 0.01820.6231 ± 0.12590.5986 ± 0.0894
Voting Model0.9544 ± 0.04620.6115 ± 0.13230.5930 ± 0.0826
Table 17. Statistical significance for score average committee vs. voting model.
Table 17. Statistical significance for score average committee vs. voting model.
MetricVoting ModelScore AverageΔp-Value
ROC-AUC0.95440.9681+0.01370.00195
PR-AUC0.61150.6231+0.01160.00195
F10.59300.5986+0.00560.00195
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nijat, M.; Wei, Y.; Hamdulla, A. Perception Norm for Mispronunciation Detection. Appl. Sci. 2026, 16, 3311. https://doi.org/10.3390/app16073311

AMA Style

Nijat M, Wei Y, Hamdulla A. Perception Norm for Mispronunciation Detection. Applied Sciences. 2026; 16(7):3311. https://doi.org/10.3390/app16073311

Chicago/Turabian Style

Nijat, Mewlude, Yang Wei, and Askar Hamdulla. 2026. "Perception Norm for Mispronunciation Detection" Applied Sciences 16, no. 7: 3311. https://doi.org/10.3390/app16073311

APA Style

Nijat, M., Wei, Y., & Hamdulla, A. (2026). Perception Norm for Mispronunciation Detection. Applied Sciences, 16(7), 3311. https://doi.org/10.3390/app16073311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop