Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment

Nijat, Mewlude; Wei, Yang; Li, Shuailong; Dawut, Abdusalam; Hamdulla, Askar

doi:10.3390/app16020647

Open AccessArticle

Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment

by

Mewlude Nijat

¹

,

Yang Wei

²,

Shuailong Li

^2,3,

Abdusalam Dawut

⁴ and

Askar Hamdulla

^1,*

¹

School of Computer Science and Technology, Xinjiang University, Ürümqi 830017, China

²

Center for Speech and Language Technologies, BNRist, Tsinghua University, Beijing 100084, China

³

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

⁴

School of Software, Xinjiang University, Ürümqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 647; https://doi.org/10.3390/app16020647

Submission received: 10 December 2025 / Revised: 1 January 2026 / Accepted: 6 January 2026 / Published: 8 January 2026

Download

Browse Figures

Versions Notes

Abstract

Pronunciation assessment is central to computer-assisted pronunciation training (CAPT) and speaking tests, yet most systems still adopt a native norm, treating deviations from canonical L1 pronunciations as errors. In contrast, rating rubrics and psycholinguistic evidence emphasize intelligibility for a target listener population and show that listeners rapidly adapt their phonetic categories to new accents. We argue that automatic assessment should likewise be referenced to the target learner group. We build a Transformer-based mispronunciation detection (MD) model that computationally mimics listener adaptation: it is first pre-trained on multi-speaker Librispeech, then fine-tuned on the non-native L2-ARCTIC corpus that represents a specific learner population. Fine-tuning, using either synthetic or human MD labels, constrains updates to the phonetic space (i.e., the representation space used to encode phone-level distinctions, the learned phone/phonetic embedding space, and its alignment with acoustic representations), which means that only the phonetic module is updated while the rest of the model stays fixed. Relative to the pre-trained model, L2 adaptation substantially improves MD recall and F1, increasing ROC–AUC from 0.72 to 0.85. The results support a target-population norm and inform the design of perception-aligned, fairer automatic pronunciation assessment systems.

Keywords:

pronunciation assessment; mispronunciation detection; speech perception; L2 speech; computer-assisted pronunciation training

1. Introduction

Automatic speech assessment—including pronunciation scoring and mispronunciation detection—is now a central component of computer-assisted pronunciation training (CAPT) systems and many speaking tests. Because these tools are deployed at scale and produce seemingly objective “scores” and “errors”, the choice of reference norm is not merely technical: it shapes what learners are told to change, what they are rewarded for, and what kinds of speaking are implicitly framed as “correct”.

A large fraction of existing systems implicitly assume a native norm: they compare second-language (L2) productions against a canonical first-language (L1) pronunciation dictionary and treat deviations as errors. Classic examples include forced-alignment-based goodness-of-pronunciation (GOP) scoring [1] and ASR-graph extensions such as the extended recognition network (ERN) [2]. While these approaches have enabled scalable feedback, making the L1 dictionary the ground truth requires a strong normative assumption: that the goal of pronunciation learning is to sound like a particular native standard.

A native norm is also a language fairness risk. Pronunciation standards are socially situated; treating a single “standard” accent as the only legitimate target can reproduce standard-language ideology and accent discrimination [3,4,5]. Automated assessment scales such norms by turning them into numbers and diagnostic labels, potentially hardening existing hierarchies—especially when deployed in large-scale education or hiring. More broadly, speech technologies have raised well-known fairness concerns even in core tasks such as ASR [6], which strengthens the case that reference norms should be made explicit and justified rather than assumed.

This assumption is at odds with how pronunciation is treated in many educational and testing settings. Major frameworks such as the Common European Framework of Reference (CEFR) and the TOEFL speaking rubrics define speaking proficiency primarily in terms of intelligibility and comprehensibility, explicitly tolerating foreign accents as long as communication is not impeded [7,8,9]. Linguistic and SLA research similarly emphasizes that accentedness and intelligibility are separable and that native-likeness is often an unnecessary (and sometimes unattainable) target for most learners [10,11,12,13].

Psycholinguistic research further undermines the idea of a fixed “native” ground truth. A large body of work shows that human listeners rapidly adapt to novel talkers and accents: native listeners exposed to Spanish- or Chinese-accented English recover an initial processing slowdown within about a minute [14], and with exposure to multiple foreign-accented talkers, they can form accent-level representations that can be generalized to previously unheard talkers [15]. Recent probabilistic accounts model such adaptation as rational inference over talker- and accent-specific distributions [16]. Listeners also use lexical information to retune phoneme category boundaries, and this retuning can be generalized across words, talkers, and even languages [17,18]. These findings suggest that human “ground truth” for pronunciation is not a fixed L1 template but a dynamically adapted mapping guided by the communicative context and the target population. We refer to this trend in human perception as the target-population norm.

This paper provides computational evidence for the target-population norm using a mispronunciation detection (MD) model. MD is a typical pronunciation assessment task that identifies which phones are perceived as mispronounced [19]. We propose a simple end-to-end Transformer-based MD system trained in two stages: pre-training on the multi-speaker Librispeech corpus [20] (interpreted as a broad multi-talker prior) and fine-tuning on the non-native L2-ARCTIC corpus (a specific learner population) to simulate phonetic adaptation. We evaluate human-annotated MD labels from L2-ARCTIC, so performance directly reflects alignment with human judgments.

Our contributions are threefold:

Fairness-first framing. We clarify two competing reference norms for L2 pronunciation—native versus target population—and argue that native-norm automation (as instantiated by GOP/ERN-style pipelines) is a problematic default that can be pedagogically uninformative and socially inequitable.
Computational evidence for phonetic adaptation. We present an end-to-end MD model that simulates how listeners adapt from broad multi-talker experience to a specific learner population. The results show that a model trained only on native speech is overly tolerant of L2 deviations, while adapting to target-population data yields substantially better alignment with human judgments.
Design principles for automatic speech assessment. We show that end-to-end assessment trained against human judgments is naturally consistent with the target-population norm. When labeled data are scarce, we argue that any data augmentation, weak supervision, or adaptation strategy must be chosen to strengthen this alignment rather than re-imposing a native template.

In the rest of this paper, we first present evidence for listener adaptation and the target-population norm in Section 2 and then introduce the MD model in Section 3. Section 4 describes the experimental settings and results, Section 5 discusses implications—with an emphasis on language fairness and deployment—and the paper is concluded in Section 6.

2. Pronunciation Evaluation, Listener Adaptation, and Language Fairness

2.1. Two Reference Norms for L2 Pronunciation

Broadly speaking, L2 pronunciation can be evaluated with respect to two potential norms.

The first is a native norm. Pronunciation should be assessed as the deviation from native standard speech. Many automatic speech assessment systems implicitly adopt this perspective by training on native speech and using canonical phone sequences as the primary reference. Under this norm, any deviation from the canonical form is treated as an error, even if it is intelligible to listeners familiar with the learner population.

The second is a target-population norm. Here, pronunciation should be assessed based on the accented speech of the target population. The primary criteria under this norm are intelligibility and comprehensibility for that population, with accentedness being secondary. This perspective is prominent in language teaching and assessment, where the aim is communicative success rather than the elimination of all traces of a foreign accent [7,8,9].

Figure 1 visualizes these two reference norms by showing how L2 pronunciations can receive different correctness judgments depending on the adopted norm (with mispronunciations indicated by red crosses). Under the native norm (Figure 1a), deviations from native standard speech (i.e., the canonical form) are treated as errors, even when they are intelligible to listeners familiar with the learner population. Under the target-population norm (Figure 1b), the reference shifts to the accented speech of the target population so that intelligibility and comprehensibility for that population are prioritized and accentedness is secondary. This illustrates that mispronunciation judgments are norm-dependent rather than absolute.

It has been argued that human raters tend to operate closer to the target-population norm. When grading L2 speech, raters implicitly calibrate their expectations to the learners’ backgrounds, the task context, and the likely interlocutors. Some empirical evidence from perceptual and psychological experiments will be presented in the rest of this section.

2.2. Why the Native Norm Is a Problematic Default (and a Fairness Risk)

Native-like pronunciation is not a universal prerequisite for communicative success. In L2 speech, accentedness, comprehensibility, and intelligibility are related but distinct constructs: listeners may understand accented speech with little difficulty while still perceiving it as “non-native” [10,11]. As a result, systems that operationalize pronunciation quality purely as deviation from a canonical L1 phone sequence (e.g., GOP and ERN) may over-penalize benign, intelligible variation and provide feedback that is poorly aligned with communicative goals [1,2]. For this reason, pronunciation pedagogy has increasingly emphasized an intelligibility principle—prioritizing features that affect understanding and listener effort over attempting to eliminate all traces of an accent [9,12,13]. When a system scores accentedness against a canonical L1 dictionary, it can generate negative feedback that is psychologically costly yet pedagogically low-yield, because it penalizes deviations that may have little impact on intelligibility for the relevant interlocutors.

From the perspective of SLA and speech perception, it is also unsurprising that learners do not simply converge to an L1 template. Influential accounts such as the Speech Learning Model and the Perceptual Assimilation Model emphasize that L2 phonetic categories are shaped by pre-existing L1 categories and by experience, yielding systematic, population-specific patterns of variation [21,22]. Under this view, a scoring rule that treats any departure from an L1 dictionary as an “error” conflates physiological/acoustic difference with communicative failure and obscures the more meaningful question of whether a listener population actually experiences misunderstanding or increased processing effort.

Finally, choosing a native norm as the ground truth is a language fairness concern. Standard-language ideologies grant social and institutional authority to particular accents; sociolinguistic work has documented how accent-based gatekeeping and discrimination operate in education and employment [3,4,5]. When automated assessment systems uncritically encode a single native standard, they can scale this gatekeeping by presenting normative judgments as objective measurements. This risk is especially salient for widely deployed CAPT tools, where the “native template” can become a de facto mechanism that reinforces the dominance of majority varieties and marginalizes legitimate accented speech. These considerations motivate a clear position for automatic assessment in most learning contexts: native-norm scoring should be avoided as a default, and assessment targets should instead be grounded in the perception of the intended listener population.

2.3. Evidence from Rating Scales and Testing Practice

Large-scale language assessment frameworks provide explicit evidence for the target-population perspective. The CEFR companion volume, for example, describes pronunciation in terms of phonological control, emphasizing intelligibility and ease of comprehension rather than native-like accuracy in every segment [7]. The scale allows for substantial foreign-accentedness at intermediate levels, as long as communication is not hindered.

Similarly, high-stakes speaking tests such as TOEFL iBT assess pronunciation in terms of intelligibility, stress, and intonation, not segmental native-like accuracy [8]. The rubrics explicitly note that some degree of accent is acceptable as long as the speech is generally understandable to ordinary listeners. Current research on pronunciation assessment likewise advocates focusing on intelligibility and listener effort rather than penalizing all deviation from native norms [9].

In classroom practice, teachers routinely adjust their expectations based on the learners’ L1 background, level, and communicative goals. For example, a teacher working with Chinese high-school learners of English may judge some typical L1-influenced substitutions as acceptable as long as they do not interfere with communication, whereas the same deviations might be judged more harshly in an advanced interpreting program. This flexibility is difficult to reconcile with a rigid native norm but is naturally captured if assessment is anchored in a target population.

2.4. Evidence from Speech Perception and Experimental Phonetics

Experimental phonetics and psycholinguistics provide more direct evidence that listeners adapt their internal phonetic categories to the speech of specific talkers and accents.

Work on adaptation to foreign-accented speech has shown that native listeners can rapidly adjust to novel accents. Ref. [14] found that listeners initially experience a processing cost when hearing heavily accented English, but their performance recovers to near-native levels after about a minute of exposure. Ref. [23] further showed that this adaptation can transfer across talkers who share the same accent.

Listeners also form abstract accent-level representations. Ref. [15] demonstrated that exposure to multiple talkers with a given foreign accent leads to better generalization to new talkers than exposure to a single talker. This suggests that listeners are not only adapting to individual voices but also extracting patterns characteristic of an accent category.

Another line of work shows that listeners use lexical information to retune phoneme category boundaries. In the classic perceptual learning paradigm, listeners hear ambiguous sounds in words whose lexical identity disambiguates the intended phoneme; as a result, they shift their category boundaries in a way that generalizes to new items [17]. These effects extend across different voices and even across languages in bilinguals [18]. Ref. [24] further highlighted that what listeners learn may depend on acoustic similarity: if talkers share similar acoustic patterns, category shifts induced by one talker can generalize to another.

Taken together, these findings support a picture in which human listeners are highly adaptive. Their phonetic categories are not fixed by an L1 dictionary but are continuously reshaped by recent experience with particular talkers and accents. This strongly suggests that the “ground truth” for pronunciation is population-relative and history-dependent, which motivates our target-population perspective for automatic speech assessment.

3. Computational Evidence with Mispronunciation Detection Models

The research on human perception reviewed above provides both quantitative and qualitative support for the target-population norm. However, this support is largely indirect: the evidence is inferred from specific aspects of listening behavior and therefore cannot be taken as direct proof that listeners are actually using the target-population pronunciation as their internal reference. To address this limitation, we introduce a computational model that mirrors the listener adaptation process and show that it can indeed reproduce this process.

We choose mispronunciation detection as a probe task, which is defined as “test if a phone is correctly pronounced”. We choose MD because it is simple and clearly defined:

Since MD is clearly defined, it is suitable for a computational model to solve. In particular, we choose Transformers as the building block. Due to the Turing completeness of Transformers [25,26], they have potential to solve such clearly defined tasks like MD, given that the data and parameters are sufficient and training is perfectly conducted.
Since MD is simple and clearly defined, human annotators can easily understand the task and conduct it well and convey their genuine opinions in their labels. Therefore, we argue that MD labels can clearly reflect people’s perceptive and psychological intentions when listening to L2 speech.

3.1. Mispronunciation Detection: A Quick Review

Mispronunciation detection (MD) is a type of pronunciation assessment task. Most MD research involves a broader task called mispronunciation detection and diagnosis (MDD), which provide learners with fine-grained feedback at the phoneme level, i.e., not only which phone is mispronounced but also how it was mispronounced. Over the past two decades, research on MDD has evolved from confidence-measure–based scoring to rule-based and free-phone recognition frameworks and more recently to end-to-end neural models that jointly learn acoustic and linguistic representations from data.

Early work was dominated by pronunciation scoring methods based on the GOP measure. In a Gaussian Mixture Model–Hidden Markov Model (GMM–HMM) system, Witt and Young [1] proposed to compute phone-level scores from forced alignments, which became the de facto baseline for automatic pronunciation assessment. With the advent of Deep Neural Network–Hidden Markov Model (DNN–HMM) acoustic models, GOP-style scores were reformulated on neural posteriors and combined with transfer learning and logistic regression classifiers to improve robustness and separability between native-like and non-native productions [27]. These methods can yield reliable scalar scores, but the decision is tightly coupled to the forced alignment and the underlying phone set, and detailed diagnosis is difficult because only limited information about error types is available.

To provide richer diagnostics, rule-based and classification-based approaches extended conventional ASR systems. The extended recognition network (ERN) introduced phonological rules into the decoding graph so that typical L2 error patterns could be recognized explicitly and mapped to diagnostic labels [2]. Subsequent work replaced GMMs with deep neural networks and explored multi-distribution and multi-task acoustic–phonemic models that jointly model canonical, mispronounced, and auxiliary attribute distributions [28,29]. These systems improved detection accuracy and the coverage of error types but still followed a multi-stage pipeline (forced alignment, lattice generation, rule matching), and their performance was sensitive to alignment errors and hand-crafted rule sets.

Motivated by advances in end-to-end ASR, recent research has turned to neural architectures that directly map input speech to phoneme sequences and mispronunciation decisions. Leung et al. [30] proposed a Convolutional Neural Network–Recurrent Neural Network–Connectionist Temporal Classification (CNN–RNN–CTC) model that performs free-phone recognition without forced alignment and then compares the recognized sequence with the canonical transcription, achieving substantial gains over ERN and acoustic–phonemic baselines. Building on this idea, sentence-dependent end-to-end mispronunciation detection and diagnosis (SED-MDD) uses an additional text encoder and attention mechanism to fuse reference-phone and acoustic features so that the model can implicitly learn phonological correspondences between canonical and observed realizations [31]. Other work adopts hybrid CTC/attention architectures or anti-phone expansions of the label set to better capture non-categorical or heavily distorted L2 pronunciations [32].

At the architectural level, several lines of work further refine end-to-end MDD. Yan and Chen [33] replace hand-crafted spectral features with raw-waveform encoders, while Wu et al. [34] employ Transformer and wav2vec-style encoders to improve phone recognition and F-measure on MDD benchmarks. Self-supervised pre-training has proven particularly beneficial: Xu et al. [35] and Guo et al. [36] show that wav2vec 2.0-based representations dramatically reduce the amount of labeled L2 data needed for competitive performance, for both English and Mandarin MDD. In parallel, text-aware models explicitly condition acoustic encoders on the known canonical transcription using attention or gating mechanisms [37,38], and streaming architectures such as coupled cross-attention (CoCA)–MDD introduce coupled cross-attention to support low-latency, segment-by-segment feedback [39]. More recent feature fusion models integrate acoustic, textual, and error-type information with multi-head attention and task-specific loss functions to address data imbalance and further boost F1 scores on TIMIT and L2-ARCTIC [40,41].

Overall, the field has moved from GOP-based scoring relying on forced alignment, through rule-enhanced ASR and multi-distribution acoustic models, to a rich family of end-to-end neural systems that leverage pre-training, textual conditioning, and multi-task learning. Our work follows this end-to-end line but focuses on modeling listener adaptation to accents through multi-listener training and explicit conditioning on the target speaker population, as detailed in the following subsections.

3.2. Model Architecture

Given a speech feature sequence

X = (x_{1}, \dots, x_{T})

and its canonical phone sequence

S = (s_{1}, \dots, s_{N})

, the goal of the model is to predict a binary mispronunciation label

y_{n} \in {0, 1}

for each phone position n, where

y_{n} = 0

denotes a correct pronunciation, and

y_{n} = 1

denotes an error (including substitutions and deletions). Figure 2 illustrates the overall architecture. It consists of a speech branch and a phone branch. The speech branch predicts true pronunciations, and the phone branch produces mispronunciation labels. The entire model is trained with a CTC loss on top of the speech branch and a binary cross-entropy (BCE) loss on top of the phone branch.

3.2.1. Speech Branch

The input is a sequence of 40-dimensional Mel filterbank features (Fbanks), denoted by

X \in R^{T \times 40}

. We first apply a convolutional front-end with two 2-D CNN layers. Both layers use a

3 \times 3

kernel and stride 2, and the number of channels is 256, and each layer is followed by a ReLU activation. This front-end performs subsampling and produces higher-level local time–frequency feature maps.

The CNN output is then flattened and projected to a sequence of 256 dim vectors using a linear layer. Next, we apply a position-wise feed-forward block (FFN) with dimensions

256 \times 256 \times 256

(i.e.,

256 \to 256 \to 256

) and ReLU activation, followed by positional encoding.

The resulting sequence is passed to a 6-layer Speech Transformer. Each layer contains three submodules in the following order:

Multi-head self-attention.
Speech-to-phone cross-attention.
A position-wise feed-forward network.

For self-attention, we use four heads with

d_{model} = 256

. For speech-to-phone cross-attention, we use a single head with

d_{model} = 256

; this submodule takes the speech hidden states as queries and attends to the phone-branch representations as keys/values and applies LayerNorm within the cross-attention block. The feed-forward network inside each Speech Transformer layer has dimensions

256 \times 512 \times 256

with ReLU activation. This cross-attention module explicitly conditions the speech representations on the canonical phone stream, rather than relying on a simple concatenation-based fusion.

For the auxiliary phoneme recognition task, the Speech Transformer output is fed to a linear layer that projects the hidden dimension from 256 to 74 logits (69 pronounced phones, 1 spn, 1 blank, 3 special tokens) at each (subsampled) time step. A softmax is applied, and we compute the CTC loss on this phoneme posterior sequence.

3.2.2. Phone Branch

The input to the phone branch is a canonical phoneme sequence

S = (s_{1}, \dots, s_{N})

. We first map phoneme IDs to embeddings using an embedding layer with vocabulary size 69 and embedding dimension 256, producing

E \in R^{N \times 256}

. We then apply a position-wise FFN with dimensions

256 \times 256 \times 256

(ReLU) and add positional encoding.

The sequence is then passed through a 4-layer Phone Transformer. Each Phone Transformer layer consists of the following: (i) multi-head self-attention with four heads and

d_{model} = 256

and (ii) a position-wise FFN with dimensions

256 \times 512 \times 256

and ReLU.

To perform mispronunciation detection, we further stack a 4-layer mispronunciation detection (MD) Transformer. Each MD Transformer layer contains the following, in order:

Multi-head self-attention over the phone sequence (four heads, $d_{model} = 256$ ).
Phone-to-speech cross-attention (one head, $d_{model} = 256$ ) that uses the phone hidden states as queries and attends to the speech-stream hidden states as keys/values, with LayerNorm used in the cross-attention block.
A position-wise FFN with dimensions $256 \times 512 \times 256$ and ReLU.

Finally, the MD Transformer output is passed through an additional position-wise FFN with output dimension 256 and ReLU activation, followed by a linear layer that maps

256 \to 1

at each phoneme position. A sigmoid activation produces the mispronunciation posterior

{\hat{y}}_{n} \in (0, 1)

for each canonical phone position n. We compute binary cross-entropy (BCE) loss between

{\hat{y}}_{n}

and the corresponding phone-level mispronunciation label

y_{n}

.

3.2.3. Training Objectives

The model is trained with two objectives: (i) the CTC loss from the speech branch (phoneme recognition) and (ii) the BCE loss from the phone branch (phone-level mispronunciation detection). The whole network is trained end to end by combining the CTC loss and the BCE loss:

L = L_{CTC} + λ L_{BCE},

(1)

where

λ

controls the relative weight of the BCE loss. The CTC objective encourages the speech encoder to learn robust phone-discriminative representations, while the BCE loss specializes these representations for phone-level mispronunciation detection conditioned on the canonical phone sequence.

3.3. Two-Stage Training: Librispeech Pre-Training and L2 Fine-Tuning

We adopt a two-stage training scheme designed to mirror the trajectory of a human listener who first accumulates broad multi-talker experience and then adapts to a specific learner population.

Stage 1: Pre-training on Librispeech.

We pre-train the model on the 960 h Librispeech corpus [20]. For each utterance, the associated text is converted to phonemes using the CMU Pronouncing Dictionary. We then construct synthetic mispronunciation labels by randomly substituting a small proportion of phonemes with other phonemes of the same class (vowel or consonant). The ground truth for the CTC loss is the original canonical sequence, while the BCE loss is supervised by whether each phone has been substituted. This yields a fully self-supervised MD objective from the perspective of L2 errors: no L2 data or human MD labels are used.

Stage 2: Fine-tuning on L2-ARCTIC.

We fine-tune the model on L2-ARCTIC [42] to simulate the perceptual adaptation of listeners. Importantly, we only update the phonetic feature components of the phone branch, i.e., from the phone embedding to the Phone Transformer. This setting ensures that only the phonetic shift in perception is taken into account when performing the adaptation, eliminating any impact from acoustic domain mismatch.

We comparatively study the performance of the perceptual adaptation with synthetic MD data and human MD labels, detailed as follows:

Self-supervised MD fine-tuning. Only the BCE loss is used, and the MD labels are synthetic. We again perform random same-class phoneme substitution on the canonical sequence. Since the MD labels are synthesized, this model does not learn any perceptual behavior of humans on the MD task; instead, it aligns the phonetic coordinate of the listeners to L2 pronunciation. By this experiment, we will verify whether this alignment will improve the consistency between model prediction and human perception (via listeners’ MD labels), even without any human supervision and linguistic knowledge, i.e., priors on phone occurrence and priors on phone-pair substitution.
Human-supervised MD fine-tuning. Only the BCE loss is used, and the MD labels from human annotations are used. This will directly optimize MD performance on the perceptive behavior of listeners on L2 pronunciations. Our assumption is that this will not only align perceptual phonetic space to the target L2 pronunciation but also help in learning other psychological and perceptual behavior when listening to L2 speech, e.g., tolerance to pronunciation deviation. Most importantly, these complex psychological and perceptual behaviors are still learned via a simple phonetic shift in perception.

4. Experiments

4.1. Data

Our experiments focus on English L2 speech, using two corpora: Librispeech as a proxy for multi-talker L1 input and L2-ARCTIC as the target learner population.

Librispeech

Librispeech is a widely used corpus of read English audiobooks, containing around 1000 h of 16 kHz speech from approximately 2400 speakers [20]. Although the speakers are predominantly North American, there is variation in voice characteristics and speaking styles. We use the full 960 h training set for pre-training.

L2-ARCTIC

L2-ARCTIC is a corpus of non-native English speech read by 24 speakers with different L1 backgrounds (e.g., Mandarin, Korean, Hindi), each reading around 1132 phonetically balanced sentences [42]. The recordings are at 16 kHz, and a subset was annotated with phoneme-level mispronunciation labels, including substitutions, deletions, and insertions. Following common practice in MD research [32,34,37], we focus on phoneme-level substitution and deletion errors and ignore insertion errors.

We split the annotated portion of L2-ARCTIC into training, development, and test sets at the speaker level following the previous work [31,43], ensuring that all speakers in the test set are unseen during training.

4.2. Evaluation Metrics

We use standard MD metrics that treat mispronunciation detection as a binary classification problem at the phoneme level [32]. For each canonical phone position, the system predicts whether the phone is mispronounced, and we compare this prediction with the human annotation. We define

True reject (TR): Mispronounced phones correctly detected as mispronounced.
False reject (FR): Correctly pronounced phones incorrectly flagged as mispronounced.
True accept (TA): Correctly pronounced phones correctly accepted as correct.
False accept (FA): Mispronounced phones incorrectly accepted as correct.

Precision and recall are then computed as

\begin{matrix} Precision & = \frac{TR}{TR + FR}, \end{matrix}

(2)

\begin{matrix} Recall & = \frac{TR}{TR + FA} . \end{matrix}

(3)

We report the F1 score

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall},

(4)

as well as the ROC–AUC, which summarizes the trade-off between true positive and false positive rates across decision thresholds.

To better understand practical system behavior, we focus on two operating points. The first is a fixed precision of 0.5. This corresponds to a regime where about half of the flagged phones are truly mispronounced, which we consider a reasonable working point for CAPT feedback. The second operation point we pay attention to is a fixed recall of 0.5, corresponding to the scenario where half of the mispronunciations are detected. Finally, a precision–recall (PR) curve is presented to evaluate models in the full operating region.

Fairness Metric (Predicted Positive Rate)

To quantify group-wise disparity across different L1 backgrounds, we additionally report the predicted positive rate (PPR), defined as the proportion of phoneme positions that are flagged as mispronounced by the system:

{PPR}_{g} = \frac{{TR}_{g} + {FR}_{g}}{{TR}_{g} + {FR}_{g} + {TA}_{g} + {FA}_{g}}

(5)

where

g \in G

indexes L1 groups. We summarize disparity using

MaxGap (PPR) = max_{g \in G} {PPR}_{g} - min_{g \in G} {PPR}_{g}

(6)

In all group-wise analyses, we use a single global decision threshold and apply it to every L1 group to avoid per-group tuning.

4.3. Settings

All models are implemented in PyTorch v2.4.0 using standard Transformer blocks. Acoustic features are 40-dimensional Mel filterbank coefficients, extracted with a 25 ms window and a 10 ms hop.

Synthetic MD Labels

Synthetic MD labels are generated on the fly by randomly substituting phones with other phones from the same broad class (vowel or consonant). For each phone sequence, we apply corruption with probability

0.9

, and the maximum proportion of substituted phones is

50 %

. We intentionally adopt relatively strong corruption so that the pre-training stage provides a stable and diverse supervision signal for learning robust MD-related representations; empirically, milder corruption tends to yield less stable pre-training and weaker downstream MD performance.

Optimization

We apply gradient clipping in all experiments. For Librispeech pre-training, we use a learning rate scheduler with a peak learning rate of

1 \times 10^{- 4}

and 10 epochs of warm-up. The batch size is 128, and

λ

in Equation (1) is set to 0.67. We pre-train for 50 epochs and use the averaged checkpoints from epochs 45–50 as the pre-trained model.

For L2-ARCTIC fine-tuning, we update only the phonetic components of the phone branch (from the phone embedding through the Phone Transformer) while keeping the acoustic front-end and the speech encoder parameters fixed. Note that the speech encoder is explicitly conditioned on the phone stream via cross-attention; thus, phonetic adaptation can affect speech-side representations even when speech encoder parameters are frozen. This setting is designed to isolate phonetic shift and to avoid conflating it with other adaptation factors such as acoustic condition, channel, or device mismatch. Although the speech encoder includes cross-attention over phone representations, freezing its parameters ensures that any change in speech-side hidden states is driven solely by the updated phone representations; we therefore treat it as a downstream consequence of phonetic shift rather than parameter adaptation in the speech branch. We fine-tune from the epoch-50 checkpoint with a smaller learning rate (

4 \times 10^{- 6}

) and batch size of 32 and report the averaged checkpoints from epochs 30–40. More details of the two test conditions in the fine-tuning stage are shown below.

In the self-supervised MD condition, we again generate synthetic MD labels by random substitution, following the same principle in the pre-training stage.
In the human-supervised MD condition, we use the human MD labels, restricting attention to substitutions and deletions while ignoring insertions. This follows the purpose of the MD task, i.e., determining whether a phone was well pronounced. Note that this “ignoring insertion” protocol is also adopted in the evaluation metrics.

4.4. Main Results

Table 1 summarizes the phoneme-level MD performance on the L2-ARCTIC test set. The top block reports recall at a fixed precision of 0.5, and the bottom block reports the precision that yields a fixed recall of 0.5; in both cases we also report the corresponding F1 and ROC–AUC.

Several observations follow. First, the zero-shot model pre-trained on Librispeech is rather weak on the MD task. In particular, it behaves like a tolerant listener: at a fixed precision of 0.5, its recall on mispronounced phones is only 0.1744, which means most mispronunciations are not identified as errors. This indicates that the model tends to accept many L2 deviations as correct, consistent with the idea that a multi-talker-trained listener is accustomed to a wide range of pronunciation variation.

Second, fine-tuning with the MD loss and using synthetic MD supervision on L2-ARCTIC improves performance significantly (F1: 0.26/0.35 to 0.38/0.41; ROC–AUC: 0.72 to 0.76). Note that only the phonetic space is adapted, and no human MD labels are used. This means that this fine-tuning simply performs a phonetic shift in the model’s “perceptual behavior” to align with L2 speech patterns. Therefore, the improved MD performance suggests that this phonetic shift leads to consistent MD behavior between the computational model and human perception. This provides strong support for the target-population norm, i.e., human perception involves phonetic shift when listening to L2 speech.

Finally, using human MD labels yields the best results, with a recall of 0.5801 at a precision of 0.5 (F1 0.54, ROC–AUC 0.85) and a precision of 0.5545 at a recall of 0.50 (F1 0.53, ROC-AUC 0.85). The significant improvement over the synthetic MD setting suggests that humans’ L2 perceptual adaptation is more complex than an alignment to L2 speech patterns; instead, more factors related to human perception might be involved, which may involve, but is not limited to, priors on erratic phones, error tolerance when encountering L2 speech, etc. Additionally, these complex effects can all be learned by adapting the phonetic encoding component of the model, meaning that phonetic shift can represent most of the human listeners’ perceptual adaptation behavior.

Figure 3 presents the precision–recall curve of the three MD systems. We see that with fine-tuning, the overall MD performance is improved at almost all the operation points, especially with human labels. These curves confirm that the improvement with MD fine-tuning is consistent in a broad range of operation points.

Overall, the observations closely mirror psycholinguistic findings: a listener with broad multi-talker experience starts out as tolerant of L2 deviations, but with exposure to a specific learner population, better MD can be obtained by aligning the perceptual phonetic space to the target speech patterns, even without any human supervisor regarding “what is mispronunciation”. If explicit human feedback (here, MD labels) is available, sensitivity to community-defined errors increases, leading to significant MD performance improvement.

4.5. Additional Results

4.5.1. Comparison with Other Results

To make sure the MD systems constructed here are representative and reasonable, we compare the MD performance obtained here (with MD labels) to the results reported in the literature with a similar data profile, as shown in Table 2. We find that our models are comparable to others, and their performance is reasonable.

4.5.2. Full Fine-Tuning

In this experiment, we fine-tune both the speech branch and the phone branch and optimize the model with both CTC and BCE losses. This setting is closer to the standard practice in MD research and provides additional degrees of freedom compared to our main (partial fine-tuning) setup. The goal is to probe whether factors beyond phonetic shift—for example, acoustic/domain or device adaptation—can further explain the MD labels (i.e., human perceptual behavior), and whether the gap between the pre-trained and fine-tuned models admits alternative explanations. The results are shown in Table 3.

A key observation is that when human MD labels are used, full fine-tuning and partial fine-tuning achieve very similar performance (e.g., comparing System 5 vs. 6 and System 11 vs. 12). This suggests that most of the explainable variation in human MD annotation can be captured by shifting the phonetic space, while allowing for additional adaptation (e.g., in the speech branch) brings only marginal gains in this setting.

With this in mind, the advantage of full fine-tuning under synthetic supervision (e.g., System 3 vs. 4 and System 9 vs. 10) is plausibly due to the extra training signal from the auxiliary CTC task, which tends to stabilize optimization and, more importantly, helps the desired phonetic shift by jointly shaping representations from both the speech side and the phone side.

4.5.3. L1 Group Analysis and Demographic Parity-Style Disparity

We further break down performance by L1 background (Arabic, Mandarin, Hindi, Korean, Spanish, Vietnamese) in L2-ARCTIC. We focus on the human-supervised partial fine-tuning setting (System 5 and System 11 in Table 3) and evaluate each L1 group using the same global threshold determined at the corresponding operating point. Besides standard MD metrics, we report the predicted positive rate (PPR) and its MaxGap across L1 groups as a simple demographic parity-style disparity indicator. It can be seen that performance varies across different L1 backgrounds, but performance is relatively stable, implying that the method is robust.

5. Discussion

5.1. Language Fairness: Why Rejecting the Native Norm Matters

Our results have direct implications for the fairness of L2 automatic speech assessment. The failure mode of the Librispeech-only model—very low recall on human-labeled L2 mispronunciations—shows that a “native” reference does not automatically yield decisions that match what human raters consider problematic for a specific learner population. This helps explain why native-norm pipelines such as GOP and ERN, which interpret deviations from a canonical L1 pronunciation as errors, can systematically disagree with target-population judgments [1,2]. In other words, what counts as an “error” is not a universal property of the acoustic signal; it is a population-relative perceptual decision shaped by experience and task goals. Systems built around GOP/ERN-style comparison to a canonical L1 dictionary therefore risk producing feedback that is systematically misaligned with the target-population judgments that matter in real learning and testing settings.

This misalignment is not only scientific but also normative. When native-likeness is operationalized as the default objective, automated tools can intensify learner pressure and legitimize a single dominant accent as the standard of correctness [4,11,12]. Because these tools can be deployed at scale, they may amplify existing inequalities and accent-based discrimination, turning sociolinguistic hierarchies into algorithmic scores [3,5]. A fairness-first positioning therefore requires that reference norms be explicit: evaluation should be justified in terms of intelligibility for a defined target population, rather than in terms of deviation from a native template.

Concretely, a fairness-first assessment pipeline should achieve the following:

State the target population and construct. Specify whose perception the system intends to approximate (e.g., trained raters for a particular test, classroom interlocutors, or an L2 community of practice) and whether the goal is intelligibility, comprehensibility, or some specialized notion of “native-like” performance.
Ground training and evaluation in target-population judgments. Whenever possible, train and validate against human ratings that reflect the target population, rather than treating an L1 dictionary as the ground truth [10,11].
Treat nativeness as an explicit, optional constraint. If a native norm is used in a particular setting, it should be justified by the communicative requirements of that setting, not assumed by default.

Finally, our experiments suggest a practical route forward. End-to-end models optimized directly against human judgments are naturally compatible with the target-population norm, but because labeled data are costly, deployments often rely on transfer and weak supervision. Our findings indicate that even unlabeled target-population speech with canonical transcripts can move a model toward human-aligned decisions and that constraining adaptation to the phonetic space can help preserve the intended notion of “error”. When labels are scarce, augmentation strategies (synthetic MD labels, self-training, or domain adaptation) should be selected to strengthen alignment with target-population perception rather than to re-impose native-norm constraints.

Empirically, when we break down the results by L1 background (Table 4), we observe non-trivial variation in group-wise predicted positive rates under a single global threshold (MaxGap(PPR) = 0.11 in our setting). This highlights that fairness questions are measurable at the system output level and should be reported alongside overall accuracy when deploying MD models across heterogeneous learner populations.

5.2. Relating Model Adaptation to Human Perceptual Learning

Our results provide a computational complement to the human perceptual learning literature. More broadly, robust speech perception has been formalized as rapid generalization and adaptation to novel talkers and accents [16]. In studies of adaptation to foreign-accented speech, listeners initially show slower or less accurate processing, then quickly adjust to the accent after brief exposure [14]. Adaptation can generalize across talkers sharing an accent [23] and can be facilitated by exposure to multiple talkers [15]. The perceptual learning of phoneme categories through lexically guided retuning similarly shows that listeners adjust their category boundaries based on experience [17,18].

In our MD setting, the pre-trained Librispeech model corresponds to a listener with extensive experience of native (and near-native) English speech from many talkers. Its low recall on L2-ARCTIC mispronunciations suggests that, like human listeners, it initially treats many L2 deviations as acceptable variations.

Fine-tuning on L2-ARCTIC with synthetic MD labels parallels unsupervised adaptation: the model is exposed to L2 speech with canonical transcriptions but no explicit error feedback. This already leads to a noticeable increase in sensitivity to mispronunciation, akin to listeners adjusting to the overall accent characteristics of the learner population. Additionally, human-supervised MD fine-tuning corresponds to explicit corrective feedback, leading to the strongest alignment between model judgments and human annotations.

From this perspective, our two-stage training scheme can be interpreted as a coarse computational model of how human listeners move from a broad multi-talker prior to a population-specific norm, with different levels of supervision corresponding to different kinds of feedback in real-world communication.

5.3. Implications for MD System Design

The target-population perspective has several implications for the design and deployment of MD systems.

First, zero-shot MD (one-class MD) based on a model trained solely on native speech is unlikely to be adequate. Our results show that such a model may be overly tolerant of deviations that target-population listeners would consider mispronunciations, leading to high false accept rates. This is not a failure of the model architecture because this is not the way human listeners perceive L2 speech.

Second, modest fine-tuning on L2 speech, even without MD labels, can substantially improve alignment with human judgments. This suggests that when deploying MD in a new context, collecting unlabeled L2 speech with canonical transcriptions may already yield meaningful gains.

Third, explicit MD supervision—human-labeled mispronunciations—provides further improvements, but the marginal gain must be weighed against annotation cost. Our results indicate that even a relatively small amount of labeled L2 data may be sufficient to move from a tolerant, multi-talker prior to a sharper, population-specific MD system.

Fourth, the target-population perspective helps reconcile advanced MD architectures with human practice. Techniques such as text-aware gating [37], multi-task learning with pronunciation scoring [45], and streaming cross-attention [39] can all be viewed as methods for efficiently encoding information about the target population and task constraints, rather than as attempts to enforce a universal native norm.

Finally, when deploying MD in a real CAPT system, it may be beneficial to continuously fine-tune the model on local data from the target population (e.g., Chinese high-school learners) to track shifts in community norms over time.

5.4. Implications for MD Database Construction

Our results suggest that the design of future mispronunciation detection (MD) databases should be reconsidered in light of the target-population norm and language fairness. Most existing MD corpora were not explicitly annotated with “being understandable” or “supporting successful communication” as primary criteria. Instead, labels are often based on an implicit native reference, and the perceptual tendencies of human listeners are only indirectly reflected in the data. This makes it difficult for automatic systems trained on such databases to learn the phonetic adaptation patterns that real listeners exhibit when processing L2 speech.

From the perspective of building fair and meaningful automatic assessment tools, MD databases should be constructed so that their labels represent how a clearly specified listener group actually perceives L2 speech. The goal is not to approximate a single native template but to approximate the judgments of a target population with whom the learner is expected to communicate. Under this view, native-based schemes such as the ERN or GOP should not serve as the primary source of ground-truth labels, because they tie evaluation to acoustic distance from native models rather than to communicative function.

Concretely, we recommend that future MD database construction follow at least the following principles:

Specify the target listener population. Annotation protocols should clearly define who the listeners are (e.g., native speakers familiar with a certain learner group, L2 users in a particular community, or mixed proficiency users) so that labels can be interpreted as approximating that population’s perception.
Use communicative criteria for labels. Instead of asking raters to judge whether a segment matches a native pronunciation, they should be asked whether the utterance is understandable, whether it supports the intended communication, and whether the deviation is acceptable for interaction within the target population.
Allow for graded and tolerant labels. Label schemes should allow for categories such as “clearly intelligible but accented” or “locally acceptable variant” so that automatic systems can learn a realistic tolerance range rather than a rigid native/non-native boundary.
Record rater background and instructions. Databases should include basic metadata about raters (e.g., language background, exposure to the learner group) and the instructions they received in order to make the target-population norm explicit and reproducible.
Avoid native-based thresholds as ground truth. When acoustic models of native speech are used internally, their outputs should be calibrated and validated against human judgments from the target population, rather than being directly treated as MD labels.

These guidelines shift the focus of MD database construction from reproducing native pronunciation to modeling human perception within a defined target population. Such databases would provide a more appropriate foundation for training end-to-end automatic assessment models that are aligned with the target-population norm and that better support language fairness in real-world educational and assessment settings.

5.5. Limitations and Future Work

This study has several limitations. First, our experiments so far rely on a single L2 corpus (L2-ARCTIC) and a pre-training corpus (Librispeech) that, while multi-speaker, are still relatively homogeneous in content (read audiobooks) and accent. Future work will incorporate additional L1 and L2 corpora to test the generality of the adaptation. In particular, testing on spontaneous speech could be an important direction.

Second, our MD architecture is intentionally simple. This is appropriate as preliminary research, but more advanced architectures—such as text-aware gating [37], multi-task learning with pronunciation scoring [45], or streaming cross-attention models [39]—could be used to verify the same target-population perspective.

Third, we did not directly model individual differences among listeners. In reality, raters may differ in their prior experience and adaptability to new accents [46], which could be reflected in ensembles or Bayesian formulations of MD.

Fourth, we omitted testing performance with SSL pre-trained representations. Although our study is orthogonal to the choice of acoustic features, stronger features may make the adaptation easier.

Finally, we used MD as the probe task, and its simplicity made the training easier and the results straightforward to interpret. Future work can extend the scope to other speech assessment tasks, making the target-population norm more concrete.

6. Conclusions

Automatic pronunciation assessment requires an explicit choice of reference norm. We argued—on both scientific and language fairness grounds—that native-norm scoring should not be treated as the default for L2 assessment. By building an end-to-end MD model that mirrors listener adaptation, we provided computational evidence that perceptual standards shift toward target populations: a model trained only on native speech is highly tolerant of L2 deviations, while adaptation to a specific learner population yields substantially better alignment with human judgments. This complements traditional perceptual and psycholinguistic experiments on phonetic adaptation and strengthens the case that “ground truth” in pronunciation assessment is population-relative.

Practically, our results support a clear design principle: automatic assessment should be grounded in target-population perception and intelligibility rather than a native-like accent. End-to-end models trained on human judgments are naturally compatible with this norm. When labeled data are limited, the methods used to strengthen an assessment system—transfer learning, weak supervision, or data augmentation—should be chosen to preserve the target-population norm instead of re-imposing a native template. More broadly, we hope that this work encourages fairness-aware automatic speech assessment systems that support learners without reinforcing sociolinguistic hierarchies.

Author Contributions

Conceptualization, M.N.; Methodology, M.N.; Software, Y.W. and S.L.; Data curation, Y.W. and S.L.; Investigation, M.N., Y.W. and S.L.; Validation, Y.W. and S.L.; Formal analysis, Y.W. and S.L.; Visualization, Y.W. and S.L.; Resources, A.D. and A.H.; Writing—original draft preparation, M.N.; Writing—review and editing, M.N., A.D. and A.H.; Supervision, A.D. and A.H.; Project administration, M.N. and A.H.; Funding acquisition, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Social Science Foundation of China (NSSFC) under Grant No. 22XYY048 and Tianshan Talents Cultivation Program—Leadings Talents for Scientific and Technological Innovation (No. 2024TSYCLJ0002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: LibriSpeech (OpenSLR SLR12: https://www.openslr.org/12/) and L2-ARCTIC (https://psi.engr.tamu.edu/l2-arctic-corpus/) (accessed before 5 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Witt, S.M.; Young, S.J. Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 2000, 30, 95–108. [Google Scholar] [CrossRef]
Harrison, A.M.; Lo, W.K.; Qian, X.; Meng, H. Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. In Proceedings of the International Workshop on Speech and Language Technology in Education (SLaTE), Warwickshire, UK, 3–5 September 2009; pp. 45–48. [Google Scholar]
Bourdieu, P. Language and Symbolic Power; Harvard University Press: Cambridge, MA, USA, 1991. [Google Scholar]
Lippi-Green, R. English with an Accent: Language, Ideology, and Discrimination in the United States, 2nd ed.; Routledge: London, UK, 2012. [Google Scholar]
Rosa, J.; Flores, N. Unsettling race and language: Toward a raciolinguistic perspective. Lang. Soc. 2017, 46, 621–647. [Google Scholar] [CrossRef]
Koenecke, A.; Nam, A.; Lake, E.; Nudell, J.; Quartey, M.; Mengesha, Z.; Toups, C.; Rickford, J.R.; Jurafsky, D.; Goel, S. Racial disparities in automatic speech recognition. Proc. Natl. Acad. Sci. USA 2020, 117, 7684–7689. [Google Scholar] [CrossRef]
Council of Europe. Common European Framework of Reference for Languages: Companion Volume—Phonological Control Scale. 2020. Available online: https://www.coe.int/ (accessed on 2 December 2025).
ETS. TOEFL iBT Independent and Integrated Speaking Rubrics. 2022. Available online: https://www.ets.org/ (accessed on 2 December 2025).
Kang, O.; Hirschi, K. Pronunciation Assessment Criteria and Intelligibility. Speak Out! J. IATEFL Pronunciation Spec. Interest Group 2023, 68, 25–34. [Google Scholar]
Munro, M.J.; Derwing, T.M. Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Lang. Learn. 1995, 45, 73–97. [Google Scholar] [CrossRef]
Derwing, T.M.; Munro, M.J. Second language accent and pronunciation teaching: A research-based approach. TESOL Q. 2005, 39, 379–397. [Google Scholar] [CrossRef]
Levis, J.M. Changing contexts and shifting paradigms in pronunciation teaching. TESOL Q. 2005, 39, 369–377. [Google Scholar] [CrossRef]
Jenkins, J. The Phonology of English as an International Language; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
Clarke, C.M.; Garrett, M.F. Rapid adaptation to foreign-accented English. J. Acoust. Soc. Am. 2004, 116, 3647–3658. [Google Scholar] [CrossRef]
Bradlow, A.R.; Bent, T. Perceptual adaptation to non-native speech. Cognition 2008, 106, 707–729. [Google Scholar] [CrossRef] [PubMed]
Kleinschmidt, D.F.; Jaeger, T.F. Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychol. Rev. 2015, 122, 148–203. [Google Scholar] [CrossRef]
Norris, D.; McQueen, J.M.; Cutler, A. Perceptual learning in speech. Cogn. Psychol. 2003, 47, 204–238. [Google Scholar] [CrossRef]
Reinisch, E.; Weber, A.; Mitterer, H. Listeners retune phoneme categories across languages. J. Exp. Psychol. Hum. Percept. Perform. 2013, 39, 75–86. [Google Scholar] [CrossRef]
Witt, S. Automatic error detection in pronunciation training: Where we are and where we need to go. In Proceedings of the ISADEPT, Stockholm, Sweden, 6–8 June 2012. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the ICASSP, Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Flege, J.E. Second-language speech learning: Theory, findings, and problems. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research; Strange, W., Ed.; York Press: Timonium, MD, USA, 1995; pp. 233–277. [Google Scholar]
Best, C.T.; Tyler, M.D. Nonnative and second-language speech perception: Commonalities and complementarities. In Second Language Speech Learning: The Role of Language Experience in Speech Perception and Production; Munro, M.J., Bohn, O.S., Eds.; John Benjamins: Amsterdam, The Netherlands, 2007; pp. 13–34. [Google Scholar]
Xie, X.; Weatherholtz, K.; Bainton, L.; Rowe, E.; Burchill, Z.; Liu, L.; Jaeger, T.F. Rapid adaptation to foreign-accented speech and its transfer across talkers. J. Acoust. Soc. Am. 2018, 143, 2013–2026. [Google Scholar] [CrossRef]
Xie, X.; Myers, J. Learning a talker or learning an accent: Acoustic similarity constrains generalization of foreign accent adaptation to new talkers. J. Mem. Lang. 2017, 95, 36–48. [Google Scholar] [CrossRef] [PubMed]
Pérez, J.; Marinković, J.; Barceló, P. On the turing completeness of modern neural network architectures. arXiv 2019, arXiv:1901.03429. [Google Scholar] [CrossRef]
Pérez, J.; Barceló, P.; Marinkovic, J. Attention is turing-complete. J. Mach. Learn. Res. 2021, 22, 1–35. [Google Scholar]
Hu, W.; Qian, Y.; Soong, F.K.; Wang, Y. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 2015, 67, 154–166. [Google Scholar] [CrossRef]
Li, K.; Qian, X.; Meng, H. Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 25, 193–207. [Google Scholar] [CrossRef]
Mao, S.; Wu, Z.; Li, R.; Li, X.; Meng, H.; Cai, L. Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in L2 English speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6254–6258. [Google Scholar]
Leung, W.K.; Liu, X.; Meng, H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8132–8136. [Google Scholar]
Feng, Y.; Fu, G.; Chen, Q.; Chen, K. SED-MDD: Towards sentence dependent end-to-end mispronunciation detection and diagnosis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3492–3496. [Google Scholar]
Yan, B.C.; Wu, M.C.; Hung, H.T.; Chen, B. An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 3032–3036. [Google Scholar] [CrossRef]
Yan, B.C.; Chen, B. End-to-end mispronunciation detection and diagnosis from raw waveforms. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 61–65. [Google Scholar]
Wu, M.; Li, K.; Leung, W.K.; Meng, H. Transformer based end-to-end mispronunciation detection and diagnosis. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 3954–3958. [Google Scholar] [CrossRef]
Xu, X.; Kang, Y.; Cao, S.; Lin, B.; Ma, L. Explore wav2vec 2.0 for mispronunciation detection. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 4428–4432. [Google Scholar] [CrossRef]
Guo, S.; Kadeer, Z.; Wumaier, A.; Wang, L.; Fan, C. Multi-Feature and Multi-Modal Mispronunciation Detection and Diagnosis Method Based on the Squeezeformer Encoder. IEEE Access 2023, 11, 66245–66256. [Google Scholar] [CrossRef]
Peng, L.; Gao, Y.; Lin, B.; Ke, D.; Xie, Y.; Zhang, J. Text-aware end-to-end mispronunciation detection and diagnosis. arXiv 2022, arXiv:2206.07289. [Google Scholar] [CrossRef]
Peng, L.; Gao, Y.; Bao, R.; Li, Y.; Zhang, J. End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning. Appl. Sci. 2023, 13, 6793. [Google Scholar] [CrossRef]
Zheng, N.; Deng, L.; Huang, W.; Yeung, Y.T.; Xu, B.; Guo, Y.; Wang, Y.; Chen, X.; Jiang, X.; Liu, Q. CoCA-MDD: A Coupled Cross-Attention based framework for streaming mispronunciation detection and diagnosis. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 4352–4356. [Google Scholar] [CrossRef]
Zhu, C.; Wumaier, A.; Wei, D.; Fan, Z.; Yang, J.; Yu, H.; Kadeer, Z.; Wang, L. Pronunciation error detection model based on feature fusion. Speech Commun. 2024, 156, 103009. [Google Scholar] [CrossRef]
Lin, B.; Wang, L. Phoneme mispronunciation detection by jointly learning to align. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6822–6826. [Google Scholar]
Zhao, G.; Sonsaat, S.; Silpachai, A.; Lucic, I.; Chukharev-Hudilainen, E.; Levis, J.; Gutierrez-Osuna, R. L2-ARCTIC: A non-native English speech corpus. Proc. Interspeech 2018, 2783–2787. [Google Scholar] [CrossRef]
Yan, B.C.; Wang, H.W.; Chen, B. Peppanet: Effective mispronunciation detection and diagnosis leveraging phonetic, phonological, and acoustic cues. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 1045–1051. [Google Scholar]
Peng, L.; Fu, K.; Lin, B.; Ke, D.; Zhang, J. A Study on Fine-Tuning wav2vec2. 0 Model for the Task of Mispronunciation Detection and Diagnosis. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; Volume 2021, pp. 4448–4452. [Google Scholar]
Ryu, H.; Kim, S.; Chung, M. A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 959–963. [Google Scholar] [CrossRef]
Baese-Berk, M.M. Perception of non-native speech. Lang. Linguist. Compass 2020, 14, e12375. [Google Scholar] [CrossRef]

Figure 1. The native norm and target-population norm for L2 pronunciation evaluation in an abstract acoustic/phonetic space. Green circles represent native pronunciations; yellow squares represent L2 pronunciations perceived as correct under the respective norm; red crosses represent L2 pronunciations perceived as incorrect (mispronunciations). The shaded ellipses indicate the reference region used for evaluation: native speech in (a) and the target population’s accented speech in (b). As a result, different L2 pronunciations may be labeled as mispronunciations under the two norms.

Figure 2. The overall framework of the MD model. The red phones in the phone sequence denote incorrect pronunciations. Green modules and arrows correspond to the speech branch and speech-related information flow, while yellow modules and arrows correspond to the phone branch and phonetic modeling. The green arrow from the Phone Transformer to the Speech Transformer denotes speech–phone cross-attention. The yellow arrow from the Speech Transformer to the MD Transformer denotes phone–speech cross-attention.

Figure 3. The precision–recall curve of three MD systems. The AUC here denotes “the area under the PR curve”. Do not be confused with the ROC–AUC results shown in Table 1.

Table 1. Phoneme-level mispronunciation detection performance on L2-ARCTIC. The top block reports recall at a fixed precision of 0.50; the bottom block reports precision at a fixed recall of 0.50. “Pre-train” uses Librispeech only; other rows fine-tune on L2-ARCTIC.

Stage	Loss	MD Labels	Precision	Recall	F1	ROC–AUC
Pre-train	CTC + BCE	synthetic (L1)	0.50	0.1744	0.26	0.72
Fine-tune	BCE	synthetic (L2)	0.50	0.2667	0.35	0.76
Fine-tune	BCE	human (L2)	0.50	0.5801	0.54	0.85
Pre-train	CTC + BCE	synthetic (L1)	0.3008	0.50	0.38	0.72
Fine-tune	BCE	synthetic (L2)	0.3480	0.50	0.41	0.76
Fine-tune	BCE	human (L2)	0.5545	0.50	0.53	0.85

Table 2. Performance comparison with results in literature tested on L2-ARCTIC.

Model	Feature	Pre-Training	Fine-Tune	Precision	Recall	F1
GOP [33]	Fbanks		TIMIT + L2-ARCTIC	0.35	0.53	0.42
CTC-Att [33]	Fbanks		TIMIT + L2-ARCTIC	0.55	0.52	0.54
SSP [44]	w2v2.0	LS	L2-ARCTIC	0.59	0.50	0.54
Ours	Fbanks	LS	L2-ARCTIC	0.55	0.50	0.53

Table 3. Phoneme-level mispronunciation detection performance on L2-ARCTIC. The top block reports recall at a fixed precision of 0.50; the bottom block reports precision at a fixed recall of 0.50. “Pre-train” uses Librispeech only; other rows fine-tune on L2-ARCTIC. The “*” label denotes partial fine-tuning, i.e., only the phone encoder is fine-tuned.

No.	Stage	Loss	MD Labels	Precision	Recall	F1	ROC–AUC
1	Pre-train	CTC + BCE	synthetic (L1)	0.50	0.1744	0.26	0.72
2	Fine-tune	CTC	none	0.50	0.2542	0.34	0.75
3	Fine-tune *	BCE	synthetic (L2)	0.50	0.2667	0.35	0.76
4	Fine-tune	CTC + BCE	synthetic (L2)	0.50	0.3498	0.41	0.79
5	Fine-tune *	BCE	human (L2)	0.50	0.5801	0.54	0.85
6	Fine-tune	CTC + BCE	human (L2)	0.50	0.5956	0.54	0.84
7	Pre-train	CTC + BCE	synthetic (L1)	0.3008	0.50	0.38	0.72
8	Fine-tune	CTC	none	0.3337	0.50	0.40	0.75
9	Fine-tune *	BCE	synthetic (L2)	0.3480	0.50	0.41	0.76
10	Fine-tune	CTC + BCE	synthetic (L2)	0.4121	0.50	0.45	0.79
11	Fine-tune *	BCE	human (L2)	0.5545	0.50	0.53	0.85
12	Fine-tune	CTC + BCE	human (L2)	0.5549	0.50	0.53	0.84

Table 4. Group-wise mispronunciation detection (MD) performance across six L1 backgrounds on L2-ARCTIC using a single global threshold. Panel A: System 5 at the fixed-precision operating point (precision

= 0.50

; threshold

= 0.2342

). Panel B: System 11 at the fixed-recall operating point (recall

= 0.50

; threshold

= 0.3159

). PPR denotes predicted positive rate; Max gap is computed as

{max}_{g \in G} {PPR}_{g} - {min}_{g \in G} {PPR}_{g}

.

Table 4. Group-wise mispronunciation detection (MD) performance across six L1 backgrounds on L2-ARCTIC using a single global threshold. Panel A: System 5 at the fixed-precision operating point (precision

= 0.50

; threshold

= 0.2342

). Panel B: System 11 at the fixed-recall operating point (recall

= 0.50

; threshold

= 0.3159

). PPR denotes predicted positive rate; Max gap is computed as

{max}_{g \in G} {PPR}_{g} - {min}_{g \in G} {PPR}_{g}

.

Panel A: System 5 (Fixed Precision = 0.50; Threshold = 0.2342)
L1 Group	Precision	Recall	F1	ROC-AUC	PPR
Spanish	0.4664	0.5410	0.50	0.84	0.14
Vietnamese	0.7300	0.7039	0.72	0.90	0.23
Hindi	0.3561	0.5275	0.43	0.81	0.16
Chinese	0.4142	0.5121	0.46	0.82	0.15
Korean	0.4601	0.5079	0.48	0.80	0.14
Arabic	0.4113	0.4819	0.44	0.84	0.12
Max gap	–	–	–	–	0.11
Panel B: System 11 (Fixed Recall = 0.50; Threshold = 0.3159)
L1 Group	Precision	Recall	F1	ROC-AUC	PPR
Spanish	0.5332	0.4939	0.51	0.84	0.11
Vietnamese	0.7668	0.6171	0.68	0.90	0.20
Hindi	0.4118	0.4462	0.43	0.81	0.12
Chinese	0.4545	0.4545	0.45	0.82	0.12
Korean	0.5161	0.4334	0.47	0.80	0.11
Arabic	0.4452	0.4036	0.42	0.84	0.09
Max gap	–	–	–	–	0.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nijat, M.; Wei, Y.; Li, S.; Dawut, A.; Hamdulla, A. Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Appl. Sci. 2026, 16, 647. https://doi.org/10.3390/app16020647

AMA Style

Nijat M, Wei Y, Li S, Dawut A, Hamdulla A. Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Applied Sciences. 2026; 16(2):647. https://doi.org/10.3390/app16020647

Chicago/Turabian Style

Nijat, Mewlude, Yang Wei, Shuailong Li, Abdusalam Dawut, and Askar Hamdulla. 2026. "Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment" Applied Sciences 16, no. 2: 647. https://doi.org/10.3390/app16020647

APA Style

Nijat, M., Wei, Y., Li, S., Dawut, A., & Hamdulla, A. (2026). Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Applied Sciences, 16(2), 647. https://doi.org/10.3390/app16020647

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment

Abstract

1. Introduction

2. Pronunciation Evaluation, Listener Adaptation, and Language Fairness

2.1. Two Reference Norms for L2 Pronunciation

2.2. Why the Native Norm Is a Problematic Default (and a Fairness Risk)

2.3. Evidence from Rating Scales and Testing Practice

2.4. Evidence from Speech Perception and Experimental Phonetics

3. Computational Evidence with Mispronunciation Detection Models

3.1. Mispronunciation Detection: A Quick Review

3.2. Model Architecture

3.2.1. Speech Branch

3.2.2. Phone Branch

3.2.3. Training Objectives

3.3. Two-Stage Training: Librispeech Pre-Training and L2 Fine-Tuning

4. Experiments

4.1. Data

4.2. Evaluation Metrics

4.3. Settings

4.4. Main Results

4.5. Additional Results

4.5.1. Comparison with Other Results

4.5.2. Full Fine-Tuning

4.5.3. L1 Group Analysis and Demographic Parity-Style Disparity

5. Discussion

5.1. Language Fairness: Why Rejecting the Native Norm Matters

5.2. Relating Model Adaptation to Human Perceptual Learning

5.3. Implications for MD System Design

5.4. Implications for MD Database Construction

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI