Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment
Abstract
1. Introduction
- Fairness-first framing. We clarify two competing reference norms for L2 pronunciation—native versus target population—and argue that native-norm automation (as instantiated by GOP/ERN-style pipelines) is a problematic default that can be pedagogically uninformative and socially inequitable.
- Computational evidence for phonetic adaptation. We present an end-to-end MD model that simulates how listeners adapt from broad multi-talker experience to a specific learner population. The results show that a model trained only on native speech is overly tolerant of L2 deviations, while adapting to target-population data yields substantially better alignment with human judgments.
- Design principles for automatic speech assessment. We show that end-to-end assessment trained against human judgments is naturally consistent with the target-population norm. When labeled data are scarce, we argue that any data augmentation, weak supervision, or adaptation strategy must be chosen to strengthen this alignment rather than re-imposing a native template.
2. Pronunciation Evaluation, Listener Adaptation, and Language Fairness
2.1. Two Reference Norms for L2 Pronunciation
2.2. Why the Native Norm Is a Problematic Default (and a Fairness Risk)
2.3. Evidence from Rating Scales and Testing Practice
2.4. Evidence from Speech Perception and Experimental Phonetics
3. Computational Evidence with Mispronunciation Detection Models
- Since MD is clearly defined, it is suitable for a computational model to solve. In particular, we choose Transformers as the building block. Due to the Turing completeness of Transformers [25,26], they have potential to solve such clearly defined tasks like MD, given that the data and parameters are sufficient and training is perfectly conducted.
- Since MD is simple and clearly defined, human annotators can easily understand the task and conduct it well and convey their genuine opinions in their labels. Therefore, we argue that MD labels can clearly reflect people’s perceptive and psychological intentions when listening to L2 speech.
3.1. Mispronunciation Detection: A Quick Review
3.2. Model Architecture
3.2.1. Speech Branch
- Multi-head self-attention.
- Speech-to-phone cross-attention.
- A position-wise feed-forward network.
3.2.2. Phone Branch
- Multi-head self-attention over the phone sequence (four heads, ).
- Phone-to-speech cross-attention (one head, ) that uses the phone hidden states as queries and attends to the speech-stream hidden states as keys/values, with LayerNorm used in the cross-attention block.
- A position-wise FFN with dimensions and ReLU.
3.2.3. Training Objectives
3.3. Two-Stage Training: Librispeech Pre-Training and L2 Fine-Tuning
- Stage 1: Pre-training on Librispeech.
- Stage 2: Fine-tuning on L2-ARCTIC.
- Self-supervised MD fine-tuning. Only the BCE loss is used, and the MD labels are synthetic. We again perform random same-class phoneme substitution on the canonical sequence. Since the MD labels are synthesized, this model does not learn any perceptual behavior of humans on the MD task; instead, it aligns the phonetic coordinate of the listeners to L2 pronunciation. By this experiment, we will verify whether this alignment will improve the consistency between model prediction and human perception (via listeners’ MD labels), even without any human supervision and linguistic knowledge, i.e., priors on phone occurrence and priors on phone-pair substitution.
- Human-supervised MD fine-tuning. Only the BCE loss is used, and the MD labels from human annotations are used. This will directly optimize MD performance on the perceptive behavior of listeners on L2 pronunciations. Our assumption is that this will not only align perceptual phonetic space to the target L2 pronunciation but also help in learning other psychological and perceptual behavior when listening to L2 speech, e.g., tolerance to pronunciation deviation. Most importantly, these complex psychological and perceptual behaviors are still learned via a simple phonetic shift in perception.
4. Experiments
4.1. Data
- Librispeech
- L2-ARCTIC
4.2. Evaluation Metrics
- True reject (TR): Mispronounced phones correctly detected as mispronounced.
- False reject (FR): Correctly pronounced phones incorrectly flagged as mispronounced.
- True accept (TA): Correctly pronounced phones correctly accepted as correct.
- False accept (FA): Mispronounced phones incorrectly accepted as correct.
- Fairness Metric (Predicted Positive Rate)
4.3. Settings
- Synthetic MD Labels
- Optimization
- In the self-supervised MD condition, we again generate synthetic MD labels by random substitution, following the same principle in the pre-training stage.
- In the human-supervised MD condition, we use the human MD labels, restricting attention to substitutions and deletions while ignoring insertions. This follows the purpose of the MD task, i.e., determining whether a phone was well pronounced. Note that this “ignoring insertion” protocol is also adopted in the evaluation metrics.
4.4. Main Results
4.5. Additional Results
4.5.1. Comparison with Other Results
4.5.2. Full Fine-Tuning
4.5.3. L1 Group Analysis and Demographic Parity-Style Disparity
5. Discussion
5.1. Language Fairness: Why Rejecting the Native Norm Matters
- State the target population and construct. Specify whose perception the system intends to approximate (e.g., trained raters for a particular test, classroom interlocutors, or an L2 community of practice) and whether the goal is intelligibility, comprehensibility, or some specialized notion of “native-like” performance.
- Treat nativeness as an explicit, optional constraint. If a native norm is used in a particular setting, it should be justified by the communicative requirements of that setting, not assumed by default.
5.2. Relating Model Adaptation to Human Perceptual Learning
5.3. Implications for MD System Design
5.4. Implications for MD Database Construction
- Specify the target listener population. Annotation protocols should clearly define who the listeners are (e.g., native speakers familiar with a certain learner group, L2 users in a particular community, or mixed proficiency users) so that labels can be interpreted as approximating that population’s perception.
- Use communicative criteria for labels. Instead of asking raters to judge whether a segment matches a native pronunciation, they should be asked whether the utterance is understandable, whether it supports the intended communication, and whether the deviation is acceptable for interaction within the target population.
- Allow for graded and tolerant labels. Label schemes should allow for categories such as “clearly intelligible but accented” or “locally acceptable variant” so that automatic systems can learn a realistic tolerance range rather than a rigid native/non-native boundary.
- Record rater background and instructions. Databases should include basic metadata about raters (e.g., language background, exposure to the learner group) and the instructions they received in order to make the target-population norm explicit and reproducible.
- Avoid native-based thresholds as ground truth. When acoustic models of native speech are used internally, their outputs should be calibrated and validated against human judgments from the target population, rather than being directly treated as MD labels.
5.5. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Witt, S.M.; Young, S.J. Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 2000, 30, 95–108. [Google Scholar] [CrossRef]
- Harrison, A.M.; Lo, W.K.; Qian, X.; Meng, H. Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. In Proceedings of the International Workshop on Speech and Language Technology in Education (SLaTE), Warwickshire, UK, 3–5 September 2009; pp. 45–48. [Google Scholar]
- Bourdieu, P. Language and Symbolic Power; Harvard University Press: Cambridge, MA, USA, 1991. [Google Scholar]
- Lippi-Green, R. English with an Accent: Language, Ideology, and Discrimination in the United States, 2nd ed.; Routledge: London, UK, 2012. [Google Scholar]
- Rosa, J.; Flores, N. Unsettling race and language: Toward a raciolinguistic perspective. Lang. Soc. 2017, 46, 621–647. [Google Scholar] [CrossRef]
- Koenecke, A.; Nam, A.; Lake, E.; Nudell, J.; Quartey, M.; Mengesha, Z.; Toups, C.; Rickford, J.R.; Jurafsky, D.; Goel, S. Racial disparities in automatic speech recognition. Proc. Natl. Acad. Sci. USA 2020, 117, 7684–7689. [Google Scholar] [CrossRef]
- Council of Europe. Common European Framework of Reference for Languages: Companion Volume—Phonological Control Scale. 2020. Available online: https://www.coe.int/ (accessed on 2 December 2025).
- ETS. TOEFL iBT Independent and Integrated Speaking Rubrics. 2022. Available online: https://www.ets.org/ (accessed on 2 December 2025).
- Kang, O.; Hirschi, K. Pronunciation Assessment Criteria and Intelligibility. Speak Out! J. IATEFL Pronunciation Spec. Interest Group 2023, 68, 25–34. [Google Scholar]
- Munro, M.J.; Derwing, T.M. Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Lang. Learn. 1995, 45, 73–97. [Google Scholar] [CrossRef]
- Derwing, T.M.; Munro, M.J. Second language accent and pronunciation teaching: A research-based approach. TESOL Q. 2005, 39, 379–397. [Google Scholar] [CrossRef]
- Levis, J.M. Changing contexts and shifting paradigms in pronunciation teaching. TESOL Q. 2005, 39, 369–377. [Google Scholar] [CrossRef]
- Jenkins, J. The Phonology of English as an International Language; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
- Clarke, C.M.; Garrett, M.F. Rapid adaptation to foreign-accented English. J. Acoust. Soc. Am. 2004, 116, 3647–3658. [Google Scholar] [CrossRef]
- Bradlow, A.R.; Bent, T. Perceptual adaptation to non-native speech. Cognition 2008, 106, 707–729. [Google Scholar] [CrossRef] [PubMed]
- Kleinschmidt, D.F.; Jaeger, T.F. Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychol. Rev. 2015, 122, 148–203. [Google Scholar] [CrossRef]
- Norris, D.; McQueen, J.M.; Cutler, A. Perceptual learning in speech. Cogn. Psychol. 2003, 47, 204–238. [Google Scholar] [CrossRef]
- Reinisch, E.; Weber, A.; Mitterer, H. Listeners retune phoneme categories across languages. J. Exp. Psychol. Hum. Percept. Perform. 2013, 39, 75–86. [Google Scholar] [CrossRef]
- Witt, S. Automatic error detection in pronunciation training: Where we are and where we need to go. In Proceedings of the ISADEPT, Stockholm, Sweden, 6–8 June 2012. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the ICASSP, Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Flege, J.E. Second-language speech learning: Theory, findings, and problems. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research; Strange, W., Ed.; York Press: Timonium, MD, USA, 1995; pp. 233–277. [Google Scholar]
- Best, C.T.; Tyler, M.D. Nonnative and second-language speech perception: Commonalities and complementarities. In Second Language Speech Learning: The Role of Language Experience in Speech Perception and Production; Munro, M.J., Bohn, O.S., Eds.; John Benjamins: Amsterdam, The Netherlands, 2007; pp. 13–34. [Google Scholar]
- Xie, X.; Weatherholtz, K.; Bainton, L.; Rowe, E.; Burchill, Z.; Liu, L.; Jaeger, T.F. Rapid adaptation to foreign-accented speech and its transfer across talkers. J. Acoust. Soc. Am. 2018, 143, 2013–2026. [Google Scholar] [CrossRef]
- Xie, X.; Myers, J. Learning a talker or learning an accent: Acoustic similarity constrains generalization of foreign accent adaptation to new talkers. J. Mem. Lang. 2017, 95, 36–48. [Google Scholar] [CrossRef] [PubMed]
- Pérez, J.; Marinković, J.; Barceló, P. On the turing completeness of modern neural network architectures. arXiv 2019, arXiv:1901.03429. [Google Scholar] [CrossRef]
- Pérez, J.; Barceló, P.; Marinkovic, J. Attention is turing-complete. J. Mach. Learn. Res. 2021, 22, 1–35. [Google Scholar]
- Hu, W.; Qian, Y.; Soong, F.K.; Wang, Y. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun. 2015, 67, 154–166. [Google Scholar] [CrossRef]
- Li, K.; Qian, X.; Meng, H. Mispronunciation detection and diagnosis in L2 English speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 25, 193–207. [Google Scholar] [CrossRef]
- Mao, S.; Wu, Z.; Li, R.; Li, X.; Meng, H.; Cai, L. Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in L2 English speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6254–6258. [Google Scholar]
- Leung, W.K.; Liu, X.; Meng, H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8132–8136. [Google Scholar]
- Feng, Y.; Fu, G.; Chen, Q.; Chen, K. SED-MDD: Towards sentence dependent end-to-end mispronunciation detection and diagnosis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3492–3496. [Google Scholar]
- Yan, B.C.; Wu, M.C.; Hung, H.T.; Chen, B. An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 3032–3036. [Google Scholar] [CrossRef]
- Yan, B.C.; Chen, B. End-to-end mispronunciation detection and diagnosis from raw waveforms. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 61–65. [Google Scholar]
- Wu, M.; Li, K.; Leung, W.K.; Meng, H. Transformer based end-to-end mispronunciation detection and diagnosis. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 3954–3958. [Google Scholar] [CrossRef]
- Xu, X.; Kang, Y.; Cao, S.; Lin, B.; Ma, L. Explore wav2vec 2.0 for mispronunciation detection. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 4428–4432. [Google Scholar] [CrossRef]
- Guo, S.; Kadeer, Z.; Wumaier, A.; Wang, L.; Fan, C. Multi-Feature and Multi-Modal Mispronunciation Detection and Diagnosis Method Based on the Squeezeformer Encoder. IEEE Access 2023, 11, 66245–66256. [Google Scholar] [CrossRef]
- Peng, L.; Gao, Y.; Lin, B.; Ke, D.; Xie, Y.; Zhang, J. Text-aware end-to-end mispronunciation detection and diagnosis. arXiv 2022, arXiv:2206.07289. [Google Scholar] [CrossRef]
- Peng, L.; Gao, Y.; Bao, R.; Li, Y.; Zhang, J. End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning. Appl. Sci. 2023, 13, 6793. [Google Scholar] [CrossRef]
- Zheng, N.; Deng, L.; Huang, W.; Yeung, Y.T.; Xu, B.; Guo, Y.; Wang, Y.; Chen, X.; Jiang, X.; Liu, Q. CoCA-MDD: A Coupled Cross-Attention based framework for streaming mispronunciation detection and diagnosis. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 4352–4356. [Google Scholar] [CrossRef]
- Zhu, C.; Wumaier, A.; Wei, D.; Fan, Z.; Yang, J.; Yu, H.; Kadeer, Z.; Wang, L. Pronunciation error detection model based on feature fusion. Speech Commun. 2024, 156, 103009. [Google Scholar] [CrossRef]
- Lin, B.; Wang, L. Phoneme mispronunciation detection by jointly learning to align. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6822–6826. [Google Scholar]
- Zhao, G.; Sonsaat, S.; Silpachai, A.; Lucic, I.; Chukharev-Hudilainen, E.; Levis, J.; Gutierrez-Osuna, R. L2-ARCTIC: A non-native English speech corpus. Proc. Interspeech 2018, 2783–2787. [Google Scholar] [CrossRef]
- Yan, B.C.; Wang, H.W.; Chen, B. Peppanet: Effective mispronunciation detection and diagnosis leveraging phonetic, phonological, and acoustic cues. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 1045–1051. [Google Scholar]
- Peng, L.; Fu, K.; Lin, B.; Ke, D.; Zhang, J. A Study on Fine-Tuning wav2vec2. 0 Model for the Task of Mispronunciation Detection and Diagnosis. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; Volume 2021, pp. 4448–4452. [Google Scholar]
- Ryu, H.; Kim, S.; Chung, M. A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 959–963. [Google Scholar] [CrossRef]
- Baese-Berk, M.M. Perception of non-native speech. Lang. Linguist. Compass 2020, 14, e12375. [Google Scholar] [CrossRef]



| Stage | Loss | MD Labels | Precision | Recall | F1 | ROC–AUC |
|---|---|---|---|---|---|---|
| Pre-train | CTC + BCE | synthetic (L1) | 0.50 | 0.1744 | 0.26 | 0.72 |
| Fine-tune | BCE | synthetic (L2) | 0.50 | 0.2667 | 0.35 | 0.76 |
| Fine-tune | BCE | human (L2) | 0.50 | 0.5801 | 0.54 | 0.85 |
| Pre-train | CTC + BCE | synthetic (L1) | 0.3008 | 0.50 | 0.38 | 0.72 |
| Fine-tune | BCE | synthetic (L2) | 0.3480 | 0.50 | 0.41 | 0.76 |
| Fine-tune | BCE | human (L2) | 0.5545 | 0.50 | 0.53 | 0.85 |
| Model | Feature | Pre-Training | Fine-Tune | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| GOP [33] | Fbanks | TIMIT + L2-ARCTIC | 0.35 | 0.53 | 0.42 | |
| CTC-Att [33] | Fbanks | TIMIT + L2-ARCTIC | 0.55 | 0.52 | 0.54 | |
| SSP [44] | w2v2.0 | LS | L2-ARCTIC | 0.59 | 0.50 | 0.54 |
| Ours | Fbanks | LS | L2-ARCTIC | 0.55 | 0.50 | 0.53 |
| No. | Stage | Loss | MD Labels | Precision | Recall | F1 | ROC–AUC |
|---|---|---|---|---|---|---|---|
| 1 | Pre-train | CTC + BCE | synthetic (L1) | 0.50 | 0.1744 | 0.26 | 0.72 |
| 2 | Fine-tune | CTC | none | 0.50 | 0.2542 | 0.34 | 0.75 |
| 3 | Fine-tune * | BCE | synthetic (L2) | 0.50 | 0.2667 | 0.35 | 0.76 |
| 4 | Fine-tune | CTC + BCE | synthetic (L2) | 0.50 | 0.3498 | 0.41 | 0.79 |
| 5 | Fine-tune * | BCE | human (L2) | 0.50 | 0.5801 | 0.54 | 0.85 |
| 6 | Fine-tune | CTC + BCE | human (L2) | 0.50 | 0.5956 | 0.54 | 0.84 |
| 7 | Pre-train | CTC + BCE | synthetic (L1) | 0.3008 | 0.50 | 0.38 | 0.72 |
| 8 | Fine-tune | CTC | none | 0.3337 | 0.50 | 0.40 | 0.75 |
| 9 | Fine-tune * | BCE | synthetic (L2) | 0.3480 | 0.50 | 0.41 | 0.76 |
| 10 | Fine-tune | CTC + BCE | synthetic (L2) | 0.4121 | 0.50 | 0.45 | 0.79 |
| 11 | Fine-tune * | BCE | human (L2) | 0.5545 | 0.50 | 0.53 | 0.85 |
| 12 | Fine-tune | CTC + BCE | human (L2) | 0.5549 | 0.50 | 0.53 | 0.84 |
| Panel A: System 5 (Fixed Precision = 0.50; Threshold = 0.2342) | |||||
| L1 Group | Precision | Recall | F1 | ROC-AUC | PPR |
| Spanish | 0.4664 | 0.5410 | 0.50 | 0.84 | 0.14 |
| Vietnamese | 0.7300 | 0.7039 | 0.72 | 0.90 | 0.23 |
| Hindi | 0.3561 | 0.5275 | 0.43 | 0.81 | 0.16 |
| Chinese | 0.4142 | 0.5121 | 0.46 | 0.82 | 0.15 |
| Korean | 0.4601 | 0.5079 | 0.48 | 0.80 | 0.14 |
| Arabic | 0.4113 | 0.4819 | 0.44 | 0.84 | 0.12 |
| Max gap | – | – | – | – | 0.11 |
| Panel B: System 11 (Fixed Recall = 0.50; Threshold = 0.3159) | |||||
| L1 Group | Precision | Recall | F1 | ROC-AUC | PPR |
| Spanish | 0.5332 | 0.4939 | 0.51 | 0.84 | 0.11 |
| Vietnamese | 0.7668 | 0.6171 | 0.68 | 0.90 | 0.20 |
| Hindi | 0.4118 | 0.4462 | 0.43 | 0.81 | 0.12 |
| Chinese | 0.4545 | 0.4545 | 0.45 | 0.82 | 0.12 |
| Korean | 0.5161 | 0.4334 | 0.47 | 0.80 | 0.11 |
| Arabic | 0.4452 | 0.4036 | 0.42 | 0.84 | 0.09 |
| Max gap | – | – | – | – | 0.11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nijat, M.; Wei, Y.; Li, S.; Dawut, A.; Hamdulla, A. Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Appl. Sci. 2026, 16, 647. https://doi.org/10.3390/app16020647
Nijat M, Wei Y, Li S, Dawut A, Hamdulla A. Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Applied Sciences. 2026; 16(2):647. https://doi.org/10.3390/app16020647
Chicago/Turabian StyleNijat, Mewlude, Yang Wei, Shuailong Li, Abdusalam Dawut, and Askar Hamdulla. 2026. "Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment" Applied Sciences 16, no. 2: 647. https://doi.org/10.3390/app16020647
APA StyleNijat, M., Wei, Y., Li, S., Dawut, A., & Hamdulla, A. (2026). Beyond Native Norms: A Perceptually Grounded and Fair Framework for Automatic Speech Assessment. Applied Sciences, 16(2), 647. https://doi.org/10.3390/app16020647

