Next Article in Journal
Fast Riemannian Manifold Hamiltonian Monte Carlo for Hierarchical Gaussian Process Models
Previous Article in Journal
Kalman Filter-Based Reconstruction of Power Trajectories for IoT-Based Photovoltaic System Monitoring
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review

1
College of Liberal Arts, Hunan Normal University, Changsha 410081, China
2
Faculty of Chinese Language and Culture, Guangdong University of Foreign Studies, Guangzhou 510420, China
3
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 145; https://doi.org/10.3390/math14010145 (registering DOI)
Submission received: 6 November 2025 / Revised: 25 December 2025 / Accepted: 26 December 2025 / Published: 30 December 2025
(This article belongs to the Section E1: Mathematics and Computer Science)

Abstract

We survey recent advances in second-language (L2) Mandarin lexical tones research and show how an interpretable computational approach can deliver parameter-aligned feedback across perception–production (P ↔ P). We synthesize four strands: (A) conventional evaluations and tasks (identification, same–different, imitation/read-aloud) that reveal robust tone-pair asymmetries and early P ↔ P decoupling; (B) physiological and behavioral instrumentation (e.g., EEG, eye-tracking) that clarifies cue weighting and time course; (C) audio-only speech analysis, from classic F0 tracking and MFCC–prosody fusion to CNN/RNN/CTC and self-supervised pipelines; and (D) interpretable learning, including attention and relational models (e.g., graph neural networks, GNNs) opened with explainable AI (XAI). Across strands, evidence converges on tones as time-evolving F0 trajectories, so movement, turning-point timing, and local F0 range are more diagnostic than height alone, and the contrast between Tone 2 (rising) and Tone 3 (dipping/low) remains the persistent difficulty; learners with tonal vs. non-tonal language backgrounds weight these cues differently. Guided by this synthesis, we outline a tool-oriented framework that pairs perception and production on the same items, jointly predicts tone labels and parameter targets, and uses XAI to generate local attributions and counterfactual edits, making feedback classroom-ready.

1. Introduction

Mandarin Chinese is a major world language with more than one billion first-language speakers and documented presence in 108 countries. This scale underscores both its global value and the need for scalable, data-driven support for second-language (L2) learning [1]. Mandarin Chinese is a tone language in which four lexical tones—Tone 1 (T1, high level), Tone 2 (T2, rising), Tone 3 (T3, dipping/low), and Tone 4 (T4, falling)—distinguish word meaning; for example: mā (mother), má (hemp), mǎ (horse), and mà (to scold). This function is carried by the fundamental frequency (F0) trajectory rather than by static pitch height alone [2,3,4,5]. For L2 learners (second-language learners), especially those from non-tonal L1s (first languages), accurate perception and production of these tones are central yet difficult skills.
Prior work can be divided into four complementary strands. Strand A covers conventional evaluations and tasks such as identification, same–different discrimination, and read-aloud or imitation. Strand B adds physio-behavioral sensors, including EEG/ERP and eye-tracking, to reveal the time course and cue weighting that underlie overt performance. Strand C focuses on audio-only speech analysis, ranging from classic prosodic and spectral features and reproducible Praat-based measurements to deep encoders (e.g., CNN/RNN/CTC/SSL) used as stand-alone baselines. Strand D wraps these models in relation-aware, explainable inference that links learners, items, and perception–production pairs, delivering parameter-aligned, instruction-oriented feedback. An overview of the proposedstrands and framework is shown in Figure 1.
Tone learning hinges on two linked abilities: perceiving stable tonal categories and producing the F0 trajectories that realize them [9,10,11,12]. Since perception and production shape each other, we treat tone acquisition as a coupled perception–production (P ↔ P) problem supported by interpretable, parameter-aligned feedback. The discussion now turns to tone dynamics and their implications for L2 perception–production.

1.1. Tone Dynamics and L2 Perception–Production

(F0) Current views of dynamics and L2 perception–production hold that lexical tone categories can be realized via time-evolving F0 trajectories, which gradually draw nearer to an abstract pitch target while the syllable is ongoing, rather than being conceived of as distinct levels.
Quantitative target approximation (qTA) model of intonation [13] for Mandarin/English formalizes target perspective under physiological and contextual limitations.
Empirical timing studies further support this dynamic view, showing that F0 extrema (peaks/troughs) align systematically with segmental landmarks and vary with tone type and speech rate. In particular, rising tone peaks frequently occur after the host syllable, a phenomenon known as peak delay, whereas high tone peaks typically occur within the host syllable [2,14,15,16]. Corpus and laboratory evidence indicates that focus expands the F0 range and triggers post-focus compression, thereby altering both the temporal alignment and pitch excursion magnitude of tone contours [3,17,18]. Large-scale corpus analyses further reveal that, beyond focus effects, tone realization systematically varies with prosodic and positional factors such as speech style, word length, and prosodic position, and that both sandhi and neutral tone alignment exhibit consistent timing patterns in connected speech [19].
These together constitute evidence of tonal trajectories as inherently dynamic with inherent modification by the slope, occurrence of turning point, and context. Second language learners, due to their different temporal and contextual dependencies present new challenges that demand research through perception–production experiments that show the acquisition and internalization of tonal categories by non-native speakers.
Over the last five years, L2 tone research has accelerated across behavioral, neurophysiological, and computational lines. High-variability perceptual training and classroom interventions incorporating visualized F0 traces (real-time pitch curves or spectrograms) and gesture cues have yielded reliable gains, particularly for the difficult T2/T3 contrast [20,21,22,23,24,25,26]. Cross-group comparisons confirm that L1 background shapes learning outcomes: tone-language learners and non-tone-language learners show different confusion patterns and improvement profiles in both perception and production, with mid-rising (T2) and dipping (T3) tones often forming the hardest pair [9,27].
Behavioral and electrophysiological studies reveal categorical perception for dissimilar contours but graded patterns for similar ones, along with longer latencies for non-native listeners [26,28]. Tone-language experience confers advantages in certain tasks. For example, Vietnamese listeners identify Mandarin tones more accurately than Russian listeners across levels, which is consistent with perceptual assimilation accounts [29]. Under noise, maintaining native-like tone patterns supports sentence intelligibility, and tone working memory predicts comprehension, highlighting the functional role of tones beyond isolated syllables [30,31]. Classroom and field studies document L1-specific strategies and error patterns: for instance, Korean or Central Asian learners adjust intensity and duration similarly to native speakers in noise but diverge in F0 adjustments; different L1s weight turning point location versus onset/offset contour shape when distinguishing T2 from T3 [32].
The perception–production (P ↔ P) link is teachable. High-variability identification and AX (same–different) discrimination training deliver sizable perception gains that generalize across talkers and items, demonstrating suprasegmental plasticity [10,33]. Complementary multimodal techniques that externalize movement and timing—such as visualizing one’s F0 trace and gesturing the rise/fall—improve both perception and production, with spectrogram-based displays often outperforming static graphics [34,35,36].

1.2. Computational Modeling Progress

On the audio-only front, recent systems learn tonal dynamics directly from waveforms or time–frequency representations. End-to-end convolutional and recurrent neural networks (CNN/RNN) with connectionist temporal classification (CTC) objectives, and self-supervised learning (SSL) approaches such as wav2vec 2.0 fine-tuned with CTC, have significantly improved recognition of suprasegmentals (syllables, tones, pitch accents), especially when combined with language modeling [37,38].
The classical F0-only pipeline has been overtaken by a feature fusion approach using a combination of F0 dynamics and spectral features (e.g., MFCCs) providing a much more robust method of tone classification and mispronunciation detection [39]. Even compact RF models with optimized fusion features provide great accuracy and generalization power over the SCSC corpus and remain light enough for classroom tools [38,40,41]. Besides audio-only pipelines, physiological and behavioral instrumentation, such as EEG and eye-tracking have further extended our knowledge about tone processing. A study of EEG dissociates acoustic dependencies between syllables and tones indicating that tone perception mainly uses temporal fine structure (TFS) to predict tone boundaries; whereas syllable identity tracking the amplitude envelope (ENV) with robust delta-band activities, which is also elicited in prefrontal/temporal regions when people are engaged in tone tasks, reveals that in our tone sensitivity line processing is more attention sensitive.
Moreover, decoding studies of EEG showed that both perceived and imagined Mandarin tones can be decoded from chance levels, and that visual–auditory stimulation may benefit classification accuracy compared to other sensory modalities, reflecting the benefits of cross modal processing that could be harnessed by educators [42].
Recently developed graph-based methods incorporated a modeling of tones in a non-Euclidean context, incorporating both linguistic structural features and acoustic cues. Graph Neural Network (GNN) architectures coupled with bidirectional LSTM (BiLSTM) can exploit cross relationships between learners, items, tasks, and contexts ignored by traditional sequence models [43], thus better detecting prosodic events and offering richer learner corpus modeling possibilities [44].

1.3. Review Protocol and Reporting Framework

A systematic search and screening procedure was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement [45], covering publications from 2015 to 2025 (inclusive). Primary searches were run in Google Scholar and China National Knowledge Infrastructure (CNKI). Google Scholar was additionally used for forward and backward citation chasing to refine terms and to check search saturation; records identified via citation chasing were merged into the main screening set and processed under the same eligibility criteria as database-retrieved records. Complete search strings, query variants, date limits, and filters are reported in Appendix A.
Eligible records were peer-reviewed journal articles and full conference papers that focused on Mandarin lexical tones (T1–T4) in L2 learners and reported perception, production, or P ↔ P outcomes, including studies that proposed interpretable/mechanistic computational approaches supporting parameter-aligned evaluation targets (e.g., turning-point timing, slope, F0 range). Studies were required to report sufficient methodological and outcome details for extraction and appraisal. We excluded non-Mandarin or non-L2 studies, theory-only or non-archival items, abstracts/posters, inaccessible full texts, and duplicate or superseded reports.
Records were imported into EndNote 2025 [46] and de-duplicated using DOI plus title matching. The identification process is summarized in the PRISMA 2020 flow diagram (Figure 2), generated using the PRISMA2020 Shiny web app (based on the PRISMA2020 R package, v1.1.1) [47]. In total, 245 records were identified. After removing 88 duplicates, 157 records were screened by title/abstract and 41 were excluded. Full texts of 116 reports were assessed for eligibility; All full texts were successfully retrieved. At full text, 62 reports were excluded for the reasons listed in Figure 2, and 54 studies were included in the review. A reporting-based methodological audit rubric is provided in Appendix B (Table A1).
A washout period of 14 days was inserted between two screening passes conducted by the same reviewer, and intra-rater agreement was computed as an audit of screening reliability. The second pass was conducted after the washout using a fresh export, without consulting the pass-1 decisions. Intra-rater agreement between the two passes was quantified using Cohen’s kappa (κ). At the title-and-abstract stage, all screened records (n = 157) were re-screened in the second pass, yielding κ = 0.94 (a = 43, b = 2, c = 2, d = 110). At the full-text stage, a stratified audit subsample (n = 62) was re-assessed in the second pass, yielding κ = 0.93 (a = 20, b = 1, c = 1, d = 40). Any discordant cases between passes were re-checked against the eligibility criteria and resolved prior to data extraction.
Table 1 summarizes the 2 × 2 decision matrix comparing Pass 1 versus Pass 2 decisions from the same reviewer. Include is treated as the positive decision, and the table defines a, b, c, and d and gives the formulas for observed agreement, chance agreement, and κ [48,49].
N = a + b + c + d
P o = a + d N
P e = a + b N a + c N + c + d N b + d N
κ = P o P e 1 P e
In this review, no inclusion or exclusion decision was made or overturned by AI. LLM-assisted screening has been reported in prior work, for example by Syriani, David, and Kumar (Journal of Computer Languages, 2024) [50]. In the present review, generative AI was limited to brainstorming synonym candidates and formatting query variants; all search strings, screening decisions and data extractions were performed and verified manually by the authors.
For each included study, the following were extracted: publication year, L1 background, sample size, task type, modality, features and models, parameter-aligned outputs, evaluation metrics, key findings, and instructional implications. A structured risk-of-bias and methodological quality appraisal was conducted using a compact checklist adapted for heterogeneous behavioral and computational study designs. The checklist covered participant characterization and proficiency control, task and stimulus specification, measurement reliability, leakage control and evaluation protocol for computational models (where applicable), metric definition and reporting, and transparency needed for replication. Appraisal outcomes were used to qualify the strength of evidence and to interpret inconsistent findings across included studies; however, full texts that lacked sufficient methodological reporting to support replication and risk-of-bias appraisal were excluded at the eligibility stage (criterion iv). The checklist, item-level coding rubric, and study-level/domain-level summaries (including rating distributions) are provided in Appendix B and are referenced in the Discussion section when interpreting heterogeneous findings.
As a robustness check, we extended screening from the first 10 pages to 20 pages per query and performed forward and backward citation checks; this did not yield additional eligible unique studies beyond the final de-duplicated set.

1.4. Motivation and Scope of This Review

While progress has been considerable, many L2-learner-facing tools still provide inscrutable labels and look at L2 tokens independently, which leads to the possible overlooking of inherent relations present in learner corpora (e.g., same learner, items, or perception–production match). Moreover, instead of inferring specific feedback for correct learning progress (T2 vs. T3), instructors would find it more helpful to provide more accurate quantitative data and performance breakdown based on important parameters underlying tonality such as slope in semitones/s, turning point timing relative to rhyme onset, and F0 range.
In addition to these broader trends, the present review is motivated by the author team’s complementary expertise in experimental phonetics and sequence modeling. Prior experimental phonetic work has provided an empirical and methodological foundation for Mandarin segmental and tonal acquisition. This includes studies of tone learning in Cantonese background learners and in Chaozhou learners, together with reproducible Praat-based workflows for acoustic measurement and parameter reporting [51,52,53]. In parallel, recent methodological work has developed time series-focused neural architectures for modeling sequential human behavioral signals, offering a transferable basis for structured sequence modeling and parameterized learning analytics [54]. Taken together, these foundations motivate a review that connects evidence on tone dynamics with modeling choices that can support interpretable and instruction-oriented feedback.
Across the literature synthesized here, three priorities emerge for future modeling.
  • Relational modeling—Encode links among learners, items, tasks, and modalities to share information across repeated attempts and matched P ↔ P pairs; GNNs and attention mechanisms are natural fits [37,55,56,57,58,59].
  • Parameter-aligned outputs—Report slope, turning point timing, and F0 range alongside tone labels, consistent with the temporal view of tones and with what teachers train in practice [19,20,60].
  • Explainable AI (XAI): give direct indication of attributions (e.g., reliance on a late peak timing), and of counterfactuals (minimal slope increase to flip T3 → T2) corresponding to some teachable parameter.
Prior work spans the four interlocking strands summarized in Figure 1, but findings remain dispersed and evaluation practices are heterogeneous. Studies differ in whether they report label based or parameter-based metrics, and in whether they adopt token wise or structure aware designs. A systematic review is therefore needed to consolidate what each strand reliably shows about tone dynamics and perception production learning, to identify gaps that block classroom ready feedback, and to align the evidence base with the outcomes and evaluation protocols used in this paper.
Building on these motivations, the next section surveys the empirical and computational landscape across four strands A–D, and we return to this map throughout. To guide the reader, Figure 3 outlines the paper’s structure and how evidence from these strands is integrated.

2. The Empirical Landscape for L2 Mandarin Tones: Measures, Sensors, and Models

Based on the motivations described above, this section surveys the empirical and computational landscape across four complementary strands (A–D) that are referenced repeatedly in the remainder of the review. Strands A–C summarize established empirical and modeling practices, whereas strand D is nascent in the present corpus (two studies) and is framed as a forward-looking methodological direction that integrates relational inductive bias and explainability to support parameter-aligned, classroom-actionable feedback. Although two mechanistic/explainable exemplars are included, none of the included studies meets our full criteria for a classroom-ready, item-matched perception–production (P ↔ P) framework; therefore, we treat strand D as an evidence gap and as the design target of Section 4.
Prior work on L2 Mandarin tones can be organized into four complementary strands: (A) conventional evaluations and tasks, such as identification/AX (same–different), read-aloud/imitation, and human ratings, paired with acoustic parameters such as F0 slope, turning-point (TP) timing, and F0 range; (B) physio-behavioral instrumentation (EEG/ERP, eye-tracking) that reveals time course, cue weighting, and processing constraints; (C) audio-only analysis (non-instrumented) using classic prosodic/spectral features and self-supervised representations (e.g., wav2vec 2.0) as stand-alone baselines; and (D) proposed relation-aware and explainable modeling, where CNN/RNN/CTC/self-supervised learning (SSL) act as front-end encoders, and graph-/attention-based relation layers plus explainable artificial intelligence (XAI) deliver parameter-aligned feedback in item-matched perception ↔ production corpora. Direct applications of strand D to L2 Mandarin tones remain limited; accordingly, strand D is treated as a gap in the literature and as the design target of the framework developed in Section 4.
Table 2 lists representative studies and methodological exemplars in each strand; it is representative rather than exhaustive. By prioritizing transparent, reproducible work, we may under-represent industrial systems and non-archival venues; full search strings and coding rubrics are listed in Appendix A for replication. Representative exemplars of perceptual training and neurophysiology/EEG are summarized in Table 2 [10,27,61,62]. Building on this curated evidence base, the following subsections unpack each strand in turn and distill the recurring dimensions—slope (movement), turning-point timing, and F0 range—that link empirical findings to perception–production (P ↔ P) learning. This positioning treats deep encoders as modules that can serve either strand C (independent tokens) or strand D (encoders embedded in relations plus XAI), depending on whether relations and explanations are explicitly modeled [63,64,65,66].
Figure 4 indicates a clear post-2018 increase in strand A studies, with strands B and C appearing intermittently across the decade. Strand D is present but remains scarce (two studies), represented by mechanistic or explainable computational modeling of tonal encoding [73,74]. Importantly, while these studies demonstrate the feasibility of mechanistic accounts and computational explanations for tone encoding, none of the included studies yet satisfy our full criteria for a classroom-ready, item-matched perception–production (P ↔ P) framework with relation-aware, pedagogically actionable explainability. This gap motivates treating strand D as a design target and motivates the framework proposed in Section 4.
Table 3 summarizes methodological trade-offs across strands A–D, including typical inputs and outputs, strengths, limitations, and classroom feasibility. Strand D is included but remains nascent, and we treat it as a design target because no included study yet satisfies the full criteria for classroom-ready, item-matched P ↔ P modeling with interpretable and pedagogically actionable outputs.

2.1. Conventional Evaluations and Tasks

Task formats and rationale. Building on the background outlined in Section 1, behavioral assessments form the empirical backbone of L2 tone research. Most classroom and lab studies evaluate tone perception via ID and AX discrimination, sometimes supplemented with same–different or lexical decision tasks; production is elicited by imitation or read-aloud and then rated by native judges and/or analyzed acoustically (e.g., F0 slope in semitones/s, peak/trough timing from rhyme onset, and F0 range) [10,75]. These tasks reveal which acoustic cues learners prioritize and when categorical-like perception emerges.
Evidence across studies shows that perception improves rapidly whereas production lags. High-variability identification (HVPT) produces robust, generalizable perceptual gains, and transfer to unseen speakers and items is reliable. However, changes in production are typically smaller and delayed, especially in early learning phases [10].
Building on this general pattern, four recurrent observations emerge from L2 tone studies. First, tone-pair asymmetries matter: behavioral and ERP data both show that learners discriminate very different contours (for example, T1 vs. T4) more easily than similar rising or low-rising contours (T2 vs. T3), which is in line with sensitivity to contour shape and to the timing of the turning point [75,76]. Second, L1 background matters: learners whose first language is tonal (e.g., Cantonese, Thai) tend to outperform learners from non-tonal languages (e.g., English, Japanese) on Mandarin tone identification, but they also carry over their own L1-specific cue-weighting strategies [77,78]. Third, context matters: when tones are processed in connected speech or in visually guided word-recognition tasks, competition with segmental information increases, and the extra alignment and sandhi requirements in disyllabic or sentential contexts lower accuracy compared with isolated monosyllables [79,80]. Fourth, instruction can target these weaknesses: visual–auditory training that makes pitch movement and timing explicit—for example, showing a learner’s F0 trace against a native template or using iconic pitch gestures—consistently improves both perception and production, with the clearest gains on the difficult T2 (rising) vs. T3 (dipping/low) contrast [81,82].

2.2. Physiological and Behavioral Instrumentation (Contact-Based/Contactless)

Whereas strand A reports overt behavioral outcomes, strand B investigates the neural and attentional processes that give rise to those overt outcomes, using physiological and behavioral instruments to see how learners process tones in real time. The basic idea is that overt accuracy tells us what learners got right or wrong, but EEG, ERP, eye-tracking, or video-based tracking can tell us when tone-related information is picked up, which cues are weighted, and whether prior tone experience changes the time course of processing. In this sense, physiological evidence complements behavioral tasks by providing converging and temporally precise indicators that pure identification or same–different tasks cannot supply.
Researchers typically adopt two measurement routes. One is contact-based, including EEG/ERP (sometimes together with EOG/EMG), to index neural or articulatory correlates of tone perception and control. The other is contactless, such as eye-tracking or RGB/infrared video, to capture attention and visible prosodic behavior without interfering with speech production. Both routes are useful for L2 tone research because they make it possible to compare learners from tonal and non-tonal backgrounds on the same stimuli and within the same time windows.
Figure 5 illustrates a typical ERP analysis pipeline based on a pitch-change discrimination paradigm, rather than a lexical-tone identification task. Panel A shows grand-averaged ERPs at representative electrodes for change vs. no-change trials; panel B maps difference-wave activity across time windows (≈50–100 ms, 125–250 ms, 350–550 ms). In this paradigm, an early positivity (P50, around 50–100 ms after stimulus onset) reflects very fast detection of pitch change. A little later, the N1 component (a negative-going wave at about 100–150 ms) and the P2 component (a positive-going wave at about 180–250 ms) index more detailed auditory and attentional processing of the same change. Tone-language speakers tend to show a stronger early P50-like response, whereas non-tonal speakers show more pronounced later N1/P2 (sometimes labelled N1c) activity, suggesting that prior tone experience shifts when the system becomes sensitive to F0 changes [83]. This example motivates the time-windowed analyses we use later for P ↔ P coupling.
Findings from mismatch-negativity (MMN) studies point in the same direction. When the tone change crosses a phonological boundary, MMN responses are categorical or near-categorical; when the change stays within a category or involves two very similar rising contours (such as the T2 vs. T3 pair), the MMN is much weaker. This neural asymmetry mirrors the behavioral difficulty with T2–T3 and provides independent, pre-attentive evidence for contour-shape and alignment sensitivity [75].
More recent contact-based studies using electroencephalography (EEG) have gone further by dissociating the acoustic and representational sources of tone-related effects. On the perception side, Ni et al. [27] used auditory-chimera speech to dissociate temporal fine structure (TFS) from the amplitude envelope (ENV) in Mandarin. Fifteen participants performed tone/syllable perception under three acoustic conditions—original speech, ENV-only (speech envelope modulating a sinusoidal carrier), and TFS-only (speech fine structure with a white-noise envelope)—while EEG was recorded. Analyses of auditory-evoked potentials (AEP; early N1 and later P2), power spectral density (PSD), and task-state microstates suggested a functional dissociation: syllable perception relied more on ENV, whereas tone perception relied more on TFS. In this paradigm, tone-sensitive effects around the P2 time range (≈200–330 ms) and delta–theta activity over temporal–prefrontal regions provide a candidate temporal anchor for subsequent time-resolved analyses. Complementing this perception-focused evidence, Chen et al. [84] examined Mandarin Tone 3 sandhi in a phonologically primed picture-naming task and showed that category-level (underlying) and context-specific (surface) tonal information are encoded at different stages, with distinct ERP signatures in stimulus-locked (≈320–550 ms) and response-locked windows (including a positive effect at −500 to −400 ms and a later negative effect at −240 to −100 ms). Together, these findings motivate our use of time-localized saliency windows and stage-specific counterfactual targets—implemented in practice as local attribution windows (e.g., around acoustically defined turning points such as TP ± 50 ms) and controllable pedagogical edits (e.g., advancing the peak or steepening the rise)—within the proposed modeling framework.
Contactless evidence supports this view. Eye-tracking evidence revealed that tone information constrained lexical access prior to vowel onset or rime onset in visual-word tasks, especially when in the support of an underlying context [85].
Finally, these physiological and behavioral metrics provide a reference layer for the relation-aware and explainable models described later in strand D. If an interpretable model attributes a learner’s T2/T3 error to a late turning point, we can check whether the same learner also shows delayed ERP components or misaligned gaze patterns. Such cross-level consistency strengthens the claim that model attributions are not only numerically correct, but also cognitively and pedagogically meaningful [27,85].

2.3. Audio Speech Analysis

Unlike strands A and B, which assess human performance and often rely on sensor-based measures to recover the time course of tone processing, this strand works directly on the speech signal and needs no extra hardware. The input is an acoustic recording—typically isolated syllables, read disyllables, or short connected speech—and each token is usually treated as an independent sample. This makes audio-only methods easy to deploy in classroom or low-resource settings, but it also means they rarely exploit the relational structure that exists in L2 tone corpora (the same learner across items, the same item across learners, paired perception–production trials) [86].
A key enabler of strand C is the availability of learner corpora with sufficient scale and metadata. For instance, Chen et al. (2016) report a large-scale characterization of non-native Mandarin produced by speakers of European origin, illustrating the extent of cross-learner variability and the need for standardized annotation and evaluation protocols [87]. Such corpora provide the empirical basis for developing and validating automatic tone assessment models.
A typical starting point in this strand is feature-based modeling. Studies extract prosodic cues that are known to be tone-relevant in Mandarin—F0 contour, rise/fall slope, local F0 range, and the timing of the turning point—and fuse them with spectral descriptors such as MFCCs. On this fused representation, random-forest or shallow neural classifiers achieve better tone recognition than using F0 alone, because the prosodic features capture the shape of the tone while the spectral features stabilize recognition across speakers and recording conditions [41]. On top of these engineered features, several works have trained purely acoustic neural models—CNNs, RNNs, or CNNs with connectionist temporal classification (CTC)—to classify tones from spectrogram or filter bank inputs, still in a token-by-token fashion and without learner/task relations [63,64,88]. When a short window of left and right context is added to such end-to-end pipelines, performance on tones in connected speech improves, since coarticulation and post-focus compression become available to the model and tone realization can be disambiguated more reliably [65].
A more recent direction in this strand is to use self-supervised learning as a front end. By fine-tuning wav2vec 2.0 on Mandarin, it is possible to obtain tone-sensitive representations even with limited labeled data. Layer-wise analyses, however, show that suprasegmental information for Mandarin and English tends to emerge in lower or middle layers, so tone-aware systems either tap those layers directly or add a small tone-specific head on top [40]. This makes SSL-based front ends a good candidate to be plugged later into the relation-aware, explainable models discussed in the next strand.
In this review we keep these CNN/RNN/CTC/SSL systems inside strand C only when they are used in this independent, audio-only way. Once a model starts to encode learner–item–trial relations or to produce explainable, parameter-aligned feedback, we treat it as part of the relation-aware strand rather than as a pure acoustic baseline [64,65,88]. The main limitation of the present audio-only studies is therefore conceptual: most of them assume i.i.d. tokens and output only a coarse tone label (T1–T4). Very few attempt to estimate the parameters that teachers correct—slope, turning-point timing in milliseconds from rhyme onset, or effective F0 range—and they do not use the repeated, learner-linked structure of L2 datasets. These gaps are precisely what motivate the relation-aware, explainable modeling in the following section.

2.4. Relation-Aware and Explainable Modeling

Most audio-only tone recognizers still make an i.i.d. assumption: every token is fed to an encoder, a tone label is returned, and the system moves on. This is convenient but it does not match how L2 tone corpora are collected. The same learner will repeat the same items in different tasks (ID/AX vs. read-aloud), several learners will pronounce the same item, and perception–production pairs are naturally aligned by item and session. In other words, the data are relational. Notably, mechanistic psycholinguistic modeling also treats tone as part of a structured recognition process rather than an isolated label; for example, Shuai and Malins (2017) implemented lexical-tone encoding in jTRACE to simulate spoken-word recognition in Mandarin, reinforcing the need to model structured dependencies beyond token-wise classification [73]. Graph-based and attention-based neural networks are designed precisely for this setting: they can treat “who produced what, in which task, and paired with which perception result” as first-class learning signals and let information flow across these links. Recent surveys on graph neural networks and dynamic graphs in speech and language show that such structures are learnable and stable for downstream tasks [56,57,58], and the original graph convolutional and graph attention formulations give us a light-weight way to implement this for L2 tones [68,69].
In the framework we propose, the encoder from Section 2.3 is kept. A CNN, RNN, CTC model, or a self-supervised front end such as wav2vec 2.0 still turns each utterance into an embedding. The difference is that this embedding is no longer the final step. It is inserted into a heterogeneous graph whose nodes represent learners, items, tokens, task types (ID, AX, imitation/read-aloud), and even modalities (perception vs. production). Recent computational modeling of Mandarin tonal encoding in disyllabic word production further motivates representing tone as a structured, stage-sensitive process rather than an isolated output label [74]. In our graph, this perspective is operationalized by separating nodes for items/tokens and by allowing message passing across learner–item–task links, which can capture stage- and context-dependent influences. Message passing on this graph lets the model borrow information from the same learner across tasks, from the same item across learners, and from perception–production pairs of the same item. A small attention head on perception–production edges can learn when perception performance should influence the interpretation of a production attempt. Similar strategies have already been used to model prosodic structure in expressive TTS and to decode auditory attention from EEG connectivity graphs, which indicates that prosodic and physiological relations are learnable in practice [43,44,56,89].
A second ingredient is interpretability. For classroom use, it is not enough to return. Only T3 is wrong. Teachers need to know what to correct. Therefore, the relation-aware part is followed by an explainability part based on widely used model-interpretation methods for speech and audio [71,72,90,91,92]. Concretely, we ask the model to predict not only the tone label, but also the three dynamic parameters that this review has highlighted throughout: rise or fall slope, turning-point time, and local F0 range. Target approximation theory and ERP/MMN work both indicate that these parameters are the ones that distinguish tones and that teachers correct in real classrooms [10,75]. On top of that, the explainability module produces token-level time saliency (for example, the model may locate the error mainly after vowel onset) and a minimal counterfactual edit such as advance the peak by about 80 ms or steepen the rise. This turns model decisions into instructions.
In this way, the four strands connect. Evidence from strand A shows that once movement and timing are made explicit through visual or gesture-based instruction, learners can improve exactly on those dimensions [35,81,82,93]. Evidence from strand B supplies temporal anchors, for example the P2 window in EEG or the early tone-driven fixations in eye-tracking, so that we can check whether a learner who is flagged by the model for late turning points in a T2 vs. T3 confusion also shows delayed ERP or gaze responses [27,85]. Strand C provides strong audio-only encoders on independent samples [37,63,64,65,88]. Strand D, which is the present section, wraps those encoders in relation-aware inference and adds explainable AI, so that the final output is tone-specific, parameter-aligned, and ready to be used in teaching.
So, strand C gives us good stand-alone encoders. Strand D keeps those encoders but uses the relational structure of L2 datasets and adds interpretability. The result is a model that still performs well but now tells us which learners, on which items, need which change, and in which time window.
In summary, L2 Mandarin tone research spans three evidence-rich strands (A–C) and an emerging, proposed strand (D). The evidence from conventional behavioral tasks, acoustic analyses, and physio-behavioral sensors converges on dynamic parameters (slope, turning-point timing, local F0 range) and on the perception–production (P ↔ P) link, motivating parameter-aligned, relation-aware, and explainable tools that can be operationalized in pedagogy.

3. Empirical States of T1–T4 Learning

Building on Section 2, recent evidence across strand A (conventional tasks), strand B (instrumentation), and strand C (audio-only analysis) converges on two themes: (i) early L2 outcomes hinge on movement and timing (slope; turning-point alignment), not pitch height alone; and (ii) the perception–production (P ↔ P) link strengthens as learners acquire those parameters. These themes motivate the graph-based + XAI approach in Section 4 that reports parameter-aligned feedback rather than opaque labels.

3.1. Perception–Production Patterns

Across all current literature on second language studies, perception improves production following training but is minimally connected and highly individual (coupled-ness). Skill-specific training in production only marginally transfers to improving perception in hearing tones, while training in perception gains benefit in speaking tones [11,12]. Classic high-variability training produces reliable gains on detection that transfer to new talkers and items and establish a strong perceptual plasticity baseline for tones [10,94].
At the acoustic-parameter level, L2 productions judged “categorically correct” often diverge from native targets in slope and turning-point timing; parameterized analysis therefore reveals residual gaps that label-only scoring misses [12,95]. Persistent T2–T3 difficulty is robust across proficiency levels [95].

3.2. Cross-Linguistic Transfer: Tonal vs. Non-Tonal L1s

Evidence from comparing tone-naïve and tone-experienced learners indicates cue-weighting differences and sensitivities to context, specifically among tone-naive listeners who undervalue movement/alignment and over-value height, as well as among tone-experienced listeners whose effects are limited in more contextually demanding situations [95,96]. Most recently, evidence obtained from a cross-context investigation shows that phonological context, such as disyllables and sandhi environment, leads to performance decline in comparison with single syllable words of tone-naive listeners [97].
For non-tonal L1s (e.g., Spanish-like profiles), initial reliance on absolute height and slower mastery of rise/fall dynamics are consistent with experience-based accounts of cue re-weighting; training rapidly improves perception but production gains lag [10,11,98]. For tonal L1s (e.g., Vietnamese-like profiles), advantages in basic identification coexist with challenges in connected-speech realization (alignment, sandhi), echoing the broader cross-linguistic pattern [12,95].

3.3. Multimodal and Classroom-Ready Scaffolds

Meta-analytic and experimental work shows that audiovisual/gesture-based designs and F0 visualization help beginners by externalizing movement and timing; benefits are strongest on T2/T3 [10,11,94,98,99,100]. Additionally, eye-tracking provides evidence that listeners differ in how they distribute their attention to tone versus vowel information depending on the context, which therefore bolsters the logic of movement-centric feedback [101].
More broadly, perceptual training works, but cueing matters. High-variability identification training produces durable perception gains that generalize to new talkers and syllables, establishing a baseline for plasticity at the suprasegmental level [10,33].
Gesture-based scaffolds help make movement and timing cues salient. In particular, pitch-iconic gestures externalize rise/fall dynamics and turning-point alignment, and have been shown to improve beginners’ tone perception and production, with especially robust benefits for the T2–T3 contrast [35,93].
Self-overlay visualization—displaying a learner’s own F0 in (quasi) real time against a native template—can yield larger gains than traditional or static feedback for approximating native-like F0 height and contour [82,102].
Beyond gesture and self-feedback, converging evidence indicates that visible head/neck motion and audiovisual cues convey lexically relevant pitch information, suggesting that visual channels can facilitate tone recognition and may partly explain variability across listening conditions [20,98].
Design issues for our study about Spanish vs. Vietnamese tone learning and tone production include four main methodological decisions.

3.4. Design Consequences for Our Spanish-L1 and Vietnamese-L1 Study

To calculate item level perception-to-production (P → P) coupling over and above global correlations [103], future studies should pair perceptual tasks (e.g., identification and AX discrimination) and production tasks (e.g., read-aloud and imitation) on identical, item-matched stimuli. This design enables direct comparisons between perceptual accuracy and the corresponding production outcomes at the level of each linguistic item.
Second, beyond categorical accuracy and confusion matrices [12], we recommend reporting parameter-aligned production metrics that capture tonal dynamics, including slope, turning-point timing, and F0 range. Such measures provide richer diagnostic information than label-only scores and are better aligned with classroom-relevant feedback.
Third, analyses should be stratified not only by L1 background (Spanish vs. Vietnamese) but also by phonological context (mono- vs. disyllabic words) and prosodic conditions (e.g., focus), because these factors are expected to modulate cue-weighting profiles and the T2/T3 difficulty locus [95,96,99].
Fourth, these behavioral metrics can serve as supervision signals for a relation-aware, graph-based computational model, with model explanations triangulated against neurophysiological anchors from Strand B (e.g., attenuated pre-attentive responses for T2–T3 discrimination) [75,76,104].
A practical implementation can combine AX and 4AFC identification with read-aloud and imitation tasks using identical wordlists, enabling computation of tone-wise accuracy, confusion patterns, acoustic parameters, and P ↔ P coupling indices per item and per learner.
Taken together, across strands A–C, T1–T4 learning appears to be driven mainly by contour movement and timing (slope, turning-point alignment) rather than by height, and the P ↔ P link strengthens as these parameters are acquired, modulated by L1 background and phonological context. Multimodal scaffolds combining gesture-based and audiovisual cues with F0 self-overlay particularly boost the T2–T3 contrast, motivating our item-matched P ↔ P design with parameterized metrics (slope, TP timing, F0 range) used to supervise the relation-aware, explainable model in Section 4.

4. Explainable Modeling for L2 Mandarin Tones (P ↔ P)

We bridge SLA theory and computational implementation through three key mappings: (1) L1 transfer effects encoded as node attributes, (2) perception–production coupling via bidirectional message passing, and (3) contextual sensitivity through phonological environment features. This framework ensures our graph-based model (Section 4.1) maintains cognitive validity while providing pedagogically meaningful explanations validated in Section 4.2.4.

4.1. System Overview

We adopt a modular stack: Encoder E (CNN/RNN/CTC/SSL) → Relation layer H (GNN/GAT/relational attention over learner–item–trial edges) → Task heads (tone classification + parameter regression) → XAI (feature attributions; counterfactuals). Encoders supply token-level embeddings; relations capture cross-token structure; XAI returns slope/turning-point-aligned rationales [63,65,68,88].
L2 tone datasets are inherently relational: each learner produces many tokens, the same items recur across tasks and sessions, perception and production can be paired on identical stimuli, and acoustically similar tokens form neighborhoods. Treating tokens as i.i.d. ignores these links and returns opaque labels, which limits both inference and pedagogy. Graph neural networks (GNNs) pass messages over such links and consistently improve learning on non-Euclidean structures (e.g., dependency or social graphs) [59,68,69].
At the same time, modern audio-only pipelines already capture tonal dynamics: fusing F0-centric prosodic cues with spectral features (e.g., MFCCs) raises tone accuracy over prosody-only or cepstrum-only models [105]. Recent self-supervised models (wav2vec 2.0 + CTC) further reduce tone error rates on standard corpora (e.g., Aishell/Hub-4), and even benefit from multi-tasking with English phoneme recognition [37]. These advances define strong baselines but keep the two critical gaps: (i) no relational inference over learner–item–task structure; (ii) little parameter-aligned feedback (e.g., slope, turning-point timing) for teaching.
We therefore model tokens on a heterogeneous graph and couple tone labels with parameter-level outputs, opening the model with XAI (feature attributions + counterfactuals) so feedback maps to what teachers correct (slope, turning-point, range).

4.2. Heterogeneous Relational Graph and Learning

We model the dataset as a heterogeneous relational graph that links learners, linguistic items, and their trial-level responses across sessions and modalities. To avoid ambiguity between tone and trial, each trial/response is denoted by R. Table 4 consolidates, in one place, the notation for node types, edge relations, model outputs, and evaluation metrics used throughout Section 4.

4.2.1. Message Passing

We apply relation-specific message passing (Relational GCN/Graph Attention; R-GCN/GAT) so that information propagates differently along same-item, same-learner, perception–production (P ↔ P), and acoustic/SSL-kNN links [68,69]. For node v   at layer k , one update step is
h v k = σ W 0 k h v k 1 + e E u N e v 1 / c v , e W e k h u k 1
Here, E is the set of edge types, N e v denotes the neighbors of v under edge type e ; c v , e is a degree-based normalizer, and σ is a nonlinearity.
Front-end encoders E ∈ {CNN, RNN, CTC, SSL} initialize node features before relation-specific propagation, ensuring that the same encoders used as strand C baselines are also embedded inside strand D relation-aware inference [59,68,69].
The P ↔ P edge links each learner’s perception trial to their production trial on the same item, allowing the model to align what was heard with what was spoken. This coupling is meaningful: EEG studies report above-chance decoding for imagined Mandarin tones, successful decoding for spoken tones, and EEG-based recognition of Mandarin sentences with tonal contrasts [42,62,106].

4.2.2. Node and Edge Features

Each response node R carries a compact acoustic–prosodic description together with optional physiological or gaze summaries when available. On the prosodic side, we use the F0 trajectory and derive slope in semitones per second, the turning-point time in milliseconds measured from a stated anchor such as rhyme onset, the local F0 range in semitones or hertz within a defined window, and supportive duration and intensity statistics so that model outputs can map transparently to classroom cues.
To capture spectral detail and long-range regularities, we attach either MFCC/log-Mel features or a self-supervised speech embedding such as wav2vec 2.0; in prior tone work, fusing prosodic and spectral information yields more reliable recognition than either source alone [105]. When a subset of trials includes physiological or eye-movement recordings, we summarize them into low-dimensional descriptors (e.g., ERP/MMN indices or Riemannian covariance features from EEG), which have supported above-chance decoding of spoken tones and improved imagined-tone decoding with audiovisual facilitation [42,62].
Item nodes I store the canonical tone label together with segmental context and prosodic position, while learner nodes L encode background type (tonal vs. non-tonal), proficiency, and musical aptitude. Edge attributes record information that is useful for learning dynamics and similarity structure, such as the time lag between a perception–production pair for the same item, acoustic distance in a kNN graph built in prosodic/SSL space, and session recency for tracking progress.

4.2.3. Joint Objectives: Tone Classification and Parameter Regression

We jointly predict (i) the tone label y T 1 ,   T 2 , T 3 , T 4 (where T 1 = high-level, T 2 = rising, T 3 = dipping/low, T 4 = falling), and (ii) continuous parameters—slope, turning-point time, and local F0 range—because these dimensions are closely tied to intelligibility and align with classroom practice.
The loss function is defined as:
L = λ c l s C E y ^ , y + λ s l o p e M S E s ^ , s + λ T P M S E t ^ , t + λ r a n g e M S E r ^ , r + λ c o n L c o n
We also add contrastive terms so that paired P ↔ P embeddings for the same item are close while different-item pairs are separated, thereby encouraging measurable P ↔ P coupling at the representation level.
This parameterized objective is motivated by classroom and lab evidence showing that slope and turning-point timing often separate merely correct from intelligible productions; visualized feedback that highlights these parameters yields larger gains, especially for the T2 vs. T3 contrast (see Section 2 and Section 3). Fusing prosodic and spectral cues in the encoder is consistent with reports that prosody + MFCC feature sets outperform F0-only baselines in tone recognition [41,105].

4.2.4. Explaining the Model: Feature Attribution and Counterfactuals

To make the system’s decisions inspectable and teachable, we compute token-level attributions over time (e.g., integrated gradients or input × gradient) on the F0 trajectory and/or SSL frames, which highlights the sub-segments most responsible for the predicted tone [71,72,107]; in practice this often surfaces a late-rise window for T2 or an early-fall segment for T4. Beyond post hoc saliency, we also generate counterfactual edits by solving a small optimization for the minimal change to slope s and turning-point time τ   that would switch the predicted label. Concretely, we solve
m i n Δ s , Δ τ α Δ s + β Δ τ s . t . f θ F 0 ; ; s + Δ s , ; τ + Δ τ = y
where y is the target tone. The system then verbalizes the result as an actionable prompt—for example, advance the turning-point time earlier on the order of tens of milliseconds to convert a T3-like trajectory into T2 [108]. To reduce the risk of over-interpreting local explanations, we additionally check a held-out subset with physiological or behavioral anchors, asking whether tokens flagged as late turning-point also show the expected ERP latencies or characteristic gaze patterns; this triangulation ties explanations back to measurable processing signatures.

4.2.5. Evaluation Protocol for P ↔ P and Pedagogy

Learner-facing approaches are evaluated on complementary fronts that jointly capture predictive quality and classroom utility. The protocol combines categorical tone decisions with continuous, parameter-level estimates of tone dynamics. The perception to production link is also evaluated so that improvements can be interpreted at both learner and item levels. Results are reported separately for perception and production, with per tone breakdowns and contrast focused summaries that are directly relevant for instruction.
Table 5 summarizes the core metrics used throughout this review and recommended for future reporting.
Let y i 1 , , K denote the gold tone label for item i , and let y ^ i denote the predicted label. Accuracy is defined as
Accuracy = 1 N i = 1 N I y i ^ = y i
Per tone F 1 is reported for both perception and production, together with full confusion matrices. For each tone k , true positives T P k , false positives F P k , and false negatives F N k are defined in the standard way. Precision, recall, and per tone F 1 are computed as
P k = T P k T P k + F P k ,   R k = T P k T P k + F N k ,   F 1 k = 2 P k R k P k + R k
MacroF1 is additionally reported as a balanced summary across tones,
MacroF 1 = 1 K k = 1 K F 1 k
Ninety-five percent confidence intervals for Accuracy and MacroF1 are obtained via bootstrap resampling over items, and the same resampling procedure is applied to per tone F 1 k .
Continuous outcomes are evaluated by benchmarking parameter estimates against native references under matched lexical and prosodic conditions whenever the design permits. Let z i denote a reference parameter value and z ^ i the predicted value. The error is defined as
e i = z i ^ z i
Mean absolute error and root mean squared error are computed as
MAE = 1 N i = 1 N e i ,   RMSE = 1 N i = 1 N e i 2
RMSE is reported for slope in semitones per second and for local fundamental frequency range in hertz or semitones. MAE is reported for turning point timing in milliseconds from rhyme onset. Parameter errors are reported per tone and per contrast, and item level distributions are provided to distinguish systematic bias from a small number of outliers.
The perception to production link is assessed at multiple levels. At the learner level and at the item level, the association between perception accuracy and production parameter error is computed so that the link can be interpreted as learning aligned rather than as a single pooled statistic. Spearman rank correlation is used to reduce sensitivity to non-normality,
ρ = cor r Spearman A s , E s
where A s is a learner’s perception score and E s is a learner’s production error summary such as slope RMSE or turning point timing MAE.
At the representation level, similarity of paired perception and production embeddings is examined for the same lexical item. Let p i denote the perception embedding and q i the production embedding for item i . Cosine similarity is computed as
sim p i , q i = p i q i p i q i
Matched pairs are compared with mismatched pairs constructed by permuting items within learner to test whether item matched perception to production structure is captured in the representation space.
Model interpretability is assessed by testing whether time-localized attributions concentrate around the turning-point region and by measuring the magnitude of counterfactual edits needed to change a model decision; smaller edits indicate better alignment with the intended parameterization. We also evaluate human usefulness by asking experienced instructors, under blinded conditions, whether the suggested edits (for example, earlier turning point or steeper rise) match their classroom corrections and produce measurable improvement on re-testing the same items. Robustness and fairness are examined by reporting all metrics separately for each tone (T1–T4) and for learners with tonal versus non-tonal language backgrounds, and by testing generalization to unseen speakers and items.
Where feasible, we triangulate attribution windows and counterfactual directions with physiological and behavioral anchors from strand B. In particular, we verify that tokens flagged for late turning-points show correspondingly later N1/P2 latencies or are more difficult to decode from neural signals, drawing on evidence from EEG-based tone decoding and ERP time-course studies [42,62,106,109]. When eye-tracking is available, we additionally verify that model-identified difficulties align with delayed or weakened tone-driven lexical competition in gaze records [85].

4.3. Output and Deployment Based on the Tool

To capture the multidimensional nature of tonal production, the system integrates two complementary acoustic representations and, where available, a lightweight physiological stream. A BiGRU/CNN extracts movement-based information from the F0 trajectory so that rise/fall slope and the timing of the turning point are learned jointly from the input contour and supervisory targets. Time-series-focused architectures for sequential recognition in other modalities (e.g., wireless gesture sequences) provide transferable design choices for modeling tonal trajectories [54]. In parallel, a log-mel or wav2vec 2.0 frontend supplies robust suprasegmental features; fusing prosodic and spectral cues is known to outperform either cue alone in tone recognition [105]. For a consented subset, covariance-manifold (Riemannian) or CSP-style EEG features can be included to validate time windows and directions indicated by the acoustic model, but this physiological branch is optional and is not required for classroom deployment [42,62].
The encoder outputs are passed to a relation module composed of two to three relation-specific GCN/GAT layers over the heterogeneous graph defined in Section 4.2. An attention head on perception → production edges makes explicit how perception for a given learner and item informs that learner’s production on the same item. The prediction head returns both discrete tone labels and continuous, parameter-aligned diagnostics—slope, turning-point time, and local F0 range—so feedback can be phrased in teachable terms.
Training uses a combined objective that mixes cross-entropy for tone labels with regression losses for the three parameters, along with class-balanced or focal weighting to address the T2 versus T3 imbalance. A contrastive term further encourages perception–production pairs of the same item to be close in the embedding space while keeping different items apart.
For deployment in classrooms, only the audio branches and the relation module are needed; the tool outputs a tone label together with parameter-level suggestions that map directly to practice (for example, advancing the turning point or steepening a rise). Where EEG is available for a subset, those signals are used to sanity-check attribution timing rather than to drive the main predictions.
To quantify each component’s contribution, we compare against three baselines: a feature-fusion model without relations (prosody + MFCC in CNN/SVM/BPNN) that reproduces the advantage of combining prosodic and spectral cues [105]; an end-to-end encoder with CTC and short-term context following recent tone pipelines [37]; and a wav2vec 2.0 fine-tuning setup with tonal syllable units and an n-gram language model, for which public-corpus tone error rates around 5% have been reported [37]. Physiological results provide supporting context: EEG can decode spoken Mandarin tones above chance (for example, about 42.9% accuracy in a four-class task where chance is 25%) and imagined-tone decoding improves with audio–visual stimulation [42,62]. These findings justify using small EEG subsets to validate attributions while keeping the primary system audio-only for scale.
To summarize, this section proposes an interpretable, relation-aware framework for L2 Mandarin tones: learner–item–trial data are modeled as a heterogeneous graph; audio encoders supply prosodic + spectral representations (with optional EEG for validation); relation-specific message passing aligns perception with production; and joint heads predict both tone labels and parameters (slope, turning-point timing, F0 range). XAI (attributions and counterfactual edits) turns decisions into teachable, parameter-aligned feedback. Evaluation covers per-tone accuracy/F1 and parameter errors, and the classroom deployment runs audio-only, delivering actionable guidance while preserving cognitive validity.

5. Discussion: Challenge and Future Directions for Multimodal Explainable Model of L2 Mandarin Tones (P ↔ P)

To turn a literature map into a working research agenda, the remaining empirical and technical gaps need to be addressed so that a graph-based, explainable pipeline can produce classroom-ready parameter-aligned feedback.

5.1. Reconciling Inconsistent Findings Across Studies

5.1.1. Data and Task Collection of the Current Corpora Fall Short

Inconsistent conclusions across strands often reflect differences in task design, stimulus control, and reporting granularity [110]. Many studies measure perception and production separately, which makes the perception–production link hard to interpret beyond pooled correlations. Item-matched designs that test the same lexical items in perception and production enable learner-level and item-level coupling analyses and reduce ambiguity in interpretation [12].
Annotation practices further contribute to heterogeneity. To support instruction-oriented modeling, corpora should record not only tone labels but also dynamic parameters that teachers target in practice. These include F0 slope, turning-point timing relative to rhyme onset, and local F0 range [14]. This is particularly important for contrasts such as Tone 2 versus Tone 3, where duration and timing cues interact with contour shape [24].
Learner background and proficiency distribution must also be explicit. Tone-language and non-tone-language learners differ in cue weighting and learning trajectories. Key outcomes should therefore be reported with stratified analyses rather than only pooled means [77,111].

5.1.2. Bias, Confounds, and Cross-Strand Validity

The reporting-based methodological audit rubric (Appendix B, Table A1) highlights recurring validity threats and, critically, recurring reporting gaps that limit cross-study comparability and may partly explain inconsistent findings across strands. The most frequent issues include incomplete participant characterization (e.g., proficiency placement and relevant background factors), under-specified stimuli and contextual controls (e.g., tone distribution, segmental context, and talker variability), and limited explicit evidence for measurement reliability or quality control. When human ratings or manual labeling are used, studies should report interrater agreement using standard indices (e.g., Cohen’s κ), together with clear labeling protocols and adjudication rules [48,49].
To make the quality ranges explicit, Table 6 summarizes the domain-level rating distributions across the included studies (n = 54).
Ratings reflect explicit reporting in the included articles (“Unclear” indicates insufficient reporting for the corresponding item and does not necessarily imply poor underlying practice). Full rubric and audit details are provided in Appendix B (Table A1).
Cross-study comparability is further reduced by differences in recording conditions and test environments. Where feasible, robustness should be evaluated under realistic variation in noise and speech rate. Results should be reported at both the label level and the parameter level so that classroom-relevant progress can be assessed [10]. Physiological studies also frequently use small samples, which can limit statistical power and increase estimate variance; accordingly, effect sizes with uncertainty (e.g., confidence intervals) and replication/robustness checks are essential when integrating evidence across strands [112,113,114].

5.2. Modeling and Explainability

Audio-only models are effective for tone recognition, yet they often lack transparency and do not clarify how decisions are made. Convolutional and recurrent architectures with connectionist temporal classification (CTC) trained on F0-aware inputs or spectro-temporal representations reach strong performance without extra sensors [63,64,65,88]. Self-supervised learning (SSL) speech representations have also been examined for lexical tone and suprasegmental information and can provide stronger encoders for downstream tonal analysis tasks [38,40,66]. These pipelines provide necessary baselines, but they are commonly token-wise and label-focused, which limits their usefulness for instruction and for perception-production inference.
For assessment and learner diagnosis, weakly supervised designs such as multi-view or Siamese setups can lower annotation cost relative to Goodness of Pronunciation (GOP) style baselines [115]. They can remain practical for classroom-scale feedback on spoken-language mispronunciation verification [116]. However, these pipelines generally output only a categorical label and treat trials as independent, leaving teachers without a rationale and ignoring the relational structure of learner corpora.
Relational inductive bias and transparent rationales make outputs more teachable. L2 tone datasets are inherently structured. The same learner attempts multiple items across sessions, and the same item recurs across learners and tasks. Perception trials can also be paired with production trials on matched items. Graph neural networks (GNNs) enable message passing over such links and can exploit this structure when data are sparse [59,68,69].
Electroencephalography (EEG) studies show that tone-related information is measurable in neural signals and can support tone classification and decoding in both overt and imagined conditions [27,42,62,106]. When such evidence is available, it can serve as an external anchor for interpretability checks. Model-identified difficulty windows should be consistent with plausible processing time courses.
Explanations are most useful when they align with tone-relevant parameters. Psycholinguistic evidence and training studies indicate that learners rely on movement direction and temporal anchoring in addition to absolute pitch height. This is especially true for contrasts in which turning-point alignment is critical [10,14,24]. Model rationales should therefore attribute decisions to segments of the F0 trajectory and to turning-point timing, rather than to global level alone.
In practice, feature-attribution methods such as Integrated Gradients and Shapley Additive exPlanations (SHAP) can be paired with counterfactual explanations. The counterfactual can propose a minimal, parameter-constrained edit that changes a decision, for example by advancing the turning point slightly or increasing rise slope within a bounded range [70,71,72,107]. Since these edits are framed in the same parameters instructors already teach (slope, timing, local range), they can be turned directly into classroom feedback.

5.3. From Modeling to Pedagogy: Validation Targets and Open Challenges

The purpose of modeling is to support better learning in real classrooms. Our approach therefore treats model outputs as prompts for concrete teaching actions rather than as end points.

5.3.1. Mapping Model Outputs to Teaching Actions

To close the research to practice gap, model outputs should be expressed in forms that align with how tone is taught and corrected. For perception, confusion patterns can be translated into contrast-focused practice sets. For production, parameter-level feedback can be stated as targeted actions, such as increasing rise slope, shifting the turning point earlier, or expanding local pitch range. Multimodal training studies suggest that such action-oriented feedback can be complemented by visual and gestural scaffolds when appropriate [26,28,41,98].

5.3.2. Scope of This Review and Recommended Validation Plan

This article is a systematic review, and the proposed evaluation protocol is intended to guide future empirical work and tool development. Prospective studies can strengthen the evidence base by using pre-specified hypotheses, item-matched perception–production designs, and the parameter-aligned metrics described in Section 4.2.5. Validation should test generalization to unseen speakers and items. It should report per-tone and background-stratified results to support instruction and fairness.

5.3.3. Open Challenges and Next Steps

Three open challenges recur across strands. First, classroom evidence remains limited relative to laboratory studies. Learner-facing tools should be evaluated under realistic classroom constraints, including limited time, variable microphones, and heterogeneous proficiency. Second, relation-aware modeling requires shared conventions for defining learners, items, and matched attempts so that P ↔ P structure is comparable across datasets. Third, explainability claims should be evaluated explicitly. This includes whether attributions concentrate around turning-point regions and whether counterfactual edits improve retest performance on matched items.

5.4. Practical Recommendations

In practice, perception and production should be recorded on the same items whenever feasible, ideally within the same session, so that item-level P ↔ P pairings can be evaluated directly. Corpora are more informative when they are balanced across speakers, lexical items, speaking rates, and contexts, including disyllables, focus or clear-speech conditions, and moderate noise. Alongside tone labels, slope in semitones per second, turning-point time in milliseconds from rhyme onset, and local F0 range in hertz or semitones are recommended because these parameters align with what instructors correct and make feedback teachable [81,102].

5.4.1. Minimum Reporting Set for Learner-Facing Evaluation

For learner-facing reporting, categorical and continuous metrics should be presented together. At the label level, accuracy, per-tone F1, and confusion matrices should be reported separately for perception and production [115]. At the parameter level, slope, turning-point timing, and local F0 range errors should be benchmarked against native references under matched lexical and prosodic conditions when the design permits [108,115]. These metrics support contrast-focused diagnosis, especially for persistent confusions such as Tone 2 versus Tone 3 [24,25,26].

5.4.2. Fairness, Ethics, and Governance for Classroom Deployment

Responsible deployment requires attention to fairness, privacy, and accountability. Performance should be reported with disaggregated analyses across L1 background, proficiency band, and other relevant learner attributes. Systematic failure modes should be documented and monitored. Data governance should specify consent, data retention, and access controls, particularly when recordings or physiological signals are collected. When explanations are presented to learners, they should be verifiable and actionable and should avoid implying certainty beyond what the evidence supports [72,91].
Taken together, these recommendations clarify how future studies can reduce heterogeneity, strengthen the perception–production evidence base, and support explainable, parameter-aligned feedback that is usable in instruction.

6. Conclusions

Research across phonetics, psycholinguistics, and speech technology now converges on a dynamic view of Mandarin tones: categories are realized as time-evolving F0 trajectories whose movement and temporal alignment carry much of the functional load. This synthesis explains why perception gains are often easier to obtain than production gains and why the T2 vs. T3 contrast remains a recurrent difficulty for many learners. It also motivates outcome measures that go beyond coarse labels to include slope, turning-point timing, and local F0 range.
This review integrates four strands—traditional behavioral tasks, physiological/behavioral instrumentation, audio-only modeling, and relation-aware, explainable methods—into a single framework for linking perception and production. In this framework, modern encoders are retained for accuracy, but their outputs are translated into parameter-aligned diagnostics, and relations among learners, items, and tasks are modeled explicitly. The result is feedback that maps directly to what teachers correct and that can be audited with interpretable evidence.
Looking ahead, progress will depend on paired perception–production datasets, parameter-level benchmarks, and careful attention to fairness across learner backgrounds. Ethical and practical guardrails are essential: balanced sampling, privacy-preserving data practices, and humility about what local attributions can and cannot explain. With these pieces in place, the field is well positioned to move from opaque judgments toward classroom-ready, interpretable tools that respect the time-varying nature of Mandarin tones.

Author Contributions

Writing—original draft preparation, Y.H.; formal analysis, Z.X.; writing—review and editing, X.B.; investigation and formal analysis, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Guangdong Province, China (No. 2025A1515011755) and, in part, by the Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education (No. CRKL240204).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Search Strategy and Database-Specific Search Strings (2015–2025)

Appendix A reports the database-specific search strings and reproducibility details used in this review. The search window covered 2015–2025 (inclusive). Searches were conducted in both English and Chinese. Eligible records were peer-reviewed journal articles and full conference papers with publicly accessible full texts and sufficient methodological detail for extraction.
Figure 4 summarizes the annual distribution of the final included set (n = 54). The venue-verification check used for Figure 4 (for trend visualization only, not for inclusion/exclusion) is described in Appendix A.3. The reporting-based methodological audit rubric is reported in Appendix B (Table A1), and the resulting domain-level rating distributions are summarized in Table 6 (main text). Records were exported in BibTeX/RIS format and imported into EndNote for DOI-plus-title de-duplication prior to screening.

Appendix A.1. Google Scholar

Google Scholar was searched using the advanced time filter with a custom range of 2015–2025. Results were sorted by relevance. For each query, the first 10 pages were screened. Screening was extended to 20 pages as a robustness check. Candidate records were exported in BibTeX or RIS format and imported into the main EndNote library for de-duplication and screening. The last search was conducted on 10 September 2025.
Representative query strings (exact phrases as entered in Google Scholar) included:
Q1: “Mandarin” “lexical tone” L2 perception
Q2: “Mandarin” “lexical tone” L2 production
Q3: “Mandarin” tone L2 identification discrimination-Cantonese-Thai
Q4: “Mandarin” tone “automatic assessment” L2
Q5: “Mandarin” tone mispronunciation detection
Q6: “Mandarin” tone classification “deep learning”
Q7: “Mandarin” tone explainable
Q8: “Mandarin” tone interpretable model
Q9: “Mandarin” tone L2 EEG ERP

Appendix A.2. China National Knowledge Infrastructure (CNKI)

CNKI advanced search was conducted using the Subject field, which searches titles, keywords, and abstracts. The following Chinese query was used:
(普通话 OR 汉语 OR 汉语普通话) AND (声调 OR 词汇声调) AND (二语 OR 外语 OR 非母语 OR 学习者) AND (感知 OR 产出 OR 识别 OR 评估) AND (计算 OR 机器学习 OR 深度学习 OR 可解释 OR 解释 OR 模型)
English gloss of key terms: Mandarin (普通话/汉语/汉语普通话); lexical tone (声调/词汇声调); L2/foreign/non-native learners (二语/外语/非母语/学习者); perception/production/recognition/assessment (感知/产出/识别/评估); computational/ML/DL/XAI/modeling (计算/机器学习/深度学习/可解释/解释/模型).
Filters were set to years 2015–2025. Chinese and English records indexed in CNKI were eligible. Full texts had to be accessible for screening and extraction.

Appendix A.3. Handling of Indexed Sources and the Figure 4 Subset

Google Scholar indexes records hosted by multiple curated services, including Web of Science, Scopus, IEEE Xplore, and the ACM Digital Library. These databases were not searched as independent sources. Records from these services could appear indirectly through Google Scholar indexing and were then subjected to the same de-duplication and screening procedure.
Figure 4 summarizes temporal trends by methodological strand using the full included set (n = 54). This subset was restricted to records with verifiable venue information, such as journal identity and indexing status. Records with insufficient bibliographic provenance were excluded from the trend analysis. Importantly, this filtering was applied only to Figure 4 and did not change the full set of studies included in the review as reported in the PRISMA flow diagram.

Appendix B. Risk-of-Bias Appraisal Checklist

A compact six-item checklist was used to appraise potential risk of bias and methodological quality across the included studies. Each item was coded as Reported adequately, Partially reported, or Unclear/Not reported based on the information available in the article. For computational studies, items referring to model evaluation were applied when applicable.

Appendix B.1. Participant Characterization and Proficiency Control

Adequacy of reporting for L1 background, L2 proficiency, learning experience, and inclusion criteria, and whether key confounds were controlled or discussed.

Appendix B.2. Task and Stimulus Specification

Clarity and validity of the perception/production tasks, tone targets, phonological contexts, stimulus construction, and scoring criteria.

Appendix B.3. Measurement Reliability

Evidence of reliability for labeling/scoring procedures and acoustic extraction pipelines, including handling of F0 tracking errors and outliers when relevant.

Appendix B.4. Evaluation Validity for Computational Modeling

When computational evaluation was reported, adequacy of train–test split protocols, speaker or item independence, leakage control, and the presence of meaningful baselines.

Appendix B.5. Metric Definition and Reporting Transparency

Whether evaluation metrics were clearly defined and reported with sufficient granularity to interpret performance, including per-tone reporting where relevant.

Appendix B.6. Replicability-Critical Transparency

Availability of replicability-critical details such as preprocessing steps, feature definitions, model settings, and access to data or code when available.
Table A1 reports the audit rubric. Domain-level rating distributions across included studies (n = 54) are summarized in Table 6 in the main text.
Table A1. Compact reporting-based appraisal rubric (B1–B6).
Table A1. Compact reporting-based appraisal rubric (B1–B6).
DomainWhat We Check
(Signaling Questions)
Adequate (2)Partial (1)Unclear/Not
Reported (0)
B1 Participant characterization and proficiency controlAre L1 background and L2 proficiency/experience reported? Are inclusion criteria and key confound addressed?L1 + N + proficiency (or placement) reported; inclusion criteria and key confound addressed.N reported but proficiency/experience or confounds incomplete.Learner profile insufficiently described; key confounds not addressed.
B2 Task and stimulus specificationAre perception/production tasks, tone targets, contexts, stimuli, and scoring criteria specified for replication?Tasks, stimuli/contexts, and scoring clearly specified.Task described but stimulus/context/scoring incomplete.Task/stimuli insufficiently described.
B3 Measurement reliabilityIs the acoustic/physiological pipeline specified (e.g., Praat settings; EEG/eye-tracking preprocessing)? Is reliability/QA reported where human labeling is involved?Pipeline specified and reliability/QA reported (e.g., rater agreement, error handling, outlier rules).Pipeline specified but reliability/QA partial or absent.Pipeline unclear; reliability/QA not reported.
B4 Evaluation validity for computational modellingIf modelling is used: are training/validation protocols and leakage control specified (splits, CV, speaker/item independence, baselines)?Clear evaluation protocol with leakage control and meaningful baselines.Some evaluation details provided but safeguards unclear.No clear evaluation protocol/validation.
B5 Metric definition and reporting transparencyAre performance outcomes clearly defined and reported with sufficient granularity (e.g., per-tone, confusion; MAE/RMSE for parameters; CIs)?Reports recognition metric(s) and either parameter errors or uncertainty/effect sizes.Reports only coarse metrics without uncertainty/parameter errors.Outcomes not quantitatively reported or not extractable.
B6 Replicability-critical transparencyAre replicability-critical details available (preprocessing, features, model settings; data/code/materials access)?Data/code/materials or full settings available (repo or detailed supplement).Partial availability.No availability statement or insufficient detail.
Appraisal outcomes were used to qualify the strength of evidence in the narrative synthesis and to highlight recurring methodological limitations that may explain inconsistent results across studies.

References

  1. Ethnologue. Chinese, Mandarin (cmn). Available online: https://www.ethnologue.com/language/cmn (accessed on 13 August 2025).
  2. Xu, Y.; Wang, Q.E. Pitch targets and their realization: Evidence from Mandarin Chinese. Speech Commun. 2001, 33, 319–337. [Google Scholar] [CrossRef]
  3. Wu, Y.; Adda-Decker, M.; Lamel, L. Mandarin lexical tone duration: Impact of speech style, word length, syllable position and prosodic position. Speech Commun. 2023, 146, 45–52. [Google Scholar] [CrossRef]
  4. Yip, M.J.W. Tone; Cambridge University Press: London, UK, 2002. [Google Scholar]
  5. Wong, P.; Schwartz, R.G.; Jenkins, J.J. Perception and production of lexical tones by 3-year-old, Mandarin-speaking children. Perception 2005, 48, 1065–1079. [Google Scholar] [CrossRef] [PubMed]
  6. Ehinger, B.V.; Groß, K.; Ibs, I.; König, P. A new comprehensive eye-tracking test battery concurrently evaluating the Pupil Labs glasses and the EyeLink 1000. PeerJ 2019, 7, e7086. [Google Scholar] [CrossRef]
  7. Ha, J.; Baek, S.-C.; Lim, Y.; Chung, J.H. Validation of cost-efficient EEG experimental setup for neural tracking in an auditory attention task. Sci. Rep. 2023, 13, 22682. [Google Scholar] [CrossRef]
  8. Gupta, G.; Kshirsagar, M.; Zhong, M.; Gholami, S.; Ferres, J.L. Comparing recurrent convolutional neural networks for large scale bird species classification. Sci. Rep. 2021, 11, 17085. [Google Scholar] [CrossRef]
  9. Pelzl, E.; Lau, E.F.; Guo, T.; DeKeyser, R. Advanced Second Language Learners’ perception of Lexical Tone Contrasts. Stud. Second Lang. Acquis. 2019, 41, 59–86. [Google Scholar]
  10. Wang, Y.; Spence, M.M.; Jongman, A.; Sereno, J.A. Training American listeners to perceive Mandarin tones. J. Acoust. Soc. Am. 1999, 106, 3649–3658. [Google Scholar] [CrossRef]
  11. Li, M.; DeKeyser, R. Perception practice, production practice, and musical ability in L2 Mandarin tone-word learning. Stud. Second Lang. Acquis. 2017, 39, 593–620. [Google Scholar] [CrossRef]
  12. Leung, K.K.; Lu, Y.-A.; Wang, Y. Examining Speech Perception–Production Relationships Through Tone Perception and Production Learning Among Indonesian Learners of Mandarin. Brain Sci. 2025, 15, 671. [Google Scholar] [CrossRef]
  13. Prom-On, S.; Xu, Y.; Thipakorn, B. Modeling tone and intonation in Mandarin and English as a process of target approximation. J. Acoust. Soc. Am. 2009, 125, 405–424. [Google Scholar] [CrossRef] [PubMed]
  14. Xu, Y. Fundamental frequency peak delay in Mandarin. Phonetica 2001, 58, 26–52. [Google Scholar] [CrossRef] [PubMed]
  15. Hsieh, I.-H.; Yeh, W.-T. The interaction between timescale and pitch contour at pre-attentive processing of frequency-modulated sweeps. Front. Psychol. 2021, 12, 637289. [Google Scholar] [CrossRef] [PubMed]
  16. Xu, Y. Effects of tone and focus on the formation and alignment of f0contours. J. Phon. 1999, 27, 55–105. [Google Scholar] [CrossRef]
  17. Jing, W.; Liu, J.; Wang, T.; Cho, S.; Lee, Y.-C. Comparisons of Mandarin on-focus expansion and post-focus compression between native speakers and L2 learners: Production and machine learning classification. Speech Commun. 2025, 173, 103280. [Google Scholar]
  18. Xu, Y. Prosody, tone, and intonation. In The Routledge Handbook of Phonetics; Routledge: Oxfordshire, England, 2019; pp. 314–356. [Google Scholar]
  19. Zheng, A.; Hirata, Y.; Kelly, S.D. Exploring the effects of imitating hand gestures and head nods on L1 and L2 Mandarin tone production. J. Speech Lang. Hear. Res. 2018, 61, 2179–2195. [Google Scholar] [CrossRef]
  20. Wei, Y.; Jia, L.; Gao, F.; Wang, J. Visual–auditory integration and high-variability speech can facilitate Mandarin Chinese tone identification. J. Speech Lang. Hear. Res. 2022, 65, 4096–4111. [Google Scholar] [CrossRef]
  21. Godfroid, A.; Lin, C.H.; Ryu, C. Hearing and seeing tone through color: An efficacy study of web-based, multimodal Chinese tone perception training. Lang. Learn. 2017, 67, 819–857. [Google Scholar] [CrossRef]
  22. Wang, X. Perception of Mandarin tones: The effect of L1 background and training. Mod. Lang. J. 2013, 97, 144–160. [Google Scholar] [CrossRef]
  23. Lee, Y.-C.; Wang, T.; Liberman, M. Production and perception of tone 3 focus in Mandarin Chinese. Front. Psychol. 2016, 7, 1058. [Google Scholar] [CrossRef]
  24. Blicher, D.L.; Diehl, R.L.; Cohen, L.B. Effects of syllable duration on the perception of the Mandarin Tone 2/Tone 3 distinction: Evidence of auditory enhancement. J. Phon. 1990, 18, 37–49. [Google Scholar] [CrossRef]
  25. Zou, T.; Zhang, J.; Cao, W. A comparative study of perception of tone 2 and tone 3 in Mandarin by native speakers and Japanese learners. In Proceedings of the 2012 8th International Symposium on Chinese Spoken Language Processing, Hong Kong, 5–8 December 2012; pp. 431–435. [Google Scholar]
  26. Shen, G.; Froud, K. Categorical perception of lexical tones by English learners of Mandarin Chinese. J. Acoust. Soc. Am. 2016, 140, 4396–4403. [Google Scholar] [CrossRef] [PubMed]
  27. Ni, G.; Xu, Z.; Bai, Y.; Zheng, Q.; Zhao, R.; Wu, Y.; Ming, D. EEG-based assessment of temporal fine structure and envelope effect in mandarin syllable and tone perception. Cereb. Cortex 2023, 33, 11287–11299. [Google Scholar] [CrossRef] [PubMed]
  28. Xu, Y. Cross-Language Studies of Tonal Perception: Hemodynamic, Electrophysiological and Behavioral Evidence. Ph.D. Thesis, Purdue University, West Lafayette, IN, USA, 2005. [Google Scholar]
  29. Zou, Q. Influence of L1 Background on Categorical Perception of Mandarin Tones by Russian and Vietnamese Listeners. Int. J. Engl. Linguist. 2019, 9, 4. [Google Scholar] [CrossRef]
  30. Chen, Y. Is Cantonese lexical tone information important for sentence recognition accuracy in quiet and in noise? PLoS ONE 2022, 17, e0276254. [Google Scholar] [CrossRef]
  31. Chen, F.; Guo, Q.; Deng, Y.; Zhu, J.; Zhang, H. Development of Mandarin lexical tone identification in noise and its relation with working memory. J. Speech Lang. Hear. Res. 2023, 66, 4100–4116. [Google Scholar] [CrossRef]
  32. Ji, J.; Hu, Y.; Yang, X.; Peng, G. Acoustic Features of Mandarin Tone Production in Noise: A Comparison Between Chinese Native Speakers and Korean L2 Learners. In Proceedings of the Interspeech 2025, Rotterdam, The Netherlands, 17–21 August 2025. [Google Scholar]
  33. Silpachai, A. The role of talker variability in the perceptual learning of Mandarin tones by American English listeners. J. Second Lang. Pronunciation 2020, 6, 209–235. [Google Scholar] [CrossRef]
  34. Shi, H. A Method of Teaching English Speaking Learners to Produce Mandarin-Chinese Tones. Ph.D. Thesis, West Virginia University, Morgantown, WV, USA, 2018. [Google Scholar]
  35. Baills, F.; Suarez-Gonzalez, N.; Gonzalez-Fuente, S.; Prieto, P. Observing and producing pitch gestures facilitates the learning of Mandarin Chinese tones and words. Stud. Second Lang. Acquis. 2019, 41, 33–58. [Google Scholar] [CrossRef]
  36. Yuan, C.; Gonzalez-Fuente, S.; Baills, F.; Prieto, P. Observing pitch gestures favors the learning of Spanish intonation by Mandarin speakers. Stud. Second Lang. Acquis. 2019, 41, 5–32. [Google Scholar] [CrossRef]
  37. Yuan, J.; Ryant, N.; Cai, X.; Church, K.; Liberman, M. Automatic recognition of suprasegmentals in speech. arXiv 2021, arXiv:2108.01122. [Google Scholar] [CrossRef]
  38. Shen, G.; Watkins, M.; Alishahi, A.; Bisazza, A. Encoding of lexical tone in self-supervised models of spoken language. arXiv 2024, arXiv:2403.16865. [Google Scholar] [CrossRef]
  39. Li, W.; Chen, N.F.; Siniscalchi, S.M.; Lee, C.-H. Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6249–6253. [Google Scholar]
  40. de la Fuente, A.; Jurafsky, D. A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models. arXiv 2024, arXiv:2408.13678. [Google Scholar] [CrossRef]
  41. Yan, J.; Tian, L.; Wang, X.; Liu, J.; Li, M. A mandarin tone recognition algorithm based on random forest and features fusion. In Proceedings of the 7th International Conference on Control Engineering and Artificial Intelligence, Sanya, China, 28–30 January 2023; pp. 168–172. [Google Scholar]
  42. Zhang, X.; Li, H.; Chen, F. EEG-based classification of imaginary Mandarin tones. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montréal, QC, Canada, 20–24 July 2020; pp. 3889–3892. [Google Scholar]
  43. Guo, D.; Zhu, X.; Xue, L.; Li, T.; Lv, Y.; Jiang, Y.; Xie, L. HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023; pp. 1–7. [Google Scholar]
  44. Sun, A.; Wang, J.; Cheng, N.; Peng, H.; Zeng, Z.; Kong, L.; Xiao, J. Graphpb: Graphical representations of prosody boundary in speech synthesis. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), virtual, 19–22 January 2021; pp. 438–445. [Google Scholar]
  45. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
  46. Clarivate. EndNote 2025; Build 21347; Clarivate: Philadelphia, PA, USA, 2025. [Google Scholar]
  47. Haddaway, N.R.; Page, M.J.; Pritchard, C.C.; McGuinness, L.A. PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis. Campbell Syst. Rev. 2022, 18, e1230. [Google Scholar] [CrossRef]
  48. McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
  49. Viera, A.J.; Garrett, J.M. Understanding interobserver agreement: The kappa statistic. Fam. Med. 2005, 37, 360–363. [Google Scholar]
  50. Syriani, E.; David, I.; Kumar, G. Screening articles for systematic reviews with ChatGPT. J. Comput. Lang. 2024, 80, 101287. [Google Scholar] [CrossRef]
  51. Bei, X.; Xiang, N. Fundamental Principles of Experimental Phonetics and Practical Use of Praat; Hunan Normal University Press: Changsha, China, 2016. [Google Scholar]
  52. Bei, X. A Study on Mandarin Speech Acquisition in Guangzhou, Hong Kong, and Macao; China Social Sciences Press: Beijing, China, 2021. [Google Scholar]
  53. Bei, X. The tone pattern in dialect contact. J. Chin. Linguist. 2015, 43, 34–52. [Google Scholar] [CrossRef]
  54. Huang, H.; Lin, L.; Zhao, L.; Ding, S.; Huang, H. Time Series Focused Neural Network for Accurate Wireless Human Gesture Recognition. IEEE Trans. Netw. Sci. Eng. 2025, 13, 118–129. [Google Scholar] [CrossRef]
  55. Zhou, Y.; Zhang, X.; Shen, J.; Han, T.; Chen, T.; Gall, H. Adversarial robustness of deep code comment generation. ACM Trans. Softw. Eng. Methodol. 2022, 31, 1–30. [Google Scholar] [CrossRef]
  56. Zheng, Y.; Yi, L.; Wei, Z. A survey of dynamic graph neural networks. Front. Comput. Sci. 2025, 19, 196323. [Google Scholar] [CrossRef]
  57. Vrahatis, A.G.; Lazaros, K.; Kotsiantis, S. Graph attention networks: A comprehensive review of methods and applications. Future Internet 2024, 16, 318. [Google Scholar] [CrossRef]
  58. Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
  59. Guo, L.; Huang, H.; Zhao, L.; Wang, P.; Jiang, S.; Su, C. Reentrancy vulnerability detection based on graph convolutional networks and expert patterns under subspace mapping. Comput. Secur. 2024, 142, 103894. [Google Scholar] [CrossRef]
  60. Wang, R. Audiovisual Perception of Mandarin Lexical Tones. Ph.D. Thesis, Bournemouth University, Poole, UK, 2018. [Google Scholar]
  61. Wayland, R.P.; Guion, S.G. Training English and Chinese listeners to perceive Thai tones: A preliminary report. Lang. Learn. 2004, 54, 681–712. [Google Scholar] [CrossRef]
  62. Li, M.; Pun, S.H.; Chen, F. Mandarin tone classification in spoken speech with EEG signals. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 1163–1166. [Google Scholar]
  63. Huang, H.; Hu, Y.; Xu, H. Mandarin tone modeling using recurrent neural networks. arXiv 2017, arXiv:1711.01946. [Google Scholar] [CrossRef]
  64. Lugosch, L.; Tomar, V.S. Tone recognition using lifters and CTC. arXiv 2018, arXiv:1807.02465. [Google Scholar] [CrossRef]
  65. Tang, J.; Li, M. End-to-end mandarin tone classification with short term context information. In Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; pp. 878–883. [Google Scholar]
  66. Yuan, J.; Cai, X.; Church, K. Improved contextualized speech representations for tonal analysis. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 4513–4517. [Google Scholar]
  67. Schenck, K.; Beguš, G. Unsupervised Learning and Representation of Mandarin Tonal Categories by a Generative CNN. arXiv 2025, arXiv:2509.17859. [Google Scholar] [CrossRef]
  68. Kipf, T. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  69. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  70. Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
  71. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  72. Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Tech. 2017, 31, 841. [Google Scholar] [CrossRef]
  73. Shuai, L.; Malins, J.G. Encoding lexical tones in jTRACE: A simulation of monosyllabic spoken word recognition in Mandarin Chinese. Behav. Res. Methods 2017, 49, 230–241. [Google Scholar] [CrossRef] [PubMed]
  74. Pan, J.; Chen, X.; Zhang, C. Computational Modeling of Tonal Encoding in Disyllabic Mandarin Word Production. In Proceedings of the Annual Meeting of the Cognitive Science Society, Rotterdam, The Netherlands, 24–27 July 2025. [Google Scholar]
  75. Xi, J.; Zhang, L.; Shu, H.; Zhang, Y.; Li, P. Categorical perception of lexical tones in Chinese revealed by mismatch negativity. Neuroscience 2010, 170, 223–231. [Google Scholar] [CrossRef]
  76. Shen, G.; Froud, K. Electrophysiological correlates of categorical perception of lexical tones by English learners of Mandarin Chinese: An ERP study. Bilingualism 2019, 22, 253–265. [Google Scholar] [CrossRef]
  77. So, C.K.; Best, C.T. Cross-language perception of non-native tonal contrasts: Effects of native phonological and phonetic influences. Lang. Speech 2010, 53, 273–293. [Google Scholar] [CrossRef]
  78. Hao, Y.-C. Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers. J. Phon. 2012, 40, 269–279. [Google Scholar] [CrossRef]
  79. Shen, W.; Hyönä, J.; Wang, Y.; Hou, M.; Zhao, J. The role of tonal information during spoken-word recognition in Chinese: Evidence from a printed-word eye-tracking study. Mem. Cogn. 2021, 49, 181–192. [Google Scholar] [CrossRef]
  80. Zou, T.; Liu, Y.; Zhong, H. The roles of consonant, rime, and tone in mandarin spoken word recognition: An eye-tracking study. Front. Psychol. 2022, 12, 740444. [Google Scholar] [CrossRef]
  81. Chun, D.M.; Jiang, Y.; Natalia, A. Visualization of tone for learning Mandarin Chinese. In Proceedings of the 4th Pronunciation in Second Language Learning and Teaching Conference, Vancouver, BC, Canada, 24–25 August 2012. [Google Scholar]
  82. Zhou, A.; Olson, D. The use of visual feedback to train L2 lexical tone: Evidence from Mandarin phonetic acquisition. In Proceedings of the 13th Pronunciation in Second Language Learning and Teaching Conference, St. Catharines, ON, Canada, 16–18 June 2022. [Google Scholar]
  83. Giuliano, R.J.; Pfordresher, P.Q.; Stanley, E.M.; Narayana, S.; Wicha, N.Y. Native experience with a tone language enhances pitch discrimination and the timing of neural responses to pitch change. Front. Psychol. 2011, 2, 146. [Google Scholar] [CrossRef]
  84. Chen, X.; Zhang, C.; Chen, Y.; Politzer-Ahles, S.; Zeng, Y.; Zhang, J. Encoding category-level and context-specific phonological information at different stages: An EEG study of Mandarin third-tone sandhi word production. Neuropsychologia 2022, 175, 108367. [Google Scholar] [CrossRef]
  85. Deng, X.; Farris-Trimble, A.; Yeung, H.H. Contextual effects on spoken word processing: An eye-tracking study of the time course of tone and vowel activation in Mandarin. J. Exp. Psychol. 2023, 49, 1145. [Google Scholar] [CrossRef] [PubMed]
  86. Wang, X.; Deng, W.; Meng, Z.; Chen, D. Hybrid-attention mechanism based heterogeneous graph representation learning. Expert Syst. Appl. 2024, 250, 123963. [Google Scholar] [CrossRef]
  87. Chen, N.F.; Wee, D.; Tong, R.; Ma, B.; Li, H. Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL. Speech Commun. 2016, 84, 46–56. [Google Scholar] [CrossRef]
  88. Gao, Q.; Sun, S.; Yang, Y. ToneNet: A CNN Model of Tone Classification of Mandarin Chinese. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 3367–3371. [Google Scholar]
  89. Niu, Y.; Chen, N.; Zhu, H.; Zhu, Z.; Li, G.; Chen, Y. Auditory Spatial Attention Detection Based on Feature Disentanglement and Brain Connectivity-Informed Graph Neural Networks. In Proceedings of the Interspeech 2024, Kos Island, Greece, 1–5 September 2024; pp. 887–891. [Google Scholar]
  90. Pastor, E.; Koudounas, A.; Attanasio, G.; Hovy, D.; Baralis, E. Explaining speech classification models via word-level audio segments and paralinguistic features. arXiv 2023, arXiv:2309.07733. [Google Scholar] [CrossRef]
  91. Das, A.; Rad, P. Opportunities and challenges in explainable artificial intelligence (xai): A survey. arXiv 2020, arXiv:2006.11371. [Google Scholar] [CrossRef]
  92. Fucci, D.; Savoldi, B.; Gaido, M.; Negri, M.; Cettolo, M.; Bentivogli, L. Explainability for Speech Models: On the Challenges of Acoustic Feature Selection. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, 4–6 December 2024; pp. 373–381. [Google Scholar]
  93. Morett, L.M.; Chang, L.-Y. Emphasising sound and meaning: Pitch gestures enhance Mandarin lexical tone acquisition. Lang. Cogn. Neurosci. 2015, 30, 347–353. [Google Scholar] [CrossRef]
  94. Wang, T.; Potter, C.E.; Saffran, J.R. Plasticity in second language learning: The case of Mandarin tones. Lang. Learn. Dev. 2020, 16, 231–243. [Google Scholar] [CrossRef]
  95. Hao, Y.-C. Second language perception of Mandarin vowels and tones. Lang. Speech 2018, 61, 135–152. [Google Scholar] [CrossRef]
  96. Hao, Y.-C. Contextual effect in second language perception and production of Mandarin tones. Speech Commun. 2018, 97, 32–42. [Google Scholar] [CrossRef]
  97. Vonessen, J.; Zellou, G. Perception of Mandarin tones across different phonological contexts by native and tone-naive listeners. In Proceedings of the Frontiers in Education, Washington, DC, USA, 13–16 October 2024; p. 1392022. [Google Scholar]
  98. Zeng, Y.; Leung, K.K.; Jongman, A.; Sereno, J.A.; Wang, Y. Multi-modal cross-linguistic perception of Mandarin tones in clear speech. Front. Hum. Neurosci. 2023, 17, 1247811. [Google Scholar] [CrossRef] [PubMed]
  99. Yu, K.; Zhang, J.; Li, Z.; Zhang, X.; Cai, H.; Li, L.; Wang, R. Production rather than observation: Comparison between the roles of embodiment and conceptual metaphor in L2 lexical tone learning. Learn. Instr. 2024, 92, 101905. [Google Scholar] [CrossRef]
  100. Farran, B.M.; Morett, L.M. Multimodal cues in L2 lexical tone acquisition: Current research and future directions. In Proceedings of the Frontiers in Education, Washington, DC, USA, 13–16 October 2024; p. 1410795. [Google Scholar]
  101. Deng, X. Processing Tone and Vowel Information in Mandarin: An Eye-Tracking Study of Contextual Effects on Speech Processing. Master’s Thesis, Simon Fraser University, Burnaby, BC, Canada, 2020. [Google Scholar]
  102. Zhang, F.; Wagner, M. Effects of F0 feedback on the learning of Chinese tones by native speakers of English. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 181–184. [Google Scholar]
  103. Yang, B. The gap between the perception and production of tones by American learners of Mandarin–An intralingual perspective. Chin. A Second Lang. Res. 2012, 1, 33–53. [Google Scholar] [CrossRef]
  104. Gao, Y.A.; Toscano, J.C.; Shih, C.; Tanner, D. Reassessing the electrophysiological evidence for categorical perception of Mandarin lexical tone: ERP evidence from native and naïve non-native Mandarin listeners. Atten. Percept. Psychophys. 2019, 81, 543–557. [Google Scholar] [CrossRef]
  105. Shen, L.; Wang, W. Fusion Feature Based Automatic Mandarin Chinese Short Tone Classification. Technol. Acoust 2018, 37, 167–174. [Google Scholar]
  106. Yang, S.-R.; Jung, T.-P.; Lin, C.-T.; Huang, K.-C.; Wei, C.-S.; Chiueh, H.; Hsin, Y.-L.; Liou, G.-T.; Wang, L.-C. Recognizing tonal and nontonal mandarin sentences for EEG-based brain–computer interface. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 1666–1677. [Google Scholar] [CrossRef]
  107. Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
  108. Leung, K.K.; Wang, Y. Modelling Mandarin tone perception-production link through critical perceptual cues. J. Acoust. Soc. Am. 2024, 155, 1451–1468. [Google Scholar] [CrossRef]
  109. Chen, J.; Wang, Y.; Zhang, Z.; Han, J.; Liu, Y.-L.; Feng, R.; Liang, X.; Ling, Z.-H.; Yuan, J. Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception. arXiv 2025, arXiv:2505.19626. [Google Scholar] [CrossRef]
  110. Wiener, S.; Lee, C.Y.; Tao, L. Statistical regularities affect the perception of second language speech: Evidence from adult classroom learners of Mandarin Chinese. Lang. Learn. 2019, 69, 527–558. [Google Scholar] [CrossRef]
  111. Li, Y. English and Thai Speakers’ Perception of Mandarin Tones. Engl. Lang. Teach. 2016, 9, 122–132. [Google Scholar] [CrossRef]
  112. Button, K.S.; Ioannidis, J.P.; Mokrysz, C.; Nosek, B.A.; Flint, J.; Robinson, E.S.; Munafò, M.R. Power failure: Why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 2013, 14, 365–376. [Google Scholar] [CrossRef]
  113. Lakens, D. Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Front. Psychol. 2013, 4, 863. [Google Scholar] [CrossRef]
  114. Szucs, D.; Ioannidis, J.P. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol. 2017, 15, e2000797. [Google Scholar] [CrossRef]
  115. Bu, X.; Guo, W.; Yang, H.; Lu, X.; He, Y.; Xu, H.; Kong, W. Evaluating Mandarin tone pronunciation accuracy for second language learners using a ResNet-based Siamese network. Sci. Rep. 2025, 15, 24558. [Google Scholar] [CrossRef]
  116. Wana, Z.; Hansen, J.H.; Xie, Y. A multi-view approach for Mandarin non-native mispronunciation verification. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 8079–8083. [Google Scholar]
Figure 1. Overview of strands and our path for L2 Mandarin tones (Perception ↔ Production). (A) Conventional behavioral tasks for tone perception and production, including identification (ID), AX (same–different) judgement, and imitation/read-aloud; illustration created by the authors. (B) Physio-behavioral sensors. The eye-tracking photograph is reproduced from Ehinger et al. (2019) [6] and the EEG photograph from Ha et al. (2023) [7]; both are reproduced under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0). (C) Acoustic speech analysis, with an example waveform–spectrogram–F0 tracking display generated from our own recordings using Praat v6.4.23 (plus the Praat software icon). (D) Proposed relation-aware and explainable modeling as a methodological direction for parameter-aligned pedagogy; the upper icons illustrating AI-based encoders (CNN/RNN/CTC/SSL) are original to this work, whereas the lower network architecture is reproduced from Gupta et al. (2021) [8] (CC BY 4.0). (a) Hybrid classification model with CNN-based representations followed by a TCNN for temporal correlation extraction. (b) Hybrid classification model with CNN-based representations followed by an RNN for temporal correlation extraction. In (D-a) and (D-b), CNN outputs are concatenated before the temporal layer.
Figure 1. Overview of strands and our path for L2 Mandarin tones (Perception ↔ Production). (A) Conventional behavioral tasks for tone perception and production, including identification (ID), AX (same–different) judgement, and imitation/read-aloud; illustration created by the authors. (B) Physio-behavioral sensors. The eye-tracking photograph is reproduced from Ehinger et al. (2019) [6] and the EEG photograph from Ha et al. (2023) [7]; both are reproduced under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0). (C) Acoustic speech analysis, with an example waveform–spectrogram–F0 tracking display generated from our own recordings using Praat v6.4.23 (plus the Praat software icon). (D) Proposed relation-aware and explainable modeling as a methodological direction for parameter-aligned pedagogy; the upper icons illustrating AI-based encoders (CNN/RNN/CTC/SSL) are original to this work, whereas the lower network architecture is reproduced from Gupta et al. (2021) [8] (CC BY 4.0). (a) Hybrid classification model with CNN-based representations followed by a TCNN for temporal correlation extraction. (b) Hybrid classification model with CNN-based representations followed by an RNN for temporal correlation extraction. In (D-a) and (D-b), CNN outputs are concatenated before the temporal layer.
Mathematics 14 00145 g001
Figure 2. PRISMA 2020 flow diagram summarizing study identification from Google Scholar and CNKI, screening, eligibility assessment, and final inclusion (2015–2025) in the review.
Figure 2. PRISMA 2020 flow diagram summarizing study identification from Google Scholar and CNKI, screening, eligibility assessment, and final inclusion (2015–2025) in the review.
Mathematics 14 00145 g002
Figure 3. Framework of this paper.
Figure 3. Framework of this paper.
Mathematics 14 00145 g003
Figure 4. Annual distribution of included studies by methodological strand (A–D), 2015–2025. Counts are computed on all included studies (n = 54), and strand labels follow Section 2. Strand D is rare (n = 2) and reflects tone-domain mechanistic or explainable modeling exemplars.
Figure 4. Annual distribution of included studies by methodological strand (A–D), 2015–2025. Counts are computed on all included studies (n = 54), and strand labels follow Section 2. Strand D is rare (n = 2) and reflects tone-domain mechanistic or explainable modeling exemplars.
Mathematics 14 00145 g004
Figure 5. Pitch-change ERP paradigm with grand-averaged ERPs and difference-wave topographies (three-time windows). Left panel: the upper part of the left panel shows an original schematic of the acoustic stimulus used in this study; the lower part of the left panel (EEG recording setup) is reproduced from Ha et al. [7], (Scientific Reports, 13, 22682). Right panel: (A) grand-averaged ERPs for Pitch Change vs. No Change trials in tone speakers and a control group (selected electrodes shown), reproduced from Giuliano et al. [83] (Frontiers in Psychology, Figure 3); the grey boxes mark the time window highlighting the early divergence between conditions (P50 range) as indicated in the original figure. (B) scalp topographies of the difference waves (Change−No Change) in three-time windows (50–100 ms, 125–250 ms, and 350–550 ms). Both reproduced under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Figure 5. Pitch-change ERP paradigm with grand-averaged ERPs and difference-wave topographies (three-time windows). Left panel: the upper part of the left panel shows an original schematic of the acoustic stimulus used in this study; the lower part of the left panel (EEG recording setup) is reproduced from Ha et al. [7], (Scientific Reports, 13, 22682). Right panel: (A) grand-averaged ERPs for Pitch Change vs. No Change trials in tone speakers and a control group (selected electrodes shown), reproduced from Giuliano et al. [83] (Frontiers in Psychology, Figure 3); the grey boxes mark the time window highlighting the early divergence between conditions (P50 range) as indicated in the original figure. (B) scalp topographies of the difference waves (Change−No Change) in three-time windows (50–100 ms, 125–250 ms, and 350–550 ms). Both reproduced under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Mathematics 14 00145 g005
Table 1. Decision matrix for two screening passes by the same reviewer (Pass 1 vs. Pass 2).
Table 1. Decision matrix for two screening passes by the same reviewer (Pass 1 vs. Pass 2).
Pass 1 DecisionPass 2 IncludePass 2 Exclude
Includea
(both include)
b
(Pass 1 include; Pass 2 exclude)
Excludec
(Pass 1 exclude; Pass 2 include)
d
(both exclude)
Table 2. Comparison of the representative studies in terms of learners/data, modality/method, key finding, and implication.
Table 2. Comparison of the representative studies in terms of learners/data, modality/method, key finding, and implication.
ReferenceLearners/DataModality/MethodKey FindingImplication
[10]L1 Eng.ID training Perception ↑, generalizesEstablishes perceptual plasticity
[61]L1 Eng./Chi → ThaiID/AXT2–T3-like confusionsPair-specific asymmetries
[27]L1 ChiEEGTFS vs. ENV separationNeural cue weighting
[62]L1 ChiEEG decoding4-tone decodingPhysiological validity
[63]CorpusRNNSeq. tone dynamicsStrong non-sensor baseline
[64]CorpusCTCEnd-to-end tonesNo explicit F0 needed
[65]CorpusShort-term contextCoarticulation capturedBetter sequences
[66]
[67]
Corpus
Tone Perfect corpus (isolated syllables)
SSL + CTC
Unsupervised generative CNN (InfoGAN)
Suprasegmental SOTA
Class variable ↔ tone (male-only best); Conv4/Conv3 encode F0; partial human-like order (T1/T4 > … > T2/T3)
Label-efficient
Label-efficient + production-capable;
bridge P ↔ P;
add layer-probing to XAI
[68,69]GCN/GATRelation learningBasis for our D
[70,71,72]XAIAttribution/CounterfactualTeachable diagnostics
Note: References are grouped by study type: Perceptual training (Refs. [10,61]); Neurophysiology/EEG (Refs. [27,62]); Speech modeling (Refs. [63,64,65,66,67]); Methodological foundations (Refs. [68,69,70,71,72]). For compactness, symbols are used as follows: ↑, improved performance/outcome; →, target tone language in a cross-language learning setting; ↔, bidirectional/associative link (e.g., perception–production mapping). Ref. [61] is included as a cross-tonal analogue; Refs. [68,69,70,71,72] provide methodological foundations (GNN/XAI).
Table 3. Methodological trade-offs across strands A–D in L2 Mandarin tone research.
Table 3. Methodological trade-offs across strands A–D in L2 Mandarin tone research.
StrandTypical InputsTypical OutputsStrengthsLimitationsClassroom Feasibility
APerception and production tasks, controlled stimuli, learner responsesTone labels, accuracy, reaction time, ratings, basic acoustic measuresDirect behavioral evidence and low cost; easy to replicateOften label-focused; perception and production are not item-matchedHigh: adaptable to classroom and online settings
BEEG or ERP and eye-tracking with audio stimuli; synchronized recordingsTime course indices, cue weighting, decoding or classification measuresMechanistic insight into attention and timing; useful anchors for validationEquipment and analysis burden; samples are often smallLow in classroom; mainly laboratory validation
CSpeech recordings; features such as F0, MFCC; self-supervised embeddingsTone labels, confusion patterns, embeddings; sometimes parameter estimatesScalable and sensor-free; strong baselines for automatic assessmentOften opaque; token-wise decisions can limit pedagogy and P ↔ P inferenceMedium to high if packaged into teacher-facing tools
DSpeech features plus links across learner, item, and session; optional sensor anchorsTone labels plus parameter-aligned targets; explanations and counterfactual editsUses corpus structure to support P ↔ P; delivers teachable, parameter-level feedbackRequires curated links, privacy safeguards, and robust evaluation; limited direct evidence in the current L2 Mandarin tone corpusEmerging: feasible with tool engineering and fairness checks
Table 4. Notation for nodes, relations, outputs, and metrics (heterogeneous relational graph).
Table 4. Notation for nodes, relations, outputs, and metrics (heterogeneous relational graph).
SymbolMeaningNotes/Examples/Units
NodesLLearner nodeL1 background (tonal/non-tonal), proficiency, musicality
IItem nodeSyllable/word info, canonical tone, phonological/prosodic context
RTrial (response) nodeOne observation: a perception trial or a production recording
RpercPerception triale.g., identification or same–different response for an item
RprodProduction triale.g., imitation or read-aloud recording
SSession nodeSession index; task type (ID/AX/imitation/read-aloud); modality
Relations (edges)L ↔ RLearner–trial linkConnects a learner to their trials
R ↔ ITrial–item linkConnects a trial to its linguistic item
Ri ↔ Rj|IWithin-item linkRepeated attempts on the same item across tasks/sessions
Ri ↔ Rj|LWithin-learner linkMultiple trials from the same learner across tasks
Rperc ↔ RprodPerception–production pairSame learner and same item (P ↔ P coupling)
kNN(R, R)Acoustic/SSL neighborsLinks trials nearest in SSL embedding or prosodic feature space
Outputs (predictions) y ^ (R)Predicted tone label(1, 2, 3, 4); 1 = T1 (high level), 2 = T2 (rising), 3 = T3 (dipping/low), 4 = T4 (falling)
s ^ (R)SlopeRise/fall slope; semitones/s
t ^ _TP(R)Turning-point timems from rhyme onset (or stated anchor)
r ^ _F0(R)Local F0 rangesemitones or Hz over a defined window
Attribution windowTP ± 50 msSaliency windowTime window around the turning point used for attribution
Evaluation metricsF1 (per-tone)Classification metricReport per-tone and macro F1
RMSE_slopeParameter errorRMSE for slope (semitones/s)
MAE _ TP Parameter errorMean absolute error for TP timing (ms)
Notes: ID = identification; AX = same–different; SSL = self-supervised learning; kNN = k-nearest neighbors; P ↔ P = perception–production; F0 = fundamental frequency; TP = turning point; RMSE = root mean square error; MAE = mean absolute error.
Table 5. Core evaluation metrics reported in this review and protocol.
Table 5. Core evaluation metrics reported in this review and protocol.
MetricTarget OutputDefinitionUnitReportingPedagogical Use
AccuracyTone label classification for perception and productionProportion of items with correct predicted tone labelUnitlessReport overall for perception and production, include 95 percent confidence intervalsOverall correctness, useful for monitoring progress but not diagnostic of specific confusions
Per-tone F1 (F1_k)Tone label classification for perception and productionHarmonic mean of precision and recall for each tone kUnitlessReport F1 for each tone, highlight Tone 2 versus Tone 3 and Tone 1 versus Tone 4Identifies which tones are reliably produced or perceived and which are most often confused
Confusion matrixTone label classification for perception and productionCross tabulation of predicted labels against gold labelsCountsReport separately for perception and production, provide contrast focused summariesDirectly reveals error directions and supports targeted practice design
RMSE for slopeContinuous parameter estimation for production dynamicsRoot mean squared error between predicted and reference slopeSemitones per secondReport per tone and per contrast, also report item level distributionsQuantifies mismatch in tonal movement strength, helpful for coaching contour shaping
MAE for turning point timingContinuous parameter estimation for production dynamicsMean absolute error between predicted and reference turning point timing relative to rhyme onsetMillisecondsReport per tone and per contrast, also report item level distributionsQuantifies timing misalignment, maps directly to actionable feedback on earlier or later turning points
RMSE for local F0 rangeContinuous parameter estimation for production dynamicsRoot mean squared error between predicted and reference local fundamental frequency rangeHertz or semitonesReport per tone and per contrast, also report item level distributionsQuantifies pitch range compression or expansion, useful for diagnosing under articulation
Table 6. Summary of reporting-based audit distributions across included studies (n = 54).
Table 6. Summary of reporting-based audit distributions across included studies (n = 54).
DomainAdequate n (%)Partial n (%)Unclear n (%)
B1 Participants/data23 (42.6)6 (11.1)25 (46.3)
B2 Task/stimuli24 (44.4)27 (50.0)3 (5.6)
B3 Measurement reliability/QA0 (0.0)18 (33.3)36 (66.7)
B4 Modelling/evaluation validity11 (20.4)22 (40.7)21 (38.9)
B5 Outcomes/metrics4 (7.4)27 (50.0)23 (42.6)
B6 Transparency/reproducibility8 (14.8)0 (0.0)46 (85.2)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Y.; Xu, Z.; Bei, X.; Huang, H. Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review. Mathematics 2026, 14, 145. https://doi.org/10.3390/math14010145

AMA Style

Huang Y, Xu Z, Bei X, Huang H. Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review. Mathematics. 2026; 14(1):145. https://doi.org/10.3390/math14010145

Chicago/Turabian Style

Huang, Yujiao, Zhaohong Xu, Xianming Bei, and Huakun Huang. 2026. "Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review" Mathematics 14, no. 1: 145. https://doi.org/10.3390/math14010145

APA Style

Huang, Y., Xu, Z., Bei, X., & Huang, H. (2026). Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review. Mathematics, 14(1), 145. https://doi.org/10.3390/math14010145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop