3.2. Results
A total of 1280 syllables were recorded and analysed for VOT. Additionally, recordings from two Chinese native speakers (one male, one female) from the first experiment were measured for VOT as the native control group. VOT for this experiment was measured from the onset of the burst release, identified as a sudden vertical spike in the waveform that emerges from the near-silent baseline, through the frication phase (and aspiration period in the case of ch [tʂ
h]), until the onset of voicing in the following vowel. The vowel onset was identified by the appearance of the first clearly visible periodic oscillation in the waveform. The spectrogram was used as a complementary reference to confirm the transition. An example can be seen in
Figure 9 below.
Of the 1280 syllables recorded, 47 (3.67%) were excluded from the analysis due to three different reasons: damaged audio files (n = 29), lost file (n = 1), and technically unusable recordings (n = 17). The unusable recordings resulted from production irregularities that rendered VOT measurement impossible, including absence of complete occlusion (which distinguishes affricates from fricatives). Fricatives, by definition, cannot be measured for VOT since they lack the stop phase (complete occlusion) necessary for VOT calculation. Other excluded productions included fully voiced consonants with pre-voicing throughout the occlusion and release phases. Since VOT is defined as the time interval between the release of the occlusion and the onset of vocal fold vibration, completely voiced consonants would not provide meaningful data for the intended cross-linguistic comparison of aspiration contrasts.
Regarding Voice Onset Times (VOTs), participants demonstrated distinct production patterns for both phonemes whilst exhibiting systematic underproduction compared to native speakers. For zh [tʂ], participants produced an overall mean VOT of 58 ms (SD = 36.5, range: 8–220 ms), whilst the native control group demonstrated a mean of 67 ms (SD = 23.9, range: 37–149 ms). For ch [tʂ
h], participants exhibited an overall mean of 125 ms (SD = 48.3, range: 28–313 ms), compared to the native control group’s mean of 164 ms (SD = 30.6, range: 107–288 ms). Participants showed greater variability than the native control group for both phonemes, as evidenced by larger standard deviations and wider ranges. These patterns are illustrated in
Figure 10.
Statistical analyses were conducted using linear mixed-effects models (lme4 package in R;
Bates et al., 2015) to test phonological factors affecting VOT production, with fixed effects for Phoneme, Tone, and Syllable Type, and random intercepts for Subject and Item. Results revealed a highly significant main effect of Phoneme (β = −64.73, SE = 2.41, t = −26.87,
p < 0.001), confirming that participants successfully distinguished zh [tʂ] (M = 58 ms) from ch [tʂ
h] (M = 125 ms) in production, with a mean difference of 67 ms. Tone also showed a significant effect (χ
2(3) = 64.04,
p < 0.001), indicating that tonal context systematically influenced VOT duration. However, Syllable Type (labialised vs. non-labialised) was not significant (β = −1.79,
p = 0.477). Random effects revealed substantial between-subject variability (SD = 27.12) relative to between-item variability (SD = 8.74).
The VOT data from participants were subsequently analysed relative to the control group intervals to convert measurements into binary outcomes. A conservative threshold approach was adopted to minimise false positives. Separate VOT ranges were established for zh [tʂ] and ch [tʂh] based on native speaker productions. Participant VOT values falling outside the phoneme-specific native speaker range were classified as incorrect, whilst those within the appropriate interval were deemed correct. The 47 previously excluded syllables were maintained as exclusions for this binary classification.
A total of 1233 VOT measurements were converted to binary classifications, of which 916 (74.29%) fell outside the native range and were therefore classified as incorrect. Among these incorrect productions, syllables with the initial zh [tʂ] accounted for 405 of 601 valid VOT measurements (67.39%), whilst those with the initial ch [tʂh] comprised 511 of 632 valid values (80.85%).
Binary accuracy was analysed using generalised linear mixed-effects models with binomial family (lme4 package in R;
Bates et al., 2015), testing Phoneme, Tone, and Syllable Type as fixed effects with random intercepts for Subject and Item. Results confirmed a significant main effect of Phoneme (β = 0.96, z = 4.71,
p < 0.001), with zh [tʂ] showing higher accuracy (32.6%) than ch [tʂ
h] (19.1%). Tone also significantly affected accuracy (χ
2(3) = 11.74,
p = 0.008), with declining accuracy from Tone 1 (31.2%) to Tone 4 (21.1%). Syllable Type showed a significant effect (β = 0.44, z = 2.09,
p = 0.036), with non-labialised syllables achieving higher accuracy (28.0%) than labialised syllables (21.2%). Random effects indicated substantial between-subject variance (SD = 1.23) compared to between-item variance (SD = 0.72).
In terms of tonal distribution, Tone 1 produced 249 incorrect responses out of 362 total responses (68.78%), Tone 2 had 159 incorrect responses out of 221 (71.95%), Tone 3 had 250 incorrect responses out of 323 (77.40%), and Tone 4 had 258 incorrect responses out of 327 (78.90%). Regarding syllable structure, non-labialised syllables resulted in 585 incorrect responses out of 813 total responses (71.96%), whereas labialised syllables accounted for 331 incorrect responses out of 420 total responses (78.81%). The data are shown in
Figure 11.
As for phonological context, when it came to production the distribution of errors across phonological contexts was as follows. For zh [tʂ] syllables, labialised contexts showed the following error rates: ZH1L, 52 of 76 responses (68.42%); ZH2L, 8 of 9 responses (88.89%); ZH3L, 43 of 59 responses (72.88%); and ZH4L, 40 of 58 responses (68.97%). Non-labialised zh [tʂ] contexts yielded the following: ZH1NL, 60 of 110 responses (54.55%); ZH2NL, 38 of 60 responses (63.33%); ZH3NL, 78 of 110 responses (70.91%); and ZH4NL, 86 of 119 responses (72.27%). These patterns are illustrated in
Figure 12.
Conversely, ch [tʂ
h] syllables in labialised contexts yielded the following: CH1L, 62 of 72 responses (86.11%); CH2L, 42 of 55 responses (76.36%); CH3L, 46 of 48 responses (95.83%); and CH4L, 38 of 43 responses (88.37%). Non-labialised ch [tʂ
h] contexts showed the following: CH1NL, 75 of 104 responses (72.12%); CH2NL, 71 of 97 responses (73.20%); CH3NL, 83 of 106 responses (78.30%); and CH4NL, 94 of 107 responses (87.85%). The data are shown in
Figure 13.
For errors by gender, males had an overall error rate of 537 out of 723 (74.27%), an error rate of 244 out of 357 (68.35%) for zh [tʂ], and an error rate of 293 out of 366 (80.05%) for ch [tʂ
h], whilst females had an overall error rate of 379 out of 510 (74.31%), an error rate of 161 out of 244 (65.98%) for zh [tʂ], and an error rate of 218 out of 266 (81.95%) for ch [tʂ
h]. Gender was examined as an additional predictor in separate mixed-effects models for both VOT and binary accuracy, and showed no significant effect (VOT: χ
2(1) = 0.01,
p = 0.915; binary: χ
2(1) = 0.45,
p = 0.502). These gender differences are illustrated in
Figure 14.
As for nationality, Peruvians had an overall error rate of 239 out of 348 (68.68%), an error rate of 102 out of 174 (58.62%) for zh [tʂ], and an error rate of 137 out of 174 (78.74%) for ch [tʂ
h], whilst Chileans had an overall error rate of 677 out of 885 (76.50%), an error rate of 303 out of 427 (70.96%) for zh [tʂ], and an error rate of 374 out of 458 (81.66%) for ch [tʂ
h]. Nationality was examined as an additional predictor in separate mixed-effects models for both VOT and binary accuracy, and showed no significant effect (VOT: χ
2(1) = 0.55,
p = 0.458; binary: χ
2(1) = 0.32,
p = 0.571). These nationality differences are illustrated in
Figure 15.
For errors by HSK level, the HSK 3 group had an overall error rate of 127 out of 128 (99.22%), an error rate of 60 out of 61 (98.36%) for zh [tʂ], and an error rate of 67 out of 67 (100%) for ch [tʂ
h]. The HSK 4 group had an overall error rate of 350 out of 483 (72.46%), an error rate of 157 out of 235 (66.81%) for zh [tʂ], and an error rate of 193 out of 248 (77.82%) for ch [tʂ
h]. The HSK 5 group had an overall error rate of 284 out of 374 (75.94%), an error rate of 119 out of 183 (65.03%) for zh [tʂ], and an error rate of 165 out of 191 (86.39%) for ch [tʂ
h]. The HSK 6 group had an overall error rate of 155 out of 248 (62.50%), an error rate of 69 out of 122 (56.56%) for zh [tʂ], and an error rate of 86 out of 126 (68.25%) for ch [tʂ
h]. Due to unbalanced group sizes (particularly HSK 3 with
n = 1), inferential statistical tests were not conducted for HSK level comparisons in production. These HSK level differences are illustrated in
Figure 16.
Finally, regarding years of study, production error rates showed that the ≤5 years group had an overall error rate of 534 out of 730 (73.15%), an error rate of 224 out of 357 (62.75%) for zh [tʂ], and an error rate of 310 out of 373 (83.11%) for ch [tʂ
h]. The 6–10 years group had an overall error rate of 317 out of 375 (84.53%), an error rate of 159 out of 183 (86.89%) for zh [tʂ], and an error rate of 158 out of 192 (82.29%) for ch [tʂ
h]. The ≥11 years group had an overall error rate of 65 out of 128 (50.78%), an error rate of 22 out of 61 (36.07%) for zh [tʂ], and an error rate of 43 out of 67 (64.18%) for ch [tʂ
h]. Due to unbalanced group sizes (particularly ≥11 years with
n = 1), inferential statistical tests were not conducted for years of study comparisons in production. These patterns across study durations are illustrated in
Figure 17.
3.3. Discussion
Participants distinguished zh [tʂ] from ch [tʂ
h] in production, with mean VOTs of 58 ms (zh [tʂ]) and 125 ms (ch [tʂ
h]), yielding a 2.2× ratio comparable to natives (2.4×) and to
Dai’s (
2015) advanced learners (2.1×), indicating successful aspiration contrast maintenance despite VOT underproduction.
Additionally, native speakers demonstrated higher overall mean VOT values for both zh [tʂ] (control group: 67 ms; participants: 58 ms) and ch [tʂ
h] (control group: 164 ms; participants: 125 ms), indicating consistent VOT underproduction by Spanish-speaking participants. This phenomenon was also observed in
Zhao’s (
2010) study of voiceless alveolo-palatal consonants, in which Spanish speakers similarly produced shorter VOT values for j [ʨ] (control group: 99 ms; participants: 81 ms) and q [ʨ
h] (control group: 212 ms; participants: 96 ms) compared to native Mandarin speakers. However,
Dai (
2015) reported the opposite pattern, with non-native groups showing VOT overproduction for zh [tʂ] across all proficiency levels and for ch [tʂ
h] in advanced learners. The reasons for this discrepancy across studies remain unclear and warrant further investigation. These findings suggest that deviation from native-like production does not follow a uniform trajectory, manifesting as either underproduction or overproduction, which aligns with
Dai’s (
2015) findings, at least for ch [tʂ
h].
Participants showed greater VOT variability than the native control group, both in range (zh [tʂ]: 113 ms vs. 86 ms; ch [tʂ
h]: 162 ms vs. 145 ms) and standard deviation (zh [tʂ]: 36.5 ms vs. 23.9 ms; ch [tʂ
h]: 48.3 ms vs. 30.6 ms), indicating less precise articulatory control. This pattern aligns with previous research (
Dai, 2015;
Zhao, 2010) demonstrating that VOT variability decreases with increasing proficiency. However, our native control group showed larger standard deviations than
Dai’s (
2015) native speakers, likely due to differences in sample size (
n = 2 vs. Dai’s larger control group) or methodological variations in stimuli selection.
Mixed-effects analysis confirmed that participants maintain statistically significant distinctions between zh [tʂ] and ch [tʂh] in production (β = −64.73, p < 0.001), with a mean difference of 67 ms. The substantial between-subject variance (SD = 27.12) relative to between-item variance (SD = 8.74) indicates that individual production patterns vary considerably more than stimulus characteristics, suggesting that learner-specific factors play a major role in VOT production accuracy. These findings indicate that despite VOT underproduction, Spanish speakers are able to successfully acquire the aspirated–unaspirated contrast necessary for accurate Mandarin consonant differentiation, though their production patterns differ significantly from native speakers.
When VOT measurements were converted to binary classifications based on whether values fell within or outside the native speaker interval, the overall production error rate was 74.29%, substantially higher than the perception error rate of 16.64%, and considerably higher than that reported by
Dai (
2015). This considerable discrepancy reflects both the conservative methodological threshold employed, in which any deviation from native VOT ranges was classified as incorrect, and the inherent difficulty of producing native-like VOT values compared to perceptually discriminating between zh [tʂ] and ch [tʂ
h]. The production correct answer rate for zh [tʂ] was 32.61% and for ch [tʂ
h] was 19.15%, demonstrating that whilst participants could distinguish these sounds perceptually with high accuracy, producing them within native parameters proved substantially more challenging.
The binary classification revealed that ch [tʂh] posed greater production challenges than zh [tʂ], with error rates of 80.85% and 67.39% respectively. This 13.46% difference is substantially larger than the 2.04% difference observed in perception, suggesting that whilst both sounds present production difficulties, the aspirated consonant ch [tʂh] requires more precise articulatory control that Spanish speakers find particularly challenging to achieve within native norms.
Regarding tones, production exhibited an inverse pattern to perception. Whilst perception showed Tone 1 with the highest error rate (20.26%) and Tone 4 with the lowest (14.71%), production demonstrated an ascending pattern, with Tone 1 having the lowest error rate (68.78%) and Tone 4 the highest (78.90%). This reversal suggests that the phonetic factors facilitating perceptual discrimination differ fundamentally from those enabling accurate production. High fundamental frequency (f0), which may interfere with aspiration perception in Tone 1, appears to support more accurate VOT production, whilst lower f0 contexts present greater production challenges.
Whether the syllable was labialised or non-labialised, production showed a 6.85% difference in error rates, with labialised syllables yielding higher error rates (78.81%) compared to non-labialised syllables (71.96%). This difference, whilst modest, is nearly three times greater than the 2.44% difference observed in perception, indicating that coarticulation effects present greater challenges in production than in perception.
The distribution of production error rates across phonological contexts revealed patterns distinct from perception. Unlike perception, where the ZH2L context achieved 0% errors (though with limited tokens), no phonological context in production achieved error-free performance. Most notably, ZH2L, the most accurately perceived context, showed one of the highest production error rates at 88.89%, demonstrating that perceptual clarity does not predict production accuracy. The highest production error rate occurred in the CH3L context (95.83%), whilst the lowest occurred in ZH1NL (54.55%), suggesting that non-labialised contexts with high f0 facilitate more accurate VOT production.
Regarding gender and nationality, production error rates showed no statistically significant differences between groups. Mixed-effects models revealed that neither gender nor nationality significantly affected production accuracy. For VOT, gender (χ2(1) = 0.01, p = 0.915) and nationality (χ2(1) = 0.55, p = 0.458) showed no significant effects. Similarly, for binary accuracy, gender (χ2(1) = 0.45, p = 0.502) and nationality (χ2(1) = 0.32, p = 0.571) were not significant. Males and females demonstrated virtually identical overall performance (74.27% vs. 74.31%), and whilst Peruvians showed numerically lower error rates than Chileans (68.68% vs. 76.50%), these differences were not statistically reliable. The small sample sizes (n = 6 males, n = 4 females; n = 3 Peruvians, n = 7 Chileans) and high within-group variability limited statistical power to detect potential effects. Further investigation with larger, balanced samples is needed to determine whether gender or dialectal factors influence production accuracy.
Regarding HSK level and years of study in production, unbalanced group sizes (HSK 3: n = 1, HSK 4: n = 4, HSK 5: n = 3, HSK 6: n = 2; ≤5 years: n = 6, 6–10 years: n = 3, ≥11 years: n = 1) precluded inferential statistical testing. Descriptively, HSK level showed a generally descending trend (HSK 3: 99.22%, HSK 5: 75.94%, HSK 4: 72.46%, HSK 6: 62.50%), whilst years of study revealed an unexpected inverted-U pattern (≤5 years: 73.15%, 6–10 years: 84.53%, ≥11 years: 50.78%), where intermediate learners performed worse than beginners. The single HSK 3 participant’s near-complete inability to produce native-like VOT (99.22%) and the ≥11 years participant’s substantial improvement (50.78%) represent particularly unreliable datapoints due to n = 1. The similarity between HSK 4 and HSK 5 performance (difference of 3.48%) may suggest a plateau effect at intermediate levels, though this remains speculative without statistical validation. Further investigation with larger, stratified samples is essential to clarify the relationship between proficiency measures and production accuracy.