Perception and Production of the Aspiration Contrast in Mandarin Retroflex Affricates [tʂ] and [tʂh] by Adult Spanish Speakers Learning Mandarin Chinese: An Exploratory Study

Galhoz Maria Roque, Guilherme; Zhang, Quanzhen

doi:10.3390/languages11040069

Open AccessArticle

Perception and Production of the Aspiration Contrast in Mandarin Retroflex Affricates [tʂ] and [tʂ^h] by Adult Spanish Speakers Learning Mandarin Chinese: An Exploratory Study

by

Guilherme Galhoz Maria Roque

^*

and

Quanzhen Zhang

International Chinese Education Department, College of Overseas Education, Nanjing University, Nanjing 210008, China

^*

Author to whom correspondence should be addressed.

Languages 2026, 11(4), 69; https://doi.org/10.3390/languages11040069

Submission received: 16 November 2025 / Revised: 18 March 2026 / Accepted: 19 March 2026 / Published: 2 April 2026

Download

Browse Figures

Versions Notes

Abstract

This exploratory study examines the perception and production of the aspiration contrast in Mandarin voiceless retroflex affricates zh [tʂ] and ch [tʂ^h] by ten adult Spanish speakers (three Peruvian, seven Chilean) at Nanjing University. Participants completed a perception identification task and a production reading task using the same set of 128 syllables. Voice Onset Time (VOT) measurements from the production task were converted to binary classifications for cross-modality comparison. Perception accuracy was moderately high (zh [tʂ]: 84.43%; ch [tʂ^h]: 82.39%), whilst production accuracy was substantially lower (zh [tʂ]: 32.61%; ch [tʂ^h]: 19.15% within native VOT ranges). Participants maintained the aspiration contrast (zh [tʂ] = 58 ms, ch [tʂ^h] = 125 ms) but consistently underproduced VOT compared to native speakers (zh [tʂ] = 67 ms, ch [tʂ^h] = 164 ms). Perception patterns align with Category Goodness (CG) assimilation within PAM-L2: both Mandarin sounds map to Spanish [tʃ] but with different goodness-of-fit, enabling moderate discrimination. Production follows SLM-r predictions, with learners developing a Composite L1–L2 Category that maintains the aspiration contrast but fails to establish new phonetic categories. The small sample size (n = 10) precluded robust statistical testing of individual differences. The perception–production asymmetry supports independent modality development in L2 phonetic acquisition.

Keywords:

voiceless retroflex affricates; Mandarin Chinese; perception; production; Spanish; second language acquisition

1. Introduction

1.1. Target Consonants: zh [tʂ] and ch [tʂ^h]

Standard Chinese has two voiceless retroflex affricate consonants, [tʂ] and [tʂ^h], with the first being non-aspirated and the second aspirated, represented by pinyin zh and ch, respectively (Ladefoged & Wu, 1984). It is worth noting that these two consonants are also sometimes described as voiceless postalveolar affricates (Lee & Zee, 2003). Additionally, whilst these are described by some as single complex consonants with two articulatory phases, occlusion and frication, others classify them as consonant clusters (Zhu, 2013). In contrast, Spanish does not possess retroflex affricates in its consonant inventory (Campos-Astorkiza, 2012). This phonological absence in the L1 system can consequently make the acquisition of Mandarin zh [tʂ] and ch [tʂ^h] difficult for Spanish learners (Fandiño, 2019).

In a production-focused study with native Spanish speakers, it was found that zh [tʂ] showed a VOT production error rate of 36% for beginner-level students (the third highest among all consonants tested) and 33.4% for advanced students (the second highest), while ch [tʂ^h] showed an error rate of 15.5% for beginner-level students (fifth highest) and 38.2% for advanced students (the highest error rate), demonstrating how challenging voiceless retroflex affricates can be for native Spanish speakers (Dai, 2015).

Similar challenges have been documented for speakers of other languages lacking retroflex affricates. In a production and perception study, it was found that native English speakers achieved a perception accuracy rate of 29% for zh [tʂ] and 80% for ch [tʂ^h] on a forced-choice identification task, and a production accuracy rate of 25% for zh [tʂ] and 47% for ch [tʂ^h] as appraised by trained listeners (Wang & Chen, 2020). By comparison, in a perception-focused study, it was found that Malay speakers mistook zh [tʂ] for ch [tʂ^h] 56% of the time whilst Burmese speakers did so 50% of the time; conversely, Malay speakers mistook ch [tʂ^h] for zh [tʂ] 44% of the time, whilst Burmese speakers showed a 50% error rate for this confusion (Wu, 2023).

Although these results, especially production results, are not easily comparable due to different methodological approaches, these findings still evidence that retroflex affricate acquisition difficulties occur across different L1 backgrounds lacking these segments, not only with Spanish but with other languages as well.

1.2. Spanish Phonological System and Its Impact on zh [tʂ]/ch [tʂ^h] Acquisition

As mentioned previously, retroflex affricates do not exist in Spanish (Campos-Astorkiza, 2012). Additionally, Spanish also lacks aspiration as a phonemic contrast, instead using voicing (Zhao, 2010). This contrasts with Standard Mandarin Chinese, which lacks voicing distinctions (Lee & Zee, 2003), though voicing is found in other Chinese variants, such as Shanghainese (Chen & Gussenhoven, 2015).

Based on this phonological difference, Zhao (2010) hypothesised that Spanish speakers would encounter difficulties with aspiration in Mandarin due to their lack of experience with aspiration as a distinctive feature. This prediction requires reconsideration given that many Spanish dialects do exhibit aspiration in the form of coda-final sibilant weakening. This phenomenon occurs in Western Andalusian Spanish and, to a lesser extent, Eastern Andalusian (De Haro & Hajek, 2011), Cuban and Puerto Rican Spanish (Terrell, 1977), and the Chilean and Peruvian varieties spoken by our participants. Chilean Spanish demonstrates this feature (Rogers, 2020), whilst Peruvian Spanish, though generally considered a non-/s/ aspiration dialect, shows aspiration in syllable-final preconsonantal position (Núñez-Méndez, 2022).

However, evidence from English acquisition suggests that this dialectal experience with aspiration may not be sufficient to ensure native-like VOT production in a second language. In one study, natives of Spanish introduced to English later in life underproduced VOT for the English /t/, whilst natives of Spanish introduced to English early in life did not differ from English monolinguals (Flege, 1991).

Furthermore, Spanish does possess one affricate: the voiceless alveopalatal affricate [tʃ] (Campos-Astorkiza, 2012). Given that Spanish speakers have experience with both aspiration (in many dialects) and affricate sounds, the primary challenge in acquiring Mandarin zh [tʂ] and ch [tʂ^h] seems to lie specifically in the retroflex characteristics rather than in the aspiration contrast or affricate features themselves. However, according to Dai (2015), accurate aspiration production should not be taken for granted, even though aspiration exists in many Spanish dialects, because Spanish aspiration never occurs in the context of a complex consonant or consonant cluster like ch [tʂ^h], potentially resulting in under-aspiration or over-aspiration.

1.3. L2 Phonetics Perception and Production Models

Researchers have developed L2 speech perception and production models to understand cross-linguistic influence in L2 phonetic learning, specifically, how L1 phonetic systems influence L2 acquisition. Two prominent models, the Revised Speech Learning Model (SLM-r) and the Perceptual Assimilation Model for L2 learners (PAM-L2), remain highly influential in contemporary research and provide crucial theoretical frameworks for understanding the acquisition of Mandarin retroflex affricates by native Spanish speakers.

The Revised Speech Learning Model (Flege & Bohn, 2021) postulates that sequential bilinguals reorganise their phonetic systems across the life span in response to L2 input received during naturalistic learning. SLM-r examines how L1 and L2 phonetic subsystems coexist and interact within a common phonetic space. SLM-r identifies two primary outcomes for L2 sound learning: (1) New Phonetic Category: the learner forms a separate, independent category for the L2 sound, distinct from any L1 category; (2) Composite L1–L2 Category: the learner does not form a new category, and instead develops a merged category based on the combined phonetic distributions of perceptually linked L1 and L2 sounds.

The formation of a new phonetic category depends primarily on three factors: (1.1) perceived cross-language phonetic dissimilarity between the L2 sound and the closest L1 sound; (1.2) quantity and quality of L2 input obtained through meaningful conversations; and (1.3) precision of the closest L1 category at the time L2 learning begins. When a new category is successfully formed, dissimilation may occur (the L1 category shifts away from the L2 category to maintain phonetic contrast). When a composite category develops instead, bidirectional assimilation occurs (both L1 and L2 realisations move toward intermediate values). SLM-r further proposes that production and perception coevolve without precedence, that learners have full access to all phonetic features (even those not exploited in L1), and that phonetic categories remain malleable throughout life, continuously responding to recent input.

The Perceptual Assimilation Model for L2 (Best & Tyler, 2007) postulates that naive listeners (i.e., individuals with no previous experience with the L2) who encounter an unfamiliar non-native phone (the smallest unit of sound) are prone to perceptually assimilate it to the closest native counterpart in articulatory terms, based on their L1 experience. PAM-L2 identifies three main assimilation types: (1) Categorised: the listener perceives the L2 phone as a segment of their own L1; (2) Uncategorised: the listener does not perceive the L2 phone as a segment of their own L1 but recognises it as a speech sound; and (3) Non-Assimilated: the listener does not recognise the L2 phone as a reproducible human speech sound at all.

Within Categorised assimilation, three scenarios are possible: (1.1) Two-Category (TC): two L2 phones map to two distinct L1 phones, yielding very good to excellent discrimination; (1.2) Single-Category (SC): two L2 phones map to the same L1 phone, yielding poor discrimination; and (1.3) Category Goodness (CG): two L2 phones map to the same L1 phone but are perceived as differing in goodness-of-fit, yielding intermediate discrimination. Within Uncategorised assimilation, two scenarios exist: (2.1) Uncategorised–Categorised (UC): one L2 phone maps to an L1 phone whilst the other does not, yielding very good discrimination; (2.2) Uncategorised–Uncategorised (UU): neither L2 phone maps to L1 phones, yielding poor to moderate discrimination. Finally, Non-Assimilated (NA) patterns typically yield good to excellent discrimination.

1.4. Objectives and Hypotheses of the Current Study

Whilst research on L2 Mandarin perception and production has grown substantially when examined separately, studies investigating both modalities remain limited (Wang & Chen, 2020). Within this already limited field, research investigating the perception and production of Mandarin consonants by native Spanish speakers is even scarcer. Among studies addressing both perception and production, Zhao (2010) appears to be the only study found examining Mandarin j [ʨ] and q [ʨ^h] among Spanish native speakers.

As the work of Zhao (2010) seems to constitute the sole precedent investigating both perception and production of Mandarin consonants among native Spanish speakers, the decision to employ PAM-L2 and SLM-r as theoretical frameworks for the perception and production tasks respectively was grounded in Zhao’s (2010) application of their predecessor models, PAM and SLM, allowing for methodological alignment with updated frameworks and facilitating comparison across studies.

With regard to the target consonants zh [tʂ] and ch [tʂ^h] in the present study, only one investigation (Dai, 2015) with Spanish native speakers has been identified, but it focuses exclusively on the production dimension, with the perception component remaining unexplored.

Based on the literature reviewed, we have formulated two hypotheses, one for perception and one for production. Regarding perception, considering the PAM-L2 model and the phonetics of both Mandarin and Spanish, we hypothesise that Spanish native speakers will assimilate both ch [tʂ^h] and zh [tʂ] with the only affricate available in Spanish [tʃ]. Out of this, two outcomes would be plausible: the first is SC, with a respective poor distinction capacity, and the second is a CG, with a better distinction capacity.

For production, according to the SLM-r model and considering the phonetics of both Mandarin and Spanish, Spanish speakers will likely treat Mandarin [tʂ] and [tʂ^h] as “similar sounds” to Spanish [tʃ], given that all these sounds are affricates, and articulatorily speaking, it is the closest sound available in their repertoire. This could potentially lead to a Composite L1–L2 Category: their production would not be entirely accurate, as there would be no need to form a new L2 sound category; they would simply rely on their L1 repertoire.

This study aims to fill this gap by investigating the relationship between the perception and production of the aspiration contrasts in Mandarin zh [tʂ] and ch [tʂ^h] by adult native Spanish speakers. Specifically, the objectives are to

Examine the perception and production accuracy of zh [tʂ]/ch [tʂ^h] aspiration contrasts by Spanish native speakers;
Analyse VOT patterns in zh [tʂ]/ch [tʂ^h] production;
Investigate the perception–production relationship.

2. Experiment 1: Perception of Mandarin Consonants

2.1. Methodology

2.1.1. Participants

The participants were ten native speakers of Spanish, comprising six males (60%) and four females (40%), with a mean age of 26.1 years, a minimum of 19 years, a maximum of 32 years, and a sample standard deviation of 4 years. All were studying at Nanjing University and were nationals of Spanish-speaking countries, with three from Peru and seven from Chile. Six participants were enrolled in a short-term Chinese language programme, whilst the remaining four were pursuing bachelor’s degrees: three in International Chinese Language Education and one in Chinese Language.

All but one participant, who spoke only Spanish and Chinese, reported speaking English as a third language; two participants also reported knowledge of a fourth language: Italian in one case and Portuguese in the other. One participant spoke five languages, including Italian and French, whilst another spoke six languages, including Korean, Japanese, and Thai.

The participants had studied Chinese for a mean of 6.2 years, with a minimum of 3 years and a maximum of 17 years, and a sample standard deviation of 4.5 years. Regarding HSK (Hànyǔ Shuǐpíng Kǎoshì, or Chinese Proficiency Exam, version 2.0, which features six levels of exponentially increasing difficulty from HSK 1 [easiest] to HSK 6 [most advanced]) levels, one participant reported being at HSK 3, four at HSK 4, three at HSK 5, and two at HSK 6. At the time of data collection, all participants were familiar with the consonants under investigation and with the overall phonetic system of Mandarin Chinese.

2.1.2. Materials

The stimuli used in Experiment 1 consisted of 128 syllables formed by combining the initials zh and ch (corresponding to the phonemes [tʂ] and [tʂ^h], respectively) with four tones (Tone 1: high level; Tone 2: rising; Tone 3: falling–rising; and Tone 4: falling) and 20 different finals: i, a, e, ai, ei, ao, ou, an, en, ang, eng, ong, u, ua, uo, uai, ui, uan, un, and uang. These finals can be further grouped into kāikǒuhū (开口呼), or non-labialised syllables, syllables that, by definition, have no medial glide immediately after the initial (i, a, e, ai, ei, ao, ou, an, en, ang, eng, ong), and hékǒuhū (合口呼), or labialised syllables, in which the initial is immediately followed by a medial glide (represented by u in pinyin: u, ua, uo, uai, ui, uan, un, and uang). Only combinations corresponding to existing Chinese characters were included, as shown in Figure 1.

The stimuli were recorded by two native speakers of Mandarin Chinese, one male and one female, both first-year master’s students of teaching Chinese to speakers of other languages at Nanjing University. The male speaker was from Fujian and the female speaker was from Hubei. The use of both male and female voices was intentional to simulate the format of the HSK exam, which features both male and female speakers. Recordings were made using a Lenovo Legion Y7000 laptop (Lenovo Group Ltd., Shanghai, China) and a HyperX SoloCast microphone (HP Inc., Lisbon, Portugal). The software Praat (6.4.27) was used, and set to a sampling frequency of 44,100 Hz and a bit depth of 16 bits. Audio was recorded in mono and saved as WAV files. The speakers recorded at a distance of approximately 15 to 35 cm from the microphone.

The resulting 256 audio files (128 syllables per speaker) were edited in Microsoft Clipchamp to create 128 combined tokens, each consisting of a male–female pair. These tokens were presented in random order and compiled into a single video displaying a black screen with only the item number. The video had a total duration of 10 min and 44 s. Each token lasted approximately 5 s: the first second presented the male voice, the second presented the female voice, and the remaining 3 s were silent, allowing participants time to record their answers. Participants were provided with an answer sheet consisting of eight A4 pages numbered from 1 to 128; for each item, only two response options were available: zh and ch.

2.1.3. Procedure

The perception experiment was conducted at both the Gulou and Xianlin campuses of Nanjing University. The testing environment was kept as free from background noise as possible, and participants wore Black Shark JoyBuds headphones (Xiaomi Corporation, Shanghai, China) with active noise cancelling to listen to the audio stimuli, which were played from the same Lenovo Legion Y7000 laptop used for recording.

Before the experiment began, all participants received clear instructions regarding the procedure. Specifically, they were told that they would hear a male voice in the first second, immediately followed by a female voice in the second second, and then have a three-second interval before the next stimulus, to ensure comprehension of the task requirements. Participants were not permitted to replay any stimuli, as the experiment was conducted in a single session. However, participants were allowed to review and modify their responses after listening to each audio token, if desired.

2.2. Results

Statistical analyses were conducted using generalised linear mixed-effects models with binomial family (lme4 package in R (4.5.0); Bates et al., 2015), with fixed effects for phonological factors and random intercepts for Subject and Item. A total of 1280 responses were collected in the listening experiment, of which 213 (16.64%) were incorrect. Among these incorrect responses, syllables with the initial zh [tʂ] accounted for 95 of 610 responses (15.57%), whilst those with the initial ch [tʂ^h] had 118 of 670 responses (17.61%); mixed-effects analysis revealed no significant difference (β = 0.17, SE = 0.17, z = 1.00, p = 0.320). In terms of tonal distribution, Tone 1 produced 77 incorrect responses out of 380 total responses (20.26%), Tone 2 had 37 incorrect responses out of 230 (16.09%), Tone 3 had 49 incorrect responses out of 330 (14.85%), and Tone 4 had 50 incorrect responses out of 340 (14.71%); the omnibus test was not significant (χ²(3) = 5.17, p = 0.160). Regarding syllable structure, non-labialised syllables resulted in 131 incorrect responses out of 830 total responses (15.78%), whereas labialised syllables accounted for 82 incorrect responses out of 450 total responses (18.22%); this difference was not significant (β = 0.16, SE = 0.18, z = 0.93, p = 0.354). Random effects revealed substantial between-subject variability (SD = 1.76) relative to between-item variability (SD = 0.31). The data are shown in Figure 2.

Figure 3 and Figure 4 illustrate the error rate distribution across phonological contexts for zh [tʂ] and ch [tʂ^h], respectively. The horizontal axis categorises stimuli by combining tone (1–4) and vowel context. Specifically, “NL” denotes non-labialised syllables (those without a medial glide), whilst “L” represents labialised syllables (containing a medial glide). For example, “ZH1L” represents error rates for syllables with zh (unaspirated voiceless retroflex affricate [tʂ]), Tone 1 (high level tone), in a labialised vowel context.

For zh [tʂ] syllables (Figure 3), labialised contexts showed the following error rates: ZH1L, 16 of 80 responses (20.00%); ZH2L, 0 of 10 responses (0.00%); ZH3L, 10 of 60 responses (16.67%); and ZH4L, 12 of 60 responses (20.00%). Non-labialised contexts yielded the following: ZH1NL, 18 of 110 responses (16.36%); ZH2NL, 9 of 60 responses (15.00%); ZH3NL, 11 of 110 responses (10.00%); and ZH4NL, 19 of 120 responses (15.83%).

For ch [tʂ^h] syllables (Figure 4), labialised contexts yielded the following: CH1L, 19 of 80 responses (23.75%); CH2L, 12 of 60 responses (20.00%); CH3L, 9 of 50 responses (18.00%); and CH4L, 4 of 50 responses (8.00%). Non-labialised contexts showed the following: CH1NL, 22 of 110 responses (20.00%); CH2NL, 16 of 100 responses (16.00%); CH3NL, 19 of 110 responses (17.27%); and CH4NL, 15 of 110 responses (13.64%).

Additionally, the results were analysed by gender, nationality, HSK level, and years of study. Male participants (n = 768 responses) demonstrated an overall error rate of 92 incorrect identifications (11.98%), with zh [tʂ] eliciting 44 errors from 366 responses (12.02%) and ch [tʂ^h] producing 48 errors from 402 responses (11.94%). Female participants (n = 512 responses) had 121 incorrect identifications (23.63%), with zh [tʂ] accounting for 51 errors from 244 responses (20.90%) and ch [tʂ^h] yielding 70 errors from 268 responses (26.12%). Gender was examined as an additional predictor in a separate mixed-effects model and showed a marginal effect (χ²(1) = 3.03, p = 0.082). These gender differences are illustrated in Figure 5.

Chilean participants (n = 896 responses) produced 168 total errors (18.75%), distributed as 76 errors for zh [tʂ] from 427 responses (17.80%) and 92 errors for ch [tʂ^h] from 469 responses (19.62%). Peruvian participants (n = 384 responses) had 45 total errors (11.72%), comprising 19 errors for zh [tʂ] from 183 responses (10.38%) and 26 errors for ch [tʂ^h] from 201 responses (12.94%). Nationality was examined as an additional predictor in a separate mixed-effects model and was not significant (χ²(1) = 2.52, p = 0.112). These nationality differences are illustrated in Figure 6.

The HSK 3 participant (n = 128 responses) had 4 total errors (3.13%), all occurring with zh [tʂ] stimuli (4 of 61 responses, 6.56%), with 0 errors for ch [tʂ^h] (0 of 67 responses, 0.00%). HSK 4 participants (n = 512 responses) produced 74 total errors (14.45%), with 33 errors for zh [tʂ] from 244 responses (13.52%) and 41 errors for ch [tʂ^h] from 268 responses (15.30%). HSK 5 participants (n = 384 responses) had 119 total errors (30.99%), comprising 51 errors for zh [tʂ] from 183 responses (27.87%) and 68 errors for ch [tʂ^h] from 201 responses (33.83%). HSK 6 participants (n = 256 responses) had 16 total errors (6.25%), distributed as 7 errors for zh [tʂ] from 122 responses (5.74%) and 9 errors for ch [tʂ^h] from 134 responses (6.72%). Due to unbalanced group sizes (particularly HSK 3 with n = 1), inferential statistical tests were not conducted for HSK level comparisons. These HSK level differences are illustrated in Figure 7.

Regarding years of study, participants were categorised into three distinct groups based on study duration: those with ≤5 years of Chinese study, those with 6–10 years, and those with ≥11 years. Participants in the ≤5 years group demonstrated 158 errors out of 768 responses (20.57%), with zh [tʂ] accounting for 69 errors out of 366 responses (18.85%) and ch [tʂ^h] yielding 89 errors out of 402 responses (22.14%). The 6–10 years group exhibited 55 errors out of 384 responses (14.32%), comprising 26 errors out of 183 responses for zh [tʂ] (14.21%) and 29 errors out of 201 responses for ch [tʂ^h] (14.43%). The ≥11 years group achieved perfect accuracy with 0 errors out of 128 responses (0.00%), including 0 errors out of 61 responses for zh [tʂ] (0.00%) and 0 errors out of 67 responses for ch [tʂ^h] (0.00%). Due to unbalanced group sizes (particularly ≥11 years with n = 1), inferential statistical tests were not conducted for years of study comparisons. These patterns across study duration are illustrated in Figure 8.

2.3. Discussion

The overall correct answer rate in the experiment was 83.36%, which is close to the 85% threshold used as a benchmark for successful perception in previous studies (Zhao, 2010). Descriptively, the correct answer rate for zh [tʂ] was 84.43% and for ch [tʂ^h] it was 82.39%, representing better results than those reported in the reviewed literature (Wang & Chen, 2020; Wu, 2023). However, mixed-effects analysis revealed no significant difference between these phonemes (p = 0.320), indicating that the 2.04% difference likely reflects sampling variability rather than a systematic perceptual advantage. These findings suggest that Mandarin consonants zh [tʂ] and ch [tʂ^h] do not pose a major perceptual challenge for native Spanish speakers, with both phonemes being equally accessible. This finding is consistent with Zhao’s (2010) research on Mandarin consonant perception by Spanish speakers, which also reported minimal differences (1.6%) between palatalized affricates j [ʨ] and q [ʨ^h].

Interestingly, both studies demonstrate high accuracy rates (Zhao: 87.9%; our study: 83.36%), contradicting Zhao’s (2010) hypothesis that Spanish speakers would struggle with aspiration due to their lack of experience with this feature, at least with regard to perception. This discrepancy suggests that aspiration distinctions are more accessible to Spanish speakers than initially predicted, possibly because Spanish speakers are able to leverage their experience with the voicing contrast (e.g., /p/-/b/, /t/-/d/), which similarly relies on VOT differences along the same acoustic dimension. These results indicate that Spanish speakers can effectively utilise temporal acoustic cues for consonant identification in L2 Mandarin perception.

However, the omnibus test was not statistically significant (p = 0.160), meaning the present data failed to uncover a reliable effect of tone on perception accuracy. This contrasts with Zhao (2010), who found significant tone effects among Spanish speakers; such an effect may therefore still exist, but the present study may have lacked the statistical power to detect it given the limited sample size (n = 10).

Whether the syllable was labialised or non-labialised had a minimal effect on error rate, with a difference in the two conditions of only 2.44%, and mixed-effects analysis confirmed this difference was not statistically significant (p = 0.354). This indicates that coarticulation does not play a major role in consonant differentiation in the context of native Spanish speakers’ perception of Mandarin consonants. This remains largely unstudied, as there is limited evidence, apart from the present study, to support this claim.

The distribution of error rates across phonological conditions revealed some unexpected patterns. Most notably, ZH2L syllables showed 0% errors, initially suggesting an exceptionally clear perceptual distinction. However, this outcome reflects a sampling limitation: only one phonotactically valid ZH2L syllable (zhú) was included, yielding just 10 tokens out of 1280 total items. Consequently, no phonological condition in this study demonstrated a consistently easier perceptual context.

Gender and nationality patterns emerged in the perception data. Male participants demonstrated lower error rates than females (11.98% vs. 23.63%), whilst Peruvians outperformed Chileans (11.72% vs. 18.75%), consistent across both consonants. However, mixed-effects models revealed that gender showed a marginal trend (χ²(1) = 3.03, p = 0.082) while nationality was not significant (χ²(1) = 2.52, p = 0.112). The substantial between-subject variance (SD = 1.76) compared to between-item variance (SD = 0.31) indicates that individual learner differences contributed more to performance variation than demographic or phonological factors. The small sample size (n = 10) and unbalanced group distributions (six males vs. four females; seven Chileans vs. three Peruvians) limited statistical power to detect these effects.

The relationship between proficiency measures and perception accuracy revealed divergent patterns. HSK level showed a non-linear relationship with performance (HSK 3: 3.13%, HSK 4: 14.45%, HSK 5: 30.99%, HSK 6: 6.25%), whilst years of study demonstrated a clearer linear trend (≤5 years: 20.57%, 6–10 years: 14.32%, ≥11 years: 0%). However, unbalanced group sizes (HSK 3: n = 1, HSK 4: n = 4, HSK 5: n = 3, HSK 6: n = 2; ≤5 years: n = 6, 6–10 years: n = 3, ≥11 years: n = 1) precluded inferential statistical testing for these variables. The extreme values for single-participant groups are particularly unreliable. Further investigation with larger, stratified samples is essential to determine whether these descriptive patterns reflect genuine relationships between proficiency measures and perception accuracy.

3. Experiment 2: Production of Mandarin Consonants

3.1. Methodology

3.1.1. Participants

The participants were the same ten native Spanish speakers (six males and four females) who participated in Experiment 1. These participants, with a mean age of 26.1 years (SD = 4 years) and an average of 6.2 years studying Chinese (SD = 4.5 years), provided production data by completing a reading task that followed the perception experiment.

3.1.2. Materials

The participants were given an A4 sheet containing two tables: one with 61 syllables beginning with the initial zh [tʂ] and another with 67 syllables beginning with the initial ch [tʂ^h], totalling 128 syllables. These were the same syllable combinations tested in Experiment 1. A sample of the reading sheet used for data collection is provided in the Appendix A. The audio recordings were made using the same equipment as in Experiment 1: a Lenovo Legion Y7000 laptop and a HyperX SoloCast microphone. Recordings were conducted using Praat (6.4.27), set to a sampling frequency of 44,100 Hz and a bit depth of 16 bits. Audio was recorded in mono channel and saved as WAV files.

3.1.3. Procedure

All participants received detailed instructions before the recording session began. They were directed to read first the zh [tʂ] table followed by the ch [tʂ^h] table, proceeding from left to right and from top to bottom within each table. Participants were instructed to maintain a pause of approximately 2 s between syllables to prevent coarticulatory effects and to facilitate subsequent acoustic analysis. Throughout the recording, participants were positioned at a distance of approximately 15 to 35 cm from the microphone. A brief break was scheduled between the two tables to allow for file saving. Participants were permitted to repeat any syllable up to three times if they felt their pronunciation was not satisfactory. Only the final production was kept for analysis. The mean duration of the recording sessions was 10 min and 40 s, with a maximum of 14 min and 26 s, a minimum of 8 min and 32 s, and a sample standard deviation of 2 min and 9 s.

3.2. Results

A total of 1280 syllables were recorded and analysed for VOT. Additionally, recordings from two Chinese native speakers (one male, one female) from the first experiment were measured for VOT as the native control group. VOT for this experiment was measured from the onset of the burst release, identified as a sudden vertical spike in the waveform that emerges from the near-silent baseline, through the frication phase (and aspiration period in the case of ch [tʂ^h]), until the onset of voicing in the following vowel. The vowel onset was identified by the appearance of the first clearly visible periodic oscillation in the waveform. The spectrogram was used as a complementary reference to confirm the transition. An example can be seen in Figure 9 below.

Of the 1280 syllables recorded, 47 (3.67%) were excluded from the analysis due to three different reasons: damaged audio files (n = 29), lost file (n = 1), and technically unusable recordings (n = 17). The unusable recordings resulted from production irregularities that rendered VOT measurement impossible, including absence of complete occlusion (which distinguishes affricates from fricatives). Fricatives, by definition, cannot be measured for VOT since they lack the stop phase (complete occlusion) necessary for VOT calculation. Other excluded productions included fully voiced consonants with pre-voicing throughout the occlusion and release phases. Since VOT is defined as the time interval between the release of the occlusion and the onset of vocal fold vibration, completely voiced consonants would not provide meaningful data for the intended cross-linguistic comparison of aspiration contrasts.

Regarding Voice Onset Times (VOTs), participants demonstrated distinct production patterns for both phonemes whilst exhibiting systematic underproduction compared to native speakers. For zh [tʂ], participants produced an overall mean VOT of 58 ms (SD = 36.5, range: 8–220 ms), whilst the native control group demonstrated a mean of 67 ms (SD = 23.9, range: 37–149 ms). For ch [tʂ^h], participants exhibited an overall mean of 125 ms (SD = 48.3, range: 28–313 ms), compared to the native control group’s mean of 164 ms (SD = 30.6, range: 107–288 ms). Participants showed greater variability than the native control group for both phonemes, as evidenced by larger standard deviations and wider ranges. These patterns are illustrated in Figure 10.

Statistical analyses were conducted using linear mixed-effects models (lme4 package in R; Bates et al., 2015) to test phonological factors affecting VOT production, with fixed effects for Phoneme, Tone, and Syllable Type, and random intercepts for Subject and Item. Results revealed a highly significant main effect of Phoneme (β = −64.73, SE = 2.41, t = −26.87, p < 0.001), confirming that participants successfully distinguished zh [tʂ] (M = 58 ms) from ch [tʂ^h] (M = 125 ms) in production, with a mean difference of 67 ms. Tone also showed a significant effect (χ²(3) = 64.04, p < 0.001), indicating that tonal context systematically influenced VOT duration. However, Syllable Type (labialised vs. non-labialised) was not significant (β = −1.79, p = 0.477). Random effects revealed substantial between-subject variability (SD = 27.12) relative to between-item variability (SD = 8.74).

The VOT data from participants were subsequently analysed relative to the control group intervals to convert measurements into binary outcomes. A conservative threshold approach was adopted to minimise false positives. Separate VOT ranges were established for zh [tʂ] and ch [tʂ^h] based on native speaker productions. Participant VOT values falling outside the phoneme-specific native speaker range were classified as incorrect, whilst those within the appropriate interval were deemed correct. The 47 previously excluded syllables were maintained as exclusions for this binary classification.

A total of 1233 VOT measurements were converted to binary classifications, of which 916 (74.29%) fell outside the native range and were therefore classified as incorrect. Among these incorrect productions, syllables with the initial zh [tʂ] accounted for 405 of 601 valid VOT measurements (67.39%), whilst those with the initial ch [tʂ^h] comprised 511 of 632 valid values (80.85%).

Binary accuracy was analysed using generalised linear mixed-effects models with binomial family (lme4 package in R; Bates et al., 2015), testing Phoneme, Tone, and Syllable Type as fixed effects with random intercepts for Subject and Item. Results confirmed a significant main effect of Phoneme (β = 0.96, z = 4.71, p < 0.001), with zh [tʂ] showing higher accuracy (32.6%) than ch [tʂ^h] (19.1%). Tone also significantly affected accuracy (χ²(3) = 11.74, p = 0.008), with declining accuracy from Tone 1 (31.2%) to Tone 4 (21.1%). Syllable Type showed a significant effect (β = 0.44, z = 2.09, p = 0.036), with non-labialised syllables achieving higher accuracy (28.0%) than labialised syllables (21.2%). Random effects indicated substantial between-subject variance (SD = 1.23) compared to between-item variance (SD = 0.72).

In terms of tonal distribution, Tone 1 produced 249 incorrect responses out of 362 total responses (68.78%), Tone 2 had 159 incorrect responses out of 221 (71.95%), Tone 3 had 250 incorrect responses out of 323 (77.40%), and Tone 4 had 258 incorrect responses out of 327 (78.90%). Regarding syllable structure, non-labialised syllables resulted in 585 incorrect responses out of 813 total responses (71.96%), whereas labialised syllables accounted for 331 incorrect responses out of 420 total responses (78.81%). The data are shown in Figure 11.

As for phonological context, when it came to production the distribution of errors across phonological contexts was as follows. For zh [tʂ] syllables, labialised contexts showed the following error rates: ZH1L, 52 of 76 responses (68.42%); ZH2L, 8 of 9 responses (88.89%); ZH3L, 43 of 59 responses (72.88%); and ZH4L, 40 of 58 responses (68.97%). Non-labialised zh [tʂ] contexts yielded the following: ZH1NL, 60 of 110 responses (54.55%); ZH2NL, 38 of 60 responses (63.33%); ZH3NL, 78 of 110 responses (70.91%); and ZH4NL, 86 of 119 responses (72.27%). These patterns are illustrated in Figure 12.

Conversely, ch [tʂ^h] syllables in labialised contexts yielded the following: CH1L, 62 of 72 responses (86.11%); CH2L, 42 of 55 responses (76.36%); CH3L, 46 of 48 responses (95.83%); and CH4L, 38 of 43 responses (88.37%). Non-labialised ch [tʂ^h] contexts showed the following: CH1NL, 75 of 104 responses (72.12%); CH2NL, 71 of 97 responses (73.20%); CH3NL, 83 of 106 responses (78.30%); and CH4NL, 94 of 107 responses (87.85%). The data are shown in Figure 13.

For errors by gender, males had an overall error rate of 537 out of 723 (74.27%), an error rate of 244 out of 357 (68.35%) for zh [tʂ], and an error rate of 293 out of 366 (80.05%) for ch [tʂ^h], whilst females had an overall error rate of 379 out of 510 (74.31%), an error rate of 161 out of 244 (65.98%) for zh [tʂ], and an error rate of 218 out of 266 (81.95%) for ch [tʂ^h]. Gender was examined as an additional predictor in separate mixed-effects models for both VOT and binary accuracy, and showed no significant effect (VOT: χ²(1) = 0.01, p = 0.915; binary: χ²(1) = 0.45, p = 0.502). These gender differences are illustrated in Figure 14.

As for nationality, Peruvians had an overall error rate of 239 out of 348 (68.68%), an error rate of 102 out of 174 (58.62%) for zh [tʂ], and an error rate of 137 out of 174 (78.74%) for ch [tʂ^h], whilst Chileans had an overall error rate of 677 out of 885 (76.50%), an error rate of 303 out of 427 (70.96%) for zh [tʂ], and an error rate of 374 out of 458 (81.66%) for ch [tʂ^h]. Nationality was examined as an additional predictor in separate mixed-effects models for both VOT and binary accuracy, and showed no significant effect (VOT: χ²(1) = 0.55, p = 0.458; binary: χ²(1) = 0.32, p = 0.571). These nationality differences are illustrated in Figure 15.

For errors by HSK level, the HSK 3 group had an overall error rate of 127 out of 128 (99.22%), an error rate of 60 out of 61 (98.36%) for zh [tʂ], and an error rate of 67 out of 67 (100%) for ch [tʂ^h]. The HSK 4 group had an overall error rate of 350 out of 483 (72.46%), an error rate of 157 out of 235 (66.81%) for zh [tʂ], and an error rate of 193 out of 248 (77.82%) for ch [tʂ^h]. The HSK 5 group had an overall error rate of 284 out of 374 (75.94%), an error rate of 119 out of 183 (65.03%) for zh [tʂ], and an error rate of 165 out of 191 (86.39%) for ch [tʂ^h]. The HSK 6 group had an overall error rate of 155 out of 248 (62.50%), an error rate of 69 out of 122 (56.56%) for zh [tʂ], and an error rate of 86 out of 126 (68.25%) for ch [tʂ^h]. Due to unbalanced group sizes (particularly HSK 3 with n = 1), inferential statistical tests were not conducted for HSK level comparisons in production. These HSK level differences are illustrated in Figure 16.

Finally, regarding years of study, production error rates showed that the ≤5 years group had an overall error rate of 534 out of 730 (73.15%), an error rate of 224 out of 357 (62.75%) for zh [tʂ], and an error rate of 310 out of 373 (83.11%) for ch [tʂ^h]. The 6–10 years group had an overall error rate of 317 out of 375 (84.53%), an error rate of 159 out of 183 (86.89%) for zh [tʂ], and an error rate of 158 out of 192 (82.29%) for ch [tʂ^h]. The ≥11 years group had an overall error rate of 65 out of 128 (50.78%), an error rate of 22 out of 61 (36.07%) for zh [tʂ], and an error rate of 43 out of 67 (64.18%) for ch [tʂ^h]. Due to unbalanced group sizes (particularly ≥11 years with n = 1), inferential statistical tests were not conducted for years of study comparisons in production. These patterns across study durations are illustrated in Figure 17.

3.3. Discussion

Participants distinguished zh [tʂ] from ch [tʂ^h] in production, with mean VOTs of 58 ms (zh [tʂ]) and 125 ms (ch [tʂ^h]), yielding a 2.2× ratio comparable to natives (2.4×) and to Dai’s (2015) advanced learners (2.1×), indicating successful aspiration contrast maintenance despite VOT underproduction.

Additionally, native speakers demonstrated higher overall mean VOT values for both zh [tʂ] (control group: 67 ms; participants: 58 ms) and ch [tʂ^h] (control group: 164 ms; participants: 125 ms), indicating consistent VOT underproduction by Spanish-speaking participants. This phenomenon was also observed in Zhao’s (2010) study of voiceless alveolo-palatal consonants, in which Spanish speakers similarly produced shorter VOT values for j [ʨ] (control group: 99 ms; participants: 81 ms) and q [ʨ^h] (control group: 212 ms; participants: 96 ms) compared to native Mandarin speakers. However, Dai (2015) reported the opposite pattern, with non-native groups showing VOT overproduction for zh [tʂ] across all proficiency levels and for ch [tʂ^h] in advanced learners. The reasons for this discrepancy across studies remain unclear and warrant further investigation. These findings suggest that deviation from native-like production does not follow a uniform trajectory, manifesting as either underproduction or overproduction, which aligns with Dai’s (2015) findings, at least for ch [tʂ^h].

Participants showed greater VOT variability than the native control group, both in range (zh [tʂ]: 113 ms vs. 86 ms; ch [tʂ^h]: 162 ms vs. 145 ms) and standard deviation (zh [tʂ]: 36.5 ms vs. 23.9 ms; ch [tʂ^h]: 48.3 ms vs. 30.6 ms), indicating less precise articulatory control. This pattern aligns with previous research (Dai, 2015; Zhao, 2010) demonstrating that VOT variability decreases with increasing proficiency. However, our native control group showed larger standard deviations than Dai’s (2015) native speakers, likely due to differences in sample size (n = 2 vs. Dai’s larger control group) or methodological variations in stimuli selection.

Mixed-effects analysis confirmed that participants maintain statistically significant distinctions between zh [tʂ] and ch [tʂ^h] in production (β = −64.73, p < 0.001), with a mean difference of 67 ms. The substantial between-subject variance (SD = 27.12) relative to between-item variance (SD = 8.74) indicates that individual production patterns vary considerably more than stimulus characteristics, suggesting that learner-specific factors play a major role in VOT production accuracy. These findings indicate that despite VOT underproduction, Spanish speakers are able to successfully acquire the aspirated–unaspirated contrast necessary for accurate Mandarin consonant differentiation, though their production patterns differ significantly from native speakers.

When VOT measurements were converted to binary classifications based on whether values fell within or outside the native speaker interval, the overall production error rate was 74.29%, substantially higher than the perception error rate of 16.64%, and considerably higher than that reported by Dai (2015). This considerable discrepancy reflects both the conservative methodological threshold employed, in which any deviation from native VOT ranges was classified as incorrect, and the inherent difficulty of producing native-like VOT values compared to perceptually discriminating between zh [tʂ] and ch [tʂ^h]. The production correct answer rate for zh [tʂ] was 32.61% and for ch [tʂ^h] was 19.15%, demonstrating that whilst participants could distinguish these sounds perceptually with high accuracy, producing them within native parameters proved substantially more challenging.

The binary classification revealed that ch [tʂ^h] posed greater production challenges than zh [tʂ], with error rates of 80.85% and 67.39% respectively. This 13.46% difference is substantially larger than the 2.04% difference observed in perception, suggesting that whilst both sounds present production difficulties, the aspirated consonant ch [tʂ^h] requires more precise articulatory control that Spanish speakers find particularly challenging to achieve within native norms.

Regarding tones, production exhibited an inverse pattern to perception. Whilst perception showed Tone 1 with the highest error rate (20.26%) and Tone 4 with the lowest (14.71%), production demonstrated an ascending pattern, with Tone 1 having the lowest error rate (68.78%) and Tone 4 the highest (78.90%). This reversal suggests that the phonetic factors facilitating perceptual discrimination differ fundamentally from those enabling accurate production. High fundamental frequency (f₀), which may interfere with aspiration perception in Tone 1, appears to support more accurate VOT production, whilst lower f₀ contexts present greater production challenges.

Whether the syllable was labialised or non-labialised, production showed a 6.85% difference in error rates, with labialised syllables yielding higher error rates (78.81%) compared to non-labialised syllables (71.96%). This difference, whilst modest, is nearly three times greater than the 2.44% difference observed in perception, indicating that coarticulation effects present greater challenges in production than in perception.

The distribution of production error rates across phonological contexts revealed patterns distinct from perception. Unlike perception, where the ZH2L context achieved 0% errors (though with limited tokens), no phonological context in production achieved error-free performance. Most notably, ZH2L, the most accurately perceived context, showed one of the highest production error rates at 88.89%, demonstrating that perceptual clarity does not predict production accuracy. The highest production error rate occurred in the CH3L context (95.83%), whilst the lowest occurred in ZH1NL (54.55%), suggesting that non-labialised contexts with high f₀ facilitate more accurate VOT production.

Regarding gender and nationality, production error rates showed no statistically significant differences between groups. Mixed-effects models revealed that neither gender nor nationality significantly affected production accuracy. For VOT, gender (χ²(1) = 0.01, p = 0.915) and nationality (χ²(1) = 0.55, p = 0.458) showed no significant effects. Similarly, for binary accuracy, gender (χ²(1) = 0.45, p = 0.502) and nationality (χ²(1) = 0.32, p = 0.571) were not significant. Males and females demonstrated virtually identical overall performance (74.27% vs. 74.31%), and whilst Peruvians showed numerically lower error rates than Chileans (68.68% vs. 76.50%), these differences were not statistically reliable. The small sample sizes (n = 6 males, n = 4 females; n = 3 Peruvians, n = 7 Chileans) and high within-group variability limited statistical power to detect potential effects. Further investigation with larger, balanced samples is needed to determine whether gender or dialectal factors influence production accuracy.

Regarding HSK level and years of study in production, unbalanced group sizes (HSK 3: n = 1, HSK 4: n = 4, HSK 5: n = 3, HSK 6: n = 2; ≤5 years: n = 6, 6–10 years: n = 3, ≥11 years: n = 1) precluded inferential statistical testing. Descriptively, HSK level showed a generally descending trend (HSK 3: 99.22%, HSK 5: 75.94%, HSK 4: 72.46%, HSK 6: 62.50%), whilst years of study revealed an unexpected inverted-U pattern (≤5 years: 73.15%, 6–10 years: 84.53%, ≥11 years: 50.78%), where intermediate learners performed worse than beginners. The single HSK 3 participant’s near-complete inability to produce native-like VOT (99.22%) and the ≥11 years participant’s substantial improvement (50.78%) represent particularly unreliable datapoints due to n = 1. The similarity between HSK 4 and HSK 5 performance (difference of 3.48%) may suggest a plateau effect at intermediate levels, though this remains speculative without statistical validation. Further investigation with larger, stratified samples is essential to clarify the relationship between proficiency measures and production accuracy.

4. Relationship Between Perception and Production

The comparison between perception and production performance reveals several noteworthy patterns and divergences across the two modalities.

Overall, production error rates were substantially higher than perception error rates across all metrics (74.29% vs. 16.64%). This discrepancy can be attributed to the conservative methodological approach adopted for production analysis, where any VOT value falling outside the native speaker range was classified as incorrect. Whilst both modalities showed higher error rates for ch [tʂ^h] than zh [tʂ], the magnitude of this difference varied considerably. In perception, the difference was minimal (2.04%), whereas in production it was substantial (13.46%), suggesting that producing the aspirated consonant presents greater challenges than perceiving it.

The two modalities exhibited inverse patterns regarding tonal effects on error rates. Perception showed a descending pattern (Tone 1: 20.26%, Tone 2: 16.09%, Tone 3: 14.85%, Tone 4: 14.71%), whilst production demonstrated an ascending pattern (Tone 1: 68.78%, Tone 2: 71.95%, Tone 3: 77.40%, Tone 4: 78.90%). This reversal suggests that the phonetic factors influencing perceptual accuracy differ from those affecting production accuracy. Both modalities showed higher error rates for labialised syllables compared to non-labialised syllables, indicating consistent difficulty with this phonological feature across perception and production. However, the magnitude of this effect differed between modalities (perception: 2.44% difference; production: 6.85% difference).

Unlike perception, where the ZH2L context yielded 0% errors (albeit with only 10 tokens), no phonological context in production achieved error-free performance. Notably, ZH2L, the most accurately perceived context, showed one of the highest production error rates (88.89%), demonstrating that perceptual clarity does not necessarily predict production accuracy.

The relationship between gender and performance differed markedly across modalities. In perception, males demonstrated lower error rates than females (11.98% vs. 23.63%), whilst in production, this gap virtually disappeared (74.27% vs. 74.31%). However, mixed-effects models revealed that neither difference was statistically significant. Whilst the numerical patterns suggest possible modality-specific gender differences, these remain unconfirmed and require investigation with larger, adequately powered samples.

Nationality patterns also differed across modalities. In perception, Peruvians demonstrated lower error rates than Chileans (11.72% vs. 18.75%), whilst in production, Peruvians also showed lower error rates (67.5% vs. 76.4%). However, mixed-effects models revealed that neither difference was statistically significant. Both nationalities maintained the pattern of lower error rates for zh [tʂ] compared to ch [tʂ^h] across both modalities.

The relationship between HSK level and performance showed distinct descriptive patterns across modalities. Perception exhibited a U-shaped pattern (HSK 3: 3.13%, HSK 4: 14.45%, HSK 5: 30.99%, HSK 6: 6.25%), whilst production showed a generally descending trend (HSK 3: 99.22%, HSK 4: 72.46%, HSK 5: 75.94%, HSK 6: 62.50%). Years of study revealed different patterns as well, with perception showing linear improvement (≤5 years: 20.57%, 6–10 years: 14.32%, ≥11 years: 0%) and production showing an inverted-U pattern (≤5 years: 73.15%, 6–10 years: 84.53%, ≥11 years: 50.78%). However, unbalanced group sizes (HSK 3: n = 1, ≥11 years: n = 1) precluded inferential statistical testing for both variables in both modalities. The extreme values for single-participant groups are particularly unreliable, making it impossible to draw firm conclusions about the relationship between proficiency measures and performance.

Whilst these descriptive patterns tentatively suggest that HSK level and years of study may relate differently to perception versus production abilities, these observations remain speculative without statistical validation. Further investigation with larger, stratified samples is essential to clarify these potential relationships.

5. General Discussion and Conclusions

After analysing all the data obtained and comparing them to the available literature, several significant findings emerged that contribute to our understanding of Mandarin phonetic acquisition by native Spanish speakers. First and foremost, we propose that perception demonstrates a PAM-L2 Category Goodness (CG) pattern, while production exemplifies an SLM-r Composite L1–L2 Category pattern.

For perception, this CG classification stems from the fact that Spanish possesses only one sound ([tʃ]) that is articulatorily similar to the Mandarin sounds under investigation. This phonological reality constrains the possible assimilation patterns to either Single-Category (SC) or Category Goodness (CG), both patterns in which two L2 sounds map onto a single L1 category. However, the SC pattern can be ruled out because it would predict poor discrimination between the two L2 sounds, which contradicts our findings. The perception experiment yielded an overall accuracy of 83.36%, indicating that while both Mandarin zh [tʂ] and ch [tʂ^h] map to Spanish [tʃ], participants perceive a qualitative difference between them. This moderately high discrimination accuracy is characteristic of CG assimilation, where the two L2 sounds are perceived as better and poorer exemplars of the same L1 category.

For production, the data support an L1–L2 Composite Category pattern. Participants clearly distinguished between zh [tʂ] and ch [tʂ^h] in their productions, maintaining a substantial VOT difference (58 ms vs. 125 ms) that approached the native ratio (2.2× vs. 2.4×). However, when VOT measurements were converted to binary classifications based on native speaker intervals, the overall accuracy was only 25.71%. This pattern (successful maintenance of the aspiration contrast combined with consistent deviation from native norms) is precisely what the composite category predicts: learners have failed to establish new phonetic categories for zh [tʂ] and ch [tʂ^h], instead developing a Composite L1–L2 Category based on phonetic input from both Spanish and Mandarin. This composite category preserves the L2 distinction but remains fundamentally influenced by L1 articulatory patterns, preventing native-like VOT precision. The proposed assimilation patterns are illustrated in Figure 18 below.

Contrary to Zhao’s (2010) hypothesis, aspiration did not pose a major obstacle in either modality. Spanish speakers distinguished zh [tʂ] from ch [tʂ^h] with high perceptual accuracy and produced distinct VOT patterns (zh [tʂ] = 58 ms, ch [tʂ^h] = 125 ms), maintaining a clear aspiration contrast despite consistent underproduction compared to native speakers (zh [tʂ] = 67 ms, ch [tʂ^h] = 164 ms). Voicing errors, including pre-voicing, voiced occlusion, and voiced release, occurred in less than 2% of productions. These findings suggest that whilst Spanish dialects’ aspiration phenomena (Rogers, 2020; Núñez-Méndez, 2022) may facilitate aspiration perception and production, the primary challenge lies in achieving native-like VOT precision, manifesting as either underproduction or overproduction as proposed by Dai (2015), rather than establishing the basic aspiration contrast.

The comparative analysis between perception and production revealed several unexpected patterns that illuminate the independent development of these modalities in L2 phonetic acquisition. The two modalities exhibited different tonal patterns: production showed a significant ascending pattern in error rates (χ²(3) = 64.04, p < 0.001), whilst perception showed a descriptive descending pattern that was not statistically significant (p = 0.160). This divergence suggests that high fundamental frequency (f₀) may influence perception and production differently, though the perceptual pattern requires confirmation with larger samples.

Whilst distinct patterns emerged across gender, nationality, HSK level, and years of study in both perception and production, none of these differences reached statistical significance. The small sample size (n = 10), unbalanced group distributions, and particularly the single-participant groups (HSK 3: n = 1, ≥11 years: n = 1) precluded reliable inferential testing. These descriptive patterns, though potentially suggestive of underlying trends, remain speculative and require investigation with larger, adequately powered samples to determine their validity.

Methodologically, this study addressed the challenge of comparing perception and production through binary classification of VOT measurements based on native speaker intervals. Whilst this conservative approach resulted in high error rates (74.29%), it enabled direct comparison with perception data and revealed patterns that would be obscured by continuous VOT analysis alone. However, this approach also represents a significant limitation, as any deviation from the relatively narrow native range was classified as incorrect, potentially overestimating production difficulties. Additionally, the native speaker range was established using only two speakers, which may not adequately represent natural variation in native Mandarin production.

Additional limitations include the study’s examination of only VOT as a production measure, neglecting other potentially relevant acoustic parameters such as burst intensity, spectral characteristics, or articulatory placement that might reveal additional production challenges. Furthermore, the laboratory reading task, whilst allowing acoustic precision, may not reflect naturalistic production contexts where cognitive load and communicative demands differ substantially.

Future research should address these limitations through several avenues: larger-scale studies with balanced groups to enable robust statistical testing; longitudinal designs to clarify developmental trajectories; expanded acoustic analysis beyond VOT; investigation of other Mandarin affricate contrasts (alveolo-palatal, dental) to test generalisability; and more naturalistic production tasks to assess ecological validity.

In conclusion, this study demonstrates that Spanish native speakers face moderate challenges when acquiring Mandarin zh [tʂ] and ch [tʂ^h] sounds, with perception and production developing along independent trajectories. Perception shows Category Goodness (CG) assimilation, whereby both Mandarin sounds map to Spanish [tʃ] but with different goodness-of-fit, yielding 83.36% accuracy. Production reflects an SLM-r Composite L1–L2 Category: learners maintain the aspiration contrast (zh [tʂ] = 58 ms, ch [tʂ^h] = 125 ms) but fail to establish new phonetic categories, resulting in 25.71% accuracy within native norms. Whilst descriptive patterns emerged across gender, nationality, and proficiency measures in both modalities, only production showed significant tonal effects (p < 0.001); other factors did not reach significance due to limited sample size (n = 10). These findings contribute to theoretical understanding of cross-linguistic phonetic transfer and modality-specific L2 development, whilst providing practical insights for Mandarin instruction to Spanish-speaking populations.

Author Contributions

Conceptualization, G.G.M.R. and Q.Z.; methodology, G.G.M.R. and Q.Z.; formal analysis, G.G.M.R.; investigation, G.G.M.R.; resources, G.G.M.R.; data curation, G.G.M.R.; writing—original draft preparation, G.G.M.R.; writing—review and editing, G.G.M.R. and Q.Z.; visualisation, G.G.M.R.; supervision, Q.Z.; project administration, G.G.M.R. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted under departmental supervision at Nanjing University. As the university does not currently have an Institutional Review Board for this type of research, formal IRB approval was not obtained.

Informed Consent Statement

All participants were fully informed about the study and agreed to participate voluntarily. Consent was obtained orally, as no written form was provided.

Data Availability Statement

All the data analysed can be accessed at 10.5281/zenodo.17719850 (accessed on 3 December 2025).

Acknowledgments

During the preparation of this study, the authors used Claude Sonnet 4.5 for grammar correction and punctuation, and for coding in R for data visualisation. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Reading sheet for production task.

References

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. [Google Scholar] [CrossRef]
Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In W. Strange (Ed.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). John Benjamins. [Google Scholar]
Campos-Astorkiza, R. (2012). The phonemes of Spanish. In J. I. Hualde, A. Olarrea, & E. O’Rourke (Eds.), The handbook of Hispanic linguistics (pp. 89–110). Blackwell Publishing. [Google Scholar] [CrossRef]
Chen, Y., & Gussenhoven, C. (2015). Shanghai Chinese. Journal of the International Phonetic Association, 45(3), 321–337. [Google Scholar] [CrossRef]
Dai, J. (2015). 西班牙语国家学生汉语塞擦音和擦音习得研究 [A study of the acquisition of Mandarin affricates and fricatives by Spanish-speaking learners] [Master’s thesis, Nanjing University]. [Google Scholar]
De Haro, A. H., & Hajek, J. (2011). Eastern Andalusian Spanish. Journal of the International Phonetic Association, 41(2), 135–156. [Google Scholar] [CrossRef]
Fandiño, J. A. R. (2019). La enseñanza y percepción de las consonantes retroflejas del chino mandarín en hablantes de español como L1 [Master’s thesis, Universidad de los Andes]. [Google Scholar]
Flege, J. E. (1991). Age of learning affects the authenticity of voice-onset time (VOT) in stop consonants produced in a second language. Journal of the Acoustical Society of America, 89(1), 395–411. [Google Scholar] [CrossRef] [PubMed]
Flege, J. E., & Bohn, O.-S. (2021). The revised Speech Learning Model (SLM-r). In R. Wayland (Ed.), Second language speech learning: Theoretical and empirical progress (pp. 3–83). Cambridge University Press. [Google Scholar] [CrossRef]
Ladefoged, P., & Wu, Z. (1984). Places of articulation: An investigation of Pekingese fricatives and affricates. Journal of Phonetics, 12(3), 267–278. [Google Scholar] [CrossRef]
Lee, W.-S., & Zee, E. (2003). Standard Chinese (Beijing). Journal of the International Phonetic Association, 33(1), 109–112. [Google Scholar] [CrossRef]
Núñez-Méndez, E. (2022). Variation in Spanish /s/: Overview and new perspectives. Languages, 7(2), 77. [Google Scholar] [CrossRef]
Rogers, B. M. A. (2020). The state of Spanish /s/ variation in concepción, Chile: Linguistic and social trends. Open Linguistics, 6(1), 530–550. [Google Scholar] [CrossRef]
Terrell, T. D. (1977). Constraints on the aspiration and deletion of final /s/ in Cuban and Puerto Rican Spanish. Bilingual Review/La Revista Bilingüe, 4(1/2), 35–51. [Google Scholar]
Wang, X., & Chen, J. (2020). The acquisition of Mandarin consonants by English learners: The relationship between perception and production. Languages, 5(2), 20. [Google Scholar] [CrossRef]
Wu, F. (2023). An experimental research into the identification of Chinese affricates as the second language. Journal of Yibin University, 23(7), 54–62. [Google Scholar] [CrossRef]
Zhao, X. (2010, May 28). 西班牙语母语学习者对汉语普通话 [ʨ]、[ʨ^h] 的感知与产生 [Spanish native speakers’ perception and production of Mandarin [ʨ] and [ʨ^h]]. 9th Chinese Phonetics Academic Conference, Tianjin, China. [Google Scholar]
Zhu, X. (2013). 语音学 [Phonetics]. Commercial Press. [Google Scholar]

Figure 1. Syllable inventory.

Figure 2. Error rates in perception by initials.

Figure 3. Error rates in perception for zh [tʂ] by phonological context.

Figure 4. Error rates in perception for ch [tʂ^h] by phonological context.

Figure 5. Error rates in perception by gender.

Figure 6. Error rates in perception by nationality.

Figure 7. Error rates in perception by HSK level.

Figure 8. Error rates in perception by years of study.

Figure 9. Example of VOT measurement.

Figure 10. VOT density distributions by phoneme and group.

Figure 11. Error rates in production by initials.

Figure 12. Error rates in production for zh [tʂ] by phonological context.

Figure 13. Error rates in production for ch [tʂ] by phonological context.

Figure 14. Error rates in production by gender.

Figure 15. Error rates in production by nationality.

Figure 16. Error rates in production by HSK level.

Figure 17. Error rates in production by years of study.

Figure 18. Proposed perceptual and production assimilation patterns for Mandarin retroflex affricates by Spanish speakers.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Galhoz Maria Roque, G.; Zhang, Q. Perception and Production of the Aspiration Contrast in Mandarin Retroflex Affricates [tʂ] and [tʂ^h] by Adult Spanish Speakers Learning Mandarin Chinese: An Exploratory Study. Languages 2026, 11, 69. https://doi.org/10.3390/languages11040069

AMA Style

Galhoz Maria Roque G, Zhang Q. Perception and Production of the Aspiration Contrast in Mandarin Retroflex Affricates [tʂ] and [tʂ^h] by Adult Spanish Speakers Learning Mandarin Chinese: An Exploratory Study. Languages. 2026; 11(4):69. https://doi.org/10.3390/languages11040069

Chicago/Turabian Style

Galhoz Maria Roque, Guilherme, and Quanzhen Zhang. 2026. "Perception and Production of the Aspiration Contrast in Mandarin Retroflex Affricates [tʂ] and [tʂ^h] by Adult Spanish Speakers Learning Mandarin Chinese: An Exploratory Study" Languages 11, no. 4: 69. https://doi.org/10.3390/languages11040069

APA Style

Galhoz Maria Roque, G., & Zhang, Q. (2026). Perception and Production of the Aspiration Contrast in Mandarin Retroflex Affricates [tʂ] and [tʂ^h] by Adult Spanish Speakers Learning Mandarin Chinese: An Exploratory Study. Languages, 11(4), 69. https://doi.org/10.3390/languages11040069