2. Experiment 1: Perception of Mandarin Consonants
2.1. Participants
The participants were 25 native English speaking (15 male, 10 female, mean age = 19.6) beginning level CFL learners enrolled in a first semester Chinese course in a public university in the U.S. All participants reported speaking English as their native language. Twenty of them were born and raised in the United States and five were born in foreign countries but moved to the U.S. between the ages of 2 and 5. Some participants reported speaking another language along with English as their first languages. They were four English/Spanish, four English/Hmong, two English/Tagalog, and one English/Vietnamese early bilinguals. At the point of data collection, the participants were about three months into the 16-week semester and all had learned and practiced Chinese consonants by then.
2.2. Material
The perceptual identification test initially included 10 Mandarin consonants in syllables (z /tsa/, c /tsʰa/, s /sa/, j /tɕʲa/, q /tɕʰʲa/, x /ɕʲa/, zh /tʂa/, ch /tʂʰa/, sh /ʂa/, and r /ʐa/) that were produced by two native Mandarin Speakers, one male and one female, through a reading task. The target words were produced in a carrier sentence wo shuo ___ zi (我说 ---字). “I say --- word”. The recordings were made on a MacBook Pro computer using the Praat software. The target syllables were separated from the sentences using waveform editing, normalized for peak volume, and saved as wave form for presentations. Eight of the 10 sounds (z /ts/, c /tsʰ/, j /tɕ/, q /tɕʰ/, x /ɕ/, zh /tʂ/, ch /tʂʰ/, sh /ʂ/) were analyzed for this perception experiment to pair exactly with the eight target consonants in the production test.
2.3. Procedure
All participants gave their informed consent for inclusion before they participated in this study which was approved by the ethics committee of the researchers’ university. Individual perception identification tasks were carried out in a sound booth on a MacBook Pro computer using Praat ExperimentMFC identification test design. A total of 60 stimuli (10 sounds 2 speakers 3 repetitions) were randomized and presented in a 10-way forced choice task. The labels for choices were the 10 consonants in pinyin (the official Romanized transcription of Mandarin Chinese) displayed on the computer screen during the test. The listeners were instructed to listen carefully for the initial consonant in each stimulus and identify the sound they heard by clicking on the corresponding consonant on the screen. During the test, they could choose to replay each stimulus twice in the case of uncertainty. To familiarize the learners with the task, before the real test began, each participant had a trial session using the stimuli not included for analyses. The software automatically recorded the test data to be exported for analysis.
2.4. Results
Individual participants’ correct identifications of each target consonant were converted to percentage accuracy scores. Their misidentified target sounds were also converted to percentage error rate and were tallied for substitution patterns. The group mean percentage correct identification scores of the eight consonants ranged from 29% zh /tʂ/ to 80% ch /tʂʰ/. To investigate the learners’ perceptual substitution patterns of the misidentified consonants, a confusion matrix was created and is presented along with the percentage correct identification scores in
Table 2.
A One-Way repeated measures ANOVA was conducted to compare the differences between the perception scores on the consonants (8 levels). There was a significant effect of consonant (Wilk’s Lambda = 0.102, F (7, 25) = 22.526,
p = 0.000). The subsequent post hoc Bonferroni tests adjusted for multiple comparisons revealed that a series of pairwise comparisons were significant. The results are presented in
Table 3.
2.5. Discussion
The perception test results showed that Mandarin zh /tʂ/ (29%), q /tɕʰ/ (31%), and x /ɕ/ (46), were the worst identified sounds by native English CFL learners. The results of the pairwise comparisons confirmed that the perception scores of zh /tʂ/ (29%), q /tɕʰ/ (31%), and x /ɕ/ (46) were significantly different from all the other five sounds but not different from each other (See
Table 3). The findings were similar to the
Wang and Chen (
2019) study in which the same three sounds were also among the four most difficult consonants for the beginning level learners. These three sounds were also among the poorest fitting categories to the learners’ L1 English sounds as established by the native English listeners in a cross-linguistic identification test found in the
Wang and Chen (
2019) study. The current findings support the
Wang and Chen (
2019) findings that phonetic distances and differences between L1 and L2 consonants predicted native English CFL learner’s perception problems with Mandarin consonants at initial stage of learning.
An inspection of the confusion matrix of the misidentified sounds led to the observation that misidentified retroflex and palatal sounds are mostly confused with each other. For example, the retroflex zh /tʂ/ was heard as palatal j /tɕ/ 32% of times. The confusion score exceeded the correct % identifications of zh /tʂ/ of only 29%. Similarly but to a less degree, the highest % of misidentified sh /ʂ/ was heard as x /ɕ/ 15%. Misidentified palatal sounds were heard mostly as retroflex sounds as well. The palatal sounds q /tɕʰ/ and x /ɕ/ were misidentified as the retroflex sounds ch /tʂʰ/ and sh /ʂ/ 43% and 31% of times, respectively. Therefore, the retroflex and palatal fricatives and affricates are both difficult for English CFL learners to identify.
The dental affricates z /ts/ and c/tsʰ/ were also poorly perceived by the learners. The most misidentified unaspirated dental z /ts/ sound was heard as /s/ 20% of times. The aspirated dental affricate c/tsʰ/ sound was misidentified as ch /tʂʰ/ and /s/ 11% each.
3. Experiment 2: Production of Mandarin Consonants
3.1. Participants
The participants were the same 25 beginning level CFL learners who took the perception test in Experiment 1. They provided the production data through a reading task that took place immediately before the perception tests.
3.2. Material and Procedure
The reading list consisted of 20 target words in pinyin embedded in a carrier sentence wo shuo ___ zi (我说 ---字). “I say --- word”. Each of the 20 target words was repeated once, yielding two versions of the same target sounds. The participants were given a few minutes to prepare for the reading task. Any questions about the pronunciation of any sounds were answered during the preparation time. The participants were told to read the list at normal speed. The recordings were then made on a MacBook Pro computer using the Praat software. The words containing the eight target consonants (z/tsa/, c /tsʰa/, j /tɕʲa/, q /tɕʰʲa/, x /ɕʲa/, zh /tʂa/, ch /tʂʰa/, and sh /ʂa/) were separated from the sentences using waveform editing, normalized for peak volume, and saved as wave form for presentations.
3.3. Assessment
Three phonetically trained native Mandarin speakers, all have taught Mandarin Chinese courses in North America, assessed the participants’ productions in an eight-way forced choice identification task followed by the goodness rating task along a scale of 1 (poor) to 7 (good). The participants’ productions of the eight target sounds were blocked by groups of five speakers, yielding 80 tokens in each session (8 words 5 speakers 2 repetitions). The eight-way forced choice task and the subsequent rating task were created using Praat ExperimentMFC identification test design. Individual identification tasks were carried out on a MacBook Pro computer in a quiet room. The native Mandarin speakers listened to each Mandarin stimulus and identified the initial consonant by clicking on the corresponding label in pinyin on the computer screen. Immediately after the identification of each sound, the listeners rated the fitness of the sound they identified by choosing a number along the scale of 1 (poor) to 7 (good). In the cases of uncertainty, the listener could replay the stimulus up to three times before the choice was made. In addition to the eight target sounds in pinyin, /s/, /t/ and /k/ sounds were also included in the labels for identifications. Based on a screening test by the first author, these three sounds provided additional options for the listeners to choose for the mis-produced target sounds. The listeners all had a trial session to learn the test procedure before the real judgement test began. They each then completed 6 sessions with mandatory breaks in between sessions. The data of one session were excluded from analysis as those five participants were not native English speakers.
3.4. Results
To assess interrater variability, a reliability test was carried out and a high degree of agreement was found among the three raters. The average measures Intraclass Correlation was 0.821 with a 95% confidence interval from 0.788 to 0.849 (F (399,798) = 5.605,
p < 0.001). Therefore, the mean group production score for each consonant was calculated by taking the average of the three listeners’ identification scores. To further explore the production substitution patterns of each mis-produced consonant, a confusion matrix was created.
Table 4 summarizes the mean percentage correct production scores (in bold) and the confusion matrix of the mis-produced eight target consonants. The mean goodness rating scores of each target consonant were also calculated and presented (in italic) by the mean correct identification scores in
Table 4.
Overall, the percentage correct production scores of the eight target sounds, as identified by the three native listeners, ranged from 25% (zh /tʂ/ and c/tsʰ/) to 88% (z /ts/). A One-Way ANOVA on the percentage identification scores revealed a significant effect of consonant (Wilk’s Lambda = 0.288, F (7, 143) = 50.486,
p = 0.000). The subsequent post hoc Bonferroni tests adjusted for multiple comparisons revealed that a series of pairwise comparisons were significant. The results are presented in
Table 5.
In addition to the percentage correct identification scores, the assessment of the learners’ production performance also included the rating scores (along a scale of 1 to 7) for each correctly identified consonant. The mean rating scores of each correctly identified sound ranged from 4.1 to 5.5 out of 7. The goodness rating task provided the listeners with a choice among a range of “fitness” of the learner’s production to the native norm of the target sound, even if the intended sound was correctly identified. Therefore, taking into consideration of the rating scores, the “adjusted” production score for each consonant was calculated by multiplying the percentage correct identification score by the rating score. The results of the adjusted production scores for the eight target consonants are summarized in
Table 6.
3.5. Discussion
The two most difficult consonants for the learners to produce were c /tsʰ/ and zh/tʂ/, each with a low percentage production score of 25% only, which was significantly different from all the other sounds under investigation. Mandarin c /tsʰ/ and zh/tʂ/ also received the lowest adjusted production scores (c /tsʰ/ (1.0) and zh/tʂ/ (1.2), when the rating scores were taken into consideration. On the other hand, the adjusted production scores for z /ts/ and j /tɕ/ were the highest among the eight sounds. The mean rating score of the eight target consonants was 5.0, with a range from 4.1 to 5.6. The difference between the best and worst rated sound was 1.5. These rating scores suggest that the listeners did not use the full range of the rating scale, especially at the lower end, once a target sound was correctly identified. While the range of the rating scores was relatively small comparing to the widely different percentage correct identification scores, those three poorly produced sounds with the lowest percentage correct identification scores c /tsʰ/ (25%), zh/tʂ/ (25%), and q/tɕʰ/ (35%) also received the rating scores below 5. All the other five sounds had a rating score of 5 and above.
The substitution patterns in production were similar to those in perception for the retroflex sounds. The mis-produced retroflex sounds zh/tʂ/, ch/tʂʰ/, and sh/ʂ/ were overwhelmingly heard as the palatal sounds j/tɕ/, (37%), q/tɕʰ/ (35%), and x/ɕ/ (22%) by the native Mandarin listeners. However, the mis-produced palatal sounds q/tɕʰ/ and x/ɕ/ were mostly heard by the native Mandarin listeners as the unaspirated palatal affricate j /tɕ/. For the aspirated dental affricate c /tsʰ/, 23% was heard as the /k/ sound. This unexpected substitution pattern was more likely caused by the pinyin spelling of “ca” being mistaken as the English orthography. Obviously, these speakers have not learned the c /tsʰ/ sound, or, at least have not associated the pinyin c with the Mandarin sound /tsʰ/.
5. General Discussion and Conclusions
To answer research question 1 which asked which Mandarin consonants pose difficulties for native English CFL learners, results of Experiment 1 showed that the learners had different degrees of difficulties in identification of the eight Mandarin consonants under investigation. The most difficult sounds were zh /tʂ/ (29%), q /tɕʰ/ (31%), and x /ɕ/ (46%). The findings were consistent with an earlier study by
Wang and Chen (
2019) in which zh /tʂ/, q /tɕʰ/, and x /ɕ/ were also among the four most difficult categories identified by the low level CFL learners. Similarly,
Yang and Yu (
2019) also found that among the six Mandarin affricates they investigated, zh /tʂ/ and q /tɕʰ/ were more difficult than their counterparts ch /tʂʰ/ and j /tɕ/ for the native English CFL learners.
The confusion matrix of misidentified sounds showed the English CFL learners substituted the retroflex and palatal sounds with each other mostly and such confusion patterns suggest that the learners have not established separate categories for these sounds. The English CFL learners’ problems with the retroflex and palatal sounds, also reported in
Hao (
2012) and
Yang and Yu (
2019) studies, are closely related to the phonetic differences between their L1 and L2 sound systems.
Wang and Chen (
2019) found the native English listeners mapped both Mandarin retroflex affricates zh/tʂ/, ch /tʂʰ/, and the aspirated palatal affricate q/tɕʰ/ onto the English /ʧ/ sound, though the degree of “fitness” was different, as ch /tʂʰ/ was identified as a much better fit to /ʧ/ (4.4) than zh/tʂ/ (3.3), and q/tɕʰ/ (2.3), indicated by their “fit indexes”: (% identification x goodness rating score). The three-to-one perceptual assimilation pattern, to a large extent, underlies the native English CFL learners’ perception problems with zh /tʂ/, ch /tʂʰ/, and q /tɕʰ/. The better fitting category ch /tʂʰ/ was identified with an accuracy score of 80%, as compared with 31% for q/tɕʰ/ and 29% for zh/tʂ/ in the current study. The findings support the Category Goodness (CG) type of assimilation of the Perceptual Assimilation Model (PAM), which states that two sounds are assimilated to a single native category resulting in a better fit for one than the other (
Best et al. 2001). The current findings suggest that the two-to-one CG type of assimilation can be expanded to three-to-one assimilation. More such three-to-one, and two-to-one, as well as one-to-two mappings of Mandarin consonants onto English categories found in the
Wang and Chen (
2019) study are presented in
Figure 2. These cross-linguistic assimilation patterns shed light on the difficulties native English CFL learners demonstrated in their perception and production of the eight target consonants in the current findings. (See the original study for detailed analysis of the assimilation patterns).
L2 to L1 assimilation patterns can also explain the English CFL learners’ difficulties with Mandarin palatal sound x/ɕ/. Both x/ɕ/ and sh/ʂ/ were mapped onto English /ʃ/ but sh/ʂ/ was a better fit than x/ɕ/, a CG type of assimilation.
Hao (
2012) reported the same CG assimilation pattern for x/ɕ/ and sh/ʂ/ and similar learning results on x/ɕ/ in her study. Mandarin x/ɕ/ was difficult for the English learners also because it was assimilated to both English /z/ and /ʃ/, a one-to-two “split” match to the native categories, a “revised” Single Category (SC) type of assimilation (see
Figure 2).
The Mandarin dental affricates z /ts/ and c/tsʰ/ were also poorly perceived by the learners in the current study. As seen in
Figure 2, the aspirated affricate /tsʰ/ was assimilated to both English /s/ and /t/, also a “revised” Single Category (SC) type of assimilation. The unaspirated dental affricate z /ts/, together with s /s/ and c /tsʰ/, were mapped onto English /s/, causing difficulties for the poor fitting categories z /ts/ and c /tsʰ/. While both z /ts/ and c /tsʰ/ exist in the English word finals “reads” and” boots”, these novice learners did not appear to have made the associations in identifying the target sounds. One explanation may be that Mandarin z /ts/ and c /tsʰ/ are stand-alone phonemes and are more prominent at word initial positions than the morphological word endings of /dz/ and /ts/ in English.
Overall, to answer research question 2, the current data suggest that the perceived phonetic differences and distances between Mandarin and English consonants predicted the learners’ perceptual difficulties with the L2 Mandarin consonants. The perception data also support the PAM model.
Flege’s Speech Learning Model (SLM) may also provide explanations for the current findings. The learners’ phonetic spaces for L1 and L2 consonants need to be reorganized to establish new phonetic categories for the Mandarin retroflex, palatal and dental sounds. For example, learners need to distinguish the differences between c /tsʰ/, x /ɕ/, z /ts/, q /tɕʰ/ and others in order to establish these categories. On the other hand, “equivalence classification” of the SLM may be at work for ch /tʂʰ/ to be identified as English /ʧ/. While ch /tʂʰ/ (80%) was the best identified category among the eight target sounds by the learners, its production score (47%) was much lower. Therefore, even if some of the L2 categories seemed to have been established by the majority of the listeners, “equivalence classifications” may have prevented them from forming the native-like perception category.
The results of Experiment 2 showed the percentage correct production scores of the eight target sounds ranged from 25% (zh /tʂ/ and c/tsʰ/) to 88% (z /ts/). Pairwise comparisons data shown in
Table 4 indicated the native English CFL learners had the most production difficulties with the Mandarin c/tsʰ/ and zh /tʂ/, followed by q/tɕʰ/ sounds. The pattern of substitutions in production was similar to that of perception for the retroflex sounds zh /tʂ/, ch /tʂʰ/, and sh /ʂ/, which were substituted with palatals j/tɕ/, q/tɕʰ/, and x/ɕ/. However, the mis-produced palatal sounds q/tɕʰ/, x/ɕ/ were not confused with the retroflex sounds but were mostly heard by the native Mandarin listeners as the unaspirated palatal affricate j /tɕ/. These substitution patterns suggest that the retroflex sounds were more difficult for the English CFL learners to produce.
Comparing the results of the two experiments, the learners had the tendency of better performance on the retroflex sounds zh /tʂ/, ch /tʂʰ/, and sh /ʂ/ in perception than in production but vice versa on palatal sounds j /tɕ/, q /tɕʰ/, and x /ɕ/. The results on dental sounds z /ts/, and c /tsʰ/ were mixed across the two domains. Mandarin retroflex and palatal fricatives and affricates, though both lack counterparts in English, pose different problems to the English CFL learners in perception and production. The results of the correlation tests comparing the perception and production scores showed only two of the eight target consonants, x /ɕ/ and c /tsʰ/ were moderately correlated. The lack of correlations in the learners’ perception and production scores for the majority of the sounds under investigation suggest the relationship between L2 speech perception and production is not straightforward.
One possible explanation for such misalignment in L2 speech perception and production might be that perception does not always lead production. For example, different mechanisms or strategies may be involved in perception and production of the retroflex sound ch /tʂʰ/ by the beginning level English CFL learners. The learners’ better perception of ch /tʂʰ/ (80%) may be explained by their closest match of the L2 Mandarin ch /tʂʰ/ to their L1 English /ʧ/ sound. In perception identification tasks, anything that is close enough to the nearest English /ʧ/ can be labeled as ch /tʂʰ/. However, the same strategy would not work for the production, if the learners have not established native-like retroflex ch /tʂʰ/ category. The key phonetic gestures in producing the correct retroflex affricates in the Mandarin ch /tʂʰ/ sound cannot be effectively replaced by the gestures of English /ʧ/. The intended ch /tʂʰ/ would not be heard as the target ch /tʂʰ/ sound but as the q /tɕʰ/ sound by the native Mandarin listeners. The same patterns seem to hold true for the other retroflex sounds zh /tʂ/ and sh /ʂ/ that were identified by the native Mandarin listeners as the palatal sounds j /tɕ/ and q/tɕʰ/. These substitution patterns suggest the cues for the retroflex sounds were absent in these non-native productions. Therefore, it is very likely the misalignment between perception and production is partly due to the different mechanisms the learners attended to in perception and production of the target consonants. Acoustic analyses of the learners’ productions of these sounds, along with perceptual test using synthesized stimuli manipulating the key acoustic cues differentiating the target categories are needed to draw a firm conclusion in future studies.
The current data also show that partial alignment between perception and production of Mandarin consonants does exist. The listeners’ perception and production scores of both x /ɕ/ and c /tsʰ/ were moderately but significantly correlated, indicating the link between perception and production. Past studies have come to the same conclusions of such partial alignment in perception and production of L2 sounds (
Flege 1999;
Rochet 1995).
L2 phonetic training studies have also examined the relationship between perception and production when assessing the outcomes of the training. There is evidence that perceptual training only led to improvement in both perception and production of L2 consonants (
Bradlow et al. 1997) and lexical tones (
Wang 2008,
2012,
2013;
Wang et al. 2003), and production gains are larger on obstruents than on sonorants and vowels (
Sakai and Moorman 2018). There is also evidence that trainees’ perceptual learning did not lead to better productions on L2 vowel contrasts (
Wang 2002). These findings suggest that the relationship between perception and production of L2 speech sounds can be further complicated by the different sound classes.
In conclusion, the current data showed partial alignment but more discrepancies on native English CFL learners’ perception and production of the eight Mandarin consonants. Different phonetic mechanisms and strategies may be involved in the L2 speech sound perception and production. Perception may not always lead production. Future studies need to carry out more detailed acoustic analysis of the native and non-native productions of the target consonants to investigate specific problems that learners have in perception and production of Mandarin consonants.
Finally, one limitation of the current study was the exclusion of the s /s/ and r /ʐa/ sounds in the analyses. It would have been better to include at least the s /s/ sound to have a nice set of dental fricative and affricates, making it parallel to the retroflex and palatal sets.