The Acquisition of Mandarin Consonants by English Learners: The Relationship between Perception and Production

: This study investigates native English CFL (Chinese as a Foreign Language) learners’ di ﬃ culties with Mandarin consonants at the initial stage of learning and explores the relationship between second language (L2) speech perception and production. Twenty-ﬁve native English CFL learners read the eight Mandarin consonants (j / t C / , q / t C h / , x / C / , zh / t ù / , ch / t ù h / , sh / ù / , z / ts / , and c / ts h / ) in sentences and identiﬁed the target sounds in a forced-choice identiﬁcation task. Native Mandarin listeners identiﬁed the consonants produced by the learners and rated the quality of each sound they identiﬁed along a scale of 1 (poor) to 7 (good). The learners’ mean percentage accuracy scores ranged from 29% to 80% for perception and 25% to 88% for production. Moderate correlations between the perception and production scores were found for two of the eight target sounds. The Mandarin retroﬂex, palatal, and dental fricatives and a ﬀ ricates, though all lack counterparts in English, pose di ﬀ erent problems to the English CFL learners. The misperceived retroﬂex and palatal sounds were substituted with each other in perception but mis-produced palatal sounds were substituted with each other, not with retroﬂex sounds. The relationship between perception and production of L2 consonants is not straightforward. The ﬁndings are discussed in terms of current speech learning models.


L2 Speech Perception and Perception Models
Adult second language (L2) speakers' problems with the perception and production of non-native speech sounds are closely related to their first language (L1) experience (Flege 1995). Research has shown that infants are language-general perceivers of speech sounds at the phonetic level. This universal perceptual pattern undergoes a profound change due to increased experience with their first language in the later half of the first year in life (Best 1994;Polka and Bohn 1996;Strange 1995;Werker 1994;Werker and Polka 1993). Adult monolinguals are language-specific perceivers of speech sounds. Perceptual studies using synthesized stimuli found that adult speakers identified stop consonants along a VOT (Voice Onset Time) continuum according to their L1 stop inventories Abramson 1964, 1970). Similar studies on L2 vowel perception using a synthesized vowel continuum also indicated that listeners labeled vowel sounds according to their L1 vowel categories (Rochet 1995). To a large extent, the language-specific nature of adult monolinguals' speech perception underlies the difficulties adult learners face in L2 speech learning.
Evidence from cross-linguistic speech perception studies in which listeners map L2 sounds onto their L1 sound system suggest that the phonetic distances between learners' L1 and L2 sound systems Languages 2020, 5, 20 2 of 15 play an important role in the degree of success in L2 speech perception (Flege and Wayland 2019;Guion et al. 2000;Wang and Chen 2019). For example, in a cross-linguistic perceptual study assessing the phonetic distances between Japanese and English consonants, monolingual Japanese speakers identified English consonants using Japanese consonant categories. The subsequent experiment found that phonetic distances between Japanese and English consonants, as established by the cross-linguistic direct mapping experiment, predicted the discrimination patterns of English consonants by Japanese learners of English with different L2 experience (Guion et al. 2000). Wang and Chen (2019) also found that English CFL learners' perception problems with Mandarin consonants were closely related to the L2 to L1 assimilation patterns.
In searching for the nature of such cross linguistic influence in L2 phonetic learning, researchers have come up with different L2 speech perception models. The Perceptual Assimilation Model (PAM) (Best 1994;Best et al. 2001) assumes that several pairwise assimilation types are possible when two non-native phones are mapped onto the L2 sound system. The pair of L2 phones may be assimilated to two different L1 phones, the Two Category (TC) type, or to a single L1 category equally poorly or well, the Single Category type (SC). The two L2 sounds can also be assimilated to a single native category but one can be a better fit than the other, the Category Goodness type (CG). The PAM model also predicts the degree of difficulties in discriminations of L2 sounds from the most to the least: SC > CG >TC (Best et al. 2001). Flege (1995Flege ( , 2007 Speech Learning Model (SLM) states that a learner's L1 and L2 sound systems interact and exist in a common phonological space. Learners will establish an L2 sound category if they perceive the phonetic differences between the L2 sound from the nearest L1 sound or the closest L2 sound. In contrast, "equivalence classification" of an L2 sound with the nearest L1 category blocks the formation of a new phonetic category. Flege claims that learners' ability to establish new phonetic categories remains intact throughout their life span and increases with their L2 experience. Perceptual learning will eventually lead to better production, although the alignment between perception and product may be partial only (Flege 1999). Therefore, the SLM is a dynamic model that emphasizes learners' L2 experiences with the target language.

The Relationship between Perception and Production
As both the PAM and SLM models place more emphasis on the perceptual assimilation or dissimilation of the L2 sound categories to the L1 sounds, the question arises about the relationship between L2 speech perception and production. Previous research on L2 speech perception and production has led to different conclusions. For example, Rochet (1995) found that native Portuguese speakers produced French /y/ as /i/ while native English speakers produced /y/ as /u/, although both English and Portuguese have /i/ and /u/ in their vowel systems. The subsequent perceptual test using synthesized high vowel continuum revealed that Portuguese listeners assimilated /y/ to /i/ while English listeners assimilated /y/ to /u/ (Rochet 1995). Similarly, Mandarin speakers' production problem with French voiced stops was related to their faulty perception of the voiced stops that do not exist in the Mandarin sound system (Rochet 1995). In a study on English front vowels /i I EI Eae/, Wang (1997) found that Mandarin speakers had problems with both the perception and production of English lax vowels /I Eae/, but they performed better in perception than in production on these three vowels. In contrast, they performed better in production than in perception on English /i EI/ categories. Such performance discrepancies between the perception and production on the English front vowels suggest that native Mandarin ESL (English as a Second Language) learners may have used different cues or strategies in their perception and production of English vowels. Flege (1999) also reported a series of studies that showed partial alignment between L2 perception and production. In a more recent study on native Arabic speakers' acquisition of British English vowels and consonants, Evans and Alshangiti (2018) found a link between perception and production, as the better perceivers of English vowels were also the better producers.
L2 phonetic training studies have also examined the relationship between perception and production when assessing the effects of training in both modes. In a recent review study applying the meta-analysis method analyzing 30 perception training studies on L2 segments conducted in the past 25 years, Sakai and Moorman (2018) found that perception training only led to small-sized gains in productions of the target sounds. Their subsequent statistical analysis based on 18 out of the 30 studies led to the conclusion that the production gains were larger on obstruents than on sonorants and vowels. Correlation tests suggested there was a small to medium-sized but statistically nonsignificant relationship between gains in perception and production.

Studies on L2 Mandarin Consonants
While L2 Learners' problems with non-native speech sounds are well documented on consonants (Bradlow et al. 1997;Guion et al. 2000;Munro et al. 2015), and on vowels (Evans and Alshangiti 2018;Munro and Derwing 2008;Wang 1997;Wang and Munro 2004), as well as on lexical tones (Wang 2006(Wang , 2008(Wang , 2013, parallel studies on perception and production of L2 speech sounds, particularly on CFL learners' difficulties with Mandarin consonants are still very limited. Several studies on the perception or production of Mandarin consonants reported in the past two decades are summarized in the following. Lai (2009) investigated learners' perception difficulties with the six Mandarin affricates z /ts/, c /ts h /, zh /tù/, ch /tù h / and j /tC/, and q /tC h / by native Malay and Burmese speakers residing in Taiwan. The learners and a control group of native Taiwan Mandarin speakers took the same/different discrimination test followed immediately by the identification test on the target affricates paired across different place and manner of articulations. Both learner groups were more accurate in identifying unaspirated affricates than the aspirated counterparts. They also had more problems identifying the dental-retroflex z /ts/-zh /tù/ and c /ts h /-ch /tù h / contrasts than the palatal affricates. Lai (2009) concluded that there was a merge of dental and retroflex affricates and the dentalization of the retroflex sounds was better explained by the Markedness theory than the learners' first language inference. In fact, the native Mandarin control group demonstrated exactly the same perceptual merge pattern as they had the same rate of errors (around 67%) as the two learner groups on their z /ts/-zh /tù/ and c /ts h /-ch /tù h / identifications. The findings were not surprising as both L2 groups were learning Mandarin in Taiwan and the dental and retroflex fricative/affricate merge is common in many Mandarin dialects spoken in Southern China as well as in Taiwan (Zhu 2012;Chuang et al. 2019). Hao (2012) investigated how the learners' L2 to L1 sound mapping patterns and the amount of L2 experience affect the perception of Mandarin sounds. Three groups of native English CFL learners with different length of Mandarin learning experience: Ex group (5.6 years), Inex group (1.5 years) and Noex (No experience) took the perceptual tests. Hao (2012) found that phonetic context and L2 Mandarin experience affect the learners' L2 to L1 sound mapping patterns. More experienced learners gave more consistent responses in Mandarin to English sound classifications and were less affected by phonetic contexts than less experienced learners. The Noex group assimilated Mandarin /s/ to English /z/ more often than to /s/ while both learner groups identified /s/ as English /s/. All three groups assimilated Mandarin /ù/ and /C/ to English /S/ mostly except that the Noex group split the classification of /C/ to /s/ and /S/ equally when /C/ was followed by an unrounded vowel /i/. While both Mandarin /ù/ and /C/ were assimilated to the English /S/, /ù/ was a better fit than /C/ as indicated by both the identification accuracy rate and the higher goodness rating score, a Category Goodness type of assimilation according to the PAM model. In the identification test, the Mandarin /ù-C/ contrast was found difficult for the learners and more so for the Inex group than the Ex group. All three groups performed equally well in discriminating the /ù u-su/ and /ù1-s1/ contrasts in the discrimination test. The author concluded that the L2 to L1 assimilation patterns failed to predict discrimination accuracy of Mandarin contrasts in most cases.
In a more recent cross-linguistic perception study on Mandarin consonants (Wang and Chen 2019), native English listeners with no Mandarin learning experience identified 10 Mandarin consonants Languages 2020, 5, 20 4 of 15 in syllables (z /tsa/, c /ts h a/, s /sa/, j /tC j a/, q /tC h j a/, x /C j a/, zh /tù a/, ch /tù h a/, sh /ùa/, and r /üa/) using the closest English sounds in a ten-way forced choice task followed by a goodness rating task along a scale of 1 (poor) to 7 (good). L2 to L1 sound mapping fitting indexes (identification score x rating score) were calculated to assess the phonetic distances between Mandarin and English consonants. Wang and Chen (2019) found there was a range of phonetic distances between the L1 and L2 sounds based on the fit indexes (range from 1.0 to 6.3 out of 7). The "poor" matching categories were x /C/, c /ts h /, q /tC h /, zh /tù/, and j /tC/ whose fit indexes were below the mean (3.7, s.d.=1.7). The "fair" fitting categories were ch /tù h /, s /s/, and z /ts/ whose fit indexes were at the mean. The "good" matching sounds were r /ü/, and sh /ù/ whose fit indexes were 1s.d. above the mean (Wang and Chen 2019). In a subsequent study on the identification of Mandarin consonants by English CFL learners at two different proficiency levels, the learners' perception scores of Mandarin consonants were found to be closely related to the L2 to L1 assimilation patterns. Results showed that zh /tù/, q /tC h /, c /ts h /, and x /C/ (the poor fitting sounds) received the lowest % identification scores among the 10 sounds by the beginning level learners. The intermediate group outperformed the beginning group on zh /tù/, q /tC h /, and c /ts h /. These findings suggest that the perceived phonetic distances between L1 and L2 consonants predicted the English CFL learners' L2 Mandarin consonant identification problems and increased L2 experience improved perceptual learning. No production data were reported in this study.
In a production study, Liu and Jongman (2012) investigated both the temporal and spectral features of Mandarin dental affricates z /ts/, and c /ts h / produced by native English CFL learners with different proficiency levels. The authors found that both the novice and more experienced learner groups acquired the durational differences for the /ts/, and /ts h / contrast but only the more advanced learners acquired the spectral (center of gravity) contrast between the target sound pair. It was not clear what weight the temporal and spectral cue each carries to the perceptual accuracy of the target contrast as no perception test was conducted to measure the accuracy of the learners' productions. This study dealt with only one pair of Mandarin affricate contrast at dental place of articulation.
In a similar study involving more Mandarin affricate contrasts, Yang and Yu (2019) investigated the perception and production of six Mandarin affricates z /ts/, c /ts h /, zh /tù/, ch /tù h /, j /tC/, and q /tC h / by native English CFL learners at beginning and intermediate levels. Both learner groups matched the native Mandarin group in perception accuracy scores in discriminating but not in identifying the target sounds. The effect of place of articulation and aspiration were significant but not uniform across the board. For example, the unaspirated palatal j /tC/ was significantly better identified than the aspirated palatal counterpart q /tC h / but the aspirated retroflex ch /tù h / was better identified than the unaspirated counterpart zh /tù/. The authors concluded that different affricates pose different learning difficulties for English CFL learners. In the production test, the intermediate group outperformed the beginning group in approximating the native speakers in the production of some but not all the acoustical features under investigation, indicating the learners did not acquire the affricates completely. Their data suggest that the distinction between palatal and retroflex affricates is more difficult for learners due to the assimilation of both classes to the same English post-alveolar affricates, the two to one type of (SC) of assimilation, according to the PAM model.
To summarize the findings of the above studies, the difficulties with the perception accuracy of Mandarin consonants by native English CFL learners are related to their L2 to L1 perceptual assimilation patterns (Hao 2012; Wang and Chen 2019; Yang and Yu 2019). In general, Mandarin retroflex and palatal contrasts pose more difficulties to the English CFL learners than other place contrasts (Hao 2012;Yang and Yu 2019), while dental and retroflex contrasts were more difficult for Malay and Burmese learners (Lai 2009). The effect of L2 experience did not appear to affect the learners' discrimination accuracy but did influence their identification accuracy of the Mandarin consonants (Hao 2012;Lai 2009; Wang and Chen 2019; Yang and Yu 2019). These results confirmed earlier findings of the advantage of identification over discrimination task in L2 phonetic test and training because the former help the learners focus more on the key phonetic/acoustic features that distinguish the target sound contrasts (Wang and Munro 2004). In production, the effect of L2 experience was more evident Languages 2020, 5, 20 5 of 15 as the more experienced learners outperformed less experienced learners in approximating the native speakers in the production of some but not all the acoustical features of the target sound contrasts (Liu and Jongman 2012; Yang and Yu 2019).

The Current Study
Several studies summarized in Section 1.3 (Hao 2012;Lai 2009; Wang and Chen 2019) investigated CFL learners' perception problems with Mandarin consonants but did not examine their production problems. Two production studies on Mandarin consonants (Liu and Jongman 2012; Yang and Yu 2019) compared the acoustic properties of the learners' productions with those of the native speakers but did not include the direct assessment of the intelligibility of the L2 speech. Parallel studies that compare the CFL learners' perception and production performance with Mandarin consonants are extremely rare. This study aims to fill this gap by investigating native English CFL learners' difficulties with the Mandarin consonants in both perception and production at initial stage of learning. An additional goal is to examine the relationship between L2 speech perception and production. The research questions are:

1.
Which Mandarin consonants are difficult to identify and produce for native English CFL learners at early stage of learning? 2.
How do the phonetic differences and distances between Mandarin and English consonants, as perceived by English listeners in an earlier study (Wang and Chen 2019), affect the perception and production of Mandarin consonants? 3.
What are the learners' performance differences between their perception and production of Mandarin consonants? Table 1 presents the 22 Mandarin consonants in IPA. The sounds in bold are the eight target Mandarin consonants under investigation in the current study: z /ts/, c /ts h /, j /tC/, q /tC h /, x /C/, zh /tù/, ch /tù h /, sh /ù/. They form the fricative/affricate groups at dental, retroflex, and alveolo-palatal (also commonly referred to as palatal) places reported to be difficult for English CFL learners to acquire as these sounds do not have corresponding counterparts in English (Lin 2005; Wang and Chen 2019).

Participants
The participants were 25 native English speaking (15 male, 10 female, mean age = 19.6) beginning level CFL learners enrolled in a first semester Chinese course in a public university in the U.S. All participants reported speaking English as their native language. Twenty of them were born and raised in the United States and five were born in foreign countries but moved to the U.S. between the ages of 2 and 5. Some participants reported speaking another language along with English as their first languages. They were four English/Spanish, four English/Hmong, two English/Tagalog, and one English/Vietnamese early bilinguals. At the point of data collection, the participants were about three months into the 16-week semester and all had learned and practiced Chinese consonants by then.

Material
The perceptual identification test initially included 10 Mandarin consonants in syllables (z /tsa/, c /ts h a/, s /sa/, j /tC j a/, q /tC h j a/, x /C j a/, zh /tùa/, ch /tù h a/, sh /ùa/, and r /üa/) that were produced by two native Mandarin Speakers, one male and one female, through a reading task. The target words were produced in a carrier sentence wo shuo ___ zi (我说 -字). "I say -word". The recordings were made on a MacBook Pro computer using the Praat software. The target syllables were separated from the sentences using waveform editing, normalized for peak volume, and saved as wave form for presentations. Eight of the 10 sounds (z /ts/, c /ts h /, j /tC/, q /tC h /, x /C/, zh /tù/, ch /tù h /, sh /ù/) were analyzed for this perception experiment to pair exactly with the eight target consonants in the production test.

Procedure
All participants gave their informed consent for inclusion before they participated in this study which was approved by the ethics committee of the researchers' university. Individual perception identification tasks were carried out in a sound booth on a MacBook Pro computer using Praat ExperimentMFC identification test design. A total of 60 stimuli (10 sounds 2 speakers 3 repetitions) were randomized and presented in a 10-way forced choice task. The labels for choices were the 10 consonants in pinyin (the official Romanized transcription of Mandarin Chinese) displayed on the computer screen during the test. The listeners were instructed to listen carefully for the initial consonant in each stimulus and identify the sound they heard by clicking on the corresponding consonant on the screen. During the test, they could choose to replay each stimulus twice in the case of uncertainty. To familiarize the learners with the task, before the real test began, each participant had a trial session using the stimuli not included for analyses. The software automatically recorded the test data to be exported for analysis.

Results
Individual participants' correct identifications of each target consonant were converted to percentage accuracy scores. Their misidentified target sounds were also converted to percentage error rate and were tallied for substitution patterns. The group mean percentage correct identification scores of the eight consonants ranged from 29% zh /tù/ to 80% ch /tù h /. To investigate the learners' perceptual substitution patterns of the misidentified consonants, a confusion matrix was created and is presented along with the percentage correct identification scores in Table 2. Table 2. Mean percentage correct identification scores (in bold) and confusion matrix of Mandarin consonants by native English CFL (Chinese as a Foreign Language) learners (N = 25).

Target
Identified A One-Way repeated measures ANOVA was conducted to compare the differences between the perception scores on the consonants (8 levels). There was a significant effect of consonant (Wilk's Lambda = 0.102, F (7, 25) = 22.526, p = 0.000). The subsequent post hoc Bonferroni tests adjusted for multiple comparisons revealed that a series of pairwise comparisons were significant. The results are presented in Table 3. Table 3. Pairwise comparisons of differences between the consonants in perception (** p < 0.01, * p < 0.05).

Discussion
The perception test results showed that Mandarin zh /tù/ (29%), q /tC h / (31%), and x /C/ (46), were the worst identified sounds by native English CFL learners. The results of the pairwise comparisons confirmed that the perception scores of zh /tù/ (29%), q /tC h / (31%), and x /C/ (46) were significantly different from all the other five sounds but not different from each other (See Table 3). The findings were similar to the Wang and Chen (2019) study in which the same three sounds were also among the four most difficult consonants for the beginning level learners. These three sounds were also among the poorest fitting categories to the learners' L1 English sounds as established by the native English listeners in a cross-linguistic identification test found in the Wang and Chen (2019) study. The current findings support the Wang and Chen (2019) findings that phonetic distances and differences between L1 and L2 consonants predicted native English CFL learner's perception problems with Mandarin consonants at initial stage of learning.
An inspection of the confusion matrix of the misidentified sounds led to the observation that misidentified retroflex and palatal sounds are mostly confused with each other. For example, the retroflex zh /tù/ was heard as palatal j /tC/ 32% of times. The confusion score exceeded the correct % identifications of zh /tù/ of only 29%. Similarly but to a less degree, the highest % of misidentified sh /ù/ was heard as x /C/ 15%. Misidentified palatal sounds were heard mostly as retroflex sounds as well. The palatal sounds q /tC h / and x /C/ were misidentified as the retroflex sounds ch /tù h / and sh /ù/ 43% and 31% of times, respectively. Therefore, the retroflex and palatal fricatives and affricates are both difficult for English CFL learners to identify.
The dental affricates z /ts/ and c/ts h / were also poorly perceived by the learners. The most misidentified unaspirated dental z /ts/ sound was heard as /s/ 20% of times. The aspirated dental affricate c/ts h / sound was misidentified as ch /tù h / and /s/ 11% each.

Participants
The participants were the same 25 beginning level CFL learners who took the perception test in Experiment 1. They provided the production data through a reading task that took place immediately before the perception tests.

Material and Procedure
The reading list consisted of 20 target words in pinyin embedded in a carrier sentence wo shuo ___ zi (我说 -字). "I say -word". Each of the 20 target words was repeated once, yielding two versions of the same target sounds. The participants were given a few minutes to prepare for the reading task. Any questions about the pronunciation of any sounds were answered during the preparation time. The participants were told to read the list at normal speed. The recordings were then made on a MacBook Pro computer using the Praat software. The words containing the eight target consonants Languages 2020, 5, 20 8 of 15 (z/tsa/, c /ts h a/, j /tC j a/, q /tC h j a/, x /C j a/, zh /tùa/, ch /tù h a/, and sh /ùa/) were separated from the sentences using waveform editing, normalized for peak volume, and saved as wave form for presentations.

Assessment
Three phonetically trained native Mandarin speakers, all have taught Mandarin Chinese courses in North America, assessed the participants' productions in an eight-way forced choice identification task followed by the goodness rating task along a scale of 1 (poor) to 7 (good). The participants' productions of the eight target sounds were blocked by groups of five speakers, yielding 80 tokens in each session (8 words 5 speakers 2 repetitions). The eight-way forced choice task and the subsequent rating task were created using Praat ExperimentMFC identification test design. Individual identification tasks were carried out on a MacBook Pro computer in a quiet room. The native Mandarin speakers listened to each Mandarin stimulus and identified the initial consonant by clicking on the corresponding label in pinyin on the computer screen. Immediately after the identification of each sound, the listeners rated the fitness of the sound they identified by choosing a number along the scale of 1 (poor) to 7 (good). In the cases of uncertainty, the listener could replay the stimulus up to three times before the choice was made. In addition to the eight target sounds in pinyin, /s/, /t/ and /k/ sounds were also included in the labels for identifications. Based on a screening test by the first author, these three sounds provided additional options for the listeners to choose for the mis-produced target sounds. The listeners all had a trial session to learn the test procedure before the real judgement test began. They each then completed 6 sessions with mandatory breaks in between sessions. The data of one session were excluded from analysis as those five participants were not native English speakers.

Results
To assess interrater variability, a reliability test was carried out and a high degree of agreement was found among the three raters. The average measures Intraclass Correlation was 0.821 with a 95% confidence interval from 0.788 to 0.849 (F (399,798) = 5.605, p < 0.001). Therefore, the mean group production score for each consonant was calculated by taking the average of the three listeners' identification scores. To further explore the production substitution patterns of each mis-produced consonant, a confusion matrix was created. Table 4 summarizes the mean percentage correct production scores (in bold) and the confusion matrix of the mis-produced eight target consonants. The mean goodness rating scores of each target consonant were also calculated and presented (in italic) by the mean correct identification scores in Table 4. Table 4. Mean percentage correct production scores (in bold), mean rating scores (in italic), and confusion matrix of mis-produced Mandarin consonants by native English CFL learners (N = 25).

Target
Identified Overall, the percentage correct production scores of the eight target sounds, as identified by the three native listeners, ranged from 25% (zh /tù/ and c/ts h /) to 88% (z /ts/). A One-Way ANOVA on the percentage identification scores revealed a significant effect of consonant (Wilk's Lambda = 0.288, F (7, 143) = 50.486, p = 0.000). The subsequent post hoc Bonferroni tests adjusted for multiple comparisons revealed that a series of pairwise comparisons were significant. The results are presented in Table 5. Table 5. Pairwise comparisons of differences between the consonants in production (** p < 0.01, * p < 0.05).

Sounds
Production Score Rating Score Adjusted Score

Discussion
The two most difficult consonants for the learners to produce were c /ts h / and zh/tù/, each with a low percentage production score of 25% only, which was significantly different from all the other sounds under investigation. Mandarin c /ts h / and zh/tù/ also received the lowest adjusted production scores (c /ts h / (1.0) and zh/tù/ (1.2), when the rating scores were taken into consideration. On the other hand, the adjusted production scores for z /ts/ and j /tC/ were the highest among the eight sounds. The mean rating score of the eight target consonants was 5.0, with a range from 4.1 to 5.6. The difference between the best and worst rated sound was 1.5. These rating scores suggest that the listeners did not use the full range of the rating scale, especially at the lower end, once a target sound was correctly identified. While the range of the rating scores was relatively small comparing to the widely different percentage correct identification scores, those three poorly produced sounds with the lowest percentage correct identification scores c /ts h / (25%), zh/tù/ (25%), and q/tC h / (35%) also received the rating scores below 5. All the other five sounds had a rating score of 5 and above.
The substitution patterns in production were similar to those in perception for the retroflex sounds. The mis-produced retroflex sounds zh/tù/, ch/tù h /, and sh/ù/ were overwhelmingly heard as the palatal sounds j/tC/, (37%), q/tC h / (35%), and x/C/ (22%) by the native Mandarin listeners. However, the mis-produced palatal sounds q/tC h / and x/C/ were mostly heard by the native Mandarin listeners as the unaspirated palatal affricate j /tC/. For the aspirated dental affricate c /ts h /, 23% was heard as the /k/ sound. This unexpected substitution pattern was more likely caused by the pinyin spelling of "ca" being mistaken as the English orthography. Obviously, these speakers have not learned the c /ts h / sound, or, at least have not associated the pinyin c with the Mandarin sound /ts h /.

Perception and Production Comparisons
Visual inspection of Figure 1, which compares the mean percentage correct perception and production scores of the eight consonants, shows the patterns of higher scores in perception than in production for retroflex sounds zh/tù/, ch/tù h /, and sh/ù/ but vice versa for palatal sounds j/tC/, q/tC h /, and x/C/. The dental affricates were mixed as c /ts h /, the aspirated dental affricate received the lowest production score of 25% while the unaspirated counterpart z/ts/ had the highest production score of 88%.
Languages 2020, 5, x FOR PEER REVIEW 10 of 16 The substitution patterns in production were similar to those in perception for the retroflex sounds. The mis-produced retroflex sounds zh/tʂ/, ch/tʂʰ/, and sh/ʂ/ were overwhelmingly heard as the palatal sounds j/tɕ/, (37%), q/tɕʰ/ (35%), and x/ɕ/ (22%) by the native Mandarin listeners. However, the mis-produced palatal sounds q/tɕʰ/ and x/ɕ/ were mostly heard by the native Mandarin listeners as the unaspirated palatal affricate j /tɕ/. For the aspirated dental affricate c /tsʰ/, 23% was heard as the /k/ sound. This unexpected substitution pattern was more likely caused by the pinyin spelling of "ca" being mistaken as the English orthography. Obviously, these speakers have not learned the c /tsʰ/ sound, or, at least have not associated the pinyin c with the Mandarin sound /tsʰ/.

Perception and Production Comparisons
Visual inspection of Figure 1, which compares the mean percentage correct perception and production scores of the eight consonants, shows the patterns of higher scores in perception than in production for retroflex sounds zh/tʂ/, ch/tʂʰ/, and sh/ʂ/ but vice versa for palatal sounds j/tɕ/, q/tɕʰ/, and x/ɕ/. The dental affricates were mixed as c /tsʰ/, the aspirated dental affricate received the lowest production score of 25% while the unaspirated counterpart z/ts/ had the highest production score of 88%. To investigate the relationship between the perception and production accuracies by the participants, Pearson Coefficients Correlation tests were performed on the percentage correct perception and production scores. The strength of the correlation test results, along with the mean percentage perception and production scores and standard deviations are presented in Table 7.
Results of the Pearson coefficients correlation tests (2-tailed) revealed a moderate size of correlations between the perception and production scores for two of the eight consonants. They were c/tsʰ/, r = 0.619, p < 0.01, and x/ɕ/, r = 0.508, p < 0.01. No significant correlations between the perception and production scores were found for the remaining six consonants. The Pearson's r ranged from (r = 0.028, p = 0.896) for q /tɕʰ/ to (r = 0.077, p = 0.715) for j /tɕ/. To investigate the relationship between the perception and production accuracies by the participants, Pearson Coefficients Correlation tests were performed on the percentage correct perception and production scores. The strength of the correlation test results, along with the mean percentage perception and production scores and standard deviations are presented in Table 7.
Results of the Pearson coefficients correlation tests (2-tailed) revealed a moderate size of correlations between the perception and production scores for two of the eight consonants. They were c/ts h /, r = 0.619, p < 0.01, and x/C/, r = 0.508, p < 0.01. No significant correlations between the perception and production scores were found for the remaining six consonants. The Pearson's r ranged from (r = 0.028, p = 0.896) for q /tC h / to (r = 0.077, p = 0.715) for j /tC/. Table 7. Pearson coefficients correlation tests between the % correct perception and production scores (** p < 0.01).

General Discussion and Conclusions
To answer research question 1 which asked which Mandarin consonants pose difficulties for native English CFL learners, results of Experiment 1 showed that the learners had different degrees of difficulties in identification of the eight Mandarin consonants under investigation. The most difficult sounds were zh /tù/ (29%), q /tC h / (31%), and x /C/ (46%). The findings were consistent with an earlier study by Wang and Chen (2019) in which zh /tù/, q /tC h /, and x /C/ were also among the four most difficult categories identified by the low level CFL learners. Similarly, Yang and Yu (2019) also found that among the six Mandarin affricates they investigated, zh /tù/ and q /tC h / were more difficult than their counterparts ch /tù h / and j /tC/ for the native English CFL learners.
The confusion matrix of misidentified sounds showed the English CFL learners substituted the retroflex and palatal sounds with each other mostly and such confusion patterns suggest that the learners have not established separate categories for these sounds. The English CFL learners' problems with the retroflex and palatal sounds, also reported in Hao (2012) and Yang and Yu (2019) studies, are closely related to the phonetic differences between their L1 and L2 sound systems. Wang and Chen (2019) found the native English listeners mapped both Mandarin retroflex affricates zh/tù/, ch /tù h /, and the aspirated palatal affricate q/tC h / onto the English /tS/ sound, though the degree of "fitness" was different, as ch /tù h / was identified as a much better fit to /tS/ (4.4) than zh/tù/ (3.3), and q/tC h / (2.3), indicated by their "fit indexes": (% identification x goodness rating score). The three-to-one perceptual assimilation pattern, to a large extent, underlies the native English CFL learners' perception problems with zh /tù/, ch /tù h /, and q /tC h /. The better fitting category ch /tù h / was identified with an accuracy score of 80%, as compared with 31% for q/tC h / and 29% for zh/tù/ in the current study. The findings support the Category Goodness (CG) type of assimilation of the Perceptual Assimilation Model (PAM), which states that two sounds are assimilated to a single native category resulting in a better fit for one than the other (Best et al. 2001). The current findings suggest that the two-to-one CG type of assimilation can be expanded to three-to-one assimilation. More such three-to-one, and two-to-one, as well as one-to-two mappings of Mandarin consonants onto English categories found in the Wang and Chen (2019) study are presented in Figure 2. These cross-linguistic assimilation patterns shed light on the difficulties native English CFL learners demonstrated in their perception and production of the eight target consonants in the current findings. (See the original study for detailed analysis of the assimilation patterns).
L2 to L1 assimilation patterns can also explain the English CFL learners' difficulties with Mandarin palatal sound x/C/. Both x/C/ and sh/ù/ were mapped onto English /S/ but sh/ù/ was a better fit than x/C/, a CG type of assimilation. Hao (2012) reported the same CG assimilation pattern for x/C/ and sh/ù/ and similar learning results on x/C/ in her study. Mandarin x/C/ was difficult for the English learners also because it was assimilated to both English /z/ and /S/, a one-to-two "split" match to the native categories, a "revised" Single Category (SC) type of assimilation (see Figure 2). Languages 2020, 5, x FOR PEER REVIEW 12 of 16 Figure 2. Mandarin to English sound mapping patterns by English listeners; 3 to 1 and 2 to 1 mappings are in blue squares and 1 to 2 mappings are in red circles.
The Mandarin dental affricates z /ts/ and c/tsʰ/ were also poorly perceived by the learners in the current study. As seen in Figure 2, the aspirated affricate /tsʰ/ was assimilated to both English /s/ and /t/, also a "revised" Single Category (SC) type of assimilation. The unaspirated dental affricate z /ts/, together with s /s/ and c /tsʰ/, were mapped onto English /s/, causing difficulties for the poor fitting categories z /ts/ and c /tsʰ/. While both z /ts/ and c /tsʰ/ exist in the English word finals "reads" and" boots", these novice learners did not appear to have made the associations in identifying the target sounds. One explanation may be that Mandarin z /ts/ and c /tsʰ/ are stand-alone phonemes and are more prominent at word initial positions than the morphological word endings of /dz/ and /ts/ in English.
Overall, to answer research question 2, the current data suggest that the perceived phonetic differences and distances between Mandarin and English consonants predicted the learners' perceptual difficulties with the L2 Mandarin consonants. The perception data also support the PAM model.
Flege's Speech Learning Model (SLM) may also provide explanations for the current findings. The learners' phonetic spaces for L1 and L2 consonants need to be reorganized to establish new phonetic categories for the Mandarin retroflex, palatal and dental sounds. For example, learners need to distinguish the differences between c /tsʰ/, x /ɕ/, z /ts/, q /tɕʰ/ and others in order to establish these categories. On the other hand, "equivalence classification" of the SLM may be at work for ch /tʂʰ/ to be identified as English /ʧ/. While ch /tʂʰ/ (80%) was the best identified category among the eight target sounds by the learners, its production score (47%) was much lower. Therefore, even if some of the L2 categories seemed to have been established by the majority of the listeners, "equivalence classifications" may have prevented them from forming the native-like perception category.
The results of Experiment 2 showed the percentage correct production scores of the eight target sounds ranged from 25% (zh /tʂ/ and c/tsʰ/) to 88% (z /ts/). Pairwise comparisons data shown in Table  4 indicated the native English CFL learners had the most production difficulties with the Mandarin c/tsʰ/ and zh /tʂ/, followed by q/tɕʰ/ sounds. The pattern of substitutions in production was similar to that of perception for the retroflex sounds zh /tʂ/, ch /tʂʰ/, and sh /ʂ/, which were substituted with palatals j/tɕ/, q/tɕʰ/, and x/ɕ/. However, the mis-produced palatal sounds q/tɕʰ/, x/ɕ/ were not confused with the retroflex sounds but were mostly heard by the native Mandarin listeners as the unaspirated palatal affricate j /tɕ/. These substitution patterns suggest that the retroflex sounds were more difficult for the English CFL learners to produce.
Comparing the results of the two experiments, the learners had the tendency of better performance on the retroflex sounds zh /tʂ/, ch /tʂʰ/, and sh /ʂ/ in perception than in production but The Mandarin dental affricates z /ts/ and c/ts h / were also poorly perceived by the learners in the current study. As seen in Figure 2, the aspirated affricate /ts h / was assimilated to both English /s/ and /t/, also a "revised" Single Category (SC) type of assimilation. The unaspirated dental affricate z /ts/, together with s /s/ and c /ts h /, were mapped onto English /s/, causing difficulties for the poor fitting categories z /ts/ and c /ts h /. While both z /ts/ and c /ts h / exist in the English word finals "reads" and" boots", these novice learners did not appear to have made the associations in identifying the target sounds. One explanation may be that Mandarin z /ts/ and c /ts h / are stand-alone phonemes and are more prominent at word initial positions than the morphological word endings of /dz/ and /ts/ in English.
Overall, to answer research question 2, the current data suggest that the perceived phonetic differences and distances between Mandarin and English consonants predicted the learners' perceptual difficulties with the L2 Mandarin consonants. The perception data also support the PAM model.
Flege's Speech Learning Model (SLM) may also provide explanations for the current findings. The learners' phonetic spaces for L1 and L2 consonants need to be reorganized to establish new phonetic categories for the Mandarin retroflex, palatal and dental sounds. For example, learners need to distinguish the differences between c /ts h /, x /C/, z /ts/, q /tC h / and others in order to establish these categories. On the other hand, "equivalence classification" of the SLM may be at work for ch /tù h / to be identified as English /tS/. While ch /tù h / (80%) was the best identified category among the eight target sounds by the learners, its production score (47%) was much lower. Therefore, even if some of the L2 categories seemed to have been established by the majority of the listeners, "equivalence classifications" may have prevented them from forming the native-like perception category.
The results of Experiment 2 showed the percentage correct production scores of the eight target sounds ranged from 25% (zh /tù/ and c/ts h /) to 88% (z /ts/). Pairwise comparisons data shown in Table 4 indicated the native English CFL learners had the most production difficulties with the Mandarin c/ts h / and zh /tù/, followed by q/tC h / sounds. The pattern of substitutions in production was similar to that of perception for the retroflex sounds zh /tù/, ch /tù h /, and sh /ù/, which were substituted with palatals j/tC/, q/tC h /, and x/C/. However, the mis-produced palatal sounds q/tC h /, x/C/ were not confused with the retroflex sounds but were mostly heard by the native Mandarin listeners as the unaspirated palatal affricate j /tC/. These substitution patterns suggest that the retroflex sounds were more difficult for the English CFL learners to produce.
Comparing the results of the two experiments, the learners had the tendency of better performance on the retroflex sounds zh /tù/, ch /tù h /, and sh /ù/ in perception than in production but vice versa on palatal sounds j /tC/, q /tC h /, and x /C/. The results on dental sounds z /ts/, and c /ts h / were mixed across the two domains. Mandarin retroflex and palatal fricatives and affricates, though both lack counterparts in English, pose different problems to the English CFL learners in perception and production. The results of the correlation tests comparing the perception and production scores showed only two of the eight target consonants, x /C/ and c /ts h / were moderately correlated. The lack of correlations in the learners' perception and production scores for the majority of the sounds under investigation suggest the relationship between L2 speech perception and production is not straightforward.
One possible explanation for such misalignment in L2 speech perception and production might be that perception does not always lead production. For example, different mechanisms or strategies may be involved in perception and production of the retroflex sound ch /tù h / by the beginning level English CFL learners. The learners' better perception of ch /tù h / (80%) may be explained by their closest match of the L2 Mandarin ch /tù h / to their L1 English /tS/ sound. In perception identification tasks, anything that is close enough to the nearest English /tS/ can be labeled as ch /tù h /. However, the same strategy would not work for the production, if the learners have not established native-like retroflex ch /tù h / category. The key phonetic gestures in producing the correct retroflex affricates in the Mandarin ch /tù h / sound cannot be effectively replaced by the gestures of English /tS/. The intended ch /tù h / would not be heard as the target ch /tù h / sound but as the q /tC h / sound by the native Mandarin listeners. The same patterns seem to hold true for the other retroflex sounds zh /tù/ and sh /ù/ that were identified by the native Mandarin listeners as the palatal sounds j /tC/ and q/tC h /. These substitution patterns suggest the cues for the retroflex sounds were absent in these non-native productions. Therefore, it is very likely the misalignment between perception and production is partly due to the different mechanisms the learners attended to in perception and production of the target consonants. Acoustic analyses of the learners' productions of these sounds, along with perceptual test using synthesized stimuli manipulating the key acoustic cues differentiating the target categories are needed to draw a firm conclusion in future studies.
The current data also show that partial alignment between perception and production of Mandarin consonants does exist. The listeners' perception and production scores of both x /C/ and c /ts h / were moderately but significantly correlated, indicating the link between perception and production. Past studies have come to the same conclusions of such partial alignment in perception and production of L2 sounds (Flege 1999;Rochet 1995).
L2 phonetic training studies have also examined the relationship between perception and production when assessing the outcomes of the training. There is evidence that perceptual training only led to improvement in both perception and production of L2 consonants (Bradlow et al. 1997) and lexical tones (Wang 2008(Wang , 2012(Wang , 2013Wang et al. 2003), and production gains are larger on obstruents than on sonorants and vowels (Sakai and Moorman 2018). There is also evidence that trainees' perceptual learning did not lead to better productions on L2 vowel contrasts (Wang 2002). These findings suggest that the relationship between perception and production of L2 speech sounds can be further complicated by the different sound classes.
In conclusion, the current data showed partial alignment but more discrepancies on native English CFL learners' perception and production of the eight Mandarin consonants. Different phonetic mechanisms and strategies may be involved in the L2 speech sound perception and production. Perception may not always lead production. Future studies need to carry out more detailed acoustic analysis of the native and non-native productions of the target consonants to investigate specific problems that learners have in perception and production of Mandarin consonants.
Finally, one limitation of the current study was the exclusion of the s /s/ and r /üa/ sounds in the analyses. It would have been better to include at least the s /s/ sound to have a nice set of dental fricative and affricates, making it parallel to the retroflex and palatal sets.
Author Contributions: Conceptualization, X.W. and J.C.; methodology, X.W. and J.C.; data collection and analysis, X.W. and J.C.; writing-original draft preparation, X.W.; writing-review and editing, X.W. and J.C. All authors have read and agreed to the published version of the manuscript.