Next Article in Journal
LiteTom-RTDETR: A Lightweight Real-Time Tomato Detection System for Plant Factories
Previous Article in Journal
Synergizing Intelligence and Privacy: A Review of Integrating Internet of Things, Large Language Models, and Federated Learning in Advanced Networked Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Acoustic Analysis and Perceptual Evaluation of Second Language Cantonese Tones Produced by Advanced Mandarin-Speaking Learners

1
Department of Chinese Language and Literature, Hong Kong Shue Yan University, North Point, Hong Kong 999077, China
2
Center for Clinical Neurolinguistics, School of Foreign Languages and Literature, Shandong University, Jinan 250100, China
3
Institute of Language Sciences, Shanghai International Studies University, Shanghai 200083, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(12), 6590; https://doi.org/10.3390/app15126590
Submission received: 27 March 2025 / Revised: 2 June 2025 / Accepted: 6 June 2025 / Published: 11 June 2025
(This article belongs to the Special Issue Musical Acoustics and Sound Perception)

Abstract

:
The tonal system of Cantonese is very different from that of Mandarin, which creates potential challenges for Mandarin speakers when learning Cantonese. The aim of this study was to explore second language (L2) production of Cantonese tones by advanced learners whose first language (L1) is Mandarin. Forty-one informants participated in a recording experiment to provide production data of Cantonese tones. The speech data were measured acoustically using the computer software Praat (Version 6.3.10) and were evaluated perceptually by native Cantonese speakers. The relationship between the acoustic analysis and perceptual evaluation was also explored. The acoustic and perceptual evaluations confirmed that, while the tones that the Mandarin learners of Cantonese produced were non-native-like, their production of the Cantonese T1 and T2 was good in general. Furthermore, the accuracy of the perceptual evaluations could be predicted based on the acoustic features of the L2 tones. Our findings are in line with hypotheses in current speech learning models, and demonstrate that familiar phonetic categories are easier to acquire than are unfamiliar ones. To provide a more complete picture of L2 speech acquisition, future research should investigate L2 tone acquisition using both production and perception data obtained from participants with a greater variety of L1s.

1. Introduction

Lexical tones are challenging not only for first language (L1) speakers at a young age (e.g., [1]) but also for second language (L2) learners (e.g., [2]). Typically developing children acquire their L1s effortlessly [3], but the case for L2 acquisition is much more complex. Although some domains of linguistic knowledge have been proven to be acquirable at the end state of L2 acquisition, such as syntax [4] and lexicon [5], it has been argued that the attainment of native-like pronunciation is unlikely for late L2 learners [6,7,8]. Following this line of research, this study aimed to explore whether speakers of a tonal language could successfully acquire the tonal system of another tonal language with sufficient exposure to the target language by investigating the L2 Cantonese tone production of advanced Mandarin-speaking immigrants in Hong Kong.
In the following sections, we first review relevant models in L2 speech acquisition in Section 1.1 and introduce the Cantonese and Mandarin tonal systems in Section 1.2. The rationales for and hypotheses in this study are described in Section 1.3. In Section 2, we explain the research methods in detail, before presenting the findings of this study in Section 3. A summary of the findings is provided in Section 4, followed by discussion of relevant issues in the acquisition of L2 Cantonese tones and possible directions for future research.

1.1. L2 Speech Acquisition

Two influential models of speech learning that may provide insights into the issues involved in L2 speech acquisition have been proposed in the literature. The Speech Learning Model (SLM) was originally proposed by [9,10] to account for the differences in the learnability of L2 phonetic segments, and was recently updated as the Revised Speech Learning Model (SLM-r [11]). According to the SLM/SLM-r, the processes and mechanisms that guide L1 speech acquisition (including the ability to form new phonetic categories) remain intact and accessible for L2 speech learning across the lifespan. Another model, the Perceptual Assimilation Model (PAM), suggests that infants attune their perceptions of speech to the properties of the sounds in their ambient language environments during the early stage of language acquisition; consequently, they become less sensitive to the sounds of a new language and may have difficulty perceiving unfamiliar sound contrasts [12]. Similar to the SLM, the PAM also assumes that the ability to perceive speech continues to be refined throughout one’s life in its extended version, the Perceptual Assimilation Model of Second Language Speech Learning (PAM-L2 [13]). Both models postulate a common phonetic space in the learners’ minds in which the phonetic categories of both the L1 and the L2 are stored.
One hypothesis from the SLM/SLM-r is the category assimilation hypothesis (CAH), which claims that, in the common space, an L2 sound that is perceived as being similar to an L1 sound does not form a new category, and is understood as a variant of the L1 sound at an allophonic level; in other words, the cross-linguistic equivalence between the two sounds has been established. In this case, the phonemic variants for interdialectal contact are called diaphones [14], and the CAH posits that only one single phonetic category is used to process the two linked diaphones. This mapping of diaphones will eventually give rise to a new merged category in bilingual speakers’ mental representations, and will be realised differently from either the L1 sound or the L2 sound in production; this phenomenon has been documented in several studies [15,16].
Another hypothesis from the SLM/SLM-r, the category dissimilation hypothesis (CDH), suggests that a new category will be established if an L2 sound is absent in the L1 system, which will make the combined phonetic space more crowded; as a result, the phonemes tend to disperse in compensation to maintain the phonetic contrast. When category dissimilation takes place, neither the newly established L2 category nor the closest L1 category will be identical to the categories in monolinguals; consequently, both categories may shift away from their original phonetic spaces. Ref. [17] provided recent support for CDH, and found that Spanish–Catalan bilinguals had developed two categories to accommodate the mid-back vowels in the two languages.
However, the majority of research working with SLM/SLM-r focused on segmental features. Whether the CAH and CDH can be generalised to suprasegmentals remains to be explored.

1.2. Cantonese and Mandarin Tonal Systems

Although Cantonese and Mandarin are both varieties of Chinese, they have different phonological systems and are not mutually intelligible [18]. Like other Chinese languages, both Mandarin and Cantonese are tonal; that is, lexical tones contribute to the differentiation of words in Mandarin and in Cantonese. Table 1 presents the tonal systems in Cantonese and Mandarin, and it lists the name, the category, the letter and an example for each tone of these two languages. Tone letters are used to represent the relative values of the tones, with ‘1’ and ‘5’ representing the lowest and highest pitch levels, respectively. As shown in Table 1, there are six lexical tones in Cantonese and four in Mandarin [19,20]. Cantonese has a complex tonal system, with three level tones (T1, T3 and T6), two rising tones (T2 and T5) and one falling tone (T4). Following the convention of Cantonese phonology, the three entering tones in Cantonese are grouped into the three level tones. In Mandarin, there is one level tone (T1), one rising tone (T2), one dip-rising tone (T3) and one falling tone (T4). In addition, a Mandarin T3 syllable undergoes tone sandhi when it is not a sentence-final syllable [21]. In terms of potential mapping of the tones, Cantonese and Mandarin share similar tone levels and values for T1 and T2, as T1 is a high-level tone (55 in both languages) and T2 is a high rising tone (25 in Cantonese and 35 in Mandarin).
According to recent studies, several tone pairs (i.e., T2 and T5, T3 and T6, and T4 and T6) are merging in either the production or the perception of some native speakers [22,23], which is probably part of a sound change in contemporary Cantonese [24]. If immigrants in Hong Kong are exposed to different types of Cantonese input, namely some merged tones and some unmerged tones, it would be interesting to explore whether they can maintain the contrasts in the six tones in their L2 Cantonese.

1.3. The Current Study

Studies of Cantonese tone acquisition have mainly focused on younger populations, such as monolingual children [25], bilingual children [26,27] and bilingual adolescents [28], but adult L2 learners’ acquisition of Cantonese tones has not been investigated. Adults are fundamentally different from younger populations in terms of L2 acquisition because adult learners have fully acquired their L1 when they begin to acquire their L2, while younger populations are developing their L1 and L2 simultaneously or consecutively [29]. In addition, the acquisition of L2 tones by learners with a tonal language background has received little attention which, according to the SLM/SLM-r and the PAM-L2, is an essential topic for the better understanding of L1 influences on L2 acquisition and of L1 and L2 speech interactions. Thus, in the present study, we attempted to investigate whether adult learners with a tonal language background (native speakers of Mandarin) could acquire the complex Cantonese tonal system, and whether there were any modulations of lexical tones in their L2 Cantonese due to the influences from their L1 Mandarin. The purpose of this study was to examine the following two hypotheses:
Hypothesis 1.
Given the acoustic similarity of T1 and T2 in Cantonese and Mandarin, the Mandarin learners of Cantonese will be able to link these L1 and L2 tones and will show convergence with the native Cantonese speakers in the production of T1 and T2.
Hypothesis 2.
Perceptual evaluations can be predicted based on the acoustic features in the L2 Cantonese tone production.

2. Materials and Methods

2.1. Participants

Thirty-two native speakers of Mandarin (twenty-seven females and five males; mean age ± standard deviation (SD): 30.72 ± 4.87) were recruited as the experimental group for this study. All the immigrants were born and raised in northern China and had lived in Mandarin-speaking regions before residing in Hong Kong after puberty (mean age of arrival ± SD: 23.68 ± 4.88). The immigrants all spoke Mandarin as their L1 and had started to learn Cantonese after their arrival in Hong Kong (mean length of residence ± SD: 7.03 ± 3.29). To assess their language background, a revised version of the Bilingual Language Profile [30] was prepared for the immigrants, the results of which indicated that they spoke fluent Cantonese at the time of the recordings, despite Mandarin being their dominant language. In addition, nine native Cantonese speakers (four females and five males; mean age ± SD: 21.27 ± 1.88), who were born and raised in Hong Kong, were included as the control group.
The sample size of 32 participants in the target group is comparable to previous studies on production of Cantonese tones (e.g., [26] and [28] included 21 and 29 target participants, respectively). The sample size of the reference group is relatively small because the production of native Cantonese speakers is considered to be more stable than L2 learners. Such a practice of recruiting fewer participants in the reference group has also been adopted in previous research such as [25,31]. In addition, a priori power analysis was conducted using G*Power (Version 3.1.9.7) [32] for sample size estimation. With a significance criterion of α = 0.05 and power = 0.80, the minimum sample size needed with this effect size was N = 25 for multiple regression (different types of multiple regression models were fitted, as detailed below in Section 2.3). Thus, the obtained sample size of N = 41 is more than adequate to test the hypotheses. Participants were recruited through printed advertisements on campus and electronic advertisements on various social media platforms. None of the participants reported having any speech, language or hearing disorders.

2.2. Materials and Data Collection

Twelve monosyllabic words were chosen as target stimuli, and were formed by combining two base syllables (/si/and/fu/) with six lexical tones. In addition, ten monosyllabic words with different syllable structures and tones were included as fillers. Both the target and the filler syllables are high-frequency words in Cantonese. The stimuli were presented in two contexts: either in isolation or embedded within a carrier phrase (我讀__呢個字, ngo5 duk6 __ nei1 go3 zi6, I read this character __). Each stimulus appeared twice in a randomised order. Therefore, 1968 tokens (2 syllables × 6 tones × 2 contexts × 2 repetitions × 41 speakers (32 L2 speakers + 9 native speakers)) were collected for this study.
The recording sessions were conducted in a soundproof room at a local university. The participants were first shown the complete list of the target and filler syllables and were allowed to get familiar with them. The stimuli were then presented randomly on the screen as Chinese characters using E-Prime 2.0 [33] to guide the participants, and the participants were instructed to read the syllables both in isolation and in the carrier phrase. They were allowed to correct themselves when needed. The recordings were collected using Audacity (Version 3.6.1) [34] in the WAV format (mono soundtrack, 44,100 Hz sampling rate, 16-bit resolution).
This project was approved by the Human Research Ethics Committee of Hong Kong Shue Yan University (protocol code: HREC 22-05 (M12); date of approval: 1 June 2022). All participants gave their written informed consent prior to the recording sessions.

2.3. Data Processing and Analysis

Both acoustic analysis and perceptual evaluation were adopted for the assessment of the collected speech samples in this study. The acoustic analysis provided a comparison of the acoustic features of the tones produced by the native and the L2 Cantonese speakers from an objective perspective, while the perceptual evaluation provided direct insights into the overall intelligibility of the participants’ tone pronunciation from native speakers’ intuition [35]. We also aimed to determine whether perceptual evaluations could be predicted based on acoustic features.

2.3.1. Acoustic Analysis Procedures

After the speech samples were collected, trained phoneticians segmented the sonorant portion of each syllable manually using Praat (Version 6.3.10) [36]. Only the sonorant portions were included in the analysis because they are the lexical tone-bearing units. Following the segmentation, 20 time-normalised fundamental frequency (F0) values were extracted from the sonorant portion of each token using a Praat script ProsodyPro [37]; for other options, we used the default in the script: F0 range of 75–600 Hz, and F0 sample rate of 100 Hz. To eliminate pitch range differences across genders and speakers, the F0 values, originally measured in Hz, were converted to T values ranging from 0 to 5 using the following equation:
T m = log F m log m i n n log m a x n log m i n n × 5
where T m and F m represent the converted T value and the original F0 value at the m t h time point, respectively, and m a x n and m i n n stand for the maximum value and the minimum value of all the original F0 values of the n t h speaker. The converted T values are thus comparable across genders and speakers and can be directly compared with the tone letters in Table 1.
To analyse the tone trajectories, we fitted generalised additive models (GAMs) with the ‘mgcv’ package [38] in R (Version 4.5.0) [39,40]. GAMs were chosen because the relationship between tone production and other variables may not be linear. To compare the production of each tone between the two groups, we fitted six models with T value as the dependent variable, group as the main predictor, and normalised time as a smooth term. An interaction between group and normalised time was included to detect potential differences in tonal contour between the two groups. To compare the production of the six tones within each group, we fitted two models with T value as the dependent variable, tone as the main predictor, and normalised time as a smooth term. An interaction between tone and normalised time was included to investigate differences in the tonal contour of the six tones. Additionally, in each model, repetition, context and syllable were included as controls, and random intercepts by the speaker were included to account for individual differences. A p-value of 0.05 is set as the threshold for statistical significance.

2.3.2. Perceptual Evaluation Procedures

Moreover, the recordings were perceptually evaluated by five native speakers of Hong Kong Cantonese who did not exhibit tone merging. The listeners were all born and raised in Hong Kong and spoke Hong Kong Cantonese as their only L1. None of the listeners attended the recording session or were familiar with the speakers. The speech samples collected from the production experiment were first randomised and then presented to the native listeners, who were asked to judge the tones by selecting the corresponding Chinese characters in a six-alternative forced-choice question based on their native speakers’ intuition (each character representing one tone). Instead of the six tones, Chinese characters were adopted in the evaluation because the Cantonese tonal system is not explicitly taught in Hong Kong. Native speakers usually acquire the Cantonese tonal system through exposure rather than any formal instructions. Consequently, native speakers of Hong Kong Cantonese are not familiar with the tonal system and may not be able to determine the tone categories accurately.
During the evaluation, the native listeners were allowed to listen to each stimulus as many times as they needed, and there was no time limit for the evaluation. In total, 9840 judgement trials (1968 tokens × 5 listeners) were collected from the 5 listeners, and the judgements were further coded in two ways. They were first coded as either correct or incorrect production of tones for statistical modelling as will be described below. In addition to that, the judgements were also coded as tone numbers (such as T1, T2, etc.) to make it possible for further error analysis. To confirm the consistency among the listeners, the intraclass correlation coefficient (ICC) was calculated with the Mangold ICC calculation software (Version 2015) [41]. Tone numbers as judged by the five listeners were used in the ICC calculation, and the agreement of the judgement data was 89.5%, suggesting high inter-rater reliability among the five listeners. Thus, the judgement data from all five listeners were included in the subsequent statistical analysis.
Generalised linear mixed-effects models (GLMMs) were used to analyse the perceptual evaluation data, with response (two levels: incorrect vs. correct) as the dependent variable, group (two levels: native vs. immigrant) and tone as independent variables, and speaker and listener as random intercepts. An interaction between group and tone was included as the production accuracy of the six tones may vary within each group. The GLMMs were implemented with the ‘lme4’ package [42] in R. Likelihood ratio tests were performed to check the significance of predictors, with the threshold of p-values being set at 0.05.

2.3.3. Linking the Perceptual and the Acoustic Results

To further explore the relationship between the perceptual evaluation results and the acoustic features of the L2 Cantonese tone production, we first quantified the tone production data. Following [28,43], we converted the 20 normalised F0 points for each syllable into three coefficients using the discrete cosine transform (DCT) approach, with DCT1, DCT2 and DCT3 representing mean F0 height, F0 linear slope and curvature, respectively. To provide a reference for comparison, the DCT values of each token produced by the native speakers were averaged. After that, we calculated the tonal distance between each token produced by the L2 speakers and the averaged DCT values of the same token produced by the native speakers with the following equation:
d = ( D C T 1 m D C T 1 c ) 2 + ( D C T 2 m D C T 2 c ) 2 + ( D C T 3 m D C T 3 c ) 2
where d is the calculated tonal distance, DCT1m, DCT2m and DCT3m represent the DCT values of a particular trial produced by Mandarin speakers, and DCT1c, DCT2c and DCT3c stand for the averaged DCT values of the same token produced by native Cantonese speakers. For example, for our first Mandarin speaker, the three DCT coefficients for the Cantonese T1 production in a specific situation (syllable: /fu/; repetition: first production; context: in isolation) are 6.02, 0.16, and −0.39, respectively. The three DCT coefficients of the same Cantonese T1 token from the Cantonese reference group are 6.70, 0.13 and 0.05, respectively. The Euclidean distance between the two points in the three-dimensional space is 0.81.
To explore the relationship between the perceptual and the acoustic results, GLMMs were employed to investigate whether the acoustic results could predict the perceptual evaluations (correct or incorrect) of native-speaking listeners, with the DCT values (DCT1, DCT2 and DCT3) and the tonal distance each included as the predictor for separate models. We first fitted models with data of all six tones, and then fitted models with each tone separately. Again, we included by-speaker and by-listener random intercepts.

3. Results

This section reports the main findings of this project. Section 3.1 presents the data from acoustic analysis based on different GAMs and post hoc pairwise comparisons. In Section 3.2, the perceptual evaluation data on tone production are outlined, and both ranking and GLMM results are provided. Lastly, Section 3.3 is an attempt to examine the relationship between perceptual evaluation results and acoustic features.

3.1. Acoustic Analysis

Figure 1 presents the F0 contours of the six distinct Cantonese tones for the two groups as T values, which were generated by the GAMs. Each data point represents the F0 value calculated from one of the 20 points extracted for each token and converted to a T value. Coloured contours were used to represent different tones. As can be seen in Figure 1A, in native speakers, there was a distinctive pattern in the native production, as the tones were separated from each other in two forms: T1, T3 and T6 were level tones with varying heights, while T2, T4 and T5 were rising or falling tones with different shapes. The GAM with 0 as the reference revealed more linearity for level tones (EDF (effective degree of freedom): T1 = 1.161, T3 = 2.717, T6 = 2.977) and more nonlinearity for rising or falling tones (EDF: T2 = 5.023, T4 = 4.937, T5 = 4.242).
As shown in Figure 1B, although the immigrants produced a high-level T1, the remaining five tones were distributed in a lower and narrower acoustic space. No clear linear or nonlinear distinction was found for the tones that the immigrants produced, except for T1 (EDF: T1 = 1.043, T2 = 6.023, T3 = 4.734, T4 = 4.881, T5 = 3.994, T6 = 4.681). Therefore, it appeared that the five tones in the lower F0 region were not easily distinguishable. Post hoc pairwise comparisons revealed that the F0 curves of T3, T5 and T6 overlapped significantly (T3–T5: β = 0.026, SE = 0.044, t = 0.597, p = 0.991; T3–T6: β = 0.046, SE = 0.045, t = 1.007, p = 0.916; T5–T6: β = 0.020, SE = 0.044, t = 0.451, p = 0.998).
A further analysis using more detailed GAMs revealed that the variable group showed significant linear effects but no significant nonlinear effects over time on T1, T2 and T6; by contrast, group showed significant linear and nonlinear effects over time on T3, T4 and T5. These results indicated that the main acoustic differences between the native Cantonese speakers and the L2 Cantonese immigrants for T1, T2 and T6 were F0 heights, whereas the main differences for T3, T4 and T5 included both F0 heights and shapes. The acoustic analyses showed that the immigrant group failed to produce Cantonese tones in a native-like manner in general, although their T1 and T2 were pronounced relatively well.

3.2. Perceptual Evaluation

The perceptual evaluation showed that the native group exhibited notably high accuracy in tone production, with an average of 82.22% (ranging from 59.44% to 99.72%). By contrast, the immigrant group displayed an average accuracy of 52.98% (ranging from 8.91% to 93.13%). Figure 2 presents the results of tone production accuracy as perceptually evaluated by native listeners, with Figure 2A showing the matrix from native speakers and Figure 2B illustrating the matrix from immigrants. The vertical y-axis and the horizontal x-axis represent the target tone and the response tone, respectively. For instance, Row 1 from Figure 2A lists the production of T1 by native speakers, with each cell from Row 1 depicting the percentage of the T1 production being perceived as pronounced as a certain tone (from T1 to T6). It can be seen that 99.2% of the T1 tokens were perceived as T1, and 0.8% of the T1 tokens were perceived as T3. The grey cells correspond to the production accuracy, and the pink cells highlight the incorrectly perceived tones with a percentage higher than 10%, i.e., the common errors. As shown in Figure 2, the order of tone production accuracy from highest to lowest was T4 > T1 > T2 > T3 > T6 > T5 for the native group and T1 > T2 > T4 > T6 > T3 > T5 for the immigrant group. The native speakers’ T1, T2 and T4 were perceived as the expected tone with near-perfect accuracy (>95%), and their T3 also showed substantial accuracy (about 80%). However, the accuracy of T6 and T5 produced by native speakers was only slightly above 50%. For the immigrants, only T1 and T2 yielded ideal results (>80%), and the remaining four tones generally failed to achieve 50% accuracy.
Next, GLMMs were fitted to examine the effects of predictors on listeners’ responses. The GLMMs revealed significant main effects of group2(1) = 59.390, p < 0.001] and tone2(5) = 767.108, p < 0.001], as well as a two-way interaction between group and tone2(5) = 151.541, p < 0.001]. These results confirmed the observations above: the performances of the native speakers were significantly better than those of the immigrants, and the accuracy of some of the tones was significantly higher than was the accuracy of others. In addition, the pairwise comparisons showed that the native speakers’ accuracy was significantly higher than that of the immigrants for each tone at the 0.001 level, except for T6, for which the significance level was 0.05. With regard to the random effects, the speaker explained 49.37% of the variance and 70.26% of the standard deviation, implying considerable variability in tone production across individuals; the listener explained 0.26% of the variance and 5.07% of the standard deviation, suggesting consistent judgements amongst the listeners.
Regarding the confusions of each type of tone, the native speakers’ main confused tones were T3 → T6 (T3 was identified as T6), T5 → T2 and T6 → T3. These confusions heavily influenced the intelligibility of Cantonese tones, and the accuracy rate for native speakers for T3, T5 and T6 would have exceeded 90% if these confusions had been eliminated. In comparison, it was not surprising that the same confusions occurred for tones produced by the immigrants, whose notable tone confusions were T3 → T4, T4 → T6, T5 → T3, T5 → T6, and T6 → T4, all of which occurred at higher rates (by at least 15%) compared to their native counterparts’ productions. From the perspective of the perceptual evaluation, tone confusion within the immigrant group was predominantly observed between T3 and T6.

3.3. Relationship Between the Perceptual Evaluation Results and the Acoustic Features

GLMMs were fitted to investigate whether listeners’ responses could be predicted by the acoustic features of the tone production. The GLMMs revealed significant main effects of tonal distance2(1) = 23.507, p < 0.001], DCT1 [χ2(1) = 5.819, p = 0.016], and DCT3 [χ2(1) = 5.940, p = 0.015] on the responses, suggesting that, in general, the tonal distances, the mean F0, and the F0 curvature were taken into account when the native-speaking listeners were making their judgements. Separate GLMMs were then fitted for each tone; the results of pairwise comparisons for each tone and variable are listed in Table 2. The main effect of tonal distance was found for T4 (p < 0.001) and T6 (p = 0.019), and there was also a marginal effect of tonal distance for T1, T2 and T5 (ps < 0.1). Furthermore, the effect of DCT1 was significant for T1, T4 and T6 (ps < 0.01), the effect of DCT2 was significant for T2, T4 and T5 (ps < 0.05), and the effect of DCT3 was significant for T1 and T4 (ps < 0.001). These statistics confirmed the relationship between the native listeners’ judgements and the acoustic features in L2 Cantonese tone production.

4. Discussion

In this study, we investigated the production of Cantonese tones by advanced Mandarin-speaking learners who had been immersed in a Cantonese-speaking environment for an average of seven years. Both acoustic analysis using computer software and perceptual evaluation by native listeners were employed to analyse the speech data. The results partially supported the two hypotheses that we proposed. Firstly, while the acoustic data showed that none of the six tones produced by L2 Cantonese speakers was native-like, their production of T1 and T2 was relatively good. This observation was confirmed by the native listeners’ judgements, as the production accuracy of T1 and T2 was above 80% and the accuracy of the remaining tones was extremely low (ranging from 8.9% to 52.5%). Secondly, our data supported the hypothesis that the perceptual evaluation accuracy (correct or incorrect) can be predicted based on the acoustic features of the tones. The effect of tonal distance on the native listeners’ perceptual evaluation was robust, and the listeners’ judgements were also influenced by the mean F0 (DCT1), the F0 slope (DCT2) and the F0 curvature (DCT3) of the produced tones.
The acoustic and perceptual results revealed differences in the L2 learners’ Cantonese tone production from that of native speakers. Note that the L2 learners had been immersed in a Cantonese-speaking environment for an average of seven years at the time of the recordings, and they were all identified as fluent speakers of Cantonese. Despite sufficient linguistic input from the environment, they still failed to acquire the tonal system of Cantonese, indicating that being speakers of an L1 tonal language does not necessarily guarantee the successful acquisition of a different tonal system in an L2 [2]. Regarding the source of difficulty, we believe that it was due to the phonetic similarity/dissimilarity between the L1 and the L2 tonal systems. As detailed in Section 1.2 and Table 1, T1 and T2 are phonetically similar in Cantonese and Mandarin, and the other tones in the two languages differ in both slope and level; thus, it is possible that the immigrants had assimilated Cantonese T1 and T2 to their L1 counterparts. Once the link had been established, the immigrants would use only one category to represent the L1 and L2 tonal category (e.g., T1 or T2). This explains the high accuracy of their T1 and T2 production, as evaluated by the Cantonese listeners. One exception was the T4 in Cantonese, which was judged as being pronounced correctly in more than 50% of the tokens. If we consider the actual pronunciation of Mandarin tones, the half-third sandhi tone in Mandarin resembles the F0 level and the slope of Cantonese T4, both of which share the value of 21. Consequently, some immigrants might also have created a merged tonal category for the Mandarin half-third sandhi tone and the Cantonese T4. This hypothesis is supported by the perceptual assimilation results from [2], where the Cantonese T4 was considered similar to the Mandarin T3 as much as 78% of the time. The remaining three tones, namely T3, T5 and T6, cannot be linked to an existing tonal category in Mandarin and require extra effort to be acquired. Our findings thus provide further evidence for the SLM/SLM-r and the PAM-L2 based on the suprasegmental features, and suggest that familiar phonetic categories might be easier to acquire than are unfamiliar ones.
Moreover, the immigrants appeared to have merged T3, T5 and T6 in their productions. This was evident based on our post hoc acoustic analysis (ps > 0.915) and on the native-speaking listeners’ judgements. According to the five listeners, the most frequently mispronounced patterns involving these three tones were T3 → T6, T5 → T6, and T6 → T3. The confusion of T3 and T6 was not unexpected because these two tones are reported to be in the process of being merged by some speakers [24]. When the input that the immigrants received was the merged T3–T6 category, we could not expect them to separate these two tones in their production. What was surprising was that almost one third of the immigrants’ T5 tokens were classified as T6, which was not a pattern that was evident in the native speakers’ data. This revealed the immigrants’ difficulty in correctly establishing the T5 and T6 categories because these categories are missing in their L1 system. Moreover, the acoustic and perceptual similarity between these two tones (23 versus 22) might have exceeded their linguistic input and caused confusion in their L2 category formation. Another possible explanation for the merging of T5 and T6 is the narrower F0 space in Mandarin speakers compared to Cantonese speakers [44]. When the tonal space is more limited, it becomes challenging (if not impossible) for all the tonal categories to remain distinguishable. Additional support for this point comes from a recent study of Cantonese spoken by South Asian Cantonese speakers, which also suggested a narrower tonal space compared to native speakers of Cantonese [28].
There are some limitations in this study that could be addressed in future research. Firstly, although both acoustic analysis and perceptual evaluation were included to provide an appropriate assessment of the speech data, we only addressed Cantonese tone production by L2 learners in this study. Previous studies on L2 speech acquisition suggest that the development of productive abilities is not synchronous with that of perceptive abilities (e.g., [9]). To better understand L2 speech acquisition, future studies should consider employing both production and perception tasks with the same group of participants to explore the potential link between L2 speech production and perception. More specifically, discrimination or identification tasks can be implemented to study the perceptive abilities of L2 learners on non-native tones (e.g., [45,46]). Furthermore, the L1 background of the L2 participants in this study was homogeneous, as they were all native speakers of Mandarin without knowledge of any other Chinese languages before coming to Hong Kong. The participants’ homogeneous background limited the scope of potential generalisation of our findings. For instance, ref. [28] found that L2 speakers with different dominant languages may produce Cantonese tones differently, and ref. [47] suggested that quantity and quality of input across different settings should also be considered as a key feature of learners’ background. To provide a more complete picture of L2 Cantonese tone acquisition, it would be worthwhile for future studies to recruit L2 participants with a greater variety of L1 backgrounds (e.g., speakers of other tonal languages or non-tonal languages) and different degrees of exposure to Cantonese. Inclusion of learners with various L1s also increases the generalisability of the main findings.
In conclusion, this study provided acoustic and perceptual data pertaining to Mandarin-speaking immigrants’ Cantonese tone production and demonstrated some relationship between the acoustic and perceptual measurements. In general, the Mandarin speakers’ tones were non-native-like despite having received sufficient exposure to Cantonese input and being categorised as advanced learners of Cantonese. The findings add to the existing literature on learning L2 speech and support the hypotheses of current speech learning models. Future research is needed to investigate L2 tone acquisition with both production and perception data obtained from participants with more diverse L1s.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app15126590/s1. Table S1: Raw and converted F0 data; Table S2: Raw rating data.

Author Contributions

Conceptualisation, Y.Y.; methodology, Y.Y.; formal analysis, J.H., Y.Y. and Y.Z.; investigation, D.H. and Y.Y.; writing—original draft preparation, J.H. and Y.Y.; writing—review and editing, D.H., J.H., Y.Y. and Y.Z.; visualisation, J.H.; supervision, Y.Y.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Grants Council of the Hong Kong Special Administrative Region, grant number UGC/FDS15/H15/22, and the Acoustical Society of America to Y.Y. The APC was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Human Research Ethics Committee of Hong Kong Shue Yan University (protocol code: HREC 22-05 (M12); data of approval: 1 June 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the Supplementary Materials. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank all the informants for their participation in this study. An earlier version of this manuscript was presented at the 20th International Congress of the Phonetic Sciences and published in its proceedings [48]. The authors are grateful to the audience for their helpful comments, especially those from Xiaocong Chen, Albert Lee, Kechun Li, Youran Lin, Justin Lo, Quentin Qin, and Caicai Zhang.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mok, P.P.K.; Li, V.G.; Fung, H.S.H. Development of Phonetic Contrasts in Cantonese Tone Acquisition. J. Speech Lang. Hear. Res. 2020, 63, 95–108. [Google Scholar] [CrossRef]
  2. Hao, Y.C. Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers. J. Phon. 2012, 40, 269–279. [Google Scholar] [CrossRef]
  3. Zhu, H.; Dodd, B. The phonological acquisition of Putonghua (Modern Standard Chinese). J. Child Lang. 2000, 27, 3–42. [Google Scholar] [CrossRef]
  4. Sorace, A.; Filiaci, F. Anaphora resolution in near-native speakers of Italian. Second Lang. Res. 2006, 22, 339–368. [Google Scholar] [CrossRef]
  5. Saito, K. The Role of Age of Acquisition in Late Second Language Oral Proficiency Attainment. Stud. Second. Lang. Acquis. 2015, 37, 713–743. [Google Scholar] [CrossRef]
  6. Singleton, D. The Critical Period Hypothesis: A coat of many colours. IRAL Int. Rev. Appl. Linguist. Lang. Teach. 2005, 43, 269–285. [Google Scholar] [CrossRef]
  7. Yang, Y. First Language Attrition and Second Language Attainment of Mandarin-speaking Immigrants in Hong Kong: Evidence from Prosodic Focus. Ph.D. Thesis, The Hong Kong Polytechnic University, Hong Kong, China, 2022. [Google Scholar]
  8. Yang, Y. Acoustic Analyses of L1 and L2 Vowel Interactions in Mandarin–Cantonese Late Bilinguals. Acoustics 2024, 6, 568–578. [Google Scholar] [CrossRef]
  9. Flege, J.E. Second Language Speech Learning: Theory, Findings, and Problems. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research; Strange, W., Ed.; York Press: Timonium, MD, USA, 1995; pp. 233–277. [Google Scholar]
  10. Flege, J.E. Interactions between the native and second-Ianguage phonetic systems. In An Integrated View of Language Development: Papers in Honor of Henning Wode; Piske, T., Rohde, A., Burmeister, P., Eds.; Wissenschaftlicher Verlag: Trier, Germany, 2002; pp. 217–244. [Google Scholar]
  11. Flege, J.E.; Bohn, O.-S. The Revised Speech Learning Model (SLM-r). In Second Language Speech Learning: Theoretical and Empirical Progress; Wayland, R., Ed.; Cambridge University Press: Cambridge, UK, 2021; pp. 3–83. [Google Scholar] [CrossRef]
  12. Best, C.T. The emergence of native-language phonological influences in infants: A perceptual assimilation model. In The development of speech perception: The transition from speech sounds to spoken words; Goodman, J.C., Nusbaum, H.C., Eds.; The MIT Press: Cambridge, MA, USA, 1994; pp. 167–224. [Google Scholar] [CrossRef]
  13. Best, C.T.; Tyler, M.D. Nonnative and second-language speech perception: Commonalities and complementarities. In Second Language Speech Learning: The Role of Language Experience in Speech Perception and Production; Munro, M.J., Bohn, O.-S., Eds.; John Benjamins: Amsterdam, The Netherlands, 2007; pp. 13–34. [Google Scholar]
  14. Weinreich, U. On the Description of Phonic Interference. Word 1957, 13, 1–11. [Google Scholar] [CrossRef]
  15. Chang, C.B. A novelty effect in phonetic drift of the native language. J. Phon. 2013, 41, 520–533. [Google Scholar] [CrossRef]
  16. Major, R.C. Losing English as a First Language. Mod. Lang. J. 1992, 76, 190–208. [Google Scholar] [CrossRef]
  17. Simonet, M. Production of a catalan-specific vowel contrast by early Spanish-Catalan bilinguals. Phonetica 2011, 68, 88–110. [Google Scholar] [CrossRef]
  18. Zhang, X. Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the COLING 1998, Montreal, Canada, 10–14 August 1998; Volume 2, pp. 1460–1464. [Google Scholar]
  19. Bauer, R.S.; Benedict, P.K. Modern Cantonese Phonology; Walter de Gruyter: Berlin, Germany, 1997. [Google Scholar]
  20. Chao, Y.R. Mandarin Primer; Harvard University Press: Cambridge, UK, 1948. [Google Scholar]
  21. Chen, S.; He, Y.; Wayland, R.; Yang, Y.; Li, B.; Yuen, C.W. Mechanisms of tone sandhi rule application by tonal and non-tonal non-native speakers. Speech Commun. 2019, 115, 67–77. [Google Scholar] [CrossRef]
  22. Fung, R.S.Y.; Lee, C.K.C. Tone mergers in Hong Kong Cantonese: An asymmetry of production and perception. J. Acoust. Soc. Am. 2019, 146, EL424–EL430. [Google Scholar] [CrossRef]
  23. Zhang, J. Tone mergers in Cantonese: Evidence from Hong Kong, Macao, and Zhuhai. Asia-Pacific Lang. Var. 2019, 5, 28–49. [Google Scholar] [CrossRef]
  24. Mok, P.P.K.; Zuo, D.; Wong, P.W.Y. Production and perception of a sound change in progress: Tone merging in Hong Kong Cantonese. Lang. Var. Chang. 2013, 25, 341–370. [Google Scholar] [CrossRef]
  25. Mok, P.P.K.; Fung, H.S.H.; Li, V.G. Assessing the Link Between Perception and Production in Cantonese Tone Acquisition. J. Speech Lang. Hear. Res. 2019, 62, 1243–1257. [Google Scholar] [CrossRef]
  26. Yao, Y.; Chan, A.; Fung, R.; Wu, W.L.; Leung, N.; Lee, S.; Luo, J. Cantonese tone production in pre-school Urdu—Cantonese bilingual minority children. Int. J. Biling. 2020, 24, 767–782. [Google Scholar] [CrossRef]
  27. Mok, P.P.K.; Lee, A. The acquisition of lexical tones by Cantonese—English bilingual children. J. Child Lang. 2018, 45, 1357–1376. [Google Scholar] [CrossRef]
  28. Yu, A.C.L.; Lee, C.W.T.; Lan, C.; Mok, P.P.K. A New System of Cantonese Tones? Tone Perception and Production in Hong Kong South Asian Cantonese. Lang. Speech 2022, 65, 625–649. [Google Scholar] [CrossRef]
  29. Meisel, J.M. First and Second Language Acquisition: Parallels and Differences; Cambridge University Press: New York, NY, USA, 2011. [Google Scholar]
  30. Birdsong, D.; Gertken, L.M.; Amengual, M. Bilingual Language Profile: An Easy-to-Use Instrument to Assess Bilingualism. COERLL, University of Texas at Austin. 2012. Available online: https://sites.la.utexas.edu/bilingual/ (accessed on 1 January 2023).
  31. Zou, Y.; Yang, Y.; Han, D. The effects of language dominance on the L1 and L2 tone production of Mandarin–Cantonese bilinguals. JASA Express Lett. 2024, 4, 125201. [Google Scholar] [CrossRef]
  32. Faul, F.; Erdfelder, E.; Lang, A.-G.; Buchner, A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods 2007, 39, 175–191. [Google Scholar] [CrossRef] [PubMed]
  33. Schneider, W.; Eschman, A.; Zuccolotto, A. E-Prime User’s Guide; Psychological Software Tools Inc.: Pittsburgh, PA, USA, 2012. [Google Scholar]
  34. Audacity Team. Audacity(R): Free Audio Editor and Recorder. 2019. Available online: https://audacityteam.org/ (accessed on 1 January 2023).
  35. Wang, Y.; Jongman, A.; Sereno, J.A. Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. J. Acoust. Soc. Am. 2003, 113, 1033–1043. [Google Scholar] [CrossRef] [PubMed]
  36. Boersma, P.; Weenink, D. Praat: Doing Phonetics by Computer; Blackwell Publishers Ltd.: Malden, MA, USA, 2015. [Google Scholar]
  37. Xu, Y. ProsodyPro—A tool for large-scale systematic prosody analysis. In Proceedings of the TRASP’2013, Aix-en-Provence, France, 30 August 2013; pp. 7–10. [Google Scholar]
  38. Wood, S.N. Generalized Additive Models: An Introduction with R; Chapman and Hall/CRC: London, UK, 2017. [Google Scholar]
  39. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2018; Available online: https://www.r-project.org (accessed on 1 January 2023).
  40. RStudio Team. RStudio: Integrated Development for R; RStudio, Inc.: Boston, MA, USA, 2016; Available online: http://www.rstudio.com/ (accessed on 1 January 2023).
  41. Mangold, P. ICC Calculation Software, Based on Wirtz & Caspar 2002. 2018. Available online: www.mangold-international.com (accessed on 1 January 2023).
  42. Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
  43. Laméris, T.J.; Li, K.K.; Post, B. Phonetic and Phono-Lexical Accuracy of Non-Native Tone Production by English-L1 and Mandarin-L1 Speakers. Lang. Speech 2023, 66, 974–1006. [Google Scholar] [CrossRef]
  44. Yang, Y.; Chen, S.; Chen, X. F0 Patterns in Mandarin Statements of Mandarin and Cantonese Speakers. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 4163–4167. [Google Scholar] [CrossRef]
  45. Qin, Z.; Lee-Kim, S.-I.; Qi, H. The effect of second-language learning experience on Korean listeners’ use of pitch cues in the perception of Cantonese tones. Second Lang. Res. 2024. [Google Scholar] [CrossRef]
  46. Cao, M.; Pavlik, P.I.; Bidelman, G.M. Enhancing lexical tone learning for second language speakers: Effects of acoustic properties in Mandarin tone perception. Front. Psychol. 2024, 15, 1403816. [Google Scholar] [CrossRef]
  47. Lin, Y.; Pollock, K.E.; Li, F. Speech Production of Mandarin Lexical Tones Among Canadian Elementary Students Enrolled in Mandarin–English Bilingual Schools. J. Speech Lang. Hear. Res. 2025, 68, 435–455. [Google Scholar] [CrossRef]
  48. Yang, Y.; Han, D.; Wong, S.M.; Chan, C.S.; Leung, C.Y.; Chen, X. Production of Cantonese tones by Mandarin-speaking immigrants: Acoustic and perceptual measurements. In Proceedings of the ICPhS 2023, Prague, Czech Republic, 7–11 August 2023; pp. 1970–1974. [Google Scholar]
Figure 1. F0 contours of the six Cantonese tones produced by the native speakers (A) and the immigrants (B).
Figure 1. F0 contours of the six Cantonese tones produced by the native speakers (A) and the immigrants (B).
Applsci 15 06590 g001
Figure 2. Production matrix of the six Cantonese tones produced by the native speakers (A) and the immigrants (B). The grey cells represent the production accuracy for each tone and the pink cells highlight the incorrectly perceived tones with a percentage higher than 10%.
Figure 2. Production matrix of the six Cantonese tones produced by the native speakers (A) and the immigrants (B). The grey cells represent the production accuracy for each tone and the pink cells highlight the incorrectly perceived tones with a percentage higher than 10%.
Applsci 15 06590 g002
Table 1. Tonal systems in Cantonese and Mandarin. There are six lexical tones in Cantonese and four in Mandarin.
Table 1. Tonal systems in Cantonese and Mandarin. There are six lexical tones in Cantonese and four in Mandarin.
LanguageTone NameTone CategoryTone LetterExample
CantoneseTone 1High Level55si55 ‘teacher’
Tone 2High Rising25si25 ‘history’
Tone 3Mid Level33si33 ‘test’
Tone 4Mid-low Falling21si21 ‘time’
Tone 5Mid-low Rising23si23 ‘market’
Tone 6Mid-low Level22si22 ‘matter’
MandarinTone 1High Level55ma55 ‘mother’
Tone 2High Rising35ma35 ‘hemp’
Tone 3Low Dipping214ma214 ‘horse’
Tone 4High Falling51ma51 ‘to scold’
Table 2. Summary statistics of main effects on native listeners’ responses from the GLMMs (significant effects have been highlighted in bold).
Table 2. Summary statistics of main effects on native listeners’ responses from the GLMMs (significant effects have been highlighted in bold).
ToneTonal DistanceDCT1DCT2DCT3
T1χ2(1) = 2.907, p = 0.088χ2(1) = 8.497, p = 0.004χ2(1) = 0.009, p = 0.924χ2(1) = 14.359, p < 0.001
T2χ2(1) = 3.136, p = 0.077χ2(1) = 0.386, p = 0.535χ2(1) = 6.107, p = 0.013χ2(1) = 2.559, p = 0.110
T3χ2(1) = 2.490, p = 0.115χ2(1) = 1.120, p = 0.290χ2(1) = 2.086, p = 0.149χ2(1) = 0.208, p = 0.648
T4χ2(1) = 10.905, p < 0.001χ2(1) = 15.626, p < 0.001χ2(1) = 6.526, p = 0.011χ2(1) = 13.474, p < 0.001
T5χ2(1) = 3.048, p = 0.080χ2(1) = 0.020, p = 0.889χ2(1) = 5.998, p = 0.014χ2(1) = 0.712, p = 0.399
T6χ2(1) = 5.522, p = 0.019χ2(1) = 7.790, p = 0.005χ2(1) = 0.213, p = 0.645χ2(1) = 1.788, p = 0.181
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Hou, J.; Zou, Y.; Han, D. Acoustic Analysis and Perceptual Evaluation of Second Language Cantonese Tones Produced by Advanced Mandarin-Speaking Learners. Appl. Sci. 2025, 15, 6590. https://doi.org/10.3390/app15126590

AMA Style

Yang Y, Hou J, Zou Y, Han D. Acoustic Analysis and Perceptual Evaluation of Second Language Cantonese Tones Produced by Advanced Mandarin-Speaking Learners. Applied Sciences. 2025; 15(12):6590. https://doi.org/10.3390/app15126590

Chicago/Turabian Style

Yang, Yike, Jie Hou, Yue Zou, and Dong Han. 2025. "Acoustic Analysis and Perceptual Evaluation of Second Language Cantonese Tones Produced by Advanced Mandarin-Speaking Learners" Applied Sciences 15, no. 12: 6590. https://doi.org/10.3390/app15126590

APA Style

Yang, Y., Hou, J., Zou, Y., & Han, D. (2025). Acoustic Analysis and Perceptual Evaluation of Second Language Cantonese Tones Produced by Advanced Mandarin-Speaking Learners. Applied Sciences, 15(12), 6590. https://doi.org/10.3390/app15126590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop