Within and Across-Language Comparison of Vocal Emotions in Mandarin and English

This study reports experimental results on whether the acoustic realization of vocal emotions differs between Mandarin and English. Prosodic cues, spectral cues and articulatory cues generated by electroglottograph (EGG) of five emotions (anger, fear, happiness, sadness and neutral) were compared within and across Mandarin and English through a production experiment. Results of within-language comparison demonstrated that each vocal emotion had specific acoustic patterns in each language. Moreover, normalized data were used in the across-language comparison analysis. Results indicated that Mandarin and English showed different mechanisms of utilizing pitch for encoding emotions. The differences in pitch variation between neutral and other emotions were significantly larger in English than in Mandarin. However, the variations of speech rate and certain phonation cues (e.g., CPP (Cepstral Peak Prominence) and CQ (Contact quotient)) were significantly greater in Mandarin than in English. The differences in emotional speech between the two languages may be due to the restriction of pitch variation by the presence of lexical tones in Mandarin. This study reveals an interesting finding that occurs when a certain cue (e.g., pitch) is restricted in one language, other cues were strengthened to take on the responsibility of differentiating vocal emotions. Therefore, we posit that the acoustic realizations of emotional speech are multidimensional.


Introduction
Human speech is believed to be a reliable means of conveying a speaker's emotional state [1].In the past, researchers looked for acoustic cues that could clearly distinguish different vocal emotions (e.g., [2][3][4], among many others)-mostly anger, fear, happiness and sadness from a core set of basic emotions [5].Much of the previous research is limited to prosodic cues including fundamental frequency (f0), duration and intensity [2,6], which are considered to be sufficient to differentiate vocal emotions [7].However, studies focusing on emotions such as anger and happiness report similar prosodic patterns with high mean pitch, fast speech rate and high intensity [2], meaning that it is difficult to distinguish these two emotions using only prosodic cues.Moreover, evidence from research on emotional speech synthesis and recognition demonstrates that multiple cues in addition to f0, duration and intensity are necessary for the acoustic realization of emotional speech [8,9].Thus, the understanding of the acoustic mechanisms underlying vocal emotions remains incomplete based on prosodic cues.
Recent trends in the acoustic analysis of vocal emotions have led to examinations of phonation-related spectral cues.Patel et al. (2011) [10] investigated the mapping between phonation cues and emotion dimensions in French speech.Results indicated three underlying components-tension, perturbation and voicing frequency-which were claimed to be related to emotion dimensions of arousal, potency and the ability to control respectively.Xu et al. (2013) [11] proposed a set of spectral and f0 cues which was named BID measurements to test the size code hypothesis using synthetic speech.Results revealed that listeners could infer emotions from the changes of cues caused by the physiological structure of a speaker.Similar ideas are also found in References [12][13][14].Wang et al. (2014) [15] highlighted the importance of phonation cues in differentiating vocal emotions in Mandarin.By adding phonation-related spectral cues in a multi-dimensional scaling acoustic space, different emotions were separated clearly.Although these studies have enriched the knowledge of vocal expression of emotions, few attempts have been made to analyse different roles played by prosodic cues and phonation cues in differentiating vocal emotions.
Another important issue is whether the vocal expression of emotions is universal across languages or language-specific.Many research studies attempted to untangle this debate through cross-cultural perception experiments by asking subjects from different language backgrounds to identify the underlying emotions expressed by speakers from either the same or a different language background (e.g., [16][17][18][19][20]).Results showed that listeners could successfully recognize vocal emotions expressed in non-native languages with no great variation in perception except for an in-group advantage when judging emotions expressed in their native languages [7].Based on the evidence in perception studies, vocal expression of emotions seems to be universal across languages.
Unlike perception, a limited number of studies have investigated the cross-language vocal expression of emotions through production experiments.This may due to the difficulties in controlling the elicitation of emotions in different languages and in designing experimental paradigms.Among the few studies, Pell et al. (2009) [7] compared acoustic parameters of six vocal emotions in English, German, Hindi and Arabic.Results showed that mean pitch, pitch range and speech rate contributed in a similar way to differentiate vocal emotions across languages.In contrast, Anolli et al. (2008) [21] investigated the vocal expression of emotions in Chinese and Italian and found differences in the specific patterns of acoustic cues including f0, duration and intensity between two languages.For example, their results showed that vocal emotions in Chinese had more restrained variations in acoustic cues among different emotions than those in Italian, indicating that the production of vocal emotions may be influenced by cultures.Similarly, Li et al. (2013) [19] analysed the f0 features of vocal emotions across Mandarin and Japanese.Results indicated that these two languages had distinct f0 features such as lower pitch for sadness in Mandarin but higher pitch in Japanese, which was claimed to be the influence of different language structures.Considering all of this evidence, it seems there is still disagreement as to whether vocal expression of emotions is universal or language-specific in production studies.
A separate body of literature in production has investigated the influence of the lexical tone system in tonal languages on marking vocal emotions.The motivation to examine this possible influence arose from the hypothesis that the existence of a lexical tone system may restrict the paralinguistic use of pitch [22,23].Ross et al. (1986) [24] found that the manipulation of f0 measures in emotional prosody was restricted in tonal languages, including Taiwanese, Mandarin and Thai, when compared to English as a non-tonal language.In References [25,26], prosodic cues of vocal emotions in Mandarin were analysed across different tone groups.The restriction of pitch variation was found in the sentences which were made up of all high-level tone syllables for vocal emotions because the pitch of a high level-tone must be quasi-static to maintain its tonal structure, indicating that the acoustic patterns of the same emotion could be different even within a language.Likewise, Chong et al. (2015) [27] reported a restriction of f0 variation in signalling vocal emotions in Cantonese-a tonal language-when compared to English, a non-tonal language.
Taken together, the established literature has some limitations.First, a full discussion of comprehensive acoustic cues, including both prosodic and phonation cues, needs to be made within one research paradigm.Much of the previous research emphasizes prosodic cues only in encoding vocal emotions [2], especially in cross-language studies, leaving another small set of studies to focus on phonation cues separately [28,29], which limits the full understanding of mechanisms that underpin vocal emotions.Second, when examining the phonation mechanisms of vocal emotions using spectral measures, the noise in the speech signal caused by poor recording quality may lead to unreliable measurement.In addition, the short distance between the first harmonic and the first formant in the spectrum adds difficulties in extracting the spectral measures.Moreover, it is difficult to confirm the physiological mechanism of voice production since different physiological mechanisms may have similar acoustic performances with spectral measures.For example, breathy voice and soft voice both have a relatively large H1-H2 value [30].Thus, an EGG experiment is worth doing to directly observe glottal activities and, more importantly, to supplement spectral measures.Third, very little is currently known about how other cues change if pitch variation is restricted when encoding vocal emotions in tonal languages such as Mandarin.
With these issues in mind, the current study aims to improve the understanding of underlying mechanisms of vocal emotions in a tonal language-Mandarin-and a non-tonal language-English-through the collective analysis of prosodic cues, spectral cues and EGG cues.These two languages were selected to be examined because they are typical and have distinctive language structures.Moreover, these two languages have been widely studied, which means our results can be compared to the existing literature.By adopting this new approach, we are attempting to answer the following research questions: (a) whether Mandarin and English have different patterns with respect to prosodic cues, spectral cues and EGG cues in vocal expression of emotions; (b) whether Mandarin shows a restricted pitch variation in encoding vocal emotions compared to English; and (c) if so, how other cues change to supplement the restricted use of pitch.The preliminary results of this study were presented at the 8th International Conference on Speech Prosody as shown in Reference [31].

Speech Materials
Speech materials were developed using 15 declarative sentences.They were first designed in Chinese and then were translated to English.Each target sentence was semantically neutral and suitably conveyed different kinds of emotions.They were embedded in a specific context to reflect five different emotions: anger, fear, happiness, sadness and neutral.Table 1 lists a sample target sentence in four emotional contexts in two languages.

Subjects
We recruited five native speakers (two males and three females) for each language.All speakers were graduate students at the University of Pennsylvania.Mandarin speakers had spent less than a year in the U.S. at the time of recordings.They all had acting and public speaking experience.Participants signed a consent form before the experiment and received ten dollars as compensation for their time.Participants reported no problems with their speech and hearing.

Recording Procedure
Recordings were conducted in a sound-proof booth in the Department of Linguistics at the University of Pennsylvania.We obtained simultaneous electroglottograph (EGG) and audio recordings from all speakers.Audio recordings were made electronically and saved directly on a computer as 16-bit wave files at a sampling rate of 44.1 kHz, using a Glottal Enterprises M80 omnidirectional headset microphone.EGG data were obtained using a two-channel Glottal Enterprises Electroglottograph, model EG2.During the recordings, we presented speech materials on PowerPoint slides.
Different emotions were recorded in a separate block and speakers were offered a break between blocks for a smooth transition from one emotion to another.Sentences with neutrality were produced in isolation and those with anger, fear, happiness and sadness were embedded in a dialogue setting, in which speakers conducted a dialogue with a native speaker.This dialogue setting enabled speakers to express different emotions in a natural way.In total, we collected 750 sentences (= 15 sentences × 5 speakers × 2 repetitions × 5 emotions) for each language.

Listening Tests
Listening tests were conducted to confirm that the intended emotions were accurately produced using the Qualtrics online survey tool [32] with respective Mandarin and English stimuli.For each language, the stimuli were divided into five sets produced by each of the five speakers.Native Mandarin and English listeners were recruited online and the number of listeners for each set was at least 15.During the listening tests, participants were asked to listen carefully to randomized audio stimuli and select the most appropriate emotion on a five-choice task (i.e., anger, happiness, fear, neutral and sadness) by pressing a computer mouse.
Recordings whose identification rate was less than 60% were excluded.This is three times the 20% chance level permitted in our study [7].In order to compare five emotions with parallel texts in each language, we chose 530 perceptually valid recordings (106 sentences × 5 emotions) in Mandarin and 300 perceptually valid recordings (60 sentences × 5 emotions) in English.

Measurements
Automatic segmentation of recordings was made using SPPAS [33].Three tiers (phoneme tier, syllable tier and sentence tier) were generated and manually corrected afterward.A sequence of pitch target points was detected by Momel algorithm [34] using Praat.Pitch contours were modelled as continuous smooth curves, interpolated quadratically from pitch target points of each utterance in order to eliminate microprosodic effects.Based on the Momel outputs, a set of prosodic cues was measured to quantify the degree of pitch movements as a function of time.
Seven prosodic cues were generated by a Praat script: (1) Number of syllables per second (Speech Rate); (2) Mean intensity of each sentence (Mean Intensity); (3) Mean pitch of each sentence (Mean Pitch); (4) Pitch range of each sentence (Pitch range) (5) Average absolute difference between two adjacent pitch target points divided by distance in seconds (Mean Absolute Slope, MAS hereafter), reflecting the frequency of pitch movements [35]; (6) Average pitch difference between two adjacent pitch target points for rise and fall separately in each sentence (Rise, Fall), which determine the degree of pitch raising and pitch falling [35]; and (7) Average slope for rise and fall separately in each sentence (Rise Slope, Fall Slope), indicating the speed rate of pitch raising and pitch falling [35].These cues portrayed both global and local pitch movements of each sentence.Pitch-related cues were normalized to eliminate individual differences using the OMe (Octave-Median) scale [36] by applying the following equation: where Hz is a raw pitch value and Median indicates the median of a speaker's pitch range.
Next, three kinds of EGG cues were extracted using EggWorks [41], including (1) Contact quotient (CQ), illustrating the duration of the vocal fold contact during one vocal fold period [42]; (2) Peak Increase in Contact (PIC), the peak positive value in the EGG derivative, indicating the highest speed of vocal fold contact [43]; and (3) Speed Quotient (SQ), the ratio between closing duration and opening duration, reflecting the asymmetry of the EGG pulses [44].These cues are indicators of the physiological mechanism of vocal fold vibration during speech production.
In summary, in order to assess the vocal expression of emotion in Mandarin and English, we measured a total of 20 cues which were then converted to z-scores combining all the emotions separately for each speaker.

Results
In this section, we first looked at whether all the cues were significantly different between emotions within each language.After that, normalized data were used to compare the different acoustic and articulatory patterns of vocal emotions across languages.

Within-Language Comparison of Vocal Emotions
Table 2 presents the normalized mean values for each emotion in Mandarin and English.To test whether each of the 20 cues discriminated between the five emotions within each language, we conducted a linear mixed-effects model implemented with the lmerTest package [45] in R [46].These cues were fitted against a model for each language, where emotion was used as a fixed factor and speaker was treated as a random factor.Kolmogorov-Smirnov (K-S) tests were used to estimate the normality of the residuals of the model.Results suggested an acceptance of normality assumption for most of the variables except for the residuals of range (p = 0.001), rise (p = 0.003), fall (p = 0.018), rise slope (p = 0.001), fall slope (p = 0.023), PIC (p = 0.006) and SQ (p = 0.000) in Mandarin and rise (p = 0.000), fall (p = 0.013), rise slope (p = 0.000) and SQ (p = 0.007) in English.We attempted to perform a data transformation (e.g., log transformation) to improve the normality of the residuals but more variables were shown to violate the normality assumptions and the interpretation of the results was not as straightforward as before.This allowed us not to employ any data transformations because the assumption of normality is of little importance in the regression model [47].Here we present the results in the order of prosodic cues, spectral cues and EGG cues for each language.
To understand the physiological mechanism of phonation for different emotional expressions, we looked at the EGG results of five emotions in Mandarin.Figure 1 illustrates the pitch-normalized EGG waveform of five emotions.The nature of the EGG waveform differed across emotions.Neutral had a typical waveform of modal phonation, with a left-skewed EGG signal, indicating a slightly longer duration of vocal fold opening period than closing period.Likewise, the waveform of happiness showed the similar pattern of modal phonation.Anger had a larger vocal fold contact area and the duration of vocal fold closing was slower, which was similar to tense phonation.The shape of pulse for fear was symmetrical, indicating a comparatively equal vocal fold opening duration and closing duration and a small contact area, suggesting a breathy phonation.Sadness showed a slight degree of breathy phonation.
Appl.Sci.2018, 10, x FOR PEER REVIEW 7 of 18 longer duration of vocal fold opening period than closing period.Likewise, the waveform of happiness showed the similar pattern of modal phonation.Anger had a larger vocal fold contact area and the duration of vocal fold closing was slower, which was similar to tense phonation.The shape of pulse for fear was symmetrical, indicating a comparatively equal vocal fold opening duration and closing duration and a small contact area, suggesting a breathy phonation.Sadness showed a slight degree of breathy phonation.Three EGG parameters, including CQ, PIC and SQ, were extracted to quantify the different patterns of EGG waveform mentioned above.Statistic results confirmed our observation that the influence of emotion was significant for CQ, X 2 = 185.98,df = 4, p < 0.001, PIC, X 2 = 70.07,df = 4, p < 0.001 and SQ, X 2 = 107.92,df = 4, p < 0.001.Table 3 illustrates the post-hoc multiple comparison (Tukey) results between emotions for each language.In Mandarin, each emotion pair can be distinguished by most of the prosodic cues.However, for neutral and sadness, there was no significant difference between them in terms of mean absolute slope, rise and fall slope.Pitch range only had a marginal difference.Likewise, fear and sadness had no significant difference in rise and only a marginal difference in fall.These results suggested that these two pairs of emotions were produced with similar degree of local pitch movements.For spectral cues, happiness and anger displayed similar patterns due to the paucity of significant differences on H1, H2, H4, H1-H2, H1-A1, H1-A2 and H1-A3, indicating that these two emotions had similar spectral distribution.For EGG cues, CQ, the open quotient of the vocal folds, Table 3 illustrates the post-hoc multiple comparison (Tukey) results between emotions for each language.In Mandarin, each emotion pair can be distinguished by most of the prosodic cues.However, for neutral and sadness, there was no significant difference between them in terms of mean absolute slope, rise and fall slope.Pitch range only had a marginal difference.Likewise, fear and sadness had no significant difference in rise and only a marginal difference in fall.These results suggested that these two pairs of emotions were produced with similar degree of local pitch movements.For spectral
Finally, statistical analysis yielded a significant emotion effect on all EGG cues: CQ, X 2 = 8.61, df = 4, p < 0.001, PIC, X 2 = 4.49, df = 4, p < 0.01 and SQ, X 2 = 27.60,df = 4, p < 0.001.As illustrated in Figure 2, the phonation mechanism of some emotions could be found from the EGG waveforms.The most prominent one was fear, which showed a symmetrical shape, indicating a breathy phonation.Anger had a little larger vocal fold contact area and the vocal fold closing was slower, which was close to tense phonation.However, if compared with the EGG waveforms of Mandarin in Figure 1, smaller contrasts in Figure 2 can be observed.
indicated a significant difference among all emotions, suggesting that different emotions were expressed with different phonation mechanisms.
Finally, statistical analysis yielded a significant emotion effect on all EGG cues: CQ, X 2 = 8.61, df = 4, p < 0.001, PIC, X 2 = 4.49, df = 4, p < 0.01 and SQ, X 2 = 27.60,df = 4, p < 0.001.As illustrated in Figure 2, the phonation mechanism of some emotions could be found from the EGG waveforms.The most prominent one was fear, which showed a symmetrical shape, indicating a breathy phonation.Anger had a little larger vocal fold contact area and the vocal fold closing was slower, which was close to tense phonation.However, if compared with the EGG waveforms of Mandarin in Figure 1, smaller contrasts in Figure 2 can be observed.A post-hoc multiple comparison procedure was conducted to determine which emotion was significantly different.The output of the multiple comparison analysis is shown in Table 3.For prosodic cues, we can see that sadness and neutral differed only in mean pitch, pitch range and intensity, indicating that they were expressed with similar degrees of pitch fluctuation and speech rate.It is worth mentioning that for speech rate, only sadness was significantly different from anger, fear and happiness, respectively.For spectral cues, happiness and anger exhibited similar spectral distribution as did sadness and neutral, since only a few cues differed from each other.As for EGG cues, all contrasts were significantly different in terms of SQ except neutral and fear.The CQ of fear was significantly different from all other emotions, suggesting that fear had a different phonation mechanism.
In the above analysis, we examined the emotional effect for prosodic cues, spectral cues and EGG cues within Mandarin and English separately.Results showed that in both Mandarin and English, five vocal emotions had different acoustic patterns and phonation mechanisms, indicating that prosodic cues, spectral cues and EGG cues were all indispensable in distinguishing these emotions.An area we want to study further is whether or not language-specific differences exist between Mandarin vocal emotions and English vocal emotions.It is noteworthy that we didn't compare these results across languages directly.For example, it is meaningless to compare the mean pitch of anger in Mandarin with the mean pitch of anger in English.Although the same experiment paradigm was applied to each language and all cues were normalized within each language by using z-score in order to eliminate individual differences, it is dangerous to compare the two languages horizontally since variations of emotional effects may come from language itself, or from different intensities of emotional elicitation.Therefore, in this part, we focused only on how each cue performed among different emotions within each language and only compared the ordering of five emotions for each cue in the two languages.For instance, anger had the highest pitch-related measures and sadness had the lowest (except for mean pitch) in Mandarin.However, in English, happiness was ordered the highest in terms of pitch-related cues and neutral was the lowest.

Across-Language Comparison of Vocal Emotions: Mandarin versus English
As mentioned above, to overcome evident biases arising from the absolute value in two languages, we adopted the method of data normalization from [21,48] according to the following equation: where x is the absolute value of each parameter for anger, fear, happiness and sadness, respectively.N is the absolute value of each parameter for neutral.Thus, this equation produces the relative value of each parameter in each of the four emotions compared to the neutral baseline.The derived data have a positive or a negative value, depending on the relative difference with neutral.

Prosodic Cues for Encoding Emotions in Mandarin and English
As presented in Figure 3, the variations of pitch-related cues around 0 (neutral emotion) are greater in English than in Mandarin except for mean pitch.The main effect of Emotion was significant on all seven pitch-related cues as shown in Table 4.In terms of Language, Mandarin and English had significant differences on pitch range, mean absolute slope, rise, fall, rise slope and fall slope.The interaction effect of Emotion × Language was significant on mean absolute slope, rise, fall and fall slope, suggesting that English vocal emotions showed more dynamic pitch-related movements than Mandarin vocal emotions.Specifically, compared to neutral emotion, happiness, fear and anger in both Mandarin and English were realized by raising mean pitch.However, the change of mean pitch for sadness was small in both languages.For pitch range, happiness and anger had expanded pitch range than neutral emotion in Mandarin and English and English had a much greater degree of pitch range expansion.Fear and sadness showed different pitch range patterns in the two languages.The pitch range of fear and sadness were negative in Mandarin, indicating that their pitch range was compressed in comparison to neutral emotion.In contrast, they were expanded in English due to positive values.Likewise, the mean absolute slope of sadness was negative in Mandarin and positive in English with a greater degree of variation, suggesting that the pitch fluctuation was suppressed in Mandarin but expanded in English.For rise, rise slope and fall, anger and happiness were higher than neutral emotion in both languages.But fear and sadness had opposite patterns.Again, English had a clearly greater degree of changes as compared to neutral emotion.Anger, fear and happiness had larger fall slope than neutral emotion in both Mandarin and English.However, the fall slope of sadness was reduced in Mandarin but increased in English.

Prosodic Cues for Encoding Emotions in Mandarin and English
As presented in Figure 3, the variations of pitch-related cues around 0 (neutral emotion) are greater in English than in Mandarin except for mean pitch.The main effect of Emotion was significant on all seven pitch-related cues as shown in Table 4.In terms of Language, Mandarin and English had significant differences on pitch range, mean absolute slope, rise, fall, rise slope and fall slope.The interaction effect of Emotion × Language was significant on mean absolute slope, rise, fall and fall slope, suggesting that English vocal emotions showed more dynamic pitch-related movements than Mandarin vocal emotions.Specifically, compared to neutral emotion, happiness, fear and anger in both Mandarin and English were realized by raising mean pitch.However, the change of mean pitch for sadness was small in both languages.For pitch range, happiness and anger had expanded pitch range than neutral emotion in Mandarin and English and English had a much greater degree of pitch range expansion.Fear and sadness showed different pitch range patterns in the two languages.The pitch range of fear and sadness were negative in Mandarin, indicating that their pitch range was compressed in comparison to neutral emotion.In contrast, they were expanded in English due to positive values.Likewise, the mean absolute slope of sadness was negative in Mandarin and positive in English with a greater degree of variation, suggesting that the pitch fluctuation was suppressed in Mandarin but expanded in English.For rise, rise slope and fall, anger and happiness were higher than neutral emotion in both languages.But fear and sadness had opposite patterns.Again, English had a clearly greater degree of changes as compared to neutral emotion.Anger, fear and happiness had larger fall slope than neutral emotion in both Mandarin and English.However, the fall slope of sadness was reduced in Mandarin but increased in English.Unlike pitch-related cues, the variations of speech rate around 0 are much greater in Mandarin than in English.Statistical results showed a significant main effect of Emotion on both speech rate and mean intensity as indicated in Table 4. Language had a significant main effect only on speech rate.There was a significant interaction effect between Emotion and Language on speech rate.We can see in Figure 3 that anger, fear and happiness increased their speech rate in both Mandarin and English, while sadness lowered its speech rate as compared to neutral emotion.However, the degree of changes in Mandarin was much larger than that in English.For intensity, both languages showed similar patterns of increasing intensity in anger, fear and happiness and lowering intensity in sadness.

Phonation Cues for Encoding Emotions in Mandarin and English
Figure 4 illustrates the effects of Emotion and Language on comprehensive spectral cues.It is clear that the mean differences of CPP, H4 and H1-H2 among four emotions are greater in Mandarin than in English.The results of mixed-repeated measures ANOVAs showed that Emotion had a significant main effect on CPP and H2.Language had no significant main effect on all spectral cues.The interaction between Emotion and Language was significant on CPP, meaning that the two languages had different CPP patterns.To be specific, the difference of CPP between anger and neutral in both Mandarin and English was small since the mean values were close to 0. The CPP of fear and sadness in both languages was smaller than neutral emotion and that of happiness was greater than neutral emotion.This result indicated that fear and sadness were less periodic with increased spectral noise in both languages and Mandarin had a greater degree of changes compared to neutral emotion.The pattern of H1 was similar to that of mean pitch, where anger and happiness increased their H1; whereas, sadness lowered its H1 in Mandarin.The changes of H2 were similar in Mandarin and English.Anger and happiness increased H2.Sadness and fear reduced H2.Unlike pitch-related cues, the variations of speech rate around 0 are much greater in Mandarin than in English.Statistical results showed a significant main effect of Emotion on both speech rate and mean intensity as indicated in Table 4. Language had a significant main effect only on speech rate.There was a significant interaction effect between Emotion and Language on speech rate.We can see in Figure 3 that anger, fear and happiness increased their speech rate in both Mandarin and English, while sadness lowered its speech rate as compared to neutral emotion.However, the degree of changes in Mandarin was much larger than that in English.For intensity, both languages showed similar patterns of increasing intensity in anger, fear and happiness and lowering intensity in sadness.

Phonation Cues for Encoding Emotions in Mandarin and English
Figure 4 illustrates the effects of Emotion and Language on comprehensive spectral cues.It is clear that the mean differences of CPP, H4 and H1-H2 among four emotions are greater in Mandarin than in English.The results of mixed-repeated measures ANOVAs showed that Emotion had a significant main effect on CPP and H2.Language had no significant main effect on all spectral cues.The interaction between Emotion and Language was significant on CPP, meaning that the two languages had different CPP patterns.To be specific, the difference of CPP between anger and neutral in both Mandarin and English was small since the mean values were close to 0. The CPP of fear and sadness in both languages was smaller than neutral emotion and that of happiness was greater than neutral emotion.This result indicated that fear and sadness were less periodic with increased spectral noise in both languages and Mandarin had a greater degree of changes compared to neutral emotion.The pattern of H1 was similar to that of mean pitch, where anger and happiness increased their H1; whereas, sadness lowered its H1 in Mandarin.The changes of H2 were similar in Mandarin and English.Anger and happiness increased H2.Sadness and fear reduced H2.To comprehend the physiological mechanisms of vocal emotions in Mandarin and English, three EGG cues were analysed.Figure 5 displays the effects of Emotion and Language on these EGG cues.As can be seen from the figure, the mean differences of these three cues among four emotions are greater in Mandarin than in English.In terms of Emotion, there were significant differences among   To comprehend the physiological mechanisms of vocal emotions in Mandarin and English, three EGG cues were analysed.Figure 5 displays the effects of Emotion and Language on these EGG cues.As can be seen from the figure, the mean differences of these three cues among four emotions are greater in Mandarin than in English.In terms of Emotion, there were significant differences among four emotions on CQ (F(3,24) = 20.77,p < 0.001), SQ (F(3,24) = 20.41,p < 0.001) and marginal significant difference on PIC (F(3,24) = 2.93, p = 0.054).However, the main effects of Language on these three cues were not significant.The interaction between Emotion and Language was significant on CQ (F(3,24) = 9.52, p < 0.001) and SQ (F(3,24) = 11.03,p < 0.001).Specifically, CQ showed a greater degree of variation among four emotions compared to neutral emotion in Mandarin than that of English.Anger and happiness increased CQ from neutral emotion, whereas fear decreased its CQ in both languages.However, the CQ of sadness was lower than neutral emotion in Mandarin but raised a little in English.Similarly, the changes of PIC in four emotions in English were close to 0 but greatly increased in Mandarin.For SQ, fear showed a significant increase in Mandarin, suggesting that fear had more symmetrical EGG waveforms.
Appl.Sci.2018, 10, x FOR PEER REVIEW 13 of 18 four emotions on CQ (F(3,24) = 20.77,p < 0.001), SQ (F(3,24) = 20.41,p < 0.001) and marginal significant difference on PIC (F(3,24) = 2.93, p = 0.054).However, the main effects of Language on these three cues were not significant.The interaction between Emotion and Language was significant on CQ (F(3,24) = 9.52, p < 0.001) and SQ (F(3,24) = 11.03,p < 0.001).Specifically, CQ showed a greater degree of variation among four emotions compared to neutral emotion in Mandarin than that of English.Anger and happiness increased CQ from neutral emotion, whereas fear decreased its CQ in both languages.However, the CQ of sadness was lower than neutral emotion in Mandarin but raised a little in English.Similarly, the changes of PIC in four emotions in English were close to 0 but greatly increased in Mandarin.For SQ, fear showed a significant increase in Mandarin, suggesting that fear had more symmetrical EGG waveforms.

Discussion and Conclusions
This study examined whether a tonal language shows restricted pitch variation when encoding vocal emotions, as compared to a non-tonal language.If this is indeed the case, then how other cues are used to differentiate vocal emotions?To discover the truth of this matter, both within and crosslanguage comparisons were made among emotions on comprehensive prosodic cues, phonationrelated spectral cues and EGG cues through a production experiment.

Acoustic and Physiological Patterns of Each Vocal Emotion in Mandarin and English
The results of the within-language comparison showed that all the prosodic cues including f0, duration and intensity were significantly different between emotions in each language, suggesting that different vocal emotions had unique prosodic patterns, which was consistent with previous studies (e.g., [2,3,49,50]).Specifically, happiness and anger showed relatively higher pitch level, greater degree of local pitch fluctuation, faster speech rate and higher intensity in both Mandarin and English, which matched those observed in earlier studies (Mandarin: [15,20,51]; English: [7]).Fear had the highest pitch level and fastest speech rate among five emotions.The intensity of fear was also higher but fear had smaller local pitch movements.These results were also consistent with those of [7,52,53].The mean pitch for sadness was close to that of neutral emotion, which was also reported by [54] for Mandarin and [7] for English.In addition, sadness had smaller local pitch movements, slowest speech rate and weakest intensity in both languages.Unlike previous studies that mostly use pitch measures like mean pitch and pitch range to describe the global changes of the pitch contour, it is interesting to note that our study includes more fine pitch-related cues that can quantify the degree of pitch changes over time locally, such as mean absolute slope, rise, fall, rise slope and full slope.Thus, vocal emotions such as happiness and fear and anger and fear can be clearly separated with respect to the degree of local pitch movements.Furthermore, these cues have been proved to be useful to examine whether pitch fluctuation is influenced by the existence of a lexical tone system within

Discussion and Conclusions
This study examined whether a tonal language shows restricted pitch variation when encoding vocal emotions, as compared to a non-tonal language.If this is indeed the case, then how other cues are used to differentiate vocal emotions?To discover the truth of this matter, both within and cross-language comparisons were made among emotions on comprehensive prosodic cues, phonation-related spectral cues and EGG cues through a production experiment.

Acoustic and Physiological Patterns of Each Vocal Emotion in Mandarin and English
The results of the within-language comparison showed that all the prosodic cues including f0, duration and intensity were significantly different between emotions in each language, suggesting that different vocal emotions had unique prosodic patterns, which was consistent with previous studies (e.g., [2,3,49,50]).Specifically, happiness and anger showed relatively higher pitch level, greater degree of local pitch fluctuation, faster speech rate and higher intensity in both Mandarin and English, which matched those observed in earlier studies (Mandarin: [15,20,51]; English: [7]).Fear had the highest pitch level and fastest speech rate among five emotions.The intensity of fear was also higher but fear had smaller local pitch movements.These results were also consistent with those of [7,52,53].The mean pitch for sadness was close to that of neutral emotion, which was also reported by [54] for Mandarin and [7] for English.In addition, sadness had smaller local pitch movements, slowest speech rate and weakest intensity in both languages.Unlike previous studies that mostly use pitch measures like mean pitch and pitch range to describe the global changes of the pitch contour, it is interesting to note that our study includes more fine pitch-related cues that can quantify the degree of pitch changes over time locally, such as mean absolute slope, rise, fall, rise slope and full slope.Thus, vocal emotions such as happiness and fear and anger and fear can be clearly separated with respect to the degree of local pitch movements.Furthermore, these cues have been proved to be useful to examine whether pitch fluctuation is influenced by the existence of a lexical tone system within one language [25,26].Consequently, we also include them in our across-language analysis to test whether the restriction exists.
In addition, our study also analysed a set of spectral cues that could reflect the degree of periodicity, signal to noise ratio (SNR) and spectral tilt in the speech signal [30,38].Results of spectrum analysis revealed different patterns in the two languages.In Mandarin, happiness had better periodicity, higher SNR and smaller spectral tilt due to the increased high-frequency energy.Anger showed similar patterns with happiness.Fear had poorest periodicity, smallest SNR and higher spectral tilt caused by lower high-frequency energy.Finally, sadness was less periodical with a smaller SNR and highest spectral tilt caused by the lowest high-frequency energy.In contrast, the periodicity and SNR of five emotions exhibited few differences in English.Happiness had highest high-frequency energy, thus, its spectral tilt was small, followed by anger, sadness and fear in English.The above results of spectrum analysis were meaningful to infer the source feature and to categorize different vocal fold vibration patterns of different vocal emotions.On one hand, periodicity and SNR will be reduced when vocal fold is not completely closed during vibration with airflow goes through, which indicates a breathy voice or an aspirated sound.On the other hand, the spectral tilt can reflect the shape of the glottal wave, which represents the mechanism of vocal fold vibration.A larger spectral tilt can be caused by a breathy voice or a soft voice [30].Therefore, our results of spectrum analysis suggest that fear and sadness may have breathy voice in Mandarin.
Another new approach we used in this study was conducting the EGG experiment to directly detect the physiological mechanism of different vocal emotions as a supplement for spectral measures.Results revealed that neutral and happy were expressed with modal voice in both Mandarin and English.Anger was produced a small degree of tense voice in Mandarin.Fear showed the typical feature of breathy voice in both Mandarin and English.Sadness also had a small degree of breathy voice in Mandarin but not in English.Based on Table 3, we observed that EGG cues by CQ showed a significant difference among all vocal emotions in Mandarin and all contrasts were significantly different in terms of SQ in English except neutral and fear.This indicates that EGG cues were helpful in discriminating all of the emotions and clearly complemented both prosodic and spectral cues.Furthermore, through the EGG experiment, we found that anger and happiness could be better differentiated than by simply using prosodic cues, which was reported to be similar in previous studies [55].Moreover, EGG cues helped identify the phonation type of fear as breathy voice instead of aspirated or soft voice, which had conflicting results based on spectral cues.Thus, EGG cues are useful to supplement the spectral cues in order to explore the voice quality features for vocal emotions.

Multidimensionality of the Acoustic Realization of Vocal Emotions
It is worth noting that in the within-language analysis, no direct comparison of each value between Mandarin and English was made, although the same experiment paradigm was used for both languages.Therefore, we further used normalized data to compare the different patterns of vocal emotions across languages.Each value of anger, fear, happiness and sadness was normalized against that of neutral emotion in each language, so that the derived data represented the proportion of changes for each emotion with neutral emotion as a reference.
Based on the acoustic analysis on the normalized pitch-related cues, we observed that Mandarin and English showed different mechanisms when utilizing pitch to express vocal emotions.There were significant interactions between emotion and language on mean absolute slope, rise, fall, rise slope and fall slope.The pitch variations, with neutral as the baseline, were significantly greater in English compared to those in Mandarin.We posit that this difference is due to the restriction of pitch variation through the existence of lexical tones in Mandarin, since pitch is primarily used to maintain the tonal shapes in Mandarin in order to fulfil the distinction of meaning and, afterward, to realize paralinguistic functions.Although the present study is not the first to propose this idea [24,25,27], our study expands upon previous work in this area and provides additional objective measures.For example, this study included more sophisticated pitch measures which were able to better reflect the variation of pitch over time.Additionally, instead of comparing absolute values between Mandarin and English, we used derived values to reflect the magnitude of change certain cues showed in emotional speech given neutral as the baseline.That is to say, we measured the relative contribution of certain cues when differentiating among emotions.
Distinct from the established literature that looked only at how paralinguistic use of pitch was influenced by the lexical tone system, we explored how other cues, such as duration, spectral and EGG cues, functioned when the restriction existed.Results showed that the differences between neutral and other emotions speech rate, CPP, CQ and SQ were significantly greater in Mandarin than in English, which was opposite of the pitch-related cues results.A possible explanation for this might be that these cues were enhanced in Mandarin vocal emotions in order to supplement the suppressed pitch variation.Therefore, we believe our study provides new evidence regarding the pitch restriction hypothesis from a fresh perspective.
Despite these promising results, several limitations remain in the way the study was conducted.First, given the limited number of tonal and non-tonal languages examined by the present study, one may argue that the differences between Mandarin and English vocal emotions may be due to different culture backgrounds of the speakers.Therefore, the conclusions need to be further tested by analysing a greater and more diverse pool of languages.The second limitation lies in the small sample number of speakers for each language.Thus, more speakers need to be recruited in future research to increase the generalizability of the study.
In conclusion, this comprehensive study was conducted within and across-language between Mandarin vocal emotions and English vocal emotions with a set of prosodic cues, phonation-related spectral cues and EGG cues.Results revealed that within each language, different vocal emotions showed significantly different acoustic patterns.However, the pitch variation was suppressed in Mandarin vocal emotions due to the influence of lexical tone system when compared with those in English.Evidence from other cues supported the idea that when a certain dimension (for example, pitch) is restricted within a language, other dimensions may be strengthened in compensation.Thus, we posit that the acoustic realization of vocal emotions is multidimensional.

Figure 1 .
Figure 1.Pitch-normalized EGG waveform of five emotions in Mandarin.

Figure 1 .
Figure 1.Pitch-normalized EGG waveform of five emotions in Mandarin.

Figure 2 .
Figure 2. Pitch-normalized EGG waveform of five emotions in English.

Figure 2 .
Figure 2. Pitch-normalized EGG waveform of five emotions in English.

Figure 3 .Figure 3 .
Figure 3. Prosodic cues of four emotions in Mandarin and English.Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure1in Reference[31]).

Figure 4 .
Figure 4. Phonation-related acoustic measures of four emotions in Mandarin and English.Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 2 in Reference [31]).

Figure 4 .
Figure 4. Phonation-related acoustic measures of four emotions in Mandarin and English.Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure2in Reference[31]).

Figure 5 .
Figure 5. Physiological measures of phonation of four emotions in Mandarin and English.Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure3in Reference[31]).

Figure 5 .
Figure 5. Physiological measures of phonation of four emotions in Mandarin and English.Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure3in Reference[31]).

Author
Contributions: T.W., Y-c.L. and Q.W.M designed the experiment; T.W. conducted the production experiment, analyzed the data and wrote the first draft of the manuscript; Y-c.L. reviewed and improved the manuscript.Funding: This work was supported in part by the Youth Project of Humanities and Social Sciences Foundation [grant no.18YJC740103] of the Ministry of Education in China, the Chenguang Program [grant no.16CG21] of the Shanghai Education Development Foundation and the Shanghai Municipal Education Commission, and Fundamental Research Funds for the Central Universities [grant no.22120170484] of Tongji University, which was awarded to the first author.

Table 1 .
An example of the target sentence "My advisor won't come to my presentation" in four emotional dialogue settings.He is the toughest advisor in the department and points out every little mistake during presentations.

Table 2 .
Normalized mean values of five emotions in each language.

Table 4 .
Results of mixed repeated measures ANOVAs for each cue.