English Focus Perception by Mandarin Listeners

: This study compared how well native Mandarin and native English speakers can perceive prosodically marked focus in English echo questions. Twenty-ﬁve yes–no echo questions were produced with a sentence focus, a verb focus, and an object focus. After hearing each sentence, they were asked to choose a correct response. Native English listeners were more accurate than native Mandarin on verb and object focus, but not on sentence focus. More importantly, both groups confused object focus with sentence focus and vice versa. However, confusion between object and verb focus, and between object and sentence focus was infrequent. These results suggest that, in some cases, (1) acoustic prominence on the head of a phrase or its internal argument can project to the entire phrase and make the entire phrase focused, and (2) parallel transmission of the two functions of intonation, and cross-linguistic variation in focus marking (prosodically versus syntactically) may contribute to their perceptual ambiguity.


Introduction
Processing accuracy and speed of an utterance in discourse is facilitated by the ability to distinguish new from background information that is already shared by the conversation partners and how they are structured and conveyed in a sentence. The means by which this is accomplished is language specific. Some languages rely on syntactic structures to mark the focus constituent, whereas others may use focus-marking particles or prosodic features, such as phrasing and accentuation as focus-marking devices (e.g., Büring 2009;Zimmermann and Onea 2011;Lee et al. 2015;Hagoort 2019). In English, new or relevant information is typically pitch-accented. For example, as a felicitous response to the question, "What did John buy?", BICYCLE in the response, "John bought a BICYCLE.", is pitch-accented. Accent placement is c in turn, is closely linked to new versus already-shared or given information. Specifically, focus allows the speaker to present to the listener part(s) of an utterance that s/he deems semantically and pragmatically prominent (Bishop 2012). Focused constituents may vary in breadth (broad or narrow) and in type (e.g., contrastive, non-contrastive).
This study investigates the ability to perceive broad sentence focus, verb narrow focus and object narrow focus in American English yes-no echo questions by native speakers of Mandarin Chinese. Echo questions allow all three types of focus to be examined in morphosyntactically identical utterances. Unlike English, focus is preferably realized syntactically, rather than prosodically, in Mandarin (Xu 2004). This cross-linguistic difference in focus markings may lead to perceptual difficulty among Mandarin listeners learning English. The remainder of the paper is organized as follows. Section 2 presents a brief overview on intonation phonology, information structure, and focus markings in English and Mandarin. The study is described in Section 3. The results are reported in Section 4 and discussed in Section 5. Finally, Section 6 summarizes and concludes. focus markings in English and Mandarin. The study is described in Section 3. The results are reported in Section 4 and discussed in Section 5. Finally, Section 6 summarizes and concludes.

Intonation, Information Structure, Focus, and Prominence
Phonological theories of speech prosody proposed that the utterance is hierarchically organized, with a larger prosodic unit dominating a smaller one (e.g., Selkirk 1978Selkirk , 1980Selkirk , 1981Selkirk , 1984Selkirk , 1995Nespor and Vogel 1983). At the lower level, weak and strong syllables are grouped together to form feet. In turn, feet are grouped into phonological words, phonological words are bundled into phonological phrases, and phonological phrases into the entire utterance. However, research differs on how these prosodic units are determined. In one approach, prosodic units are largely identified based on syntactic structure, whereas, in the other, they are determined by their intonation features (see (Jun 1993(Jun , 1998 for a discussion on the two approaches).
Different prosodic hierarchies have been proposed over the years (see (Shattuck-Hufnagel and Turk 1996) for an overview). According to (Grice 2006, p. 778), each well-formed utterance contains at least one intonation phrase (IP), the highest prosodic unit above Word (W), represented by "tonal marking of some kind at one or both of its edges and/or at least one prominent within it." In turn, an IP may contain one or more intermediate phrase (ip). The Pierrehumbert's autosegmental-metrical (AM) model of intonation and its corresponding ToBI transcription system posit that each ip has at least one pitch accent (T*), which is linked to a stressed syllable. An ip can have additional pitch accents, and the final pitch accent in an ip is called the nuclear pitch accent. There are no pitch accents after the nuclear accent; therefore, any words following the nuclear accent are (postnuclear) unaccented and any accents before the nuclear accent are called prenuclear accents (Ayers 1996). The end of an ip is marked by a phrasal accent (T-) and the end (and rarely the beginning) of an IP is marked by a boundary tone (T%) (see Figure 1) (Pierrehumbert 1980;Beckman and Pierrehumbert 1986;Beckman and Ayers 1994). In English ToBI, five pitch accents have been proposed: L*, H*, L*+H, L+H*, H+!H* (Jun 2005). H tones denote high pitch targets corresponding to F0 peaks, L tones are low pitch targets associated with F0 troughs or valleys, and * indicates alignment with stressed syllables of the accented word. Pitch-accent assignment is determined post-lexically based on the meaning of the utterance in the discourse. In other words, pitch accents are assigned, not to every word with stress, but to words that are semantically and/or pragmatically prominent in discourse, and different types of pitch accents deliver different meanings (Jun 2005;Pierrehumbert and Hirschberg 1990). In a neutral production of an utterance, the last pitch accent, known as nuclear pitch accent, of an intermediate phrase is the most prominent pitch accent within the ip (Jun 2005).
Phrasal accents (L-, H-) occur at intermediate phrase boundaries, whereas boundary tones (L%, H%) are associated with the right edge of a full intonation phrase. H denotes the high pitch level in the local pitch range and L refers to low pitch level in the local pitch range (Beckman and Hirschberg 1994). Since an intonation phrase is usually comprised of one or more intermediate phrases and a In English ToBI, five pitch accents have been proposed: L*, H*, L*+H, L+H*, H+!H* (Jun 2005). H tones denote high pitch targets corresponding to F0 peaks, L tones are low pitch targets associated with F0 troughs or valleys, and * indicates alignment with stressed syllables of the accented word. Pitch-accent assignment is determined post-lexically based on the meaning of the utterance in the discourse. In other words, pitch accents are assigned, not to every word with stress, but to words that are semantically and/or pragmatically prominent in discourse, and different types of pitch accents deliver different meanings (Jun 2005;Pierrehumbert and Hirschberg 1990). In a neutral production of an utterance, the last pitch accent, known as nuclear pitch accent, of an intermediate phrase is the most prominent pitch accent within the ip (Jun 2005).
Phrasal accents (L-, H-) occur at intermediate phrase boundaries, whereas boundary tones (L%, H%) are associated with the right edge of a full intonation phrase. H denotes the high pitch level in the local pitch range and L refers to low pitch level in the local pitch range (Beckman and Hirschberg 1994).
Since an intonation phrase is usually comprised of one or more intermediate phrases and a boundary tone, a full intonation phrase boundary consists of two final tones: L-L%, L-H%, H-L% and H-H% (Beckman and Hirschberg 1994).
An utterance is often part of a longer interaction between the interlocutors; therefore, some of the information it contains is already shared between the speaker and the hearer while new information is Languages 2019, 4, 91 3 of 16 being added to the conversation (Hagoort 2019). Information is "given" if it has been earlier mentioned in the same or preceding sentence and all other information is "new" (Chafe 1974). Information structure refers to the way in which shared and new information are packaged in the sentence (Seuren 2009), but how this is accomplished varies from language to language. In some languages, syntactic structures are used to mark focus constituents, while focus-marking particles or prosodic features such as phrasing and accentuation may be employed in others (Hagoort 2019).
Accent placement is connected to focus, which, in turn, is closely related to "new" versus "given" or "shared with the listener" information (Ayers 1996, p. 22). In English, new or relevant information is typically pitch-accented. For example, as an answer to the question, "What did Mary cook for dinner?", the focus constituent, "STEAK," in "Mary cooked STEAK for dinner." is pitch-accented to convey the new information and may be perceived as being prominent by the listeners.
Perceived prominence is "subjective impression of 'prosodic strength,'" which can be driven by a bottom-up process based on acoustic cues or a non-signal-based, top-down process based on abstract patterns found in production (Bishop 2012, p. 5). Increased segmental duration, greater intensity, salient F0 features such as F0 peaks, F0 valleys, F0 movement or contours are acoustic cues that have been shown to be associated with perceived prominence. For the top-down process, it has been found that lexical frequency and the size of the focus constituents influenced perceived degrees of prominence (e.g., Bishop 2012;Cole et al. 2010). For example, Bishop (2012) found that the same object was perceived as being more prominent when it is narrowly focused than when it was part of a broad verb phrase and a sentence focus.

English Focus Marking
Focus is an important feature conveying a sentence's information structure (Bishop 2012). It is a means of emphasizing part of an utterance to specify new or informative information (Bolinger 1972;Gussenhoven 1983a;Selkirk 1995), "the information that the speaker presents as semantically or pragmatically prominent" (Bishop 2012, p. 238, emphasis is original). Focus may vary in two dimensions: size and type or contrastiveness (e.g., Gussenhoven 2008;Krifka 2008). The size of a focus constituent is the syntactic constituent that conveys the information determined by the speaker to be informative which, in turn, depends on the discourse context in which the sentence is produced (Bishop 2012). For example, the focus constituent of the answer (1D) depends on the contexts (1A-C) in which it is uttered: 1. A: What happened? B: What did John do? C: What did John buy? D: John bought a bicycle.
In response to 1A, the focus constituent of 1D is the entire sentence or sentence focus. In the context of 1B, only the verb phrase (bought a bicycle) forms the focus constituent and is referred to as VP focus. Finally, as a response to 1C, the object noun-phrase (bicycle) forms the focus constituent and is referred to as object focus. Larger focus constituents such as sentence focus and VP focus are also referred to as broad focus compared to a narrow focus on a single word such as the object focus (Ladd 1980;Selkirk 1984). In other words, narrow focus refers to cases where only the agent, the action, the patient, etc., of an event is focused, whereas broad focus refers to cases where the entire event is focused.
Regarding focus type, two categories can be differentiated: non-contrastive and contrastive focus. Non-contrastive focus refers to the information required by WH-questions as in (1) where explicit alternative is not mentioned. Contrastive focus, on the other hand, refers to cases where an explicit alternative is mentioned (also referred to as 'corrective focus') as in 'bicycle' in 2B:

2.
A: Did John buy a car? B: (No), he bought a bicycle. Contrastive focus often involves cases of narrow focus (e.g., Baumann et al. 2007;Hanssen et al. 2008). However, a focus of any size can be contrastive or non-contrastive. That is, focus size and focus type are orthogonal properties of informational structure (Bishop 2012). For instance, (2B) could have contrastive VP focus if it is uttered as a response to such questions as, "Did John fix the car?" However, the distinction between contrastive and non-contrastive focus is not recognized by all researchers. Rooth (1992), for instance, argued that there is no principled difference between the two types of focus. According to Rooth (1992), two semantic representations are evoked for each expression: its actual meaning and a set of alternatives. When a constituent in an expression is focused, the alternative set consists of the expression itself and all expressions with an alternative substituted for the focus-marked constituent. If there is no focus within the expression, then the alternative set consists only of the expression itself (Breen et al. 2010).
In English, prosodically focused words are produced with an expanded pitch range, higher intensity, longer duration and/or a pause before or after it (e.g., Jun 2005;Xu and Xu 2005;Xu et al. 2012). In an acoustic study of focus production on the first and the last noun phrase in English statements and questions,  observed an increase in duration for focused words in both statements and questions. For sentences with broad or sentence-final focus, the F0 peak on the last key word was significantly higher in questions than in statements. On the other hand, for sentences with focus on the first key word, the peak F0 on the focused word itself was comparable between statements and questions. However, while the F0 toplines of statements lower, they remain high for the remaining of the utterance for questions. In addition, the F0 contour of focused words is rising in questions but falling in statements. Xu and Xu (2005) examined focus realization in American English declarative statements and reported similar results; specifically, the pitch range is expanded under focus, reduced after focus, but remains unchanged before focus. A reduction in pitch range in post-focus position is known as post-focus compression (PFC) (Xu et al. 2012).
Foci of different sizes are acoustically differentiated. For example, to distinguish narrow focus on an object from broad VP focus, native English speakers may place more emphasis on the object and produce it with a higher F0 peak (Ladd 2012). Other studies e.g., Sityaev and House 2003;Xu and Xu 2005;Breen et al. 2010), have also shown that focus constituents of different size are systematically distinguished by acoustic features.  found that native English speakers produced narrow-focused words with a higher F0, and a longer duration compared to the same words uttered in neutral-or broad-focus sentences. However, broad focus on the verb phrase is pronounced with an increase in duration without an accompanying higher F0 peak. F0 peaks of narrow-focused words were also found to be comparable in single-and dual-focus sentences. However, unlike single-focus sentences, the word following an initial focused item in dual-focus sentences did not exhibit the low F0 value characteristic of post-focused words. Sityaev and House (2003) found a small but significant phonetic difference between broad focus, narrow focus and contrastive focus, with a greater difference between contrastive and non-contrastive (broad and narrow), particularly in sentence final position. Breen et al. (2010) found that focus location (subject, verb, and object) and focus type are reliably signaled by native speakers of English with greater acoustic intensity, longer duration, and higher mean and maximum F0, particularly when they were aware of prosodic ambiguity between focus type.
However, despite acoustic prominence, ambiguity between broad and narrow focus could still occur through the focus projection process (Selkirk 1984(Selkirk , 1995. According to Selkirk, an accented word is Focus (F)-marked where F-marking is based on a semantic feature to be used in interpretation. F-marking of higher constituents is, however, projected according to a set of rules: F-marking of the head of a phrase licenses F-marking of the entire phrase; and F-marking of an internal argument of a head licenses the F-marking of the head. Thus, according to these principles, ambiguity between a reading where only the object is narrowly focused and the one where the entire verb phrase is focused could arise in a clause containing a transitive verb whose direct object is accented, since F-marking on Languages 2019, 4, 91 5 of 16 the object can license F-marking of the verb phrase. Supports for the focus projection principles have been previously reported (e.g., Birch and Clifton 1995;Gussenhoven 1983b;Welby 2003).

Mandarin Focus Marking
F0 is used to distinguish meaning at the syllable/word level in Mandarin. For example, the syllable /ma/ means 'mother,' 'hemp,' 'horse,' and 'admonish or scold' when produced with a high-level (Tone 1), a rising (Tone 2), a falling-rising (Tone 3), and a falling (Tone 4) F0, respectively. In contrast to English, focus is preferably realized syntactically, rather than prosodically, in Mandarin (Xu 2004). Xu (2004) differentiated two types of focus in Mandarin, informational and contrastive focus. Informationally focused words are placed in the default, sentence-final position (see 3B), whereas words that are in contrastive focus are surrounded by focus markers shi . . . de (Xu 2004;Paul and Whitman 2008) (see 4A).

4.
A. Ta shi tan gangquin de 3SG shi play piano de 'He is a pianist.' (Paris 1998, p. 149) Unlike English, phonological prominence (i.e., prosodically prominent word) is not required in Mandarin when a sentence has broad focus, or when narrow focus is marked syntactically (Baker 2010;Paul and Whitman 2008;Xu 2004).
Acoustic analysis of prosodic focus in Mandarin showed that prosodically marked, narrow contrastive focus is expressed by an increased duration and expanded F0 range for the focused word, a lower maximum F0, and reduced F0 range for all following post-focused words with little to no change to words in pre-focused positions (e.g., Wang and Xu 2006;Liu and Xu 2005;Xu et al. 2004;Yuan 2004;Xu 1999). According to (Xu 1999, p. 95), "the full realization of a focus requires that the F0 of all words after the focus be suppressed." When this cannot be implemented, as when focus is on the last word, for example, utterances with final narrow focus may not be reliably differentiated from utterances with broad focus (Xu 1999;Cooper et al. 1985;Jin 1996) due to a lack of a reliable F0 as perceptual cues (Bunnell et al. 1997). Post-focus compression is a highly effective perceptual cue for focus in Mandarin (Xu et al. , 2012Liu and Xu 2005;Chen et al. 2009).
Prosodic realization of focus varies from tone to tone in Mandarin (e.g., Shih 1988;Xu 1999). For tones with a high pitch target (e.g., Tone 1), its pitch is raised under focus, but it remains unclear whether the pitch is lowered or remains the same for a focused low tone (e.g., Tone 2, Tone 3) (Shih 1988;Xu 1999;Chen and Gussenhoven 2008). Lee et al. (2016) found Tone 3 to be the least accurately identified (77.1%) in comparison to Tone 1 (90.8%), Tone 2 (90.2%), and Tone 4 (92.5%). The relatively low identification accuracy rate for Tone 3 is attributed to its smaller pitch range expansion, smaller increases in intensity, and increases in the duration, intensity, and pitch range of surrounding syllables within the same phrase due to a local dissimilarity effect (Lee et al. 2016).
However, focus perception accuracy is more affected by its position in the utterance than by tone, with sentence-medial focus being the most accurately identified (92.9%) followed by sentence-initial (87.2) and sentence-final (75.5%) focus (Yuan 2004).
Few studies have examined the ability to process English phrase-level prosody by non-native listeners. This study fills this research gap by investigating Mandarin Chinese listeners' ability to determine focus locations in English interrogative intonation. Differences in focus marking preference between Mandarin and English could impair English focus identification among Mandarin listeners.

Non-Native Perception of Phrasal Prosody
The empirical evidence available suggests that native listeners process sentence accent and discourse implication conveyed by native prosody with ease (Cutler et al. 1997). However, few studies have investigated perception of non-native post-lexical prosody. Results of some of these previous studies suggested that listeners have difficulty perceiving non-native prosody (e.g., Cruz-Ferreira 1987;Akker and Cutler 2003;Baker 2010). Cruz-Ferreira (1987) examined comprehension of non-native intonation in Portuguese and in English, by native English and native Portuguese speakers, respectively, and found that the non-natives showed hesitation in assigning meaning to the L2 intonation patterns and performed significantly worse than the natives for intonation patterns that do not exist in their native language. Akker and Cutler (2003) found Dutch learners of English to be less efficient in processing accented and focused words in English than native speakers. More recently, Baker (2010) found that Korean and Mandarin learners of English had difficulty matching intended pitch accent to focus location. Particularly, they had difficulty identifying subject narrow focus and sentence broad focus. For the subject narrow focus context, they erroneously accepted a response with the nuclear-accented object to a question that asks about the subject. For example, they accepted the answer, "Kim bought a FAN.", with focus on the object as a felicitous response to a question narrowly focused on the subject, "Who bought a fan?" The fact that an object is nuclear-accented in a wide variety of contexts in English including the sentence broad focus, the verb phrase broad focus, and the object narrow focus may have led non-native speakers to extend the pattern to a wrong context (Baker 2010). Both Korean and Mandarin speakers also failed to reject an answer with the nuclear pitch accent on the subject as a response to a sentence with broad focus. For example, they accepted the answer "KIM bought a fan." as a felicitous response to a question, "What happened?", suggesting that non-native speakers may accept nuclear pitch accents (i.e., KIM) that are within the focused constituent (the entire sentence) even when they are not placed in the standard location (i.e., "fan") (Baker 2010). However, they were relatively more successful at rejecting an answer with incorrect nuclear pitch accent placement on the subject in the verb broad focus context. For example, they accurately rejected the answer "KIM bought a fan." as a felicitous answer to the question "What did Kim do?" In this case, the nuclear accent (i.e., KIM) is not in a common location or within the focused constituent (the verb phrase), so they may have seemed more clearly incorrect." (Baker 2010, p. 4).

This Study
Similar to Baker (2010), this study explores the ability to identify location of focus in an utterance. Three focus types, namely broad sentence focus, narrow verb focus and narrow object focus are examined. It complements Baker (2010)'s study by investigating non-WH echo question instead of WH-questions. Echo questions are questions that are structurally the same as the previous utterance (Artstein 2002;Repp and Rosin 2015). By using echo interrogative questions as stimuli, all three focus types can be examined in morphosyntactically identical utterances. In addition, "the entire echo question is given, so none of its parts needs to be marked with focus; therefore, focus can serve the purpose of indicating disputed (rather than new) material" (Artstein 2002, p. 98). Echo questions can be asked for various reasons including auditory failure or emotional arousal. For example, the speaker might request a clarification about the reference of the expression in the previous utterance (e.g., A: "She likes the baby."; B: "She likes WHO?"), the speaker might not believe what s/he heard and want to double-check, or the speaker might want to express his or her emotion (e.g., amazed, indignant) in response to what s/he heard. Intonational markings for different types of echo questions have been found to vary. Except for reference echo questions, it has been suggested that echo questions in German and English exhibit an obligatory rise at the end (H-H%) and the in situ wh-phrase is narrowly focused, e.g., (Reis 2012), and carry either the (L+)H* or the L* nuclear accent. (L+)H* followed by H-H% typically signals an auditory failure, whereas emotional arousal due to disbelief or surprise is typically signaled by a L* accent followed by H-H% (Repp and Rosin 2015). In this study, the focus is Languages 2019, 4, 91 7 of 16 on echo questions without wh-phrases (e.g., A: "She liked the sweater."; B: "She liked the SWEATER?") signaled by L* H-H% intonation contour (See Figure 1). Baker (2010) employed a question-answer prosodic matching task. The problem with such a metalinguistic judgement task is that the participants' interpretation of either the question or the answer is not known (Breen et al. 2010). Our study aimed to measure the participant's interpretation of the question. Specifically, our participants heard the question and were then asked to choose an appropriate response from three alternative forced choices written in standard English orthography without any indication of focus. In addition to obviating the need to train the participants on the meaning of prosody, the participant's interpretation of the question can be measured based on their answer choice. Two main questions guided our research: (1) Are the three focus sizes (broad-sentence focus, narrow-verb and narrow-object focus) equally differentiated?; (2) Is the ability to differentiate the three focus sizes affected by how focus is marked in the native language? Based on Baker (2010)'s findings, we hypothesized that verb narrow focus (VF) will be more accurately perceived than both the object narrow focus (OF) and the sentence broad focus (SF) among both groups of listeners due to the prevalence of nuclear-accented objects in English and the absence of PFC. In addition, consistent with Selkirk's focus projection principle, ambiguity between object focus and sentence focus, as well as between object focus and verb focus, is expected. However, with native experience, native listeners are expected to outperform non-native listeners.

Participants
Sixteen native Mandarin Chinese speakers (7 males) participated in this experiment. The mean age for this group of participants was 24.8 (SD = 3) years. None of them had been in the US for more than 2 years, and most participants had only been in the US for 5 months. All reported having at least an intermediate level knowledge of English, most with 10-15 years of experience learning English, while 1 participant reported having 23 years of experience. All are graduate students at the University of Florida where a minimum TOEFL score of 550 (paper test); 80 (internet-based), International English Language System (IELTS) score of 6, Michigan English Language Assessment Battery (MELAB) score of 77, and a verbal Graduate Record Examination (GRE) score of 140 are required for admission. None reported prior knowledge of any other West Germanic languages.
Twelve (3 males, 9 females) native English speakers, also students at the University of Florida, participated in this experiment. The mean age for this group was 21.8 (SD = 2) years. None of the participants in either of the two groups reported having any known problems with their hearing, reading, or speech. None reported prior experience with Mandarin. All subjects gave their informed consent for inclusion before they participated in the experiment. The study was conducted in accordance with the University of Florida Institutional Review Board.

Materials
The stimuli were 25 sets of English yes-no, echo question-answer pairs as shown in (5). Only the questions were recorded and aurally presented to the participants. Q1 is produced with a broad sentence focus, Q2 a narrow verb focus, and Q3 was uttered with a narrow focus on the object. The questions were produced and recorded (44,100 Hz sampling rate, 16-bit quantization rate) in Praat (Boersma and Weenink 2018) by a male native English speaker with phonetics training. Spectrograms, pitch tracks, and ToBI transcriptions of the three questions are shown in Figure 2. A total of 75 stimuli (25 questions x 3 focus types) were produced.

5.
Q1  F0 values of the target words in each utterance were automatically extracted using ProsodyPro, a custom-written script for the Praat program by Xu (2013). To avoid consonant onset F0 perturbation, the F0 extraction was taken from the first full vocal pulse to the last vocal pulse of the vowel. In addition, F0 values were extracted separately for the first and the second syllables of the last disyllabic nouns. Figure 3 shows time-normalized pitch tracks for all three focus types averaged across 25 utterances each. A multivariate ANOVA analysis indicated that: (1) the mean F0 of the first words (e.g., she) is significantly higher for OF questions than for VF questions (151 Hz vs. 135 Hz, p = 0.03); (2) the mean F0 of the verb (e.g., loved) is significantly lower for VF questions than for OF questions (117 Hz versus 135 Hz, p = 0.025); and, (3) the mean F0 of word #3, the article (e.g., the), word #4, the adjective (e.g., new), and the first syllable of the object (e.g., sweat), or the first word of the two-word compound (e.g., bus), are significantly higher for VF questions than for OF and SF questions (ps < 0.000). However, mean F0 of the last syllable (e.g., er), or the last word (e.g., stop), of the object is comparable across the three focus types (p > 0.10). F0 values of the target words in each utterance were automatically extracted using ProsodyPro, a custom-written script for the Praat program by Xu (2013). To avoid consonant onset F0 perturbation, the F0 extraction was taken from the first full vocal pulse to the last vocal pulse of the vowel. In addition, F0 values were extracted separately for the first and the second syllables of the last disyllabic nouns. Figure 3 shows time-normalized pitch tracks for all three focus types averaged across 25 utterances each. A multivariate ANOVA analysis indicated that: (1) the mean F0 of the first words (e.g., she) is significantly higher for OF questions than for VF questions (151 Hz vs. 135 Hz, p = 0.03); (2) the mean F0 of the verb (e.g., loved) is significantly lower for VF questions than for OF questions (117 Hz versus 135 Hz, p = 0.025); and, (3) the mean F0 of word #3, the article (e.g., the), word #4, the adjective (e.g., new), and the first syllable of the object (e.g., sweat), or the first word of the two-word compound (e.g., bus), are significantly higher for VF questions than for OF and SF questions (ps < 0.000). However, mean F0 of the last syllable (e.g., er), or the last word (e.g., stop), of the object is comparable across the three focus types (p > 0.10).

Procedures
Participants were tested individually in a quiet room. All three questions in each set were randomly presented aurally through headphones (Sennheiser HD 280 Pro) one at a time using Eprime (2.0). After hearing each question, they were instructed to choose one of the three possible answers (See A1, A2, and A3 in (5)) printed on an answer sheet in standard English orthography with no focus location indicated. A total of 75 questions (25 sets) were presented.
To familiarize the participants with the task, 6 practice trials preceded the experimental trials. They can repeat the trial by pressing the number 3 on the number pad. They are asked to press 5 to move on to the next question. Responses were coded as correct (1) or incorrect (0) for statistical analyses. Data from one English participant were excluded as an outlier for poor performance (2 SD below the group mean).

Results
Log-transformed proportions of response accuracy for each focus type by the two groups of listeners are shown in Figure 4. Overall, native English speakers outperformed native Mandarin speakers on all three focus types. For both groups, VF was the most accurately perceived, whereas OF focus was the least.

Procedures
Participants were tested individually in a quiet room. All three questions in each set were randomly presented aurally through headphones (Sennheiser HD 280 Pro) one at a time using Eprime (2.0). After hearing each question, they were instructed to choose one of the three possible answers (See A1, A2, and A3 in (5)) printed on an answer sheet in standard English orthography with no focus location indicated. A total of 75 questions (25 sets) were presented.
To familiarize the participants with the task, 6 practice trials preceded the experimental trials. They can repeat the trial by pressing the number 3 on the number pad. They are asked to press 5 to move on to the next question. Responses were coded as correct (1) or incorrect (0) for statistical analyses. Data from one English participant were excluded as an outlier for poor performance (2 SD below the group mean).

Results
Log-transformed proportions of response accuracy for each focus type by the two groups of listeners are shown in Figure 4. Overall, native English speakers outperformed native Mandarin speakers on all three focus types. For both groups, VF was the most accurately perceived, whereas OF focus was the least. Tables 1 and 2 below show the confusion matrices for the perceptual accuracy of the three focus types for English and Mandarin speakers, respectively. English speakers were more accurate at detecting VF than either OF or SF. SF is also frequently confused with OF and vice versa. Mandarin speakers are less accurate than English speakers overall, but in contrast to English speakers, their ability to locate VF and SF is equally accurate, whereas their ability to identify OF was the worst. Like English speakers, they frequently confused OF with SF and vice versa.
The results obtained (Table 3) show that, for both groups combined, object focus perception is significantly less accurate than sentence focus (the default focus type in the model). As shown in Tables 1 and 2, these results are largely driven by the performance of Mandarin speakers. Unlike English speakers, they are less accurate on object focus than on sentence focus.  Tables 1 and 2 below show the confusion matrices for the perceptual accuracy of the three focus types for English and Mandarin speakers, respectively. English speakers were more accurate at detecting VF than either OF or SF. SF is also frequently confused with OF and vice versa. Mandarin speakers are less accurate than English speakers overall, but in contrast to English speakers, their ability to locate VF and SF is equally accurate, whereas their ability to identify OF was the worst. Like English speakers, they frequently confused OF with SF and vice versa. A generalized linear mixed-effects (logistic regression) model was fitted to the response accuracy data (1 = correct, 0 = incorrect) in R (R Development Core Team 2008) using the package lme4 (Bates et al. 2015). The maximal model that converged included random intercepts for subjects and items: full.model <glmer(Response~factor(L1 background, levels = c(Mandarin, English))*factor(Focus type, levels = c(sentence, object, verb)) + (1|Subject) + (1|Item), data = datafile, family = binomial).
The results obtained (Table 3) show that, for both groups combined, object focus perception is significantly less accurate than sentence focus (the default focus type in the model). As shown in Tables 1 and 2, these results are largely driven by the performance of Mandarin speakers. Unlike English speakers, they are less accurate on object focus than on sentence focus. The analysis also confirmed that English speakers are significantly more accurate than Mandarin speakers in identifying both object focus and verb focus.

Discussion
Native speakers of Mandarin and native speakers of American English were evaluated for their ability to recognize sentence broad focus, verb narrow focus and object narrow focus in English yes-no echo questions. Both verb narrow focus and object narrow focus were produced with a low (L*) nuclear pitch accent and the sentence broad focus bears a default L* nuclear pitch accent on the object. Native speakers of English were more accurate than native Mandarin speakers overall, particularly on the verb and the object narrow focus, but not on the sentence broad focus.
However, both groups struggled to differentiate between the sentence-final object narrow focus and the sentence broad focus. This finding agrees with that of Liu and Xu (2005), who found that Mandarin speakers frequently confused neutral focus in questions with final focus in statements. It is also consistent with the observation that interpretation of the focused constituent in sentence-final position is often ambiguous, not only in English, but also in other languages including German, Dutch, Spanish, European Portuguese and Italian. (Nooteboom and Kruyt 1987;Birch and Clifton 1995;Welby 2003;Zubizarreta 2014;Frota 1997;Avesani and Vayra 2003). For example, Welby (2003) found that native English listeners judged a sentence like "I read the DISPATCH." with acoustic prominence on 'dispatch' as a felicitous response to a question narrowly focusing the object, "What newspaper do you read?", as well as to a question broadly focusing the entire event, "How do you keep up with the news?" Similar results were reported by Birch and Clifton (1995). These findings are predicted by the focus projection principles proposed by Selkirk (1984Selkirk ( , 1995. A similar claim was also made by Gussenhoven (1983bGussenhoven ( , 1999. Specifically, Selkirk (1995) argued that, through the focus projection process, an acoustic prominence on the head of a phrase or its internal argument can project to the entire phrase, thus making the entire phrase focused. The focus projection process appears to also account for the confusion between narrow object focus and broad sentence focus among both Mandarin and English speakers in our study.
A lack of perceptible difference between object focus and sentence focus may have also been responsible for the result. Gussenhoven (1983b) found that perceptible difference between narrow and broad focus exists in some productions, but that the listeners cannot reliably use it to tell in which context (narrow or broad focus) the sentence was produced. In contrast to Gussenhoven's finding, Rump and Rump and Collier (1996)'s listeners can accurately discriminate narrow and broad focus using pitch cues. Unfortunately, our English and Mandarin listeners could not rely on pitch cues to discriminate between object narrow focus and sentence broad focus. As shown in Figure 2, no significant difference in pitch values were found for any portions of the question uttered in the narrow object focus and the broad sentence focus contexts. In addition, Bunnell et al. (1997) found F0 to be less effective in focus perception at the end of the utterance (Bunnell et al. 1997). Furthermore, Xu (1999) argued that the full realization of a focus requires that the F0 of all words after the focus be suppressed. A lack of post-focus compression in utterances with focus on the last word can, therefore, lead to perceptual confusion between utterances with final narrow focus and utterances with broad focus (Cooper et al. 1985;Jin 1996).
Besides a lack of difference in acoustic prominence, the prevalence of nuclear pitch accent on the object in English may have also contributed to object and sentence focus confusion. As Baker (2010) pointed out, a nuclear pitch accent object occurs in subject broad focus, verb broad focus and object narrow focus contexts. It appears that natives and non-natives alike easily mistake one for the other, and alterations in acoustic prominence produced by the speaker were insufficient and/or were treated as irrelevant for the metrical structuring of the two focus types.
The confusion between object focus and sentence focus is also consistent with the functional view of intonation (Xu 2005;Liu and Xu 2005). According to this view, components of intonation or speech melody are individually and independently defined and organized by their communicative functions (e.g., lexical stress, focus, sentence type). More importantly, these functions are encoded in parallel, each with its own distinctive encoding schemes, specifying its surface acoustic/articulatory values. However, due to limitations in acoustic/articulatory dimensions and space (e.g., F0, intensity, duration), interactions among the encoding schemes of these functions frequently occur, leading to a delicate balance between functions that share the same articulatory/acoustic parameters, as is the case with sentence-final focus and question . In turn, ambiguity or confusion may result from overlapped articulatory/acoustic characteristics between two or more communicative functions. According to this framework, ambiguity between object focus and sentence focus found in our study may have resulted from interference in transmission of the acoustic/articulatory features associated with two communicative functions, namely questions and focus. Specifically, the F0 rise at the end of the utterance required to signal a question obscures or raises the low pitch target on the object (see Figure 2), and reduces its perceived prominence.
However, the fact that native English speakers were more accurate than Mandarin speakers on object narrow focus relative to sentence broad focus suggested that native experience reduced the negative impact of a lack of acoustic distinctiveness due to parallel transmission of these two focus contexts. More specifically, compared to native English speakers, Mandarin speakers are less familiar with associating low F0 value with focus. In addition, unlike English, no prosodically marked word is required in Mandarin, particularly when a sentence has broad focus, or when narrow focus is marked syntactically (Baker 2010). In addition, when prosodically marked, focus raises rather than lowers the pitch of high tones. For low tones, it remains inconclusive whether focus decreases their low pitch targets.
Finally, a lack of ambiguity between verb focus and object focus, as well as between verb focus and sentence focus, are inconsistent with the focus projection principle. According to Selkirk (1984) and Gussenhoven (1983b), ambiguity between an object-only focused reading and a reading where the entire verb phrase is focused is predicted in a clause containing a transitive verb with an accented direct object. Similarly, ambiguous interpretation between a verb-only focus reading and a reading where the entire event is focused is also expected. While support for the focus projection principles have been reported in previous studies e.g., (Birch and Clifton 1995;Gussenhoven 1983b;Welby 2003), our findings are inconsistent with the hypotheses. In our case, the focus projection process, both to and from a constituent containing a narrowly focused verb carrying an L* nuclear pitch accent, the rarest type of pitch accent (Jun 2006) appeared to have been blocked. That is, English and Mandarin listeners exhibited a preference for verb focus when it is prominently marked with a low pitch target. Whether this is also true for other types of pitch accent awaits further research.

Summary and Conclusions
This study compared how well native Mandarin and native English speakers can perceive prosodically marked focus. They were asked to choose a correct response for twenty-five English echo yes-no questions produced with a sentence-medial verb narrow focus, a sentence-final object focus, and a sentence broad focus associated with a low pitch accent (L*). Native English listeners were more accurate than native Mandarin listeners on verb and object narrow focus, but not on sentence broad focus. More importantly, both groups confused object narrow focus with sentence broad focus and vice versa. However, confusion between object focus and verb focus, and between verb focus and sentence focus, was relatively infrequent. These results suggested that, in some cases, acoustic prominence on the head of a phrase or its internal argument can project to the entire phrase and make the entire phrase focused, that parallel transmission of the two functions of intonation, and that cross-linguistic variation in focus marking (prosodically versus syntactically) contribute to their perceptual inaccuracy, the latter among Mandarin speakers, and native exposure reduces but does not remove its effect among native English speakers.