1. Introduction
In daily life, speech perception typically happens in noisy environments with other background sounds. These background sounds can cause physical interference with the speech signal that results in greater difficulty in perceiving speech. This physical acoustic interference between environmental noises and target speech is called energetic masking [
1]. When speech is masked by other speech, there is additional perceptual interference that occurs, called informational masking [
1,
2]. Due to this additional perceptual interference, the perception of speech in competing speech is typically more difficult than perceiving speech in non-speech noise [
3,
4,
5]. Because speech perception does not typically occur in a quiet environment, and because competing speech typically does not share all acoustic properties with target speech, understanding the acoustic variables that influence competing speech perception and how they interact is critically important in understanding the nature of speech perception in an ecologically valid context. In the present study, the interaction of f0 and speaking rate is examined.
A number of variables have been examined that influence competing speech perception. These variables include the characteristics of the target speech that listeners are instructed to pay attention to and the characteristics of the masker speech, which is typically multi-talker babble, superimposed over the target speech to experimentally create an environment of competing speech perception.
One reason why the perception of target speech in masker speech is particularly difficult compared to speech-in-noise perception stems from the lexical competition that occurs due to masker speech. Brungart and Simpson [
6] observed lexical competition during competing speech perception using Coordinate Response Measure (CRM) stimuli [
7] that include combinations of numbers and colors presented on a grid on a monitor. In this task, a number and color combination is auditorily presented within another number and color combination masker phrase, and participants are tasked with selecting the proper combination of number and color from the set of possible options on the screen. Results on this task demonstrated that when participants made mistakes, the mistake typically matched the content of the masker speech, meaning the participant had trouble disambiguating between the target and masker, rather than trouble with perceiving the target speech signal. Additionally, removal of the response that corresponded with the masker speech signal from the set of possible response options on the screen led to a dramatic improvement in performance on the task, indicating again that most errors on the task were due to misidentifying the masker as the target. These results offer evidence that lexical competition between target and masker speech influences performance on competing speech perception tasks.
Brouwer and Bradlow [
8] used eye-tracking to study lexical competition during competing speech perception. Their experiment involved disyllabic nouns where target and competitor words matched in either onset or offset. Participants viewed displays containing the target (e.g., ‘candle’), onset competitor (e.g., ‘candy’), rhyme competitor (e.g., ‘sandal’), and a distractor (e.g., ‘lemon’) while hearing a target word and a competing word spoken simultaneously by different female speakers. The target was 2 dB louder than the masker. Results showed greater visual fixations on onset-matching competitors compared to distractors, regardless of the masker type. Additionally, rhyme-matching competitors attracted more fixations than distractors and were preferred over onset-matching competitors when the masker speech matched the rhyme. This indicates that background speech can cause lexical competition with target speech, and the onset competitor advantage that is typically expected in visual world eye-tracking research [
9] can be reduced when the rhyme competitor is played as masker background speech. These results demonstrate that lexical competition from background speech influences lexical processing of a target speech signal.
Additionally, lexical properties of the masker speech modulate its effectiveness as a masker. For example, previous research has found that meaningful masker speech masks target speech more effectively (more interference) than nonmeaningful masker speech [
10,
11]. Furthermore, masker speech containing high-frequency words has been found to mask target speech also more effectively than masker speech containing low-frequency words [
12].
While previous research has identified that there is a clear role of lexical competition from the masker speech signal during competing speech perception, much less is understood about what types of acoustic differences/similarities between target and masker influence competing speech perception.
Some previous research seems to suggest that increased acoustic similarity between target and masker speech is a major detriment to target speech perception. One acoustic cue that has been well-studied in the competing speech perception domain is f0. Previous research has found that when target and masker speech mismatch in speaker sex, which typically corresponds to differences in f0 and formant frequencies, the masker is less effective at masking the target speech signal [
13,
14,
15]. This phenomenon was demonstrated in Brungart [
13] through the use of the Coordinate Response Measure (CRM) task using target speech played within single-talker masker speech that was either produced by the same talker as the target speech, from a different talker of the same sex as the target speaker, or from a different talker of a different sex as the target speaker. Participants were more accurate at segregating the target speech signal from the masker speech signal when the masker was of a different sex from the target compared to the same sex, and participants were more accurate perceiving the target speech signal when the masker speaker differed from the target speaker, even if the masker speaker was of the same sex as the target speaker. When using multi-talker (3- and 4-talker) masker babble, similar results were found, with target speech perception best when target and masker speech differed in sex and worst when target and masker speech were produced by the same talker [
14]. These studies indicate that f0 differences between target and masker speech can greatly improve competing speech segregation and offer release from masking.
This phenomenon of mismatches in sex between target and masker offering release from masking also occurs when artificially changing f0 to approximate sex differences [
16]. In this study, speech from the same talker was used as the target and masker speech. The masker speech was composed of only single-talker speech, not multi-talker babble. f0 differences between target and masker were artificially manipulated to fall within 0 and 12 semitones. Results indicated gradual improvement in target sentence perception as the difference in f0 between target and masker increased. This improvement, even at the largest f0 difference (12 semitones), was not as significant as the improvement in the perception of target speech masked by a single masker talker of a different sex than the target.
Similarly, previous research using different sentential stimuli has also found that masker speech in a different pitch range from target speech is less effective at masking compared to speech in similar pitch ranges, and greater differences between target and masker pitch lead to less and less effective masking [
17]. In this study, the f0 of syntactically correct semantically anomalous sentences was artificially manipulated to 100, 103, 106, 109, 120, and 200 Hz for the target speech sentences. The masker speech was a continuous speech stream manipulated to an f0 of 100 Hz. Accuracy scores on a sentence repetition task indicated that error rates generally reduced as the difference in f0 between target and masker increased. When using natural productions of f0 to create distinct pitch ranges instead of artificially manipulating the pitch range (the low f0 stimuli comprised a male’s normal pitch utterances, while the high f0 stimuli comprised the same male speaker’s utterances when he attempted to imitate a female speaker’s pitch), results again indicated that error rates were lowest when target and masker differed in f0 pitch range compared to when they were from a similar pitch range. Additionally, f0 contour differences (either flat, normal, or exaggerated speaking style) between target and masker speech decrease the effectiveness of masker speech at masking the target [
18].
These previous studies indicate that differences in f0 properties between target and masker speech robustly aid in the segregation of a competing speech signal, regardless of whether the differences arise from natural productions or artificial manipulation.
While the role of f0 has been studied in competing speech perception, little research has identified the role of speaking rate in masker effectiveness. Calandruccio et al. [
19] indirectly investigated the role of speaking rate in competing speech perception by testing the perception of English clear and conversational target speech within English and Croatian clear and conversational masker speech. Due to the slower speaking rate of clear speech compared to conversational speech, this study partially tests the role of speaking rate, although other acoustic differences exist between clear and conversational speech beyond just speaking rate (such as differences in vowel reduction, consistency in release of stop bursts, and RMS intensity for obstruent sounds [
20]), meaning any findings of the study cannot be exclusively attributed to speaking rate. Participants in the experiment listened to female target speech spoken in either conversational or clear speaking style within female two-talker masker babble composed of either English clear speech, English conversational speech, Croatian clear speech, or Croatian conversational speech. Participants were not familiar with Croatian. Accuracy scores in a sentence repetition task in which participants were supposed to repeat aloud what the target speaker just said indicated that while clear target speech has an expected perceptual advantage over conversational target speech, masker speaking style (clear versus conversational) did not influence masker effectiveness. While these data suggest slower speaking rates in clear speech may improve intelligibility of target speech during competing speech perception, the clear speech benefit may not only be attributed to speaking rate given other potential acoustic differences between clear and conversational speech.
Speech rhythm has also been shown to play a role in competing speech perception. Using artificially manipulated speech rhythm irregularities in target and masker speech, McAuley [
21,
22] found that masker speech with an irregular rhythm led to improved target speech perception. However, when f0 differences were available as a cue, speech rhythm irregularities no longer mattered. Within these studies, the rhythm irregularities in the stimuli were created by speeding up and slowing down the speech in a sinusoidal pattern in order to create a disruption from the natural sentence timing.
Studies have also been conducted on speech rhythm differences in target and masker speech by studying languages with differing rhythm classes [
23,
24]. Rhythm classes are a means of categorizing languages based on rhythmic properties, but the exact criteria that are used when classifying languages have been highly debated in the literature [
25,
26,
27,
28]. Two of the three main rhythm classes that have been identified are stress-timed languages for which stress is said to occur at equal intervals and syllable-timed languages for which syllables tend to occur at equal intervals. Studies have found that languages that allow for complex syllables or vowel reduction are typically stress-timed languages, and languages that allow for only simpler syllable structures and no vowel reduction are typically syllable-timed languages [
29]. Acoustic studies (e.g., [
28]) have indicated that rhythm classes can sometimes be identified through the duration of intervals between vowels and between consonants. However, Arvaniti [
25,
26] claims that cross-linguistic differences between rhythm classes are not reliably supported based on acoustic measures and argues instead that rhythm research should focus on prominence and grouping measures of languages as opposed to the more traditionally defined stress-timed and syllable-timed type distinctions.
For the languages of the current study (English, Dutch, and French), previous research largely points to English and Dutch sharing properties traditionally corresponding to stress-timed languages, while French has properties traditionally corresponding to syllable-timed languages. Ramus et al. [
28] found that when plotting the proportion of vocalic intervals and the average standard deviation of consonantal intervals, English and Dutch clustered together, while French was plotted in a different cluster farther away, suggesting that English and Dutch belong to a different rhythm class from French. Similarly, when studying prominence features across languages, Dutch and English share similar prosodic structures [
30] and have lexical stress [
31,
32], while French uses prosody differently when cueing focus [
33] and has phrasal prominence instead of lexical stress [
34]. Overall, given these characteristics of English, Dutch, and French rhythm and prominence, English and Dutch share many similarities that are distinct from French. For the purposes of the present study, the terms stress-timed (for English and Dutch) and syllable-timed (for French) will be used to represent these differences between languages, although it should be noted that these terms are not agreed upon by all researchers.
Reel and Hicks [
24] compared the perception of English (a stress-timed language) target speech within varying masker languages that either match or mismatch English in rhythm class. When comparing the effectiveness of English, German, French, Spanish, and Japanese maskers at masking English target speech, they found that French and Spanish, which do not share a rhythm class with English, were less effective at masking compared to German and English, which do share a rhythm class with the English target. However, they additionally found that Japanese, which does not share a rhythm class with English, was equally effective at masking English as German. This could potentially be due to Japanese sharing some properties with stress-timed languages such as English and German, but it could potentially point to a more complex relationship between masker effectiveness and language properties than simply rhythm class.
Previous research robustly finds that when the masker speech is in a language unknown to the listener, it is less effective at masking target speech than masker speech in a known language [
8,
23,
24,
35,
36,
37,
38,
39,
40,
41,
42]. Additionally, Calandruccio et al. [
39] showed that unknown languages that are similar to the target (Dutch masker for English target) mask more effectively than dissimilar languages (Mandarin masker for English target). Results indicated that English maskers were most effective at masking English target speech, with Dutch the next most effective and Mandarin the least effective.
A potential confound in the study by Calandruccio et al. [
39] was that the Mandarin masker babble had less high-frequency energy than the English and Dutch maskers. A follow-up study was conducted with new maskers consisting of noise files that were spectrally matched to the English, Dutch, and Mandarin babble from the initial experiment. This allowed for the examination of spectral differences between English, Dutch, and Mandarin while removing any low-frequency temporal modulation differences that may have been present in the initial experiment. Additionally, another set of maskers was created with white noise that matched the low-frequency temporal modulations from the English, Dutch, and Mandarin babble maskers from the original experiment, which allowed for the study of the impact of temporal information on masker effectiveness. The use of white noise meant the spectral differences that typically appear for English, Dutch, and Mandarin were not present, so any differences found in masker effectiveness could be attributed to the low-frequency temporal modulations. Results indicated that the Mandarin masker babble’s spectral energy was less effective at masking the target compared to the English and Dutch maskers, meaning the results from the initial experiment do not fully represent differences in linguistic and phonetic properties of English, Dutch, and Mandarin. The fact that English and Dutch maskers were equally effective for the spectrally matched noise conditions, suggests that something beyond just spectral differences drove the results from the initial experiment where English was a more effective masker than Dutch. The findings suggest that the benefit of English over Dutch over Mandarin in masker effectiveness could result from rhythm class differences between the languages.
Overall, these experimental results hint at a potential role of rhythm class in masker effectiveness. The findings indicate that unknown language maskers are less effective than a known language masker and differ from each other in terms of effectiveness.
Target and masker mismatch in terms of dialect, such as General American English masking Southern American English, also leads to less effective masking of target speech [
43]. Additionally, a foreign-accented masker is less effective in masking natively accented target speech [
44]. Finally, when comparing target speech perception within masker speech of the same language as the target (English) or a different language (Greek), which, crucially, is also known by the listeners, it has also been found that the target language masker is the more effective masker [
23]. This study tested monolingual English and Greek–English bilinguals’ ability to segregate English target speech from either English or Greek two-talker masker babble. Results indicated that Greek masker speech was a less effective masker compared to English speech, even for the bilingual listeners who were familiar with both languages. Because the participants understood both the target and masker language in this study, these results cannot be explained by differences in lexical competition between the language match and mismatch conditions. Instead, the authors posited that these differences in performance may result from syllable rate or general rhythm pattern differences between the target and masker speech when spoken in different languages that may modulate masker effectiveness.
The consistent finding of all these studies is that increased similarity between the target and masker hinders the segregation of target speech from masker speech, often called the
target-masker linguistic similarity hypothesis [
38]. The present study further explores the target-masker linguistic similarity hypothesis by systematically examining the influence of both fundamental frequency (f0) and speaking rate differences between target and masker speech. Specifically, it investigates f0 and speaking rate as distinct sources of variability. Prior research on earwitness testimony has highlighted significant differences in how memory biases affect these two acoustic features. When recalling a voice from memory, listeners tend to perceive lower f0 pitches as even lower and higher f0 pitches as even higher than they originally were. However, no such memory bias has been observed for speaking rate differences [
45]. Mullennix et al. [
45] attributed this to the transient nature of surface speech properties like speaking rate, which can vary significantly within a single speaker and thus provide an unreliable cue for speaker identification. In contrast, f0 exhibits less within-speaker variation, making it a more stable and reliable cue for distinguishing speakers. As a result, f0 is more likely to be stored in memory and subject to perceptual distortions, whereas speaking rate is not. Thus, while f0 serves as a memory cue for speaker identification, speaking rate does not, indicating that they may play fundamentally different roles in speech perception.
Beyond assessing the independent contributions of f0 and speaking rate differences between target and masker speech, this study also examines their combined effect to determine whether variations in both properties influence masker effectiveness differently than changes in just one. Additionally, the research introduces unfamiliar masker speech to investigate whether the absence of attention to fine-grained acoustic differences alters their importance in modulating speech perception during competition. This unfamiliar masker speech includes languages from the same and different rhythm classes as the target language, allowing for an analysis of how rhythm similarities affect speech perception in competing speech conditions.
To gain deeper insight into how multiple cues interact in competing speech perception, this study assesses whether cues not stored in memory for speaker identification, such as speaking rate, function differently from those that are, such as f0. This analysis will focus on their effectiveness in separating target speech from masker speech in a competing speech environment. Furthermore, the interaction between f0 and speaking rate may vary based on whether listeners are familiar with the masker speech and whether it shares a rhythm class with the target speech. Understanding these interactions will enhance our insight into the cognitive processes involved in competing speech perception.
In summary, while prior research has identified various factors affecting the effectiveness of masker speech, little is known about how multiple acoustic cues interact during competing speech perception. The present research comprises five experiments that systematically investigate the role of f0 and speaking rate in modulating target speech perception. The goal is to determine how these cues differentially influence listeners during competing speech perception due to their varying ability to signal a new speaker. Moreover, by incorporating unknown masker speech from different rhythm classes, this study examines whether the effects of speaking rate and f0 differ when listeners do not understand the masker speech and whether rhythm class or speaking rate plays a more significant role in shaping competing speech perception. The findings will contribute to a broader understanding of how multiple acoustic cues interact in speech perception, ultimately informing our knowledge of the factors influencing real-world speech processing in noisy environments.
Five experiments were conducted. Procedures and stimuli are similar across all experiments. They will be described in detail for Experiment 1; subsequently, any differences in procedures as well as specific stimulus information will be introduced for each experiment. Experiment 1 tested whether having a similar f0 range for talker and masker speech will result in more effective masking compared to having a distinct f0 range.
6. Experiment 5: f0 and Speaking Rate with French Masker Speech
Another factor that may influence competing speech perception is rhythm class. Reel and Hicks [
24] suggest that the rhythm class of a language may influence its effectiveness at masking target speech. Similarly, Calandruccio and Zhou [
23] found that English/Greek bilinguals still showed greater masker effectiveness when the target and masker were the same language even when listeners understood both languages, either because of rhythm class differences between English and Greek or because of differences in speaking rate between both languages. Since that study’s methodology precluded the ability to disambiguate between those two possible explanations, the present study intended to explicitly test whether masker effectiveness is driven more by speaking rate or by rhythm class properties.
In order to test this, masker effectiveness is compared for masker languages with the same stress-timed rhythm class as English (Dutch in Experiment 4) and masker languages with a different rhythm class from English (French in Experiment 5). French was selected as the masker language with a different rhythm class because it is a syllable-timed language [
27]. In order to experimentally determine whether the present study’s stimuli demonstrate the expected stress-timed and syllable-timed differences among the three languages, %V and ΔC were plotted for a subset of 10 randomly selected English sentences produced by the target speaker, 10 randomly selected sentences produced by each Dutch masker, and 10 randomly selected sentences produced by each French masker, for a total of 50 sentences.
Figure 5 shows the differences across languages.
A one-way multivariate analysis of variance (MANOVA) was conducted to examine whether the combination of %V and ΔC differed depending on the language (English, Dutch, or French). Using Pillai’s trace, there was a significant multivariate effect of Language, V = 0.388, F (4, 94) = 5.65, p < 0.001.
Follow-up pairwise MANOVAs revealed that English and Dutch did not differ significantly, (p = 0.132); however, French differed significantly from both English (p < 0.001) and Dutch (p = 0.002). These results suggest that French exhibits a distinct rhythm pattern compared to the other two languages, which follows expectations from previous literature.
The present study attempted to determine whether speaking rate influences masker effectiveness more for Dutch compared to French masker speech, or whether the role of speaking rate does not depend on rhythm class. If masker effectiveness is driven more by speaking rate, it would be expected that both Dutch and French maskers show similar patterns of speaking rate’s effectiveness at segregating the speech signal, regardless of rhythm class. If masker effectiveness is driven more by rhythm class properties, it would be expected that Dutch and French maskers will differ in terms of how they are influenced by speaking rate; it is likely that Dutch causes more interference with similar speaking rates for target and masker speech compared to French. Alternatively, it is also possible that neither speaking rate nor rhythm class effects may appear in the results. Based on the target–masker similarity hypothesis that states that more similarity between target and masker will lead to a greater masker effectiveness [
38], it is hypothesized that there will be a greater influence of speaking rate for the same rhythm class language masker compared to the different rhythm class masker.
Additional differences between Dutch and French beyond just rhythm properties may involve their sound inventories and the similarity of these inventories to English. While all three languages share a number of overlapping sounds, there are also differences across the languages [
32,
60,
61]. While English and Dutch both allow for diphthongs, the exact diphthongs that appear differ across languages. Meanwhile, French does not allow for diphthongs. While English does not contain rounded front vowels, Dutch contains two, and French contains three. Both Dutch and French also contain uvular consonants that English does not have. Additionally, the same phoneme may also be produced in an acoustically different way depending on the language. For example, while all three languages distinguish between voiced and voiceless stops, voiced stops in English are sometimes realized as voiceless unaspirated stops in word-initial position [
62], but voiced stops in French and Dutch are typically realized as fully voiced [
32,
60]. Overall, English, Dutch, and French vary in similarity to each other in terms of multiple cues beyond rhythm properties, such as phoneme inventory (diphthongs, uvular consonants, rounded front vowels) and the phonetic realization of sounds (voiced versus voiceless unaspirated stops).
6.1. Participants
6.1.1. Speakers
Speakers included the same female native English speaker of a Midwestern dialect of American English who was the target speaker from Experiments 1–4. The masker speakers for this experiment were two female native French speakers (ages 32 and 29 years old) from France (one from Croix and the other from Grenoble).
6.1.2. Listeners
Participants included a subset of 45 of the native English speakers (17 females, 27 males, 1 preferred not to specify; mean age 36.2 years old) from Experiment 3 who were not familiar with French. Two-talker masker babble was presented in French, and target speech was presented in English.
6.2. Materials
Stimuli were composed of the same subset of the 1965 Revised List of Phonetically Balanced Sentences (Harvard Sentences) [
47] as Experiment 3. All procedures for stimulus creation are identical to Experiment 3 except the masker babble is now in French instead of English. A set of 165 phonemically balanced French sentences inspired by the Harvard Sentences were used for the French babble [
63].
Speaking rate was adjusted for all speakers to fall into a fast (6 syllables/second) and a slow (3 syllables/second) range using Praat [
48]. The f0 of all utterances was also adjusted for all speakers to fall into a high f0 (225 Hz) and a low f0 (175 Hz) range using Praat [
48].
6.3. Procedure
The procedures for the experiment replicated those for Experiment 3. For 35 of the participants, the experiment was conducted remotely online using Gorilla Experiment Builder [
53]. Ten additional participants completed the experiment at the KU Phonetics and Psycholinguistics Lab using identical procedures as those who completed the experiment remotely.
After completing Experiment 3, participants completed Experiment 5. Since the target speaker remained the same as in Experiment 3, no new familiarization phase was needed. Participants completed 80 experimental trials.
6.4. Results
A logistic mixed-effects regression model was run on keyword accuracy scores for the sentence transcription task using the same approach as in Experiment 4. It was hypothesized that when the target and masker match in both f0 and speaking rate, keyword accuracy will be the lowest, whereas when target and masker mismatch in both f0 and speaking rate, keyword accuracy will be the highest.
Table 5 contains the best fitting model results after backward likelihood-ratio testing (α = 0.05).
The final model retained fixed effects of f0 Match and Speaking Rate Match, and random slopes for both predictors by Subject and Item. While neither fixed effect was statistically significant in the final model (
p > 0.05), model comparisons during backward fitting confirmed that both were necessary to retain, suggesting considerable inter-individual and item-level variability in their influence on accuracy. These results indicate that no statistically significant amount of variance in accuracy scores could be explained by the f0 Match or Speaking Rate Match status of the stimuli in French masker speech.
Figure 6 shows the results.
To determine whether the lack of an effect was due to failure to compare to the proper baseline, the logistic mixed-effects regression model was releveled to all possible baseline combinations for f0 Match and Speaking Rate match and rerun. None of the models revealed any significant effects.
6.4.1. Comparison of Experiments 3 (English), 4 (Dutch), and 5 (French)
To identify whether performance differed by masker language, a logistic mixed-effects regression model was run on keyword accuracy scores for the sentence transcription task in Experiments 3, 4, and 5. Fixed effects included f0 Match (match vs. mismatch; reference = match), Speaking Rate Match (match vs. mismatch; reference = match), and Language (English, Dutch, French; reference = Dutch) and their interactions, random intercepts of Subject and Item were included, and by-subject slopes for all fixed effects (Speaking Rate Match, f0 Match, and Language) were included. Random slopes for Language by Subject were not included because Language was between-subjects in the combined dataset and therefore did not vary within individuals (participants who completed both Experiment 3 and Experiment 4 or Experiment 3 and Experiment 5 were treated as distinct subjects for each sub experiment in this combined model in order to avoid conflating individual people’s variability with their variability resulting from experimental differences). By-item random slopes and interactions in the by-subject random slopes were not included to achieve model convergence and interpretability. It was hypothesized that accuracy would be higher when the masker was an unknown language compared to English due to the lack of informational masking present when the masker language is unknown.
Table 6 contains the best fitting model results after backward likelihood-ratio testing (α = 0.05).
The significant positive simple effect of f0 Match indicates that accuracy for f0 Mismatch conditions was significantly higher than for f0 Match conditions when the masker language was Dutch. The lack of interaction between f0 Match and Language suggests this pattern remains the same for stimuli with English and French maskers. This means that regardless of the masker language (English, Dutch, or French), participants were more accurate when target and masker speech mismatched in f0 than when they matched. While Speaking Rate Match was retained during backward fitting, it did not reach significance in the final model, indicating that no statistically significant amount of variance in accuracy scores could be explained by the Speaking Rate Match status of the stimuli.
Figure 7 shows the results.
The significant negative simple effect of Language indicates that accuracy for English was lower than for Dutch for f0 Match conditions. The lack of interaction between f0 Match and Language suggests this patten remains the same for f0 Mismatch conditions. The significant negative simple effect of Language indicates that accuracy for French conditions was lower than for Dutch conditions for f0 Match conditions. The lack of interaction between f0 Match and Language suggests this patten remains the same for f0 Mismatch conditions. The model was releveled (baseline = English) to compare English and French, and a significant positive simple effect of Language was found (
β = 0.337, SE = 0.107,
t = 3.147,
p = 0.002), indicating that accuracy for French conditions was higher than for English conditions for f0 Match conditions. The lack of interaction between f0 Match and Language suggests this patten remains the same for f0 Mismatch conditions. These results indicate that accuracy was highest when identifying target speech within Dutch masker speech, lower when identifying target speech within French masker speech, and lowest when identifying target speech within English masker speech, regardless of whether f0 matched or mismatched for the targets and maskers.
Figure 8 illustrates the pattern.
To identify whether performance differed for participants tested online and in-person, the best fitting model for Experiments 3, 4, and 5 (Accuracy ~ f0 Match + Language + Speaking Rate Match + (1 + f0 Match + Speaking Rate Match|Subject) + (1|Item)) was compared to a model also containing Online status as a fixed effect (Accuracy ~ f0 Match + Language + Online + Speaking Rate Match + (1 + f0 Match + Speaking Rate Match|Subject) + (1|Item)). A single term deletion analysis was conducted on the more complex version of the model. The analysis revealed that eliminating the Online term did not significantly impact model fit (χ2 (1) = 0.072, p = 0.789), indicating that whether participants completed the experiment online or in-person did not seem to influence their accuracy. This means that participants performed similarly on the task regardless of whether they completed the experiment in-person or remotely online.
6.4.2. Discussion
The results of Experiment 5 (with French maskers) failed to reach significance for simple effects of f0 Match and Speaking Rate Match. The data do not show any difference in performance for different combinations of f0 and speaking rate for targets and maskers. Participants generally performed similarly regardless of the f0 and speaking rate characteristics of the stimuli.
The data again trend in the predicted direction of gradual increases in accuracy as the similarity between target and masker speech decreased. For example, when both f0 and speaking rate matched, accuracy was lowest (66%), but when only f0 or only speaking rate matched, accuracy was higher (68% and 69%, respectively). Finally, when both speaking rate and f0 mismatched, accuracy was highest (70%). These results again align with the target-masker linguistic similarity hypothesis [
38].
When conducting analyses on Experiments 3, 4, and 5 (English versus Dutch versus French), the data support the hypothesis that mismatch conditions yield an advantage over match conditions in competing speech perception, although the relationship seems to only depend on f0, not speaking rate. When target and masker share f0, regardless of whether it is high or low, accuracy in target speech perception within competing speech is lower than when target and masker mismatch in f0. In contrast, speaking rate did not explain a significant amount of variance in accuracy scores, meaning whether speaking rate matched or mismatched between targets and maskers could not predict performance on the task.
Overall, the data indicate a differential role of f0 compared to speaking rate, with f0 performing as the more salient cue that, when present, makes speaking rate unimportant. This pattern was observed when masker speech was in English and Dutch, but not when masker speech was in French. While the Experiment 5 data numerically patterned in the predicted direction, with performance highest for fully mismatching stimuli and performance lowest for fully matching stimuli, the simple effects of f0 Match and Speaking Rate Match did not reach significance when French maskers were used.
7. Discussion and Conclusions
The aim of this study was to examine the role of f0, speaking rate, and rhythm in competing speech perception. To achieve this, five sentence transcription tasks were conducted, in which target speech was presented within two-talker masker babble. In the first experiment, the voices of three female speakers were adjusted to have either high or low f0 values to determine whether an f0 mismatch between target and masker speech would facilitate speech perception. The second experiment similarly manipulated the speakers’ voices to have either a fast or slow speaking rate to assess whether a difference in speaking rate would reduce masking effects. The third experiment combined manipulations of both f0 and speaking rate to explore their interaction. In the fourth experiment, Dutch—a language with rhythmic properties similar to the target speech—was used as the masker to examine whether reduced lexical competition would aid speech segregation. Lastly, the fifth experiment introduced French, a language with distinct rhythmic properties from the target speech, to investigate the influence of rhythm on speech segregation.
Experiment 1 found clear evidence of release from masking when participants heard a target and masker mismatching in f0 compared to when they heard a target and masker with the same f0. Results from Experiment 2 indicated that when target and masker speech mismatched in speaking rate, accuracy was better than when they matched. This indicates f0 and speaking rate seem to behave in the same way during competing speech perception, and the results give further evidence for the target-masker linguistic similarity hypothesis [
38], which claims that as the similarity between target and masker increases, masking is more effective. This holds true for both f0 and speaking rate when examined separately.
One reason why the fast target was more difficult than the slow target is likely due to the less taxing cognitive processing required to disambiguate a slow speech signal, which gives more time as it unfolds to identify the target and process its contents.
In order to determine whether f0 and speaking rate interact, both f0 and speaking rate were manipulated for the target and masker speech in Experiment 3. The results for Experiment 3 indicated that only f0 seemed to matter when both f0 and speaking rate varied, as indicated by the simple effect of f0 Match without any simple effect or interactions for Speaking Rate Match. These results are in line with previous research by McAuley et al. [
22], which found that when a salient cue such as f0 was present, rhythm irregularities no longer influenced accuracy.
When comparing accuracy scores between Speaking Rate Match and Mismatch conditions for the subset of data with matching f0 values between target and masker speech, there was significantly higher accuracy for Speaking Rate Mismatch conditions. In contrast, there was no significant difference in accuracy scores depending on speaking rate properties. This indicates that speaking rate differences between target and masker speech only seem to matter when f0 cues do not contribute and conflict.
Results from Experiments 1, 2 and 3 indicate that, while both f0 and speaking rate differences between target and masker speech offered release from masking when only one cue (either f0 or speaking rate) was manipulated, when both cues were manipulated, f0 and speaking rate behaved differently. f0, the cue that more effectively indicates a change in talker, became the only relevant cue to influence release from masking. In comparison, speaking rate, the cue that is not as effective in signaling a change in talker, did not seem to matter when f0 cues were present. These results suggest that the ability for a cue to identify a speaker may influence how influential it is when trying to segregate multiple speech streams.
While evidence for additive effects of both f0 and speaking rate in release from masking did not reach significance, the data numerically patterned in the predicted direction, with accuracy highest for target and masker speech combinations that were most different (mismatched in both f0 and speaking rate), accuracy in the middle for target and masker speech that differed in only one cue (mismatched in either f0 or speaking rate), and accuracy lowest for target and masker speech that differed in no cues (matched in both f0 and speaking rate). This general pattern does align with the predictions of the target-masker linguistic similarity hypothesis [
38] in that as the similarity between target and masker increases, accuracy decreases. This offers some degree of evidence that there can be additive effects of both f0 and speaking rate over just f0 or just speaking rate alone in aiding with speech segregation, although the lack of significance of these findings means no firm conclusions can be drawn.
To identify whether participants still pay attention to f0 and speaking rate in the same way when masker speech is in an unknown language, and thus does not cause additional lexical competition, Experiment 4 replicated Experiment 3 using Dutch masker babble instead of English masker babble. Dutch and English are both stress-timed languages, meaning both languages share similar rhythm properties. Results indicated that participants behaved similarly in their use of f0 and speaking rate cues as in Experiment 3 with English masker babble.
To determine whether the unknown language masker was less effective than the known language masker, a comparison of the results of Experiments 3 and 4 demonstrated that accuracy was higher for Dutch compared to English for all conditions, as indicated by a simple effect of Language (β = −0.624, SE = 0.105, t = −5.968, p < 0.001) with no interactions between Language and either f0 Match or Speaking Rate Match. These findings may result from the lack of lexical competition and informational masking that the unknown language masker has compared to English. Additionally, it could be that there are other acoustic differences between Dutch and English that may make Dutch a less effective masker, such as differences in phoneme inventory.
Previous research focusing on quantifying the similarity between languages has used various measures to do so. For example, Eden [
64] measured distance between languages based on how many linguistic parameters differ between languages and used probabilistic models of one language to predict the upcoming sound or letter in another language. Eden [
64] used a multitude of parameters when comparing English, Dutch, and French, including phonotactic constraints and syllable structure properties, as well as vowel and consonant inventories. Any of these different parameters can contribute to the degree of masker effectiveness during competing speech perception. Based on these parameters, English and Dutch were classified as closer than English and French or Dutch and French. While these parameters did not involve many rhythm-related features of the languages, these measures still offer some means by which to quantify linguistic similarity between languages.
Experiment 5 replicated Experiment 3, except with French masker speech instead of English masker speech. The goal of this experiment was to identify whether a different unknown language masker with rhythm properties that differ more between target and masker speech would still cause participants to use f0 and speaking rate in the same way, with particular focus on how speaking rate’s influence may differ for languages of different rhythm classes. The results, however, demonstrated no significant effects, meaning neither the f0 nor the speaking rate of the stimuli predicted performance on the task. While no significant effects were found, results again numerically offered some evidence in favor of the target-masker linguistic similarity hypothesis [
38], with accuracy highest for the most distinct target-masker pairs (f0 and speaking rate both mismatching) and accuracy lowest for the most similar target-masker pairs (f0 and speaking rate both matching).
The lack of significant findings for Experiment 5 may possibly result from French being a poor masker that participants are good at ignoring regardless of other acoustic cues available to aid in speech segregation. This would potentially mean participants can achieve good accuracy without needing to pay attention to acoustic cues such as f0 or speaking rate.
A model combining Experiments 3 (English masker speech) and 5 (Dutch masker speech) established that the unknown language masker advantage found for Dutch also occurred for French. Specifically, the significant simple effect of Language without any interactions with f0 Match or Speaking Rate Match (
β = 0.339, SE = 0.108,
t = 3.133,
p = 0.002) indicated that participants were more accurate when the masker was an unknown language, French, compared to a known language. This again aligns with previous literature that unknown language maskers are typically less effective than known language maskers [
8,
23,
24,
35,
36,
37,
38,
39,
40,
41,
42].
Experiments 4 (Dutch masker speech) and 5 (French masker speech) were also compared. This analysis found no effect of Language (Language did not remain after backwards fitting the model), indicating that while participants experienced greater accuracy for unknown masker speech (either Dutch or French) compared to known masker speech (English), the actual masker language itself did not matter.
Further statistical analyses were conducted for Experiments 3, 4, and 5 to identify the role of masker background language in how rhythm and linguistic knowledge influence speech segregation. When data from Experiments 3 and 5 or from Experiments 4 and 5 were combined, the simple effect of f0 Match consistently remained (β = 0.411, SE = 0.107, t = 3.830, p < 0.001, and β = 0.400, SE = 0.153, t = 2.613, p = 0.009, respectively), and no effect or interaction of Speaking Rate Match was found (Speaking Rate Match did not remain after backwards fitting). Because nearly all models (those containing the results from Experiments 1, 3, 4, 3+4, 3+5, and 4+5) except for Experiment 5 alone contained the simple effect of f0 Match without any effect of speaking rate, the salient role of f0 in competing speech perception and the lack of role of speaking rate when f0 is present appear to be relatively robust patterns.
These results become more complicated when running an analysis on Experiments 3, 4, and 5 together. In this analysis, there was an effect of language that indicated all three masker languages significantly differed in accuracy scores, with English maskers resulting in the poorest accuracy (most effective masking), French maskers resulting in the next poorest accuracy, and Dutch maskers resulting in the best accuracy (least effective masking). While English maskers were expected to be most effective, the advantage for Dutch maskers over French maskers was more surprising. Since Dutch and English share a rhythm class while French does not, it was expected that Dutch would be a more effective masker than French, but the opposite result was found. However, since these results did not hold when just comparing Dutch and French alone (Experiments 4 + 5), the finding may not be robust.
Another goal of the current study was to identify whether the role of speaking rate differed for Dutch masker stimuli compared to French masker stimuli due to their differences in rhythm properties. It was predicted that speaking rate differences between target and masker speech would be more impactful for Dutch masker speech compared to French masker speech because Dutch and English share rhythm properties as stress-timed languages that they do not share with French, a syllable-timed language. Because the study did not include manipulation of only speaking rate without f0 manipulations using Dutch and French maskers, any potential effects of speaking rate were overwhelmed by the f0 effects. A future study comparing Dutch and French maskers that only manipulates speaking rate may identify whether any differences in Speaking Rate Match occur.
Because any differences in participant performance between target stimuli within Dutch or French masker babble could be due to a variety of other factors beyond rhythm properties, including sound inventory, future studies may additionally want to compare another stress-timed language with a relatively distinct phoneme inventory to identify whether behavior differs for this language compared to Dutch.
Overall, the present research identified a clear role of f0 in segregating target speech from masker babble. While speaking rate can help in segregating target speech from masker babble, when a more salient cue that is more reliable in cuing the presence of a new talker, such as f0, is available, speaking rate, which is a less reliable cue for speaker identity, no longer seems to matter. When the masker babble is in a language unknown to listeners, such as Dutch or French, masking is less effective compared to when the masker is in a known language to listeners. The present study offers some evidence in support of the target-masker linguistic similarity hypothesis [
38], indicating that the greater the similarity between target and masker, the more effective the masker is at masking target speech. Due to f0’s greater ability to cue speaker identity, f0 cues were more effectively used by participants than speaking rate cues, which are typically less effective at signaling speaker identity, when both cues were available. Moreover, significant additive effects of f0 and speaking rate were not found. Instead, f0 effects were more influential than speaking rate effects. Future research should test various F0 and speaking rate differences against each other to more definitively determine whether F0 is always used as the primary cue over speaking rate when both are present or whether, or to what extent, the ranges of differences between speaking rates and F0 values influence cue usage by listeners.
This study is the first of its kind to attempt to identify the role of speaking rate differences in competing speech perception, offering motivation for continuing future research to identify how speaking rate may interact with other acoustic cues in speech segregation.
The present finding that f0 and speaking rate behave differently, likely as a result of their differences in ability to reliably signal the identity of a talker, also brings into question what other acoustic cues to speaker identity listeners may be attending to and how these cues rank against each other in order of importance when trying to identify a target speaker in noisy listening conditions. Future research further defining the identity cues that can be most reliably used by listeners to track a speaker in suboptimal listening conditions can offer insight into how to more effectively aid listeners in communicating in noisy environments.
The present study identified a differential role of f0 versus speaking rate in speech segregation based on the two cues’ ability to reliably signal the identity of a talker. More specifically, the current results suggest that while cues to speaker identity play an important role in speech segregation, listeners use whatever information is available in the speech signal unless a stronger cue is present. Moreover, the finding that listeners process background speech even when they do not know the language indicates that its acoustic characteristics matter. The focus on understanding the cognitive processes underlying speech perception in noisy ecologically valid contexts, such as in degraded conditions with background speech, allows for a better understanding of real-world speech processing.