1. Introduction
A foreign accent (FA) refers to the distinct pronunciation characteristics and prosodic patterns that manifest in a non-native speaker’s utterance of a second language (L2) (
Derwing and Munro 2015;
Uzun 2023). These characteristics stem from the influence of the speaker’s native language and can vary widely in prominence and nature. Interspeaker differences can pose a challenge for listeners and, more particularly, students of a second language. These differences manifest in two key areas: variations in accent among non-native speakers who share the same first language (L1), and variations among non-native speakers from different L1 backgrounds built up from a range of factors, including individual exposure to the target language, learning environment, and personal motivation (
Boduch-Grabka and Lev-Ari 2021;
Mackay et al. 2006;
Moyer 2007). Such complexity in FA can impact the effectiveness of language instruction and comprehension, emphasizing the importance of a nuanced understanding of FA in the pedagogy and practice of second-language learning.
The study of FA is integral to the field of second language acquisition, as it not only provides insights into the cognitive processes underlying language learning but also has practical implications (
Ioup 1984). The presence of an FA may impact a listener’s perception of the speaker’s credibility, fluency, or even identity, thus influencing social interactions and opportunities (
Adank et al. 2013). Furthermore, understanding and addressing FA can aid in the development of effective language teaching methodologies, assisting learners in achieving a higher level of communicative competence in their second language. The phenomenon of FA, therefore, occupies a significant role in both linguistic research and educational practice, linking the intricate mechanisms of language learning with broader social dynamics.
Traditionally, FA has been meticulously explored through a wide range of viewpoints in an attempt to capture its complex and manifold characteristics. A significant portion of this scholarly research has adopted a comprehensive or holistic approach, delving into the implications of FA in a broad sense (
Flege et al. 1995;
Munro and Derwing 1995;
Piske et al. 2001). Such studies have been instrumental in shedding light on the intricate relationships between FA and various factors that may influence it. For example, researchers have analyzed the impact of the age at which learning begins (
Asher and García 1969;
Chen et al. 2020), the period of study or exposure to the language, and the level of proficiency in a second language (L2) (
Kang 2020). These variables have been shown to play vital roles in the way accented speech is both produced and perceived.
A complementary strand of literature has taken a more focused approach, examining the role of specific individual features in FA. Within this context, prosodic features, which include elements such as intonation (
Mennen 2015;
Van Els and De Bot 1987), rhythm (
Polyanskaya et al. 2017), and speech rate (
Munro and Derwing 1995), have been identified as pivotal in giving rise to accented speech. Furthermore, the segmental domain (that is, sounds and phonemes) has been recognized as another key factor in the production of foreign-accented speech (
Sereno et al. 2016). Researchers have delved into both vowels (
Chan et al. 2017;
Georgiou 2019) and consonants (
Chen et al. 2020;
Neuhauser 2011), uncovering how the mispronunciation of such sounds can be perceived as accented speech.
The advancement of foreign accent (FA) research has been significantly fueled by the utilization of innovative speech manipulation tools. These tools have been indispensable in providing a platform for analyzing and manipulating speech patterns, thereby enabling researchers to dissect and understand the subtle gradations of accented speech. The Pattern Playback machine, a pioneering instrument in its time, allowed researchers to convert visual patterns into sound, thereby laying the groundwork for systematic speech analysis. More recently, Praat (
Boersma and Weenink 2023) has emerged as an essential tool in the field, providing a way to manipulate speech in a controlled manner. In recent years, computational advancements such as diphone synthesis (
Taylor et al. 1998), Hidden Markov Model synthesis (
Kayte et al. 2015) or Deep Neural Networks (
Qian et al. 2014) have contributed significantly to the advancement of the field.
In the present study, we will explore the implications of segmental foreign accent. Particularly, by using a corpus that stems from novel manipulation techniques such as the splicing technique (
García Lecumberri et al. 2014) and the gradation technique (
Pérez-Ramón et al. 2020), our research aims to provide insights into whether learners of a second language are able to discern a range of degrees of FA when produced by different speakers. The significance of exploring segmental foreign accents is well-founded, having been substantiated by prior investigations (
García Lecumberri et al. 2014;
Pérez-Ramón et al. 2022b,
2023). Previous findings align with certain outcomes of holistic studies of accent. For instance, it has been observed that a segment pronounced with a noticeably strong FA does not necessarily lead to a decrease in overall intelligibility (
Munro and Derwing 2020). Moreover, there are more specific conclusions that have emerged, such as the intriguing implication that the phonological representations of certain mispronounced segments may exert a more pronounced influence on intelligibility than others (
Imai et al. 2005;
Pérez-Ramón et al. 2022b).
For the purpose of our research, we will focus on the pronunciation of English vowels as articulated by native Spanish speakers. This focus is underpinned by the substantial differences between the vocalic systems of English and Spanish. While the Spanish vowel system is relatively simple, encompassing only five vowels [a, e, i, o, u] without any distinctions in length, the English vowel system is far more intricate and complex [ɜː, æ, ɑː, ɪ, iː, ɒ, ɔː, ʊ, uː] (as considered in this study, since the Southern British English variety will be used), utilizing duration as a distinguishing feature, which can lead to intelligibility conflicts (
Nygard 2006). Furthermore, individual difficulties with vowels such as [ɪ] and [æ] (
Flege et al. 1995) have been found, and (
Franklin and Stoel-Gammon 2014) offers a comprehensive analysis of the intelligibility conflicts that may arise from the Spanish-like realisation of the vowels chosen for this study. These divergent structures compel Spanish speakers to adapt their existing vowel system to map onto the more complex English system, which can lead to recognizable confusion and, more importantly, significant intelligibility interference (
Franklin and Stoel-Gammon 2014). This interference involves not only a perceivable accent but also specific pronunciation patterns that may obscure the intended message, impeding clear communication. The intricacy of the English system juxtaposed with the simplicity of the Spanish one has led to challenges in the phonological translation between the two. Analyzing these specific challenges in pronunciation helps to provide deeper insights into not only the nature of language acquisition but also the unique phonetic characteristics that differentiate these two languages.
The primary question driving our research is whether learners of an L2 are capable of discerning various degrees of accent across different speakers. To explore this question more deeply, we will conduct an AXB discrimination experiment. Within this framework, listeners will be exposed to two voices, one male and one female, that have been manipulated and fine-tuned using the previously mentioned techniques. These voices will pronounce the core vowel in English one-syllable words with five levels of Spanish accent, ranging from a completely non-native, Spanish accent to a completely native Southern British English accent.
The experiment involved two distinct groups of participants: one group for whom Spanish is their L1 and another group with Japanese as their L1. These two groups were selected because both of them share a similar vowel system of five vowels, which contrasts with the much more populated system of English. The most salient difference between Spanish and Japanese vowels in terms of spectral data would be the fact that the Japanese /u/ is not rounded ([ɯ]), as in Spanish ([u]). Otherwise, these two cohorts of speakers share a similar distribution of the five-vowel system in the vocalic space. Both groups, however, follow different strategies when it comes to the pronunciation of vowels. The Spanish system does not differentiate between long and short vowels, a feature that is used in English; at the same time, the Japanese language has a moraic rhythm, in which long vowels are pronounced as two time groups and can distinguish words (e.g.,
oba おば vs.
obaa おばあ) (
Bion et al. 2013). Additionally, in Japanese, vowels [ɯ] and [i] can be devoiced in certain contexts (e.g.,
suki すき [sɯ̥ki]). This design will allow us to evaluate the significance of having a matching first language when it comes to discerning accents.
Section 2 of this study will outline the methodology, and
Section 3 will present the experiment’s results, accompanied by a comprehensive statistical analysis.
Section 4 will explore possible answers to the research questions posed and
Section 5 will conclude with insights into the implications of segmental foreign accents in language teaching, as well as recommendations for future research in this field.
3. Results
In this section, the results provided by the participants will be analyzed. Unless otherwise specified, the models were general linear mixed models generated using the
lme4 package (
Bates et al. 2015) in R (
R Core Team 2023) and post hoc pairwise comparisons were collected using the
emmeans package (
Lenth 2023).
3.1. Pre-Processing
Since the experiment was self-paced and participants completed the task at their homes or other locations uncontrolled by the researcher, trials in which reaction time was over 5000 ms were removed from the database. This led to the removal of the 2.62% of the answers provided by the participants.
In our analysis, we explored two potential influences on participant responses: the position of the ‘different’ token in each triad (either ‘A’ or ‘B’) and the direction of the triad (either ‘upwards’—e.g., A = 1, X = 1, B = 3—or ‘downwards’—e.g., A = 4, X = 4, B = 2). Despite the considerable number of trials (4785 in ‘A’ position and 5031 in ‘B’ position; 4941 ‘upwards’ and 4875 ‘downwards’), our models, accounting for interactions between cohorts, step pairs, token positions, and trial directions with participant IDs as a random intercept, revealed no significant impact from the token position for either cohort. However, a notable exception was found in the Spanish cohort, where the direction of the trial influenced responses for the comparison of steps 3 and 1 (z = 2.938, p = 0.0033). Given that this was an isolated case, we decided not to include the direction factor in further analyses, allowing us to focus on more substantial findings. No other factors were expected to interfere with the interpretation of the results.
3.2. Overall Results
Overall, results show similar behaviour of both cohorts for every pair of steps. Japanese listeners seem to be slightly more proficient in distinguishing pairs 1-3, 2-4 and 4-2, and steps 5-3 are slightly better distinguished by listeners of both cohorts (
Figure 1).
The effects of
cohort,
steps compared, and the interaction between both factors on the correct response variable were examined using a generalized linear mixed-effects model. The model included the id of the participants as a random factor to account for individual variability. The results of a Type II Wald chi-square test are presented in
Table 3.
The variable steps compared was found to be highly significant. Additionally, the interaction between cohort and steps compared was also significant, suggesting that the effect of steps depends on the cohort answering. The main effect of the cohort was found to be only marginally significant.
Post hoc analysis of the model revealed a significant difference between Japanese and Spanish in steps 1-3 (z = 2.351,
p = 0.0187) and a marginally significant difference in the comparison of steps 4-2 (z = 1.939,
p = 0.0525). The other contrasts were not found to be statistically significant at the 0.05 level. Additionally, differences across step pairs were also analyzed for each cohort. No significant difference was found between any pair of steps for the Japanese cohort; however, more variability was found for the Spanish cohort. Trials in which the target token (i.e., the X token pronounced by the female voice) was 1, 2 or 4 were significantly less accurately discriminated from the male voice than trials with 3 or 5 as the target token (
Table 4).
3.3. Results by Vowel
One of the main advantages of the manipulation of tokens via the splicing technique and the gradation technique is that it brings the possibility of examining the contribution of each individual segment to the overall perception of foreign accents analyzed in the previous section. In this section, a comprehensive analysis of the results for each [non-native]→[native] vowel continuum will be provided.
The effects of
cohort,
steps compared,
vowel, and their interactions were examined on the response variable using a generalized linear mixed-effects model, with the id of the participants included as a random factor. The results of a Type II Wald chi-square test are summarized in
Table 5.
The results provide insights into the relationships among steps compared, vowel, and cohort. Both the main effects of steps compared and vowel were found to be highly significant. Additionally, significant pairwise interactions were detected between cohort and steps compared, cohort and vowel, and steps compared and vowel. The three-way interaction among cohort, steps compared, and vowel was marginally significant.
Given these findings, we further examined the interaction between
cohort and
steps compared for each vowel individually (
Figure 2). No significant effect was detected in various continua (namely [a]→[æ], [a]→[ɑː] and [o]→[ɒ]). However, significant differences emerged in the
steps compared factor for the remaining continua, all with a significance level of
p < 0.001 except for [o]→[ɔː] (
p < 0.05) for which the
cohort factor was also found significant (
p < 0.01). Finally, for the [i]→[iː] continuum, a significant effect of
cohort (
p < 0.01),
steps compared (
p < 0.001) and the interaction of both factors (
p < 0.001) was also found.
The discrimination abilities of the two cohorts were evaluated using d-prime values for the [o]→[ɔː] and [i]→[iː] continua. Both cohorts demonstrated discrimination abilities above chance, as indicated by positive d-prime values. However, the Japanese listeners generally exhibited superior discrimination, particularly in the [o]→[ɔː] continuum, with d-prime values of 0.868 and 0.850 at the 3-1 and 5-3 steps compared, respectively, compared to the Spanish listeners’ 0.375 and 0.646 at the same levels. In the [i]→[iː] continuum, the Japanese cohort again showed better discrimination, notably with a d-prime value of 1.11 at step 4-2. Contrastingly, their performance at steps 3-5 resulted in a d-prime value of 0, indicating no better discrimination than random guessing. This unexpected result warrants further exploration and will be discussed in the following section.
A detailed pairwise analysis of each factor can be found in
Appendix A for every individual sound, and the main conclusions will be outlined here. The
steps compared factor significance arises from the fact that the native end of the continuum, i.e., the comparison of steps 4-2 and especially 5-3 gives a higher score than the other pairs. This finding implies that listeners are more able to discern accents across voices when these voices sound near-native rather than with a strong foreign accent.
Moreover, among the continua exhibiting significant effects of cohort, the Japanese cohort generally displayed a greater ability to distinguish accents across voices compared to the Spanish group. This finding, where participants with an L1 different from the accent in the experimental words were more skilled at discerning accents, will be further explored and discussed in the following section.
4. Discussion
In this paper, we have explored the capacity of non-native English speakers to distinguish varying degrees of foreign accent across different voices. Specifically, our analysis focused on two distinct cohorts: one in which the participants’ L1 matched the accent of the English words (Spanish), and another in which the L1 was mismatched (Japanese). Our investigation looked into their ability to discern five distinct degrees of foreign accent imposed over seven English nuclear vowels. The results yielded a complex picture, revealing that listeners can indeed detect differences in foreign accents across voices. However, this ability is not uniform across all scenarios. Two key conditions emerged that appeared to enhance the discrimination capabilities of the participants: (i) when the listener’s L1 does not match the speaker’s, thereby providing a potentially contrasting perspective, and (ii) when the speech samples sounded closer to native-like, potentially facilitating more refined discrimination.
Previous research has looked into the challenges that listeners encounter when trying to distinguish between segmental contrasts that are considered “difficult” to acquire. Specifically, these difficulties arise when the two segments being compared are identified as a single phonemic unit in the listener’s native language (
Højen and Flege 2006;
Pérez-Ramón et al. 2020). This phenomenon leads to increased challenges for individuals learning a second language, as they often struggle to establish robust and distinct categories that differ from or interfere with the categories embedded in their L1. Essentially, the cognitive framework formed by a person’s L1 can obscure the subtleties between phonemic units in an L2, causing them to perceive distinct sounds as identical (
Escudero et al. 2009;
Tuninetti and Tokowicz 2018). This phenomenon is further illuminated by Tyler and Best’s research on perceptual assimilation, which examines how native English speakers perceive various non-native vowel contrasts. Their findings suggest a consistent influence of native-language attunement on speech perception (
Tyler 2014) that can result in the assimilation of non-native sounds to native phonological categories, affecting discrimination abilities.
Central to the overarching aim of this research is the observation that students who are learning a second language are generally exposed to a diverse array of accents. This exposure plays a crucial role in how learners interpret and process language. Learners are known to extract information from these accents in unique ways, often influenced by the preconceptions and expectations they harbour regarding their instructors. Specifically, there has been empirical evidence to demonstrate that these expectations extend to different areas of language expertise. For instance, Chinese students learning English often anticipate that native English-speaking teachers will demonstrate mastery in pronunciation. In contrast, they anticipate that non-native English-speaking teachers will display a deeper understanding of grammatical rules and strategic language use (
Sung 2014). Similar findings have been uncovered for Vietnamese and Japanese students of English (
Walkinshaw and Oanh 2014). In this case, participants regarded the speech of non-native teachers as not only more accented but also more comprehensible, i.e., easier to understand. This body of research brings to light the issue of interspeaker accent discrimination.
In our study, we have provided evidence that learners possess the ability to recognize varying degrees of foreign accents, a capability that manifests not only within individual speakers but also, and more notably, across different speakers. This ability to discriminate is significantly linked to the intensity of the FA as applied to nuclear vowels of monosyllabic words. Particularly, we observed that when the vowels exhibit a more native-like quality, listeners become more attuned to subtle variations in accent across speakers. This sensitivity to accentedness is not an isolated phenomenon but aligns with existing research that often reveals that small deviations from the norm can be enough to convey a significant increment in the perceived degree of foreign accent (
Pérez-Ramón et al. 2020), enhancing the discriminatory capabilities of the listener.
Our investigation revealed varying abilities among non-native English speakers in discerning degrees of foreign accent, influenced by both the listener’s native language and the native-likeness of the speech samples. This variability can be partially explained through the lens of top-down and bottom-up processing. When listeners encountered non-native-like tokens, they likely engaged in top-down processing, utilizing contextual information and their prior linguistic knowledge to comprehend the speech (
Xie and Myers 2017). This approach enables a talker-specific pathway to comprehension, which was particularly evident in the Japanese cohort’s ability to discriminate vowel length contrasts, a feature prominent in their L1. Conversely, in scenarios where speech samples were more native-like and thus provided less contextual clues for non-native listeners, a reliance on bottom-up processing was observed. Here, listeners depended more on the raw auditory information (
Gerrits and Schouten 2004) to distinguish between different degrees of accent. Such a shift in cognitive processing strategy might explain the varied performance across different speech samples and cohorts.
A secondary finding of our research is the difference in discrimination capabilities across cohorts of listeners. Prior research has shown that the way individuals perceive non-native phonetic segments is deeply influenced by the extent to which these segments align with the phonological structures present in the listener’s own language (
Hu 2021;
Imai et al. 2005;
Pérez-Ramón et al. 2020). This suggests that the phonological representations developed over time play a pivotal role in the interpretation of unfamiliar sounds.
It is known that listeners are more adept at distinguishing between segments when those segments have contrastive features in their native language (L1). Previous research has demonstrated that, for example, the contrast /ʃ/-/ʒ/ is perceived differently by English and Chinese listeners since the latter lack this specific phonological pair in their native consonant system (
Chen and van de Weijer 2022). In another study (
Schoonmaker-Gates 2015), the production of Spanish plosives by native and non-native speakers was rated for accentedness by English learners of Spanish. The conclusions drawn from this study suggest that contrasts in a second language can be acquired over time, but they are not straightforward for non-proficient speakers.
In our study, this can be seen in the differences between Spanish and Japanese listeners. Specifically, the ability to discriminate between degrees of accent across speakers is significantly superior in instances where the duration of a vowel serves as a differentiating cue, particularly in the transitions between [i→iː] and [o→ɔː]. Given that the Japanese language uses vowel duration as a distinctive feature (
de Weers and Munro 2018;
Hui and Arai 2020), it falls within expectations that Japanese listeners would more effortlessly discriminate varying degrees of accent in contrasts primarily defined by vowel length.
It remains unclear why only these two contrasts were especially clear for Japanese speakers. One possible explanation is that the non-native realisation of Japanese speakers of English for the [ɜː] vowel is neither [e] nor [o], but [a] (
D’Angelo et al. 2021;
Lengeris 2009). Therefore, their expectations for the realisation of these words may have hindered the subtleties of the differences between degrees of accentedness across speakers.
While the primary focus of our study is on the nuances of accent discrimination in language perception, our findings may hold important implications for language education. In the formal education setting, students are often exposed to instructors with a variety of accents and linguistic backgrounds, ranging from native speakers to non-native speakers of different proficiency levels. This diversity in accent and teaching approach, as previous research suggests, can significantly influence the learning process (
Algethami 2017). Studies have shown that exposure to diverse accents can enhance students’ listening skills and linguistic flexibility, preparing them for real-world language use (
Burke et al. 2018;
Sumner 2009). Furthermore, adapting to varied teaching methodologies in response to these accents could foster cognitive adaptability and language processing skills, which are vital for language acquisition (
Clarke and Garret 2004;
Cristia et al. 2012).
Our study primarily explored the interspeaker discrimination abilities of non-native speakers when faced with accented speech in their target language. The pedagogical ramifications stemming from our findings are considerable. A key takeaway is the need for educators to be deeply attuned to the pervasive influence of a student’s L1. Such an understanding is crucial in effectively guiding their progress in L2 acquisition and in crafting feedback that is both insightful and culturally sensitive. Our suggestion for future research is to broaden its scope to include an in-depth analysis of consonant discrimination, thus complementing the vowel-focused findings of this paper. Additionally, a more extensive exploration into the sociolinguistic intricacies of accent discrimination can further enrich our insights and foster more holistic and understanding language teaching approaches.