The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English

Alshangiti, Wafaa

doi:10.3390/languages11050081

Open AccessArticle

The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English

by

Wafaa Alshangiti

English Language Institute, King Abdulaziz University, Jeddah 21589, Saudi Arabia

Languages 2026, 11(5), 81; https://doi.org/10.3390/languages11050081

Submission received: 7 January 2026 / Revised: 16 March 2026 / Accepted: 1 April 2026 / Published: 22 April 2026

Download

Browse Figures

Versions Notes

Abstract

Increased experience with a second language (L2) can affect one’s speech perception and production. Some studies have suggested that experience does not affect the production of English bilabial stops by Arabic speakers. They produce the English bilabial stops /p/ and /b/ as the Arabic /b/, which differs in VOT. However, the effect of English experience on the perception of English bilabial stops remains underinvestigated. This study examines the effect of L2 immersion experience on the perception of the English stops /p/–/b/ to investigate whether the lack of /p/ in Arabic can affect the perception of the /p/–/b/ contrast and whether L2 experience shifts the category boundary toward that of native speakers. Sixtysix participants, comprising two groups of Arabic speakers with differing L2 experience and a control group of native English speakers, completed identification and discrimination tasks using the /p/–/b/ VOT continuum. The regression analysis showed that listeners with more L2 experience (i.e., ≥3 years in the UK) had a closer category boundary to that of native listeners than those with less L2 experience. However, category discrimination accuracy did not differ significantly between the Arabic groups. The results highlight the importance of L2 immersion experience in altering VOT perceptual strategies, which can help in designing future training studies that focus on VOT perception as an L2 phonetic cue.

Keywords:

VOT perception; L2 experience; category boundary; category discrimination

1. Introduction

Mapping speech sounds to discrete phonetic categories is a fundamental function of speech perception (Liberman et al., 1967). However, adult learners of a second language (L2) often encounter challenges in perceiving L2 phonemes, as their perceptual strategies developed through their first language (L1) tend to influence the acquisition of some L2 phonemes (e.g., Best, 1995; Flege, 1995; Iverson et al., 2003; Best & Tyler, 2007). Several theoretical models propose some explanations of these challenges, including the Speech Learning Model (SLM; Flege, 1995), the Perceptual Assimilation Model (PAML2; Best & Tyler, 2007), the Second Language Linguistic Perception Model (L2LP; Escudero, 2005; Van Leussen & Escudero, 2015), and the Universal Perceptual Model (UPM: Georgiou, 2021). Collectively, these models predict that the relative difficulty or ease of perceiving L2 phonemes depends largely on the degree of similarity between L1 and L2 phonetic categories.

According to the Perceptual Assimilation Model (PAM: Best & Tyler, 2007), nonnative speech sounds are perceived in terms of how they can be assimilated to an L1 phonetic category. Similarly, the Speech Learning Model (Flege, 1995) states that the L1 and L2 phonetic categories coexist during phonetic processing, and thus the formation of new L2 categories becomes difficult when L2 sounds are acoustically close to existing L1 categories. The model further suggests that during the early stages of L2 acquisition, the establishment of new phonetic categories may be hindered by this similarity; however, increased experience with L2 can enhance learners’ ability to discriminate and form distinct L2 phonemes.

In a related framework, the Second Language Linguistic Perception Model (L2LP; Escudero, 2005) states that L2 learners, in the initial stage of learning, perceive L2 sounds in a manner that resembles their L1 counterparts. According to this model, the acoustic similarities between L1 and L2 play a vital role in shaping L2 phonological development. The L2LP model proposes a gradual learning algorithm that involves perceptual learning, enabling learners to adjust to new phonetic input. This process may facilitate the formation of new phonological categories. According to the L2 LP model, L2 learners may achieve native like perception when they are exposed to sufficiently rich and extensive L2 input, which can be referred to as an L2 experience.

Experience in an L2 can facilitate L2 learning and may be gained by different ways, including L2 use (e.g., Kartushina & Martin, 2019; Piske et al., 2001), L2 formal instruction (e.g., Nagle, 2019), and residency in an L2-speaking country (e.g., Flege, 1987; Flege et al., 1997; Flege & Liu, 2001; Gorba, 2019; Gorba & Cebrian, 2021). According to the PAM–L2 model (Best & Tyler, 2007), learners who have lived in an L2-speaking country for six months or more can be classified as L2-experienced. The period of residency in an L2 setting has been found to be positively related to the accuracy of vowel perception and production (Flege et al., 1997; Levy & Law, 2010), sentence production (Flege et al., 1995), and some suprasegmental features (Gorba, 2018; Petrova et al., 2023; Trofimovich & Baker, 2006). For example, Flege et al. (1997) examined the effect of English language experience on the perception and production of English vowels by German, Spanish, Mandarin, and Korean L1 speakers who had lived in the US for 5–9 years. Flege et al. (1997) found that highly experienced L2 learners perceived and produced the English vowel contrast /ε/–/æ/ more similarly to native English speakers than less experienced L2 speakers did. At the suprasegmental level, Petrova et al. (2023) investigated the effect of an English L2-immersionbased experience on the perceptual strategies of Mandarin listeners. Their results showed that participants who had lived in the UK for more than 3 years demonstrated more nativelike weighting of pitch and duration cues than those who had lived there for less than one year. These findings suggest that extended residence in an L2-speaking country can positively influence the perception of L2 phonetic cues.

A phonetic cue that has been widely investigated in relation to L2 experience is VOT perception. Voice onset time (VOT) refers to the temporal interval between the release of a stop consonant and the onset of voicing in the subsequent vowel. It functions as an important acoustic cue that enables listeners to differentiate between voiced and voiceless stop consonants (Lisker & Abramson, 1964; Liberman et al., 1958; see also Cho et al., 2019). Since the perception of VOT category boundaries, i.e., the VOT values indicating a shift from one phonetic to another, is language-specific (e.g., Hoonhorst et al., 2009; Souganidis et al., 2024), it is important to establish how these boundaries are perceived by L2 learners with a specific L1 background. This is important for shedding light on whether similarities between L1 and L2 phonetic categories can predict the ease or difficulty of perceiving the L2 categories, as suggested by the above–mentioned speech perception models.

For example, PAM (Best, 1995; Best & Tyler, 2007) suggested different types of assimilations when learners discriminate between nonnative phonemes. Learners can discriminate non–native phonemes accurately if they are mapped to two different native phonological categories, a two–category (TC) assimilation. When one non–native phoneme is perceived as a native phonetic category, and the other is not perceived as any L1 phonetic category (uncategorised–categorised, UC), discrimination is predicted to be good. A less accurate discrimination might occur if the nonnative phonemes are assimilated into one native phonetic category. If two L2 phonemes are perceived as equally good or poor versions of a native phonetic category, a single category (SC) discrimination might occur, and discrimination will be difficult. However, discrimination of L2 phonemes can be relatively easier than SC discrimination; if one of the nonnative phonemes is perceived as a good exemplar and the other as a poor exemplar of native phonetic categories, a category goodness (CG) assimilation is predicted. Nevertheless, when two L2 phones are uncategorized, an uncategoriseduncategorised (UU) assimilation occurs, and the discrimination of these phones can vary from poor to excellent, depending on their similarity to native phonetic categories and their similarity to each other.

One of the implications of PAM assimilation patterns is its ability to predict the accuracy with which L2 learners discriminate between L1 and L2 nonnative phonemes. Specifically, when L2 learners encounter certain phonemic contrasts that are absent in their L1 inventory, they may assimilate these speech sounds to the closest L1 category. As a result, the discrimination becomes more difficult, especially in cases of SC assimilation. For example, Arabic lacks the phoneme /p/, and therefore, Arabic speakers might assimilate the English contrast /p/–/b/ to the closest Arabic phonological category /b/, which may lead to reduced discrimination accuracy and possibly the formation of an SC assimilation pattern. However, when two L2 phonemes correspond to different L1 categories, TC assimilation occurs, typically resulting in relatively accurate discrimination. An example is the /k/–/g/ contrast, which exists in both Arabic and English and is thus expected to be easily distinguished by Arabic learners of English.

Given that the difficulty associated with perceiving L2 speech sounds is influenced by learners’ L1 phonological system, the current study investigates the perception of L2 phonemes that are absent from learners’ L1 inventory. Specifically, the study investigates whether, as proposed by SLM (Flege, 1995), speech sounds that do not exist in the learners’ L1 may be relatively easier to acquire. Moreover, if Arabic speakers perceive /p/ as an uncategorised speech sound, as suggested by PAM (Best, 1995), the /p/–/b/ contrast may be classified as a UC assimilation, which would predict relatively successful discrimination. Conversely, according to PAM (Best, 1995; Best & Tyler, 2007), nonnative phonetic categories that do not have direct L1 counterparts may be assimilated to existing L1 categories, potentially making them difficult to acquire; consequently, the English /p/–/b/ contrast is predicted to be assimilated to the Arabic /b/, which may make the discrimination harder.

Building on these theoretical predictions, the present study investigates the effect of L2 experience on shaping the perception of VOT. In particular, it examines whether Saudi Arabic learners of English with varying levels of L2 experience demonstrate shifts in the perceptual category boundaries of English bilabial stops, and whether increased experience improves their ability to discriminate between these categories. This issue is particularly important, because the variation in VOT has been shown to affect foreign–accented L2 speech and may sometimes lead to communication misunderstandings (e.g., Schoonmaker-Gates, 2015). Furthermore, given the fact that English and Arabic have different voicing patterns in consonantal stops (e.g., Flege & Port, 1981), it is worth investigating whether the effect of the L1 voicing category might shift with L2 experience. In other words, the study seeks to establish whether greater exposure to English as an L2 leads learners to modify their perceptual strategies, shifting from reliance on L1 phonetic cues toward increased sensitivity to L2-specific cues.

The research questions then are:

How do learners perceive L2 phonemes that are absent from their native phonological inventory?
Do Saudi Arabic learners of English with varying levels of L2 experience exhibit shifts in the perceptual category boundaries of English bilabial stops?
Does increased L2 immersion experience enhance Saudi learners’ ability to discriminate English bilabial stops?

Both English and Arabic have a two-way distinction for laryngeal contrast between consonantal stops, which are classified as voiced and voiceless stops (Lisker & Abramson, 1964). However, the two languages differ in their VOT patterns. In English, there does not need to be any vocal cord vibration during the production of consonant stops, whether they are voiceless (/p/, /t/, and /k/) or voiced (/b/, /d/, and /g/); i.e., there is no prevoicing in English. Furthermore, the voiceless stops in English are aspirated, while the voiced stops are mostly unaspirated at the word-initial position. Therefore, during the production of word-initial stops, the cue for voicing is not obtained from the presence or absence of glottal pulsing, but rather from the timing between glottal and supraglottal events, i.e., the time interval between the release of the stop consonants and the onset of vocal cord vibration (Ladefoged & Maddieson, 1996; Lisker & Abramson, 1971). On the other hand, Arabic relies on the presence or absence of glottal pulsing during the closure period of the consonant stops (e.g., Flege & Port, 1981). Unlike English, the Arabic language does not have the bilabial plosive /p/ (e.g., Al-Ani, 1970). It is described as a ‘true voicing language’ with fully voiced (prevoiced) stops (/b/, /d/, and /g/) and voiceless, unaspirated (short–lag) stops: /t/ and /k/ (Flege & Port, 1981, for Saudi Arabic; Kulikov, 2016, for Qatari Arabic). Saudi speakers, the group investigated in this study, produce in Najdi Arabic, a Saudi dialect, word-initial voiced stops with short-lag VOTs and voiceless stops with moderately aspirated, longlag VOT values, with longer closure durations, bursts, and F0 and F1 onsets and less intensity than the voiced stops (Al-Ghamdi et al., 2019). Alotaibi and AlDahri (2011) suggested that the Modern Standard Arabic (MSA) and English voiceless stops /t/ and /k/ are longlag VOTs in wordinitial positions (Lisker & Abramson, 1964), but these two languages differ in relation to the degree of aspiration. Therefore, in order for Arabic-English bilinguals to perceive or produce voiceless stops in English, they need to learn how to distinguish voiced and voiceless speech sounds, especially when they are aspirated, as the Arabic phonetic system does not include aspirated voiceless stops.

Studies that have investigated the difference between Arabic and English VOTs have mostly considered the differences concerning production by Arabic-English bilingual children (e.g., Khattab, 2002) and adults (e.g., Alharbi et al., 2023; Flege & Port, 1981). For example, Flege and Port (1981) investigated the production of English stops (word-initial and word-final stop voicing contrasts in CV: C minimal pairs) by adult Saudi Arabic speakers who had lived in the United States for several years. They found that in the word-initial position, Saudi speakers produce the English /p/ with the temporal properties of a bilabial stop, but with the closure voicing of /b/. This is possibly because the Arabic language lacks a voiceless bilabial stop, and Arabic speakers use their native voiced /b/ to produce the English /p/. The resulting sound is not the same as the English /p/ or the English /b/; it has the timing characteristics and place of articulation of an English /p/, but with the closure voicing of an English /b/, which can be explained by L1 interference in producing L2 speech sounds. A similar conclusion was drawn by Alanazi (2018), who investigated the differences in VOT values for consonant stops between advanced Saudi Arabic learners of English and monolingual Saudi Arabic speakers with little to no knowledge of English, along with a control group of monolingual English speakers. The participants were asked to record stop consonants in three vowel contexts, /i:/, /a:/, and /u:/, in isolated words and sentences. Alanazi (2018) found that Arabic learners produced English voiceless stops with aspiration more similar to the Arabic VOT values than those of English and voiced stops with prevoicing following the Arabic norm. The vowel context did not affect the VOT values. Alanazi (2018) concluded that the Saudi Arabic speakers emulated Arabic production for the English stops regardless of their L2 proficiency and experience.

In addition to the previously mentioned studies that investigated the production of English bilabial stops by Arabic speakers, other studies focused on how living in an English–speaking country affected Saudi speakers’ production of consonant stops. For example, Alluhaidah (2023) investigated the VOT production of a male Saudi speaker who lived in the UK at the time of testing and compared it to that of a native English speaker. He found that the length of stay in the UK and daily use of English did not make the Saudi speaker’s VOT values more similar to those of the native speaker. In contrast, Alharbi et al. (2023) found an effect of L2 experience on bilabial stops production. They investigated the VOT production of voiceless stops among highly proficient, older Arabic-English and English-Arabic bilinguals who had been living in an L2-speaking region for more than 15 years. The participants were recorded narrating cartoons in Arabic and English. For Arabic-English bilinguals, the voiceless bilabial /p/ was easier to produce than L2 sounds that are similar to L1 phonetic categories like /t/ and /k/. However, their perception of between and withincategory VOT values was not investigated.

In sum, previous research has established that Saudi learners usually produce English bilabial stops similar to the Arabic /b/. If we assume that production and perception of L2 speech sounds work in a similar way (for a review, see BaeseBerk et al., 2025), then, based on Alharbi et al.’s (2023) study, one might expect that L2 learners perceive the English /p/ more accurately in an immersion setting. Indeed, Alzahrani (2021) tested Saudi speakers who had lived in Australia for more than three years on their perception of English bilabial stops and compared them to those who had lived there for less than six months. Alzahrani (2021) found that the group that had lived in Australia for 3 years perceived minimal pairs starting with bilabial stops more accurately than those who had lived there for 6 months, indicating the positive effect of living in an L2-speaking country on speech perception. However, these studies did not investigate the category boundaries of different VOT values; rather, they used minimal pairs for identification tasks.

While previous research has extensively investigated the production of English bilabial stops by Arabic and specifically Saudi speakers, there is a notable gap in research on the perception of VOT category boundaries and whether living in an L2-speaking country might facilitate a shift in category boundaries towards those of native English speakers. This study investigated the effect of experience, defined as residence in the UK for at least three years, on Saudi Arabic speakers’ VOT perception and discrimination between the English /b/ and /p/, and compared it to that of those living in Saudi Arabia. Bilabial stops were chosen, as the voiceless /p/ does not exist in the Arabic phonetic system (Newman, 2002), and Arabic speakers often produce the English /p/ similar to the Arabic /b/, which is pre–voiced (Flege & Port, 1981). The Saudi groups’ perception was compared to that of native Standard Southern British English (SSBE) speakers as the control to see whether L2 experience helped in shifting the category boundary between the /b/ and /p/ towards that of the SSBE group.

To investigate the effect of L2 experience on category perception, the participants completed identification and discrimination tasks on a VOT continuum from /pi/ to /bi/. The hypothesis was that L2 experience might cause VOT perception to become more similar to that of native English speakers (cf. Gorba, 2018, 2019) compared to Arabic speakers who live in Saudi Arabia. Indeed, the participants with less L2 experience were found to assimilate L2 phonetic categories into one L1 category, while learners with more L2 experience showed more dispersed mapping (Doan & Oh, 2023). However, according to the SLM, L2 categories that do not have a direct L1 counterpart might be easier to learn, and thus, Saudi learners may accurately discriminate between the English bilabial stops, which would oppose the hypothesis. In fact, experienced Arabic learners produced the English /p/ correctly, which does not exist in Arabic. However, some L2 phonemes are similar to their L1 counterparts, such as /t/ and /k/, which are harder to produce in a native–like manner (Alharbi et al., 2023). On the other hand, since some studies found no effect of L2 experience on VOT production (e.g., Alanazi, 2018), a similar scenario for VOT perception is probable, which would confirm the null hypothesis that experience has no effect on VOT perception. Then again, perception and production can be affected differently (e.g., BaeseBerk et al., 2025; Hattori, 2010). Thus, perception, unlike production, might improve as a result of L2 experience.

2. Materials and Methods

VOT perception was investigated through category discrimination and identification tasks. The tasks were completed online using Gorilla the Experiment Builder (www.gorilla.sc (accessed on 12 March 2024); Anwyl-Irvine et al., 2020), and all data were collected online. All participants received emails about the procedure. To ensure that stimulus presentation was as similar as possible across participants, they were instructed to be in a quiet room and use computers using the Google Chrome browser. The participants were also sent a questionnaire and a consent form via email a day before the experiment. The questionnaire included questions about the length of residency in the UK and their use of English. The participants were divided into two groups only: one group in the UK and one group in Saudi Arabia. The participants were asked to finish both tasks in an hour; otherwise, they were excluded.

2.1. Participants

A total of 66 participants took part in this study, including 31 Saudi Arabic participants living in Saudi Arabia (age range 18–39 years, median = 19.5 years); 25 Saudi Arabic participants in London (age range 18–45 years, median = 32.5 years), who had lived in there for between 3 and 15 years (median = 5 years); and 10 native English speakers (age range 28–58 years, median = 42.5 years). The Arabic listeners began learning English between 5 and 24 years of age (median = 11 years). The participants in London were recruited from a scholarship program for Saudi postgraduate students. All participants were enrolled in postgraduate programs; they reported that they speak English at work or school and for general communication, and Arabic at home and with other Saudi students. The minimum time spent in London was 3 years, as this cutoff was found to have a positive effect on perceptual studies (e.g., Petrova et al., 2023), and thus, anyone who spent less than 3 years in the UK was excluded from the study. The Saudi Arabic participants living in Saudi Arabia were also postgraduate students who were there working toward an MA. English is the most commonly used language for their MA courses. However, they reported that they speak mostly Arabic with some English words, and they do not communicate with native English speakers outside the university. Both Saudi Arabic groups reported having a high level of proficiency in English. None of the participants reported any hearing or speech problems. The participants were sent a link to the experiment via friends and friends of friends.

2.2. Stimuli

A /pi/–/bi/ VOT continuum with 10step stimuli ranging from 0 ms to 50 ms was presented to the participants. The stimuli were synthesised using the method described by Winn (2020), where the VOT continuum comprised covariance of voice pitch (F0), the first formant (F1) transition, and aspiration intensity using Praat’s work (Boersma & Weenink, 2022). The stimuli had VOT values of 0, 10, 15, 20, 25, 30, 35, 40, 45 and 50 ms. The English category boundary between the bilabial stops was found to be around 20 ms (cf. Nakai & Scobbie, 2016), so the values 0, 10 and 15 corresponded to [ba], and the VOT values 20 to 50 corresponded to [p]. The selection of a 5 ms step after 10 ms was chosen to test for any subtle change in the category boundary between the groups. The stimuli were recorded by a 45yearold monolingual SSBEspeaking male in a sound–attenuated booth using a Rode NT1A Studio Condenser Microphone. For the detailed manipulation technique, see Winn (2020).

The vowel /i/ was selected to follow the bilabial stops, as its formants are more stable over time than any other English vowel (Hillenbrand et al., 1995), and thus, F1 remains roughly the same throughout the VOT continuum (see also Winn, 2020). The same stimuli were used for the identification and the discrimination tasks. WAV files were converted to MP3 format, as required by the developers of Gorilla, the online experiment builder. To minimise displayrelated variability, participants were restricted to completing the experiment on computers using Chrome. However, controlling other factors, particularly those related to the acoustic presentation of stimuli, remained challenging.

2.3. Procedure

2.3.1. Identification Task

The listeners were presented with two alternative, necessary choices and were asked to identify whether the sound they heard was /p/ or /b/, with the key words bee for ‘b’ and pea for ‘p’. There were 50 stimuli (10 VOT continuum steps with 5 repetitions) presented in a random order. There was a ‘replay’ button that allowed 2 trials for each stimulus.

2.3.2. Discrimination Task (Same/Different)

The wav files containing the VOT values on the 10step VOT continuum were arranged into random pairs (e.g., 10 ms–10 ms and 10–50 ms), and each pair was combined in one wav file to be presented to the participants. The combination of VOT value pairs was random, and there were some similar (10 similar pairs consisting of VOT values from 0 to 50, such as 0–0 ms, 10–10 ms, 15–15 ms, and 20–20 ms) and some different pairs (45 different pairs consisting of a combination of each VOT value, e.g., 20 ms, with all the VOT values across the VOT continuum, e.g., 20–0 ms, 20–10 ms, 20–15 ms, 20–30 ms, 20–35 ms, 20–40 ms, 20–45 ms, and 20–50 ms). There were 55 stimuli (see Appendix A). The participants were asked to listen to two sounds and decide whether they were ‘the same’ or ‘different’ by clicking on the corresponding buttons. There was a ‘replay’ button that allowed 2 trials for each stimulus. The ‘replay’ button was added because there were 45 pairs; we decided to add a ‘replay’ button instead of having the stimuli repeated to avoid listener fatigue. There was no trial session for either task, but this was compensated for by stimulus repetition.

The tasks can be found at this link: https://app.gorilla.sc/openmaterials/1103662 (accessed on 12 March 2024).

2.4. Statistical Analyses

The statistical analysis for both the identification and the discrimination tasks was conducted using R 4.5.2 (R Development Core Team, 2023). Given that the responses were binomial in both tasks, logistic mixed–effects regression models in the ‘lmerTest’ package (Kuznetsova et al., 2017) were used, andthe package ‘ggplot2′ 4.0.2 (Wickham, 2016) was used to build the graphs. In both tasks, the significance of the main effects and interactions was evaluated with ANOVA in the ‘car’ package (Fox & Weisberg, 2019). The output of the models is presented in Table 1 and Table 2. In both models, the dependent variable was binary. For the identification task, /b/ was 1, and /p/ was 0, and for the discrimination task, 1 was ‘same’, and 0 was ‘different’. The VOT values were standardized to improve the computational efficiency of the model. The main fixed factors in the mixed models were standardized VOT values (a linear variable with ten steps from 0 to 50 ms), group (SSBE speakers, Saudi speakers who live in the UK ‘SUK’, and Saudi speakers who live in Saudi Arabia ‘SA’), and their interactions. In both models, a random intercept for the participant was included to account for variability. For the discrimination task, Holm’s method (Holm, 1979) was used for post hoc correction rather than Bonferroni or Tukey adjustments, because it offers greater statistical power while maintaining the same familywise error rate, which makes it more likely to detect true effects (Agbangba et al., 2024).

The data for the responses for the identification task were arranged by the VOT values, and the purpose was to investigate the category boundaries across the VOT continuum. However, for the discrimination task, the data was organized to have the responses as ‘correct’ or incorrect’, as the purpose was to investigate the accuracy of their discrimination.

3. Results

3.1. The Identification Task

The identification task was designed to assess the effect of living in an immersion setting on the category boundary between voiced and voiceless bilabial stops. Figure 1 displays the mean percentages of responses to the VOT continuum by three groups: S–UK, SA, and SSBE. Unlike the SA group, which seemed to identify both choices in the task at or below the chance level, the S–UK group followed a similar identification pattern to that of the SSBE listeners, where they both appear to represent a category between the /b/ and /p/ continuum at around 20–30 ms.

Model 1 confirms this observation and shows a significant main effect of the VOT values [χ² (1) = 267.2, p < 0.001], and a significant interaction between the VOT values and the groups [χ² (2) = 206.6, p < 0.001]. However, the group effect was not significant (p > 0.05).

Figure 1 also shows that the category boundary for the SSBE and S–UK groups was around 20–30, with different percentages of identification. In contrast, the boundary for the SA group was at around the chance level from 0 to 25 ms and then changed after 30 ms. However, there did not seem to be a sharp category boundary for this group’s performance.

To assess the sharpness of the category boundary between the groups, contrast coefficient slopes were created (cf. Morrison, 2008). We followed some studies (e.g., García-Sierra et al., 2021) that assessed categorical perception in bilinguals by comparing and calculating the 50% (0.5 probability of responding ‘bi’) crossover point for each group. The 50% crossover was calculated by dividing the intercept by the slope for VOT and multiplying it by −1 (cf. Casillas, 2021). For the SSBE group, the category boundary was −0.018 standard deviation points below the mean. To calculate the boundary in ms, the category boundary was multiplied by the VOT values and added to the VOT mean (the reverse of the z–score allows us to find the original raw data value (x) from a given z–score, the dataset’s mean (μ), and its standard deviation (σ) using this formula: x = μ + (z–score × σ)), which was around 26.7 ms for the SSBE group. The category boundary for the SA–UK group was calculated by adding the S–UK and SSBE intercepts divided by their slopes, which was −0.186 standard deviation points below the mean, which was 24.16 ms (calculated by multiplying the category boundary by the VOT–SD and adding to the VOT mean). The category boundaries for the SA and S–UK groups were calculated in the same way, where the category boundary was at −0.82 SD below the mean, which was 14.51 ms.

In short, the contrast coefficient slope method showed that Saudi speakers who had lived in the UK for more than three years had a closer category boundary to that of the SSBE speakers than the Saudi speakers who lived in Saudi Arabia, indicating that L2 immersion experience may have helped in partial restructuring of category boundaries of English bilabial stops.

3.2. The Discrimination Task (Same/Different)

This task measured the discrimination accuracy of the 10–step continuum stimuli (0 VOT, 10 VOT, 15 VOT, 20 VOT, 25 VOT, 30 VOT, 35 VOT, 40 VOT, 45 VOT, and 50 VOT). For a technical error, the VOT pair 5–45 VOT was not included. The participants in the three groups were asked to listen and judge whether the two speech sounds they heard were similar or different. Most VOT pairs were different, but some (10 pairs) were the same.

Figure 2 shows that the discrimination accuracy differs between the groups. The S–UK group seemed to demonstrate fairly accurate discrimination between and within the categories, which was similar to the SSBE group’s performance. However, the SA group seemed to show less accurate within–category discrimination but performed better at the endpoints (0VOT and 50VOT values).

A generalised linear mixed model was built for the categorical responses (correct or incorrect responses) as the dependent variable, with the pair of VOT values and group as fixed factors and the random intercept of the participants as a random factor. The model showed that the effect of the VOT pair was significant (χ² (56) = 56, p < 0.0001), as was the effect of group (χ² (2) = 6.35, p < 0.05). Post hoc correction using Holm’s method (Holm, 1979) showed that the difference in the SSBE and S–UK groups was significant (β = 0.3, SE = 0.16, z = 2.4, p < 0.05), as was the difference between the SSBE and SA groups (β = 0.57, SE = 0.11, z = 2.4, p < 0.001). However, the difference between the S–UK and SA groups was not significant (see Table 3).

In order to investigate whether the accuracy differs between the peaks at boundary regions, a linear model was built for the responses with the group and the boundary position (within or between) as fixed factors, and the random intercept of the participant as a random factor. The model showed that the effect of the group was significant (χ² (1) = 7.04, p < 0.05), and the effect of the boundary peaks was significant (χ² (1) = 86.57, p < 0.001). and the interaction between the group and the position was significant (χ² (1) = 12.85, p < 0.001). The fixed effects coefficients showed that the performance at the peak of the category boundaries was more accurate (β = 1.55, SE = 0.24, z = 2.4, p < 0.001) than within categories (β = −1.5, SE = 0.2, z = −7.5, p < 0.001). Participants in the SA group found the values within the categories less accurate than the other two groups (β = 0.809, SE = 0.22, z = 3.58, p < 0.001).

4. Discussion

This study investigated the effect of living in an English–speaking environment on the perception of English bilabial stops by Saudi Arabic learners of English. Specifically, it compared the perception of VOT between two groups of Saudi Arabic speakers: one lives in Saudi Arabia (SA), and the other has lived in the UK (S–UK) for at least three years. Their performance was evaluated against that of SSBE speakers, who served as the normative reference group. Category boundaries were established using an identification task in which participants identified stimuli representing different VOT values across a 10–step continuum, while their accuracy in perceiving different VOT values was tested using a category discrimination task.

The results suggest that language experience affects the perceptual category boundaries of VOT values for the English /p/–/b/ contrast. Although both Saudi groups (SA and S–UK) had category boundaries with smaller VOT values than those of the SSBE speakers, the S–UK group presented a boundary (24 ms) that is closer to the SSBE boundary (26.7 ms) than the SA group boundary (14 ms). This is possibly because the less experienced listeners, the SA group, relied on the pre-voicing cue that they use for their Arabic /b/, while the SSBE listeners relied on aspiration, which is used contrastively in their L1. On the other hand, the S–UK group showed some sensitivity to perceiving aspiration, which may indicate a change from relying on the pre-voicing cue in their L1 to using the aspiration cue that is salient in discriminating the English /p/–/b/ contrast, suggesting that cue shifting occurs as a result of living in an L2-speaking country. This finding aligns with Gorba’s (2019) study, which revealed that the category boundary for less experienced Spanish learners of English differed significantly from that of English speakers. A similar result was found by Gorba and Cebrian (2021), who investigated the effect of L2 experience on perceiving English consonant stops. They found that English speakers present larger category boundaries, i.e., VOT values in which the category changes from one to another, than Spanish speakers, as English speakers use aspiration contrastively in their language, while Spanish speakers rely on pre-voicing.

Nonetheless, the category boundary results in this study confirm the SLM and Flege’s (1995) prediction that in the initial stages of L2 learning, L2 phonemes are perceived similarly to L1 phonemes if they are similar to each other. This is evident in the less experienced group; since they are less experienced, their results are similar to the SLM “initial stage”, and thus, they rely on their L1 bilabial stop to perceive English bilabial stops. Although Arabic does not have the English bilabial /p/, it shares the place of articulation with the Arabic /b/, which makes the phonemes very similar to each other. This is possibly why Arabic learners of English are well known for producing the English /p/ in the Arabic /b/ voicing manner (e.g., Flege & Port, 1981; Alanazi, 2018). However, the SLM predicts that a ‘new’ L2 category can be formed with more experience in L2. This might explain why the learners who have more experience in this study seemed to form a new L2 category, which is evidenced by the category boundary being closer to that of the native speakers. These results are also supported by the PAM–L2, which states that if two L2 categories are similar to an L1 category, speakers will perform single–category assimilation, which makes it difficult to discriminate between the two categories. Once L2 learners establish new categories, they can discriminate between a given L2 phonemic contrast if they are assimilated to two different categories in their L1, which could be the case with the learners with more experience in this study.

However, the results showed that the category discrimination accuracy was not affected by language experience; both Saudi groups had around chance–level accuracy, although the data showed some advantage of living in an L2-speaking country. The S–UK group performed slightly better (57% correct) than the SA group (53% correct), but this difference was not statistically significant. This is possibly an example of single-category assimilation (Flege, 1995), as the learners likely assimilated the two categories /p/ and /b/ into their Arabic /b/ and thus found it difficult to discriminate between VOT values within and between categories. This is in line with other studies that found that the perception of L2 category discrimination did not change when living in an L2-speaking country (e.g., Cebrian, 2006; Morrison, 2002). For example, Morrison (2002) investigated the perception of English vowels by Spanish and Japanese learners of English one and six months after they arrived in Canada. Morrison (2002) found experience had no effect. In this study, the participants had lived in the UK for longer than the participants in Morrison’s study (Morrison, 2002), and yet this did not have an effect on their category discrimination. This is possibly because of the relatively short time, 6 months, in Morrison’s study (Morrison, 2002), whereas the participants in this study had lived in the UK for more than 3 years. However, other studies have found some change in perception as a result of living in an L2–speaking country, ranging from 3 to 11 years (e.g., Flege et al., 1995) and from 0.7 to 7 years (e.g., Flege et al., 1997).

Although L2 experience affected the shift in category boundaries that are closer to those of native English speakers in the identification task, the category discrimination task showed no significant effect of experience. This is possibly because the more experienced learners (SUK) possibly began to use aspiration as an L2 phonetic cue, rather than relying on prevoicing, which is related to the Arabic bilabial stop. However, the use of aspiration as an L2 cue may not be robust enough to discriminate between different synthetic VOT values. Another possible reason for such low accuracy in the discrimination task might be related to the nature of the task, where the stimuli were pairs of VOT values ranging from 0 ms to 50 ms. Some values were very similar to each other, e.g., 15 ms and 10 ms, which may have made it difficult to discriminate. A few studies, as mentioned above, have investigated the perception of the VOTs of bilabial stops by Saudi learners of English. Some of these studies (e.g., Alzahrani, 2021) investigated the discrimination between the English /p/ and /b/ using minimal pairs to test the effect of living in Australia on the perception of English bilabial. Alzahrani (2021) found that Saudi speakers who lived in Australia perceived the distinction between the English /p/ and /b/ more accurately than the Saudi group who studied English in Saudi Arabia. The findings in this study showed an effect of experience on the category boundaries on the VOT continuum. However, it would have been more useful if discrimination of some minimal pairs (e.g., pit–bit) as a separate task had been included with discrimination against the VOT continuum. This would have allowed us to compare the accuracy of discriminating between categories (using minimal pairs) and the sensitivity of within–category shifts using the VOT continuum.

Given that the participants with 3 years of immersionbased experience showed VOT perceptual category boundaries that are close to those of native English speakers and that their category discrimination was fairly accurate, another topic of future study could be whether the VOT category boundaries and accurate discrimination can be improved by training to achieve nativelike L2 perceptual strategies, i.e., not relying on L1 phonetic cues. Training that is focused on specific phonetic cues has been shown to change some perceptual strategies (e.g., Francis et al., 2008), and thus, if Saudi speakers were trained to perceive different VOT values related to English voicing, their performance might improve. This can be used as a teaching and training tool for this L2 learning group.

Although other studies found that English bilabial production did not improve as a result of L2 experience for Saudi speakers (e.g., Alanazi, 2018), the results from this study showed that it can improve VOT perception. Perhaps this is because the relationship between two language domains is not always straightforward (cf. Gorba & Cebrian, 2021; BaeseBerk et al., 2025), and improvement in speech perception does not always mean improvement in speech production. For example, Gorba and Cebrian (2021) found no correlation between VOT perception and production, which means that some learners may perceive VOT values accurately but do not necessarily produce them in a native–like manner, and vice versa. That being said, it would have been more useful if the connection between VOT perception and production had been investigated in this study to see whether learners who perceive VOT values accurately or have closer category boundaries to those of native English speakers would also produce VOT values in a native–like manner.

A key limitation of this study is that the VOT values examined were restricted to the bilabial stops in English. Including VOT values for the Arabic bilabial stop /b/ would have allowed for a more direct assessment of whether Saudi participants’ perceptual judgement was influenced by VOT patterns associated with Arabic /b/. Another limitation concerns the variation in participants’ age of onset of English acquisition, which may have affected their performance, as early bilinguals might have performed differently. Future research should therefore control the age of onset of L2 acquisition when examining the impact of L2 experience. Finally, like many online experimental designs, the study lacked the environmental control typically found in laboratorybased settings. Factors such as participants’ audio quality, ambient noise, and display conditions could not be fully controlled. Although the Gorilla Experiment Builder platform restricted participants to use computers to partially control for the display conditions, other variables like audio quality, volume level and the testing environment remained outside the researcher’s control.

A valuable direction for future research would be to examine how L2 experience shapes the perception of VOT boundaries, especially regarding the VOT continuum and the use of voicing contrasts in lexical contexts (see Giovannone & Theodore, 2021). Lexical information has been shown to support speech perception (Ganong, 1980). When learners hear VOT variants within real words, they may be more likely to detect contrasts because they can anchor these variants to familiar lexical items, e.g., bet versus pet. In contrast, identifying distinctions in isolated acoustic stimuli may be more difficult. Comparing learners’ perception of the VOT continuum with their perception of VOT values presented within words would allow researchers to determine whether increased L2 experience enhances sensitivity to VOT differences and whether learners can distinguish these contrasts when they appear in meaningful lexical contexts, potentially revealing their phonolexical representations.

5. Conclusions

This study focused on VOT perception by two Saudi Arabic groups that differ in English experience (experience is defined by living in the UK). This study aimed to investigate whether the category boundaries between English bilabial stops differ with experience and whether this could affect a learner’s ability to discriminate VOT values between and within categories. The results showed that experienced Saudi learners of English had closer category boundaries to those of native English speakers than the less experienced learners did. However, discrimination did not show a significant difference between the two groups. These findings highlight the importance of L2 experience in phonetic category formation, which can be useful for tailoring teaching tools for L2 learners. Future research should explore the implications of the potential advantages associated with L2 experience through targeted phonetic training interventions, such as highvariability phonetic training (HVPT; see Barriuso & HayesHarb, 2018, for review). These programmes aim to direct learners’ attention towards specific L2 cues, such as voicing, lexical stress, or formant frequencies, which can enhance L2 speech perception.

Funding

This study was supported by the KAU Endowment (WAQF) at King Abdulaziz University, Jeddah, Saudi Arabia. The author gratefully acknowledges the WAQF and the Deanship of Scientific Research (DSR) for their financial and technical support.

Institutional Review Board Statement

This research was approved by the ethical approval committee at the English Language Institute, King Abdulaziz University.

Informed Consent Statement

Informed consent was obtained from all participants involved in this study.

Data Availability Statement

The data that support the findings of this study are available from the author upon request.

Acknowledgments

The author would like to thank Mark Wibrow for his help with Gorilla, the experiment builder.

Conflicts of Interest

The author declares no conflicts of interest to disclose.

Appendix A. Audio Stimuli Used for Tasks in MP3 Format, with 10–Step VOT Continuum from /b/ to /p/

Identification Task	Discrimination Task
BP__VOT_01	3VOT-4VOT
BP__VOT_02	1VOT-9VOT
BP__VOT_03	10VOT-10VOT
BP__VOT_04	1VOT-10VOT
BP__VOT_05	1VOT-1VOT
BP__VOT_06	1VOT-2VOT
BP__VOT_07	1VOT-3VOT
BP__VOT_08	1VOT-4VOT
BP__VOT_09	1VOT-5VOT
BP__VOT_10	1VOT-6VOT
BP__VOT_01	1VOT-7VOT
BP__VOT_02	1VOT-8VOT
BP__VOT_03	2VOT-10VOT
BP__VOT_04	2VOT-2VOT
BP__VOT_05	2VOT-3VOT
BP__VOT_06	2VOT-4VOT
BP__VOT_07	2VOT-5VOT
BP__VOT_08	2VOT-6VOT
BP__VOT_09	2VOT-7VOT
BP__VOT_10	2VOT-8VOT
BP__VOT_01	2VOT-9VOT
BP__VOT_02	3VOT-10VOT
BP__VOT_03	3VOT-3VOT
BP__VOT_04	3VOT-5VOT
BP__VOT_05	3VOT-6VOT
BP__VOT_06	3VOT-7VOT
BP__VOT_07	3VOT-8VOT
BP__VOT_08	3VOT-9VOT
BP__VOT_09	4VOT-10VOT
BP__VOT_10	4VOT-4VOT
BP__VOT_01	4VOT-5VOT
BP__VOT_02	4VOT-6VOT
BP__VOT_03	4VOT-7VOT
BP__VOT_04	4VOT-8VOT
BP__VOT_05	4VOT-9VOT
BP__VOT_06	5VOT-10VOT
BP__VOT_07	5VOT-5VOT
BP__VOT_08	5VOT-6VOT
BP__VOT_09	5VOT-7VOT
BP__VOT_10	5VOT-8VOT
BP__VOT_01	5VOT-9VOT
BP__VOT_02	6VOT-10VOT
BP__VOT_03	6VOT-6VOT
BP__VOT_04	6VOT-7VOT
BP__VOT_05	6VOT-8VOT
BP__VOT_06	6VOT-9VOT
BP__VOT_07	7VOT-10VOT
BP__VOT_08	7VOT-7VOT
BP__VOT_09	7VOT-8VOT
BP__VOT_10	7VOT-9VOT
	8VOT-10VOT
	8VOT-8VOT
	8VOT-VOT
	9VOT-10VOT
	9VOT-10VOT
	9VOT-9VOT

References

Agbangba, C. E., Aide, E. S., Honfo, H., & Kakai, R. G. (2024). On the use of post-hoc tests in environmental and biological sciences: A critical review. Heliyon, 10(3), e25131. [Google Scholar] [CrossRef]
Alanazi, S. (2018). The acquisition of English stops by Saudi L2 learners [Unpublished doctoral dissertation]. University of Essex.
Al-Ani, S. H. (1970). Arabic phonology: An acoustical and physiological investigation. Mouton. [Google Scholar]
Al-Ghamdi, N., Al-Tamimi, J., & Khattab, G. (2019). The acoustic properties of laryngeal contrast in Najdi Arabic initial stops. In S. Calhoun, P. Escudero, M. Tabain, & P. Warren (Eds.), Proceedings of the 19th international congress of phonetic sciences (pp. 2051–2055). Australasian Speech Science and Technology Association Inc. [Google Scholar]
Alharbi, A., Foltz, A., Kornder, L., & Mennen, I. (2023). L2 acquisition and L1 attrition of VOTs of voiceless plosives in highly proficient late bilinguals. Second Language Research, 39(4), 1133–1163. [Google Scholar] [CrossRef]
Alluhaidah, M. (2023). Comparison VOT production between Arabic ESL learner and English native speaker. American Journal of Educational Research, 11(2), 25–33. [Google Scholar]
Alotaibi, Y. A., & AlDahri, S. S. (2011). Investigating VOTs of Arabic stops /b, k/ with comparisons to other languages. In 2011 4th international congress on image and signal processing (Vol. 5, pp. 2413–2417). IEEE. [Google Scholar] [CrossRef]
Alzahrani, S. (2021). The perception in Saudi learners of the English bilabial stops and the English labiodental fricatives. International Journal of English Linguistics, 11(1), 278. [Google Scholar] [CrossRef]
Anwyl-Irvine, A. L., Massonnié, J., Flitton, A., Kirkham, N., & Evershed, J. K. (2020). Gorilla in our midst: An online behavioral experiment builder. Behavior Research Methods, 52, 388–407. [Google Scholar] [CrossRef]
BaeseBerk, M. M., Kapnoula, E. C., & Samuel, A. G. (2025). The relationship of speech perception and speech production: It’s complicated. Psychonomic Bulletin & Review, 32(1), 226–242. [Google Scholar] [CrossRef]
Barriuso, T. A., & HayesHarb, R. (2018). High variability phonetic training as a bridge from research to practice. The CATESOL Journal, 30(1), 177–194. [Google Scholar] [CrossRef]
Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (p. 171204). York Press. [Google Scholar]
Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In O.-S. Bohn, & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of james emil flege (Vol. 10389, pp. 13–34). John Benjamins. [Google Scholar] [CrossRef]
Boersma, P., & Weenink, D. (2022). Praat: Doing phonetics by computer (Version 6.2.00). Available online: http://www.fon.hum.uva.nl/praat/download_win.html (accessed on 29 August 2025).
Casillas, J. V. (2021). Exploring phonemic boundaries using logistic regression. Available online: https://www.jvcasillas.com/posts/2021-05-15_logistic_regression_and_phonemic_boundaries/2021-05-15_logistic_regression_and_phonemic_boundaries.html (accessed on 29 August 2025).
Cebrian, J. (2006). Experience and the use of non-native duration in L2 vowel categorization. Journal of Phonetics, 34(3), 372–387. [Google Scholar] [CrossRef]
Cho, T., Whalen, D. H., & Docherty, G. (2019). Voice onset time and beyond: Exploring laryngeal contrast in 19 languages. Journal of Phonetics, 72, 52–65. [Google Scholar] [CrossRef]
Doan, T. L. A., & Oh, E. (2023). The role of L2 experience on the perceived similarity and identification of British English vowels by Vietnamese speakers. Linguistic Research, 40, 127–149. [Google Scholar] [CrossRef]
Escudero, P. (2005). Linguistic perception and second language acquisition: Explaining the attainment of optimal phonological categorization [Ph.D. thesis, Utrecht University]. [Google Scholar]
Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics, 15(1), 47–65. [Google Scholar] [CrossRef]
Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). York Press. [Google Scholar]
Flege, J. E., Bohn, O. S., & Jang, S. (1997). Effects of experience on non-native speakers’ production and perception of English vowels. Journal of Phonetics, 25, 437–470. [Google Scholar] [CrossRef]
Flege, J. E., & Liu, S. (2001). The effect of experience on adults’ acquisition of a second language. Studies in Second Language Acquisition, 23(4), 527–552. [Google Scholar] [CrossRef]
Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995). Factors affecting the strength of perceived foreign accent in a second language. Journal of the Acoustical Society of America, 97(5), 3125–3134. [Google Scholar] [CrossRef]
Flege, J. E., & Port, R. (1981). Cross-language phonetic interference: Arabic to English. Language and Speech, 24(2), 125–146. [Google Scholar] [CrossRef]
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. Available online: https://www.johnfox.ca/Companion/ (accessed on 29 August 2025).
Francis, A. L., Kaganovich, N., & Driscoll-Huber, C. (2008). Cuespecific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English. The Journal of the Acoustical Society of America, 124(2), 1234–1251. [Google Scholar] [CrossRef]
Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6(1), 110–125. [Google Scholar] [CrossRef] [PubMed]
García-Sierra, A., Schifano, E., Duncan, G. M., & Fish, M. S. (2021). An analysis of the perception of stop consonants in bilinguals and monolinguals in different phonetic contexts: A rangebased language cueing approach. Attention, Perception, & Psychophysics, 83, 1878–1896. [Google Scholar] [CrossRef] [PubMed]
Georgiou, G. P. (2021). Toward a new model for speech perception: The Universal Perceptual Model (UPM) of second language. Cognitive Processing, 22(2), 277–289. [Google Scholar] [CrossRef] [PubMed]
Giovannone, N., & Theodore, R. M. (2021). Individual differences in lexical contributions to speech perception. Journal of Speech, Language, and Hearing Research: JSLHR, 64(3), 707–724. [Google Scholar] [CrossRef] [PubMed]
Gorba, C. (2018). The effect of L2 experience on the categorization of native and non-native stops by Spanish learners of English. In S. Martin, D. Owen, & E. PladevallBallester (Eds.), Persistence and resistance in English studies. New research (pp. 163–173). Cambridge Scholars Publishing. [Google Scholar]
Gorba, C. (2019). Bidirectional influence on L1 Spanish and L2 English stop perception: The role of L2 experience. The Journal of the Acoustical Society of America, 145(6), EL587–EL592. [Google Scholar] [CrossRef]
Gorba, C., & Cebrian, J. (2021). The role of L2 experience in L1 and L2 perception and production of voiceless stops by English learners of Spanish. Journal of Phonetics, 88, 101094. [Google Scholar] [CrossRef]
Hattori, K. (2010). Perception and production of English /r/-/L/ by adult Japanese speakers [Doctoral dissertation, UCL (University College London)]. [Google Scholar]
Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099–3111. [Google Scholar] [CrossRef]
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 6570. [Google Scholar]
Hoonhorst, I., Colin, C., Markessis, E., Radeau, M., Deltenre, P., & Serniclaes, W. (2009). French native speakers in the making: From language-general to language-specific voicing boundaries. Journal of Experimental Child Psychology, 104(4), 353–366. [Google Scholar] [CrossRef]
Iverson, P., Kuhl, P. K., Akahane-Yamada, R., Diesch, E., Tohkura, Y., Kettermann, A., & Siebert, C. (2003). A perceptual interference account of acquisition difficulties for non-native phonemes. Cognition, 87, B47–B57. [Google Scholar] [CrossRef] [PubMed]
Kartushina, N., & Martin, C. D. (2019). Third-language learning affects bilinguals’ production in both their native languages: A longitudinal study of dynamic changes in L1, L2, and L3 vowel production. Journal of Phonetics, 77, 100920. [Google Scholar] [CrossRef]
Khattab, G. (2002). VOT production in English and Arabic bilingual and monolingual children. In Perspectives on Arabic linguistics XIII–XIV: Papers from the thirteenth and fourteenth annual symposia on Arabic linguistics (Vol. 230, p. 1). John Benjamins Publishing. [Google Scholar]
Kulikov, V. (2016). Voicing in Qatari Arabic: Evidence for prevoicing and aspiration. In Qatar foundation annual research conference proceedings (Vol. 2016, p. SSHAPP2330). HBKU Press. [Google Scholar] [CrossRef]
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. (2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82, 1–26. [Google Scholar] [CrossRef]
Ladefoged, P., & Maddieson, I. (1996). The sounds of the world’s languages. Blackwell Publishing. [Google Scholar]
Levy, E. S., & Law, F. F. (2010). Production of French vowels by American-English learners of French: Language experience, consonantal context, and the perception-production relationship. The Journal of the Acoustical Society of America, 128(3), 1290–1305. [Google Scholar] [CrossRef]
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74(6), 431. [Google Scholar] [CrossRef] [PubMed]
Liberman, A. M., Delattre, P. C., & Cooper, F. S. (1958). Some cues for the distinction between voiced and voiceless stops in initial position. Language and Speech, 1(3), 153–167. [Google Scholar] [CrossRef]
Lisker, L., & Abramson, A. S. (1964). A crosslanguage study of voicing in initial stops: Acoustical measurements. Word, 20(3), 384–422. [Google Scholar] [CrossRef]
Lisker, L., & Abramson, A. S. (1971). Distinctive features and laryngeal control. Language, 47, 767–785. [Google Scholar] [CrossRef]
Morrison, G. S. (2002, April 6–7). Perception of English /i/ and /I/ by Japanese and Spanish listeners: Longitudinal results [Paper presentation]. Northwest Linguistics Conference 2002 (pp. 29–48), Burnaby, BC, Canada. [Google Scholar]
Morrison, G. S. (2008). Logistic regression modelling for first and second language perception data. In Segmental and prosodic issues in Romance phonology (pp. 219–236). John Benjamins Publishing Company. [Google Scholar] [CrossRef]
Nagle, C. (2019). A longitudinal study of voice onset time development in L2 Spanish stops. Applied Linguistics, 40(1), 86–107. [Google Scholar] [CrossRef]
Nakai, S., & Scobbie, J. M. (2016). The VOT category boundary in word-initial stops: Counter-evidence against rate normalization in English spontaneous speech. Laboratory Phonology, 7(1), 13. [Google Scholar] [CrossRef]
Newman, D. (2002). The phonetic status of Arabic within the world’s languages. Antwerp Papers in Linguistics, 100, 65–75. [Google Scholar]
Petrova, K., Jasmin, K., Saito, K., & Tierney, A. T. (2023). Extensive residence in a second language environment modifies perceptual strategies for suprasegmental categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 49(12), 1943–1955. [Google Scholar] [CrossRef] [PubMed]
Piske, T., MacKay, I. R., & Flege, J. E. (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics, 29(2), 191–215. [Google Scholar] [CrossRef]
R Development Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
Schoonmaker-Gates, E. (2015). On voice-onset time as a cue to foreign accent in Spanish: Native and nonnative perceptions. Hispania, 98(4), 779–791. [Google Scholar] [CrossRef]
Souganidis, C., Molinaro, N., & Stoehr, A. (2024). Bilinguals produce language-specific voice onset time in two true-voicing languages: The case of Basque-Spanish early bilinguals. Linguistic Approaches to Bilingualism, 14(3), 370–399. [Google Scholar] [CrossRef]
Trofimovich, P., & Baker, W. (2006). Learning second language suprasegmentals: Effect of L2 experience on prosody and fluency characteristics of L2 speech. Studies in Second Language Acquisition, 28(1), 1–30. [Google Scholar] [CrossRef]
Van Leussen, J. W., & Escudero, P. (2015). Learning to perceive and recognize a second language: The L2LP model revised. Frontiers in Psychology, 6, 1000. [Google Scholar] [CrossRef] [PubMed]
Wickham, H. (2016). Programming with ggplot2. In ggplot2: Elegant graphics for data analysis (pp. 241–253). Springer International Publishing. [Google Scholar] [CrossRef]
Winn, M. B. (2020). Manipulation of voice onset time in speech stimuli: A tutorial and flexible Praat script. The Journal of the Acoustical Society of America, 147(2), 852–866. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Mean percentages of /b/ responses for SSBE participants compared to those of Saudi listeners in the UK (S–UK) and Saudi Arabia (SA).

Figure 2. Heatmap of the accuracy of VOT discrimination between VOT pairs, VOT1 and VOT2, for three groups: SSBE, S–UK (Saudi speakers in the UK), and SA (Saudi speakers in Saudi Arabia).

Table 1. Fixed–effects coefficients in Model 1 of voicing identification. Model formula: response ~ VOT–std × Group + (1|Participant).

	β	SE	z Value	p Value
Fixed Effects
(Intercept)	−0.08835	0.33508	−0.264	0.792
VOT_std	−4.70684	0.47479	−9.913	<2 × 10⁻¹⁶ ***
GroupS–UK	−0.23741	0.38183	−0.622	0.534
GroupSA	−0.26398	0.37054	−0.712	0.476
VOT_std:GroupS–UK	2.95816	0.48274	6.128	8.91 × 10⁻¹⁰ ***
VOT_std:GroupSA	4.27779	0.47771	8.955	<2 × 10⁻¹⁶ ***
Random effects
	Groups	Name	Variance	SD
	Participant	(Intercept)	0.7001	0.8367

Note: *** p < 0.001.

Table 2. Analysis of deviance type II for Model 2. Model formula: nCorrect ~ PairVOT × Group + (1|Participant).

Response: nCorrect
	Chisq	Df	p Value
PairVOT	449.099	56	2.20 × 10⁻¹⁶
Group	6.356	2	0.041668
PairVOT:Group	151.584	106	0.002451

Table 3. Pairwise comparison using Holm’s method: group_emm <– emmeans (mod2, ~Group) pairs (group_emm, adjust = “Holm”).

Contrast	β	SE	df	z Ratio	p Value
SSBE–(S–UK)	0.393	0.163	Inf	2.405	0.0323
SSBE–SA	0.57	0.155	Inf	3.667	0.0007
(S–UK)–SA	0.177	0.116	Inf	1.529	0.1262

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshangiti, W. The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English. Languages 2026, 11, 81. https://doi.org/10.3390/languages11050081

AMA Style

Alshangiti W. The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English. Languages. 2026; 11(5):81. https://doi.org/10.3390/languages11050081

Chicago/Turabian Style

Alshangiti, Wafaa. 2026. "The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English" Languages 11, no. 5: 81. https://doi.org/10.3390/languages11050081

APA Style

Alshangiti, W. (2026). The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English. Languages, 11(5), 81. https://doi.org/10.3390/languages11050081

Article Menu

The Effect of Second Language Immersion Experience on the Perception of VOT by Saudi Arabic Learners of English

Abstract

1. Introduction

2. Materials and Methods

2.1. Participants

2.2. Stimuli

2.3. Procedure

2.3.1. Identification Task

2.3.2. Discrimination Task (Same/Different)

2.4. Statistical Analyses

3. Results

3.1. The Identification Task

3.2. The Discrimination Task (Same/Different)

4. Discussion

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Audio Stimuli Used for Tasks in MP3 Format, with 10–Step VOT Continuum from /b/ to /p/

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI