Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning

Georgiou, Georgios P.

doi:10.3390/ai6070134

Open AccessArticle

Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning

by

Georgios P. Georgiou

^1,2

¹

Department of Languages and Literature, University of Nicosia, Nicosia 2417, Cyprus

²

Phonetic Lab, University of Nicosia, Nicosia 2417, Cyprus

AI 2025, 6(7), 134; https://doi.org/10.3390/ai6070134

Submission received: 28 April 2025 / Revised: 12 June 2025 / Accepted: 20 June 2025 / Published: 23 June 2025

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: While machine learning has made substantial strides in pronunciation detection in recent years, there remains a notable gap in the literature regarding research on improvements in the acquisition of speech sounds following a training intervention, especially in the domain of perception. This study addresses this gap by developing a deep learning algorithm designed to identify perceptual gains resulting from second language (L2) phonetic training. Methods: The participants underwent multiple sessions of high-variability phonetic training, focusing on discriminating challenging L2 vowel contrasts. The deep learning model was trained on perceptual data collected before and after the intervention. Results: The results demonstrated good model performance across a range of metrics, confirming that learners’ gains in phonetic training could be effectively detected by the algorithm. Conclusions: This research underscores the potential of deep learning techniques to track improvements in phonetic training, offering a promising and practical approach for evaluating language learning outcomes and paving the way for more personalized, adaptive language learning solutions. Deep learning enables the automatic extraction of complex patterns in learner behavior that might be missed by traditional methods. This makes it especially valuable in educational contexts where subtle improvements need to be captured and assessed objectively.

Keywords:

deep learning; phonetic training; second language; vowels

1. Introduction

Learning the sounds of a second language (L2) can be challenging for adolescents and adults due to their strong attunement to the phonological system of their first language (L1) [1,2,3,4,5]. This difficulty is especially pronounced for speakers of languages with phonological systems that are smaller, less diverse, or significantly different from those of the L2. Previous studies on speech perception suggest that perceptual learning can occur at almost any age, though it is shaped by the individual’s entire language learning experience (e.g., [6]). Similarly, the main assumptions of various speech models, including the Speech Learning Model-r [7], the Perceptual Assimilation Model-L2 [8], and the Universal Perceptual Model [9], support the ability of learners to acquire non-native sounds during their lifespan under some preconditions, e.g., reception of training.

Several studies show the advantage of high-variability phonetic training (HVPT) in enhancing speech acquisition skills [10,11,12,13,14]. HVPT employs multiple voices rather than a single speaker, exposing learners to stimuli produced by various talkers and presented in diverse linguistic contexts. During training, the learners hear target language stimuli and must identify the sounds they perceive, receiving immediate feedback. This process helps them focus on the critical acoustic cues necessary for distinguishing sounds while disregarding speaker-specific variations [11]. Perceptual training plays a foundational role in L2 phonetics by shaping learners’ ability to form accurate phonemic categories in the target language. It enhances their sensitivity to subtle phonetic distinctions that may not exist in their L1, which is essential for both accurate speech perception and intelligible production. By improving perceptual acuity, such training helps lay the groundwork for more robust and transferable language skills.

Simon et al. [12] investigated the robustness of HVPT effects across three dimensions: generalization to novel tokens and talkers, long-term retention, and performance in non-optimal listening conditions (i.e., with noise). The participants were French learners of Dutch in Belgium, divided into an experimental group and a control group. Both groups completed a pretest, posttest, and delayed posttest involving a lexical identification task with and without noise. The experimental group underwent five multimodal HVPT sessions, incorporating perceptual identification tasks with feedback and metalinguistic information. The results confirmed HVPT’s effectiveness in facilitating generalization to new tokens and talkers. Georgiou [13] trained Cypriot Greek children and adults in the identification of L2 English vowels using an HVPT paradigm. The findings indicated that training enhanced vowel identification, with children benefiting more than adults. Although adults showed lower accuracy, their phonological system adapted significantly post-training. The findings also suggest a link between speech perception and production, as training influenced learners’ productions, at least in the case of children.

Machine learning, a subfield of artificial intelligence, focuses on developing algorithms that enable computers to learn from data and make predictions or decisions without being explicitly programmed. Machine learning techniques have been widely adopted across various domains, including healthcare, finance, and education, due to their ability to identify patterns and improve over time through exposure to more data [15]. In the field of L2 acquisition, machine learning is increasingly being leveraged to enhance language learning processes, assess learner performance, and personalize instruction. These models have been applied in several areas of L2 acquisition, such as automated speech recognition, computer-assisted pronunciation training (CAPT), grammar correction, learner modeling, and intelligent tutoring systems [16,17]. These technologies not only provide immediate, data-driven feedback to learners but also offer scalable solutions for language education, especially in contexts where human instruction is limited.

Given the rapid advancements in technology, machine learning has also become an effective tool for identifying speech patterns in both L1 speakers and L2 learners [18,19,20,21,22,23,24,25]. Machine learning emerges as a promising technique, surpassing traditional statistical inference methods commonly used in phonetic training studies. Its ability to capture complex patterns and relationships offers greater precision, in contrast to statistical tests, which tend to provide less accurate inferences [26]. Moreover, statistical methods can yield varying results, potentially obscuring differences depending on the chosen test [27]. In a recent study, Georgiou and Theodorou [20] developed an automated method for identifying developmental language disorder (DLD) in Cypriot Greek-speaking children using a neural network machine learning algorithm. The model was trained on perceptual and production data from children with DLD and typically developing controls. The perceptual data included the correct and incorrect responses of children in a consonant discrimination task, a grammaticality judgment task, and a semantic judgment task. The performance ranged between 0.87 and 0.92 across various metrics, all of which indicated high classification accuracy. These findings highlighted the potential of machine learning for improving clinical assessments and the early detection of DLD.

Piotrowska et al. [21] investigated the automatic detection of aspiration in the pronunciation of Polish speakers of L2 English using a whole-word analysis. Speech recordings from English speakers and Polish L2 speakers were analyzed using various machine learning classifiers. The results indicated that convolutional neural network-based models performed well in distinguishing aspirated from unaspirated allophones, demonstrating the feasibility of automated pronunciation evaluation for specific phonological features. Korzekwa [22] argued that CAPT systems often underperform in detecting pronunciation errors, as reflected by their low evaluation metrics. A key limitation of current CAPT approaches is the scarcity of annotated mispronounced speech, which is crucial for effectively training pronunciation error detection models. To address this, the author proposed a deep learning-based method for identifying pronunciation errors in L2 English speech. This approach demonstrated improved performance in terms of the area under the curve (AUC) metric, outperforming several state-of-the-art techniques. Although there is increasing research at the intersection of machine learning and phonetics, the vast majority of previous work focuses on the production domain, whereas work on the perceptual domain is still in its infancy.

This study introduces an automated method for detecting perceptual gains in adult L2 learners following phonetic training using a deep learning algorithm. While machine learning has been widely applied in pronunciation detection, there remains a significant gap in research addressing perception-based improvements in L2 sound acquisition after targeted training interventions. Most existing studies focus on production outcomes, overlooking the perceptual domain, which is a critical precursor to accurate speech production. To address this gap, the present study focused on adult L2 learners of Greek whose training targeted specific vowel contrasts known to be problematic due to L1 interference. The goal was to improve their ability to discriminate these sounds through HVPT. Pre- and post-training performance data were used to train a deep learning algorithm capable of identifying individual learners’ perceptual gains. Deep learning models were selected over traditional machine learning methods due to their superior ability to handle complex classification tasks by learning directly from raw input data [28,29,30]. The model’s effectiveness was evaluated using multiple performance metrics, demonstrating its potential as a reliable tool for tracking phonetic training outcomes. The key contribution of this study lies in its novel application of deep learning to the domain of perceptual L2 phonetic training. By providing an automated, scalable method for evaluating perceptual learning outcomes, this research offers new insights into adaptive language learning technologies and opens the door to more personalized approaches in L2 acquisition.

2. Methodology

2.1. Participants

The study involved 15 female learners of Greek as an L2 [11]. These participants were adult L1 of Egyptian Arabic who had been residing in Nicosia, Cyprus, for 4–5 years at the time of the study. Their ages ranged from 18 to 24 years (M = 21.6; SD = 2.06), and they had acquired Greek through formal education in schools and universities. They began learning Greek between the ages of 15 and 20 (M = 17.4; SD = 1.72). All participants were middle-class students who regularly used both their L1 and L2 in daily communication. Based on self-reported data from a questionnaire (Likert-point scale from 1 to 5), they rated their Greek-speaking skills as low (M = 1.93, SD = 0.59), while their comprehension (M = 3.2; SD = 0.41), reading (M = 3.53; SD = 0.64), and writing (M = 3.67; SD = 0.49) abilities ranged from moderate to good. All participants reported having normal hearing and received financial compensation for their participation in both the training and posttest sessions.

2.2. Materials

The experimental stimuli comprised four different trial types: AAB, ABB, BBA, and BAA, where A represented the first vowel in the contrast and B the second. Each trial contained triads of disyllabic pseudowords designed to test four vowel contrasts in Greek: stressed /i/–/e/, unstressed /i/–/e/, stressed /o/–/u/, and unstressed /o/–/u/. These pseudowords were embedded in the phonetic frames ^|sVsa, sV^|sa, ^|Vsa, and V^|sa, ensuring that the lexical familiarity did not influence the phonetic discrimination. The stimuli for the training phase were recorded by four L1 Greek speakers (two males and two females, aged 20–32) in a quiet room with a sampling rate of 44.1 kHz. The pretest and posttest stimuli were identical to those used in the training session, but additional distractor words with the phonetic frames ^|pVsa and pV^|sa were included. The stimuli were recorded by a female adult L1 speaker of Greek at the same 44.1 kHz sampling rate.

2.3. Procedure

2.3.1. Pretest

Participants were tested individually in a quiet environment using an AXB discrimination task implemented in Praat [31]. They listened to sequences of three words and had to identify whether the second word was the same as the first or the third. The stimuli were presented through headphones at an approximate volume of 75 dB. The test consisted of 108 items (4 vowel contrasts × 2 conditions × 4 trial types × 3 repetitions), along with 16 distractor words. The distractor words were included to prevent the participants from anticipating patterns and to maintain attention throughout the task. The participants were allowed short breaks after completing 36 and 72 items. The interstimulus interval was one second and the intertrial interval was five seconds. The stimuli were randomized for each participant.

2.3.2. Phonetic Training

The training phase took place one week after the pretest in a sound-attenuated room, where the participants were trained individually. The training was conducted using TP–Version 1.0 [32] and focused on vowel contrast discrimination. The stimuli were presented in multiple phonetic contexts, including stressed and unstressed positions, as well as initial fricative /s/ and initial vowel contexts. The procedure mirrored that of the pretest, with the key difference being immediate feedback provided for each triad. If the participant’s response was correct, a tick mark appeared next to the selected option. If incorrect, an X mark was displayed, and the participant was encouraged to use the “Replay” button to listen to the triad again. Each participant underwent a total of 10 h of training, spread across five sessions (two hours per session) over a period of 1.5 to 2 weeks. During the training, they completed 96 triads per session (4 vowel contrasts × 2 conditions × 4 trial types × 3 repetitions).

2.3.3. Posttest

The posttest was conducted several days after the training session, following the same procedure as the pretest. The participants completed the AXB discrimination task again to assess their progress in vowel contrast perception.

2.4. Deep Learning Algorithm Training

To predict the effectiveness of phonetic training, we implemented a supervised deep learning approach using a feedforward artificial neural network. This model was built with TensorFlow via the Keras package version 2.15.0 in R [33]. We chose this model for its proven ability to capture complex, non-linear relationships between features and outcomes, which is a significant advantage when analyzing nuanced linguistic distinctions and working with limited datasets. The model was trained on a standard laptop with an 11th generation Intel Core i5 processor at 2.4 GHz, 16 GB RAM, running Windows 11.

The target variable for our network was the test phase (pre/post). The input features consisted of the learners’ correct or incorrect responses across three problematic English vowel contrasts: stressed /i/–/e/, stressed /o/–/u/, and unstressed /o/–/u/, as identified in Georgiou [11]. These features were encoded as binary values (1 for correct, 0 for incorrect), meaning that each specific response type across the three contrasts contributed a distinct input node. Our dataset comprised 2430 observations (810 for each vowel contrast) collected from 15 participants, with observations evenly balanced between pretest and posttest conditions.

The network architecture was carefully designed for this task. The input layer directly corresponded to the number of input features, which were the encoded binary responses across the three vowel contrasts. We employed two fully connected (dense) hidden layers, each comprising 32 neurons and utilizing the Rectified Linear Unit (ReLU) activation function. ReLU was chosen for its computational efficiency and its effectiveness in mitigating the vanishing gradient problem. To significantly reduce the risk of overfitting, dropout layers with a rate of 0.5 were strategically placed immediately after each hidden layer. This technique randomly deactivates 50% of neurons during each training step, preventing the overreliance on specific pathways and promoting more robust feature learning. Finally, a single neuron with a sigmoid activation function formed the output layer, providing a probability score between 0 and 1, suitable for binary classification tasks.

The model’s training and evaluation were configured with specific parameters to ensure robustness and optimal performance. The dataset was initially partitioned using a stratified 90–10 ratio. This crucial step ensured that the proportion of the target variable’s classes was preserved in both the 90% training/validation set and the 10% independent test set to prevent biased evaluation. The model was compiled using the Adam optimizer, known for its adaptive learning rate capabilities and efficient convergence in deep learning applications. Binary cross-entropy was selected as the loss function, which is standard for binary classification problems, quantifying the divergence between predicted probabilities and true labels.

A five-fold cross-validation procedure was applied to the 90% training data. This involved partitioning the training set into five equally sized folds, training the model five times (each time using four folds for training and one for validation), and averaging performance metrics across these folds. This approach provides a more reliable and less biased estimate of the model’s generalization capability than a single train-validation split. The training was conducted using mini-batches of size 8. This relatively small batch size aids in escaping shallow local minima and often improves generalization, albeit potentially with more volatile gradient updates per epoch. The model was trained for a maximum of 50 epochs. To prevent overfitting and optimize training time, an early stopping callback was implemented. This mechanism monitored the validation loss, and training automatically ceased if no improvement was observed for 10 consecutive epochs (patience = 10). The model performance was comprehensively monitored and reported using accuracy, precision, recall, and AUC [34]. The final model performance metrics were averaged across the five cross-validation folds on a single run. An indicative architecture of the neural model is illustrated in Figure 1.

3. Results

The results of the phonetic training showed that all L2 vowel contrasts were poorly discriminated in the pretest, with the accuracy rates barely exceeding the chance level. However, the discrimination accuracy improved significantly in the posttest. This indicates that phonetic training was beneficial for the improvement of the learners’ vowel contrast discrimination abilities. Figure 2 displays the percentage of correct responses for the discrimination of three L2 Greek vowel contrasts during the pretest and posttest sessions.

A binomial mixed-effects model was employed to investigate the differences between the pretest and posttest for all contrasts. Response (correct/incorrect) was the binary dependent variable. Contrast (Stressed /i/–/e/, Stressed /o/–/u/, and Unstressed /o/–/u/), Test (Pre/Post), and their interaction served as fixed factors, and Participant served as the random factor. The results showed no significant differences between ContrastStressedo-u, ContrastUnstressedo-u, and the Intercept term. No significant differences were observed for any of the interactions. However, significant differences occurred between TestPre and the Intercept term. Moreover, the odds ratio was 0.21, suggesting that for each one-unit increase in the PreTest accuracy, the odds of accuracy decreased by 79%, highlighting a strong inverse effect size. This confirms the positive role of the intervention in the improvement of the learners’ discrimination abilities. Table 1 presents the results of the binomial mixed-effects model.

A deep learning model was utilized to investigate how well it detects gains of perceptual phonetic training. To evaluate the model’s performance, we used different metrics. Accuracy indicates how well a model performs overall by measuring the proportion of correctly classified instances out of the total dataset, using the following formula: (true positives + true negatives)/(true positives + true negatives + false positives + false negatives). Precision evaluates the quality of positive predictions by calculating the ratio of true positives to all predicted positives: true positives/(true positives + false positives). Recall, also known as sensitivity, measures the model’s ability to identify all actual positive cases and is calculated as follows: true positives/(true positives + false negatives). The F1-score combines precision and recall into a single metric using their harmonic mean: 2 × (precision × recall)/(precision + recall); this offers a balanced view especially useful in cases of class imbalance. The AUC summarizes this performance into a single value. An AUC close to 1 indicates strong model performance, whereas an AUC around 0.5 suggests the model is no better than random guessing.

To evaluate the contribution of early stopping to model generalization, we conducted an ablation experiment by removing the early stopping callback from the training process. Without early stopping, the model was allowed to train for the full 50 epochs regardless of the validation loss trends. Removing early stopping led to a substantial decline in all performance metrics compared to the model trained with early stopping, which ranged from 0.28 to 0.52. This suggests that early stopping plays a crucial role in preventing overfitting and enhancing generalization. By monitoring validation loss and halting training when the performance deteriorates, early stopping ensures the model retains its best weights without overtraining.

The results of the deep learning algorithm suggested that the model performs moderately well, showcasing values between 0.74 and 0.80. The model achieved an average accuracy of 0.74, indicating that it correctly classified approximately 74% of the instances. The average precision was 0.74, suggesting that when the model predicted a positive class, 74% of those predictions were correct. The average recall was 0.76, demonstrating that the model correctly identified 76% of the actual positive cases. The average F1-score, which balances precision and recall, was 0.75, reflecting a stable trade-off between the two measures. Additionally, the model attained an average AUC of 0.80, which indicates strong discrimination between the positive and negative classes. The values of each metric are shown in Table 2. Figure 3 illustrates the training and validation performance curves, which indicate how the model improves over time.

4. Discussion

This study employed a deep learning algorithm to explore how well it distinguishes phonetic training intervention gains in L2 learners, specifically focusing on their ability to discriminate L2 vowel contrasts. The model was trained with data obtained from a controlled forced-choice discrimination task, in which the participants were required to differentiate between minimal pairs of L2 vowels. The deep learning model was evaluated using several metrics, including accuracy, precision, recall, F1-score, and AUC, to gauge its effectiveness in detecting changes in learners’ phonetic abilities after the training intervention.

It is well known that speakers often struggle to perceive L2 speech sounds due to the influence of their L1 [35]. However, this understanding has evolved with the recognition of lifelong speech learning through the positive effects of training interventions. Evidence for the effectiveness of phonetic training in L2 learners comes from the broader literature on perceptual training in L2 speech. Research by Bradlow et al. [10] has shown that targeted perceptual training led to measurable improvements in speech perception and production. Similarly, in a more recent study, Georgiou [13] found that both children and adults benefited from HVPT, with gains being transferred to production. In line with previous work, our study found that learners who underwent phonetic training intervention exhibited increased accuracy in identifying L2 vowel contrasts, which aligned with the performance metrics produced by the deep learning classifier. These findings corroborate the notion that phonetic training interventions are effective in enhancing L2 learners’ ability to perceive subtle phonetic distinctions. They are also consistent with the tenets of several speech models regarding the lifelong opportunity for L2 sound learning [7,8,9]. Phonetic training interventions, particularly those supported by deep learning technology, can continue to benefit learners of all ages. As the literature on lifelong learning suggests, speech perception is malleable, even in adulthood. This underscores the importance of incorporating effective training methods that cater to learners at various stages of their language acquisition journey. The use of deep learning to measure and enhance phonetic training gains contributes to the long-term success of language learners, regardless of their age.

Previous research in the area of phonetics and machine learning suggests that automatic classifiers and, more specifically, those based on neural networks, are well suited for analyzing complex speech data and capturing subtle differences in phonetic performance (e.g., [19,21,22]). In our study, the deep learning classifier showed a good capacity to identify phonetic training gains in L2 learners. The model achieved high accuracy in distinguishing between pre- and post-intervention responses in the speech perception task. The metrics used to evaluate the performance of the deep learning model in this study are well established in the literature as robust measures of classifier performance in speech and language processing tasks (e.g., [18,36]). Therefore, the model generalizes effectively to new, unseen data. Generalization is a crucial measure of a model’s ability to perform accurately in real-world situations, beyond just the training data. The good performance of the model on the testing subset suggests that it has successfully learned the underlying patterns in the data during training without overfitting to specific examples, thus improving its reliability and practical usefulness. While the deep learning classifier demonstrated positive results, it would be useful to compare its performance against other traditional machine learning models, such as support vector machines or random forests. This comparison would allow us to understand the relative strengths and weaknesses of deep learning models in the context of phonetic analysis.

This study adds to the growing impact of deep learning on pronunciation learning, paving the way for more effective and personalized language education. For example, Liu [37] introduces a backward recognition deep learning approach, which demonstrates the potential of deep learning to significantly enhance pronunciation accuracy in the teaching of spoken English. The study showcases the use of recurrent neural networks and transformers to model complex speech patterns and provide real-time feedback. Xu [38] argued that deep learning technology not only facilitates rapid identification and correction of pronunciation errors but also enhances the overall language learning efficiency through data-driven personalization. The author investigated the application of deep learning in assessing English pronunciation quality, with a specific focus on three key prosodic features. He concluded that automated pronunciation evaluation systems can effectively address the scalability and consistency limitations inherent in human-rated assessments; deep learning-based systems offer significant advantages for L2 pronunciation acquisition by delivering precise, objective, and instantaneous feedback. By demonstrating the feasibility of using deep learning to automatically identify phonetic training gains, this research opens up new avenues for personalized language learning interventions in the speech perception domain. Moreover, this approach could be used to develop more efficient and objective methods for assessing the effectiveness of different training protocols, leading to more evidence-based pedagogical practices. For example, by comparing the phonetic gains of students who undergo different training protocols (e.g., auditory training, visual feedback, or multisensory approaches), educators could identify which techniques are most effective in improving sound discrimination.

5. Conclusions

In conclusion, the results of this study demonstrate that deep learning algorithms, when trained on controlled discrimination task data, can effectively capture and distinguish phonetic training gains in L2 learners. This suggests that such models offer a promising tool for assessing the efficacy of phonetic training interventions and advancing our understanding of the mechanisms underlying L2 speech learning. However, it is important to acknowledge that the performance of the neural algorithm, while promising, still exhibits potential enhancement. Despite the encouraging findings, there are several limitations that warrant further investigation.

Firstly, the study’s sample of L2 learners was relatively small and linguistically homogeneous, consisting exclusively of female L1 speakers of Egyptian Arabic. This narrow demographic profile imposes limitations on both the statistical reliability and the generalizability of our findings. The small sample size reduces the statistical power, thereby limiting our ability to detect subtle effects or interactions. Moreover, linguistic and gender homogeneity restricts the applicability of the results to broader L2 populations like those with different L1 backgrounds, gender identities, and language learning experiences. Although this controlled design was intentional, aiming at minimizing variability from gender and L1 phonological systems, it inevitably sacrifices the ecological validity for the internal control. Future research should, therefore, prioritize larger, more heterogeneous samples that reflect the diversity of L2 learners, incorporating a wider range of L1s, gender identities, and proficiency levels. This would not only enhance the robustness and generalizability of the observed patterns but also support more precise analyses of individual differences in perceptual learning. Despite these constraints, our findings serve as an initial proof of concept for computational approaches to tracking L2 phonetic learning, offering a foundation for more scalable and inclusive investigations moving forward.

Secondly, this study focused exclusively on vowel contrasts; however, consonantal and prosodic contrasts are also essential for intelligibility in L2 speech. Expanding the model to include these features could provide a more comprehensive understanding of L2 phonetic learning. Incorporating such predictors may improve the model’s ability to generalize across different types of linguistic contrasts and learner profiles. However, doing so also introduces several challenges. These include increased data complexity, potential multicollinearity among features, and the need for larger datasets to ensure robust learning and avoid overfitting. Moreover, collecting high-quality, annotated data for consonantal and prosodic features often requires more sophisticated instrumentation and manual labor. Future work should weigh these trade-offs carefully, with the aim of enhancing model interpretability and predictive power while maintaining methodological rigor and feasibility.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Department of Languages and Literature of the University of Nicosia on March 2025.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study and the code are available on request from the corresponding author.

Acknowledgments

This study is supported by the Phonetic Lab of the University of Nicosia.

Conflicts of Interest

The author declares no conflicts of interest.

References

Black, M.R.; Rato, A.; Rafat, Y. Effect of perceptual training without feedback on bilingual speech perception: Evidence from approximant-stop discrimination in L1 Spanish and L1 English late bilinguals. J. Monolingual Biling. Speech 2024, 6, 127–150. [Google Scholar] [CrossRef]
Fabra, L.R.; Tyler, M.D. Predicting discrimination difficulty of Californian English vowel contrasts from L2-to-L1 categorization. Ampersand 2023, 10, 100109. [Google Scholar] [CrossRef]
Wang, Y.; Bundgaard-Nielsen, R.L.; Baker, B.J.; Maxwell, O. Same vowels but different contrasts: Mandarin listeners’ perception of English /ei/–/iː/ in unfamiliar phonotactic contexts. J. Phon. 2023, 97, 101221. [Google Scholar] [CrossRef]
Kaucke, S.; Schlechtweg, M. English speakers’ perception of non-native vowel contrasts in adverse listening conditions: A discrimination study on the German front rounded vowels /y/ and /ø/. Lang. Speech 2025, 68, 162–180. [Google Scholar] [CrossRef]
Li, Y. A comparison of perception-based and production-based training approaches to adults’ learning of L2 sounds. Lang. Learn. Dev. 2024, 20, 232–248. [Google Scholar] [CrossRef]
Bohn, O.S. Cross-language and second language speech perception. In The Handbook of Psycholinguistics; Fernández, E.M., Cairns, H.S., Eds.; John Wiley & Sons: Hoboken, NJ, USA, 2017; pp. 213–239. [Google Scholar]
Flege, J.E.; Aoyama, K.; Bohn, O.S. The revised speech learning model (SLM-r) applied. In Second Language Speech Learning: Theoretical and Empirical Progress; Flege, J.E., Ed.; John Benjamins: Amsterdam, The Netherlands, 2021; pp. 84–118. [Google Scholar]
Best, C.T.; Tyler, M. Non-native and second-language speech perception: Commonalities and complementarities. In Second Language Speech Learning: In Honor of James Emil Flege; Bohn, O.-S., Munro, M.J., Eds.; John Benjamins: Amsterdam, The Netherlands, 2007; pp. 13–34. [Google Scholar]
Georgiou, G.P.; Giannakou, A.; Alexander, K. Perception of second language phonetic contrasts by monolinguals and bidialectals: A comparison of competencies. Q. J. Exp. Psychol. 2025, 78, 1148–1162. [Google Scholar] [CrossRef]
Bradlow, A.R.; Pisoni, D.B.; Akahane-Yamada, R.; Tohkura, Y.I. Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production. J. Acoust. Soc. Am. 1997, 101, 2299–2310. [Google Scholar] [CrossRef]
Georgiou, G.P. Effects of phonetic training on the discrimination of second language sounds by learners with naturalistic access to the second language. J. Psycholinguist. Res. 2021, 50, 707–721. [Google Scholar] [CrossRef] [PubMed]
Simon, E.; De Clercq, B.; Degrave, P.; Decourcelle, Q. On the robustness of high variability phonetic training effects: A study on the perception of non-native Dutch contrasts by French-speaking learners. In Second Language Pronunciation: Different Approaches to Teaching and Training; 2023; pp. 315–344. [Google Scholar]
Georgiou, G.P. The impact of auditory perceptual training on the perception and production of English vowels by Cypriot Greek children and adults. Lang. Learn. Dev. 2022, 18, 379–392. [Google Scholar] [CrossRef]
Zhang, W.; Liao, Y.; Truong, H.T. High variability phonetic training facilitates categorical perception of Mandarin lexical tones in L2 older adults: A link to auditory processing. Lang. Teach. Res. 2024. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Xie, H.; Chu, H.C.; Hwang, G.J.; Wang, C.C. Trends and development in technology-enhanced adaptive/personalized learning: A systematic review of journal publications from 2007 to 2017. Comput. Educ. 2019, 140, 103599. [Google Scholar] [CrossRef]
Strik, H.; Truong, K.; de Wet, F.; Cucchiarini, C. The impact of ASR-based pronunciation feedback on L2 learners’ pronunciation proficiency. Speech Commun. 2019, 108, 55–68. [Google Scholar] [CrossRef]
Georgiou, G.P.; Theodorou, E. Detection of developmental language disorder in Cypriot Greek children using a neural network algorithm. J. Technol. Behav. Sci. 2024. [Google Scholar] [CrossRef]
Graham, C. L1 Identification from L2 Speech Using Neural Spectrogram Analysis. In Interspeech; 2021; Volume 2021, pp. 3959–3963. [Google Scholar]
Hirschi, K.; Kang, O. Machine Learning (ML) tools for measuring second language (L2) intelligibility. In Routledge Handbook of Technological Advances in Researching Language Learning; Routledge: London, UK, 2024; pp. 465–478. [Google Scholar]
Piotrowska, M.; Czyżewski, A.; Ciszewski, T.; Korvel, G.; Kurowski, A.; Kostek, B. Evaluation of aspiration problems in L2 English pronunciation employing machine learning. J. Acoust. Soc. Am. 2021, 150, 120–132. [Google Scholar] [CrossRef]
Korzekwa, D. Automated detection of pronunciation errors in non-native English speech employing deep learning. arXiv 2022, arXiv:2209.06265. [Google Scholar]
Guo, W. A Practical Study of English Pronunciation Correction Using Deep Learning Techniques. J. Comput. Methods Sci. Eng. 2025, 25, 1015–1029. [Google Scholar] [CrossRef]
Essaid, B.; Kheddar, H.; Batel, N.; Chowdhury, M.E. Deep Learning-Based Coding Strategy for Improved Cochlear Implant Speech Perception in Noisy Environments. IEEE Access 2025, 13, 35707–35732. [Google Scholar] [CrossRef]
Haro, S.; Smalt, C.J.; Ciccarelli, G.A.; Quatieri, T.F. Deep Neural Network Model of Hearing-Impaired Speech-in-Noise Perception. Front. Neurosci. 2020, 14, 588448. [Google Scholar] [CrossRef]
Bzdok, D.; Altman, N.; Krzywinski, M. Statistics versus machine learning. Nat. Methods 2018, 15, 233–234. [Google Scholar] [CrossRef]
Haendler, Y.; Lassotta, R.; Adelt, A.; Stadie, N.; Burchert, F.; Adani, F. Bayesian Analysis as an Alternative to Frequentist Methods: A Demonstration with Data from Language-Impaired Children’s Relative Clause Processing. In Proceedings of the 44th Boston University Conference on Language Development, Boston, MA, USA, 7–10 November 2019; pp. 168–181. [Google Scholar]
Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
Chauhan, N.K.; Singh, K. A review on conventional machine learning vs deep learning. In Proceedings of the 2018 International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 28–29 September 2018; IEEE: New York, NY, USA, 2018; pp. 347–352. [Google Scholar]
Kamath, C.N.; Bukhari, S.S.; Dengel, A. Comparative study between traditional machine learning and deep learning approaches for text classification. In Proceedings of the ACM Symposium on Document Engineering 2018, Halifax, NS, Canada, 4–7 September 2018; ACM: New York, NY, USA, 2018; pp. 1–11. [Google Scholar]
Boersma, P.; Weenink, D. Praat: Doing Phonetics by Computer [Computer Program]. Available online: http://www.fon.hum.uva.nl/praat/ (accessed on 27 April 2025).
Rauber, A.; Rato, A.; Kluge, D.; Santos, G. TP-S (Version 1.0) [Application Software]. Available online: http://www.worken.com.br/sistemas/tp-s/ (accessed on 27 April 2025).
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2025; Available online: https://www.R-project.org/ (accessed on 27 April 2025).
Georgiou, G.P. Comparison of the prediction accuracy of machine learning algorithms in crosslinguistic vowel classification. Sci. Rep. 2023, 13, 15594. [Google Scholar] [CrossRef] [PubMed]
Zampini, M.L. L2 speech production research: Findings, issues, and advances. In Phonology and Second Language Acquisition; Hansen Edwards, J.G., Zampini, M.L., Eds.; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2008; pp. 219–249. [Google Scholar]
Rukwong, N.; Pongpinigpinyo, S. An acoustic feature-based deep learning model for automatic Thai vowel pronunciation recognition. Appl. Sci. 2022, 12, 6595. [Google Scholar] [CrossRef]
Liu, J. Deep learning backward recognition model to improve pronunciation accuracy in the teaching of spoken English. J. Electr. Syst. 2024, 20, 1516–1527. [Google Scholar] [CrossRef]
Xu, Y. English speech recognition and evaluation of pronunciation quality using deep learning. Mob. Inf. Syst. 2022, 2022, 7186375. [Google Scholar] [CrossRef]

Figure 1. Indicative architecture of the deep learning model. This diagram illustrates a neural network with three input variables, two hidden layers each containing 32 neurons (with only the first 5 shown for visual clarity), and an output layer representing pretest and posttest performance.

Figure 2. Percentage of correct responses for the discrimination of three L2 Greek contrasts in the pretest and posttest sessions.

Figure 3. Training and validation performance across epochs for each metric. The blue line represents the training performance and the red line indicates the validation performance.

Table 1. Results of binomial mixed-effects model. ContrastStressede-i and TestPost serve as Intercept terms.

	Estimate	Std. Error	z-Value	p-Value
(Intercept)	2.17	0.17	12.71	<0.001
ContrastStressedo–u	−0.10	0.23	−0.45	0.65
ContrastUnstressedo–u	0.11	0.24	0.47	0.64
TestPre	−1.55	0.19	−7.99	0.00
ContrastStressedo–u:TestPre	−0.08	0.27	−0.29	0.77
ContrastUnstressedo–u:TestPre	−0.53	0.28	−1.91	0.06

Table 2. Performance values for each metric of the deep learning model.

Metric	Value
Accuracy	0.74
Precision	0.74
Recall	0.76
F1-score	0.75
AUC	0.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Georgiou, G.P. Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning. AI 2025, 6, 134. https://doi.org/10.3390/ai6070134

AMA Style

Georgiou GP. Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning. AI. 2025; 6(7):134. https://doi.org/10.3390/ai6070134

Chicago/Turabian Style

Georgiou, Georgios P. 2025. "Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning" AI 6, no. 7: 134. https://doi.org/10.3390/ai6070134

APA Style

Georgiou, G. P. (2025). Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning. AI, 6(7), 134. https://doi.org/10.3390/ai6070134

Article Menu

Identification of Perceptual Phonetic Training Gains in a Second Language Through Deep Learning

Abstract

1. Introduction

2. Methodology

2.1. Participants

2.2. Materials

2.3. Procedure

2.3.1. Pretest

2.3.2. Phonetic Training

2.3.3. Posttest

2.4. Deep Learning Algorithm Training

3. Results

4. Discussion

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI