Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback

Liu, Fengqin; Orkphol, Korawit; Pannurat, Natthapon; Sooknuan, Thanat; Muangpool, Thanin; Kuankid, Sanya; Phothisonothai, Montri

doi:10.3390/inventions10040046

Open AccessArticle

Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback

by

Fengqin Liu

¹

,

Korawit Orkphol

²,

Natthapon Pannurat

²

,

Thanat Sooknuan

³

,

Thanin Muangpool

⁴

,

Sanya Kuankid

⁴

and

Montri Phothisonothai

^2,*

¹

School of Foreign Languages, Dali University, Dali 671003, China

²

Department of Computer Engineering, Faculty of Engineering at Sriracha, Kasetsart University, Sriracha 20230, Thailand

³

Faculty of Engineering and Technology, Rajamangala University of Technology Isan, Nakhon Ratchasima 30000, Thailand

⁴

Faculty of Science and Technology, Nakhon Pathom Rajabhat University, Nakhon Pathom 73000, Thailand

^*

Author to whom correspondence should be addressed.

Inventions 2025, 10(4), 46; https://doi.org/10.3390/inventions10040046

Submission received: 28 April 2025 / Revised: 31 May 2025 / Accepted: 16 June 2025 / Published: 24 June 2025

(This article belongs to the Special Issue Advances and Innovations in Deep Learning: Unveiling Multidisciplinary Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain processes speech through neural pathways, our system analyzes multiple speech features using spectrograms, mel-frequency cepstral coefficients (MFCCs), and formant frequencies in a manner that mirrors the auditory cortex’s interpretation of sound. The core of our approach utilizes a convolutional neural network (CNN) to classify pronunciation patterns from user-recorded speech. To enhance the assessment accuracy and provide nuanced feedback, we integrated a fuzzy inference system (FIS) that helps learners identify and correct specific pronunciation errors. The experimental results demonstrate that our multi-feature model achieved 82.41% to 90.52% accuracies in accent classification across diverse linguistic contexts. The user testing revealed statistically significant improvements in pronunciation skills, where learners showed a 5–20% enhancement in accuracy after using the system. The proposed MALL system offers a portable, accessible solution for language learners while establishing a foundation for future research in multilingual functionality and mobile platform optimization. By combining advanced speech analysis with intuitive feedback mechanisms, this system addresses a critical challenge in language acquisition and promotes more effective self-directed learning.

Keywords:

speech recognition; accent classification; convolutional neural networks; fuzzy inference systems

1. Introduction

In an increasingly globalized world, English has established itself as the primary language for international communication, attracting learners from various linguistic backgrounds. Language learning encompasses not only grammar and vocabulary, but also accent and pronunciation, which significantly affect communicative competence and speech standardization [1]. However, regional accents, which are often influenced by a speaker’s native language, introduce challenges in achieving clear and comprehensible English. Recent advances in deep learning technology have fueled the development of sophisticated speech analysis systems, offering potential for improving language-learning outcomes through enhanced pronunciation feedback [2,3]. Furthermore, pronunciation is a fundamental component of effective communication, yet it remains one of the most challenging aspects for language learners. Nonnative speakers often substitute phonemes with those familiar from their native languages, producing distinct accents that can impede mutual understanding [4]. Addressing pronunciation through targeted learning tools not only enhances learners’ speech intelligibility but also fosters confidence and fluency in language use [5]. Convolutional neural networks (CNNs) in particular have proven to be effective at feature extraction and classification tasks within speech recognition [3,6]. These models can discern intricate patterns in audio data, making them highly suitable for accent recognition and pronunciation analysis. Nevertheless, the majority of studies focus on major English accents, such as American and British, with limited exploration and application for a broader array of nonnative accents [4,5,7]. Machine learning techniques, including mel-frequency cepstral coefficients (MFCCs) and spectrogram-based feature extraction, have shown promise in classifying and assessing nonnative pronunciation [6,8]. Such approaches enable the more precise identification of speech characteristics, contributing to personalized feedback systems designed to support accent improvement and language learning. Deep learning-powered speech analysis systems offer a refined approach to evaluating pronunciation and accent, providing feedback with greater accuracy. By utilizing models trained on diverse linguistic data, these systems offer feedback tailored to learners’ unique needs, encouraging an iterative learning process [7,9]. This focus on accent adaptation and enhancement is crucial for supporting language acquisition and reducing communication barriers among nonnative speakers. Despite substantial progress, current speech recognition and analysis systems encounter limitations, particularly in handling a wide range of accents with the same efficacy as native accents [5]. Most models are optimized for standard American and British English, leading to an underrepresentation of nonnative accents [10]. Additionally, noise sensitivity and variations in pronunciation reduce the effectiveness of these systems in real-world scenarios [11,12].

In a study by Ensslin et al. [6], deep learning was investigated for speech accent detection within video games, with a focus on sociolinguistic aspects, such as stereotypical accent usage and related social judgments. AlexNet was trained on the speech accent archive (SAA) data and applied to audio from a video game. To optimize the model, experiments were conducted with varying parameters, including epochs, batch sizes, time windows, and frequency filters, which resulted in an optimal test accuracy of 61%. Following training, a 75% accuracy was achieved on the SAA data and a 52.7% accuracy on game audio samples, where the accuracy improved to 60% in low-noise conditions. Limitations in speech analysis systems, which are typically optimized for American and British English, were addressed by Upadhyay and Lui [3]. A model capable of classifying nonnative accents was developed. Audio signals were pre-processed and converted to MFCCs. Four classification methods were tested: Random Forest, Gradient Boosting, CNN, and multi-layer perceptron (MLP). Among these methods, the CNN model demonstrated the highest accuracy, where it achieved rates between 80% and 88% and significantly outperformed traditional approaches.

Foreign-accented English classification was explored by Russell and Najafian [5] to determine speakers’ countries of origin. A corpus of 30 speakers from six countries was developed, and MFCC features were used with a deep belief network (DBN) classifier. After noise cancellation and normalization, the DBN model, consisting of two hidden layers with 1000 nodes, achieved a 90.2% accuracy for two accents and 71.9% for six accents, which outperformed conventional classifiers, like SVM, k-NN, and random forest. The pronunciation quality in English learners was assessed by Nicolao et al. [11] using deep neural network features and phoneme-specific discriminative classifiers. A system was introduced to provide phoneme-level scoring based on teacher-annotated error patterns. The learner pronunciation was compared with a reference, and pronunciation scores were generated based on phoneme duration and similarity.

For mobile-assisted pronunciation learning, the smartphone-assisted pronunciation learning technique (SAPT) was proposed by Lee et al. [10]. Pronunciation errors were detected, and words were recommended for practice. Processing was offloaded to an internet of things (IoT) system to address the constraints of low-computation devices. Through a seven-step process, user speech was analyzed, phoneme correlations were evaluated, and practice words were suggested. Finally, pronunciation variation across English varieties was addressed by Kasahara et al. [9]. A structure-based method to predict pronunciation distances was proposed. Support vector regression (SVR) and Bhattacharyya distances (BDs) were used to represent pronunciation differences. Local contrasts and phonetic class features were identified as significant contributors to accurate pronunciation distance predictions, as indicated by high correlation scores. In [12], MALL was proposed as a tool to enhance student motivation and readiness, promoting flexibility and engagement in language learning for achieving positive outcomes. However, the study faced limitations, including a restricted sample from Indian universities, hardware constraints affecting speaking and listening tasks, and a narrow focus on English. Liu et al. [13] proposed a knowledge-based intelligence program to address pronunciation challenges. The proposed methods achieved significant accuracy in classifying correct and incorrect pronunciations. However, the study’s limitations include a small dataset and its generalizability to other phonemes and real-world contexts. Recently, Rukwong and Pongpinigpinyo [14] introduced an innovative approach to computer-assisted pronunciation training (CAPT) for Thai vowel recognition, leveraging CNN and acoustic features such as mel-spectrograms. Their system effectively addresses key challenges in Thai vowel pronunciation training, including the reliance on expert intervention and the complexity of traditional manual methods. While the system demonstrated an impressive accuracy of 98.61%, its limitations include a reliance on a narrowly focused dataset of standard Thai speakers in controlled environments, raising concerns about its adaptability and robustness in diverse real-world scenarios.

Recent advancements in MALL have increasingly leveraged deep learning techniques to enhance pronunciation assessment. For example, systems using CNNs have demonstrated a strong capability in extracting hierarchical acoustic features, such as spectrograms and MFCCs [15,16,17,18], leading to the improved classification accuracy of accents and pronunciation errors. However, many existing studies primarily focus on broad accent classification or overall pronunciation scoring without integrating personalized, context-aware feedback mechanisms. Multi-feature fusion with fuzzy inference [19]: Unlike prior works that often relied on a single feature type or straightforward classification, our system uniquely combines multiple complementary acoustic features, including spectrograms, MFCCs, and formant frequencies, with a fuzzy inference system (FIS). This integration enables context-aware, interpretable, and nuanced feedback tailored to the learner’s specific pronunciation challenges, especially for minimal pair discrimination. Real-time mobile implementation: While many previous models have demonstrated strong performances offline or in desktop environments, our work emphasizes a real-time, mobile-friendly architecture. This practical focus supports accessible and immediate feedback for learners, promoting effective self-directed practice outside traditional classroom settings. Targeted support for Chinese-accented English: Our study specifically addressed the pronunciation challenges faced by Chinese-accented English learners by including carefully selected standard words and minimal pairs known to be problematic. This targeted approach enhances the relevance and effectiveness of the feedback, distinguishing it from more general accent recognition systems. This advances the field by offering a comprehensive, multi-feature, and context-aware system optimized for mobile use and focused on the particular needs of a specific learner group.

2. Materials and Methods

2.1. System Architecture

The proposed system adopts a client–server architecture tailored to support efficient speech processing and analysis. The client component, a mobile application developed using React Native (Meta Platforms, Inc., Menlo Park, CA, USA), enables users to record their speech and submit audio files to the server for processing. This design ensures cross-platform compatibility on both iOS and Android devices. On the server side, a Python 3.12-based framework facilitates comprehensive audio pre-processing, feature extraction, and analysis using deep learning models. The server leverages Flask 2.3.3, a lightweight web framework, to efficiently manage data requests and audio-processing tasks. Figure 1 provides a visual representation of the system’s architecture, detailing the data flow between the client and server. This illustration highlights the transfer, processing, and analysis of audio files, culminating in the delivery of real-time feedback to the user. Such a configuration fosters a seamless user experience, enhancing engagement in language learning by delivering instant feedback on pronunciation and accent classification.

The results of this research were presented through a system consisting of both client-side and server-side components, which is implemented in a simple client–server architecture, as shown in Figure 2. The mobile application captures the user’s voice input, which is then sent to the server as a .wav file. Once the server processes the input signal, the analyzed results are returned to both the mobile and desktop applications.

The diagram delineates the step-by-step workflow from the initial user input through server-side processing, emphasizing a modular architecture that facilitates straightforward updates and the integration of additional features. This design enhances the scalability and supports the maintenance of real-time interactions, which are essential for mobile-assisted language-learning applications. The proposed system was implemented using Python (Python Software Foundation, Wilmington, DE, USA) within the PyCharm 2018.1 Integrated Development Environment (JetBrains, Prague, Czech Republic/Boston, MA, USA). A variety of libraries, detailed in Table 1, were utilized during the implementation process. The development and experimental evaluations were performed on a personal computer with the following specifications: Intel Core i5-8400 CPU (2.80 GHz), 24 GB of RAM, and a 64-bit operating system.

2.2. Pre-Processing Techniques

The primary goal of these pre-processing techniques is to standardize the recorded speech into a consistent input format, thereby minimizing the variability that could interfere with an accurate analysis [7]. By aligning the input format across all audio samples, pre-processing reduces fluctuations caused by inconsistencies in the recording environment, speaker volume, or background noise. Additionally, achieving a clean, noise-free condition is crucial to focus the model on the essential features of the speech itself rather than irrelevant environmental artifacts. This uniformity and clarity in the input data enhance the reliability and accuracy of feature extraction, model training, and subsequent analysis. Therefore, in this study, these pre-processing techniques employed the following processes.

2.2.1. Speech Signal Normalization

This ensures consistency across audio samples, preventing variations in amplitude from impacting the analysis. The initial pre-processing step involves adjusting the audio signal so that its mean value is zero. This stabilization minimizes the baseline fluctuations that could interfere with the feature extraction. Amplitude normalization scales the audio signal to a standardized range, typically between −1 and 1, by dividing each value by the signal’s maximum absolute value [20]. This process removes any baseline offset, centering the signal around zero to stabilize it for further analysis.

2.2.2. Speech Segmentation

To ensure consistency and optimize neural network processing, the speech signal is standardized to a predetermined length, creating uniform input data. A signal power calculation aids in isolating significant speech segments by distinguishing vocalized parts from silence, allowing the analysis to focus on relevant portions of the speech. This process involves computing the standard deviation within a fixed window size (e.g., 256 samples) to identify speech segments. A threshold is applied to exclude non-speech portions, retaining only meaningful speech for feature extraction. Segments with standard deviation values above the threshold are classified as speech. Speech segmentation further divides the signal into smaller, analyzable units based on these thresholding criteria. Additionally, an error-removal step eliminates small, noisy sections mistakenly classified as speech. This refinement produces cleaner and more accurate inputs for feature extraction, thereby enhancing the reliability of the model training and the overall analysis process.

2.3. Feature Extraction and Model Training

The pre-processed speech data allowed for the precise extraction of salient features, enabling the construction of both one-dimensional and two-dimensional arrays tailored for specific analytical objectives. Our approach is informed by neuroscientific understanding of human speech processing, where the auditory cortex interprets sound through complex neural pathways using time–frequency representations [21]. This biological inspiration directly influenced our feature extraction methodology.

2.3.1. Spectrogram Analysis

Spectrograms provide time–frequency representations of audio signals through short-time Fourier transform (STFT). This method captures both the temporal and spectral characteristics of speech by segmenting the signal into smaller sections (N = 256 samples per segment in our implementation) and calculating the frequency components over time [10]. The mathematical representation of STFT is

X [n, k] = \sum_{m = 0}^{N} x [m] w [m - n] e^{- j 2 π \frac{k m}{N}}

(1)

where

X [n, k]

is the STFT of the signal at time frame n and frequency bin

k, x [m]

is the discrete-time signal at sample m,

w [m - n]

is the rectangular window function applied to the segment centered at n,

e^{- j 2 π \frac{k m}{N}}

is the complex exponential for the frequency bin k, and N is the total number of samples in one windowed segment. The spectrogram,

S [n, k]

, is commonly represented as the magnitude squared of the STFT:

S [n, k] = {|X [n, k]|}^{2}

(2)

2.3.2. Mel-Frequency Cepstral Coefficients

MFCCs mimic human auditory perception by mapping the power spectrum onto the mel-scale, which approximates how humans perceive pitch. The computation involves three primary steps:

1. Applying a mel-scale filter bank to map the linear frequency scale onto a non-linear mel-scale.

2. Taking the logarithm of the mel-filtered spectrum to compress dynamic range.

3. Applying a discrete cosine transform (DCT) to obtain the final coefficients.

The mathematical representation for the DCT calculation of MFCCs is

C [m] = \sum_{k = 0}^{L - 1} L [k] \cos (\frac{π c (k + 1 / 2)}{L})

(3)

where

C [m]

is the c-th MFCC,

L

is the number of mel-filters, and

L [k]

is the log mel-filtered spectrum. The result is a set of MFCCs, which represent the frequency content of the signal in a form that is more closely aligned with human auditory perception.

2.3.3. Formant Frequency Analysis

Formant frequencies represent the resonant frequencies of the vocal tract and are crucial for identifying vowel and consonant sounds. By extracting the first three formants (

F_{1}

,

F_{2}

,

F_{3}

), we complement the information from spectrograms and MFCCs, creating a more comprehensive feature set that better characterizes pronunciation patterns.

2.3.4. Model Training, Data Augmentation, and Architecture

The dataset comprised diverse accent profiles, and the input features included spectrograms, MFCCs, and formant frequencies. These features were pre-processed through a dedicated pre-processing pipeline to ensure robust and consistent feature representation. To improve the model generalization and reduce overfitting, we applied data augmentation techniques, such as pitch shifting and time stretching. These augmentations simulate variations in speech pitch and tempo, thereby diversifying the training data and enhancing the model’s ability to handle real-world variations in speech. The CNN model architecture, as outlined in Table 2, was designed to effectively process the acoustic features.

Convolutional layers: Hierarchical features are extracted from spectrograms and MFCCs using 3 × 3 filters with increasing depth (32 → 64 → 128) and employing ReLU activation for non-linearity. Max-pooling layers: The feature maps are progressively (2 × 2 pooling) downsampled to reduce the spatial dimensions while retaining important features. Dropout layers: These are applied twice with a 25% dropout rate to reduce overfitting by randomly disabling neurons during training. Fully connected dense layers: These follow convolutional layers to aggregate learned features for classification. Output layer: The final dense layer outputs predictions across 8 classes, typically followed by a softmax activation. The dataset used for training and evaluating the CNN model comprises a total of 16,995 recordings, which include various native and nonnative English accents, along with an added Chinese-accented group. Approximately 70% of the data were allocated for training, 10% were used as a separate validation set to monitor and optimize the model performance during training, and the remaining 20% were reserved for testing the final model’s performance.

3. Results

The pre-processing phase was essential for enhancing the quality and consistency of audio signals before feature extraction and classification. Figure 3 shows the pre-processing stages, where the original audio signal underwent transformations, such as zero-mean normalization, fixed-length segmentation, and noise removal, to improve the clarity and uniformity.

Figure 4 highlights the impact of pre-processing on the audio signal, illustrating significant improvements in clarity and consistency. These enhancements directly affect the quality of the extracted features, which form the basis of the classification model. In Figure 4 (left), the symbols /s/, /pi/, and /der/ represent the segmented phonetic components of the spoken word “spider.” Linguistically, the word is divided into these three parts corresponding to its phonemic and syllabic structure: the initial fricative /s/; the stressed syllable /pi/, /pa_I/; and the final syllable /der/, /dər/. Figure 4 (right) illustrates the segmented speech and non-speech parts that are processed as input for the subsequent feature extraction phase, such as MFCCs and spectrogram analysis.

Figure 5 provides an overview of the extracted features used for classification, including spectrograms, MFCCs, and formant frequencies. These features serve as critical inputs for the CNN model and contribute to its high classification accuracy, as demonstrated in [19]. The proposed method involves a feature extraction procedure aimed at identifying salient characteristics from a pre-processed signal to form a two-dimensional array with unique properties. These two-dimensional representations—spectrogram and MFCC—are then used as input data for the CNN model, both as array data and image data, and formatted with a resolution of 640 × 480 pixels.

3.1. Data Preparation

The dataset used in this study was sourced from the OSCAAR (Open Super-large Crawled Aggregated coRpus) [22], which contains a scripted reading scenario. In this scenario, participants clearly enunciated a scripted list of words one at a time. This dataset proved valuable during the pre-processing step, where we segmented individual word utterances from the original speech recording. The segmentation produced a collection of word-level utterances, which were then used for further feature extraction and analysis.

In addition to the Hoosier Database of Native and Nonnative Speech for Children [23], this database includes digital audio recordings of both native and nonnative English speakers reading words, sentences, and paragraphs, providing a diverse range of speech samples for our study. The Hoosier Database consists of 27 speakers, representing the aforementioned seven native language backgrounds. These speakers produced a total of 1139 recordings across the various tasks listed in Table 3 and Table 4.

This study also included a group of Chinese-accented subjects who participated in the experiment. A total of 50 participants were selected (25 male and 25 female), with ages that ranged from 18 to 30 years. None of the subjects had any prior background in English proficiency tests. The audio signals were recorded in a soundproof studio, which ensured a high signal-to-noise ratio (SNR) of 60 dB or higher, in line with the standard for high-quality studio sound recordings. The recordings were captured at the School of Foreign Languages, Dali University, where all participants gave their informed consent to take part in the study. The inclusion of this additional accent category served two primary purposes: first, to explore the development of a system designed to help native Chinese speakers learn other languages, and second, to investigate how variations in the dataset may influence the performance of deep learning models. This research aimed to measure the impacts of these differences on the model’s accuracy and robustness. The participants were selected from a group of native Chinese individuals with no history of exposure to environments that might influence their pronunciation, such as attending international schools from a young age or engaging in prolonged daily interactions with foreigners. For the recordings, a sampling rate of 44,100 Hz was used. The recordings were made in a mono-channel configuration, with a 16-bit resolution, which ensured high-quality, precise sound capture.

3.2. Model Training and Test Results

The approach employed in this study leveraged deep learning techniques to develop a reference-based model for word pronunciation. This model functioned as a classifier to analyze, assess, and classify pronunciation accuracy from a speech or word input. The proposed method utilized a CNN model to achieve this task. The data used to train the deep learning model consisted of the extracted features from the pronunciation audio of words or speech, which were represented through corresponding spectrograms, MFCCs, and formant frequencies. The CNN model was trained with different feature sets, including MFCCs, spectrograms, and formant frequencies. According to the CNN architecture in Table 2, the first convolutional layer was pre-processed or resized from the original high-resolution feature maps (640 × 400) down to around 62 × 46 spatial dimensions after applying the 3 × 3 filters. The MFCCs were resized to smaller matrices, with the tested dimensions including 28 × 28, 64 × 48, and 128 × 48, as shown in Table 5 and Table 6, to select a size that balanced the feature detail and computational feasibility. The CNN architecture started with two convolutional layers of filter sizes 3 × 3 and filter counts 32 and 64, which operated on this resized input. Subsequent the max-pooling layers (2 × 2) reduced the spatial dimensions progressively: from 62 × 46 to 60 × 44 to 30 × 22 to 28 × 20 to 14 × 10.

The results, summarized in Table 5, highlight the effectiveness of the MFCCs and spectrograms, where the test accuracies peaked at 73.89% and 74.27%, respectively. As demonstrated by previous research, MFCCs can achieve high precision when used for accent classification. In this study, we aimed to investigate how the chosen dataset and pre-processing methods impacted the classification results across eight distinct accent classes. Figure 6a illustrates the average accuracy across all parameters during the training of the classification model using the MFCC dataset with a 0.005 threshold over 30 epochs. After the 15th epoch, the model began to overfit: while the training accuracy continued to improve, the testing accuracy plateaued and remained stagnant. Figure 6b presents the confusion matrix for the MFCC-based model. The distribution of correctly predicted accents was relatively uniform across the different accent classes, where most classes had a prediction frequency that exceeded 300 instances in the test dataset. This discrepancy is attributed to the unequal distribution of data, as the German-accented speech data were approximately 20% fewer than other accent data, which led to a slight imbalance in the predictions. The spectrogram-based model demonstrated superior performance, where it achieved a peak test accuracy of 79% when optimized network parameters were applied, as shown in Table 5. This underscores the ability of spectrograms to capture more detailed temporal and frequency-related information, which provided a more comprehensive representation of the speech signal compared with the MFCCs alone.

In Table 6, the highest precision achieved by the model was 0.829, while the lowest precision across all parameters derived from MFCC-based data was 0.7878. These results indicate that the spectrogram approach consistently outperformed the MFCC method.

Figure 7a illustrates the average accuracy across all parameters while training the classification model using the spectrogram dataset with a threshold of 0.005 and 30 epochs. After approximately the 24th epoch, the accuracy stabilized and ceased to improve. Figure 7b shows the prediction map for all classes from the classification model trained using spectrograms. The pattern was consistent across all classes, with a very high prediction rate, except for the German accent, which showed a lower accuracy. The highest classification accuracy, approximately 87%, was achieved by combining the MFCCs and spectrograms. This approach leveraged both the frequency emphasis from the MFCCs and the temporal detail from the spectrograms, which optimized the model’s classification capabilities for accent detection.

To design the assessment process, a fuzzy inference system (FIS) based on Scikit-Fuzzy 0.5.0 was implemented [24]. The subjects were asked to select words from a curated vocabulary list, which included standard words that represented commonly used English words that covered a broad range of phonetic contexts and phonetic contrast pairs specifically designed to target known pronunciation challenges for Chinese-accented English learners. Upon selection, the subject pronounced the chosen word through the MALL system. The system processed the recorded speech input by extracting the key acoustic features for evaluation.

3.3. Period of Phonetic (PoP)

The PoP approach begins by segmenting the phonetic components of the sample signal using a pre-processing method. The resulting PoP value represents the phoneme duration (in seconds) of each segmented part of the sample signal as compared with each corresponding template segment. During the segmentation process, a single threshold was insufficient to achieve optimal segmentation results. Therefore, an adaptive thresholding method was employed, which spanned values from 0.001 to 0.017. The PoP values were constrained within a range of 0 to 1 s. The PoP was converted into five fuzzy sets, namely, poor, mediocre, average, decent, and good, since its value could distinguish the range of similarity better. Thus, the normalized PoP score, p_norm, for the entire utterance could be computed using

p_{n o r m} = 1 - \frac{1}{M} \sum_{i = 1}^{M} |\frac{P_{i} - T_{i}}{T_{i}}|

(4)

where

P_{i}

is the duration of the i-th phonetic segment in the subject’s utterance,

T_{i}

is the duration of the i-th phonetic segment in the dataset’s template, and

M

is the total number of phonetic segments in the word. The final PoP score indicates a closer duration match in percentage from 0 to a 100% perfect match.

3.4. Dynamic Time Warping (DTW)

DTW is used to extract features from the speech features (spectrogram, MFCCs, and formant frequency) of both the template and sample signals [25]. DTW measures the similarity between the temporal sequences of the signals, producing a normalized distance value. The normalized DTW distance indicates the degree of similarity, where lower distances correspond to better alignment. These distances were mapped onto fuzzy sets, including poor, average, and good, as shown in Table 7. For all the feature types, after computing the local distance matrix D(i, j), DTW found the minimal cumulative distance alignment path using the standard DTW recursion:

D (i, j) = d (x_{i}, y_{j}) + \min {D (i - 1, j), D (i, j - 1), D (i - 1, j - 1)}

(5)

where d(x_i, y_j) is the distance measure between feature vectors x_i as input features and y_j as template features, commonly the Euclidean distance, and D = Σd(x_in, y_jn): n = 1, 2, …, P, where P is the length of the warping path. The final normalized DTW distance, d_norm, could be defined as

d_{n o r m} = \frac{D_{k} - \min D_{k}}{\max D_{k} - \min D_{k}}

(6)

where min(D_k) and max(D_k) are the minimum and maximum DTW distances for features k∈{spec, MFCC, formant}, respectively.

3.5. Knowledge Base

The proposed system features a server monitoring interface, as shown in Figure 8, which presents an overview of the entire process. From Equations (4) and (6), both the PoP and DTW measurements were mapped and evaluated by comparing the subject’s pronunciation against the template pronunciations stored in the training dataset, which contained high-quality native speaker exemplars for each word. This template matching ensured that the assessments were grounded in authentic pronunciation standards, which enabled precise identification of the timing and acoustic deviations. These fuzzy-mapped inputs were fed into the fuzzy inference system (FIS), which applied a set of IF–THEN fuzzy rules in its knowledge base to interpret the combined phoneme duration and alignment quality. The IF–THEN fuzzy rules are shown in Table 7 and represent a matrix of DTW values and PoP values. The fuzzy output from the FIS was then defuzzified to produce a continuous soft-decision pronunciation score. In Figure 9c, the score obtained from the evaluation is shown. The fuzzy centroid estimation method, also known as the center of area (COA), is used to determine the feedback result.

3.6. Mobile-Assisted Language Learning (MALL)

To implement a mobile platform interface showcasing the system’s workflow, we utilized the React Native framework, a hybrid mobile development framework that enables efficient development while ensuring compatibility with both Android and iOS platforms. Figure 10a presents the main page, where users can interact with the MALL system by pressing the microphone button to record audio. The recorded audio is automatically sent to the server for processing. Figure 10b displays the list of available words that can be assessed using the proposed method.

Figure 11 illustrates the results of the 10 selected subjects chosen from a total of 50 participants, who were tasked with pronouncing 10 of the 16 words with five trials per word. The scores ranged from 55 to 80, compared with the native speakers’ scores, which ranged from 80 to 100. The standard deviations are represented by the error bars. Notably, all the selected subjects had studied in an international program, which may have contributed to their relatively high initial scores.

Before using the MALL proposed system, the subjects’ pronunciation correctness was significantly lower, with scores that trailed the native speech data by 5% to 20%. After utilizing the MALL system, which provided real-time feedback and detailed pronunciation assessments, all the subjects showed notable improvement in their pronunciation accuracy. The increased scores highlight the system’s effectiveness in bridging the gap between nonnative and native pronunciation, demonstrating its potential as a powerful tool for language learning and accent refinement. To enhance the phonetic coverage and address specific challenges faced by Chinese-accented English learners, additional vocabulary was selected with a focus on known pronunciation difficulties. This selection emphasizes minimal pairs and words that contrast frequently problematic sounds for Chinese speakers, including the following:

The /l/ versus /r/ distinction, which is often merged in many Chinese dialects.
Final consonant sounds (e.g., -s, -t, and -d endings), which are commonly omitted or altered.
Consonant clusters, such as /str/, /spr/, and /spl/, which are absent in Mandarin and many other Chinese dialects.
Interdental fricatives /θ/ and /ð/, which are often substituted with /s/, /z/, /d/, or /t/.
Vowel length and tense/lax contrasts, e.g., /i:/ versus /_I/ and /u:/ versus /ʊ/.

Minimal pairs of words differing by only a single phoneme and yielding different meanings are employed extensively in language instruction and phonetic research. These pairs highlight critical phonetic distinctions that are often problematic due to differences in phoneme inventories and constraints between English and various Chinese dialects, as shown in Table 8. This targeted focus enhanced the system’s ability to accurately classify and assess pronunciation errors that are common among Chinese-accented English speakers, thereby improving the precision and effectiveness of pronunciation training.

4. Discussion

Although CNNs are traditionally associated with spatial data, their adaptability to sequential data, like speech, has been increasingly demonstrated. By leveraging convolutional filters, CNNs effectively capture local spatial–temporal speech features, which are vital for detailed and accurate pronunciation analysis. In addition, CNNs offer significant advantages in computational efficiency compared with alternative architectures, such as recurrent neural networks (RNNs), including long short-term memory (LSTM) networks. The inherently sequential nature of RNNs results in slower training times and higher computational costs, which are often compounded by challenges like vanishing gradients. Conversely, CNNs enable the parallel processing of input data, allowing for faster convergence and the more efficient utilization of computational resources, a crucial consideration for real-time feedback in mobile platforms. Numerous studies have confirmed that CNN-based models outperform other deep learning frameworks in extracting meaningful speech representations [26,27], thereby validating their suitability for applications focused on pronunciation assessment.

This study made several significant contributions to the field of MALL and pronunciation analysis. First, we demonstrated that combining multiple complementary feature extraction methods, such as spectrograms, MFCCs, and formant frequencies, achieved a superior accent classification accuracy compared with single-feature approaches. Second, we introduce a novel integration of fuzzy inference systems with deep learning models to provide interpretable feedback that accommodates natural speech variability. Third, our system provides sophisticated pronunciation assessment accessible outside traditional learning environments, addressing a critical gap in self-directed language learning. Fourth, our experimental results confirm that this approach led to measurable improvements in pronunciation accuracy (5% to 20%), which validated the practical effectiveness.

Table 9 presents a performance comparison of different pronunciation assessment models. The proposed system demonstrated a robust performance, where it achieved accuracies that ranged from 82.41% to 90.52%, precisions between 81.46% and 90.09%, and recalls from 82.08% to 90.68%. This high level of performance was attributed to the integration of CNNs for feature extraction and FIS for decision-making. The CNN effectively captured local temporal patterns in speech, while the FIS provided context-aware assessments that accommodated the natural variability inherent in human speech. The combination of these methods enabled nuanced, interpretable feedback, which made the system particularly well-suited for real-time mobile applications focused on minimal pair practice. In comparison, traditional Goodness of Pronunciation (GoP) systems [28] achieve accuracies ranging from 81.67% to 87.60%, with precisions between 82.49% and 89.19% and recalls between 80.43% and 89.36%. When combined with Dynamic Time Warping (DTW) [29], GoP systems achieve an accuracy range of 80.26% to 87.27%, though detailed precision and recall metrics are not provided. In contrast, the spectrogram + FIS system achieved accuracies between 79.86% and 81.48%, with precisions that ranged from 75.65% to 79.23% and recalls between 76.10% and 80.13%. Although effective for extracting acoustic features, this approach did not fully leverage advanced temporal modeling or phoneme-level feedback, which made it less effective compared with the proposed method. The MFCCs + FIS system showed improved performance, with accuracies that ranged from 78.36% to 84.71%, precisions from 78.18% to 83.26%, and recalls from 79.47% to 82.29%. By combining MFCCs with FIS, this approach improved the system’s ability to capture detailed phoneme-level features and account for temporal variations, though it still did not achieve the same level of accuracy and recall as the proposed CNN + FIS system.

In summary, the proposed CNN + FIS system outperformed other models in terms of both accuracy and recall, demonstrating the efficacy of combining deep learning for feature extraction with fuzzy inference for decision-making. The integration of CNNs and FIS offers a promising solution for addressing these limitations and providing more precise, real-time feedback for pronunciation improvement.

4.1. Comparison of Study Findings with Existing Literature

The findings of this study align closely with existing research on the application of deep learning techniques in speech analysis, particularly within MALL contexts. The high accuracy rates achieved by the CNN model in this study are consistent with prior work that emphasized CNNs’ effectiveness in speech classification tasks. For example, Lesnichaia et al. [17] demonstrated that CNNs excel at managing complex speech patterns and distinguishing subtle pronunciation variations due to their capacity to extract hierarchical features from input data. Similarly, Mikhailava et al. [18] highlighted the robustness of CNN models in handling sparse and crowd-sourced speech data, further validating their applicability in diverse linguistic settings. The use of spectrograms and MFCCs as input features corroborates established literature on the effectiveness of combining time–frequency representations with perceptually relevant spectral data. Sejdic et al. [4] demonstrated that the integration of MFCCs with spectrograms significantly enhances the classification performance, as the two feature types provide complementary information. This study confirmed these findings, showing that the combination of spectrograms and MFCCs increased the model accuracy to approximately 87%, which outperformed the individual contributions of each feature set.

Furthermore, the inclusion of formant frequencies added phonetic depth to the analysis, particularly for distinguishing vowel sounds. This finding aligns with Kasahara et al. [9], who emphasized the importance of formant frequencies in differentiating phonetic elements. While formants alone did not achieve the accuracy of spectrograms or MFCCs, their integration into a multi-feature approach enhanced the overall robustness, reinforcing the argument that combining diverse features yields more comprehensive results. The application of fuzzy logic for pronunciation assessment represents a novel contribution to the fuzzy logic in managing uncertainty in speech-processing tasks [19]. The FIS implemented in this study facilitated flexible, interpretable pronunciation feedback, making it well-suited for user-centric language-learning applications. By enabling context-aware evaluations, the system provided a more nuanced alternative to rigid scoring mechanisms. In summary, this study validated and extended the existing literature by demonstrating that a well-integrated approach, i.e., combining CNN models, multi-feature extraction methods, and fuzzy logic assessment, can substantially enhance pronunciation analysis.

4.2. System Potential and Challenges in Practical Application

The system developed in this study demonstrated significant potential for enhancing language learning by providing personalized feedback and real-time pronunciation analysis. By leveraging deep learning models, such as CNNs, the system effectively classifies accents and delivers targeted improvement recommendations. The mobile-assisted approach enhances accessibility, enabling users to practice pronunciation anytime and anywhere, thereby supporting the democratization of language education [30].

However, challenges persist in real-world deployment. One primary concern is the system’s performance in noisy environments, which can impact the accuracy of speech recognition and classification. While pre-processing techniques, such as noise reduction and signal normalization, help mitigate these issues, achieving consistent performance across varied real-world scenarios remains complex.

Another challenge involves addressing the variability in user accents and speech rates, requiring the model to handle diversity effectively to provide reliable feedback [31]. Incorporating advanced data augmentation techniques and expanding the training dataset to include more diverse accents could further improve the model performance in this regard. User engagement and feedback interpretability are also critical considerations. Ensuring users understand and act on the system’s feedback necessitates clear, intuitive interfaces, which illustrate how processed results and detailed analyses are presented. Overcoming these challenges is essential for the practical adoption and sustained use of pronunciation learning tools.

4.3. Importance of Improved Pronunciation Feedback

The proposed system demonstrates remarkable accuracy in classifying nonnative English accents. By leveraging spectrograms and MFCCs as core features, the model achieved classification accuracies that exceeded more than 80%, aligning with prior studies that emphasized the effectiveness of these features in speech analysis. The integration of adaptive thresholding and advanced feature extraction techniques bolstered the model’s robustness, which enabled reliable differentiation across eight accent classes, including a newly introduced class of Chinese-accented English speakers. This outcome validates the system’s potential for practical applications in accent classification, which is comparable with findings in the existing literature on robust speech systems.

The study highlighted the complementary effectiveness of spectrograms, MFCCs, and formant frequencies in accent classification. The spectrogram-based features, which are known for capturing intricate temporal and frequency nuances, delivered superior classification performance compared with the MFCCs alone. Additionally, the integration of DTW enhanced the system’s ability to align the speech features across varying speaking speeds, which improved the identification of subtle pronunciation variations. These findings underscore the importance of combining diverse feature extraction methods for higher accuracy and consistency in pronunciation analysis.

The system’s real-time feedback mechanism inspired by neural and cognitive processes of the human brain significantly contributed to users’ improvement in pronunciation skills [32]. By offering detailed assessments and visual feedback on phoneme accuracy and intonation patterns, users could iteratively refine their pronunciation [33]. The experimental results revealed that the subjects improved their pronunciation correctness by 5% to 20% after using the system, corroborating neuroscientific findings on the auditory cortex’s role in processing time–frequency representations, similar to those in spectrograms [34,35]. This study effectively bridged cognitive neuroscience principles with technological applications to facilitate impactful language-learning outcomes.

4.4. Limitations and Future Directions

A primary limitation of the system lies in its adaptability to different languages. Although the model was trained on a diverse dataset encompassing various English accents, its performance may not be as robust for languages with significantly different phonetic structures, such as Mandarin or Thai. The system’s reliance on feature extraction methods optimized for English pronunciation, such as MFCCs and spectrograms, could restrict its effectiveness for languages requiring distinct acoustic emphases [17]. Additionally, the feedback mechanism, which was fine-tuned for English-specific pronunciation nuances, may need reconfiguration to cater to other languages. Addressing these challenges would necessitate adapting feature extraction techniques and retraining the model on multilingual datasets to broaden its applicability beyond English [30].

Another limitation was related to the dataset’s scale and diversity, which could impact the generalizability of the model’s performance. While the dataset includes a variety of English accents, its relatively constrained size andcontrolled recording conditions may not fully represent real-world scenarios [18]. Incorporating larger datasets with more speakers and a broader range of pronunciation variations, including data from nonnative speakers with varying proficiency levels, would improve the model’s robustness and practical utility [32]. Expanding the dataset to include recordings from diverse languages, dialects, and real-life settings is a critical step for future research [31].

Deploying deep learning models, such as CNNs, on mobile platforms also presents challenges related to processing power and energy consumption. Mobile devices, while convenient for language learning, often have limited computational resources compared with desktops. Running complex models on mobile devices can lead to increased battery usage and longer response times, potentially affecting the user interface and experience. Techniques such as model pruning, quantization, and cloud-based processing could help optimize the model for mobile use, balancing resource efficiency and model complexity.

5. Conclusions

Our experimental results demonstrate the power of feature fusion in improving classification accuracy, which achieved approximately 82.41% to 90.52% when combining spectrograms and MFCCs—this significantly outperformed single-feature approaches. User testing confirmed the system’s practical effectiveness, where the participants showed measurable improvements in pronunciation accuracy after using the application. These findings align with existing research on the importance of robust feature extraction while extending the practical application of deep learning in educational technology.

The mobile platform implementation addresses accessibility needs in language education, allowing learners to practice pronunciation independently and receive immediate feedback regardless of location. Despite its current capabilities, limitations in multilingual support and computational efficiency on mobile platforms indicate promising directions for future research. Expanding the system to support additional languages, incorporating adaptive learning mechanisms that personalize feedback based on learner progress, and optimizing performance on resource-constrained devices represent important next steps. These advancements would further enhance the system’s impact as an effective tool for self-directed language learning in our increasingly globalized world.

Author Contributions

Conceptualization, F.L. and M.P.; methodology, M.P.; software, M.P., K.O. and N.P.; validation, F.L., T.M. and S.K.; formal analysis, M.P. and S.K.; investigation, F.L. and M.P.; resources, F.L.; data curation, F.L.; writing—original draft preparation, F.L. and M.P.; writing—review and editing, F.L., K.O., N.P., T.S., T.M., S.K. and M.P.; visualization, M.P. and T.S.; supervision, M.P., T.M. and S.K.; project administration, F.L. and M.P.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Faculty of Engineering at Sriracha, Kasetsart University, for the fiscal year 2568 B.E.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Research Ethics Review Board of Dali University (No. 20241101) on 10 January 2024.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

We extend our sincere gratitude to all participants for their valuable contributions to this study. We also express our thanks to the precise reviews and the editorial team for their insightful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Amano, T.; Ramírez-Castañeda, V.; Berdejo-Espinola, V.; Borokini, I.; Chowdhury, S.; Golivets, M.; González-Trujillo, J.D.; Montaño-Centellas, F.; Paudel, K.; White, R.L.; et al. The manifold costs of being a nonnative English speaker in science. PLoS Biol. 2023, 21, e3002184. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Liu, Z. WordDecipher: Enhancing Digital Workspace Communication with Explainable AI for nonnative English Speakers. arXiv 2024, arXiv:2404.07005. [Google Scholar] [CrossRef]
Upadhyay, R.; Lui, S. (Eds.) Foreign English Accent Classification Using Deep Belief Networks. In Proceedings of the 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 31 January–2 February 2018. [Google Scholar]
Sejdic, E.; Djurovic, I.; Jiang, J. Time-Frequency Feature Representation Using Energy Concentration: An Overview of Recent Advances. Digit. Signal Process. 2009, 19, 153–183. [Google Scholar] [CrossRef]
Russell, M.; Najafian, M. (Eds.) Modelling Accents for Automatic Speech Recognition. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; p. 1568. [Google Scholar]
Ensslin, A.; Goorimoorthee, T.; Carleton, S.; Bulitko, V.; Hernandez, S. Deep Learning for Speech Accent Detection in Videogames. Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. (AIIDE) 2017, 13, 69–74. [Google Scholar] [CrossRef]
Huang, C.; Chen, T.; Chang, E. Accent Issues in Large Vocabulary Continuous Speech Recognition. Int. J. Speech Technol. 2004, 7, 141–153. [Google Scholar] [CrossRef]
Chanamool, N.; Unhalekajit, B.; Phothisonothai, M. Computer Software for Phonetic Analyzing of Thai Language Using Speech Processing Techniques. KMUTT Res. Dev. J. 2010, 33, 319–328. Available online: https://digital.lib.kmutt.ac.th/journal/loadfile.php?A_ID=444 (accessed on 15 September 2024).
Kasahara, S.; Minematsu, N.; Shen, H.; Saito, D.; Hirose, K. Structure-Based Prediction of English Pronunciation Distances and Its Analytical Investigation. In Proceedings of the 2014 4th IEEE International Conference on Information Science and Technology, Shenzhen, China, 26–28 April 2014. [Google Scholar]
Lee, J.; Lee, C.H.; Kim, D.; Kang, B. Smartphone-Assisted Pronunciation Learning Technique for Ambient Intelligence. IEEE Access 2017, 5, 312–325. [Google Scholar] [CrossRef]
Nicolao, M.; Beeston, A.V.; Hain, T. Automatic Assessment of English Learner Pronunciation Using Discriminative Classifiers. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015. [Google Scholar]
Habib, S.; Haider, A.; Suleman, S.S.M.; Akmal, S.; Khan, M.A. Mobile Assisted Language Learning: Evaluation of Accessibility, Adoption, and Perceived Outcome among Students of Higher Education. Electronics 2022, 11, 1113. [Google Scholar] [CrossRef]
Liu, L.; Li, W.; Morris, S.; Zhuang, M. Knowledge-Based Features for Speech Analysis and Classification: Pronunciation Diagnoses. Electronics 2023, 12, 2055. [Google Scholar] [CrossRef]
Rukwong, N.; Pongpinigpinyo, S. An Acoustic Feature-Based Deep Learning Model for Automatic Thai Vowel Pronunciation Recognition. Appl. Sci. 2022, 12, 6595. [Google Scholar] [CrossRef]
Sheng, L.; Edmund, M. Deep Learning Approach to Accent Classification. Machine Learning Stanford. 2017. Available online: https://cs229.stanford.edu/proj2017/final-reports/5244230.pdf (accessed on 24 January 2024).
Lee, A.; Zhang, Y.; Glass, J. Mispronunciation Detection via Dynamic Time Warping on Deep Belief Network-Based Posteriorgrams. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
Lesnichaia, M.; Mikhailava, V.; Bogach, N.; Lezhenin, I.; Blake, J.; Pyshkin, E. Classification of Accented English Using CNN Model Trained on Amplitude Mel-Spectrograms. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association (Interspeech), Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar]
Mikhailava, V.; Lesnichaia, M.; Bogach, N.; Lezhenin, I.; Blake, J.; Pyshkin, E. Language Accent Detection with CNN Using Sparse Data from a Crowd-Sourced Speech Archive. Mathematics 2022, 10, 2913. [Google Scholar] [CrossRef]
Rashmi, M.; Yogeesh, N.; Girija, D.K.; William, P. Robust Speech Processing with Fuzzy Logic-Driven Anti-Spoofing Techniques. In Proceedings of the 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 18–20 October 2023. [Google Scholar]
Grenander, U. Some Non-Linear Problems in Probability Theory. In Probability and Statistics: The Harald Cramér Volume; Almqvist Wiksell: Stockholm, Sweden, 1959; pp. 353–365. [Google Scholar]
Campbell, W.; Reynolds, D.; Campbell, J.; Brady, K. Estimating and Evaluating Confidence for Forensic Speaker Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, 18–23 March 2005. [Google Scholar]
OSCAR Database. Available online: https://huggingface.co/datasets/oscar-corpus/community-oscar (accessed on 10 May 2024).
Hoosier Database. Available online: https://www.speechperceptionlab.com/hoosierdatabase (accessed on 20 August 2024).
Scikit-Fuzzy. Available online: https://pythonhosted.org/scikit-fuzzy/overview.html (accessed on 24 December 2024).
Wöllmer, M.; Al-Hames, M.; Eyben, F.; Schuller, B.; Rigoll, G. A Multidimensional Dynamic Time Warping Algorithm for Efficient Multimodal Fusion of Asynchronous Data Streams. Neurocomputing 2009, 73, 366–380. [Google Scholar] [CrossRef]
Demir, F.; Abdullah, D.A.; Sengur, A. A New Deep CNN Model for Environmental Sound Classification. IEEE Access 2020, 8, 66529–66537. [Google Scholar] [CrossRef]
Syed, S.A.; Rashid, M.; Hussain, S.; Zahid, H. Comparative Analysis of CNN and RNN for Voice Pathology Detection. BioMed Res. Int. 2021, 2021, 6635964. [Google Scholar] [CrossRef] [PubMed]
Kanters, S.; Cucchiarini, C.; Strik, H. The Goodness of Pronunciation Algorithm: A Detailed Performance Study. In Proceedings of the ISCA International Workshop on Speech and Language Technology in Education (SLaTE 2009), Warwickshire, UK, 3–5 September 2009. [Google Scholar]
Sheoran, K.; Bajgoti, A.; Gupta, R.; Jatana, N.; Dhand, G.; Gupta, C. Pronunciation Scoring With Goodness of Pronunciation and Dynamic Time Warping. IEEE Access 2023, 11, 15485–15495. [Google Scholar] [CrossRef]
Al-Jumaili, Z.; Bassiouny, T.; Alanezi, A.; Khan, W.; Al-Jumeily, D.; Hussain, A.J. Classification of Spoken English Accents Using Deep Learning and Speech Analysis. Intell. Comput. Methodol. 2022, 277–287. [Google Scholar] [CrossRef]
Hamilton, L.S. Neural Processing of Speech Using Intracranial Electroencephalography: Sound Representations in the Auditory Cortex. Oxford Research Encyclopedia of Neuroscience; Oxford University Press: Oxford, UK, 2024. [Google Scholar] [CrossRef]
Bartelds, M.; de Vries, W.; Sanal, F.; Richter, C.; Liberman, M.; Wieling, M. Neural Representations for Modeling Variation in Speech. J. Phon. 2022, 92, 101137. [Google Scholar] [CrossRef]
Asswad, R.; Boscain, U.; Turco, G.; Prandi, D.; Sacchelli, L. An Auditory Cortex Model for Sound Processing. Lect. Notes Comput. Sci. 2021, 12829, 56–64. [Google Scholar] [CrossRef]
Wang, R.; Wang, Y.; Flinker, A. Reconstructing Speech Stimuli From Human Auditory Cortex Activity Using a WaveNet Approach. arXiv 2018, arXiv:1811.02694. [Google Scholar]
Tourville, J.A.; Reilly, K.J.; Guenther, F.H. Neural mechanisms underlying auditory feedback control of speech. NeuroImage 2008, 39, 1429–1443. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Client–server architecture of the proposed system.

Figure 2. System overview of the proposed method.

Figure 3. Step-by-step pre-processing of the original speech signal to the segmented signal: (a) normalization; (b) fixed length; (c) power; (d) window-based SD; (e) speech segmentation; (f) removing error; (g) speech reconstruction; (h) pre-processed signal.

Figure 4. Comparison of the original signal before and after pre-processing.

Figure 5. Extracted features in this study: (a) pre-processed speech signal; (b) spectrogram feature; (c) MFCC feature; (d) formant frequency.

Figure 6. Training results for MFCC and confusion matrix for MFCC with 0.005 threshold: (a) training results for MFCC and (b) confusion matrix for MFCC.

Figure 7. Training results for the spectrogram and confusion matrix for the spectrogram with a threshold of 0.005: (a) training results for the spectrogram and (b) confusion matrix for the spectrogram.

Figure 8. Server monitoring interface of the proposed system. This interface provides a comprehensive overview of the entire processing workflow, including pre-processing and feature extraction steps. It enables real-time accent classification and pronunciation accuracy feedback to users.

Figure 9. Membership functions and final score evaluation process using FIS-based rule system: (a) Normalized DTW input value indicated by the vertical line, (b) Normalized PoP input value indicated by the vertical line, and (c) Final feedback score determined by the areas under the respective membership functions, with the output value indicated by a vertical line obtained using the COA defuzzification method.

Figure 10. Overview of the MALL system, highlighting its main functionalities: (a) audio recording, (b) word selection, (c) result processing, and (d) detailed feedback.

Figure 11. Assessment results showing pronunciation average scores as a percentage.

Table 1. Libraries utilized in the implementation process.

Library	Version	Purpose
TensorFlow (Google LLC, Mountain View, CA, USA)	1.13.1	Machine learning library
Keras (Google LLC, Mountain View, CA, USA)	2.2.4	High-level neural network API written in Python
SciPy (Open-source, SciPy community)	1.10	Data management and computation
PythonSpeechFeatures (Open-source, James Lyons)	0.6	Extraction of MFCCs and filterbank energies
Matplotlib (Open-source, Matplotlib team)	3.0.0	Plotting library for generating figures
Flask (Open-source, Pallets Projects)	2.3.3	RESTful request dispatching
NumPy (Open-source, NumPy community)	2.0	Core library for scientific computing

Table 2. ConvNet-model-based CNN architecture.

Layer	Filter Size	No. of Filters	Activation	Dropout	Output Shape
Conv2d	3 × 3	32	ReLU	-	62 × 46
Conv2d	3 × 3	64	ReLU	-	60 × 44
Maxpooling2d	2 × 2	64	-	-	30 × 22
Conv2d	3 × 3	128	ReLU	-	28 × 20
Maxpooling2d	2 × 2	128	-	-	14 × 10
Dropout	-	128	-	25	14 × 10
Flatten	-	-	-	-	14 × 10
Dense	-	128	-	-	14 × 10
Dropout	-	128	-	25	14 × 10
Dense	-	8	-	-	14 × 10

Table 3. Number of recordings for each accent category.

Category	Number of Recordings
Native	2200
French	2150
German	1650
Mandarin	2200
Spanish	2200
Japanese	2200
Korean	2200
Thai	2195
Total	16,995

Table 4. Task Categories and Corresponding Number of Recordings in the Hoosier Database.

Category	Number of Recordings
160	Hearing in Noise Test for Children sentences
10	Digit words
48	Multi-syllabic Lexical Neighborhood Test words
50	Northwestern University–Children’s Perception of Speech words
100	Lexical Neighborhood Test words
50	Lexical Neighborhood Sentence Test sentences
40	Pediatric Speech Intelligibility sentences
20	Pediatric Speech Intelligibility words
339	Bamford–Kowal–Bench sentences
150	Phonetically Balanced Kindergarten words
72	Spondee words
100	Word Intelligibility by Picture Identification words

Table 5. MFCC Dimensions Performance in the OSCAAR Dataset Test Result.

MFCC: 28 × 28			MFCC: 64 × 48		MFCC: 128 × 48
Iter.	Test Acc. [%]	Time [s]	Test Acc. [%]	Time [s]	Test Acc. [%]	Time [s]
1	40.48	26	45.37	123	45.21	552
5	57.73	130	60.16	612	60.43	2734
10	64.21	261	65.54	1231	63.94	5467
15	67.67	391	69.45	1830	66.40	8251
20	70.29	522	70.72	2421	67.05	11,024
30	73.89	781	74.27	3636	67.51	16,591

Table 6. MFCC Dimensions Performance in the Hoosier Dataset Test Result.

MFCC: 28 × 28			MFCC: 64 × 48		MFCC: 128 × 48
Iter.	Test Acc. [%]	Time [s]	Test Acc. [%]	Time [s]	Test Acc. [%]	Time [s]
1	34.21	26	38.51	122	38.29	552
5	63.48	130	71.62	615	68.43	2734
10	70.45	260	76.19	1231	77.13	5467
15	75.11	390	78.35	1846	80.84	8251
20	76.83	520	79.97	2459	81.94	11,024
30	78.78	780	81.16	3690	82.86	16,591

Table 7. DTW and PoP linguistic values for the IF–THEN FIS rule-based decision.

DTW/PoP	Poor	Mediocre	Average	Decent	Good
Poor	F	F	D	D	D
Average	C	C	B	A	A
Good	C	B	A	A	A

Table 8. Table of minimal pairs for pronunciation training.

No.	Phonetic Contrast	Word 1	Word 2
1	/l/ vs. /r/	Light	Right
2	/θ/ vs. /s/	Think	Sink
3	/θ/ vs. /t/ or /d/	Clothes	Close
4	Consonant clusters	Stop	Strap
5	Final consonants	Cat	Cab

Table 9. Performance comparison of different models.

Model/System	Accuracy (%)	Precision (%)	Recall (%)
Proposed Method	82.41–90.52	81.46–90.09	82.08–90.68
GoP-based System [28]	81.67–87.60	82.49–89.19	80.43–89.36
GoP + DTW [29]	80.26–87.27	N/A	N/A
Spectrogram + FIS	79.86–81.48	75.65–79.23	76.10–80.13
MFCCs + FIS	78.36–84.71	78.18–83.26	79.47–82.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, F.; Orkphol, K.; Pannurat, N.; Sooknuan, T.; Muangpool, T.; Kuankid, S.; Phothisonothai, M. Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback. Inventions 2025, 10, 46. https://doi.org/10.3390/inventions10040046

AMA Style

Liu F, Orkphol K, Pannurat N, Sooknuan T, Muangpool T, Kuankid S, Phothisonothai M. Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback. Inventions. 2025; 10(4):46. https://doi.org/10.3390/inventions10040046

Chicago/Turabian Style

Liu, Fengqin, Korawit Orkphol, Natthapon Pannurat, Thanat Sooknuan, Thanin Muangpool, Sanya Kuankid, and Montri Phothisonothai. 2025. "Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback" Inventions 10, no. 4: 46. https://doi.org/10.3390/inventions10040046

APA Style

Liu, F., Orkphol, K., Pannurat, N., Sooknuan, T., Muangpool, T., Kuankid, S., & Phothisonothai, M. (2025). Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback. Inventions, 10(4), 46. https://doi.org/10.3390/inventions10040046

Article Menu

Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback

Abstract

1. Introduction

2. Materials and Methods

2.1. System Architecture

2.2. Pre-Processing Techniques

2.2.1. Speech Signal Normalization

2.2.2. Speech Segmentation

2.3. Feature Extraction and Model Training

2.3.1. Spectrogram Analysis

2.3.2. Mel-Frequency Cepstral Coefficients

2.3.3. Formant Frequency Analysis

2.3.4. Model Training, Data Augmentation, and Architecture

3. Results

3.1. Data Preparation

3.2. Model Training and Test Results

3.3. Period of Phonetic (PoP)

3.4. Dynamic Time Warping (DTW)

3.5. Knowledge Base

3.6. Mobile-Assisted Language Learning (MALL)

4. Discussion

4.1. Comparison of Study Findings with Existing Literature

4.2. System Potential and Challenges in Practical Application

4.3. Importance of Improved Pronunciation Feedback

4.4. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI